Quota Enforcement

Design Principle

Two-tier enforcement matching the two consistency domains (ADR-004). Hard limits enforced at the quorum (strong consistency, cannot be violated). Soft limits enforced at the scheduler (eventual consistency, may temporarily overshoot, self-correcting).

Hard Quotas (Quorum-Enforced)

Hard quotas are checked during Raft proposal validation, before commit. A proposal that would violate a hard quota is rejected immediately.

Quota	Scope	Enforcement
`max_nodes`	Per tenant	Quorum rejects allocation proposals that would exceed the tenant’s maximum concurrent node count
`max_concurrent_allocations`	Per tenant	Quorum rejects proposals that would exceed the tenant’s maximum number of running allocations
`sensitive_pool_size`	System-wide	Hard limit on the number of nodes that can be claimed for sensitive use

Guarantees: These quotas cannot be violated, even momentarily. Two vCluster schedulers proposing conflicting allocations that together would exceed a hard quota: the first committed wins, the second is rejected and retried next cycle.

Error handling: Hard quota rejection returns a clear error to the user:

allocation rejected: tenant "physics" would exceed max_nodes quota (current: 195, requested: 10, limit: 200)

Soft Quotas (Scheduler-Level)

Soft quotas are tracked with eventual consistency. They influence scheduling decisions through the cost function but do not hard-block allocations.

GPU-Hours Budget

gpu_hours_budget: 100000  # per billing period (month)
gpu_hours_used: 87500     # eventually consistent counter

Behavior: The scheduler uses remaining budget as a penalty in the cost function. As budget depletes:

0-80% used: no penalty
80-100% used: increasing penalty (lower scheduling priority)
100% used: very low score (effective starvation for new allocations, but not hard rejection)

Consistency window: Up to ~30 seconds of lag. Acceptable because: (a) scheduling cycle is 5-30s, (b) over-allocation is self-correcting via fair-share scoring, (c) GPU-hours tracking is for billing, not safety.

fair_share_target: 0.15  # tenant should get ~15% of system capacity

Behavior: Feeds into f₃ (fair_share_deficit) in the cost function. Tenants below their share get priority; tenants above are deprioritized. Not a hard ceiling — a tenant can use more than their share when resources are idle.

Burst Allowance

burst_allowance: 1.5  # allow up to 150% of fair share when resources idle

Behavior: Allows temporary over-allocation when the system has spare capacity. When demand increases and other tenants need their share, burst allocations are the first candidates for preemption (via checkpoint cost model).

Internal Budget Ledger

When Waldur is unavailable or not configured, the scheduler computes GPU-hours consumption internally from allocation records in the quorum. This replaces the previously empty budget_utilization map in the cost function.

Computation

Two metrics are tracked:

node_hours_used = Σ (end_time - started_at).hours × assigned_nodes.len()
gpu_hours_used  = Σ (end_time - started_at).hours × Σ gpu_count_per_node

For running allocations: end_time = now
For completed/failed/cancelled: end_time = completed_at
Only allocations within the configured budget_period_days (default: 90 days, rolling window) are included
Node GPU count looked up from current hardware inventory; unknown nodes default to 1 GPU
Node-hours is the universal metric (works for CPU-only and GPU nodes)
When both gpu_hours_budget and node_hours_budget are set, the worse (higher) utilization fraction drives the budget penalty

Budget Period

Configurable via scheduling.budget_period_days (default: 90). This is a rolling window, not a calendar-aligned reset. Calendar-aligned resets require Waldur to push new gpu_hours_budget values at period boundaries.

Waldur Override

When Waldur is available, its remaining_budget() response takes precedence over the internal ledger. When Waldur is unavailable (transient failure), the internal ledger provides fallback data so budget enforcement continues.

API Access

gRPC: GetTenantUsage / GetUserUsage RPCs in AdminService
REST: GET /api/v1/tenants/{id}/usage?days=90 / GET /api/v1/usage?user=alice&days=90
Rust SDK: client.tenant_usage("physics", 90) / client.user_usage("alice", 90)
CLI: lattice usage --tenant physics / lattice usage (uses gRPC)

Exhausted Budget Behavior

GPU-Hours Budget Exhausted

New allocations for this tenant receive a very low scheduling score (effective starvation, not hard rejection)
Tenant admin notified via API event
Running allocations continue to completion (no preemption for budget reasons)
If Waldur integration enabled: Waldur can update the budget (cross-ref: accounting.md)
Tenant admin can request budget increase through Waldur self-service portal

Max Nodes Exhausted

Hard rejection at quorum — clear error returned to user
User must wait for running allocations to complete or cancel existing allocations
No waiting queue for hard-quota-blocked allocations (submit is rejected, user resubmits when capacity is available)

Quota Update Flow

Administrative Update

System admin updates tenant quotas via CLI or API:

# CLI (uses gRPC UpdateTenant RPC)
lattice admin tenant update physics \
  --max-nodes 250 \
  --max-concurrent-allocations 50 \
  --gpu-hours-budget 150000 \
  --node-hours-budget 500000

# Python SDK
await client.update_tenant("physics", {
    "max_nodes": 250,
    "max_concurrent_allocations": 50,
    "gpu_hours_budget": 150000,
    "node_hours_budget": 500000,
})

# REST
PUT /api/v1/tenants/{id}
{
  "max_nodes": 250,
  "max_concurrent_allocations": 50,
  "gpu_hours_budget": 150000,
  "node_hours_budget": 500000
}

Hard quota changes are Raft-committed (immediate effect). Soft quota changes propagate eventually.

Waldur-Driven Update

When Waldur integration is enabled, Waldur can push quota changes:

Waldur determines budget exhaustion or contract change
Waldur calls lattice-api: PUT /api/v1/tenants/{id} (authenticated with Waldur service token)
Hard quotas committed via Raft; soft quotas propagated to schedulers
Reducing max_nodes below current usage does not preempt running allocations — it prevents new ones

Quota Reduction While Allocations Are Running

When a quota is reduced below current usage (e.g., Waldur reduces max_nodes from 200 to 100, but tenant is currently using 150):

Hard Quota Reduction

Running allocations are not preempted. The reduced quota only blocks new allocations.
Current usage (150) exceeds new limit (100): all new proposals for this tenant are rejected until usage drops below 100.

The user receives a clear error on new submissions:

allocation rejected: tenant "physics" exceeds max_nodes quota
  Current usage: 150 nodes
  New limit: 100 nodes
  Hint: Wait for running allocations to complete, or contact your tenant admin.

As running allocations complete naturally, usage drops. When usage < new limit: new allocations are accepted again.

Soft Quota Reduction

Reduced gpu_hours_budget: scheduling score penalty increases. Pending allocations get lower priority but are not rejected.
Reduced fair_share_target: tenant gets deprioritized but can still schedule when resources are idle.
No immediate impact on running allocations.

Pending Allocations

Allocations that are Pending (in the scheduler queue but not yet committed) when a hard quota is reduced:

They are not retroactively cancelled.
If proposed to quorum, the proposal is rejected due to the new quota.
The scheduler will not re-propose them until quota headroom exists.
User sees allocation stuck in Pending state. lattice status shows the reason: "waiting for quota headroom".

Sensitive Quota Considerations

Sensitive quotas are always hard quotas:

sensitive_pool_size — System-wide hard limit, quorum-enforced
Sensitive node claims always go through quorum (strong consistency)
No soft/eventual quota mechanisms for sensitive resources
Idle sensitive nodes (claimed but unused) are not reclaimable — they remain allocated to the claiming user

Cross-ref: sensitive-workloads.md for the full sensitive workload model.

Cross-References

scheduling-algorithm.md — f₃ fair_share_deficit uses soft quota targets
accounting.md — Waldur quota feedback loop
sensitive-workloads.md — Sensitive quotas are always hard
autoscaling.md — Scale-up respects hard quota limits

Keyboard shortcuts

Lattice Documentation