Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Preemption Policy

Design Principle

Preemption is a last resort for resource rebalancing. The scheduler prefers waiting, backfill, and elastic borrowing over preemption. When preemption is necessary, it targets allocations with the lowest preemption cost (fast checkpoint, low priority, short remaining runtime). Sensitive allocations are never preempted.

Preemption Classes

Each allocation has a preemption_class (0-10):

ClassMeaningTypical UsePreemptible By
0Best-effortScavenger jobs, testingAny higher class
1-3Low priorityBatch exploration, sweepsClass 4+
4-6NormalProduction training, simulationClass 7+
7-9High priorityTime-sensitive productionClass 10 only
10Critical / SensitiveSensitive claims, emergencyNever preempted

Rule: Preemption only moves down — a class-5 allocation can preempt class 0-4 allocations but never class 5+.

Enforcement: The preemption_class range (0-10) is validated at API admission. Values outside this range are rejected with a 400 Bad Request error before reaching the scheduler.

Tie-breaking within class: If multiple allocations have the same preemption class, the scheduler prefers to preempt the one with the lowest checkpoint cost (f₈).

Preemption Triggers

1. Higher-Priority Demand

A pending allocation with class N cannot be scheduled because all suitable nodes are occupied by lower-class allocations. The scheduler evaluates whether preempting one or more lower-class allocations would free enough resources.

2. Elastic Reclamation

A vCluster’s idle nodes were borrowed by another vCluster (elastic sharing). The home vCluster now needs them back. Borrowed nodes carry an implicit preemption risk — the checkpoint cost model (f₈) accounts for this.

3. Sensitive Node Claim

A sensitive user claims nodes that are currently occupied by non-sensitive allocations. Sensitive claims are class 10 (highest). The scheduler triggers immediate checkpoint + preemption of the occupying allocations.

4. Quota Enforcement

A tenant exceeds their hard quota due to a race condition (two concurrent proposals, first committed). The quorum rejects the second proposal — this is not preemption but rejection. Running allocations are never preempted for quota enforcement.

Preemption Decision Algorithm

PreemptionDecision(pending_job, candidates):

1. Filter candidates:
   - Only allocations with preemption_class < pending_job.preemption_class
   - Exclude sensitive allocations (never preempted)
   - Exclude allocations in Checkpointing state (already being preempted)

2. Score each candidate by preemption cost:
   preemption_cost(c) = checkpoint_time(c)
                       + recompute_if_no_checkpoint(c)
                       + remaining_walltime_value(c)

   checkpoint_time(c):
     If checkpoint == Auto: estimated_checkpoint_minutes from f₈
     If checkpoint == Manual: assume application handles it, use configured timeout
     If checkpoint == None: recompute_if_no_checkpoint applies

   recompute_if_no_checkpoint(c):
     time_since_last_checkpoint(c) × node_count(c) × gpu_per_node
     (GPU-hours that would be lost)

   remaining_walltime_value(c):
     If c is near completion (>90% walltime used): high cost (let it finish)
     If c just started (<10% walltime used): low cost (little invested)

3. Select victim set:
   Greedy: pick candidates with lowest preemption_cost until enough nodes freed.
   Constraint: freed nodes must satisfy pending_job's topology/conformance requirements.

4. If no valid victim set exists: pending_job stays queued (preemption not possible).

5. If valid victim set found: initiate preemption sequence.

Preemption Sequence

1. Scheduler identifies victim allocations
2. For each victim:
   a. If checkpoint == Auto or Manual:
      - Checkpoint broker sends CHECKPOINT_HINT to node agents
      - Application checkpoints (signal, shmem, or gRPC callback)
      - Timeout: checkpoint_timeout (default: 10 minutes)
   b. If checkpoint == None:
      - SIGTERM sent immediately
      - Grace period (30s) → SIGKILL
3. When checkpoint completes (or timeout):
   - Allocation transitions to Suspended state
   - Nodes released to quorum (Raft commit)
4. Freed nodes assigned to pending allocation
5. Suspended allocations re-enter queue with:
   - Original submission time preserved (no wait-time penalty)
   - Resume-from-checkpoint flag set
   - Preempted-count incremented

Checkpoint Timeout Handling

When a checkpointing allocation fails to complete within the timeout:

ScenarioAction
Application responds but slowExtend timeout by 50%, once
Application unresponsiveSIGTERM → grace period → SIGKILL. Mark as failed (not suspended). Requeue if policy allows.
gRPC callback: application requests deferralGrant deferral up to max_deferral (default: 5 minutes). Then force.

Multi-Victim Preemption

Sometimes freeing one allocation isn’t enough. The scheduler can preempt multiple allocations in a single decision:

Constraints:

  • Maximum victims per decision: configurable (default: 3)
  • All victims must have lower preemption class than the pending job
  • Total preemption cost must be less than the pending job’s estimated value
  • Scheduler prefers preempting fewer, larger allocations over many small ones

Ordering: Victims are preempted in parallel (all receive checkpoint hints simultaneously). The pending job starts once all victims have released their nodes.

Per-vCluster Preemption Policy

vCluster TypePreemption AllowedNotes
HPC BatchYesClass-based, checkpoint-aware
ML TrainingYesCheckpoint cost heavily weighted (w₈=0.15)
ServiceYes (borrowed nodes only)Services on home nodes are not preempted; borrowed nodes reclaimable
SensitiveNever preemptedClass 10, no exceptions
InteractiveYesShort-lived, low cost to preempt

Non-Preemptible Allocations

An allocation is effectively non-preemptible when:

  1. checkpoint: None AND preemption_class >= 7 — High cost to preempt (all progress lost), high priority
  2. Sensitive allocations (always class 10)
  3. Allocations within 5 minutes of walltime completion (configurable: near_completion_threshold)

The scheduler avoids placing non-preemptible allocations on borrowed nodes, since those nodes may need to be reclaimed.

Preemption Metrics

MetricTypeDescription
lattice_preemptions_totalcounterLabels: vcluster, reason (priority/reclaim/sensitive)
lattice_preemption_checkpoint_duration_secondshistogramTime from hint to checkpoint completion
lattice_preemption_victim_requeue_totalcounterPreempted allocations re-entering queue
lattice_preemption_failed_checkpoint_totalcounterCheckpoint timeouts during preemption

Cross-References