Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Testing Strategy

Design Principle

Scheduler correctness is non-negotiable. The testing strategy covers four levels: unit tests for individual functions, integration tests for component interactions, simulation tests for scheduling behavior, and chaos tests for fault tolerance. Every level must pass before a release.

Test Levels

┌─────────────────────────────────────────────────┐
│ Level 4: Chaos Tests (fault injection)          │
│   Raft leader loss, network partitions,         │
│   node failures, storage unavailability         │
├─────────────────────────────────────────────────┤
│ Level 3: Simulation (RM-Replay)                 │
│   Production workload replay, weight tuning,    │
│   fairness validation, SLO compliance           │
├─────────────────────────────────────────────────┤
│ Level 2: Integration Tests                      │
│   Multi-component scenarios, API contracts,     │
│   end-to-end allocation lifecycle               │
├─────────────────────────────────────────────────┤
│ Level 1: Unit Tests                             │
│   Cost function, topology solver, state machine,│
│   protobuf serialization, error handling        │
└─────────────────────────────────────────────────┘

Level 1: Unit Tests

In-module tests (#[cfg(test)]), run via cargo test.

Critical Paths

CrateWhat to TestExample
lattice-schedulerCost function components (f₁-f₉)Given inputs, verify score output
lattice-schedulerKnapsack solverGiven nodes and allocations, verify placement
lattice-schedulerTopology packingGiven groups and node count, verify group selection
lattice-schedulerConformance group selectionGiven fingerprints, verify grouping
lattice-quorumRaft proposal validationHard quota rejection, ownership conflict
lattice-quorumState machine transitionsNode state changes, allocation lifecycle
lattice-commonType serialization/deserializationProtobuf round-trip for all types
lattice-commonAllocation state machineValid and invalid state transitions
lattice-apiRequest validationReject invalid allocations (cycles in DAG, bad constraints)
lattice-apiSBATCH directive parsingTranslate Slurm directives to Intent API
lattice-checkpointCost model evaluationGiven metrics, verify checkpoint decision
lattice-cliArgument parsingFlag combinations, error messages

Property-Based Tests

Use proptest for property-based testing of the cost function and solver:

  • Cost function monotonicity: Increasing wait time always increases f₂
  • Fair share bounds: f₃ always in [0, 1]
  • Solver validity: Every placement returned by the solver satisfies all constraints
  • Topology packing: Solver never spans more groups than necessary
  • State machine: No invalid state transitions accepted

Level 2: Integration Tests

In tests/ directories, using real components with mock external dependencies.

Test Harness

A test harness that spins up:

  • In-memory Raft cluster (3 members, using openraft test utilities)
  • Mock node agents (report capabilities, respond to heartbeats)
  • Mock VAST API (storage queries return configurable responses)
  • Real scheduler instances
  • Real API server (in-process)

Scenarios

ScenarioWhat It Tests
Submit → Schedule → CompleteFull allocation lifecycle through all components
DAG submissionMulti-allocation workflow with dependency resolution
PreemptionHigher-priority allocation preempts lower-priority
Elastic borrowingvCluster borrows and returns nodes
Quota rejectionHard quota exceeded → proposal rejected
Sensitive claimNode claim, audit logging, wipe on release
Session lifecycleSession create → terminal → disconnect → cleanup
Rolling upgrade simulationMixed-version node agents, protocol negotiation
Conformance driftNode fingerprint changes → scheduling impact
Reactive scalingMetric threshold triggers scale-up/down

API Contract Tests

For every API endpoint, test:

  • Valid request → expected response
  • Invalid request → appropriate error code and message
  • Authorization: user sees own allocations only, tenant-admin sees tenant, system-admin sees all
  • Rate limiting: exceeded rate → 429 with Retry-After header

Protobuf Compatibility

Test backward compatibility:

  • Deserialize messages from previous version with new code (additive fields)
  • Deserialize messages from new version with old code (unknown fields ignored)

Level 3: Simulation (RM-Replay)

Purpose

RM-Replay replays production workload traces through the scheduler to validate scheduling behavior without risking production. Essential for:

  • Tuning cost function weights before deployment
  • Validating fairness across tenants
  • Regression testing after scheduler changes

Workflow

1. Capture: Record production workload traces
   - Allocation submissions (arrival time, resources, constraints, tenant)
   - Allocation completions (duration, exit status)
   - Node inventory (capabilities, topology)

2. Configure: Set cost function weights and vCluster policies

3. Replay: Feed traces through lattice-scheduler in simulation mode
   - No real nodes or quorum — mock environment
   - Simulated time (runs in seconds, not hours)
   - Deterministic (same trace + same weights = same result)

4. Evaluate: Measure scheduling outcomes
   - Utilization: fraction of GPU-hours used
   - Wait time: p50, p95, p99 queue wait per priority class
   - Fairness: actual share vs. target share per tenant (Jain's fairness index)
   - Backfill effectiveness: percentage of idle slots filled
   - SLO compliance: percentage of allocations meeting target wait time
   - Preemption rate: preemptions per hour

5. Iterate: Adjust weights, re-run, compare

Regression Suite

Maintain a library of representative workload traces:

TraceDescriptionKey Metric
steady-state.traceNormal mixed workload (HPC + ML + services)Utilization > 85%
burst.traceSudden spike in submissionsNo starvation (p99 wait < 4h)
unfair.traceOne tenant submits heavilyFair share deviation < 10%
sensitive-claim.traceSensitive claims interleaved with HPCSensitive wait = 0 (immediate)
preemption-heavy.traceMany priority inversionsCheckpoint success rate > 95%
empty-to-full.traceCluster goes from idle to fullRamp-up time, scheduling cycle latency

Each trace has a pass/fail threshold for key metrics. CI runs the regression suite on every scheduler change.

Level 4: Chaos Tests

Fault injection tests that validate the failure modes documented in failure-modes.md.

Fault Injection Framework

Use a test harness that can inject faults at configurable times:

FaultInjection MethodValidates
Raft leader killStop leader processLeader election, in-flight proposal retry
Raft member killStop follower processContinued operation with minority loss
Network partition (node↔quorum)Drop heartbeatsDegraded → Down transition, allocation requeue
Network partition (quorum split)Partition Raft membersMinority stalls, majority continues
Node agent crashKill agent processHeartbeat timeout, allocation requeue
Storage unavailabilityMock VAST returns errorsStaging pauses, running allocations continue
Checkpoint timeoutApplication ignores checkpoint hintForced preemption after timeout
API server crashKill API serverClient retry, no state loss
Quorum snapshot corruptionCorrupt snapshot fileRecovery from previous valid snapshot

Chaos Test Scenarios

ScenarioStepsExpected Outcome
Leader election under loadSubmit 50 allocations, kill leader mid-cycleNew leader elected < 5s, no proposals lost, all allocations eventually scheduled
Node failure with requeueStart 10 allocations, kill 2 node agentsAllocations requeued, rescheduled on healthy nodes, total delay < 2 min
Split-brain preventionPartition 3-member quorum into 1+2Minority (1) cannot commit, majority (2) continues, no divergent state
Cascade failureKill 3 node agents simultaneouslyAllocations on all 3 nodes requeued, scheduling continues for remaining nodes
Sensitive node failureKill sensitive node agentExtended grace period, operator alert, no auto-requeue
Recovery from full quorum lossKill all quorum members, restore from snapshotState restored, node agents reconnect, scheduling resumes

Execution

Chaos tests run in CI on a dedicated stage (not on every commit):

  • Nightly: full chaos suite
  • On release branch: full chaos suite must pass

Performance Benchmarks

Scheduling Cycle Latency

BenchmarkConfigurationTarget
100 pending allocations, 1000 nodesHPC backfillCycle < 5s
500 pending allocations, 5000 nodesHPC backfillCycle < 15s
1000 pending allocations, 10000 nodesHPC backfillCycle < 30s
Raft commit (single proposal)3-member quorump99 < 50ms
Raft commit (single proposal)5-member quorump99 < 100ms

Load Tests

TestDescriptionTarget
API throughputConcurrent submission requests> 1000 req/s
Heartbeat load10000 node agents reporting< 1% CPU on quorum
Log streaming100 concurrent log streams< 5% CPU on API server

CI Pipeline

On every commit:
  cargo fmt --check
  cargo clippy --all-targets
  cargo test (Level 1: unit tests)

On every PR:
  Level 1 + Level 2 (integration tests)
  Protobuf backward compatibility check

Nightly:
  Level 1 + Level 2 + Level 3 (RM-Replay regression) + Level 4 (chaos)
  Performance benchmarks (track regressions)

On release:
  All levels must pass
  Performance benchmarks must meet targets

Cross-References