Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Performance Tuning Guide

Design Principle

Tuning Lattice is primarily about tuning the cost function weights per vCluster. The RM-Replay simulator is the primary tool: capture production traces, replay with different weights, measure outcomes, deploy with confidence.

Cost Function Sensitivity

Weight Impact Matrix

Each cost function weight controls a trade-off. Increasing one weight reduces the influence of others:

Weight IncreasedPositive EffectNegative EffectWhen to Increase
w₁ (priority)High-priority jobs scheduled fasterLow-priority jobs starve longerMany priority levels with strict SLAs
w₂ (wait_time)Better anti-starvation, fairer wait distributionMay schedule low-value jobs before high-value onesLong tail of wait times
w₃ (fair_share)Tenants get closer to contracted shareMay reduce overall utilization (leaving resources idle)Multi-tenant with strict fairness requirements
w₄ (topology)Better placement, higher network performanceMay increase wait time (holding out for ideal placement)Network-sensitive workloads (NCCL, MPI allreduce)
w₅ (data_readiness)Less I/O stall at job startMay delay jobs whose data isn’t pre-stagedLarge-dataset workloads
w₆ (backlog)System responds to queue pressureMay destabilize scheduling when queue fluctuatesBursty submission patterns
w₇ (energy)Lower electricity costsJobs may wait for cheap-energy windowsTime-flexible workloads, sites with TOU pricing
w₈ (checkpoint)More flexible resource rebalancingOverhead from frequent checkpointingPreemption-heavy environments
w₉ (conformance)Fewer driver-mismatch issuesFewer candidate nodes (smaller conformance groups)Multi-node GPU workloads

Common Trade-offs

Throughput vs. Fairness (w₃):

  • Low w₃ (0.05): maximize utilization — schedule whatever fits, regardless of tenant share
  • High w₃ (0.35): enforce fairness — tenants below their share get priority even if it means idle resources

Typical compromise: w₃ = 0.15-0.25

Wait Time vs. Topology (w₂ vs. w₄):

  • High w₂, low w₄: schedule quickly in any topology — reduces wait but may hurt network performance
  • Low w₂, high w₄: wait for good topology — increases wait but improves job runtime

Typical for HPC: w₂ = 0.25, w₄ = 0.15 Typical for ML training: w₂ = 0.10, w₄ = 0.30

Utilization vs. Energy (w₇):

  • w₇ = 0.00: schedule immediately regardless of energy cost (default for most sites)
  • w₇ = 0.10-0.15: delay time-flexible jobs to cheap-energy windows

Only relevant for sites with significant time-of-use electricity pricing.

Using RM-Replay

Overview

RM-Replay replays production workload traces through the scheduler in simulation mode. No real resources are used. Simulation runs in seconds, not hours.

Reference: Martinasso et al., “RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management” (SC18).

Step 1: Capture Traces

Record workload traces from production (or synthetic workloads):

# Enable trace capture (writes to S3)
lattice admin config set scheduler.trace_capture=true
lattice admin config set scheduler.trace_path="s3://lattice-traces/"

# Capture for a representative period (1 week recommended)
# Traces include:
#   - Allocation submissions (arrival time, resources, constraints, tenant, priority)
#   - Allocation completions (actual duration, exit status)
#   - Node inventory (capabilities, topology, conformance groups)

Trace format is a timestamped event log (JSON lines):

{"ts": "2026-03-01T00:00:01Z", "type": "submit", "alloc": {"nodes": 64, "gpu_type": "GH200", "walltime": "72h", "tenant": "physics", "priority": 4}}
{"ts": "2026-03-01T00:00:05Z", "type": "complete", "alloc_id": "abc-123", "duration": "68h", "exit": 0}

Step 2: Configure Weights

Create weight profiles to compare:

# profiles/baseline.yaml (current production weights)
hpc-batch:
  priority: 0.20
  wait_time: 0.25
  fair_share: 0.25
  topology: 0.15
  data_readiness: 0.10
  backlog: 0.05
  energy: 0.00
  checkpoint: 0.00
  conformance: 0.10

# profiles/fairness-boost.yaml (experiment: more fairness)
hpc-batch:
  priority: 0.15
  wait_time: 0.20
  fair_share: 0.35        # increased
  topology: 0.15
  data_readiness: 0.10
  backlog: 0.05
  energy: 0.00
  checkpoint: 0.00
  conformance: 0.10

Step 3: Replay

# Replay with baseline weights
rm-replay --trace=traces/week-2026-03.jsonl \
          --weights=profiles/baseline.yaml \
          --nodes=inventory/alps.yaml \
          --output=results/baseline/

# Replay with experimental weights
rm-replay --trace=traces/week-2026-03.jsonl \
          --weights=profiles/fairness-boost.yaml \
          --nodes=inventory/alps.yaml \
          --output=results/fairness-boost/

Step 4: Evaluate

RM-Replay produces a summary report:

=== RM-Replay Results: fairness-boost ===

Utilization:
  GPU-hours consumed: 1,234,567 / 1,500,000 available (82.3%)
  ↓ 2.1% vs baseline (84.4%)

Wait Time:
  p50: 12 min  (baseline: 10 min)  ↑ 20%
  p95: 2.1 hr  (baseline: 2.5 hr)  ↓ 16%
  p99: 8.3 hr  (baseline: 12.1 hr) ↓ 31%

Fairness (Jain's Index):
  0.94 (baseline: 0.87)  ↑ 8%

Tenant Share Deviation:
  Max deviation: 3.2%  (baseline: 8.7%)  ↓ 63%

Backfill:
  Backfill jobs: 342 (baseline: 367)  ↓ 7%

Preemptions:
  Total: 15 (baseline: 12)  ↑ 25%

Step 5: Decide and Deploy

Compare results across profiles. When satisfied:

# Deploy new weights (hot-reloadable, no restart)
lattice admin vcluster set-weights --name=hpc-batch \
  --priority=0.15 --wait-time=0.20 --fair-share=0.35 \
  --topology=0.15 --data-readiness=0.10 --backlog=0.05 \
  --energy=0.00 --checkpoint=0.00 --conformance=0.10

Weights take effect on the next scheduling cycle.

Scheduling Cycle Tuning

The scheduling cycle interval affects responsiveness vs. overhead:

IntervalEffectRecommended For
5sFast scheduling, higher CPU on schedulerInteractive vCluster, small clusters
15sBalancedHPC batch, ML training
30sLower overhead, slower responseLarge clusters (5000+ nodes), service vCluster
lattice admin vcluster set-config --name=hpc-batch --cycle-interval=15s

Backfill Tuning

Backfill depth controls how many future reservations the solver considers:

DepthEffect
0No backfill (only first-fit) — simple but low utilization
10Moderate backfill — good balance
50Deep backfill — higher utilization but longer cycle time

For most sites, depth 10-20 is optimal. Increase if utilization is below target.

Conformance Group Sizing

If conformance groups are too small (many distinct fingerprints), multi-node jobs have fewer candidate sets:

  • Symptom: High wait times for multi-node jobs, f₉ scores consistently low
  • Diagnosis: lattice nodes -o wide shows many distinct conformance hashes
  • Fix: Coordinate with OpenCHAMI to standardize firmware versions. Prioritize GPU driver and NIC firmware alignment.
  • Workaround: Reduce w₉ for tolerant workloads (services, interactive)

Cross-References