Performance Tuning Guide
Design Principle
Tuning Lattice is primarily about tuning the cost function weights per vCluster. The RM-Replay simulator is the primary tool: capture production traces, replay with different weights, measure outcomes, deploy with confidence.
Cost Function Sensitivity
Weight Impact Matrix
Each cost function weight controls a trade-off. Increasing one weight reduces the influence of others:
| Weight Increased | Positive Effect | Negative Effect | When to Increase |
|---|---|---|---|
| w₁ (priority) | High-priority jobs scheduled faster | Low-priority jobs starve longer | Many priority levels with strict SLAs |
| w₂ (wait_time) | Better anti-starvation, fairer wait distribution | May schedule low-value jobs before high-value ones | Long tail of wait times |
| w₃ (fair_share) | Tenants get closer to contracted share | May reduce overall utilization (leaving resources idle) | Multi-tenant with strict fairness requirements |
| w₄ (topology) | Better placement, higher network performance | May increase wait time (holding out for ideal placement) | Network-sensitive workloads (NCCL, MPI allreduce) |
| w₅ (data_readiness) | Less I/O stall at job start | May delay jobs whose data isn’t pre-staged | Large-dataset workloads |
| w₆ (backlog) | System responds to queue pressure | May destabilize scheduling when queue fluctuates | Bursty submission patterns |
| w₇ (energy) | Lower electricity costs | Jobs may wait for cheap-energy windows | Time-flexible workloads, sites with TOU pricing |
| w₈ (checkpoint) | More flexible resource rebalancing | Overhead from frequent checkpointing | Preemption-heavy environments |
| w₉ (conformance) | Fewer driver-mismatch issues | Fewer candidate nodes (smaller conformance groups) | Multi-node GPU workloads |
Common Trade-offs
Throughput vs. Fairness (w₃):
- Low w₃ (0.05): maximize utilization — schedule whatever fits, regardless of tenant share
- High w₃ (0.35): enforce fairness — tenants below their share get priority even if it means idle resources
Typical compromise: w₃ = 0.15-0.25
Wait Time vs. Topology (w₂ vs. w₄):
- High w₂, low w₄: schedule quickly in any topology — reduces wait but may hurt network performance
- Low w₂, high w₄: wait for good topology — increases wait but improves job runtime
Typical for HPC: w₂ = 0.25, w₄ = 0.15 Typical for ML training: w₂ = 0.10, w₄ = 0.30
Utilization vs. Energy (w₇):
- w₇ = 0.00: schedule immediately regardless of energy cost (default for most sites)
- w₇ = 0.10-0.15: delay time-flexible jobs to cheap-energy windows
Only relevant for sites with significant time-of-use electricity pricing.
Using RM-Replay
Overview
RM-Replay replays production workload traces through the scheduler in simulation mode. No real resources are used. Simulation runs in seconds, not hours.
Reference: Martinasso et al., “RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management” (SC18).
Step 1: Capture Traces
Record workload traces from production (or synthetic workloads):
# Enable trace capture (writes to S3)
lattice admin config set scheduler.trace_capture=true
lattice admin config set scheduler.trace_path="s3://lattice-traces/"
# Capture for a representative period (1 week recommended)
# Traces include:
# - Allocation submissions (arrival time, resources, constraints, tenant, priority)
# - Allocation completions (actual duration, exit status)
# - Node inventory (capabilities, topology, conformance groups)
Trace format is a timestamped event log (JSON lines):
{"ts": "2026-03-01T00:00:01Z", "type": "submit", "alloc": {"nodes": 64, "gpu_type": "GH200", "walltime": "72h", "tenant": "physics", "priority": 4}}
{"ts": "2026-03-01T00:00:05Z", "type": "complete", "alloc_id": "abc-123", "duration": "68h", "exit": 0}
Step 2: Configure Weights
Create weight profiles to compare:
# profiles/baseline.yaml (current production weights)
hpc-batch:
priority: 0.20
wait_time: 0.25
fair_share: 0.25
topology: 0.15
data_readiness: 0.10
backlog: 0.05
energy: 0.00
checkpoint: 0.00
conformance: 0.10
# profiles/fairness-boost.yaml (experiment: more fairness)
hpc-batch:
priority: 0.15
wait_time: 0.20
fair_share: 0.35 # increased
topology: 0.15
data_readiness: 0.10
backlog: 0.05
energy: 0.00
checkpoint: 0.00
conformance: 0.10
Step 3: Replay
# Replay with baseline weights
rm-replay --trace=traces/week-2026-03.jsonl \
--weights=profiles/baseline.yaml \
--nodes=inventory/alps.yaml \
--output=results/baseline/
# Replay with experimental weights
rm-replay --trace=traces/week-2026-03.jsonl \
--weights=profiles/fairness-boost.yaml \
--nodes=inventory/alps.yaml \
--output=results/fairness-boost/
Step 4: Evaluate
RM-Replay produces a summary report:
=== RM-Replay Results: fairness-boost ===
Utilization:
GPU-hours consumed: 1,234,567 / 1,500,000 available (82.3%)
↓ 2.1% vs baseline (84.4%)
Wait Time:
p50: 12 min (baseline: 10 min) ↑ 20%
p95: 2.1 hr (baseline: 2.5 hr) ↓ 16%
p99: 8.3 hr (baseline: 12.1 hr) ↓ 31%
Fairness (Jain's Index):
0.94 (baseline: 0.87) ↑ 8%
Tenant Share Deviation:
Max deviation: 3.2% (baseline: 8.7%) ↓ 63%
Backfill:
Backfill jobs: 342 (baseline: 367) ↓ 7%
Preemptions:
Total: 15 (baseline: 12) ↑ 25%
Step 5: Decide and Deploy
Compare results across profiles. When satisfied:
# Deploy new weights (hot-reloadable, no restart)
lattice admin vcluster set-weights --name=hpc-batch \
--priority=0.15 --wait-time=0.20 --fair-share=0.35 \
--topology=0.15 --data-readiness=0.10 --backlog=0.05 \
--energy=0.00 --checkpoint=0.00 --conformance=0.10
Weights take effect on the next scheduling cycle.
Scheduling Cycle Tuning
The scheduling cycle interval affects responsiveness vs. overhead:
| Interval | Effect | Recommended For |
|---|---|---|
| 5s | Fast scheduling, higher CPU on scheduler | Interactive vCluster, small clusters |
| 15s | Balanced | HPC batch, ML training |
| 30s | Lower overhead, slower response | Large clusters (5000+ nodes), service vCluster |
lattice admin vcluster set-config --name=hpc-batch --cycle-interval=15s
Backfill Tuning
Backfill depth controls how many future reservations the solver considers:
| Depth | Effect |
|---|---|
| 0 | No backfill (only first-fit) — simple but low utilization |
| 10 | Moderate backfill — good balance |
| 50 | Deep backfill — higher utilization but longer cycle time |
For most sites, depth 10-20 is optimal. Increase if utilization is below target.
Conformance Group Sizing
If conformance groups are too small (many distinct fingerprints), multi-node jobs have fewer candidate sets:
- Symptom: High wait times for multi-node jobs, f₉ scores consistently low
- Diagnosis:
lattice nodes -o wideshows many distinct conformance hashes - Fix: Coordinate with OpenCHAMI to standardize firmware versions. Prioritize GPU driver and NIC firmware alignment.
- Workaround: Reduce w₉ for tolerant workloads (services, interactive)
Cross-References
- scheduling-algorithm.md — Cost function definition, weight profiles
- testing-strategy.md — RM-Replay regression suite
- conformance.md — Conformance groups and drift
- telemetry.md — Scheduler self-monitoring metrics for observing tuning impact