Lattice
A distributed workload scheduler for large-scale scientific computing, AI/ML training, inference services, and regulated workloads.
Lattice schedules both finite jobs (batch training, simulations) and infinite jobs (inference services, monitoring) on shared HPC infrastructure with topology-aware placement, federated multi-site operation, and a unified API for human users and autonomous agents.
Architecture at a Glance
User Plane lattice-cli + lattice-api (OIDC via hpc-auth)
Software Plane uenv (SquashFS) + Sarus (OCI) + Registry
Scheduling Plane Raft Quorum + vCluster Schedulers (knapsack)
Data Plane VAST (NFS/S3) tiered storage + data mover
Network Fabric Slingshot / Ultra Ethernet (libfabric)
Node Plane Node Agent + mount namespaces + eBPF telemetry
Infrastructure OpenCHAMI (Redfish BMC, boot, inventory)
Start with System Architecture for the full picture, or jump to API Design to see how users interact with the system.
Source Code
The project is organized as a Rust workspace with 9 crates:
| Crate | Purpose |
|---|---|
lattice-common | Shared types, config, protobuf bindings |
lattice-quorum | Raft consensus, global state machine, audit log |
lattice-scheduler | vCluster schedulers, knapsack solver, cost function |
lattice-api | gRPC + REST server, OIDC, RBAC, mTLS |
lattice-checkpoint | Checkpoint broker, cost evaluator |
lattice-node-agent | Per-node daemon, GPU discovery, eBPF telemetry |
lattice-cli | CLI binary (submit, status, cancel, session, telemetry) |
lattice-test-harness | Shared mocks, fixtures, builders |
lattice-acceptance | BDD scenarios and property tests |
Plus a Python SDK, an RM-Replay simulator, and deployment configs in infra/.
Getting Started
Overview
Lattice is a distributed workload scheduler for HPC and AI infrastructure. It schedules both batch jobs (training runs, simulations) and long-running services (inference endpoints, monitoring) on shared GPU-accelerated clusters.
If you’re coming from Slurm, most concepts map directly — see the Slurm migration guide for a quick comparison.
Prerequisites
- A running Lattice cluster (ask your admin for the API endpoint)
- The
latticeCLI installed on your workstation or login node - Your tenant credentials (OIDC token or mTLS certificate)
Installing the CLI
# Determine architecture
ARCH=$(uname -m | sed 's/aarch64/arm64/')
# Download from GitHub Releases
curl -sSfL "https://github.com/witlox/lattice/releases/latest/download/lattice-${ARCH}.tar.gz" | tar xz
sudo mv lattice /usr/local/bin/
# Or build from source
cargo build --release -p lattice-cli
sudo cp target/release/lattice /usr/local/bin/
Configuration
Create ~/.config/lattice/config.yaml:
endpoint: "lattice-api.example.com:50051"
tenant: "my-team"
# Optional: default vCluster
vcluster: "gpu-batch"
Or use environment variables:
export LATTICE_ENDPOINT="lattice-api.example.com:50051"
export LATTICE_TENANT="my-team"
Your First Job
Submit a batch script
lattice submit train.sh
# Submitted allocation a1b2c3d4
Check status
lattice status
# ID NAME STATE NODES WALLTIME ELAPSED VCLUSTER
# a1b2c3d4 train.sh Running 4 24:00:00 00:12:34 gpu-batch
View logs
lattice logs a1b2c3d4
# [2026-03-05T10:00:12Z] Epoch 1/100, loss=2.341
# [2026-03-05T10:01:45Z] Epoch 2/100, loss=1.892
Cancel a job
lattice cancel a1b2c3d4
Next Steps
- Submitting Workloads — detailed submission options
- Interactive Sessions — attach a terminal to running jobs
- DAG Workflows — multi-step pipelines with dependencies
- Python SDK — programmatic access from notebooks and agents
Submitting Workloads
Basic Submission
# Run a script on 4 nodes for up to 24 hours
lattice submit --nodes=4 --walltime=24h train.sh
# With GPU constraints
lattice submit --nodes=8 --walltime=72h --constraint="gpu_type=GH200" -- torchrun train.py
# With a software environment (uenv)
lattice submit --nodes=2 --uenv=prgenv-gnu/24.11:v1 -- make -j run
Script Directives
Lattice parses #LATTICE directives from your script (and #SBATCH for compatibility):
#!/bin/bash
#LATTICE --nodes=64
#LATTICE --walltime=72h
#LATTICE --uenv=prgenv-gnu/24.11:v1
#LATTICE --vcluster=ml-training
#LATTICE --tenant=physics
#LATTICE --name=large-training-run
torchrun --nproc_per_node=4 train.py --data /scratch/dataset
Resource Constraints
# GPU type
lattice submit --constraint="gpu_type=GH200,gpu_count=4" script.sh
# Memory requirements
lattice submit --constraint="memory_gb>=512" script.sh
# Require unified memory (GH200/MI300A superchip)
lattice submit --constraint="require_unified_memory" script.sh
# Prefer same NUMA domain
lattice submit --constraint="prefer_same_numa" script.sh
Task Groups (Job Arrays)
Submit multiple instances of the same job:
# 100 tasks, 20 running concurrently
lattice submit --task-group=0-99%20 sweep.sh
# Task index available as $LATTICE_TASK_INDEX
Dependencies
# Run after job succeeds
lattice submit --depends-on=a1b2c3d4:success postprocess.sh
# Run after job completes (success or failure)
lattice submit --depends-on=a1b2c3d4:any cleanup.sh
# Multiple dependencies
lattice submit --depends-on=job1:success,job2:success merge.sh
Data Staging
Lattice can pre-stage data to the hot tier before your job starts:
lattice submit --data-mount="s3://bucket/dataset:/data" --nodes=4 train.sh
The scheduler evaluates data readiness as part of the cost function — jobs with data already on the hot tier are prioritized.
Lifecycle Types
Bounded (batch) — default
lattice submit --walltime=24h train.sh
Job runs until completion or walltime, then terminates.
Unbounded (service)
lattice submit --service --expose=8080 serve.sh
Runs indefinitely. Exposed ports are reachable via the network domain.
Reactive (autoscaling)
lattice submit --reactive --min-nodes=1 --max-nodes=8 \
--scale-metric=gpu_utilization --scale-target=0.8 serve.sh
Automatically scales between min and max nodes based on the target metric.
Preemption Classes
Higher preemption class = harder to preempt:
# Best-effort (preempted first)
lattice submit --preemption-class=0 experiment.sh
# Normal priority (default: 5)
lattice submit train.sh
# High priority
lattice submit --preemption-class=8 critical-training.sh
Checkpointing
If your application supports checkpointing, declare it:
# Signal-based (receives SIGUSR1 before preemption)
lattice submit --checkpoint=signal train.sh
# gRPC callback
lattice submit --checkpoint=grpc --checkpoint-port=9999 train.sh
# Shared memory flag
lattice submit --checkpoint=shmem train.sh
# Non-preemptible (no checkpoint, never preempted)
lattice submit --no-preempt train.sh
Slurm Compatibility
Existing Slurm scripts work with minimal changes:
# These are equivalent
sbatch --nodes=4 --time=24:00:00 --partition=gpu train.sh
lattice submit --nodes=4 --walltime=24h --vcluster=gpu train.sh
Supported #SBATCH directives are automatically translated. See Slurm Migration for details.
Output Formats
# Default: human-readable table
lattice status
# JSON (for scripting)
lattice status -o json
# YAML
lattice status -o yaml
# Wide (more columns)
lattice status -o wide
Interactive Sessions
Interactive sessions give you a terminal attached to allocated compute nodes — similar to salloc + srun --pty in Slurm.
Creating a Session
# Basic interactive session (1 node, 4 hours)
lattice session --walltime=4h
# With GPU and software environment
lattice session --nodes=1 --constraint="gpu_type=GH200" --uenv=prgenv-gnu/24.11:v1
# Specify vCluster
lattice session --vcluster=interactive --walltime=2h
The session enters the queue like any other allocation. Once scheduled, your terminal automatically attaches to the first node.
Attaching to Running Allocations
You can attach a terminal to any running allocation (not just sessions):
# Attach to a running job
lattice attach a1b2c3d4
# Attach to a specific node in a multi-node allocation
lattice attach a1b2c3d4 --node=nid001234
# Run a specific command instead of a shell
lattice attach a1b2c3d4 -- htop
Multiple Terminals
You can open multiple terminals to the same allocation:
# Terminal 1
lattice attach a1b2c3d4
# Terminal 2 (different shell window)
lattice attach a1b2c3d4
Session Lifecycle
- Pending — waiting in the queue for resources
- Running — terminal is attached, you’re working
- Disconnected — if you lose connection, the session keeps running (use
tmux/screeninside for persistence) - Completed — walltime expired or you exited
Tips
- Use
tmuxorscreeninside your session for disconnect resilience - Sessions respect the same preemption rules as batch jobs — use
--preemption-class=7for important interactive work - If preempted, you’ll see checkpoint progress in your terminal before disconnection
- The
--walltimeflag is mandatory for sessions (prevents runaway resource usage)
DAG Workflows
DAGs (Directed Acyclic Graphs) let you define multi-step pipelines where allocations depend on each other.
YAML Definition
# workflow.yaml
name: training-pipeline
allocations:
- name: preprocess
entrypoint: "python preprocess.py"
nodes: 2
walltime: "2h"
- name: train
entrypoint: "torchrun train.py"
nodes: 64
walltime: "72h"
uenv: "prgenv-gnu/24.11:v1"
depends_on:
- preprocess: success
- name: evaluate
entrypoint: "python eval.py"
nodes: 1
walltime: "1h"
depends_on:
- train: success
- name: notify-failure
entrypoint: "python notify.py --status=failed"
nodes: 1
walltime: "10m"
depends_on:
- train: failure
Submitting a DAG
lattice dag submit workflow.yaml
# Submitted DAG d1e2f3g4 with 4 allocations
Dependency Conditions
| Condition | Meaning |
|---|---|
success | Run after dependency completes successfully |
failure | Run after dependency fails |
any | Run after dependency completes (success or failure) |
corresponding | For task groups: task N depends on task N of the parent |
Monitoring DAGs
# DAG status overview
lattice dag status d1e2f3g4
# Detailed graph view
lattice dag status d1e2f3g4 --graph
# Output:
# preprocess [Completed] → train [Running] → evaluate [Pending]
# ↘ notify-failure [Pending]
Cancelling a DAG
# Cancel all allocations in the DAG
lattice dag cancel d1e2f3g4
Cancellation cascades — downstream allocations that haven’t started are cancelled automatically.
Failure Propagation
- If a
successdependency fails, downstream allocations are cancelled - If a
failuredependency succeeds, those downstream allocations are skipped anydependencies always run regardless of upstream outcome
Limits
- Maximum 1000 allocations per DAG (configurable by admin)
- Cycles are rejected at submission time
- Duplicate allocation names within a DAG are rejected
Monitoring & Observability
Allocation Status
# Your allocations
lattice status
# Specific allocation
lattice status a1b2c3d4
# Filter by state
lattice status --state=running
lattice status --state=pending
# All tenant allocations (requires permissions)
lattice status --all
# Watch mode (refreshes every 5 seconds)
lattice status --watch
lattice watch a1b2c3d4
Logs
# View logs (from S3 persistent store)
lattice logs a1b2c3d4
# Live tail (streaming)
lattice logs a1b2c3d4 --follow
# Last N lines
lattice logs a1b2c3d4 --tail=100
Metrics
Query metrics for a running allocation:
# Snapshot of current metrics
lattice metrics a1b2c3d4
# Output:
# METRIC VALUE UNIT
# gpu_utilization 87.3 %
# gpu_memory_used 71.2 GB
# cpu_utilization 45.1 %
# memory_used 384.0 GB
# network_rx 12.4 GB/s
# network_tx 8.7 GB/s
Live metrics stream:
lattice metrics a1b2c3d4 --stream
Diagnostics
Combined view of network and storage health for an allocation:
lattice diagnostics a1b2c3d4
# Network diagnostics only
lattice diagnostics a1b2c3d4 --network
# Storage diagnostics only
lattice diagnostics a1b2c3d4 --storage
Cross-Allocation Comparison
Compare metrics between two allocations (useful for A/B experiments):
lattice compare a1b2c3d4 e5f6g7h8 --metric=gpu_utilization
Cluster Overview
# List all nodes
lattice nodes
# Filter by state
lattice nodes --state=ready
lattice nodes --state=draining
# Specific node details
lattice nodes nid001234
Python SDK
The Lattice Python SDK provides an async client for interacting with the REST API from notebooks, scripts, and autonomous agents.
Installation
pip install lattice-sdk
Quick Start
import asyncio
from lattice_sdk import LatticeClient, AllocationSpec
async def main():
async with LatticeClient("lattice-api.example.com", 8080) as client:
# Submit an allocation
alloc = await client.submit(AllocationSpec(
entrypoint="python train.py",
nodes=4,
walltime="24h",
tenant="ml-team",
))
print(f"Submitted: {alloc.id}")
# Check status
status = await client.status(alloc.id)
print(f"State: {status.state}")
# Wait for completion
async for event in client.watch(alloc.id):
print(f"State changed: {event.state}")
if event.state in ("Completed", "Failed", "Cancelled"):
break
asyncio.run(main())
Core Methods
Submission
# Basic submission
alloc = await client.submit(AllocationSpec(
entrypoint="torchrun train.py",
nodes=64,
walltime="72h",
uenv="prgenv-gnu/24.11:v1",
constraints={"gpu_type": "GH200"},
))
# Submit DAG
dag = await client.submit_dag("workflow.yaml")
Status & Listing
# Get allocation
alloc = await client.status(alloc_id)
# List allocations
allocs = await client.list_allocations(state="running")
# List nodes
nodes = await client.list_nodes(state="ready")
Monitoring
# Stream logs
async for line in client.stream_logs(alloc_id):
print(line.message)
# Query metrics
metrics = await client.query_metrics(alloc_id)
print(f"GPU util: {metrics.gpu_utilization}%")
# Stream metrics
async for snapshot in client.stream_metrics(alloc_id):
print(f"GPU: {snapshot.gpu_utilization}%")
# Watch state changes
async for event in client.watch(alloc_id):
print(f"State: {event.state}")
Management
# Cancel
await client.cancel(alloc_id)
# Checkpoint
await client.checkpoint(alloc_id)
Tenants & vClusters
tenants = await client.list_tenants()
vclusters = await client.list_vclusters()
Error Handling
from lattice_sdk import LatticeError, LatticeNotFoundError, LatticeAuthError
try:
alloc = await client.status("nonexistent-id")
except LatticeNotFoundError:
print("Allocation not found")
except LatticeAuthError:
print("Authentication failed")
except LatticeError as e:
print(f"API error ({e.status_code}): {e}")
Authentication
# Token-based (OIDC)
client = LatticeClient("api.example.com", 8080, token="eyJ...")
# Headers
client = LatticeClient("api.example.com", 8080, headers={"X-Tenant": "my-team"})
Slurm Migration
Command Mapping
| Slurm | Lattice | Notes |
|---|---|---|
sbatch script.sh | lattice submit script.sh | #SBATCH directives are parsed |
squeue | lattice status | |
squeue -u $USER | lattice status | Default shows own jobs |
scancel 12345 | lattice cancel 12345 | |
salloc | lattice session | Interactive allocation |
srun --pty bash | lattice attach <id> | Attach terminal |
sinfo | lattice nodes | Cluster node overview |
sacct | lattice status --all | Historical view |
Directive Mapping
#SBATCH Directive | Lattice Equivalent | Notes |
|---|---|---|
--nodes=N | --nodes=N | Exact match |
--ntasks=N | — | Mapped to node count: ceil(N / tasks_per_node) |
--ntasks-per-node=N | — | Passed as task config |
--time=HH:MM:SS | --walltime=HH:MM:SS | Also accepts 24h, 30m shorthand |
--partition=X | --vcluster=X | Configurable partition→vCluster mapping |
--account=X | --tenant=X | Account→tenant mapping |
--job-name=X | --name=X | |
--output=file | — | Logs always go to persistent store; download path configurable |
--error=file | — | Same as --output |
--constraint=X | --constraint=X | Feature matching |
--gres=gpu:N | --constraint="gpu_count=N" | |
--qos=X | --preemption-class=N | Configurable QOS→class mapping |
--array=0-99%20 | --task-group=0-99%20 | |
--dependency=afterok:ID | --depends-on=ID:success | |
--exclusive | Default | Lattice always allocates full nodes |
Environment Variables
When Slurm compatibility is enabled (compat.set_slurm_env: true), Lattice sets familiar environment variables inside allocations:
| Variable | Value |
|---|---|
SLURM_JOB_ID | Allocation ID |
SLURM_JOB_NAME | Allocation name |
SLURM_NNODES | Number of allocated nodes |
SLURM_NODELIST | Comma-separated node list |
SLURM_NTASKS | Task count |
SLURM_SUBMIT_DIR | Working directory at submission |
Lattice also sets its own LATTICE_* equivalents.
What’s Different
Full-Node Scheduling
Lattice always allocates full nodes (no sub-node sharing). This simplifies resource management and improves performance isolation. If you’re used to --ntasks=1 on a shared node, you’ll get the whole node.
No Partitions — vClusters
Slurm partitions map to Lattice vClusters, but vClusters are more flexible: each has its own scheduling policy (backfill, bin-pack, FIFO, reservation) and weight tuning.
Topology-Aware Placement
Lattice automatically packs multi-node jobs within the same Slingshot dragonfly group for optimal network performance. No manual --switches needed.
Data Staging
Lattice can pre-stage data during queue wait time. Add --data-mount="s3://bucket/data:/data" and the scheduler factors data locality into placement decisions.
Checkpointing
Unlike Slurm’s --requeue, Lattice coordinates checkpointing before preemption. Declare --checkpoint=signal and your job receives SIGUSR1 before being suspended.
Migration Steps
- Start with existing scripts —
#SBATCHdirectives work out of the box - Replace
sbatch/squeue/scancelwithlattice submit/status/cancel - Gradually adopt native features — data staging, checkpointing, DAGs, uenv
- Tune scheduling weights — use the RM-Replay simulator for A/B comparison
Deployment & Administration
Architecture Overview
A Lattice deployment consists of:
- 3-5 quorum members — Raft consensus nodes running
lattice-server - N compute nodes — each running
lattice-agent - VictoriaMetrics (or compatible TSDB) — telemetry storage
- S3-compatible storage — checkpoint and log persistence
- VAST (optional) — data staging and QoS
Deployment Methods
Docker Compose (dev/test)
cd infra/docker
docker compose up -d
This starts a 3-node quorum with VictoriaMetrics. See infra/docker/docker-compose.yml.
Systemd (production)
Download binaries from GitHub Releases and install:
ARCH=$(uname -m | sed 's/aarch64/arm64/')
# Server (quorum members)
curl -sSfL "https://github.com/witlox/lattice/releases/latest/download/lattice-server-${ARCH}.tar.gz" | tar xz
sudo mv lattice-server /usr/local/bin/
sudo cp infra/systemd/lattice-server.service /etc/systemd/system/
sudo cp config/production.yaml /etc/lattice/config.yaml
sudo systemctl enable --now lattice-server
# Agent (compute nodes) — single binary per architecture, all GPU support included
curl -sSfL "https://github.com/witlox/lattice/releases/latest/download/lattice-agent-${ARCH}.tar.gz" | tar xz
sudo mv lattice-agent /usr/local/bin/
sudo cp infra/systemd/lattice-agent.service /etc/systemd/system/
sudo systemctl enable --now lattice-agent
Configuration
Example configs are in config/:
| File | Purpose |
|---|---|
config/minimal.yaml | Single-node dev mode, no optional features |
config/production.yaml | Full reference with all sections documented |
See the production config for every option with explanations.
Required Sections
quorum— Raft node ID, peers, data directoryapi— gRPC and REST listen addressesstorage— S3 endpoint, NFS pathstelemetry— TSDB endpoint, aggregation mode
Optional Sections
node_agent— heartbeat timing, grace periodsnetwork— VNI pool range for Slingshotcheckpoint— checkpoint evaluation and timeout tuningscheduling— cycle interval, backfill depthaccounting— Waldur integration (requiresaccountingfeature)rate_limit— per-user API rate limitingfederation— Sovra cross-site federation (requiresfederationfeature)compat— Slurm compatibility settings
Authentication & Authorization
Overview
Lattice authenticates three types of callers:
| Caller | Auth method | Token source |
|---|---|---|
| Humans (CLI) | OIDC (PKCE flow) → RS256 JWT | IdP (Keycloak, Dex) |
| Agents (node agent) | mTLS (production) or Bearer token (dev) | SPIRE SVID / bootstrap certs / LATTICE_AGENT_TOKEN |
| Services (AI/MCP) | OIDC (client_credentials) → RS256 JWT | IdP service account |
Server OIDC Configuration
api:
oidc_issuer: "https://keycloak.example.com/realms/hpc" # IdP discovery URL
oidc_client_id: "lattice" # Expected `aud` claim
# oidc_hmac_secret: "dev-secret-only" # HMAC fallback (dev only)
| Config field | Env var | Purpose |
|---|---|---|
api.oidc_issuer | — | OIDC provider URL. Enables JWKS (RS256/ES256) validation. |
api.oidc_client_id | — | Expected aud claim. Returned by auth discovery endpoint. |
api.oidc_hmac_secret | LATTICE_OIDC_HMAC_SECRET | Shared secret for HS256 validation (dev/testing/break-glass). |
Priority: JWKS (if oidc_issuer set) > HMAC (if secret set) > no auth (warning logged).
The auth discovery endpoint GET /api/v1/auth/discovery is public (no auth required) and returns {idp_url, client_id, issuer} so the CLI can bootstrap login.
Roles
Role derivation checks OIDC scopes first, then cross-system role claims (pact_role, lattice_role). First match wins.
| Role | OIDC scope | Cross-system claim | Permissions |
|---|---|---|---|
| SystemAdmin | admin or system:admin | pact-platform-admin or system-admin | Unrestricted — all operations |
| TenantAdmin | tenant:admin | tenant-admin | Manage own tenant’s allocations, vClusters, quotas. Drain nodes. Query audit. |
| Operator | operator | operator | Drain/undrain/disable/enable nodes. Cannot create tenants or manage federation. |
| ClaimingUser | sensitive:claim | — | User + claim/release sensitive nodes |
| ReadOnly | readonly | — | GET/LIST/WATCH only, no mutations |
| User | (default — any authenticated user) | — | Submit/cancel own allocations, view nodes, create sessions |
IdP Setup (Keycloak / Dex)
Configure your IdP to include the appropriate scopes in issued tokens:
Keycloak:
- Create client
latticewith PKCE (Authorization Code) flow - Create client scopes:
admin,tenant:admin,operator,sensitive:claim,readonly - Assign scopes to users/groups via role mappings
- For pact+lattice co-deployment: add
pact_roleas a custom claim in the token mapper
Dex:
staticClients:
- id: lattice
name: Lattice Scheduler
redirectURIs: ['http://localhost:8400/callback']
public: true # PKCE, no client secret
Dex passes through upstream IdP claims. Configure pact_role / scopes in the upstream IdP (LDAP groups, SAML attributes, etc.).
Agent Authentication
Node agents authenticate to lattice-server for registration and heartbeats.
Production (mTLS): Agent acquires identity via the cascade: SPIRE → SelfSigned CA → Bootstrap certs. The gRPC channel uses ClientTlsConfig with the acquired cert/key/CA. Server verifies the client certificate.
# Bootstrap cert path (used until SPIRE is available)
lattice-agent \
--quorum-endpoint=https://lattice-01:50051 \
--bootstrap-cert=/etc/lattice/tls/agent.crt \
--bootstrap-key=/etc/lattice/tls/agent.key \
--bootstrap-ca=/etc/lattice/tls/ca.crt \
...
Dev/testing (Bearer token): When no mTLS identity is available, agent falls back to LATTICE_AGENT_TOKEN.
LATTICE_AGENT_TOKEN="eyJ..." lattice-agent \
--quorum-endpoint=http://lattice-01:50051 \
...
| Env var | Purpose |
|---|---|
LATTICE_AGENT_TOKEN | Bearer token for agent→server auth (dev/testing/break-glass) |
LATTICE_SPIRE_SOCKET | SPIRE agent socket path (default: /run/spire/agent.sock) |
LATTICE_BOOTSTRAP_CERT | Bootstrap cert PEM path |
LATTICE_BOOTSTRAP_KEY | Bootstrap key PEM path |
LATTICE_BOOTSTRAP_CA | Bootstrap CA PEM path |
mTLS takes priority. Token auth is the fallback. In production, leave LATTICE_AGENT_TOKEN unset.
Quorum Management
Initial Bootstrap
The first quorum member initializes the Raft cluster using the --bootstrap flag. This flag must only be passed once — on the very first startup of node 1. All subsequent restarts (including systemd restarts) omit it.
# First-ever start of node 1 — initializes the Raft cluster:
lattice-server --config /etc/lattice/server.yaml --bootstrap
# All subsequent restarts — no --bootstrap:
lattice-server --config /etc/lattice/server.yaml
# (or via systemd, which never passes --bootstrap)
Configure peers in each node’s config:
quorum:
node_id: 1
data_dir: /var/lib/lattice/raft
peers:
- id: 2
address: "lattice-02:9000"
- id: 3
address: "lattice-03:9000"
Nodes 2 and 3 never need --bootstrap — they join via Raft membership replication from the leader.
Raft Status
curl http://lattice-01:8080/api/v1/raft/status
Backup & Restore
# Create backup
curl -X POST http://lattice-01:8080/api/v1/admin/backup
# Verify backup integrity
curl http://lattice-01:8080/api/v1/admin/backup/verify
# Restore (requires restart)
curl -X POST http://lattice-01:8080/api/v1/admin/restore \
-d '{"path": "/var/lib/lattice/backups/backup-20260305T120000Z.tar.gz"}'
Node Management
Agent Registration
Agents register automatically on startup. Authentication uses mTLS (production) or Bearer token (dev/testing):
# Production: mTLS via bootstrap certs (SPIRE preferred when available)
lattice-agent \
--node-id=nid001234 \
--quorum-endpoint=https://lattice-01:50051 \
--bootstrap-cert=/etc/lattice/tls/agent.crt \
--bootstrap-key=/etc/lattice/tls/agent.key \
--bootstrap-ca=/etc/lattice/tls/ca.crt \
--gpu-count=4 --gpu-type=GH200 --cpu-cores=72 --memory-gb=512
# Dev/testing: Bearer token auth (no certs needed)
LATTICE_AGENT_TOKEN="eyJ..." lattice-agent \
--node-id=nid001234 \
--quorum-endpoint=http://lattice-01:50051 \
--gpu-count=4 --gpu-type=GH200 --cpu-cores=72 --memory-gb=512
The agent tries the identity cascade (SPIRE → SelfSigned → Bootstrap) first. If no mTLS identity is available, it falls back to LATTICE_AGENT_TOKEN.
Draining Nodes
The drain lifecycle is: Ready → Draining → Drained → Ready.
# Drain a node (existing jobs complete, no new jobs scheduled)
lattice admin drain nid001234 --reason="maintenance"
# If no active allocations, node goes directly to Drained.
# If allocations are running, node stays in Draining until they complete.
# The scheduler loop automatically transitions Draining → Drained.
# Undrain (only works from Drained state)
lattice admin undrain nid001234
Undrain only works when the node is in Drained state. If the node is still Draining (allocations running), wait for them to complete or cancel them first.
Node States
| State | Meaning |
|---|---|
| Ready | Available for scheduling |
| Draining | No new jobs; existing jobs continue |
| Down | Heartbeat lost beyond grace period |
| Degraded | Heartbeat late but within grace period |
| Claimed | Reserved for sensitive workload |
Tenant Management
# Create a tenant
lattice admin tenant create --name="physics" --max-nodes=100
# List tenants
lattice admin tenant list
# Update quota
lattice admin tenant update physics --max-nodes=200
TLS Configuration
Server TLS
api:
tls_cert: /etc/lattice/tls/server.crt
tls_key: /etc/lattice/tls/server.key
Mutual TLS (mTLS)
api:
tls_cert: /etc/lattice/tls/server.crt
tls_key: /etc/lattice/tls/server.key
tls_ca: /etc/lattice/tls/ca.crt # Require client certificates
Feature Flags
Compile-time features control optional integrations:
| Feature | Crate | Enables |
|---|---|---|
oidc | lattice-api | JWT/OIDC token validation |
accounting | lattice-api | Waldur billing integration |
federation | lattice-api | Sovra cross-site federation |
nvidia | lattice-node-agent | NVIDIA GPU discovery (nvml-wrapper) |
rocm | lattice-node-agent | AMD GPU discovery (rocm-smi) |
ebpf | lattice-node-agent | eBPF kernel telemetry (Linux only) |
Pre-built release binaries ship with all features enabled. GPU libraries are loaded at runtime — nodes without GPUs simply report no GPU hardware. To build from source:
# Server with all features
cargo build --release -p lattice-api --all-features
# Agent with all features
cargo build --release -p lattice-node-agent --all-features
Release Artifacts
| Artifact | Architecture | GPU Support |
|---|---|---|
lattice-server-x86_64.tar.gz | x86_64 | n/a |
lattice-server-arm64.tar.gz | arm64 | n/a |
lattice-x86_64.tar.gz | x86_64 | n/a (CLI) |
lattice-arm64.tar.gz | arm64 | n/a (CLI) |
lattice-agent-x86_64.tar.gz | x86_64 | NVIDIA + AMD ROCm + eBPF |
lattice-agent-arm64.tar.gz | arm64 | NVIDIA + AMD ROCm + eBPF |
rm-replay-x86_64.tar.gz | x86_64 | n/a |
rm-replay-arm64.tar.gz | arm64 | n/a |
GPU discovery is automatic at runtime. The agent detects available hardware and uses the appropriate provider:
| Hardware | Discovery Method | Runtime Dependency |
|---|---|---|
| NVIDIA (H100, A100, GH200) | nvml-wrapper (libnvidia-ml.so via dlopen) | NVIDIA driver installed |
| AMD (MI300X, MI250) | rocm-smi CLI | ROCm toolkit installed |
| CPU-only nodes | No GPU discovery runs | None |
GCP Test Cluster
For integration testing without production hardware:
# 1. Build Packer image (once, ~5 min)
cd infra/gcp/packer
packer build -var project_id=YOUR_PROJECT lattice-compute.pkr.hcl
# 2. Provision infrastructure (~2 min)
cd infra/gcp
terraform apply -var="project_id=YOUR_PROJECT" -var="use_packer_image=true"
# 3. Build + bundle binaries
cargo build --release --target x86_64-unknown-linux-gnu
./scripts/deploy/make-provision-bundle.sh target/x86_64-unknown-linux-gnu/release /tmp/lattice-provision.tar.gz
# 4. Deploy to nodes (SCP bundle + run install scripts)
# See scripts/deploy/install-quorum.sh and install-compute.sh
# 5. Run validation test matrix
./scripts/deploy/validate.sh http://QUORUM1_IP:8080 x1000c0s0b0n0,x1000c0s0b0n1
# 6. Teardown
cd infra/gcp && terraform destroy
The test cluster includes: 3 quorum nodes, 2 compute nodes (with podman + squashfs-tools), 1 OCI registry, 1 VictoriaMetrics. The validate.sh script runs 15 tests covering health, auth, submit, drain, restart, and validation.
Deploy scripts (scripts/deploy/install-*.sh) are reusable on-prem — no GCP-specific logic.
Cluster Monitoring & Observability
Prometheus Metrics
Lattice exposes Prometheus-compatible metrics at GET /metrics on the REST port (default 8080).
Key Metrics
| Metric | Type | Description |
|---|---|---|
lattice_allocations_total | Counter | Total allocations by state |
lattice_allocations_active | Gauge | Currently running allocations |
lattice_scheduling_cycle_duration_seconds | Histogram | Scheduling cycle latency |
lattice_scheduling_placements_total | Counter | Successful placements |
lattice_scheduling_preemptions_total | Counter | Preemption events |
lattice_raft_commit_latency_seconds | Histogram | Raft commit latency |
lattice_raft_sensitive_audit_entries_total | Counter | Sensitive audit log entries |
lattice_api_request_duration_seconds | Histogram | API request latency |
lattice_api_requests_total | Counter | API requests by method and status |
lattice_nodes_total | Gauge | Nodes by state |
lattice_checkpoint_duration_seconds | Histogram | Checkpoint operation latency |
Scrape Configuration
# prometheus.yml
scrape_configs:
- job_name: 'lattice'
static_configs:
- targets:
- 'lattice-01:8080'
- 'lattice-02:8080'
- 'lattice-03:8080'
Grafana Dashboards
Pre-built dashboards are in infra/grafana/dashboards/:
- Cluster Overview — node states, allocation throughput, queue depth
- Scheduling Performance — cycle latency, placement rate, preemption rate
- Raft Health — commit latency, leader elections, log compaction
- Per-Tenant Usage — resource consumption, fair-share deficit
Import via Grafana UI or provision from infra/grafana/provisioning/.
Alerting Rules
Pre-configured alerting rules in infra/alerting/:
| Alert | Condition |
|---|---|
LatticeRaftNoLeader | No Raft leader for > 30s |
LatticeNodeDown | Node heartbeat lost for > 5m |
LatticeSchedulingStalled | No placements for > 10m with pending jobs |
LatticeHighPreemptionRate | > 10 preemptions/minute |
LatticeCheckpointFailure | Checkpoint success rate < 90% |
LatticeDiskSpaceLow | Raft data directory > 80% full |
TSDB Integration
Lattice pushes per-node telemetry to VictoriaMetrics (or any Prometheus-compatible remote write endpoint).
telemetry:
tsdb_endpoint: "http://victoriametrics:8428"
prod_interval_seconds: 30
Telemetry includes CPU, memory, GPU utilization, network I/O, and disk I/O per node.
Audit Log
Sensitive workload operations are recorded in the Raft-committed audit log:
# Query audit log
curl "http://lattice-01:8080/api/v1/audit?tenant=sensitive-team&from=2026-03-01"
Audit entries include: node claims/releases, allocation lifecycle events, and access log entries. Retention: 7 years (configurable).
Health Check
curl http://lattice-01:8080/healthz
# {"status": "ok"}
Used by Docker/Kubernetes health probes and load balancers.
Managing Sensitive Workloads
Sensitive workloads (financial, defense, regulated research) require strict isolation, auditing, and data handling. Lattice provides a dedicated scheduling mode for these workloads.
How It Works
- User claims nodes — not the scheduler. The user’s identity is recorded as the owner in the Raft audit log.
- Full isolation — claimed nodes run only the owner’s workloads. No sharing.
- Hardened OS — OpenCHAMI provisions a hardened boot image for claimed nodes.
- Encrypted storage — a dedicated encrypted pool is assigned. All access is logged.
- Signed software only — only vulnerability-scanned, signed uenv images are allowed.
- Wipe on release — when the claim ends, storage is crypto-erased and nodes are re-provisioned.
Submitting Sensitive Workloads
# Submit to the sensitive vCluster
lattice submit --vcluster=sensitive --nodes=4 --walltime=168h analysis.sh
The sensitive scheduler uses a reservation model (not backfill). Priority is fixed at the highest level; the only tiebreaker is conformance fitness.
Node Claiming
Sensitive allocations claim specific nodes. Once claimed:
- Nodes are exclusively owned by the claiming user
- The claim is Raft-committed with the user’s identity
- No other workloads (even from the same tenant) can run on claimed nodes
Audit Trail
Every sensitive operation is logged:
# Query sensitive audit entries
curl "http://lattice-01:8080/api/v1/audit?scope=sensitive"
Logged events:
- Node claim / release
- Allocation start / completion
- Data access (read/write operations)
- Software image loads
- Storage wipe confirmation
Retention: 7 years (per regulatory requirements).
Network Isolation
Sensitive allocations get a unique Slingshot VNI (network domain). Ingress and egress are denied except to the designated data gateway. With Ultra Ethernet, wire-level encryption is enabled.
Admin Responsibilities
- Provision hardened images via OpenCHAMI for sensitive nodes
- Maintain signed uenv registry — only approved images should be signed
- Monitor audit log — set up alerting for unexpected access patterns
- Test wipe procedures — verify crypto-erase completes on node release
- Designate sensitive-capable nodes — not all nodes need to support sensitive workloads
Configuration
No special server configuration is needed. The sensitive scheduler is a built-in vCluster type. Create a sensitive vCluster:
lattice admin vcluster create \
--name=sensitive \
--scheduler-type=sensitive-reservation \
--description="Regulated workloads with full isolation"
System Architecture
Overview
Lattice is a seven-layer architecture where each layer has a clear responsibility and communicates with adjacent layers via defined interfaces.
┌─ User Plane ───────────────────────────────────────────────────┐
│ lattice-cli + lattice-api (OIDC via hpc-auth) │
│ ├── Job lifecycle (submit, monitor, cancel) │
│ ├── Interactive sessions (WebSocket terminal) │
│ ├── Data management (stage, browse, transfer) │
│ ├── uenv management (list, pull, test) │
│ ├── Observability (attach, logs, metrics, diagnostics) │
│ └── Sensitive: user-level node claim/release │
└───────────────────────────┬────────────────────────────────────┘
│
┌─ Software Plane ──────────┴────────────────────────────────────┐
│ Default: uenv (squashfs + mount namespace) │
│ Optional: OCI/Sarus (isolation, third-party images) │
│ Registry: JFrog/Nexus → S3 backing (VAST hot tier) │
│ Node-local NVMe image cache (optional) │
│ Sensitive: signed images only, vulnerability-scanned │
└───────────────────────────┬────────────────────────────────────┘
│
┌─ Scheduling Plane ────────┴────────────────────────────────────┐
│ Quorum (Raft, 3-5 replicas) │
│ Strong: (1) node ownership (2) sensitive audit log │
│ Eventual: job queues, telemetry, quotas │
│ │
│ vCluster Schedulers: │
│ ├── HPC: backfill + dragonfly group packing │
│ ├── Service: bin-pack + autoscale │
│ ├── Sensitive: user-claim reservation, dedicated nodes │
│ └── Interactive: FIFO, short-lived, node-sharing via Sarus │
└───────────────────────────┬────────────────────────────────────┘
│
┌─ Data Plane ──────────────┴────────────────────────────────────┐
│ Hot: VAST (NFS + S3, single flash tier) │
│ ├── Home dirs, scratch, active datasets (NFS) │
│ ├── Checkpoints, image cache, objects (S3) │
│ ├── Scheduler integration: QoS, pre-staging, snapshots │
│ └── Sensitive: encrypted view, audit-logged, dedicated pool │
│ Warm: Capacity store (S3-compat, cost-optimized) │
│ Cold: Tape archive (S3-compat, regulatory retention) │
│ Data mover: pre-stages during queue wait, policy-driven │
└───────────────────────────┬────────────────────────────────────┘
│
┌─ Network Fabric ──────────┴────────────────────────────────────┐
│ Slingshot (current) / Ultra Ethernet (future path) │
│ ├── libfabric abstraction for workload communication │
│ ├── VNI-based network domains (job isolation) │
│ ├── Traffic classes: compute | management | telemetry │
│ ├── CSIG for in-band congestion telemetry │
│ └── Sensitive: encrypted RDMA, dedicated VNI │
└───────────────────────────┬────────────────────────────────────┘
│
┌─ Node Plane ──────────────┴────────────────────────────────────┐
│ Node Agent (per node) │
│ ├── squashfs-mount (uenv delivery) │
│ ├── Sarus (OCI container runtime, when needed) │
│ ├── eBPF telemetry + CSIG tap │
│ ├── Node-local NVMe (optional): scratch + image cache │
│ ├── Conformance fingerprint (driver/firmware/kernel hash) │
│ └── Health reporting → OpenCHAMI SMD │
└───────────────────────────┬────────────────────────────────────┘
│
┌─ Infrastructure Plane ────┴────────────────────────────────────┐
│ OpenCHAMI │
│ ├── Magellan: Redfish BMC discovery & inventory │
│ ├── SMD: State Management Daemon (hardware lifecycle) │
│ ├── BSS: Boot Script Service (image selection per node) │
│ ├── OPAAL: Authentication & identity │
│ ├── Cloud-init: per-node config injection │
│ └── Manta CLI: admin tooling │
└────────────────────────────────────────────────────────────────┘
Component Interactions
Allocation Lifecycle
1. User/Agent → lattice-cli → lattice-api (Intent API or Compat API)
2. lattice-api validates request, resolves uenv, creates Allocation object
3. Allocation placed in vCluster scheduler's queue (eventually consistent)
4. vCluster scheduler runs scheduling cycle:
a. Scores pending allocations with cost function
b. Solves knapsack: maximize value subject to resource constraints
c. Proposes allocation → quorum
5. Quorum validates (node ownership, quotas, sensitive isolation)
6. Quorum commits: node ownership updated (strong consistency)
7. Quorum notifies node agents of new allocation
8. Node agents:
a. Pull uenv squashfs image (from cache or registry)
b. Mount via squashfs-mount
c. Start processes in mount namespace
d. Begin log capture (ring buffer + S3 persistence)
e. Accept attach sessions (if user connects)
f. Report health/telemetry
8.5. During execution, users can:
- Attach interactive terminal (nsenter into allocation namespace)
- Stream logs (live tail from ring buffer or historical from S3)
- Query metrics (lattice top → TSDB) or stream them (lattice watch → node agents)
- View diagnostics (network health, storage performance)
- Compare metrics across allocations (TSDB multi-query)
9. On completion: node agents report, quorum releases nodes
Preemption Flow
1. Higher-priority allocation arrives, needs nodes currently in use
2. Scheduler evaluates: which running allocations are cheapest to preempt?
→ checkpoint_efficiency score from cost function
3. Checkpoint broker sends CHECKPOINT_HINT to target allocation's node agents
4. Application checkpoints (or: timeout → forced preemption)
5. Nodes released, reassigned to higher-priority allocation
6. Preempted allocation re-queued, will resume from checkpoint when resources available
Federation Flow (when enabled)
1. User at Site A submits allocation targeting Site B
2. Site A's federation broker signs request with Sovra token
3. Request arrives at Site B's federation broker
4. Site B verifies Sovra token, checks policy (OPA)
5. If accepted: allocation enters Site B's scheduling plane
6. Site B's local quorum manages the allocation entirely
7. Results/logs accessible to user at Site A via federation catalog
Topology Model
The scheduler maintains a model of the Slingshot dragonfly topology:
System
├── Group 0 (electrical group, ~hundreds of nodes)
│ ├── Switch 0
│ │ ├── Node 0..N
│ │ └── ...
│ └── Switch M
├── Group 1
│ └── ...
└── Group K
└── ...
Intra-group: electrical, low latency, high bandwidth
Inter-group: optical, higher latency, potential congestion
Scheduling rule: pack jobs into fewest groups possible. Jobs below group size → single group. Large jobs → minimize group span, prefer adjacent groups. Network-sensitive jobs (NCCL) get stricter placement constraints.
State Machine
The quorum manages a replicated state machine with the following state:
GlobalState {
nodes: Map<NodeId, NodeState>, // ownership, health, capabilities
allocations: Map<AllocId, Allocation>, // all active allocations
tenants: Map<TenantId, TenantState>, // quotas, fair-share counters
vclusters: Map<VClusterId, VClusterConfig>, // scheduler configs
topology: TopologyModel, // dragonfly group structure
sensitive_audit: AppendOnlyLog<AuditEvent>, // strong consistency
}
NodeState {
owner: Option<(TenantId, VClusterId, AllocId)>,
health: NodeHealth,
capabilities: NodeCapabilities, // GPU type, memory, features
group: GroupId, // topology position
conformance_group: ConformanceGroupId, // fingerprint of driver/firmware/kernel
}
Transitions are proposed by vCluster schedulers and validated by the quorum before commit. Only node ownership changes and sensitive audit events require Raft consensus; everything else is eventually consistent.
Note: Observability data (logs, metrics, attach sessions, diagnostics) is NOT stored in the Raft state machine. This data lives in the TSDB, S3, and node agent memory. Only sensitive audit events about observability actions (e.g., “Dr. X attached to allocation Y”) flow through Raft consensus (per ADR-004).
API Design
Two-Tier API Model
Tier 1: Intent API (Agent-Native)
Agents and advanced users interact with the Intent API. They declare what they need; the scheduler resolves how.
Core Resources
Allocation — The universal work unit.
POST /v1/allocations Create allocation (or DAG of allocations)
GET /v1/allocations List allocations (filterable)
GET /v1/allocations/{id} Get allocation status
DELETE /v1/allocations/{id} Cancel allocation
PATCH /v1/allocations/{id} Update allocation (e.g., extend walltime, switch telemetry)
POST /v1/allocations/{id}/tasks Launch tasks within an existing allocation (srun equivalent)
POST /v1/allocations/{id}/checkpoint Request checkpoint
Observability — User-facing debugging and monitoring.
POST /v1/allocations/{id}/attach Attach interactive terminal (WebSocket upgrade)
GET /v1/allocations/{id}/logs Historical logs from S3
GET /v1/allocations/{id}/logs/stream Live log tail (SSE / gRPC stream)
GET /v1/allocations/{id}/metrics Query metrics snapshot from TSDB
GET /v1/allocations/{id}/metrics/stream Push-based live metrics stream
GET /v1/allocations/{id}/diagnostics Combined network + storage diagnostics
GET /v1/allocations/{id}/diagnostics/network Network-specific diagnostics
GET /v1/allocations/{id}/diagnostics/storage Storage-specific diagnostics
GET /v1/compare Cross-allocation metric comparison
DAGs — Workflow graph management.
POST /v1/dags Submit a DAG of allocations
GET /v1/dags List DAGs (filterable by tenant, user, state)
GET /v1/dags/{id} Get DAG status (overall state + per-allocation states)
GET /v1/dags/{id}/graph Get DAG structure (allocations + dependency edges)
DELETE /v1/dags/{id} Cancel all allocations in a DAG
Session — Interactive allocation with WebSocket terminal.
POST /v1/sessions Create interactive session
GET /v1/sessions/{id}/terminal WebSocket terminal endpoint
Nodes — Read-only view of cluster state.
GET /v1/nodes List nodes (filterable by vCluster, tenant, state)
GET /v1/nodes/{id} Get node details
Tenants / vClusters — Administrative.
GET /v1/tenants List tenants
GET /v1/vclusters List vClusters
GET /v1/vclusters/{id}/queue View vCluster queue
Accounting
GET /v1/accounting Query usage history
Allocation Request Schema
# Full Intent API allocation request
allocation:
# Identity
tenant: "ml-team"
project: "gpt-training"
vcluster: "ml-training" # optional: scheduler can infer from intent
tags: { experiment: "run-42" }
# What to run
intent: "train" # optional hint for scheduler
environment:
uenv: "prgenv-gnu/24.11:v1" # uenv name/version
view: "default" # uenv view to activate
# OR:
image: "registry.example.com/my-training:latest" # OCI image via Sarus
entrypoint: "torchrun --nproc_per_node=4 train.py"
# Resources
resources:
nodes: 64 # can be exact or range: { min: 32, max: 128 }
constraints:
gpu_type: "GH200"
features: ["nvme_scratch"]
topology: "tight" # scheduler hint: pack into fewest groups
# Lifecycle
lifecycle:
type: "bounded" # bounded | unbounded | reactive
walltime: "72h" # for bounded
preemption_class: 2 # 0 = lowest, higher = harder to preempt
# For reactive:
# scale_policy: { min: 4, max: 16, metric: "request_latency_p99", target: "100ms" }
# Data
data:
mounts:
- source: "s3://datasets/imagenet"
target: "/data/input"
access: "read-only"
tier_hint: "hot" # scheduler pre-stages if needed
defaults: true # auto-mount home, scratch, output dir
# Networking
connectivity:
network_domain: "ml-workspace" # shared domain for cross-allocation communication
expose: # for services
- name: "metrics"
port: 9090
# Dependencies (for DAG submissions)
depends_on:
- ref: "preprocess-job"
condition: "success" # success | failure | any | corresponding
# Checkpointing
checkpoint:
strategy: "auto" # auto | manual | none
# auto: scheduler decides based on cost function
# manual: application manages its own checkpointing
# none: non-checkpointable, treated as non-preemptible
# Telemetry
telemetry:
mode: "prod" # prod | debug | audit
DAG Submission
Submit multiple allocations as a workflow graph:
dag:
allocations:
- id: "stage-data"
entrypoint: "python stage.py"
resources: { nodes: 1 }
lifecycle: { type: "bounded", walltime: "2h" }
- id: "train"
entrypoint: "torchrun train.py"
resources: { nodes: 64, constraints: { topology: "tight" } }
lifecycle: { type: "bounded", walltime: "72h" }
depends_on: [{ ref: "stage-data", condition: "success" }]
- id: "evaluate"
entrypoint: "python eval.py"
resources: { nodes: 4 }
depends_on: [{ ref: "train", condition: "any" }]
DAG size limit: Maximum 1000 allocations per DAG (configurable). Submissions exceeding this limit are rejected at validation time. See dag-scheduling.md for details.
Task Groups (Job Arrays)
allocation:
type: "task_group"
template:
entrypoint: "python sweep.py --config=${INDEX}"
resources: { nodes: 1, constraints: { gpu_type: "GH200" } }
lifecycle: { type: "bounded", walltime: "4h" }
range: { start: 0, end: 99 }
concurrency: 20 # max simultaneous tasks
Tier 2: Compatibility API (Slurm-like)
Translates familiar Slurm commands to Intent API calls. Implemented as CLI wrappers + lattice-api REST endpoints.
Command Mapping
| Slurm | Lattice CLI | Intent API |
|---|---|---|
sbatch script.sh | lattice submit script.sh | POST /v1/allocations |
sbatch --array=0-99%20 script.sh | lattice submit --task-group=0-99%20 script.sh | POST /v1/allocations (task_group) |
sbatch --dependency=afterok:123 script.sh | lattice submit --depends-on=123:success script.sh | POST /v1/allocations (depends_on) |
squeue | lattice status | GET /v1/allocations |
squeue -u $USER | lattice status --user=$USER | GET /v1/allocations?user= |
scancel 123 | lattice cancel 123 | DELETE /v1/allocations/123 |
salloc -N2 | lattice session --nodes=2 | POST /v1/sessions |
srun -n4 hostname | lattice launch --alloc=123 -n4 hostname | POST /v1/allocations/123/tasks |
sinfo | lattice nodes | GET /v1/nodes |
sacct | lattice history | GET /v1/accounting |
--constraint="gpu" | --constraint="gpu" | constraints.features |
--partition=debug | --vcluster=interactive | vcluster field |
--qos=high | --priority=high | preemption_class |
--uenv=prgenv-gnu/24.11:v1 | --uenv=prgenv-gnu/24.11:v1 | environment.uenv |
srun --jobid=123 --pty bash | lattice attach 123 | Attach RPC (bidir stream) |
cat slurm-123.out | lattice logs 123 | GET /v1/allocations/123/logs |
tail -f slurm-123.out | lattice logs 123 --follow | StreamLogs RPC |
sstat -j 123 | lattice top 123 | QueryMetrics RPC |
| (no equivalent) | lattice watch 123 | StreamMetrics RPC |
| (no equivalent) | lattice diag 123 | GetDiagnostics RPC |
| (no equivalent) | lattice compare 123 456 | CompareMetrics RPC |
Script Parsing
The compatibility layer parses #SBATCH directives from submission scripts, translating them to Intent API fields. Unknown directives are warned but not fatal (graceful degradation).
#!/bin/bash
#SBATCH --nodes=64
#SBATCH --time=72:00:00
#SBATCH --gres=gpu:4
#SBATCH --constraint=GH200
#SBATCH --uenv=prgenv-gnu/24.11:v1
#SBATCH --view=default
#SBATCH --account=ml-team
#SBATCH --job-name=training-run
torchrun --nproc_per_node=4 train.py
Wire Format
gRPC (protobuf) is the primary protocol. REST is provided via gRPC-gateway for browser/curl access.
Protobuf definitions in proto/ directory. See proto/README.md for schema details.
Proto Coverage
The protobuf definitions in proto/lattice/v1/allocations.proto currently cover:
| Service / Area | Proto Status | Notes |
|---|---|---|
| AllocationService (submit, get, list, cancel, update, watch, checkpoint) | Defined | Core allocation lifecycle |
| Observability RPCs (attach, logs, metrics, diagnostics, compare) | Defined | Part of AllocationService |
| DAG RPCs (get, list, cancel) | Defined | Part of AllocationService |
| NodeService (list, get, drain, undrain, disable, enable, health) | Defined | proto/lattice/v1/nodes.proto |
| AdminService (tenant CRUD, vCluster CRUD, Raft status, backup, audit, accounting) | Defined | proto/lattice/v1/admin.proto |
| Session RPCs (create, get, delete) | Defined | Part of AllocationService |
| Service Discovery (lookup, list) | Defined | Part of AdminService, admin.proto |
| LivenessProbeSpec | Defined | Part of AllocationSpec, allocations.proto |
All planned services have been implemented as RPCs within the existing three services (AllocationService, NodeService, AdminService). Both gRPC and REST endpoints are available for all operations.
Service Discovery Endpoints
| Method | Endpoint | Description |
|---|---|---|
| gRPC | AdminService.LookupService(name) | Returns endpoints for a named service (tenant-filtered) |
| gRPC | AdminService.ListServices() | Lists all registered service names (tenant-filtered) |
| REST | GET /api/v1/services | JSON list of registered service names |
| REST | GET /api/v1/services/{name} | JSON endpoints for a named service |
Tenant filtering: requests with x-lattice-tenant header only see services belonging to their tenant. Without the header, all services are visible (admin mode).
Liveness Probe Schema
Allocations can include an optional liveness_probe in the submission spec:
message LivenessProbeSpec {
string probe_type = 1; // "tcp" or "http"
uint32 port = 2; // 1-65535
string path = 3; // HTTP path (e.g., "/healthz")
uint32 period_secs = 4; // default: 30
uint32 initial_delay_secs = 5;
uint32 failure_threshold = 6; // default: 3
uint32 timeout_secs = 7; // default: 5
}
When failure_threshold consecutive probes fail, the allocation is marked Failed. The reconciliation loop then requeues it (for Unbounded/Reactive allocations with appropriate requeue policy).
Client SDKs
| SDK | Protocol | Location |
|---|---|---|
Python (lattice-sdk) | REST (httpx) | sdk/python/ |
Rust (lattice-client) | gRPC (tonic) | crates/lattice-client/ |
The Rust SDK re-exports all proto types as lattice_client::proto — consumers do not need to depend on lattice-common directly.
Authentication
All API calls require OIDC bearer token. The lattice CLI handles the OIDC flow via hpc-auth (institutional IdP integration). The lattice-api server validates tokens against the configured OIDC provider.
Sensitive tenant tokens include additional claims for audit trail binding.
Scheduling Algorithm
Overview
Lattice uses a multi-dimensional knapsack formulation with a composite cost function, executed independently by each vCluster scheduler. The quorum provides global coordination.
The Knapsack Formulation
Resources (Knapsack Dimensions)
Each scheduling decision must respect multiple resource constraints simultaneously:
| Dimension | Unit | Source |
|---|---|---|
| Nodes | count | Quorum (available nodes owned by or borrowable by vCluster) |
| GPU-hours | nodes × walltime | Derived from allocation request |
| Topology span | group count | Topology model (dragonfly groups consumed) |
| Storage I/O bandwidth | GB/s | VAST API (current utilization + allocation estimate) |
| Power budget | kW | OpenCHAMI BMC telemetry (per-node power draw) |
Value (Cost Function)
Score(j) = Σ wᵢ · fᵢ(j)
Component Functions
f₁: priority_class(j) — Static priority tier (0-10). Sensitive claims are highest. Preemption only moves down tiers.
f₂: wait_time_factor(j) — Anti-starvation. Increases monotonically with time in queue.
f₂(j) = log(1 + wait_seconds / reference_wait)
reference_wait is tunable (default: 1 hour). Log prevents wait time from dominating all other factors.
f₃: fair_share_deficit(j) — How far the tenant is from their contracted share. See quota-enforcement.md for hard vs. soft quota semantics.
f₃(j) = max(0, target_share(tenant) - actual_usage(tenant)) / target_share(tenant)
Ranges from 0 (tenant at or above share) to 1 (tenant has used nothing). Tenants below their share get priority.
f₄: topology_fitness(j) — How well the job fits available topology. For intra-node GPU topology, see gpu-topology.md.
f₄(j) = 1.0 - (groups_needed(j) / max_groups_available)
Jobs that fit in a single group score highest. Penalty for spanning groups scales with group count.
f₅: data_readiness(j) — Is the job’s input data on hot tier?
f₅(j) = fraction_of_input_data_on_hot_tier(j)
If unknown (user didn’t specify data requirements), defaults to 0.5 (neutral).
f₆: backlog_pressure(t) — Global signal, not per-job. High when queue is deep.
f₆(t) = min(1.0, queued_gpu_hours / running_gpu_hours)
Capped at 1.0. Affects all jobs equally — it’s a system-level urgency signal.
f₇: energy_cost(j, t) — Time-varying electricity price at scheduling time.
f₇(j, t) = 1.0 - normalized_energy_price(t)
Jobs score higher when energy is cheap. In federated mode, extends to energy_cost(j, t, site).
f₈: checkpoint_efficiency(j) — How cheaply can this job be preempted?
f₈(j) = 1.0 / (1.0 + estimated_checkpoint_minutes(j))
Jobs with fast checkpointing are more attractive to schedule on borrowed/preemptible nodes.
f₉: conformance_fitness(j, candidates) — How well do the candidate nodes match each other’s configuration?
f₉(j, candidates) = largest_conformance_group_size(candidates) / j.requested_nodes
Scores 1.0 when all candidate nodes share the same conformance fingerprint, lower when the node set is heterogeneous. Critical for multi-node jobs where driver/firmware mismatches cause subtle performance degradation or correctness issues (e.g., NCCL hangs from mismatched NIC firmware).
The conformance fingerprint is a hash of: GPU driver version, NIC firmware version, BIOS/BMC firmware version, and kernel parameters. The node agent computes and reports this fingerprint alongside health data. Nodes with identical fingerprints belong to the same conformance group.
This factor is evaluated during node selection (step 2a in the solver), not during scoring. The solver prefers to select nodes from the largest available conformance group that satisfies the allocation’s constraints.
See data-staging.md for details on how input data is pre-staged during queue wait to improve f₅ scores. See preemption.md for how preemption classes interact with f₁ priority scoring. See network-domains.md for the VNI assignment that enables topology-aware placement (f₄).
Weight Profiles
| Weight | HPC Batch | ML Training | Service | Sensitive | Interactive |
|---|---|---|---|---|---|
| w₁ (priority) | 0.15 | 0.10 | 0.15 | 0.90 | 0.10 |
| w₂ (wait_time) | 0.20 | 0.10 | 0.05 | 0.00 | 0.30 |
| w₃ (fair_share) | 0.20 | 0.10 | 0.10 | 0.00 | 0.10 |
| w₄ (topology) | 0.15 | 0.25 | 0.05 | 0.00 | 0.00 |
| w₅ (data_ready) | 0.10 | 0.15 | 0.10 | 0.00 | 0.05 |
| w₆ (backlog) | 0.05 | 0.05 | 0.05 | 0.00 | 0.15 |
| w₇ (energy) | 0.00 | 0.05 | 0.10 | 0.00 | 0.00 |
| w₈ (checkpoint) | 0.05 | 0.10 | 0.10 | 0.00 | 0.00 |
| w₉ (conformance) | 0.10 | 0.10 | 0.30 | 0.10 | 0.30 |
Sensitive scheduler is degenerate: priority dominates because node claims are non-negotiable (w₁=0.90). Conformance (w₉=0.10) acts as a tiebreaker among conformant nodes; non-conformant nodes are excluded entirely as a hard constraint at the solver level (step 2a), not via the weight system.
Note: The CostWeights::default() in crates/lattice-common/src/types.rs provides a “balanced HPC” baseline (w₁=0.20, w₂=0.20, w₃=0.20, w₄=0.15, w₅=0.10, w₆=0.05, w₇=0.00, w₈=0.00, w₉=0.10). This is not identical to any named profile in the table above — it is a general-purpose starting point. Each vCluster should have its weights tuned for its workload type, either manually or via RM-Replay simulation.
Solver
The multi-dimensional knapsack is NP-hard in general. For our scale (tens to hundreds of pending large allocations), a greedy heuristic with backfill is sufficient:
Algorithm: GreedyTopologyAwareBackfill
1. Sort pending allocations by Score(j) descending
2. For each allocation j in sorted order:
a. Find the smallest set of available nodes that satisfies:
- Node count >= j.requested_nodes
- All nodes in fewest possible dragonfly groups
- All nodes in same conformance group (prefer) or fewest groups (fallback)
- Constraints satisfied (GPU type, features, etc.)
- Power budget not exceeded
b. If nodes found: PROPOSE allocation to quorum
c. If not found: try backfill (can j fit in gaps left by higher-priority reservations?)
3. Collect quorum responses (commit or reject)
4. For rejected proposals: re-queue, will try next cycle
Scheduling cycle: every 5-30 seconds (configurable per vCluster)
DAG Dependencies
DAGs (directed acyclic graphs) are first-class workflow primitives. Individual allocations within a DAG are scored by the knapsack solver like any other allocation — the DAG structure controls when allocations enter the queue, not how they are scored. Root allocations enter immediately; downstream allocations enter when their dependency conditions are satisfied. See dag-scheduling.md for the full DAG lifecycle and dependency conditions.
Reactive Scaling
Reactive allocations (autoscaling services) start at min_nodes and scale based on metric thresholds. Scale-up and scale-down are proposed as node ownership changes through the quorum. The knapsack solver handles each scale proposal as a regular allocation change. See autoscaling.md for the scaling loop, metrics, and cooldown behavior.
Elastic Resource Sharing
Nodes can be “borrowed” across vClusters:
vCluster A: 200 dedicated nodes, currently using 150
→ 50 idle nodes advertised as "borrowable" to other vClusters
vCluster B: 100 dedicated nodes, needs 120 for a pending job
→ Borrows 20 nodes from vCluster A's idle pool
→ These borrowed nodes have a preemption penalty in the cost function
→ If vCluster A needs them back: checkpoint + reclaim
The quorum tracks ownership at two levels:
- Home vCluster: permanent assignment (based on tenant contracts)
- Current vCluster: who is actually using the node right now
Checkpoint Cost Model
See checkpoint-broker.md for the full checkpoint decision framework.
Summary: checkpoint when Value > Cost, where value includes recompute_saved + preemptability + backlog_relief, and cost includes write_time + compute_waste + storage_cost. Backlog pressure increases checkpoint aggressiveness.
Simulation and Tuning
Use RM-Replay (tools/rm-replay/) to test scheduling configurations:
- Capture production workload traces
- Configure weight profiles
- Replay through simulator
- Evaluate: utilization, wait times, QoS compliance, fairness
- Iterate on weights before deploying to production
Reference: Martinasso et al., “RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management” (SC18).
CLI Design
Design Principle
The CLI is the primary user interface. It should feel natural to Slurm users while exposing Lattice’s richer capabilities. Commands follow a consistent lattice <verb> [resource] [flags] pattern. Output is human-readable by default, machine-parseable with --output=json.
Command Structure
lattice <command> [subcommand] [arguments] [flags]
Global Flags
| Flag | Short | Description |
|---|---|---|
--output | -o | Output format: table (default), json, yaml, wide |
--quiet | -q | Suppress non-essential output |
--verbose | -v | Verbose output (debug info) |
--tenant | -t | Override tenant (for multi-tenant users) |
--vcluster | Override vCluster selection | |
--config | Config file path (default: ~/.config/lattice/config.yaml) | |
--no-color | Disable colored output |
Authentication Commands
Login (lattice login)
Authenticate with the lattice server. Uses hpc-auth for OIDC token acquisition with cascading flow selection.
# Login (auto-discovers IdP from lattice-api auth discovery endpoint)
lattice login
# Force device code flow (for SSH sessions without browser)
lattice login --flow device
# Force manual paste flow
lattice login --flow manual
# Login to a specific server
lattice login --server cluster.example.com
Token is cached per-server in ~/.config/lattice/tokens.json with 0600 permissions (lenient mode: warn and fix if wrong).
Logout (lattice logout)
Clear cached token and revoke at IdP (best-effort).
lattice logout
Unauthenticated Commands
These commands do not require a token (INV-A1):
lattice login/lattice logoutlattice --versionlattice --helplattice completions <shell>
All other commands require authentication. If no valid token is cached, the CLI prints:
Not logged in. Run `lattice login` first.
Expired tokens are silently refreshed if a valid refresh token exists.
Core Commands
Submit (lattice submit)
Submit an allocation or batch script.
# Submit a script (Slurm-compatible directives parsed)
lattice submit script.sh
# Submit with inline arguments
lattice submit --nodes=64 --walltime=72h --uenv=prgenv-gnu/24.11:v1 -- torchrun train.py
# Submit a task group (job array)
lattice submit --task-group=0-99%20 script.sh
# Submit with dependencies
lattice submit --depends-on=12345:success script.sh
# Submit a DAG from YAML
lattice dag submit workflow.yaml
# Submit to a specific vCluster
lattice submit --vcluster=ml-training script.sh
Output: Allocation ID on success.
Submitted allocation 12345
Status (lattice status)
Query allocation status.
# List own allocations
lattice status
# Specific allocation
lattice status 12345
# Filter by state
lattice status --state=running
# All allocations (tenant admin)
lattice status --all
# Watch mode (refresh every 5s)
lattice status --watch
Default output (table):
ID NAME STATE NODES WALLTIME ELAPSED VCLUSTER
12345 training-run Running 64 72:00:00 14:23:01 ml-training
12346 eval-job Pending 4 02:00:00 — hpc-batch
12347 sweep Running 1×20 04:00:00 01:12:33 hpc-batch
Wide output (-o wide): Adds columns: tenant, project, uenv, GPU type, dragonfly groups.
Cancel (lattice cancel)
Cancel allocations.
# Cancel single
lattice cancel 12345
# Cancel multiple
lattice cancel 12345 12346 12347
# Cancel all own pending allocations
lattice cancel --state=pending --all-mine
# Cancel a DAG
lattice dag cancel dag-789
Session (lattice session)
Create an interactive session. See sessions.md for details.
# Basic session
lattice session --walltime=4h
# With resources
lattice session --nodes=2 --constraint=gpu_type:GH200 --walltime=8h
# With uenv
lattice session --uenv=prgenv-gnu/24.11:v1 --walltime=4h
Attach (lattice attach)
Attach a terminal to a running allocation. See observability.md.
lattice attach 12345
lattice attach 12345 --node=x1000c0s0b0n3
lattice attach 12345 --command="nvidia-smi -l 1"
Launch (lattice launch)
Run a task within an existing allocation (srun equivalent).
# Run on all nodes
lattice launch --alloc=12345 hostname
# Run on specific number of tasks
lattice launch --alloc=12345 -n 4 ./my_program
# Run interactively with PTY
lattice launch --alloc=12345 --pty bash
Logs (lattice logs)
View allocation logs. See observability.md.
lattice logs 12345
lattice logs 12345 --follow
lattice logs 12345 --stderr --node=x1000c0s0b0n3
lattice logs 12345 --tail=100
Top / Watch / Diag / Compare
Monitoring commands. See observability.md.
lattice top 12345 # Metrics snapshot
lattice top 12345 --per-gpu # Per-GPU breakdown
lattice watch 12345 # Live streaming metrics
lattice watch 12345 --alerts-only # Alerts only
lattice diag 12345 # Network + storage diagnostics
lattice compare 12345 12346 --metric=gpu_util # Cross-allocation comparison
Telemetry (lattice telemetry)
Switch telemetry mode.
lattice telemetry --alloc=12345 --mode=debug --duration=30m
Nodes (lattice nodes)
View cluster nodes (read-only).
# List all nodes
lattice nodes
# Filter by state
lattice nodes --state=ready
# Filter by vCluster
lattice nodes --vcluster=hpc-batch
# Specific node details
lattice nodes x1000c0s0b0n0
Output:
NODE STATE GPUS VCLUSTER TENANT GROUP CONFORMANCE
x1000c0s0b0n0 Ready 4×GH200 hpc-batch physics 3 a1b2c3
x1000c0s0b0n1 Ready 4×GH200 hpc-batch physics 3 a1b2c3
x1000c0s1b0n0 Draining 4×GH200 ml-training ml-team 7 a1b2c3
History (lattice history)
Query completed allocations (accounting data).
lattice history
lattice history --since=2026-03-01 --until=2026-03-02
lattice history --output=json
DAG Commands (lattice dag)
lattice dag submit workflow.yaml # Submit a DAG
lattice dag status dag-789 # DAG status with per-allocation states
lattice dag list # List DAGs
lattice dag cancel dag-789 # Cancel a DAG
Cache Commands (lattice cache)
lattice cache warm --image=prgenv-gnu/24.11:v1 --group=3
lattice cache status --node=x1000c0s0b0n0
lattice cache evict --image=prgenv-gnu/24.11:v1 --node=x1000c0s0b0n0
Admin Commands (lattice admin)
Administrative commands require system-admin role.
# Node management
lattice node drain x1000c0s0b0n0
lattice node drain x1000c0s0b0n0 --urgent
lattice node undrain x1000c0s0b0n0
lattice node disable x1000c0s0b0n0
lattice node enable x1000c0s0b0n0
# Tenant management
lattice admin tenant create --name=physics --max-nodes=200
lattice admin tenant set-quota --name=physics --max-nodes=250
# vCluster management
lattice admin vcluster create --name=hpc-batch --scheduler=hpc-backfill --tenant=physics
lattice admin vcluster set-weights --name=hpc-batch --priority=0.20 ...
# Configuration
lattice admin config get accounting.enabled
lattice admin config set accounting.enabled=true
# Raft status
lattice admin raft status
Output Formats
| Format | Flag | Use Case |
|---|---|---|
table | Default | Human-readable, aligned columns |
wide | -o wide | Extended columns |
json | -o json | Machine-parseable, scripting |
yaml | -o yaml | Machine-parseable, config integration |
All formats support piping and redirection. JSON output uses newline-delimited JSON for streaming commands (logs –follow, watch).
Error Messages
Errors are human-readable with actionable guidance:
Error: allocation rejected — tenant "physics" exceeds max_nodes quota
Current: 195 nodes in use
Requested: 10 additional nodes
Limit: 200 nodes
Hint: Cancel running allocations or request a quota increase from your tenant admin.
Error: no nodes available matching constraints
GPU type: GH200
Nodes requested: 64
Available: 42 (22 in use by your allocations, 136 by other tenants)
Hint: Reduce node count, use --topology=any, or wait for resources.
Shell Completion
Shell completion is generated for bash, zsh, and fish:
# Generate completion
lattice completion bash > /etc/bash_completion.d/lattice
lattice completion zsh > ~/.zfunc/_lattice
lattice completion fish > ~/.config/fish/completions/lattice.fish
Completions cover: subcommands, flag names, allocation IDs (from recent lattice status), node IDs, vCluster names, uenv names.
Configuration File
# ~/.config/lattice/config.yaml
api_url: "https://lattice.example.com:50051"
default_tenant: "physics"
default_vcluster: "hpc-batch"
default_uenv: "prgenv-gnu/24.11:v1"
output_format: "table"
color: true
Environment variables override config file: LATTICE_API_URL, LATTICE_TENANT, LATTICE_VCLUSTER.
Slurm Compatibility Aliases
For sites migrating from Slurm, optional shell aliases:
# Source from lattice-provided script
source $(lattice compat-aliases)
# Provides:
# sbatch → lattice submit
# squeue → lattice status
# scancel → lattice cancel
# salloc → lattice session
# srun → lattice launch
# sinfo → lattice nodes
# sacct → lattice history
These aliases translate Slurm flags to Lattice flags where possible. See slurm-migration.md for details.
Cross-References
- api-design.md — API endpoints that CLI commands map to
- sessions.md — Interactive session lifecycle
- observability.md — Monitoring commands (top, watch, diag, compare)
- slurm-migration.md — Slurm command translation details
Telemetry Architecture
Design Principle
Collect at high resolution, aggregate at configurable resolution, transmit out-of-band.
Three-Layer Pipeline
Layer 1: Collection (eBPF, always-on)
eBPF programs JIT-compiled into kernel, attached to tracepoints and kprobes.
Kernel-level metrics:
- CPU: context switches, runqueue depth, scheduling latency histograms
- Network: per-flow bytes/packets, Slingshot CSIG congestion signals from packet headers
- Block I/O: latency histograms, throughput per device (NVMe scratch, network mounts)
- Memory: allocation/free rates, NUMA locality, page faults
GPU metrics (via NVML/DCGM hooks):
- SM occupancy, memory utilization, power draw
- PCIe/NVLink throughput
- ECC error counts (feeds into checkpoint cost model)
Storage overhead: ~0.3% on compute-bound workloads. eBPF programs run in kernel context, no syscall overhead, no userspace daemon polling.
Data flows into per-CPU ring buffers (BPF_MAP_TYPE_RINGBUF), consumed by the node agent.
Layer 2: Aggregation (Node Agent, switchable)
The node agent reads ring buffers and aggregates based on the current mode.
Mode: prod (default)
- 30-second aggregation windows
- Statistical summaries: p50, p95, p99, mean, max, count
- Bicubic interpolation for time-series smoothing (reduces storage, preserves trends)
- Transmitted on Slingshot telemetry traffic class (separate from compute traffic)
- Additional overhead: ~0.1%
Mode: debug (per-job or per-node, time-limited)
- 1-second or sub-second raw event streams
- Full per-flow network traces
- GPU kernel-level profiling (CUPTI integration)
- Stored to job-specific S3 path for user analysis
- Additional overhead: ~2-5% (acceptable for debugging)
- Auto-reverts to prod after configured duration (default: 30 minutes)
Mode: audit (sensitive vCluster)
- All file access events (open, read, write, close) with user identity
- All API calls logged with request/response metadata
- Network flow summaries (source, destination, bytes, duration)
- Signed with Sovra keys (if federation enabled) for tamper evidence
- Additional overhead: ~1%
- Retention: 7 years (cold tier, S3-compatible archive)
Layer 3: Storage and Query
Time-series store — recommended: VictoriaMetrics (single-node or cluster) for single-site deployments; Thanos on top of Prometheus for federated multi-site deployments that need a global query view across sites:
- Ingestion: all nodes stream aggregated metrics
- Auto-downsampling: raw → 1m → 5m → 1h → 1d
- Retention policy configurable per tenant/vCluster
Three materialized views (label-based access control):
| View | Audience | Content |
|---|---|---|
| Holistic | System admins | System-wide utilization, power, health, scheduling efficiency |
| Tenant | Tenant admins | Per-tenant resource usage, quota tracking, job statistics |
| vCluster | Scheduler | Metrics feeding into cost function (GPU util, I/O, congestion) |
| User | Allocation owners | Per-allocation metrics scoped by OIDC identity (via lattice-api) |
Query interface: PromQL-compatible API. Grafana dashboards for visualization.
Debug traces: Stored to s3://{tenant}/{project}/{job_id}/telemetry/ with short retention (7 days default, configurable).
Audit logs: Append-only, encrypted at rest, stored to dedicated audit storage with long retention. Queryable for compliance reporting.
Switching Telemetry Mode
Via Intent API:
PATCH /v1/allocations/{id}
{ "telemetry": { "mode": "debug", "duration": "30m" } }
Via CLI:
lattice telemetry --alloc=12345 --mode=debug --duration=30m
Switching is instant — the eBPF programs are always collecting at full resolution. Only the aggregation behavior changes.
User-Facing Telemetry Query
The telemetry pipeline serves admin dashboards and the scheduler cost function. The user-facing query layer adds scoped access so allocation owners can query their own metrics without admin intervention.
Query Path
User → lattice-api → PromQL (scoped by alloc/tenant/user) → TSDB → response
The lattice-api injects label filters to ensure users only see metrics for their own allocations. Tenant admins can query any allocation within their tenant.
Scoping Rules
| Caller | Visible Scope |
|---|---|
| Allocation owner | Metrics for their own allocations |
| Tenant admin | Metrics for any allocation in their tenant |
| System admin | All metrics (holistic view) |
User Metrics Catalog
| Metric | Description | Available In |
|---|---|---|
gpu_utilization | SM occupancy per GPU | prod, debug, audit |
gpu_memory_used | GPU memory in use | prod, debug, audit |
gpu_power_draw | GPU power consumption | prod, debug, audit |
cpu_utilization | CPU usage per node | prod, debug, audit |
memory_used | System memory in use | prod, debug, audit |
network_tx_bytes | Network bytes sent per second | prod, debug, audit |
network_rx_bytes | Network bytes received per second | prod, debug, audit |
io_read_bytes | Storage read throughput | prod, debug, audit |
io_write_bytes | Storage write throughput | prod, debug, audit |
io_latency_p99 | Storage I/O latency (p99) | prod, debug, audit |
Telemetry Streaming
For use cases requiring push-based updates (e.g., lattice watch), the StreamMetrics RPC fans out to node agents running the target allocation and merges their streams.
Architecture
lattice-api receives StreamMetrics request
→ identifies nodes running allocation (from quorum state)
→ opens per-node metric streams to node agents
→ merges streams with allocation-scoped labels
→ returns unified server-streaming response to client
In prod mode, node agents emit aggregated snapshots every 30 seconds. In debug mode, raw events stream at 1-second intervals. The client receives the same resolution as the current telemetry mode — switching mode (via PATCH /v1/allocations/{id}) takes effect on active streams.
Alert Generation
Node agents evaluate threshold rules locally and inject MetricAlert events into the stream when:
- GPU utilization < 10% for > 60s (potential hang)
- GPU memory > 95% (OOM risk)
- Network error rate exceeds 0.1%
- I/O p99 latency exceeds 10ms
Cross-Allocation Comparison
Users can compare metrics across multiple allocations (e.g., successive training runs) via the CompareMetrics RPC or GET /v1/compare.
TSDB Query
The lattice-api issues parallel PromQL queries for each allocation ID, scoped to the requesting user’s permissions. Results are aligned by relative time (see below).
Relative Time Alignment
Allocations may run at different wall-clock times. Comparison uses relative-to-start alignment: each allocation’s metric series is indexed from t=0 (the allocation’s started_at timestamp). This allows apples-to-apples comparison of metrics across runs that started hours or days apart.
Feedback to Scheduler
The telemetry system feeds key metrics back to the scheduling cost function:
| Metric | Cost Function Component | Effect |
|---|---|---|
| GPU utilization per job | Efficiency scoring | Low util → deprioritize for topology-premium placement |
| Network congestion (CSIG) | topology_fitness | Congested groups → avoid placing new jobs there |
| I/O throughput per job | data_readiness | High I/O demand → ensure storage QoS before scheduling |
| Node ECC errors | checkpoint cost model | Rising errors → increase checkpoint urgency |
| Power draw per node | energy_cost | Feeds into power budget constraint |
Telemetry Aggregation Topology
For large systems (10,000+ nodes), direct streaming to a central store creates an ingestion bottleneck. Use hierarchical aggregation:
Nodes (per-group) → Group Aggregator → Central Store
Each Slingshot dragonfly group has a designated aggregator node.
Group aggregators perform first-level aggregation (merge per-node summaries).
Central store receives per-group aggregated streams.
In debug mode: bypasses group aggregation, streams directly for that job's nodes.
Scheduler Self-Monitoring
Internal metrics for monitoring Lattice’s own health. These metrics feed into canary criteria during rolling upgrades (cross-ref: upgrades.md) and are available on the holistic dashboard.
Scheduling Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
lattice_scheduling_cycle_duration_seconds | histogram | vcluster | Time to complete one scheduling cycle |
lattice_scheduling_queue_depth | gauge | vcluster | Number of pending allocations |
lattice_scheduling_proposals_total | counter | vcluster, result (accepted/rejected) | Proposals sent to quorum |
lattice_scheduling_cost_function_duration_seconds | histogram | vcluster | Time to evaluate the cost function for all candidates |
lattice_scheduling_backfill_jobs_total | counter | vcluster | Allocations placed via backfill |
Quorum Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
lattice_raft_leader | gauge | member_id | 1 if this member is leader, 0 if follower |
lattice_raft_commit_latency_seconds | histogram | member_id | Time from proposal to commit |
lattice_raft_log_entries | gauge | member_id | Number of entries in the Raft log |
lattice_raft_snapshot_duration_seconds | histogram | member_id | Time to create a Raft snapshot |
API Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
lattice_api_requests_total | counter | method, status | Total API requests |
lattice_api_request_duration_seconds | histogram | method | Request latency |
lattice_api_active_streams | gauge | stream_type (attach/logs/metrics) | Active streaming connections |
Node Agent Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
lattice_agent_heartbeat_latency_seconds | histogram | node_id | Heartbeat round-trip time |
lattice_agent_allocation_startup_seconds | histogram | node_id | Time from allocation assignment to process start (includes uenv pull/mount) |
lattice_agent_ebpf_overhead_percent | gauge | node_id | Measured eBPF collection overhead |
Accounting Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
lattice_accounting_events_buffered | gauge | — | Events in the in-memory accounting buffer |
lattice_accounting_events_dropped_total | counter | — | Events dropped due to buffer overflow |
Federation Broker Metrics
When federation is enabled, the federation broker exposes additional metrics:
| Metric | Type | Labels | Description |
|---|---|---|---|
lattice_federation_proposals_total | counter | peer, result (accepted/rejected/timeout) | Placement proposals sent to/from peers |
lattice_federation_proposal_latency_seconds | histogram | peer | Round-trip time for federation proposals |
lattice_federation_peer_status | gauge | peer | 1 = connected, 0 = unreachable |
lattice_federation_data_gravity_score | gauge | peer, dataset | Data gravity score for placement decisions (higher = more data at peer) |
These metrics are only active when federation.enabled = true. The federation broker exposes them on the same /metrics endpoint as other components (default port: 9105).
Alerting Rules
Example alerting rules (PromQL-compatible):
| Rule | Condition | Severity |
|---|---|---|
| Scheduling cycle slow | histogram_quantile(0.99, lattice_scheduling_cycle_duration_seconds) > 30 | warning |
| Queue depth high | lattice_scheduling_queue_depth > 100 for 5 minutes | warning |
| Raft commit slow | histogram_quantile(0.99, lattice_raft_commit_latency_seconds) > 5 | critical |
| Node heartbeat missing | time() - lattice_agent_last_heartbeat_timestamp > 60 | node degraded |
| API error rate spike | rate(lattice_api_requests_total{status=~"5.."}[5m]) / rate(lattice_api_requests_total[5m]) > 0.05 | warning |
| Accounting buffer filling | lattice_accounting_events_buffered > 8000 | warning |
| VNI pool exhaustion approaching | (lattice_network_vni_pool_total - lattice_network_vni_pool_available) / lattice_network_vni_pool_total > 0.90 | warning |
| Quota utilization high | lattice_quota_used_nodes / lattice_quota_max_nodes > 0.95 for 10 minutes | warning |
| Raft disk usage high | lattice_raft_disk_used_bytes / lattice_raft_disk_total_bytes > 0.80 | warning |
| Snapshot storage growth | rate(lattice_raft_snapshot_size_bytes[1h]) > 100e6 | info |
Dashboard Views
Three views matching the existing telemetry pattern:
| Dashboard | Audience | Key Panels |
|---|---|---|
| Holistic | System admins | All scheduler cycle times, quorum health, total queue depth, API throughput |
| Per-vCluster | Scheduler operators | vCluster-specific queue depth, cycle time, proposal accept rate, backfill rate |
| Per-quorum-member | Quorum operators | Raft log size, commit latency, leader status, snapshot timing |
Monitoring Deployment
Prometheus Scrape Configuration
All Lattice components expose metrics on a /metrics endpoint (Prometheus exposition format):
| Component | Default Metrics Port | Endpoint |
|---|---|---|
| Quorum members | 9100 | http://{quorum-host}:9100/metrics |
| API servers | 9101 | http://{api-host}:9101/metrics |
| vCluster schedulers | 9102 | http://{scheduler-host}:9102/metrics |
| Node agents | 9103 | http://{node-host}:9103/metrics |
| Checkpoint broker | 9104 | http://{checkpoint-host}:9104/metrics |
Example Prometheus scrape config:
scrape_configs:
- job_name: "lattice-quorum"
static_configs:
- targets: ["quorum-1:9100", "quorum-2:9100", "quorum-3:9100"]
- job_name: "lattice-api"
static_configs:
- targets: ["api-1:9101", "api-2:9101"]
- job_name: "lattice-scheduler"
static_configs:
- targets: ["scheduler-hpc:9102", "scheduler-ml:9102", "scheduler-interactive:9102"]
- job_name: "lattice-agents"
file_sd_configs:
- files: ["/etc/prometheus/lattice-agents.json"]
refresh_interval: 5m
# Node agents are numerous; use file-based service discovery
# populated from OpenCHAMI node inventory
Alert Routing
Alerts are routed via Alertmanager (or compatible system):
| Severity | Route | Response Time |
|---|---|---|
| Critical | PagerDuty / on-call | Immediate (< 15 min) |
| Warning | Slack #lattice-alerts | Business hours (< 4 hours) |
| Info | Slack #lattice-info | Best effort |
Example Alertmanager route:
route:
receiver: "slack-info"
routes:
- match: { severity: "critical" }
receiver: "pagerduty-oncall"
- match: { severity: "warning" }
receiver: "slack-alerts"
Grafana Dashboards
Pre-built dashboards for the three views described above. Dashboards are defined as JSON and version-controlled in infra/grafana/:
infra/grafana/
├── holistic.json # System-wide overview
├── per-vcluster.json # vCluster-specific scheduling
├── per-quorum-member.json # Raft health
├── per-node.json # Individual node health
└── user-allocation.json # User-facing allocation metrics
Each dashboard uses the standard Lattice metric names. Data source: Prometheus (or compatible TSDB).
TSDB Sizing
| Cluster Size | Metric Cardinality | Ingestion Rate | Storage (30-day retention) |
|---|---|---|---|
| 100 nodes | ~50,000 series | ~10k samples/s | ~50 GB |
| 1,000 nodes | ~500,000 series | ~100k samples/s | ~500 GB |
| 10,000 nodes | ~5,000,000 series | ~1M samples/s | ~5 TB |
For clusters > 1000 nodes, use a horizontally scalable TSDB (VictoriaMetrics cluster, Mimir, or Thanos) with the hierarchical aggregation described in the Telemetry Aggregation Topology section above.
User-Facing Observability & Debugging
Design Principle
Lattice already collects high-resolution telemetry (eBPF, TSDB, three aggregation modes) for operator and scheduler use. This document describes the user-facing surface that lets job owners debug, monitor, and profile their own allocations without admin intervention.
All observability data flows through existing pipelines — no new collection infrastructure is required. The user-facing layer adds scoped query access, streaming endpoints, and interactive attach.
Overview
┌─ User ───────────────────────────────────────────────────────┐
│ lattice attach / logs / top / watch / diag / compare │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌─────────── lattice-api (gRPC + REST) ───────────────┐ │
│ │ Attach ──────────────── bidir stream to node agent │ │
│ │ Logs ────────────────── ring buffer (live) + S3 │ │
│ │ Metrics ─────────────── PromQL query to TSDB │ │
│ │ StreamMetrics ───────── fan-out to node agents │ │
│ │ Diagnostics ─────────── TSDB + fabric telemetry │ │
│ │ Compare ─────────────── multi-alloc TSDB query │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ Node Agents S3 logs TSDB Slingshot CSIG │
└──────────────────────────────────────────────────────────────┘
| Capability | Data Source | Latency | CLI Command |
|---|---|---|---|
| Attach to running allocation | Node agent (nsenter) | Real-time | lattice attach <id> |
| Log streaming (live tail) | Node agent ring buffer | Sub-second | lattice logs <id> --follow |
| Historical logs | S3 | Seconds | lattice logs <id> |
Live metrics (top) | TSDB | 30s (prod mode) | lattice top <id> |
Live telemetry stream (watch) | Node agents (push) | 1-30s | lattice watch <id> |
| Diagnostics | TSDB + fabric telemetry | 30s | lattice diag <id> |
| Cross-allocation comparison | TSDB | Seconds | lattice compare <id1> <id2> |
| Application profiling | User tools (via tools_uenv) | N/A | User-driven |
Attach to Running Allocation
Architecture
The attach mechanism provides an interactive terminal session inside a running allocation’s execution environment. The node agent uses nsenter to enter the allocation’s mount and network namespaces — this is not a new allocation, just a terminal session in the existing one.
User → lattice-cli → lattice-api → gRPC bidir stream → node agent
│
nsenter into
mount/net ns
│
PTY ↔ shell
Terminal Protocol
The gRPC bidirectional stream carries:
- Client → Server: stdin bytes, terminal resize events, signals (SIGINT, SIGTSTP)
- Server → Client: stdout/stderr bytes, exit code on completion
The stream begins with an AttachStart message specifying the target node (for multi-node allocations) and command (default: user’s shell).
Authorization Model
| vCluster Type | Who Can Attach | Additional Constraints |
|---|---|---|
| HPC (backfill) | Allocation owner | — |
| Service (bin-pack) | Allocation owner | — |
| Interactive (FIFO) | Allocation owner | Already has session; attach is secondary terminal |
| Sensitive (reservation) | Claiming user only | Session recorded, audit trail, signed uenv only |
Sensitive Constraints
- Only the user who claimed the nodes (identity from Raft audit log) can attach
- All attach sessions are recorded (input + output) to the sensitive audit log
- Attach is only permitted when the allocation runs a signed uenv
- Session start/end events are Raft-committed audit entries
Attach During Node Crash
If the node hosting an attach session crashes or becomes unreachable:
- The gRPC bidirectional stream is dropped (connection reset).
- The API server detects the stream drop and sets
ended_aton theAttachSessionrecord. - For sensitive allocations, the session end event is recorded in the audit log with reason
node_unreachable. - The client receives a stream error and can display:
"connection to node lost — attach session ended".
Attach During Preemption
If the allocation is preempted while an attach session is active, the session is terminated gracefully. See sessions.md for the detailed preemption sequence. If the allocation is in Checkpointing state, new attach requests are rejected with: "allocation is being checkpointed — attach unavailable until rescheduled".
CLI Usage
# Attach to allocation (first node, user's shell)
lattice attach 12345
# Attach to a specific node
lattice attach 12345 --node=x1000c0s0b0n3
# Attach with a specific command
lattice attach 12345 --command="nvidia-smi -l 1"
Slurm Compatibility
| Slurm | Lattice |
|---|---|
srun --jobid=123 --pty bash | lattice attach 123 |
Log Streaming
Dual-Path Architecture
Logs use two paths to balance latency and durability:
-
Ring buffer (live tail): Each node agent maintains a per-allocation ring buffer (default 64 MB) of stdout/stderr. Supports low-latency streaming for
--followmode. Data is ephemeral — lost when the allocation ends or the buffer wraps. -
S3 persistence: Node agents periodically flush log chunks to S3 for durable storage. Available during and after allocation execution.
Process stdout/stderr
│
├──→ Ring buffer (node agent, 64 MB)
│ │
│ └──→ gRPC StreamLogs (live tail)
│
└──→ S3 flush (periodic, configurable interval)
│
└──→ REST GET /logs (historical)
Log Storage Layout
s3://{tenant}/{project}/{alloc_id}/logs/
├── stdout/{node_id}/{chunk_000..N}.log.zst
├── stderr/{node_id}/{chunk_000..N}.log.zst
└── metadata.json # timestamps, byte offsets, node list
Logs are compressed with zstd. The metadata file enables efficient range queries by time or byte offset.
Streaming (Live Tail)
Via gRPC StreamLogs RPC (server-streaming). The client specifies:
- Allocation ID
- Stream filter: stdout, stderr, or both
- Node filter: specific node or all nodes
- Follow mode: whether to keep streaming as new output arrives
- Tail lines: number of lines from the ring buffer to replay on connect
Historical Log Access
Via REST GET /v1/allocations/{id}/logs:
- Query params:
stream(stdout/stderr),node,since,until,offset,limit - Returns paginated log entries from S3
- Available after allocation completion (subject to retention policy)
Sensitive Constraints
- Logs from sensitive allocations are encrypted at rest in the dedicated sensitive S3 pool
- All log access events are recorded in the sensitive audit log
- Log retention follows sensitive data retention policy (user-specified, minimum per regulation)
- Logs are only accessible to the claiming user and designated compliance reviewers
CLI Usage
# View logs (all nodes, both streams)
lattice logs 12345
# Follow mode (live tail)
lattice logs 12345 --follow
# Filter by stream and node
lattice logs 12345 --stderr --node=x1000c0s0b0n3
# Tail last 100 lines
lattice logs 12345 --tail=100
# Historical range
lattice logs 12345 --since="2026-03-01T10:00:00Z" --until="2026-03-01T11:00:00Z"
Slurm Compatibility
| Slurm | Lattice |
|---|---|
cat slurm-123.out | lattice logs 123 |
tail -f slurm-123.out | lattice logs 123 --follow |
User-Facing Live Metrics (lattice top)
Query Path
Metrics are served from the TSDB (not directly from node agents). The lattice-api translates user queries into PromQL, scoped to the requesting user’s allocations.
lattice top <id> → lattice-api → PromQL → TSDB → response
This reuses the existing telemetry pipeline. In prod mode, data has 30-second resolution. In debug mode (if switched), 1-second resolution.
Metrics Catalog
| Metric | Description | Unit |
|---|---|---|
gpu_utilization | SM occupancy per GPU | % |
gpu_memory_used | GPU memory in use | bytes |
gpu_power_draw | GPU power consumption | watts |
cpu_utilization | CPU usage per node | % |
memory_used | System memory in use | bytes |
network_tx_bytes | Network bytes sent | bytes/s |
network_rx_bytes | Network bytes received | bytes/s |
io_read_bytes | Storage read throughput | bytes/s |
io_write_bytes | Storage write throughput | bytes/s |
io_latency_p99 | Storage I/O latency (p99) | microseconds |
Display Modes
| Mode | Flag | Content |
|---|---|---|
| Summary (default) | — | Aggregated across all nodes: mean GPU%, total mem, total I/O |
| Per-node | --per-node | One row per node |
| Per-GPU | --per-gpu | One row per GPU across all nodes |
| Wide | --wide | All metrics in a wide table |
REST + gRPC Access
- REST:
GET /v1/allocations/{id}/metrics?mode=summary&duration=5m - gRPC:
QueryMetricsRPC withMetricsQueryRequest
CLI Usage
# Summary view (default)
lattice top 12345
# Per-node breakdown
lattice top 12345 --per-node
# Per-GPU breakdown
lattice top 12345 --per-gpu
# Wide format with all metrics
lattice top 12345 --wide
# Custom time window
lattice top 12345 --duration=1h
Live Telemetry Stream (lattice watch)
Push-Based Event Stream
Unlike lattice top (which queries TSDB), lattice watch opens a push-based stream from node agents for near-real-time events.
lattice watch <id> → lattice-api → fan-out → node agents
↑
stream merge
↑
per-node MetricsEvent streams
Relationship to Telemetry Modes
| Telemetry Mode | lattice top Resolution | lattice watch Resolution |
|---|---|---|
| prod | 30s (TSDB) | 30s (prod aggregation from node agent) |
| debug | 1s (TSDB) | 1s (raw events from node agent) |
| audit | 30s (TSDB) | 30s + access events |
Switching to debug mode (lattice telemetry --alloc=12345 --mode=debug) increases resolution for both top and watch.
Stream Content
Each MetricsEvent contains:
- Timestamp and node ID
- Current metric values (GPU, CPU, memory, network, I/O)
- Threshold alerts (if any metric exceeds configured bounds)
Alerts are generated by node agents when metrics cross thresholds:
- GPU utilization drops below 10% (potential hang)
- GPU memory utilization exceeds 95% (OOM risk)
- Network error rate exceeds threshold
- I/O latency spike detected
CLI Usage
# Watch all metrics (refreshing display)
lattice watch 12345
# Watch specific metrics
lattice watch 12345 --metrics=gpu_utilization,memory_used
# Watch with alerts only (suppress normal updates)
lattice watch 12345 --alerts-only
Diagnostics View
Network Diagnostics
Network health is critical for multi-node allocations. Diagnostics expose Slingshot-specific metrics that are otherwise invisible to users.
| Metric | Description | Source |
|---|---|---|
| CSIG congestion | In-band congestion signals per Slingshot group | eBPF CSIG tap |
| Group span | Number of dragonfly groups the allocation spans | Topology model |
| Inter-node bandwidth | Measured bandwidth between node pairs | eBPF network flow |
| NVLink throughput | GPU-to-GPU bandwidth (intra-node) | NVML |
Storage Diagnostics
| Metric | Description | Source |
|---|---|---|
| QoS floor vs actual | Configured storage QoS vs measured throughput | VAST API + eBPF I/O |
| Latency histogram | I/O latency distribution (p50/p95/p99) | eBPF block I/O |
| Mount health | Per-mount status (NFS, S3, scratch) | Node agent |
| IOPS | Read/write operations per second | eBPF block I/O |
Combined Diagnostics
lattice diag combines network and storage diagnostics into a single view with health indicators:
$ lattice diag 12345
Network:
Group span: 2 groups (g3, g7)
CSIG congestion: LOW (0.02 avg)
Inter-node BW: 190 GB/s avg (target: 200 GB/s) ✓
Storage:
/data/input (NFS): 12.5 GB/s read (QoS floor: 10 GB/s) ✓
/scratch (NVMe): 6.2 GB/s write, p99 latency: 45µs ✓
/home (NFS): 0.1 GB/s (idle) ✓
GPUs:
SM occupancy: 92% avg across 256 GPUs ✓
NVLink: 850 GB/s avg (of 900 GB/s) ✓
ECC errors: 0 ✓
CLI Usage
# Full diagnostics
lattice diag 12345
# Network only
lattice diag 12345 --network
# Storage only
lattice diag 12345 --storage
Cross-Allocation Comparison
TSDB Query
Compares metrics across multiple allocations by querying the same TSDB data used for lattice top. Useful for regression detection across training runs.
Time Alignment
Comparisons use relative-to-start time alignment: each allocation’s metrics are indexed from t=0 (allocation start), not wall clock time. This allows meaningful comparison of allocations that ran at different times.
CLI Usage
# Compare two allocations
lattice compare 12345 12346
# Compare specific metric
lattice compare 12345 12346 --metric=gpu_utilization
# JSON output for scripting
lattice compare 12345 12346 --output=json
REST Interface
GET /v1/compare?ids=12345,12346&metrics=gpu_utilization,io_write_bytes&align=relative
Application Profiling Integration
Scope
Lattice provides mechanisms for profiling, not profiler implementations. Users bring their own profiling tools, delivered via tools_uenv.
Profiler Delivery
Profiling tools are packaged as uenv images and mounted alongside the application uenv:
environment:
uenv: "prgenv-gnu/24.11:v1" # application stack
tools_uenv: "profiling/2024.1" # profilers: nsight, vtune, darshan, etc.
The tools_uenv mount provides profiler binaries without contaminating the application environment.
Usage Patterns
Batch profiling (non-interactive):
# Submit with profiling tools
lattice submit --uenv=prgenv-gnu/24.11:v1 --tools-uenv=profiling/2024.1 script.sh
# Script uses profiler internally (e.g., nsys profile ./train)
# Results written to output directory
Interactive profiling (attach-based):
# Attach and run profiler interactively
lattice attach 12345 --command="nsys profile --delay=60 -o /scratch/profile ./train"
Darshan / Score-P Integration Notes
- Darshan: LD_PRELOAD-based I/O profiling. No Lattice-specific integration needed; user loads Darshan from
tools_uenvand setsLD_PRELOAD. Darshan logs written to scratch/output. - Score-P: Instrumentation-based profiling. User compiles with Score-P wrappers from
tools_uenv. Lattice provides no special support beyond tools delivery andattach.
Security Model
Authorization
All observability endpoints are scoped by OIDC token claims:
- Users can only query their own allocations (or allocations in their tenant, if tenant-admin)
- Token scopes:
allocations:read(metrics, logs, diagnostics),allocations:attach(interactive attach) - Sensitive allocations: only the claiming user (verified against Raft audit log)
Rate Limiting
All rate limits are per user (identified by OIDC subject claim). Tenant admins and system admins share the same limits unless overridden in system configuration.
| Endpoint | Rate Limit | Scope | Rationale |
|---|---|---|---|
| Attach | 5 concurrent sessions | Per user | Resource-intensive (PTY per session) |
| StreamLogs | 10 concurrent streams | Per user | Memory (ring buffer readers) |
| QueryMetrics | 60 req/min | Per user | TSDB query load |
| StreamMetrics | 5 concurrent streams | Per user | Node agent fan-out |
| Diagnostics | 30 req/min | Per user | TSDB + fabric query load |
| Compare | 10 req/min | Per user | Multi-alloc TSDB queries |
When rate limit is exceeded:
- Concurrent limits (Attach, StreamLogs, StreamMetrics): New request rejected with
429 Too Many Requestsand a message:"maximum concurrent sessions reached (5/5). Close an existing session to open a new one." - Request-rate limits (QueryMetrics, Diagnostics, Compare): Request rejected with
429 Too Many RequestsandRetry-Afterheader indicating seconds until the next request is allowed. - No queueing — rejected requests must be retried by the client.
Admin override: System admins can adjust per-user rate limits via configuration:
rate_limits:
attach_max_concurrent: 10 # override default of 5
query_metrics_per_minute: 120 # override default of 60
Data Sensitivity
| Data Type | Sensitivity | Handling |
|---|---|---|
| Metrics (GPU%, CPU%, I/O) | Low | Standard OIDC scoping |
| Logs (stdout/stderr) | Medium | May contain application data; encrypted at rest for sensitive |
| Attach (interactive terminal) | High | Session recorded for sensitive; PTY access = code execution |
| Diagnostics (network/storage) | Low | Infrastructure metrics, no application data |
| Profiling output | Medium | Written to user’s storage, no Lattice-managed persistence |
Security Architecture
Design Principle
Defense in depth with zero-trust internal communication. Every component authenticates to every other component. Trust boundaries are explicit and enforced by mTLS, RBAC, and network segmentation.
Trust Boundaries
User ──OIDC──→ lattice-api (direct, via hpc-auth) ──mTLS──→ quorum
│ │
│ mTLS │ mTLS
▼ ▼
node-agents ──namespace──→ workloads
│
│ mTLS/REST
▼
VAST / OpenCHAMI
Federation (optional):
quorum ──Sovra mTLS──→ federation-broker ──Sovra mTLS──→ remote quorum
STRIDE Threat Analysis
Spoofing
| Boundary | Attack | Mitigation |
|---|---|---|
| User → lattice-api | Stolen OIDC token | Short-lived tokens (5 min), token binding to client cert, MFA enforcement at IdP |
| Internal services | Rogue node agent | mTLS with site PKI (OpenCHAMI OPAAL-issued certificates). Node agents receive certs during boot via cloud-init. Cert CN must match node identity in quorum. |
| Federation | Rogue remote site | Sovra workspace-scoped certificates. Each site’s identity is cryptographically bound to its Sovra workspace. Revocable. |
Tampering
| Boundary | Attack | Mitigation |
|---|---|---|
| Quorum ↔ node agent | Fake heartbeat / state update | mTLS + message signing. Heartbeats include monotonic sequence number — replay detection. |
| uenv images | Compromised image | Image signing with site PKI (or Sovra PKI for federated images). Node agent verifies signature + hash before mount. Unsigned images rejected. |
| Raft log | Log manipulation | Raft log entries are chained (each entry references previous). Stored on local SSD with integrity checks. Snapshot checksums verified on restore. |
| API requests | Request modification in transit | TLS for all external connections. mTLS for all internal connections. |
Repudiation
| Boundary | Attack | Mitigation |
|---|---|---|
| Sensitive actions | User denies accessing sensitive data | Raft-committed audit log with user identity (from OIDC). Cryptographically signed entries (Sovra keys if available, otherwise site PKI). 7-year retention. Tamper-evident chain. |
| Allocation submission | User denies submitting allocation | All API requests logged with authenticated user identity. Audit trail in lattice-api access logs. |
| Node claims | Deny claiming sensitive nodes | Node claim is a Raft-committed operation with user identity. Cannot be repudiated. |
Information Disclosure
| Boundary | Attack | Mitigation |
|---|---|---|
| Node ↔ storage | Data exfiltration via network sniffing | Encrypted transport: NFS-over-TLS (VAST supports), S3 over HTTPS. Sensitive: encrypted at rest (VAST encrypted pool). |
| Cross-tenant | Side-channel via co-location | Full-node scheduling (ADR-007): no co-location of different tenants by default. Interactive vCluster uses Sarus containers with seccomp for intra-node isolation. |
| Telemetry | Metric leakage between tenants | Label-based access control on TSDB queries. lattice-api injects tenant/user scope filters. |
| Memory | Data remnants after allocation | Node agent zeroes GPU memory and clears scratch storage (NVMe or tmpfs) on allocation release. Sensitive: full node wipe via OpenCHAMI. |
| API responses | Enumeration of other tenants’ data | RBAC filtering on all list/query endpoints. Users see only their own allocations; tenant admins see their tenant. |
Denial of Service
| Boundary | Attack | Mitigation |
|---|---|---|
| User → API | API flooding | Rate limiting per tenant (token bucket). Admission control: reject requests that exceed tenant’s request quota. lattice-api provides rate limiting via Tower middleware. |
| Node → quorum | Heartbeat storm | Heartbeat coalescing: node agents batch heartbeats. Quorum-side rate limiting per node (max 1 heartbeat per interval). |
| Scheduling | Malicious allocation specs | Validation at API layer: max resource requests bounded, max array size bounded, DAG cycle detection. Reject before reaching scheduler. |
| Storage | Storage exhaustion | Per-tenant storage quotas enforced by VAST. Checkpoint storage bounded per allocation. |
Elevation of Privilege
| Boundary | Attack | Mitigation |
|---|---|---|
| User → scheduler | Escalate priority class | RBAC: priority class tied to tenant contract, not user request. Users cannot set priority above their tenant’s maximum. |
| Node agent → host | Container/namespace escape | Sarus: seccomp profile, no root in container, read-only rootfs. uenv: mount namespace only (no user namespace needed), processes run as submitting user. No setuid binaries in uenv images (enforced at build time). |
| Tenant admin → system admin | Escalate administrative scope | Distinct RBAC roles with no implicit promotion. System admin requires separate authentication (not derivable from tenant admin token). |
| Workload → network | Break out of network domain | Slingshot VNI enforcement at NIC level (hardware-enforced). Workloads can only communicate within their assigned network domain. |
Internal Service Authentication
All inter-component communication uses mTLS in production. Node agents acquire certificates via the identity cascade (SPIRE → SelfSigned CA → Bootstrap certs). When no mTLS identity is available (dev, testing, break-glass), agents fall back to Bearer token auth via LATTICE_AGENT_TOKEN.
| Component | Certificate Source | Rotation | Fallback |
|---|---|---|---|
| Quorum members | Pre-provisioned during deployment | Annual rotation, Raft membership change for re-keying | — |
| Node agents | Identity cascade: SPIRE SVID → SelfSigned (quorum CA) → Bootstrap certs | CertRotator at 2/3 lifetime | LATTICE_AGENT_TOKEN Bearer token |
| API servers | Pre-provisioned or OPAAL | Annual rotation | — |
| vCluster schedulers | Pre-provisioned or OPAAL | Annual rotation | — |
| Checkpoint broker | Pre-provisioned or OPAAL | Annual rotation | — |
Agent authentication priority:
- mTLS (production) — agent acquires a
WorkloadIdentityvia the identity cascade and configures the gRPC channel withClientTlsConfig. Server verifies the client certificate. No Bearer token needed. - Bearer token (dev/testing/break-glass) — when no mTLS identity is available, agent reads
LATTICE_AGENT_TOKENfrom the environment and injects it asAuthorization: Bearer <token>on all gRPC calls. Server validates via HMAC or JWKS.
Both paths coexist — mTLS takes priority. The LATTICE_AGENT_TOKEN path should be disabled in production (env var unset).
Certificate CN format: {component}.{site}.lattice.internal (e.g., node-042.alps.lattice.internal).
CA trust chain: Site root CA → intermediate CA (OPAAL) → component certificates.
Secret Management
Sensitive values are never stored in configuration files:
| Secret | Storage | Access Pattern |
|---|---|---|
| Waldur API token | Secrets manager (HashiCorp Vault or equivalent) | Referenced by path: vault://lattice/waldur-token |
| VAST API credentials | Secrets manager | Referenced by path |
| TLS private keys | Local filesystem (mode 0600) or TPM | Loaded at startup |
| OIDC client secret | Secrets manager | Used by hpc-auth (CLI) or lattice-api (server-side validation) |
| Sovra workspace key | Sovra key store (HSM-backed) | Used by federation broker |
Configuration files reference secrets by path, never by value:
waldur:
token_secret_ref: "vault://lattice/waldur-token"
vast:
credentials_ref: "vault://lattice/vast-creds"
RBAC Model
Three base roles, plus a sensitive-specific role:
| Role | Scope | Permissions |
|---|---|---|
| user | Own allocations | Submit, cancel, query own allocations. View own metrics. Attach to own sessions. |
| tenant-admin | Tenant’s allocations | All user permissions for any allocation in tenant. Manage tenant quotas (within limits). View tenant-level metrics. |
| system-admin | All | All operations. Manage vClusters, nodes, tenants. View holistic metrics. |
| claiming-user | Claimed sensitive nodes | User role + claim/release sensitive nodes. Access sensitive storage pool. All actions audit-logged. |
Role assignment:
userrole derived from OIDC token (any authenticated user)tenant-adminassigned per-tenant in quorum state, or viatenant-adminrole claimsystem-adminassigned via quorum configuration, or viaadmin/system:adminscopeclaiming-userassigned per-tenant by tenant-admin (sensitive tenants only)operatorassigned viaoperatorscope or role claim
Cross-system role mapping (pact+lattice co-deployment):
When pact delegates operations to lattice (e.g., drain, cordon), the pact admin’s token carries a pact_role claim instead of lattice scopes. Lattice recognizes these cross-system role claims:
| Token claim | Value | Lattice role |
|---|---|---|
pact_role | pact-platform-admin | SystemAdmin |
pact_role or lattice_role | system-admin | SystemAdmin |
pact_role or lattice_role | tenant-admin | TenantAdmin |
pact_role or lattice_role | operator | Operator |
Standard OIDC scopes take precedence over role claims. Both are checked by derive_role().
Network Security
| Traffic Class | Network | Isolation |
|---|---|---|
| Management (mTLS, heartbeats) | Slingshot management traffic class | Dedicated bandwidth reservation |
| Compute (MPI, NCCL) | Slingshot compute VNIs | Hardware-isolated per network domain |
| Storage (NFS, S3) | Slingshot storage traffic class | QoS-enforced bandwidth |
| Telemetry (metrics) | Slingshot telemetry traffic class | Separate from compute, low priority |
| User access (API, SSH) | Out-of-band Ethernet | Firewalled, rate-limited |
Slingshot traffic classes provide hardware-enforced isolation — compute traffic cannot starve management traffic and vice versa.
Certificate Rotation
Quorum Members
- Generate new certificate from site CA (same CN format)
- Deploy new cert + key to the target member’s TLS directory
- Perform Raft membership change: remove old member, add “new” member (same node, new cert)
- Verify:
lattice admin raft statusshows member healthy with new cert serial - Repeat for each member (one at a time, maintaining majority)
Node Agents
Node agents receive certificates from OPAAL during boot. Rotation is automatic on reboot:
- Drain the node:
lattice node drain <id> - Reboot (or reimage) via OpenCHAMI
- Node boots with new OPAAL-issued certificate
- Undrain:
lattice node undrain <id>
For batch rotation without reboot (if OPAAL supports renewal):
- Node agent requests new cert from OPAAL
- Node agent reloads TLS context (graceful, no connection drop)
- New cert active on next heartbeat
API Servers and Schedulers
- Generate new certificate from site CA
- Deploy new cert + key to the component’s TLS directory
- Restart the component (stateless — no data loss)
- Load balancer health check confirms the component is back
Federation (Sovra Certificates)
Sovra workspace keys are managed by the Sovra key rotation protocol. Lattice components use derived tokens, which are automatically refreshed. No Lattice-side action is required for routine Sovra key rotation.
For emergency revocation: revoke the Sovra shared workspace (see federation.md — Removing a Federation Peer).
Additional Security Considerations
OIDC Token Refresh for Long-Lived Streams
Long-lived gRPC streams (Attach, StreamLogs, StreamMetrics) may outlive the OIDC access token’s lifetime:
- Token validation at stream open. The API server validates the OIDC token when the stream is established.
- Periodic re-validation. For streams lasting longer than
token_revalidation_interval(default: 5 minutes), the API server re-validates the token’s claims against the OIDC provider. If the token has been revoked or the user’s permissions have changed, the stream is terminated with anUNAUTHENTICATEDerror. - Client responsibility. Clients should refresh their access token before it expires and present the new token on reconnection if the stream is terminated.
Anti-Replay for API Requests
API requests are protected against replay attacks:
- TLS as primary defense. All external API communication uses TLS, which provides replay protection at the transport layer.
- Request idempotency. Mutating operations (Submit, Cancel, Update) use client-generated
request_idfields for idempotency. Duplicaterequest_idvalues within a time window are rejected. - Raft proposal deduplication. The quorum deduplicates proposals using the proposing scheduler’s identity and a monotonic sequence number. Replayed proposals are ignored.
RBAC for Node Management
Node management operations (drain, undrain, disable) require the system-admin role:
| Operation | Required Role | Notes |
|---|---|---|
ListNodes, GetNode | user | Read-only, filtered by tenant scope |
DrainNode, UndrainNode | system-admin | Affects scheduling across all tenants |
DisableNode | system-admin | Removes node from scheduling entirely |
| Sensitive node claim | claiming-user | Sensitive-specific role within tenant |
Certificate CN vs NodeId Mapping
Node agent certificates use a deterministic CN format that maps to the node’s xname identity:
- Format:
{xname}.{site}.lattice.internal(e.g.,x1000c0s0b0n0.alps.lattice.internal) - Validation: On each heartbeat, the quorum verifies that the certificate CN matches the node ID reported in the heartbeat payload. A mismatch triggers an
UNAUTHENTICATEDerror and an alert. - Prevents: A compromised node agent from impersonating a different node.
Sensitive Session Recording Storage
Attach session recordings for sensitive allocations are stored alongside the audit log:
- Path:
s3://sensitive-audit/{tenant}/{alloc_id}/sessions/{session_id}.recording - Format: Raw byte stream (input + output interleaved with timestamps), compressed with zstd
- Encryption: Encrypted at rest using the sensitive storage pool’s encryption keys
- Retention: 7 years (matching sensitive audit log retention)
- Access: Only the claiming user and tenant-admin (compliance reviewer) can access recordings via the audit query API
Audit Signing Key Persistence
The Ed25519 signing key for audit log entries is loaded from a persistent file configured via QuorumConfig.audit_signing_key_path. This ensures:
- Chain continuity: Archived audit entries (in S3) can be verified after quorum restart
- Non-repudiation: The same key signs all entries, forming a verifiable chain
- Key rotation: Replace the file and restart the quorum to rotate (old entries remain verifiable with the old public key)
- Dev mode: When
audit_signing_key_pathis not set, a random key is generated (suitable for testing only)
REST API Authentication
REST and gRPC endpoints require authentication when OIDC or HMAC is configured:
- Bearer token required in
Authorizationheader (validated on every request) - Two validation modes: JWKS (production, via
oidc_issuer) or HMAC-SHA256 (dev/testing, viaLATTICE_OIDC_HMAC_SECRET) - REST middleware validates asynchronously (supports JWKS network fetch on cache miss)
- gRPC interceptor validates synchronously using cached JWKS keys (pre-fetched at startup) or HMAC
- Rate limiting applied per-user
- Public endpoints exempt:
/healthz,/api/v1/auth/discovery - OIDC discovery client disables HTTP redirects (JWKS cache poisoning prevention)
- Non-HTTPS issuer URLs produce a warning (MITM risk)
- Server logs a prominent warning on startup if no authentication is configured
Service Discovery Isolation
Service discovery endpoints (LookupService, ListServices) are tenant-filtered:
x-lattice-tenantheader constrains results to the requesting tenant’s services- Without the header, all services are visible (admin/operator access)
- Prevents cross-tenant information disclosure of service topology
Session Security
Interactive sessions are tracked globally in Raft state:
CreateSession/DeleteSessionare Raft-committed operations- Sensitive allocations: at most one concurrent session globally (INV-C2)
- Sessions survive API server restart (persisted in quorum state)
- Ownership verified: only the allocation’s user can create sessions
Cross-References
- sensitive-workloads.md — Sensitive-specific security requirements
- failure-modes.md — Security implications of failure scenarios
- upgrades.md — Certificate rotation during upgrades
- accounting.md — Waldur API token management
Deployment & Bootstrapping
Design Principle
Lattice deploys on bare metal managed by OpenCHAMI. The bootstrap sequence is deterministic: infrastructure first, then control plane, then compute nodes. Each step is idempotent and can be retried. The system can be fully rebuilt from configuration files and Raft snapshots.
Prerequisites
Before deploying Lattice:
| Dependency | Required | Notes |
|---|---|---|
| OpenCHAMI | Yes | Node inventory, BMC discovery, boot service, identity (OPAAL) |
| VAST (or compatible NFS+S3) | Yes | Hot tier storage, QoS API |
| OIDC Provider | Yes | User authentication (institutional IdP) |
| PKI / Certificate Authority | Yes | mTLS certificates for all components |
| Secrets Manager | Yes | API tokens, TLS keys (Vault or equivalent) |
| Time-series database | Yes | VictoriaMetrics, Mimir, or Thanos |
| Slingshot/UE fabric | Yes | Network with VNI support |
| Waldur | Optional | External accounting (feature-flagged) |
| Sovra | Optional | Federation trust (feature-flagged) |
Network Topology
Lattice runs on the high-speed network (HSN — Slingshot/Ultra Ethernet, 200G+). When co-deployed with PACT, the two systems use different networks for clean failure isolation (PACT ADR-017):
| System | Network | Ports | Traffic |
|---|---|---|---|
| PACT | Management (1G) | gRPC 9443, Raft 9444 | Admin ops, boot overlay, config, shell |
| Lattice | HSN (200G+) | gRPC 50051, Raft 9000, REST 8080 | Scheduling, heartbeats, telemetry, allocation lifecycle |
Node (dual-homed):
├── Management NIC (1G Ethernet)
│ └── pact-agent ←mTLS→ pact-journal:9443
│
├── HSN NIC (200G+ Slingshot/UE)
│ ├── lattice-node-agent ←mTLS→ lattice-quorum:50051
│ └── workload traffic (MPI, NCCL, storage data plane)
│
└── SPIRE agent socket (local, network-agnostic)
├── pact-agent obtains SVID → uses on management net
└── lattice-node-agent obtains SVID → uses on HSN
Configuration: Set bind_network: hsn in quorum and node-agent config (default). This resolves to the HSN interface at startup. In standalone mode without PACT, bind_network: any (default 0.0.0.0) is acceptable.
Failure isolation: Management net down → PACT degraded, lattice unaffected. HSN down → lattice paused, PACT unaffected (admin access works). See specs/failure-modes.md for full matrix.
Bootstrap Sequence
Phase 1: Infrastructure (OpenCHAMI)
1. Deploy OpenCHAMI services:
- Magellan (BMC discovery)
- SMD (State Management Daemon)
- BSS (Boot Script Service)
- OPAAL (Authentication)
2. Discover nodes via Redfish BMC scan
3. Register node inventory in SMD
4. Prepare boot images:
- Standard compute image (Linux + node agent)
- Sensitive hardened image (minimal kernel, SELinux, no SSH)
5. Generate PKI:
- Site root CA
- Intermediate CA for OPAAL
- Pre-provision quorum member certificates
Phase 2: Control Plane
1. Deploy quorum members (3 or 5 nodes, dedicated hardware):
a. Install lattice-quorum binary
b. Configure Raft cluster membership
c. Load TLS certificates (pre-provisioned)
d. Initialize Raft cluster:
- First member bootstraps as single-node cluster
- Additional members join via Raft AddMember
e. Verify: Raft leader elected, all members healthy
2. Deploy API servers (2+ for redundancy):
a. Install lattice-api binary
b. Configure quorum endpoints, TLS, OIDC provider
c. Place behind load balancer
d. Health check: /healthz returns 200
3. Deploy vCluster schedulers:
a. One scheduler instance per vCluster type
b. Configure cost function weights (from config file or quorum)
c. Verify: scheduling cycle runs (empty, no nodes yet)
4. Deploy checkpoint broker:
a. Install lattice-checkpoint binary
b. Configure quorum and VAST API endpoints
Phase 3: Compute Nodes
1. Configure BSS with standard compute image + cloud-init template:
- cloud-init installs node agent binary
- cloud-init generates TLS certificate via OPAAL
- cloud-init configures quorum endpoint
2. Boot nodes (batch: groups of 50-100):
- PXE boot → BSS serves image → cloud-init runs → node agent starts
3. Node agent startup:
a. Generate TLS cert from OPAAL (if not pre-provisioned)
b. Discover local hardware (GPUs via NVML/ROCm-SMI, NVMe if present, NIC)
c. Compute conformance fingerprint
d. Register with quorum (first heartbeat)
e. Report capabilities and health
4. Quorum auto-discovers nodes from first heartbeat.
No manual node registration required.
5. Verify: `lattice node list` shows all nodes in Ready state.
Phase 4: Configuration
1. Create tenants:
lattice admin tenant create --name="physics" --max-nodes=200
2. Create vClusters:
lattice admin vcluster create --name="hpc-batch" \
--scheduler=hpc-backfill \
--tenant=physics \
--nodes=x1000c0s0b0n[0-199]
3. Configure cost function weights (or use defaults):
lattice admin vcluster set-weights --name="hpc-batch" \
--priority=0.20 --wait-time=0.25 --fair-share=0.25 ...
4. (Optional) Configure Waldur accounting:
lattice admin config set accounting.enabled=true
lattice admin config set accounting.waldur.api_url="https://..."
5. (Optional) Configure federation:
lattice admin federation add-peer --endpoint=... --workspace=...
6. Test: submit a test allocation.
Quorum Initialization
First-Time Bootstrap
The first quorum member initializes a new Raft cluster using the --bootstrap
flag. This flag must only be passed once — on the very first startup of
node 1. All subsequent restarts omit it; the persisted Raft state (WAL +
snapshots) is sufficient to rejoin.
# First-ever start of node 1:
lattice-server --config /etc/lattice/server.yaml --bootstrap
# All subsequent restarts (including systemd):
lattice-server --config /etc/lattice/server.yaml
This creates an empty Raft log and elects node 1 as leader.
Adding Members
Subsequent members join the existing cluster:
# On the leader (or any member):
lattice-quorum membership add --node-id=quorum-2 --addr=quorum-2:4001
# On the new member:
lattice-quorum --join=quorum-1:4001 \
--node-id=quorum-2 \
--listen=0.0.0.0:4001 \
--data-dir=/var/lib/lattice/raft
The new member syncs the Raft log from the leader and becomes a follower.
Initial State
A freshly bootstrapped quorum has:
- Empty node registry (populated when nodes boot)
- Empty tenant/vCluster configuration (created by admin)
- Empty sensitive audit log
- Default system configuration
Disaster Recovery
Raft Snapshot + WAL Recovery
The quorum periodically snapshots its state and writes a WAL (Write-Ahead Log):
/var/lib/lattice/raft/
├── snapshots/
│ ├── snap-000100.bin # Raft state at log index 100
│ └── snap-000200.bin # Raft state at log index 200
├── wal/
│ ├── wal-000200-000300 # Log entries 200-300
│ └── wal-000300-000400 # Log entries 300-400
└── metadata.json # Current term, voted_for, last_applied
Backup: Snapshots are replicated to S3 (configurable interval, default: hourly):
s3://lattice-backup/raft/snap-{timestamp}.bin
Recovery Procedure
If all quorum members are lost:
1. Provision new quorum hardware (3 or 5 nodes)
2. Retrieve latest snapshot from S3:
aws s3 cp s3://lattice-backup/raft/snap-latest.bin /var/lib/lattice/raft/
3. Bootstrap from snapshot:
lattice-quorum --recover-from=/var/lib/lattice/raft/snap-latest.bin \
--node-id=quorum-1 --bootstrap
4. Add remaining quorum members (join the recovered leader)
5. Node agents will reconnect automatically (they retry with backoff)
6. Verify state:
lattice admin raft status
lattice node list
Data loss window: From the last snapshot to the failure. With hourly snapshots, at most 1 hour of Raft commits could be lost. In practice, node ownership changes are infrequent (scheduling cycles), so data loss is minimal.
Partial Quorum Loss
If a minority of quorum members fail (1 of 3, or 2 of 5):
- The cluster continues operating (Raft majority maintained)
- Replace failed members via Raft membership change:
lattice-quorum membership remove --node-id=quorum-2 lattice-quorum membership add --node-id=quorum-2-new --addr=... - New member syncs from leader automatically
- No data loss, no downtime
Non-Raft State Backup
The Raft snapshot captures quorum state (node ownership, tenants, sensitive audit). Other stateful components require separate backup strategies:
| Component | State Location | Backup Strategy |
|---|---|---|
| TSDB (metrics) | VictoriaMetrics / Thanos | TSDB-native snapshot + S3 replication |
| S3 logs | s3://{tenant}/{project}/{alloc_id}/logs/ | S3 bucket versioning + cross-region replication |
| Accounting WAL | /var/lib/lattice/accounting-wal | Include in node backup or replicate to S3 |
| Sensitive audit log | Raft state (primary) + S3 archive (cold) | Covered by Raft snapshot; S3 archive has its own retention |
| Grafana dashboards | infra/grafana/ (version-controlled) | Git repository |
Recommended schedule: Daily backup verification for TSDB snapshots. Accounting WAL backed up on the same schedule as Raft snapshots.
Quorum Hardware Replacement
When a quorum member’s hardware fails and must be replaced:
-
Remove the failed member from the Raft cluster:
lattice-quorum membership remove --node-id=quorum-2The cluster continues operating with the remaining majority.
-
Provision new hardware:
- Install the same OS and lattice-quorum binary
- Generate a new TLS certificate from the site CA (same CN format)
- Configure the same data directory path
-
Add the new member to the cluster:
# On an existing member: lattice-quorum membership add --node-id=quorum-2-new --addr=new-host:4001 # On the new hardware: lattice-quorum --join=quorum-1:4001 \ --node-id=quorum-2-new \ --listen=0.0.0.0:4001 \ --data-dir=/var/lib/lattice/raft -
Verify: The new member syncs the full Raft log from the leader. Check with
lattice admin raft status. -
Cleanup: Remove old member’s data directory from failed hardware (if recoverable). Update monitoring/alerting to reference the new member.
Important: Replace one member at a time. Wait for the new member to fully sync before replacing another. For a 3-member quorum, never have more than 1 member down simultaneously.
Configuration Management
All configuration is stored in two places:
| Configuration | Storage | Update Mechanism |
|---|---|---|
| Raft cluster membership | Raft log | Membership change commands |
| Tenant/vCluster definitions | Raft state machine | API calls (Raft-committed) |
| Cost function weights | Raft state machine | Hot-reloadable via API |
| Component config (listen addr, TLS paths) | Local config files | Restart required |
| Node agent config | cloud-init template | Reboot to apply changes |
Config files are version-controlled alongside deployment manifests. Changes to Raft-stored configuration are applied via API and take effect immediately.
Capacity Planning
| Cluster Size | Quorum Members | API Servers | Scheduler Instances | Quorum Hardware |
|---|---|---|---|---|
| < 100 nodes | 3 | 2 | 1 per vCluster type | 4 CPU, 16 GB RAM, 100 GB SSD |
| 100-1000 nodes | 3 | 3 | 1 per vCluster type | 8 CPU, 32 GB RAM, 200 GB SSD |
| 1000-5000 nodes | 5 | 5 | 2 per vCluster type | 16 CPU, 64 GB RAM, 500 GB SSD |
| 5000+ nodes | 5 | 5+ (behind LB) | 2+ per vCluster type | 32 CPU, 128 GB RAM, 1 TB SSD |
Quorum hardware notes: Quorum members are latency-sensitive (Raft commits). Dedicated NVMe SSD for WAL. Not co-located with compute workloads. Prefer separate hardware or at minimum separate failure domains.
Backup Verification
Snapshots replicated to S3 should be verified periodically to ensure they are restorable:
# Verify the latest snapshot is readable and consistent
lattice admin backup verify --source=s3://lattice-backup/raft/snap-latest.bin
# Verify a specific snapshot
lattice admin backup verify --source=s3://lattice-backup/raft/snap-20260301T120000.bin
Verification checks:
- Snapshot file integrity (checksum match)
- Raft metadata consistency (term, index, membership)
- Deserialization of state machine (all entries parseable)
Recommended schedule: Weekly automated verification via cron or CI pipeline. Alert on failure.
Snapshot Retention Policy
Local snapshots are retained on quorum member disks:
- Keep the last 5 snapshots (default, configurable:
raft.snapshot_retention_count) - Older snapshots are deleted after a new snapshot is confirmed written
S3 snapshots follow a lifecycle policy:
- Keep all snapshots for 7 days (hourly granularity)
- After 7 days: keep one snapshot per day for 30 days
- After 30 days: keep one snapshot per week for 90 days
- After 90 days: delete (unless sensitive audit retention requires longer)
Configure via S3 lifecycle rules on the lattice-backup bucket.
Component Log Management
Lattice components log to stdout/stderr by default, managed by the system’s init system (systemd journald or equivalent).
Recommended log rotation:
| Component | Log Volume | Rotation |
|---|---|---|
| Quorum members | Low (Raft events, membership changes) | journald default (rotate at 4 GB or 1 month) |
| API servers | Medium (request logs, access logs) | journald or file rotation (rotate at 1 GB, keep 7 files) |
| vCluster schedulers | Low-Medium (scheduling cycle logs) | journald default |
| Node agents | Low per-node (heartbeats, allocation lifecycle) | journald default |
| Checkpoint broker | Low (checkpoint decisions) | journald default |
For centralized log collection, configure journald to forward to a log aggregator (e.g., Loki, Elasticsearch) via systemd-journal-remote or a sidecar agent.
Structured logging: All components emit JSON-formatted logs with fields: timestamp, level, component, message, and context-specific fields (e.g., allocation_id, node_id).
Test/Dev Deployment (GCP)
For integration testing without bare metal, use the GCP test infrastructure:
infra/gcp/
├── terraform/main.tf # 3 quorum + 2 compute + registry + TSDB
├── packer/lattice-compute.pkr.hcl # Pre-baked image with podman + squashfs-tools
scripts/deploy/
├── make-provision-bundle.sh # Single tarball: binaries + scripts + systemd units
├── install-quorum.sh # Reusable, no GCP-specific logic
├── install-compute.sh # Reusable, HMAC token generation
└── validate.sh # Structured test runner (15 tests)
Workflow:
packer build— create compute image (once)terraform apply— provision VMsmake-provision-bundle.sh— package release- SCP bundle to nodes,
install-quorum.sh(node 1 with--bootstrap),install-compute.sh validate.sh— run test matrixterraform destroy— manual cleanup
The deploy scripts are reusable on-prem — no GCP-specific logic in install-*.sh.
Cross-References
- system-architecture.md — Seven-layer architecture overview
- security.md — PKI, mTLS, certificate provisioning
- upgrades.md — Rolling upgrade procedure (after initial deployment)
- failure-modes.md — Component failure and recovery
- node-lifecycle.md — Node boot and registration
Failure Modes and Recovery
Design Principle
Fail-safe defaults. Running allocations survive component failures. Modeled after Slurm’s proven failure patterns, mapped to Lattice’s distributed architecture: requeue on node failure, state recovery on controller restart, running jobs unaffected by control plane restarts.
Component Failures
Quorum Member Loss
Detection: Raft heartbeat timeout (default: 500ms).
Recovery: Raft tolerates minority failure. A 3-member quorum tolerates 1 failure; a 5-member quorum tolerates 2. The remaining majority continues serving reads and commits. No scheduling disruption.
Action: Alert ops. Replace failed member via Raft membership change (add new → remove old). No data loss — Raft log is replicated.
Quorum Leader Loss
Detection: Raft follower timeout triggers leader election.
Recovery: New leader elected within seconds (typically 1-3s depending on election timeout configuration). In-flight proposals that were not committed are retried by the proposing vCluster scheduler on the next scheduling cycle.
Data loss risk: None. Uncommitted proposals are re-proposed. Committed state is durable.
Complete Quorum Loss
Detection: All quorum members unreachable. API server returns unavailable.
Recovery: Restore from most recent Raft snapshot + WAL replay (analogous to slurmctld --recover). The latest snapshot is stored on persistent storage (local SSD + replicated to S3). Recovery restores node ownership and sensitive audit state to the last committed entry.
Impact during outage: No new allocations can be scheduled (proposals cannot be committed). Running allocations continue — node agents operate autonomously. Node agents buffer heartbeats and replay on quorum recovery.
Node Agent Crash
Detection: Heartbeat timeout (default: 30s) followed by grace period (default: 60s). Total time to Down transition: ~90s. Analogous to Slurm’s SlurmdTimeout.
Recovery:
- Quorum marks node as
Degradedafter first missed heartbeat - After grace period (default: 60s), node transitions to
Down - Allocations on the node are requeued (if
requeuepolicy allows) or markedFailed - Node agent restarts → loads persisted state from
/var/lib/lattice/agent-state.json→ reattaches to surviving workload processes (PID liveness check viakill(pid, 0)) → cleans up orphaned cgroups → re-registers with quorum → health check → re-enters scheduling pool
Workloads survive agent restart because the systemd unit uses KillMode=process (only the agent process is killed, not children in their own cgroup scopes).
Sensitive nodes: Longer grace period (default: 5 minutes) to avoid false positives from transient issues. Sensitive allocations are never automatically requeued — operator intervention required.
Node Hardware Failure
Detection: Dual-path: heartbeat timeout (node agent) + OpenCHAMI Redfish BMC polling (out-of-band).
Recovery: Same as agent crash, but OpenCHAMI can detect hardware failures (PSU, memory ECC uncorrectable, GPU fallen off bus) before heartbeat timeout. BMC-detected failures trigger immediate Down transition, skipping the grace period.
vCluster Scheduler Crash
Detection: Health check failure (liveness probe).
Recovery: vCluster schedulers are stateless — they read pending allocations and node state from the quorum on each scheduling cycle. Restart from quorum state. No scheduling occurs for this vCluster during downtime, but running allocations continue unaffected (like slurmctld crash: running jobs are fine).
Data loss risk: None. Pending allocations are persisted in the quorum.
API Server Crash
Detection: Load balancer health check / liveness probe.
Recovery: API servers are stateless. Restart and resume serving. Multiple API server replicas behind a load balancer provide redundancy. Client retries with exponential backoff. No job loss.
Checkpoint Broker Crash
Detection: Health check failure.
Recovery: Pending checkpoint requests are lost (they were in-memory). On restart, the broker re-evaluates all running allocations against the checkpoint cost model. Allocations that should have been checkpointed will be identified on the next evaluation cycle.
Data loss risk: Minimal. At worst, one evaluation cycle’s worth of checkpoint decisions are delayed. No allocation data is lost.
Infrastructure Failures
Network Partition: Node ↔ Quorum
Detection: Heartbeat timeout on the quorum side; connection failure on the node side.
Recovery:
- Quorum side: nodes marked unreachable →
Degraded→Downafter grace period. Allocations requeued. - Node side: node agent continues running allocations autonomously. Buffers heartbeats and state updates. When connectivity restores, replays buffered state to quorum.
- If partition heals before grace period: node returns to
Ready, no allocation disruption.
Sensitive: Extended grace period (5 minutes). Network partitions are logged as audit events.
Network Partition: Quorum Split-Brain
Detection: Raft protocol prevents split-brain by design.
Recovery: The minority partition cannot achieve quorum and therefore cannot commit any proposals. The majority partition continues operating normally. When the partition heals, the minority members catch up via Raft log replication. No divergent state is possible.
Storage Unavailability (VAST Down)
Detection: Failed VAST API calls / NFS mount timeouts.
Impact:
- Data staging for new allocations pauses (cannot pre-stage input data)
- Running allocations with data already mounted continue (local NVMe cache, if present, persists)
- Checkpoint writes fail → broker pauses checkpoint scheduling
- New allocation proposals that require data staging are held in queue
Recovery: Automatic retry with backoff. Alert raised. Staging resumes when VAST recovers. On nodes with NVMe cache, locally cached data persists through storage outage.
OpenCHAMI Unavailable
Detection: Failed API calls to OpenCHAMI endpoints.
Impact:
- Node boot/reimaging blocked (cannot provision new nodes)
- Node wipe-on-release blocked (sensitive nodes held in quarantine state)
- Running allocations unaffected
- Scheduling of new allocations to already-booted nodes continues normally
Recovery: Operations that require OpenCHAMI are queued and retried. Alert raised.
Allocation-Level Failures
Prologue Failure (uenv Pull/Mount)
Detection: Node agent reports prologue error to quorum.
Recovery:
- Node drained for this allocation (other allocations on the node unaffected)
- Allocation retried on different nodes (analogous to Slurm PrologSlurmctld failure)
- Max retries configurable (default: 3)
- After max retries: allocation moves to
Failedstate, user notified
Common causes: Corrupted uenv image (hash mismatch), local cache full (if NVMe present), registry unavailable.
Application Crash
Detection: Node agent detects process exit with non-zero status.
Recovery:
- Allocation moves to
Failedstate - Nodes released back to scheduling pool
- If allocation has
requeue: on_node_failureorrequeue: always: re-enter queue - DAG dependencies evaluated (cross-ref: dag-scheduling.md)
Walltime Exceeded
Detection: Node agent timer.
Recovery:
SIGTERMsent to all processes in the allocation- Grace period (default: 30s) for clean shutdown
SIGKILLif processes still running after grace period- Nodes released
- Allocation marked as
Failedwith reasonwalltime_exceeded
Walltime Exceeded During Checkpoint
If an allocation’s walltime expires while a checkpoint is in progress:
- Walltime takes priority. The walltime timer is not extended to accommodate an in-progress checkpoint.
SIGTERMis sent as normal. If the checkpoint completes within the SIGTERM grace period (default: 30s), the checkpoint is usable and the allocation is markedSuspended(can be resumed).- If the checkpoint does not complete within the grace period,
SIGKILLis sent. The incomplete checkpoint is discarded and the allocation is markedFailedwith reasonwalltime_exceeded. - The checkpoint broker tracks this race condition via the
lattice_checkpoint_walltime_conflict_totalcounter metric.
Recovery Matrix
| Failure | Detection | Recovery Action | Data Loss Risk |
|---|---|---|---|
| Quorum member loss | Raft heartbeat | Leader election, continue | None |
| Quorum leader loss | Raft timeout | New election (1-3s) | None (uncommitted retried) |
| Complete quorum loss | All members down | Snapshot + WAL recovery | None (last committed state) |
| Node agent crash | Heartbeat timeout (30s) + grace (60s) | Degrade → Down → requeue | Running allocation output since last checkpoint |
| Node hardware failure | BMC + heartbeat | Immediate Down → requeue | Running allocation output since last checkpoint |
| vCluster scheduler crash | Health check | Stateless restart | None |
| API server crash | Health check | Stateless restart | None |
| Checkpoint broker crash | Health check | Restart, re-evaluate | Delayed checkpoint decisions |
| Network partition (node) | Heartbeat timeout | Grace period → requeue | None if heals in time |
| Network partition (quorum) | Raft protocol | Minority stalls, majority continues | None |
| VAST down | API timeout | Queue staging, continue running | None |
| OpenCHAMI down | API timeout | Queue provisioning ops | None |
| Prologue failure | Agent report | Retry on different nodes | None |
| Application crash | Process exit | Release nodes, optional requeue | Application-dependent |
| Walltime exceeded | Agent timer | SIGTERM → SIGKILL → release | Unsaved work |
Allocation Requeue Policy
Configurable per allocation at submission time:
| Policy | Behavior |
|---|---|
never | Allocation fails permanently on any node failure. Default for interactive sessions. |
on_node_failure | Requeue only when the failure is node-side (hardware, agent crash, network partition). Default for batch allocations. |
always | Requeue on any failure including application crash. Use with caution — can cause infinite loops for buggy applications. |
Max requeue count: Default 3. Configurable per allocation (max 100, validated at submission). After max requeues, allocation transitions to Failed regardless of policy. Requeue uses optimistic concurrency (expected_requeue_count) to prevent double-increment from concurrent reconcilers.
Requeue behavior: Requeued allocations retain their original submission time for fair-share and wait-time calculations (no queue-jumping penalty, no starvation). Just-requeued allocations are excluded from the pending set in the same scheduler cycle (TOCTOU prevention).
Service Failure Detection (Liveness Probes)
For Unbounded and Reactive allocations with a liveness_probe configured:
- Node agent runs the probe periodically (TCP connect or HTTP GET)
- Consecutive failures tracked by ProbeManager (per-allocation counter)
- Threshold exceeded → allocation marked Failed by node agent
- Reconciler detects Failed service → requeues per policy (if not at max_requeue)
- Scheduler re-places the allocation on available nodes
Timeline: initial_delay (default 10s) → periodic probes (default 30s) → failure_threshold (default 3) → Failed → next scheduler cycle requeues.
Service Registry Failure
If the service registry becomes inconsistent (e.g., allocation completes but endpoint not deregistered):
- Registry is part of Raft state machine — same consistency guarantees as node ownership
- Endpoint registration/deregistration happens atomically in
update_allocation_state()handler - Deregistration also occurs in
requeue_allocation()handler - Empty service entries are cleaned up automatically
Cross-References
- scheduling-algorithm.md — f₈ checkpoint_efficiency affects preemption cost
- dag-scheduling.md — Failure propagation in DAG workflows
- sensitive-workloads.md — Sensitive-specific failure handling (longer grace periods, no auto-requeue)
- accounting.md — Accounting service failure buffering
- upgrades.md — Failure detection during canary rollouts
- sessions.md — Interactive session disconnect/reconnect during node failures
Upgrades and Rollouts
Design Principle
Zero-downtime upgrades. No running allocation is disrupted by an upgrade. Components are upgraded independently. Protocol backward compatibility ensures mixed-version operation during rolling upgrades.
Protocol Versioning
All gRPC services are versioned (lattice.v1.*):
- New fields are additive (backward compatible within a major version)
- Breaking changes require a new version (
lattice.v2.*) - During rolling upgrades, node agents and quorum members must support both version N and N-1
- Version negotiation on connection establishment: components advertise supported versions, use the highest common version
Upgrade Order
Components are upgraded in dependency order, from leaf to core:
1. Node agents (rolling, batched)
2. vCluster schedulers (rolling)
3. API servers (rolling)
4. Quorum members (Raft rolling membership change, one at a time)
This order ensures that core components (quorum) speak the old protocol until all clients (node agents, schedulers) are upgraded. The quorum is upgraded last because it’s the most critical and the hardest to roll back.
Node Agent Rolling Upgrade
Procedure
For each batch of nodes:
- Drain: Stop scheduling new allocations to the node. Node enters
Drainingstate. If no allocations are running, it transitions directly toDrained. - Wait: Running allocations complete naturally. The scheduler loop transitions the node from
DrainingtoDrainedonce all allocations finish. For urgent upgrades: checkpoint running allocations and migrate (cross-ref: checkpoint-broker.md). - Upgrade: Replace node agent binary while node is
Drained. Configuration is preserved. - Restart: Node agent starts, re-registers with quorum using new protocol version.
- Health check: Node passes health check (heartbeat, GPU detection, network test).
- Undrain: Operator runs
undrain. Node transitions fromDrainedtoReadyand is available for scheduling.
Canary Strategy
- Upgrade 1-2 nodes first (canary set)
- Monitor canary nodes for the observation window (default: 15 minutes):
- Scheduling cycle latency within SLO (cross-ref: telemetry.md scheduler self-monitoring)
- No increase in allocation failures on canary nodes
- Heartbeat latency stable
- Node health check pass rate = 100%
- If canary passes: proceed with rolling batches (batch size configurable, default: 5% of nodes)
- If canary fails: stop rollout, revert canary nodes (see Rollback below)
Batch Sizing
| Cluster Size | Canary Size | Batch Size | Total Batches |
|---|---|---|---|
| < 50 nodes | 1 node | 5 nodes | ~10 |
| 50-500 nodes | 2 nodes | 25 nodes | ~20 |
| 500+ nodes | 5 nodes | 50 nodes | varies |
vCluster Scheduler Rolling Upgrade
Schedulers are stateless — they read state from the quorum each cycle:
- Stop scheduler instance
- Upgrade binary
- Restart
- Verify: scheduling cycle completes successfully, proposals accepted by quorum
During scheduler downtime, the affected vCluster pauses scheduling (no new allocations). Running allocations are unaffected. Multiple scheduler replicas (if deployed) provide continuity.
API Server Rolling Upgrade
API servers are stateless, behind a load balancer:
- Remove instance from load balancer
- Drain active connections (grace period: 30s)
- Upgrade binary
- Restart
- Health check passes → re-add to load balancer
Client impact: brief connection reset for long-lived streams (StreamMetrics, StreamLogs). Clients reconnect automatically.
Quorum Rolling Upgrade
The most sensitive upgrade. One member at a time, maintaining quorum majority throughout:
3-Member Quorum
- Upgrade follower A: remove from Raft group → upgrade → re-add
- Wait for follower A to catch up (Raft log sync)
- Upgrade follower B: remove → upgrade → re-add
- Wait for follower B to catch up
- Trigger leader transfer to an upgraded follower
- Upgrade old leader: remove → upgrade → re-add
Constraint: Never more than 1 member down simultaneously (2/3 majority required).
5-Member Quorum
Same procedure but can upgrade 2 followers in parallel (3/5 majority maintained):
- Upgrade followers A and B in parallel
- Wait for catch-up
- Upgrade followers C and D in parallel
- Wait for catch-up
- Leader transfer → upgrade old leader
Constraint: Never more than 2 members down simultaneously (3/5 majority required).
Quorum Upgrade Verification
After each member upgrade:
- Raft log replication is current (no lag)
- Commit latency within SLO (< 5s)
- Leader election succeeds if triggered
- All node ownership state is consistent
Canary Criteria
Metrics from scheduler self-monitoring (cross-ref: telemetry.md) that gate rollout progression:
| Metric | Threshold | Severity |
|---|---|---|
lattice_scheduling_cycle_duration_seconds | p99 < 30s | Warning: pause rollout |
lattice_scheduling_proposals_total{result="rejected"} | No increase > 10% | Warning: pause rollout |
lattice_agent_heartbeat_latency_seconds | p99 < 5s | Warning: pause rollout |
lattice_raft_commit_latency_seconds | p99 < 5s | Critical: stop rollout |
lattice_api_requests_total{status="5xx"} | No increase > 5% | Warning: pause rollout |
| Allocation failure rate | No increase | Critical: stop rollout |
Rollback
Node Agent Rollback
- Drain canary/failed nodes
- Replace binary with previous version
- Restart
- Verify old-version operation
- Protocol backward compatibility ensures the rolled-back agent works with the rest of the cluster
Scheduler/API Rollback
Stateless — replace binary and restart.
Quorum Rollback
- Remove new-version member from Raft group
- Add old-version member back
- Protocol backward compatibility ensures mixed-version operation during the transition
Rollback is always safe because N-1 protocol support is maintained throughout the upgrade window.
Configuration Hot-Reload
Not all changes require a binary upgrade. Configuration changes that can be hot-reloaded via quorum without restart:
| Change | Hot-Reloadable | Mechanism |
|---|---|---|
| Cost function weights | Yes | Quorum config update, schedulers pick up next cycle |
| vCluster policies | Yes | Quorum config update |
| Telemetry mode (prod/debug/audit) | Yes | API call to node agent |
| Tenant quotas | Yes | Quorum config update |
| Node drain/undrain | Yes | API call |
| Protocol version | No | Binary upgrade required |
| Raft cluster size | No | Membership change (safe, but not hot-reload) |
Cross-References
- telemetry.md — Scheduler self-monitoring metrics used for canary criteria
- failure-modes.md — Failure detection during upgrades
- security.md — Certificate rotation during upgrades
- checkpoint-broker.md — Checkpoint before drain for urgent upgrades
Testing Strategy
Design Principle
Scheduler correctness is non-negotiable. The testing strategy covers four levels: unit tests for individual functions, integration tests for component interactions, simulation tests for scheduling behavior, and chaos tests for fault tolerance. Every level must pass before a release.
Test Levels
┌─────────────────────────────────────────────────┐
│ Level 4: Chaos Tests (fault injection) │
│ Raft leader loss, network partitions, │
│ node failures, storage unavailability │
├─────────────────────────────────────────────────┤
│ Level 3: Simulation (RM-Replay) │
│ Production workload replay, weight tuning, │
│ fairness validation, SLO compliance │
├─────────────────────────────────────────────────┤
│ Level 2: Integration Tests │
│ Multi-component scenarios, API contracts, │
│ end-to-end allocation lifecycle │
├─────────────────────────────────────────────────┤
│ Level 1: Unit Tests │
│ Cost function, topology solver, state machine,│
│ protobuf serialization, error handling │
└─────────────────────────────────────────────────┘
Level 1: Unit Tests
In-module tests (#[cfg(test)]), run via cargo test.
Critical Paths
| Crate | What to Test | Example |
|---|---|---|
lattice-scheduler | Cost function components (f₁-f₉) | Given inputs, verify score output |
lattice-scheduler | Knapsack solver | Given nodes and allocations, verify placement |
lattice-scheduler | Topology packing | Given groups and node count, verify group selection |
lattice-scheduler | Conformance group selection | Given fingerprints, verify grouping |
lattice-quorum | Raft proposal validation | Hard quota rejection, ownership conflict |
lattice-quorum | State machine transitions | Node state changes, allocation lifecycle |
lattice-common | Type serialization/deserialization | Protobuf round-trip for all types |
lattice-common | Allocation state machine | Valid and invalid state transitions |
lattice-api | Request validation | Reject invalid allocations (cycles in DAG, bad constraints) |
lattice-api | SBATCH directive parsing | Translate Slurm directives to Intent API |
lattice-checkpoint | Cost model evaluation | Given metrics, verify checkpoint decision |
lattice-cli | Argument parsing | Flag combinations, error messages |
Property-Based Tests
Use proptest for property-based testing of the cost function and solver:
- Cost function monotonicity: Increasing wait time always increases f₂
- Fair share bounds: f₃ always in [0, 1]
- Solver validity: Every placement returned by the solver satisfies all constraints
- Topology packing: Solver never spans more groups than necessary
- State machine: No invalid state transitions accepted
Level 2: Integration Tests
In tests/ directories, using real components with mock external dependencies.
Test Harness
A test harness that spins up:
- In-memory Raft cluster (3 members, using
openrafttest utilities) - Mock node agents (report capabilities, respond to heartbeats)
- Mock VAST API (storage queries return configurable responses)
- Real scheduler instances
- Real API server (in-process)
Scenarios
| Scenario | What It Tests |
|---|---|
| Submit → Schedule → Complete | Full allocation lifecycle through all components |
| DAG submission | Multi-allocation workflow with dependency resolution |
| Preemption | Higher-priority allocation preempts lower-priority |
| Elastic borrowing | vCluster borrows and returns nodes |
| Quota rejection | Hard quota exceeded → proposal rejected |
| Sensitive claim | Node claim, audit logging, wipe on release |
| Session lifecycle | Session create → terminal → disconnect → cleanup |
| Rolling upgrade simulation | Mixed-version node agents, protocol negotiation |
| Conformance drift | Node fingerprint changes → scheduling impact |
| Reactive scaling | Metric threshold triggers scale-up/down |
API Contract Tests
For every API endpoint, test:
- Valid request → expected response
- Invalid request → appropriate error code and message
- Authorization: user sees own allocations only, tenant-admin sees tenant, system-admin sees all
- Rate limiting: exceeded rate → 429 with Retry-After header
Protobuf Compatibility
Test backward compatibility:
- Deserialize messages from previous version with new code (additive fields)
- Deserialize messages from new version with old code (unknown fields ignored)
Level 3: Simulation (RM-Replay)
Purpose
RM-Replay replays production workload traces through the scheduler to validate scheduling behavior without risking production. Essential for:
- Tuning cost function weights before deployment
- Validating fairness across tenants
- Regression testing after scheduler changes
Workflow
1. Capture: Record production workload traces
- Allocation submissions (arrival time, resources, constraints, tenant)
- Allocation completions (duration, exit status)
- Node inventory (capabilities, topology)
2. Configure: Set cost function weights and vCluster policies
3. Replay: Feed traces through lattice-scheduler in simulation mode
- No real nodes or quorum — mock environment
- Simulated time (runs in seconds, not hours)
- Deterministic (same trace + same weights = same result)
4. Evaluate: Measure scheduling outcomes
- Utilization: fraction of GPU-hours used
- Wait time: p50, p95, p99 queue wait per priority class
- Fairness: actual share vs. target share per tenant (Jain's fairness index)
- Backfill effectiveness: percentage of idle slots filled
- SLO compliance: percentage of allocations meeting target wait time
- Preemption rate: preemptions per hour
5. Iterate: Adjust weights, re-run, compare
Regression Suite
Maintain a library of representative workload traces:
| Trace | Description | Key Metric |
|---|---|---|
steady-state.trace | Normal mixed workload (HPC + ML + services) | Utilization > 85% |
burst.trace | Sudden spike in submissions | No starvation (p99 wait < 4h) |
unfair.trace | One tenant submits heavily | Fair share deviation < 10% |
sensitive-claim.trace | Sensitive claims interleaved with HPC | Sensitive wait = 0 (immediate) |
preemption-heavy.trace | Many priority inversions | Checkpoint success rate > 95% |
empty-to-full.trace | Cluster goes from idle to full | Ramp-up time, scheduling cycle latency |
Each trace has a pass/fail threshold for key metrics. CI runs the regression suite on every scheduler change.
Level 4: Chaos Tests
Fault injection tests that validate the failure modes documented in failure-modes.md.
Fault Injection Framework
Use a test harness that can inject faults at configurable times:
| Fault | Injection Method | Validates |
|---|---|---|
| Raft leader kill | Stop leader process | Leader election, in-flight proposal retry |
| Raft member kill | Stop follower process | Continued operation with minority loss |
| Network partition (node↔quorum) | Drop heartbeats | Degraded → Down transition, allocation requeue |
| Network partition (quorum split) | Partition Raft members | Minority stalls, majority continues |
| Node agent crash | Kill agent process | Heartbeat timeout, allocation requeue |
| Storage unavailability | Mock VAST returns errors | Staging pauses, running allocations continue |
| Checkpoint timeout | Application ignores checkpoint hint | Forced preemption after timeout |
| API server crash | Kill API server | Client retry, no state loss |
| Quorum snapshot corruption | Corrupt snapshot file | Recovery from previous valid snapshot |
Chaos Test Scenarios
| Scenario | Steps | Expected Outcome |
|---|---|---|
| Leader election under load | Submit 50 allocations, kill leader mid-cycle | New leader elected < 5s, no proposals lost, all allocations eventually scheduled |
| Node failure with requeue | Start 10 allocations, kill 2 node agents | Allocations requeued, rescheduled on healthy nodes, total delay < 2 min |
| Split-brain prevention | Partition 3-member quorum into 1+2 | Minority (1) cannot commit, majority (2) continues, no divergent state |
| Cascade failure | Kill 3 node agents simultaneously | Allocations on all 3 nodes requeued, scheduling continues for remaining nodes |
| Sensitive node failure | Kill sensitive node agent | Extended grace period, operator alert, no auto-requeue |
| Recovery from full quorum loss | Kill all quorum members, restore from snapshot | State restored, node agents reconnect, scheduling resumes |
Execution
Chaos tests run in CI on a dedicated stage (not on every commit):
- Nightly: full chaos suite
- On release branch: full chaos suite must pass
Performance Benchmarks
Scheduling Cycle Latency
| Benchmark | Configuration | Target |
|---|---|---|
| 100 pending allocations, 1000 nodes | HPC backfill | Cycle < 5s |
| 500 pending allocations, 5000 nodes | HPC backfill | Cycle < 15s |
| 1000 pending allocations, 10000 nodes | HPC backfill | Cycle < 30s |
| Raft commit (single proposal) | 3-member quorum | p99 < 50ms |
| Raft commit (single proposal) | 5-member quorum | p99 < 100ms |
Load Tests
| Test | Description | Target |
|---|---|---|
| API throughput | Concurrent submission requests | > 1000 req/s |
| Heartbeat load | 10000 node agents reporting | < 1% CPU on quorum |
| Log streaming | 100 concurrent log streams | < 5% CPU on API server |
CI Pipeline
On every commit:
cargo fmt --check
cargo clippy --all-targets
cargo test (Level 1: unit tests)
On every PR:
Level 1 + Level 2 (integration tests)
Protobuf backward compatibility check
Nightly:
Level 1 + Level 2 + Level 3 (RM-Replay regression) + Level 4 (chaos)
Performance benchmarks (track regressions)
On release:
All levels must pass
Performance benchmarks must meet targets
Cross-References
- failure-modes.md — Failure scenarios validated by chaos tests
- scheduling-algorithm.md — Cost function tested by unit tests and RM-Replay
- upgrades.md — Rolling upgrade validated by integration tests
- conformance.md — Conformance behavior validated by integration tests
DAG Scheduling
Design Principle
DAGs are first-class workflow primitives. The scheduler resolves dependencies; users declare intent. Dependency semantics are Slurm-compatible (afterok, afternotok, afterany, aftercorr) to ease migration.
DAG Submission
A DAG is a set of allocation specs with dependency edges, submitted as a single unit via the Intent API:
DagSpec {
allocations: Vec<AllocationSpec>, // each spec has an id and depends_on fields
}
Dependencies are expressed inline via each AllocationSpec.depends_on field (a list of DependencySpec with ref_id and condition), not as separate edge objects. This matches the protobuf definition in proto/lattice/v1/allocations.proto.
Dependency Conditions
Defined in crates/lattice-common/src/types.rs (DependencyCondition enum):
| Condition | Slurm Equivalent | Semantics |
|---|---|---|
Success | afterok | Successor runs only if predecessor exits 0 |
Failure | afternotok | Successor runs only if predecessor exits non-zero |
Any | afterany | Successor runs regardless of predecessor’s exit status |
Corresponding | aftercorr | Task group: array element N depends on predecessor’s element N |
Mutex | singleton | Only one allocation with this mutex name runs at a time |
DAG Lifecycle
1. Submission and Validation
- User submits
DagSpecviaPOST /v1/dagsorlattice dag submit - lattice-api validates the graph:
- No cycles (topological sort must succeed)
- All
depends_on.ref_idvalues reference allocation IDs within the DAG - All allocation specs individually valid
- DAG receives a unique
dag_id - Individual allocations receive
allocation_idvalues and are tagged withdag_id
2. Root Node Scheduling
- Allocations with no incoming dependency edges (root nodes) enter their vCluster scheduler queue immediately
- Root nodes are scored and scheduled like any other allocation
3. Dependency Resolution
- When an allocation completes (any terminal state), the system evaluates outgoing edges:
- For each outgoing edge, check if the condition is satisfied
- If all incoming edges to a successor are satisfied, the successor enters the scheduler queue
- Dependency resolution is eventually consistent (handled by lattice-api or a lightweight DAG controller, not the quorum)
4. DAG Completion
- DAG completes when all allocations reach a terminal state (Completed, Failed, or Cancelled)
- DAG state:
Runningwhile any allocation is pending or running,Completedwhen all are done,Failedif any required allocation failed without a catching edge
5. DAG Cancellation
DELETE /v1/dags/{id}orlattice dag cancel {id}- Cancels all pending and running allocations in the DAG
- Running allocations receive SIGTERM → grace period → SIGKILL (same as walltime exceeded)
Failure Propagation
Default: Success Dependencies
If allocation A fails and B depends on A via Success:
- B is cancelled (dependency can never be satisfied)
- B’s downstream dependencies are also evaluated (cascading cancellation)
Error Handling Paths
With Failure edges, users can build error-handling workflows:
train ──Success──→ evaluate ──Success──→ deploy
│ │
└──Failure──→ notify_failure
│
└──Failure──→ notify_failure
notify_failureruns only iftrainorevaluatefailsdeployruns only if bothtrainandevaluatesucceed
Any Dependencies
With Any edges, successors run regardless:
run_experiment ──Any──→ cleanup
cleanup runs whether run_experiment succeeds or fails. Useful for teardown tasks.
Corresponding Dependencies (Task Groups)
For task groups (array jobs), Corresponding creates element-wise dependencies:
preprocess[0..N] ──Corresponding──→ train[0..N]
train[i] starts only when preprocess[i] completes successfully. Other array elements are independent.
State Tracking
DAG state is eventually consistent, following ADR-004:
- The quorum tracks individual allocation states (ownership, terminal states). It does not know about DAG structure.
- The DAG controller (runs within lattice-api) evaluates dependency edges when allocation state changes. It reads allocation states from the quorum and determines which successors to release into the scheduler queue.
- This separation keeps the quorum simple and avoids adding DAG-specific logic to the Raft state machine.
DAG Queries
| Endpoint | Description |
|---|---|
GET /v1/dags/{id} | DAG status: overall state, per-allocation states |
GET /v1/dags/{id}/graph | DAG structure: allocations and edges |
GET /v1/dags?tenant={id} | List DAGs for a tenant |
DELETE /v1/dags/{id} | Cancel DAG |
CLI equivalents: lattice dag status, lattice dag list, lattice dag cancel.
Edge Cases
Node Failure During DAG Execution
When a node fails while running a DAG allocation:
- The allocation follows its
requeue_policy(see failure-modes.md) - If requeued: the allocation re-enters the scheduler queue with its original priority. Downstream dependencies remain blocked until it completes.
- If failed: downstream
Successdependencies are cancelled.FailureandAnyedges are evaluated normally. - DAG state remains
Runningas long as any allocation is pending or active.
Task Group with Corresponding Dependencies and Mixed Exit Codes
When a task group has Corresponding dependencies and individual elements exit with different codes:
- Each
Correspondingedge is evaluated independently per array index train[3]failing does not affecttrain[4]’s dependency onpreprocess[4]- The downstream task group may have a mix of running, cancelled, and completed elements
- DAG completion waits for all evaluable elements to reach terminal states
Corresponding Dependencies with Mismatched Array Sizes
When two task groups have Corresponding dependencies but different array sizes (e.g., preprocess[0..9] → train[0..14]):
- Array indices that exist in both groups are matched normally:
train[i]depends onpreprocess[i]foriin0..9. - Extra indices in the successor group (
train[10..14]) have no matching predecessor element. These extra indices are treated as having theirCorrespondingdependency satisfied immediately — they enter the scheduler queue as if they were root nodes. - This design avoids silent failures: users get all successor elements running, not just the matched subset.
Max DAG Size
DAGs are validated at submission time with a maximum allocation count (default: 1000 allocations per DAG). Submitting a DAG exceeding this limit returns an error:
Error: DAG exceeds maximum size (1234 allocations, limit: 1000)
Hint: Split the workflow into smaller DAGs or increase the limit via system configuration.
The limit is configurable via lattice admin config set scheduling.max_dag_size=2000. Cycle detection runs in O(V+E) and is not a bottleneck, but very large DAGs increase dependency resolution overhead in the DAG controller.
Cross-References
- api-design.md — DagSpec in protobuf definition
- scheduling-algorithm.md — DAG members are scored individually by the knapsack solver
- failure-modes.md — Allocation-level failure recovery interacts with DAG propagation
- types.rs —
Dependency,DependencyConditionenum definitions
Preemption Policy
Design Principle
Preemption is a last resort for resource rebalancing. The scheduler prefers waiting, backfill, and elastic borrowing over preemption. When preemption is necessary, it targets allocations with the lowest preemption cost (fast checkpoint, low priority, short remaining runtime). Sensitive allocations are never preempted.
Preemption Classes
Each allocation has a preemption_class (0-10):
| Class | Meaning | Typical Use | Preemptible By |
|---|---|---|---|
| 0 | Best-effort | Scavenger jobs, testing | Any higher class |
| 1-3 | Low priority | Batch exploration, sweeps | Class 4+ |
| 4-6 | Normal | Production training, simulation | Class 7+ |
| 7-9 | High priority | Time-sensitive production | Class 10 only |
| 10 | Critical / Sensitive | Sensitive claims, emergency | Never preempted |
Rule: Preemption only moves down — a class-5 allocation can preempt class 0-4 allocations but never class 5+.
Enforcement: The preemption_class range (0-10) is validated at API admission. Values outside this range are rejected with a 400 Bad Request error before reaching the scheduler.
Tie-breaking within class: If multiple allocations have the same preemption class, the scheduler prefers to preempt the one with the lowest checkpoint cost (f₈).
Preemption Triggers
1. Higher-Priority Demand
A pending allocation with class N cannot be scheduled because all suitable nodes are occupied by lower-class allocations. The scheduler evaluates whether preempting one or more lower-class allocations would free enough resources.
2. Elastic Reclamation
A vCluster’s idle nodes were borrowed by another vCluster (elastic sharing). The home vCluster now needs them back. Borrowed nodes carry an implicit preemption risk — the checkpoint cost model (f₈) accounts for this.
3. Sensitive Node Claim
A sensitive user claims nodes that are currently occupied by non-sensitive allocations. Sensitive claims are class 10 (highest). The scheduler triggers immediate checkpoint + preemption of the occupying allocations.
4. Quota Enforcement
A tenant exceeds their hard quota due to a race condition (two concurrent proposals, first committed). The quorum rejects the second proposal — this is not preemption but rejection. Running allocations are never preempted for quota enforcement.
Preemption Decision Algorithm
PreemptionDecision(pending_job, candidates):
1. Filter candidates:
- Only allocations with preemption_class < pending_job.preemption_class
- Exclude sensitive allocations (never preempted)
- Exclude allocations in Checkpointing state (already being preempted)
2. Score each candidate by preemption cost:
preemption_cost(c) = checkpoint_time(c)
+ recompute_if_no_checkpoint(c)
+ remaining_walltime_value(c)
checkpoint_time(c):
If checkpoint == Auto: estimated_checkpoint_minutes from f₈
If checkpoint == Manual: assume application handles it, use configured timeout
If checkpoint == None: recompute_if_no_checkpoint applies
recompute_if_no_checkpoint(c):
time_since_last_checkpoint(c) × node_count(c) × gpu_per_node
(GPU-hours that would be lost)
remaining_walltime_value(c):
If c is near completion (>90% walltime used): high cost (let it finish)
If c just started (<10% walltime used): low cost (little invested)
3. Select victim set:
Greedy: pick candidates with lowest preemption_cost until enough nodes freed.
Constraint: freed nodes must satisfy pending_job's topology/conformance requirements.
4. If no valid victim set exists: pending_job stays queued (preemption not possible).
5. If valid victim set found: initiate preemption sequence.
Preemption Sequence
1. Scheduler identifies victim allocations
2. For each victim:
a. If checkpoint == Auto or Manual:
- Checkpoint broker sends CHECKPOINT_HINT to node agents
- Application checkpoints (signal, shmem, or gRPC callback)
- Timeout: checkpoint_timeout (default: 10 minutes)
b. If checkpoint == None:
- SIGTERM sent immediately
- Grace period (30s) → SIGKILL
3. When checkpoint completes (or timeout):
- Allocation transitions to Suspended state
- Nodes released to quorum (Raft commit)
4. Freed nodes assigned to pending allocation
5. Suspended allocations re-enter queue with:
- Original submission time preserved (no wait-time penalty)
- Resume-from-checkpoint flag set
- Preempted-count incremented
Checkpoint Timeout Handling
When a checkpointing allocation fails to complete within the timeout:
| Scenario | Action |
|---|---|
| Application responds but slow | Extend timeout by 50%, once |
| Application unresponsive | SIGTERM → grace period → SIGKILL. Mark as failed (not suspended). Requeue if policy allows. |
| gRPC callback: application requests deferral | Grant deferral up to max_deferral (default: 5 minutes). Then force. |
Multi-Victim Preemption
Sometimes freeing one allocation isn’t enough. The scheduler can preempt multiple allocations in a single decision:
Constraints:
- Maximum victims per decision: configurable (default: 3)
- All victims must have lower preemption class than the pending job
- Total preemption cost must be less than the pending job’s estimated value
- Scheduler prefers preempting fewer, larger allocations over many small ones
Ordering: Victims are preempted in parallel (all receive checkpoint hints simultaneously). The pending job starts once all victims have released their nodes.
Per-vCluster Preemption Policy
| vCluster Type | Preemption Allowed | Notes |
|---|---|---|
| HPC Batch | Yes | Class-based, checkpoint-aware |
| ML Training | Yes | Checkpoint cost heavily weighted (w₈=0.15) |
| Service | Yes (borrowed nodes only) | Services on home nodes are not preempted; borrowed nodes reclaimable |
| Sensitive | Never preempted | Class 10, no exceptions |
| Interactive | Yes | Short-lived, low cost to preempt |
Non-Preemptible Allocations
An allocation is effectively non-preemptible when:
checkpoint: NoneANDpreemption_class >= 7— High cost to preempt (all progress lost), high priority- Sensitive allocations (always class 10)
- Allocations within 5 minutes of walltime completion (configurable:
near_completion_threshold)
The scheduler avoids placing non-preemptible allocations on borrowed nodes, since those nodes may need to be reclaimed.
Preemption Metrics
| Metric | Type | Description |
|---|---|---|
lattice_preemptions_total | counter | Labels: vcluster, reason (priority/reclaim/sensitive) |
lattice_preemption_checkpoint_duration_seconds | histogram | Time from hint to checkpoint completion |
lattice_preemption_victim_requeue_total | counter | Preempted allocations re-entering queue |
lattice_preemption_failed_checkpoint_total | counter | Checkpoint timeouts during preemption |
Cross-References
- scheduling-algorithm.md — f₈ checkpoint_efficiency in cost function
- checkpoint-broker.md — Checkpoint cost model and application protocol
- failure-modes.md — Requeue policy for preempted allocations
- node-lifecycle.md — Node state transitions during preemption
- sensitive-workloads.md — Sensitive allocations never preempted
Checkpoint Broker
Purpose
The checkpoint broker coordinates between the scheduler’s resource management decisions and running applications’ checkpoint capabilities. It enables cost-aware preemption: the scheduler can reclaim resources from running jobs by triggering checkpoints, with the decision driven by an economic cost function.
Cost Model
When to Checkpoint
Should_checkpoint(j, t) = Value(j, t) > Cost(j, t)
Cost Components
Cost(j, t) = write_time(j) + compute_waste(j) + storage_cost(j)
write_time(j):
Estimated from: checkpoint_size(j) / storage_write_bandwidth
checkpoint_size(j) estimated from: GPU memory usage × node count
storage_write_bandwidth from: VAST API current throughput metrics
compute_waste(j):
GPU-seconds lost during checkpoint I/O
= write_time(j) × node_count(j) × gpu_per_node
storage_cost(j):
= checkpoint_size(j) × cost_per_GB_on_target_tier
Value Components
Value(j, t) = recompute_saved(j, t) + preemptability(j, t) + backlog_relief(t)
recompute_saved(j, t):
GPU-hours that would be lost if the job fails and restarts from scratch
= time_since_last_checkpoint(j) × node_count(j) × gpu_per_node
Weighted by failure_probability(j, t) which increases with:
- Job duration (longer jobs more likely to hit hardware issues)
- Node health signals (ECC errors, thermal warnings from BMC)
preemptability(j, t):
Value of being able to preempt this job if a higher-priority job arrives
= Σ (waiting_higher_priority_jobs × their urgency) × preemption_probability
High when higher-priority work is queued and this job sits on reclaimable nodes
backlog_relief(t):
= backlog_pressure(t) × estimated_queue_wait_reduction_if_nodes_freed
Global signal: how much would freeing these nodes help the overall queue?
Decision Dynamics
| Scenario | backlog | preempt demand | node health | Decision |
|---|---|---|---|---|
| Quiet system, healthy nodes | Low | Low | Good | Checkpoint infrequently (every 6h) |
| Deep queue, sensitive job waiting | High | High | Good | Checkpoint now, preempt |
| Node ECC errors increasing | Low | Low | Degrading | Checkpoint proactively, migrate |
| Large job nearing walltime | Low | Low | Good | Checkpoint for restart capability |
Application Protocol
Three Communication Modes
Applications opt into checkpoint coordination via one of three mechanisms:
1. Signal-based (legacy compatibility)
Node agent sends SIGUSR1 to the application's process group.
Application catches signal, writes checkpoint, signals completion via exit of a sentinel file.
Timeout: if no completion signal within checkpoint_timeout, assume non-checkpointable.
2. Shared memory flag (low-latency)
Node agent sets a flag in a shared memory region mapped at a well-known path.
Application polls the flag (or uses futex wait) and initiates checkpoint.
Completion: application clears the flag and sets a "done" flag.
Best for performance-sensitive applications that can't afford signal handler overhead.
3. gRPC callback (agent-aware applications)
Application registers a checkpoint endpoint with the node agent at startup.
Node agent calls the endpoint when checkpoint is requested.
Application responds with estimated completion time, then streams progress.
Most expressive: supports negotiation (application can request deferral).
Checkpoint Destinations
Checkpoints are written to a standard location:
s3://{tenant}/{project}/{allocation_id}/checkpoints/{checkpoint_id}/
Or, if NFS is preferred for POSIX-style checkpoint (e.g., MPI checkpoint/restart):
/scratch/{tenant}/{project}/{allocation_id}/checkpoints/{checkpoint_id}/
The checkpoint broker coordinates with the data plane to ensure bandwidth is available.
Non-Checkpointable Applications
If an application declares checkpoint: none or fails to respond to checkpoint hints:
- The allocation is marked as non-preemptible in the cost function
- It receives a penalty in the knapsack solver (ties up resources without flexibility)
- The scheduler avoids placing it on borrowed/elastic nodes
Fallback option: DMTCP (Distributed MultiThreaded Checkpointing) for transparent process-level checkpointing. Higher overhead, but works for unmodified applications.
Integration with Scheduler
The checkpoint broker runs as part of the scheduler plane, with access to:
- Running allocation state (from quorum)
- Node health telemetry (from eBPF/OpenCHAMI)
- Storage metrics (from VAST API)
- Queue state (from vCluster schedulers)
It evaluates the cost function continuously (every 30-60 seconds for each running allocation) and issues checkpoint hints when the threshold is crossed.
Storage Outage Behavior
When the checkpoint destination (VAST S3 or NFS) is unavailable:
- Detection: Checkpoint broker detects storage unavailability via failed write probes or VAST API health checks
- Immediate effect: All pending checkpoint requests are paused (not cancelled)
- Cost function adjustment:
storage_write_bandwidthdrops to 0, makingwrite_time(j)infinite — the cost function naturally suppresses checkpoint decisions - Running allocations: Continue running. They are effectively non-preemptible during the outage (no checkpoint possible)
- Preemption requests: If preemption is forced (e.g., sensitive claim), the victim receives SIGTERM without checkpoint. The allocation is marked
Failed(notSuspended) since no checkpoint was written - Recovery: When storage recovers, the broker re-evaluates all running allocations on the next cycle. Allocations with high
recompute_savedvalue are prioritized for immediate checkpoint - Alert:
lattice_checkpoint_storage_unavailablegauge set to 1; critical alert fired
Edge Cases
Reactive Allocation Checkpointing
Reactive (autoscaling) allocations pose unique challenges for the checkpoint broker:
- Variable node count. The checkpoint size estimate (
GPU memory × node count) changes as the allocation scales. The broker re-evaluates cost on each cycle using the current node count. - Scale-down as implicit checkpoint trigger. When the scheduler decides to scale down a reactive allocation, it triggers a checkpoint on the nodes being released before removing them from the allocation. This ensures state is preserved.
- Recommendation: For reactive allocations with complex distributed state, use
checkpoint: manualand implement application-level checkpoint coordination. The broker’s automatic checkpointing works best for static-size allocations where checkpoint size is predictable.
Walltime vs Checkpoint Race
When an allocation’s walltime expires while a checkpoint is in progress:
- Walltime takes priority. The walltime timer is not extended to accommodate the checkpoint.
- If the checkpoint completes before the SIGTERM grace period expires, the checkpoint is usable for restart.
- If the checkpoint is still in progress when SIGKILL is sent, the checkpoint is considered incomplete and is not used for restart. The allocation is marked
Failedwith reasonwalltime_exceeded. - To avoid this race, schedule checkpoints proactively as walltime approaches (the
recompute_savedvalue naturally increases near walltime expiration).
Cross-References
- scheduling-algorithm.md — f₈ checkpoint_efficiency in the cost function
- preemption.md — Preemption sequence and checkpoint timeout handling
- failure-modes.md — Checkpoint broker crash recovery
- telemetry.md — Node health signals (ECC errors) feeding into checkpoint urgency
- sensitive-workloads.md — Sensitive allocations and checkpoint constraints
- data-staging.md — Storage bandwidth sharing with checkpoint writes
Autoscaling
Design Principle
Simple, metric-driven scaling. No complex control theory. The scheduler adjusts node count within bounds based on a single metric threshold. Users set bounds, the scheduler respects them.
Reactive Lifecycle
Defined in crates/lattice-common/src/types.rs (LifecycleType::Reactive):
Reactive {
min_nodes: u32,
max_nodes: u32,
metric: String, // e.g., "gpu_utilization", "queue_depth", "request_rate"
target: String, // e.g., "0.80" (80% GPU utilization target)
}
Reactive allocations are unbounded in duration (like services) but have variable node count.
Scaling Loop
- Start: Allocation begins with
min_nodes - Evaluate: Every evaluation interval (default: 60s), the scheduler queries TSDB for the allocation’s metric
- Scale up: If metric > target for
scale_up_window(default: 2 minutes):- Propose adding 1 node (conservative: avoid large jumps)
- Quorum validates the node addition (ownership transfer)
- Node agent starts processes on the new node
- Repeat until metric ≤ target or
max_nodesreached
- Scale down: If metric < target ×
scale_down_threshold(default: 0.5) forscale_down_window(default: 5 minutes):- Propose removing 1 node (least-loaded or most-recently-added)
- Graceful drain: stop sending work to the node, wait for in-flight requests
- Node released back to scheduling pool
- Repeat until metric ≥ target × scale_down_threshold or
min_nodesreached
- Cooldown: After any scale event, no further scaling for
cooldown_period(default: 3 minutes)
Why Conservative Scaling
- Adding 1 node at a time prevents overshooting (workloads often have non-linear resource curves)
- Scale-down windows are longer than scale-up windows (scale down is more disruptive)
- Cooldown prevents oscillation from metric noise
Built-In Scaling Metrics
| Metric | Description | Source | Best For |
|---|---|---|---|
gpu_utilization | Mean GPU SM occupancy across allocation | eBPF / NVML | ML inference services |
cpu_utilization | Mean CPU usage across allocation | eBPF | CPU-bound services |
request_rate | Inbound requests per second | eBPF (network flow tracking) | API/web services |
queue_depth | Pending request queue length | Application-reported or eBPF | Batch-processing services |
Custom Metrics
Any metric available in TSDB can be used for scaling by specifying a label matcher:
lifecycle:
type: reactive
min_nodes: 2
max_nodes: 20
metric: "custom_metric{job='my-inference'}"
target: "100" # e.g., 100 pending requests
The scheduler queries TSDB with the label matcher scoped to the allocation’s nodes.
Configuration Defaults
| Parameter | Default | Configurable |
|---|---|---|
evaluation_interval | 60s | Per allocation |
scale_up_window | 2 minutes | Per allocation |
scale_down_window | 5 minutes | Per allocation |
scale_down_threshold | 0.5 (50% of target) | Per allocation |
cooldown_period | 3 minutes | Per allocation |
Quota Interaction
Scale-up respects the tenant’s max_nodes hard quota (cross-ref: quota-enforcement.md):
- Before proposing a scale-up, the scheduler checks if the tenant has remaining node capacity
- If
max_nodeswould be exceeded: scale-up is a no-op, allocation continues at current size - No error raised — the allocation operates within its current bounds
- If quota is later increased (e.g., via Waldur), scaling resumes automatically
Preemption Interaction
Borrowed nodes (from elastic resource sharing) are valid targets for reactive scaling, but they carry a preemption risk:
- Scaling onto borrowed nodes gives the allocation more capacity temporarily
- If the home vCluster reclaims the node: reactive allocation scales down gracefully
- Minimum guarantee:
min_nodesalways come from the allocation’s home vCluster (not borrowed)
Error Handling
Metric Query Failure (TSDB Down)
If the scheduler cannot query TSDB for the scaling metric:
- First failure: skip this evaluation cycle, log warning
- Consecutive failures (3+): alert raised (
lattice_autoscaling_metric_query_failures_total) - No scaling decisions made while metric is unavailable — allocation stays at current size
- When TSDB recovers: normal evaluation resumes on next cycle
The allocation is never scaled blindly. No metric = no action.
Scale-Up Proposal Rejected
If the quorum rejects a scale-up proposal (e.g., race condition with another vCluster):
- Retry on next evaluation cycle (60s later)
- Maximum 3 consecutive retries for the same scale-up
- After 3 rejections: log warning, back off for 2 cooldown periods
- Scale-up resumes when conditions change (nodes become available)
Scale-Down During Borrowed Node Reclamation
If a borrowed node is reclaimed by the home vCluster while the reactive allocation is scaling down:
- The reclamation takes priority (home vCluster always wins)
- The reactive allocation loses the node immediately (graceful drain attempted, but not guaranteed)
- If this drops below
min_nodes: scheduler attempts to acquire a replacement node from the home vCluster - If no replacement available: allocation operates below
min_nodestemporarily, alert raised
Metric Oscillation
If the metric oscillates around the target, causing repeated scale-up/scale-down:
- The cooldown period (default: 3 minutes) prevents rapid oscillation
- If scale events alternate for more than 5 cycles: alert raised suggesting the user adjust their target or increase cooldown
- No automatic target adjustment — the user must update the configuration
Preemption During Scale-Up
If a reactive allocation is scaling up while simultaneously being preempted (e.g., a higher-priority job arrives):
- The preemption takes priority — the checkpoint/preemption sequence begins
- Any in-flight scale-up proposals are cancelled (quorum rejects proposals for allocations in
Checkpointingstate) - After preemption completes: the allocation is suspended with its last stable node count
- When resumed: scaling restarts from
min_nodes, re-evaluating the metric from scratch - The cooldown period applies after resume to prevent immediate re-scaling
If preemption and scale-up proposals race at the quorum:
- The quorum serializes all proposals — one wins, the other is rejected
- The rejected proposal is retried on the next scheduling cycle (if still applicable)
Cross-References
- scheduling-algorithm.md — Reactive allocations scored by the knapsack solver like any allocation
- quota-enforcement.md — Hard quota limits on scale-up
- telemetry.md — Metric sources for scaling decisions
- preemption.md — Borrowed node reclamation
- types.rs —
LifecycleType::Reactivedefinition
Quota Enforcement
Design Principle
Two-tier enforcement matching the two consistency domains (ADR-004). Hard limits enforced at the quorum (strong consistency, cannot be violated). Soft limits enforced at the scheduler (eventual consistency, may temporarily overshoot, self-correcting).
Hard Quotas (Quorum-Enforced)
Hard quotas are checked during Raft proposal validation, before commit. A proposal that would violate a hard quota is rejected immediately.
| Quota | Scope | Enforcement |
|---|---|---|
max_nodes | Per tenant | Quorum rejects allocation proposals that would exceed the tenant’s maximum concurrent node count |
max_concurrent_allocations | Per tenant | Quorum rejects proposals that would exceed the tenant’s maximum number of running allocations |
sensitive_pool_size | System-wide | Hard limit on the number of nodes that can be claimed for sensitive use |
Guarantees: These quotas cannot be violated, even momentarily. Two vCluster schedulers proposing conflicting allocations that together would exceed a hard quota: the first committed wins, the second is rejected and retried next cycle.
Error handling: Hard quota rejection returns a clear error to the user:
allocation rejected: tenant "physics" would exceed max_nodes quota (current: 195, requested: 10, limit: 200)
Soft Quotas (Scheduler-Level)
Soft quotas are tracked with eventual consistency. They influence scheduling decisions through the cost function but do not hard-block allocations.
GPU-Hours Budget
gpu_hours_budget: 100000 # per billing period (month)
gpu_hours_used: 87500 # eventually consistent counter
Behavior: The scheduler uses remaining budget as a penalty in the cost function. As budget depletes:
- 0-80% used: no penalty
- 80-100% used: increasing penalty (lower scheduling priority)
-
100% used: very low score (effective starvation for new allocations, but not hard rejection)
Consistency window: Up to ~30 seconds of lag. Acceptable because: (a) scheduling cycle is 5-30s, (b) over-allocation is self-correcting via fair-share scoring, (c) GPU-hours tracking is for billing, not safety.
Fair Share Target
fair_share_target: 0.15 # tenant should get ~15% of system capacity
Behavior: Feeds into f₃ (fair_share_deficit) in the cost function. Tenants below their share get priority; tenants above are deprioritized. Not a hard ceiling — a tenant can use more than their share when resources are idle.
Burst Allowance
burst_allowance: 1.5 # allow up to 150% of fair share when resources idle
Behavior: Allows temporary over-allocation when the system has spare capacity. When demand increases and other tenants need their share, burst allocations are the first candidates for preemption (via checkpoint cost model).
Internal Budget Ledger
When Waldur is unavailable or not configured, the scheduler computes GPU-hours consumption internally from allocation records in the quorum. This replaces the previously empty budget_utilization map in the cost function.
Computation
Two metrics are tracked:
node_hours_used = Σ (end_time - started_at).hours × assigned_nodes.len()
gpu_hours_used = Σ (end_time - started_at).hours × Σ gpu_count_per_node
- For running allocations:
end_time = now - For completed/failed/cancelled:
end_time = completed_at - Only allocations within the configured
budget_period_days(default: 90 days, rolling window) are included - Node GPU count looked up from current hardware inventory; unknown nodes default to 1 GPU
- Node-hours is the universal metric (works for CPU-only and GPU nodes)
- When both
gpu_hours_budgetandnode_hours_budgetare set, the worse (higher) utilization fraction drives the budget penalty
Budget Period
Configurable via scheduling.budget_period_days (default: 90). This is a rolling window, not a calendar-aligned reset. Calendar-aligned resets require Waldur to push new gpu_hours_budget values at period boundaries.
Waldur Override
When Waldur is available, its remaining_budget() response takes precedence over the internal ledger. When Waldur is unavailable (transient failure), the internal ledger provides fallback data so budget enforcement continues.
API Access
- gRPC:
GetTenantUsage/GetUserUsageRPCs in AdminService - REST:
GET /api/v1/tenants/{id}/usage?days=90/GET /api/v1/usage?user=alice&days=90 - Rust SDK:
client.tenant_usage("physics", 90)/client.user_usage("alice", 90) - CLI:
lattice usage --tenant physics/lattice usage(uses gRPC)
Exhausted Budget Behavior
GPU-Hours Budget Exhausted
- New allocations for this tenant receive a very low scheduling score (effective starvation, not hard rejection)
- Tenant admin notified via API event
- Running allocations continue to completion (no preemption for budget reasons)
- If Waldur integration enabled: Waldur can update the budget (cross-ref: accounting.md)
- Tenant admin can request budget increase through Waldur self-service portal
Max Nodes Exhausted
- Hard rejection at quorum — clear error returned to user
- User must wait for running allocations to complete or cancel existing allocations
- No waiting queue for hard-quota-blocked allocations (submit is rejected, user resubmits when capacity is available)
Quota Update Flow
Administrative Update
System admin updates tenant quotas via CLI or API:
# CLI (uses gRPC UpdateTenant RPC)
lattice admin tenant update physics \
--max-nodes 250 \
--max-concurrent-allocations 50 \
--gpu-hours-budget 150000 \
--node-hours-budget 500000
# Python SDK
await client.update_tenant("physics", {
"max_nodes": 250,
"max_concurrent_allocations": 50,
"gpu_hours_budget": 150000,
"node_hours_budget": 500000,
})
# REST
PUT /api/v1/tenants/{id}
{
"max_nodes": 250,
"max_concurrent_allocations": 50,
"gpu_hours_budget": 150000,
"node_hours_budget": 500000
}
Hard quota changes are Raft-committed (immediate effect). Soft quota changes propagate eventually.
Waldur-Driven Update
When Waldur integration is enabled, Waldur can push quota changes:
- Waldur determines budget exhaustion or contract change
- Waldur calls lattice-api:
PUT /api/v1/tenants/{id}(authenticated with Waldur service token) - Hard quotas committed via Raft; soft quotas propagated to schedulers
- Reducing
max_nodesbelow current usage does not preempt running allocations — it prevents new ones
Quota Reduction While Allocations Are Running
When a quota is reduced below current usage (e.g., Waldur reduces max_nodes from 200 to 100, but tenant is currently using 150):
Hard Quota Reduction
- Running allocations are not preempted. The reduced quota only blocks new allocations.
- Current usage (150) exceeds new limit (100): all new proposals for this tenant are rejected until usage drops below 100.
- The user receives a clear error on new submissions:
allocation rejected: tenant "physics" exceeds max_nodes quota Current usage: 150 nodes New limit: 100 nodes Hint: Wait for running allocations to complete, or contact your tenant admin. - As running allocations complete naturally, usage drops. When usage < new limit: new allocations are accepted again.
Soft Quota Reduction
- Reduced
gpu_hours_budget: scheduling score penalty increases. Pending allocations get lower priority but are not rejected. - Reduced
fair_share_target: tenant gets deprioritized but can still schedule when resources are idle. - No immediate impact on running allocations.
Pending Allocations
Allocations that are Pending (in the scheduler queue but not yet committed) when a hard quota is reduced:
- They are not retroactively cancelled.
- If proposed to quorum, the proposal is rejected due to the new quota.
- The scheduler will not re-propose them until quota headroom exists.
- User sees allocation stuck in
Pendingstate.lattice statusshows the reason:"waiting for quota headroom".
Sensitive Quota Considerations
Sensitive quotas are always hard quotas:
sensitive_pool_size— System-wide hard limit, quorum-enforced- Sensitive node claims always go through quorum (strong consistency)
- No soft/eventual quota mechanisms for sensitive resources
- Idle sensitive nodes (claimed but unused) are not reclaimable — they remain allocated to the claiming user
Cross-ref: sensitive-workloads.md for the full sensitive workload model.
Cross-References
- scheduling-algorithm.md — f₃ fair_share_deficit uses soft quota targets
- accounting.md — Waldur quota feedback loop
- sensitive-workloads.md — Sensitive quotas are always hard
- autoscaling.md — Scale-up respects hard quota limits
GPU Topology
Design Principle
Vendor-neutral abstraction over GPU interconnect topologies. The scheduler reasons about “GPU domains” and “link bandwidth,” not vendor-specific terms. Node agents discover and report topology; the scheduler uses it for placement decisions.
Vendor Support
| Vendor | GPU Family | Interconnect | Topology Discovery | Metrics Collection |
|---|---|---|---|---|
| NVIDIA | H100, GH200, B200 | NVLink, NVSwitch | NVML (nvmlDeviceGetTopologyCommonAncestor) | NVML / DCGM |
| AMD | MI300X, MI300A | Infinity Fabric, xGMI | ROCm-SMI (rsmi_topo_get_link_type) | ROCm-SMI / rocm_smi_lib |
Additional vendors can be supported by implementing the topology discovery trait in the node agent.
Abstraction Model
GpuTopology {
gpus: Vec<GpuDevice>,
links: Vec<GpuLink>,
nic_affinity: Map<GpuIndex, NicId>, // which NIC is closest to which GPU
}
GpuDevice {
index: u32,
vendor: GpuVendor, // Nvidia | Amd
model: String, // "H100", "MI300X"
memory_bytes: u64,
compute_capability: String, // CUDA CC or GCN/CDNA arch
}
GpuLink {
gpu_a: u32,
gpu_b: u32,
link_type: GpuLinkType, // NvLink | NvSwitch | InfinityFabric | Xgmi | Pcie
bandwidth_gbps: f64,
}
The node agent populates this structure at startup using vendor-specific APIs and reports it alongside node capabilities and health data.
Link Types and Bandwidth
| Link Type | Typical Bandwidth | Latency | Notes |
|---|---|---|---|
| NVLink (H100) | 450 GB/s per link | ~1 μs | Direct GPU-to-GPU |
| NVSwitch (H100) | 900 GB/s all-to-all | ~1 μs | Full-bisection via switch |
| Infinity Fabric (MI300X) | 896 GB/s aggregate | ~1 μs | XGMI links between dies |
| PCIe Gen5 | 64 GB/s | ~2-5 μs | Fallback, cross-socket |
| PCIe Gen4 | 32 GB/s | ~2-5 μs | Older systems |
Actual bandwidth is discovered at runtime via vendor APIs, not hardcoded.
Intra-Node Scheduling Impact
ADR-007 defines “full-node scheduling with intra-node packing.” GPU topology informs the intra-node packing:
Multi-GPU Jobs Within a Node
For allocations requesting fewer GPUs than the node has, the node agent packs on GPUs with direct high-bandwidth links:
- Prefer GPUs connected via NVLink/NVSwitch/InfinityFabric (direct high-bandwidth)
- Avoid splitting across PCIe domains when high-bandwidth links are available
- For NCCL/RCCL workloads, contiguous GPU groups minimize communication overhead
Multi-Node Jobs
For allocations spanning multiple nodes:
- Prefer nodes where GPU-to-NIC affinity matches — GPUs closest to the NIC used for inter-node communication (Slingshot/Ultra Ethernet)
- NIC affinity reduces PCIe hops for inter-node traffic, improving MPI/NCCL allreduce performance
- Combined with f₄ (topology_fitness): inter-node placement minimizes dragonfly group span, intra-node placement maximizes link bandwidth
Selection Algorithm
For a k-GPU allocation on a node with n GPUs:
1. Build a graph of GPUs weighted by link bandwidth
2. Find the k-GPU subgraph with maximum minimum link bandwidth
3. If multiple subgraphs tie: prefer the one with best NIC affinity
4. Assign allocation to selected GPUs via cgroup/device isolation
MIG / GPU Partitioning
NVIDIA Multi-Instance GPU (MIG)
H100 can partition into up to 7 MIG instances, each with isolated memory, cache, and compute:
| MIG Profile | GPU Memory | SMs | Use Case |
|---|---|---|---|
| 1g.10gb | 10 GB | 1/7 | Interactive, notebooks |
| 2g.20gb | 20 GB | 2/7 | Small inference |
| 3g.40gb | 40 GB | 3/7 | Medium training |
| 4g.40gb | 40 GB | 4/7 | Medium training |
| 7g.80gb | 80 GB | 7/7 | Full GPU (no partitioning) |
MIG is relevant for interactive/small-job vClusters where intra-node packing is used. Each MIG instance is a separate schedulable GPU resource.
AMD
No equivalent partitioning as of MI300 generation. MI300X allocations always get full GPU dies.
Scheduler Integration
- MIG instances are reported as individual
GpuDeviceentries with reducedmemory_bytesand apartitioned: trueflag - The scheduler treats MIG instances like smaller GPUs — no special MIG logic in the knapsack solver
- MIG configuration is managed by the node agent, not the scheduler (reconfiguration requires idle GPU)
Integration with Cost Function
GPU topology extends f₄ (topology_fitness) to include intra-node topology quality:
f₄(j) = α · inter_node_fitness(j) + (1-α) · intra_node_fitness(j)
inter_node_fitness = 1.0 - (groups_needed / max_groups_available) // existing
intra_node_fitness = min_link_bandwidth(selected_gpus) / max_link_bandwidth(node)
α = 1.0 for single-node jobs (intra-node only matters)
α = 0.7 for multi-node jobs (inter-node dominates but intra-node still relevant)
The node agent reports GpuTopology alongside capabilities and health on every heartbeat (topology is static, but health/utilization changes).
Conformance Interaction
GPU driver version and firmware version are part of the conformance fingerprint (cross-ref: conformance.md). For multi-node GPU jobs, mismatched drivers cause NCCL/RCCL hangs. The conformance fitness factor (f₉) ensures nodes in a multi-GPU allocation share the same driver stack.
Cross-References
- scheduling-algorithm.md — f₄ topology_fitness, f₉ conformance_fitness
- conformance.md — GPU driver version in conformance fingerprint
- telemetry.md — GPU metrics collection (NVML/DCGM, ROCm-SMI)
Memory Topology
Design Principle
Vendor-neutral abstraction over CPU-memory-GPU memory topology. The scheduler reasons about “memory domains” and “interconnect bandwidth,” not vendor-specific terms like NUMA node IDs or NVLink-C2C. Node agents discover and report memory topology; the scheduler uses it for placement decisions and memory policy configuration.
This complements gpu-topology.md, which models GPU interconnects. Memory topology models the CPU-memory-GPU memory hierarchy: NUMA domains, unified memory architectures, and CXL-attached memory tiers.
Memory Domain Types
| Type | Hardware Example | Characteristics | Discovery |
|---|---|---|---|
| Discrete NUMA | Multi-socket Intel Xeon, AMD EPYC | Separate DRAM per socket, asymmetric access latencies | /sys/devices/system/node/ |
| Unified CPU-GPU | NVIDIA Grace Hopper GH200 | NVLink-C2C coherent, single address space across CPU and GPU | NVML + /sys/devices/system/node/ |
| APU / Unified Die | AMD MI300A | CPU + GPU on same package, shared HBM3 pool | ROCm-SMI + hwloc |
| CXL-Attached | CXL Type 3 memory expanders | Pooled or device-attached memory, higher latency than local DRAM | /sys/bus/cxl/ |
| Single-Socket | Single-socket servers | Trivial: one NUMA node, uniform access | /sys/devices/system/node/ |
Abstraction Model
MemoryTopology {
domains: Vec<MemoryDomain>,
interconnects: Vec<MemoryInterconnect>,
total_capacity_bytes: u64,
}
MemoryDomain {
id: u32,
domain_type: MemoryDomainType, // Dram | Hbm | CxlAttached | Unified
capacity_bytes: u64,
numa_node: Option<u32>, // Linux NUMA node ID, if applicable
attached_cpus: Vec<u32>, // CPU IDs with local access
attached_gpus: Vec<u32>, // GPU indices with local/coherent access
}
MemoryInterconnect {
domain_a: u32,
domain_b: u32,
link_type: MemoryLinkType, // NumaLink | CxlSwitch | CoherentFabric
bandwidth_gbps: f64,
latency_ns: u64,
}
enum MemoryDomainType { Dram, Hbm, CxlAttached, Unified }
enum MemoryLinkType { NumaLink, CxlSwitch, CoherentFabric }
The node agent populates this structure at startup alongside GpuTopology and reports it with node capabilities and health data.
Interconnect Bandwidth and Latency
| Link Type | Typical Bandwidth | Typical Latency | Notes |
|---|---|---|---|
| Local DRAM access | 50-100 GB/s per channel | ~80 ns | Same-socket, same NUMA node |
| Remote NUMA (UPI/xGMI) | 20-40 GB/s | ~150-300 ns | Cross-socket, 1.5-3x local latency |
| NVLink-C2C (GH200) | 900 GB/s | ~100 ns | CPU-GPU coherent fabric |
| Infinity Fabric (MI300A) | 896 GB/s aggregate | ~100 ns | On-package CPU-GPU interconnect |
| CXL 2.0 (Type 3) | 32-64 GB/s | ~200-400 ns | Memory expander, higher latency |
| PCIe Gen5 (discrete GPU) | 64 GB/s | ~1-2 us | Non-coherent, requires explicit transfer |
Actual bandwidth and latency are discovered at runtime, not hardcoded.
Superchip Architectures
NVIDIA Grace Hopper (GH200)
Grace CPU + Hopper GPU connected via NVLink-C2C (900 GB/s bidirectional). The CPU and GPU share a single coherent address space — no explicit cudaMemcpy required for data movement.
┌────────────────────────────────────────────────────┐
│ GH200 Superchip │
│ │
│ ┌─────────────────┐ NVLink-C2C ┌─────────────┐ │
│ │ Grace CPU │◄──900 GB/s───►│ Hopper GPU │ │
│ │ 72 cores │ coherent │ 80 GB HBM3 │ │
│ │ 512 GB LPDDR5X │ │ │ │
│ └─────────────────┘ └─────────────┘ │
│ │
│ Single coherent address space (CPU + GPU) │
│ → Maps to one Unified MemoryDomain │
└────────────────────────────────────────────────────┘
Mapping to abstraction:
- One
MemoryDomain { type: Unified }spanning CPU LPDDR5X + GPU HBM3 attached_cpus: all Grace cores;attached_gpus: [Hopper GPU index]- One
MemoryInterconnect { type: CoherentFabric, bandwidth: 900 }between CPU and GPU sub-domains
AMD Instinct MI300A
APU with CDNA 3 GPU + Zen 4 CPU on the same package, sharing HBM3 memory pool. No discrete CPU DRAM — all memory is HBM3 accessible by both CPU and GPU.
┌──────────────────────────────────────────────────┐
│ MI300A Package │
│ │
│ ┌─────────────┐ Infinity ┌────────────────┐ │
│ │ Zen 4 CPU │ ◄──Fabric──► │ CDNA 3 GPU │ │
│ │ 24 cores │ 896 GB/s │ 6 XCDs │ │
│ └──────┬──────┘ └───────┬────────┘ │
│ │ │ │
│ └──────┐ ┌───────────┘ │
│ ▼ ▼ │
│ ┌─────────────────────┐ │
│ │ Shared HBM3 Pool │ │
│ │ 128 GB │ │
│ └─────────────────────┘ │
│ │
│ → Maps to one Unified MemoryDomain │
└──────────────────────────────────────────────────┘
Mapping to abstraction:
- One
MemoryDomain { type: Unified }for the shared HBM3 pool attached_cpus: all Zen 4 cores;attached_gpus: [MI300A GPU index]- Internal Infinity Fabric interconnect is not separately modeled (on-package, always present)
Discovery
The node agent discovers memory topology at startup using platform-specific sources:
| Source | What It Provides | Platform |
|---|---|---|
/sys/devices/system/node/ | NUMA node count, CPU-to-node mapping, memory per node | Linux (all) |
numactl --hardware | NUMA distances (latency matrix between nodes) | Linux (all) |
hwloc | Portable topology discovery, cache hierarchy, PCI locality | Linux (all) |
| NVML | GPU-to-NUMA affinity, NVLink-C2C detection (GH200) | NVIDIA GPUs |
| ROCm-SMI | GPU-to-NUMA affinity, MI300A detection | AMD GPUs |
/sys/bus/cxl/ | CXL device enumeration, memory regions, interleave config | CXL-capable systems |
Superchip Detection
GH200 and MI300A superchips are identified by GPU model string during GPU discovery (cross-ref: gpu-topology.md). When detected:
- The node agent queries the coherent memory size via vendor API (NVML for GH200, ROCm-SMI for MI300A)
- NUMA nodes associated with both CPU and GPU are merged into a single
Unifieddomain - The coherent interconnect bandwidth is reported as a
CoherentFabriclink
Discovery Fallback
If vendor APIs are unavailable (e.g., driver not loaded), the node agent falls back to hwloc for topology and reports Dram domains only. GPU memory domains are still reported via the GPU topology path but without coherent interconnect metadata.
Scheduling Impact
Extending f₄ (topology_fitness)
Memory topology extends the intra-node component of f₄ alongside GPU topology:
intra_node_fitness = β · gpu_link_fitness + (1-β) · memory_locality_fitness
memory_locality_fitness(j, selected_nodes) =
average over selected nodes of:
fraction of allocation's CPUs and GPUs in the same memory domain
β = 0.7 for GPU-heavy workloads (GPU interconnect dominates)
β = 0.3 for CPU-heavy workloads with GPU offload (memory locality dominates)
β = 0.5 default
Constraint Hints
Allocations can specify memory topology preferences:
| Constraint | Effect |
|---|---|
prefer_same_numa | Soft: prefer placing all CPUs in a single NUMA domain |
require_unified_memory | Hard: only schedule on nodes with Unified memory domains (GH200, MI300A) |
prefer_local_memory | Soft: prefer NUMA-local memory allocation policy |
allow_cxl_memory | Opt-in: allow scheduling on CXL-expanded memory capacity |
Hard constraints filter nodes before the knapsack solver runs. Soft constraints contribute to memory_locality_fitness.
Intra-Node CPU-GPU Co-location
On discrete NUMA systems (e.g., dual-socket with 4 GPUs per socket), the node agent co-locates an allocation’s CPU cores and GPUs within the same NUMA domain when possible:
For an allocation requesting k CPUs and g GPUs on a multi-NUMA node:
1. Identify NUMA domains that have both free CPUs and GPUs with local affinity
2. Prefer the domain where GPU-to-NIC affinity is best (for inter-node traffic)
3. Assign CPUs and GPUs from the same domain via cgroup/cpuset
4. If the allocation spans domains: prefer domains connected by highest-bandwidth link
Memory Mapping Policies
The node agent configures memory allocation policy at allocation start via numactl (or equivalent). This is transparent to the user unless they specify a preference.
| Policy | numactl Flag | When Used |
|---|---|---|
| Local | --localalloc | Default: allocate on the NUMA node where the thread runs |
| Interleave | --interleave=all | Large shared datasets that all threads access equally |
| Preferred | --preferred=<node> | Pin to a specific NUMA node (for known data locality) |
| Bind | --membind=<nodes> | Strict: only allocate from specified nodes (sensitive isolation) |
On unified memory architectures (GH200, MI300A), NUMA policy has reduced impact since CPU and GPU share the same memory pool. The node agent skips numactl configuration for allocations on unified nodes unless the user explicitly requests a policy.
Allocation-Level Override
Users can specify memory policy in the allocation request:
resources:
cpus: 24
gpus: 1
memory_gb: 128
constraints:
memory_policy: interleave # optional: local | interleave | preferred | bind
require_unified_memory: true # optional: only unified architectures
CXL Memory Tiers
CXL Type 3 memory expanders add a new capacity tier: higher latency than local DRAM but lower cost per GB. The scheduler treats CXL memory as a separate resource dimension.
Capacity Model
Node memory capacity:
local_dram_bytes: 512 GB (fast, NUMA-local)
cxl_memory_bytes: 2 TB (slower, CXL-attached)
total_bytes: 2.5 TB
Allocation can request:
memory_gb: 256 # scheduler satisfies from local DRAM
memory_gb: 1024 # scheduler must use CXL tier (exceeds local DRAM)
memory_gb: 1024
allow_cxl_memory: true # explicit opt-in for CXL tier
Scheduling Rules
- By default, allocations are placed using local DRAM capacity only
- If
allow_cxl_memory: true, CXL capacity is included in available memory - Allocations requesting more memory than local DRAM are only placed on CXL-capable nodes when the constraint is set
- CXL memory appears as a separate
CxlAttacheddomain inMemoryTopology
Cross-References
- gpu-topology.md — GPU interconnect topology, NIC affinity, intra-node GPU selection
- telemetry.md — NUMA locality metrics collection (eBPF), memory utilization
- scheduling-algorithm.md — f₄ topology_fitness, knapsack solver, constraint handling
- node-lifecycle.md — Node agent startup, health reporting, capability discovery
- conformance.md — Hardware configuration fingerprint (includes memory architecture)
Performance Tuning Guide
Design Principle
Tuning Lattice is primarily about tuning the cost function weights per vCluster. The RM-Replay simulator is the primary tool: capture production traces, replay with different weights, measure outcomes, deploy with confidence.
Cost Function Sensitivity
Weight Impact Matrix
Each cost function weight controls a trade-off. Increasing one weight reduces the influence of others:
| Weight Increased | Positive Effect | Negative Effect | When to Increase |
|---|---|---|---|
| w₁ (priority) | High-priority jobs scheduled faster | Low-priority jobs starve longer | Many priority levels with strict SLAs |
| w₂ (wait_time) | Better anti-starvation, fairer wait distribution | May schedule low-value jobs before high-value ones | Long tail of wait times |
| w₃ (fair_share) | Tenants get closer to contracted share | May reduce overall utilization (leaving resources idle) | Multi-tenant with strict fairness requirements |
| w₄ (topology) | Better placement, higher network performance | May increase wait time (holding out for ideal placement) | Network-sensitive workloads (NCCL, MPI allreduce) |
| w₅ (data_readiness) | Less I/O stall at job start | May delay jobs whose data isn’t pre-staged | Large-dataset workloads |
| w₆ (backlog) | System responds to queue pressure | May destabilize scheduling when queue fluctuates | Bursty submission patterns |
| w₇ (energy) | Lower electricity costs | Jobs may wait for cheap-energy windows | Time-flexible workloads, sites with TOU pricing |
| w₈ (checkpoint) | More flexible resource rebalancing | Overhead from frequent checkpointing | Preemption-heavy environments |
| w₉ (conformance) | Fewer driver-mismatch issues | Fewer candidate nodes (smaller conformance groups) | Multi-node GPU workloads |
Common Trade-offs
Throughput vs. Fairness (w₃):
- Low w₃ (0.05): maximize utilization — schedule whatever fits, regardless of tenant share
- High w₃ (0.35): enforce fairness — tenants below their share get priority even if it means idle resources
Typical compromise: w₃ = 0.15-0.25
Wait Time vs. Topology (w₂ vs. w₄):
- High w₂, low w₄: schedule quickly in any topology — reduces wait but may hurt network performance
- Low w₂, high w₄: wait for good topology — increases wait but improves job runtime
Typical for HPC: w₂ = 0.25, w₄ = 0.15 Typical for ML training: w₂ = 0.10, w₄ = 0.30
Utilization vs. Energy (w₇):
- w₇ = 0.00: schedule immediately regardless of energy cost (default for most sites)
- w₇ = 0.10-0.15: delay time-flexible jobs to cheap-energy windows
Only relevant for sites with significant time-of-use electricity pricing.
Using RM-Replay
Overview
RM-Replay replays production workload traces through the scheduler in simulation mode. No real resources are used. Simulation runs in seconds, not hours.
Reference: Martinasso et al., “RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management” (SC18).
Step 1: Capture Traces
Record workload traces from production (or synthetic workloads):
# Enable trace capture (writes to S3)
lattice admin config set scheduler.trace_capture=true
lattice admin config set scheduler.trace_path="s3://lattice-traces/"
# Capture for a representative period (1 week recommended)
# Traces include:
# - Allocation submissions (arrival time, resources, constraints, tenant, priority)
# - Allocation completions (actual duration, exit status)
# - Node inventory (capabilities, topology, conformance groups)
Trace format is a timestamped event log (JSON lines):
{"ts": "2026-03-01T00:00:01Z", "type": "submit", "alloc": {"nodes": 64, "gpu_type": "GH200", "walltime": "72h", "tenant": "physics", "priority": 4}}
{"ts": "2026-03-01T00:00:05Z", "type": "complete", "alloc_id": "abc-123", "duration": "68h", "exit": 0}
Step 2: Configure Weights
Create weight profiles to compare:
# profiles/baseline.yaml (current production weights)
hpc-batch:
priority: 0.20
wait_time: 0.25
fair_share: 0.25
topology: 0.15
data_readiness: 0.10
backlog: 0.05
energy: 0.00
checkpoint: 0.00
conformance: 0.10
# profiles/fairness-boost.yaml (experiment: more fairness)
hpc-batch:
priority: 0.15
wait_time: 0.20
fair_share: 0.35 # increased
topology: 0.15
data_readiness: 0.10
backlog: 0.05
energy: 0.00
checkpoint: 0.00
conformance: 0.10
Step 3: Replay
# Replay with baseline weights
rm-replay --trace=traces/week-2026-03.jsonl \
--weights=profiles/baseline.yaml \
--nodes=inventory/alps.yaml \
--output=results/baseline/
# Replay with experimental weights
rm-replay --trace=traces/week-2026-03.jsonl \
--weights=profiles/fairness-boost.yaml \
--nodes=inventory/alps.yaml \
--output=results/fairness-boost/
Step 4: Evaluate
RM-Replay produces a summary report:
=== RM-Replay Results: fairness-boost ===
Utilization:
GPU-hours consumed: 1,234,567 / 1,500,000 available (82.3%)
↓ 2.1% vs baseline (84.4%)
Wait Time:
p50: 12 min (baseline: 10 min) ↑ 20%
p95: 2.1 hr (baseline: 2.5 hr) ↓ 16%
p99: 8.3 hr (baseline: 12.1 hr) ↓ 31%
Fairness (Jain's Index):
0.94 (baseline: 0.87) ↑ 8%
Tenant Share Deviation:
Max deviation: 3.2% (baseline: 8.7%) ↓ 63%
Backfill:
Backfill jobs: 342 (baseline: 367) ↓ 7%
Preemptions:
Total: 15 (baseline: 12) ↑ 25%
Step 5: Decide and Deploy
Compare results across profiles. When satisfied:
# Deploy new weights (hot-reloadable, no restart)
lattice admin vcluster set-weights --name=hpc-batch \
--priority=0.15 --wait-time=0.20 --fair-share=0.35 \
--topology=0.15 --data-readiness=0.10 --backlog=0.05 \
--energy=0.00 --checkpoint=0.00 --conformance=0.10
Weights take effect on the next scheduling cycle.
Scheduling Cycle Tuning
The scheduling cycle interval affects responsiveness vs. overhead:
| Interval | Effect | Recommended For |
|---|---|---|
| 5s | Fast scheduling, higher CPU on scheduler | Interactive vCluster, small clusters |
| 15s | Balanced | HPC batch, ML training |
| 30s | Lower overhead, slower response | Large clusters (5000+ nodes), service vCluster |
lattice admin vcluster set-config --name=hpc-batch --cycle-interval=15s
Backfill Tuning
Backfill depth controls how many future reservations the solver considers:
| Depth | Effect |
|---|---|
| 0 | No backfill (only first-fit) — simple but low utilization |
| 10 | Moderate backfill — good balance |
| 50 | Deep backfill — higher utilization but longer cycle time |
For most sites, depth 10-20 is optimal. Increase if utilization is below target.
Conformance Group Sizing
If conformance groups are too small (many distinct fingerprints), multi-node jobs have fewer candidate sets:
- Symptom: High wait times for multi-node jobs, f₉ scores consistently low
- Diagnosis:
lattice nodes -o wideshows many distinct conformance hashes - Fix: Coordinate with OpenCHAMI to standardize firmware versions. Prioritize GPU driver and NIC firmware alignment.
- Workaround: Reduce w₉ for tolerant workloads (services, interactive)
Cross-References
- scheduling-algorithm.md — Cost function definition, weight profiles
- testing-strategy.md — RM-Replay regression suite
- conformance.md — Conformance groups and drift
- telemetry.md — Scheduler self-monitoring metrics for observing tuning impact
Node Lifecycle
Design Principle
Nodes follow a formal state machine with well-defined transitions, timeouts, and operator actions. The node agent drives transitions locally; the quorum records ownership changes with strong consistency. Running allocations are never disrupted by state transitions unless the node is genuinely unhealthy.
State Machine
┌────────────────────────────────────────────┐
│ │
▼ │
┌─────────┐ boot ┌──────────┐ health ok ┌─────────┐ │
│ Unknown │────────→ │ Booting │──────────────→│ Ready │ │
└─────────┘ └──────────┘ └────┬────┘ │
▲ │ │ │
│ boot fail │ │
│ │ ┌────────────────┤ │
│ ▼ │ │ │
│ ┌──────────┐ │ drain cmd │ │
│ │ Failed │ │ │ │ │
│ └──────────┘ │ ▼ │ │
│ │ │ ┌──────────┐ │ remediated
│ wipe/reboot │ │ Draining │ │ │
│ │ │ └─────┬────┘ │ │
│ │ │ allocs done │ │
│ │ │ │ │ │
│ │ │ ▼ │ │
│ │ │ ┌──────────┐ │ │
│ │ │ │ Drained │ │ │
│ │ │ └─────┬────┘ │ │
│ │ │ undrain│ │ │
│ │ │ │ │ │
│ │ │ ▼ │ │
│ │ └──→ (Ready) ◄───┘ │
│ │ │
│ │ heartbeat miss ┌───────────┐│
│ │ ┌────────────────→│ Degraded ││
│ │ │ (Ready) └─────┬─────┘│
│ │ │ grace timeout│
│ │ │ │ │
│ │ │ ▼ │
│ └────┼──────────────────┌─────────┐ │
│ │ │ Down │ │
└──────────────────────────┼──────────────────└────┬────┘ │
│ reboot│ │
│ └──────┘
│
heartbeat resume
(within grace)
│
└──→ (Ready)
States
| State | Description | Schedulable | Allocations Run |
|---|---|---|---|
Unknown | Node exists in inventory but has never reported | No | No |
Booting | OpenCHAMI booting/reimaging the node | No | No |
Ready | Healthy, agent reporting, available for scheduling | Yes | Yes |
Degraded | Heartbeat missed or minor issue detected | No (new) | Yes (existing) |
Down | Confirmed failure, grace period expired | No | No (requeued) |
Draining | Operator or scheduler requested drain, waiting for allocations to finish | No (new) | Yes (existing, draining) |
Drained | All allocations completed/migrated after drain | No | No |
Failed | Boot failure or unrecoverable hardware error | No | No |
Transitions
Ready → Degraded
Trigger: First missed heartbeat.
Timeout: heartbeat_interval (default: 30s). If no heartbeat received within this window, the quorum marks the node Degraded.
Effect: Node is removed from scheduling candidates for new allocations. Running allocations continue undisturbed. No user notification.
Sensitive override: Sensitive nodes use a longer degradation window (default: 2 minutes) to avoid false positives from transient network issues.
Degraded → Ready
Trigger: Heartbeat resumes within the grace period.
Effect: Node re-enters the scheduling pool. No allocation disruption occurred. Event logged but no alert.
Degraded → Down
Trigger: Grace period expired without heartbeat recovery.
Timeouts:
| Node Type | Grace Period | Rationale |
|---|---|---|
| Standard | 60s | Balance between fast recovery and false positive avoidance |
| Sensitive | 5 minutes | Sensitive allocations are high-value; avoid premature requeue |
| Borrowed | 30s | Borrowed nodes should be reclaimed quickly |
Effect:
- All allocations on the node are evaluated per their requeue policy (cross-ref: failure-modes.md)
- Node ownership released (Raft commit)
- Alert raised to operators
- OpenCHAMI notified for out-of-band investigation (Redfish BMC check)
Ready → Draining
Trigger: Explicit operator command (lattice node drain <id>) or scheduler-initiated (upgrade, conformance drift on sensitive node).
Effect:
- Node removed from scheduling candidates
- Running allocations continue until completion
- For urgent drains: scheduler may trigger checkpoint on running allocations (cross-ref: checkpoint-broker.md)
- No new allocations assigned
Draining → Drained
Trigger: All running allocations on the node have completed, been checkpointed, or been migrated.
Effect: Node is idle and safe for maintenance. Operator can upgrade, reboot, or reimage.
Drained → Ready
Trigger: Operator undrain (lattice node undrain <id>). Typically after maintenance.
Precondition: Node agent health check passes (heartbeat, GPU detection, network test, conformance fingerprint computed).
Effect: Node re-enters scheduling pool.
Any → Down (hardware failure)
Trigger: OpenCHAMI Redfish BMC detects critical hardware failure (PSU, uncorrectable ECC, GPU fallen off bus).
Effect: Immediate transition to Down, bypassing grace period. Same allocation handling as Degraded → Down.
Down → Booting
Trigger: Operator or automated remediation initiates reboot/reimage via OpenCHAMI.
Effect: Node enters Booting state. OpenCHAMI BSS serves the appropriate image.
Booting → Ready
Trigger: Node agent starts, passes health check, reports to quorum.
Health check: Heartbeat received, GPU count matches capabilities, NIC firmware detected, conformance fingerprint computed and reported.
Booting → Failed
Trigger: Boot timeout (default: 10 minutes) or repeated boot failures (3 consecutive).
Effect: Node marked Failed. Alert raised. Operator must investigate.
Sensitive Node Lifecycle Extensions
Sensitive nodes have additional constraints:
| Event | Standard Node | Sensitive Node |
|---|---|---|
| Claim | Scheduler assigns | User claims explicitly, Raft-committed |
| Degraded grace | 60s | 5 minutes |
| Down → requeue | Automatic | Operator intervention required |
| Release | Node returns to pool | Node must be wiped (OpenCHAMI secure erase) before returning |
| Conformance drift | Deprioritized | Immediate Draining, audit logged |
Sensitive Release Sequence
1. User releases sensitive allocation
2. Quorum releases node ownership (Raft commit, audit entry)
3. Node enters Draining (if other sensitive allocations) or proceeds to wipe
4. OpenCHAMI initiates secure wipe:
a. GPU memory clear
b. NVMe secure erase (if present)
c. RAM scrub
d. Reboot into clean image
5. Wipe confirmation reported to quorum (Raft commit, audit entry)
6. Node transitions to Ready and returns to general pool
Wipe Failure Handling
If the OpenCHAMI secure wipe fails or times out during sensitive node release:
- Timeout: Default wipe timeout is 30 minutes (configurable:
sensitive.wipe_timeout). If wipe does not complete within this window, the node enters aQuarantinestate (treated asDownby the scheduler). - Quarantine: Quarantined nodes are excluded from scheduling and flagged for operator intervention. They do not return to the general pool.
- Operator intervention: The operator investigates (BMC console, hardware diagnostics) and either:
- Retries the wipe:
lattice admin node wipe <id> --force - Replaces the node hardware
- Marks the node as permanently failed:
lattice node disable <id>
- Retries the wipe:
- Audit: Wipe failures are logged as critical audit events (Raft-committed for sensitive nodes). The audit entry records: node ID, wipe start time, failure reason, operator action.
- Alert:
lattice_sensitive_wipe_failure_totalcounter incremented; critical alert fired.
Operator Commands
| Command | Effect | Confirmation Required |
|---|---|---|
lattice node drain <id> | Start draining | No |
lattice node drain <id> --urgent | Drain with checkpoint trigger | Yes (allocations will be checkpointed) |
lattice node undrain <id> | Re-enable scheduling | No |
lattice node disable <id> | Transition to Down immediately | Yes (allocations will be requeued/failed) |
lattice node enable <id> | Re-enable a disabled node (Down → Ready) | No |
lattice node status <id> | Show current state, allocations, health | No |
lattice node list --state=degraded | List nodes in specific state | No |
Heartbeat Protocol
Node agents send heartbeats to the quorum at a configurable interval:
| Parameter | Default | Description |
|---|---|---|
heartbeat_interval | 10s | How often the agent sends a heartbeat |
heartbeat_timeout | 30s | Quorum marks Degraded after this silence |
grace_period | 60s | Degraded → Down after this additional silence |
sensitive_grace_period | 5m | Extended grace for sensitive nodes |
Heartbeats include:
- Monotonic sequence number (replay detection)
- Node health summary (GPU count, temperature, ECC errors)
- Conformance fingerprint (if recomputed since last heartbeat)
- Running allocation count
Heartbeats are lightweight (~200 bytes) and sent over the management traffic class (cross-ref: security.md).
Agent Restart and State Recovery
The node agent persists active allocation state to /var/lib/lattice/agent-state.json (configurable via --state-file). This enables workload survival across agent restarts.
On graceful shutdown (SIGTERM):
- Agent writes current allocation state (PIDs, cgroup paths, runtime type, mount points) to the state file
- Agent exits without killing workloads (systemd
KillMode=process)
On startup:
- Agent reads the persisted state file
- For each allocation, checks if the process is still alive (
kill(pid, 0)) - Alive processes are reattached — agent resumes heartbeating their status
- Dead processes are treated as orphans — cgroup scopes are destroyed, mounts cleaned up
- Stray cgroup scopes under
workload.slice/alloc-*.scopewith no matching state entry are also cleaned up - Agent re-registers with quorum and resumes normal operation
Crash recovery: If the agent crashes without writing the state file, the startup scan of cgroup scopes under workload.slice/ provides a fallback discovery mechanism for orphaned workloads.
Cross-References
- failure-modes.md — Allocation requeue on node failure
- conformance.md — Conformance drift triggers drain on sensitive nodes
- upgrades.md — Drain/undrain during rolling upgrades
- checkpoint-broker.md — Checkpoint on urgent drain
- sensitive-workloads.md — Sensitive node claim/release/wipe
- security.md — Heartbeat authentication (mTLS, sequence numbers)
Node Conformance & Configuration Drift
Problem
In large-scale HPC systems, nodes gradually drift from their intended configuration: firmware versions diverge, driver updates are applied unevenly, kernel parameters change. This configuration drift causes:
- Silent performance degradation. A 64-node NCCL training run where one node has a different NIC firmware version may see unexplained slowdowns or hangs.
- Correctness issues. Mismatched GPU driver versions can produce different numerical results.
- Compliance violations. Regulated workloads require provable consistency of the execution environment.
Design Principle
The scheduler does not manage node configuration — OpenCHAMI does. The scheduler only needs to know whether nodes are the same or different, and how strict the workload’s homogeneity requirements are. Detection is the node agent’s job. Remediation is OpenCHAMI’s job.
Conformance Fingerprint
Each node agent computes a conformance fingerprint: a hash of the node’s configuration-critical software and firmware versions.
Components included in the fingerprint:
- GPU driver version (e.g., NVIDIA 550.54.14)
- NIC firmware version (Slingshot/UE adapter firmware)
- BIOS/BMC firmware version (reported via Redfish/OpenCHAMI)
- Kernel version and boot parameters
- uenv base image hash (for sensitive: the hardened OS image)
The fingerprint is a content hash (SHA-256 of the sorted component list). Nodes with identical fingerprints belong to the same conformance group.
Reporting
The node agent reports the conformance fingerprint alongside its existing health data. This is eventually consistent — conformance group membership does not go through Raft (it’s derived from node agent reports, same as health status).
Exception: for sensitive nodes, conformance state changes are recorded in the Raft-committed audit log (per sensitive workload requirements).
Staleness
The node agent recomputes the fingerprint:
- On startup
- Periodically (default: every 6 hours)
- On explicit request from the scheduler (e.g., after OpenCHAMI remediation)
If a node hasn’t reported a fingerprint within the staleness window, the scheduler treats it as unknown conformance — equivalent to a unique conformance group of one.
Scheduling Integration
Cost Function (f₉)
See scheduling-algorithm.md for the full cost function. The conformance factor f₉ scores how homogeneous the candidate node set is:
f₉(j, candidates) = largest_conformance_group_size(candidates) / j.requested_nodes
- 1.0 → all candidate nodes share the same fingerprint
- 0.5 → half the nodes match, half differ
- Low values → highly heterogeneous set
Node Selection
During node selection (solver step 2a), the solver prefers nodes from the same conformance group:
- Among nodes satisfying constraints (GPU type, topology, etc.), group by conformance fingerprint
- Select the largest conformance group that can satisfy the node count
- If no single group is large enough, merge groups (with a scoring penalty via f₉)
- For single-node jobs, conformance is irrelevant (f₉ = 1.0 trivially)
Per-vCluster Policy
| vCluster Type | Conformance Behavior |
|---|---|
| HPC Batch | Soft preference (w₉=0.10). Prefers homogeneous sets but will mix if needed. |
| ML Training | Strong preference (w₉=0.25). Multi-node training is sensitive to driver mismatches. |
| Service | Weak preference (w₉=0.05). Services are usually single-node or tolerate heterogeneity. |
| Sensitive | Hard constraint at solver level (drifted nodes excluded before scoring). w₉=0.10 as tiebreaker among conformant nodes. |
| Interactive | Ignored (w₉=0.00). Short-lived, single-node, not sensitive to drift. |
Drift Response
When the scheduler detects that a node’s conformance fingerprint has changed (or diverged from the majority in its group):
- Continue running workloads. Existing allocations are not disrupted — the drift already happened, and disrupting would make things worse.
- Stop scheduling new work. The node is deprioritized for new allocations (it now belongs to a smaller conformance group, scoring lower on f₉).
- Signal OpenCHAMI. The scheduler (or node agent) notifies OpenCHAMI that the node has drifted, triggering remediation (firmware update, reboot into correct image, etc.).
- For sensitive nodes: additionally flag the drift in the audit log and set the node to
Draining(transitioning toDrainedonce active allocations complete) — no new sensitive claims until remediated and verified. After remediation, an operator undoes the drain (Drained→Ready).
The scheduler does not attempt to remediate drift itself. It only avoids scheduling on drifted nodes and signals the infrastructure layer to fix them.
OpenCHAMI Coordination
When the scheduler detects drift:
-
Signal: The node agent (or scheduler) calls OpenCHAMI SMD to report the drift:
PATCH /hsm/v2/State/Components/{xname} { "Flag": "Warning", "FlagMsg": "conformance_drift: expected=<hash_a>, actual=<hash_b>" } -
OpenCHAMI response: OpenCHAMI evaluates the drift against its remediation policy:
- Minor drift (kernel param change): schedule firmware update at next maintenance window
- Major drift (GPU driver version): schedule immediate reboot into correct image via BSS
- Critical drift (sensitive node): immediate remediation, operator notified
-
Wait for remediation: The scheduler does not re-enable the node automatically. After OpenCHAMI remediates (reboot, firmware flash), the node agent:
- Recomputes conformance fingerprint on startup
- Reports new fingerprint to quorum
- If fingerprint matches expected baseline: node returns to Ready
- If still drifted: remains deprioritized, alert escalated
-
Timeout: If a node remains drifted for longer than
drift_remediation_timeout(default: 24 hours):- Alert escalated to critical
- Node transitions to
Down(removed from scheduling entirely) - Operator must investigate and manually undrain after fix
-
Sensitive nodes (stricter):
- Drift triggers immediate
Draining(no grace period for new claims) - Remediation timeout: 4 hours (shorter, due to regulatory risk)
- After remediation: conformance re-verified AND admin approval required before accepting sensitive claims again
- Drift triggers immediate
Relationship to Existing Concepts
- NodeHealth tracks whether the node is functional (Healthy/Degraded/Down/Draining). Conformance is orthogonal — a node can be Healthy but drifted.
- NodeCapabilities tracks what the node has (GPU type, memory). Conformance tracks whether the node’s software stack matches expectations.
- Topology (GroupId) tracks physical location. Conformance tracks software configuration. Both are inputs to node selection: pack by topology AND by conformance group.
Network Domains
Design Principle
Network domains provide L3 reachability between allocations that need to communicate. They map to Slingshot VNIs (Virtual Network Identifiers) which provide hardware-enforced network isolation. Domains are created on demand, scoped to tenants, and cleaned up automatically.
What is a Network Domain
A network domain is a named group of allocations that share network reachability:
# Two allocations sharing a domain:
allocation_a:
connectivity:
network_domain: "ml-workspace"
allocation_b:
connectivity:
network_domain: "ml-workspace"
Allocations in the same domain can communicate over the Slingshot fabric. Allocations in different domains (or with no domain) are network-isolated at the hardware level.
VNI Lifecycle
Allocation
1. User submits allocation with network_domain: "ml-workspace"
2. lattice-api checks if domain "ml-workspace" exists for this tenant:
a. If exists: allocation joins the existing domain
b. If not: create new domain, allocate VNI from pool
3. VNI assignment is stored in quorum state (eventually consistent)
4. Node agents configure Slingshot NIC with the VNI for the allocation's traffic
VNI Pool
VNIs are allocated from a configured pool:
network:
vni_pool_start: 1000
vni_pool_end: 4095
# Reserved VNIs:
# 1 = management
# 2 = telemetry
# 3-999 = reserved for future use
VNIs are allocated sequentially from the pool. When freed, they return to the available set.
Release
1. Last allocation in the domain completes (or is cancelled)
2. Domain enters "draining" state for grace_period (default: 5 minutes)
- Allows brief gaps between allocations in a long-running workflow
3. After grace period with no new allocations: domain is released
4. VNI returns to the available pool
5. Domain name can be reused by the same tenant
The grace period prevents VNI churn in DAG workflows where allocations start and stop in sequence but share a domain.
DAG Domain Persistence
DAG workflows often have sequential stages that share a network domain but have gaps between stages (one allocation completes before the next starts). The grace period (default: 5 minutes) covers these gaps:
- If the next DAG stage starts within the grace period: it joins the existing domain (same VNI, no churn)
- If the gap exceeds the grace period: the domain is released and a new VNI is allocated when the next stage starts
- For long-running DAGs with predictable inter-stage gaps, increase the grace period per-domain:
lattice admin network set-grace --domain=<name> --grace=15m - The grace period timer resets each time a new allocation joins the domain
Scoping Rules
| Rule | Enforcement |
|---|---|
| Domain names are scoped to a tenant | Two tenants can use the same domain name without conflict |
| Only allocations from the same tenant can share a domain | Cross-tenant domains are not allowed (isolation requirement) |
| Sensitive domains are per-allocation | Each sensitive allocation gets a unique domain (no sharing, even within tenant) |
| Domain names are user-chosen strings | No system-generated names; users pick meaningful names |
Capacity
| Parameter | Default | Notes |
|---|---|---|
| VNI pool size | 3095 (1000-4095) | Sufficient for typical HPC deployments |
| Max domains per tenant | 50 | Configurable per tenant |
| Max allocations per domain | Unlimited | Practical limit: node count |
VNI Exhaustion
If the VNI pool is exhausted:
- New domain creation fails with a clear error:
Error: cannot create network domain — VNI pool exhausted (3095/3095 in use) Hint: Wait for running allocations to complete, or contact your system admin. - Allocations without
network_domainare unaffected (they don’t need a VNI) - Allocations joining an existing domain are unaffected (domain already has a VNI)
- Alert raised for operators
VNI Exhaustion Mid-DAG
If the VNI pool is exhausted while a DAG has pending allocations that require a new network domain:
- The allocation that needs the new domain enters
Pendingstate with reasonvni_pool_exhausted. - The DAG stalls at this allocation — downstream dependencies remain blocked.
- Already-running DAG allocations with existing domains are unaffected.
- Mitigation: Use a shared network domain across DAG stages where possible. This avoids new VNI allocation for each stage and reduces pool pressure.
- Recovery: When other allocations complete and release VNIs, the pending allocation is re-evaluated on the next scheduling cycle.
Default Behavior
If an allocation does not specify network_domain:
- Single-node allocations: no VNI needed, no network isolation beyond the default
- Multi-node allocations: automatically assigned a domain named
alloc-{id}(private to this allocation) - Services with
exposeports: automatically assigned a domain if not specified
Service Exposure
For allocations exposing service endpoints:
connectivity:
network_domain: "inference-cluster"
expose:
- name: "api"
port: 8080
protocol: "http"
Exposed ports are reachable from:
- Other allocations in the same network domain (always)
- The lattice-api REST gateway (for external access)
- Not directly reachable from outside the fabric (Slingshot is not routable from Ethernet)
Sensitive Network Domains
Sensitive allocations get strict network isolation:
connectivity:
network_domain: "sensitive-{user}-{alloc_id}" # auto-generated, unique
policy:
ingress: deny-all-except:
- same_domain # only processes in this allocation
- data_gateway # controlled data ingress
egress: deny-all-except:
- data_gateway # controlled data egress
- Each sensitive allocation gets its own domain (no sharing)
- Ingress/egress restricted to a data gateway endpoint
- With Ultra Ethernet: network-level encryption enabled for the VNI
- VNI released immediately on allocation completion (no grace period)
VNI Pool Expansion
To expand the VNI pool when approaching exhaustion:
-
Update the configuration to extend
vni_pool_end:network: vni_pool_start: 1000 vni_pool_end: 8191 # expanded from 4095 -
Restart the API server to pick up the new pool range. Existing domains and their VNI assignments are not affected.
-
Verify: The
lattice_network_vni_pool_totalmetric should reflect the new pool size.
Note: The expanded range must not overlap with reserved VNIs (1-999) or VNIs used by other systems on the Slingshot fabric. Coordinate with network administrators before expanding.
Cross-References
- system-architecture.md — Network fabric layer, VNI-based isolation
- sensitive-workloads.md — Sensitive network isolation policy
- security.md — Network security, traffic classes
- api-design.md — Connectivity field in allocation request
MPI Process Management
Design Principle
Lattice must launch and manage multi-node MPI processes without relying on SSH between compute nodes. The node agent provides process management infrastructure (PMI) so that MPI implementations (OpenMPI, MPICH, Cray MPICH) can perform rank discovery and key-value exchange through Lattice rather than through SSH or a Slurm-specific launcher.
Problem Statement
In Slurm, srun serves as both a process launcher (fan-out to nodes) and a PMI server (rank discovery, KV exchange). Lattice replaces srun with lattice launch / the LaunchTasks RPC, but the current implementation is a stub that does not:
- Fan out process launch to node agents
- Provide PMI wire-up so MPI ranks can discover each other
- Manage CXI credentials for Slingshot/Ultra Ethernet fabric access
Without this, users calling mpirun directly fall back to SSH for remote process spawning, which is:
- A security risk (SSH keys between compute nodes)
- Incompatible with network-domain-only L3 reachability
- Incompatible with the sensitive workload isolation model
- Operationally fragile (SSH host key management, authorized_keys distribution)
Supported MPI Implementations
| Implementation | PMI-2 Support | PMIx Support | Default Launcher | Notes |
|---|---|---|---|---|
| MPICH | Native (PMI-2 origin) | Via external PMIx | Hydra (SSH) | PMI-2 is the natural fit |
| OpenMPI | Yes (OMPI_MCA_pmix=pmi2) | Preferred (PRRTE) | ORTE/PRRTE (SSH) | PMI-2 fully functional |
| Cray MPICH | Native (via PALS) | Via PALS | PALS | PMI-2 without PALS works |
All three support PMI-2. PMIx is preferred by OpenMPI but not required.
Architecture
Two-Tier Design
┌─────────────────────────────────────────────────────────┐
│ Default: Native PMI-2 Server (built into node agent) │
│ Simple, no external dependencies, covers 95%+ of MPI │
│ workloads. ~8 wire commands over Unix domain socket. │
├─────────────────────────────────────────────────────────┤
│ Optional: OpenPMIx Sidecar (feature-flagged) │
│ Full PMIx v4/v5 support for workloads that require │
│ PMIx-specific features (spawn, tools API, events). │
│ Node agent manages OpenPMIx server lifecycle. │
└─────────────────────────────────────────────────────────┘
Launch Flow
User: lattice launch --alloc=123 -n 256 --tasks-per-node=4 ./my_mpi_app
│
▼
lattice-api (LaunchTasks RPC)
│
├─ Validates: allocation is Running, user owns it
├─ Computes rank layout: N nodes × tasks_per_node = total ranks
│ Rank assignment: node 0 gets ranks [0..3], node 1 gets [4..7], ...
├─ Generates launch_id, PMI job attributes (appnum, size, universe_size)
├─ Provisions CXI credentials if Slingshot fabric (see below)
│
▼ Fan-out: gRPC LaunchProcesses to each node agent in the allocation
Node Agent 0 Node Agent 1 Node Agent N-1
│ │ │
├─ Creates PMI-2 server ├─ Creates PMI-2 server ├─ ...
│ (Unix domain socket) │ (Unix domain socket) │
│ │ │
├─ Spawns local ranks ├─ Spawns local ranks │
│ rank 0: ./my_mpi_app │ rank 4: ./my_mpi_app │
│ rank 1: ./my_mpi_app │ rank 5: ./my_mpi_app │
│ rank 2: ./my_mpi_app │ rank 6: ./my_mpi_app │
│ rank 3: ./my_mpi_app │ rank 7: ./my_mpi_app │
│ │ │
│ Each rank inherits: │ │
│ - PMI_FD (socket fd) │ │
│ - PMI_RANK (global rank) │ │
│ - PMI_SIZE (world size) │ │
│ │ │
▼ ▼ ▼
MPI_Init() → PMI-2 fullinit → local KVS puts (libfabric endpoint addr)
│ │ │
▼ ─────────── kvsfence (cross-node KVS exchange via gRPC) ────────────
│ │ │
MPI_Init() completes MPI_Init() completes ...
│ │ │
(application runs) (application runs) ...
│ │ │
MPI_Finalize() → PMI-2 finalize
PMI-2 Wire Protocol
The PMI-2 wire protocol is text-based over a Unix domain socket. The node agent implements these commands:
| Command | Direction | Purpose |
|---|---|---|
fullinit | rank → agent | Initialize PMI connection, receive rank/size/appnum |
job-getinfo | rank → agent | Query job attributes (e.g., universe size) |
kvsput | rank → agent | Store a key-value pair (e.g., libfabric endpoint address) |
kvsget | rank → agent | Retrieve a key-value pair |
kvsfence | rank → agent | Barrier + distribute all KV pairs across all ranks |
finalize | rank → agent | Clean shutdown of PMI connection |
abort | rank → agent | Signal abnormal termination |
spawn | rank → agent | Dynamic process spawning (optional, rarely used) |
Cross-Node KVS Exchange (Fence)
The kvsfence operation is the only cross-node PMI operation. It requires all ranks across all nodes to synchronize and exchange accumulated KV pairs. This is implemented via gRPC between node agents:
kvsfence triggered on all nodes
│
▼
Phase 1: Local collection
Each node agent collects all kvsput entries from its local ranks.
Phase 2: Exchange (star topology via designated head node)
┌─────────────┐
│ Head Agent │ ◄──── gRPC PmiFence(local_kvs) ──── Agent 1
│ (rank 0's │ ◄──── gRPC PmiFence(local_kvs) ──── Agent 2
│ node) │ ◄──── gRPC PmiFence(local_kvs) ──── Agent N-1
│ │
│ Merges all │
│ KVS entries │
│ │
│ Broadcasts │ ────► gRPC PmiFenceComplete(merged_kvs) ──► Agent 1
│ merged KVS │ ────► gRPC PmiFenceComplete(merged_kvs) ──► Agent 2
│ │ ────► gRPC PmiFenceComplete(merged_kvs) ──► Agent N-1
└─────────────┘
Phase 3: Local completion
Each node agent unblocks its local ranks' kvsfence.
Ranks can now kvsget any key from any node.
The head agent is the node agent hosting rank 0. For large jobs (>128 nodes), a tree-based reduction can be used instead of a star to reduce head-node pressure.
Node Agent gRPC Extensions
New RPCs on the node agent service for MPI process management:
service NodeAgentService {
// Existing RPCs...
// Launch MPI ranks on this node (called by API server during fan-out)
rpc LaunchProcesses(LaunchProcessesRequest) returns (LaunchProcessesResponse);
// PMI fence exchange between node agents
rpc PmiFence(PmiFenceRequest) returns (PmiFenceResponse);
// PMI fence completion broadcast from head agent
rpc PmiFenceComplete(PmiFenceCompleteRequest) returns (PmiFenceCompleteResponse);
// Notify all local ranks to abort (e.g., one node failed)
rpc AbortProcesses(AbortProcessesRequest) returns (AbortProcessesResponse);
}
message LaunchProcessesRequest {
string launch_id = 1;
string allocation_id = 2;
string entrypoint = 3;
repeated string args = 4;
uint32 tasks_per_node = 5;
uint32 first_rank = 6; // global rank offset for this node
uint32 world_size = 7; // total ranks across all nodes
map<string, string> env = 8; // additional env vars
PmiMode pmi_mode = 9; // PMI2 (default) or PMIX
// CXI credentials for Slingshot fabric
optional CxiCredentials cxi_credentials = 10;
// Peer node agents for fence exchange
repeated PeerInfo peers = 11;
// Index of the head node (for fence coordination)
uint32 head_node_index = 12;
}
message PeerInfo {
string node_id = 1;
string grpc_address = 2; // node agent address (reachable via management network)
uint32 first_rank = 3;
uint32 num_ranks = 4;
}
enum PmiMode {
PMI2 = 0;
PMIX = 1;
}
message CxiCredentials {
uint32 vni = 1;
bytes auth_key = 2;
uint32 svc_id = 3;
}
PMI-2 Server Implementation
Each node agent runs a PMI-2 server per launch (one Unix socket per launch_id):
Node Agent
│
├─ LaunchProcesses received
│ ├─ Create Unix socket: /tmp/lattice-pmi-{launch_id}.sock
│ ├─ Start PMI-2 server task (tokio)
│ ├─ Fork/exec ranks with:
│ │ PMI_FD={fd} # inherited socket fd
│ │ PMI_RANK={rank} # global rank
│ │ PMI_SIZE={world_size} # world size
│ │ PMI_SPAWNED=0 # not dynamically spawned
│ │ LATTICE_LAUNCH_ID={launch_id}
│ │ LATTICE_ALLOC_ID={allocation_id}
│ │ LATTICE_NODELIST={comma-separated node list}
│ │ LATTICE_NNODES={node_count}
│ │ LATTICE_NPROCS={world_size}
│ │ # CXI env (if Slingshot):
│ │ FI_CXI_DEFAULT_VNI={vni}
│ │ FI_CXI_AUTH_KEY={key}
│ └─ Monitor all rank processes, report exit status
│
├─ PMI-2 server handles:
│ ├─ fullinit → return rank, size, appnum, debug flag
│ ├─ kvsput → store in local HashMap
│ ├─ kvsget → lookup local, or merged (post-fence)
│ ├─ kvsfence → collect local, trigger cross-node exchange, block until complete
│ ├─ finalize → mark rank done
│ └─ abort → signal all local ranks, notify head agent
│
└─ Cleanup on launch completion
├─ Remove Unix socket
├─ Report per-rank exit codes to API server
└─ Clean up CXI credentials
Environment Variables
Lattice sets these environment variables for MPI processes:
| Variable | Value | Purpose |
|---|---|---|
PMI_FD | fd number | PMI-2 socket (inherited) |
PMI_RANK | global rank | MPI rank |
PMI_SIZE | world size | MPI world size |
PMI_SPAWNED | 0 | Not dynamically spawned |
LATTICE_LAUNCH_ID | UUID | Launch identifier |
LATTICE_ALLOC_ID | UUID | Allocation identifier |
LATTICE_NODELIST | comma-separated | All nodes in this launch |
LATTICE_NNODES | integer | Node count |
LATTICE_NPROCS | integer | Total rank count |
LATTICE_LOCAL_RANK | 0..tasks_per_node-1 | Node-local rank |
LATTICE_LOCAL_SIZE | tasks_per_node | Ranks on this node |
FI_CXI_DEFAULT_VNI | VNI number | Slingshot VNI (if applicable) |
FI_CXI_AUTH_KEY | hex string | CXI auth key (if applicable) |
FI_PROVIDER | cxi or verbs | libfabric provider hint |
For Slurm compatibility (compat.set_slurm_env=true), also set SLURM_PROCID, SLURM_NPROCS, SLURM_LOCALID, SLURM_NODELIST.
CXI Credential Management (Slingshot)
On Slingshot systems, MPI communication requires CXI (Cassini eXtended Interface) credentials tied to the allocation’s VNI. Without valid credentials, libfabric’s CXI provider refuses to open endpoints.
Credential Lifecycle
1. Allocation scheduled → network domain assigned → VNI allocated
2. LaunchTasks RPC → API server requests CXI credentials from fabric manager
- Input: VNI, allocation ID, node list
- Output: auth_key, svc_id (bound to VNI + node set)
3. Credentials included in LaunchProcessesRequest to each node agent
4. Node agent sets FI_CXI_DEFAULT_VNI and FI_CXI_AUTH_KEY for spawned ranks
5. On launch completion → API server revokes CXI credentials
Fabric Manager Integration
The Slingshot fabric manager provides a REST API for credential management:
| Operation | Endpoint | When |
|---|---|---|
| Create CXI service | POST /fabric/cxi/services | Launch start |
| Get auth key | GET /fabric/cxi/services/{id}/auth | Launch start |
| Revoke CXI service | DELETE /fabric/cxi/services/{id} | Launch end |
This is a new integration point, similar to the existing VAST API integration for storage.
Optional: OpenPMIx Sidecar (Feature-Flagged)
For workloads requiring full PMIx v4/v5 support (dynamic process spawning, PMIx tools API, event notification, PMIx groups), Lattice can run an OpenPMIx server as a managed sidecar process.
When to Use PMIx Mode
| Scenario | PMI-2 (default) | PMIx (optional) |
|---|---|---|
| Standard MPI (init, communication, finalize) | Yes | Yes |
| Multi-application launch (MPMD) | Limited | Yes |
Dynamic process spawning (MPI_Comm_spawn) | No | Yes |
| PMIx tools API (debugger attach) | No | Yes |
| PMIx event notification | No | Yes |
| OpenMPI with PMIx-only features | No | Yes |
Architecture
Node Agent
│
├─ PmiMode::PMIX requested in LaunchProcessesRequest
│
├─ Spawns OpenPMIx server (pmix_server binary)
│ ├─ Configured via tmpdir/pmix-{launch_id}/
│ ├─ Node agent implements the PMIx "host" callback interface
│ │ via a small C shim library (libpmix-lattice-host.so)
│ │ that calls back to the node agent via Unix socket
│ ├─ Cross-node exchange: host callbacks route to node agent gRPC
│ └─ pmix_server provides Unix rendezvous socket for ranks
│
├─ Spawns ranks with:
│ PMIX_SERVER_URI={rendezvous_uri}
│ PMIX_NAMESPACE={launch_id}
│ PMIX_RANK={rank}
│ (instead of PMI_FD/PMI_RANK/PMI_SIZE)
│
└─ On completion: stops pmix_server, cleans up
Host Callback Shim
The OpenPMIx server requires the host (resource manager) to provide certain callbacks for cross-node operations. These are implemented via a small C shared library (libpmix-lattice-host.so) that:
- Is loaded by
pmix_serverat startup via--host-liborLD_PRELOAD - Implements:
pmix_server_fencenb_fn,pmix_server_dmodex_fn,pmix_server_spawn_fn - Each callback sends a request over a Unix socket to the node agent
- Node agent handles cross-node coordination via gRPC (same as PMI-2 fence)
This keeps the C code minimal (~200 lines) while leveraging the full OpenPMIx implementation.
Build and Deployment
# Cargo.toml (lattice-node-agent)
[features]
pmix = [] # enables PMIx sidecar support
When the pmix feature is enabled:
pmix_serverbinary must be installed on compute nodes (packaged separately or via uenv)libpmix-lattice-host.sois built frominfra/pmix-host/and installed alongside the node agent- The node agent detects
pmix_serveravailability at startup and reports it as a node capability
When disabled: PmiMode::PMIX requests return an error with a clear message.
Integration with Existing Runtimes
uenv Runtime
PMI-2 socket and environment variables are available inside the mount namespace with no special handling (mount namespace does not isolate Unix sockets in the parent namespace).
Sarus Runtime
The PMI-2 Unix socket must be bind-mounted into the container:
sarus run --mount=type=bind,source=/tmp/lattice-pmi-{launch_id}.sock,destination=/tmp/lattice-pmi.sock ...
The --mpi flag in Sarus already handles MPI wire-up for Slurm; for Lattice, we configure Sarus to use the Lattice-provided PMI socket instead. This requires the Sarus MPI hook to be configured for PMI-2 mode rather than Slurm PMI mode.
DMTCP (Checkpoint/Restart)
DMTCP wraps the MPI process. The PMI-2 socket is outside the DMTCP checkpoint boundary. On restart, the node agent creates a new PMI-2 server and the restarted ranks re-initialize PMI. DMTCP’s MPI plugin handles reconnecting MPI communicators.
Failure Handling
Rank Failure
1. Rank exits with non-zero code (or is killed by signal)
2. Local node agent detects via process monitor
3. Node agent sends RankFailed notification to head agent
4. Head agent:
a. If allocation requeue policy = "on_any_failure": abort all ranks, requeue allocation
b. If MPI_ERRORS_RETURN semantics: notify remaining ranks via PMI-2 abort
c. Default: abort all ranks, report failure to API server
Node Agent Failure
1. Node agent crashes or becomes unreachable
2. Head agent detects via gRPC timeout during fence (or heartbeat miss)
3. Head agent aborts the launch on all surviving nodes
4. API server handles allocation state transition (same as node failure)
Fence Timeout
1. kvsfence does not complete within timeout (default: 60s, configurable)
2. Head agent declares fence failure
3. All ranks aborted with PMI-2 abort message
4. Launch reported as failed with "PMI fence timeout" reason
User-Facing Changes
lattice launch (CLI)
# MPI launch (replaces srun -n 256 ./app)
lattice launch --alloc=123 -n 256 ./my_mpi_app
# With tasks-per-node control
lattice launch --alloc=123 --tasks-per-node=4 ./my_mpi_app
# Force PMIx mode (requires pmix feature on nodes)
lattice launch --alloc=123 -n 256 --pmi=pmix ./my_mpi_app
# Launch with environment variables
lattice launch --alloc=123 -n 256 --env OMP_NUM_THREADS=8 ./my_mpi_app
Submission Script
#!/bin/bash
#LATTICE nodes=64
#LATTICE walltime=2:00:00
#LATTICE vcluster=hpc-batch
#LATTICE network_domain=my-training-run
# No SSH, no mpirun, no srun needed.
# The entrypoint IS the MPI program; Lattice handles process launch and PMI.
lattice launch -n 256 --tasks-per-node=4 ./my_mpi_training
# Or for Slurm compatibility:
# srun -n 256 ./my_mpi_training (compat layer translates to lattice launch)
Direct mpirun (Escape Hatch)
Users who want to call mpirun directly can still do so. Lattice provides a Hydra-compatible launcher script (lattice-mpi-launcher) that uses the node agent gRPC instead of SSH:
# mpirun detects the Lattice launcher via:
# HYDRA_LAUNCHER=manual
# HYDRA_LAUNCHER_EXEC=lattice-mpi-launcher
# These are set automatically by the node agent when an allocation starts.
# So this "just works" inside an allocation:
mpirun -np 256 ./my_mpi_app
The lattice-mpi-launcher script:
- Receives the launch command from Hydra/ORTE
- Calls the local node agent’s
LaunchProcessesgRPC to spawn on the target node - Returns the PID to the MPI launcher
This provides backward compatibility for scripts that use mpirun directly while still avoiding SSH.
Performance Considerations
| Operation | Latency | Bottleneck | Mitigation |
|---|---|---|---|
| Launch fan-out | ~100ms for 256 nodes | gRPC round-trips | Parallel fan-out from API server |
| PMI-2 fence (star) | ~10ms for <128 nodes | Head agent merge | Acceptable for typical HPC |
| PMI-2 fence (tree) | ~20ms for 1000+ nodes | Tree depth (log N) | Only needed at extreme scale |
| CXI credential provisioning | ~50ms | Fabric manager API | Cached for allocation lifetime |
MPI_Init typically takes 100-500ms. The Lattice PMI overhead is well within this budget.
Cross-References
- network-domains.md – VNI allocation, L3 reachability
- security.md – CXI credentials, network isolation
- slurm-migration.md – srun replacement
- node-lifecycle.md – Node agent process management
- failure-modes.md – Rank and node failure handling
- checkpoint-broker.md – DMTCP + MPI checkpoint interaction
- sessions.md – Interactive allocations with MPI launch
- ADR-010: Native PMI-2 with optional PMIx sidecar
Data Plane & Storage Architecture
Tiered Storage Model
┌─ Hot Tier (VAST-like) ─────────────────────────────────┐
│ Protocol: NFS + S3 (native multiprotocol) │
│ Use: active datasets, home dirs, checkpoints, scratch │
│ Performance: NVMe-speed, low-latency │
│ Scheduler integration: QoS per export, pre-staging │
│ Sensitive: encrypted pool, access-logged │
└────────────────────┬───────────────────────────────────┘
│ policy-driven data mover
┌────────────────────┴───────────────────────────────────┐
│ Warm Tier (capacity storage) │
│ Protocol: S3-compatible │
│ Use: completed outputs, older datasets, cold models │
│ Cost: significantly lower than hot │
└────────────────────┬───────────────────────────────────┘
│ archive policy
┌────────────────────┴───────────────────────────────────┐
│ Cold Tier (tape/object archive) │
│ Protocol: S3-compatible (Glacier-style retrieval) │
│ Use: regulatory retention, long-term archival │
│ Sensitive: 7+ year retention, immutable │
└────────────────────────────────────────────────────────┘
Protocol Standardization
Only two protocols for user-facing access:
- NFS: POSIX workloads, home directories, uenv images, legacy codes that expect a filesystem
- S3: Object access for checkpoints, datasets, model artifacts, any cloud-native tooling
No Lustre/GPFS client required. VAST delivers parallel-file-system performance via NFS.
Job Data Requirements
Explicit Declaration
Users who know their data needs can declare them:
data:
mounts:
- source: "s3://training-data/imagenet"
target: "/data/input"
tier_hint: "hot"
access: "read-only"
- source: "nfs://home/{user}"
target: "/home/{user}"
access: "read-write"
output: "s3://{tenant}/{project}/{allocation_id}/"
scratch_per_node: "500GB"
Sane Defaults (for users who don’t specify)
Every allocation automatically gets:
- Home directory: mounted via NFS from hot tier (
/home/{user}) - Node-local scratch: NVMe-backed ephemeral storage (
/scratch/local/) if NVMe is available; tmpfs or network scratch otherwise - Output directory:
s3://{tenant}/{project}/{allocation_id}/auto-created - Checkpoint directory:
s3://{tenant}/{project}/{allocation_id}/checkpoints/(if checkpoint != none)
Data Staging (Scheduler-Integrated)
The scheduler integrates with the storage API for intelligent data movement:
-
Pre-staging during queue wait: When a job is queued and its data is on warm/cold tier, the data mover begins warming it to hot tier. Queue wait time becomes useful instead of idle.
-
QoS allocation at job start: The scheduler calls the VAST API to set bandwidth guarantees for the job’s NFS export. Prevents I/O-intensive jobs from starving latency-sensitive services.
-
Checkpoint coordination: The checkpoint broker pre-allocates storage bandwidth windows to avoid I/O storms when many jobs checkpoint simultaneously.
VAST API Integration Points
| Operation | VAST API | When |
|---|---|---|
| Create export with QoS | POST /exports + QoS policy | Job starts |
| Query data locality | GET /catalog?path=… | Scheduling (data_readiness score) |
| Create snapshot | POST /snapshots | Job start (reproducibility) or checkpoint |
| Pre-stage from warm | POST /dataspace/prefetch | Job queued, data not on hot tier |
| Set bandwidth floor | PATCH /exports/{id}/qos | Job starts |
| Audit log query | GET /audit/logs?path=… | Compliance reporting |
Sensitive Storage Policy
vcluster: sensitive-secure
storage_policy:
encryption: aes-256-at-rest
pool: dedicated # separate VAST view/tenant
wipe_on_release: true # scrub after allocation ends
access_logging: full # every read/write logged
data_sovereignty: "ch" # data stays in Swiss jurisdiction
retention:
data: "as_specified_by_user"
audit_logs: "7_years"
tier_restriction: "hot_only" # no unencrypted copies on warm/cold
Log Storage
Allocation logs are persisted to S3 alongside output data. See observability.md for the log storage layout:
s3://{tenant}/{project}/{alloc_id}/logs/
├── stdout/{node_id}/{chunk_000..N}.log.zst
├── stderr/{node_id}/{chunk_000..N}.log.zst
└── metadata.json
Sensitive allocation logs are stored in the encrypted sensitive S3 pool with access logging enabled.
Node-Local Storage (Optional)
Nodes may have NVMe SSDs managed by the node agent. Local storage is not a hard requirement — nodes without NVMe operate with reduced performance but full functionality.
When NVMe is present:
- Scratch: ephemeral, wiped between allocations. For temp files, staging.
- Image cache: persistent across allocations. Caches uenv squashfs images and OCI layers.
- LRU eviction policy
- Cache hit avoids network pull from registry
- Popular images stay warm automatically
When NVMe is absent:
- Scratch: falls back to tmpfs (RAM-backed) or a network-mounted scratch directory. Capacity is limited by available RAM or network storage quota.
- Image cache: no persistent local cache. Images are pulled from the registry on every allocation start (or served from a shared NFS cache if configured). Higher startup latency.
- Allocations requesting the
nvme_scratchfeature constraint will not be scheduled on these nodes.
The node agent detects local storage at startup and reports its availability as part of node capabilities (features: ["nvme_scratch"]).
Data Staging & Cache Lifecycle
Design Principle
Data staging is invisible to users. The scheduler pre-stages data during queue wait time, manages node-local caches with bounded eviction, and coordinates storage bandwidth to prevent I/O storms. Users declare data requirements; the system handles placement.
This document extends data-plane.md with operational details for staging, caching, and eviction.
Pre-Staging Pipeline
Trigger
When an allocation enters the Pending state and declares data mounts with tier_hint: hot:
- Scheduler queries VAST API for data locality (
GET /catalog?path=...) - If data is on warm/cold tier: scheduler issues pre-stage request (
POST /dataspace/prefetch) - Allocation transitions to
Stagingstate (visible to user vialattice status) - When staging completes: allocation is eligible for scheduling
Staging During Queue Wait
Pre-staging runs concurrently with queue waiting. If the allocation reaches the front of the scheduling queue before staging completes:
| Scenario | Action |
|---|---|
| Staging complete | Schedule immediately |
| Staging >80% complete | Schedule, accept brief I/O stall at start |
| Staging <80% complete | Hold in queue, f₅ (data_readiness) penalizes scheduling |
| Staging failed | Retry up to 3 times, then alert user and keep in queue |
Priority
Pre-stage requests are prioritized by:
- Estimated scheduling time (jobs closer to front of queue stage first)
- Data size (smaller datasets stage faster, unblock more jobs)
- Tenant fair share (tenants below their share get staging priority)
Bandwidth Coordination
The scheduler tracks aggregate staging bandwidth to avoid saturating the VAST system:
max_concurrent_staging_bandwidth = 0.3 × total_VAST_write_bandwidth
When the staging bandwidth limit is reached, additional staging requests are queued. This prevents staging from impacting running allocations’ I/O performance.
Node-Local Image Cache
Nodes with NVMe SSDs use a dedicated partition for image caching (uenv SquashFS and OCI layers). Local storage is optional — nodes without NVMe pull images directly from the registry on every allocation start, or use a shared NFS-based cache if configured. The scheduler accounts for this via the nvme_scratch feature: jobs that benefit from local caching can request it as a constraint.
Cache Layout
/var/cache/lattice/
├── uenv/ # SquashFS images
│ ├── prgenv-gnu_24.11_v1.squashfs
│ ├── pytorch_2.4_cuda12.squashfs
│ └── ...
├── oci/ # OCI container layers
│ ├── sha256:<hash>/
│ └── ...
└── metadata.json # Cache index: image → size, last_used, pin
Cache Parameters
| Parameter | Default | Description |
|---|---|---|
cache_partition_size | 80% of NVMe (if present) | Reserved for image cache; ignored on nodes without NVMe |
cache_high_watermark | 90% | Eviction starts when usage exceeds this |
cache_low_watermark | 70% | Eviction stops when usage drops below this |
min_free_space | 50 GB | Absolute minimum free space (overrides watermarks) |
Eviction Policy
LRU with pinning:
- When cache usage exceeds
cache_high_watermark:- Evict least-recently-used images until usage drops below
cache_low_watermark - Never evict images currently mounted by running allocations (pinned)
- Never evict images marked as
stickyby admin (base OS images, common frameworks)
- Evict least-recently-used images until usage drops below
- Eviction order: LRU by last mount time, largest images first among equally-old entries
- If eviction cannot free enough space (all images pinned or sticky): alert raised, staging for new allocations pauses on this node
Cache-Full During Staging
If the node-local cache is full when a new allocation needs to pull an image:
- Check if eviction can free space → run eviction
- If eviction insufficient (all pinned): allocation’s prologue waits with backoff
- After 3 retries (5 minutes total): node marked as cache-full, scheduler avoids this node for allocations requiring uncached images
- Scheduler selects alternative nodes with cache space (or where the image is already cached)
Cache Warming
Administrators can pre-warm caches for anticipated workloads:
# Warm a uenv image on all nodes in a group
lattice cache warm --image=prgenv-gnu/24.11:v1 --group=3
# Warm on specific nodes
lattice cache warm --image=pytorch/2.4:cuda12 --nodes=x1000c0s0b0n0,x1000c0s0b0n1
Post-Reboot Cache Consistency
After a node reboot (nodes with NVMe only):
- Node agent reads
metadata.jsonfrom the cache partition - Validates each cached image (hash check against registry manifest)
- Images that fail validation are evicted
- Images that pass remain in cache (NVMe is persistent across reboots)
- Cache index rebuilt in ~seconds (metadata only, no full re-scan)
On nodes without NVMe, there is no persistent cache to recover — images are pulled fresh after reboot.
Allocation Data Lifecycle
Start (Prologue)
1. Node agent receives allocation assignment
2. Pull uenv image:
a. Check node-local cache → hit: mount directly
b. Cache miss: pull from registry → write to cache → mount
3. Mount data volumes:
a. NFS mounts (home, shared data): mount with VAST QoS policy
b. S3 mounts: FUSE or native S3 client
4. Create scratch directory: /scratch/local/{alloc_id}/ (NVMe) or /scratch/tmp/{alloc_id}/ (tmpfs/network)
5. Create output directory (S3): s3://{tenant}/{project}/{alloc_id}/
6. If checkpoint != none: create checkpoint directory
During Execution
- NFS QoS maintained by VAST (bandwidth floor set at prologue)
- Scratch is node-local NVMe (if available) or tmpfs/network scratch
- Output is written to S3 (async, application-driven)
- Checkpoint broker coordinates checkpoint writes to avoid bandwidth storms
End (Epilogue)
1. Processes terminated (completed, failed, or killed)
2. Flush pending log chunks to S3
3. Unmount uenv image (stays in cache for future use)
4. Unmount NFS volumes
5. Clean scratch: rm -rf /scratch/local/{alloc_id}/
6. Release VAST QoS policy
7. Sensitive: trigger secure wipe sequence (cross-ref: node-lifecycle.md)
Data Retention
| Data Type | Location | Retention |
|---|---|---|
| uenv images | Node-local cache | Until evicted (LRU) |
| Logs | S3 | Configurable (default: 30 days) |
| Checkpoints | S3 | Configurable (default: 7 days after completion) |
| Output | S3 | User-managed (not auto-deleted) |
| Scratch | NVMe or tmpfs | Deleted at allocation end |
| Debug traces | S3 | Short (default: 7 days) |
| Sensitive audit logs | Cold tier (S3) | 7 years |
Storage Tier Migration
Data automatically migrates between tiers based on access patterns:
Hot (VAST NFS+S3) → Warm (capacity S3) → Cold (archive S3)
↑ pre-stage ↑ restore ↑ retrieve
| Trigger | Direction | Mechanism |
|---|---|---|
Allocation queued with tier_hint: hot | Warm → Hot | Scheduler-initiated pre-stage |
| Data untouched for 30 days | Hot → Warm | VAST policy-driven (automatic) |
| Data untouched for 90 days | Warm → Cold | Storage policy (automatic) |
| User request or allocation references cold data | Cold → Warm/Hot | Explicit retrieval (may take hours) |
Sensitive exception: Sensitive data on hot tier stays on hot tier (no automatic migration). tier_restriction: hot_only prevents copies on shared warm/cold tiers.
Cross-References
- data-plane.md — Storage architecture, VAST API integration, protocol standardization
- scheduling-algorithm.md — f₅ data_readiness in cost function
- node-lifecycle.md — Sensitive node wipe sequence
- failure-modes.md — VAST unavailability handling
- sensitive-workloads.md — Sensitive storage policy
Federation Architecture
Design Principle
Federation is opt-in and sovereignty-first. The system is fully functional without it. When enabled, each site retains full control over its resources. The federation broker suggests, the local scheduler decides.
Feature Gate
Federation is compile-time optional via Rust feature flag:
# Cargo.toml (lattice-api)
[features]
default = []
federation = ["lattice-common/federation"]
When federation feature is disabled:
- No Sovra dependency
- No federation broker binary
- No cross-site API endpoints
- System operates as a standalone site
Trust Model: Sovra Integration
Sovra provides federated sovereign key management. Each site runs its own Sovra instance with its own root key.
Site A Sovra Instance Site B Sovra Instance
├── Site A Root Key (sovereign) ├── Site B Root Key (sovereign)
├── Workspace: "hpc-general" ├── Workspace: "hpc-general"
│ (shared federation key) │ (federated with Site A)
├── Workspace: "sensitive-ch" └── Policy: Site B OPA rules
│ (hospital CRK, delegated)
└── Policy: Site A OPA rules
Sovra Federation Protocol (peer-to-peer, no central authority)
Key Management Principles
- Site root keys never leave the site. All cross-site authentication uses derived keys from shared workspaces.
- Federation is revocable. Revoking a shared workspace invalidates all cross-site tokens. Instant defederation.
- Sensitive keys are tenant-controlled. The hospital (data owner) holds the Customer Root Key. The operating site holds a delegated key. If the relationship ends, the hospital retains access.
- Audit logs are cryptographically signed. Each site signs its audit entries with its own key. Cross-site audit trails are verifiable by any party in the trust chain.
Federation Components
Federation Broker
A Go service that runs alongside the scheduler (when federation feature is enabled).
Responsibilities:
- Advertises site capabilities to federated peers (available capacity, GPU types, energy prices, data locality)
- Receives federated allocation requests from peer sites
- Signs outbound requests with Sovra tokens
- Verifies inbound requests against Sovra trust chain + OPA policy
- Routes accepted requests into the local scheduling plane
Communication: gRPC over mTLS, with Sovra-signed metadata in request headers.
Federation Catalog
A read-mostly, eventually consistent shared catalog across federated sites:
| Content | Update Frequency | Consistency |
|---|---|---|
| Site capabilities (GPU types, node counts) | Hourly | Eventual |
| uenv image registry (cross-site name resolution) | On publish | Eventual |
| Dataset catalog (where data physically resides) | On change | Eventual |
| Tenant identity mapping (OIDC trust) | On federation setup | Strong (Sovra) |
| Energy prices per site | Every 15 minutes | Eventual |
Catalog Consistency and Staleness
The federation catalog is eventually consistent. Entries may be stale, missing, or outdated. The system must handle this gracefully:
Staleness bounds:
| Entry Type | Max Staleness | Effect of Stale Data |
|---|---|---|
| Site capabilities | 2 hours (hourly sync + margin) | May route job to site that no longer has capacity → remote rejection, retry locally |
| Energy prices | 30 minutes | May choose suboptimal site for energy cost → acceptable, not a correctness issue |
| Dataset catalog | Minutes (event-driven) | May not know data was moved → routing decision based on old location |
| uenv registry | Minutes (event-driven) | May reference image version not yet available at remote → prologue retry |
Handling completely stale entries:
If a peer site has not reported a catalog update within 2× the expected interval (e.g., no capability update in 2 hours):
- Federation broker marks the peer as
stalein its local view - Routing decisions deprioritize stale peers (not excluded, just scored lower)
- Alert raised:
lattice_federation_peer_stale{peer="site-b"} - If stale for > 24 hours: peer marked
unreachable, excluded from routing
Handling peer unavailability:
If a federated request fails (peer broker unreachable):
- First failure: retry with exponential backoff (1s, 2s, 4s, max 30s)
- After 3 retries: return failure to the user with explanation
- If
--site=auto: fall back to local scheduling (no remote attempt) - Peer marked as
degradedin catalog; future requests deprioritize it - Peer returns to
healthyon next successful heartbeat/catalog sync
Cross-site uenv resolution:
uenv images are resolved via the federation catalog:
- User submits
--uenv=prgenv-gnu/24.11:v1targeting Site B - Federation broker checks if Site B’s catalog includes this image
- If present: proceed (Site B has the image or can pull it)
- If absent: warn user and proceed (Site B may pull from a shared registry)
- If pull fails at Site B: prologue failure, allocation retried or failed per policy
Job Routing Logic
The federation broker’s routing decision is advisory, not mandatory:
Input: Allocation request from remote site (or local user targeting remote)
Output: Recommendation (run locally, run at site X, reject)
Factors:
1. Data gravity: where does the input data physically reside?
→ Strong bias toward running where data is
2. Compute availability: does the target site have capacity?
→ Check advertised capacity (may be stale)
3. Energy cost: which site has cheaper power right now?
→ Time-varying electricity prices from catalog
4. Tenant authorization: is this user allowed at the target site?
→ OPA policy check via Sovra-delegated credentials
5. Data sovereignty: can the data legally transit to the target site?
→ Sensitive data: check jurisdiction constraints
Decision: route to site with best composite score, or reject if no site qualifies
Federated Allocation Flow
1. User at Site A submits: lattice submit --site=B train.sh
2. Site A lattice-api receives request, passes to federation broker
3. Federation broker:
a. Signs request with Sovra token (Site A workspace key)
b. Resolves target: Site B (explicit) or best-fit (if --site=auto)
c. Forwards to Site B's federation broker
4. Site B federation broker:
a. Verifies Sovra token (Site A is trusted peer)
b. Checks OPA policy (user authorized, resources available)
c. Injects allocation into Site B's scheduling plane
5. Site B local quorum manages allocation entirely
6. Status/logs available to user at Site A via federation catalog query
7. On completion: Site B reports results, Site A's user notified
Cross-Site Data Access
When a federated job runs at a remote site but needs data from the home site:
- Small data (<1 GB): Fetched on demand via S3 over WAN
- Medium data (1 GB - 1 TB): Pre-staged during queue wait via VAST DataSpace sync
- Large data (>1 TB): Strong recommendation to run job at data’s home site
- Sensitive data: Never transferred. Job must run at data’s home site. No exceptions.
Operational Considerations
Adding a Federation Peer
- Exchange Sovra workspace keys (out-of-band, verified by site admins)
- Configure federation broker with peer endpoint + workspace ID
- Define OPA policies for cross-site access
- Test with non-production allocations
- Enable in production
Removing a Federation Peer
- Revoke Sovra shared workspace
- All in-flight federated allocations continue to completion (or are cancelled by policy)
- Remove peer from federation broker config
- Immediate: no new federated requests accepted
Federation Requests During Leader Election
When the local Raft quorum is undergoing a leader election (typically 1-3 seconds):
- Inbound federated requests from peer sites receive a
503 Service Unavailablewith aRetry-After: 5header - The federation broker does not queue inbound requests during election — the remote site’s retry logic handles resubmission
- Outbound federated requests (local user targeting a remote site) are unaffected — routing and signing happen in the federation broker, not the quorum
- If the election takes longer than 10 seconds (unusual): the federation broker marks the local site as
degradedin catalog updates to peers
Cross-References
- system-architecture.md — Control plane architecture
- security.md — Sovra trust model, mTLS
- sensitive-workloads.md — Sensitive data sovereignty
- failure-modes.md — Quorum leader loss recovery
Interactive Sessions
Design Principle
Interactive sessions are allocations with a terminal. They reuse the standard allocation lifecycle with additional terminal protocol handling. Sessions are not a separate concept — they are bounded or unbounded allocations with an attached PTY as the primary interaction mode.
Global session tracking (F20): Sessions are now tracked in GlobalState via Raft-committed CreateSession/DeleteSession commands. This enables:
- Global session limit enforcement: sensitive allocations limited to one concurrent session (INV-C2)
- Session survival across API server restarts
- Ownership verification at creation time (allocation must be Running, user must own it)
Session Creation
A session is created via POST /v1/sessions (or lattice session):
session:
tenant: "ml-team"
vcluster: "interactive" # typically the interactive FIFO vCluster
resources:
nodes: 1 # default: 1 node
constraints:
gpu_type: "GH200"
lifecycle:
type: "bounded"
walltime: "4h" # interactive sessions have walltime
environment:
uenv: "prgenv-gnu/24.11:v1"
Internally, the API server creates a standard Allocation with:
lifecycle.type = Bounded { walltime }- A flag indicating terminal should auto-attach on scheduling
- Allocation state follows the normal lifecycle (Pending → Running → Completed)
Terminal Protocol
Connection Setup
1. Client connects: POST /v1/sessions → returns session_id + allocation_id
2. Allocation is scheduled (may wait in queue)
3. Once Running, client opens terminal: GET /v1/sessions/{id}/terminal (WebSocket upgrade)
4. WebSocket connection established to lattice-api
5. lattice-api opens gRPC bidirectional stream to the node agent
6. Node agent spawns PTY + user shell in allocation's mount/network namespace
Wire Protocol
The gRPC bidirectional stream carries framed messages:
Client → Server:
| Message Type | Content |
|---|---|
StdinData | Raw bytes from client terminal |
Resize | Terminal dimensions (rows, cols) |
Signal | SIGINT, SIGTSTP, SIGHUP, SIGQUIT |
Keepalive | Heartbeat (every 30s) |
Server → Client:
| Message Type | Content |
|---|---|
StdoutData | Raw bytes from PTY (stdout + stderr merged) |
ExitCode | Process exit code (terminal message) |
Error | Error description (e.g., “allocation not running”) |
Initial Terminal Size
The client sends a Resize message as the first message after connection. The node agent configures the PTY with these dimensions. If no Resize is sent, defaults to 80x24.
Signal Handling
| Signal | Client Action | Server Action |
|---|---|---|
| SIGINT (Ctrl+C) | Send Signal(SIGINT) | Node agent sends SIGINT to foreground process group |
| SIGTSTP (Ctrl+Z) | Send Signal(SIGTSTP) | Node agent sends SIGTSTP to foreground process group |
| SIGHUP | Connection close | Node agent sends SIGHUP to session process group |
| SIGQUIT (Ctrl+\) | Send Signal(SIGQUIT) | Node agent sends SIGQUIT to foreground process group |
| SIGWINCH | Send Resize(rows, cols) | Node agent calls ioctl(TIOCSWINSZ) on PTY |
Session Lifecycle
Active Session
While the terminal is connected:
- PTY output streams to client in real-time
- Client input streams to PTY stdin
- Keepalive every 30s to detect stale connections
- Session remains active as long as the WebSocket is open AND the shell process is alive
Disconnect and Reconnect
Client disconnect (network drop, laptop close):
- WebSocket closes (or keepalive timeout: 90s)
- Node agent sends SIGHUP to the session’s process group
- Default behavior: processes receive SIGHUP and exit
- If the user’s shell ignores SIGHUP (e.g.,
tmux,screen):- Processes continue running in the background
- User can reconnect:
lattice attach <alloc_id> - Allocation walltime continues counting
Deliberate detach:
Users who want background sessions should use tmux or screen inside the session. Lattice does not implement a detach/reattach protocol — it delegates to proven tools.
Session Timeout
| Timeout | Default | Description |
|---|---|---|
idle_timeout | 30 minutes | If no stdin for this duration, warn user. No auto-kill. |
walltime | User-specified | Hard deadline. SIGTERM → SIGKILL → release. |
keepalive_timeout | 90s | WebSocket keepalive. Missed → treat as disconnect. |
Idle warning: After idle_timeout, the terminal displays:
[lattice] Warning: session idle for 30 minutes. Walltime remaining: 3h 12m.
No automatic termination on idle — the user may be running a long computation.
Cleanup
When the session’s allocation reaches a terminal state (Completed, Failed, Cancelled):
- SIGTERM to all remaining processes
- Grace period (30s)
- SIGKILL
- Unmount uenv, release scratch, release nodes
- Session terminal sends
ExitCodeand closes WebSocket
Preemption During Active Session
When a session’s allocation is preempted while a terminal is connected:
- The checkpoint sequence begins (if
checkpoint != None) - The terminal remains connected during checkpointing — user sees normal output
- When checkpoint completes and the allocation transitions to
Suspended:- Server sends a terminal message:
[lattice] Allocation preempted. Session suspended. Use 'lattice attach <id>' to reconnect after rescheduling. - Server sends
ExitCode(-1)and closes the stream
- Server sends a terminal message:
- When the allocation is rescheduled and resumes:
- The user must manually reconnect:
lattice attach <id> - The session starts a fresh shell (PTY state is not checkpointed)
- Application state is restored from checkpoint (if the application supports it)
- The user must manually reconnect:
Multi-Node Sessions
For sessions requesting multiple nodes:
- The terminal connects to the first node (node 0)
- The user’s shell runs on node 0
- Other nodes are accessible via
ssh(intra-allocation, uses the network domain) - Or via
lattice attach <alloc_id> --node=<node_id>(opens a second terminal to a specific node)
Concurrent Attach
| Scenario | Allowed | Notes |
|---|---|---|
| Same user, multiple terminals | Yes | Multiple attach sessions to the same allocation |
| Different users (non-sensitive) | No | Only the allocation owner can attach |
| Different users (sensitive) | No | Only the claiming user; one session at a time |
| Same user, different nodes | Yes | Each attach targets a specific node |
Slurm Compatibility
| Slurm | Lattice | Notes |
|---|---|---|
salloc -N2 | lattice session --nodes=2 | Creates session allocation |
srun --jobid=123 --pty bash | lattice attach 123 | Attach to existing allocation |
salloc then srun | lattice session then lattice launch | Session + task within allocation |
CLI Usage
# Create a session (waits for scheduling, then opens terminal)
lattice session --nodes=1 --walltime=4h --uenv=prgenv-gnu/24.11:v1
# Create with specific constraints
lattice session --nodes=2 --constraint=gpu_type:GH200 --walltime=8h
# Create in a specific vCluster
lattice session --vcluster=interactive --walltime=2h
# Attach to an existing session's allocation
lattice attach 12345
# Attach to a specific node
lattice attach 12345 --node=x1000c0s0b0n3
# Attach with a specific command (not the default shell)
lattice attach 12345 --command="nvidia-smi -l 1"
Cross-References
- observability.md — Attach architecture, authorization model, rate limiting
- api-design.md — Session API endpoints
- sensitive-workloads.md — Sensitive session constraints (one session, recording, signed uenv)
- cli-design.md — Full CLI command reference
Sensitive & Regulated Workload Design
Threat Model
Sensitive workloads on shared HPC infrastructure face regulatory requirements (Swiss FADP, EU GDPR, potentially HIPAA for international collaboration). The design must be defensible to an auditor.
What we must prove:
- Sensitive data was only accessible to authorized users during processing
- No other tenant’s workload ran on the same physical nodes simultaneously
- Data was encrypted at rest and in transit
- All access was logged with user identity and timestamp
- Data was destroyed when no longer needed
- Data did not leave the designated jurisdiction
Isolation Model: User Claims Node
Unlike other vClusters where the scheduler assigns nodes, sensitive nodes are claimed by a specific user:
Dr. X authenticates via OIDC (institutional IdP)
→ Requests 4 nodes via lattice CLI: lattice submit --sensitive
→ Quorum records: nodes N1-N4 owned by user:dr-x, tenant:hospital-a
→ Strong consistency: Raft commit before any workload starts
→ OpenCHAMI boots N1-N4 with hardened sensitive image (if not already)
→ All activity on N1-N4 audited under dr-x's identity
→ When released:
→ Quorum releases node ownership (Raft commit)
→ OpenCHAMI wipes node (memory scrub, storage secure erase if NVMe present)
→ Node returns to general pool only after wipe confirmation
No clever optimization on sensitive nodes. If Dr. X claims 4 nodes at 9am and runs nothing until 2pm, those nodes sit idle. The cost is real and should be visible to the tenant’s accounting. But there is no co-scheduling, no borrowing, no time-sharing.
Concurrent Sensitive Claims
If two users simultaneously attempt to claim overlapping nodes:
- First Raft commit wins. Node ownership is a strong consistency domain. The quorum serializes all claim requests via Raft.
- The second claim request receives an
OwnershipConflicterror with a message identifying which nodes are already claimed and by which user. - The second user must select different nodes or wait for the first user to release.
- There is no queueing or waitlist for sensitive node claims — they are immediate or rejected.
OS Image
Sensitive nodes boot a hardened image via OpenCHAMI BSS:
- Minimal kernel, no unnecessary services
- Mandatory access control (SELinux/AppArmor enforcing)
- No SSH daemon (all access via API gateway)
- Encrypted swap (if any)
- Audit daemon (auditd) logging all syscalls to audit subsystem
- Node agent with audit mode telemetry enabled by default
Software Delivery
Sensitive allocations use signed uenv images only:
environment:
uenv: "sensitive/validated-2024.1" # curated, audited base stack
sign_required: true # image signature verified before mount
scan_required: true # CVE scan passed
approved_bases_only: true # can only use admin-approved base images
The uenv registry enforces:
- Image signing (with Sovra keys or site-specific PKI)
- Vulnerability scanning (integrated with JFrog/Nexus security scanning)
- Approved base image list (maintained by site security team)
- Audit log of all image pulls
Storage
Sensitive data lives in a dedicated storage pool:
storage_policy:
pool: "sensitive-encrypted" # dedicated VAST view/tenant
encryption: "aes-256-at-rest" # VAST native encryption
access_logging: "full" # every read/write logged via VAST audit
wipe_on_release: true # VAST secure delete on allocation end
data_sovereignty: "ch" # data stays in Swiss jurisdiction
retention:
data: "user_specified" # user declares retention period
audit_logs: "7_years" # regulatory minimum
tier_restriction: "hot_only" # no copies on shared warm/cold tiers
Network Isolation
Sensitive allocations get a dedicated Slingshot VNI:
connectivity:
network_domain: "sensitive-{user}-{alloc_id}" # unique per allocation
policy:
ingress: deny-all-except:
- same_domain # only processes in this allocation
- data_gateway # controlled data ingress endpoint
egress: deny-all-except:
- data_gateway # controlled data egress
With Ultra Ethernet: network-level encryption (UET built-in) provides an additional layer without performance penalty.
Audit Trail
What is logged (strong consistency via Raft):
- Node claim: user identity, timestamp, node IDs
- Node release: user identity, timestamp, wipe confirmation
- Allocation start/stop: what ran, which uenv image (with hash), which data paths
- Data access: every file open/read/write (from eBPF audit telemetry)
- API calls: every lattice-api call related to sensitive allocations
- Checkpoint events: when, where, what was written
- Attach sessions: user identity, start/end timestamps, target node, session recording reference
- Log access events: who accessed logs, when, which allocation
- Metrics queries: user identity, allocation queried, timestamp
Storage:
- Append-only log (no deletions, no modifications)
- Encrypted at rest (Sovra-managed keys if federation enabled, site PKI otherwise)
- 7-year retention on cold tier (S3-compatible, immutable storage)
- Cryptographically signed entries (tamper-evident)
Query Interface
The audit log is queryable via a dedicated API endpoint and CLI:
API:
GET /v1/audit/logs?user=dr-x&since=2026-03-01&until=2026-03-15
GET /v1/audit/logs?allocation=12345
GET /v1/audit/logs?node=x1000c0s0b0n0&since=2026-03-01
GET /v1/audit/logs?data_path=s3://sensitive-data/subject-001/
CLI:
lattice audit query --user=dr-x --since=2026-03-01 --until=2026-03-15
lattice audit query --alloc=12345
lattice audit query --node=x1000c0s0b0n0 --since=2026-03-01 --output=json
Scoping:
| Caller | Visible Scope |
|---|---|
| Claiming user | Own audit events only |
| Tenant admin (compliance reviewer) | All audit events for their tenant |
| System admin | All audit events |
Indexing: Audit entries are indexed by:
- User ID (primary query dimension for compliance reporting)
- Allocation ID (all events for a specific allocation)
- Node ID (all events on a specific node)
- Timestamp (range queries, required for all queries)
- Event type (filter by: claim, release, data_access, attach, etc.)
Performance targets:
| Query Scope | Expected Latency |
|---|---|
| Single allocation (any timeframe) | < 1s |
| Single user, 1-day range | < 2s |
| Single user, 30-day range | < 10s |
| Tenant-wide, 1-day range | < 30s |
Queries spanning more than 90 days may be served from cold tier (S3 archive) with higher latency (minutes).
Export: For regulatory submissions, audit logs can be exported as signed JSON bundles:
lattice audit export --user=dr-x --since=2026-01-01 --until=2026-06-30 --output=audit-report.json.sig
The export includes cryptographic signatures for tamper evidence.
Observability Constraints
Every user-facing observability feature has sensitive-specific restrictions. The principle: observability must not weaken the isolation model.
Attach
- Claiming user only. The user who claimed the nodes (identity verified against Raft audit log) is the only user permitted to attach. No delegation, no shared access.
- Session recording. All attach sessions are recorded (input + output bytes) and stored at
s3://sensitive-audit/{tenant}/{alloc_id}/sessions/{session_id}.recording(zstd-compressed, encrypted at rest, 7-year retention). The session recording reference is a Raft-committed audit entry. - Signed uenv only. Attach is only permitted when the allocation runs a signed, vulnerability-scanned uenv image. This prevents attaching to environments with unvetted tools.
- No concurrent attach from different sessions. One active attach session per allocation at a time (prevents accidental data exposure via shared terminal).
Logs
- Encrypted at rest. Logs from sensitive allocations are stored in the dedicated encrypted S3 pool (same as sensitive data).
- Access-logged. Every log access (live tail or historical) generates an audit entry with user identity and timestamp.
- Restricted access. Only the claiming user and designated compliance reviewers (via tenant admin role) can access logs.
- Retention follows data policy. Log retention matches the allocation’s sensitive data retention policy, not the default log retention.
Metrics
- Low sensitivity, still scoped. Metrics (GPU%, CPU%, I/O rates) do not contain sensitive data, but are still scoped to the claiming user. Tenant admins can view aggregated usage.
- No cross-tenant visibility. Even system admins see sensitive allocation metrics only in aggregate (holistic view), not per-allocation detail.
Diagnostics
- No cross-allocation comparison for sensitive. The
CompareMetricsRPC rejects requests that include sensitive allocation IDs alongside non-sensitive ones. Comparison within a single sensitive tenant is permitted (same claiming user). - Network diagnostics scoped. Network diagnostics for sensitive allocations only show the allocation’s own VNI traffic, not fabric-wide metrics.
Profiling
- Signed tools_uenv only. Profiling tools must be delivered via a signed, approved
tools_uenvimage. Users cannot load arbitrary profiler binaries. - Profile output stays in sensitive pool. All profiling output is written to the encrypted sensitive storage pool and is subject to the same access logging and retention policies.
Federation Constraints
Sensitive data does not federate by default:
- Data stays at the designated site (data sovereignty)
- Compute can theoretically federate (run at remote site), but only if:
- Remote site meets the same compliance requirements
- Data does not transit (remote compute accesses data via encrypted API, not bulk transfer)
- Both sites’ Sovra instances have a sensitive workspace with hospital CRK
- In practice: sensitive jobs run where the data is. Period.
Conformance Requirements
Sensitive nodes have strict conformance enforcement. Unlike general workloads where conformance is a soft preference, sensitive workloads treat configuration drift as a hard constraint:
- Pre-claim validation. Before a node can be claimed for sensitive use, the scheduler verifies its conformance fingerprint matches the expected baseline for the sensitive vCluster. Drifted nodes are rejected.
- Drift triggers drain. If a sensitive node’s conformance fingerprint changes during operation (e.g., a firmware update was missed), the node agent flags the drift. The scheduler will not assign new sensitive claims to the node until OpenCHAMI remediates it.
- Audit trail. Conformance state changes on sensitive nodes are recorded in the Raft-committed audit log (which firmware/driver versions were active during the allocation).
This is deliberately conservative: sensitive workloads do not tolerate the subtle failures that configuration drift can cause, and regulatory compliance requires provable consistency of the execution environment.
Scheduler Behavior
The sensitive vCluster scheduler is intentionally simple:
- Algorithm: Reservation-based (not knapsack). User claims nodes, scheduler validates and commits.
- No backfill. Sensitive nodes are not shared.
- No preemption. Sensitive allocations are never preempted.
- No elastic borrowing. Sensitive nodes cannot be borrowed by other vClusters.
- Fair-share: Not applicable (nodes are user-claimed, not queue-scheduled).
- Conformance: Hard constraint — only nodes matching the expected conformance baseline are eligible.
- Cost function weights: priority=0.90, conformance=0.10 (tiebreaker among conformant nodes; non-conformant nodes are excluded as a hard constraint at the solver level, not via the weight system), everything else near-zero.
Accounting
Design Principle
Lattice schedules, Waldur accounts. Accounting is asynchronous and optional (feature-flagged like federation). Waldur unavailability never blocks scheduling.
What is Waldur
Waldur is a hybrid cloud orchestrator with HPC integration, accounting, billing, and self-service portal. It provides:
- Resource usage tracking and billing
- Project-level budget management
- Self-service quota requests
- Invoice generation
Integration is via Waldur’s REST API.
Integration Pattern
Lattice ──async push──→ Waldur (accounting events)
Waldur ──API call──→ Lattice (quota updates)
Lattice pushes accounting events to Waldur asynchronously. Waldur can push quota updates back. The two systems are loosely coupled — neither depends on the other for core functionality.
Accounting Events
Events pushed from Lattice to Waldur:
| Event | Trigger | Payload |
|---|---|---|
allocation.started | Allocation enters Running state | tenant, project, user, resources (nodes, GPUs, GPU type), estimated duration |
allocation.completed | Allocation reaches terminal state | actual duration, GPU-hours consumed, exit status, storage bytes written |
allocation.checkpointed | Checkpoint written | checkpoint storage consumed, checkpoint duration |
node.claimed | Sensitive node claimed by a user | tenant, user, node IDs, claiming timestamp |
node.released | Sensitive node released | tenant, user, node IDs, release timestamp, wipe confirmation |
quota.updated | Waldur updates a tenant’s quota | new quota values (Waldur → Lattice direction) |
Events are timestamped and include the allocation ID for correlation.
Entity Mapping
| Lattice Entity | Waldur Entity | Notes |
|---|---|---|
| Tenant | Customer | 1:1 mapping |
| Project (within tenant) | Project | 1:1 mapping |
| vCluster | Offering | Each vCluster type is a service offering |
| Allocation | Order | Each allocation is a resource order |
Waldur API Endpoints Used
| Direction | Endpoint | Purpose |
|---|---|---|
| Lattice → Waldur | POST /api/marketplace-orders/ | Report resource usage |
| Lattice → Waldur | POST /api/invoices/{id}/items/ | Add billing line items |
| Waldur → Lattice | GET /api/customers/{id}/quotas/ | Read project quotas |
| Waldur → Lattice | PUT /api/v1/tenants/{id} | Update tenant quotas in Lattice |
Authentication
Waldur API token is stored in a secrets manager (never in config files):
waldur:
token_secret_ref: "vault://lattice/waldur-token"
The token is loaded at startup and refreshed on rotation. Cross-ref: security.md for secret management.
Failure Handling
Waldur unavailability must never block scheduling:
- Buffer: Accounting events are buffered in a bounded in-memory queue (default: 10,000 events)
- Persist: If the buffer fills, overflow events are persisted to disk (WAL-style append log)
- Replay: On Waldur reconnection, buffered and persisted events are replayed in order
- Alert: If the disk buffer exceeds a threshold (default: 100,000 events), an alert is raised via scheduler self-monitoring (cross-ref: telemetry.md)
- Degrade gracefully: If both buffer and disk are full, events are dropped with a counter metric (
lattice_accounting_events_dropped_total). Scheduling continues.
Operational Response to Buffer Overflow
When the accounting buffer fills and events are dropped:
- Detect:
lattice_accounting_events_dropped_totalcounter increments. Alert fires when > 0. - Impact: Billing data is incomplete. GPU-hours and allocation events are missing from Waldur. This affects invoice accuracy but never affects scheduling.
- Respond:
- Check Waldur availability (
lattice admin accounting status) - If Waldur is down: wait for recovery. Buffered events will replay. Dropped events are lost.
- If Waldur is up but slow: check push interval and batch size. Increase
push_interval_secondsto allow larger batches.
- Check Waldur availability (
- Recovery: Dropped events cannot be recovered from the accounting pipeline. However, the quorum has allocation state (start/end times, node assignments). An admin can reconstruct missing billing data from quorum logs:
This command reads allocation history from the quorum and generates compensating events for Waldur.lattice admin accounting reconcile --since=2026-03-01 --until=2026-03-02 - Prevention: Size the buffer for expected Waldur outage duration. Rule of thumb:
buffer_size = events_per_minute × max_expected_outage_minutes. For a busy cluster (100 events/min) and 2-hour outage target:buffer_size = 12000.
Quota Feedback Loop
Waldur can act as the budget authority, updating Lattice tenant quotas:
- Waldur detects budget exhaustion (e.g., project spent its allocated compute hours)
- Waldur calls lattice-api:
PUT /api/v1/tenants/{id}with reduced limits - Lattice updates hard/soft quotas (cross-ref: quota-enforcement.md)
- Effect: tenant’s new allocations are blocked (hard quota) or deprioritized (soft quota)
Conversely, when a tenant purchases more compute:
- Waldur increases the tenant’s quota
- Lattice picks up the new limits
- Previously-starved allocations can now be scheduled
Sensitive Accounting
Sensitive allocations have additional accounting requirements:
- All accounting events include the claiming user’s identity (not just tenant)
- Idle node time (nodes claimed but no running allocation) is billable — Waldur receives
node.claimedandnode.releasedevents - Accounting events for sensitive allocations are also written to the Raft-committed audit log (cross-ref: sensitive-workloads.md)
- Waldur must retain sensitive billing records for 7 years (configured on the Waldur side)
Configuration
accounting:
enabled: true # feature flag, default: false
provider: "waldur"
waldur:
api_url: "https://waldur.example.com/api/"
token_secret_ref: "vault://lattice/waldur-token"
push_interval_seconds: 60 # batch push interval
buffer_size: 10000 # in-memory event buffer
disk_buffer_path: "/var/lib/lattice/accounting-wal"
disk_buffer_max_events: 100000
When accounting.enabled is false, no accounting code runs and no Waldur dependency exists (same pattern as federation).
Cross-References
- quota-enforcement.md — Waldur updates quotas, hard vs. soft semantics
- failure-modes.md — Accounting service failure buffering
- security.md — Waldur API token management
- sensitive-workloads.md — Sensitive billing and audit requirements
- telemetry.md — Accounting buffer metrics in scheduler self-monitoring
Slurm Migration
Design Principle
Migration from Slurm should be gradual and low-risk. Existing Slurm scripts should work with minimal changes via the compatibility layer. Users can adopt Lattice-native features incrementally. The goal is not perfect Slurm emulation — it’s a smooth on-ramp.
Migration Phases
Phase 1: Dual-Stack (Recommended Start)
Run Lattice alongside Slurm on a subset of nodes. Users can submit to either system. This provides:
- Side-by-side comparison of scheduling behavior
- Gradual user migration with rollback to Slurm
- Time to validate RM-Replay weight tuning
Phase 2: Compat-Mode Cutover
Move all nodes to Lattice. Users continue using sbatch/squeue via compatibility aliases. Slurm daemons are decommissioned.
Phase 3: Native Adoption
Users migrate scripts to native lattice CLI, adopting features not available in Slurm (reactive scaling, metric-driven autoscaling, DAG workflows, data staging hints).
Script Compatibility
Supported #SBATCH Directives
| Slurm Directive | Lattice Mapping | Notes |
|---|---|---|
--nodes=N | resources.nodes: N | Exact match |
--ntasks=N | Mapped to node count | nodes = ceil(N / tasks_per_node) |
--ntasks-per-node=N | Passed as task config | Used by launcher |
--time=HH:MM:SS | lifecycle.walltime | Exact match |
--partition=X | vcluster: X | Partition name → vCluster name mapping |
--account=X | tenant: X | Account → tenant mapping |
--job-name=X | tags.name: X | Stored as tag |
--output=file | Log path hint | Logs always go to S3; --output sets download path |
--error=file | Log path hint | Same as --output |
--constraint=X | constraints.features | Feature matching |
--gres=gpu:N | constraints.gpu_count | Mapped to GPU constraint |
--exclusive | Default behavior | Lattice schedules full nodes by default (ADR-007) |
--array=0-99%20 | task_group | Task group with concurrency limit |
--dependency=afterok:123 | depends_on: [{ref: "123", condition: "success"}] | DAG edge |
--qos=X | preemption_class | QoS → priority mapping (configurable per site) |
--mail-user, --mail-type | Not supported | Warn, skip |
--mem=X | Not supported | Full-node scheduling; memory is not a constraint |
--cpus-per-task=N | Not supported | Full-node scheduling |
--uenv=X | environment.uenv: X | Lattice extension, not in Slurm |
--view=X | environment.view: X | Lattice extension |
Unsupported Directives
Directives that have no Lattice equivalent are handled gracefully:
Warning: #SBATCH --mem=64G ignored (Lattice uses full-node scheduling, memory is not constrainable)
Warning: #SBATCH --mail-user=user@example.com ignored (use `lattice watch` for event notifications)
Submitted allocation 12345
The submission succeeds — unsupported directives produce warnings, not errors. This is critical for migration: existing scripts should not fail because of irrelevant Slurm options.
Conflicting Directives
| Conflict | Resolution |
|---|---|
--nodes=64 + --ntasks=128 with --ntasks-per-node=4 | --nodes takes precedence; ntasks-per-node used by launcher |
--exclusive + --mem=64G | --exclusive is default; --mem ignored with warning |
--partition not found | Error: vCluster "X" not found. Available: hpc-batch, ml-training, interactive |
Slurm Features Not Supported
These Slurm features have no Lattice equivalent and are not planned:
| Feature | Reason | Alternative |
|---|---|---|
Job steps (srun within sbatch) | Lattice uses tasks within allocations | lattice launch --alloc=<id> |
| Hetjob (heterogeneous job) | Not yet designed | Submit separate allocations with DAG dependencies |
Burst buffer (#DW) | DataWarp-specific | Use data.mounts with tier_hint: hot |
| GRES beyond GPU | Not needed (full-node scheduling) | Use constraints.features for non-GPU resources |
Accounting (sacctmgr) | Waldur handles accounting | lattice history or Waldur portal |
Reservations (scontrol create reservation) | Use sensitive claims for dedicated nodes | lattice admin reserve (future) |
Licenses/resources (--licenses=) | Not applicable | Use constraints.features |
Multi-cluster (--cluster=) | Use federation | lattice submit --site=X (if federation enabled) |
srun Within Allocations
Slurm users often use srun inside batch scripts to launch parallel tasks. In Lattice:
# Slurm pattern:
srun -n 256 ./my_mpi_program
# Lattice equivalent (inside a running allocation):
# Option 1: The entrypoint IS the parallel launch
# In the submission script, use the appropriate launcher directly:
mpirun -np 256 ./my_mpi_program
# or:
torchrun --nproc_per_node=4 ./train.py
# Option 2: Use lattice launch from another terminal
lattice launch --alloc=12345 -n 256 ./my_mpi_program
The compatibility layer translates srun to lattice launch when the compat aliases are active.
Environment Variables
Slurm sets many environment variables in jobs. Lattice provides equivalent variables:
| Slurm Variable | Lattice Variable | Description |
|---|---|---|
SLURM_JOB_ID | LATTICE_ALLOC_ID | Allocation ID |
SLURM_JOB_NAME | LATTICE_JOB_NAME | Job name (from tags) |
SLURM_NODELIST | LATTICE_NODELIST | Comma-separated node list |
SLURM_NNODES | LATTICE_NNODES | Number of nodes |
SLURM_NPROCS | LATTICE_NPROCS | Number of tasks |
SLURM_ARRAY_TASK_ID | LATTICE_TASK_INDEX | Task group index |
SLURM_ARRAY_JOB_ID | LATTICE_TASK_GROUP_ID | Task group parent ID |
SLURM_SUBMIT_DIR | LATTICE_SUBMIT_DIR | Submission directory |
SLURM_JOBID | LATTICE_ALLOC_ID | Alias for compatibility |
For migration convenience, the compat layer can also set SLURM_* variables (configurable: compat.set_slurm_env=true). This is disabled by default to avoid confusion.
Partition-to-vCluster Mapping
Sites configure the mapping from Slurm partition names to Lattice vClusters:
# lattice-compat.yaml
partition_mapping:
normal: "hpc-batch"
debug: "interactive"
gpu: "ml-training"
long: "hpc-batch" # multiple partitions can map to one vCluster
sensitive: "sensitive-secure"
qos_mapping:
low: 1
normal: 4
high: 7
urgent: 9
Unmapped partition names produce an error with a list of available vClusters.
Migration Checklist
For site administrators:
- Deploy Lattice control plane alongside Slurm
- Configure partition-to-vCluster mapping
- Configure QoS-to-preemption-class mapping
- Tune cost function weights using RM-Replay with production traces
- Test representative batch scripts via compat layer
- Validate accounting (Waldur) captures match Slurm sacct data
- Train users on
latticeCLI basics - Run dual-stack for 2-4 weeks
- Migrate remaining users, decommission Slurm
For users:
- Test existing scripts with
lattice submit(compat mode parses#SBATCH) - Review warnings for unsupported directives
- Replace
srunin scripts with direct launcher commands (mpirun,torchrun) - (Optional) Migrate to native
latticeCLI syntax for new workflows
Cross-References
- api-design.md — Compatibility API command mapping
- cli-design.md — Native CLI design and compat aliases
- sessions.md —
sallocequivalent - dag-scheduling.md — DAG dependencies (replace
--dependency) - mpi-process-management.md — MPI launch, PMI-2, srun replacement
Troubleshooting Guide
Allocation Stuck in Pending
Symptom: lattice status shows allocation in Pending for longer than expected.
Diagnosis:
# Check why the allocation isn't being scheduled
lattice status 12345 --verbose
| Verbose Output | Cause | Fix |
|---|---|---|
waiting for quota headroom | Tenant hard quota (max_nodes or max_concurrent_allocations) exceeded | Cancel other allocations or request quota increase |
no nodes matching constraints | No nodes with requested GPU type, features, or topology | Relax constraints (--topology=any), check lattice nodes --state=ready |
data staging in progress | Input data being pre-staged from warm/cold tier | Wait (check progress with lattice status 12345 --verbose), or submit without tier_hint: hot |
insufficient conformance group | Not enough nodes with matching conformance fingerprint for multi-node job | Reduce node count, or wait for OpenCHAMI to remediate drifted nodes |
all suitable nodes occupied | Resources are busy; allocation is queued normally | Wait; check queue depth with lattice status --state=pending |
soft quota penalty (low score) | GPU-hours budget nearly exhausted; allocation deprioritized | Request budget increase from tenant admin or Waldur portal |
Deeper investigation:
# Check scheduler cycle is running
lattice admin scheduler status --vcluster=hpc-batch
# Check if proposals are being rejected
lattice admin raft status
# View scheduling metrics
# (high proposal rejection rate may indicate race conditions or quota contention)
Scheduling Cycle Slow
Symptom: lattice_scheduling_cycle_duration_seconds p99 > 30s.
Diagnosis:
| Check | Command | What to Look For |
|---|---|---|
| Queue depth | lattice status --state=pending --count | > 500 pending allocations |
| Cost function time | Grafana: lattice_scheduling_cost_function_duration_seconds | Dominant component of cycle |
| Conformance group fragmentation | lattice nodes -o wide | sort -k7 | uniq -c | Many small groups |
| Topology solver | Grafana: cycle time breakdown | Multi-group spanning expensive |
Fixes:
| Cause | Fix |
|---|---|
| Too many pending allocations | Increase cycle interval to batch more proposals |
| Cost function slow | Check if custom metrics (f₅ data_readiness) are causing TSDB query delays |
| Conformance fragmented | Standardize firmware, or reduce w₉ for tolerant workloads |
| Topology solver | Reduce backfill depth, or allow topology: any for more jobs |
Node Stuck in Degraded/Down
Symptom: Node shows Degraded or Down in lattice nodes.
Diagnosis:
# Check node details
lattice nodes x1000c0s0b0n0
# Check heartbeat
# If heartbeat missing: node agent may be down or network partitioned
| State | Duration | Likely Cause | Fix |
|---|---|---|---|
| Degraded, < 2 min | Transient network blip | Wait; likely self-resolves | |
| Degraded, > 5 min | Agent crash or network partition | SSH to node, check agent: systemctl status lattice-agent | |
| Down | Agent not recovering | Check BMC via OpenCHAMI: manta node status x1000c0s0b0n0 | |
| Down, BMC unreachable | Hardware failure | Physical inspection required |
Recovery:
# If agent crashed, restart it
ssh x1000c0s0b0n0 systemctl restart lattice-agent
# If node needs reboot
lattice node disable x1000c0s0b0n0
# (coordinate with OpenCHAMI for reboot)
lattice node undrain x1000c0s0b0n0 # after reboot + health check
Raft Commit Latency High
Symptom: lattice_raft_commit_latency_seconds p99 > 1s.
Diagnosis:
| Check | What to Look For |
|---|---|
| Disk I/O on quorum members | WAL write latency. Quorum members need fast SSD. |
| Network between quorum members | Packet loss or high latency between quorum nodes |
| Leader overloaded | Too many proposals per second |
| Log compaction | Snapshot in progress (one-time spike, normal) |
Fixes:
| Cause | Fix |
|---|---|
| Slow disk | Move WAL to dedicated NVMe SSD |
| Network latency | Ensure quorum members are on low-latency network (same rack or switch) |
| Leader overload | Increase scheduling cycle interval to reduce proposal rate |
| Log too large | Reduce snapshot interval (more frequent snapshots = smaller log) |
Allocation Fails During Prologue
Symptom: Allocation moves from Running to Failed within seconds of starting.
Diagnosis:
lattice logs 12345
# Look for prologue errors:
# "uenv pull failed: hash mismatch"
# "mount failed: ENOSPC"
# "NFS mount timeout"
| Error | Cause | Fix |
|---|---|---|
| Hash mismatch | Corrupted image in cache or registry | lattice cache evict --image=... --node=... and retry |
| ENOSPC | Node-local cache full, eviction couldn’t free space | Check cache status: lattice cache status --node=.... Evict unused images manually. |
| NFS mount timeout | VAST unavailable or network issue | Check VAST health. Check Slingshot storage traffic class. |
| Image not found | uenv name/version doesn’t exist in registry | Verify with lattice cache status --node=... or check the uenv registry directly |
Preemption Not Working
Symptom: Higher-priority allocation waiting despite lower-priority allocations running on suitable nodes.
Diagnosis:
lattice status 12345 --verbose
# Check if preemption is enabled for this vCluster
lattice admin vcluster show hpc-batch
| Cause | Fix |
|---|---|
| Pending job’s priority class ≤ running jobs’ class | Preemption only works downward. Check priority classes. |
Running jobs are non-preemptible (checkpoint: none + high class) | Wait for them to complete |
| Running jobs are near completion (>90% walltime) | Scheduler avoids preempting near-completion jobs. Wait. |
| vCluster doesn’t allow preemption | Check vCluster config. Service vClusters only preempt borrowed nodes. |
Autoscaling Not Triggering
Symptom: Reactive allocation stays at min_nodes despite high metric value.
Diagnosis:
# Check current metric value
lattice top 12345 --metric=gpu_utilization
# Check scaling events
lattice status 12345 --verbose
| Cause | Fix |
|---|---|
| Metric below target | Scaling only triggers when metric > target for scale_up_window (2 min) |
| Cooldown period active | Recent scale event; wait for cooldown (3 min default) |
| TSDB query failing | Check lattice_autoscaling_metric_query_failures_total metric |
| Tenant quota exhausted | max_nodes reached; scale-up is a no-op |
| Metric name wrong | Verify metric exists in TSDB: lattice top 12345 --metric=<name> |
Sensitive Node Won’t Accept Claims
Symptom: Sensitive node claim rejected.
Diagnosis:
| Check | What to Look For |
|---|---|
lattice nodes <id> | Is node in Ready state? (Not Degraded, Down, Draining) |
| Conformance | Is node’s conformance fingerprint matching the sensitive baseline? |
| Pool size | Is sensitive_pool_size quota exhausted? |
| Previous wipe | Was the node properly wiped after last sensitive use? |
Fix:
# Check conformance
lattice nodes x1000c0s0b0n0 -o wide
# If drifted: coordinate with OpenCHAMI for remediation
# Check sensitive pool
lattice admin tenant show hospital-a --quotas
# If exhausted: release unused sensitive nodes or increase pool
Log Collection
When filing a bug report or escalating, collect:
# System overview
lattice admin raft status > diag/raft.txt
lattice nodes -o json > diag/nodes.json
lattice status --all -o json > diag/allocations.json
# Recent scheduler metrics (last hour)
lattice admin metrics dump --component=scheduler --duration=1h > diag/scheduler-metrics.json
# Specific node agent logs (if relevant)
ssh x1000c0s0b0n0 journalctl -u lattice-agent --since="1 hour ago" > diag/agent.log
Cross-References
- failure-modes.md — Expected failure patterns and recovery
- node-lifecycle.md — Node state transitions and timeouts
- preemption.md — Preemption policy and classes
- autoscaling.md — Scaling loop and error handling
- data-staging.md — Cache management and staging pipeline
- tuning-guide.md — Cost function tuning for performance issues
Architecture Decision Records
Template
Each ADR follows this format:
- Status: Proposed | Accepted | Superseded
- Context: What is the problem?
- Decision: What did we decide?
- Consequences: What are the trade-offs?
ADR-001: Raft for Quorum Consensus
Status: Accepted
Context: The scheduler needs a distributed control plane that avoids single-point-of-failure (Slurm’s slurmctld problem). We need strong consistency for node ownership and sensitive audit, but the system schedules tens-to-hundreds of large allocations, not millions of microservices.
Decision: Use Raft consensus (via openraft crate) for the quorum. 3-5 replicas. Only node ownership changes and sensitive audit events go through Raft. Everything else is eventually consistent.
Consequences:
- (+) No SPOF. Quorum tolerates minority failures.
- (+) Raft is well-understood, battle-tested, good Rust implementations exist.
- (+) Consistency latency (few ms per commit) is acceptable for our scheduling granularity.
- (-) Operational complexity of running a Raft cluster (leader election, log compaction, membership changes).
- (-) Write throughput limited by Raft commit latency. Not a problem at our scale.
ADR-002: Knapsack Scheduling with Composite Cost Function
Status: Accepted
Context: We need a scheduling algorithm that handles both HPC batch (topology-aware, fair-share) and cloud service (bin-packing, autoscale) workloads. Different vClusters need different optimization strategies.
Decision: Multi-dimensional knapsack formulation with a composite weighted cost function. Weights tunable per vCluster. Greedy solver with topology-aware backfill. Validated via RM-Replay simulator before production deployment.
Consequences:
- (+) Unified framework for all workload types (just change weights).
- (+) Cost function is extensible (add new factors without restructuring).
- (+) RM-Replay provides safe testing of configuration changes.
- (-) Weight tuning requires expertise and simulation. Not “plug and play.”
- (-) Greedy solver is not globally optimal. Acceptable for our scale.
ADR-003: uenv-First Software Delivery
Status: Accepted
Context: Users need reproducible software environments. Options: full containers (Docker/Sarus), uenv (SquashFS mount namespaces), or module systems.
Decision: uenv is the default software delivery mechanism. Sarus for OCI containers when isolation is needed (multi-tenant node sharing, third-party images, sensitive with enhanced isolation). No module system.
Consequences:
- (+) Near-zero runtime overhead (mount namespace, no container isolation overhead).
- (+) Native GPU/Slingshot access without namespace workarounds.
- (+) MPI “just works” — no network namespace translation.
- (+) Proven at CSCS scale (Alps, 10,752 GH200 GPUs).
- (-) Users must use curated uenv stacks or build their own (Spack/Stackinator).
- (-) Weaker isolation than containers — fine for trusted HPC users, needs Sarus for untrusted workloads.
ADR-004: Two Strong Consistency Domains
Status: Accepted
Context: Strong consistency (Raft) has a performance cost. We need to minimize what goes through consensus while ensuring correctness for critical state.
Decision: Exactly two categories of state require strong consistency:
- Node ownership — which tenant/vCluster/allocation owns which nodes
- Sensitive audit log — all events related to sensitive node claims, data access, and isolation boundaries
Everything else (job queues, telemetry, quota accounting, session state) is eventually consistent.
Consequences:
- (+) Minimal Raft throughput requirements (node ownership changes are infrequent).
- (+) Sensitive compliance: audit trail is provably consistent and tamper-evident.
- (+) Job queue staleness is bounded and self-correcting (rejected proposals retry next cycle).
- (-) Eventual consistency means two vCluster schedulers might propose conflicting allocations. One gets rejected. This is a retry, not a bug.
- (-) Quota accounting can lag. Hard limits enforced at quorum (node ownership), soft limits eventually.
ADR-005: Federation as Opt-In via Sovra
Status: Accepted
Context: Multi-site operation is desirable but adds significant complexity. Not all deployments need it. The trust model for cross-site operation is a hard problem.
Decision: Federation is a compile-time feature flag. When disabled, no Sovra dependency and no cross-site code paths. When enabled, Sovra provides the cryptographic trust layer. Each site retains full sovereignty — federation broker suggests, local scheduler decides.
Consequences:
- (+) Zero overhead when federation is not needed.
- (+) Sovra’s sovereign key model aligns with institutional requirements (each site controls its keys).
- (+) Revocable federation (revoke workspace → instant defederation).
- (-) Additional infrastructure to operate (Sovra instances, federation brokers).
- (-) Cross-site scheduling decisions are based on eventually consistent capacity data (may be stale).
ADR-006: Rust for Scheduler Core
Status: Accepted
Context: The scheduler is a long-lived, performance-critical, correctness-critical system. Options: Rust, Go, C++.
Decision: Rust for all performance-critical components (quorum, schedulers, node agent, API server, CLI, checkpoint broker). Go for infrastructure integration (OpenCHAMI, Sovra, federation broker). Python for user-facing SDK and tooling.
Consequences:
- (+) Memory safety without GC pauses (critical for scheduler latency).
- (+) Strong type system for modeling resource constraints (algebraic types for allocation states).
- (+) Excellent async/concurrency (tokio) for handling many concurrent node agent connections.
- (+) Single binary deployment for node agents (no runtime dependencies).
- (-) Steeper learning curve for contributors.
- (-) Slower initial development velocity vs. Go.
- (-) Ecosystem for HPC is smaller than C/C++ (but growing).
ADR-007: Full-Node Scheduling with Intra-Node Packing
Status: Accepted
Context: Scheduling granularity: full nodes, fractional nodes, or both?
Decision: The scheduler reasons about full nodes. The node agent handles intra-node packing (multiple containers/uenvs on a single node) for workloads that don’t need a full node (interactive sessions, small Jupyter notebooks). This is a two-level scheme: scheduler assigns nodes to vClusters, node agent packs work within allocated nodes.
Consequences:
- (+) Simplifies the scheduler (no cgroup negotiation between co-tenants).
- (+) Predictable performance for large jobs (no noisy neighbor at scheduler level).
- (+) Node agent can use simple bin-packing for small workloads.
- (-) Potential waste for small workloads that get a full node unnecessarily.
- (-) Mitigated by: Sarus containers with resource limits for interactive vCluster, and by grouping small workloads on designated “shared” nodes.
ADR-008: Asynchronous Accounting via Waldur
Status: Accepted
Context: Lattice needs external accounting and billing but should not depend on an accounting system for core scheduling functionality. Waldur provides HPC-aware accounting, billing, and self-service portal capabilities.
Decision: Integrate with Waldur as an optional, feature-flagged accounting provider. Lattice pushes accounting events (allocation started/completed, resource usage) to Waldur asynchronously. Waldur can push quota updates back to Lattice. Waldur unavailability never blocks scheduling. Events are buffered in memory and persisted to disk on overflow, replayed on reconnection.
Consequences:
- (+) Clean separation of concerns: Lattice schedules, Waldur accounts.
- (+) Zero scheduling impact from accounting failures (events are buffered).
- (+) Waldur’s self-service portal gives tenant admins quota visibility without Lattice changes.
- (+) Feature-flagged: zero overhead when accounting is not needed.
- (-) Eventually consistent accounting data (events pushed at configurable interval, default 60s).
- (-) Additional external dependency to operate (Waldur instance, API token management).
- (-) Entity mapping (Tenant↔Customer, Project↔Project) must stay synchronized.
ADR-009: Two-Tier Quota Enforcement
Status: Accepted
Context: Quota enforcement must balance strictness (prevent over-allocation) with performance (don’t bottleneck scheduling on consensus). Some quotas are safety-critical (node counts), others are advisory (GPU-hours budgets).
Decision: Two-tier quota enforcement matching the two consistency domains (ADR-004):
- Hard quotas (quorum-enforced, strong consistency):
max_nodes,max_concurrent_allocations,sensitive_pool_size. Checked during Raft proposal validation. Cannot be violated even momentarily. - Soft quotas (scheduler-enforced, eventual consistency):
gpu_hours_budget,node_hours_budget,fair_share_target,burst_allowance. Influence scheduling score but don’t hard-block. May temporarily overshoot during consistency window (~30s), self-correcting via fair-share scoring. When both GPU-hours and node-hours budgets are set, the worse utilization drives the penalty.
Consequences:
- (+) Hard quotas are provably enforced (Raft consensus guarantees).
- (+) Soft quotas don’t bottleneck scheduling (no consensus required for budget checks).
- (+) Consistency window for soft quotas is acceptable (scheduling cycle is 5-30s, budget tracking is for billing not safety).
- (+) Integrates cleanly with Waldur (ADR-008): Waldur updates quotas, Lattice enforces them.
- (-) Soft quotas can temporarily overshoot (by design). Requires clear documentation that GPU-hours tracking is approximate.
- (-) Two enforcement paths add complexity. Developers must know which tier a quota belongs to.
ADR-010: Native PMI-2 with Optional PMIx Sidecar
Status: Accepted
Context: Lattice replaces Slurm’s srun, which serves as both a process launcher (fan-out to nodes) and a PMI server (rank/key-value discovery for MPI). Without a PMI provider, multi-node MPI jobs fall back to SSH for process spawning (OpenMPI’s ORTE, MPICH’s Hydra). SSH between compute nodes is a security risk, conflicts with network-domain isolation, and is incompatible with the sensitive workload model. The system must support OpenMPI, MPICH, and Cray MPICH.
Three options were evaluated:
- Full PMIx server in Rust – PMIx v4/v5 is ~200+ attributes, enormous implementation surface, no existing Rust implementation. Rejected: too much scope, too much risk.
- Embed OpenPMIx library via FFI – Battle-tested, full compatibility. But adds a heavy C dependency (~100K LOC), complex FFI, and still requires custom cross-node transport via gRPC.
- Native PMI-2 wire protocol – ~8 text commands over Unix domain socket. Implementable in ~1000-1500 lines of Rust. All three target MPI implementations support PMI-2 natively. The only cross-node operation (kvsfence) maps cleanly to gRPC between node agents.
Decision: Implement a native PMI-2 server in the node agent as the default process management interface. The node agent provides a Unix domain socket per launch, sets PMI_FD/PMI_RANK/PMI_SIZE, and handles cross-node KV exchange (fence) via gRPC between node agents. Optionally, for workloads requiring full PMIx (dynamic spawn, tools API, event notification), support an OpenPMIx sidecar process managed by the node agent, behind the pmix feature flag.
Consequences:
- (+) No SSH between compute nodes. Eliminates an entire class of security and operational issues.
- (+) No external C dependencies for the default path. PMI-2 is simple enough to implement and test in pure Rust.
- (+) All three target MPI implementations (OpenMPI, MPICH, Cray MPICH) work with PMI-2 out of the box.
- (+) Cross-node fence reuses the existing node-agent gRPC infrastructure (management network, mTLS).
- (+) CXI credential management integrates naturally with existing VNI/network-domain lifecycle.
- (+) PMIx available as opt-in for the ~5% of workloads that need it, without burdening the default path.
- (-) PMI-2 does not support dynamic process spawning (
MPI_Comm_spawn). Rare in HPC but used by some frameworks. - (-) OpenMPI users must set
OMPI_MCA_pmix=pmi2(or Lattice sets it automatically). Minor friction. - (-) PMIx sidecar mode adds a C dependency (OpenPMIx) and a host callback shim (~200 LOC C). Only needed when feature-flagged.
- (-) Fence performance at extreme scale (>1000 nodes) requires tree-based reduction instead of star topology. Optimization deferred until needed.
ADR-011: Observability Data Out-of-Raft
Status: Accepted
Context: The system generates significant observability data: per-node telemetry (CPU, GPU, network, I/O), allocation logs (stdout/stderr), and metrics time series. This data must be queryable by users (dashboards, debugging) and by the scheduler (cost function factors like energy cost and data readiness). The question is where to store it.
Options:
- Raft state machine — guarantees consistency but creates enormous write load (thousands of metric points per second across hundreds of nodes). Raft commit latency becomes the bottleneck for telemetry ingestion.
- External TSDB + S3 — eventually consistent but decouples observability throughput from scheduling throughput. Standard tooling (Grafana, PromQL) works out of the box.
- In-memory ring buffers only — fast but volatile; node agent restart loses history; no cross-node aggregation.
Decision: Observability data is stored entirely outside the Raft state machine. Metrics go to an external TSDB (VictoriaMetrics). Logs are dual-path: ring buffer in the node agent for live streaming, S3 for persistent storage. The scheduler queries the TSDB for cost function inputs. Only sensitive audit events about observability actions (e.g., “user X attached to allocation Y”) flow through Raft consensus (per ADR-004).
Consequences:
- (+) Raft throughput is reserved for what matters: node ownership and sensitive audit.
- (+) Standard observability tooling (Grafana, PromQL) works without custom integration.
- (+) Telemetry pipeline failures do not disrupt scheduling or allocation lifecycle.
- (+) TSDB handles retention, downsampling, and high-cardinality queries natively.
- (-) Metrics are eventually consistent (~30s lag). Scheduler cost function inputs may be slightly stale.
- (-) TSDB is an additional infrastructure dependency to operate.
- (-) Log persistence depends on S3 availability; brief gaps possible during S3 outages (ring buffer covers live access).
ADR-012: Allocation as Universal Work Unit
Status: Accepted
Context: The system must schedule both finite work (training runs, simulations, CI jobs) and infinite work (inference services, monitoring daemons, interactive notebooks). Slurm treats these as fundamentally different (jobs vs. “perpetual” jobs with workarounds). Kubernetes treats everything as a pod/deployment but lacks HPC scheduling semantics. We need a single abstraction that spans both worlds without losing scheduling precision.
Options:
- Two separate types (Job and Service) — clear semantics per type, but duplicates scheduling logic, quota enforcement, preemption policy, and API surface. Every feature must be implemented twice.
- Always bounded (Slurm model) — services require walltime workarounds (submit with max walltime, auto-resubmit). Clumsy and fragile.
- Always unbounded (K8s model) — batch jobs require explicit termination signals. Cannot express “run until completion” natively.
- Single type with lifecycle variants — one Allocation with lifecycle: Bounded | Unbounded | Reactive.
Decision: A single Allocation type is the universal work unit. The lifecycle field determines duration semantics: Bounded (has walltime, completes or is killed), Unbounded (runs until cancelled, auto-restarts on failure), Reactive (scales in response to metrics/load). All scheduling, quota, preemption, checkpoint, and telemetry policies operate on Allocations uniformly. Task Groups (Slurm job arrays) and DAGs (dependency graphs) compose Allocations.
Consequences:
- (+) Unified scheduling: one cost function, one knapsack solver, one preemption engine for all workload types.
- (+) Simpler API: users learn one submission model. Services and batch jobs differ only in lifecycle field.
- (+) Quota and fair-share accounting is uniform — no special cases for services vs. jobs.
- (+) DAG dependencies can mix bounded and unbounded allocations (e.g., training job → inference service).
- (-) Lifecycle variants add complexity to the state machine (Bounded has walltime enforcement; Unbounded has restart policy; Reactive has scaling triggers).
- (-) Users coming from Slurm must learn that “job” and “service” are the same thing with different lifecycle.
ADR-013: Network Domains via Hardware VNIs
Status: Accepted
Context: Multi-tenant HPC requires network isolation between allocations. On Slingshot/Ultra Ethernet fabrics, the NIC supports Virtual Network Identifiers (VNIs) that provide hardware-enforced L3 isolation. Alternative approaches exist in software.
Options:
- Software-based isolation (Linux network namespaces, iptables) — can be bypassed by privileged processes, adds per-packet overhead, difficult to audit at scale, incompatible with RDMA.
- No network isolation — all allocations share L2/L3. Unacceptable for multi-tenant security and sensitive workloads.
- Full overlay network (Kubernetes CNI model) — adds encapsulation overhead, incompatible with Slingshot fabric semantics, destroys RDMA performance.
- Hardware VNI isolation — Slingshot NIC enforces isolation at line rate, zero software overhead, auditable via fabric manager.
Decision: Network isolation is enforced at the Slingshot hardware level via VNIs. Each network domain maps to a VNI allocated from a managed pool. Allocations in the same domain share a VNI and have L3 reachability. Allocations in different domains are hardware-isolated. VNI assignment is eventually consistent (node agents configure NICs based on quorum-reported domain membership). Sensitive allocations get unique per-allocation domains with encrypted RDMA (Ultra Ethernet).
Consequences:
- (+) Zero-overhead isolation — no per-packet software processing, RDMA performance preserved.
- (+) Hardware-enforced — cannot be bypassed by user processes, even with root inside a container.
- (+) Auditable via fabric manager — network domain membership is visible to operators.
- (+) Naturally integrates with CXI credential management for MPI (ADR-010).
- (-) Tied to Slingshot/Ultra Ethernet hardware. Non-Slingshot deployments need a software fallback.
- (-) VNI pool is finite (default: 3095). Exhaustion blocks new domain creation.
- (-) VNI configuration propagation to NICs adds latency to allocation startup (~50ms).
ADR-014: Conformance Fingerprinting for Configuration Drift Detection
Status: Accepted
Context: Multi-node GPU workloads (distributed training, MPI simulations) are sensitive to configuration heterogeneity. Nodes with different GPU driver versions, NIC firmware, or kernel versions can cause subtle correctness issues (NCCL version mismatches, libfabric ABI incompatibilities) or performance degradation. Slurm has no built-in mechanism to detect this; operators discover it via user bug reports.
Options:
- No tracking — silent failures; users debug configuration drift themselves.
- Exact node-by-node attribute matching — too strict; every firmware update requires simultaneously updating all nodes or scheduling breaks.
- Conformance fingerprint (hash of driver/firmware/kernel) — nodes with identical fingerprints are grouped into cohorts; scheduler places multi-node jobs on same-cohort nodes.
- Scheduler-driven remediation — scheduler triggers firmware updates on non-conforming nodes. Out of scope; OpenCHAMI handles infrastructure.
Decision: Each node agent computes a conformance fingerprint (SHA-256 of GPU driver version, NIC firmware version, BIOS version, kernel version) and reports it with heartbeats. The quorum groups nodes into conformance cohorts. The cost function factor f₉: conformance_fitness penalizes multi-node allocations that would span cohorts. Allocations can set require_conformance: true to hard-require same-cohort placement. Conformance drift on sensitive nodes triggers immediate drain (not remediation — that’s OpenCHAMI’s job).
Consequences:
- (+) Detects configuration drift before it causes user-visible failures.
- (+) Soft by default (penalty, not hard block) — avoids scheduling starvation during rolling updates.
- (+) Hard mode available for workloads that need it (
require_conformance). - (+) Sensitive nodes get stricter enforcement (drain on drift) for compliance.
- (-) Fingerprint granularity is coarse. Two nodes with different BIOS settings but same BIOS version have the same fingerprint.
- (-) Multi-node jobs with
require_conformancemay wait longer for same-cohort nodes. - (-) Rolling firmware updates temporarily create many small cohorts, reducing scheduling flexibility.
ADR-015: Attach via nsenter
Status: Accepted
Context: Users need interactive terminal access to running allocations for debugging, monitoring, and interactive workflows (equivalent to Slurm’s srun --pty bash into a running job). The question is how to provide this without compromising isolation or consuming scheduling resources.
Options:
- Create a new “attach” allocation on the same node — goes through the scheduler queue; consumes quota; adds latency; overkill for a debugging session.
- SSH into the compute node — requires SSH key distribution between login and compute nodes; security risk; incompatible with network domain isolation; operationally fragile.
- nsenter from node agent — the node agent enters the allocation’s mount/PID namespace via Linux
nsenter; bidirectional gRPC stream provides the PTY. No new resource allocation, no SSH. - Direct socket from user to container — requires host filesystem access; less secure; doesn’t work with uenv (no container to connect to).
Decision: Attach uses nsenter executed by the node agent. The user’s lattice attach <id> command opens a bidirectional gRPC stream to the API server, which forwards to the node agent hosting the allocation. The node agent spawns a shell inside the allocation’s namespace via nsenter. No new allocation is created, no quota is consumed, and no SSH is involved.
Consequences:
- (+) Instant attach — no scheduler queue, no resource allocation.
- (+) No SSH infrastructure needed on compute nodes.
- (+) Works identically for uenv and Sarus allocations (both use Linux namespaces).
- (+) Attach sessions are logged as observability events (sensitive: Raft-committed audit entry).
- (-) Requires the node agent to have
CAP_SYS_ADMIN/ sufficient privileges fornsenter. - (-) Attach shares the allocation’s resource limits — a heavy debugging tool could impact the running workload.
- (-) If the node agent is down, attach is unavailable (no fallback).
ADR-016: Two-Tier API (Intent API + Compatibility Layer)
Status: Accepted
Context: Lattice must serve two audiences: (1) new users and AI agents who benefit from a declarative, intent-based API (“I need 64 GPU nodes for 2 hours with this data”), and (2) existing Slurm users who have years of scripts using sbatch, squeue, scancel. Supporting both without maintaining two scheduling engines requires a clear layering decision.
Options:
- Single imperative API (Slurm-style) — familiar to HPC users but locks the system into Slurm’s abstractions (partitions, job steps, GRES). Cannot express reactive scaling or data staging intent.
- Single declarative API (Intent-only) — clean design but forces all existing users to rewrite scripts immediately. Migration barrier too high.
- Dual engines — one for Intent, one for Slurm compat. Code duplication, inconsistent scheduling behavior, unmaintainable.
- Two-tier: Intent API as primary, Compatibility API as thin mapping — Slurm commands are translated to Intent API calls. One scheduling engine, one state machine, one set of semantics.
Decision: The Intent API is the primary and only scheduling interface. The Compatibility API (sbatch, squeue, scancel and their lattice submit, lattice status, lattice cancel equivalents) is a stateless translation layer that maps Slurm directives to Intent API fields. All scheduling decisions, state transitions, and quota enforcement happen through the Intent API path. The compat layer produces warnings for unsupported directives but never errors (graceful degradation for migration).
Consequences:
- (+) One scheduling engine, one code path, one set of tests.
- (+) Gradual migration: existing scripts work on day one via compat layer.
- (+) Intent API can evolve freely without Slurm compatibility constraints.
- (+) AI agents use the Intent API directly — no impedance mismatch.
- (-) Some Slurm features have no mapping (hetjob, burst buffer, GRES beyond GPU). Users get warnings.
- (-) Compat layer must be maintained and tested against Slurm script variations.
- (-) Users may stay on compat layer indefinitely, never adopting Intent API features.
ADR-017: Eventual Consistency for Job Queues
Status: Accepted
Context: When a user submits an allocation, how quickly must the system guarantee that the submission is durable and schedulable? Raft consensus provides strong guarantees but adds latency (few ms per commit) and throughput limits. Job queues see bursts (hundreds of submissions in seconds during class assignments or automated pipelines).
Options:
- Synchronous Raft commit on every submission — strong guarantee but adds 10-100ms per submission, bottlenecks the API under burst load, scheduler throughput limited by Raft commit latency.
- Eventually consistent with bounded staleness — submission is acknowledged immediately (stored in-memory queue), committed to Raft asynchronously on the next scheduling cycle. Staleness bounded by scheduling cycle time (~5-30s).
- Optimistic with no retry — submissions may be silently lost on leader failover. Unacceptable.
Decision: Job queue state is eventually consistent. Allocation submissions are acknowledged immediately by the API server and placed in the vCluster scheduler’s in-memory queue. The scheduler proposes allocations to the quorum on each scheduling cycle; the quorum validates and commits node ownership (strong consistency). If the API server fails between acknowledgment and the next scheduling cycle, the submission is lost — but the user receives an allocation ID and can query status, which will show “not found” (detectable failure, not silent). In practice, the window is <30s and API server failures are rare.
Consequences:
- (+) Submission API is fast (<5ms) regardless of Raft cluster health.
- (+) Burst submissions don’t bottleneck on consensus.
- (+) Scheduling cycle naturally batches proposals, reducing Raft commit count.
- (-) Submissions can be lost on API server crash (between ack and next cycle). Mitigated by: client retries on “not found” status, and API server persistence to disk (WAL) as future enhancement.
- (-) Two schedulers may independently queue the same submission if load-balanced. Deduplication by allocation ID at quorum level.
ADR-018: Scheduler-Coordinated Checkpointing
Status: Accepted
Context: Preemption requires evicting running allocations to free resources for higher-priority work. Killing allocations without warning wastes all computed progress. Checkpointing preserves progress but has cost: I/O bandwidth for writing state, compute time lost during checkpoint, and storage for checkpoint data. The question is who decides when to checkpoint.
Options:
- User-initiated checkpointing — user inserts checkpoint calls in their code. Does not solve the preemption problem (scheduler cannot wait for user to decide).
- Periodic automatic checkpointing (fixed interval) — simple but wasteful. Short intervals waste I/O on stable workloads; long intervals lose too much progress on preemption.
- Transparent checkpointing (DMTCP) without cost model — works for any application but causes I/O storms when many allocations checkpoint simultaneously. No way to prioritize which allocations to preempt.
- Scheduler-coordinated with cost function — scheduler evaluates checkpoint value vs. cost per allocation, decides when and which allocations to checkpoint for preemption.
Decision: Checkpointing is scheduler-coordinated. The cost function evaluates checkpoint_value = resource_freed × preemptability + backlog_relief vs. checkpoint_cost = write_time + compute_waste + storage_cost. The scheduler triggers checkpoints by sending CHECKPOINT_HINT to the node agent, which forwards to the application (via signal, shmem flag, or gRPC callback). Applications declare their checkpoint capability (signal, shmem, grpc, dmtcp, or none). Applications with none are either non-preemptible or killed without checkpoint. Backlog pressure increases checkpoint aggressiveness (more allocations waiting → more willing to preempt).
Consequences:
- (+) Checkpoint decisions are globally optimal (scheduler has full visibility of queue, resources, priorities).
- (+) Avoids I/O storms (scheduler staggers checkpoints across time and storage bandwidth).
- (+) Backlog-responsive: system becomes more aggressive about freeing resources when demand is high.
- (+) Applications retain control of checkpoint mechanics (signal handler, custom format).
- (-) Applications must implement checkpoint support to benefit. Unsupported applications are either non-preemptible or lose progress.
- (-) Cost function calibration requires tuning (write bandwidth, storage cost per GB).
- (-) Checkpoint hint is advisory — application may take too long, forcing a hard kill after timeout.
ADR-019: Eventually Consistent Node Capacity
Status: Accepted
Context: The scheduler needs two kinds of information about nodes: (1) ownership — which tenant/vCluster/allocation owns the node, and (2) capacity — current health, GPU utilization, temperature, available memory. Ownership must be strongly consistent (ADR-004) to prevent double-assignment. But capacity data changes frequently (every heartbeat, ~10s) and is used for scoring, not for correctness.
Options:
- All node updates through Raft — ownership and capacity in one consistent view. But heartbeats every 10s × hundreds of nodes = thousands of Raft writes per minute. Commit latency becomes the scheduling bottleneck.
- All node updates eventually consistent — fast but ownership conflicts are possible. Two schedulers could assign the same node simultaneously.
- Split: ownership via Raft, capacity via eventual consistency — ownership changes are rare (scheduling cycles) and go through Raft. Capacity updates are frequent (heartbeats) and propagated via gossip or direct reporting.
Decision: Node ownership (tenant, vCluster, allocation assignment) is Raft-committed (strong consistency). Node capacity (health, utilization, temperature, conformance fingerprint) is eventually consistent — node agents report to the quorum leader, which updates in-memory state without Raft commit. The scheduler reads the latest reported capacity when scoring. Stale capacity data may cause suboptimal placement but never incorrect ownership.
Consequences:
- (+) Heartbeats do not bottleneck Raft. Hundreds of nodes can report every 10s without consensus overhead.
- (+) Scheduling cycle time is decoupled from Raft commit latency for capacity reads.
- (+) Ownership consistency is preserved — double-assignment is impossible.
- (-) Capacity staleness can cause suboptimal decisions (e.g., scheduling on a node whose GPU just failed but hasn’t reported yet). Bounded by heartbeat interval.
- (-) Two levels of consistency require developers to know which fields are strong vs. eventual.
ADR-020: Sensitive Node Claims by User Identity
Status: Accepted
Context: Sensitive (regulated, high-security) workloads require provable isolation and audit trails that satisfy regulatory requirements (e.g., data protection laws, institutional compliance). The question is what identity is recorded as the “owner” of a sensitive node allocation: the tenant (organizational unit), a role, or the specific user.
Options:
- Tenant-owned — the organizational unit owns the nodes. Cannot prove which individual accessed which data. Insufficient for regulatory audit (“who accessed patient records?”).
- Role-based — a role (e.g., “researcher”) owns the nodes. Same problem: multiple users share a role; individual accountability is lost.
- User-owned (OIDC subject) — the authenticated user’s identity (from OIDC token) is recorded in the Raft-committed audit log as the owner. Every data access, attach session, and log retrieval is tied to a specific person.
Decision: Sensitive allocations are claimed by the authenticated user’s OIDC subject identifier, not by the tenant or a role. The quorum records the user identity in the Raft-committed audit log. All subsequent actions on the allocation (data access, attach, log retrieval) are logged with user identity. Nodes are wiped on release (OpenCHAMI secure erase) with wipe confirmation recorded in the audit log. Audit retention is 7 years.
Consequences:
- (+) Individual accountability: every action is tied to a specific authenticated person.
- (+) Regulatory defensibility: audit trail shows who claimed what, when, and what they did.
- (+) Wipe-on-release with Raft-committed confirmation provides provable data destruction.
- (+) 7-year retention satisfies most regulatory frameworks.
- (-) User identity must be available at claim time (requires OIDC authentication, no service accounts for sensitive claims).
- (-) Sensitive allocations cannot be transferred between users (the claim is to a specific identity).
- (-) Wipe-on-release adds latency to node return-to-pool (10-30 minutes for secure erase).
ADR-021: Data Staging as Invisible Background Pre-stage
Status: Accepted
Context: Many HPC workloads require large datasets (TBs) that may reside on warm or cold storage tiers. If data is not on the hot tier (VAST NFS/S3) when the allocation starts, the first minutes of compute time are wasted on I/O. The question is when and how to move data to the hot tier.
Options:
- User-managed staging — user runs a separate staging job before the compute job. Shifts responsibility; users who forget waste compute time. Incompatible with multi-tenant fairness (staging time counted against user).
- Blocking inline staging — allocation starts, blocks on data transfer before running the entrypoint. User sees unpredictable startup latency. If staging fails, the allocation is stuck in a running-but-waiting state, consuming resources.
- Background pre-staging during queue wait — when an allocation is queued and declares data mounts with
tier_hint: hot, the data mover begins warming data to the hot tier while the allocation waits in the queue. Queue wait time becomes productive. - Post-allocation staging on compute nodes — wastes compute resources on I/O; saturates node-local network bandwidth.
Decision: Data staging runs as a background process during queue wait time. The allocation transitions through a Staging state where the data mover pre-stages declared data mounts from warm/cold to hot tier. The cost function factor f₅: data_readiness scores how ready an allocation’s data is: fully staged allocations score higher and are scheduled sooner. Allocations whose data is not yet ready can still be scheduled if resources are available (staging continues during prologue). Staging failure is non-fatal — the allocation starts with a warning, and the entrypoint may encounter I/O latency.
Consequences:
- (+) Queue wait time is no longer wasted — data moves while the allocation waits.
- (+) Users don’t need to manage staging manually; just declare data mounts.
- (+) Scheduler can prioritize data-ready allocations, improving overall throughput.
- (+) Non-blocking: staging failure degrades performance but doesn’t prevent execution.
- (-) Adds complexity to the allocation state machine (Staging state, data mover integration).
- (-) Hot tier must have capacity for pre-staged data. Over-staging wastes hot tier space.
- (-) Cost function tuning:
f₅weight determines how much data readiness influences scheduling order.
ADR-022: Three-Layer Telemetry Pipeline
Status: Accepted
Context: The system needs telemetry for three consumers: (1) operators (dashboards, alerts), (2) users (debugging, performance analysis), and (3) the scheduler (cost function inputs: GPU utilization, network congestion, energy cost). Each has different resolution, latency, and retention requirements. The pipeline must handle hundreds of nodes producing thousands of metric points per second.
Options:
- In-memory ring buffers only — fast, low overhead. But volatile: node agent restart loses history. No cross-node aggregation for dashboards. Insufficient for scheduler feedback (requires historical trends).
- Direct eBPF-to-S3 pipeline — durable but high latency. No live metrics for dashboards. Raw data too granular for efficient query.
- Stream all metrics to Raft state machine — consistent but bloats the state machine. Raft commit latency becomes the telemetry bottleneck. Fundamentally wrong abstraction.
- Three-layer: collect (eBPF) → aggregate (configurable resolution) → store (external TSDB) — each layer optimized for its purpose.
Decision: Telemetry follows a three-layer pipeline. Layer 1: eBPF programs (always-on, <0.3% overhead) collect kernel-level metrics at high resolution. Layer 2: the node agent aggregates at configurable resolution (production: 30s bicubic smoothing, debug: 1s raw, audit: access logs). Layer 3: aggregated metrics are pushed to an external TSDB (VictoriaMetrics) for storage, query, and alerting. The scheduler queries the TSDB for cost function inputs. Users query the TSDB via Grafana or the lattice top/lattice metrics commands.
Consequences:
- (+) Each layer is independently scalable and replaceable (swap TSDB, change eBPF programs, adjust resolution).
- (+) eBPF collection is always-on with negligible overhead — no sampling trade-offs.
- (+) Configurable resolution per use case: fine-grained for debugging, coarse for production.
- (+) Standard tooling (Grafana, PromQL, AlertManager) works without custom integration.
- (+) Telemetry pipeline failure does not affect scheduling (graceful degradation: stale cost function inputs).
- (-) Three layers add operational complexity (eBPF programs, agent aggregation config, TSDB deployment).
- (-) End-to-end latency from event to queryable metric is ~30s in production mode.
- (-) eBPF programs require kernel version compatibility and
CAP_BPFon nodes.
ADR-023: vCluster as Soft Isolation Boundary
Status: Accepted
Context: Different workload types need different scheduling policies: HPC batch needs backfill with topology packing, ML training needs fair-share with GPU affinity, services need bin-packing with autoscale, sensitive needs dedicated reservation. A single scheduler cannot optimize for all simultaneously. But hard partitioning wastes resources when one workload type is idle while another is starved.
Options:
- Hard partitioning (dedicated node pools per workload type) — simple isolation but guaranteed waste. If the ML training pool is 50% idle and HPC batch is oversubscribed, resources sit unused.
- Single global scheduler with workload-type heuristics — no waste but cannot apply fundamentally different policies (backfill vs. bin-pack) simultaneously. Policy conflicts create unpredictable behavior.
- Opaque vClusters (cannot see each other) — avoids conflicts but makes cross-vCluster fairness impossible. Borrowing is non-deterministic because the lending vCluster doesn’t know its own utilization relative to others.
- Soft vClusters with global visibility — each vCluster has its own scheduler and cost function weights, but all schedulers see the global node ownership state via the quorum. Borrowing is explicit and policy-driven.
Decision: vClusters are soft isolation boundaries. Each vCluster has an independent scheduler instance with its own cost function weights (ADR-002) and scheduling algorithm (backfill, bin-pack, reservation, FIFO). All schedulers read the same global state from the quorum. vClusters have base allocations (guaranteed node counts) and can borrow from other vClusters with explicit priority and duration. Borrowed nodes are returned when the lending vCluster needs them (preemption of borrowed allocations at lower priority). The quorum enforces that proposals from different vCluster schedulers don’t conflict (node ownership is Raft-committed).
Consequences:
- (+) Each workload type gets an optimized scheduler without one-size-fits-all compromises.
- (+) No waste: idle resources in one vCluster are available to others via borrowing.
- (+) Fair-share is globally visible:
f₃can compare a tenant’s usage across all vClusters. - (+) Borrowing is explicit and reversible: lending vCluster retains priority over its base allocation.
- (-) Multiple schedulers proposing simultaneously can cause Raft proposal conflicts (one rejected, retried next cycle). Not a bug, but adds latency under contention.
- (-) Borrowing policy configuration is complex (priority levels, max borrow duration, return grace period).
- (-) Operators must understand that vClusters are not security boundaries — they are scheduling policy boundaries. Tenant isolation is provided by RBAC and network domains, not vClusters.
External References
Core Infrastructure Projects
OpenCHAMI
- What: Open-source HPC system management platform (provisioning, boot, inventory)
- Repo: https://github.com/OpenCHAMI
- Docs: https://openchami.org
- Components we integrate with: SMD (State Management Daemon), BSS (Boot Script Service), Magellan (Redfish discovery), OPAAL (auth), Cloud-init
- Founded by: LANL, NERSC, CSCS, HPE, University of Bristol
- Language: Go
- Our integration: Infrastructure plane — Lattice queries SMD for node inventory, triggers BSS for boot image selection (e.g., sensitive hardened image), uses Magellan for hardware discovery
FirecREST
- What: RESTful API gateway for HPC systems
- Repo: https://github.com/eth-cscs/firecrest
- Docs: https://firecrest.readthedocs.io
- Our integration: Optional — lattice authenticates directly via hpc-auth. FirecREST is only needed for hybrid Slurm deployments where it serves as a passthrough compatibility gateway.
uenv
- What: User environment tool for mounting SquashFS software stacks
- Repo: https://github.com/eth-cscs/uenv
- Related: https://github.com/eth-cscs/squashfs-mount (setuid mount binary), https://github.com/eth-cscs/slurm-uenv-mount (Slurm SPANK plugin)
- Docs: https://docs.cscs.ch/software/uenv/using/
- Key properties: SquashFS images, mount namespace isolation (per-process-tree), setuid binary (not FUSE), Spack-built stacks via Stackinator, multiple mount points (/user-environment, /user-tools)
- Our integration: Software plane — node agent uses squashfs-mount to deliver uenv to allocations. We replace the Slurm SPANK plugin with native node agent integration.
Sarus
- What: OCI-compliant container runtime for HPC
- Repo: https://github.com/eth-cscs/sarus
- Key properties: Near-native performance, direct GPU/interconnect access via OCI hooks, no network namespace overhead for MPI
- Our integration: Software plane — used when full container isolation is needed (multi-tenant node sharing, third-party images, sensitive workloads with enhanced isolation)
Sovra
- What: Federated sovereign key management for critical infrastructure
- Repo: https://github.com/witlox/sovra
- Docs: https://witlox.github.io/sovra/
- Key properties: Peer-to-peer control planes, customer-controlled root keys, OPA-based policy, air-gap capable, cross-domain sharing
- Language: Go
- Our integration: Federation trust layer (optional, feature-gated). Provides cross-site authentication, sensitive data encryption key management, audit log signing.
Networking
Slingshot (HPE CXI)
- What: HPE’s HPC interconnect, dragonfly topology
- Key properties: Hardware traffic classes, VNIs for isolation, high-radix switches, RDMA
- Scheduler relevance: Topology-aware placement (minimize inter-group hops), VNI-based network domains, separate traffic classes for compute/management/telemetry
Ultra Ethernet Consortium (UEC)
- What: Open Ethernet-based networking stack for AI/HPC
- Spec: https://ultraethernet.org (1.0 released June 2025)
- Key properties: UET transport (native RDMA over Ethernet), packet spraying (adaptive multi-path), CSIG (in-band congestion signaling), built-in encryption, libfabric 2.0 API
- Relationship to Slingshot: ~75% of UET derives from Slingshot transport. Migration path is evolutionary, not revolutionary.
- Scheduler relevance: CSIG feeds into telemetry (congestion-aware scheduling), encryption simplifies sensitive compliance, libfabric abstraction enables fabric-agnostic scheduler
libfabric
- What: Fabric abstraction library (provider-based: CXI for Slingshot, EFA for AWS, verbs for InfiniBand, UET for Ultra Ethernet)
- Our integration: Network fabric abstraction. The scheduler and node agent interact with the network via libfabric, making the scheduler fabric-agnostic.
Storage
VAST Data Platform
- What: All-flash unified storage (NFS + S3 + block), DASE architecture
- Key properties: Multiprotocol (NFS + S3 native), RESTful API for everything, QoS per export, auto-indexing catalog, snapshots, DataSpace (global namespace with prefetch)
- Scheduler integration: QoS setting at job start, data locality queries via Catalog API, pre-staging via DataSpace prefetch, snapshots for reproducibility, audit logs for sensitive compliance
IBM Storage Scale (GPFS)
- What: Parallel file system with extensive management features
- Key properties: Placement policies, AFM (async data management), filesets with quotas, watch/callback API, transparent cloud tiering
- Scheduler integration: Alternative to VAST. Fileset-per-job for isolation, placement policies for workload-specific tuning, AFM for remote data staging.
Research Papers
CSCS Alps Architecture
- Martinasso, Klein, Schulthess. “Alps, a versatile research infrastructure.” CUG 2025. arXiv:2507.02404
- Alam, Gila, Klein, Martinasso, Schulthess. “Versatile software-defined HPC and cloud clusters on Alps supercomputer for diverse workflows.” IJHPCA 2023.
- Martinasso et al. “Resource Elasticity for Scientific Platforms on HPC Infrastructure.” Springer 2025.
Scheduler Simulation
- Martinasso, Gila, Bianco, Alam, McMurtrie, Schulthess. “RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management.” SC18. Repo
Multi-Objective Scheduling
- Simon, Nguyen, Halem. “Multiple Objective Scheduling of HPC Workloads Through Dynamic Prioritization.” Uses bounded fractional knapsack with dynamic priority scoring.
- Goponenko. “Objective-Driven Strategies for HPC Job Scheduling.” UCF 2024. Comprehensive metrics for scheduling quality, I/O-aware backfill.
Energy-Aware Federation
- “Power-Aware Scheduling for Multi-Center HPC Electricity Cost Optimization.” arXiv:2503.11011. GNN-based power prediction + multi-site scheduling, up to 18% energy cost reduction.
uenv Deployment
- Coles et al. “Deploying Alternative User Environments on Alps.” CUG 2023. Details squashfs-mount, Slurm SPANK plugin, Spack stack building.
ML on HPC
- CSCS. “Evolving HPC services to enable ML workloads on HPE Cray EX.” CUG 2025. arXiv:2507.01880. Container Engine, Environment Definition Files, gaps for ML users.