Lattice

A distributed workload scheduler for large-scale scientific computing, AI/ML training, inference services, and regulated workloads.

Lattice schedules both finite jobs (batch training, simulations) and infinite jobs (inference services, monitoring) on shared HPC infrastructure with topology-aware placement, federated multi-site operation, and a unified API for human users and autonomous agents.

Architecture at a Glance

User Plane         lattice-cli + lattice-api (OIDC via hpc-auth)
Software Plane     uenv (SquashFS) + Sarus (OCI) + Registry
Scheduling Plane   Raft Quorum + vCluster Schedulers (knapsack)
Data Plane         VAST (NFS/S3) tiered storage + data mover
Network Fabric     Slingshot / Ultra Ethernet (libfabric)
Node Plane         Node Agent + mount namespaces + eBPF telemetry
Infrastructure     OpenCHAMI (Redfish BMC, boot, inventory)

Start with System Architecture for the full picture, or jump to API Design to see how users interact with the system.

Source Code

The project is organized as a Rust workspace with 9 crates:

Crate	Purpose
`lattice-common`	Shared types, config, protobuf bindings
`lattice-quorum`	Raft consensus, global state machine, audit log
`lattice-scheduler`	vCluster schedulers, knapsack solver, cost function
`lattice-api`	gRPC + REST server, OIDC, RBAC, mTLS
`lattice-checkpoint`	Checkpoint broker, cost evaluator
`lattice-node-agent`	Per-node daemon, GPU discovery, eBPF telemetry
`lattice-cli`	CLI binary (submit, status, cancel, session, telemetry)
`lattice-test-harness`	Shared mocks, fixtures, builders
`lattice-acceptance`	BDD scenarios and property tests

Plus a Python SDK, an RM-Replay simulator, and deployment configs in infra/.

Getting Started

Overview

Lattice is a distributed workload scheduler for HPC and AI infrastructure. It schedules both batch jobs (training runs, simulations) and long-running services (inference endpoints, monitoring) on shared GPU-accelerated clusters.

If you’re coming from Slurm, most concepts map directly — see the Slurm migration guide for a quick comparison.

Prerequisites

A running Lattice cluster (ask your admin for the API endpoint)
The lattice CLI installed on your workstation or login node
Your tenant credentials (OIDC token or mTLS certificate)

Installing the CLI

# Determine architecture
ARCH=$(uname -m | sed 's/aarch64/arm64/')

# Download from GitHub Releases
curl -sSfL "https://github.com/witlox/lattice/releases/latest/download/lattice-${ARCH}.tar.gz" | tar xz
sudo mv lattice /usr/local/bin/

# Or build from source
cargo build --release -p lattice-cli
sudo cp target/release/lattice /usr/local/bin/

Configuration

Create ~/.config/lattice/config.yaml:

endpoint: "lattice-api.example.com:50051"
tenant: "my-team"
# Optional: default vCluster
vcluster: "gpu-batch"

Or use environment variables:

export LATTICE_ENDPOINT="lattice-api.example.com:50051"
export LATTICE_TENANT="my-team"

Your First Job

Submit a batch script

lattice submit train.sh
# Submitted allocation a1b2c3d4

Check status

lattice status
# ID        NAME           STATE    NODES  WALLTIME   ELAPSED    VCLUSTER
# a1b2c3d4  train.sh       Running  4      24:00:00   00:12:34   gpu-batch

View logs

lattice logs a1b2c3d4
# [2026-03-05T10:00:12Z] Epoch 1/100, loss=2.341
# [2026-03-05T10:01:45Z] Epoch 2/100, loss=1.892

Cancel a job

lattice cancel a1b2c3d4

Next Steps

Submitting Workloads — detailed submission options
Interactive Sessions — attach a terminal to running jobs
DAG Workflows — multi-step pipelines with dependencies
Python SDK — programmatic access from notebooks and agents

Submitting Workloads

Basic Submission

# Run a script on 4 nodes for up to 24 hours
lattice submit --nodes=4 --walltime=24h train.sh

# With GPU constraints
lattice submit --nodes=8 --walltime=72h --constraint="gpu_type=GH200" -- torchrun train.py

# With a software environment (uenv)
lattice submit --nodes=2 --uenv=prgenv-gnu/24.11:v1 -- make -j run

Script Directives

Lattice parses #LATTICE directives from your script (and #SBATCH for compatibility):

#!/bin/bash
#LATTICE --nodes=64
#LATTICE --walltime=72h
#LATTICE --uenv=prgenv-gnu/24.11:v1
#LATTICE --vcluster=ml-training
#LATTICE --tenant=physics
#LATTICE --name=large-training-run

torchrun --nproc_per_node=4 train.py --data /scratch/dataset

Resource Constraints

# GPU type
lattice submit --constraint="gpu_type=GH200,gpu_count=4" script.sh

# Memory requirements
lattice submit --constraint="memory_gb>=512" script.sh

# Require unified memory (GH200/MI300A superchip)
lattice submit --constraint="require_unified_memory" script.sh

# Prefer same NUMA domain
lattice submit --constraint="prefer_same_numa" script.sh

Task Groups (Job Arrays)

Submit multiple instances of the same job:

# 100 tasks, 20 running concurrently
lattice submit --task-group=0-99%20 sweep.sh

# Task index available as $LATTICE_TASK_INDEX

Dependencies

# Run after job succeeds
lattice submit --depends-on=a1b2c3d4:success postprocess.sh

# Run after job completes (success or failure)
lattice submit --depends-on=a1b2c3d4:any cleanup.sh

# Multiple dependencies
lattice submit --depends-on=job1:success,job2:success merge.sh

Data Staging

Lattice can pre-stage data to the hot tier before your job starts:

lattice submit --data-mount="s3://bucket/dataset:/data" --nodes=4 train.sh

The scheduler evaluates data readiness as part of the cost function — jobs with data already on the hot tier are prioritized.

Lifecycle Types

Bounded (batch) — default

lattice submit --walltime=24h train.sh

Job runs until completion or walltime, then terminates.

Unbounded (service)

lattice submit --service --expose=8080 serve.sh

Runs indefinitely. Exposed ports are reachable via the network domain.

Reactive (autoscaling)

lattice submit --reactive --min-nodes=1 --max-nodes=8 \
  --scale-metric=gpu_utilization --scale-target=0.8 serve.sh

Automatically scales between min and max nodes based on the target metric.

Preemption Classes

Higher preemption class = harder to preempt:

# Best-effort (preempted first)
lattice submit --preemption-class=0 experiment.sh

# Normal priority (default: 5)
lattice submit train.sh

# High priority
lattice submit --preemption-class=8 critical-training.sh

Checkpointing

If your application supports checkpointing, declare it:

# Signal-based (receives SIGUSR1 before preemption)
lattice submit --checkpoint=signal train.sh

# gRPC callback
lattice submit --checkpoint=grpc --checkpoint-port=9999 train.sh

# Shared memory flag
lattice submit --checkpoint=shmem train.sh

# Non-preemptible (no checkpoint, never preempted)
lattice submit --no-preempt train.sh

Slurm Compatibility

Existing Slurm scripts work with minimal changes:

# These are equivalent
sbatch --nodes=4 --time=24:00:00 --partition=gpu train.sh
lattice submit --nodes=4 --walltime=24h --vcluster=gpu train.sh

Supported #SBATCH directives are automatically translated. See Slurm Migration for details.

Output Formats

# Default: human-readable table
lattice status

# JSON (for scripting)
lattice status -o json

# YAML
lattice status -o yaml

# Wide (more columns)
lattice status -o wide

Interactive Sessions

Interactive sessions give you a terminal attached to allocated compute nodes — similar to salloc + srun --pty in Slurm.

Creating a Session

# Basic interactive session (1 node, 4 hours)
lattice session --walltime=4h

# With GPU and software environment
lattice session --nodes=1 --constraint="gpu_type=GH200" --uenv=prgenv-gnu/24.11:v1

# Specify vCluster
lattice session --vcluster=interactive --walltime=2h

The session enters the queue like any other allocation. Once scheduled, your terminal automatically attaches to the first node.

Attaching to Running Allocations

You can attach a terminal to any running allocation (not just sessions):

# Attach to a running job
lattice attach a1b2c3d4

# Attach to a specific node in a multi-node allocation
lattice attach a1b2c3d4 --node=nid001234

# Run a specific command instead of a shell
lattice attach a1b2c3d4 -- htop

Multiple Terminals

You can open multiple terminals to the same allocation:

# Terminal 1
lattice attach a1b2c3d4

# Terminal 2 (different shell window)
lattice attach a1b2c3d4

Session Lifecycle

Pending — waiting in the queue for resources
Running — terminal is attached, you’re working
Disconnected — if you lose connection, the session keeps running (use tmux/screen inside for persistence)
Completed — walltime expired or you exited

Tips

Use tmux or screen inside your session for disconnect resilience
Sessions respect the same preemption rules as batch jobs — use --preemption-class=7 for important interactive work
If preempted, you’ll see checkpoint progress in your terminal before disconnection
The --walltime flag is mandatory for sessions (prevents runaway resource usage)

DAG Workflows

DAGs (Directed Acyclic Graphs) let you define multi-step pipelines where allocations depend on each other.

YAML Definition

# workflow.yaml
name: training-pipeline
allocations:
  - name: preprocess
    entrypoint: "python preprocess.py"
    nodes: 2
    walltime: "2h"

  - name: train
    entrypoint: "torchrun train.py"
    nodes: 64
    walltime: "72h"
    uenv: "prgenv-gnu/24.11:v1"
    depends_on:
      - preprocess: success

  - name: evaluate
    entrypoint: "python eval.py"
    nodes: 1
    walltime: "1h"
    depends_on:
      - train: success

  - name: notify-failure
    entrypoint: "python notify.py --status=failed"
    nodes: 1
    walltime: "10m"
    depends_on:
      - train: failure

Submitting a DAG

lattice dag submit workflow.yaml
# Submitted DAG d1e2f3g4 with 4 allocations

Dependency Conditions

Condition	Meaning
`success`	Run after dependency completes successfully
`failure`	Run after dependency fails
`any`	Run after dependency completes (success or failure)
`corresponding`	For task groups: task N depends on task N of the parent

Monitoring DAGs

# DAG status overview
lattice dag status d1e2f3g4

# Detailed graph view
lattice dag status d1e2f3g4 --graph

# Output:
# preprocess [Completed] → train [Running] → evaluate [Pending]
#                                          ↘ notify-failure [Pending]

Cancelling a DAG

# Cancel all allocations in the DAG
lattice dag cancel d1e2f3g4

Cancellation cascades — downstream allocations that haven’t started are cancelled automatically.

Failure Propagation

If a success dependency fails, downstream allocations are cancelled
If a failure dependency succeeds, those downstream allocations are skipped
any dependencies always run regardless of upstream outcome

Limits

Maximum 1000 allocations per DAG (configurable by admin)
Cycles are rejected at submission time
Duplicate allocation names within a DAG are rejected

Monitoring & Observability

Allocation Status

# Your allocations
lattice status

# Specific allocation
lattice status a1b2c3d4

# Filter by state
lattice status --state=running
lattice status --state=pending

# All tenant allocations (requires permissions)
lattice status --all

# Watch mode (refreshes every 5 seconds)
lattice status --watch
lattice watch a1b2c3d4

Logs

# View logs (from S3 persistent store)
lattice logs a1b2c3d4

# Live tail (streaming)
lattice logs a1b2c3d4 --follow

# Last N lines
lattice logs a1b2c3d4 --tail=100

Metrics

Query metrics for a running allocation:

# Snapshot of current metrics
lattice metrics a1b2c3d4

# Output:
# METRIC            VALUE    UNIT
# gpu_utilization   87.3     %
# gpu_memory_used   71.2     GB
# cpu_utilization   45.1     %
# memory_used       384.0    GB
# network_rx        12.4     GB/s
# network_tx        8.7      GB/s

Live metrics stream:

lattice metrics a1b2c3d4 --stream

Diagnostics

Combined view of network and storage health for an allocation:

lattice diagnostics a1b2c3d4

# Network diagnostics only
lattice diagnostics a1b2c3d4 --network

# Storage diagnostics only
lattice diagnostics a1b2c3d4 --storage

Cross-Allocation Comparison

Compare metrics between two allocations (useful for A/B experiments):

lattice compare a1b2c3d4 e5f6g7h8 --metric=gpu_utilization

Cluster Overview

# List all nodes
lattice nodes

# Filter by state
lattice nodes --state=ready
lattice nodes --state=draining

# Specific node details
lattice nodes nid001234

Python SDK

The Lattice Python SDK provides an async client for interacting with the REST API from notebooks, scripts, and autonomous agents.

Installation

pip install lattice-sdk

Quick Start

import asyncio
from lattice_sdk import LatticeClient, AllocationSpec

async def main():
    async with LatticeClient("lattice-api.example.com", 8080) as client:
        # Submit an allocation
        alloc = await client.submit(AllocationSpec(
            entrypoint="python train.py",
            nodes=4,
            walltime="24h",
            tenant="ml-team",
        ))
        print(f"Submitted: {alloc.id}")

        # Check status
        status = await client.status(alloc.id)
        print(f"State: {status.state}")

        # Wait for completion
        async for event in client.watch(alloc.id):
            print(f"State changed: {event.state}")
            if event.state in ("Completed", "Failed", "Cancelled"):
                break

asyncio.run(main())

Core Methods

Submission

# Basic submission
alloc = await client.submit(AllocationSpec(
    entrypoint="torchrun train.py",
    nodes=64,
    walltime="72h",
    uenv="prgenv-gnu/24.11:v1",
    constraints={"gpu_type": "GH200"},
))

# Submit DAG
dag = await client.submit_dag("workflow.yaml")

Status & Listing

# Get allocation
alloc = await client.status(alloc_id)

# List allocations
allocs = await client.list_allocations(state="running")

# List nodes
nodes = await client.list_nodes(state="ready")

Monitoring

# Stream logs
async for line in client.stream_logs(alloc_id):
    print(line.message)

# Query metrics
metrics = await client.query_metrics(alloc_id)
print(f"GPU util: {metrics.gpu_utilization}%")

# Stream metrics
async for snapshot in client.stream_metrics(alloc_id):
    print(f"GPU: {snapshot.gpu_utilization}%")

# Watch state changes
async for event in client.watch(alloc_id):
    print(f"State: {event.state}")

Management

# Cancel
await client.cancel(alloc_id)

# Checkpoint
await client.checkpoint(alloc_id)

Tenants & vClusters

tenants = await client.list_tenants()
vclusters = await client.list_vclusters()

Error Handling

from lattice_sdk import LatticeError, LatticeNotFoundError, LatticeAuthError

try:
    alloc = await client.status("nonexistent-id")
except LatticeNotFoundError:
    print("Allocation not found")
except LatticeAuthError:
    print("Authentication failed")
except LatticeError as e:
    print(f"API error ({e.status_code}): {e}")

Authentication

# Token-based (OIDC)
client = LatticeClient("api.example.com", 8080, token="eyJ...")

# Headers
client = LatticeClient("api.example.com", 8080, headers={"X-Tenant": "my-team"})

Slurm Migration

Command Mapping

Slurm	Lattice	Notes
`sbatch script.sh`	`lattice submit script.sh`	`#SBATCH` directives are parsed
`squeue`	`lattice status`
`squeue -u $USER`	`lattice status`	Default shows own jobs
`scancel 12345`	`lattice cancel 12345`
`salloc`	`lattice session`	Interactive allocation
`srun --pty bash`	`lattice attach <id>`	Attach terminal
`sinfo`	`lattice nodes`	Cluster node overview
`sacct`	`lattice status --all`	Historical view

Directive Mapping

`#SBATCH` Directive	Lattice Equivalent	Notes
`--nodes=N`	`--nodes=N`	Exact match
`--ntasks=N`	—	Mapped to node count: `ceil(N / tasks_per_node)`
`--ntasks-per-node=N`	—	Passed as task config
`--time=HH:MM:SS`	`--walltime=HH:MM:SS`	Also accepts `24h`, `30m` shorthand
`--partition=X`	`--vcluster=X`	Configurable partition→vCluster mapping
`--account=X`	`--tenant=X`	Account→tenant mapping
`--job-name=X`	`--name=X`
`--output=file`	—	Logs always go to persistent store; download path configurable
`--error=file`	—	Same as `--output`
`--constraint=X`	`--constraint=X`	Feature matching
`--gres=gpu:N`	`--constraint="gpu_count=N"`
`--qos=X`	`--preemption-class=N`	Configurable QOS→class mapping
`--array=0-99%20`	`--task-group=0-99%20`
`--dependency=afterok:ID`	`--depends-on=ID:success`
`--exclusive`	Default	Lattice always allocates full nodes

Environment Variables

When Slurm compatibility is enabled (compat.set_slurm_env: true), Lattice sets familiar environment variables inside allocations:

Variable	Value
`SLURM_JOB_ID`	Allocation ID
`SLURM_JOB_NAME`	Allocation name
`SLURM_NNODES`	Number of allocated nodes
`SLURM_NODELIST`	Comma-separated node list
`SLURM_NTASKS`	Task count
`SLURM_SUBMIT_DIR`	Working directory at submission

Lattice also sets its own LATTICE_* equivalents.

What’s Different

Full-Node Scheduling

Lattice always allocates full nodes (no sub-node sharing). This simplifies resource management and improves performance isolation. If you’re used to --ntasks=1 on a shared node, you’ll get the whole node.

No Partitions — vClusters

Slurm partitions map to Lattice vClusters, but vClusters are more flexible: each has its own scheduling policy (backfill, bin-pack, FIFO, reservation) and weight tuning.

Topology-Aware Placement

Lattice automatically packs multi-node jobs within the same Slingshot dragonfly group for optimal network performance. No manual --switches needed.

Data Staging

Lattice can pre-stage data during queue wait time. Add --data-mount="s3://bucket/data:/data" and the scheduler factors data locality into placement decisions.

Checkpointing

Unlike Slurm’s --requeue, Lattice coordinates checkpointing before preemption. Declare --checkpoint=signal and your job receives SIGUSR1 before being suspended.

Migration Steps

Start with existing scripts — #SBATCH directives work out of the box
Replace sbatch/squeue/scancel with lattice submit/status/cancel
Gradually adopt native features — data staging, checkpointing, DAGs, uenv
Tune scheduling weights — use the RM-Replay simulator for A/B comparison

Deployment & Administration

Architecture Overview

A Lattice deployment consists of:

3-5 quorum members — Raft consensus nodes running lattice-server
N compute nodes — each running lattice-agent
VictoriaMetrics (or compatible TSDB) — telemetry storage
S3-compatible storage — checkpoint and log persistence
VAST (optional) — data staging and QoS

Deployment Methods

Docker Compose (dev/test)

cd infra/docker
docker compose up -d

This starts a 3-node quorum with VictoriaMetrics. See infra/docker/docker-compose.yml.

Systemd (production)

Download binaries from GitHub Releases and install:

ARCH=$(uname -m | sed 's/aarch64/arm64/')

# Server (quorum members)
curl -sSfL "https://github.com/witlox/lattice/releases/latest/download/lattice-server-${ARCH}.tar.gz" | tar xz
sudo mv lattice-server /usr/local/bin/
sudo cp infra/systemd/lattice-server.service /etc/systemd/system/
sudo cp config/production.yaml /etc/lattice/config.yaml
sudo systemctl enable --now lattice-server

# Agent (compute nodes) — single binary per architecture, all GPU support included
curl -sSfL "https://github.com/witlox/lattice/releases/latest/download/lattice-agent-${ARCH}.tar.gz" | tar xz
sudo mv lattice-agent /usr/local/bin/
sudo cp infra/systemd/lattice-agent.service /etc/systemd/system/
sudo systemctl enable --now lattice-agent

Configuration

Example configs are in config/:

File	Purpose
`config/minimal.yaml`	Single-node dev mode, no optional features
`config/production.yaml`	Full reference with all sections documented

See the production config for every option with explanations.

Required Sections

quorum — Raft node ID, peers, data directory
api — gRPC and REST listen addresses
storage — S3 endpoint, NFS paths
telemetry — TSDB endpoint, aggregation mode

Optional Sections

node_agent — heartbeat timing, grace periods
network — VNI pool range for Slingshot
checkpoint — checkpoint evaluation and timeout tuning
scheduling — cycle interval, backfill depth
accounting — Waldur integration (requires accounting feature)
rate_limit — per-user API rate limiting
federation — Sovra cross-site federation (requires federation feature)
compat — Slurm compatibility settings

Authentication & Authorization

Overview

Lattice authenticates three types of callers:

Caller	Auth method	Token source
Humans (CLI)	OIDC (PKCE flow) → RS256 JWT	IdP (Keycloak, Dex)
Agents (node agent)	mTLS (production) or Bearer token (dev)	SPIRE SVID / bootstrap certs / `LATTICE_AGENT_TOKEN`
Services (AI/MCP)	OIDC (client_credentials) → RS256 JWT	IdP service account

Server OIDC Configuration

api:
  oidc_issuer: "https://keycloak.example.com/realms/hpc"   # IdP discovery URL
  oidc_client_id: "lattice"                                 # Expected `aud` claim
  # oidc_hmac_secret: "dev-secret-only"                     # HMAC fallback (dev only)

Config field	Env var	Purpose
`api.oidc_issuer`	—	OIDC provider URL. Enables JWKS (RS256/ES256) validation.
`api.oidc_client_id`	—	Expected `aud` claim. Returned by auth discovery endpoint.
`api.oidc_hmac_secret`	`LATTICE_OIDC_HMAC_SECRET`	Shared secret for HS256 validation (dev/testing/break-glass).

Priority: JWKS (if oidc_issuer set) > HMAC (if secret set) > no auth (warning logged).

The auth discovery endpoint GET /api/v1/auth/discovery is public (no auth required) and returns {idp_url, client_id, issuer} so the CLI can bootstrap login.

Roles

Role derivation checks OIDC scopes first, then cross-system role claims (pact_role, lattice_role). First match wins.

Role	OIDC scope	Cross-system claim	Permissions
SystemAdmin	`admin` or `system:admin`	`pact-platform-admin` or `system-admin`	Unrestricted — all operations
TenantAdmin	`tenant:admin`	`tenant-admin`	Manage own tenant’s allocations, vClusters, quotas. Drain nodes. Query audit.
Operator	`operator`	`operator`	Drain/undrain/disable/enable nodes. Cannot create tenants or manage federation.
ClaimingUser	`sensitive:claim`	—	User + claim/release sensitive nodes
ReadOnly	`readonly`	—	GET/LIST/WATCH only, no mutations
User	(default — any authenticated user)	—	Submit/cancel own allocations, view nodes, create sessions

IdP Setup (Keycloak / Dex)

Configure your IdP to include the appropriate scopes in issued tokens:

Keycloak:

Create client lattice with PKCE (Authorization Code) flow
Create client scopes: admin, tenant:admin, operator, sensitive:claim, readonly
Assign scopes to users/groups via role mappings
For pact+lattice co-deployment: add pact_role as a custom claim in the token mapper

Dex:

staticClients:
  - id: lattice
    name: Lattice Scheduler
    redirectURIs: ['http://localhost:8400/callback']
    public: true   # PKCE, no client secret

Dex passes through upstream IdP claims. Configure pact_role / scopes in the upstream IdP (LDAP groups, SAML attributes, etc.).

Agent Authentication

Node agents authenticate to lattice-server for registration and heartbeats.

Production (mTLS): Agent acquires identity via the cascade: SPIRE → SelfSigned CA → Bootstrap certs. The gRPC channel uses ClientTlsConfig with the acquired cert/key/CA. Server verifies the client certificate.

# Bootstrap cert path (used until SPIRE is available)
lattice-agent \
  --quorum-endpoint=https://lattice-01:50051 \
  --bootstrap-cert=/etc/lattice/tls/agent.crt \
  --bootstrap-key=/etc/lattice/tls/agent.key \
  --bootstrap-ca=/etc/lattice/tls/ca.crt \
  ...

Dev/testing (Bearer token): When no mTLS identity is available, agent falls back to LATTICE_AGENT_TOKEN.

LATTICE_AGENT_TOKEN="eyJ..." lattice-agent \
  --quorum-endpoint=http://lattice-01:50051 \
  ...

Env var	Purpose
`LATTICE_AGENT_TOKEN`	Bearer token for agent→server auth (dev/testing/break-glass)
`LATTICE_SPIRE_SOCKET`	SPIRE agent socket path (default: `/run/spire/agent.sock`)
`LATTICE_BOOTSTRAP_CERT`	Bootstrap cert PEM path
`LATTICE_BOOTSTRAP_KEY`	Bootstrap key PEM path
`LATTICE_BOOTSTRAP_CA`	Bootstrap CA PEM path

mTLS takes priority. Token auth is the fallback. In production, leave LATTICE_AGENT_TOKEN unset.

Quorum Management

Initial Bootstrap

The first quorum member initializes the Raft cluster using the --bootstrap flag. This flag must only be passed once — on the very first startup of node 1. All subsequent restarts (including systemd restarts) omit it.

# First-ever start of node 1 — initializes the Raft cluster:
lattice-server --config /etc/lattice/server.yaml --bootstrap

# All subsequent restarts — no --bootstrap:
lattice-server --config /etc/lattice/server.yaml
# (or via systemd, which never passes --bootstrap)

Configure peers in each node’s config:

quorum:
  node_id: 1
  data_dir: /var/lib/lattice/raft
  peers:
    - id: 2
      address: "lattice-02:9000"
    - id: 3
      address: "lattice-03:9000"

Nodes 2 and 3 never need --bootstrap — they join via Raft membership replication from the leader.

Raft Status

curl http://lattice-01:8080/api/v1/raft/status

Backup & Restore

# Create backup
curl -X POST http://lattice-01:8080/api/v1/admin/backup

# Verify backup integrity
curl http://lattice-01:8080/api/v1/admin/backup/verify

# Restore (requires restart)
curl -X POST http://lattice-01:8080/api/v1/admin/restore \
  -d '{"path": "/var/lib/lattice/backups/backup-20260305T120000Z.tar.gz"}'

Node Management

Agent Registration

Agents register automatically on startup. Authentication uses mTLS (production) or Bearer token (dev/testing):

# Production: mTLS via bootstrap certs (SPIRE preferred when available)
lattice-agent \
  --node-id=nid001234 \
  --quorum-endpoint=https://lattice-01:50051 \
  --bootstrap-cert=/etc/lattice/tls/agent.crt \
  --bootstrap-key=/etc/lattice/tls/agent.key \
  --bootstrap-ca=/etc/lattice/tls/ca.crt \
  --gpu-count=4 --gpu-type=GH200 --cpu-cores=72 --memory-gb=512

# Dev/testing: Bearer token auth (no certs needed)
LATTICE_AGENT_TOKEN="eyJ..." lattice-agent \
  --node-id=nid001234 \
  --quorum-endpoint=http://lattice-01:50051 \
  --gpu-count=4 --gpu-type=GH200 --cpu-cores=72 --memory-gb=512

The agent tries the identity cascade (SPIRE → SelfSigned → Bootstrap) first. If no mTLS identity is available, it falls back to LATTICE_AGENT_TOKEN.

Draining Nodes

The drain lifecycle is: Ready → Draining → Drained → Ready.

# Drain a node (existing jobs complete, no new jobs scheduled)
lattice admin drain nid001234 --reason="maintenance"

# If no active allocations, node goes directly to Drained.
# If allocations are running, node stays in Draining until they complete.
# The scheduler loop automatically transitions Draining → Drained.

# Undrain (only works from Drained state)
lattice admin undrain nid001234

Undrain only works when the node is in Drained state. If the node is still Draining (allocations running), wait for them to complete or cancel them first.

Node States

State	Meaning
Ready	Available for scheduling
Draining	No new jobs; existing jobs continue
Down	Heartbeat lost beyond grace period
Degraded	Heartbeat late but within grace period
Claimed	Reserved for sensitive workload

Tenant Management

# Create a tenant
lattice admin tenant create --name="physics" --max-nodes=100

# List tenants
lattice admin tenant list

# Update quota
lattice admin tenant update physics --max-nodes=200

TLS Configuration

Server TLS

api:
  tls_cert: /etc/lattice/tls/server.crt
  tls_key: /etc/lattice/tls/server.key

Mutual TLS (mTLS)

api:
  tls_cert: /etc/lattice/tls/server.crt
  tls_key: /etc/lattice/tls/server.key
  tls_ca: /etc/lattice/tls/ca.crt  # Require client certificates

Feature Flags

Compile-time features control optional integrations:

Feature	Crate	Enables
`oidc`	lattice-api	JWT/OIDC token validation
`accounting`	lattice-api	Waldur billing integration
`federation`	lattice-api	Sovra cross-site federation
`nvidia`	lattice-node-agent	NVIDIA GPU discovery (nvml-wrapper)
`rocm`	lattice-node-agent	AMD GPU discovery (rocm-smi)
`ebpf`	lattice-node-agent	eBPF kernel telemetry (Linux only)

Pre-built release binaries ship with all features enabled. GPU libraries are loaded at runtime — nodes without GPUs simply report no GPU hardware. To build from source:

# Server with all features
cargo build --release -p lattice-api --all-features

# Agent with all features
cargo build --release -p lattice-node-agent --all-features

Release Artifacts

Artifact	Architecture	GPU Support
`lattice-server-x86_64.tar.gz`	x86_64	n/a
`lattice-server-arm64.tar.gz`	arm64	n/a
`lattice-x86_64.tar.gz`	x86_64	n/a (CLI)
`lattice-arm64.tar.gz`	arm64	n/a (CLI)
`lattice-agent-x86_64.tar.gz`	x86_64	NVIDIA + AMD ROCm + eBPF
`lattice-agent-arm64.tar.gz`	arm64	NVIDIA + AMD ROCm + eBPF
`rm-replay-x86_64.tar.gz`	x86_64	n/a
`rm-replay-arm64.tar.gz`	arm64	n/a

GPU discovery is automatic at runtime. The agent detects available hardware and uses the appropriate provider:

Hardware	Discovery Method	Runtime Dependency
NVIDIA (H100, A100, GH200)	nvml-wrapper (`libnvidia-ml.so` via dlopen)	NVIDIA driver installed
AMD (MI300X, MI250)	rocm-smi CLI	ROCm toolkit installed
CPU-only nodes	No GPU discovery runs	None

GCP Test Cluster

For integration testing without production hardware:

# 1. Build Packer image (once, ~5 min)
cd infra/gcp/packer
packer build -var project_id=YOUR_PROJECT lattice-compute.pkr.hcl

# 2. Provision infrastructure (~2 min)
cd infra/gcp
terraform apply -var="project_id=YOUR_PROJECT" -var="use_packer_image=true"

# 3. Build + bundle binaries
cargo build --release --target x86_64-unknown-linux-gnu
./scripts/deploy/make-provision-bundle.sh target/x86_64-unknown-linux-gnu/release /tmp/lattice-provision.tar.gz

# 4. Deploy to nodes (SCP bundle + run install scripts)
# See scripts/deploy/install-quorum.sh and install-compute.sh

# 5. Run validation test matrix
./scripts/deploy/validate.sh http://QUORUM1_IP:8080 x1000c0s0b0n0,x1000c0s0b0n1

# 6. Teardown
cd infra/gcp && terraform destroy

The test cluster includes: 3 quorum nodes, 2 compute nodes (with podman + squashfs-tools), 1 OCI registry, 1 VictoriaMetrics. The validate.sh script runs 15 tests covering health, auth, submit, drain, restart, and validation.

Deploy scripts (scripts/deploy/install-*.sh) are reusable on-prem — no GCP-specific logic.

Cluster Monitoring & Observability

Prometheus Metrics

Lattice exposes Prometheus-compatible metrics at GET /metrics on the REST port (default 8080).

Key Metrics

Metric	Type	Description
`lattice_allocations_total`	Counter	Total allocations by state
`lattice_allocations_active`	Gauge	Currently running allocations
`lattice_scheduling_cycle_duration_seconds`	Histogram	Scheduling cycle latency
`lattice_scheduling_placements_total`	Counter	Successful placements
`lattice_scheduling_preemptions_total`	Counter	Preemption events
`lattice_raft_commit_latency_seconds`	Histogram	Raft commit latency
`lattice_raft_sensitive_audit_entries_total`	Counter	Sensitive audit log entries
`lattice_api_request_duration_seconds`	Histogram	API request latency
`lattice_api_requests_total`	Counter	API requests by method and status
`lattice_nodes_total`	Gauge	Nodes by state
`lattice_checkpoint_duration_seconds`	Histogram	Checkpoint operation latency

Scrape Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'lattice'
    static_configs:
      - targets:
        - 'lattice-01:8080'
        - 'lattice-02:8080'
        - 'lattice-03:8080'

Grafana Dashboards

Pre-built dashboards are in infra/grafana/dashboards/:

Cluster Overview — node states, allocation throughput, queue depth
Scheduling Performance — cycle latency, placement rate, preemption rate
Raft Health — commit latency, leader elections, log compaction
Per-Tenant Usage — resource consumption, fair-share deficit

Import via Grafana UI or provision from infra/grafana/provisioning/.

Alerting Rules

Pre-configured alerting rules in infra/alerting/:

Alert	Condition
`LatticeRaftNoLeader`	No Raft leader for > 30s
`LatticeNodeDown`	Node heartbeat lost for > 5m
`LatticeSchedulingStalled`	No placements for > 10m with pending jobs
`LatticeHighPreemptionRate`	> 10 preemptions/minute
`LatticeCheckpointFailure`	Checkpoint success rate < 90%
`LatticeDiskSpaceLow`	Raft data directory > 80% full

TSDB Integration

Lattice pushes per-node telemetry to VictoriaMetrics (or any Prometheus-compatible remote write endpoint).

telemetry:
  tsdb_endpoint: "http://victoriametrics:8428"
  prod_interval_seconds: 30

Telemetry includes CPU, memory, GPU utilization, network I/O, and disk I/O per node.

Audit Log

Sensitive workload operations are recorded in the Raft-committed audit log:

# Query audit log
curl "http://lattice-01:8080/api/v1/audit?tenant=sensitive-team&from=2026-03-01"

Audit entries include: node claims/releases, allocation lifecycle events, and access log entries. Retention: 7 years (configurable).

Health Check

curl http://lattice-01:8080/healthz
# {"status": "ok"}

Used by Docker/Kubernetes health probes and load balancers.

Managing Sensitive Workloads

Sensitive workloads (financial, defense, regulated research) require strict isolation, auditing, and data handling. Lattice provides a dedicated scheduling mode for these workloads.

How It Works

User claims nodes — not the scheduler. The user’s identity is recorded as the owner in the Raft audit log.
Full isolation — claimed nodes run only the owner’s workloads. No sharing.
Hardened OS — OpenCHAMI provisions a hardened boot image for claimed nodes.
Encrypted storage — a dedicated encrypted pool is assigned. All access is logged.
Signed software only — only vulnerability-scanned, signed uenv images are allowed.
Wipe on release — when the claim ends, storage is crypto-erased and nodes are re-provisioned.

Submitting Sensitive Workloads

# Submit to the sensitive vCluster
lattice submit --vcluster=sensitive --nodes=4 --walltime=168h analysis.sh

The sensitive scheduler uses a reservation model (not backfill). Priority is fixed at the highest level; the only tiebreaker is conformance fitness.

Node Claiming

Sensitive allocations claim specific nodes. Once claimed:

Nodes are exclusively owned by the claiming user
The claim is Raft-committed with the user’s identity
No other workloads (even from the same tenant) can run on claimed nodes

Audit Trail

Every sensitive operation is logged:

# Query sensitive audit entries
curl "http://lattice-01:8080/api/v1/audit?scope=sensitive"

Logged events:

Node claim / release
Allocation start / completion
Data access (read/write operations)
Software image loads
Storage wipe confirmation

Retention: 7 years (per regulatory requirements).

Network Isolation

Sensitive allocations get a unique Slingshot VNI (network domain). Ingress and egress are denied except to the designated data gateway. With Ultra Ethernet, wire-level encryption is enabled.

Admin Responsibilities

Provision hardened images via OpenCHAMI for sensitive nodes
Maintain signed uenv registry — only approved images should be signed
Monitor audit log — set up alerting for unexpected access patterns
Test wipe procedures — verify crypto-erase completes on node release
Designate sensitive-capable nodes — not all nodes need to support sensitive workloads

Configuration

No special server configuration is needed. The sensitive scheduler is a built-in vCluster type. Create a sensitive vCluster:

lattice admin vcluster create \
  --name=sensitive \
  --scheduler-type=sensitive-reservation \
  --description="Regulated workloads with full isolation"

System Architecture

Overview

Lattice is a seven-layer architecture where each layer has a clear responsibility and communicates with adjacent layers via defined interfaces.

┌─ User Plane ───────────────────────────────────────────────────┐
│  lattice-cli + lattice-api (OIDC via hpc-auth)                 │
│  ├── Job lifecycle (submit, monitor, cancel)                   │
│  ├── Interactive sessions (WebSocket terminal)                 │
│  ├── Data management (stage, browse, transfer)                 │
│  ├── uenv management (list, pull, test)                        │
│  ├── Observability (attach, logs, metrics, diagnostics)        │
│  └── Sensitive: user-level node claim/release                  │
└───────────────────────────┬────────────────────────────────────┘
                            │
┌─ Software Plane ──────────┴────────────────────────────────────┐
│  Default: uenv (squashfs + mount namespace)                    │
│  Optional: OCI/Sarus (isolation, third-party images)           │
│  Registry: JFrog/Nexus → S3 backing (VAST hot tier)            │
│  Node-local NVMe image cache (optional)                        │
│  Sensitive: signed images only, vulnerability-scanned          │
└───────────────────────────┬────────────────────────────────────┘
                            │
┌─ Scheduling Plane ────────┴────────────────────────────────────┐
│  Quorum (Raft, 3-5 replicas)                                   │
│  Strong: (1) node ownership  (2) sensitive audit log           │
│  Eventual: job queues, telemetry, quotas                       │
│                                                                │
│  vCluster Schedulers:                                          │
│  ├── HPC: backfill + dragonfly group packing                   │
│  ├── Service: bin-pack + autoscale                             │
│  ├── Sensitive: user-claim reservation, dedicated nodes        │
│  └── Interactive: FIFO, short-lived, node-sharing via Sarus    │
└───────────────────────────┬────────────────────────────────────┘
                            │
┌─ Data Plane ──────────────┴────────────────────────────────────┐
│  Hot:  VAST (NFS + S3, single flash tier)                      │
│    ├── Home dirs, scratch, active datasets (NFS)               │
│    ├── Checkpoints, image cache, objects (S3)                  │
│    ├── Scheduler integration: QoS, pre-staging, snapshots      │
│    └── Sensitive: encrypted view, audit-logged, dedicated pool │
│  Warm: Capacity store (S3-compat, cost-optimized)              │
│  Cold: Tape archive (S3-compat, regulatory retention)          │
│  Data mover: pre-stages during queue wait, policy-driven       │
└───────────────────────────┬────────────────────────────────────┘
                            │
┌─ Network Fabric ──────────┴────────────────────────────────────┐
│  Slingshot (current) / Ultra Ethernet (future path)            │
│  ├── libfabric abstraction for workload communication          │
│  ├── VNI-based network domains (job isolation)                 │
│  ├── Traffic classes: compute | management | telemetry         │
│  ├── CSIG for in-band congestion telemetry                     │
│  └── Sensitive: encrypted RDMA, dedicated VNI                  │
└───────────────────────────┬────────────────────────────────────┘
                            │
┌─ Node Plane ──────────────┴────────────────────────────────────┐
│  Node Agent (per node)                                         │
│  ├── squashfs-mount (uenv delivery)                            │
│  ├── Sarus (OCI container runtime, when needed)                │
│  ├── eBPF telemetry + CSIG tap                                 │
│  ├── Node-local NVMe (optional): scratch + image cache         │
│  ├── Conformance fingerprint (driver/firmware/kernel hash)     │
│  └── Health reporting → OpenCHAMI SMD                          │
└───────────────────────────┬────────────────────────────────────┘
                            │
┌─ Infrastructure Plane ────┴────────────────────────────────────┐
│  OpenCHAMI                                                     │
│  ├── Magellan: Redfish BMC discovery & inventory               │
│  ├── SMD: State Management Daemon (hardware lifecycle)         │
│  ├── BSS: Boot Script Service (image selection per node)       │
│  ├── OPAAL: Authentication & identity                          │
│  ├── Cloud-init: per-node config injection                     │
│  └── Manta CLI: admin tooling                                  │
└────────────────────────────────────────────────────────────────┘

Component Interactions

Allocation Lifecycle

1. User/Agent → lattice-cli → lattice-api (Intent API or Compat API)
2. lattice-api validates request, resolves uenv, creates Allocation object
3. Allocation placed in vCluster scheduler's queue (eventually consistent)
4. vCluster scheduler runs scheduling cycle:
   a. Scores pending allocations with cost function
   b. Solves knapsack: maximize value subject to resource constraints
   c. Proposes allocation → quorum
5. Quorum validates (node ownership, quotas, sensitive isolation)
6. Quorum commits: node ownership updated (strong consistency)
7. Quorum notifies node agents of new allocation
8. Node agents:
   a. Pull uenv squashfs image (from cache or registry)
   b. Mount via squashfs-mount
   c. Start processes in mount namespace
   d. Begin log capture (ring buffer + S3 persistence)
   e. Accept attach sessions (if user connects)
   f. Report health/telemetry
8.5. During execution, users can:
   - Attach interactive terminal (nsenter into allocation namespace)
   - Stream logs (live tail from ring buffer or historical from S3)
   - Query metrics (lattice top → TSDB) or stream them (lattice watch → node agents)
   - View diagnostics (network health, storage performance)
   - Compare metrics across allocations (TSDB multi-query)
9. On completion: node agents report, quorum releases nodes

Preemption Flow

1. Higher-priority allocation arrives, needs nodes currently in use
2. Scheduler evaluates: which running allocations are cheapest to preempt?
   → checkpoint_efficiency score from cost function
3. Checkpoint broker sends CHECKPOINT_HINT to target allocation's node agents
4. Application checkpoints (or: timeout → forced preemption)
5. Nodes released, reassigned to higher-priority allocation
6. Preempted allocation re-queued, will resume from checkpoint when resources available

Federation Flow (when enabled)

1. User at Site A submits allocation targeting Site B
2. Site A's federation broker signs request with Sovra token
3. Request arrives at Site B's federation broker
4. Site B verifies Sovra token, checks policy (OPA)
5. If accepted: allocation enters Site B's scheduling plane
6. Site B's local quorum manages the allocation entirely
7. Results/logs accessible to user at Site A via federation catalog

Topology Model

The scheduler maintains a model of the Slingshot dragonfly topology:

System
├── Group 0 (electrical group, ~hundreds of nodes)
│   ├── Switch 0
│   │   ├── Node 0..N
│   │   └── ...
│   └── Switch M
├── Group 1
│   └── ...
└── Group K
    └── ...

Intra-group: electrical, low latency, high bandwidth
Inter-group: optical, higher latency, potential congestion

Scheduling rule: pack jobs into fewest groups possible. Jobs below group size → single group. Large jobs → minimize group span, prefer adjacent groups. Network-sensitive jobs (NCCL) get stricter placement constraints.

State Machine

The quorum manages a replicated state machine with the following state:

GlobalState {
    nodes: Map<NodeId, NodeState>,        // ownership, health, capabilities
    allocations: Map<AllocId, Allocation>, // all active allocations
    tenants: Map<TenantId, TenantState>,  // quotas, fair-share counters
    vclusters: Map<VClusterId, VClusterConfig>, // scheduler configs
    topology: TopologyModel,              // dragonfly group structure
    sensitive_audit: AppendOnlyLog<AuditEvent>, // strong consistency
}

NodeState {
    owner: Option<(TenantId, VClusterId, AllocId)>,
    health: NodeHealth,
    capabilities: NodeCapabilities,  // GPU type, memory, features
    group: GroupId,                  // topology position
    conformance_group: ConformanceGroupId, // fingerprint of driver/firmware/kernel
}

Transitions are proposed by vCluster schedulers and validated by the quorum before commit. Only node ownership changes and sensitive audit events require Raft consensus; everything else is eventually consistent.

Note: Observability data (logs, metrics, attach sessions, diagnostics) is NOT stored in the Raft state machine. This data lives in the TSDB, S3, and node agent memory. Only sensitive audit events about observability actions (e.g., “Dr. X attached to allocation Y”) flow through Raft consensus (per ADR-004).

API Design

Two-Tier API Model

Tier 1: Intent API (Agent-Native)

Agents and advanced users interact with the Intent API. They declare what they need; the scheduler resolves how.

Core Resources

Allocation — The universal work unit.

POST   /v1/allocations              Create allocation (or DAG of allocations)
GET    /v1/allocations              List allocations (filterable)
GET    /v1/allocations/{id}         Get allocation status
DELETE /v1/allocations/{id}         Cancel allocation
PATCH  /v1/allocations/{id}         Update allocation (e.g., extend walltime, switch telemetry)
POST   /v1/allocations/{id}/tasks   Launch tasks within an existing allocation (srun equivalent)
POST   /v1/allocations/{id}/checkpoint  Request checkpoint

Observability — User-facing debugging and monitoring.

POST   /v1/allocations/{id}/attach           Attach interactive terminal (WebSocket upgrade)
GET    /v1/allocations/{id}/logs             Historical logs from S3
GET    /v1/allocations/{id}/logs/stream      Live log tail (SSE / gRPC stream)
GET    /v1/allocations/{id}/metrics          Query metrics snapshot from TSDB
GET    /v1/allocations/{id}/metrics/stream   Push-based live metrics stream
GET    /v1/allocations/{id}/diagnostics      Combined network + storage diagnostics
GET    /v1/allocations/{id}/diagnostics/network  Network-specific diagnostics
GET    /v1/allocations/{id}/diagnostics/storage  Storage-specific diagnostics
GET    /v1/compare                           Cross-allocation metric comparison

DAGs — Workflow graph management.

POST   /v1/dags                    Submit a DAG of allocations
GET    /v1/dags                    List DAGs (filterable by tenant, user, state)
GET    /v1/dags/{id}               Get DAG status (overall state + per-allocation states)
GET    /v1/dags/{id}/graph         Get DAG structure (allocations + dependency edges)
DELETE /v1/dags/{id}               Cancel all allocations in a DAG

Session — Interactive allocation with WebSocket terminal.

POST   /v1/sessions                 Create interactive session
GET    /v1/sessions/{id}/terminal   WebSocket terminal endpoint

Nodes — Read-only view of cluster state.

GET    /v1/nodes                    List nodes (filterable by vCluster, tenant, state)
GET    /v1/nodes/{id}               Get node details

Tenants / vClusters — Administrative.

GET    /v1/tenants                  List tenants
GET    /v1/vclusters                List vClusters
GET    /v1/vclusters/{id}/queue     View vCluster queue

Accounting

GET    /v1/accounting               Query usage history

Allocation Request Schema

# Full Intent API allocation request
allocation:
  # Identity
  tenant: "ml-team"
  project: "gpt-training"
  vcluster: "ml-training"           # optional: scheduler can infer from intent
  tags: { experiment: "run-42" }

  # What to run
  intent: "train"                    # optional hint for scheduler
  environment:
    uenv: "prgenv-gnu/24.11:v1"     # uenv name/version
    view: "default"                  # uenv view to activate
    # OR:
    image: "registry.example.com/my-training:latest"  # OCI image via Sarus
  entrypoint: "torchrun --nproc_per_node=4 train.py"

  # Resources
  resources:
    nodes: 64                        # can be exact or range: { min: 32, max: 128 }
    constraints:
      gpu_type: "GH200"
      features: ["nvme_scratch"]
      topology: "tight"              # scheduler hint: pack into fewest groups

  # Lifecycle
  lifecycle:
    type: "bounded"                  # bounded | unbounded | reactive
    walltime: "72h"                  # for bounded
    preemption_class: 2              # 0 = lowest, higher = harder to preempt
    # For reactive:
    # scale_policy: { min: 4, max: 16, metric: "request_latency_p99", target: "100ms" }

  # Data
  data:
    mounts:
      - source: "s3://datasets/imagenet"
        target: "/data/input"
        access: "read-only"
        tier_hint: "hot"             # scheduler pre-stages if needed
    defaults: true                   # auto-mount home, scratch, output dir

  # Networking
  connectivity:
    network_domain: "ml-workspace"   # shared domain for cross-allocation communication
    expose:                          # for services
      - name: "metrics"
        port: 9090

  # Dependencies (for DAG submissions)
  depends_on:
    - ref: "preprocess-job"
      condition: "success"           # success | failure | any | corresponding

  # Checkpointing
  checkpoint:
    strategy: "auto"                 # auto | manual | none
    # auto: scheduler decides based on cost function
    # manual: application manages its own checkpointing
    # none: non-checkpointable, treated as non-preemptible

  # Telemetry
  telemetry:
    mode: "prod"                     # prod | debug | audit

DAG Submission

Submit multiple allocations as a workflow graph:

dag:
  allocations:
    - id: "stage-data"
      entrypoint: "python stage.py"
      resources: { nodes: 1 }
      lifecycle: { type: "bounded", walltime: "2h" }

    - id: "train"
      entrypoint: "torchrun train.py"
      resources: { nodes: 64, constraints: { topology: "tight" } }
      lifecycle: { type: "bounded", walltime: "72h" }
      depends_on: [{ ref: "stage-data", condition: "success" }]

    - id: "evaluate"
      entrypoint: "python eval.py"
      resources: { nodes: 4 }
      depends_on: [{ ref: "train", condition: "any" }]

DAG size limit: Maximum 1000 allocations per DAG (configurable). Submissions exceeding this limit are rejected at validation time. See dag-scheduling.md for details.

Task Groups (Job Arrays)

allocation:
  type: "task_group"
  template:
    entrypoint: "python sweep.py --config=${INDEX}"
    resources: { nodes: 1, constraints: { gpu_type: "GH200" } }
    lifecycle: { type: "bounded", walltime: "4h" }
  range: { start: 0, end: 99 }
  concurrency: 20                   # max simultaneous tasks

Tier 2: Compatibility API (Slurm-like)

Translates familiar Slurm commands to Intent API calls. Implemented as CLI wrappers + lattice-api REST endpoints.

Command Mapping

Slurm	Lattice CLI	Intent API
`sbatch script.sh`	`lattice submit script.sh`	POST /v1/allocations
`sbatch --array=0-99%20 script.sh`	`lattice submit --task-group=0-99%20 script.sh`	POST /v1/allocations (task_group)
`sbatch --dependency=afterok:123 script.sh`	`lattice submit --depends-on=123:success script.sh`	POST /v1/allocations (depends_on)
`squeue`	`lattice status`	GET /v1/allocations
`squeue -u $USER`	`lattice status --user=$USER`	GET /v1/allocations?user=
`scancel 123`	`lattice cancel 123`	DELETE /v1/allocations/123
`salloc -N2`	`lattice session --nodes=2`	POST /v1/sessions
`srun -n4 hostname`	`lattice launch --alloc=123 -n4 hostname`	POST /v1/allocations/123/tasks
`sinfo`	`lattice nodes`	GET /v1/nodes
`sacct`	`lattice history`	GET /v1/accounting
`--constraint="gpu"`	`--constraint="gpu"`	constraints.features
`--partition=debug`	`--vcluster=interactive`	vcluster field
`--qos=high`	`--priority=high`	preemption_class
`--uenv=prgenv-gnu/24.11:v1`	`--uenv=prgenv-gnu/24.11:v1`	environment.uenv
`srun --jobid=123 --pty bash`	`lattice attach 123`	Attach RPC (bidir stream)
`cat slurm-123.out`	`lattice logs 123`	GET /v1/allocations/123/logs
`tail -f slurm-123.out`	`lattice logs 123 --follow`	StreamLogs RPC
`sstat -j 123`	`lattice top 123`	QueryMetrics RPC
(no equivalent)	`lattice watch 123`	StreamMetrics RPC
(no equivalent)	`lattice diag 123`	GetDiagnostics RPC
(no equivalent)	`lattice compare 123 456`	CompareMetrics RPC

Script Parsing

The compatibility layer parses #SBATCH directives from submission scripts, translating them to Intent API fields. Unknown directives are warned but not fatal (graceful degradation).

#!/bin/bash
#SBATCH --nodes=64
#SBATCH --time=72:00:00
#SBATCH --gres=gpu:4
#SBATCH --constraint=GH200
#SBATCH --uenv=prgenv-gnu/24.11:v1
#SBATCH --view=default
#SBATCH --account=ml-team
#SBATCH --job-name=training-run

torchrun --nproc_per_node=4 train.py

Wire Format

gRPC (protobuf) is the primary protocol. REST is provided via gRPC-gateway for browser/curl access.

Protobuf definitions in proto/ directory. See proto/README.md for schema details.

Proto Coverage

The protobuf definitions in proto/lattice/v1/allocations.proto currently cover:

Service / Area	Proto Status	Notes
AllocationService (submit, get, list, cancel, update, watch, checkpoint)	Defined	Core allocation lifecycle
Observability RPCs (attach, logs, metrics, diagnostics, compare)	Defined	Part of AllocationService
DAG RPCs (get, list, cancel)	Defined	Part of AllocationService
NodeService (list, get, drain, undrain, disable, enable, health)	Defined	`proto/lattice/v1/nodes.proto`
AdminService (tenant CRUD, vCluster CRUD, Raft status, backup, audit, accounting)	Defined	`proto/lattice/v1/admin.proto`
Session RPCs (create, get, delete)	Defined	Part of AllocationService
Service Discovery (lookup, list)	Defined	Part of AdminService, `admin.proto`
LivenessProbeSpec	Defined	Part of AllocationSpec, `allocations.proto`

All planned services have been implemented as RPCs within the existing three services (AllocationService, NodeService, AdminService). Both gRPC and REST endpoints are available for all operations.

Service Discovery Endpoints

Method	Endpoint	Description
gRPC	`AdminService.LookupService(name)`	Returns endpoints for a named service (tenant-filtered)
gRPC	`AdminService.ListServices()`	Lists all registered service names (tenant-filtered)
REST	`GET /api/v1/services`	JSON list of registered service names
REST	`GET /api/v1/services/{name}`	JSON endpoints for a named service

Tenant filtering: requests with x-lattice-tenant header only see services belonging to their tenant. Without the header, all services are visible (admin mode).

Liveness Probe Schema

Allocations can include an optional liveness_probe in the submission spec:

message LivenessProbeSpec {
  string probe_type = 1;    // "tcp" or "http"
  uint32 port = 2;          // 1-65535
  string path = 3;          // HTTP path (e.g., "/healthz")
  uint32 period_secs = 4;   // default: 30
  uint32 initial_delay_secs = 5;
  uint32 failure_threshold = 6;  // default: 3
  uint32 timeout_secs = 7;      // default: 5
}

When failure_threshold consecutive probes fail, the allocation is marked Failed. The reconciliation loop then requeues it (for Unbounded/Reactive allocations with appropriate requeue policy).

Client SDKs

SDK	Protocol	Location
Python (`lattice-sdk`)	REST (httpx)	`sdk/python/`
Rust (`lattice-client`)	gRPC (tonic)	`crates/lattice-client/`

The Rust SDK re-exports all proto types as lattice_client::proto — consumers do not need to depend on lattice-common directly.

Authentication

All API calls require OIDC bearer token. The lattice CLI handles the OIDC flow via hpc-auth (institutional IdP integration). The lattice-api server validates tokens against the configured OIDC provider.

Sensitive tenant tokens include additional claims for audit trail binding.

Scheduling Algorithm

Overview

Lattice uses a multi-dimensional knapsack formulation with a composite cost function, executed independently by each vCluster scheduler. The quorum provides global coordination.

The Knapsack Formulation

Resources (Knapsack Dimensions)

Each scheduling decision must respect multiple resource constraints simultaneously:

Dimension	Unit	Source
Nodes	count	Quorum (available nodes owned by or borrowable by vCluster)
GPU-hours	nodes × walltime	Derived from allocation request
Topology span	group count	Topology model (dragonfly groups consumed)
Storage I/O bandwidth	GB/s	VAST API (current utilization + allocation estimate)
Power budget	kW	OpenCHAMI BMC telemetry (per-node power draw)

Value (Cost Function)

Score(j) = Σ wᵢ · fᵢ(j)

Component Functions

f₁: priority_class(j) — Static priority tier (0-10). Sensitive claims are highest. Preemption only moves down tiers.

f₂: wait_time_factor(j) — Anti-starvation. Increases monotonically with time in queue.

f₂(j) = log(1 + wait_seconds / reference_wait)

reference_wait is tunable (default: 1 hour). Log prevents wait time from dominating all other factors.

f₃: fair_share_deficit(j) — How far the tenant is from their contracted share. See quota-enforcement.md for hard vs. soft quota semantics.

f₃(j) = max(0, target_share(tenant) - actual_usage(tenant)) / target_share(tenant)

Ranges from 0 (tenant at or above share) to 1 (tenant has used nothing). Tenants below their share get priority.

f₄: topology_fitness(j) — How well the job fits available topology. For intra-node GPU topology, see gpu-topology.md.

f₄(j) = 1.0 - (groups_needed(j) / max_groups_available)

Jobs that fit in a single group score highest. Penalty for spanning groups scales with group count.

f₅: data_readiness(j) — Is the job’s input data on hot tier?

f₅(j) = fraction_of_input_data_on_hot_tier(j)

If unknown (user didn’t specify data requirements), defaults to 0.5 (neutral).

f₆: backlog_pressure(t) — Global signal, not per-job. High when queue is deep.

f₆(t) = min(1.0, queued_gpu_hours / running_gpu_hours)

Capped at 1.0. Affects all jobs equally — it’s a system-level urgency signal.

f₇: energy_cost(j, t) — Time-varying electricity price at scheduling time.

f₇(j, t) = 1.0 - normalized_energy_price(t)

Jobs score higher when energy is cheap. In federated mode, extends to energy_cost(j, t, site).

f₈: checkpoint_efficiency(j) — How cheaply can this job be preempted?

f₈(j) = 1.0 / (1.0 + estimated_checkpoint_minutes(j))

Jobs with fast checkpointing are more attractive to schedule on borrowed/preemptible nodes.

f₉: conformance_fitness(j, candidates) — How well do the candidate nodes match each other’s configuration?

f₉(j, candidates) = largest_conformance_group_size(candidates) / j.requested_nodes

Scores 1.0 when all candidate nodes share the same conformance fingerprint, lower when the node set is heterogeneous. Critical for multi-node jobs where driver/firmware mismatches cause subtle performance degradation or correctness issues (e.g., NCCL hangs from mismatched NIC firmware).

The conformance fingerprint is a hash of: GPU driver version, NIC firmware version, BIOS/BMC firmware version, and kernel parameters. The node agent computes and reports this fingerprint alongside health data. Nodes with identical fingerprints belong to the same conformance group.

This factor is evaluated during node selection (step 2a in the solver), not during scoring. The solver prefers to select nodes from the largest available conformance group that satisfies the allocation’s constraints.

See data-staging.md for details on how input data is pre-staged during queue wait to improve f₅ scores. See preemption.md for how preemption classes interact with f₁ priority scoring. See network-domains.md for the VNI assignment that enables topology-aware placement (f₄).

Weight Profiles

Weight	HPC Batch	ML Training	Service	Sensitive	Interactive
w₁ (priority)	0.15	0.10	0.15	0.90	0.10
w₂ (wait_time)	0.20	0.10	0.05	0.00	0.30
w₃ (fair_share)	0.20	0.10	0.10	0.00	0.10
w₄ (topology)	0.15	0.25	0.05	0.00	0.00
w₅ (data_ready)	0.10	0.15	0.10	0.00	0.05
w₆ (backlog)	0.05	0.05	0.05	0.00	0.15
w₇ (energy)	0.00	0.05	0.10	0.00	0.00
w₈ (checkpoint)	0.05	0.10	0.10	0.00	0.00
w₉ (conformance)	0.10	0.10	0.30	0.10	0.30

Sensitive scheduler is degenerate: priority dominates because node claims are non-negotiable (w₁=0.90). Conformance (w₉=0.10) acts as a tiebreaker among conformant nodes; non-conformant nodes are excluded entirely as a hard constraint at the solver level (step 2a), not via the weight system.

Note: The CostWeights::default() in crates/lattice-common/src/types.rs provides a “balanced HPC” baseline (w₁=0.20, w₂=0.20, w₃=0.20, w₄=0.15, w₅=0.10, w₆=0.05, w₇=0.00, w₈=0.00, w₉=0.10). This is not identical to any named profile in the table above — it is a general-purpose starting point. Each vCluster should have its weights tuned for its workload type, either manually or via RM-Replay simulation.

Solver

The multi-dimensional knapsack is NP-hard in general. For our scale (tens to hundreds of pending large allocations), a greedy heuristic with backfill is sufficient:

Algorithm: GreedyTopologyAwareBackfill

1. Sort pending allocations by Score(j) descending
2. For each allocation j in sorted order:
   a. Find the smallest set of available nodes that satisfies:
      - Node count >= j.requested_nodes
      - All nodes in fewest possible dragonfly groups
      - All nodes in same conformance group (prefer) or fewest groups (fallback)
      - Constraints satisfied (GPU type, features, etc.)
      - Power budget not exceeded
   b. If nodes found: PROPOSE allocation to quorum
   c. If not found: try backfill (can j fit in gaps left by higher-priority reservations?)
3. Collect quorum responses (commit or reject)
4. For rejected proposals: re-queue, will try next cycle

Scheduling cycle: every 5-30 seconds (configurable per vCluster)

DAG Dependencies

DAGs (directed acyclic graphs) are first-class workflow primitives. Individual allocations within a DAG are scored by the knapsack solver like any other allocation — the DAG structure controls when allocations enter the queue, not how they are scored. Root allocations enter immediately; downstream allocations enter when their dependency conditions are satisfied. See dag-scheduling.md for the full DAG lifecycle and dependency conditions.

Reactive Scaling

Reactive allocations (autoscaling services) start at min_nodes and scale based on metric thresholds. Scale-up and scale-down are proposed as node ownership changes through the quorum. The knapsack solver handles each scale proposal as a regular allocation change. See autoscaling.md for the scaling loop, metrics, and cooldown behavior.

Nodes can be “borrowed” across vClusters:

vCluster A: 200 dedicated nodes, currently using 150
  → 50 idle nodes advertised as "borrowable" to other vClusters

vCluster B: 100 dedicated nodes, needs 120 for a pending job
  → Borrows 20 nodes from vCluster A's idle pool
  → These borrowed nodes have a preemption penalty in the cost function
  → If vCluster A needs them back: checkpoint + reclaim

The quorum tracks ownership at two levels:

Home vCluster: permanent assignment (based on tenant contracts)
Current vCluster: who is actually using the node right now

Checkpoint Cost Model

See checkpoint-broker.md for the full checkpoint decision framework.

Summary: checkpoint when Value > Cost, where value includes recompute_saved + preemptability + backlog_relief, and cost includes write_time + compute_waste + storage_cost. Backlog pressure increases checkpoint aggressiveness.

Simulation and Tuning

Use RM-Replay (tools/rm-replay/) to test scheduling configurations:

Capture production workload traces
Configure weight profiles
Replay through simulator
Evaluate: utilization, wait times, QoS compliance, fairness
Iterate on weights before deploying to production

Reference: Martinasso et al., “RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management” (SC18).

CLI Design

Design Principle

The CLI is the primary user interface. It should feel natural to Slurm users while exposing Lattice’s richer capabilities. Commands follow a consistent lattice <verb> [resource] [flags] pattern. Output is human-readable by default, machine-parseable with --output=json.

Command Structure

lattice <command> [subcommand] [arguments] [flags]

Global Flags

Flag	Short	Description
`--output`	`-o`	Output format: `table` (default), `json`, `yaml`, `wide`
`--quiet`	`-q`	Suppress non-essential output
`--verbose`	`-v`	Verbose output (debug info)
`--tenant`	`-t`	Override tenant (for multi-tenant users)
`--vcluster`		Override vCluster selection
`--config`		Config file path (default: `~/.config/lattice/config.yaml`)
`--no-color`		Disable colored output

Authentication Commands

Authenticate with the lattice server. Uses hpc-auth for OIDC token acquisition with cascading flow selection.

# Login (auto-discovers IdP from lattice-api auth discovery endpoint)
lattice login

# Force device code flow (for SSH sessions without browser)
lattice login --flow device

# Force manual paste flow
lattice login --flow manual

# Login to a specific server
lattice login --server cluster.example.com

Token is cached per-server in ~/.config/lattice/tokens.json with 0600 permissions (lenient mode: warn and fix if wrong).

Logout (`lattice logout`)

Clear cached token and revoke at IdP (best-effort).

lattice logout

Unauthenticated Commands

These commands do not require a token (INV-A1):

lattice login / lattice logout
lattice --version
lattice --help
lattice completions <shell>

All other commands require authentication. If no valid token is cached, the CLI prints:

Not logged in. Run `lattice login` first.

Expired tokens are silently refreshed if a valid refresh token exists.

Core Commands

Submit (`lattice submit`)

Submit an allocation or batch script.

# Submit a script (Slurm-compatible directives parsed)
lattice submit script.sh

# Submit with inline arguments
lattice submit --nodes=64 --walltime=72h --uenv=prgenv-gnu/24.11:v1 -- torchrun train.py

# Submit a task group (job array)
lattice submit --task-group=0-99%20 script.sh

# Submit with dependencies
lattice submit --depends-on=12345:success script.sh

# Submit a DAG from YAML
lattice dag submit workflow.yaml

# Submit to a specific vCluster
lattice submit --vcluster=ml-training script.sh

Output: Allocation ID on success.

Submitted allocation 12345

Status (`lattice status`)

Query allocation status.

# List own allocations
lattice status

# Specific allocation
lattice status 12345

# Filter by state
lattice status --state=running

# All allocations (tenant admin)
lattice status --all

# Watch mode (refresh every 5s)
lattice status --watch

Default output (table):

ID      NAME           STATE    NODES  WALLTIME   ELAPSED   VCLUSTER
12345   training-run   Running  64     72:00:00   14:23:01  ml-training
12346   eval-job       Pending  4      02:00:00   —         hpc-batch
12347   sweep          Running  1×20   04:00:00   01:12:33  hpc-batch

Wide output (-o wide): Adds columns: tenant, project, uenv, GPU type, dragonfly groups.

Cancel (`lattice cancel`)

Cancel allocations.

# Cancel single
lattice cancel 12345

# Cancel multiple
lattice cancel 12345 12346 12347

# Cancel all own pending allocations
lattice cancel --state=pending --all-mine

# Cancel a DAG
lattice dag cancel dag-789

Session (`lattice session`)

Create an interactive session. See sessions.md for details.

# Basic session
lattice session --walltime=4h

# With resources
lattice session --nodes=2 --constraint=gpu_type:GH200 --walltime=8h

# With uenv
lattice session --uenv=prgenv-gnu/24.11:v1 --walltime=4h

Attach (`lattice attach`)

Attach a terminal to a running allocation. See observability.md.

lattice attach 12345
lattice attach 12345 --node=x1000c0s0b0n3
lattice attach 12345 --command="nvidia-smi -l 1"

Launch (`lattice launch`)

Run a task within an existing allocation (srun equivalent).

# Run on all nodes
lattice launch --alloc=12345 hostname

# Run on specific number of tasks
lattice launch --alloc=12345 -n 4 ./my_program

# Run interactively with PTY
lattice launch --alloc=12345 --pty bash

Logs (`lattice logs`)

View allocation logs. See observability.md.

lattice logs 12345
lattice logs 12345 --follow
lattice logs 12345 --stderr --node=x1000c0s0b0n3
lattice logs 12345 --tail=100

Top / Watch / Diag / Compare

Monitoring commands. See observability.md.

lattice top 12345                              # Metrics snapshot
lattice top 12345 --per-gpu                    # Per-GPU breakdown
lattice watch 12345                            # Live streaming metrics
lattice watch 12345 --alerts-only              # Alerts only
lattice diag 12345                             # Network + storage diagnostics
lattice compare 12345 12346 --metric=gpu_util  # Cross-allocation comparison

Telemetry (`lattice telemetry`)

Switch telemetry mode.

lattice telemetry --alloc=12345 --mode=debug --duration=30m

Nodes (`lattice nodes`)

View cluster nodes (read-only).

# List all nodes
lattice nodes

# Filter by state
lattice nodes --state=ready

# Filter by vCluster
lattice nodes --vcluster=hpc-batch

# Specific node details
lattice nodes x1000c0s0b0n0

Output:

NODE                STATE   GPUS  VCLUSTER      TENANT    GROUP  CONFORMANCE
x1000c0s0b0n0       Ready   4×GH200  hpc-batch    physics   3      a1b2c3
x1000c0s0b0n1       Ready   4×GH200  hpc-batch    physics   3      a1b2c3
x1000c0s1b0n0       Draining 4×GH200  ml-training  ml-team   7      a1b2c3

History (`lattice history`)

Query completed allocations (accounting data).

lattice history
lattice history --since=2026-03-01 --until=2026-03-02
lattice history --output=json

DAG Commands (`lattice dag`)

lattice dag submit workflow.yaml     # Submit a DAG
lattice dag status dag-789           # DAG status with per-allocation states
lattice dag list                     # List DAGs
lattice dag cancel dag-789           # Cancel a DAG

Cache Commands (`lattice cache`)

lattice cache warm --image=prgenv-gnu/24.11:v1 --group=3
lattice cache status --node=x1000c0s0b0n0
lattice cache evict --image=prgenv-gnu/24.11:v1 --node=x1000c0s0b0n0

Admin Commands (`lattice admin`)

Administrative commands require system-admin role.

# Node management
lattice node drain x1000c0s0b0n0
lattice node drain x1000c0s0b0n0 --urgent
lattice node undrain x1000c0s0b0n0
lattice node disable x1000c0s0b0n0
lattice node enable x1000c0s0b0n0

# Tenant management
lattice admin tenant create --name=physics --max-nodes=200
lattice admin tenant set-quota --name=physics --max-nodes=250

# vCluster management
lattice admin vcluster create --name=hpc-batch --scheduler=hpc-backfill --tenant=physics
lattice admin vcluster set-weights --name=hpc-batch --priority=0.20 ...

# Configuration
lattice admin config get accounting.enabled
lattice admin config set accounting.enabled=true

# Raft status
lattice admin raft status

Output Formats

Format	Flag	Use Case
`table`	Default	Human-readable, aligned columns
`wide`	`-o wide`	Extended columns
`json`	`-o json`	Machine-parseable, scripting
`yaml`	`-o yaml`	Machine-parseable, config integration

All formats support piping and redirection. JSON output uses newline-delimited JSON for streaming commands (logs –follow, watch).

Error Messages

Errors are human-readable with actionable guidance:

Error: allocation rejected — tenant "physics" exceeds max_nodes quota
  Current: 195 nodes in use
  Requested: 10 additional nodes
  Limit: 200 nodes

  Hint: Cancel running allocations or request a quota increase from your tenant admin.

Error: no nodes available matching constraints
  GPU type: GH200
  Nodes requested: 64
  Available: 42 (22 in use by your allocations, 136 by other tenants)

  Hint: Reduce node count, use --topology=any, or wait for resources.

Shell Completion

Shell completion is generated for bash, zsh, and fish:

# Generate completion
lattice completion bash > /etc/bash_completion.d/lattice
lattice completion zsh > ~/.zfunc/_lattice
lattice completion fish > ~/.config/fish/completions/lattice.fish

Completions cover: subcommands, flag names, allocation IDs (from recent lattice status), node IDs, vCluster names, uenv names.

Configuration File

# ~/.config/lattice/config.yaml
api_url: "https://lattice.example.com:50051"
default_tenant: "physics"
default_vcluster: "hpc-batch"
default_uenv: "prgenv-gnu/24.11:v1"
output_format: "table"
color: true

Environment variables override config file: LATTICE_API_URL, LATTICE_TENANT, LATTICE_VCLUSTER.

Slurm Compatibility Aliases

For sites migrating from Slurm, optional shell aliases:

# Source from lattice-provided script
source $(lattice compat-aliases)

# Provides:
# sbatch → lattice submit
# squeue → lattice status
# scancel → lattice cancel
# salloc → lattice session
# srun → lattice launch
# sinfo → lattice nodes
# sacct → lattice history

These aliases translate Slurm flags to Lattice flags where possible. See slurm-migration.md for details.

Cross-References

api-design.md — API endpoints that CLI commands map to
sessions.md — Interactive session lifecycle
observability.md — Monitoring commands (top, watch, diag, compare)
slurm-migration.md — Slurm command translation details

Telemetry Architecture

Design Principle

Collect at high resolution, aggregate at configurable resolution, transmit out-of-band.

Three-Layer Pipeline

Layer 1: Collection (eBPF, always-on)

eBPF programs JIT-compiled into kernel, attached to tracepoints and kprobes.

Kernel-level metrics:

CPU: context switches, runqueue depth, scheduling latency histograms
Network: per-flow bytes/packets, Slingshot CSIG congestion signals from packet headers
Block I/O: latency histograms, throughput per device (NVMe scratch, network mounts)
Memory: allocation/free rates, NUMA locality, page faults

GPU metrics (via NVML/DCGM hooks):

SM occupancy, memory utilization, power draw
PCIe/NVLink throughput
ECC error counts (feeds into checkpoint cost model)

Storage overhead: ~0.3% on compute-bound workloads. eBPF programs run in kernel context, no syscall overhead, no userspace daemon polling.

Data flows into per-CPU ring buffers (BPF_MAP_TYPE_RINGBUF), consumed by the node agent.

Layer 2: Aggregation (Node Agent, switchable)

The node agent reads ring buffers and aggregates based on the current mode.

Mode: prod (default)

30-second aggregation windows
Statistical summaries: p50, p95, p99, mean, max, count
Bicubic interpolation for time-series smoothing (reduces storage, preserves trends)
Transmitted on Slingshot telemetry traffic class (separate from compute traffic)
Additional overhead: ~0.1%

Mode: debug (per-job or per-node, time-limited)

1-second or sub-second raw event streams
Full per-flow network traces
GPU kernel-level profiling (CUPTI integration)
Stored to job-specific S3 path for user analysis
Additional overhead: ~2-5% (acceptable for debugging)
Auto-reverts to prod after configured duration (default: 30 minutes)

Mode: audit (sensitive vCluster)

All file access events (open, read, write, close) with user identity
All API calls logged with request/response metadata
Network flow summaries (source, destination, bytes, duration)
Signed with Sovra keys (if federation enabled) for tamper evidence
Additional overhead: ~1%
Retention: 7 years (cold tier, S3-compatible archive)

Layer 3: Storage and Query

Time-series store — recommended: VictoriaMetrics (single-node or cluster) for single-site deployments; Thanos on top of Prometheus for federated multi-site deployments that need a global query view across sites:

Ingestion: all nodes stream aggregated metrics
Auto-downsampling: raw → 1m → 5m → 1h → 1d
Retention policy configurable per tenant/vCluster

Three materialized views (label-based access control):

View	Audience	Content
Holistic	System admins	System-wide utilization, power, health, scheduling efficiency
Tenant	Tenant admins	Per-tenant resource usage, quota tracking, job statistics
vCluster	Scheduler	Metrics feeding into cost function (GPU util, I/O, congestion)
User	Allocation owners	Per-allocation metrics scoped by OIDC identity (via lattice-api)

Query interface: PromQL-compatible API. Grafana dashboards for visualization.

Debug traces: Stored to s3://{tenant}/{project}/{job_id}/telemetry/ with short retention (7 days default, configurable).

Audit logs: Append-only, encrypted at rest, stored to dedicated audit storage with long retention. Queryable for compliance reporting.

Switching Telemetry Mode

Via Intent API:

PATCH /v1/allocations/{id}
{ "telemetry": { "mode": "debug", "duration": "30m" } }

Via CLI:

lattice telemetry --alloc=12345 --mode=debug --duration=30m

Switching is instant — the eBPF programs are always collecting at full resolution. Only the aggregation behavior changes.

User-Facing Telemetry Query

The telemetry pipeline serves admin dashboards and the scheduler cost function. The user-facing query layer adds scoped access so allocation owners can query their own metrics without admin intervention.

Query Path

User → lattice-api → PromQL (scoped by alloc/tenant/user) → TSDB → response

The lattice-api injects label filters to ensure users only see metrics for their own allocations. Tenant admins can query any allocation within their tenant.

Scoping Rules

Caller	Visible Scope
Allocation owner	Metrics for their own allocations
Tenant admin	Metrics for any allocation in their tenant
System admin	All metrics (holistic view)

User Metrics Catalog

Metric	Description	Available In
`gpu_utilization`	SM occupancy per GPU	prod, debug, audit
`gpu_memory_used`	GPU memory in use	prod, debug, audit
`gpu_power_draw`	GPU power consumption	prod, debug, audit
`cpu_utilization`	CPU usage per node	prod, debug, audit
`memory_used`	System memory in use	prod, debug, audit
`network_tx_bytes`	Network bytes sent per second	prod, debug, audit
`network_rx_bytes`	Network bytes received per second	prod, debug, audit
`io_read_bytes`	Storage read throughput	prod, debug, audit
`io_write_bytes`	Storage write throughput	prod, debug, audit
`io_latency_p99`	Storage I/O latency (p99)	prod, debug, audit

Telemetry Streaming

For use cases requiring push-based updates (e.g., lattice watch), the StreamMetrics RPC fans out to node agents running the target allocation and merges their streams.

Architecture

lattice-api receives StreamMetrics request
    → identifies nodes running allocation (from quorum state)
    → opens per-node metric streams to node agents
    → merges streams with allocation-scoped labels
    → returns unified server-streaming response to client

In prod mode, node agents emit aggregated snapshots every 30 seconds. In debug mode, raw events stream at 1-second intervals. The client receives the same resolution as the current telemetry mode — switching mode (via PATCH /v1/allocations/{id}) takes effect on active streams.

Alert Generation

Node agents evaluate threshold rules locally and inject MetricAlert events into the stream when:

GPU utilization < 10% for > 60s (potential hang)
GPU memory > 95% (OOM risk)
Network error rate exceeds 0.1%
I/O p99 latency exceeds 10ms

Cross-Allocation Comparison

Users can compare metrics across multiple allocations (e.g., successive training runs) via the CompareMetrics RPC or GET /v1/compare.

TSDB Query

The lattice-api issues parallel PromQL queries for each allocation ID, scoped to the requesting user’s permissions. Results are aligned by relative time (see below).

Relative Time Alignment

Allocations may run at different wall-clock times. Comparison uses relative-to-start alignment: each allocation’s metric series is indexed from t=0 (the allocation’s started_at timestamp). This allows apples-to-apples comparison of metrics across runs that started hours or days apart.

Feedback to Scheduler

The telemetry system feeds key metrics back to the scheduling cost function:

Metric	Cost Function Component	Effect
GPU utilization per job	Efficiency scoring	Low util → deprioritize for topology-premium placement
Network congestion (CSIG)	topology_fitness	Congested groups → avoid placing new jobs there
I/O throughput per job	data_readiness	High I/O demand → ensure storage QoS before scheduling
Node ECC errors	checkpoint cost model	Rising errors → increase checkpoint urgency
Power draw per node	energy_cost	Feeds into power budget constraint

Telemetry Aggregation Topology

For large systems (10,000+ nodes), direct streaming to a central store creates an ingestion bottleneck. Use hierarchical aggregation:

Nodes (per-group) → Group Aggregator → Central Store

Each Slingshot dragonfly group has a designated aggregator node.
Group aggregators perform first-level aggregation (merge per-node summaries).
Central store receives per-group aggregated streams.

In debug mode: bypasses group aggregation, streams directly for that job's nodes.

Scheduler Self-Monitoring

Internal metrics for monitoring Lattice’s own health. These metrics feed into canary criteria during rolling upgrades (cross-ref: upgrades.md) and are available on the holistic dashboard.

Scheduling Metrics

Metric	Type	Labels	Description
`lattice_scheduling_cycle_duration_seconds`	histogram	`vcluster`	Time to complete one scheduling cycle
`lattice_scheduling_queue_depth`	gauge	`vcluster`	Number of pending allocations
`lattice_scheduling_proposals_total`	counter	`vcluster`, `result` (accepted/rejected)	Proposals sent to quorum
`lattice_scheduling_cost_function_duration_seconds`	histogram	`vcluster`	Time to evaluate the cost function for all candidates
`lattice_scheduling_backfill_jobs_total`	counter	`vcluster`	Allocations placed via backfill

Quorum Metrics

Metric	Type	Labels	Description
`lattice_raft_leader`	gauge	`member_id`	1 if this member is leader, 0 if follower
`lattice_raft_commit_latency_seconds`	histogram	`member_id`	Time from proposal to commit
`lattice_raft_log_entries`	gauge	`member_id`	Number of entries in the Raft log
`lattice_raft_snapshot_duration_seconds`	histogram	`member_id`	Time to create a Raft snapshot

API Metrics

Metric	Type	Labels	Description
`lattice_api_requests_total`	counter	`method`, `status`	Total API requests
`lattice_api_request_duration_seconds`	histogram	`method`	Request latency
`lattice_api_active_streams`	gauge	`stream_type` (attach/logs/metrics)	Active streaming connections

Node Agent Metrics

Metric	Type	Labels	Description
`lattice_agent_heartbeat_latency_seconds`	histogram	`node_id`	Heartbeat round-trip time
`lattice_agent_allocation_startup_seconds`	histogram	`node_id`	Time from allocation assignment to process start (includes uenv pull/mount)
`lattice_agent_ebpf_overhead_percent`	gauge	`node_id`	Measured eBPF collection overhead

Accounting Metrics

Metric	Type	Labels	Description
`lattice_accounting_events_buffered`	gauge	—	Events in the in-memory accounting buffer
`lattice_accounting_events_dropped_total`	counter	—	Events dropped due to buffer overflow

Federation Broker Metrics

When federation is enabled, the federation broker exposes additional metrics:

Metric	Type	Labels	Description
`lattice_federation_proposals_total`	counter	`peer`, `result` (accepted/rejected/timeout)	Placement proposals sent to/from peers
`lattice_federation_proposal_latency_seconds`	histogram	`peer`	Round-trip time for federation proposals
`lattice_federation_peer_status`	gauge	`peer`	1 = connected, 0 = unreachable
`lattice_federation_data_gravity_score`	gauge	`peer`, `dataset`	Data gravity score for placement decisions (higher = more data at peer)

These metrics are only active when federation.enabled = true. The federation broker exposes them on the same /metrics endpoint as other components (default port: 9105).

Alerting Rules

Example alerting rules (PromQL-compatible):

Rule	Condition	Severity
Scheduling cycle slow	`histogram_quantile(0.99, lattice_scheduling_cycle_duration_seconds) > 30`	warning
Queue depth high	`lattice_scheduling_queue_depth > 100` for 5 minutes	warning
Raft commit slow	`histogram_quantile(0.99, lattice_raft_commit_latency_seconds) > 5`	critical
Node heartbeat missing	`time() - lattice_agent_last_heartbeat_timestamp > 60`	node degraded
API error rate spike	`rate(lattice_api_requests_total{status=~"5.."}[5m]) / rate(lattice_api_requests_total[5m]) > 0.05`	warning
Accounting buffer filling	`lattice_accounting_events_buffered > 8000`	warning
VNI pool exhaustion approaching	`(lattice_network_vni_pool_total - lattice_network_vni_pool_available) / lattice_network_vni_pool_total > 0.90`	warning
Quota utilization high	`lattice_quota_used_nodes / lattice_quota_max_nodes > 0.95` for 10 minutes	warning
Raft disk usage high	`lattice_raft_disk_used_bytes / lattice_raft_disk_total_bytes > 0.80`	warning
Snapshot storage growth	`rate(lattice_raft_snapshot_size_bytes[1h]) > 100e6`	info

Dashboard Views

Three views matching the existing telemetry pattern:

Dashboard	Audience	Key Panels
Holistic	System admins	All scheduler cycle times, quorum health, total queue depth, API throughput
Per-vCluster	Scheduler operators	vCluster-specific queue depth, cycle time, proposal accept rate, backfill rate
Per-quorum-member	Quorum operators	Raft log size, commit latency, leader status, snapshot timing

Monitoring Deployment

Prometheus Scrape Configuration

All Lattice components expose metrics on a /metrics endpoint (Prometheus exposition format):

Component	Default Metrics Port	Endpoint
Quorum members	9100	`http://{quorum-host}:9100/metrics`
API servers	9101	`http://{api-host}:9101/metrics`
vCluster schedulers	9102	`http://{scheduler-host}:9102/metrics`
Node agents	9103	`http://{node-host}:9103/metrics`
Checkpoint broker	9104	`http://{checkpoint-host}:9104/metrics`

Example Prometheus scrape config:

scrape_configs:
  - job_name: "lattice-quorum"
    static_configs:
      - targets: ["quorum-1:9100", "quorum-2:9100", "quorum-3:9100"]

  - job_name: "lattice-api"
    static_configs:
      - targets: ["api-1:9101", "api-2:9101"]

  - job_name: "lattice-scheduler"
    static_configs:
      - targets: ["scheduler-hpc:9102", "scheduler-ml:9102", "scheduler-interactive:9102"]

  - job_name: "lattice-agents"
    file_sd_configs:
      - files: ["/etc/prometheus/lattice-agents.json"]
        refresh_interval: 5m
    # Node agents are numerous; use file-based service discovery
    # populated from OpenCHAMI node inventory

Alert Routing

Alerts are routed via Alertmanager (or compatible system):

Severity	Route	Response Time
Critical	PagerDuty / on-call	Immediate (< 15 min)
Warning	Slack #lattice-alerts	Business hours (< 4 hours)
Info	Slack #lattice-info	Best effort

Example Alertmanager route:

route:
  receiver: "slack-info"
  routes:
    - match: { severity: "critical" }
      receiver: "pagerduty-oncall"
    - match: { severity: "warning" }
      receiver: "slack-alerts"

Grafana Dashboards

Pre-built dashboards for the three views described above. Dashboards are defined as JSON and version-controlled in infra/grafana/:

infra/grafana/
├── holistic.json          # System-wide overview
├── per-vcluster.json      # vCluster-specific scheduling
├── per-quorum-member.json # Raft health
├── per-node.json          # Individual node health
└── user-allocation.json   # User-facing allocation metrics

Each dashboard uses the standard Lattice metric names. Data source: Prometheus (or compatible TSDB).

TSDB Sizing

Cluster Size	Metric Cardinality	Ingestion Rate	Storage (30-day retention)
100 nodes	~50,000 series	~10k samples/s	~50 GB
1,000 nodes	~500,000 series	~100k samples/s	~500 GB
10,000 nodes	~5,000,000 series	~1M samples/s	~5 TB

For clusters > 1000 nodes, use a horizontally scalable TSDB (VictoriaMetrics cluster, Mimir, or Thanos) with the hierarchical aggregation described in the Telemetry Aggregation Topology section above.

User-Facing Observability & Debugging

Design Principle

Lattice already collects high-resolution telemetry (eBPF, TSDB, three aggregation modes) for operator and scheduler use. This document describes the user-facing surface that lets job owners debug, monitor, and profile their own allocations without admin intervention.

All observability data flows through existing pipelines — no new collection infrastructure is required. The user-facing layer adds scoped query access, streaming endpoints, and interactive attach.

Overview

┌─ User ───────────────────────────────────────────────────────┐
│  lattice attach / logs / top / watch / diag / compare        │
│         │           │          │         │         │         │
│         ▼           ▼          ▼         ▼         ▼         │
│    ┌─────────── lattice-api (gRPC + REST) ───────────────┐   │
│    │  Attach ──────────────── bidir stream to node agent │   │
│    │  Logs ────────────────── ring buffer (live) + S3    │   │
│    │  Metrics ─────────────── PromQL query to TSDB       │   │
│    │  StreamMetrics ───────── fan-out to node agents     │   │
│    │  Diagnostics ─────────── TSDB + fabric telemetry    │   │
│    │  Compare ─────────────── multi-alloc TSDB query     │   │
│    └─────────────────────────────────────────────────────┘   │
│         │           │          │         │                   │
│         ▼           ▼          ▼         ▼                   │
│    Node Agents    S3 logs     TSDB    Slingshot CSIG         │
└──────────────────────────────────────────────────────────────┘

Capability	Data Source	Latency	CLI Command
Attach to running allocation	Node agent (nsenter)	Real-time	`lattice attach <id>`
Log streaming (live tail)	Node agent ring buffer	Sub-second	`lattice logs <id> --follow`
Historical logs	S3	Seconds	`lattice logs <id>`
Live metrics (`top`)	TSDB	30s (prod mode)	`lattice top <id>`
Live telemetry stream (`watch`)	Node agents (push)	1-30s	`lattice watch <id>`
Diagnostics	TSDB + fabric telemetry	30s	`lattice diag <id>`
Cross-allocation comparison	TSDB	Seconds	`lattice compare <id1> <id2>`
Application profiling	User tools (via tools_uenv)	N/A	User-driven

Attach to Running Allocation

Architecture

The attach mechanism provides an interactive terminal session inside a running allocation’s execution environment. The node agent uses nsenter to enter the allocation’s mount and network namespaces — this is not a new allocation, just a terminal session in the existing one.

User → lattice-cli → lattice-api → gRPC bidir stream → node agent
                                                         │
                                                    nsenter into
                                                    mount/net ns
                                                         │
                                                    PTY ↔ shell

Terminal Protocol

The gRPC bidirectional stream carries:

Client → Server: stdin bytes, terminal resize events, signals (SIGINT, SIGTSTP)
Server → Client: stdout/stderr bytes, exit code on completion

The stream begins with an AttachStart message specifying the target node (for multi-node allocations) and command (default: user’s shell).

Authorization Model

vCluster Type	Who Can Attach	Additional Constraints
HPC (backfill)	Allocation owner	—
Service (bin-pack)	Allocation owner	—
Interactive (FIFO)	Allocation owner	Already has session; attach is secondary terminal
Sensitive (reservation)	Claiming user only	Session recorded, audit trail, signed uenv only

Sensitive Constraints

Only the user who claimed the nodes (identity from Raft audit log) can attach
All attach sessions are recorded (input + output) to the sensitive audit log
Attach is only permitted when the allocation runs a signed uenv
Session start/end events are Raft-committed audit entries

Attach During Node Crash

If the node hosting an attach session crashes or becomes unreachable:

The gRPC bidirectional stream is dropped (connection reset).
The API server detects the stream drop and sets ended_at on the AttachSession record.
For sensitive allocations, the session end event is recorded in the audit log with reason node_unreachable.
The client receives a stream error and can display: "connection to node lost — attach session ended".

Attach During Preemption

If the allocation is preempted while an attach session is active, the session is terminated gracefully. See sessions.md for the detailed preemption sequence. If the allocation is in Checkpointing state, new attach requests are rejected with: "allocation is being checkpointed — attach unavailable until rescheduled".

CLI Usage

# Attach to allocation (first node, user's shell)
lattice attach 12345

# Attach to a specific node
lattice attach 12345 --node=x1000c0s0b0n3

# Attach with a specific command
lattice attach 12345 --command="nvidia-smi -l 1"

Slurm Compatibility

Slurm	Lattice
`srun --jobid=123 --pty bash`	`lattice attach 123`

Log Streaming

Dual-Path Architecture

Logs use two paths to balance latency and durability:

Ring buffer (live tail): Each node agent maintains a per-allocation ring buffer (default 64 MB) of stdout/stderr. Supports low-latency streaming for --follow mode. Data is ephemeral — lost when the allocation ends or the buffer wraps.
S3 persistence: Node agents periodically flush log chunks to S3 for durable storage. Available during and after allocation execution.

Process stdout/stderr
    │
    ├──→ Ring buffer (node agent, 64 MB)
    │         │
    │         └──→ gRPC StreamLogs (live tail)
    │
    └──→ S3 flush (periodic, configurable interval)
              │
              └──→ REST GET /logs (historical)

Log Storage Layout

s3://{tenant}/{project}/{alloc_id}/logs/
    ├── stdout/{node_id}/{chunk_000..N}.log.zst
    ├── stderr/{node_id}/{chunk_000..N}.log.zst
    └── metadata.json    # timestamps, byte offsets, node list

Logs are compressed with zstd. The metadata file enables efficient range queries by time or byte offset.

Streaming (Live Tail)

Via gRPC StreamLogs RPC (server-streaming). The client specifies:

Allocation ID
Stream filter: stdout, stderr, or both
Node filter: specific node or all nodes
Follow mode: whether to keep streaming as new output arrives
Tail lines: number of lines from the ring buffer to replay on connect

Historical Log Access

Via REST GET /v1/allocations/{id}/logs:

Query params: stream (stdout/stderr), node, since, until, offset, limit
Returns paginated log entries from S3
Available after allocation completion (subject to retention policy)

Sensitive Constraints

Logs from sensitive allocations are encrypted at rest in the dedicated sensitive S3 pool
All log access events are recorded in the sensitive audit log
Log retention follows sensitive data retention policy (user-specified, minimum per regulation)
Logs are only accessible to the claiming user and designated compliance reviewers

CLI Usage

# View logs (all nodes, both streams)
lattice logs 12345

# Follow mode (live tail)
lattice logs 12345 --follow

# Filter by stream and node
lattice logs 12345 --stderr --node=x1000c0s0b0n3

# Tail last 100 lines
lattice logs 12345 --tail=100

# Historical range
lattice logs 12345 --since="2026-03-01T10:00:00Z" --until="2026-03-01T11:00:00Z"

Slurm Compatibility

Slurm	Lattice
`cat slurm-123.out`	`lattice logs 123`
`tail -f slurm-123.out`	`lattice logs 123 --follow`

User-Facing Live Metrics (`lattice top`)

Query Path

Metrics are served from the TSDB (not directly from node agents). The lattice-api translates user queries into PromQL, scoped to the requesting user’s allocations.

lattice top <id> → lattice-api → PromQL → TSDB → response

This reuses the existing telemetry pipeline. In prod mode, data has 30-second resolution. In debug mode (if switched), 1-second resolution.

Metrics Catalog

Metric	Description	Unit
`gpu_utilization`	SM occupancy per GPU	%
`gpu_memory_used`	GPU memory in use	bytes
`gpu_power_draw`	GPU power consumption	watts
`cpu_utilization`	CPU usage per node	%
`memory_used`	System memory in use	bytes
`network_tx_bytes`	Network bytes sent	bytes/s
`network_rx_bytes`	Network bytes received	bytes/s
`io_read_bytes`	Storage read throughput	bytes/s
`io_write_bytes`	Storage write throughput	bytes/s
`io_latency_p99`	Storage I/O latency (p99)	microseconds

Display Modes

Mode	Flag	Content
Summary (default)	—	Aggregated across all nodes: mean GPU%, total mem, total I/O
Per-node	`--per-node`	One row per node
Per-GPU	`--per-gpu`	One row per GPU across all nodes
Wide	`--wide`	All metrics in a wide table

REST + gRPC Access

REST: GET /v1/allocations/{id}/metrics?mode=summary&duration=5m
gRPC: QueryMetrics RPC with MetricsQueryRequest

CLI Usage

# Summary view (default)
lattice top 12345

# Per-node breakdown
lattice top 12345 --per-node

# Per-GPU breakdown
lattice top 12345 --per-gpu

# Wide format with all metrics
lattice top 12345 --wide

# Custom time window
lattice top 12345 --duration=1h

Live Telemetry Stream (`lattice watch`)

Push-Based Event Stream

Unlike lattice top (which queries TSDB), lattice watch opens a push-based stream from node agents for near-real-time events.

lattice watch <id> → lattice-api → fan-out → node agents
                          ↑
                     stream merge
                          ↑
             per-node MetricsEvent streams

Relationship to Telemetry Modes

Telemetry Mode	`lattice top` Resolution	`lattice watch` Resolution
prod	30s (TSDB)	30s (prod aggregation from node agent)
debug	1s (TSDB)	1s (raw events from node agent)
audit	30s (TSDB)	30s + access events

Switching to debug mode (lattice telemetry --alloc=12345 --mode=debug) increases resolution for both top and watch.

Stream Content

Each MetricsEvent contains:

Timestamp and node ID
Current metric values (GPU, CPU, memory, network, I/O)
Threshold alerts (if any metric exceeds configured bounds)

Alerts are generated by node agents when metrics cross thresholds:

GPU utilization drops below 10% (potential hang)
GPU memory utilization exceeds 95% (OOM risk)
Network error rate exceeds threshold
I/O latency spike detected

CLI Usage

# Watch all metrics (refreshing display)
lattice watch 12345

# Watch specific metrics
lattice watch 12345 --metrics=gpu_utilization,memory_used

# Watch with alerts only (suppress normal updates)
lattice watch 12345 --alerts-only

Diagnostics View

Network Diagnostics

Network health is critical for multi-node allocations. Diagnostics expose Slingshot-specific metrics that are otherwise invisible to users.

Metric	Description	Source
CSIG congestion	In-band congestion signals per Slingshot group	eBPF CSIG tap
Group span	Number of dragonfly groups the allocation spans	Topology model
Inter-node bandwidth	Measured bandwidth between node pairs	eBPF network flow
NVLink throughput	GPU-to-GPU bandwidth (intra-node)	NVML

Storage Diagnostics

Metric	Description	Source
QoS floor vs actual	Configured storage QoS vs measured throughput	VAST API + eBPF I/O
Latency histogram	I/O latency distribution (p50/p95/p99)	eBPF block I/O
Mount health	Per-mount status (NFS, S3, scratch)	Node agent
IOPS	Read/write operations per second	eBPF block I/O

Combined Diagnostics

lattice diag combines network and storage diagnostics into a single view with health indicators:

$ lattice diag 12345

Network:
  Group span:     2 groups (g3, g7)
  CSIG congestion: LOW (0.02 avg)
  Inter-node BW:  190 GB/s avg (target: 200 GB/s) ✓

Storage:
  /data/input (NFS):  12.5 GB/s read (QoS floor: 10 GB/s) ✓
  /scratch (NVMe):    6.2 GB/s write, p99 latency: 45µs ✓
  /home (NFS):        0.1 GB/s (idle) ✓

GPUs:
  SM occupancy:   92% avg across 256 GPUs ✓
  NVLink:         850 GB/s avg (of 900 GB/s) ✓
  ECC errors:     0 ✓

CLI Usage

# Full diagnostics
lattice diag 12345

# Network only
lattice diag 12345 --network

# Storage only
lattice diag 12345 --storage

Cross-Allocation Comparison

TSDB Query

Compares metrics across multiple allocations by querying the same TSDB data used for lattice top. Useful for regression detection across training runs.

Time Alignment

Comparisons use relative-to-start time alignment: each allocation’s metrics are indexed from t=0 (allocation start), not wall clock time. This allows meaningful comparison of allocations that ran at different times.

CLI Usage

# Compare two allocations
lattice compare 12345 12346

# Compare specific metric
lattice compare 12345 12346 --metric=gpu_utilization

# JSON output for scripting
lattice compare 12345 12346 --output=json

REST Interface

GET /v1/compare?ids=12345,12346&metrics=gpu_utilization,io_write_bytes&align=relative

Application Profiling Integration

Scope

Lattice provides mechanisms for profiling, not profiler implementations. Users bring their own profiling tools, delivered via tools_uenv.

Profiler Delivery

Profiling tools are packaged as uenv images and mounted alongside the application uenv:

environment:
  uenv: "prgenv-gnu/24.11:v1"        # application stack
  tools_uenv: "profiling/2024.1"     # profilers: nsight, vtune, darshan, etc.

The tools_uenv mount provides profiler binaries without contaminating the application environment.

Usage Patterns

Batch profiling (non-interactive):

# Submit with profiling tools
lattice submit --uenv=prgenv-gnu/24.11:v1 --tools-uenv=profiling/2024.1 script.sh
# Script uses profiler internally (e.g., nsys profile ./train)
# Results written to output directory

Interactive profiling (attach-based):

# Attach and run profiler interactively
lattice attach 12345 --command="nsys profile --delay=60 -o /scratch/profile ./train"

Darshan / Score-P Integration Notes

Darshan: LD_PRELOAD-based I/O profiling. No Lattice-specific integration needed; user loads Darshan from tools_uenv and sets LD_PRELOAD. Darshan logs written to scratch/output.
Score-P: Instrumentation-based profiling. User compiles with Score-P wrappers from tools_uenv. Lattice provides no special support beyond tools delivery and attach.

Security Model

Authorization

All observability endpoints are scoped by OIDC token claims:

Users can only query their own allocations (or allocations in their tenant, if tenant-admin)
Token scopes: allocations:read (metrics, logs, diagnostics), allocations:attach (interactive attach)
Sensitive allocations: only the claiming user (verified against Raft audit log)

Rate Limiting

All rate limits are per user (identified by OIDC subject claim). Tenant admins and system admins share the same limits unless overridden in system configuration.

Endpoint	Rate Limit	Scope	Rationale
Attach	5 concurrent sessions	Per user	Resource-intensive (PTY per session)
StreamLogs	10 concurrent streams	Per user	Memory (ring buffer readers)
QueryMetrics	60 req/min	Per user	TSDB query load
StreamMetrics	5 concurrent streams	Per user	Node agent fan-out
Diagnostics	30 req/min	Per user	TSDB + fabric query load
Compare	10 req/min	Per user	Multi-alloc TSDB queries

When rate limit is exceeded:

Concurrent limits (Attach, StreamLogs, StreamMetrics): New request rejected with 429 Too Many Requests and a message: "maximum concurrent sessions reached (5/5). Close an existing session to open a new one."
Request-rate limits (QueryMetrics, Diagnostics, Compare): Request rejected with 429 Too Many Requests and Retry-After header indicating seconds until the next request is allowed.
No queueing — rejected requests must be retried by the client.

Admin override: System admins can adjust per-user rate limits via configuration:

rate_limits:
  attach_max_concurrent: 10       # override default of 5
  query_metrics_per_minute: 120   # override default of 60

Data Sensitivity

Data Type	Sensitivity	Handling
Metrics (GPU%, CPU%, I/O)	Low	Standard OIDC scoping
Logs (stdout/stderr)	Medium	May contain application data; encrypted at rest for sensitive
Attach (interactive terminal)	High	Session recorded for sensitive; PTY access = code execution
Diagnostics (network/storage)	Low	Infrastructure metrics, no application data
Profiling output	Medium	Written to user’s storage, no Lattice-managed persistence

Security Architecture

Design Principle

Defense in depth with zero-trust internal communication. Every component authenticates to every other component. Trust boundaries are explicit and enforced by mTLS, RBAC, and network segmentation.

Trust Boundaries

User ──OIDC──→ lattice-api (direct, via hpc-auth) ──mTLS──→ quorum
                                        │                    │
                                        │ mTLS               │ mTLS
                                        ▼                    ▼
                                   node-agents ──namespace──→ workloads
                                        │
                                        │ mTLS/REST
                                        ▼
                                   VAST / OpenCHAMI

Federation (optional):
  quorum ──Sovra mTLS──→ federation-broker ──Sovra mTLS──→ remote quorum

STRIDE Threat Analysis

Spoofing

Boundary	Attack	Mitigation
User → lattice-api	Stolen OIDC token	Short-lived tokens (5 min), token binding to client cert, MFA enforcement at IdP
Internal services	Rogue node agent	mTLS with site PKI (OpenCHAMI OPAAL-issued certificates). Node agents receive certs during boot via cloud-init. Cert CN must match node identity in quorum.
Federation	Rogue remote site	Sovra workspace-scoped certificates. Each site’s identity is cryptographically bound to its Sovra workspace. Revocable.

Tampering

Boundary	Attack	Mitigation
Quorum ↔ node agent	Fake heartbeat / state update	mTLS + message signing. Heartbeats include monotonic sequence number — replay detection.
uenv images	Compromised image	Image signing with site PKI (or Sovra PKI for federated images). Node agent verifies signature + hash before mount. Unsigned images rejected.
Raft log	Log manipulation	Raft log entries are chained (each entry references previous). Stored on local SSD with integrity checks. Snapshot checksums verified on restore.
API requests	Request modification in transit	TLS for all external connections. mTLS for all internal connections.

Repudiation

Boundary	Attack	Mitigation
Sensitive actions	User denies accessing sensitive data	Raft-committed audit log with user identity (from OIDC). Cryptographically signed entries (Sovra keys if available, otherwise site PKI). 7-year retention. Tamper-evident chain.
Allocation submission	User denies submitting allocation	All API requests logged with authenticated user identity. Audit trail in lattice-api access logs.
Node claims	Deny claiming sensitive nodes	Node claim is a Raft-committed operation with user identity. Cannot be repudiated.

Information Disclosure

Boundary	Attack	Mitigation
Node ↔ storage	Data exfiltration via network sniffing	Encrypted transport: NFS-over-TLS (VAST supports), S3 over HTTPS. Sensitive: encrypted at rest (VAST encrypted pool).
Cross-tenant	Side-channel via co-location	Full-node scheduling (ADR-007): no co-location of different tenants by default. Interactive vCluster uses Sarus containers with seccomp for intra-node isolation.
Telemetry	Metric leakage between tenants	Label-based access control on TSDB queries. lattice-api injects tenant/user scope filters.
Memory	Data remnants after allocation	Node agent zeroes GPU memory and clears scratch storage (NVMe or tmpfs) on allocation release. Sensitive: full node wipe via OpenCHAMI.
API responses	Enumeration of other tenants’ data	RBAC filtering on all list/query endpoints. Users see only their own allocations; tenant admins see their tenant.

Denial of Service

Boundary	Attack	Mitigation
User → API	API flooding	Rate limiting per tenant (token bucket). Admission control: reject requests that exceed tenant’s request quota. lattice-api provides rate limiting via Tower middleware.
Node → quorum	Heartbeat storm	Heartbeat coalescing: node agents batch heartbeats. Quorum-side rate limiting per node (max 1 heartbeat per interval).
Scheduling	Malicious allocation specs	Validation at API layer: max resource requests bounded, max array size bounded, DAG cycle detection. Reject before reaching scheduler.
Storage	Storage exhaustion	Per-tenant storage quotas enforced by VAST. Checkpoint storage bounded per allocation.

Elevation of Privilege

Boundary	Attack	Mitigation
User → scheduler	Escalate priority class	RBAC: priority class tied to tenant contract, not user request. Users cannot set priority above their tenant’s maximum.
Node agent → host	Container/namespace escape	Sarus: seccomp profile, no root in container, read-only rootfs. uenv: mount namespace only (no user namespace needed), processes run as submitting user. No setuid binaries in uenv images (enforced at build time).
Tenant admin → system admin	Escalate administrative scope	Distinct RBAC roles with no implicit promotion. System admin requires separate authentication (not derivable from tenant admin token).
Workload → network	Break out of network domain	Slingshot VNI enforcement at NIC level (hardware-enforced). Workloads can only communicate within their assigned network domain.

Internal Service Authentication

All inter-component communication uses mTLS in production. Node agents acquire certificates via the identity cascade (SPIRE → SelfSigned CA → Bootstrap certs). When no mTLS identity is available (dev, testing, break-glass), agents fall back to Bearer token auth via LATTICE_AGENT_TOKEN.

Component	Certificate Source	Rotation	Fallback
Quorum members	Pre-provisioned during deployment	Annual rotation, Raft membership change for re-keying	—
Node agents	Identity cascade: SPIRE SVID → SelfSigned (quorum CA) → Bootstrap certs	CertRotator at 2/3 lifetime	`LATTICE_AGENT_TOKEN` Bearer token
API servers	Pre-provisioned or OPAAL	Annual rotation	—
vCluster schedulers	Pre-provisioned or OPAAL	Annual rotation	—
Checkpoint broker	Pre-provisioned or OPAAL	Annual rotation	—

Agent authentication priority:

mTLS (production) — agent acquires a WorkloadIdentity via the identity cascade and configures the gRPC channel with ClientTlsConfig. Server verifies the client certificate. No Bearer token needed.
Bearer token (dev/testing/break-glass) — when no mTLS identity is available, agent reads LATTICE_AGENT_TOKEN from the environment and injects it as Authorization: Bearer <token> on all gRPC calls. Server validates via HMAC or JWKS.

Both paths coexist — mTLS takes priority. The LATTICE_AGENT_TOKEN path should be disabled in production (env var unset).

Certificate CN format: {component}.{site}.lattice.internal (e.g., node-042.alps.lattice.internal).

CA trust chain: Site root CA → intermediate CA (OPAAL) → component certificates.

Secret Management

Sensitive values are never stored in configuration files:

Secret	Storage	Access Pattern
Waldur API token	Secrets manager (HashiCorp Vault or equivalent)	Referenced by path: `vault://lattice/waldur-token`
VAST API credentials	Secrets manager	Referenced by path
TLS private keys	Local filesystem (mode 0600) or TPM	Loaded at startup
OIDC client secret	Secrets manager	Used by hpc-auth (CLI) or lattice-api (server-side validation)
Sovra workspace key	Sovra key store (HSM-backed)	Used by federation broker

Configuration files reference secrets by path, never by value:

waldur:
  token_secret_ref: "vault://lattice/waldur-token"
vast:
  credentials_ref: "vault://lattice/vast-creds"

RBAC Model

Three base roles, plus a sensitive-specific role:

Role	Scope	Permissions
user	Own allocations	Submit, cancel, query own allocations. View own metrics. Attach to own sessions.
tenant-admin	Tenant’s allocations	All user permissions for any allocation in tenant. Manage tenant quotas (within limits). View tenant-level metrics.
system-admin	All	All operations. Manage vClusters, nodes, tenants. View holistic metrics.
claiming-user	Claimed sensitive nodes	User role + claim/release sensitive nodes. Access sensitive storage pool. All actions audit-logged.

Role assignment:

user role derived from OIDC token (any authenticated user)
tenant-admin assigned per-tenant in quorum state, or via tenant-admin role claim
system-admin assigned via quorum configuration, or via admin/system:admin scope
claiming-user assigned per-tenant by tenant-admin (sensitive tenants only)
operator assigned via operator scope or role claim

Cross-system role mapping (pact+lattice co-deployment):

When pact delegates operations to lattice (e.g., drain, cordon), the pact admin’s token carries a pact_role claim instead of lattice scopes. Lattice recognizes these cross-system role claims:

Token claim	Value	Lattice role
`pact_role`	`pact-platform-admin`	SystemAdmin
`pact_role` or `lattice_role`	`system-admin`	SystemAdmin
`pact_role` or `lattice_role`	`tenant-admin`	TenantAdmin
`pact_role` or `lattice_role`	`operator`	Operator

Standard OIDC scopes take precedence over role claims. Both are checked by derive_role().

Network Security

Traffic Class	Network	Isolation
Management (mTLS, heartbeats)	Slingshot management traffic class	Dedicated bandwidth reservation
Compute (MPI, NCCL)	Slingshot compute VNIs	Hardware-isolated per network domain
Storage (NFS, S3)	Slingshot storage traffic class	QoS-enforced bandwidth
Telemetry (metrics)	Slingshot telemetry traffic class	Separate from compute, low priority
User access (API, SSH)	Out-of-band Ethernet	Firewalled, rate-limited

Slingshot traffic classes provide hardware-enforced isolation — compute traffic cannot starve management traffic and vice versa.

Certificate Rotation

Quorum Members

Generate new certificate from site CA (same CN format)
Deploy new cert + key to the target member’s TLS directory
Perform Raft membership change: remove old member, add “new” member (same node, new cert)
Verify: lattice admin raft status shows member healthy with new cert serial
Repeat for each member (one at a time, maintaining majority)

Node Agents

Node agents receive certificates from OPAAL during boot. Rotation is automatic on reboot:

Drain the node: lattice node drain <id>
Reboot (or reimage) via OpenCHAMI
Node boots with new OPAAL-issued certificate
Undrain: lattice node undrain <id>

For batch rotation without reboot (if OPAAL supports renewal):

Node agent requests new cert from OPAAL
Node agent reloads TLS context (graceful, no connection drop)
New cert active on next heartbeat

API Servers and Schedulers

Generate new certificate from site CA
Deploy new cert + key to the component’s TLS directory
Restart the component (stateless — no data loss)
Load balancer health check confirms the component is back

Federation (Sovra Certificates)

Sovra workspace keys are managed by the Sovra key rotation protocol. Lattice components use derived tokens, which are automatically refreshed. No Lattice-side action is required for routine Sovra key rotation.

For emergency revocation: revoke the Sovra shared workspace (see federation.md — Removing a Federation Peer).

Additional Security Considerations

OIDC Token Refresh for Long-Lived Streams

Long-lived gRPC streams (Attach, StreamLogs, StreamMetrics) may outlive the OIDC access token’s lifetime:

Token validation at stream open. The API server validates the OIDC token when the stream is established.
Periodic re-validation. For streams lasting longer than token_revalidation_interval (default: 5 minutes), the API server re-validates the token’s claims against the OIDC provider. If the token has been revoked or the user’s permissions have changed, the stream is terminated with an UNAUTHENTICATED error.
Client responsibility. Clients should refresh their access token before it expires and present the new token on reconnection if the stream is terminated.

Anti-Replay for API Requests

API requests are protected against replay attacks:

TLS as primary defense. All external API communication uses TLS, which provides replay protection at the transport layer.
Request idempotency. Mutating operations (Submit, Cancel, Update) use client-generated request_id fields for idempotency. Duplicate request_id values within a time window are rejected.
Raft proposal deduplication. The quorum deduplicates proposals using the proposing scheduler’s identity and a monotonic sequence number. Replayed proposals are ignored.

RBAC for Node Management

Node management operations (drain, undrain, disable) require the system-admin role:

Operation	Required Role	Notes
`ListNodes`, `GetNode`	`user`	Read-only, filtered by tenant scope
`DrainNode`, `UndrainNode`	`system-admin`	Affects scheduling across all tenants
`DisableNode`	`system-admin`	Removes node from scheduling entirely
Sensitive node claim	`claiming-user`	Sensitive-specific role within tenant

Certificate CN vs NodeId Mapping

Node agent certificates use a deterministic CN format that maps to the node’s xname identity:

Format: {xname}.{site}.lattice.internal (e.g., x1000c0s0b0n0.alps.lattice.internal)
Validation: On each heartbeat, the quorum verifies that the certificate CN matches the node ID reported in the heartbeat payload. A mismatch triggers an UNAUTHENTICATED error and an alert.
Prevents: A compromised node agent from impersonating a different node.

Sensitive Session Recording Storage

Attach session recordings for sensitive allocations are stored alongside the audit log:

Path: s3://sensitive-audit/{tenant}/{alloc_id}/sessions/{session_id}.recording
Format: Raw byte stream (input + output interleaved with timestamps), compressed with zstd
Encryption: Encrypted at rest using the sensitive storage pool’s encryption keys
Retention: 7 years (matching sensitive audit log retention)
Access: Only the claiming user and tenant-admin (compliance reviewer) can access recordings via the audit query API

Audit Signing Key Persistence

The Ed25519 signing key for audit log entries is loaded from a persistent file configured via QuorumConfig.audit_signing_key_path. This ensures:

Chain continuity: Archived audit entries (in S3) can be verified after quorum restart
Non-repudiation: The same key signs all entries, forming a verifiable chain
Key rotation: Replace the file and restart the quorum to rotate (old entries remain verifiable with the old public key)
Dev mode: When audit_signing_key_path is not set, a random key is generated (suitable for testing only)

REST API Authentication

REST and gRPC endpoints require authentication when OIDC or HMAC is configured:

Bearer token required in Authorization header (validated on every request)
Two validation modes: JWKS (production, via oidc_issuer) or HMAC-SHA256 (dev/testing, via LATTICE_OIDC_HMAC_SECRET)
REST middleware validates asynchronously (supports JWKS network fetch on cache miss)
gRPC interceptor validates synchronously using cached JWKS keys (pre-fetched at startup) or HMAC
Rate limiting applied per-user
Public endpoints exempt: /healthz, /api/v1/auth/discovery
OIDC discovery client disables HTTP redirects (JWKS cache poisoning prevention)
Non-HTTPS issuer URLs produce a warning (MITM risk)
Server logs a prominent warning on startup if no authentication is configured

Service Discovery Isolation

Service discovery endpoints (LookupService, ListServices) are tenant-filtered:

x-lattice-tenant header constrains results to the requesting tenant’s services
Without the header, all services are visible (admin/operator access)
Prevents cross-tenant information disclosure of service topology

Session Security

Interactive sessions are tracked globally in Raft state:

CreateSession / DeleteSession are Raft-committed operations
Sensitive allocations: at most one concurrent session globally (INV-C2)
Sessions survive API server restart (persisted in quorum state)
Ownership verified: only the allocation’s user can create sessions

Cross-References

sensitive-workloads.md — Sensitive-specific security requirements
failure-modes.md — Security implications of failure scenarios
upgrades.md — Certificate rotation during upgrades
accounting.md — Waldur API token management

Deployment & Bootstrapping

Design Principle

Lattice deploys on bare metal managed by OpenCHAMI. The bootstrap sequence is deterministic: infrastructure first, then control plane, then compute nodes. Each step is idempotent and can be retried. The system can be fully rebuilt from configuration files and Raft snapshots.

Prerequisites

Before deploying Lattice:

Dependency	Required	Notes
OpenCHAMI	Yes	Node inventory, BMC discovery, boot service, identity (OPAAL)
VAST (or compatible NFS+S3)	Yes	Hot tier storage, QoS API
OIDC Provider	Yes	User authentication (institutional IdP)
PKI / Certificate Authority	Yes	mTLS certificates for all components
Secrets Manager	Yes	API tokens, TLS keys (Vault or equivalent)
Time-series database	Yes	VictoriaMetrics, Mimir, or Thanos
Slingshot/UE fabric	Yes	Network with VNI support
Waldur	Optional	External accounting (feature-flagged)
Sovra	Optional	Federation trust (feature-flagged)

Network Topology

Lattice runs on the high-speed network (HSN — Slingshot/Ultra Ethernet, 200G+). When co-deployed with PACT, the two systems use different networks for clean failure isolation (PACT ADR-017):

System	Network	Ports	Traffic
PACT	Management (1G)	gRPC 9443, Raft 9444	Admin ops, boot overlay, config, shell
Lattice	HSN (200G+)	gRPC 50051, Raft 9000, REST 8080	Scheduling, heartbeats, telemetry, allocation lifecycle

Node (dual-homed):
├── Management NIC (1G Ethernet)
│   └── pact-agent ←mTLS→ pact-journal:9443
│
├── HSN NIC (200G+ Slingshot/UE)
│   ├── lattice-node-agent ←mTLS→ lattice-quorum:50051
│   └── workload traffic (MPI, NCCL, storage data plane)
│
└── SPIRE agent socket (local, network-agnostic)
    ├── pact-agent obtains SVID → uses on management net
    └── lattice-node-agent obtains SVID → uses on HSN

Configuration: Set bind_network: hsn in quorum and node-agent config (default). This resolves to the HSN interface at startup. In standalone mode without PACT, bind_network: any (default 0.0.0.0) is acceptable.

Failure isolation: Management net down → PACT degraded, lattice unaffected. HSN down → lattice paused, PACT unaffected (admin access works). See specs/failure-modes.md for full matrix.

Bootstrap Sequence

Phase 1: Infrastructure (OpenCHAMI)

1. Deploy OpenCHAMI services:
   - Magellan (BMC discovery)
   - SMD (State Management Daemon)
   - BSS (Boot Script Service)
   - OPAAL (Authentication)
2. Discover nodes via Redfish BMC scan
3. Register node inventory in SMD
4. Prepare boot images:
   - Standard compute image (Linux + node agent)
   - Sensitive hardened image (minimal kernel, SELinux, no SSH)
5. Generate PKI:
   - Site root CA
   - Intermediate CA for OPAAL
   - Pre-provision quorum member certificates

Phase 2: Control Plane

1. Deploy quorum members (3 or 5 nodes, dedicated hardware):
   a. Install lattice-quorum binary
   b. Configure Raft cluster membership
   c. Load TLS certificates (pre-provisioned)
   d. Initialize Raft cluster:
      - First member bootstraps as single-node cluster
      - Additional members join via Raft AddMember
   e. Verify: Raft leader elected, all members healthy

2. Deploy API servers (2+ for redundancy):
   a. Install lattice-api binary
   b. Configure quorum endpoints, TLS, OIDC provider
   c. Place behind load balancer
   d. Health check: /healthz returns 200

3. Deploy vCluster schedulers:
   a. One scheduler instance per vCluster type
   b. Configure cost function weights (from config file or quorum)
   c. Verify: scheduling cycle runs (empty, no nodes yet)

4. Deploy checkpoint broker:
   a. Install lattice-checkpoint binary
   b. Configure quorum and VAST API endpoints

Phase 3: Compute Nodes

1. Configure BSS with standard compute image + cloud-init template:
   - cloud-init installs node agent binary
   - cloud-init generates TLS certificate via OPAAL
   - cloud-init configures quorum endpoint

2. Boot nodes (batch: groups of 50-100):
   - PXE boot → BSS serves image → cloud-init runs → node agent starts

3. Node agent startup:
   a. Generate TLS cert from OPAAL (if not pre-provisioned)
   b. Discover local hardware (GPUs via NVML/ROCm-SMI, NVMe if present, NIC)
   c. Compute conformance fingerprint
   d. Register with quorum (first heartbeat)
   e. Report capabilities and health

4. Quorum auto-discovers nodes from first heartbeat.
   No manual node registration required.

5. Verify: `lattice node list` shows all nodes in Ready state.

Phase 4: Configuration

1. Create tenants:
   lattice admin tenant create --name="physics" --max-nodes=200

2. Create vClusters:
   lattice admin vcluster create --name="hpc-batch" \
     --scheduler=hpc-backfill \
     --tenant=physics \
     --nodes=x1000c0s0b0n[0-199]

3. Configure cost function weights (or use defaults):
   lattice admin vcluster set-weights --name="hpc-batch" \
     --priority=0.20 --wait-time=0.25 --fair-share=0.25 ...

4. (Optional) Configure Waldur accounting:
   lattice admin config set accounting.enabled=true
   lattice admin config set accounting.waldur.api_url="https://..."

5. (Optional) Configure federation:
   lattice admin federation add-peer --endpoint=... --workspace=...

6. Test: submit a test allocation.

Quorum Initialization

First-Time Bootstrap

The first quorum member initializes a new Raft cluster using the --bootstrap flag. This flag must only be passed once — on the very first startup of node 1. All subsequent restarts omit it; the persisted Raft state (WAL + snapshots) is sufficient to rejoin.

# First-ever start of node 1:
lattice-server --config /etc/lattice/server.yaml --bootstrap

# All subsequent restarts (including systemd):
lattice-server --config /etc/lattice/server.yaml

This creates an empty Raft log and elects node 1 as leader.

Adding Members

Subsequent members join the existing cluster:

# On the leader (or any member):
lattice-quorum membership add --node-id=quorum-2 --addr=quorum-2:4001

# On the new member:
lattice-quorum --join=quorum-1:4001 \
  --node-id=quorum-2 \
  --listen=0.0.0.0:4001 \
  --data-dir=/var/lib/lattice/raft

The new member syncs the Raft log from the leader and becomes a follower.

Initial State

A freshly bootstrapped quorum has:

Empty node registry (populated when nodes boot)
Empty tenant/vCluster configuration (created by admin)
Empty sensitive audit log
Default system configuration

Disaster Recovery

Raft Snapshot + WAL Recovery

The quorum periodically snapshots its state and writes a WAL (Write-Ahead Log):

/var/lib/lattice/raft/
├── snapshots/
│   ├── snap-000100.bin     # Raft state at log index 100
│   └── snap-000200.bin     # Raft state at log index 200
├── wal/
│   ├── wal-000200-000300   # Log entries 200-300
│   └── wal-000300-000400   # Log entries 300-400
└── metadata.json           # Current term, voted_for, last_applied

Backup: Snapshots are replicated to S3 (configurable interval, default: hourly):

s3://lattice-backup/raft/snap-{timestamp}.bin

Recovery Procedure

If all quorum members are lost:

1. Provision new quorum hardware (3 or 5 nodes)
2. Retrieve latest snapshot from S3:
   aws s3 cp s3://lattice-backup/raft/snap-latest.bin /var/lib/lattice/raft/

3. Bootstrap from snapshot:
   lattice-quorum --recover-from=/var/lib/lattice/raft/snap-latest.bin \
     --node-id=quorum-1 --bootstrap

4. Add remaining quorum members (join the recovered leader)

5. Node agents will reconnect automatically (they retry with backoff)

6. Verify state:
   lattice admin raft status
   lattice node list

Data loss window: From the last snapshot to the failure. With hourly snapshots, at most 1 hour of Raft commits could be lost. In practice, node ownership changes are infrequent (scheduling cycles), so data loss is minimal.

Partial Quorum Loss

If a minority of quorum members fail (1 of 3, or 2 of 5):

The cluster continues operating (Raft majority maintained)

Replace failed members via Raft membership change:

lattice-quorum membership remove --node-id=quorum-2
lattice-quorum membership add --node-id=quorum-2-new --addr=...

New member syncs from leader automatically
No data loss, no downtime

Non-Raft State Backup

The Raft snapshot captures quorum state (node ownership, tenants, sensitive audit). Other stateful components require separate backup strategies:

Component	State Location	Backup Strategy
TSDB (metrics)	VictoriaMetrics / Thanos	TSDB-native snapshot + S3 replication
S3 logs	`s3://{tenant}/{project}/{alloc_id}/logs/`	S3 bucket versioning + cross-region replication
Accounting WAL	`/var/lib/lattice/accounting-wal`	Include in node backup or replicate to S3
Sensitive audit log	Raft state (primary) + S3 archive (cold)	Covered by Raft snapshot; S3 archive has its own retention
Grafana dashboards	`infra/grafana/` (version-controlled)	Git repository

Recommended schedule: Daily backup verification for TSDB snapshots. Accounting WAL backed up on the same schedule as Raft snapshots.

Quorum Hardware Replacement

When a quorum member’s hardware fails and must be replaced:

Remove the failed member from the Raft cluster:
```
lattice-quorum membership remove --node-id=quorum-2
```
The cluster continues operating with the remaining majority.
Provision new hardware:
- Install the same OS and lattice-quorum binary
- Generate a new TLS certificate from the site CA (same CN format)
- Configure the same data directory path

Add the new member to the cluster:

# On an existing member:
lattice-quorum membership add --node-id=quorum-2-new --addr=new-host:4001

# On the new hardware:
lattice-quorum --join=quorum-1:4001 \
  --node-id=quorum-2-new \
  --listen=0.0.0.0:4001 \
  --data-dir=/var/lib/lattice/raft

Verify: The new member syncs the full Raft log from the leader. Check with lattice admin raft status.
Cleanup: Remove old member’s data directory from failed hardware (if recoverable). Update monitoring/alerting to reference the new member.

Important: Replace one member at a time. Wait for the new member to fully sync before replacing another. For a 3-member quorum, never have more than 1 member down simultaneously.

Configuration Management

All configuration is stored in two places:

Configuration	Storage	Update Mechanism
Raft cluster membership	Raft log	Membership change commands
Tenant/vCluster definitions	Raft state machine	API calls (Raft-committed)
Cost function weights	Raft state machine	Hot-reloadable via API
Component config (listen addr, TLS paths)	Local config files	Restart required
Node agent config	cloud-init template	Reboot to apply changes

Config files are version-controlled alongside deployment manifests. Changes to Raft-stored configuration are applied via API and take effect immediately.

Capacity Planning

Cluster Size	Quorum Members	API Servers	Scheduler Instances	Quorum Hardware
< 100 nodes	3	2	1 per vCluster type	4 CPU, 16 GB RAM, 100 GB SSD
100-1000 nodes	3	3	1 per vCluster type	8 CPU, 32 GB RAM, 200 GB SSD
1000-5000 nodes	5	5	2 per vCluster type	16 CPU, 64 GB RAM, 500 GB SSD
5000+ nodes	5	5+ (behind LB)	2+ per vCluster type	32 CPU, 128 GB RAM, 1 TB SSD

Quorum hardware notes: Quorum members are latency-sensitive (Raft commits). Dedicated NVMe SSD for WAL. Not co-located with compute workloads. Prefer separate hardware or at minimum separate failure domains.

Backup Verification

Snapshots replicated to S3 should be verified periodically to ensure they are restorable:

# Verify the latest snapshot is readable and consistent
lattice admin backup verify --source=s3://lattice-backup/raft/snap-latest.bin

# Verify a specific snapshot
lattice admin backup verify --source=s3://lattice-backup/raft/snap-20260301T120000.bin

Verification checks:

Snapshot file integrity (checksum match)
Raft metadata consistency (term, index, membership)
Deserialization of state machine (all entries parseable)

Recommended schedule: Weekly automated verification via cron or CI pipeline. Alert on failure.

Snapshot Retention Policy

Local snapshots are retained on quorum member disks:

Keep the last 5 snapshots (default, configurable: raft.snapshot_retention_count)
Older snapshots are deleted after a new snapshot is confirmed written

S3 snapshots follow a lifecycle policy:

Keep all snapshots for 7 days (hourly granularity)
After 7 days: keep one snapshot per day for 30 days
After 30 days: keep one snapshot per week for 90 days
After 90 days: delete (unless sensitive audit retention requires longer)

Configure via S3 lifecycle rules on the lattice-backup bucket.

Component Log Management

Lattice components log to stdout/stderr by default, managed by the system’s init system (systemd journald or equivalent).

Recommended log rotation:

Component	Log Volume	Rotation
Quorum members	Low (Raft events, membership changes)	journald default (rotate at 4 GB or 1 month)
API servers	Medium (request logs, access logs)	journald or file rotation (rotate at 1 GB, keep 7 files)
vCluster schedulers	Low-Medium (scheduling cycle logs)	journald default
Node agents	Low per-node (heartbeats, allocation lifecycle)	journald default
Checkpoint broker	Low (checkpoint decisions)	journald default

For centralized log collection, configure journald to forward to a log aggregator (e.g., Loki, Elasticsearch) via systemd-journal-remote or a sidecar agent.

Structured logging: All components emit JSON-formatted logs with fields: timestamp, level, component, message, and context-specific fields (e.g., allocation_id, node_id).

Test/Dev Deployment (GCP)

For integration testing without bare metal, use the GCP test infrastructure:

infra/gcp/
├── terraform/main.tf           # 3 quorum + 2 compute + registry + TSDB
├── packer/lattice-compute.pkr.hcl  # Pre-baked image with podman + squashfs-tools
scripts/deploy/
├── make-provision-bundle.sh    # Single tarball: binaries + scripts + systemd units
├── install-quorum.sh           # Reusable, no GCP-specific logic
├── install-compute.sh          # Reusable, HMAC token generation
└── validate.sh                 # Structured test runner (15 tests)

Workflow:

packer build — create compute image (once)
terraform apply — provision VMs
make-provision-bundle.sh — package release
SCP bundle to nodes, install-quorum.sh (node 1 with --bootstrap), install-compute.sh
validate.sh — run test matrix
terraform destroy — manual cleanup

The deploy scripts are reusable on-prem — no GCP-specific logic in install-*.sh.

Cross-References

system-architecture.md — Seven-layer architecture overview
security.md — PKI, mTLS, certificate provisioning
upgrades.md — Rolling upgrade procedure (after initial deployment)
failure-modes.md — Component failure and recovery
node-lifecycle.md — Node boot and registration

Failure Modes and Recovery

Design Principle

Fail-safe defaults. Running allocations survive component failures. Modeled after Slurm’s proven failure patterns, mapped to Lattice’s distributed architecture: requeue on node failure, state recovery on controller restart, running jobs unaffected by control plane restarts.

Component Failures

Quorum Member Loss

Detection: Raft heartbeat timeout (default: 500ms).

Recovery: Raft tolerates minority failure. A 3-member quorum tolerates 1 failure; a 5-member quorum tolerates 2. The remaining majority continues serving reads and commits. No scheduling disruption.

Action: Alert ops. Replace failed member via Raft membership change (add new → remove old). No data loss — Raft log is replicated.

Quorum Leader Loss

Detection: Raft follower timeout triggers leader election.

Recovery: New leader elected within seconds (typically 1-3s depending on election timeout configuration). In-flight proposals that were not committed are retried by the proposing vCluster scheduler on the next scheduling cycle.

Data loss risk: None. Uncommitted proposals are re-proposed. Committed state is durable.

Complete Quorum Loss

Detection: All quorum members unreachable. API server returns unavailable.

Recovery: Restore from most recent Raft snapshot + WAL replay (analogous to slurmctld --recover). The latest snapshot is stored on persistent storage (local SSD + replicated to S3). Recovery restores node ownership and sensitive audit state to the last committed entry.

Impact during outage: No new allocations can be scheduled (proposals cannot be committed). Running allocations continue — node agents operate autonomously. Node agents buffer heartbeats and replay on quorum recovery.

Node Agent Crash

Detection: Heartbeat timeout (default: 30s) followed by grace period (default: 60s). Total time to Down transition: ~90s. Analogous to Slurm’s SlurmdTimeout.

Recovery:

Quorum marks node as Degraded after first missed heartbeat
After grace period (default: 60s), node transitions to Down
Allocations on the node are requeued (if requeue policy allows) or marked Failed
Node agent restarts → loads persisted state from /var/lib/lattice/agent-state.json → reattaches to surviving workload processes (PID liveness check via kill(pid, 0)) → cleans up orphaned cgroups → re-registers with quorum → health check → re-enters scheduling pool

Workloads survive agent restart because the systemd unit uses KillMode=process (only the agent process is killed, not children in their own cgroup scopes).

Sensitive nodes: Longer grace period (default: 5 minutes) to avoid false positives from transient issues. Sensitive allocations are never automatically requeued — operator intervention required.

Node Hardware Failure

Detection: Dual-path: heartbeat timeout (node agent) + OpenCHAMI Redfish BMC polling (out-of-band).

Recovery: Same as agent crash, but OpenCHAMI can detect hardware failures (PSU, memory ECC uncorrectable, GPU fallen off bus) before heartbeat timeout. BMC-detected failures trigger immediate Down transition, skipping the grace period.

vCluster Scheduler Crash

Detection: Health check failure (liveness probe).

Recovery: vCluster schedulers are stateless — they read pending allocations and node state from the quorum on each scheduling cycle. Restart from quorum state. No scheduling occurs for this vCluster during downtime, but running allocations continue unaffected (like slurmctld crash: running jobs are fine).

Data loss risk: None. Pending allocations are persisted in the quorum.

API Server Crash

Detection: Load balancer health check / liveness probe.

Recovery: API servers are stateless. Restart and resume serving. Multiple API server replicas behind a load balancer provide redundancy. Client retries with exponential backoff. No job loss.

Checkpoint Broker Crash

Detection: Health check failure.

Recovery: Pending checkpoint requests are lost (they were in-memory). On restart, the broker re-evaluates all running allocations against the checkpoint cost model. Allocations that should have been checkpointed will be identified on the next evaluation cycle.

Data loss risk: Minimal. At worst, one evaluation cycle’s worth of checkpoint decisions are delayed. No allocation data is lost.

Infrastructure Failures

Network Partition: Node ↔ Quorum

Detection: Heartbeat timeout on the quorum side; connection failure on the node side.

Recovery:

Quorum side: nodes marked unreachable → Degraded → Down after grace period. Allocations requeued.
Node side: node agent continues running allocations autonomously. Buffers heartbeats and state updates. When connectivity restores, replays buffered state to quorum.
If partition heals before grace period: node returns to Ready, no allocation disruption.

Sensitive: Extended grace period (5 minutes). Network partitions are logged as audit events.

Network Partition: Quorum Split-Brain

Detection: Raft protocol prevents split-brain by design.

Recovery: The minority partition cannot achieve quorum and therefore cannot commit any proposals. The majority partition continues operating normally. When the partition heals, the minority members catch up via Raft log replication. No divergent state is possible.

Storage Unavailability (VAST Down)

Detection: Failed VAST API calls / NFS mount timeouts.

Impact:

Data staging for new allocations pauses (cannot pre-stage input data)
Running allocations with data already mounted continue (local NVMe cache, if present, persists)
Checkpoint writes fail → broker pauses checkpoint scheduling
New allocation proposals that require data staging are held in queue

Recovery: Automatic retry with backoff. Alert raised. Staging resumes when VAST recovers. On nodes with NVMe cache, locally cached data persists through storage outage.

OpenCHAMI Unavailable

Detection: Failed API calls to OpenCHAMI endpoints.

Impact:

Node boot/reimaging blocked (cannot provision new nodes)
Node wipe-on-release blocked (sensitive nodes held in quarantine state)
Running allocations unaffected
Scheduling of new allocations to already-booted nodes continues normally

Recovery: Operations that require OpenCHAMI are queued and retried. Alert raised.

Allocation-Level Failures

Prologue Failure (uenv Pull/Mount)

Detection: Node agent reports prologue error to quorum.

Recovery:

Node drained for this allocation (other allocations on the node unaffected)
Allocation retried on different nodes (analogous to Slurm PrologSlurmctld failure)
Max retries configurable (default: 3)
After max retries: allocation moves to Failed state, user notified

Common causes: Corrupted uenv image (hash mismatch), local cache full (if NVMe present), registry unavailable.

Application Crash

Detection: Node agent detects process exit with non-zero status.

Recovery:

Allocation moves to Failed state
Nodes released back to scheduling pool
If allocation has requeue: on_node_failure or requeue: always: re-enter queue
DAG dependencies evaluated (cross-ref: dag-scheduling.md)

Walltime Exceeded

Detection: Node agent timer.

Recovery:

SIGTERM sent to all processes in the allocation
Grace period (default: 30s) for clean shutdown
SIGKILL if processes still running after grace period
Nodes released
Allocation marked as Failed with reason walltime_exceeded

Walltime Exceeded During Checkpoint

If an allocation’s walltime expires while a checkpoint is in progress:

Walltime takes priority. The walltime timer is not extended to accommodate an in-progress checkpoint.
SIGTERM is sent as normal. If the checkpoint completes within the SIGTERM grace period (default: 30s), the checkpoint is usable and the allocation is marked Suspended (can be resumed).
If the checkpoint does not complete within the grace period, SIGKILL is sent. The incomplete checkpoint is discarded and the allocation is marked Failed with reason walltime_exceeded.
The checkpoint broker tracks this race condition via the lattice_checkpoint_walltime_conflict_total counter metric.

Recovery Matrix

Failure	Detection	Recovery Action	Data Loss Risk
Quorum member loss	Raft heartbeat	Leader election, continue	None
Quorum leader loss	Raft timeout	New election (1-3s)	None (uncommitted retried)
Complete quorum loss	All members down	Snapshot + WAL recovery	None (last committed state)
Node agent crash	Heartbeat timeout (30s) + grace (60s)	Degrade → Down → requeue	Running allocation output since last checkpoint
Node hardware failure	BMC + heartbeat	Immediate Down → requeue	Running allocation output since last checkpoint
vCluster scheduler crash	Health check	Stateless restart	None
API server crash	Health check	Stateless restart	None
Checkpoint broker crash	Health check	Restart, re-evaluate	Delayed checkpoint decisions
Network partition (node)	Heartbeat timeout	Grace period → requeue	None if heals in time
Network partition (quorum)	Raft protocol	Minority stalls, majority continues	None
VAST down	API timeout	Queue staging, continue running	None
OpenCHAMI down	API timeout	Queue provisioning ops	None
Prologue failure	Agent report	Retry on different nodes	None
Application crash	Process exit	Release nodes, optional requeue	Application-dependent
Walltime exceeded	Agent timer	SIGTERM → SIGKILL → release	Unsaved work

Allocation Requeue Policy

Configurable per allocation at submission time:

Policy	Behavior
`never`	Allocation fails permanently on any node failure. Default for interactive sessions.
`on_node_failure`	Requeue only when the failure is node-side (hardware, agent crash, network partition). Default for batch allocations.
`always`	Requeue on any failure including application crash. Use with caution — can cause infinite loops for buggy applications.

Max requeue count: Default 3. Configurable per allocation (max 100, validated at submission). After max requeues, allocation transitions to Failed regardless of policy. Requeue uses optimistic concurrency (expected_requeue_count) to prevent double-increment from concurrent reconcilers.

Requeue behavior: Requeued allocations retain their original submission time for fair-share and wait-time calculations (no queue-jumping penalty, no starvation). Just-requeued allocations are excluded from the pending set in the same scheduler cycle (TOCTOU prevention).

Service Failure Detection (Liveness Probes)

For Unbounded and Reactive allocations with a liveness_probe configured:

Node agent runs the probe periodically (TCP connect or HTTP GET)
Consecutive failures tracked by ProbeManager (per-allocation counter)
Threshold exceeded → allocation marked Failed by node agent
Reconciler detects Failed service → requeues per policy (if not at max_requeue)
Scheduler re-places the allocation on available nodes

Timeline: initial_delay (default 10s) → periodic probes (default 30s) → failure_threshold (default 3) → Failed → next scheduler cycle requeues.

Service Registry Failure

If the service registry becomes inconsistent (e.g., allocation completes but endpoint not deregistered):

Registry is part of Raft state machine — same consistency guarantees as node ownership
Endpoint registration/deregistration happens atomically in update_allocation_state() handler
Deregistration also occurs in requeue_allocation() handler
Empty service entries are cleaned up automatically

Cross-References

scheduling-algorithm.md — f₈ checkpoint_efficiency affects preemption cost
dag-scheduling.md — Failure propagation in DAG workflows
sensitive-workloads.md — Sensitive-specific failure handling (longer grace periods, no auto-requeue)
accounting.md — Accounting service failure buffering
upgrades.md — Failure detection during canary rollouts
sessions.md — Interactive session disconnect/reconnect during node failures

Upgrades and Rollouts

Design Principle

Zero-downtime upgrades. No running allocation is disrupted by an upgrade. Components are upgraded independently. Protocol backward compatibility ensures mixed-version operation during rolling upgrades.

Protocol Versioning

All gRPC services are versioned (lattice.v1.*):

New fields are additive (backward compatible within a major version)
Breaking changes require a new version (lattice.v2.*)
During rolling upgrades, node agents and quorum members must support both version N and N-1
Version negotiation on connection establishment: components advertise supported versions, use the highest common version

Upgrade Order

Components are upgraded in dependency order, from leaf to core:

1. Node agents (rolling, batched)
2. vCluster schedulers (rolling)
3. API servers (rolling)
4. Quorum members (Raft rolling membership change, one at a time)

This order ensures that core components (quorum) speak the old protocol until all clients (node agents, schedulers) are upgraded. The quorum is upgraded last because it’s the most critical and the hardest to roll back.

Node Agent Rolling Upgrade

Procedure

For each batch of nodes:

Drain: Stop scheduling new allocations to the node. Node enters Draining state. If no allocations are running, it transitions directly to Drained.
Wait: Running allocations complete naturally. The scheduler loop transitions the node from Draining to Drained once all allocations finish. For urgent upgrades: checkpoint running allocations and migrate (cross-ref: checkpoint-broker.md).
Upgrade: Replace node agent binary while node is Drained. Configuration is preserved.
Restart: Node agent starts, re-registers with quorum using new protocol version.
Health check: Node passes health check (heartbeat, GPU detection, network test).
Undrain: Operator runs undrain. Node transitions from Drained to Ready and is available for scheduling.

Canary Strategy

Upgrade 1-2 nodes first (canary set)
Monitor canary nodes for the observation window (default: 15 minutes):
- Scheduling cycle latency within SLO (cross-ref: telemetry.md scheduler self-monitoring)
- No increase in allocation failures on canary nodes
- Heartbeat latency stable
- Node health check pass rate = 100%
If canary passes: proceed with rolling batches (batch size configurable, default: 5% of nodes)
If canary fails: stop rollout, revert canary nodes (see Rollback below)

Batch Sizing

Cluster Size	Canary Size	Batch Size	Total Batches
< 50 nodes	1 node	5 nodes	~10
50-500 nodes	2 nodes	25 nodes	~20
500+ nodes	5 nodes	50 nodes	varies

vCluster Scheduler Rolling Upgrade

Schedulers are stateless — they read state from the quorum each cycle:

Stop scheduler instance
Upgrade binary
Restart
Verify: scheduling cycle completes successfully, proposals accepted by quorum

During scheduler downtime, the affected vCluster pauses scheduling (no new allocations). Running allocations are unaffected. Multiple scheduler replicas (if deployed) provide continuity.

API Server Rolling Upgrade

API servers are stateless, behind a load balancer:

Remove instance from load balancer
Drain active connections (grace period: 30s)
Upgrade binary
Restart
Health check passes → re-add to load balancer

Client impact: brief connection reset for long-lived streams (StreamMetrics, StreamLogs). Clients reconnect automatically.

Quorum Rolling Upgrade

The most sensitive upgrade. One member at a time, maintaining quorum majority throughout:

3-Member Quorum

Upgrade follower A: remove from Raft group → upgrade → re-add
Wait for follower A to catch up (Raft log sync)
Upgrade follower B: remove → upgrade → re-add
Wait for follower B to catch up
Trigger leader transfer to an upgraded follower
Upgrade old leader: remove → upgrade → re-add

Constraint: Never more than 1 member down simultaneously (2/3 majority required).

5-Member Quorum

Same procedure but can upgrade 2 followers in parallel (3/5 majority maintained):

Upgrade followers A and B in parallel
Wait for catch-up
Upgrade followers C and D in parallel
Wait for catch-up
Leader transfer → upgrade old leader

Constraint: Never more than 2 members down simultaneously (3/5 majority required).

Quorum Upgrade Verification

After each member upgrade:

Raft log replication is current (no lag)
Commit latency within SLO (< 5s)
Leader election succeeds if triggered
All node ownership state is consistent

Canary Criteria

Metrics from scheduler self-monitoring (cross-ref: telemetry.md) that gate rollout progression:

Metric	Threshold	Severity
`lattice_scheduling_cycle_duration_seconds`	p99 < 30s	Warning: pause rollout
`lattice_scheduling_proposals_total{result="rejected"}`	No increase > 10%	Warning: pause rollout
`lattice_agent_heartbeat_latency_seconds`	p99 < 5s	Warning: pause rollout
`lattice_raft_commit_latency_seconds`	p99 < 5s	Critical: stop rollout
`lattice_api_requests_total{status="5xx"}`	No increase > 5%	Warning: pause rollout
Allocation failure rate	No increase	Critical: stop rollout

Rollback

Node Agent Rollback

Drain canary/failed nodes
Replace binary with previous version
Restart
Verify old-version operation
Protocol backward compatibility ensures the rolled-back agent works with the rest of the cluster

Scheduler/API Rollback

Stateless — replace binary and restart.

Quorum Rollback

Remove new-version member from Raft group
Add old-version member back
Protocol backward compatibility ensures mixed-version operation during the transition

Rollback is always safe because N-1 protocol support is maintained throughout the upgrade window.

Configuration Hot-Reload

Not all changes require a binary upgrade. Configuration changes that can be hot-reloaded via quorum without restart:

Change	Hot-Reloadable	Mechanism
Cost function weights	Yes	Quorum config update, schedulers pick up next cycle
vCluster policies	Yes	Quorum config update
Telemetry mode (prod/debug/audit)	Yes	API call to node agent
Tenant quotas	Yes	Quorum config update
Node drain/undrain	Yes	API call
Protocol version	No	Binary upgrade required
Raft cluster size	No	Membership change (safe, but not hot-reload)

Cross-References

telemetry.md — Scheduler self-monitoring metrics used for canary criteria
failure-modes.md — Failure detection during upgrades
security.md — Certificate rotation during upgrades
checkpoint-broker.md — Checkpoint before drain for urgent upgrades

Testing Strategy

Design Principle

Scheduler correctness is non-negotiable. The testing strategy covers four levels: unit tests for individual functions, integration tests for component interactions, simulation tests for scheduling behavior, and chaos tests for fault tolerance. Every level must pass before a release.

Test Levels

┌─────────────────────────────────────────────────┐
│ Level 4: Chaos Tests (fault injection)          │
│   Raft leader loss, network partitions,         │
│   node failures, storage unavailability         │
├─────────────────────────────────────────────────┤
│ Level 3: Simulation (RM-Replay)                 │
│   Production workload replay, weight tuning,    │
│   fairness validation, SLO compliance           │
├─────────────────────────────────────────────────┤
│ Level 2: Integration Tests                      │
│   Multi-component scenarios, API contracts,     │
│   end-to-end allocation lifecycle               │
├─────────────────────────────────────────────────┤
│ Level 1: Unit Tests                             │
│   Cost function, topology solver, state machine,│
│   protobuf serialization, error handling        │
└─────────────────────────────────────────────────┘

Level 1: Unit Tests

In-module tests (#[cfg(test)]), run via cargo test.

Critical Paths

Crate	What to Test	Example
`lattice-scheduler`	Cost function components (f₁-f₉)	Given inputs, verify score output
`lattice-scheduler`	Knapsack solver	Given nodes and allocations, verify placement
`lattice-scheduler`	Topology packing	Given groups and node count, verify group selection
`lattice-scheduler`	Conformance group selection	Given fingerprints, verify grouping
`lattice-quorum`	Raft proposal validation	Hard quota rejection, ownership conflict
`lattice-quorum`	State machine transitions	Node state changes, allocation lifecycle
`lattice-common`	Type serialization/deserialization	Protobuf round-trip for all types
`lattice-common`	Allocation state machine	Valid and invalid state transitions
`lattice-api`	Request validation	Reject invalid allocations (cycles in DAG, bad constraints)
`lattice-api`	SBATCH directive parsing	Translate Slurm directives to Intent API
`lattice-checkpoint`	Cost model evaluation	Given metrics, verify checkpoint decision
`lattice-cli`	Argument parsing	Flag combinations, error messages

Property-Based Tests

Use proptest for property-based testing of the cost function and solver:

Cost function monotonicity: Increasing wait time always increases f₂
Fair share bounds: f₃ always in [0, 1]
Solver validity: Every placement returned by the solver satisfies all constraints
Topology packing: Solver never spans more groups than necessary
State machine: No invalid state transitions accepted

Level 2: Integration Tests

In tests/ directories, using real components with mock external dependencies.

Test Harness

A test harness that spins up:

In-memory Raft cluster (3 members, using openraft test utilities)
Mock node agents (report capabilities, respond to heartbeats)
Mock VAST API (storage queries return configurable responses)
Real scheduler instances
Real API server (in-process)

Scenarios

Scenario	What It Tests
Submit → Schedule → Complete	Full allocation lifecycle through all components
DAG submission	Multi-allocation workflow with dependency resolution
Preemption	Higher-priority allocation preempts lower-priority
Elastic borrowing	vCluster borrows and returns nodes
Quota rejection	Hard quota exceeded → proposal rejected
Sensitive claim	Node claim, audit logging, wipe on release
Session lifecycle	Session create → terminal → disconnect → cleanup
Rolling upgrade simulation	Mixed-version node agents, protocol negotiation
Conformance drift	Node fingerprint changes → scheduling impact
Reactive scaling	Metric threshold triggers scale-up/down

API Contract Tests

For every API endpoint, test:

Valid request → expected response
Invalid request → appropriate error code and message
Authorization: user sees own allocations only, tenant-admin sees tenant, system-admin sees all
Rate limiting: exceeded rate → 429 with Retry-After header

Protobuf Compatibility

Test backward compatibility:

Deserialize messages from previous version with new code (additive fields)
Deserialize messages from new version with old code (unknown fields ignored)

Level 3: Simulation (RM-Replay)

Purpose

RM-Replay replays production workload traces through the scheduler to validate scheduling behavior without risking production. Essential for:

Tuning cost function weights before deployment
Validating fairness across tenants
Regression testing after scheduler changes

Workflow

1. Capture: Record production workload traces
   - Allocation submissions (arrival time, resources, constraints, tenant)
   - Allocation completions (duration, exit status)
   - Node inventory (capabilities, topology)

2. Configure: Set cost function weights and vCluster policies

3. Replay: Feed traces through lattice-scheduler in simulation mode
   - No real nodes or quorum — mock environment
   - Simulated time (runs in seconds, not hours)
   - Deterministic (same trace + same weights = same result)

4. Evaluate: Measure scheduling outcomes
   - Utilization: fraction of GPU-hours used
   - Wait time: p50, p95, p99 queue wait per priority class
   - Fairness: actual share vs. target share per tenant (Jain's fairness index)
   - Backfill effectiveness: percentage of idle slots filled
   - SLO compliance: percentage of allocations meeting target wait time
   - Preemption rate: preemptions per hour

5. Iterate: Adjust weights, re-run, compare

Regression Suite

Maintain a library of representative workload traces:

Trace	Description	Key Metric
`steady-state.trace`	Normal mixed workload (HPC + ML + services)	Utilization > 85%
`burst.trace`	Sudden spike in submissions	No starvation (p99 wait < 4h)
`unfair.trace`	One tenant submits heavily	Fair share deviation < 10%
`sensitive-claim.trace`	Sensitive claims interleaved with HPC	Sensitive wait = 0 (immediate)
`preemption-heavy.trace`	Many priority inversions	Checkpoint success rate > 95%
`empty-to-full.trace`	Cluster goes from idle to full	Ramp-up time, scheduling cycle latency

Each trace has a pass/fail threshold for key metrics. CI runs the regression suite on every scheduler change.

Level 4: Chaos Tests

Fault injection tests that validate the failure modes documented in failure-modes.md.

Fault Injection Framework

Use a test harness that can inject faults at configurable times:

Fault	Injection Method	Validates
Raft leader kill	Stop leader process	Leader election, in-flight proposal retry
Raft member kill	Stop follower process	Continued operation with minority loss
Network partition (node↔quorum)	Drop heartbeats	Degraded → Down transition, allocation requeue
Network partition (quorum split)	Partition Raft members	Minority stalls, majority continues
Node agent crash	Kill agent process	Heartbeat timeout, allocation requeue
Storage unavailability	Mock VAST returns errors	Staging pauses, running allocations continue
Checkpoint timeout	Application ignores checkpoint hint	Forced preemption after timeout
API server crash	Kill API server	Client retry, no state loss
Quorum snapshot corruption	Corrupt snapshot file	Recovery from previous valid snapshot

Chaos Test Scenarios

Scenario	Steps	Expected Outcome
Leader election under load	Submit 50 allocations, kill leader mid-cycle	New leader elected < 5s, no proposals lost, all allocations eventually scheduled
Node failure with requeue	Start 10 allocations, kill 2 node agents	Allocations requeued, rescheduled on healthy nodes, total delay < 2 min
Split-brain prevention	Partition 3-member quorum into 1+2	Minority (1) cannot commit, majority (2) continues, no divergent state
Cascade failure	Kill 3 node agents simultaneously	Allocations on all 3 nodes requeued, scheduling continues for remaining nodes
Sensitive node failure	Kill sensitive node agent	Extended grace period, operator alert, no auto-requeue
Recovery from full quorum loss	Kill all quorum members, restore from snapshot	State restored, node agents reconnect, scheduling resumes

Execution

Chaos tests run in CI on a dedicated stage (not on every commit):

Nightly: full chaos suite
On release branch: full chaos suite must pass

Performance Benchmarks

Scheduling Cycle Latency

Benchmark	Configuration	Target
100 pending allocations, 1000 nodes	HPC backfill	Cycle < 5s
500 pending allocations, 5000 nodes	HPC backfill	Cycle < 15s
1000 pending allocations, 10000 nodes	HPC backfill	Cycle < 30s
Raft commit (single proposal)	3-member quorum	p99 < 50ms
Raft commit (single proposal)	5-member quorum	p99 < 100ms

Load Tests

Test	Description	Target
API throughput	Concurrent submission requests	> 1000 req/s
Heartbeat load	10000 node agents reporting	< 1% CPU on quorum
Log streaming	100 concurrent log streams	< 5% CPU on API server

CI Pipeline

On every commit:
  cargo fmt --check
  cargo clippy --all-targets
  cargo test (Level 1: unit tests)

On every PR:
  Level 1 + Level 2 (integration tests)
  Protobuf backward compatibility check

Nightly:
  Level 1 + Level 2 + Level 3 (RM-Replay regression) + Level 4 (chaos)
  Performance benchmarks (track regressions)

On release:
  All levels must pass
  Performance benchmarks must meet targets

Cross-References

failure-modes.md — Failure scenarios validated by chaos tests
scheduling-algorithm.md — Cost function tested by unit tests and RM-Replay
upgrades.md — Rolling upgrade validated by integration tests
conformance.md — Conformance behavior validated by integration tests

DAG Scheduling

Design Principle

DAGs are first-class workflow primitives. The scheduler resolves dependencies; users declare intent. Dependency semantics are Slurm-compatible (afterok, afternotok, afterany, aftercorr) to ease migration.

DAG Submission

A DAG is a set of allocation specs with dependency edges, submitted as a single unit via the Intent API:

DagSpec {
    allocations: Vec<AllocationSpec>,  // each spec has an id and depends_on fields
}

Dependencies are expressed inline via each AllocationSpec.depends_on field (a list of DependencySpec with ref_id and condition), not as separate edge objects. This matches the protobuf definition in proto/lattice/v1/allocations.proto.

Dependency Conditions

Defined in crates/lattice-common/src/types.rs (DependencyCondition enum):

Condition	Slurm Equivalent	Semantics
`Success`	afterok	Successor runs only if predecessor exits 0
`Failure`	afternotok	Successor runs only if predecessor exits non-zero
`Any`	afterany	Successor runs regardless of predecessor’s exit status
`Corresponding`	aftercorr	Task group: array element N depends on predecessor’s element N
`Mutex`	singleton	Only one allocation with this mutex name runs at a time

DAG Lifecycle

1. Submission and Validation

User submits DagSpec via POST /v1/dags or lattice dag submit
lattice-api validates the graph:
- No cycles (topological sort must succeed)
- All depends_on.ref_id values reference allocation IDs within the DAG
- All allocation specs individually valid
DAG receives a unique dag_id
Individual allocations receive allocation_id values and are tagged with dag_id

2. Root Node Scheduling

Allocations with no incoming dependency edges (root nodes) enter their vCluster scheduler queue immediately
Root nodes are scored and scheduled like any other allocation

3. Dependency Resolution

When an allocation completes (any terminal state), the system evaluates outgoing edges:
- For each outgoing edge, check if the condition is satisfied
- If all incoming edges to a successor are satisfied, the successor enters the scheduler queue
Dependency resolution is eventually consistent (handled by lattice-api or a lightweight DAG controller, not the quorum)

4. DAG Completion

DAG completes when all allocations reach a terminal state (Completed, Failed, or Cancelled)
DAG state: Running while any allocation is pending or running, Completed when all are done, Failed if any required allocation failed without a catching edge

5. DAG Cancellation

DELETE /v1/dags/{id} or lattice dag cancel {id}
Cancels all pending and running allocations in the DAG
Running allocations receive SIGTERM → grace period → SIGKILL (same as walltime exceeded)

Failure Propagation

Default: Success Dependencies

If allocation A fails and B depends on A via Success:

B is cancelled (dependency can never be satisfied)
B’s downstream dependencies are also evaluated (cascading cancellation)

Error Handling Paths

With Failure edges, users can build error-handling workflows:

  train ──Success──→ evaluate ──Success──→ deploy
    │                   │
    └──Failure──→ notify_failure
                        │
    └──Failure──→ notify_failure

notify_failure runs only if train or evaluate fails
deploy runs only if both train and evaluate succeed

Any Dependencies

With Any edges, successors run regardless:

  run_experiment ──Any──→ cleanup

cleanup runs whether run_experiment succeeds or fails. Useful for teardown tasks.

Corresponding Dependencies (Task Groups)

For task groups (array jobs), Corresponding creates element-wise dependencies:

  preprocess[0..N] ──Corresponding──→ train[0..N]

train[i] starts only when preprocess[i] completes successfully. Other array elements are independent.

State Tracking

DAG state is eventually consistent, following ADR-004:

The quorum tracks individual allocation states (ownership, terminal states). It does not know about DAG structure.
The DAG controller (runs within lattice-api) evaluates dependency edges when allocation state changes. It reads allocation states from the quorum and determines which successors to release into the scheduler queue.
This separation keeps the quorum simple and avoids adding DAG-specific logic to the Raft state machine.

DAG Queries

Endpoint	Description
`GET /v1/dags/{id}`	DAG status: overall state, per-allocation states
`GET /v1/dags/{id}/graph`	DAG structure: allocations and edges
`GET /v1/dags?tenant={id}`	List DAGs for a tenant
`DELETE /v1/dags/{id}`	Cancel DAG

CLI equivalents: lattice dag status, lattice dag list, lattice dag cancel.

Edge Cases

Node Failure During DAG Execution

When a node fails while running a DAG allocation:

The allocation follows its requeue_policy (see failure-modes.md)
If requeued: the allocation re-enters the scheduler queue with its original priority. Downstream dependencies remain blocked until it completes.
If failed: downstream Success dependencies are cancelled. Failure and Any edges are evaluated normally.
DAG state remains Running as long as any allocation is pending or active.

Task Group with Corresponding Dependencies and Mixed Exit Codes

When a task group has Corresponding dependencies and individual elements exit with different codes:

Each Corresponding edge is evaluated independently per array index
train[3] failing does not affect train[4]’s dependency on preprocess[4]
The downstream task group may have a mix of running, cancelled, and completed elements
DAG completion waits for all evaluable elements to reach terminal states

Corresponding Dependencies with Mismatched Array Sizes

When two task groups have Corresponding dependencies but different array sizes (e.g., preprocess[0..9] → train[0..14]):

Array indices that exist in both groups are matched normally: train[i] depends on preprocess[i] for i in 0..9.
Extra indices in the successor group (train[10..14]) have no matching predecessor element. These extra indices are treated as having their Corresponding dependency satisfied immediately — they enter the scheduler queue as if they were root nodes.
This design avoids silent failures: users get all successor elements running, not just the matched subset.

Max DAG Size

DAGs are validated at submission time with a maximum allocation count (default: 1000 allocations per DAG). Submitting a DAG exceeding this limit returns an error:

Error: DAG exceeds maximum size (1234 allocations, limit: 1000)
Hint: Split the workflow into smaller DAGs or increase the limit via system configuration.

The limit is configurable via lattice admin config set scheduling.max_dag_size=2000. Cycle detection runs in O(V+E) and is not a bottleneck, but very large DAGs increase dependency resolution overhead in the DAG controller.

Cross-References

api-design.md — DagSpec in protobuf definition
scheduling-algorithm.md — DAG members are scored individually by the knapsack solver
failure-modes.md — Allocation-level failure recovery interacts with DAG propagation
types.rs — Dependency, DependencyCondition enum definitions

Preemption Policy

Design Principle

Preemption is a last resort for resource rebalancing. The scheduler prefers waiting, backfill, and elastic borrowing over preemption. When preemption is necessary, it targets allocations with the lowest preemption cost (fast checkpoint, low priority, short remaining runtime). Sensitive allocations are never preempted.

Preemption Classes

Each allocation has a preemption_class (0-10):

Class	Meaning	Typical Use	Preemptible By
0	Best-effort	Scavenger jobs, testing	Any higher class
1-3	Low priority	Batch exploration, sweeps	Class 4+
4-6	Normal	Production training, simulation	Class 7+
7-9	High priority	Time-sensitive production	Class 10 only
10	Critical / Sensitive	Sensitive claims, emergency	Never preempted

Rule: Preemption only moves down — a class-5 allocation can preempt class 0-4 allocations but never class 5+.

Enforcement: The preemption_class range (0-10) is validated at API admission. Values outside this range are rejected with a 400 Bad Request error before reaching the scheduler.

Tie-breaking within class: If multiple allocations have the same preemption class, the scheduler prefers to preempt the one with the lowest checkpoint cost (f₈).

Preemption Triggers

1. Higher-Priority Demand

A pending allocation with class N cannot be scheduled because all suitable nodes are occupied by lower-class allocations. The scheduler evaluates whether preempting one or more lower-class allocations would free enough resources.

2. Elastic Reclamation

A vCluster’s idle nodes were borrowed by another vCluster (elastic sharing). The home vCluster now needs them back. Borrowed nodes carry an implicit preemption risk — the checkpoint cost model (f₈) accounts for this.

3. Sensitive Node Claim

A sensitive user claims nodes that are currently occupied by non-sensitive allocations. Sensitive claims are class 10 (highest). The scheduler triggers immediate checkpoint + preemption of the occupying allocations.

4. Quota Enforcement

A tenant exceeds their hard quota due to a race condition (two concurrent proposals, first committed). The quorum rejects the second proposal — this is not preemption but rejection. Running allocations are never preempted for quota enforcement.

Preemption Decision Algorithm

PreemptionDecision(pending_job, candidates):

1. Filter candidates:
   - Only allocations with preemption_class < pending_job.preemption_class
   - Exclude sensitive allocations (never preempted)
   - Exclude allocations in Checkpointing state (already being preempted)

2. Score each candidate by preemption cost:
   preemption_cost(c) = checkpoint_time(c)
                       + recompute_if_no_checkpoint(c)
                       + remaining_walltime_value(c)

   checkpoint_time(c):
     If checkpoint == Auto: estimated_checkpoint_minutes from f₈
     If checkpoint == Manual: assume application handles it, use configured timeout
     If checkpoint == None: recompute_if_no_checkpoint applies

   recompute_if_no_checkpoint(c):
     time_since_last_checkpoint(c) × node_count(c) × gpu_per_node
     (GPU-hours that would be lost)

   remaining_walltime_value(c):
     If c is near completion (>90% walltime used): high cost (let it finish)
     If c just started (<10% walltime used): low cost (little invested)

3. Select victim set:
   Greedy: pick candidates with lowest preemption_cost until enough nodes freed.
   Constraint: freed nodes must satisfy pending_job's topology/conformance requirements.

4. If no valid victim set exists: pending_job stays queued (preemption not possible).

5. If valid victim set found: initiate preemption sequence.

Preemption Sequence

1. Scheduler identifies victim allocations
2. For each victim:
   a. If checkpoint == Auto or Manual:
      - Checkpoint broker sends CHECKPOINT_HINT to node agents
      - Application checkpoints (signal, shmem, or gRPC callback)
      - Timeout: checkpoint_timeout (default: 10 minutes)
   b. If checkpoint == None:
      - SIGTERM sent immediately
      - Grace period (30s) → SIGKILL
3. When checkpoint completes (or timeout):
   - Allocation transitions to Suspended state
   - Nodes released to quorum (Raft commit)
4. Freed nodes assigned to pending allocation
5. Suspended allocations re-enter queue with:
   - Original submission time preserved (no wait-time penalty)
   - Resume-from-checkpoint flag set
   - Preempted-count incremented

Checkpoint Timeout Handling

When a checkpointing allocation fails to complete within the timeout:

Scenario	Action
Application responds but slow	Extend timeout by 50%, once
Application unresponsive	SIGTERM → grace period → SIGKILL. Mark as failed (not suspended). Requeue if policy allows.
gRPC callback: application requests deferral	Grant deferral up to `max_deferral` (default: 5 minutes). Then force.

Multi-Victim Preemption

Sometimes freeing one allocation isn’t enough. The scheduler can preempt multiple allocations in a single decision:

Constraints:

Maximum victims per decision: configurable (default: 3)
All victims must have lower preemption class than the pending job
Total preemption cost must be less than the pending job’s estimated value
Scheduler prefers preempting fewer, larger allocations over many small ones

Ordering: Victims are preempted in parallel (all receive checkpoint hints simultaneously). The pending job starts once all victims have released their nodes.

Per-vCluster Preemption Policy

vCluster Type	Preemption Allowed	Notes
HPC Batch	Yes	Class-based, checkpoint-aware
ML Training	Yes	Checkpoint cost heavily weighted (w₈=0.15)
Service	Yes (borrowed nodes only)	Services on home nodes are not preempted; borrowed nodes reclaimable
Sensitive	Never preempted	Class 10, no exceptions
Interactive	Yes	Short-lived, low cost to preempt

Non-Preemptible Allocations

An allocation is effectively non-preemptible when:

checkpoint: None AND preemption_class >= 7 — High cost to preempt (all progress lost), high priority
Sensitive allocations (always class 10)
Allocations within 5 minutes of walltime completion (configurable: near_completion_threshold)

The scheduler avoids placing non-preemptible allocations on borrowed nodes, since those nodes may need to be reclaimed.

Preemption Metrics

Metric	Type	Description
`lattice_preemptions_total`	counter	Labels: `vcluster`, `reason` (priority/reclaim/sensitive)
`lattice_preemption_checkpoint_duration_seconds`	histogram	Time from hint to checkpoint completion
`lattice_preemption_victim_requeue_total`	counter	Preempted allocations re-entering queue
`lattice_preemption_failed_checkpoint_total`	counter	Checkpoint timeouts during preemption

Cross-References

scheduling-algorithm.md — f₈ checkpoint_efficiency in cost function
checkpoint-broker.md — Checkpoint cost model and application protocol
failure-modes.md — Requeue policy for preempted allocations
node-lifecycle.md — Node state transitions during preemption
sensitive-workloads.md — Sensitive allocations never preempted

Checkpoint Broker

Purpose

The checkpoint broker coordinates between the scheduler’s resource management decisions and running applications’ checkpoint capabilities. It enables cost-aware preemption: the scheduler can reclaim resources from running jobs by triggering checkpoints, with the decision driven by an economic cost function.

Cost Model

When to Checkpoint

Should_checkpoint(j, t) = Value(j, t) > Cost(j, t)

Cost Components

Cost(j, t) = write_time(j) + compute_waste(j) + storage_cost(j)

write_time(j):
  Estimated from: checkpoint_size(j) / storage_write_bandwidth
  checkpoint_size(j) estimated from: GPU memory usage × node count
  storage_write_bandwidth from: VAST API current throughput metrics

compute_waste(j):
  GPU-seconds lost during checkpoint I/O
  = write_time(j) × node_count(j) × gpu_per_node

storage_cost(j):
  = checkpoint_size(j) × cost_per_GB_on_target_tier

Value Components

Value(j, t) = recompute_saved(j, t) + preemptability(j, t) + backlog_relief(t)

recompute_saved(j, t):
  GPU-hours that would be lost if the job fails and restarts from scratch
  = time_since_last_checkpoint(j) × node_count(j) × gpu_per_node
  Weighted by failure_probability(j, t) which increases with:
    - Job duration (longer jobs more likely to hit hardware issues)
    - Node health signals (ECC errors, thermal warnings from BMC)

preemptability(j, t):
  Value of being able to preempt this job if a higher-priority job arrives
  = Σ (waiting_higher_priority_jobs × their urgency) × preemption_probability
  High when higher-priority work is queued and this job sits on reclaimable nodes

backlog_relief(t):
  = backlog_pressure(t) × estimated_queue_wait_reduction_if_nodes_freed
  Global signal: how much would freeing these nodes help the overall queue?

Decision Dynamics

Scenario	backlog	preempt demand	node health	Decision
Quiet system, healthy nodes	Low	Low	Good	Checkpoint infrequently (every 6h)
Deep queue, sensitive job waiting	High	High	Good	Checkpoint now, preempt
Node ECC errors increasing	Low	Low	Degrading	Checkpoint proactively, migrate
Large job nearing walltime	Low	Low	Good	Checkpoint for restart capability

Application Protocol

Three Communication Modes

Applications opt into checkpoint coordination via one of three mechanisms:

1. Signal-based (legacy compatibility)

Node agent sends SIGUSR1 to the application's process group.
Application catches signal, writes checkpoint, signals completion via exit of a sentinel file.
Timeout: if no completion signal within checkpoint_timeout, assume non-checkpointable.

2. Shared memory flag (low-latency)

Node agent sets a flag in a shared memory region mapped at a well-known path.
Application polls the flag (or uses futex wait) and initiates checkpoint.
Completion: application clears the flag and sets a "done" flag.
Best for performance-sensitive applications that can't afford signal handler overhead.

3. gRPC callback (agent-aware applications)

Application registers a checkpoint endpoint with the node agent at startup.
Node agent calls the endpoint when checkpoint is requested.
Application responds with estimated completion time, then streams progress.
Most expressive: supports negotiation (application can request deferral).

Checkpoint Destinations

Checkpoints are written to a standard location:

s3://{tenant}/{project}/{allocation_id}/checkpoints/{checkpoint_id}/

Or, if NFS is preferred for POSIX-style checkpoint (e.g., MPI checkpoint/restart):

/scratch/{tenant}/{project}/{allocation_id}/checkpoints/{checkpoint_id}/

The checkpoint broker coordinates with the data plane to ensure bandwidth is available.

Non-Checkpointable Applications

If an application declares checkpoint: none or fails to respond to checkpoint hints:

The allocation is marked as non-preemptible in the cost function
It receives a penalty in the knapsack solver (ties up resources without flexibility)
The scheduler avoids placing it on borrowed/elastic nodes

Fallback option: DMTCP (Distributed MultiThreaded Checkpointing) for transparent process-level checkpointing. Higher overhead, but works for unmodified applications.

Integration with Scheduler

The checkpoint broker runs as part of the scheduler plane, with access to:

Running allocation state (from quorum)
Node health telemetry (from eBPF/OpenCHAMI)
Storage metrics (from VAST API)
Queue state (from vCluster schedulers)

It evaluates the cost function continuously (every 30-60 seconds for each running allocation) and issues checkpoint hints when the threshold is crossed.

Storage Outage Behavior

When the checkpoint destination (VAST S3 or NFS) is unavailable:

Detection: Checkpoint broker detects storage unavailability via failed write probes or VAST API health checks
Immediate effect: All pending checkpoint requests are paused (not cancelled)
Cost function adjustment: storage_write_bandwidth drops to 0, making write_time(j) infinite — the cost function naturally suppresses checkpoint decisions
Running allocations: Continue running. They are effectively non-preemptible during the outage (no checkpoint possible)
Preemption requests: If preemption is forced (e.g., sensitive claim), the victim receives SIGTERM without checkpoint. The allocation is marked Failed (not Suspended) since no checkpoint was written
Recovery: When storage recovers, the broker re-evaluates all running allocations on the next cycle. Allocations with high recompute_saved value are prioritized for immediate checkpoint
Alert: lattice_checkpoint_storage_unavailable gauge set to 1; critical alert fired

Edge Cases

Reactive Allocation Checkpointing

Reactive (autoscaling) allocations pose unique challenges for the checkpoint broker:

Variable node count. The checkpoint size estimate (GPU memory × node count) changes as the allocation scales. The broker re-evaluates cost on each cycle using the current node count.
Scale-down as implicit checkpoint trigger. When the scheduler decides to scale down a reactive allocation, it triggers a checkpoint on the nodes being released before removing them from the allocation. This ensures state is preserved.
Recommendation: For reactive allocations with complex distributed state, use checkpoint: manual and implement application-level checkpoint coordination. The broker’s automatic checkpointing works best for static-size allocations where checkpoint size is predictable.

Walltime vs Checkpoint Race

When an allocation’s walltime expires while a checkpoint is in progress:

Walltime takes priority. The walltime timer is not extended to accommodate the checkpoint.
If the checkpoint completes before the SIGTERM grace period expires, the checkpoint is usable for restart.
If the checkpoint is still in progress when SIGKILL is sent, the checkpoint is considered incomplete and is not used for restart. The allocation is marked Failed with reason walltime_exceeded.
To avoid this race, schedule checkpoints proactively as walltime approaches (the recompute_saved value naturally increases near walltime expiration).

Cross-References

scheduling-algorithm.md — f₈ checkpoint_efficiency in the cost function
preemption.md — Preemption sequence and checkpoint timeout handling
failure-modes.md — Checkpoint broker crash recovery
telemetry.md — Node health signals (ECC errors) feeding into checkpoint urgency
sensitive-workloads.md — Sensitive allocations and checkpoint constraints
data-staging.md — Storage bandwidth sharing with checkpoint writes

Autoscaling

Design Principle

Simple, metric-driven scaling. No complex control theory. The scheduler adjusts node count within bounds based on a single metric threshold. Users set bounds, the scheduler respects them.

Reactive Lifecycle

Defined in crates/lattice-common/src/types.rs (LifecycleType::Reactive):

Reactive {
    min_nodes: u32,
    max_nodes: u32,
    metric: String,       // e.g., "gpu_utilization", "queue_depth", "request_rate"
    target: String,       // e.g., "0.80" (80% GPU utilization target)
}

Reactive allocations are unbounded in duration (like services) but have variable node count.

Scaling Loop

Start: Allocation begins with min_nodes
Evaluate: Every evaluation interval (default: 60s), the scheduler queries TSDB for the allocation’s metric
Scale up: If metric > target for scale_up_window (default: 2 minutes):
- Propose adding 1 node (conservative: avoid large jumps)
- Quorum validates the node addition (ownership transfer)
- Node agent starts processes on the new node
- Repeat until metric ≤ target or max_nodes reached
Scale down: If metric < target × scale_down_threshold (default: 0.5) for scale_down_window (default: 5 minutes):
- Propose removing 1 node (least-loaded or most-recently-added)
- Graceful drain: stop sending work to the node, wait for in-flight requests
- Node released back to scheduling pool
- Repeat until metric ≥ target × scale_down_threshold or min_nodes reached
Cooldown: After any scale event, no further scaling for cooldown_period (default: 3 minutes)

Why Conservative Scaling

Adding 1 node at a time prevents overshooting (workloads often have non-linear resource curves)
Scale-down windows are longer than scale-up windows (scale down is more disruptive)
Cooldown prevents oscillation from metric noise

Built-In Scaling Metrics

Metric	Description	Source	Best For
`gpu_utilization`	Mean GPU SM occupancy across allocation	eBPF / NVML	ML inference services
`cpu_utilization`	Mean CPU usage across allocation	eBPF	CPU-bound services
`request_rate`	Inbound requests per second	eBPF (network flow tracking)	API/web services
`queue_depth`	Pending request queue length	Application-reported or eBPF	Batch-processing services

Custom Metrics

Any metric available in TSDB can be used for scaling by specifying a label matcher:

lifecycle:
  type: reactive
  min_nodes: 2
  max_nodes: 20
  metric: "custom_metric{job='my-inference'}"
  target: "100"  # e.g., 100 pending requests

The scheduler queries TSDB with the label matcher scoped to the allocation’s nodes.

Configuration Defaults

Parameter	Default	Configurable
`evaluation_interval`	60s	Per allocation
`scale_up_window`	2 minutes	Per allocation
`scale_down_window`	5 minutes	Per allocation
`scale_down_threshold`	0.5 (50% of target)	Per allocation
`cooldown_period`	3 minutes	Per allocation

Quota Interaction

Scale-up respects the tenant’s max_nodes hard quota (cross-ref: quota-enforcement.md):

Before proposing a scale-up, the scheduler checks if the tenant has remaining node capacity
If max_nodes would be exceeded: scale-up is a no-op, allocation continues at current size
No error raised — the allocation operates within its current bounds
If quota is later increased (e.g., via Waldur), scaling resumes automatically

Preemption Interaction

Borrowed nodes (from elastic resource sharing) are valid targets for reactive scaling, but they carry a preemption risk:

Scaling onto borrowed nodes gives the allocation more capacity temporarily
If the home vCluster reclaims the node: reactive allocation scales down gracefully
Minimum guarantee: min_nodes always come from the allocation’s home vCluster (not borrowed)

Error Handling

Metric Query Failure (TSDB Down)

If the scheduler cannot query TSDB for the scaling metric:

First failure: skip this evaluation cycle, log warning
Consecutive failures (3+): alert raised (lattice_autoscaling_metric_query_failures_total)
No scaling decisions made while metric is unavailable — allocation stays at current size
When TSDB recovers: normal evaluation resumes on next cycle

The allocation is never scaled blindly. No metric = no action.

Scale-Up Proposal Rejected

If the quorum rejects a scale-up proposal (e.g., race condition with another vCluster):

Retry on next evaluation cycle (60s later)
Maximum 3 consecutive retries for the same scale-up
After 3 rejections: log warning, back off for 2 cooldown periods
Scale-up resumes when conditions change (nodes become available)

Scale-Down During Borrowed Node Reclamation

If a borrowed node is reclaimed by the home vCluster while the reactive allocation is scaling down:

The reclamation takes priority (home vCluster always wins)
The reactive allocation loses the node immediately (graceful drain attempted, but not guaranteed)
If this drops below min_nodes: scheduler attempts to acquire a replacement node from the home vCluster
If no replacement available: allocation operates below min_nodes temporarily, alert raised

Metric Oscillation

If the metric oscillates around the target, causing repeated scale-up/scale-down:

The cooldown period (default: 3 minutes) prevents rapid oscillation
If scale events alternate for more than 5 cycles: alert raised suggesting the user adjust their target or increase cooldown
No automatic target adjustment — the user must update the configuration

Preemption During Scale-Up

If a reactive allocation is scaling up while simultaneously being preempted (e.g., a higher-priority job arrives):

The preemption takes priority — the checkpoint/preemption sequence begins
Any in-flight scale-up proposals are cancelled (quorum rejects proposals for allocations in Checkpointing state)
After preemption completes: the allocation is suspended with its last stable node count
When resumed: scaling restarts from min_nodes, re-evaluating the metric from scratch
The cooldown period applies after resume to prevent immediate re-scaling

If preemption and scale-up proposals race at the quorum:

The quorum serializes all proposals — one wins, the other is rejected
The rejected proposal is retried on the next scheduling cycle (if still applicable)

Cross-References

scheduling-algorithm.md — Reactive allocations scored by the knapsack solver like any allocation
quota-enforcement.md — Hard quota limits on scale-up
telemetry.md — Metric sources for scaling decisions
preemption.md — Borrowed node reclamation
types.rs — LifecycleType::Reactive definition

Quota Enforcement

Design Principle

Two-tier enforcement matching the two consistency domains (ADR-004). Hard limits enforced at the quorum (strong consistency, cannot be violated). Soft limits enforced at the scheduler (eventual consistency, may temporarily overshoot, self-correcting).

Hard Quotas (Quorum-Enforced)

Hard quotas are checked during Raft proposal validation, before commit. A proposal that would violate a hard quota is rejected immediately.

Quota	Scope	Enforcement
`max_nodes`	Per tenant	Quorum rejects allocation proposals that would exceed the tenant’s maximum concurrent node count
`max_concurrent_allocations`	Per tenant	Quorum rejects proposals that would exceed the tenant’s maximum number of running allocations
`sensitive_pool_size`	System-wide	Hard limit on the number of nodes that can be claimed for sensitive use

Guarantees: These quotas cannot be violated, even momentarily. Two vCluster schedulers proposing conflicting allocations that together would exceed a hard quota: the first committed wins, the second is rejected and retried next cycle.

Error handling: Hard quota rejection returns a clear error to the user:

allocation rejected: tenant "physics" would exceed max_nodes quota (current: 195, requested: 10, limit: 200)

Soft Quotas (Scheduler-Level)

Soft quotas are tracked with eventual consistency. They influence scheduling decisions through the cost function but do not hard-block allocations.

GPU-Hours Budget

gpu_hours_budget: 100000  # per billing period (month)
gpu_hours_used: 87500     # eventually consistent counter

Behavior: The scheduler uses remaining budget as a penalty in the cost function. As budget depletes:

0-80% used: no penalty
80-100% used: increasing penalty (lower scheduling priority)
100% used: very low score (effective starvation for new allocations, but not hard rejection)

Consistency window: Up to ~30 seconds of lag. Acceptable because: (a) scheduling cycle is 5-30s, (b) over-allocation is self-correcting via fair-share scoring, (c) GPU-hours tracking is for billing, not safety.

fair_share_target: 0.15  # tenant should get ~15% of system capacity

Behavior: Feeds into f₃ (fair_share_deficit) in the cost function. Tenants below their share get priority; tenants above are deprioritized. Not a hard ceiling — a tenant can use more than their share when resources are idle.

Burst Allowance

burst_allowance: 1.5  # allow up to 150% of fair share when resources idle

Behavior: Allows temporary over-allocation when the system has spare capacity. When demand increases and other tenants need their share, burst allocations are the first candidates for preemption (via checkpoint cost model).

Internal Budget Ledger

When Waldur is unavailable or not configured, the scheduler computes GPU-hours consumption internally from allocation records in the quorum. This replaces the previously empty budget_utilization map in the cost function.

Computation

Two metrics are tracked:

node_hours_used = Σ (end_time - started_at).hours × assigned_nodes.len()
gpu_hours_used  = Σ (end_time - started_at).hours × Σ gpu_count_per_node

For running allocations: end_time = now
For completed/failed/cancelled: end_time = completed_at
Only allocations within the configured budget_period_days (default: 90 days, rolling window) are included
Node GPU count looked up from current hardware inventory; unknown nodes default to 1 GPU
Node-hours is the universal metric (works for CPU-only and GPU nodes)
When both gpu_hours_budget and node_hours_budget are set, the worse (higher) utilization fraction drives the budget penalty

Budget Period

Configurable via scheduling.budget_period_days (default: 90). This is a rolling window, not a calendar-aligned reset. Calendar-aligned resets require Waldur to push new gpu_hours_budget values at period boundaries.

Waldur Override

When Waldur is available, its remaining_budget() response takes precedence over the internal ledger. When Waldur is unavailable (transient failure), the internal ledger provides fallback data so budget enforcement continues.

API Access

gRPC: GetTenantUsage / GetUserUsage RPCs in AdminService
REST: GET /api/v1/tenants/{id}/usage?days=90 / GET /api/v1/usage?user=alice&days=90
Rust SDK: client.tenant_usage("physics", 90) / client.user_usage("alice", 90)
CLI: lattice usage --tenant physics / lattice usage (uses gRPC)

Exhausted Budget Behavior

GPU-Hours Budget Exhausted

New allocations for this tenant receive a very low scheduling score (effective starvation, not hard rejection)
Tenant admin notified via API event
Running allocations continue to completion (no preemption for budget reasons)
If Waldur integration enabled: Waldur can update the budget (cross-ref: accounting.md)
Tenant admin can request budget increase through Waldur self-service portal

Max Nodes Exhausted

Hard rejection at quorum — clear error returned to user
User must wait for running allocations to complete or cancel existing allocations
No waiting queue for hard-quota-blocked allocations (submit is rejected, user resubmits when capacity is available)

Quota Update Flow

Administrative Update

System admin updates tenant quotas via CLI or API:

# CLI (uses gRPC UpdateTenant RPC)
lattice admin tenant update physics \
  --max-nodes 250 \
  --max-concurrent-allocations 50 \
  --gpu-hours-budget 150000 \
  --node-hours-budget 500000

# Python SDK
await client.update_tenant("physics", {
    "max_nodes": 250,
    "max_concurrent_allocations": 50,
    "gpu_hours_budget": 150000,
    "node_hours_budget": 500000,
})

# REST
PUT /api/v1/tenants/{id}
{
  "max_nodes": 250,
  "max_concurrent_allocations": 50,
  "gpu_hours_budget": 150000,
  "node_hours_budget": 500000
}

Hard quota changes are Raft-committed (immediate effect). Soft quota changes propagate eventually.

Waldur-Driven Update

When Waldur integration is enabled, Waldur can push quota changes:

Waldur determines budget exhaustion or contract change
Waldur calls lattice-api: PUT /api/v1/tenants/{id} (authenticated with Waldur service token)
Hard quotas committed via Raft; soft quotas propagated to schedulers
Reducing max_nodes below current usage does not preempt running allocations — it prevents new ones

Quota Reduction While Allocations Are Running

When a quota is reduced below current usage (e.g., Waldur reduces max_nodes from 200 to 100, but tenant is currently using 150):

Hard Quota Reduction

Running allocations are not preempted. The reduced quota only blocks new allocations.
Current usage (150) exceeds new limit (100): all new proposals for this tenant are rejected until usage drops below 100.

The user receives a clear error on new submissions:

allocation rejected: tenant "physics" exceeds max_nodes quota
  Current usage: 150 nodes
  New limit: 100 nodes
  Hint: Wait for running allocations to complete, or contact your tenant admin.

As running allocations complete naturally, usage drops. When usage < new limit: new allocations are accepted again.

Soft Quota Reduction

Reduced gpu_hours_budget: scheduling score penalty increases. Pending allocations get lower priority but are not rejected.
Reduced fair_share_target: tenant gets deprioritized but can still schedule when resources are idle.
No immediate impact on running allocations.

Pending Allocations

Allocations that are Pending (in the scheduler queue but not yet committed) when a hard quota is reduced:

They are not retroactively cancelled.
If proposed to quorum, the proposal is rejected due to the new quota.
The scheduler will not re-propose them until quota headroom exists.
User sees allocation stuck in Pending state. lattice status shows the reason: "waiting for quota headroom".

Sensitive Quota Considerations

Sensitive quotas are always hard quotas:

sensitive_pool_size — System-wide hard limit, quorum-enforced
Sensitive node claims always go through quorum (strong consistency)
No soft/eventual quota mechanisms for sensitive resources
Idle sensitive nodes (claimed but unused) are not reclaimable — they remain allocated to the claiming user

Cross-ref: sensitive-workloads.md for the full sensitive workload model.

Cross-References

scheduling-algorithm.md — f₃ fair_share_deficit uses soft quota targets
accounting.md — Waldur quota feedback loop
sensitive-workloads.md — Sensitive quotas are always hard
autoscaling.md — Scale-up respects hard quota limits

GPU Topology

Design Principle

Vendor-neutral abstraction over GPU interconnect topologies. The scheduler reasons about “GPU domains” and “link bandwidth,” not vendor-specific terms. Node agents discover and report topology; the scheduler uses it for placement decisions.

Vendor Support

Vendor	GPU Family	Interconnect	Topology Discovery	Metrics Collection
NVIDIA	H100, GH200, B200	NVLink, NVSwitch	NVML (`nvmlDeviceGetTopologyCommonAncestor`)	NVML / DCGM
AMD	MI300X, MI300A	Infinity Fabric, xGMI	ROCm-SMI (`rsmi_topo_get_link_type`)	ROCm-SMI / `rocm_smi_lib`

Additional vendors can be supported by implementing the topology discovery trait in the node agent.

Abstraction Model

GpuTopology {
    gpus: Vec<GpuDevice>,
    links: Vec<GpuLink>,
    nic_affinity: Map<GpuIndex, NicId>,  // which NIC is closest to which GPU
}

GpuDevice {
    index: u32,
    vendor: GpuVendor,          // Nvidia | Amd
    model: String,              // "H100", "MI300X"
    memory_bytes: u64,
    compute_capability: String, // CUDA CC or GCN/CDNA arch
}

GpuLink {
    gpu_a: u32,
    gpu_b: u32,
    link_type: GpuLinkType,     // NvLink | NvSwitch | InfinityFabric | Xgmi | Pcie
    bandwidth_gbps: f64,
}

The node agent populates this structure at startup using vendor-specific APIs and reports it alongside node capabilities and health data.

Link Types and Bandwidth

Link Type	Typical Bandwidth	Latency	Notes
NVLink (H100)	450 GB/s per link	~1 μs	Direct GPU-to-GPU
NVSwitch (H100)	900 GB/s all-to-all	~1 μs	Full-bisection via switch
Infinity Fabric (MI300X)	896 GB/s aggregate	~1 μs	XGMI links between dies
PCIe Gen5	64 GB/s	~2-5 μs	Fallback, cross-socket
PCIe Gen4	32 GB/s	~2-5 μs	Older systems

Actual bandwidth is discovered at runtime via vendor APIs, not hardcoded.

Intra-Node Scheduling Impact

ADR-007 defines “full-node scheduling with intra-node packing.” GPU topology informs the intra-node packing:

Multi-GPU Jobs Within a Node

For allocations requesting fewer GPUs than the node has, the node agent packs on GPUs with direct high-bandwidth links:

Prefer GPUs connected via NVLink/NVSwitch/InfinityFabric (direct high-bandwidth)
Avoid splitting across PCIe domains when high-bandwidth links are available
For NCCL/RCCL workloads, contiguous GPU groups minimize communication overhead

Multi-Node Jobs

For allocations spanning multiple nodes:

Prefer nodes where GPU-to-NIC affinity matches — GPUs closest to the NIC used for inter-node communication (Slingshot/Ultra Ethernet)
NIC affinity reduces PCIe hops for inter-node traffic, improving MPI/NCCL allreduce performance
Combined with f₄ (topology_fitness): inter-node placement minimizes dragonfly group span, intra-node placement maximizes link bandwidth

Selection Algorithm

For a k-GPU allocation on a node with n GPUs:
1. Build a graph of GPUs weighted by link bandwidth
2. Find the k-GPU subgraph with maximum minimum link bandwidth
3. If multiple subgraphs tie: prefer the one with best NIC affinity
4. Assign allocation to selected GPUs via cgroup/device isolation

MIG / GPU Partitioning

NVIDIA Multi-Instance GPU (MIG)

H100 can partition into up to 7 MIG instances, each with isolated memory, cache, and compute:

MIG Profile	GPU Memory	SMs	Use Case
1g.10gb	10 GB	1/7	Interactive, notebooks
2g.20gb	20 GB	2/7	Small inference
3g.40gb	40 GB	3/7	Medium training
4g.40gb	40 GB	4/7	Medium training
7g.80gb	80 GB	7/7	Full GPU (no partitioning)

MIG is relevant for interactive/small-job vClusters where intra-node packing is used. Each MIG instance is a separate schedulable GPU resource.

AMD

No equivalent partitioning as of MI300 generation. MI300X allocations always get full GPU dies.

Scheduler Integration

MIG instances are reported as individual GpuDevice entries with reduced memory_bytes and a partitioned: true flag
The scheduler treats MIG instances like smaller GPUs — no special MIG logic in the knapsack solver
MIG configuration is managed by the node agent, not the scheduler (reconfiguration requires idle GPU)

Integration with Cost Function

GPU topology extends f₄ (topology_fitness) to include intra-node topology quality:

f₄(j) = α · inter_node_fitness(j) + (1-α) · intra_node_fitness(j)

inter_node_fitness = 1.0 - (groups_needed / max_groups_available)  // existing
intra_node_fitness = min_link_bandwidth(selected_gpus) / max_link_bandwidth(node)

α = 1.0 for single-node jobs (intra-node only matters)
α = 0.7 for multi-node jobs (inter-node dominates but intra-node still relevant)

The node agent reports GpuTopology alongside capabilities and health on every heartbeat (topology is static, but health/utilization changes).

Conformance Interaction

GPU driver version and firmware version are part of the conformance fingerprint (cross-ref: conformance.md). For multi-node GPU jobs, mismatched drivers cause NCCL/RCCL hangs. The conformance fitness factor (f₉) ensures nodes in a multi-GPU allocation share the same driver stack.

Cross-References

scheduling-algorithm.md — f₄ topology_fitness, f₉ conformance_fitness
conformance.md — GPU driver version in conformance fingerprint
telemetry.md — GPU metrics collection (NVML/DCGM, ROCm-SMI)

Memory Topology

Design Principle

Vendor-neutral abstraction over CPU-memory-GPU memory topology. The scheduler reasons about “memory domains” and “interconnect bandwidth,” not vendor-specific terms like NUMA node IDs or NVLink-C2C. Node agents discover and report memory topology; the scheduler uses it for placement decisions and memory policy configuration.

This complements gpu-topology.md, which models GPU interconnects. Memory topology models the CPU-memory-GPU memory hierarchy: NUMA domains, unified memory architectures, and CXL-attached memory tiers.

Memory Domain Types

Type	Hardware Example	Characteristics	Discovery
Discrete NUMA	Multi-socket Intel Xeon, AMD EPYC	Separate DRAM per socket, asymmetric access latencies	`/sys/devices/system/node/`
Unified CPU-GPU	NVIDIA Grace Hopper GH200	NVLink-C2C coherent, single address space across CPU and GPU	NVML + `/sys/devices/system/node/`
APU / Unified Die	AMD MI300A	CPU + GPU on same package, shared HBM3 pool	ROCm-SMI + `hwloc`
CXL-Attached	CXL Type 3 memory expanders	Pooled or device-attached memory, higher latency than local DRAM	`/sys/bus/cxl/`
Single-Socket	Single-socket servers	Trivial: one NUMA node, uniform access	`/sys/devices/system/node/`

Abstraction Model

MemoryTopology {
    domains: Vec<MemoryDomain>,
    interconnects: Vec<MemoryInterconnect>,
    total_capacity_bytes: u64,
}

MemoryDomain {
    id: u32,
    domain_type: MemoryDomainType,    // Dram | Hbm | CxlAttached | Unified
    capacity_bytes: u64,
    numa_node: Option<u32>,           // Linux NUMA node ID, if applicable
    attached_cpus: Vec<u32>,          // CPU IDs with local access
    attached_gpus: Vec<u32>,          // GPU indices with local/coherent access
}

MemoryInterconnect {
    domain_a: u32,
    domain_b: u32,
    link_type: MemoryLinkType,        // NumaLink | CxlSwitch | CoherentFabric
    bandwidth_gbps: f64,
    latency_ns: u64,
}

enum MemoryDomainType { Dram, Hbm, CxlAttached, Unified }
enum MemoryLinkType { NumaLink, CxlSwitch, CoherentFabric }

The node agent populates this structure at startup alongside GpuTopology and reports it with node capabilities and health data.

Interconnect Bandwidth and Latency

Link Type	Typical Bandwidth	Typical Latency	Notes
Local DRAM access	50-100 GB/s per channel	~80 ns	Same-socket, same NUMA node
Remote NUMA (UPI/xGMI)	20-40 GB/s	~150-300 ns	Cross-socket, 1.5-3x local latency
NVLink-C2C (GH200)	900 GB/s	~100 ns	CPU-GPU coherent fabric
Infinity Fabric (MI300A)	896 GB/s aggregate	~100 ns	On-package CPU-GPU interconnect
CXL 2.0 (Type 3)	32-64 GB/s	~200-400 ns	Memory expander, higher latency
PCIe Gen5 (discrete GPU)	64 GB/s	~1-2 us	Non-coherent, requires explicit transfer

Actual bandwidth and latency are discovered at runtime, not hardcoded.

Superchip Architectures

NVIDIA Grace Hopper (GH200)

Grace CPU + Hopper GPU connected via NVLink-C2C (900 GB/s bidirectional). The CPU and GPU share a single coherent address space — no explicit cudaMemcpy required for data movement.

┌────────────────────────────────────────────────────┐
│                  GH200 Superchip                   │
│                                                    │
│  ┌─────────────────┐   NVLink-C2C  ┌─────────────┐ │
│  │  Grace CPU      │◄──900 GB/s───►│  Hopper GPU │ │
│  │  72 cores       │   coherent    │  80 GB HBM3 │ │
│  │  512 GB LPDDR5X │               │             │ │
│  └─────────────────┘               └─────────────┘ │
│                                                    │
│  Single coherent address space (CPU + GPU)         │
│  → Maps to one Unified MemoryDomain                │
└────────────────────────────────────────────────────┘

Mapping to abstraction:

One MemoryDomain { type: Unified } spanning CPU LPDDR5X + GPU HBM3
attached_cpus: all Grace cores; attached_gpus: [Hopper GPU index]
One MemoryInterconnect { type: CoherentFabric, bandwidth: 900 } between CPU and GPU sub-domains

AMD Instinct MI300A

APU with CDNA 3 GPU + Zen 4 CPU on the same package, sharing HBM3 memory pool. No discrete CPU DRAM — all memory is HBM3 accessible by both CPU and GPU.

┌──────────────────────────────────────────────────┐
│                  MI300A Package                  │
│                                                  │
│  ┌─────────────┐   Infinity   ┌────────────────┐ │
│  │  Zen 4 CPU  │ ◄──Fabric──► │  CDNA 3 GPU    │ │
│  │  24 cores   │   896 GB/s   │  6 XCDs        │ │
│  └──────┬──────┘              └───────┬────────┘ │
│         │                             │          │
│         └──────┐          ┌───────────┘          │
│                ▼          ▼                      │
│         ┌─────────────────────┐                  │
│         │  Shared HBM3 Pool   │                  │
│         │  128 GB              │                 │
│         └─────────────────────┘                  │
│                                                  │
│  → Maps to one Unified MemoryDomain              │
└──────────────────────────────────────────────────┘

Mapping to abstraction:

One MemoryDomain { type: Unified } for the shared HBM3 pool
attached_cpus: all Zen 4 cores; attached_gpus: [MI300A GPU index]
Internal Infinity Fabric interconnect is not separately modeled (on-package, always present)

Discovery

The node agent discovers memory topology at startup using platform-specific sources:

Source	What It Provides	Platform
`/sys/devices/system/node/`	NUMA node count, CPU-to-node mapping, memory per node	Linux (all)
`numactl --hardware`	NUMA distances (latency matrix between nodes)	Linux (all)
`hwloc`	Portable topology discovery, cache hierarchy, PCI locality	Linux (all)
NVML	GPU-to-NUMA affinity, NVLink-C2C detection (GH200)	NVIDIA GPUs
ROCm-SMI	GPU-to-NUMA affinity, MI300A detection	AMD GPUs
`/sys/bus/cxl/`	CXL device enumeration, memory regions, interleave config	CXL-capable systems

Superchip Detection

GH200 and MI300A superchips are identified by GPU model string during GPU discovery (cross-ref: gpu-topology.md). When detected:

The node agent queries the coherent memory size via vendor API (NVML for GH200, ROCm-SMI for MI300A)
NUMA nodes associated with both CPU and GPU are merged into a single Unified domain
The coherent interconnect bandwidth is reported as a CoherentFabric link

Discovery Fallback

If vendor APIs are unavailable (e.g., driver not loaded), the node agent falls back to hwloc for topology and reports Dram domains only. GPU memory domains are still reported via the GPU topology path but without coherent interconnect metadata.

Scheduling Impact

Extending f₄ (topology_fitness)

Memory topology extends the intra-node component of f₄ alongside GPU topology:

intra_node_fitness = β · gpu_link_fitness + (1-β) · memory_locality_fitness

memory_locality_fitness(j, selected_nodes) =
    average over selected nodes of:
        fraction of allocation's CPUs and GPUs in the same memory domain

β = 0.7 for GPU-heavy workloads (GPU interconnect dominates)
β = 0.3 for CPU-heavy workloads with GPU offload (memory locality dominates)
β = 0.5 default

Constraint Hints

Allocations can specify memory topology preferences:

Constraint	Effect
`prefer_same_numa`	Soft: prefer placing all CPUs in a single NUMA domain
`require_unified_memory`	Hard: only schedule on nodes with `Unified` memory domains (GH200, MI300A)
`prefer_local_memory`	Soft: prefer NUMA-local memory allocation policy
`allow_cxl_memory`	Opt-in: allow scheduling on CXL-expanded memory capacity

Hard constraints filter nodes before the knapsack solver runs. Soft constraints contribute to memory_locality_fitness.

Intra-Node CPU-GPU Co-location

On discrete NUMA systems (e.g., dual-socket with 4 GPUs per socket), the node agent co-locates an allocation’s CPU cores and GPUs within the same NUMA domain when possible:

For an allocation requesting k CPUs and g GPUs on a multi-NUMA node:
1. Identify NUMA domains that have both free CPUs and GPUs with local affinity
2. Prefer the domain where GPU-to-NIC affinity is best (for inter-node traffic)
3. Assign CPUs and GPUs from the same domain via cgroup/cpuset
4. If the allocation spans domains: prefer domains connected by highest-bandwidth link

Memory Mapping Policies

The node agent configures memory allocation policy at allocation start via numactl (or equivalent). This is transparent to the user unless they specify a preference.

Policy	`numactl` Flag	When Used
Local	`--localalloc`	Default: allocate on the NUMA node where the thread runs
Interleave	`--interleave=all`	Large shared datasets that all threads access equally
Preferred	`--preferred=<node>`	Pin to a specific NUMA node (for known data locality)
Bind	`--membind=<nodes>`	Strict: only allocate from specified nodes (sensitive isolation)

On unified memory architectures (GH200, MI300A), NUMA policy has reduced impact since CPU and GPU share the same memory pool. The node agent skips numactl configuration for allocations on unified nodes unless the user explicitly requests a policy.

Allocation-Level Override

Users can specify memory policy in the allocation request:

resources:
  cpus: 24
  gpus: 1
  memory_gb: 128
constraints:
  memory_policy: interleave    # optional: local | interleave | preferred | bind
  require_unified_memory: true  # optional: only unified architectures

CXL Memory Tiers

CXL Type 3 memory expanders add a new capacity tier: higher latency than local DRAM but lower cost per GB. The scheduler treats CXL memory as a separate resource dimension.

Capacity Model

Node memory capacity:
  local_dram_bytes:  512 GB  (fast, NUMA-local)
  cxl_memory_bytes:  2 TB    (slower, CXL-attached)
  total_bytes:       2.5 TB

Allocation can request:
  memory_gb: 256              # scheduler satisfies from local DRAM
  memory_gb: 1024             # scheduler must use CXL tier (exceeds local DRAM)
  memory_gb: 1024
  allow_cxl_memory: true      # explicit opt-in for CXL tier

Scheduling Rules

By default, allocations are placed using local DRAM capacity only
If allow_cxl_memory: true, CXL capacity is included in available memory
Allocations requesting more memory than local DRAM are only placed on CXL-capable nodes when the constraint is set
CXL memory appears as a separate CxlAttached domain in MemoryTopology

Cross-References

gpu-topology.md — GPU interconnect topology, NIC affinity, intra-node GPU selection
telemetry.md — NUMA locality metrics collection (eBPF), memory utilization
scheduling-algorithm.md — f₄ topology_fitness, knapsack solver, constraint handling
node-lifecycle.md — Node agent startup, health reporting, capability discovery
conformance.md — Hardware configuration fingerprint (includes memory architecture)

Performance Tuning Guide

Design Principle

Tuning Lattice is primarily about tuning the cost function weights per vCluster. The RM-Replay simulator is the primary tool: capture production traces, replay with different weights, measure outcomes, deploy with confidence.

Cost Function Sensitivity

Weight Impact Matrix

Each cost function weight controls a trade-off. Increasing one weight reduces the influence of others:

Weight Increased	Positive Effect	Negative Effect	When to Increase
w₁ (priority)	High-priority jobs scheduled faster	Low-priority jobs starve longer	Many priority levels with strict SLAs
w₂ (wait_time)	Better anti-starvation, fairer wait distribution	May schedule low-value jobs before high-value ones	Long tail of wait times
w₃ (fair_share)	Tenants get closer to contracted share	May reduce overall utilization (leaving resources idle)	Multi-tenant with strict fairness requirements
w₄ (topology)	Better placement, higher network performance	May increase wait time (holding out for ideal placement)	Network-sensitive workloads (NCCL, MPI allreduce)
w₅ (data_readiness)	Less I/O stall at job start	May delay jobs whose data isn’t pre-staged	Large-dataset workloads
w₆ (backlog)	System responds to queue pressure	May destabilize scheduling when queue fluctuates	Bursty submission patterns
w₇ (energy)	Lower electricity costs	Jobs may wait for cheap-energy windows	Time-flexible workloads, sites with TOU pricing
w₈ (checkpoint)	More flexible resource rebalancing	Overhead from frequent checkpointing	Preemption-heavy environments
w₉ (conformance)	Fewer driver-mismatch issues	Fewer candidate nodes (smaller conformance groups)	Multi-node GPU workloads

Common Trade-offs

Throughput vs. Fairness (w₃):

Low w₃ (0.05): maximize utilization — schedule whatever fits, regardless of tenant share
High w₃ (0.35): enforce fairness — tenants below their share get priority even if it means idle resources

Typical compromise: w₃ = 0.15-0.25

Wait Time vs. Topology (w₂ vs. w₄):

High w₂, low w₄: schedule quickly in any topology — reduces wait but may hurt network performance
Low w₂, high w₄: wait for good topology — increases wait but improves job runtime

Typical for HPC: w₂ = 0.25, w₄ = 0.15 Typical for ML training: w₂ = 0.10, w₄ = 0.30

Utilization vs. Energy (w₇):

w₇ = 0.00: schedule immediately regardless of energy cost (default for most sites)
w₇ = 0.10-0.15: delay time-flexible jobs to cheap-energy windows

Only relevant for sites with significant time-of-use electricity pricing.

Using RM-Replay

Overview

RM-Replay replays production workload traces through the scheduler in simulation mode. No real resources are used. Simulation runs in seconds, not hours.

Reference: Martinasso et al., “RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management” (SC18).

Step 1: Capture Traces

Record workload traces from production (or synthetic workloads):

# Enable trace capture (writes to S3)
lattice admin config set scheduler.trace_capture=true
lattice admin config set scheduler.trace_path="s3://lattice-traces/"

# Capture for a representative period (1 week recommended)
# Traces include:
#   - Allocation submissions (arrival time, resources, constraints, tenant, priority)
#   - Allocation completions (actual duration, exit status)
#   - Node inventory (capabilities, topology, conformance groups)

Trace format is a timestamped event log (JSON lines):

{"ts": "2026-03-01T00:00:01Z", "type": "submit", "alloc": {"nodes": 64, "gpu_type": "GH200", "walltime": "72h", "tenant": "physics", "priority": 4}}
{"ts": "2026-03-01T00:00:05Z", "type": "complete", "alloc_id": "abc-123", "duration": "68h", "exit": 0}

Step 2: Configure Weights

Create weight profiles to compare:

# profiles/baseline.yaml (current production weights)
hpc-batch:
  priority: 0.20
  wait_time: 0.25
  fair_share: 0.25
  topology: 0.15
  data_readiness: 0.10
  backlog: 0.05
  energy: 0.00
  checkpoint: 0.00
  conformance: 0.10

# profiles/fairness-boost.yaml (experiment: more fairness)
hpc-batch:
  priority: 0.15
  wait_time: 0.20
  fair_share: 0.35        # increased
  topology: 0.15
  data_readiness: 0.10
  backlog: 0.05
  energy: 0.00
  checkpoint: 0.00
  conformance: 0.10

Step 3: Replay

# Replay with baseline weights
rm-replay --trace=traces/week-2026-03.jsonl \
          --weights=profiles/baseline.yaml \
          --nodes=inventory/alps.yaml \
          --output=results/baseline/

# Replay with experimental weights
rm-replay --trace=traces/week-2026-03.jsonl \
          --weights=profiles/fairness-boost.yaml \
          --nodes=inventory/alps.yaml \
          --output=results/fairness-boost/

Step 4: Evaluate

RM-Replay produces a summary report:

=== RM-Replay Results: fairness-boost ===

Utilization:
  GPU-hours consumed: 1,234,567 / 1,500,000 available (82.3%)
  ↓ 2.1% vs baseline (84.4%)

Wait Time:
  p50: 12 min  (baseline: 10 min)  ↑ 20%
  p95: 2.1 hr  (baseline: 2.5 hr)  ↓ 16%
  p99: 8.3 hr  (baseline: 12.1 hr) ↓ 31%

Fairness (Jain's Index):
  0.94 (baseline: 0.87)  ↑ 8%

Tenant Share Deviation:
  Max deviation: 3.2%  (baseline: 8.7%)  ↓ 63%

Backfill:
  Backfill jobs: 342 (baseline: 367)  ↓ 7%

Preemptions:
  Total: 15 (baseline: 12)  ↑ 25%

Step 5: Decide and Deploy

Compare results across profiles. When satisfied:

# Deploy new weights (hot-reloadable, no restart)
lattice admin vcluster set-weights --name=hpc-batch \
  --priority=0.15 --wait-time=0.20 --fair-share=0.35 \
  --topology=0.15 --data-readiness=0.10 --backlog=0.05 \
  --energy=0.00 --checkpoint=0.00 --conformance=0.10

Weights take effect on the next scheduling cycle.

Scheduling Cycle Tuning

The scheduling cycle interval affects responsiveness vs. overhead:

Interval	Effect	Recommended For
5s	Fast scheduling, higher CPU on scheduler	Interactive vCluster, small clusters
15s	Balanced	HPC batch, ML training
30s	Lower overhead, slower response	Large clusters (5000+ nodes), service vCluster

lattice admin vcluster set-config --name=hpc-batch --cycle-interval=15s

Backfill Tuning

Backfill depth controls how many future reservations the solver considers:

Depth	Effect
0	No backfill (only first-fit) — simple but low utilization
10	Moderate backfill — good balance
50	Deep backfill — higher utilization but longer cycle time

For most sites, depth 10-20 is optimal. Increase if utilization is below target.

Conformance Group Sizing

If conformance groups are too small (many distinct fingerprints), multi-node jobs have fewer candidate sets:

Symptom: High wait times for multi-node jobs, f₉ scores consistently low
Diagnosis: lattice nodes -o wide shows many distinct conformance hashes
Fix: Coordinate with OpenCHAMI to standardize firmware versions. Prioritize GPU driver and NIC firmware alignment.
Workaround: Reduce w₉ for tolerant workloads (services, interactive)

Cross-References

scheduling-algorithm.md — Cost function definition, weight profiles
testing-strategy.md — RM-Replay regression suite
conformance.md — Conformance groups and drift
telemetry.md — Scheduler self-monitoring metrics for observing tuning impact

Node Lifecycle

Design Principle

Nodes follow a formal state machine with well-defined transitions, timeouts, and operator actions. The node agent drives transitions locally; the quorum records ownership changes with strong consistency. Running allocations are never disrupted by state transitions unless the node is genuinely unhealthy.

State Machine

                    ┌────────────────────────────────────────────┐
                    │                                            │
                    ▼                                            │
  ┌─────────┐   boot   ┌──────────┐   health ok   ┌─────────┐    │
  │ Unknown │────────→ │ Booting  │──────────────→│  Ready  │    │
  └─────────┘          └──────────┘               └────┬────┘    │
       ▲                     │                         │         │
       │               boot fail                       │         │
       │                     │        ┌────────────────┤         │
       │                     ▼        │                │         │
       │               ┌──────────┐   │  drain cmd     │         │
       │               │  Failed  │   │       │        │         │
       │               └──────────┘   │       ▼        │         │
       │                     │        │  ┌──────────┐  │  remediated
       │               wipe/reboot    │  │ Draining │  │         │
       │                     │        │  └─────┬────┘  │         │
       │                     │        │   allocs done  │         │
       │                     │        │        │       │         │
       │                     │        │        ▼       │         │
       │                     │        │  ┌──────────┐  │         │
       │                     │        │  │ Drained  │  │         │
       │                     │        │  └─────┬────┘  │         │
       │                     │        │ undrain│       │         │
       │                     │        │        │       │         │
       │                     │        │        ▼       │         │
       │                     │        └──→ (Ready) ◄───┘         │
       │                     │                                   │
       │                     │    heartbeat miss    ┌───────────┐│
       │                     │    ┌────────────────→│ Degraded  ││
       │                     │    │   (Ready)       └─────┬─────┘│
       │                     │    │                 grace timeout│
       │                     │    │                       │      │
       │                     │    │                       ▼      │
       │                     └────┼──────────────────┌─────────┐ │
       │                          │                  │  Down   │ │
       └──────────────────────────┼──────────────────└────┬────┘ │
                                  │                 reboot│      │
                                  │                       └──────┘
                                  │
                         heartbeat resume
                          (within grace)
                                  │
                                  └──→ (Ready)

States

State	Description	Schedulable	Allocations Run
`Unknown`	Node exists in inventory but has never reported	No	No
`Booting`	OpenCHAMI booting/reimaging the node	No	No
`Ready`	Healthy, agent reporting, available for scheduling	Yes	Yes
`Degraded`	Heartbeat missed or minor issue detected	No (new)	Yes (existing)
`Down`	Confirmed failure, grace period expired	No	No (requeued)
`Draining`	Operator or scheduler requested drain, waiting for allocations to finish	No (new)	Yes (existing, draining)
`Drained`	All allocations completed/migrated after drain	No	No
`Failed`	Boot failure or unrecoverable hardware error	No	No

Transitions

Ready → Degraded

Trigger: First missed heartbeat.

Timeout: heartbeat_interval (default: 30s). If no heartbeat received within this window, the quorum marks the node Degraded.

Effect: Node is removed from scheduling candidates for new allocations. Running allocations continue undisturbed. No user notification.

Sensitive override: Sensitive nodes use a longer degradation window (default: 2 minutes) to avoid false positives from transient network issues.

Degraded → Ready

Trigger: Heartbeat resumes within the grace period.

Effect: Node re-enters the scheduling pool. No allocation disruption occurred. Event logged but no alert.

Degraded → Down

Trigger: Grace period expired without heartbeat recovery.

Timeouts:

Node Type	Grace Period	Rationale
Standard	60s	Balance between fast recovery and false positive avoidance
Sensitive	5 minutes	Sensitive allocations are high-value; avoid premature requeue
Borrowed	30s	Borrowed nodes should be reclaimed quickly

Effect:

All allocations on the node are evaluated per their requeue policy (cross-ref: failure-modes.md)
Node ownership released (Raft commit)
Alert raised to operators
OpenCHAMI notified for out-of-band investigation (Redfish BMC check)

Ready → Draining

Trigger: Explicit operator command (lattice node drain <id>) or scheduler-initiated (upgrade, conformance drift on sensitive node).

Effect:

Node removed from scheduling candidates
Running allocations continue until completion
For urgent drains: scheduler may trigger checkpoint on running allocations (cross-ref: checkpoint-broker.md)
No new allocations assigned

Draining → Drained

Trigger: All running allocations on the node have completed, been checkpointed, or been migrated.

Effect: Node is idle and safe for maintenance. Operator can upgrade, reboot, or reimage.

Drained → Ready

Trigger: Operator undrain (lattice node undrain <id>). Typically after maintenance.

Precondition: Node agent health check passes (heartbeat, GPU detection, network test, conformance fingerprint computed).

Effect: Node re-enters scheduling pool.

Any → Down (hardware failure)

Trigger: OpenCHAMI Redfish BMC detects critical hardware failure (PSU, uncorrectable ECC, GPU fallen off bus).

Effect: Immediate transition to Down, bypassing grace period. Same allocation handling as Degraded → Down.

Down → Booting

Trigger: Operator or automated remediation initiates reboot/reimage via OpenCHAMI.

Effect: Node enters Booting state. OpenCHAMI BSS serves the appropriate image.

Booting → Ready

Trigger: Node agent starts, passes health check, reports to quorum.

Health check: Heartbeat received, GPU count matches capabilities, NIC firmware detected, conformance fingerprint computed and reported.

Booting → Failed

Trigger: Boot timeout (default: 10 minutes) or repeated boot failures (3 consecutive).

Effect: Node marked Failed. Alert raised. Operator must investigate.

Sensitive Node Lifecycle Extensions

Sensitive nodes have additional constraints:

Event	Standard Node	Sensitive Node
Claim	Scheduler assigns	User claims explicitly, Raft-committed
Degraded grace	60s	5 minutes
Down → requeue	Automatic	Operator intervention required
Release	Node returns to pool	Node must be wiped (OpenCHAMI secure erase) before returning
Conformance drift	Deprioritized	Immediate `Draining`, audit logged

Sensitive Release Sequence

1. User releases sensitive allocation
2. Quorum releases node ownership (Raft commit, audit entry)
3. Node enters Draining (if other sensitive allocations) or proceeds to wipe
4. OpenCHAMI initiates secure wipe:
   a. GPU memory clear
   b. NVMe secure erase (if present)
   c. RAM scrub
   d. Reboot into clean image
5. Wipe confirmation reported to quorum (Raft commit, audit entry)
6. Node transitions to Ready and returns to general pool

Wipe Failure Handling

If the OpenCHAMI secure wipe fails or times out during sensitive node release:

Timeout: Default wipe timeout is 30 minutes (configurable: sensitive.wipe_timeout). If wipe does not complete within this window, the node enters a Quarantine state (treated as Down by the scheduler).
Quarantine: Quarantined nodes are excluded from scheduling and flagged for operator intervention. They do not return to the general pool.
Operator intervention: The operator investigates (BMC console, hardware diagnostics) and either:
- Retries the wipe: lattice admin node wipe <id> --force
- Replaces the node hardware
- Marks the node as permanently failed: lattice node disable <id>
Audit: Wipe failures are logged as critical audit events (Raft-committed for sensitive nodes). The audit entry records: node ID, wipe start time, failure reason, operator action.
Alert: lattice_sensitive_wipe_failure_total counter incremented; critical alert fired.

Operator Commands

Command	Effect	Confirmation Required
`lattice node drain <id>`	Start draining	No
`lattice node drain <id> --urgent`	Drain with checkpoint trigger	Yes (allocations will be checkpointed)
`lattice node undrain <id>`	Re-enable scheduling	No
`lattice node disable <id>`	Transition to Down immediately	Yes (allocations will be requeued/failed)
`lattice node enable <id>`	Re-enable a disabled node (Down → Ready)	No
`lattice node status <id>`	Show current state, allocations, health	No
`lattice node list --state=degraded`	List nodes in specific state	No

Heartbeat Protocol

Node agents send heartbeats to the quorum at a configurable interval:

Parameter	Default	Description
`heartbeat_interval`	10s	How often the agent sends a heartbeat
`heartbeat_timeout`	30s	Quorum marks `Degraded` after this silence
`grace_period`	60s	`Degraded` → `Down` after this additional silence
`sensitive_grace_period`	5m	Extended grace for sensitive nodes

Heartbeats include:

Monotonic sequence number (replay detection)
Node health summary (GPU count, temperature, ECC errors)
Conformance fingerprint (if recomputed since last heartbeat)
Running allocation count

Heartbeats are lightweight (~200 bytes) and sent over the management traffic class (cross-ref: security.md).

Agent Restart and State Recovery

The node agent persists active allocation state to /var/lib/lattice/agent-state.json (configurable via --state-file). This enables workload survival across agent restarts.

On graceful shutdown (SIGTERM):

Agent writes current allocation state (PIDs, cgroup paths, runtime type, mount points) to the state file
Agent exits without killing workloads (systemd KillMode=process)

On startup:

Agent reads the persisted state file
For each allocation, checks if the process is still alive (kill(pid, 0))
Alive processes are reattached — agent resumes heartbeating their status
Dead processes are treated as orphans — cgroup scopes are destroyed, mounts cleaned up
Stray cgroup scopes under workload.slice/alloc-*.scope with no matching state entry are also cleaned up
Agent re-registers with quorum and resumes normal operation

Crash recovery: If the agent crashes without writing the state file, the startup scan of cgroup scopes under workload.slice/ provides a fallback discovery mechanism for orphaned workloads.

Cross-References

failure-modes.md — Allocation requeue on node failure
conformance.md — Conformance drift triggers drain on sensitive nodes
upgrades.md — Drain/undrain during rolling upgrades
checkpoint-broker.md — Checkpoint on urgent drain
sensitive-workloads.md — Sensitive node claim/release/wipe
security.md — Heartbeat authentication (mTLS, sequence numbers)

Node Conformance & Configuration Drift

Problem

In large-scale HPC systems, nodes gradually drift from their intended configuration: firmware versions diverge, driver updates are applied unevenly, kernel parameters change. This configuration drift causes:

Silent performance degradation. A 64-node NCCL training run where one node has a different NIC firmware version may see unexplained slowdowns or hangs.
Correctness issues. Mismatched GPU driver versions can produce different numerical results.
Compliance violations. Regulated workloads require provable consistency of the execution environment.

Design Principle

The scheduler does not manage node configuration — OpenCHAMI does. The scheduler only needs to know whether nodes are the same or different, and how strict the workload’s homogeneity requirements are. Detection is the node agent’s job. Remediation is OpenCHAMI’s job.

Conformance Fingerprint

Each node agent computes a conformance fingerprint: a hash of the node’s configuration-critical software and firmware versions.

Components included in the fingerprint:

GPU driver version (e.g., NVIDIA 550.54.14)
NIC firmware version (Slingshot/UE adapter firmware)
BIOS/BMC firmware version (reported via Redfish/OpenCHAMI)
Kernel version and boot parameters
uenv base image hash (for sensitive: the hardened OS image)

The fingerprint is a content hash (SHA-256 of the sorted component list). Nodes with identical fingerprints belong to the same conformance group.

Reporting

The node agent reports the conformance fingerprint alongside its existing health data. This is eventually consistent — conformance group membership does not go through Raft (it’s derived from node agent reports, same as health status).

Exception: for sensitive nodes, conformance state changes are recorded in the Raft-committed audit log (per sensitive workload requirements).

Staleness

The node agent recomputes the fingerprint:

On startup
Periodically (default: every 6 hours)
On explicit request from the scheduler (e.g., after OpenCHAMI remediation)

If a node hasn’t reported a fingerprint within the staleness window, the scheduler treats it as unknown conformance — equivalent to a unique conformance group of one.

Scheduling Integration

Cost Function (f₉)

See scheduling-algorithm.md for the full cost function. The conformance factor f₉ scores how homogeneous the candidate node set is:

f₉(j, candidates) = largest_conformance_group_size(candidates) / j.requested_nodes

1.0 → all candidate nodes share the same fingerprint
0.5 → half the nodes match, half differ
Low values → highly heterogeneous set

Node Selection

During node selection (solver step 2a), the solver prefers nodes from the same conformance group:

Among nodes satisfying constraints (GPU type, topology, etc.), group by conformance fingerprint
Select the largest conformance group that can satisfy the node count
If no single group is large enough, merge groups (with a scoring penalty via f₉)
For single-node jobs, conformance is irrelevant (f₉ = 1.0 trivially)

Per-vCluster Policy

vCluster Type	Conformance Behavior
HPC Batch	Soft preference (w₉=0.10). Prefers homogeneous sets but will mix if needed.
ML Training	Strong preference (w₉=0.25). Multi-node training is sensitive to driver mismatches.
Service	Weak preference (w₉=0.05). Services are usually single-node or tolerate heterogeneity.
Sensitive	Hard constraint at solver level (drifted nodes excluded before scoring). w₉=0.10 as tiebreaker among conformant nodes.
Interactive	Ignored (w₉=0.00). Short-lived, single-node, not sensitive to drift.

Drift Response

When the scheduler detects that a node’s conformance fingerprint has changed (or diverged from the majority in its group):

Continue running workloads. Existing allocations are not disrupted — the drift already happened, and disrupting would make things worse.
Stop scheduling new work. The node is deprioritized for new allocations (it now belongs to a smaller conformance group, scoring lower on f₉).
Signal OpenCHAMI. The scheduler (or node agent) notifies OpenCHAMI that the node has drifted, triggering remediation (firmware update, reboot into correct image, etc.).
For sensitive nodes: additionally flag the drift in the audit log and set the node to Draining (transitioning to Drained once active allocations complete) — no new sensitive claims until remediated and verified. After remediation, an operator undoes the drain (Drained → Ready).

The scheduler does not attempt to remediate drift itself. It only avoids scheduling on drifted nodes and signals the infrastructure layer to fix them.

OpenCHAMI Coordination

When the scheduler detects drift:

Signal: The node agent (or scheduler) calls OpenCHAMI SMD to report the drift:

PATCH /hsm/v2/State/Components/{xname}
{ "Flag": "Warning", "FlagMsg": "conformance_drift: expected=<hash_a>, actual=<hash_b>" }

OpenCHAMI response: OpenCHAMI evaluates the drift against its remediation policy:
- Minor drift (kernel param change): schedule firmware update at next maintenance window
- Major drift (GPU driver version): schedule immediate reboot into correct image via BSS
- Critical drift (sensitive node): immediate remediation, operator notified
Wait for remediation: The scheduler does not re-enable the node automatically. After OpenCHAMI remediates (reboot, firmware flash), the node agent:
- Recomputes conformance fingerprint on startup
- Reports new fingerprint to quorum
- If fingerprint matches expected baseline: node returns to Ready
- If still drifted: remains deprioritized, alert escalated
Timeout: If a node remains drifted for longer than drift_remediation_timeout (default: 24 hours):
- Alert escalated to critical
- Node transitions to Down (removed from scheduling entirely)
- Operator must investigate and manually undrain after fix
Sensitive nodes (stricter):
- Drift triggers immediate Draining (no grace period for new claims)
- Remediation timeout: 4 hours (shorter, due to regulatory risk)
- After remediation: conformance re-verified AND admin approval required before accepting sensitive claims again

Relationship to Existing Concepts

NodeHealth tracks whether the node is functional (Healthy/Degraded/Down/Draining). Conformance is orthogonal — a node can be Healthy but drifted.
NodeCapabilities tracks what the node has (GPU type, memory). Conformance tracks whether the node’s software stack matches expectations.
Topology (GroupId) tracks physical location. Conformance tracks software configuration. Both are inputs to node selection: pack by topology AND by conformance group.

Network Domains

Design Principle

Network domains provide L3 reachability between allocations that need to communicate. They map to Slingshot VNIs (Virtual Network Identifiers) which provide hardware-enforced network isolation. Domains are created on demand, scoped to tenants, and cleaned up automatically.

What is a Network Domain

A network domain is a named group of allocations that share network reachability:

# Two allocations sharing a domain:
allocation_a:
  connectivity:
    network_domain: "ml-workspace"

allocation_b:
  connectivity:
    network_domain: "ml-workspace"

Allocations in the same domain can communicate over the Slingshot fabric. Allocations in different domains (or with no domain) are network-isolated at the hardware level.

VNI Lifecycle

Allocation

1. User submits allocation with network_domain: "ml-workspace"
2. lattice-api checks if domain "ml-workspace" exists for this tenant:
   a. If exists: allocation joins the existing domain
   b. If not: create new domain, allocate VNI from pool
3. VNI assignment is stored in quorum state (eventually consistent)
4. Node agents configure Slingshot NIC with the VNI for the allocation's traffic

VNI Pool

VNIs are allocated from a configured pool:

network:
  vni_pool_start: 1000
  vni_pool_end: 4095
  # Reserved VNIs:
  # 1 = management
  # 2 = telemetry
  # 3-999 = reserved for future use

VNIs are allocated sequentially from the pool. When freed, they return to the available set.

Release

1. Last allocation in the domain completes (or is cancelled)
2. Domain enters "draining" state for grace_period (default: 5 minutes)
   - Allows brief gaps between allocations in a long-running workflow
3. After grace period with no new allocations: domain is released
4. VNI returns to the available pool
5. Domain name can be reused by the same tenant

The grace period prevents VNI churn in DAG workflows where allocations start and stop in sequence but share a domain.

DAG Domain Persistence

DAG workflows often have sequential stages that share a network domain but have gaps between stages (one allocation completes before the next starts). The grace period (default: 5 minutes) covers these gaps:

If the next DAG stage starts within the grace period: it joins the existing domain (same VNI, no churn)
If the gap exceeds the grace period: the domain is released and a new VNI is allocated when the next stage starts
For long-running DAGs with predictable inter-stage gaps, increase the grace period per-domain: lattice admin network set-grace --domain=<name> --grace=15m
The grace period timer resets each time a new allocation joins the domain

Scoping Rules

Rule	Enforcement
Domain names are scoped to a tenant	Two tenants can use the same domain name without conflict
Only allocations from the same tenant can share a domain	Cross-tenant domains are not allowed (isolation requirement)
Sensitive domains are per-allocation	Each sensitive allocation gets a unique domain (no sharing, even within tenant)
Domain names are user-chosen strings	No system-generated names; users pick meaningful names

Capacity

Parameter	Default	Notes
VNI pool size	3095 (1000-4095)	Sufficient for typical HPC deployments
Max domains per tenant	50	Configurable per tenant
Max allocations per domain	Unlimited	Practical limit: node count

VNI Exhaustion

If the VNI pool is exhausted:

New domain creation fails with a clear error:

Error: cannot create network domain — VNI pool exhausted (3095/3095 in use)
Hint: Wait for running allocations to complete, or contact your system admin.

Allocations without network_domain are unaffected (they don’t need a VNI)
Allocations joining an existing domain are unaffected (domain already has a VNI)
Alert raised for operators

VNI Exhaustion Mid-DAG

If the VNI pool is exhausted while a DAG has pending allocations that require a new network domain:

The allocation that needs the new domain enters Pending state with reason vni_pool_exhausted.
The DAG stalls at this allocation — downstream dependencies remain blocked.
Already-running DAG allocations with existing domains are unaffected.
Mitigation: Use a shared network domain across DAG stages where possible. This avoids new VNI allocation for each stage and reduces pool pressure.
Recovery: When other allocations complete and release VNIs, the pending allocation is re-evaluated on the next scheduling cycle.

Default Behavior

If an allocation does not specify network_domain:

Single-node allocations: no VNI needed, no network isolation beyond the default
Multi-node allocations: automatically assigned a domain named alloc-{id} (private to this allocation)
Services with expose ports: automatically assigned a domain if not specified

Service Exposure

For allocations exposing service endpoints:

connectivity:
  network_domain: "inference-cluster"
  expose:
    - name: "api"
      port: 8080
      protocol: "http"

Exposed ports are reachable from:

Other allocations in the same network domain (always)
The lattice-api REST gateway (for external access)
Not directly reachable from outside the fabric (Slingshot is not routable from Ethernet)

Sensitive Network Domains

Sensitive allocations get strict network isolation:

connectivity:
  network_domain: "sensitive-{user}-{alloc_id}"  # auto-generated, unique
  policy:
    ingress: deny-all-except:
      - same_domain          # only processes in this allocation
      - data_gateway         # controlled data ingress
    egress: deny-all-except:
      - data_gateway         # controlled data egress

Each sensitive allocation gets its own domain (no sharing)
Ingress/egress restricted to a data gateway endpoint
With Ultra Ethernet: network-level encryption enabled for the VNI
VNI released immediately on allocation completion (no grace period)

VNI Pool Expansion

To expand the VNI pool when approaching exhaustion:

Update the configuration to extend vni_pool_end:

network:
  vni_pool_start: 1000
  vni_pool_end: 8191   # expanded from 4095

Restart the API server to pick up the new pool range. Existing domains and their VNI assignments are not affected.
Verify: The lattice_network_vni_pool_total metric should reflect the new pool size.

Note: The expanded range must not overlap with reserved VNIs (1-999) or VNIs used by other systems on the Slingshot fabric. Coordinate with network administrators before expanding.

Cross-References

system-architecture.md — Network fabric layer, VNI-based isolation
sensitive-workloads.md — Sensitive network isolation policy
security.md — Network security, traffic classes
api-design.md — Connectivity field in allocation request

MPI Process Management

Design Principle

Lattice must launch and manage multi-node MPI processes without relying on SSH between compute nodes. The node agent provides process management infrastructure (PMI) so that MPI implementations (OpenMPI, MPICH, Cray MPICH) can perform rank discovery and key-value exchange through Lattice rather than through SSH or a Slurm-specific launcher.

Problem Statement

In Slurm, srun serves as both a process launcher (fan-out to nodes) and a PMI server (rank discovery, KV exchange). Lattice replaces srun with lattice launch / the LaunchTasks RPC, but the current implementation is a stub that does not:

Fan out process launch to node agents
Provide PMI wire-up so MPI ranks can discover each other
Manage CXI credentials for Slingshot/Ultra Ethernet fabric access

Without this, users calling mpirun directly fall back to SSH for remote process spawning, which is:

A security risk (SSH keys between compute nodes)
Incompatible with network-domain-only L3 reachability
Incompatible with the sensitive workload isolation model
Operationally fragile (SSH host key management, authorized_keys distribution)

Supported MPI Implementations

Implementation	PMI-2 Support	PMIx Support	Default Launcher	Notes
MPICH	Native (PMI-2 origin)	Via external PMIx	Hydra (SSH)	PMI-2 is the natural fit
OpenMPI	Yes (`OMPI_MCA_pmix=pmi2`)	Preferred (PRRTE)	ORTE/PRRTE (SSH)	PMI-2 fully functional
Cray MPICH	Native (via PALS)	Via PALS	PALS	PMI-2 without PALS works

All three support PMI-2. PMIx is preferred by OpenMPI but not required.

Architecture

Two-Tier Design

┌─────────────────────────────────────────────────────────┐
│  Default: Native PMI-2 Server (built into node agent)   │
│  Simple, no external dependencies, covers 95%+ of MPI   │
│  workloads. ~8 wire commands over Unix domain socket.   │
├─────────────────────────────────────────────────────────┤
│  Optional: OpenPMIx Sidecar (feature-flagged)           │
│  Full PMIx v4/v5 support for workloads that require     │
│  PMIx-specific features (spawn, tools API, events).     │
│  Node agent manages OpenPMIx server lifecycle.          │
└─────────────────────────────────────────────────────────┘

Launch Flow

User: lattice launch --alloc=123 -n 256 --tasks-per-node=4 ./my_mpi_app

  │
  ▼
lattice-api (LaunchTasks RPC)
  │
  ├─ Validates: allocation is Running, user owns it
  ├─ Computes rank layout: N nodes × tasks_per_node = total ranks
  │   Rank assignment: node 0 gets ranks [0..3], node 1 gets [4..7], ...
  ├─ Generates launch_id, PMI job attributes (appnum, size, universe_size)
  ├─ Provisions CXI credentials if Slingshot fabric (see below)
  │
  ▼ Fan-out: gRPC LaunchProcesses to each node agent in the allocation

Node Agent 0                 Node Agent 1                 Node Agent N-1
  │                            │                            │
  ├─ Creates PMI-2 server      ├─ Creates PMI-2 server      ├─ ...
  │  (Unix domain socket)      │  (Unix domain socket)      │
  │                            │                            │
  ├─ Spawns local ranks        ├─ Spawns local ranks        │
  │  rank 0: ./my_mpi_app      │  rank 4: ./my_mpi_app      │
  │  rank 1: ./my_mpi_app      │  rank 5: ./my_mpi_app      │
  │  rank 2: ./my_mpi_app      │  rank 6: ./my_mpi_app      │
  │  rank 3: ./my_mpi_app      │  rank 7: ./my_mpi_app      │
  │                            │                            │
  │  Each rank inherits:       │                            │
  │  - PMI_FD (socket fd)      │                            │
  │  - PMI_RANK (global rank)  │                            │
  │  - PMI_SIZE (world size)   │                            │
  │                            │                            │
  ▼                            ▼                            ▼
  MPI_Init() → PMI-2 fullinit → local KVS puts (libfabric endpoint addr)
  │                            │                            │
  ▼ ─────────── kvsfence (cross-node KVS exchange via gRPC) ────────────
  │                            │                            │
  MPI_Init() completes         MPI_Init() completes         ...
  │                            │                            │
  (application runs)           (application runs)           ...
  │                            │                            │
  MPI_Finalize() → PMI-2 finalize

PMI-2 Wire Protocol

The PMI-2 wire protocol is text-based over a Unix domain socket. The node agent implements these commands:

Command	Direction	Purpose
`fullinit`	rank → agent	Initialize PMI connection, receive rank/size/appnum
`job-getinfo`	rank → agent	Query job attributes (e.g., universe size)
`kvsput`	rank → agent	Store a key-value pair (e.g., libfabric endpoint address)
`kvsget`	rank → agent	Retrieve a key-value pair
`kvsfence`	rank → agent	Barrier + distribute all KV pairs across all ranks
`finalize`	rank → agent	Clean shutdown of PMI connection
`abort`	rank → agent	Signal abnormal termination
`spawn`	rank → agent	Dynamic process spawning (optional, rarely used)

Cross-Node KVS Exchange (Fence)

The kvsfence operation is the only cross-node PMI operation. It requires all ranks across all nodes to synchronize and exchange accumulated KV pairs. This is implemented via gRPC between node agents:

kvsfence triggered on all nodes
  │
  ▼
Phase 1: Local collection
  Each node agent collects all kvsput entries from its local ranks.

Phase 2: Exchange (star topology via designated head node)
  ┌─────────────┐
  │ Head Agent  │ ◄──── gRPC PmiFence(local_kvs) ──── Agent 1
  │ (rank 0's   │ ◄──── gRPC PmiFence(local_kvs) ──── Agent 2
  │  node)      │ ◄──── gRPC PmiFence(local_kvs) ──── Agent N-1
  │             │
  │ Merges all  │
  │ KVS entries │
  │             │
  │ Broadcasts  │ ────► gRPC PmiFenceComplete(merged_kvs) ──► Agent 1
  │ merged KVS  │ ────► gRPC PmiFenceComplete(merged_kvs) ──► Agent 2
  │             │ ────► gRPC PmiFenceComplete(merged_kvs) ──► Agent N-1
  └─────────────┘

Phase 3: Local completion
  Each node agent unblocks its local ranks' kvsfence.
  Ranks can now kvsget any key from any node.

The head agent is the node agent hosting rank 0. For large jobs (>128 nodes), a tree-based reduction can be used instead of a star to reduce head-node pressure.

Node Agent gRPC Extensions

New RPCs on the node agent service for MPI process management:

service NodeAgentService {
  // Existing RPCs...

  // Launch MPI ranks on this node (called by API server during fan-out)
  rpc LaunchProcesses(LaunchProcessesRequest) returns (LaunchProcessesResponse);

  // PMI fence exchange between node agents
  rpc PmiFence(PmiFenceRequest) returns (PmiFenceResponse);

  // PMI fence completion broadcast from head agent
  rpc PmiFenceComplete(PmiFenceCompleteRequest) returns (PmiFenceCompleteResponse);

  // Notify all local ranks to abort (e.g., one node failed)
  rpc AbortProcesses(AbortProcessesRequest) returns (AbortProcessesResponse);
}

message LaunchProcessesRequest {
  string launch_id = 1;
  string allocation_id = 2;
  string entrypoint = 3;
  repeated string args = 4;
  uint32 tasks_per_node = 5;
  uint32 first_rank = 6;        // global rank offset for this node
  uint32 world_size = 7;        // total ranks across all nodes
  map<string, string> env = 8;  // additional env vars
  PmiMode pmi_mode = 9;         // PMI2 (default) or PMIX
  // CXI credentials for Slingshot fabric
  optional CxiCredentials cxi_credentials = 10;
  // Peer node agents for fence exchange
  repeated PeerInfo peers = 11;
  // Index of the head node (for fence coordination)
  uint32 head_node_index = 12;
}

message PeerInfo {
  string node_id = 1;
  string grpc_address = 2;  // node agent address (reachable via management network)
  uint32 first_rank = 3;
  uint32 num_ranks = 4;
}

enum PmiMode {
  PMI2 = 0;
  PMIX = 1;
}

message CxiCredentials {
  uint32 vni = 1;
  bytes auth_key = 2;
  uint32 svc_id = 3;
}

PMI-2 Server Implementation

Each node agent runs a PMI-2 server per launch (one Unix socket per launch_id):

Node Agent
  │
  ├─ LaunchProcesses received
  │   ├─ Create Unix socket: /tmp/lattice-pmi-{launch_id}.sock
  │   ├─ Start PMI-2 server task (tokio)
  │   ├─ Fork/exec ranks with:
  │   │   PMI_FD={fd}           # inherited socket fd
  │   │   PMI_RANK={rank}       # global rank
  │   │   PMI_SIZE={world_size} # world size
  │   │   PMI_SPAWNED=0         # not dynamically spawned
  │   │   LATTICE_LAUNCH_ID={launch_id}
  │   │   LATTICE_ALLOC_ID={allocation_id}
  │   │   LATTICE_NODELIST={comma-separated node list}
  │   │   LATTICE_NNODES={node_count}
  │   │   LATTICE_NPROCS={world_size}
  │   │   # CXI env (if Slingshot):
  │   │   FI_CXI_DEFAULT_VNI={vni}
  │   │   FI_CXI_AUTH_KEY={key}
  │   └─ Monitor all rank processes, report exit status
  │
  ├─ PMI-2 server handles:
  │   ├─ fullinit → return rank, size, appnum, debug flag
  │   ├─ kvsput → store in local HashMap
  │   ├─ kvsget → lookup local, or merged (post-fence)
  │   ├─ kvsfence → collect local, trigger cross-node exchange, block until complete
  │   ├─ finalize → mark rank done
  │   └─ abort → signal all local ranks, notify head agent
  │
  └─ Cleanup on launch completion
      ├─ Remove Unix socket
      ├─ Report per-rank exit codes to API server
      └─ Clean up CXI credentials

Environment Variables

Lattice sets these environment variables for MPI processes:

Variable	Value	Purpose
`PMI_FD`	fd number	PMI-2 socket (inherited)
`PMI_RANK`	global rank	MPI rank
`PMI_SIZE`	world size	MPI world size
`PMI_SPAWNED`	`0`	Not dynamically spawned
`LATTICE_LAUNCH_ID`	UUID	Launch identifier
`LATTICE_ALLOC_ID`	UUID	Allocation identifier
`LATTICE_NODELIST`	comma-separated	All nodes in this launch
`LATTICE_NNODES`	integer	Node count
`LATTICE_NPROCS`	integer	Total rank count
`LATTICE_LOCAL_RANK`	0..tasks_per_node-1	Node-local rank
`LATTICE_LOCAL_SIZE`	tasks_per_node	Ranks on this node
`FI_CXI_DEFAULT_VNI`	VNI number	Slingshot VNI (if applicable)
`FI_CXI_AUTH_KEY`	hex string	CXI auth key (if applicable)
`FI_PROVIDER`	`cxi` or `verbs`	libfabric provider hint

For Slurm compatibility (compat.set_slurm_env=true), also set SLURM_PROCID, SLURM_NPROCS, SLURM_LOCALID, SLURM_NODELIST.

CXI Credential Management (Slingshot)

On Slingshot systems, MPI communication requires CXI (Cassini eXtended Interface) credentials tied to the allocation’s VNI. Without valid credentials, libfabric’s CXI provider refuses to open endpoints.

Credential Lifecycle

1. Allocation scheduled → network domain assigned → VNI allocated
2. LaunchTasks RPC → API server requests CXI credentials from fabric manager
   - Input: VNI, allocation ID, node list
   - Output: auth_key, svc_id (bound to VNI + node set)
3. Credentials included in LaunchProcessesRequest to each node agent
4. Node agent sets FI_CXI_DEFAULT_VNI and FI_CXI_AUTH_KEY for spawned ranks
5. On launch completion → API server revokes CXI credentials

Fabric Manager Integration

The Slingshot fabric manager provides a REST API for credential management:

Operation	Endpoint	When
Create CXI service	`POST /fabric/cxi/services`	Launch start
Get auth key	`GET /fabric/cxi/services/{id}/auth`	Launch start
Revoke CXI service	`DELETE /fabric/cxi/services/{id}`	Launch end

This is a new integration point, similar to the existing VAST API integration for storage.

Optional: OpenPMIx Sidecar (Feature-Flagged)

For workloads requiring full PMIx v4/v5 support (dynamic process spawning, PMIx tools API, event notification, PMIx groups), Lattice can run an OpenPMIx server as a managed sidecar process.

When to Use PMIx Mode

Scenario	PMI-2 (default)	PMIx (optional)
Standard MPI (init, communication, finalize)	Yes	Yes
Multi-application launch (MPMD)	Limited	Yes
Dynamic process spawning (`MPI_Comm_spawn`)	No	Yes
PMIx tools API (debugger attach)	No	Yes
PMIx event notification	No	Yes
OpenMPI with PMIx-only features	No	Yes

Architecture

Node Agent
  │
  ├─ PmiMode::PMIX requested in LaunchProcessesRequest
  │
  ├─ Spawns OpenPMIx server (pmix_server binary)
  │   ├─ Configured via tmpdir/pmix-{launch_id}/
  │   ├─ Node agent implements the PMIx "host" callback interface
  │   │   via a small C shim library (libpmix-lattice-host.so)
  │   │   that calls back to the node agent via Unix socket
  │   ├─ Cross-node exchange: host callbacks route to node agent gRPC
  │   └─ pmix_server provides Unix rendezvous socket for ranks
  │
  ├─ Spawns ranks with:
  │   PMIX_SERVER_URI={rendezvous_uri}
  │   PMIX_NAMESPACE={launch_id}
  │   PMIX_RANK={rank}
  │   (instead of PMI_FD/PMI_RANK/PMI_SIZE)
  │
  └─ On completion: stops pmix_server, cleans up

Host Callback Shim

The OpenPMIx server requires the host (resource manager) to provide certain callbacks for cross-node operations. These are implemented via a small C shared library (libpmix-lattice-host.so) that:

Is loaded by pmix_server at startup via --host-lib or LD_PRELOAD
Implements: pmix_server_fencenb_fn, pmix_server_dmodex_fn, pmix_server_spawn_fn
Each callback sends a request over a Unix socket to the node agent
Node agent handles cross-node coordination via gRPC (same as PMI-2 fence)

This keeps the C code minimal (~200 lines) while leveraging the full OpenPMIx implementation.

Build and Deployment

# Cargo.toml (lattice-node-agent)
[features]
pmix = []  # enables PMIx sidecar support

When the pmix feature is enabled:

pmix_server binary must be installed on compute nodes (packaged separately or via uenv)
libpmix-lattice-host.so is built from infra/pmix-host/ and installed alongside the node agent
The node agent detects pmix_server availability at startup and reports it as a node capability

When disabled: PmiMode::PMIX requests return an error with a clear message.

Integration with Existing Runtimes

uenv Runtime

PMI-2 socket and environment variables are available inside the mount namespace with no special handling (mount namespace does not isolate Unix sockets in the parent namespace).

Sarus Runtime

The PMI-2 Unix socket must be bind-mounted into the container:

sarus run --mount=type=bind,source=/tmp/lattice-pmi-{launch_id}.sock,destination=/tmp/lattice-pmi.sock ...

The --mpi flag in Sarus already handles MPI wire-up for Slurm; for Lattice, we configure Sarus to use the Lattice-provided PMI socket instead. This requires the Sarus MPI hook to be configured for PMI-2 mode rather than Slurm PMI mode.

DMTCP (Checkpoint/Restart)

DMTCP wraps the MPI process. The PMI-2 socket is outside the DMTCP checkpoint boundary. On restart, the node agent creates a new PMI-2 server and the restarted ranks re-initialize PMI. DMTCP’s MPI plugin handles reconnecting MPI communicators.

Failure Handling

Rank Failure

1. Rank exits with non-zero code (or is killed by signal)
2. Local node agent detects via process monitor
3. Node agent sends RankFailed notification to head agent
4. Head agent:
   a. If allocation requeue policy = "on_any_failure": abort all ranks, requeue allocation
   b. If MPI_ERRORS_RETURN semantics: notify remaining ranks via PMI-2 abort
   c. Default: abort all ranks, report failure to API server

Node Agent Failure

1. Node agent crashes or becomes unreachable
2. Head agent detects via gRPC timeout during fence (or heartbeat miss)
3. Head agent aborts the launch on all surviving nodes
4. API server handles allocation state transition (same as node failure)

Fence Timeout

1. kvsfence does not complete within timeout (default: 60s, configurable)
2. Head agent declares fence failure
3. All ranks aborted with PMI-2 abort message
4. Launch reported as failed with "PMI fence timeout" reason

User-Facing Changes

`lattice launch` (CLI)

# MPI launch (replaces srun -n 256 ./app)
lattice launch --alloc=123 -n 256 ./my_mpi_app

# With tasks-per-node control
lattice launch --alloc=123 --tasks-per-node=4 ./my_mpi_app

# Force PMIx mode (requires pmix feature on nodes)
lattice launch --alloc=123 -n 256 --pmi=pmix ./my_mpi_app

# Launch with environment variables
lattice launch --alloc=123 -n 256 --env OMP_NUM_THREADS=8 ./my_mpi_app

Submission Script

#!/bin/bash
#LATTICE nodes=64
#LATTICE walltime=2:00:00
#LATTICE vcluster=hpc-batch
#LATTICE network_domain=my-training-run

# No SSH, no mpirun, no srun needed.
# The entrypoint IS the MPI program; Lattice handles process launch and PMI.
lattice launch -n 256 --tasks-per-node=4 ./my_mpi_training

# Or for Slurm compatibility:
# srun -n 256 ./my_mpi_training   (compat layer translates to lattice launch)

Direct `mpirun` (Escape Hatch)

Users who want to call mpirun directly can still do so. Lattice provides a Hydra-compatible launcher script (lattice-mpi-launcher) that uses the node agent gRPC instead of SSH:

# mpirun detects the Lattice launcher via:
#   HYDRA_LAUNCHER=manual
#   HYDRA_LAUNCHER_EXEC=lattice-mpi-launcher
# These are set automatically by the node agent when an allocation starts.

# So this "just works" inside an allocation:
mpirun -np 256 ./my_mpi_app

The lattice-mpi-launcher script:

Receives the launch command from Hydra/ORTE
Calls the local node agent’s LaunchProcesses gRPC to spawn on the target node
Returns the PID to the MPI launcher

This provides backward compatibility for scripts that use mpirun directly while still avoiding SSH.

Performance Considerations

Operation	Latency	Bottleneck	Mitigation
Launch fan-out	~100ms for 256 nodes	gRPC round-trips	Parallel fan-out from API server
PMI-2 fence (star)	~10ms for <128 nodes	Head agent merge	Acceptable for typical HPC
PMI-2 fence (tree)	~20ms for 1000+ nodes	Tree depth (log N)	Only needed at extreme scale
CXI credential provisioning	~50ms	Fabric manager API	Cached for allocation lifetime

MPI_Init typically takes 100-500ms. The Lattice PMI overhead is well within this budget.

Cross-References

network-domains.md – VNI allocation, L3 reachability
security.md – CXI credentials, network isolation
slurm-migration.md – srun replacement
node-lifecycle.md – Node agent process management
failure-modes.md – Rank and node failure handling
checkpoint-broker.md – DMTCP + MPI checkpoint interaction
sessions.md – Interactive allocations with MPI launch
ADR-010: Native PMI-2 with optional PMIx sidecar

Data Plane & Storage Architecture

Tiered Storage Model

┌─ Hot Tier (VAST-like) ─────────────────────────────────┐
│  Protocol: NFS + S3 (native multiprotocol)             │
│  Use: active datasets, home dirs, checkpoints, scratch │
│  Performance: NVMe-speed, low-latency                  │
│  Scheduler integration: QoS per export, pre-staging    │
│  Sensitive: encrypted pool, access-logged              │
└────────────────────┬───────────────────────────────────┘
                     │ policy-driven data mover
┌────────────────────┴───────────────────────────────────┐
│  Warm Tier (capacity storage)                          │
│  Protocol: S3-compatible                               │
│  Use: completed outputs, older datasets, cold models   │
│  Cost: significantly lower than hot                    │
└────────────────────┬───────────────────────────────────┘
                     │ archive policy
┌────────────────────┴───────────────────────────────────┐
│  Cold Tier (tape/object archive)                       │
│  Protocol: S3-compatible (Glacier-style retrieval)     │
│  Use: regulatory retention, long-term archival         │
│  Sensitive: 7+ year retention, immutable               │
└────────────────────────────────────────────────────────┘

Protocol Standardization

Only two protocols for user-facing access:

NFS: POSIX workloads, home directories, uenv images, legacy codes that expect a filesystem
S3: Object access for checkpoints, datasets, model artifacts, any cloud-native tooling

No Lustre/GPFS client required. VAST delivers parallel-file-system performance via NFS.

Job Data Requirements

Explicit Declaration

Users who know their data needs can declare them:

data:
  mounts:
    - source: "s3://training-data/imagenet"
      target: "/data/input"
      tier_hint: "hot"
      access: "read-only"
    - source: "nfs://home/{user}"
      target: "/home/{user}"
      access: "read-write"
  output: "s3://{tenant}/{project}/{allocation_id}/"
  scratch_per_node: "500GB"

Sane Defaults (for users who don’t specify)

Every allocation automatically gets:

Home directory: mounted via NFS from hot tier (/home/{user})
Node-local scratch: NVMe-backed ephemeral storage (/scratch/local/) if NVMe is available; tmpfs or network scratch otherwise
Output directory: s3://{tenant}/{project}/{allocation_id}/ auto-created
Checkpoint directory: s3://{tenant}/{project}/{allocation_id}/checkpoints/ (if checkpoint != none)

Data Staging (Scheduler-Integrated)

The scheduler integrates with the storage API for intelligent data movement:

Pre-staging during queue wait: When a job is queued and its data is on warm/cold tier, the data mover begins warming it to hot tier. Queue wait time becomes useful instead of idle.
QoS allocation at job start: The scheduler calls the VAST API to set bandwidth guarantees for the job’s NFS export. Prevents I/O-intensive jobs from starving latency-sensitive services.
Checkpoint coordination: The checkpoint broker pre-allocates storage bandwidth windows to avoid I/O storms when many jobs checkpoint simultaneously.

VAST API Integration Points

Operation	VAST API	When
Create export with QoS	POST /exports + QoS policy	Job starts
Query data locality	GET /catalog?path=…	Scheduling (data_readiness score)
Create snapshot	POST /snapshots	Job start (reproducibility) or checkpoint
Pre-stage from warm	POST /dataspace/prefetch	Job queued, data not on hot tier
Set bandwidth floor	PATCH /exports/{id}/qos	Job starts
Audit log query	GET /audit/logs?path=…	Compliance reporting

Sensitive Storage Policy

vcluster: sensitive-secure
  storage_policy:
    encryption: aes-256-at-rest
    pool: dedicated               # separate VAST view/tenant
    wipe_on_release: true         # scrub after allocation ends
    access_logging: full          # every read/write logged
    data_sovereignty: "ch"        # data stays in Swiss jurisdiction
    retention:
      data: "as_specified_by_user"
      audit_logs: "7_years"
      tier_restriction: "hot_only"  # no unencrypted copies on warm/cold

Log Storage

Allocation logs are persisted to S3 alongside output data. See observability.md for the log storage layout:

s3://{tenant}/{project}/{alloc_id}/logs/
    ├── stdout/{node_id}/{chunk_000..N}.log.zst
    ├── stderr/{node_id}/{chunk_000..N}.log.zst
    └── metadata.json

Sensitive allocation logs are stored in the encrypted sensitive S3 pool with access logging enabled.

Node-Local Storage (Optional)

Nodes may have NVMe SSDs managed by the node agent. Local storage is not a hard requirement — nodes without NVMe operate with reduced performance but full functionality.

When NVMe is present:

Scratch: ephemeral, wiped between allocations. For temp files, staging.
Image cache: persistent across allocations. Caches uenv squashfs images and OCI layers.
- LRU eviction policy
- Cache hit avoids network pull from registry
- Popular images stay warm automatically

When NVMe is absent:

Scratch: falls back to tmpfs (RAM-backed) or a network-mounted scratch directory. Capacity is limited by available RAM or network storage quota.
Image cache: no persistent local cache. Images are pulled from the registry on every allocation start (or served from a shared NFS cache if configured). Higher startup latency.
Allocations requesting the nvme_scratch feature constraint will not be scheduled on these nodes.

The node agent detects local storage at startup and reports its availability as part of node capabilities (features: ["nvme_scratch"]).

Data Staging & Cache Lifecycle

Design Principle

Data staging is invisible to users. The scheduler pre-stages data during queue wait time, manages node-local caches with bounded eviction, and coordinates storage bandwidth to prevent I/O storms. Users declare data requirements; the system handles placement.

This document extends data-plane.md with operational details for staging, caching, and eviction.

Pre-Staging Pipeline

Trigger

When an allocation enters the Pending state and declares data mounts with tier_hint: hot:

Scheduler queries VAST API for data locality (GET /catalog?path=...)
If data is on warm/cold tier: scheduler issues pre-stage request (POST /dataspace/prefetch)
Allocation transitions to Staging state (visible to user via lattice status)
When staging completes: allocation is eligible for scheduling

Staging During Queue Wait

Pre-staging runs concurrently with queue waiting. If the allocation reaches the front of the scheduling queue before staging completes:

Scenario	Action
Staging complete	Schedule immediately
Staging >80% complete	Schedule, accept brief I/O stall at start
Staging <80% complete	Hold in queue, f₅ (data_readiness) penalizes scheduling
Staging failed	Retry up to 3 times, then alert user and keep in queue

Priority

Pre-stage requests are prioritized by:

Estimated scheduling time (jobs closer to front of queue stage first)
Data size (smaller datasets stage faster, unblock more jobs)
Tenant fair share (tenants below their share get staging priority)

Bandwidth Coordination

The scheduler tracks aggregate staging bandwidth to avoid saturating the VAST system:

max_concurrent_staging_bandwidth = 0.3 × total_VAST_write_bandwidth

When the staging bandwidth limit is reached, additional staging requests are queued. This prevents staging from impacting running allocations’ I/O performance.

Node-Local Image Cache

Nodes with NVMe SSDs use a dedicated partition for image caching (uenv SquashFS and OCI layers). Local storage is optional — nodes without NVMe pull images directly from the registry on every allocation start, or use a shared NFS-based cache if configured. The scheduler accounts for this via the nvme_scratch feature: jobs that benefit from local caching can request it as a constraint.

Cache Layout

/var/cache/lattice/
├── uenv/                     # SquashFS images
│   ├── prgenv-gnu_24.11_v1.squashfs
│   ├── pytorch_2.4_cuda12.squashfs
│   └── ...
├── oci/                      # OCI container layers
│   ├── sha256:<hash>/
│   └── ...
└── metadata.json             # Cache index: image → size, last_used, pin

Cache Parameters

Parameter	Default	Description
`cache_partition_size`	80% of NVMe (if present)	Reserved for image cache; ignored on nodes without NVMe
`cache_high_watermark`	90%	Eviction starts when usage exceeds this
`cache_low_watermark`	70%	Eviction stops when usage drops below this
`min_free_space`	50 GB	Absolute minimum free space (overrides watermarks)

Eviction Policy

LRU with pinning:

When cache usage exceeds cache_high_watermark:
- Evict least-recently-used images until usage drops below cache_low_watermark
- Never evict images currently mounted by running allocations (pinned)
- Never evict images marked as sticky by admin (base OS images, common frameworks)
Eviction order: LRU by last mount time, largest images first among equally-old entries
If eviction cannot free enough space (all images pinned or sticky): alert raised, staging for new allocations pauses on this node

Cache-Full During Staging

If the node-local cache is full when a new allocation needs to pull an image:

Check if eviction can free space → run eviction
If eviction insufficient (all pinned): allocation’s prologue waits with backoff
After 3 retries (5 minutes total): node marked as cache-full, scheduler avoids this node for allocations requiring uncached images
Scheduler selects alternative nodes with cache space (or where the image is already cached)

Cache Warming

Administrators can pre-warm caches for anticipated workloads:

# Warm a uenv image on all nodes in a group
lattice cache warm --image=prgenv-gnu/24.11:v1 --group=3

# Warm on specific nodes
lattice cache warm --image=pytorch/2.4:cuda12 --nodes=x1000c0s0b0n0,x1000c0s0b0n1

Post-Reboot Cache Consistency

After a node reboot (nodes with NVMe only):

Node agent reads metadata.json from the cache partition
Validates each cached image (hash check against registry manifest)
Images that fail validation are evicted
Images that pass remain in cache (NVMe is persistent across reboots)
Cache index rebuilt in ~seconds (metadata only, no full re-scan)

On nodes without NVMe, there is no persistent cache to recover — images are pulled fresh after reboot.

Allocation Data Lifecycle

Start (Prologue)

1. Node agent receives allocation assignment
2. Pull uenv image:
   a. Check node-local cache → hit: mount directly
   b. Cache miss: pull from registry → write to cache → mount
3. Mount data volumes:
   a. NFS mounts (home, shared data): mount with VAST QoS policy
   b. S3 mounts: FUSE or native S3 client
4. Create scratch directory: /scratch/local/{alloc_id}/ (NVMe) or /scratch/tmp/{alloc_id}/ (tmpfs/network)
5. Create output directory (S3): s3://{tenant}/{project}/{alloc_id}/
6. If checkpoint != none: create checkpoint directory

During Execution

NFS QoS maintained by VAST (bandwidth floor set at prologue)
Scratch is node-local NVMe (if available) or tmpfs/network scratch
Output is written to S3 (async, application-driven)
Checkpoint broker coordinates checkpoint writes to avoid bandwidth storms

End (Epilogue)

1. Processes terminated (completed, failed, or killed)
2. Flush pending log chunks to S3
3. Unmount uenv image (stays in cache for future use)
4. Unmount NFS volumes
5. Clean scratch: rm -rf /scratch/local/{alloc_id}/
6. Release VAST QoS policy
7. Sensitive: trigger secure wipe sequence (cross-ref: node-lifecycle.md)

Data Retention

Data Type	Location	Retention
uenv images	Node-local cache	Until evicted (LRU)
Logs	S3	Configurable (default: 30 days)
Checkpoints	S3	Configurable (default: 7 days after completion)
Output	S3	User-managed (not auto-deleted)
Scratch	NVMe or tmpfs	Deleted at allocation end
Debug traces	S3	Short (default: 7 days)
Sensitive audit logs	Cold tier (S3)	7 years

Storage Tier Migration

Data automatically migrates between tiers based on access patterns:

Hot (VAST NFS+S3) → Warm (capacity S3) → Cold (archive S3)
     ↑ pre-stage        ↑ restore            ↑ retrieve

Trigger	Direction	Mechanism
Allocation queued with `tier_hint: hot`	Warm → Hot	Scheduler-initiated pre-stage
Data untouched for 30 days	Hot → Warm	VAST policy-driven (automatic)
Data untouched for 90 days	Warm → Cold	Storage policy (automatic)
User request or allocation references cold data	Cold → Warm/Hot	Explicit retrieval (may take hours)

Sensitive exception: Sensitive data on hot tier stays on hot tier (no automatic migration). tier_restriction: hot_only prevents copies on shared warm/cold tiers.

Cross-References

data-plane.md — Storage architecture, VAST API integration, protocol standardization
scheduling-algorithm.md — f₅ data_readiness in cost function
node-lifecycle.md — Sensitive node wipe sequence
failure-modes.md — VAST unavailability handling
sensitive-workloads.md — Sensitive storage policy

Federation Architecture

Design Principle

Federation is opt-in and sovereignty-first. The system is fully functional without it. When enabled, each site retains full control over its resources. The federation broker suggests, the local scheduler decides.

Feature Gate

Federation is compile-time optional via Rust feature flag:

# Cargo.toml (lattice-api)
[features]
default = []
federation = ["lattice-common/federation"]

When federation feature is disabled:

No Sovra dependency
No federation broker binary
No cross-site API endpoints
System operates as a standalone site

Trust Model: Sovra Integration

Sovra provides federated sovereign key management. Each site runs its own Sovra instance with its own root key.

Site A Sovra Instance              Site B Sovra Instance
├── Site A Root Key (sovereign)    ├── Site B Root Key (sovereign)
├── Workspace: "hpc-general"       ├── Workspace: "hpc-general"
│   (shared federation key)        │   (federated with Site A)
├── Workspace: "sensitive-ch"      └── Policy: Site B OPA rules
│   (hospital CRK, delegated)
└── Policy: Site A OPA rules

Sovra Federation Protocol (peer-to-peer, no central authority)

Key Management Principles

Site root keys never leave the site. All cross-site authentication uses derived keys from shared workspaces.
Federation is revocable. Revoking a shared workspace invalidates all cross-site tokens. Instant defederation.
Sensitive keys are tenant-controlled. The hospital (data owner) holds the Customer Root Key. The operating site holds a delegated key. If the relationship ends, the hospital retains access.
Audit logs are cryptographically signed. Each site signs its audit entries with its own key. Cross-site audit trails are verifiable by any party in the trust chain.

Federation Components

Federation Broker

A Go service that runs alongside the scheduler (when federation feature is enabled).

Responsibilities:

Advertises site capabilities to federated peers (available capacity, GPU types, energy prices, data locality)
Receives federated allocation requests from peer sites
Signs outbound requests with Sovra tokens
Verifies inbound requests against Sovra trust chain + OPA policy
Routes accepted requests into the local scheduling plane

Communication: gRPC over mTLS, with Sovra-signed metadata in request headers.

Federation Catalog

A read-mostly, eventually consistent shared catalog across federated sites:

Content	Update Frequency	Consistency
Site capabilities (GPU types, node counts)	Hourly	Eventual
uenv image registry (cross-site name resolution)	On publish	Eventual
Dataset catalog (where data physically resides)	On change	Eventual
Tenant identity mapping (OIDC trust)	On federation setup	Strong (Sovra)
Energy prices per site	Every 15 minutes	Eventual

Catalog Consistency and Staleness

The federation catalog is eventually consistent. Entries may be stale, missing, or outdated. The system must handle this gracefully:

Staleness bounds:

Entry Type	Max Staleness	Effect of Stale Data
Site capabilities	2 hours (hourly sync + margin)	May route job to site that no longer has capacity → remote rejection, retry locally
Energy prices	30 minutes	May choose suboptimal site for energy cost → acceptable, not a correctness issue
Dataset catalog	Minutes (event-driven)	May not know data was moved → routing decision based on old location
uenv registry	Minutes (event-driven)	May reference image version not yet available at remote → prologue retry

Handling completely stale entries:

If a peer site has not reported a catalog update within 2× the expected interval (e.g., no capability update in 2 hours):

Federation broker marks the peer as stale in its local view
Routing decisions deprioritize stale peers (not excluded, just scored lower)
Alert raised: lattice_federation_peer_stale{peer="site-b"}
If stale for > 24 hours: peer marked unreachable, excluded from routing

Handling peer unavailability:

If a federated request fails (peer broker unreachable):

First failure: retry with exponential backoff (1s, 2s, 4s, max 30s)
After 3 retries: return failure to the user with explanation
If --site=auto: fall back to local scheduling (no remote attempt)
Peer marked as degraded in catalog; future requests deprioritize it
Peer returns to healthy on next successful heartbeat/catalog sync

Cross-site uenv resolution:

uenv images are resolved via the federation catalog:

User submits --uenv=prgenv-gnu/24.11:v1 targeting Site B
Federation broker checks if Site B’s catalog includes this image
If present: proceed (Site B has the image or can pull it)
If absent: warn user and proceed (Site B may pull from a shared registry)
If pull fails at Site B: prologue failure, allocation retried or failed per policy

Job Routing Logic

The federation broker’s routing decision is advisory, not mandatory:

Input: Allocation request from remote site (or local user targeting remote)
Output: Recommendation (run locally, run at site X, reject)

Factors:
1. Data gravity: where does the input data physically reside?
   → Strong bias toward running where data is
2. Compute availability: does the target site have capacity?
   → Check advertised capacity (may be stale)
3. Energy cost: which site has cheaper power right now?
   → Time-varying electricity prices from catalog
4. Tenant authorization: is this user allowed at the target site?
   → OPA policy check via Sovra-delegated credentials
5. Data sovereignty: can the data legally transit to the target site?
   → Sensitive data: check jurisdiction constraints

Decision: route to site with best composite score, or reject if no site qualifies

Federated Allocation Flow

1. User at Site A submits: lattice submit --site=B train.sh
2. Site A lattice-api receives request, passes to federation broker
3. Federation broker:
   a. Signs request with Sovra token (Site A workspace key)
   b. Resolves target: Site B (explicit) or best-fit (if --site=auto)
   c. Forwards to Site B's federation broker
4. Site B federation broker:
   a. Verifies Sovra token (Site A is trusted peer)
   b. Checks OPA policy (user authorized, resources available)
   c. Injects allocation into Site B's scheduling plane
5. Site B local quorum manages allocation entirely
6. Status/logs available to user at Site A via federation catalog query
7. On completion: Site B reports results, Site A's user notified

Cross-Site Data Access

When a federated job runs at a remote site but needs data from the home site:

Small data (<1 GB): Fetched on demand via S3 over WAN
Medium data (1 GB - 1 TB): Pre-staged during queue wait via VAST DataSpace sync
Large data (>1 TB): Strong recommendation to run job at data’s home site
Sensitive data: Never transferred. Job must run at data’s home site. No exceptions.

Operational Considerations

Adding a Federation Peer

Exchange Sovra workspace keys (out-of-band, verified by site admins)
Configure federation broker with peer endpoint + workspace ID
Define OPA policies for cross-site access
Test with non-production allocations
Enable in production

Removing a Federation Peer

Revoke Sovra shared workspace
All in-flight federated allocations continue to completion (or are cancelled by policy)
Remove peer from federation broker config
Immediate: no new federated requests accepted

Federation Requests During Leader Election

When the local Raft quorum is undergoing a leader election (typically 1-3 seconds):

Inbound federated requests from peer sites receive a 503 Service Unavailable with a Retry-After: 5 header
The federation broker does not queue inbound requests during election — the remote site’s retry logic handles resubmission
Outbound federated requests (local user targeting a remote site) are unaffected — routing and signing happen in the federation broker, not the quorum
If the election takes longer than 10 seconds (unusual): the federation broker marks the local site as degraded in catalog updates to peers

Cross-References

system-architecture.md — Control plane architecture
security.md — Sovra trust model, mTLS
sensitive-workloads.md — Sensitive data sovereignty
failure-modes.md — Quorum leader loss recovery

Interactive Sessions

Design Principle

Interactive sessions are allocations with a terminal. They reuse the standard allocation lifecycle with additional terminal protocol handling. Sessions are not a separate concept — they are bounded or unbounded allocations with an attached PTY as the primary interaction mode.

Global session tracking (F20): Sessions are now tracked in GlobalState via Raft-committed CreateSession/DeleteSession commands. This enables:

Global session limit enforcement: sensitive allocations limited to one concurrent session (INV-C2)
Session survival across API server restarts
Ownership verification at creation time (allocation must be Running, user must own it)

Session Creation

A session is created via POST /v1/sessions (or lattice session):

session:
  tenant: "ml-team"
  vcluster: "interactive"         # typically the interactive FIFO vCluster
  resources:
    nodes: 1                      # default: 1 node
    constraints:
      gpu_type: "GH200"
  lifecycle:
    type: "bounded"
    walltime: "4h"                # interactive sessions have walltime
  environment:
    uenv: "prgenv-gnu/24.11:v1"

Internally, the API server creates a standard Allocation with:

lifecycle.type = Bounded { walltime }
A flag indicating terminal should auto-attach on scheduling
Allocation state follows the normal lifecycle (Pending → Running → Completed)

Terminal Protocol

Connection Setup

1. Client connects: POST /v1/sessions → returns session_id + allocation_id
2. Allocation is scheduled (may wait in queue)
3. Once Running, client opens terminal: GET /v1/sessions/{id}/terminal (WebSocket upgrade)
4. WebSocket connection established to lattice-api
5. lattice-api opens gRPC bidirectional stream to the node agent
6. Node agent spawns PTY + user shell in allocation's mount/network namespace

Wire Protocol

The gRPC bidirectional stream carries framed messages:

Client → Server:

Message Type	Content
`StdinData`	Raw bytes from client terminal
`Resize`	Terminal dimensions (rows, cols)
`Signal`	SIGINT, SIGTSTP, SIGHUP, SIGQUIT
`Keepalive`	Heartbeat (every 30s)

Server → Client:

Message Type	Content
`StdoutData`	Raw bytes from PTY (stdout + stderr merged)
`ExitCode`	Process exit code (terminal message)
`Error`	Error description (e.g., “allocation not running”)

Initial Terminal Size

The client sends a Resize message as the first message after connection. The node agent configures the PTY with these dimensions. If no Resize is sent, defaults to 80x24.

Signal Handling

Signal	Client Action	Server Action
SIGINT (Ctrl+C)	Send `Signal(SIGINT)`	Node agent sends SIGINT to foreground process group
SIGTSTP (Ctrl+Z)	Send `Signal(SIGTSTP)`	Node agent sends SIGTSTP to foreground process group
SIGHUP	Connection close	Node agent sends SIGHUP to session process group
SIGQUIT (Ctrl+\)	Send `Signal(SIGQUIT)`	Node agent sends SIGQUIT to foreground process group
SIGWINCH	Send `Resize(rows, cols)`	Node agent calls `ioctl(TIOCSWINSZ)` on PTY

Session Lifecycle

Active Session

While the terminal is connected:

PTY output streams to client in real-time
Client input streams to PTY stdin
Keepalive every 30s to detect stale connections
Session remains active as long as the WebSocket is open AND the shell process is alive

Disconnect and Reconnect

Client disconnect (network drop, laptop close):

WebSocket closes (or keepalive timeout: 90s)
Node agent sends SIGHUP to the session’s process group
Default behavior: processes receive SIGHUP and exit
If the user’s shell ignores SIGHUP (e.g., tmux, screen):
- Processes continue running in the background
- User can reconnect: lattice attach <alloc_id>
- Allocation walltime continues counting

Deliberate detach:

Users who want background sessions should use tmux or screen inside the session. Lattice does not implement a detach/reattach protocol — it delegates to proven tools.

Session Timeout

Timeout	Default	Description
`idle_timeout`	30 minutes	If no stdin for this duration, warn user. No auto-kill.
`walltime`	User-specified	Hard deadline. SIGTERM → SIGKILL → release.
`keepalive_timeout`	90s	WebSocket keepalive. Missed → treat as disconnect.

Idle warning: After idle_timeout, the terminal displays:

[lattice] Warning: session idle for 30 minutes. Walltime remaining: 3h 12m.

No automatic termination on idle — the user may be running a long computation.

Cleanup

When the session’s allocation reaches a terminal state (Completed, Failed, Cancelled):

SIGTERM to all remaining processes
Grace period (30s)
SIGKILL
Unmount uenv, release scratch, release nodes
Session terminal sends ExitCode and closes WebSocket

Preemption During Active Session

When a session’s allocation is preempted while a terminal is connected:

The checkpoint sequence begins (if checkpoint != None)
The terminal remains connected during checkpointing — user sees normal output
When checkpoint completes and the allocation transitions to Suspended:
- Server sends a terminal message: [lattice] Allocation preempted. Session suspended. Use 'lattice attach <id>' to reconnect after rescheduling.
- Server sends ExitCode(-1) and closes the stream
When the allocation is rescheduled and resumes:
- The user must manually reconnect: lattice attach <id>
- The session starts a fresh shell (PTY state is not checkpointed)
- Application state is restored from checkpoint (if the application supports it)

Multi-Node Sessions

For sessions requesting multiple nodes:

The terminal connects to the first node (node 0)
The user’s shell runs on node 0
Other nodes are accessible via ssh (intra-allocation, uses the network domain)
Or via lattice attach <alloc_id> --node=<node_id> (opens a second terminal to a specific node)

Concurrent Attach

Scenario	Allowed	Notes
Same user, multiple terminals	Yes	Multiple attach sessions to the same allocation
Different users (non-sensitive)	No	Only the allocation owner can attach
Different users (sensitive)	No	Only the claiming user; one session at a time
Same user, different nodes	Yes	Each attach targets a specific node

Slurm Compatibility

Slurm	Lattice	Notes
`salloc -N2`	`lattice session --nodes=2`	Creates session allocation
`srun --jobid=123 --pty bash`	`lattice attach 123`	Attach to existing allocation
`salloc` then `srun`	`lattice session` then `lattice launch`	Session + task within allocation

CLI Usage

# Create a session (waits for scheduling, then opens terminal)
lattice session --nodes=1 --walltime=4h --uenv=prgenv-gnu/24.11:v1

# Create with specific constraints
lattice session --nodes=2 --constraint=gpu_type:GH200 --walltime=8h

# Create in a specific vCluster
lattice session --vcluster=interactive --walltime=2h

# Attach to an existing session's allocation
lattice attach 12345

# Attach to a specific node
lattice attach 12345 --node=x1000c0s0b0n3

# Attach with a specific command (not the default shell)
lattice attach 12345 --command="nvidia-smi -l 1"

Cross-References

observability.md — Attach architecture, authorization model, rate limiting
api-design.md — Session API endpoints
sensitive-workloads.md — Sensitive session constraints (one session, recording, signed uenv)
cli-design.md — Full CLI command reference

Sensitive & Regulated Workload Design

Threat Model

Sensitive workloads on shared HPC infrastructure face regulatory requirements (Swiss FADP, EU GDPR, potentially HIPAA for international collaboration). The design must be defensible to an auditor.

What we must prove:

Sensitive data was only accessible to authorized users during processing
No other tenant’s workload ran on the same physical nodes simultaneously
Data was encrypted at rest and in transit
All access was logged with user identity and timestamp
Data was destroyed when no longer needed
Data did not leave the designated jurisdiction

Isolation Model: User Claims Node

Unlike other vClusters where the scheduler assigns nodes, sensitive nodes are claimed by a specific user:

Dr. X authenticates via OIDC (institutional IdP)
  → Requests 4 nodes via lattice CLI: lattice submit --sensitive
  → Quorum records: nodes N1-N4 owned by user:dr-x, tenant:hospital-a
  → Strong consistency: Raft commit before any workload starts
  → OpenCHAMI boots N1-N4 with hardened sensitive image (if not already)
  → All activity on N1-N4 audited under dr-x's identity
  → When released:
    → Quorum releases node ownership (Raft commit)
    → OpenCHAMI wipes node (memory scrub, storage secure erase if NVMe present)
    → Node returns to general pool only after wipe confirmation

No clever optimization on sensitive nodes. If Dr. X claims 4 nodes at 9am and runs nothing until 2pm, those nodes sit idle. The cost is real and should be visible to the tenant’s accounting. But there is no co-scheduling, no borrowing, no time-sharing.

Concurrent Sensitive Claims

If two users simultaneously attempt to claim overlapping nodes:

First Raft commit wins. Node ownership is a strong consistency domain. The quorum serializes all claim requests via Raft.
The second claim request receives an OwnershipConflict error with a message identifying which nodes are already claimed and by which user.
The second user must select different nodes or wait for the first user to release.
There is no queueing or waitlist for sensitive node claims — they are immediate or rejected.

OS Image

Sensitive nodes boot a hardened image via OpenCHAMI BSS:

Minimal kernel, no unnecessary services
Mandatory access control (SELinux/AppArmor enforcing)
No SSH daemon (all access via API gateway)
Encrypted swap (if any)
Audit daemon (auditd) logging all syscalls to audit subsystem
Node agent with audit mode telemetry enabled by default

Software Delivery

Sensitive allocations use signed uenv images only:

environment:
  uenv: "sensitive/validated-2024.1"  # curated, audited base stack
  sign_required: true                # image signature verified before mount
  scan_required: true                # CVE scan passed
  approved_bases_only: true          # can only use admin-approved base images

The uenv registry enforces:

Image signing (with Sovra keys or site-specific PKI)
Vulnerability scanning (integrated with JFrog/Nexus security scanning)
Approved base image list (maintained by site security team)
Audit log of all image pulls

Storage

Sensitive data lives in a dedicated storage pool:

storage_policy:
  pool: "sensitive-encrypted"          # dedicated VAST view/tenant
  encryption: "aes-256-at-rest"      # VAST native encryption
  access_logging: "full"             # every read/write logged via VAST audit
  wipe_on_release: true              # VAST secure delete on allocation end
  data_sovereignty: "ch"             # data stays in Swiss jurisdiction
  retention:
    data: "user_specified"           # user declares retention period
    audit_logs: "7_years"            # regulatory minimum
  tier_restriction: "hot_only"       # no copies on shared warm/cold tiers

Network Isolation

Sensitive allocations get a dedicated Slingshot VNI:

connectivity:
  network_domain: "sensitive-{user}-{alloc_id}"  # unique per allocation
  policy:
    ingress: deny-all-except:
      - same_domain                  # only processes in this allocation
      - data_gateway                 # controlled data ingress endpoint
    egress: deny-all-except:
      - data_gateway                 # controlled data egress

With Ultra Ethernet: network-level encryption (UET built-in) provides an additional layer without performance penalty.

Audit Trail

What is logged (strong consistency via Raft):

Node claim: user identity, timestamp, node IDs
Node release: user identity, timestamp, wipe confirmation
Allocation start/stop: what ran, which uenv image (with hash), which data paths
Data access: every file open/read/write (from eBPF audit telemetry)
API calls: every lattice-api call related to sensitive allocations
Checkpoint events: when, where, what was written
Attach sessions: user identity, start/end timestamps, target node, session recording reference
Log access events: who accessed logs, when, which allocation
Metrics queries: user identity, allocation queried, timestamp

Storage:

Append-only log (no deletions, no modifications)
Encrypted at rest (Sovra-managed keys if federation enabled, site PKI otherwise)
7-year retention on cold tier (S3-compatible, immutable storage)
Cryptographically signed entries (tamper-evident)

Query Interface

The audit log is queryable via a dedicated API endpoint and CLI:

API:

GET /v1/audit/logs?user=dr-x&since=2026-03-01&until=2026-03-15
GET /v1/audit/logs?allocation=12345
GET /v1/audit/logs?node=x1000c0s0b0n0&since=2026-03-01
GET /v1/audit/logs?data_path=s3://sensitive-data/subject-001/

CLI:

lattice audit query --user=dr-x --since=2026-03-01 --until=2026-03-15
lattice audit query --alloc=12345
lattice audit query --node=x1000c0s0b0n0 --since=2026-03-01 --output=json

Scoping:

Caller	Visible Scope
Claiming user	Own audit events only
Tenant admin (compliance reviewer)	All audit events for their tenant
System admin	All audit events

Indexing: Audit entries are indexed by:

User ID (primary query dimension for compliance reporting)
Allocation ID (all events for a specific allocation)
Node ID (all events on a specific node)
Timestamp (range queries, required for all queries)
Event type (filter by: claim, release, data_access, attach, etc.)

Performance targets:

Query Scope	Expected Latency
Single allocation (any timeframe)	< 1s
Single user, 1-day range	< 2s
Single user, 30-day range	< 10s
Tenant-wide, 1-day range	< 30s

Queries spanning more than 90 days may be served from cold tier (S3 archive) with higher latency (minutes).

Export: For regulatory submissions, audit logs can be exported as signed JSON bundles:

lattice audit export --user=dr-x --since=2026-01-01 --until=2026-06-30 --output=audit-report.json.sig

The export includes cryptographic signatures for tamper evidence.

Observability Constraints

Every user-facing observability feature has sensitive-specific restrictions. The principle: observability must not weaken the isolation model.

Attach

Claiming user only. The user who claimed the nodes (identity verified against Raft audit log) is the only user permitted to attach. No delegation, no shared access.
Session recording. All attach sessions are recorded (input + output bytes) and stored at s3://sensitive-audit/{tenant}/{alloc_id}/sessions/{session_id}.recording (zstd-compressed, encrypted at rest, 7-year retention). The session recording reference is a Raft-committed audit entry.
Signed uenv only. Attach is only permitted when the allocation runs a signed, vulnerability-scanned uenv image. This prevents attaching to environments with unvetted tools.
No concurrent attach from different sessions. One active attach session per allocation at a time (prevents accidental data exposure via shared terminal).

Logs

Encrypted at rest. Logs from sensitive allocations are stored in the dedicated encrypted S3 pool (same as sensitive data).
Access-logged. Every log access (live tail or historical) generates an audit entry with user identity and timestamp.
Restricted access. Only the claiming user and designated compliance reviewers (via tenant admin role) can access logs.
Retention follows data policy. Log retention matches the allocation’s sensitive data retention policy, not the default log retention.

Metrics

Low sensitivity, still scoped. Metrics (GPU%, CPU%, I/O rates) do not contain sensitive data, but are still scoped to the claiming user. Tenant admins can view aggregated usage.
No cross-tenant visibility. Even system admins see sensitive allocation metrics only in aggregate (holistic view), not per-allocation detail.

Diagnostics

No cross-allocation comparison for sensitive. The CompareMetrics RPC rejects requests that include sensitive allocation IDs alongside non-sensitive ones. Comparison within a single sensitive tenant is permitted (same claiming user).
Network diagnostics scoped. Network diagnostics for sensitive allocations only show the allocation’s own VNI traffic, not fabric-wide metrics.

Profiling

Signed tools_uenv only. Profiling tools must be delivered via a signed, approved tools_uenv image. Users cannot load arbitrary profiler binaries.
Profile output stays in sensitive pool. All profiling output is written to the encrypted sensitive storage pool and is subject to the same access logging and retention policies.

Federation Constraints

Sensitive data does not federate by default:

Data stays at the designated site (data sovereignty)
Compute can theoretically federate (run at remote site), but only if:
- Remote site meets the same compliance requirements
- Data does not transit (remote compute accesses data via encrypted API, not bulk transfer)
- Both sites’ Sovra instances have a sensitive workspace with hospital CRK
In practice: sensitive jobs run where the data is. Period.

Conformance Requirements

Sensitive nodes have strict conformance enforcement. Unlike general workloads where conformance is a soft preference, sensitive workloads treat configuration drift as a hard constraint:

Pre-claim validation. Before a node can be claimed for sensitive use, the scheduler verifies its conformance fingerprint matches the expected baseline for the sensitive vCluster. Drifted nodes are rejected.
Drift triggers drain. If a sensitive node’s conformance fingerprint changes during operation (e.g., a firmware update was missed), the node agent flags the drift. The scheduler will not assign new sensitive claims to the node until OpenCHAMI remediates it.
Audit trail. Conformance state changes on sensitive nodes are recorded in the Raft-committed audit log (which firmware/driver versions were active during the allocation).

This is deliberately conservative: sensitive workloads do not tolerate the subtle failures that configuration drift can cause, and regulatory compliance requires provable consistency of the execution environment.

Scheduler Behavior

The sensitive vCluster scheduler is intentionally simple:

Algorithm: Reservation-based (not knapsack). User claims nodes, scheduler validates and commits.
No backfill. Sensitive nodes are not shared.
No preemption. Sensitive allocations are never preempted.
No elastic borrowing. Sensitive nodes cannot be borrowed by other vClusters.
Fair-share: Not applicable (nodes are user-claimed, not queue-scheduled).
Conformance: Hard constraint — only nodes matching the expected conformance baseline are eligible.
Cost function weights: priority=0.90, conformance=0.10 (tiebreaker among conformant nodes; non-conformant nodes are excluded as a hard constraint at the solver level, not via the weight system), everything else near-zero.

Accounting

Design Principle

Lattice schedules, Waldur accounts. Accounting is asynchronous and optional (feature-flagged like federation). Waldur unavailability never blocks scheduling.

What is Waldur

Waldur is a hybrid cloud orchestrator with HPC integration, accounting, billing, and self-service portal. It provides:

Resource usage tracking and billing
Project-level budget management
Self-service quota requests
Invoice generation

Integration is via Waldur’s REST API.

Integration Pattern

Lattice ──async push──→ Waldur (accounting events)
Waldur ──API call──→ Lattice (quota updates)

Lattice pushes accounting events to Waldur asynchronously. Waldur can push quota updates back. The two systems are loosely coupled — neither depends on the other for core functionality.

Accounting Events

Events pushed from Lattice to Waldur:

Event	Trigger	Payload
`allocation.started`	Allocation enters Running state	tenant, project, user, resources (nodes, GPUs, GPU type), estimated duration
`allocation.completed`	Allocation reaches terminal state	actual duration, GPU-hours consumed, exit status, storage bytes written
`allocation.checkpointed`	Checkpoint written	checkpoint storage consumed, checkpoint duration
`node.claimed`	Sensitive node claimed by a user	tenant, user, node IDs, claiming timestamp
`node.released`	Sensitive node released	tenant, user, node IDs, release timestamp, wipe confirmation
`quota.updated`	Waldur updates a tenant’s quota	new quota values (Waldur → Lattice direction)

Events are timestamped and include the allocation ID for correlation.

Entity Mapping

Lattice Entity	Waldur Entity	Notes
Tenant	Customer	1:1 mapping
Project (within tenant)	Project	1:1 mapping
vCluster	Offering	Each vCluster type is a service offering
Allocation	Order	Each allocation is a resource order

Waldur API Endpoints Used

Direction	Endpoint	Purpose
Lattice → Waldur	`POST /api/marketplace-orders/`	Report resource usage
Lattice → Waldur	`POST /api/invoices/{id}/items/`	Add billing line items
Waldur → Lattice	`GET /api/customers/{id}/quotas/`	Read project quotas
Waldur → Lattice	`PUT /api/v1/tenants/{id}`	Update tenant quotas in Lattice

Authentication

Waldur API token is stored in a secrets manager (never in config files):

waldur:
  token_secret_ref: "vault://lattice/waldur-token"

The token is loaded at startup and refreshed on rotation. Cross-ref: security.md for secret management.

Failure Handling

Waldur unavailability must never block scheduling:

Buffer: Accounting events are buffered in a bounded in-memory queue (default: 10,000 events)
Persist: If the buffer fills, overflow events are persisted to disk (WAL-style append log)
Replay: On Waldur reconnection, buffered and persisted events are replayed in order
Alert: If the disk buffer exceeds a threshold (default: 100,000 events), an alert is raised via scheduler self-monitoring (cross-ref: telemetry.md)
Degrade gracefully: If both buffer and disk are full, events are dropped with a counter metric (lattice_accounting_events_dropped_total). Scheduling continues.

Operational Response to Buffer Overflow

When the accounting buffer fills and events are dropped:

Detect: lattice_accounting_events_dropped_total counter increments. Alert fires when > 0.
Impact: Billing data is incomplete. GPU-hours and allocation events are missing from Waldur. This affects invoice accuracy but never affects scheduling.
Respond:
- Check Waldur availability (lattice admin accounting status)
- If Waldur is down: wait for recovery. Buffered events will replay. Dropped events are lost.
- If Waldur is up but slow: check push interval and batch size. Increase push_interval_seconds to allow larger batches.
Recovery: Dropped events cannot be recovered from the accounting pipeline. However, the quorum has allocation state (start/end times, node assignments). An admin can reconstruct missing billing data from quorum logs:
```
lattice admin accounting reconcile --since=2026-03-01 --until=2026-03-02
```
This command reads allocation history from the quorum and generates compensating events for Waldur.
Prevention: Size the buffer for expected Waldur outage duration. Rule of thumb: buffer_size = events_per_minute × max_expected_outage_minutes. For a busy cluster (100 events/min) and 2-hour outage target: buffer_size = 12000.

Quota Feedback Loop

Waldur can act as the budget authority, updating Lattice tenant quotas:

Waldur detects budget exhaustion (e.g., project spent its allocated compute hours)
Waldur calls lattice-api: PUT /api/v1/tenants/{id} with reduced limits
Lattice updates hard/soft quotas (cross-ref: quota-enforcement.md)
Effect: tenant’s new allocations are blocked (hard quota) or deprioritized (soft quota)

Conversely, when a tenant purchases more compute:

Waldur increases the tenant’s quota
Lattice picks up the new limits
Previously-starved allocations can now be scheduled

Sensitive Accounting

Sensitive allocations have additional accounting requirements:

All accounting events include the claiming user’s identity (not just tenant)
Idle node time (nodes claimed but no running allocation) is billable — Waldur receives node.claimed and node.released events
Accounting events for sensitive allocations are also written to the Raft-committed audit log (cross-ref: sensitive-workloads.md)
Waldur must retain sensitive billing records for 7 years (configured on the Waldur side)

Configuration

accounting:
  enabled: true                     # feature flag, default: false
  provider: "waldur"
  waldur:
    api_url: "https://waldur.example.com/api/"
    token_secret_ref: "vault://lattice/waldur-token"
    push_interval_seconds: 60       # batch push interval
    buffer_size: 10000              # in-memory event buffer
    disk_buffer_path: "/var/lib/lattice/accounting-wal"
    disk_buffer_max_events: 100000

When accounting.enabled is false, no accounting code runs and no Waldur dependency exists (same pattern as federation).

Cross-References

quota-enforcement.md — Waldur updates quotas, hard vs. soft semantics
failure-modes.md — Accounting service failure buffering
security.md — Waldur API token management
sensitive-workloads.md — Sensitive billing and audit requirements
telemetry.md — Accounting buffer metrics in scheduler self-monitoring

Slurm Migration

Design Principle

Migration from Slurm should be gradual and low-risk. Existing Slurm scripts should work with minimal changes via the compatibility layer. Users can adopt Lattice-native features incrementally. The goal is not perfect Slurm emulation — it’s a smooth on-ramp.

Migration Phases

Phase 1: Dual-Stack (Recommended Start)

Run Lattice alongside Slurm on a subset of nodes. Users can submit to either system. This provides:

Side-by-side comparison of scheduling behavior
Gradual user migration with rollback to Slurm
Time to validate RM-Replay weight tuning

Phase 2: Compat-Mode Cutover

Move all nodes to Lattice. Users continue using sbatch/squeue via compatibility aliases. Slurm daemons are decommissioned.

Phase 3: Native Adoption

Users migrate scripts to native lattice CLI, adopting features not available in Slurm (reactive scaling, metric-driven autoscaling, DAG workflows, data staging hints).

Script Compatibility

Supported `#SBATCH` Directives

Slurm Directive	Lattice Mapping	Notes
`--nodes=N`	`resources.nodes: N`	Exact match
`--ntasks=N`	Mapped to node count	`nodes = ceil(N / tasks_per_node)`
`--ntasks-per-node=N`	Passed as task config	Used by launcher
`--time=HH:MM:SS`	`lifecycle.walltime`	Exact match
`--partition=X`	`vcluster: X`	Partition name → vCluster name mapping
`--account=X`	`tenant: X`	Account → tenant mapping
`--job-name=X`	`tags.name: X`	Stored as tag
`--output=file`	Log path hint	Logs always go to S3; `--output` sets download path
`--error=file`	Log path hint	Same as `--output`
`--constraint=X`	`constraints.features`	Feature matching
`--gres=gpu:N`	`constraints.gpu_count`	Mapped to GPU constraint
`--exclusive`	Default behavior	Lattice schedules full nodes by default (ADR-007)
`--array=0-99%20`	`task_group`	Task group with concurrency limit
`--dependency=afterok:123`	`depends_on: [{ref: "123", condition: "success"}]`	DAG edge
`--qos=X`	`preemption_class`	QoS → priority mapping (configurable per site)
`--mail-user`, `--mail-type`	Not supported	Warn, skip
`--mem=X`	Not supported	Full-node scheduling; memory is not a constraint
`--cpus-per-task=N`	Not supported	Full-node scheduling
`--uenv=X`	`environment.uenv: X`	Lattice extension, not in Slurm
`--view=X`	`environment.view: X`	Lattice extension

Unsupported Directives

Directives that have no Lattice equivalent are handled gracefully:

Warning: #SBATCH --mem=64G ignored (Lattice uses full-node scheduling, memory is not constrainable)
Warning: #SBATCH --mail-user=user@example.com ignored (use `lattice watch` for event notifications)
Submitted allocation 12345

The submission succeeds — unsupported directives produce warnings, not errors. This is critical for migration: existing scripts should not fail because of irrelevant Slurm options.

Conflicting Directives

Conflict	Resolution
`--nodes=64` + `--ntasks=128` with `--ntasks-per-node=4`	`--nodes` takes precedence; `ntasks-per-node` used by launcher
`--exclusive` + `--mem=64G`	`--exclusive` is default; `--mem` ignored with warning
`--partition` not found	Error: `vCluster "X" not found. Available: hpc-batch, ml-training, interactive`

Slurm Features Not Supported

These Slurm features have no Lattice equivalent and are not planned:

Feature	Reason	Alternative
Job steps (`srun` within `sbatch`)	Lattice uses tasks within allocations	`lattice launch --alloc=<id>`
Hetjob (heterogeneous job)	Not yet designed	Submit separate allocations with DAG dependencies
Burst buffer (`#DW`)	DataWarp-specific	Use `data.mounts` with `tier_hint: hot`
GRES beyond GPU	Not needed (full-node scheduling)	Use `constraints.features` for non-GPU resources
Accounting (`sacctmgr`)	Waldur handles accounting	`lattice history` or Waldur portal
Reservations (`scontrol create reservation`)	Use sensitive claims for dedicated nodes	`lattice admin reserve` (future)
Licenses/resources (`--licenses=`)	Not applicable	Use `constraints.features`
Multi-cluster (`--cluster=`)	Use federation	`lattice submit --site=X` (if federation enabled)

`srun` Within Allocations

Slurm users often use srun inside batch scripts to launch parallel tasks. In Lattice:

# Slurm pattern:
srun -n 256 ./my_mpi_program

# Lattice equivalent (inside a running allocation):
# Option 1: The entrypoint IS the parallel launch
# In the submission script, use the appropriate launcher directly:
mpirun -np 256 ./my_mpi_program
# or:
torchrun --nproc_per_node=4 ./train.py

# Option 2: Use lattice launch from another terminal
lattice launch --alloc=12345 -n 256 ./my_mpi_program

The compatibility layer translates srun to lattice launch when the compat aliases are active.

Environment Variables

Slurm sets many environment variables in jobs. Lattice provides equivalent variables:

Slurm Variable	Lattice Variable	Description
`SLURM_JOB_ID`	`LATTICE_ALLOC_ID`	Allocation ID
`SLURM_JOB_NAME`	`LATTICE_JOB_NAME`	Job name (from tags)
`SLURM_NODELIST`	`LATTICE_NODELIST`	Comma-separated node list
`SLURM_NNODES`	`LATTICE_NNODES`	Number of nodes
`SLURM_NPROCS`	`LATTICE_NPROCS`	Number of tasks
`SLURM_ARRAY_TASK_ID`	`LATTICE_TASK_INDEX`	Task group index
`SLURM_ARRAY_JOB_ID`	`LATTICE_TASK_GROUP_ID`	Task group parent ID
`SLURM_SUBMIT_DIR`	`LATTICE_SUBMIT_DIR`	Submission directory
`SLURM_JOBID`	`LATTICE_ALLOC_ID`	Alias for compatibility

For migration convenience, the compat layer can also set SLURM_* variables (configurable: compat.set_slurm_env=true). This is disabled by default to avoid confusion.

Partition-to-vCluster Mapping

Sites configure the mapping from Slurm partition names to Lattice vClusters:

# lattice-compat.yaml
partition_mapping:
  normal: "hpc-batch"
  debug: "interactive"
  gpu: "ml-training"
  long: "hpc-batch"        # multiple partitions can map to one vCluster
  sensitive: "sensitive-secure"
qos_mapping:
  low: 1
  normal: 4
  high: 7
  urgent: 9

Unmapped partition names produce an error with a list of available vClusters.

Migration Checklist

For site administrators:

Deploy Lattice control plane alongside Slurm
Configure partition-to-vCluster mapping
Configure QoS-to-preemption-class mapping
Tune cost function weights using RM-Replay with production traces
Test representative batch scripts via compat layer
Validate accounting (Waldur) captures match Slurm sacct data
Train users on lattice CLI basics
Run dual-stack for 2-4 weeks
Migrate remaining users, decommission Slurm

For users:

Test existing scripts with lattice submit (compat mode parses #SBATCH)
Review warnings for unsupported directives
Replace srun in scripts with direct launcher commands (mpirun, torchrun)
(Optional) Migrate to native lattice CLI syntax for new workflows

Cross-References

api-design.md — Compatibility API command mapping
cli-design.md — Native CLI design and compat aliases
sessions.md — salloc equivalent
dag-scheduling.md — DAG dependencies (replace --dependency)
mpi-process-management.md — MPI launch, PMI-2, srun replacement

Troubleshooting Guide

Allocation Stuck in Pending

Symptom: lattice status shows allocation in Pending for longer than expected.

Diagnosis:

# Check why the allocation isn't being scheduled
lattice status 12345 --verbose

Verbose Output	Cause	Fix
`waiting for quota headroom`	Tenant hard quota (`max_nodes` or `max_concurrent_allocations`) exceeded	Cancel other allocations or request quota increase
`no nodes matching constraints`	No nodes with requested GPU type, features, or topology	Relax constraints (`--topology=any`), check `lattice nodes --state=ready`
`data staging in progress`	Input data being pre-staged from warm/cold tier	Wait (check progress with `lattice status 12345 --verbose`), or submit without `tier_hint: hot`
`insufficient conformance group`	Not enough nodes with matching conformance fingerprint for multi-node job	Reduce node count, or wait for OpenCHAMI to remediate drifted nodes
`all suitable nodes occupied`	Resources are busy; allocation is queued normally	Wait; check queue depth with `lattice status --state=pending`
`soft quota penalty (low score)`	GPU-hours budget nearly exhausted; allocation deprioritized	Request budget increase from tenant admin or Waldur portal

Deeper investigation:

# Check scheduler cycle is running
lattice admin scheduler status --vcluster=hpc-batch

# Check if proposals are being rejected
lattice admin raft status

# View scheduling metrics
# (high proposal rejection rate may indicate race conditions or quota contention)

Scheduling Cycle Slow

Symptom: lattice_scheduling_cycle_duration_seconds p99 > 30s.

Diagnosis:

Check	Command	What to Look For
Queue depth	`lattice status --state=pending --count`	> 500 pending allocations
Cost function time	Grafana: `lattice_scheduling_cost_function_duration_seconds`	Dominant component of cycle
Conformance group fragmentation	`lattice nodes -o wide \| sort -k7 \| uniq -c`	Many small groups
Topology solver	Grafana: cycle time breakdown	Multi-group spanning expensive

Fixes:

Cause	Fix
Too many pending allocations	Increase cycle interval to batch more proposals
Cost function slow	Check if custom metrics (f₅ data_readiness) are causing TSDB query delays
Conformance fragmented	Standardize firmware, or reduce w₉ for tolerant workloads
Topology solver	Reduce backfill depth, or allow `topology: any` for more jobs

Node Stuck in Degraded/Down

Symptom: Node shows Degraded or Down in lattice nodes.

Diagnosis:

# Check node details
lattice nodes x1000c0s0b0n0

# Check heartbeat
# If heartbeat missing: node agent may be down or network partitioned

State	Duration	Likely Cause
Degraded, < 2 min	Transient network blip	Wait; likely self-resolves
Degraded, > 5 min	Agent crash or network partition	SSH to node, check agent: `systemctl status lattice-agent`
Down	Agent not recovering	Check BMC via OpenCHAMI: `manta node status x1000c0s0b0n0`
Down, BMC unreachable	Hardware failure	Physical inspection required

Recovery:

# If agent crashed, restart it
ssh x1000c0s0b0n0 systemctl restart lattice-agent

# If node needs reboot
lattice node disable x1000c0s0b0n0
# (coordinate with OpenCHAMI for reboot)
lattice node undrain x1000c0s0b0n0  # after reboot + health check

Raft Commit Latency High

Symptom: lattice_raft_commit_latency_seconds p99 > 1s.

Diagnosis:

Check	What to Look For
Disk I/O on quorum members	WAL write latency. Quorum members need fast SSD.
Network between quorum members	Packet loss or high latency between quorum nodes
Leader overloaded	Too many proposals per second
Log compaction	Snapshot in progress (one-time spike, normal)

Fixes:

Cause	Fix
Slow disk	Move WAL to dedicated NVMe SSD
Network latency	Ensure quorum members are on low-latency network (same rack or switch)
Leader overload	Increase scheduling cycle interval to reduce proposal rate
Log too large	Reduce snapshot interval (more frequent snapshots = smaller log)

Allocation Fails During Prologue

Symptom: Allocation moves from Running to Failed within seconds of starting.

Diagnosis:

lattice logs 12345
# Look for prologue errors:
#   "uenv pull failed: hash mismatch"
#   "mount failed: ENOSPC"
#   "NFS mount timeout"

Error	Cause	Fix
Hash mismatch	Corrupted image in cache or registry	`lattice cache evict --image=... --node=...` and retry
ENOSPC	Node-local cache full, eviction couldn’t free space	Check cache status: `lattice cache status --node=...`. Evict unused images manually.
NFS mount timeout	VAST unavailable or network issue	Check VAST health. Check Slingshot storage traffic class.
Image not found	uenv name/version doesn’t exist in registry	Verify with `lattice cache status --node=...` or check the uenv registry directly

Preemption Not Working

Symptom: Higher-priority allocation waiting despite lower-priority allocations running on suitable nodes.

Diagnosis:

lattice status 12345 --verbose
# Check if preemption is enabled for this vCluster
lattice admin vcluster show hpc-batch

Cause	Fix
Pending job’s priority class ≤ running jobs’ class	Preemption only works downward. Check priority classes.
Running jobs are non-preemptible (`checkpoint: none` + high class)	Wait for them to complete
Running jobs are near completion (>90% walltime)	Scheduler avoids preempting near-completion jobs. Wait.
vCluster doesn’t allow preemption	Check vCluster config. Service vClusters only preempt borrowed nodes.

Autoscaling Not Triggering

Symptom: Reactive allocation stays at min_nodes despite high metric value.

Diagnosis:

# Check current metric value
lattice top 12345 --metric=gpu_utilization

# Check scaling events
lattice status 12345 --verbose

Cause	Fix
Metric below target	Scaling only triggers when metric > target for `scale_up_window` (2 min)
Cooldown period active	Recent scale event; wait for cooldown (3 min default)
TSDB query failing	Check `lattice_autoscaling_metric_query_failures_total` metric
Tenant quota exhausted	`max_nodes` reached; scale-up is a no-op
Metric name wrong	Verify metric exists in TSDB: `lattice top 12345 --metric=<name>`

Sensitive Node Won’t Accept Claims

Symptom: Sensitive node claim rejected.

Diagnosis:

Check	What to Look For
`lattice nodes <id>`	Is node in `Ready` state? (Not `Degraded`, `Down`, `Draining`)
Conformance	Is node’s conformance fingerprint matching the sensitive baseline?
Pool size	Is `sensitive_pool_size` quota exhausted?
Previous wipe	Was the node properly wiped after last sensitive use?

Fix:

# Check conformance
lattice nodes x1000c0s0b0n0 -o wide
# If drifted: coordinate with OpenCHAMI for remediation

# Check sensitive pool
lattice admin tenant show hospital-a --quotas
# If exhausted: release unused sensitive nodes or increase pool

Log Collection

When filing a bug report or escalating, collect:

# System overview
lattice admin raft status > diag/raft.txt
lattice nodes -o json > diag/nodes.json
lattice status --all -o json > diag/allocations.json

# Recent scheduler metrics (last hour)
lattice admin metrics dump --component=scheduler --duration=1h > diag/scheduler-metrics.json

# Specific node agent logs (if relevant)
ssh x1000c0s0b0n0 journalctl -u lattice-agent --since="1 hour ago" > diag/agent.log

Cross-References

failure-modes.md — Expected failure patterns and recovery
node-lifecycle.md — Node state transitions and timeouts
preemption.md — Preemption policy and classes
autoscaling.md — Scaling loop and error handling
data-staging.md — Cache management and staging pipeline
tuning-guide.md — Cost function tuning for performance issues

Architecture Decision Records

Template

Each ADR follows this format:

Status: Proposed | Accepted | Superseded
Context: What is the problem?
Decision: What did we decide?
Consequences: What are the trade-offs?

ADR-001: Raft for Quorum Consensus

Status: Accepted

Context: The scheduler needs a distributed control plane that avoids single-point-of-failure (Slurm’s slurmctld problem). We need strong consistency for node ownership and sensitive audit, but the system schedules tens-to-hundreds of large allocations, not millions of microservices.

Decision: Use Raft consensus (via openraft crate) for the quorum. 3-5 replicas. Only node ownership changes and sensitive audit events go through Raft. Everything else is eventually consistent.

Consequences:

(+) No SPOF. Quorum tolerates minority failures.
(+) Raft is well-understood, battle-tested, good Rust implementations exist.
(+) Consistency latency (few ms per commit) is acceptable for our scheduling granularity.
(-) Operational complexity of running a Raft cluster (leader election, log compaction, membership changes).
(-) Write throughput limited by Raft commit latency. Not a problem at our scale.

ADR-002: Knapsack Scheduling with Composite Cost Function

Status: Accepted

Context: We need a scheduling algorithm that handles both HPC batch (topology-aware, fair-share) and cloud service (bin-packing, autoscale) workloads. Different vClusters need different optimization strategies.

Decision: Multi-dimensional knapsack formulation with a composite weighted cost function. Weights tunable per vCluster. Greedy solver with topology-aware backfill. Validated via RM-Replay simulator before production deployment.

Consequences:

(+) Unified framework for all workload types (just change weights).
(+) Cost function is extensible (add new factors without restructuring).
(+) RM-Replay provides safe testing of configuration changes.
(-) Weight tuning requires expertise and simulation. Not “plug and play.”
(-) Greedy solver is not globally optimal. Acceptable for our scale.

ADR-003: uenv-First Software Delivery

Status: Accepted

Context: Users need reproducible software environments. Options: full containers (Docker/Sarus), uenv (SquashFS mount namespaces), or module systems.

Decision: uenv is the default software delivery mechanism. Sarus for OCI containers when isolation is needed (multi-tenant node sharing, third-party images, sensitive with enhanced isolation). No module system.

Consequences:

(+) Near-zero runtime overhead (mount namespace, no container isolation overhead).
(+) Native GPU/Slingshot access without namespace workarounds.
(+) MPI “just works” — no network namespace translation.
(+) Proven at CSCS scale (Alps, 10,752 GH200 GPUs).
(-) Users must use curated uenv stacks or build their own (Spack/Stackinator).
(-) Weaker isolation than containers — fine for trusted HPC users, needs Sarus for untrusted workloads.

ADR-004: Two Strong Consistency Domains

Status: Accepted

Context: Strong consistency (Raft) has a performance cost. We need to minimize what goes through consensus while ensuring correctness for critical state.

Decision: Exactly two categories of state require strong consistency:

Node ownership — which tenant/vCluster/allocation owns which nodes
Sensitive audit log — all events related to sensitive node claims, data access, and isolation boundaries

Everything else (job queues, telemetry, quota accounting, session state) is eventually consistent.

Consequences:

(+) Minimal Raft throughput requirements (node ownership changes are infrequent).
(+) Sensitive compliance: audit trail is provably consistent and tamper-evident.
(+) Job queue staleness is bounded and self-correcting (rejected proposals retry next cycle).
(-) Eventual consistency means two vCluster schedulers might propose conflicting allocations. One gets rejected. This is a retry, not a bug.
(-) Quota accounting can lag. Hard limits enforced at quorum (node ownership), soft limits eventually.

ADR-005: Federation as Opt-In via Sovra

Status: Accepted

Context: Multi-site operation is desirable but adds significant complexity. Not all deployments need it. The trust model for cross-site operation is a hard problem.

Decision: Federation is a compile-time feature flag. When disabled, no Sovra dependency and no cross-site code paths. When enabled, Sovra provides the cryptographic trust layer. Each site retains full sovereignty — federation broker suggests, local scheduler decides.

Consequences:

(+) Zero overhead when federation is not needed.
(+) Sovra’s sovereign key model aligns with institutional requirements (each site controls its keys).
(+) Revocable federation (revoke workspace → instant defederation).
(-) Additional infrastructure to operate (Sovra instances, federation brokers).
(-) Cross-site scheduling decisions are based on eventually consistent capacity data (may be stale).

ADR-006: Rust for Scheduler Core

Status: Accepted

Context: The scheduler is a long-lived, performance-critical, correctness-critical system. Options: Rust, Go, C++.

Decision: Rust for all performance-critical components (quorum, schedulers, node agent, API server, CLI, checkpoint broker). Go for infrastructure integration (OpenCHAMI, Sovra, federation broker). Python for user-facing SDK and tooling.

Consequences:

(+) Memory safety without GC pauses (critical for scheduler latency).
(+) Strong type system for modeling resource constraints (algebraic types for allocation states).
(+) Excellent async/concurrency (tokio) for handling many concurrent node agent connections.
(+) Single binary deployment for node agents (no runtime dependencies).
(-) Steeper learning curve for contributors.
(-) Slower initial development velocity vs. Go.
(-) Ecosystem for HPC is smaller than C/C++ (but growing).

ADR-007: Full-Node Scheduling with Intra-Node Packing

Status: Accepted

Context: Scheduling granularity: full nodes, fractional nodes, or both?

Decision: The scheduler reasons about full nodes. The node agent handles intra-node packing (multiple containers/uenvs on a single node) for workloads that don’t need a full node (interactive sessions, small Jupyter notebooks). This is a two-level scheme: scheduler assigns nodes to vClusters, node agent packs work within allocated nodes.

Consequences:

(+) Simplifies the scheduler (no cgroup negotiation between co-tenants).
(+) Predictable performance for large jobs (no noisy neighbor at scheduler level).
(+) Node agent can use simple bin-packing for small workloads.
(-) Potential waste for small workloads that get a full node unnecessarily.
(-) Mitigated by: Sarus containers with resource limits for interactive vCluster, and by grouping small workloads on designated “shared” nodes.

ADR-008: Asynchronous Accounting via Waldur

Status: Accepted

Context: Lattice needs external accounting and billing but should not depend on an accounting system for core scheduling functionality. Waldur provides HPC-aware accounting, billing, and self-service portal capabilities.

Decision: Integrate with Waldur as an optional, feature-flagged accounting provider. Lattice pushes accounting events (allocation started/completed, resource usage) to Waldur asynchronously. Waldur can push quota updates back to Lattice. Waldur unavailability never blocks scheduling. Events are buffered in memory and persisted to disk on overflow, replayed on reconnection.

Consequences:

(+) Clean separation of concerns: Lattice schedules, Waldur accounts.
(+) Zero scheduling impact from accounting failures (events are buffered).
(+) Waldur’s self-service portal gives tenant admins quota visibility without Lattice changes.
(+) Feature-flagged: zero overhead when accounting is not needed.
(-) Eventually consistent accounting data (events pushed at configurable interval, default 60s).
(-) Additional external dependency to operate (Waldur instance, API token management).
(-) Entity mapping (Tenant↔Customer, Project↔Project) must stay synchronized.

ADR-009: Two-Tier Quota Enforcement

Status: Accepted

Context: Quota enforcement must balance strictness (prevent over-allocation) with performance (don’t bottleneck scheduling on consensus). Some quotas are safety-critical (node counts), others are advisory (GPU-hours budgets).

Decision: Two-tier quota enforcement matching the two consistency domains (ADR-004):

Hard quotas (quorum-enforced, strong consistency): max_nodes, max_concurrent_allocations, sensitive_pool_size. Checked during Raft proposal validation. Cannot be violated even momentarily.
Soft quotas (scheduler-enforced, eventual consistency): gpu_hours_budget, node_hours_budget, fair_share_target, burst_allowance. Influence scheduling score but don’t hard-block. May temporarily overshoot during consistency window (~30s), self-correcting via fair-share scoring. When both GPU-hours and node-hours budgets are set, the worse utilization drives the penalty.

Consequences:

(+) Hard quotas are provably enforced (Raft consensus guarantees).
(+) Soft quotas don’t bottleneck scheduling (no consensus required for budget checks).
(+) Consistency window for soft quotas is acceptable (scheduling cycle is 5-30s, budget tracking is for billing not safety).
(+) Integrates cleanly with Waldur (ADR-008): Waldur updates quotas, Lattice enforces them.
(-) Soft quotas can temporarily overshoot (by design). Requires clear documentation that GPU-hours tracking is approximate.
(-) Two enforcement paths add complexity. Developers must know which tier a quota belongs to.

ADR-010: Native PMI-2 with Optional PMIx Sidecar

Status: Accepted

Context: Lattice replaces Slurm’s srun, which serves as both a process launcher (fan-out to nodes) and a PMI server (rank/key-value discovery for MPI). Without a PMI provider, multi-node MPI jobs fall back to SSH for process spawning (OpenMPI’s ORTE, MPICH’s Hydra). SSH between compute nodes is a security risk, conflicts with network-domain isolation, and is incompatible with the sensitive workload model. The system must support OpenMPI, MPICH, and Cray MPICH.

Three options were evaluated:

Full PMIx server in Rust – PMIx v4/v5 is ~200+ attributes, enormous implementation surface, no existing Rust implementation. Rejected: too much scope, too much risk.
Embed OpenPMIx library via FFI – Battle-tested, full compatibility. But adds a heavy C dependency (~100K LOC), complex FFI, and still requires custom cross-node transport via gRPC.
Native PMI-2 wire protocol – ~8 text commands over Unix domain socket. Implementable in ~1000-1500 lines of Rust. All three target MPI implementations support PMI-2 natively. The only cross-node operation (kvsfence) maps cleanly to gRPC between node agents.

Decision: Implement a native PMI-2 server in the node agent as the default process management interface. The node agent provides a Unix domain socket per launch, sets PMI_FD/PMI_RANK/PMI_SIZE, and handles cross-node KV exchange (fence) via gRPC between node agents. Optionally, for workloads requiring full PMIx (dynamic spawn, tools API, event notification), support an OpenPMIx sidecar process managed by the node agent, behind the pmix feature flag.

Consequences:

(+) No SSH between compute nodes. Eliminates an entire class of security and operational issues.
(+) No external C dependencies for the default path. PMI-2 is simple enough to implement and test in pure Rust.
(+) All three target MPI implementations (OpenMPI, MPICH, Cray MPICH) work with PMI-2 out of the box.
(+) Cross-node fence reuses the existing node-agent gRPC infrastructure (management network, mTLS).
(+) CXI credential management integrates naturally with existing VNI/network-domain lifecycle.
(+) PMIx available as opt-in for the ~5% of workloads that need it, without burdening the default path.
(-) PMI-2 does not support dynamic process spawning (MPI_Comm_spawn). Rare in HPC but used by some frameworks.
(-) OpenMPI users must set OMPI_MCA_pmix=pmi2 (or Lattice sets it automatically). Minor friction.
(-) PMIx sidecar mode adds a C dependency (OpenPMIx) and a host callback shim (~200 LOC C). Only needed when feature-flagged.
(-) Fence performance at extreme scale (>1000 nodes) requires tree-based reduction instead of star topology. Optimization deferred until needed.

ADR-011: Observability Data Out-of-Raft

Status: Accepted

Context: The system generates significant observability data: per-node telemetry (CPU, GPU, network, I/O), allocation logs (stdout/stderr), and metrics time series. This data must be queryable by users (dashboards, debugging) and by the scheduler (cost function factors like energy cost and data readiness). The question is where to store it.

Options:

Raft state machine — guarantees consistency but creates enormous write load (thousands of metric points per second across hundreds of nodes). Raft commit latency becomes the bottleneck for telemetry ingestion.
External TSDB + S3 — eventually consistent but decouples observability throughput from scheduling throughput. Standard tooling (Grafana, PromQL) works out of the box.
In-memory ring buffers only — fast but volatile; node agent restart loses history; no cross-node aggregation.

Decision: Observability data is stored entirely outside the Raft state machine. Metrics go to an external TSDB (VictoriaMetrics). Logs are dual-path: ring buffer in the node agent for live streaming, S3 for persistent storage. The scheduler queries the TSDB for cost function inputs. Only sensitive audit events about observability actions (e.g., “user X attached to allocation Y”) flow through Raft consensus (per ADR-004).

Consequences:

(+) Raft throughput is reserved for what matters: node ownership and sensitive audit.
(+) Standard observability tooling (Grafana, PromQL) works without custom integration.
(+) Telemetry pipeline failures do not disrupt scheduling or allocation lifecycle.
(+) TSDB handles retention, downsampling, and high-cardinality queries natively.
(-) Metrics are eventually consistent (~30s lag). Scheduler cost function inputs may be slightly stale.
(-) TSDB is an additional infrastructure dependency to operate.
(-) Log persistence depends on S3 availability; brief gaps possible during S3 outages (ring buffer covers live access).

ADR-012: Allocation as Universal Work Unit

Status: Accepted

Context: The system must schedule both finite work (training runs, simulations, CI jobs) and infinite work (inference services, monitoring daemons, interactive notebooks). Slurm treats these as fundamentally different (jobs vs. “perpetual” jobs with workarounds). Kubernetes treats everything as a pod/deployment but lacks HPC scheduling semantics. We need a single abstraction that spans both worlds without losing scheduling precision.

Options:

Two separate types (Job and Service) — clear semantics per type, but duplicates scheduling logic, quota enforcement, preemption policy, and API surface. Every feature must be implemented twice.
Always bounded (Slurm model) — services require walltime workarounds (submit with max walltime, auto-resubmit). Clumsy and fragile.
Always unbounded (K8s model) — batch jobs require explicit termination signals. Cannot express “run until completion” natively.
Single type with lifecycle variants — one Allocation with lifecycle: Bounded | Unbounded | Reactive.

Decision: A single Allocation type is the universal work unit. The lifecycle field determines duration semantics: Bounded (has walltime, completes or is killed), Unbounded (runs until cancelled, auto-restarts on failure), Reactive (scales in response to metrics/load). All scheduling, quota, preemption, checkpoint, and telemetry policies operate on Allocations uniformly. Task Groups (Slurm job arrays) and DAGs (dependency graphs) compose Allocations.

Consequences:

(+) Unified scheduling: one cost function, one knapsack solver, one preemption engine for all workload types.
(+) Simpler API: users learn one submission model. Services and batch jobs differ only in lifecycle field.
(+) Quota and fair-share accounting is uniform — no special cases for services vs. jobs.
(+) DAG dependencies can mix bounded and unbounded allocations (e.g., training job → inference service).
(-) Lifecycle variants add complexity to the state machine (Bounded has walltime enforcement; Unbounded has restart policy; Reactive has scaling triggers).
(-) Users coming from Slurm must learn that “job” and “service” are the same thing with different lifecycle.

ADR-013: Network Domains via Hardware VNIs

Status: Accepted

Context: Multi-tenant HPC requires network isolation between allocations. On Slingshot/Ultra Ethernet fabrics, the NIC supports Virtual Network Identifiers (VNIs) that provide hardware-enforced L3 isolation. Alternative approaches exist in software.

Options:

Software-based isolation (Linux network namespaces, iptables) — can be bypassed by privileged processes, adds per-packet overhead, difficult to audit at scale, incompatible with RDMA.
No network isolation — all allocations share L2/L3. Unacceptable for multi-tenant security and sensitive workloads.
Full overlay network (Kubernetes CNI model) — adds encapsulation overhead, incompatible with Slingshot fabric semantics, destroys RDMA performance.
Hardware VNI isolation — Slingshot NIC enforces isolation at line rate, zero software overhead, auditable via fabric manager.

Decision: Network isolation is enforced at the Slingshot hardware level via VNIs. Each network domain maps to a VNI allocated from a managed pool. Allocations in the same domain share a VNI and have L3 reachability. Allocations in different domains are hardware-isolated. VNI assignment is eventually consistent (node agents configure NICs based on quorum-reported domain membership). Sensitive allocations get unique per-allocation domains with encrypted RDMA (Ultra Ethernet).

Consequences:

(+) Zero-overhead isolation — no per-packet software processing, RDMA performance preserved.
(+) Hardware-enforced — cannot be bypassed by user processes, even with root inside a container.
(+) Auditable via fabric manager — network domain membership is visible to operators.
(+) Naturally integrates with CXI credential management for MPI (ADR-010).
(-) Tied to Slingshot/Ultra Ethernet hardware. Non-Slingshot deployments need a software fallback.
(-) VNI pool is finite (default: 3095). Exhaustion blocks new domain creation.
(-) VNI configuration propagation to NICs adds latency to allocation startup (~50ms).

ADR-014: Conformance Fingerprinting for Configuration Drift Detection

Status: Accepted

Context: Multi-node GPU workloads (distributed training, MPI simulations) are sensitive to configuration heterogeneity. Nodes with different GPU driver versions, NIC firmware, or kernel versions can cause subtle correctness issues (NCCL version mismatches, libfabric ABI incompatibilities) or performance degradation. Slurm has no built-in mechanism to detect this; operators discover it via user bug reports.

Options:

No tracking — silent failures; users debug configuration drift themselves.
Exact node-by-node attribute matching — too strict; every firmware update requires simultaneously updating all nodes or scheduling breaks.
Conformance fingerprint (hash of driver/firmware/kernel) — nodes with identical fingerprints are grouped into cohorts; scheduler places multi-node jobs on same-cohort nodes.
Scheduler-driven remediation — scheduler triggers firmware updates on non-conforming nodes. Out of scope; OpenCHAMI handles infrastructure.

Decision: Each node agent computes a conformance fingerprint (SHA-256 of GPU driver version, NIC firmware version, BIOS version, kernel version) and reports it with heartbeats. The quorum groups nodes into conformance cohorts. The cost function factor f₉: conformance_fitness penalizes multi-node allocations that would span cohorts. Allocations can set require_conformance: true to hard-require same-cohort placement. Conformance drift on sensitive nodes triggers immediate drain (not remediation — that’s OpenCHAMI’s job).

Consequences:

(+) Detects configuration drift before it causes user-visible failures.
(+) Soft by default (penalty, not hard block) — avoids scheduling starvation during rolling updates.
(+) Hard mode available for workloads that need it (require_conformance).
(+) Sensitive nodes get stricter enforcement (drain on drift) for compliance.
(-) Fingerprint granularity is coarse. Two nodes with different BIOS settings but same BIOS version have the same fingerprint.
(-) Multi-node jobs with require_conformance may wait longer for same-cohort nodes.
(-) Rolling firmware updates temporarily create many small cohorts, reducing scheduling flexibility.

ADR-015: Attach via nsenter

Status: Accepted

Context: Users need interactive terminal access to running allocations for debugging, monitoring, and interactive workflows (equivalent to Slurm’s srun --pty bash into a running job). The question is how to provide this without compromising isolation or consuming scheduling resources.

Options:

Create a new “attach” allocation on the same node — goes through the scheduler queue; consumes quota; adds latency; overkill for a debugging session.
SSH into the compute node — requires SSH key distribution between login and compute nodes; security risk; incompatible with network domain isolation; operationally fragile.
nsenter from node agent — the node agent enters the allocation’s mount/PID namespace via Linux nsenter; bidirectional gRPC stream provides the PTY. No new resource allocation, no SSH.
Direct socket from user to container — requires host filesystem access; less secure; doesn’t work with uenv (no container to connect to).

Decision: Attach uses nsenter executed by the node agent. The user’s lattice attach <id> command opens a bidirectional gRPC stream to the API server, which forwards to the node agent hosting the allocation. The node agent spawns a shell inside the allocation’s namespace via nsenter. No new allocation is created, no quota is consumed, and no SSH is involved.

Consequences:

(+) Instant attach — no scheduler queue, no resource allocation.
(+) No SSH infrastructure needed on compute nodes.
(+) Works identically for uenv and Sarus allocations (both use Linux namespaces).
(+) Attach sessions are logged as observability events (sensitive: Raft-committed audit entry).
(-) Requires the node agent to have CAP_SYS_ADMIN / sufficient privileges for nsenter.
(-) Attach shares the allocation’s resource limits — a heavy debugging tool could impact the running workload.
(-) If the node agent is down, attach is unavailable (no fallback).

ADR-016: Two-Tier API (Intent API + Compatibility Layer)

Status: Accepted

Context: Lattice must serve two audiences: (1) new users and AI agents who benefit from a declarative, intent-based API (“I need 64 GPU nodes for 2 hours with this data”), and (2) existing Slurm users who have years of scripts using sbatch, squeue, scancel. Supporting both without maintaining two scheduling engines requires a clear layering decision.

Options:

Single imperative API (Slurm-style) — familiar to HPC users but locks the system into Slurm’s abstractions (partitions, job steps, GRES). Cannot express reactive scaling or data staging intent.
Single declarative API (Intent-only) — clean design but forces all existing users to rewrite scripts immediately. Migration barrier too high.
Dual engines — one for Intent, one for Slurm compat. Code duplication, inconsistent scheduling behavior, unmaintainable.
Two-tier: Intent API as primary, Compatibility API as thin mapping — Slurm commands are translated to Intent API calls. One scheduling engine, one state machine, one set of semantics.

Decision: The Intent API is the primary and only scheduling interface. The Compatibility API (sbatch, squeue, scancel and their lattice submit, lattice status, lattice cancel equivalents) is a stateless translation layer that maps Slurm directives to Intent API fields. All scheduling decisions, state transitions, and quota enforcement happen through the Intent API path. The compat layer produces warnings for unsupported directives but never errors (graceful degradation for migration).

Consequences:

(+) One scheduling engine, one code path, one set of tests.
(+) Gradual migration: existing scripts work on day one via compat layer.
(+) Intent API can evolve freely without Slurm compatibility constraints.
(+) AI agents use the Intent API directly — no impedance mismatch.
(-) Some Slurm features have no mapping (hetjob, burst buffer, GRES beyond GPU). Users get warnings.
(-) Compat layer must be maintained and tested against Slurm script variations.
(-) Users may stay on compat layer indefinitely, never adopting Intent API features.

ADR-017: Eventual Consistency for Job Queues

Status: Accepted

Context: When a user submits an allocation, how quickly must the system guarantee that the submission is durable and schedulable? Raft consensus provides strong guarantees but adds latency (few ms per commit) and throughput limits. Job queues see bursts (hundreds of submissions in seconds during class assignments or automated pipelines).

Options:

Synchronous Raft commit on every submission — strong guarantee but adds 10-100ms per submission, bottlenecks the API under burst load, scheduler throughput limited by Raft commit latency.
Eventually consistent with bounded staleness — submission is acknowledged immediately (stored in-memory queue), committed to Raft asynchronously on the next scheduling cycle. Staleness bounded by scheduling cycle time (~5-30s).
Optimistic with no retry — submissions may be silently lost on leader failover. Unacceptable.

Decision: Job queue state is eventually consistent. Allocation submissions are acknowledged immediately by the API server and placed in the vCluster scheduler’s in-memory queue. The scheduler proposes allocations to the quorum on each scheduling cycle; the quorum validates and commits node ownership (strong consistency). If the API server fails between acknowledgment and the next scheduling cycle, the submission is lost — but the user receives an allocation ID and can query status, which will show “not found” (detectable failure, not silent). In practice, the window is <30s and API server failures are rare.

Consequences:

(+) Submission API is fast (<5ms) regardless of Raft cluster health.
(+) Burst submissions don’t bottleneck on consensus.
(+) Scheduling cycle naturally batches proposals, reducing Raft commit count.
(-) Submissions can be lost on API server crash (between ack and next cycle). Mitigated by: client retries on “not found” status, and API server persistence to disk (WAL) as future enhancement.
(-) Two schedulers may independently queue the same submission if load-balanced. Deduplication by allocation ID at quorum level.

ADR-018: Scheduler-Coordinated Checkpointing

Status: Accepted

Context: Preemption requires evicting running allocations to free resources for higher-priority work. Killing allocations without warning wastes all computed progress. Checkpointing preserves progress but has cost: I/O bandwidth for writing state, compute time lost during checkpoint, and storage for checkpoint data. The question is who decides when to checkpoint.

Options:

User-initiated checkpointing — user inserts checkpoint calls in their code. Does not solve the preemption problem (scheduler cannot wait for user to decide).
Periodic automatic checkpointing (fixed interval) — simple but wasteful. Short intervals waste I/O on stable workloads; long intervals lose too much progress on preemption.
Transparent checkpointing (DMTCP) without cost model — works for any application but causes I/O storms when many allocations checkpoint simultaneously. No way to prioritize which allocations to preempt.
Scheduler-coordinated with cost function — scheduler evaluates checkpoint value vs. cost per allocation, decides when and which allocations to checkpoint for preemption.

Decision: Checkpointing is scheduler-coordinated. The cost function evaluates checkpoint_value = resource_freed × preemptability + backlog_relief vs. checkpoint_cost = write_time + compute_waste + storage_cost. The scheduler triggers checkpoints by sending CHECKPOINT_HINT to the node agent, which forwards to the application (via signal, shmem flag, or gRPC callback). Applications declare their checkpoint capability (signal, shmem, grpc, dmtcp, or none). Applications with none are either non-preemptible or killed without checkpoint. Backlog pressure increases checkpoint aggressiveness (more allocations waiting → more willing to preempt).

Consequences:

(+) Checkpoint decisions are globally optimal (scheduler has full visibility of queue, resources, priorities).
(+) Avoids I/O storms (scheduler staggers checkpoints across time and storage bandwidth).
(+) Backlog-responsive: system becomes more aggressive about freeing resources when demand is high.
(+) Applications retain control of checkpoint mechanics (signal handler, custom format).
(-) Applications must implement checkpoint support to benefit. Unsupported applications are either non-preemptible or lose progress.
(-) Cost function calibration requires tuning (write bandwidth, storage cost per GB).
(-) Checkpoint hint is advisory — application may take too long, forcing a hard kill after timeout.

ADR-019: Eventually Consistent Node Capacity

Status: Accepted

Context: The scheduler needs two kinds of information about nodes: (1) ownership — which tenant/vCluster/allocation owns the node, and (2) capacity — current health, GPU utilization, temperature, available memory. Ownership must be strongly consistent (ADR-004) to prevent double-assignment. But capacity data changes frequently (every heartbeat, ~10s) and is used for scoring, not for correctness.

Options:

All node updates through Raft — ownership and capacity in one consistent view. But heartbeats every 10s × hundreds of nodes = thousands of Raft writes per minute. Commit latency becomes the scheduling bottleneck.
All node updates eventually consistent — fast but ownership conflicts are possible. Two schedulers could assign the same node simultaneously.
Split: ownership via Raft, capacity via eventual consistency — ownership changes are rare (scheduling cycles) and go through Raft. Capacity updates are frequent (heartbeats) and propagated via gossip or direct reporting.

Decision: Node ownership (tenant, vCluster, allocation assignment) is Raft-committed (strong consistency). Node capacity (health, utilization, temperature, conformance fingerprint) is eventually consistent — node agents report to the quorum leader, which updates in-memory state without Raft commit. The scheduler reads the latest reported capacity when scoring. Stale capacity data may cause suboptimal placement but never incorrect ownership.

Consequences:

(+) Heartbeats do not bottleneck Raft. Hundreds of nodes can report every 10s without consensus overhead.
(+) Scheduling cycle time is decoupled from Raft commit latency for capacity reads.
(+) Ownership consistency is preserved — double-assignment is impossible.
(-) Capacity staleness can cause suboptimal decisions (e.g., scheduling on a node whose GPU just failed but hasn’t reported yet). Bounded by heartbeat interval.
(-) Two levels of consistency require developers to know which fields are strong vs. eventual.

ADR-020: Sensitive Node Claims by User Identity

Status: Accepted

Context: Sensitive (regulated, high-security) workloads require provable isolation and audit trails that satisfy regulatory requirements (e.g., data protection laws, institutional compliance). The question is what identity is recorded as the “owner” of a sensitive node allocation: the tenant (organizational unit), a role, or the specific user.

Options:

Tenant-owned — the organizational unit owns the nodes. Cannot prove which individual accessed which data. Insufficient for regulatory audit (“who accessed patient records?”).
Role-based — a role (e.g., “researcher”) owns the nodes. Same problem: multiple users share a role; individual accountability is lost.
User-owned (OIDC subject) — the authenticated user’s identity (from OIDC token) is recorded in the Raft-committed audit log as the owner. Every data access, attach session, and log retrieval is tied to a specific person.

Decision: Sensitive allocations are claimed by the authenticated user’s OIDC subject identifier, not by the tenant or a role. The quorum records the user identity in the Raft-committed audit log. All subsequent actions on the allocation (data access, attach, log retrieval) are logged with user identity. Nodes are wiped on release (OpenCHAMI secure erase) with wipe confirmation recorded in the audit log. Audit retention is 7 years.

Consequences:

(+) Individual accountability: every action is tied to a specific authenticated person.
(+) Regulatory defensibility: audit trail shows who claimed what, when, and what they did.
(+) Wipe-on-release with Raft-committed confirmation provides provable data destruction.
(+) 7-year retention satisfies most regulatory frameworks.
(-) User identity must be available at claim time (requires OIDC authentication, no service accounts for sensitive claims).
(-) Sensitive allocations cannot be transferred between users (the claim is to a specific identity).
(-) Wipe-on-release adds latency to node return-to-pool (10-30 minutes for secure erase).

ADR-021: Data Staging as Invisible Background Pre-stage

Status: Accepted

Context: Many HPC workloads require large datasets (TBs) that may reside on warm or cold storage tiers. If data is not on the hot tier (VAST NFS/S3) when the allocation starts, the first minutes of compute time are wasted on I/O. The question is when and how to move data to the hot tier.

Options:

User-managed staging — user runs a separate staging job before the compute job. Shifts responsibility; users who forget waste compute time. Incompatible with multi-tenant fairness (staging time counted against user).
Blocking inline staging — allocation starts, blocks on data transfer before running the entrypoint. User sees unpredictable startup latency. If staging fails, the allocation is stuck in a running-but-waiting state, consuming resources.
Background pre-staging during queue wait — when an allocation is queued and declares data mounts with tier_hint: hot, the data mover begins warming data to the hot tier while the allocation waits in the queue. Queue wait time becomes productive.
Post-allocation staging on compute nodes — wastes compute resources on I/O; saturates node-local network bandwidth.

Decision: Data staging runs as a background process during queue wait time. The allocation transitions through a Staging state where the data mover pre-stages declared data mounts from warm/cold to hot tier. The cost function factor f₅: data_readiness scores how ready an allocation’s data is: fully staged allocations score higher and are scheduled sooner. Allocations whose data is not yet ready can still be scheduled if resources are available (staging continues during prologue). Staging failure is non-fatal — the allocation starts with a warning, and the entrypoint may encounter I/O latency.

Consequences:

(+) Queue wait time is no longer wasted — data moves while the allocation waits.
(+) Users don’t need to manage staging manually; just declare data mounts.
(+) Scheduler can prioritize data-ready allocations, improving overall throughput.
(+) Non-blocking: staging failure degrades performance but doesn’t prevent execution.
(-) Adds complexity to the allocation state machine (Staging state, data mover integration).
(-) Hot tier must have capacity for pre-staged data. Over-staging wastes hot tier space.
(-) Cost function tuning: f₅ weight determines how much data readiness influences scheduling order.

ADR-022: Three-Layer Telemetry Pipeline

Status: Accepted

Context: The system needs telemetry for three consumers: (1) operators (dashboards, alerts), (2) users (debugging, performance analysis), and (3) the scheduler (cost function inputs: GPU utilization, network congestion, energy cost). Each has different resolution, latency, and retention requirements. The pipeline must handle hundreds of nodes producing thousands of metric points per second.

Options:

In-memory ring buffers only — fast, low overhead. But volatile: node agent restart loses history. No cross-node aggregation for dashboards. Insufficient for scheduler feedback (requires historical trends).
Direct eBPF-to-S3 pipeline — durable but high latency. No live metrics for dashboards. Raw data too granular for efficient query.
Stream all metrics to Raft state machine — consistent but bloats the state machine. Raft commit latency becomes the telemetry bottleneck. Fundamentally wrong abstraction.
Three-layer: collect (eBPF) → aggregate (configurable resolution) → store (external TSDB) — each layer optimized for its purpose.

Decision: Telemetry follows a three-layer pipeline. Layer 1: eBPF programs (always-on, <0.3% overhead) collect kernel-level metrics at high resolution. Layer 2: the node agent aggregates at configurable resolution (production: 30s bicubic smoothing, debug: 1s raw, audit: access logs). Layer 3: aggregated metrics are pushed to an external TSDB (VictoriaMetrics) for storage, query, and alerting. The scheduler queries the TSDB for cost function inputs. Users query the TSDB via Grafana or the lattice top/lattice metrics commands.

Consequences:

(+) Each layer is independently scalable and replaceable (swap TSDB, change eBPF programs, adjust resolution).
(+) eBPF collection is always-on with negligible overhead — no sampling trade-offs.
(+) Configurable resolution per use case: fine-grained for debugging, coarse for production.
(+) Standard tooling (Grafana, PromQL, AlertManager) works without custom integration.
(+) Telemetry pipeline failure does not affect scheduling (graceful degradation: stale cost function inputs).
(-) Three layers add operational complexity (eBPF programs, agent aggregation config, TSDB deployment).
(-) End-to-end latency from event to queryable metric is ~30s in production mode.
(-) eBPF programs require kernel version compatibility and CAP_BPF on nodes.

ADR-023: vCluster as Soft Isolation Boundary

Status: Accepted

Context: Different workload types need different scheduling policies: HPC batch needs backfill with topology packing, ML training needs fair-share with GPU affinity, services need bin-packing with autoscale, sensitive needs dedicated reservation. A single scheduler cannot optimize for all simultaneously. But hard partitioning wastes resources when one workload type is idle while another is starved.

Options:

Hard partitioning (dedicated node pools per workload type) — simple isolation but guaranteed waste. If the ML training pool is 50% idle and HPC batch is oversubscribed, resources sit unused.
Single global scheduler with workload-type heuristics — no waste but cannot apply fundamentally different policies (backfill vs. bin-pack) simultaneously. Policy conflicts create unpredictable behavior.
Opaque vClusters (cannot see each other) — avoids conflicts but makes cross-vCluster fairness impossible. Borrowing is non-deterministic because the lending vCluster doesn’t know its own utilization relative to others.
Soft vClusters with global visibility — each vCluster has its own scheduler and cost function weights, but all schedulers see the global node ownership state via the quorum. Borrowing is explicit and policy-driven.

Decision: vClusters are soft isolation boundaries. Each vCluster has an independent scheduler instance with its own cost function weights (ADR-002) and scheduling algorithm (backfill, bin-pack, reservation, FIFO). All schedulers read the same global state from the quorum. vClusters have base allocations (guaranteed node counts) and can borrow from other vClusters with explicit priority and duration. Borrowed nodes are returned when the lending vCluster needs them (preemption of borrowed allocations at lower priority). The quorum enforces that proposals from different vCluster schedulers don’t conflict (node ownership is Raft-committed).

Consequences:

(+) Each workload type gets an optimized scheduler without one-size-fits-all compromises.
(+) No waste: idle resources in one vCluster are available to others via borrowing.
(+) Fair-share is globally visible: f₃ can compare a tenant’s usage across all vClusters.
(+) Borrowing is explicit and reversible: lending vCluster retains priority over its base allocation.
(-) Multiple schedulers proposing simultaneously can cause Raft proposal conflicts (one rejected, retried next cycle). Not a bug, but adds latency under contention.
(-) Borrowing policy configuration is complex (priority levels, max borrow duration, return grace period).
(-) Operators must understand that vClusters are not security boundaries — they are scheduling policy boundaries. Tenant isolation is provided by RBAC and network domains, not vClusters.

External References

Core Infrastructure Projects

OpenCHAMI

What: Open-source HPC system management platform (provisioning, boot, inventory)
Repo: https://github.com/OpenCHAMI
Docs: https://openchami.org
Components we integrate with: SMD (State Management Daemon), BSS (Boot Script Service), Magellan (Redfish discovery), OPAAL (auth), Cloud-init
Founded by: LANL, NERSC, CSCS, HPE, University of Bristol
Language: Go
Our integration: Infrastructure plane — Lattice queries SMD for node inventory, triggers BSS for boot image selection (e.g., sensitive hardened image), uses Magellan for hardware discovery

FirecREST

What: RESTful API gateway for HPC systems
Repo: https://github.com/eth-cscs/firecrest
Docs: https://firecrest.readthedocs.io
Our integration: Optional — lattice authenticates directly via hpc-auth. FirecREST is only needed for hybrid Slurm deployments where it serves as a passthrough compatibility gateway.

uenv

What: User environment tool for mounting SquashFS software stacks
Repo: https://github.com/eth-cscs/uenv
Related: https://github.com/eth-cscs/squashfs-mount (setuid mount binary), https://github.com/eth-cscs/slurm-uenv-mount (Slurm SPANK plugin)
Docs: https://docs.cscs.ch/software/uenv/using/
Key properties: SquashFS images, mount namespace isolation (per-process-tree), setuid binary (not FUSE), Spack-built stacks via Stackinator, multiple mount points (/user-environment, /user-tools)
Our integration: Software plane — node agent uses squashfs-mount to deliver uenv to allocations. We replace the Slurm SPANK plugin with native node agent integration.

Sarus

What: OCI-compliant container runtime for HPC
Repo: https://github.com/eth-cscs/sarus
Key properties: Near-native performance, direct GPU/interconnect access via OCI hooks, no network namespace overhead for MPI
Our integration: Software plane — used when full container isolation is needed (multi-tenant node sharing, third-party images, sensitive workloads with enhanced isolation)

Sovra

What: Federated sovereign key management for critical infrastructure
Repo: https://github.com/witlox/sovra
Docs: https://witlox.github.io/sovra/
Key properties: Peer-to-peer control planes, customer-controlled root keys, OPA-based policy, air-gap capable, cross-domain sharing
Language: Go
Our integration: Federation trust layer (optional, feature-gated). Provides cross-site authentication, sensitive data encryption key management, audit log signing.

Networking

Slingshot (HPE CXI)

What: HPE’s HPC interconnect, dragonfly topology
Key properties: Hardware traffic classes, VNIs for isolation, high-radix switches, RDMA
Scheduler relevance: Topology-aware placement (minimize inter-group hops), VNI-based network domains, separate traffic classes for compute/management/telemetry

Ultra Ethernet Consortium (UEC)

What: Open Ethernet-based networking stack for AI/HPC
Spec: https://ultraethernet.org (1.0 released June 2025)
Key properties: UET transport (native RDMA over Ethernet), packet spraying (adaptive multi-path), CSIG (in-band congestion signaling), built-in encryption, libfabric 2.0 API
Relationship to Slingshot: ~75% of UET derives from Slingshot transport. Migration path is evolutionary, not revolutionary.
Scheduler relevance: CSIG feeds into telemetry (congestion-aware scheduling), encryption simplifies sensitive compliance, libfabric abstraction enables fabric-agnostic scheduler

libfabric

What: Fabric abstraction library (provider-based: CXI for Slingshot, EFA for AWS, verbs for InfiniBand, UET for Ultra Ethernet)
Our integration: Network fabric abstraction. The scheduler and node agent interact with the network via libfabric, making the scheduler fabric-agnostic.

Storage

VAST Data Platform

What: All-flash unified storage (NFS + S3 + block), DASE architecture
Key properties: Multiprotocol (NFS + S3 native), RESTful API for everything, QoS per export, auto-indexing catalog, snapshots, DataSpace (global namespace with prefetch)
Scheduler integration: QoS setting at job start, data locality queries via Catalog API, pre-staging via DataSpace prefetch, snapshots for reproducibility, audit logs for sensitive compliance

IBM Storage Scale (GPFS)

What: Parallel file system with extensive management features
Key properties: Placement policies, AFM (async data management), filesets with quotas, watch/callback API, transparent cloud tiering
Scheduler integration: Alternative to VAST. Fileset-per-job for isolation, placement policies for workload-specific tuning, AFM for remote data staging.

Research Papers

CSCS Alps Architecture

Martinasso, Klein, Schulthess. “Alps, a versatile research infrastructure.” CUG 2025. arXiv:2507.02404
Alam, Gila, Klein, Martinasso, Schulthess. “Versatile software-defined HPC and cloud clusters on Alps supercomputer for diverse workflows.” IJHPCA 2023.
Martinasso et al. “Resource Elasticity for Scientific Platforms on HPC Infrastructure.” Springer 2025.

Scheduler Simulation

Martinasso, Gila, Bianco, Alam, McMurtrie, Schulthess. “RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management.” SC18. Repo

Multi-Objective Scheduling

Simon, Nguyen, Halem. “Multiple Objective Scheduling of HPC Workloads Through Dynamic Prioritization.” Uses bounded fractional knapsack with dynamic priority scoring.
Goponenko. “Objective-Driven Strategies for HPC Job Scheduling.” UCF 2024. Comprehensive metrics for scheduling quality, I/O-aware backfill.

Energy-Aware Federation

“Power-Aware Scheduling for Multi-Center HPC Electricity Cost Optimization.” arXiv:2503.11011. GNN-based power prediction + multi-site scheduling, up to 18% energy cost reduction.

uenv Deployment

Coles et al. “Deploying Alternative User Environments on Alps.” CUG 2023. Details squashfs-mount, Slurm SPANK plugin, Spack stack building.

ML on HPC

CSCS. “Evolving HPC services to enable ML workloads on HPE Cray EX.” CUG 2025. arXiv:2507.01880. Container Engine, Environment Definition Files, gaps for ML users.

Keyboard shortcuts

Lattice Documentation