Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lattice

A distributed workload scheduler for large-scale scientific computing, AI/ML training, inference services, and regulated workloads.

Lattice schedules both finite jobs (batch training, simulations) and infinite jobs (inference services, monitoring) on shared HPC infrastructure with topology-aware placement, federated multi-site operation, and a unified API for human users and autonomous agents.

Architecture at a Glance

User Plane         lattice-cli + lattice-api (OIDC via hpc-auth)
Software Plane     uenv (SquashFS) + Sarus (OCI) + Registry
Scheduling Plane   Raft Quorum + vCluster Schedulers (knapsack)
Data Plane         VAST (NFS/S3) tiered storage + data mover
Network Fabric     Slingshot / Ultra Ethernet (libfabric)
Node Plane         Node Agent + mount namespaces + eBPF telemetry
Infrastructure     OpenCHAMI (Redfish BMC, boot, inventory)

Start with System Architecture for the full picture, or jump to API Design to see how users interact with the system.

Source Code

The project is organized as a Rust workspace with 9 crates:

CratePurpose
lattice-commonShared types, config, protobuf bindings
lattice-quorumRaft consensus, global state machine, audit log
lattice-schedulervCluster schedulers, knapsack solver, cost function
lattice-apigRPC + REST server, OIDC, RBAC, mTLS
lattice-checkpointCheckpoint broker, cost evaluator
lattice-node-agentPer-node daemon, GPU discovery, eBPF telemetry
lattice-cliCLI binary (submit, status, cancel, session, telemetry)
lattice-test-harnessShared mocks, fixtures, builders
lattice-acceptanceBDD scenarios and property tests

Plus a Python SDK, an RM-Replay simulator, and deployment configs in infra/.

Getting Started

Overview

Lattice is a distributed workload scheduler for HPC and AI infrastructure. It schedules both batch jobs (training runs, simulations) and long-running services (inference endpoints, monitoring) on shared GPU-accelerated clusters.

If you’re coming from Slurm, most concepts map directly — see the Slurm migration guide for a quick comparison.

Prerequisites

  • A running Lattice cluster (ask your admin for the API endpoint)
  • The lattice CLI installed on your workstation or login node
  • Your tenant credentials (OIDC token or mTLS certificate)

Installing the CLI

# Determine architecture
ARCH=$(uname -m | sed 's/aarch64/arm64/')

# Download from GitHub Releases
curl -sSfL "https://github.com/witlox/lattice/releases/latest/download/lattice-${ARCH}.tar.gz" | tar xz
sudo mv lattice /usr/local/bin/

# Or build from source
cargo build --release -p lattice-cli
sudo cp target/release/lattice /usr/local/bin/

Configuration

Create ~/.config/lattice/config.yaml:

endpoint: "lattice-api.example.com:50051"
tenant: "my-team"
# Optional: default vCluster
vcluster: "gpu-batch"

Or use environment variables:

export LATTICE_ENDPOINT="lattice-api.example.com:50051"
export LATTICE_TENANT="my-team"

Your First Job

Submit a batch script

lattice submit train.sh
# Submitted allocation a1b2c3d4

Check status

lattice status
# ID        NAME           STATE    NODES  WALLTIME   ELAPSED    VCLUSTER
# a1b2c3d4  train.sh       Running  4      24:00:00   00:12:34   gpu-batch

View logs

lattice logs a1b2c3d4
# [2026-03-05T10:00:12Z] Epoch 1/100, loss=2.341
# [2026-03-05T10:01:45Z] Epoch 2/100, loss=1.892

Cancel a job

lattice cancel a1b2c3d4

Next Steps

Submitting Workloads

Basic Submission

# Run a script on 4 nodes for up to 24 hours
lattice submit --nodes=4 --walltime=24h train.sh

# With GPU constraints
lattice submit --nodes=8 --walltime=72h --constraint="gpu_type=GH200" -- torchrun train.py

# With a software environment (uenv)
lattice submit --nodes=2 --uenv=prgenv-gnu/24.11:v1 -- make -j run

Script Directives

Lattice parses #LATTICE directives from your script (and #SBATCH for compatibility):

#!/bin/bash
#LATTICE --nodes=64
#LATTICE --walltime=72h
#LATTICE --uenv=prgenv-gnu/24.11:v1
#LATTICE --vcluster=ml-training
#LATTICE --tenant=physics
#LATTICE --name=large-training-run

torchrun --nproc_per_node=4 train.py --data /scratch/dataset

Resource Constraints

# GPU type
lattice submit --constraint="gpu_type=GH200,gpu_count=4" script.sh

# Memory requirements
lattice submit --constraint="memory_gb>=512" script.sh

# Require unified memory (GH200/MI300A superchip)
lattice submit --constraint="require_unified_memory" script.sh

# Prefer same NUMA domain
lattice submit --constraint="prefer_same_numa" script.sh

Task Groups (Job Arrays)

Submit multiple instances of the same job:

# 100 tasks, 20 running concurrently
lattice submit --task-group=0-99%20 sweep.sh

# Task index available as $LATTICE_TASK_INDEX

Dependencies

# Run after job succeeds
lattice submit --depends-on=a1b2c3d4:success postprocess.sh

# Run after job completes (success or failure)
lattice submit --depends-on=a1b2c3d4:any cleanup.sh

# Multiple dependencies
lattice submit --depends-on=job1:success,job2:success merge.sh

Data Staging

Lattice can pre-stage data to the hot tier before your job starts:

lattice submit --data-mount="s3://bucket/dataset:/data" --nodes=4 train.sh

The scheduler evaluates data readiness as part of the cost function — jobs with data already on the hot tier are prioritized.

Lifecycle Types

Bounded (batch) — default

lattice submit --walltime=24h train.sh

Job runs until completion or walltime, then terminates.

Unbounded (service)

lattice submit --service --expose=8080 serve.sh

Runs indefinitely. Exposed ports are reachable via the network domain.

Reactive (autoscaling)

lattice submit --reactive --min-nodes=1 --max-nodes=8 \
  --scale-metric=gpu_utilization --scale-target=0.8 serve.sh

Automatically scales between min and max nodes based on the target metric.

Preemption Classes

Higher preemption class = harder to preempt:

# Best-effort (preempted first)
lattice submit --preemption-class=0 experiment.sh

# Normal priority (default: 5)
lattice submit train.sh

# High priority
lattice submit --preemption-class=8 critical-training.sh

Checkpointing

If your application supports checkpointing, declare it:

# Signal-based (receives SIGUSR1 before preemption)
lattice submit --checkpoint=signal train.sh

# gRPC callback
lattice submit --checkpoint=grpc --checkpoint-port=9999 train.sh

# Shared memory flag
lattice submit --checkpoint=shmem train.sh

# Non-preemptible (no checkpoint, never preempted)
lattice submit --no-preempt train.sh

Slurm Compatibility

Existing Slurm scripts work with minimal changes:

# These are equivalent
sbatch --nodes=4 --time=24:00:00 --partition=gpu train.sh
lattice submit --nodes=4 --walltime=24h --vcluster=gpu train.sh

Supported #SBATCH directives are automatically translated. See Slurm Migration for details.

Output Formats

# Default: human-readable table
lattice status

# JSON (for scripting)
lattice status -o json

# YAML
lattice status -o yaml

# Wide (more columns)
lattice status -o wide

Interactive Sessions

Interactive sessions give you a terminal attached to allocated compute nodes — similar to salloc + srun --pty in Slurm.

Creating a Session

# Basic interactive session (1 node, 4 hours)
lattice session --walltime=4h

# With GPU and software environment
lattice session --nodes=1 --constraint="gpu_type=GH200" --uenv=prgenv-gnu/24.11:v1

# Specify vCluster
lattice session --vcluster=interactive --walltime=2h

The session enters the queue like any other allocation. Once scheduled, your terminal automatically attaches to the first node.

Attaching to Running Allocations

You can attach a terminal to any running allocation (not just sessions):

# Attach to a running job
lattice attach a1b2c3d4

# Attach to a specific node in a multi-node allocation
lattice attach a1b2c3d4 --node=nid001234

# Run a specific command instead of a shell
lattice attach a1b2c3d4 -- htop

Multiple Terminals

You can open multiple terminals to the same allocation:

# Terminal 1
lattice attach a1b2c3d4

# Terminal 2 (different shell window)
lattice attach a1b2c3d4

Session Lifecycle

  • Pending — waiting in the queue for resources
  • Running — terminal is attached, you’re working
  • Disconnected — if you lose connection, the session keeps running (use tmux/screen inside for persistence)
  • Completed — walltime expired or you exited

Tips

  • Use tmux or screen inside your session for disconnect resilience
  • Sessions respect the same preemption rules as batch jobs — use --preemption-class=7 for important interactive work
  • If preempted, you’ll see checkpoint progress in your terminal before disconnection
  • The --walltime flag is mandatory for sessions (prevents runaway resource usage)

DAG Workflows

DAGs (Directed Acyclic Graphs) let you define multi-step pipelines where allocations depend on each other.

YAML Definition

# workflow.yaml
name: training-pipeline
allocations:
  - name: preprocess
    entrypoint: "python preprocess.py"
    nodes: 2
    walltime: "2h"

  - name: train
    entrypoint: "torchrun train.py"
    nodes: 64
    walltime: "72h"
    uenv: "prgenv-gnu/24.11:v1"
    depends_on:
      - preprocess: success

  - name: evaluate
    entrypoint: "python eval.py"
    nodes: 1
    walltime: "1h"
    depends_on:
      - train: success

  - name: notify-failure
    entrypoint: "python notify.py --status=failed"
    nodes: 1
    walltime: "10m"
    depends_on:
      - train: failure

Submitting a DAG

lattice dag submit workflow.yaml
# Submitted DAG d1e2f3g4 with 4 allocations

Dependency Conditions

ConditionMeaning
successRun after dependency completes successfully
failureRun after dependency fails
anyRun after dependency completes (success or failure)
correspondingFor task groups: task N depends on task N of the parent

Monitoring DAGs

# DAG status overview
lattice dag status d1e2f3g4

# Detailed graph view
lattice dag status d1e2f3g4 --graph

# Output:
# preprocess [Completed] → train [Running] → evaluate [Pending]
#                                          ↘ notify-failure [Pending]

Cancelling a DAG

# Cancel all allocations in the DAG
lattice dag cancel d1e2f3g4

Cancellation cascades — downstream allocations that haven’t started are cancelled automatically.

Failure Propagation

  • If a success dependency fails, downstream allocations are cancelled
  • If a failure dependency succeeds, those downstream allocations are skipped
  • any dependencies always run regardless of upstream outcome

Limits

  • Maximum 1000 allocations per DAG (configurable by admin)
  • Cycles are rejected at submission time
  • Duplicate allocation names within a DAG are rejected

Monitoring & Observability

Allocation Status

# Your allocations
lattice status

# Specific allocation
lattice status a1b2c3d4

# Filter by state
lattice status --state=running
lattice status --state=pending

# All tenant allocations (requires permissions)
lattice status --all

# Watch mode (refreshes every 5 seconds)
lattice status --watch
lattice watch a1b2c3d4

Logs

# View logs (from S3 persistent store)
lattice logs a1b2c3d4

# Live tail (streaming)
lattice logs a1b2c3d4 --follow

# Last N lines
lattice logs a1b2c3d4 --tail=100

Metrics

Query metrics for a running allocation:

# Snapshot of current metrics
lattice metrics a1b2c3d4

# Output:
# METRIC            VALUE    UNIT
# gpu_utilization   87.3     %
# gpu_memory_used   71.2     GB
# cpu_utilization   45.1     %
# memory_used       384.0    GB
# network_rx        12.4     GB/s
# network_tx        8.7      GB/s

Live metrics stream:

lattice metrics a1b2c3d4 --stream

Diagnostics

Combined view of network and storage health for an allocation:

lattice diagnostics a1b2c3d4

# Network diagnostics only
lattice diagnostics a1b2c3d4 --network

# Storage diagnostics only
lattice diagnostics a1b2c3d4 --storage

Cross-Allocation Comparison

Compare metrics between two allocations (useful for A/B experiments):

lattice compare a1b2c3d4 e5f6g7h8 --metric=gpu_utilization

Cluster Overview

# List all nodes
lattice nodes

# Filter by state
lattice nodes --state=ready
lattice nodes --state=draining

# Specific node details
lattice nodes nid001234

Python SDK

The Lattice Python SDK provides an async client for interacting with the REST API from notebooks, scripts, and autonomous agents.

Installation

pip install lattice-sdk

Quick Start

import asyncio
from lattice_sdk import LatticeClient, AllocationSpec

async def main():
    async with LatticeClient("lattice-api.example.com", 8080) as client:
        # Submit an allocation
        alloc = await client.submit(AllocationSpec(
            entrypoint="python train.py",
            nodes=4,
            walltime="24h",
            tenant="ml-team",
        ))
        print(f"Submitted: {alloc.id}")

        # Check status
        status = await client.status(alloc.id)
        print(f"State: {status.state}")

        # Wait for completion
        async for event in client.watch(alloc.id):
            print(f"State changed: {event.state}")
            if event.state in ("Completed", "Failed", "Cancelled"):
                break

asyncio.run(main())

Core Methods

Submission

# Basic submission
alloc = await client.submit(AllocationSpec(
    entrypoint="torchrun train.py",
    nodes=64,
    walltime="72h",
    uenv="prgenv-gnu/24.11:v1",
    constraints={"gpu_type": "GH200"},
))

# Submit DAG
dag = await client.submit_dag("workflow.yaml")

Status & Listing

# Get allocation
alloc = await client.status(alloc_id)

# List allocations
allocs = await client.list_allocations(state="running")

# List nodes
nodes = await client.list_nodes(state="ready")

Monitoring

# Stream logs
async for line in client.stream_logs(alloc_id):
    print(line.message)

# Query metrics
metrics = await client.query_metrics(alloc_id)
print(f"GPU util: {metrics.gpu_utilization}%")

# Stream metrics
async for snapshot in client.stream_metrics(alloc_id):
    print(f"GPU: {snapshot.gpu_utilization}%")

# Watch state changes
async for event in client.watch(alloc_id):
    print(f"State: {event.state}")

Management

# Cancel
await client.cancel(alloc_id)

# Checkpoint
await client.checkpoint(alloc_id)

Tenants & vClusters

tenants = await client.list_tenants()
vclusters = await client.list_vclusters()

Error Handling

from lattice_sdk import LatticeError, LatticeNotFoundError, LatticeAuthError

try:
    alloc = await client.status("nonexistent-id")
except LatticeNotFoundError:
    print("Allocation not found")
except LatticeAuthError:
    print("Authentication failed")
except LatticeError as e:
    print(f"API error ({e.status_code}): {e}")

Authentication

# Token-based (OIDC)
client = LatticeClient("api.example.com", 8080, token="eyJ...")

# Headers
client = LatticeClient("api.example.com", 8080, headers={"X-Tenant": "my-team"})

Slurm Migration

Command Mapping

SlurmLatticeNotes
sbatch script.shlattice submit script.sh#SBATCH directives are parsed
squeuelattice status
squeue -u $USERlattice statusDefault shows own jobs
scancel 12345lattice cancel 12345
salloclattice sessionInteractive allocation
srun --pty bashlattice attach <id>Attach terminal
sinfolattice nodesCluster node overview
sacctlattice status --allHistorical view

Directive Mapping

#SBATCH DirectiveLattice EquivalentNotes
--nodes=N--nodes=NExact match
--ntasks=NMapped to node count: ceil(N / tasks_per_node)
--ntasks-per-node=NPassed as task config
--time=HH:MM:SS--walltime=HH:MM:SSAlso accepts 24h, 30m shorthand
--partition=X--vcluster=XConfigurable partition→vCluster mapping
--account=X--tenant=XAccount→tenant mapping
--job-name=X--name=X
--output=fileLogs always go to persistent store; download path configurable
--error=fileSame as --output
--constraint=X--constraint=XFeature matching
--gres=gpu:N--constraint="gpu_count=N"
--qos=X--preemption-class=NConfigurable QOS→class mapping
--array=0-99%20--task-group=0-99%20
--dependency=afterok:ID--depends-on=ID:success
--exclusiveDefaultLattice always allocates full nodes

Environment Variables

When Slurm compatibility is enabled (compat.set_slurm_env: true), Lattice sets familiar environment variables inside allocations:

VariableValue
SLURM_JOB_IDAllocation ID
SLURM_JOB_NAMEAllocation name
SLURM_NNODESNumber of allocated nodes
SLURM_NODELISTComma-separated node list
SLURM_NTASKSTask count
SLURM_SUBMIT_DIRWorking directory at submission

Lattice also sets its own LATTICE_* equivalents.

What’s Different

Full-Node Scheduling

Lattice always allocates full nodes (no sub-node sharing). This simplifies resource management and improves performance isolation. If you’re used to --ntasks=1 on a shared node, you’ll get the whole node.

No Partitions — vClusters

Slurm partitions map to Lattice vClusters, but vClusters are more flexible: each has its own scheduling policy (backfill, bin-pack, FIFO, reservation) and weight tuning.

Topology-Aware Placement

Lattice automatically packs multi-node jobs within the same Slingshot dragonfly group for optimal network performance. No manual --switches needed.

Data Staging

Lattice can pre-stage data during queue wait time. Add --data-mount="s3://bucket/data:/data" and the scheduler factors data locality into placement decisions.

Checkpointing

Unlike Slurm’s --requeue, Lattice coordinates checkpointing before preemption. Declare --checkpoint=signal and your job receives SIGUSR1 before being suspended.

Migration Steps

  1. Start with existing scripts#SBATCH directives work out of the box
  2. Replace sbatch/squeue/scancel with lattice submit/status/cancel
  3. Gradually adopt native features — data staging, checkpointing, DAGs, uenv
  4. Tune scheduling weights — use the RM-Replay simulator for A/B comparison

Deployment & Administration

Architecture Overview

A Lattice deployment consists of:

  • 3-5 quorum members — Raft consensus nodes running lattice-server
  • N compute nodes — each running lattice-agent
  • VictoriaMetrics (or compatible TSDB) — telemetry storage
  • S3-compatible storage — checkpoint and log persistence
  • VAST (optional) — data staging and QoS

Deployment Methods

Docker Compose (dev/test)

cd infra/docker
docker compose up -d

This starts a 3-node quorum with VictoriaMetrics. See infra/docker/docker-compose.yml.

Systemd (production)

Download binaries from GitHub Releases and install:

ARCH=$(uname -m | sed 's/aarch64/arm64/')

# Server (quorum members)
curl -sSfL "https://github.com/witlox/lattice/releases/latest/download/lattice-server-${ARCH}.tar.gz" | tar xz
sudo mv lattice-server /usr/local/bin/
sudo cp infra/systemd/lattice-server.service /etc/systemd/system/
sudo cp config/production.yaml /etc/lattice/config.yaml
sudo systemctl enable --now lattice-server

# Agent (compute nodes) — single binary per architecture, all GPU support included
curl -sSfL "https://github.com/witlox/lattice/releases/latest/download/lattice-agent-${ARCH}.tar.gz" | tar xz
sudo mv lattice-agent /usr/local/bin/
sudo cp infra/systemd/lattice-agent.service /etc/systemd/system/
sudo systemctl enable --now lattice-agent

Configuration

Example configs are in config/:

FilePurpose
config/minimal.yamlSingle-node dev mode, no optional features
config/production.yamlFull reference with all sections documented

See the production config for every option with explanations.

Required Sections

  • quorum — Raft node ID, peers, data directory
  • api — gRPC and REST listen addresses
  • storage — S3 endpoint, NFS paths
  • telemetry — TSDB endpoint, aggregation mode

Optional Sections

  • node_agent — heartbeat timing, grace periods
  • network — VNI pool range for Slingshot
  • checkpoint — checkpoint evaluation and timeout tuning
  • scheduling — cycle interval, backfill depth
  • accounting — Waldur integration (requires accounting feature)
  • rate_limit — per-user API rate limiting
  • federation — Sovra cross-site federation (requires federation feature)
  • compat — Slurm compatibility settings

Authentication & Authorization

Overview

Lattice authenticates three types of callers:

CallerAuth methodToken source
Humans (CLI)OIDC (PKCE flow) → RS256 JWTIdP (Keycloak, Dex)
Agents (node agent)mTLS (production) or Bearer token (dev)SPIRE SVID / bootstrap certs / LATTICE_AGENT_TOKEN
Services (AI/MCP)OIDC (client_credentials) → RS256 JWTIdP service account

Server OIDC Configuration

api:
  oidc_issuer: "https://keycloak.example.com/realms/hpc"   # IdP discovery URL
  oidc_client_id: "lattice"                                 # Expected `aud` claim
  # oidc_hmac_secret: "dev-secret-only"                     # HMAC fallback (dev only)
Config fieldEnv varPurpose
api.oidc_issuerOIDC provider URL. Enables JWKS (RS256/ES256) validation.
api.oidc_client_idExpected aud claim. Returned by auth discovery endpoint.
api.oidc_hmac_secretLATTICE_OIDC_HMAC_SECRETShared secret for HS256 validation (dev/testing/break-glass).

Priority: JWKS (if oidc_issuer set) > HMAC (if secret set) > no auth (warning logged).

The auth discovery endpoint GET /api/v1/auth/discovery is public (no auth required) and returns {idp_url, client_id, issuer} so the CLI can bootstrap login.

Roles

Role derivation checks OIDC scopes first, then cross-system role claims (pact_role, lattice_role). First match wins.

RoleOIDC scopeCross-system claimPermissions
SystemAdminadmin or system:adminpact-platform-admin or system-adminUnrestricted — all operations
TenantAdmintenant:admintenant-adminManage own tenant’s allocations, vClusters, quotas. Drain nodes. Query audit.
OperatoroperatoroperatorDrain/undrain/disable/enable nodes. Cannot create tenants or manage federation.
ClaimingUsersensitive:claimUser + claim/release sensitive nodes
ReadOnlyreadonlyGET/LIST/WATCH only, no mutations
User(default — any authenticated user)Submit/cancel own allocations, view nodes, create sessions

IdP Setup (Keycloak / Dex)

Configure your IdP to include the appropriate scopes in issued tokens:

Keycloak:

  1. Create client lattice with PKCE (Authorization Code) flow
  2. Create client scopes: admin, tenant:admin, operator, sensitive:claim, readonly
  3. Assign scopes to users/groups via role mappings
  4. For pact+lattice co-deployment: add pact_role as a custom claim in the token mapper

Dex:

staticClients:
  - id: lattice
    name: Lattice Scheduler
    redirectURIs: ['http://localhost:8400/callback']
    public: true   # PKCE, no client secret

Dex passes through upstream IdP claims. Configure pact_role / scopes in the upstream IdP (LDAP groups, SAML attributes, etc.).

Agent Authentication

Node agents authenticate to lattice-server for registration and heartbeats.

Production (mTLS): Agent acquires identity via the cascade: SPIRE → SelfSigned CA → Bootstrap certs. The gRPC channel uses ClientTlsConfig with the acquired cert/key/CA. Server verifies the client certificate.

# Bootstrap cert path (used until SPIRE is available)
lattice-agent \
  --quorum-endpoint=https://lattice-01:50051 \
  --bootstrap-cert=/etc/lattice/tls/agent.crt \
  --bootstrap-key=/etc/lattice/tls/agent.key \
  --bootstrap-ca=/etc/lattice/tls/ca.crt \
  ...

Dev/testing (Bearer token): When no mTLS identity is available, agent falls back to LATTICE_AGENT_TOKEN.

LATTICE_AGENT_TOKEN="eyJ..." lattice-agent \
  --quorum-endpoint=http://lattice-01:50051 \
  ...
Env varPurpose
LATTICE_AGENT_TOKENBearer token for agent→server auth (dev/testing/break-glass)
LATTICE_SPIRE_SOCKETSPIRE agent socket path (default: /run/spire/agent.sock)
LATTICE_BOOTSTRAP_CERTBootstrap cert PEM path
LATTICE_BOOTSTRAP_KEYBootstrap key PEM path
LATTICE_BOOTSTRAP_CABootstrap CA PEM path

mTLS takes priority. Token auth is the fallback. In production, leave LATTICE_AGENT_TOKEN unset.

Quorum Management

Initial Bootstrap

The first quorum member initializes the Raft cluster using the --bootstrap flag. This flag must only be passed once — on the very first startup of node 1. All subsequent restarts (including systemd restarts) omit it.

# First-ever start of node 1 — initializes the Raft cluster:
lattice-server --config /etc/lattice/server.yaml --bootstrap

# All subsequent restarts — no --bootstrap:
lattice-server --config /etc/lattice/server.yaml
# (or via systemd, which never passes --bootstrap)

Configure peers in each node’s config:

quorum:
  node_id: 1
  data_dir: /var/lib/lattice/raft
  peers:
    - id: 2
      address: "lattice-02:9000"
    - id: 3
      address: "lattice-03:9000"

Nodes 2 and 3 never need --bootstrap — they join via Raft membership replication from the leader.

Raft Status

curl http://lattice-01:8080/api/v1/raft/status

Backup & Restore

# Create backup
curl -X POST http://lattice-01:8080/api/v1/admin/backup

# Verify backup integrity
curl http://lattice-01:8080/api/v1/admin/backup/verify

# Restore (requires restart)
curl -X POST http://lattice-01:8080/api/v1/admin/restore \
  -d '{"path": "/var/lib/lattice/backups/backup-20260305T120000Z.tar.gz"}'

Node Management

Agent Registration

Agents register automatically on startup. Authentication uses mTLS (production) or Bearer token (dev/testing):

# Production: mTLS via bootstrap certs (SPIRE preferred when available)
lattice-agent \
  --node-id=nid001234 \
  --quorum-endpoint=https://lattice-01:50051 \
  --bootstrap-cert=/etc/lattice/tls/agent.crt \
  --bootstrap-key=/etc/lattice/tls/agent.key \
  --bootstrap-ca=/etc/lattice/tls/ca.crt \
  --gpu-count=4 --gpu-type=GH200 --cpu-cores=72 --memory-gb=512

# Dev/testing: Bearer token auth (no certs needed)
LATTICE_AGENT_TOKEN="eyJ..." lattice-agent \
  --node-id=nid001234 \
  --quorum-endpoint=http://lattice-01:50051 \
  --gpu-count=4 --gpu-type=GH200 --cpu-cores=72 --memory-gb=512

The agent tries the identity cascade (SPIRE → SelfSigned → Bootstrap) first. If no mTLS identity is available, it falls back to LATTICE_AGENT_TOKEN.

Draining Nodes

The drain lifecycle is: Ready → Draining → Drained → Ready.

# Drain a node (existing jobs complete, no new jobs scheduled)
lattice admin drain nid001234 --reason="maintenance"

# If no active allocations, node goes directly to Drained.
# If allocations are running, node stays in Draining until they complete.
# The scheduler loop automatically transitions Draining → Drained.

# Undrain (only works from Drained state)
lattice admin undrain nid001234

Undrain only works when the node is in Drained state. If the node is still Draining (allocations running), wait for them to complete or cancel them first.

Node States

StateMeaning
ReadyAvailable for scheduling
DrainingNo new jobs; existing jobs continue
DownHeartbeat lost beyond grace period
DegradedHeartbeat late but within grace period
ClaimedReserved for sensitive workload

Tenant Management

# Create a tenant
lattice admin tenant create --name="physics" --max-nodes=100

# List tenants
lattice admin tenant list

# Update quota
lattice admin tenant update physics --max-nodes=200

TLS Configuration

Server TLS

api:
  tls_cert: /etc/lattice/tls/server.crt
  tls_key: /etc/lattice/tls/server.key

Mutual TLS (mTLS)

api:
  tls_cert: /etc/lattice/tls/server.crt
  tls_key: /etc/lattice/tls/server.key
  tls_ca: /etc/lattice/tls/ca.crt  # Require client certificates

Feature Flags

Compile-time features control optional integrations:

FeatureCrateEnables
oidclattice-apiJWT/OIDC token validation
accountinglattice-apiWaldur billing integration
federationlattice-apiSovra cross-site federation
nvidialattice-node-agentNVIDIA GPU discovery (nvml-wrapper)
rocmlattice-node-agentAMD GPU discovery (rocm-smi)
ebpflattice-node-agenteBPF kernel telemetry (Linux only)

Pre-built release binaries ship with all features enabled. GPU libraries are loaded at runtime — nodes without GPUs simply report no GPU hardware. To build from source:

# Server with all features
cargo build --release -p lattice-api --all-features

# Agent with all features
cargo build --release -p lattice-node-agent --all-features

Release Artifacts

ArtifactArchitectureGPU Support
lattice-server-x86_64.tar.gzx86_64n/a
lattice-server-arm64.tar.gzarm64n/a
lattice-x86_64.tar.gzx86_64n/a (CLI)
lattice-arm64.tar.gzarm64n/a (CLI)
lattice-agent-x86_64.tar.gzx86_64NVIDIA + AMD ROCm + eBPF
lattice-agent-arm64.tar.gzarm64NVIDIA + AMD ROCm + eBPF
rm-replay-x86_64.tar.gzx86_64n/a
rm-replay-arm64.tar.gzarm64n/a

GPU discovery is automatic at runtime. The agent detects available hardware and uses the appropriate provider:

HardwareDiscovery MethodRuntime Dependency
NVIDIA (H100, A100, GH200)nvml-wrapper (libnvidia-ml.so via dlopen)NVIDIA driver installed
AMD (MI300X, MI250)rocm-smi CLIROCm toolkit installed
CPU-only nodesNo GPU discovery runsNone

GCP Test Cluster

For integration testing without production hardware:

# 1. Build Packer image (once, ~5 min)
cd infra/gcp/packer
packer build -var project_id=YOUR_PROJECT lattice-compute.pkr.hcl

# 2. Provision infrastructure (~2 min)
cd infra/gcp
terraform apply -var="project_id=YOUR_PROJECT" -var="use_packer_image=true"

# 3. Build + bundle binaries
cargo build --release --target x86_64-unknown-linux-gnu
./scripts/deploy/make-provision-bundle.sh target/x86_64-unknown-linux-gnu/release /tmp/lattice-provision.tar.gz

# 4. Deploy to nodes (SCP bundle + run install scripts)
# See scripts/deploy/install-quorum.sh and install-compute.sh

# 5. Run validation test matrix
./scripts/deploy/validate.sh http://QUORUM1_IP:8080 x1000c0s0b0n0,x1000c0s0b0n1

# 6. Teardown
cd infra/gcp && terraform destroy

The test cluster includes: 3 quorum nodes, 2 compute nodes (with podman + squashfs-tools), 1 OCI registry, 1 VictoriaMetrics. The validate.sh script runs 15 tests covering health, auth, submit, drain, restart, and validation.

Deploy scripts (scripts/deploy/install-*.sh) are reusable on-prem — no GCP-specific logic.

Cluster Monitoring & Observability

Prometheus Metrics

Lattice exposes Prometheus-compatible metrics at GET /metrics on the REST port (default 8080).

Key Metrics

MetricTypeDescription
lattice_allocations_totalCounterTotal allocations by state
lattice_allocations_activeGaugeCurrently running allocations
lattice_scheduling_cycle_duration_secondsHistogramScheduling cycle latency
lattice_scheduling_placements_totalCounterSuccessful placements
lattice_scheduling_preemptions_totalCounterPreemption events
lattice_raft_commit_latency_secondsHistogramRaft commit latency
lattice_raft_sensitive_audit_entries_totalCounterSensitive audit log entries
lattice_api_request_duration_secondsHistogramAPI request latency
lattice_api_requests_totalCounterAPI requests by method and status
lattice_nodes_totalGaugeNodes by state
lattice_checkpoint_duration_secondsHistogramCheckpoint operation latency

Scrape Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'lattice'
    static_configs:
      - targets:
        - 'lattice-01:8080'
        - 'lattice-02:8080'
        - 'lattice-03:8080'

Grafana Dashboards

Pre-built dashboards are in infra/grafana/dashboards/:

  • Cluster Overview — node states, allocation throughput, queue depth
  • Scheduling Performance — cycle latency, placement rate, preemption rate
  • Raft Health — commit latency, leader elections, log compaction
  • Per-Tenant Usage — resource consumption, fair-share deficit

Import via Grafana UI or provision from infra/grafana/provisioning/.

Alerting Rules

Pre-configured alerting rules in infra/alerting/:

AlertCondition
LatticeRaftNoLeaderNo Raft leader for > 30s
LatticeNodeDownNode heartbeat lost for > 5m
LatticeSchedulingStalledNo placements for > 10m with pending jobs
LatticeHighPreemptionRate> 10 preemptions/minute
LatticeCheckpointFailureCheckpoint success rate < 90%
LatticeDiskSpaceLowRaft data directory > 80% full

TSDB Integration

Lattice pushes per-node telemetry to VictoriaMetrics (or any Prometheus-compatible remote write endpoint).

telemetry:
  tsdb_endpoint: "http://victoriametrics:8428"
  prod_interval_seconds: 30

Telemetry includes CPU, memory, GPU utilization, network I/O, and disk I/O per node.

Audit Log

Sensitive workload operations are recorded in the Raft-committed audit log:

# Query audit log
curl "http://lattice-01:8080/api/v1/audit?tenant=sensitive-team&from=2026-03-01"

Audit entries include: node claims/releases, allocation lifecycle events, and access log entries. Retention: 7 years (configurable).

Health Check

curl http://lattice-01:8080/healthz
# {"status": "ok"}

Used by Docker/Kubernetes health probes and load balancers.

Managing Sensitive Workloads

Sensitive workloads (financial, defense, regulated research) require strict isolation, auditing, and data handling. Lattice provides a dedicated scheduling mode for these workloads.

How It Works

  1. User claims nodes — not the scheduler. The user’s identity is recorded as the owner in the Raft audit log.
  2. Full isolation — claimed nodes run only the owner’s workloads. No sharing.
  3. Hardened OS — OpenCHAMI provisions a hardened boot image for claimed nodes.
  4. Encrypted storage — a dedicated encrypted pool is assigned. All access is logged.
  5. Signed software only — only vulnerability-scanned, signed uenv images are allowed.
  6. Wipe on release — when the claim ends, storage is crypto-erased and nodes are re-provisioned.

Submitting Sensitive Workloads

# Submit to the sensitive vCluster
lattice submit --vcluster=sensitive --nodes=4 --walltime=168h analysis.sh

The sensitive scheduler uses a reservation model (not backfill). Priority is fixed at the highest level; the only tiebreaker is conformance fitness.

Node Claiming

Sensitive allocations claim specific nodes. Once claimed:

  • Nodes are exclusively owned by the claiming user
  • The claim is Raft-committed with the user’s identity
  • No other workloads (even from the same tenant) can run on claimed nodes

Audit Trail

Every sensitive operation is logged:

# Query sensitive audit entries
curl "http://lattice-01:8080/api/v1/audit?scope=sensitive"

Logged events:

  • Node claim / release
  • Allocation start / completion
  • Data access (read/write operations)
  • Software image loads
  • Storage wipe confirmation

Retention: 7 years (per regulatory requirements).

Network Isolation

Sensitive allocations get a unique Slingshot VNI (network domain). Ingress and egress are denied except to the designated data gateway. With Ultra Ethernet, wire-level encryption is enabled.

Admin Responsibilities

  • Provision hardened images via OpenCHAMI for sensitive nodes
  • Maintain signed uenv registry — only approved images should be signed
  • Monitor audit log — set up alerting for unexpected access patterns
  • Test wipe procedures — verify crypto-erase completes on node release
  • Designate sensitive-capable nodes — not all nodes need to support sensitive workloads

Configuration

No special server configuration is needed. The sensitive scheduler is a built-in vCluster type. Create a sensitive vCluster:

lattice admin vcluster create \
  --name=sensitive \
  --scheduler-type=sensitive-reservation \
  --description="Regulated workloads with full isolation"

System Architecture

Overview

Lattice is a seven-layer architecture where each layer has a clear responsibility and communicates with adjacent layers via defined interfaces.

┌─ User Plane ───────────────────────────────────────────────────┐
│  lattice-cli + lattice-api (OIDC via hpc-auth)                 │
│  ├── Job lifecycle (submit, monitor, cancel)                   │
│  ├── Interactive sessions (WebSocket terminal)                 │
│  ├── Data management (stage, browse, transfer)                 │
│  ├── uenv management (list, pull, test)                        │
│  ├── Observability (attach, logs, metrics, diagnostics)        │
│  └── Sensitive: user-level node claim/release                  │
└───────────────────────────┬────────────────────────────────────┘
                            │
┌─ Software Plane ──────────┴────────────────────────────────────┐
│  Default: uenv (squashfs + mount namespace)                    │
│  Optional: OCI/Sarus (isolation, third-party images)           │
│  Registry: JFrog/Nexus → S3 backing (VAST hot tier)            │
│  Node-local NVMe image cache (optional)                        │
│  Sensitive: signed images only, vulnerability-scanned          │
└───────────────────────────┬────────────────────────────────────┘
                            │
┌─ Scheduling Plane ────────┴────────────────────────────────────┐
│  Quorum (Raft, 3-5 replicas)                                   │
│  Strong: (1) node ownership  (2) sensitive audit log           │
│  Eventual: job queues, telemetry, quotas                       │
│                                                                │
│  vCluster Schedulers:                                          │
│  ├── HPC: backfill + dragonfly group packing                   │
│  ├── Service: bin-pack + autoscale                             │
│  ├── Sensitive: user-claim reservation, dedicated nodes        │
│  └── Interactive: FIFO, short-lived, node-sharing via Sarus    │
└───────────────────────────┬────────────────────────────────────┘
                            │
┌─ Data Plane ──────────────┴────────────────────────────────────┐
│  Hot:  VAST (NFS + S3, single flash tier)                      │
│    ├── Home dirs, scratch, active datasets (NFS)               │
│    ├── Checkpoints, image cache, objects (S3)                  │
│    ├── Scheduler integration: QoS, pre-staging, snapshots      │
│    └── Sensitive: encrypted view, audit-logged, dedicated pool │
│  Warm: Capacity store (S3-compat, cost-optimized)              │
│  Cold: Tape archive (S3-compat, regulatory retention)          │
│  Data mover: pre-stages during queue wait, policy-driven       │
└───────────────────────────┬────────────────────────────────────┘
                            │
┌─ Network Fabric ──────────┴────────────────────────────────────┐
│  Slingshot (current) / Ultra Ethernet (future path)            │
│  ├── libfabric abstraction for workload communication          │
│  ├── VNI-based network domains (job isolation)                 │
│  ├── Traffic classes: compute | management | telemetry         │
│  ├── CSIG for in-band congestion telemetry                     │
│  └── Sensitive: encrypted RDMA, dedicated VNI                  │
└───────────────────────────┬────────────────────────────────────┘
                            │
┌─ Node Plane ──────────────┴────────────────────────────────────┐
│  Node Agent (per node)                                         │
│  ├── squashfs-mount (uenv delivery)                            │
│  ├── Sarus (OCI container runtime, when needed)                │
│  ├── eBPF telemetry + CSIG tap                                 │
│  ├── Node-local NVMe (optional): scratch + image cache         │
│  ├── Conformance fingerprint (driver/firmware/kernel hash)     │
│  └── Health reporting → OpenCHAMI SMD                          │
└───────────────────────────┬────────────────────────────────────┘
                            │
┌─ Infrastructure Plane ────┴────────────────────────────────────┐
│  OpenCHAMI                                                     │
│  ├── Magellan: Redfish BMC discovery & inventory               │
│  ├── SMD: State Management Daemon (hardware lifecycle)         │
│  ├── BSS: Boot Script Service (image selection per node)       │
│  ├── OPAAL: Authentication & identity                          │
│  ├── Cloud-init: per-node config injection                     │
│  └── Manta CLI: admin tooling                                  │
└────────────────────────────────────────────────────────────────┘

Component Interactions

Allocation Lifecycle

1. User/Agent → lattice-cli → lattice-api (Intent API or Compat API)
2. lattice-api validates request, resolves uenv, creates Allocation object
3. Allocation placed in vCluster scheduler's queue (eventually consistent)
4. vCluster scheduler runs scheduling cycle:
   a. Scores pending allocations with cost function
   b. Solves knapsack: maximize value subject to resource constraints
   c. Proposes allocation → quorum
5. Quorum validates (node ownership, quotas, sensitive isolation)
6. Quorum commits: node ownership updated (strong consistency)
7. Quorum notifies node agents of new allocation
8. Node agents:
   a. Pull uenv squashfs image (from cache or registry)
   b. Mount via squashfs-mount
   c. Start processes in mount namespace
   d. Begin log capture (ring buffer + S3 persistence)
   e. Accept attach sessions (if user connects)
   f. Report health/telemetry
8.5. During execution, users can:
   - Attach interactive terminal (nsenter into allocation namespace)
   - Stream logs (live tail from ring buffer or historical from S3)
   - Query metrics (lattice top → TSDB) or stream them (lattice watch → node agents)
   - View diagnostics (network health, storage performance)
   - Compare metrics across allocations (TSDB multi-query)
9. On completion: node agents report, quorum releases nodes

Preemption Flow

1. Higher-priority allocation arrives, needs nodes currently in use
2. Scheduler evaluates: which running allocations are cheapest to preempt?
   → checkpoint_efficiency score from cost function
3. Checkpoint broker sends CHECKPOINT_HINT to target allocation's node agents
4. Application checkpoints (or: timeout → forced preemption)
5. Nodes released, reassigned to higher-priority allocation
6. Preempted allocation re-queued, will resume from checkpoint when resources available

Federation Flow (when enabled)

1. User at Site A submits allocation targeting Site B
2. Site A's federation broker signs request with Sovra token
3. Request arrives at Site B's federation broker
4. Site B verifies Sovra token, checks policy (OPA)
5. If accepted: allocation enters Site B's scheduling plane
6. Site B's local quorum manages the allocation entirely
7. Results/logs accessible to user at Site A via federation catalog

Topology Model

The scheduler maintains a model of the Slingshot dragonfly topology:

System
├── Group 0 (electrical group, ~hundreds of nodes)
│   ├── Switch 0
│   │   ├── Node 0..N
│   │   └── ...
│   └── Switch M
├── Group 1
│   └── ...
└── Group K
    └── ...

Intra-group: electrical, low latency, high bandwidth
Inter-group: optical, higher latency, potential congestion

Scheduling rule: pack jobs into fewest groups possible. Jobs below group size → single group. Large jobs → minimize group span, prefer adjacent groups. Network-sensitive jobs (NCCL) get stricter placement constraints.

State Machine

The quorum manages a replicated state machine with the following state:

GlobalState {
    nodes: Map<NodeId, NodeState>,        // ownership, health, capabilities
    allocations: Map<AllocId, Allocation>, // all active allocations
    tenants: Map<TenantId, TenantState>,  // quotas, fair-share counters
    vclusters: Map<VClusterId, VClusterConfig>, // scheduler configs
    topology: TopologyModel,              // dragonfly group structure
    sensitive_audit: AppendOnlyLog<AuditEvent>, // strong consistency
}

NodeState {
    owner: Option<(TenantId, VClusterId, AllocId)>,
    health: NodeHealth,
    capabilities: NodeCapabilities,  // GPU type, memory, features
    group: GroupId,                  // topology position
    conformance_group: ConformanceGroupId, // fingerprint of driver/firmware/kernel
}

Transitions are proposed by vCluster schedulers and validated by the quorum before commit. Only node ownership changes and sensitive audit events require Raft consensus; everything else is eventually consistent.

Note: Observability data (logs, metrics, attach sessions, diagnostics) is NOT stored in the Raft state machine. This data lives in the TSDB, S3, and node agent memory. Only sensitive audit events about observability actions (e.g., “Dr. X attached to allocation Y”) flow through Raft consensus (per ADR-004).

API Design

Two-Tier API Model

Tier 1: Intent API (Agent-Native)

Agents and advanced users interact with the Intent API. They declare what they need; the scheduler resolves how.

Core Resources

Allocation — The universal work unit.

POST   /v1/allocations              Create allocation (or DAG of allocations)
GET    /v1/allocations              List allocations (filterable)
GET    /v1/allocations/{id}         Get allocation status
DELETE /v1/allocations/{id}         Cancel allocation
PATCH  /v1/allocations/{id}         Update allocation (e.g., extend walltime, switch telemetry)
POST   /v1/allocations/{id}/tasks   Launch tasks within an existing allocation (srun equivalent)
POST   /v1/allocations/{id}/checkpoint  Request checkpoint

Observability — User-facing debugging and monitoring.

POST   /v1/allocations/{id}/attach           Attach interactive terminal (WebSocket upgrade)
GET    /v1/allocations/{id}/logs             Historical logs from S3
GET    /v1/allocations/{id}/logs/stream      Live log tail (SSE / gRPC stream)
GET    /v1/allocations/{id}/metrics          Query metrics snapshot from TSDB
GET    /v1/allocations/{id}/metrics/stream   Push-based live metrics stream
GET    /v1/allocations/{id}/diagnostics      Combined network + storage diagnostics
GET    /v1/allocations/{id}/diagnostics/network  Network-specific diagnostics
GET    /v1/allocations/{id}/diagnostics/storage  Storage-specific diagnostics
GET    /v1/compare                           Cross-allocation metric comparison

DAGs — Workflow graph management.

POST   /v1/dags                    Submit a DAG of allocations
GET    /v1/dags                    List DAGs (filterable by tenant, user, state)
GET    /v1/dags/{id}               Get DAG status (overall state + per-allocation states)
GET    /v1/dags/{id}/graph         Get DAG structure (allocations + dependency edges)
DELETE /v1/dags/{id}               Cancel all allocations in a DAG

Session — Interactive allocation with WebSocket terminal.

POST   /v1/sessions                 Create interactive session
GET    /v1/sessions/{id}/terminal   WebSocket terminal endpoint

Nodes — Read-only view of cluster state.

GET    /v1/nodes                    List nodes (filterable by vCluster, tenant, state)
GET    /v1/nodes/{id}               Get node details

Tenants / vClusters — Administrative.

GET    /v1/tenants                  List tenants
GET    /v1/vclusters                List vClusters
GET    /v1/vclusters/{id}/queue     View vCluster queue

Accounting

GET    /v1/accounting               Query usage history

Allocation Request Schema

# Full Intent API allocation request
allocation:
  # Identity
  tenant: "ml-team"
  project: "gpt-training"
  vcluster: "ml-training"           # optional: scheduler can infer from intent
  tags: { experiment: "run-42" }

  # What to run
  intent: "train"                    # optional hint for scheduler
  environment:
    uenv: "prgenv-gnu/24.11:v1"     # uenv name/version
    view: "default"                  # uenv view to activate
    # OR:
    image: "registry.example.com/my-training:latest"  # OCI image via Sarus
  entrypoint: "torchrun --nproc_per_node=4 train.py"

  # Resources
  resources:
    nodes: 64                        # can be exact or range: { min: 32, max: 128 }
    constraints:
      gpu_type: "GH200"
      features: ["nvme_scratch"]
      topology: "tight"              # scheduler hint: pack into fewest groups

  # Lifecycle
  lifecycle:
    type: "bounded"                  # bounded | unbounded | reactive
    walltime: "72h"                  # for bounded
    preemption_class: 2              # 0 = lowest, higher = harder to preempt
    # For reactive:
    # scale_policy: { min: 4, max: 16, metric: "request_latency_p99", target: "100ms" }

  # Data
  data:
    mounts:
      - source: "s3://datasets/imagenet"
        target: "/data/input"
        access: "read-only"
        tier_hint: "hot"             # scheduler pre-stages if needed
    defaults: true                   # auto-mount home, scratch, output dir

  # Networking
  connectivity:
    network_domain: "ml-workspace"   # shared domain for cross-allocation communication
    expose:                          # for services
      - name: "metrics"
        port: 9090

  # Dependencies (for DAG submissions)
  depends_on:
    - ref: "preprocess-job"
      condition: "success"           # success | failure | any | corresponding

  # Checkpointing
  checkpoint:
    strategy: "auto"                 # auto | manual | none
    # auto: scheduler decides based on cost function
    # manual: application manages its own checkpointing
    # none: non-checkpointable, treated as non-preemptible

  # Telemetry
  telemetry:
    mode: "prod"                     # prod | debug | audit

DAG Submission

Submit multiple allocations as a workflow graph:

dag:
  allocations:
    - id: "stage-data"
      entrypoint: "python stage.py"
      resources: { nodes: 1 }
      lifecycle: { type: "bounded", walltime: "2h" }

    - id: "train"
      entrypoint: "torchrun train.py"
      resources: { nodes: 64, constraints: { topology: "tight" } }
      lifecycle: { type: "bounded", walltime: "72h" }
      depends_on: [{ ref: "stage-data", condition: "success" }]

    - id: "evaluate"
      entrypoint: "python eval.py"
      resources: { nodes: 4 }
      depends_on: [{ ref: "train", condition: "any" }]

DAG size limit: Maximum 1000 allocations per DAG (configurable). Submissions exceeding this limit are rejected at validation time. See dag-scheduling.md for details.

Task Groups (Job Arrays)

allocation:
  type: "task_group"
  template:
    entrypoint: "python sweep.py --config=${INDEX}"
    resources: { nodes: 1, constraints: { gpu_type: "GH200" } }
    lifecycle: { type: "bounded", walltime: "4h" }
  range: { start: 0, end: 99 }
  concurrency: 20                   # max simultaneous tasks

Tier 2: Compatibility API (Slurm-like)

Translates familiar Slurm commands to Intent API calls. Implemented as CLI wrappers + lattice-api REST endpoints.

Command Mapping

SlurmLattice CLIIntent API
sbatch script.shlattice submit script.shPOST /v1/allocations
sbatch --array=0-99%20 script.shlattice submit --task-group=0-99%20 script.shPOST /v1/allocations (task_group)
sbatch --dependency=afterok:123 script.shlattice submit --depends-on=123:success script.shPOST /v1/allocations (depends_on)
squeuelattice statusGET /v1/allocations
squeue -u $USERlattice status --user=$USERGET /v1/allocations?user=
scancel 123lattice cancel 123DELETE /v1/allocations/123
salloc -N2lattice session --nodes=2POST /v1/sessions
srun -n4 hostnamelattice launch --alloc=123 -n4 hostnamePOST /v1/allocations/123/tasks
sinfolattice nodesGET /v1/nodes
sacctlattice historyGET /v1/accounting
--constraint="gpu"--constraint="gpu"constraints.features
--partition=debug--vcluster=interactivevcluster field
--qos=high--priority=highpreemption_class
--uenv=prgenv-gnu/24.11:v1--uenv=prgenv-gnu/24.11:v1environment.uenv
srun --jobid=123 --pty bashlattice attach 123Attach RPC (bidir stream)
cat slurm-123.outlattice logs 123GET /v1/allocations/123/logs
tail -f slurm-123.outlattice logs 123 --followStreamLogs RPC
sstat -j 123lattice top 123QueryMetrics RPC
(no equivalent)lattice watch 123StreamMetrics RPC
(no equivalent)lattice diag 123GetDiagnostics RPC
(no equivalent)lattice compare 123 456CompareMetrics RPC

Script Parsing

The compatibility layer parses #SBATCH directives from submission scripts, translating them to Intent API fields. Unknown directives are warned but not fatal (graceful degradation).

#!/bin/bash
#SBATCH --nodes=64
#SBATCH --time=72:00:00
#SBATCH --gres=gpu:4
#SBATCH --constraint=GH200
#SBATCH --uenv=prgenv-gnu/24.11:v1
#SBATCH --view=default
#SBATCH --account=ml-team
#SBATCH --job-name=training-run

torchrun --nproc_per_node=4 train.py

Wire Format

gRPC (protobuf) is the primary protocol. REST is provided via gRPC-gateway for browser/curl access.

Protobuf definitions in proto/ directory. See proto/README.md for schema details.

Proto Coverage

The protobuf definitions in proto/lattice/v1/allocations.proto currently cover:

Service / AreaProto StatusNotes
AllocationService (submit, get, list, cancel, update, watch, checkpoint)DefinedCore allocation lifecycle
Observability RPCs (attach, logs, metrics, diagnostics, compare)DefinedPart of AllocationService
DAG RPCs (get, list, cancel)DefinedPart of AllocationService
NodeService (list, get, drain, undrain, disable, enable, health)Definedproto/lattice/v1/nodes.proto
AdminService (tenant CRUD, vCluster CRUD, Raft status, backup, audit, accounting)Definedproto/lattice/v1/admin.proto
Session RPCs (create, get, delete)DefinedPart of AllocationService
Service Discovery (lookup, list)DefinedPart of AdminService, admin.proto
LivenessProbeSpecDefinedPart of AllocationSpec, allocations.proto

All planned services have been implemented as RPCs within the existing three services (AllocationService, NodeService, AdminService). Both gRPC and REST endpoints are available for all operations.

Service Discovery Endpoints

MethodEndpointDescription
gRPCAdminService.LookupService(name)Returns endpoints for a named service (tenant-filtered)
gRPCAdminService.ListServices()Lists all registered service names (tenant-filtered)
RESTGET /api/v1/servicesJSON list of registered service names
RESTGET /api/v1/services/{name}JSON endpoints for a named service

Tenant filtering: requests with x-lattice-tenant header only see services belonging to their tenant. Without the header, all services are visible (admin mode).

Liveness Probe Schema

Allocations can include an optional liveness_probe in the submission spec:

message LivenessProbeSpec {
  string probe_type = 1;    // "tcp" or "http"
  uint32 port = 2;          // 1-65535
  string path = 3;          // HTTP path (e.g., "/healthz")
  uint32 period_secs = 4;   // default: 30
  uint32 initial_delay_secs = 5;
  uint32 failure_threshold = 6;  // default: 3
  uint32 timeout_secs = 7;      // default: 5
}

When failure_threshold consecutive probes fail, the allocation is marked Failed. The reconciliation loop then requeues it (for Unbounded/Reactive allocations with appropriate requeue policy).

Client SDKs

SDKProtocolLocation
Python (lattice-sdk)REST (httpx)sdk/python/
Rust (lattice-client)gRPC (tonic)crates/lattice-client/

The Rust SDK re-exports all proto types as lattice_client::proto — consumers do not need to depend on lattice-common directly.

Authentication

All API calls require OIDC bearer token. The lattice CLI handles the OIDC flow via hpc-auth (institutional IdP integration). The lattice-api server validates tokens against the configured OIDC provider.

Sensitive tenant tokens include additional claims for audit trail binding.

Scheduling Algorithm

Overview

Lattice uses a multi-dimensional knapsack formulation with a composite cost function, executed independently by each vCluster scheduler. The quorum provides global coordination.

The Knapsack Formulation

Resources (Knapsack Dimensions)

Each scheduling decision must respect multiple resource constraints simultaneously:

DimensionUnitSource
NodescountQuorum (available nodes owned by or borrowable by vCluster)
GPU-hoursnodes × walltimeDerived from allocation request
Topology spangroup countTopology model (dragonfly groups consumed)
Storage I/O bandwidthGB/sVAST API (current utilization + allocation estimate)
Power budgetkWOpenCHAMI BMC telemetry (per-node power draw)

Value (Cost Function)

Score(j) = Σ wᵢ · fᵢ(j)

Component Functions

f₁: priority_class(j) — Static priority tier (0-10). Sensitive claims are highest. Preemption only moves down tiers.

f₂: wait_time_factor(j) — Anti-starvation. Increases monotonically with time in queue.

f₂(j) = log(1 + wait_seconds / reference_wait)

reference_wait is tunable (default: 1 hour). Log prevents wait time from dominating all other factors.

f₃: fair_share_deficit(j) — How far the tenant is from their contracted share. See quota-enforcement.md for hard vs. soft quota semantics.

f₃(j) = max(0, target_share(tenant) - actual_usage(tenant)) / target_share(tenant)

Ranges from 0 (tenant at or above share) to 1 (tenant has used nothing). Tenants below their share get priority.

f₄: topology_fitness(j) — How well the job fits available topology. For intra-node GPU topology, see gpu-topology.md.

f₄(j) = 1.0 - (groups_needed(j) / max_groups_available)

Jobs that fit in a single group score highest. Penalty for spanning groups scales with group count.

f₅: data_readiness(j) — Is the job’s input data on hot tier?

f₅(j) = fraction_of_input_data_on_hot_tier(j)

If unknown (user didn’t specify data requirements), defaults to 0.5 (neutral).

f₆: backlog_pressure(t) — Global signal, not per-job. High when queue is deep.

f₆(t) = min(1.0, queued_gpu_hours / running_gpu_hours)

Capped at 1.0. Affects all jobs equally — it’s a system-level urgency signal.

f₇: energy_cost(j, t) — Time-varying electricity price at scheduling time.

f₇(j, t) = 1.0 - normalized_energy_price(t)

Jobs score higher when energy is cheap. In federated mode, extends to energy_cost(j, t, site).

f₈: checkpoint_efficiency(j) — How cheaply can this job be preempted?

f₈(j) = 1.0 / (1.0 + estimated_checkpoint_minutes(j))

Jobs with fast checkpointing are more attractive to schedule on borrowed/preemptible nodes.

f₉: conformance_fitness(j, candidates) — How well do the candidate nodes match each other’s configuration?

f₉(j, candidates) = largest_conformance_group_size(candidates) / j.requested_nodes

Scores 1.0 when all candidate nodes share the same conformance fingerprint, lower when the node set is heterogeneous. Critical for multi-node jobs where driver/firmware mismatches cause subtle performance degradation or correctness issues (e.g., NCCL hangs from mismatched NIC firmware).

The conformance fingerprint is a hash of: GPU driver version, NIC firmware version, BIOS/BMC firmware version, and kernel parameters. The node agent computes and reports this fingerprint alongside health data. Nodes with identical fingerprints belong to the same conformance group.

This factor is evaluated during node selection (step 2a in the solver), not during scoring. The solver prefers to select nodes from the largest available conformance group that satisfies the allocation’s constraints.

See data-staging.md for details on how input data is pre-staged during queue wait to improve f₅ scores. See preemption.md for how preemption classes interact with f₁ priority scoring. See network-domains.md for the VNI assignment that enables topology-aware placement (f₄).

Weight Profiles

WeightHPC BatchML TrainingServiceSensitiveInteractive
w₁ (priority)0.150.100.150.900.10
w₂ (wait_time)0.200.100.050.000.30
w₃ (fair_share)0.200.100.100.000.10
w₄ (topology)0.150.250.050.000.00
w₅ (data_ready)0.100.150.100.000.05
w₆ (backlog)0.050.050.050.000.15
w₇ (energy)0.000.050.100.000.00
w₈ (checkpoint)0.050.100.100.000.00
w₉ (conformance)0.100.100.300.100.30

Sensitive scheduler is degenerate: priority dominates because node claims are non-negotiable (w₁=0.90). Conformance (w₉=0.10) acts as a tiebreaker among conformant nodes; non-conformant nodes are excluded entirely as a hard constraint at the solver level (step 2a), not via the weight system.

Note: The CostWeights::default() in crates/lattice-common/src/types.rs provides a “balanced HPC” baseline (w₁=0.20, w₂=0.20, w₃=0.20, w₄=0.15, w₅=0.10, w₆=0.05, w₇=0.00, w₈=0.00, w₉=0.10). This is not identical to any named profile in the table above — it is a general-purpose starting point. Each vCluster should have its weights tuned for its workload type, either manually or via RM-Replay simulation.

Solver

The multi-dimensional knapsack is NP-hard in general. For our scale (tens to hundreds of pending large allocations), a greedy heuristic with backfill is sufficient:

Algorithm: GreedyTopologyAwareBackfill

1. Sort pending allocations by Score(j) descending
2. For each allocation j in sorted order:
   a. Find the smallest set of available nodes that satisfies:
      - Node count >= j.requested_nodes
      - All nodes in fewest possible dragonfly groups
      - All nodes in same conformance group (prefer) or fewest groups (fallback)
      - Constraints satisfied (GPU type, features, etc.)
      - Power budget not exceeded
   b. If nodes found: PROPOSE allocation to quorum
   c. If not found: try backfill (can j fit in gaps left by higher-priority reservations?)
3. Collect quorum responses (commit or reject)
4. For rejected proposals: re-queue, will try next cycle

Scheduling cycle: every 5-30 seconds (configurable per vCluster)

DAG Dependencies

DAGs (directed acyclic graphs) are first-class workflow primitives. Individual allocations within a DAG are scored by the knapsack solver like any other allocation — the DAG structure controls when allocations enter the queue, not how they are scored. Root allocations enter immediately; downstream allocations enter when their dependency conditions are satisfied. See dag-scheduling.md for the full DAG lifecycle and dependency conditions.

Reactive Scaling

Reactive allocations (autoscaling services) start at min_nodes and scale based on metric thresholds. Scale-up and scale-down are proposed as node ownership changes through the quorum. The knapsack solver handles each scale proposal as a regular allocation change. See autoscaling.md for the scaling loop, metrics, and cooldown behavior.

Elastic Resource Sharing

Nodes can be “borrowed” across vClusters:

vCluster A: 200 dedicated nodes, currently using 150
  → 50 idle nodes advertised as "borrowable" to other vClusters

vCluster B: 100 dedicated nodes, needs 120 for a pending job
  → Borrows 20 nodes from vCluster A's idle pool
  → These borrowed nodes have a preemption penalty in the cost function
  → If vCluster A needs them back: checkpoint + reclaim

The quorum tracks ownership at two levels:

  • Home vCluster: permanent assignment (based on tenant contracts)
  • Current vCluster: who is actually using the node right now

Checkpoint Cost Model

See checkpoint-broker.md for the full checkpoint decision framework.

Summary: checkpoint when Value > Cost, where value includes recompute_saved + preemptability + backlog_relief, and cost includes write_time + compute_waste + storage_cost. Backlog pressure increases checkpoint aggressiveness.

Simulation and Tuning

Use RM-Replay (tools/rm-replay/) to test scheduling configurations:

  1. Capture production workload traces
  2. Configure weight profiles
  3. Replay through simulator
  4. Evaluate: utilization, wait times, QoS compliance, fairness
  5. Iterate on weights before deploying to production

Reference: Martinasso et al., “RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management” (SC18).

CLI Design

Design Principle

The CLI is the primary user interface. It should feel natural to Slurm users while exposing Lattice’s richer capabilities. Commands follow a consistent lattice <verb> [resource] [flags] pattern. Output is human-readable by default, machine-parseable with --output=json.

Command Structure

lattice <command> [subcommand] [arguments] [flags]

Global Flags

FlagShortDescription
--output-oOutput format: table (default), json, yaml, wide
--quiet-qSuppress non-essential output
--verbose-vVerbose output (debug info)
--tenant-tOverride tenant (for multi-tenant users)
--vclusterOverride vCluster selection
--configConfig file path (default: ~/.config/lattice/config.yaml)
--no-colorDisable colored output

Authentication Commands

Login (lattice login)

Authenticate with the lattice server. Uses hpc-auth for OIDC token acquisition with cascading flow selection.

# Login (auto-discovers IdP from lattice-api auth discovery endpoint)
lattice login

# Force device code flow (for SSH sessions without browser)
lattice login --flow device

# Force manual paste flow
lattice login --flow manual

# Login to a specific server
lattice login --server cluster.example.com

Token is cached per-server in ~/.config/lattice/tokens.json with 0600 permissions (lenient mode: warn and fix if wrong).

Logout (lattice logout)

Clear cached token and revoke at IdP (best-effort).

lattice logout

Unauthenticated Commands

These commands do not require a token (INV-A1):

  • lattice login / lattice logout
  • lattice --version
  • lattice --help
  • lattice completions <shell>

All other commands require authentication. If no valid token is cached, the CLI prints:

Not logged in. Run `lattice login` first.

Expired tokens are silently refreshed if a valid refresh token exists.

Core Commands

Submit (lattice submit)

Submit an allocation or batch script.

# Submit a script (Slurm-compatible directives parsed)
lattice submit script.sh

# Submit with inline arguments
lattice submit --nodes=64 --walltime=72h --uenv=prgenv-gnu/24.11:v1 -- torchrun train.py

# Submit a task group (job array)
lattice submit --task-group=0-99%20 script.sh

# Submit with dependencies
lattice submit --depends-on=12345:success script.sh

# Submit a DAG from YAML
lattice dag submit workflow.yaml

# Submit to a specific vCluster
lattice submit --vcluster=ml-training script.sh

Output: Allocation ID on success.

Submitted allocation 12345

Status (lattice status)

Query allocation status.

# List own allocations
lattice status

# Specific allocation
lattice status 12345

# Filter by state
lattice status --state=running

# All allocations (tenant admin)
lattice status --all

# Watch mode (refresh every 5s)
lattice status --watch

Default output (table):

ID      NAME           STATE    NODES  WALLTIME   ELAPSED   VCLUSTER
12345   training-run   Running  64     72:00:00   14:23:01  ml-training
12346   eval-job       Pending  4      02:00:00   —         hpc-batch
12347   sweep          Running  1×20   04:00:00   01:12:33  hpc-batch

Wide output (-o wide): Adds columns: tenant, project, uenv, GPU type, dragonfly groups.

Cancel (lattice cancel)

Cancel allocations.

# Cancel single
lattice cancel 12345

# Cancel multiple
lattice cancel 12345 12346 12347

# Cancel all own pending allocations
lattice cancel --state=pending --all-mine

# Cancel a DAG
lattice dag cancel dag-789

Session (lattice session)

Create an interactive session. See sessions.md for details.

# Basic session
lattice session --walltime=4h

# With resources
lattice session --nodes=2 --constraint=gpu_type:GH200 --walltime=8h

# With uenv
lattice session --uenv=prgenv-gnu/24.11:v1 --walltime=4h

Attach (lattice attach)

Attach a terminal to a running allocation. See observability.md.

lattice attach 12345
lattice attach 12345 --node=x1000c0s0b0n3
lattice attach 12345 --command="nvidia-smi -l 1"

Launch (lattice launch)

Run a task within an existing allocation (srun equivalent).

# Run on all nodes
lattice launch --alloc=12345 hostname

# Run on specific number of tasks
lattice launch --alloc=12345 -n 4 ./my_program

# Run interactively with PTY
lattice launch --alloc=12345 --pty bash

Logs (lattice logs)

View allocation logs. See observability.md.

lattice logs 12345
lattice logs 12345 --follow
lattice logs 12345 --stderr --node=x1000c0s0b0n3
lattice logs 12345 --tail=100

Top / Watch / Diag / Compare

Monitoring commands. See observability.md.

lattice top 12345                              # Metrics snapshot
lattice top 12345 --per-gpu                    # Per-GPU breakdown
lattice watch 12345                            # Live streaming metrics
lattice watch 12345 --alerts-only              # Alerts only
lattice diag 12345                             # Network + storage diagnostics
lattice compare 12345 12346 --metric=gpu_util  # Cross-allocation comparison

Telemetry (lattice telemetry)

Switch telemetry mode.

lattice telemetry --alloc=12345 --mode=debug --duration=30m

Nodes (lattice nodes)

View cluster nodes (read-only).

# List all nodes
lattice nodes

# Filter by state
lattice nodes --state=ready

# Filter by vCluster
lattice nodes --vcluster=hpc-batch

# Specific node details
lattice nodes x1000c0s0b0n0

Output:

NODE                STATE   GPUS  VCLUSTER      TENANT    GROUP  CONFORMANCE
x1000c0s0b0n0       Ready   4×GH200  hpc-batch    physics   3      a1b2c3
x1000c0s0b0n1       Ready   4×GH200  hpc-batch    physics   3      a1b2c3
x1000c0s1b0n0       Draining 4×GH200  ml-training  ml-team   7      a1b2c3

History (lattice history)

Query completed allocations (accounting data).

lattice history
lattice history --since=2026-03-01 --until=2026-03-02
lattice history --output=json

DAG Commands (lattice dag)

lattice dag submit workflow.yaml     # Submit a DAG
lattice dag status dag-789           # DAG status with per-allocation states
lattice dag list                     # List DAGs
lattice dag cancel dag-789           # Cancel a DAG

Cache Commands (lattice cache)

lattice cache warm --image=prgenv-gnu/24.11:v1 --group=3
lattice cache status --node=x1000c0s0b0n0
lattice cache evict --image=prgenv-gnu/24.11:v1 --node=x1000c0s0b0n0

Admin Commands (lattice admin)

Administrative commands require system-admin role.

# Node management
lattice node drain x1000c0s0b0n0
lattice node drain x1000c0s0b0n0 --urgent
lattice node undrain x1000c0s0b0n0
lattice node disable x1000c0s0b0n0
lattice node enable x1000c0s0b0n0

# Tenant management
lattice admin tenant create --name=physics --max-nodes=200
lattice admin tenant set-quota --name=physics --max-nodes=250

# vCluster management
lattice admin vcluster create --name=hpc-batch --scheduler=hpc-backfill --tenant=physics
lattice admin vcluster set-weights --name=hpc-batch --priority=0.20 ...

# Configuration
lattice admin config get accounting.enabled
lattice admin config set accounting.enabled=true

# Raft status
lattice admin raft status

Output Formats

FormatFlagUse Case
tableDefaultHuman-readable, aligned columns
wide-o wideExtended columns
json-o jsonMachine-parseable, scripting
yaml-o yamlMachine-parseable, config integration

All formats support piping and redirection. JSON output uses newline-delimited JSON for streaming commands (logs –follow, watch).

Error Messages

Errors are human-readable with actionable guidance:

Error: allocation rejected — tenant "physics" exceeds max_nodes quota
  Current: 195 nodes in use
  Requested: 10 additional nodes
  Limit: 200 nodes

  Hint: Cancel running allocations or request a quota increase from your tenant admin.
Error: no nodes available matching constraints
  GPU type: GH200
  Nodes requested: 64
  Available: 42 (22 in use by your allocations, 136 by other tenants)

  Hint: Reduce node count, use --topology=any, or wait for resources.

Shell Completion

Shell completion is generated for bash, zsh, and fish:

# Generate completion
lattice completion bash > /etc/bash_completion.d/lattice
lattice completion zsh > ~/.zfunc/_lattice
lattice completion fish > ~/.config/fish/completions/lattice.fish

Completions cover: subcommands, flag names, allocation IDs (from recent lattice status), node IDs, vCluster names, uenv names.

Configuration File

# ~/.config/lattice/config.yaml
api_url: "https://lattice.example.com:50051"
default_tenant: "physics"
default_vcluster: "hpc-batch"
default_uenv: "prgenv-gnu/24.11:v1"
output_format: "table"
color: true

Environment variables override config file: LATTICE_API_URL, LATTICE_TENANT, LATTICE_VCLUSTER.

Slurm Compatibility Aliases

For sites migrating from Slurm, optional shell aliases:

# Source from lattice-provided script
source $(lattice compat-aliases)

# Provides:
# sbatch → lattice submit
# squeue → lattice status
# scancel → lattice cancel
# salloc → lattice session
# srun → lattice launch
# sinfo → lattice nodes
# sacct → lattice history

These aliases translate Slurm flags to Lattice flags where possible. See slurm-migration.md for details.

Cross-References

Telemetry Architecture

Design Principle

Collect at high resolution, aggregate at configurable resolution, transmit out-of-band.

Three-Layer Pipeline

Layer 1: Collection (eBPF, always-on)

eBPF programs JIT-compiled into kernel, attached to tracepoints and kprobes.

Kernel-level metrics:

  • CPU: context switches, runqueue depth, scheduling latency histograms
  • Network: per-flow bytes/packets, Slingshot CSIG congestion signals from packet headers
  • Block I/O: latency histograms, throughput per device (NVMe scratch, network mounts)
  • Memory: allocation/free rates, NUMA locality, page faults

GPU metrics (via NVML/DCGM hooks):

  • SM occupancy, memory utilization, power draw
  • PCIe/NVLink throughput
  • ECC error counts (feeds into checkpoint cost model)

Storage overhead: ~0.3% on compute-bound workloads. eBPF programs run in kernel context, no syscall overhead, no userspace daemon polling.

Data flows into per-CPU ring buffers (BPF_MAP_TYPE_RINGBUF), consumed by the node agent.

Layer 2: Aggregation (Node Agent, switchable)

The node agent reads ring buffers and aggregates based on the current mode.

Mode: prod (default)

  • 30-second aggregation windows
  • Statistical summaries: p50, p95, p99, mean, max, count
  • Bicubic interpolation for time-series smoothing (reduces storage, preserves trends)
  • Transmitted on Slingshot telemetry traffic class (separate from compute traffic)
  • Additional overhead: ~0.1%

Mode: debug (per-job or per-node, time-limited)

  • 1-second or sub-second raw event streams
  • Full per-flow network traces
  • GPU kernel-level profiling (CUPTI integration)
  • Stored to job-specific S3 path for user analysis
  • Additional overhead: ~2-5% (acceptable for debugging)
  • Auto-reverts to prod after configured duration (default: 30 minutes)

Mode: audit (sensitive vCluster)

  • All file access events (open, read, write, close) with user identity
  • All API calls logged with request/response metadata
  • Network flow summaries (source, destination, bytes, duration)
  • Signed with Sovra keys (if federation enabled) for tamper evidence
  • Additional overhead: ~1%
  • Retention: 7 years (cold tier, S3-compatible archive)

Layer 3: Storage and Query

Time-series store — recommended: VictoriaMetrics (single-node or cluster) for single-site deployments; Thanos on top of Prometheus for federated multi-site deployments that need a global query view across sites:

  • Ingestion: all nodes stream aggregated metrics
  • Auto-downsampling: raw → 1m → 5m → 1h → 1d
  • Retention policy configurable per tenant/vCluster

Three materialized views (label-based access control):

ViewAudienceContent
HolisticSystem adminsSystem-wide utilization, power, health, scheduling efficiency
TenantTenant adminsPer-tenant resource usage, quota tracking, job statistics
vClusterSchedulerMetrics feeding into cost function (GPU util, I/O, congestion)
UserAllocation ownersPer-allocation metrics scoped by OIDC identity (via lattice-api)

Query interface: PromQL-compatible API. Grafana dashboards for visualization.

Debug traces: Stored to s3://{tenant}/{project}/{job_id}/telemetry/ with short retention (7 days default, configurable).

Audit logs: Append-only, encrypted at rest, stored to dedicated audit storage with long retention. Queryable for compliance reporting.

Switching Telemetry Mode

Via Intent API:

PATCH /v1/allocations/{id}
{ "telemetry": { "mode": "debug", "duration": "30m" } }

Via CLI:

lattice telemetry --alloc=12345 --mode=debug --duration=30m

Switching is instant — the eBPF programs are always collecting at full resolution. Only the aggregation behavior changes.

User-Facing Telemetry Query

The telemetry pipeline serves admin dashboards and the scheduler cost function. The user-facing query layer adds scoped access so allocation owners can query their own metrics without admin intervention.

Query Path

User → lattice-api → PromQL (scoped by alloc/tenant/user) → TSDB → response

The lattice-api injects label filters to ensure users only see metrics for their own allocations. Tenant admins can query any allocation within their tenant.

Scoping Rules

CallerVisible Scope
Allocation ownerMetrics for their own allocations
Tenant adminMetrics for any allocation in their tenant
System adminAll metrics (holistic view)

User Metrics Catalog

MetricDescriptionAvailable In
gpu_utilizationSM occupancy per GPUprod, debug, audit
gpu_memory_usedGPU memory in useprod, debug, audit
gpu_power_drawGPU power consumptionprod, debug, audit
cpu_utilizationCPU usage per nodeprod, debug, audit
memory_usedSystem memory in useprod, debug, audit
network_tx_bytesNetwork bytes sent per secondprod, debug, audit
network_rx_bytesNetwork bytes received per secondprod, debug, audit
io_read_bytesStorage read throughputprod, debug, audit
io_write_bytesStorage write throughputprod, debug, audit
io_latency_p99Storage I/O latency (p99)prod, debug, audit

Telemetry Streaming

For use cases requiring push-based updates (e.g., lattice watch), the StreamMetrics RPC fans out to node agents running the target allocation and merges their streams.

Architecture

lattice-api receives StreamMetrics request
    → identifies nodes running allocation (from quorum state)
    → opens per-node metric streams to node agents
    → merges streams with allocation-scoped labels
    → returns unified server-streaming response to client

In prod mode, node agents emit aggregated snapshots every 30 seconds. In debug mode, raw events stream at 1-second intervals. The client receives the same resolution as the current telemetry mode — switching mode (via PATCH /v1/allocations/{id}) takes effect on active streams.

Alert Generation

Node agents evaluate threshold rules locally and inject MetricAlert events into the stream when:

  • GPU utilization < 10% for > 60s (potential hang)
  • GPU memory > 95% (OOM risk)
  • Network error rate exceeds 0.1%
  • I/O p99 latency exceeds 10ms

Cross-Allocation Comparison

Users can compare metrics across multiple allocations (e.g., successive training runs) via the CompareMetrics RPC or GET /v1/compare.

TSDB Query

The lattice-api issues parallel PromQL queries for each allocation ID, scoped to the requesting user’s permissions. Results are aligned by relative time (see below).

Relative Time Alignment

Allocations may run at different wall-clock times. Comparison uses relative-to-start alignment: each allocation’s metric series is indexed from t=0 (the allocation’s started_at timestamp). This allows apples-to-apples comparison of metrics across runs that started hours or days apart.

Feedback to Scheduler

The telemetry system feeds key metrics back to the scheduling cost function:

MetricCost Function ComponentEffect
GPU utilization per jobEfficiency scoringLow util → deprioritize for topology-premium placement
Network congestion (CSIG)topology_fitnessCongested groups → avoid placing new jobs there
I/O throughput per jobdata_readinessHigh I/O demand → ensure storage QoS before scheduling
Node ECC errorscheckpoint cost modelRising errors → increase checkpoint urgency
Power draw per nodeenergy_costFeeds into power budget constraint

Telemetry Aggregation Topology

For large systems (10,000+ nodes), direct streaming to a central store creates an ingestion bottleneck. Use hierarchical aggregation:

Nodes (per-group) → Group Aggregator → Central Store

Each Slingshot dragonfly group has a designated aggregator node.
Group aggregators perform first-level aggregation (merge per-node summaries).
Central store receives per-group aggregated streams.

In debug mode: bypasses group aggregation, streams directly for that job's nodes.

Scheduler Self-Monitoring

Internal metrics for monitoring Lattice’s own health. These metrics feed into canary criteria during rolling upgrades (cross-ref: upgrades.md) and are available on the holistic dashboard.

Scheduling Metrics

MetricTypeLabelsDescription
lattice_scheduling_cycle_duration_secondshistogramvclusterTime to complete one scheduling cycle
lattice_scheduling_queue_depthgaugevclusterNumber of pending allocations
lattice_scheduling_proposals_totalcountervcluster, result (accepted/rejected)Proposals sent to quorum
lattice_scheduling_cost_function_duration_secondshistogramvclusterTime to evaluate the cost function for all candidates
lattice_scheduling_backfill_jobs_totalcountervclusterAllocations placed via backfill

Quorum Metrics

MetricTypeLabelsDescription
lattice_raft_leadergaugemember_id1 if this member is leader, 0 if follower
lattice_raft_commit_latency_secondshistogrammember_idTime from proposal to commit
lattice_raft_log_entriesgaugemember_idNumber of entries in the Raft log
lattice_raft_snapshot_duration_secondshistogrammember_idTime to create a Raft snapshot

API Metrics

MetricTypeLabelsDescription
lattice_api_requests_totalcountermethod, statusTotal API requests
lattice_api_request_duration_secondshistogrammethodRequest latency
lattice_api_active_streamsgaugestream_type (attach/logs/metrics)Active streaming connections

Node Agent Metrics

MetricTypeLabelsDescription
lattice_agent_heartbeat_latency_secondshistogramnode_idHeartbeat round-trip time
lattice_agent_allocation_startup_secondshistogramnode_idTime from allocation assignment to process start (includes uenv pull/mount)
lattice_agent_ebpf_overhead_percentgaugenode_idMeasured eBPF collection overhead

Accounting Metrics

MetricTypeLabelsDescription
lattice_accounting_events_bufferedgaugeEvents in the in-memory accounting buffer
lattice_accounting_events_dropped_totalcounterEvents dropped due to buffer overflow

Federation Broker Metrics

When federation is enabled, the federation broker exposes additional metrics:

MetricTypeLabelsDescription
lattice_federation_proposals_totalcounterpeer, result (accepted/rejected/timeout)Placement proposals sent to/from peers
lattice_federation_proposal_latency_secondshistogrampeerRound-trip time for federation proposals
lattice_federation_peer_statusgaugepeer1 = connected, 0 = unreachable
lattice_federation_data_gravity_scoregaugepeer, datasetData gravity score for placement decisions (higher = more data at peer)

These metrics are only active when federation.enabled = true. The federation broker exposes them on the same /metrics endpoint as other components (default port: 9105).

Alerting Rules

Example alerting rules (PromQL-compatible):

RuleConditionSeverity
Scheduling cycle slowhistogram_quantile(0.99, lattice_scheduling_cycle_duration_seconds) > 30warning
Queue depth highlattice_scheduling_queue_depth > 100 for 5 minuteswarning
Raft commit slowhistogram_quantile(0.99, lattice_raft_commit_latency_seconds) > 5critical
Node heartbeat missingtime() - lattice_agent_last_heartbeat_timestamp > 60node degraded
API error rate spikerate(lattice_api_requests_total{status=~"5.."}[5m]) / rate(lattice_api_requests_total[5m]) > 0.05warning
Accounting buffer fillinglattice_accounting_events_buffered > 8000warning
VNI pool exhaustion approaching(lattice_network_vni_pool_total - lattice_network_vni_pool_available) / lattice_network_vni_pool_total > 0.90warning
Quota utilization highlattice_quota_used_nodes / lattice_quota_max_nodes > 0.95 for 10 minuteswarning
Raft disk usage highlattice_raft_disk_used_bytes / lattice_raft_disk_total_bytes > 0.80warning
Snapshot storage growthrate(lattice_raft_snapshot_size_bytes[1h]) > 100e6info

Dashboard Views

Three views matching the existing telemetry pattern:

DashboardAudienceKey Panels
HolisticSystem adminsAll scheduler cycle times, quorum health, total queue depth, API throughput
Per-vClusterScheduler operatorsvCluster-specific queue depth, cycle time, proposal accept rate, backfill rate
Per-quorum-memberQuorum operatorsRaft log size, commit latency, leader status, snapshot timing

Monitoring Deployment

Prometheus Scrape Configuration

All Lattice components expose metrics on a /metrics endpoint (Prometheus exposition format):

ComponentDefault Metrics PortEndpoint
Quorum members9100http://{quorum-host}:9100/metrics
API servers9101http://{api-host}:9101/metrics
vCluster schedulers9102http://{scheduler-host}:9102/metrics
Node agents9103http://{node-host}:9103/metrics
Checkpoint broker9104http://{checkpoint-host}:9104/metrics

Example Prometheus scrape config:

scrape_configs:
  - job_name: "lattice-quorum"
    static_configs:
      - targets: ["quorum-1:9100", "quorum-2:9100", "quorum-3:9100"]

  - job_name: "lattice-api"
    static_configs:
      - targets: ["api-1:9101", "api-2:9101"]

  - job_name: "lattice-scheduler"
    static_configs:
      - targets: ["scheduler-hpc:9102", "scheduler-ml:9102", "scheduler-interactive:9102"]

  - job_name: "lattice-agents"
    file_sd_configs:
      - files: ["/etc/prometheus/lattice-agents.json"]
        refresh_interval: 5m
    # Node agents are numerous; use file-based service discovery
    # populated from OpenCHAMI node inventory

Alert Routing

Alerts are routed via Alertmanager (or compatible system):

SeverityRouteResponse Time
CriticalPagerDuty / on-callImmediate (< 15 min)
WarningSlack #lattice-alertsBusiness hours (< 4 hours)
InfoSlack #lattice-infoBest effort

Example Alertmanager route:

route:
  receiver: "slack-info"
  routes:
    - match: { severity: "critical" }
      receiver: "pagerduty-oncall"
    - match: { severity: "warning" }
      receiver: "slack-alerts"

Grafana Dashboards

Pre-built dashboards for the three views described above. Dashboards are defined as JSON and version-controlled in infra/grafana/:

infra/grafana/
├── holistic.json          # System-wide overview
├── per-vcluster.json      # vCluster-specific scheduling
├── per-quorum-member.json # Raft health
├── per-node.json          # Individual node health
└── user-allocation.json   # User-facing allocation metrics

Each dashboard uses the standard Lattice metric names. Data source: Prometheus (or compatible TSDB).

TSDB Sizing

Cluster SizeMetric CardinalityIngestion RateStorage (30-day retention)
100 nodes~50,000 series~10k samples/s~50 GB
1,000 nodes~500,000 series~100k samples/s~500 GB
10,000 nodes~5,000,000 series~1M samples/s~5 TB

For clusters > 1000 nodes, use a horizontally scalable TSDB (VictoriaMetrics cluster, Mimir, or Thanos) with the hierarchical aggregation described in the Telemetry Aggregation Topology section above.

User-Facing Observability & Debugging

Design Principle

Lattice already collects high-resolution telemetry (eBPF, TSDB, three aggregation modes) for operator and scheduler use. This document describes the user-facing surface that lets job owners debug, monitor, and profile their own allocations without admin intervention.

All observability data flows through existing pipelines — no new collection infrastructure is required. The user-facing layer adds scoped query access, streaming endpoints, and interactive attach.

Overview

┌─ User ───────────────────────────────────────────────────────┐
│  lattice attach / logs / top / watch / diag / compare        │
│         │           │          │         │         │         │
│         ▼           ▼          ▼         ▼         ▼         │
│    ┌─────────── lattice-api (gRPC + REST) ───────────────┐   │
│    │  Attach ──────────────── bidir stream to node agent │   │
│    │  Logs ────────────────── ring buffer (live) + S3    │   │
│    │  Metrics ─────────────── PromQL query to TSDB       │   │
│    │  StreamMetrics ───────── fan-out to node agents     │   │
│    │  Diagnostics ─────────── TSDB + fabric telemetry    │   │
│    │  Compare ─────────────── multi-alloc TSDB query     │   │
│    └─────────────────────────────────────────────────────┘   │
│         │           │          │         │                   │
│         ▼           ▼          ▼         ▼                   │
│    Node Agents    S3 logs     TSDB    Slingshot CSIG         │
└──────────────────────────────────────────────────────────────┘
CapabilityData SourceLatencyCLI Command
Attach to running allocationNode agent (nsenter)Real-timelattice attach <id>
Log streaming (live tail)Node agent ring bufferSub-secondlattice logs <id> --follow
Historical logsS3Secondslattice logs <id>
Live metrics (top)TSDB30s (prod mode)lattice top <id>
Live telemetry stream (watch)Node agents (push)1-30slattice watch <id>
DiagnosticsTSDB + fabric telemetry30slattice diag <id>
Cross-allocation comparisonTSDBSecondslattice compare <id1> <id2>
Application profilingUser tools (via tools_uenv)N/AUser-driven

Attach to Running Allocation

Architecture

The attach mechanism provides an interactive terminal session inside a running allocation’s execution environment. The node agent uses nsenter to enter the allocation’s mount and network namespaces — this is not a new allocation, just a terminal session in the existing one.

User → lattice-cli → lattice-api → gRPC bidir stream → node agent
                                                         │
                                                    nsenter into
                                                    mount/net ns
                                                         │
                                                    PTY ↔ shell

Terminal Protocol

The gRPC bidirectional stream carries:

  • Client → Server: stdin bytes, terminal resize events, signals (SIGINT, SIGTSTP)
  • Server → Client: stdout/stderr bytes, exit code on completion

The stream begins with an AttachStart message specifying the target node (for multi-node allocations) and command (default: user’s shell).

Authorization Model

vCluster TypeWho Can AttachAdditional Constraints
HPC (backfill)Allocation owner
Service (bin-pack)Allocation owner
Interactive (FIFO)Allocation ownerAlready has session; attach is secondary terminal
Sensitive (reservation)Claiming user onlySession recorded, audit trail, signed uenv only

Sensitive Constraints

  • Only the user who claimed the nodes (identity from Raft audit log) can attach
  • All attach sessions are recorded (input + output) to the sensitive audit log
  • Attach is only permitted when the allocation runs a signed uenv
  • Session start/end events are Raft-committed audit entries

Attach During Node Crash

If the node hosting an attach session crashes or becomes unreachable:

  • The gRPC bidirectional stream is dropped (connection reset).
  • The API server detects the stream drop and sets ended_at on the AttachSession record.
  • For sensitive allocations, the session end event is recorded in the audit log with reason node_unreachable.
  • The client receives a stream error and can display: "connection to node lost — attach session ended".

Attach During Preemption

If the allocation is preempted while an attach session is active, the session is terminated gracefully. See sessions.md for the detailed preemption sequence. If the allocation is in Checkpointing state, new attach requests are rejected with: "allocation is being checkpointed — attach unavailable until rescheduled".

CLI Usage

# Attach to allocation (first node, user's shell)
lattice attach 12345

# Attach to a specific node
lattice attach 12345 --node=x1000c0s0b0n3

# Attach with a specific command
lattice attach 12345 --command="nvidia-smi -l 1"

Slurm Compatibility

SlurmLattice
srun --jobid=123 --pty bashlattice attach 123

Log Streaming

Dual-Path Architecture

Logs use two paths to balance latency and durability:

  1. Ring buffer (live tail): Each node agent maintains a per-allocation ring buffer (default 64 MB) of stdout/stderr. Supports low-latency streaming for --follow mode. Data is ephemeral — lost when the allocation ends or the buffer wraps.

  2. S3 persistence: Node agents periodically flush log chunks to S3 for durable storage. Available during and after allocation execution.

Process stdout/stderr
    │
    ├──→ Ring buffer (node agent, 64 MB)
    │         │
    │         └──→ gRPC StreamLogs (live tail)
    │
    └──→ S3 flush (periodic, configurable interval)
              │
              └──→ REST GET /logs (historical)

Log Storage Layout

s3://{tenant}/{project}/{alloc_id}/logs/
    ├── stdout/{node_id}/{chunk_000..N}.log.zst
    ├── stderr/{node_id}/{chunk_000..N}.log.zst
    └── metadata.json    # timestamps, byte offsets, node list

Logs are compressed with zstd. The metadata file enables efficient range queries by time or byte offset.

Streaming (Live Tail)

Via gRPC StreamLogs RPC (server-streaming). The client specifies:

  • Allocation ID
  • Stream filter: stdout, stderr, or both
  • Node filter: specific node or all nodes
  • Follow mode: whether to keep streaming as new output arrives
  • Tail lines: number of lines from the ring buffer to replay on connect

Historical Log Access

Via REST GET /v1/allocations/{id}/logs:

  • Query params: stream (stdout/stderr), node, since, until, offset, limit
  • Returns paginated log entries from S3
  • Available after allocation completion (subject to retention policy)

Sensitive Constraints

  • Logs from sensitive allocations are encrypted at rest in the dedicated sensitive S3 pool
  • All log access events are recorded in the sensitive audit log
  • Log retention follows sensitive data retention policy (user-specified, minimum per regulation)
  • Logs are only accessible to the claiming user and designated compliance reviewers

CLI Usage

# View logs (all nodes, both streams)
lattice logs 12345

# Follow mode (live tail)
lattice logs 12345 --follow

# Filter by stream and node
lattice logs 12345 --stderr --node=x1000c0s0b0n3

# Tail last 100 lines
lattice logs 12345 --tail=100

# Historical range
lattice logs 12345 --since="2026-03-01T10:00:00Z" --until="2026-03-01T11:00:00Z"

Slurm Compatibility

SlurmLattice
cat slurm-123.outlattice logs 123
tail -f slurm-123.outlattice logs 123 --follow

User-Facing Live Metrics (lattice top)

Query Path

Metrics are served from the TSDB (not directly from node agents). The lattice-api translates user queries into PromQL, scoped to the requesting user’s allocations.

lattice top <id> → lattice-api → PromQL → TSDB → response

This reuses the existing telemetry pipeline. In prod mode, data has 30-second resolution. In debug mode (if switched), 1-second resolution.

Metrics Catalog

MetricDescriptionUnit
gpu_utilizationSM occupancy per GPU%
gpu_memory_usedGPU memory in usebytes
gpu_power_drawGPU power consumptionwatts
cpu_utilizationCPU usage per node%
memory_usedSystem memory in usebytes
network_tx_bytesNetwork bytes sentbytes/s
network_rx_bytesNetwork bytes receivedbytes/s
io_read_bytesStorage read throughputbytes/s
io_write_bytesStorage write throughputbytes/s
io_latency_p99Storage I/O latency (p99)microseconds

Display Modes

ModeFlagContent
Summary (default)Aggregated across all nodes: mean GPU%, total mem, total I/O
Per-node--per-nodeOne row per node
Per-GPU--per-gpuOne row per GPU across all nodes
Wide--wideAll metrics in a wide table

REST + gRPC Access

  • REST: GET /v1/allocations/{id}/metrics?mode=summary&duration=5m
  • gRPC: QueryMetrics RPC with MetricsQueryRequest

CLI Usage

# Summary view (default)
lattice top 12345

# Per-node breakdown
lattice top 12345 --per-node

# Per-GPU breakdown
lattice top 12345 --per-gpu

# Wide format with all metrics
lattice top 12345 --wide

# Custom time window
lattice top 12345 --duration=1h

Live Telemetry Stream (lattice watch)

Push-Based Event Stream

Unlike lattice top (which queries TSDB), lattice watch opens a push-based stream from node agents for near-real-time events.

lattice watch <id> → lattice-api → fan-out → node agents
                          ↑
                     stream merge
                          ↑
             per-node MetricsEvent streams

Relationship to Telemetry Modes

Telemetry Modelattice top Resolutionlattice watch Resolution
prod30s (TSDB)30s (prod aggregation from node agent)
debug1s (TSDB)1s (raw events from node agent)
audit30s (TSDB)30s + access events

Switching to debug mode (lattice telemetry --alloc=12345 --mode=debug) increases resolution for both top and watch.

Stream Content

Each MetricsEvent contains:

  • Timestamp and node ID
  • Current metric values (GPU, CPU, memory, network, I/O)
  • Threshold alerts (if any metric exceeds configured bounds)

Alerts are generated by node agents when metrics cross thresholds:

  • GPU utilization drops below 10% (potential hang)
  • GPU memory utilization exceeds 95% (OOM risk)
  • Network error rate exceeds threshold
  • I/O latency spike detected

CLI Usage

# Watch all metrics (refreshing display)
lattice watch 12345

# Watch specific metrics
lattice watch 12345 --metrics=gpu_utilization,memory_used

# Watch with alerts only (suppress normal updates)
lattice watch 12345 --alerts-only

Diagnostics View

Network Diagnostics

Network health is critical for multi-node allocations. Diagnostics expose Slingshot-specific metrics that are otherwise invisible to users.

MetricDescriptionSource
CSIG congestionIn-band congestion signals per Slingshot groupeBPF CSIG tap
Group spanNumber of dragonfly groups the allocation spansTopology model
Inter-node bandwidthMeasured bandwidth between node pairseBPF network flow
NVLink throughputGPU-to-GPU bandwidth (intra-node)NVML

Storage Diagnostics

MetricDescriptionSource
QoS floor vs actualConfigured storage QoS vs measured throughputVAST API + eBPF I/O
Latency histogramI/O latency distribution (p50/p95/p99)eBPF block I/O
Mount healthPer-mount status (NFS, S3, scratch)Node agent
IOPSRead/write operations per secondeBPF block I/O

Combined Diagnostics

lattice diag combines network and storage diagnostics into a single view with health indicators:

$ lattice diag 12345

Network:
  Group span:     2 groups (g3, g7)
  CSIG congestion: LOW (0.02 avg)
  Inter-node BW:  190 GB/s avg (target: 200 GB/s) ✓

Storage:
  /data/input (NFS):  12.5 GB/s read (QoS floor: 10 GB/s) ✓
  /scratch (NVMe):    6.2 GB/s write, p99 latency: 45µs ✓
  /home (NFS):        0.1 GB/s (idle) ✓

GPUs:
  SM occupancy:   92% avg across 256 GPUs ✓
  NVLink:         850 GB/s avg (of 900 GB/s) ✓
  ECC errors:     0 ✓

CLI Usage

# Full diagnostics
lattice diag 12345

# Network only
lattice diag 12345 --network

# Storage only
lattice diag 12345 --storage

Cross-Allocation Comparison

TSDB Query

Compares metrics across multiple allocations by querying the same TSDB data used for lattice top. Useful for regression detection across training runs.

Time Alignment

Comparisons use relative-to-start time alignment: each allocation’s metrics are indexed from t=0 (allocation start), not wall clock time. This allows meaningful comparison of allocations that ran at different times.

CLI Usage

# Compare two allocations
lattice compare 12345 12346

# Compare specific metric
lattice compare 12345 12346 --metric=gpu_utilization

# JSON output for scripting
lattice compare 12345 12346 --output=json

REST Interface

GET /v1/compare?ids=12345,12346&metrics=gpu_utilization,io_write_bytes&align=relative

Application Profiling Integration

Scope

Lattice provides mechanisms for profiling, not profiler implementations. Users bring their own profiling tools, delivered via tools_uenv.

Profiler Delivery

Profiling tools are packaged as uenv images and mounted alongside the application uenv:

environment:
  uenv: "prgenv-gnu/24.11:v1"        # application stack
  tools_uenv: "profiling/2024.1"     # profilers: nsight, vtune, darshan, etc.

The tools_uenv mount provides profiler binaries without contaminating the application environment.

Usage Patterns

Batch profiling (non-interactive):

# Submit with profiling tools
lattice submit --uenv=prgenv-gnu/24.11:v1 --tools-uenv=profiling/2024.1 script.sh
# Script uses profiler internally (e.g., nsys profile ./train)
# Results written to output directory

Interactive profiling (attach-based):

# Attach and run profiler interactively
lattice attach 12345 --command="nsys profile --delay=60 -o /scratch/profile ./train"

Darshan / Score-P Integration Notes

  • Darshan: LD_PRELOAD-based I/O profiling. No Lattice-specific integration needed; user loads Darshan from tools_uenv and sets LD_PRELOAD. Darshan logs written to scratch/output.
  • Score-P: Instrumentation-based profiling. User compiles with Score-P wrappers from tools_uenv. Lattice provides no special support beyond tools delivery and attach.

Security Model

Authorization

All observability endpoints are scoped by OIDC token claims:

  • Users can only query their own allocations (or allocations in their tenant, if tenant-admin)
  • Token scopes: allocations:read (metrics, logs, diagnostics), allocations:attach (interactive attach)
  • Sensitive allocations: only the claiming user (verified against Raft audit log)

Rate Limiting

All rate limits are per user (identified by OIDC subject claim). Tenant admins and system admins share the same limits unless overridden in system configuration.

EndpointRate LimitScopeRationale
Attach5 concurrent sessionsPer userResource-intensive (PTY per session)
StreamLogs10 concurrent streamsPer userMemory (ring buffer readers)
QueryMetrics60 req/minPer userTSDB query load
StreamMetrics5 concurrent streamsPer userNode agent fan-out
Diagnostics30 req/minPer userTSDB + fabric query load
Compare10 req/minPer userMulti-alloc TSDB queries

When rate limit is exceeded:

  • Concurrent limits (Attach, StreamLogs, StreamMetrics): New request rejected with 429 Too Many Requests and a message: "maximum concurrent sessions reached (5/5). Close an existing session to open a new one."
  • Request-rate limits (QueryMetrics, Diagnostics, Compare): Request rejected with 429 Too Many Requests and Retry-After header indicating seconds until the next request is allowed.
  • No queueing — rejected requests must be retried by the client.

Admin override: System admins can adjust per-user rate limits via configuration:

rate_limits:
  attach_max_concurrent: 10       # override default of 5
  query_metrics_per_minute: 120   # override default of 60

Data Sensitivity

Data TypeSensitivityHandling
Metrics (GPU%, CPU%, I/O)LowStandard OIDC scoping
Logs (stdout/stderr)MediumMay contain application data; encrypted at rest for sensitive
Attach (interactive terminal)HighSession recorded for sensitive; PTY access = code execution
Diagnostics (network/storage)LowInfrastructure metrics, no application data
Profiling outputMediumWritten to user’s storage, no Lattice-managed persistence

Security Architecture

Design Principle

Defense in depth with zero-trust internal communication. Every component authenticates to every other component. Trust boundaries are explicit and enforced by mTLS, RBAC, and network segmentation.

Trust Boundaries

User ──OIDC──→ lattice-api (direct, via hpc-auth) ──mTLS──→ quorum
                                        │                    │
                                        │ mTLS               │ mTLS
                                        ▼                    ▼
                                   node-agents ──namespace──→ workloads
                                        │
                                        │ mTLS/REST
                                        ▼
                                   VAST / OpenCHAMI

Federation (optional):
  quorum ──Sovra mTLS──→ federation-broker ──Sovra mTLS──→ remote quorum

STRIDE Threat Analysis

Spoofing

BoundaryAttackMitigation
User → lattice-apiStolen OIDC tokenShort-lived tokens (5 min), token binding to client cert, MFA enforcement at IdP
Internal servicesRogue node agentmTLS with site PKI (OpenCHAMI OPAAL-issued certificates). Node agents receive certs during boot via cloud-init. Cert CN must match node identity in quorum.
FederationRogue remote siteSovra workspace-scoped certificates. Each site’s identity is cryptographically bound to its Sovra workspace. Revocable.

Tampering

BoundaryAttackMitigation
Quorum ↔ node agentFake heartbeat / state updatemTLS + message signing. Heartbeats include monotonic sequence number — replay detection.
uenv imagesCompromised imageImage signing with site PKI (or Sovra PKI for federated images). Node agent verifies signature + hash before mount. Unsigned images rejected.
Raft logLog manipulationRaft log entries are chained (each entry references previous). Stored on local SSD with integrity checks. Snapshot checksums verified on restore.
API requestsRequest modification in transitTLS for all external connections. mTLS for all internal connections.

Repudiation

BoundaryAttackMitigation
Sensitive actionsUser denies accessing sensitive dataRaft-committed audit log with user identity (from OIDC). Cryptographically signed entries (Sovra keys if available, otherwise site PKI). 7-year retention. Tamper-evident chain.
Allocation submissionUser denies submitting allocationAll API requests logged with authenticated user identity. Audit trail in lattice-api access logs.
Node claimsDeny claiming sensitive nodesNode claim is a Raft-committed operation with user identity. Cannot be repudiated.

Information Disclosure

BoundaryAttackMitigation
Node ↔ storageData exfiltration via network sniffingEncrypted transport: NFS-over-TLS (VAST supports), S3 over HTTPS. Sensitive: encrypted at rest (VAST encrypted pool).
Cross-tenantSide-channel via co-locationFull-node scheduling (ADR-007): no co-location of different tenants by default. Interactive vCluster uses Sarus containers with seccomp for intra-node isolation.
TelemetryMetric leakage between tenantsLabel-based access control on TSDB queries. lattice-api injects tenant/user scope filters.
MemoryData remnants after allocationNode agent zeroes GPU memory and clears scratch storage (NVMe or tmpfs) on allocation release. Sensitive: full node wipe via OpenCHAMI.
API responsesEnumeration of other tenants’ dataRBAC filtering on all list/query endpoints. Users see only their own allocations; tenant admins see their tenant.

Denial of Service

BoundaryAttackMitigation
User → APIAPI floodingRate limiting per tenant (token bucket). Admission control: reject requests that exceed tenant’s request quota. lattice-api provides rate limiting via Tower middleware.
Node → quorumHeartbeat stormHeartbeat coalescing: node agents batch heartbeats. Quorum-side rate limiting per node (max 1 heartbeat per interval).
SchedulingMalicious allocation specsValidation at API layer: max resource requests bounded, max array size bounded, DAG cycle detection. Reject before reaching scheduler.
StorageStorage exhaustionPer-tenant storage quotas enforced by VAST. Checkpoint storage bounded per allocation.

Elevation of Privilege

BoundaryAttackMitigation
User → schedulerEscalate priority classRBAC: priority class tied to tenant contract, not user request. Users cannot set priority above their tenant’s maximum.
Node agent → hostContainer/namespace escapeSarus: seccomp profile, no root in container, read-only rootfs. uenv: mount namespace only (no user namespace needed), processes run as submitting user. No setuid binaries in uenv images (enforced at build time).
Tenant admin → system adminEscalate administrative scopeDistinct RBAC roles with no implicit promotion. System admin requires separate authentication (not derivable from tenant admin token).
Workload → networkBreak out of network domainSlingshot VNI enforcement at NIC level (hardware-enforced). Workloads can only communicate within their assigned network domain.

Internal Service Authentication

All inter-component communication uses mTLS in production. Node agents acquire certificates via the identity cascade (SPIRE → SelfSigned CA → Bootstrap certs). When no mTLS identity is available (dev, testing, break-glass), agents fall back to Bearer token auth via LATTICE_AGENT_TOKEN.

ComponentCertificate SourceRotationFallback
Quorum membersPre-provisioned during deploymentAnnual rotation, Raft membership change for re-keying
Node agentsIdentity cascade: SPIRE SVID → SelfSigned (quorum CA) → Bootstrap certsCertRotator at 2/3 lifetimeLATTICE_AGENT_TOKEN Bearer token
API serversPre-provisioned or OPAALAnnual rotation
vCluster schedulersPre-provisioned or OPAALAnnual rotation
Checkpoint brokerPre-provisioned or OPAALAnnual rotation

Agent authentication priority:

  1. mTLS (production) — agent acquires a WorkloadIdentity via the identity cascade and configures the gRPC channel with ClientTlsConfig. Server verifies the client certificate. No Bearer token needed.
  2. Bearer token (dev/testing/break-glass) — when no mTLS identity is available, agent reads LATTICE_AGENT_TOKEN from the environment and injects it as Authorization: Bearer <token> on all gRPC calls. Server validates via HMAC or JWKS.

Both paths coexist — mTLS takes priority. The LATTICE_AGENT_TOKEN path should be disabled in production (env var unset).

Certificate CN format: {component}.{site}.lattice.internal (e.g., node-042.alps.lattice.internal).

CA trust chain: Site root CA → intermediate CA (OPAAL) → component certificates.

Secret Management

Sensitive values are never stored in configuration files:

SecretStorageAccess Pattern
Waldur API tokenSecrets manager (HashiCorp Vault or equivalent)Referenced by path: vault://lattice/waldur-token
VAST API credentialsSecrets managerReferenced by path
TLS private keysLocal filesystem (mode 0600) or TPMLoaded at startup
OIDC client secretSecrets managerUsed by hpc-auth (CLI) or lattice-api (server-side validation)
Sovra workspace keySovra key store (HSM-backed)Used by federation broker

Configuration files reference secrets by path, never by value:

waldur:
  token_secret_ref: "vault://lattice/waldur-token"
vast:
  credentials_ref: "vault://lattice/vast-creds"

RBAC Model

Three base roles, plus a sensitive-specific role:

RoleScopePermissions
userOwn allocationsSubmit, cancel, query own allocations. View own metrics. Attach to own sessions.
tenant-adminTenant’s allocationsAll user permissions for any allocation in tenant. Manage tenant quotas (within limits). View tenant-level metrics.
system-adminAllAll operations. Manage vClusters, nodes, tenants. View holistic metrics.
claiming-userClaimed sensitive nodesUser role + claim/release sensitive nodes. Access sensitive storage pool. All actions audit-logged.

Role assignment:

  • user role derived from OIDC token (any authenticated user)
  • tenant-admin assigned per-tenant in quorum state, or via tenant-admin role claim
  • system-admin assigned via quorum configuration, or via admin/system:admin scope
  • claiming-user assigned per-tenant by tenant-admin (sensitive tenants only)
  • operator assigned via operator scope or role claim

Cross-system role mapping (pact+lattice co-deployment):

When pact delegates operations to lattice (e.g., drain, cordon), the pact admin’s token carries a pact_role claim instead of lattice scopes. Lattice recognizes these cross-system role claims:

Token claimValueLattice role
pact_rolepact-platform-adminSystemAdmin
pact_role or lattice_rolesystem-adminSystemAdmin
pact_role or lattice_roletenant-adminTenantAdmin
pact_role or lattice_roleoperatorOperator

Standard OIDC scopes take precedence over role claims. Both are checked by derive_role().

Network Security

Traffic ClassNetworkIsolation
Management (mTLS, heartbeats)Slingshot management traffic classDedicated bandwidth reservation
Compute (MPI, NCCL)Slingshot compute VNIsHardware-isolated per network domain
Storage (NFS, S3)Slingshot storage traffic classQoS-enforced bandwidth
Telemetry (metrics)Slingshot telemetry traffic classSeparate from compute, low priority
User access (API, SSH)Out-of-band EthernetFirewalled, rate-limited

Slingshot traffic classes provide hardware-enforced isolation — compute traffic cannot starve management traffic and vice versa.

Certificate Rotation

Quorum Members

  1. Generate new certificate from site CA (same CN format)
  2. Deploy new cert + key to the target member’s TLS directory
  3. Perform Raft membership change: remove old member, add “new” member (same node, new cert)
  4. Verify: lattice admin raft status shows member healthy with new cert serial
  5. Repeat for each member (one at a time, maintaining majority)

Node Agents

Node agents receive certificates from OPAAL during boot. Rotation is automatic on reboot:

  1. Drain the node: lattice node drain <id>
  2. Reboot (or reimage) via OpenCHAMI
  3. Node boots with new OPAAL-issued certificate
  4. Undrain: lattice node undrain <id>

For batch rotation without reboot (if OPAAL supports renewal):

  1. Node agent requests new cert from OPAAL
  2. Node agent reloads TLS context (graceful, no connection drop)
  3. New cert active on next heartbeat

API Servers and Schedulers

  1. Generate new certificate from site CA
  2. Deploy new cert + key to the component’s TLS directory
  3. Restart the component (stateless — no data loss)
  4. Load balancer health check confirms the component is back

Federation (Sovra Certificates)

Sovra workspace keys are managed by the Sovra key rotation protocol. Lattice components use derived tokens, which are automatically refreshed. No Lattice-side action is required for routine Sovra key rotation.

For emergency revocation: revoke the Sovra shared workspace (see federation.md — Removing a Federation Peer).

Additional Security Considerations

OIDC Token Refresh for Long-Lived Streams

Long-lived gRPC streams (Attach, StreamLogs, StreamMetrics) may outlive the OIDC access token’s lifetime:

  • Token validation at stream open. The API server validates the OIDC token when the stream is established.
  • Periodic re-validation. For streams lasting longer than token_revalidation_interval (default: 5 minutes), the API server re-validates the token’s claims against the OIDC provider. If the token has been revoked or the user’s permissions have changed, the stream is terminated with an UNAUTHENTICATED error.
  • Client responsibility. Clients should refresh their access token before it expires and present the new token on reconnection if the stream is terminated.

Anti-Replay for API Requests

API requests are protected against replay attacks:

  • TLS as primary defense. All external API communication uses TLS, which provides replay protection at the transport layer.
  • Request idempotency. Mutating operations (Submit, Cancel, Update) use client-generated request_id fields for idempotency. Duplicate request_id values within a time window are rejected.
  • Raft proposal deduplication. The quorum deduplicates proposals using the proposing scheduler’s identity and a monotonic sequence number. Replayed proposals are ignored.

RBAC for Node Management

Node management operations (drain, undrain, disable) require the system-admin role:

OperationRequired RoleNotes
ListNodes, GetNodeuserRead-only, filtered by tenant scope
DrainNode, UndrainNodesystem-adminAffects scheduling across all tenants
DisableNodesystem-adminRemoves node from scheduling entirely
Sensitive node claimclaiming-userSensitive-specific role within tenant

Certificate CN vs NodeId Mapping

Node agent certificates use a deterministic CN format that maps to the node’s xname identity:

  • Format: {xname}.{site}.lattice.internal (e.g., x1000c0s0b0n0.alps.lattice.internal)
  • Validation: On each heartbeat, the quorum verifies that the certificate CN matches the node ID reported in the heartbeat payload. A mismatch triggers an UNAUTHENTICATED error and an alert.
  • Prevents: A compromised node agent from impersonating a different node.

Sensitive Session Recording Storage

Attach session recordings for sensitive allocations are stored alongside the audit log:

  • Path: s3://sensitive-audit/{tenant}/{alloc_id}/sessions/{session_id}.recording
  • Format: Raw byte stream (input + output interleaved with timestamps), compressed with zstd
  • Encryption: Encrypted at rest using the sensitive storage pool’s encryption keys
  • Retention: 7 years (matching sensitive audit log retention)
  • Access: Only the claiming user and tenant-admin (compliance reviewer) can access recordings via the audit query API

Audit Signing Key Persistence

The Ed25519 signing key for audit log entries is loaded from a persistent file configured via QuorumConfig.audit_signing_key_path. This ensures:

  • Chain continuity: Archived audit entries (in S3) can be verified after quorum restart
  • Non-repudiation: The same key signs all entries, forming a verifiable chain
  • Key rotation: Replace the file and restart the quorum to rotate (old entries remain verifiable with the old public key)
  • Dev mode: When audit_signing_key_path is not set, a random key is generated (suitable for testing only)

REST API Authentication

REST and gRPC endpoints require authentication when OIDC or HMAC is configured:

  • Bearer token required in Authorization header (validated on every request)
  • Two validation modes: JWKS (production, via oidc_issuer) or HMAC-SHA256 (dev/testing, via LATTICE_OIDC_HMAC_SECRET)
  • REST middleware validates asynchronously (supports JWKS network fetch on cache miss)
  • gRPC interceptor validates synchronously using cached JWKS keys (pre-fetched at startup) or HMAC
  • Rate limiting applied per-user
  • Public endpoints exempt: /healthz, /api/v1/auth/discovery
  • OIDC discovery client disables HTTP redirects (JWKS cache poisoning prevention)
  • Non-HTTPS issuer URLs produce a warning (MITM risk)
  • Server logs a prominent warning on startup if no authentication is configured

Service Discovery Isolation

Service discovery endpoints (LookupService, ListServices) are tenant-filtered:

  • x-lattice-tenant header constrains results to the requesting tenant’s services
  • Without the header, all services are visible (admin/operator access)
  • Prevents cross-tenant information disclosure of service topology

Session Security

Interactive sessions are tracked globally in Raft state:

  • CreateSession / DeleteSession are Raft-committed operations
  • Sensitive allocations: at most one concurrent session globally (INV-C2)
  • Sessions survive API server restart (persisted in quorum state)
  • Ownership verified: only the allocation’s user can create sessions

Cross-References

Deployment & Bootstrapping

Design Principle

Lattice deploys on bare metal managed by OpenCHAMI. The bootstrap sequence is deterministic: infrastructure first, then control plane, then compute nodes. Each step is idempotent and can be retried. The system can be fully rebuilt from configuration files and Raft snapshots.

Prerequisites

Before deploying Lattice:

DependencyRequiredNotes
OpenCHAMIYesNode inventory, BMC discovery, boot service, identity (OPAAL)
VAST (or compatible NFS+S3)YesHot tier storage, QoS API
OIDC ProviderYesUser authentication (institutional IdP)
PKI / Certificate AuthorityYesmTLS certificates for all components
Secrets ManagerYesAPI tokens, TLS keys (Vault or equivalent)
Time-series databaseYesVictoriaMetrics, Mimir, or Thanos
Slingshot/UE fabricYesNetwork with VNI support
WaldurOptionalExternal accounting (feature-flagged)
SovraOptionalFederation trust (feature-flagged)

Network Topology

Lattice runs on the high-speed network (HSN — Slingshot/Ultra Ethernet, 200G+). When co-deployed with PACT, the two systems use different networks for clean failure isolation (PACT ADR-017):

SystemNetworkPortsTraffic
PACTManagement (1G)gRPC 9443, Raft 9444Admin ops, boot overlay, config, shell
LatticeHSN (200G+)gRPC 50051, Raft 9000, REST 8080Scheduling, heartbeats, telemetry, allocation lifecycle
Node (dual-homed):
├── Management NIC (1G Ethernet)
│   └── pact-agent ←mTLS→ pact-journal:9443
│
├── HSN NIC (200G+ Slingshot/UE)
│   ├── lattice-node-agent ←mTLS→ lattice-quorum:50051
│   └── workload traffic (MPI, NCCL, storage data plane)
│
└── SPIRE agent socket (local, network-agnostic)
    ├── pact-agent obtains SVID → uses on management net
    └── lattice-node-agent obtains SVID → uses on HSN

Configuration: Set bind_network: hsn in quorum and node-agent config (default). This resolves to the HSN interface at startup. In standalone mode without PACT, bind_network: any (default 0.0.0.0) is acceptable.

Failure isolation: Management net down → PACT degraded, lattice unaffected. HSN down → lattice paused, PACT unaffected (admin access works). See specs/failure-modes.md for full matrix.

Bootstrap Sequence

Phase 1: Infrastructure (OpenCHAMI)

1. Deploy OpenCHAMI services:
   - Magellan (BMC discovery)
   - SMD (State Management Daemon)
   - BSS (Boot Script Service)
   - OPAAL (Authentication)
2. Discover nodes via Redfish BMC scan
3. Register node inventory in SMD
4. Prepare boot images:
   - Standard compute image (Linux + node agent)
   - Sensitive hardened image (minimal kernel, SELinux, no SSH)
5. Generate PKI:
   - Site root CA
   - Intermediate CA for OPAAL
   - Pre-provision quorum member certificates

Phase 2: Control Plane

1. Deploy quorum members (3 or 5 nodes, dedicated hardware):
   a. Install lattice-quorum binary
   b. Configure Raft cluster membership
   c. Load TLS certificates (pre-provisioned)
   d. Initialize Raft cluster:
      - First member bootstraps as single-node cluster
      - Additional members join via Raft AddMember
   e. Verify: Raft leader elected, all members healthy

2. Deploy API servers (2+ for redundancy):
   a. Install lattice-api binary
   b. Configure quorum endpoints, TLS, OIDC provider
   c. Place behind load balancer
   d. Health check: /healthz returns 200

3. Deploy vCluster schedulers:
   a. One scheduler instance per vCluster type
   b. Configure cost function weights (from config file or quorum)
   c. Verify: scheduling cycle runs (empty, no nodes yet)

4. Deploy checkpoint broker:
   a. Install lattice-checkpoint binary
   b. Configure quorum and VAST API endpoints

Phase 3: Compute Nodes

1. Configure BSS with standard compute image + cloud-init template:
   - cloud-init installs node agent binary
   - cloud-init generates TLS certificate via OPAAL
   - cloud-init configures quorum endpoint

2. Boot nodes (batch: groups of 50-100):
   - PXE boot → BSS serves image → cloud-init runs → node agent starts

3. Node agent startup:
   a. Generate TLS cert from OPAAL (if not pre-provisioned)
   b. Discover local hardware (GPUs via NVML/ROCm-SMI, NVMe if present, NIC)
   c. Compute conformance fingerprint
   d. Register with quorum (first heartbeat)
   e. Report capabilities and health

4. Quorum auto-discovers nodes from first heartbeat.
   No manual node registration required.

5. Verify: `lattice node list` shows all nodes in Ready state.

Phase 4: Configuration

1. Create tenants:
   lattice admin tenant create --name="physics" --max-nodes=200

2. Create vClusters:
   lattice admin vcluster create --name="hpc-batch" \
     --scheduler=hpc-backfill \
     --tenant=physics \
     --nodes=x1000c0s0b0n[0-199]

3. Configure cost function weights (or use defaults):
   lattice admin vcluster set-weights --name="hpc-batch" \
     --priority=0.20 --wait-time=0.25 --fair-share=0.25 ...

4. (Optional) Configure Waldur accounting:
   lattice admin config set accounting.enabled=true
   lattice admin config set accounting.waldur.api_url="https://..."

5. (Optional) Configure federation:
   lattice admin federation add-peer --endpoint=... --workspace=...

6. Test: submit a test allocation.

Quorum Initialization

First-Time Bootstrap

The first quorum member initializes a new Raft cluster using the --bootstrap flag. This flag must only be passed once — on the very first startup of node 1. All subsequent restarts omit it; the persisted Raft state (WAL + snapshots) is sufficient to rejoin.

# First-ever start of node 1:
lattice-server --config /etc/lattice/server.yaml --bootstrap

# All subsequent restarts (including systemd):
lattice-server --config /etc/lattice/server.yaml

This creates an empty Raft log and elects node 1 as leader.

Adding Members

Subsequent members join the existing cluster:

# On the leader (or any member):
lattice-quorum membership add --node-id=quorum-2 --addr=quorum-2:4001

# On the new member:
lattice-quorum --join=quorum-1:4001 \
  --node-id=quorum-2 \
  --listen=0.0.0.0:4001 \
  --data-dir=/var/lib/lattice/raft

The new member syncs the Raft log from the leader and becomes a follower.

Initial State

A freshly bootstrapped quorum has:

  • Empty node registry (populated when nodes boot)
  • Empty tenant/vCluster configuration (created by admin)
  • Empty sensitive audit log
  • Default system configuration

Disaster Recovery

Raft Snapshot + WAL Recovery

The quorum periodically snapshots its state and writes a WAL (Write-Ahead Log):

/var/lib/lattice/raft/
├── snapshots/
│   ├── snap-000100.bin     # Raft state at log index 100
│   └── snap-000200.bin     # Raft state at log index 200
├── wal/
│   ├── wal-000200-000300   # Log entries 200-300
│   └── wal-000300-000400   # Log entries 300-400
└── metadata.json           # Current term, voted_for, last_applied

Backup: Snapshots are replicated to S3 (configurable interval, default: hourly):

s3://lattice-backup/raft/snap-{timestamp}.bin

Recovery Procedure

If all quorum members are lost:

1. Provision new quorum hardware (3 or 5 nodes)
2. Retrieve latest snapshot from S3:
   aws s3 cp s3://lattice-backup/raft/snap-latest.bin /var/lib/lattice/raft/

3. Bootstrap from snapshot:
   lattice-quorum --recover-from=/var/lib/lattice/raft/snap-latest.bin \
     --node-id=quorum-1 --bootstrap

4. Add remaining quorum members (join the recovered leader)

5. Node agents will reconnect automatically (they retry with backoff)

6. Verify state:
   lattice admin raft status
   lattice node list

Data loss window: From the last snapshot to the failure. With hourly snapshots, at most 1 hour of Raft commits could be lost. In practice, node ownership changes are infrequent (scheduling cycles), so data loss is minimal.

Partial Quorum Loss

If a minority of quorum members fail (1 of 3, or 2 of 5):

  1. The cluster continues operating (Raft majority maintained)
  2. Replace failed members via Raft membership change:
    lattice-quorum membership remove --node-id=quorum-2
    lattice-quorum membership add --node-id=quorum-2-new --addr=...
    
  3. New member syncs from leader automatically
  4. No data loss, no downtime

Non-Raft State Backup

The Raft snapshot captures quorum state (node ownership, tenants, sensitive audit). Other stateful components require separate backup strategies:

ComponentState LocationBackup Strategy
TSDB (metrics)VictoriaMetrics / ThanosTSDB-native snapshot + S3 replication
S3 logss3://{tenant}/{project}/{alloc_id}/logs/S3 bucket versioning + cross-region replication
Accounting WAL/var/lib/lattice/accounting-walInclude in node backup or replicate to S3
Sensitive audit logRaft state (primary) + S3 archive (cold)Covered by Raft snapshot; S3 archive has its own retention
Grafana dashboardsinfra/grafana/ (version-controlled)Git repository

Recommended schedule: Daily backup verification for TSDB snapshots. Accounting WAL backed up on the same schedule as Raft snapshots.

Quorum Hardware Replacement

When a quorum member’s hardware fails and must be replaced:

  1. Remove the failed member from the Raft cluster:

    lattice-quorum membership remove --node-id=quorum-2
    

    The cluster continues operating with the remaining majority.

  2. Provision new hardware:

    • Install the same OS and lattice-quorum binary
    • Generate a new TLS certificate from the site CA (same CN format)
    • Configure the same data directory path
  3. Add the new member to the cluster:

    # On an existing member:
    lattice-quorum membership add --node-id=quorum-2-new --addr=new-host:4001
    
    # On the new hardware:
    lattice-quorum --join=quorum-1:4001 \
      --node-id=quorum-2-new \
      --listen=0.0.0.0:4001 \
      --data-dir=/var/lib/lattice/raft
    
  4. Verify: The new member syncs the full Raft log from the leader. Check with lattice admin raft status.

  5. Cleanup: Remove old member’s data directory from failed hardware (if recoverable). Update monitoring/alerting to reference the new member.

Important: Replace one member at a time. Wait for the new member to fully sync before replacing another. For a 3-member quorum, never have more than 1 member down simultaneously.

Configuration Management

All configuration is stored in two places:

ConfigurationStorageUpdate Mechanism
Raft cluster membershipRaft logMembership change commands
Tenant/vCluster definitionsRaft state machineAPI calls (Raft-committed)
Cost function weightsRaft state machineHot-reloadable via API
Component config (listen addr, TLS paths)Local config filesRestart required
Node agent configcloud-init templateReboot to apply changes

Config files are version-controlled alongside deployment manifests. Changes to Raft-stored configuration are applied via API and take effect immediately.

Capacity Planning

Cluster SizeQuorum MembersAPI ServersScheduler InstancesQuorum Hardware
< 100 nodes321 per vCluster type4 CPU, 16 GB RAM, 100 GB SSD
100-1000 nodes331 per vCluster type8 CPU, 32 GB RAM, 200 GB SSD
1000-5000 nodes552 per vCluster type16 CPU, 64 GB RAM, 500 GB SSD
5000+ nodes55+ (behind LB)2+ per vCluster type32 CPU, 128 GB RAM, 1 TB SSD

Quorum hardware notes: Quorum members are latency-sensitive (Raft commits). Dedicated NVMe SSD for WAL. Not co-located with compute workloads. Prefer separate hardware or at minimum separate failure domains.

Backup Verification

Snapshots replicated to S3 should be verified periodically to ensure they are restorable:

# Verify the latest snapshot is readable and consistent
lattice admin backup verify --source=s3://lattice-backup/raft/snap-latest.bin

# Verify a specific snapshot
lattice admin backup verify --source=s3://lattice-backup/raft/snap-20260301T120000.bin

Verification checks:

  • Snapshot file integrity (checksum match)
  • Raft metadata consistency (term, index, membership)
  • Deserialization of state machine (all entries parseable)

Recommended schedule: Weekly automated verification via cron or CI pipeline. Alert on failure.

Snapshot Retention Policy

Local snapshots are retained on quorum member disks:

  • Keep the last 5 snapshots (default, configurable: raft.snapshot_retention_count)
  • Older snapshots are deleted after a new snapshot is confirmed written

S3 snapshots follow a lifecycle policy:

  • Keep all snapshots for 7 days (hourly granularity)
  • After 7 days: keep one snapshot per day for 30 days
  • After 30 days: keep one snapshot per week for 90 days
  • After 90 days: delete (unless sensitive audit retention requires longer)

Configure via S3 lifecycle rules on the lattice-backup bucket.

Component Log Management

Lattice components log to stdout/stderr by default, managed by the system’s init system (systemd journald or equivalent).

Recommended log rotation:

ComponentLog VolumeRotation
Quorum membersLow (Raft events, membership changes)journald default (rotate at 4 GB or 1 month)
API serversMedium (request logs, access logs)journald or file rotation (rotate at 1 GB, keep 7 files)
vCluster schedulersLow-Medium (scheduling cycle logs)journald default
Node agentsLow per-node (heartbeats, allocation lifecycle)journald default
Checkpoint brokerLow (checkpoint decisions)journald default

For centralized log collection, configure journald to forward to a log aggregator (e.g., Loki, Elasticsearch) via systemd-journal-remote or a sidecar agent.

Structured logging: All components emit JSON-formatted logs with fields: timestamp, level, component, message, and context-specific fields (e.g., allocation_id, node_id).

Test/Dev Deployment (GCP)

For integration testing without bare metal, use the GCP test infrastructure:

infra/gcp/
├── terraform/main.tf           # 3 quorum + 2 compute + registry + TSDB
├── packer/lattice-compute.pkr.hcl  # Pre-baked image with podman + squashfs-tools
scripts/deploy/
├── make-provision-bundle.sh    # Single tarball: binaries + scripts + systemd units
├── install-quorum.sh           # Reusable, no GCP-specific logic
├── install-compute.sh          # Reusable, HMAC token generation
└── validate.sh                 # Structured test runner (15 tests)

Workflow:

  1. packer build — create compute image (once)
  2. terraform apply — provision VMs
  3. make-provision-bundle.sh — package release
  4. SCP bundle to nodes, install-quorum.sh (node 1 with --bootstrap), install-compute.sh
  5. validate.sh — run test matrix
  6. terraform destroy — manual cleanup

The deploy scripts are reusable on-prem — no GCP-specific logic in install-*.sh.

Cross-References

Failure Modes and Recovery

Design Principle

Fail-safe defaults. Running allocations survive component failures. Modeled after Slurm’s proven failure patterns, mapped to Lattice’s distributed architecture: requeue on node failure, state recovery on controller restart, running jobs unaffected by control plane restarts.

Component Failures

Quorum Member Loss

Detection: Raft heartbeat timeout (default: 500ms).

Recovery: Raft tolerates minority failure. A 3-member quorum tolerates 1 failure; a 5-member quorum tolerates 2. The remaining majority continues serving reads and commits. No scheduling disruption.

Action: Alert ops. Replace failed member via Raft membership change (add new → remove old). No data loss — Raft log is replicated.

Quorum Leader Loss

Detection: Raft follower timeout triggers leader election.

Recovery: New leader elected within seconds (typically 1-3s depending on election timeout configuration). In-flight proposals that were not committed are retried by the proposing vCluster scheduler on the next scheduling cycle.

Data loss risk: None. Uncommitted proposals are re-proposed. Committed state is durable.

Complete Quorum Loss

Detection: All quorum members unreachable. API server returns unavailable.

Recovery: Restore from most recent Raft snapshot + WAL replay (analogous to slurmctld --recover). The latest snapshot is stored on persistent storage (local SSD + replicated to S3). Recovery restores node ownership and sensitive audit state to the last committed entry.

Impact during outage: No new allocations can be scheduled (proposals cannot be committed). Running allocations continue — node agents operate autonomously. Node agents buffer heartbeats and replay on quorum recovery.

Node Agent Crash

Detection: Heartbeat timeout (default: 30s) followed by grace period (default: 60s). Total time to Down transition: ~90s. Analogous to Slurm’s SlurmdTimeout.

Recovery:

  1. Quorum marks node as Degraded after first missed heartbeat
  2. After grace period (default: 60s), node transitions to Down
  3. Allocations on the node are requeued (if requeue policy allows) or marked Failed
  4. Node agent restarts → loads persisted state from /var/lib/lattice/agent-state.json → reattaches to surviving workload processes (PID liveness check via kill(pid, 0)) → cleans up orphaned cgroups → re-registers with quorum → health check → re-enters scheduling pool

Workloads survive agent restart because the systemd unit uses KillMode=process (only the agent process is killed, not children in their own cgroup scopes).

Sensitive nodes: Longer grace period (default: 5 minutes) to avoid false positives from transient issues. Sensitive allocations are never automatically requeued — operator intervention required.

Node Hardware Failure

Detection: Dual-path: heartbeat timeout (node agent) + OpenCHAMI Redfish BMC polling (out-of-band).

Recovery: Same as agent crash, but OpenCHAMI can detect hardware failures (PSU, memory ECC uncorrectable, GPU fallen off bus) before heartbeat timeout. BMC-detected failures trigger immediate Down transition, skipping the grace period.

vCluster Scheduler Crash

Detection: Health check failure (liveness probe).

Recovery: vCluster schedulers are stateless — they read pending allocations and node state from the quorum on each scheduling cycle. Restart from quorum state. No scheduling occurs for this vCluster during downtime, but running allocations continue unaffected (like slurmctld crash: running jobs are fine).

Data loss risk: None. Pending allocations are persisted in the quorum.

API Server Crash

Detection: Load balancer health check / liveness probe.

Recovery: API servers are stateless. Restart and resume serving. Multiple API server replicas behind a load balancer provide redundancy. Client retries with exponential backoff. No job loss.

Checkpoint Broker Crash

Detection: Health check failure.

Recovery: Pending checkpoint requests are lost (they were in-memory). On restart, the broker re-evaluates all running allocations against the checkpoint cost model. Allocations that should have been checkpointed will be identified on the next evaluation cycle.

Data loss risk: Minimal. At worst, one evaluation cycle’s worth of checkpoint decisions are delayed. No allocation data is lost.

Infrastructure Failures

Network Partition: Node ↔ Quorum

Detection: Heartbeat timeout on the quorum side; connection failure on the node side.

Recovery:

  • Quorum side: nodes marked unreachable → DegradedDown after grace period. Allocations requeued.
  • Node side: node agent continues running allocations autonomously. Buffers heartbeats and state updates. When connectivity restores, replays buffered state to quorum.
  • If partition heals before grace period: node returns to Ready, no allocation disruption.

Sensitive: Extended grace period (5 minutes). Network partitions are logged as audit events.

Network Partition: Quorum Split-Brain

Detection: Raft protocol prevents split-brain by design.

Recovery: The minority partition cannot achieve quorum and therefore cannot commit any proposals. The majority partition continues operating normally. When the partition heals, the minority members catch up via Raft log replication. No divergent state is possible.

Storage Unavailability (VAST Down)

Detection: Failed VAST API calls / NFS mount timeouts.

Impact:

  • Data staging for new allocations pauses (cannot pre-stage input data)
  • Running allocations with data already mounted continue (local NVMe cache, if present, persists)
  • Checkpoint writes fail → broker pauses checkpoint scheduling
  • New allocation proposals that require data staging are held in queue

Recovery: Automatic retry with backoff. Alert raised. Staging resumes when VAST recovers. On nodes with NVMe cache, locally cached data persists through storage outage.

OpenCHAMI Unavailable

Detection: Failed API calls to OpenCHAMI endpoints.

Impact:

  • Node boot/reimaging blocked (cannot provision new nodes)
  • Node wipe-on-release blocked (sensitive nodes held in quarantine state)
  • Running allocations unaffected
  • Scheduling of new allocations to already-booted nodes continues normally

Recovery: Operations that require OpenCHAMI are queued and retried. Alert raised.

Allocation-Level Failures

Prologue Failure (uenv Pull/Mount)

Detection: Node agent reports prologue error to quorum.

Recovery:

  1. Node drained for this allocation (other allocations on the node unaffected)
  2. Allocation retried on different nodes (analogous to Slurm PrologSlurmctld failure)
  3. Max retries configurable (default: 3)
  4. After max retries: allocation moves to Failed state, user notified

Common causes: Corrupted uenv image (hash mismatch), local cache full (if NVMe present), registry unavailable.

Application Crash

Detection: Node agent detects process exit with non-zero status.

Recovery:

  • Allocation moves to Failed state
  • Nodes released back to scheduling pool
  • If allocation has requeue: on_node_failure or requeue: always: re-enter queue
  • DAG dependencies evaluated (cross-ref: dag-scheduling.md)

Walltime Exceeded

Detection: Node agent timer.

Recovery:

  1. SIGTERM sent to all processes in the allocation
  2. Grace period (default: 30s) for clean shutdown
  3. SIGKILL if processes still running after grace period
  4. Nodes released
  5. Allocation marked as Failed with reason walltime_exceeded

Walltime Exceeded During Checkpoint

If an allocation’s walltime expires while a checkpoint is in progress:

  1. Walltime takes priority. The walltime timer is not extended to accommodate an in-progress checkpoint.
  2. SIGTERM is sent as normal. If the checkpoint completes within the SIGTERM grace period (default: 30s), the checkpoint is usable and the allocation is marked Suspended (can be resumed).
  3. If the checkpoint does not complete within the grace period, SIGKILL is sent. The incomplete checkpoint is discarded and the allocation is marked Failed with reason walltime_exceeded.
  4. The checkpoint broker tracks this race condition via the lattice_checkpoint_walltime_conflict_total counter metric.

Recovery Matrix

FailureDetectionRecovery ActionData Loss Risk
Quorum member lossRaft heartbeatLeader election, continueNone
Quorum leader lossRaft timeoutNew election (1-3s)None (uncommitted retried)
Complete quorum lossAll members downSnapshot + WAL recoveryNone (last committed state)
Node agent crashHeartbeat timeout (30s) + grace (60s)Degrade → Down → requeueRunning allocation output since last checkpoint
Node hardware failureBMC + heartbeatImmediate Down → requeueRunning allocation output since last checkpoint
vCluster scheduler crashHealth checkStateless restartNone
API server crashHealth checkStateless restartNone
Checkpoint broker crashHealth checkRestart, re-evaluateDelayed checkpoint decisions
Network partition (node)Heartbeat timeoutGrace period → requeueNone if heals in time
Network partition (quorum)Raft protocolMinority stalls, majority continuesNone
VAST downAPI timeoutQueue staging, continue runningNone
OpenCHAMI downAPI timeoutQueue provisioning opsNone
Prologue failureAgent reportRetry on different nodesNone
Application crashProcess exitRelease nodes, optional requeueApplication-dependent
Walltime exceededAgent timerSIGTERM → SIGKILL → releaseUnsaved work

Allocation Requeue Policy

Configurable per allocation at submission time:

PolicyBehavior
neverAllocation fails permanently on any node failure. Default for interactive sessions.
on_node_failureRequeue only when the failure is node-side (hardware, agent crash, network partition). Default for batch allocations.
alwaysRequeue on any failure including application crash. Use with caution — can cause infinite loops for buggy applications.

Max requeue count: Default 3. Configurable per allocation (max 100, validated at submission). After max requeues, allocation transitions to Failed regardless of policy. Requeue uses optimistic concurrency (expected_requeue_count) to prevent double-increment from concurrent reconcilers.

Requeue behavior: Requeued allocations retain their original submission time for fair-share and wait-time calculations (no queue-jumping penalty, no starvation). Just-requeued allocations are excluded from the pending set in the same scheduler cycle (TOCTOU prevention).

Service Failure Detection (Liveness Probes)

For Unbounded and Reactive allocations with a liveness_probe configured:

  1. Node agent runs the probe periodically (TCP connect or HTTP GET)
  2. Consecutive failures tracked by ProbeManager (per-allocation counter)
  3. Threshold exceeded → allocation marked Failed by node agent
  4. Reconciler detects Failed service → requeues per policy (if not at max_requeue)
  5. Scheduler re-places the allocation on available nodes

Timeline: initial_delay (default 10s) → periodic probes (default 30s) → failure_threshold (default 3) → Failed → next scheduler cycle requeues.

Service Registry Failure

If the service registry becomes inconsistent (e.g., allocation completes but endpoint not deregistered):

  • Registry is part of Raft state machine — same consistency guarantees as node ownership
  • Endpoint registration/deregistration happens atomically in update_allocation_state() handler
  • Deregistration also occurs in requeue_allocation() handler
  • Empty service entries are cleaned up automatically

Cross-References

Upgrades and Rollouts

Design Principle

Zero-downtime upgrades. No running allocation is disrupted by an upgrade. Components are upgraded independently. Protocol backward compatibility ensures mixed-version operation during rolling upgrades.

Protocol Versioning

All gRPC services are versioned (lattice.v1.*):

  • New fields are additive (backward compatible within a major version)
  • Breaking changes require a new version (lattice.v2.*)
  • During rolling upgrades, node agents and quorum members must support both version N and N-1
  • Version negotiation on connection establishment: components advertise supported versions, use the highest common version

Upgrade Order

Components are upgraded in dependency order, from leaf to core:

1. Node agents (rolling, batched)
2. vCluster schedulers (rolling)
3. API servers (rolling)
4. Quorum members (Raft rolling membership change, one at a time)

This order ensures that core components (quorum) speak the old protocol until all clients (node agents, schedulers) are upgraded. The quorum is upgraded last because it’s the most critical and the hardest to roll back.

Node Agent Rolling Upgrade

Procedure

For each batch of nodes:

  1. Drain: Stop scheduling new allocations to the node. Node enters Draining state. If no allocations are running, it transitions directly to Drained.
  2. Wait: Running allocations complete naturally. The scheduler loop transitions the node from Draining to Drained once all allocations finish. For urgent upgrades: checkpoint running allocations and migrate (cross-ref: checkpoint-broker.md).
  3. Upgrade: Replace node agent binary while node is Drained. Configuration is preserved.
  4. Restart: Node agent starts, re-registers with quorum using new protocol version.
  5. Health check: Node passes health check (heartbeat, GPU detection, network test).
  6. Undrain: Operator runs undrain. Node transitions from Drained to Ready and is available for scheduling.

Canary Strategy

  1. Upgrade 1-2 nodes first (canary set)
  2. Monitor canary nodes for the observation window (default: 15 minutes):
    • Scheduling cycle latency within SLO (cross-ref: telemetry.md scheduler self-monitoring)
    • No increase in allocation failures on canary nodes
    • Heartbeat latency stable
    • Node health check pass rate = 100%
  3. If canary passes: proceed with rolling batches (batch size configurable, default: 5% of nodes)
  4. If canary fails: stop rollout, revert canary nodes (see Rollback below)

Batch Sizing

Cluster SizeCanary SizeBatch SizeTotal Batches
< 50 nodes1 node5 nodes~10
50-500 nodes2 nodes25 nodes~20
500+ nodes5 nodes50 nodesvaries

vCluster Scheduler Rolling Upgrade

Schedulers are stateless — they read state from the quorum each cycle:

  1. Stop scheduler instance
  2. Upgrade binary
  3. Restart
  4. Verify: scheduling cycle completes successfully, proposals accepted by quorum

During scheduler downtime, the affected vCluster pauses scheduling (no new allocations). Running allocations are unaffected. Multiple scheduler replicas (if deployed) provide continuity.

API Server Rolling Upgrade

API servers are stateless, behind a load balancer:

  1. Remove instance from load balancer
  2. Drain active connections (grace period: 30s)
  3. Upgrade binary
  4. Restart
  5. Health check passes → re-add to load balancer

Client impact: brief connection reset for long-lived streams (StreamMetrics, StreamLogs). Clients reconnect automatically.

Quorum Rolling Upgrade

The most sensitive upgrade. One member at a time, maintaining quorum majority throughout:

3-Member Quorum

  1. Upgrade follower A: remove from Raft group → upgrade → re-add
  2. Wait for follower A to catch up (Raft log sync)
  3. Upgrade follower B: remove → upgrade → re-add
  4. Wait for follower B to catch up
  5. Trigger leader transfer to an upgraded follower
  6. Upgrade old leader: remove → upgrade → re-add

Constraint: Never more than 1 member down simultaneously (2/3 majority required).

5-Member Quorum

Same procedure but can upgrade 2 followers in parallel (3/5 majority maintained):

  1. Upgrade followers A and B in parallel
  2. Wait for catch-up
  3. Upgrade followers C and D in parallel
  4. Wait for catch-up
  5. Leader transfer → upgrade old leader

Constraint: Never more than 2 members down simultaneously (3/5 majority required).

Quorum Upgrade Verification

After each member upgrade:

  • Raft log replication is current (no lag)
  • Commit latency within SLO (< 5s)
  • Leader election succeeds if triggered
  • All node ownership state is consistent

Canary Criteria

Metrics from scheduler self-monitoring (cross-ref: telemetry.md) that gate rollout progression:

MetricThresholdSeverity
lattice_scheduling_cycle_duration_secondsp99 < 30sWarning: pause rollout
lattice_scheduling_proposals_total{result="rejected"}No increase > 10%Warning: pause rollout
lattice_agent_heartbeat_latency_secondsp99 < 5sWarning: pause rollout
lattice_raft_commit_latency_secondsp99 < 5sCritical: stop rollout
lattice_api_requests_total{status="5xx"}No increase > 5%Warning: pause rollout
Allocation failure rateNo increaseCritical: stop rollout

Rollback

Node Agent Rollback

  1. Drain canary/failed nodes
  2. Replace binary with previous version
  3. Restart
  4. Verify old-version operation
  5. Protocol backward compatibility ensures the rolled-back agent works with the rest of the cluster

Scheduler/API Rollback

Stateless — replace binary and restart.

Quorum Rollback

  1. Remove new-version member from Raft group
  2. Add old-version member back
  3. Protocol backward compatibility ensures mixed-version operation during the transition

Rollback is always safe because N-1 protocol support is maintained throughout the upgrade window.

Configuration Hot-Reload

Not all changes require a binary upgrade. Configuration changes that can be hot-reloaded via quorum without restart:

ChangeHot-ReloadableMechanism
Cost function weightsYesQuorum config update, schedulers pick up next cycle
vCluster policiesYesQuorum config update
Telemetry mode (prod/debug/audit)YesAPI call to node agent
Tenant quotasYesQuorum config update
Node drain/undrainYesAPI call
Protocol versionNoBinary upgrade required
Raft cluster sizeNoMembership change (safe, but not hot-reload)

Cross-References

Testing Strategy

Design Principle

Scheduler correctness is non-negotiable. The testing strategy covers four levels: unit tests for individual functions, integration tests for component interactions, simulation tests for scheduling behavior, and chaos tests for fault tolerance. Every level must pass before a release.

Test Levels

┌─────────────────────────────────────────────────┐
│ Level 4: Chaos Tests (fault injection)          │
│   Raft leader loss, network partitions,         │
│   node failures, storage unavailability         │
├─────────────────────────────────────────────────┤
│ Level 3: Simulation (RM-Replay)                 │
│   Production workload replay, weight tuning,    │
│   fairness validation, SLO compliance           │
├─────────────────────────────────────────────────┤
│ Level 2: Integration Tests                      │
│   Multi-component scenarios, API contracts,     │
│   end-to-end allocation lifecycle               │
├─────────────────────────────────────────────────┤
│ Level 1: Unit Tests                             │
│   Cost function, topology solver, state machine,│
│   protobuf serialization, error handling        │
└─────────────────────────────────────────────────┘

Level 1: Unit Tests

In-module tests (#[cfg(test)]), run via cargo test.

Critical Paths

CrateWhat to TestExample
lattice-schedulerCost function components (f₁-f₉)Given inputs, verify score output
lattice-schedulerKnapsack solverGiven nodes and allocations, verify placement
lattice-schedulerTopology packingGiven groups and node count, verify group selection
lattice-schedulerConformance group selectionGiven fingerprints, verify grouping
lattice-quorumRaft proposal validationHard quota rejection, ownership conflict
lattice-quorumState machine transitionsNode state changes, allocation lifecycle
lattice-commonType serialization/deserializationProtobuf round-trip for all types
lattice-commonAllocation state machineValid and invalid state transitions
lattice-apiRequest validationReject invalid allocations (cycles in DAG, bad constraints)
lattice-apiSBATCH directive parsingTranslate Slurm directives to Intent API
lattice-checkpointCost model evaluationGiven metrics, verify checkpoint decision
lattice-cliArgument parsingFlag combinations, error messages

Property-Based Tests

Use proptest for property-based testing of the cost function and solver:

  • Cost function monotonicity: Increasing wait time always increases f₂
  • Fair share bounds: f₃ always in [0, 1]
  • Solver validity: Every placement returned by the solver satisfies all constraints
  • Topology packing: Solver never spans more groups than necessary
  • State machine: No invalid state transitions accepted

Level 2: Integration Tests

In tests/ directories, using real components with mock external dependencies.

Test Harness

A test harness that spins up:

  • In-memory Raft cluster (3 members, using openraft test utilities)
  • Mock node agents (report capabilities, respond to heartbeats)
  • Mock VAST API (storage queries return configurable responses)
  • Real scheduler instances
  • Real API server (in-process)

Scenarios

ScenarioWhat It Tests
Submit → Schedule → CompleteFull allocation lifecycle through all components
DAG submissionMulti-allocation workflow with dependency resolution
PreemptionHigher-priority allocation preempts lower-priority
Elastic borrowingvCluster borrows and returns nodes
Quota rejectionHard quota exceeded → proposal rejected
Sensitive claimNode claim, audit logging, wipe on release
Session lifecycleSession create → terminal → disconnect → cleanup
Rolling upgrade simulationMixed-version node agents, protocol negotiation
Conformance driftNode fingerprint changes → scheduling impact
Reactive scalingMetric threshold triggers scale-up/down

API Contract Tests

For every API endpoint, test:

  • Valid request → expected response
  • Invalid request → appropriate error code and message
  • Authorization: user sees own allocations only, tenant-admin sees tenant, system-admin sees all
  • Rate limiting: exceeded rate → 429 with Retry-After header

Protobuf Compatibility

Test backward compatibility:

  • Deserialize messages from previous version with new code (additive fields)
  • Deserialize messages from new version with old code (unknown fields ignored)

Level 3: Simulation (RM-Replay)

Purpose

RM-Replay replays production workload traces through the scheduler to validate scheduling behavior without risking production. Essential for:

  • Tuning cost function weights before deployment
  • Validating fairness across tenants
  • Regression testing after scheduler changes

Workflow

1. Capture: Record production workload traces
   - Allocation submissions (arrival time, resources, constraints, tenant)
   - Allocation completions (duration, exit status)
   - Node inventory (capabilities, topology)

2. Configure: Set cost function weights and vCluster policies

3. Replay: Feed traces through lattice-scheduler in simulation mode
   - No real nodes or quorum — mock environment
   - Simulated time (runs in seconds, not hours)
   - Deterministic (same trace + same weights = same result)

4. Evaluate: Measure scheduling outcomes
   - Utilization: fraction of GPU-hours used
   - Wait time: p50, p95, p99 queue wait per priority class
   - Fairness: actual share vs. target share per tenant (Jain's fairness index)
   - Backfill effectiveness: percentage of idle slots filled
   - SLO compliance: percentage of allocations meeting target wait time
   - Preemption rate: preemptions per hour

5. Iterate: Adjust weights, re-run, compare

Regression Suite

Maintain a library of representative workload traces:

TraceDescriptionKey Metric
steady-state.traceNormal mixed workload (HPC + ML + services)Utilization > 85%
burst.traceSudden spike in submissionsNo starvation (p99 wait < 4h)
unfair.traceOne tenant submits heavilyFair share deviation < 10%
sensitive-claim.traceSensitive claims interleaved with HPCSensitive wait = 0 (immediate)
preemption-heavy.traceMany priority inversionsCheckpoint success rate > 95%
empty-to-full.traceCluster goes from idle to fullRamp-up time, scheduling cycle latency

Each trace has a pass/fail threshold for key metrics. CI runs the regression suite on every scheduler change.

Level 4: Chaos Tests

Fault injection tests that validate the failure modes documented in failure-modes.md.

Fault Injection Framework

Use a test harness that can inject faults at configurable times:

FaultInjection MethodValidates
Raft leader killStop leader processLeader election, in-flight proposal retry
Raft member killStop follower processContinued operation with minority loss
Network partition (node↔quorum)Drop heartbeatsDegraded → Down transition, allocation requeue
Network partition (quorum split)Partition Raft membersMinority stalls, majority continues
Node agent crashKill agent processHeartbeat timeout, allocation requeue
Storage unavailabilityMock VAST returns errorsStaging pauses, running allocations continue
Checkpoint timeoutApplication ignores checkpoint hintForced preemption after timeout
API server crashKill API serverClient retry, no state loss
Quorum snapshot corruptionCorrupt snapshot fileRecovery from previous valid snapshot

Chaos Test Scenarios

ScenarioStepsExpected Outcome
Leader election under loadSubmit 50 allocations, kill leader mid-cycleNew leader elected < 5s, no proposals lost, all allocations eventually scheduled
Node failure with requeueStart 10 allocations, kill 2 node agentsAllocations requeued, rescheduled on healthy nodes, total delay < 2 min
Split-brain preventionPartition 3-member quorum into 1+2Minority (1) cannot commit, majority (2) continues, no divergent state
Cascade failureKill 3 node agents simultaneouslyAllocations on all 3 nodes requeued, scheduling continues for remaining nodes
Sensitive node failureKill sensitive node agentExtended grace period, operator alert, no auto-requeue
Recovery from full quorum lossKill all quorum members, restore from snapshotState restored, node agents reconnect, scheduling resumes

Execution

Chaos tests run in CI on a dedicated stage (not on every commit):

  • Nightly: full chaos suite
  • On release branch: full chaos suite must pass

Performance Benchmarks

Scheduling Cycle Latency

BenchmarkConfigurationTarget
100 pending allocations, 1000 nodesHPC backfillCycle < 5s
500 pending allocations, 5000 nodesHPC backfillCycle < 15s
1000 pending allocations, 10000 nodesHPC backfillCycle < 30s
Raft commit (single proposal)3-member quorump99 < 50ms
Raft commit (single proposal)5-member quorump99 < 100ms

Load Tests

TestDescriptionTarget
API throughputConcurrent submission requests> 1000 req/s
Heartbeat load10000 node agents reporting< 1% CPU on quorum
Log streaming100 concurrent log streams< 5% CPU on API server

CI Pipeline

On every commit:
  cargo fmt --check
  cargo clippy --all-targets
  cargo test (Level 1: unit tests)

On every PR:
  Level 1 + Level 2 (integration tests)
  Protobuf backward compatibility check

Nightly:
  Level 1 + Level 2 + Level 3 (RM-Replay regression) + Level 4 (chaos)
  Performance benchmarks (track regressions)

On release:
  All levels must pass
  Performance benchmarks must meet targets

Cross-References

DAG Scheduling

Design Principle

DAGs are first-class workflow primitives. The scheduler resolves dependencies; users declare intent. Dependency semantics are Slurm-compatible (afterok, afternotok, afterany, aftercorr) to ease migration.

DAG Submission

A DAG is a set of allocation specs with dependency edges, submitted as a single unit via the Intent API:

DagSpec {
    allocations: Vec<AllocationSpec>,  // each spec has an id and depends_on fields
}

Dependencies are expressed inline via each AllocationSpec.depends_on field (a list of DependencySpec with ref_id and condition), not as separate edge objects. This matches the protobuf definition in proto/lattice/v1/allocations.proto.

Dependency Conditions

Defined in crates/lattice-common/src/types.rs (DependencyCondition enum):

ConditionSlurm EquivalentSemantics
SuccessafterokSuccessor runs only if predecessor exits 0
FailureafternotokSuccessor runs only if predecessor exits non-zero
AnyafteranySuccessor runs regardless of predecessor’s exit status
CorrespondingaftercorrTask group: array element N depends on predecessor’s element N
MutexsingletonOnly one allocation with this mutex name runs at a time

DAG Lifecycle

1. Submission and Validation

  • User submits DagSpec via POST /v1/dags or lattice dag submit
  • lattice-api validates the graph:
    • No cycles (topological sort must succeed)
    • All depends_on.ref_id values reference allocation IDs within the DAG
    • All allocation specs individually valid
  • DAG receives a unique dag_id
  • Individual allocations receive allocation_id values and are tagged with dag_id

2. Root Node Scheduling

  • Allocations with no incoming dependency edges (root nodes) enter their vCluster scheduler queue immediately
  • Root nodes are scored and scheduled like any other allocation

3. Dependency Resolution

  • When an allocation completes (any terminal state), the system evaluates outgoing edges:
    • For each outgoing edge, check if the condition is satisfied
    • If all incoming edges to a successor are satisfied, the successor enters the scheduler queue
  • Dependency resolution is eventually consistent (handled by lattice-api or a lightweight DAG controller, not the quorum)

4. DAG Completion

  • DAG completes when all allocations reach a terminal state (Completed, Failed, or Cancelled)
  • DAG state: Running while any allocation is pending or running, Completed when all are done, Failed if any required allocation failed without a catching edge

5. DAG Cancellation

  • DELETE /v1/dags/{id} or lattice dag cancel {id}
  • Cancels all pending and running allocations in the DAG
  • Running allocations receive SIGTERM → grace period → SIGKILL (same as walltime exceeded)

Failure Propagation

Default: Success Dependencies

If allocation A fails and B depends on A via Success:

  • B is cancelled (dependency can never be satisfied)
  • B’s downstream dependencies are also evaluated (cascading cancellation)

Error Handling Paths

With Failure edges, users can build error-handling workflows:

  train ──Success──→ evaluate ──Success──→ deploy
    │                   │
    └──Failure──→ notify_failure
                        │
    └──Failure──→ notify_failure
  • notify_failure runs only if train or evaluate fails
  • deploy runs only if both train and evaluate succeed

Any Dependencies

With Any edges, successors run regardless:

  run_experiment ──Any──→ cleanup

cleanup runs whether run_experiment succeeds or fails. Useful for teardown tasks.

Corresponding Dependencies (Task Groups)

For task groups (array jobs), Corresponding creates element-wise dependencies:

  preprocess[0..N] ──Corresponding──→ train[0..N]

train[i] starts only when preprocess[i] completes successfully. Other array elements are independent.

State Tracking

DAG state is eventually consistent, following ADR-004:

  • The quorum tracks individual allocation states (ownership, terminal states). It does not know about DAG structure.
  • The DAG controller (runs within lattice-api) evaluates dependency edges when allocation state changes. It reads allocation states from the quorum and determines which successors to release into the scheduler queue.
  • This separation keeps the quorum simple and avoids adding DAG-specific logic to the Raft state machine.

DAG Queries

EndpointDescription
GET /v1/dags/{id}DAG status: overall state, per-allocation states
GET /v1/dags/{id}/graphDAG structure: allocations and edges
GET /v1/dags?tenant={id}List DAGs for a tenant
DELETE /v1/dags/{id}Cancel DAG

CLI equivalents: lattice dag status, lattice dag list, lattice dag cancel.

Edge Cases

Node Failure During DAG Execution

When a node fails while running a DAG allocation:

  1. The allocation follows its requeue_policy (see failure-modes.md)
  2. If requeued: the allocation re-enters the scheduler queue with its original priority. Downstream dependencies remain blocked until it completes.
  3. If failed: downstream Success dependencies are cancelled. Failure and Any edges are evaluated normally.
  4. DAG state remains Running as long as any allocation is pending or active.

Task Group with Corresponding Dependencies and Mixed Exit Codes

When a task group has Corresponding dependencies and individual elements exit with different codes:

  • Each Corresponding edge is evaluated independently per array index
  • train[3] failing does not affect train[4]’s dependency on preprocess[4]
  • The downstream task group may have a mix of running, cancelled, and completed elements
  • DAG completion waits for all evaluable elements to reach terminal states

Corresponding Dependencies with Mismatched Array Sizes

When two task groups have Corresponding dependencies but different array sizes (e.g., preprocess[0..9]train[0..14]):

  • Array indices that exist in both groups are matched normally: train[i] depends on preprocess[i] for i in 0..9.
  • Extra indices in the successor group (train[10..14]) have no matching predecessor element. These extra indices are treated as having their Corresponding dependency satisfied immediately — they enter the scheduler queue as if they were root nodes.
  • This design avoids silent failures: users get all successor elements running, not just the matched subset.

Max DAG Size

DAGs are validated at submission time with a maximum allocation count (default: 1000 allocations per DAG). Submitting a DAG exceeding this limit returns an error:

Error: DAG exceeds maximum size (1234 allocations, limit: 1000)
Hint: Split the workflow into smaller DAGs or increase the limit via system configuration.

The limit is configurable via lattice admin config set scheduling.max_dag_size=2000. Cycle detection runs in O(V+E) and is not a bottleneck, but very large DAGs increase dependency resolution overhead in the DAG controller.

Cross-References

  • api-design.md — DagSpec in protobuf definition
  • scheduling-algorithm.md — DAG members are scored individually by the knapsack solver
  • failure-modes.md — Allocation-level failure recovery interacts with DAG propagation
  • types.rs — Dependency, DependencyCondition enum definitions

Preemption Policy

Design Principle

Preemption is a last resort for resource rebalancing. The scheduler prefers waiting, backfill, and elastic borrowing over preemption. When preemption is necessary, it targets allocations with the lowest preemption cost (fast checkpoint, low priority, short remaining runtime). Sensitive allocations are never preempted.

Preemption Classes

Each allocation has a preemption_class (0-10):

ClassMeaningTypical UsePreemptible By
0Best-effortScavenger jobs, testingAny higher class
1-3Low priorityBatch exploration, sweepsClass 4+
4-6NormalProduction training, simulationClass 7+
7-9High priorityTime-sensitive productionClass 10 only
10Critical / SensitiveSensitive claims, emergencyNever preempted

Rule: Preemption only moves down — a class-5 allocation can preempt class 0-4 allocations but never class 5+.

Enforcement: The preemption_class range (0-10) is validated at API admission. Values outside this range are rejected with a 400 Bad Request error before reaching the scheduler.

Tie-breaking within class: If multiple allocations have the same preemption class, the scheduler prefers to preempt the one with the lowest checkpoint cost (f₈).

Preemption Triggers

1. Higher-Priority Demand

A pending allocation with class N cannot be scheduled because all suitable nodes are occupied by lower-class allocations. The scheduler evaluates whether preempting one or more lower-class allocations would free enough resources.

2. Elastic Reclamation

A vCluster’s idle nodes were borrowed by another vCluster (elastic sharing). The home vCluster now needs them back. Borrowed nodes carry an implicit preemption risk — the checkpoint cost model (f₈) accounts for this.

3. Sensitive Node Claim

A sensitive user claims nodes that are currently occupied by non-sensitive allocations. Sensitive claims are class 10 (highest). The scheduler triggers immediate checkpoint + preemption of the occupying allocations.

4. Quota Enforcement

A tenant exceeds their hard quota due to a race condition (two concurrent proposals, first committed). The quorum rejects the second proposal — this is not preemption but rejection. Running allocations are never preempted for quota enforcement.

Preemption Decision Algorithm

PreemptionDecision(pending_job, candidates):

1. Filter candidates:
   - Only allocations with preemption_class < pending_job.preemption_class
   - Exclude sensitive allocations (never preempted)
   - Exclude allocations in Checkpointing state (already being preempted)

2. Score each candidate by preemption cost:
   preemption_cost(c) = checkpoint_time(c)
                       + recompute_if_no_checkpoint(c)
                       + remaining_walltime_value(c)

   checkpoint_time(c):
     If checkpoint == Auto: estimated_checkpoint_minutes from f₈
     If checkpoint == Manual: assume application handles it, use configured timeout
     If checkpoint == None: recompute_if_no_checkpoint applies

   recompute_if_no_checkpoint(c):
     time_since_last_checkpoint(c) × node_count(c) × gpu_per_node
     (GPU-hours that would be lost)

   remaining_walltime_value(c):
     If c is near completion (>90% walltime used): high cost (let it finish)
     If c just started (<10% walltime used): low cost (little invested)

3. Select victim set:
   Greedy: pick candidates with lowest preemption_cost until enough nodes freed.
   Constraint: freed nodes must satisfy pending_job's topology/conformance requirements.

4. If no valid victim set exists: pending_job stays queued (preemption not possible).

5. If valid victim set found: initiate preemption sequence.

Preemption Sequence

1. Scheduler identifies victim allocations
2. For each victim:
   a. If checkpoint == Auto or Manual:
      - Checkpoint broker sends CHECKPOINT_HINT to node agents
      - Application checkpoints (signal, shmem, or gRPC callback)
      - Timeout: checkpoint_timeout (default: 10 minutes)
   b. If checkpoint == None:
      - SIGTERM sent immediately
      - Grace period (30s) → SIGKILL
3. When checkpoint completes (or timeout):
   - Allocation transitions to Suspended state
   - Nodes released to quorum (Raft commit)
4. Freed nodes assigned to pending allocation
5. Suspended allocations re-enter queue with:
   - Original submission time preserved (no wait-time penalty)
   - Resume-from-checkpoint flag set
   - Preempted-count incremented

Checkpoint Timeout Handling

When a checkpointing allocation fails to complete within the timeout:

ScenarioAction
Application responds but slowExtend timeout by 50%, once
Application unresponsiveSIGTERM → grace period → SIGKILL. Mark as failed (not suspended). Requeue if policy allows.
gRPC callback: application requests deferralGrant deferral up to max_deferral (default: 5 minutes). Then force.

Multi-Victim Preemption

Sometimes freeing one allocation isn’t enough. The scheduler can preempt multiple allocations in a single decision:

Constraints:

  • Maximum victims per decision: configurable (default: 3)
  • All victims must have lower preemption class than the pending job
  • Total preemption cost must be less than the pending job’s estimated value
  • Scheduler prefers preempting fewer, larger allocations over many small ones

Ordering: Victims are preempted in parallel (all receive checkpoint hints simultaneously). The pending job starts once all victims have released their nodes.

Per-vCluster Preemption Policy

vCluster TypePreemption AllowedNotes
HPC BatchYesClass-based, checkpoint-aware
ML TrainingYesCheckpoint cost heavily weighted (w₈=0.15)
ServiceYes (borrowed nodes only)Services on home nodes are not preempted; borrowed nodes reclaimable
SensitiveNever preemptedClass 10, no exceptions
InteractiveYesShort-lived, low cost to preempt

Non-Preemptible Allocations

An allocation is effectively non-preemptible when:

  1. checkpoint: None AND preemption_class >= 7 — High cost to preempt (all progress lost), high priority
  2. Sensitive allocations (always class 10)
  3. Allocations within 5 minutes of walltime completion (configurable: near_completion_threshold)

The scheduler avoids placing non-preemptible allocations on borrowed nodes, since those nodes may need to be reclaimed.

Preemption Metrics

MetricTypeDescription
lattice_preemptions_totalcounterLabels: vcluster, reason (priority/reclaim/sensitive)
lattice_preemption_checkpoint_duration_secondshistogramTime from hint to checkpoint completion
lattice_preemption_victim_requeue_totalcounterPreempted allocations re-entering queue
lattice_preemption_failed_checkpoint_totalcounterCheckpoint timeouts during preemption

Cross-References

Checkpoint Broker

Purpose

The checkpoint broker coordinates between the scheduler’s resource management decisions and running applications’ checkpoint capabilities. It enables cost-aware preemption: the scheduler can reclaim resources from running jobs by triggering checkpoints, with the decision driven by an economic cost function.

Cost Model

When to Checkpoint

Should_checkpoint(j, t) = Value(j, t) > Cost(j, t)

Cost Components

Cost(j, t) = write_time(j) + compute_waste(j) + storage_cost(j)

write_time(j):
  Estimated from: checkpoint_size(j) / storage_write_bandwidth
  checkpoint_size(j) estimated from: GPU memory usage × node count
  storage_write_bandwidth from: VAST API current throughput metrics

compute_waste(j):
  GPU-seconds lost during checkpoint I/O
  = write_time(j) × node_count(j) × gpu_per_node

storage_cost(j):
  = checkpoint_size(j) × cost_per_GB_on_target_tier

Value Components

Value(j, t) = recompute_saved(j, t) + preemptability(j, t) + backlog_relief(t)

recompute_saved(j, t):
  GPU-hours that would be lost if the job fails and restarts from scratch
  = time_since_last_checkpoint(j) × node_count(j) × gpu_per_node
  Weighted by failure_probability(j, t) which increases with:
    - Job duration (longer jobs more likely to hit hardware issues)
    - Node health signals (ECC errors, thermal warnings from BMC)

preemptability(j, t):
  Value of being able to preempt this job if a higher-priority job arrives
  = Σ (waiting_higher_priority_jobs × their urgency) × preemption_probability
  High when higher-priority work is queued and this job sits on reclaimable nodes

backlog_relief(t):
  = backlog_pressure(t) × estimated_queue_wait_reduction_if_nodes_freed
  Global signal: how much would freeing these nodes help the overall queue?

Decision Dynamics

Scenariobacklogpreempt demandnode healthDecision
Quiet system, healthy nodesLowLowGoodCheckpoint infrequently (every 6h)
Deep queue, sensitive job waitingHighHighGoodCheckpoint now, preempt
Node ECC errors increasingLowLowDegradingCheckpoint proactively, migrate
Large job nearing walltimeLowLowGoodCheckpoint for restart capability

Application Protocol

Three Communication Modes

Applications opt into checkpoint coordination via one of three mechanisms:

1. Signal-based (legacy compatibility)

Node agent sends SIGUSR1 to the application's process group.
Application catches signal, writes checkpoint, signals completion via exit of a sentinel file.
Timeout: if no completion signal within checkpoint_timeout, assume non-checkpointable.

2. Shared memory flag (low-latency)

Node agent sets a flag in a shared memory region mapped at a well-known path.
Application polls the flag (or uses futex wait) and initiates checkpoint.
Completion: application clears the flag and sets a "done" flag.
Best for performance-sensitive applications that can't afford signal handler overhead.

3. gRPC callback (agent-aware applications)

Application registers a checkpoint endpoint with the node agent at startup.
Node agent calls the endpoint when checkpoint is requested.
Application responds with estimated completion time, then streams progress.
Most expressive: supports negotiation (application can request deferral).

Checkpoint Destinations

Checkpoints are written to a standard location:

s3://{tenant}/{project}/{allocation_id}/checkpoints/{checkpoint_id}/

Or, if NFS is preferred for POSIX-style checkpoint (e.g., MPI checkpoint/restart):

/scratch/{tenant}/{project}/{allocation_id}/checkpoints/{checkpoint_id}/

The checkpoint broker coordinates with the data plane to ensure bandwidth is available.

Non-Checkpointable Applications

If an application declares checkpoint: none or fails to respond to checkpoint hints:

  • The allocation is marked as non-preemptible in the cost function
  • It receives a penalty in the knapsack solver (ties up resources without flexibility)
  • The scheduler avoids placing it on borrowed/elastic nodes

Fallback option: DMTCP (Distributed MultiThreaded Checkpointing) for transparent process-level checkpointing. Higher overhead, but works for unmodified applications.

Integration with Scheduler

The checkpoint broker runs as part of the scheduler plane, with access to:

  • Running allocation state (from quorum)
  • Node health telemetry (from eBPF/OpenCHAMI)
  • Storage metrics (from VAST API)
  • Queue state (from vCluster schedulers)

It evaluates the cost function continuously (every 30-60 seconds for each running allocation) and issues checkpoint hints when the threshold is crossed.

Storage Outage Behavior

When the checkpoint destination (VAST S3 or NFS) is unavailable:

  1. Detection: Checkpoint broker detects storage unavailability via failed write probes or VAST API health checks
  2. Immediate effect: All pending checkpoint requests are paused (not cancelled)
  3. Cost function adjustment: storage_write_bandwidth drops to 0, making write_time(j) infinite — the cost function naturally suppresses checkpoint decisions
  4. Running allocations: Continue running. They are effectively non-preemptible during the outage (no checkpoint possible)
  5. Preemption requests: If preemption is forced (e.g., sensitive claim), the victim receives SIGTERM without checkpoint. The allocation is marked Failed (not Suspended) since no checkpoint was written
  6. Recovery: When storage recovers, the broker re-evaluates all running allocations on the next cycle. Allocations with high recompute_saved value are prioritized for immediate checkpoint
  7. Alert: lattice_checkpoint_storage_unavailable gauge set to 1; critical alert fired

Edge Cases

Reactive Allocation Checkpointing

Reactive (autoscaling) allocations pose unique challenges for the checkpoint broker:

  • Variable node count. The checkpoint size estimate (GPU memory × node count) changes as the allocation scales. The broker re-evaluates cost on each cycle using the current node count.
  • Scale-down as implicit checkpoint trigger. When the scheduler decides to scale down a reactive allocation, it triggers a checkpoint on the nodes being released before removing them from the allocation. This ensures state is preserved.
  • Recommendation: For reactive allocations with complex distributed state, use checkpoint: manual and implement application-level checkpoint coordination. The broker’s automatic checkpointing works best for static-size allocations where checkpoint size is predictable.

Walltime vs Checkpoint Race

When an allocation’s walltime expires while a checkpoint is in progress:

  • Walltime takes priority. The walltime timer is not extended to accommodate the checkpoint.
  • If the checkpoint completes before the SIGTERM grace period expires, the checkpoint is usable for restart.
  • If the checkpoint is still in progress when SIGKILL is sent, the checkpoint is considered incomplete and is not used for restart. The allocation is marked Failed with reason walltime_exceeded.
  • To avoid this race, schedule checkpoints proactively as walltime approaches (the recompute_saved value naturally increases near walltime expiration).

Cross-References

Autoscaling

Design Principle

Simple, metric-driven scaling. No complex control theory. The scheduler adjusts node count within bounds based on a single metric threshold. Users set bounds, the scheduler respects them.

Reactive Lifecycle

Defined in crates/lattice-common/src/types.rs (LifecycleType::Reactive):

Reactive {
    min_nodes: u32,
    max_nodes: u32,
    metric: String,       // e.g., "gpu_utilization", "queue_depth", "request_rate"
    target: String,       // e.g., "0.80" (80% GPU utilization target)
}

Reactive allocations are unbounded in duration (like services) but have variable node count.

Scaling Loop

  1. Start: Allocation begins with min_nodes
  2. Evaluate: Every evaluation interval (default: 60s), the scheduler queries TSDB for the allocation’s metric
  3. Scale up: If metric > target for scale_up_window (default: 2 minutes):
    • Propose adding 1 node (conservative: avoid large jumps)
    • Quorum validates the node addition (ownership transfer)
    • Node agent starts processes on the new node
    • Repeat until metric ≤ target or max_nodes reached
  4. Scale down: If metric < target × scale_down_threshold (default: 0.5) for scale_down_window (default: 5 minutes):
    • Propose removing 1 node (least-loaded or most-recently-added)
    • Graceful drain: stop sending work to the node, wait for in-flight requests
    • Node released back to scheduling pool
    • Repeat until metric ≥ target × scale_down_threshold or min_nodes reached
  5. Cooldown: After any scale event, no further scaling for cooldown_period (default: 3 minutes)

Why Conservative Scaling

  • Adding 1 node at a time prevents overshooting (workloads often have non-linear resource curves)
  • Scale-down windows are longer than scale-up windows (scale down is more disruptive)
  • Cooldown prevents oscillation from metric noise

Built-In Scaling Metrics

MetricDescriptionSourceBest For
gpu_utilizationMean GPU SM occupancy across allocationeBPF / NVMLML inference services
cpu_utilizationMean CPU usage across allocationeBPFCPU-bound services
request_rateInbound requests per secondeBPF (network flow tracking)API/web services
queue_depthPending request queue lengthApplication-reported or eBPFBatch-processing services

Custom Metrics

Any metric available in TSDB can be used for scaling by specifying a label matcher:

lifecycle:
  type: reactive
  min_nodes: 2
  max_nodes: 20
  metric: "custom_metric{job='my-inference'}"
  target: "100"  # e.g., 100 pending requests

The scheduler queries TSDB with the label matcher scoped to the allocation’s nodes.

Configuration Defaults

ParameterDefaultConfigurable
evaluation_interval60sPer allocation
scale_up_window2 minutesPer allocation
scale_down_window5 minutesPer allocation
scale_down_threshold0.5 (50% of target)Per allocation
cooldown_period3 minutesPer allocation

Quota Interaction

Scale-up respects the tenant’s max_nodes hard quota (cross-ref: quota-enforcement.md):

  • Before proposing a scale-up, the scheduler checks if the tenant has remaining node capacity
  • If max_nodes would be exceeded: scale-up is a no-op, allocation continues at current size
  • No error raised — the allocation operates within its current bounds
  • If quota is later increased (e.g., via Waldur), scaling resumes automatically

Preemption Interaction

Borrowed nodes (from elastic resource sharing) are valid targets for reactive scaling, but they carry a preemption risk:

  • Scaling onto borrowed nodes gives the allocation more capacity temporarily
  • If the home vCluster reclaims the node: reactive allocation scales down gracefully
  • Minimum guarantee: min_nodes always come from the allocation’s home vCluster (not borrowed)

Error Handling

Metric Query Failure (TSDB Down)

If the scheduler cannot query TSDB for the scaling metric:

  1. First failure: skip this evaluation cycle, log warning
  2. Consecutive failures (3+): alert raised (lattice_autoscaling_metric_query_failures_total)
  3. No scaling decisions made while metric is unavailable — allocation stays at current size
  4. When TSDB recovers: normal evaluation resumes on next cycle

The allocation is never scaled blindly. No metric = no action.

Scale-Up Proposal Rejected

If the quorum rejects a scale-up proposal (e.g., race condition with another vCluster):

  1. Retry on next evaluation cycle (60s later)
  2. Maximum 3 consecutive retries for the same scale-up
  3. After 3 rejections: log warning, back off for 2 cooldown periods
  4. Scale-up resumes when conditions change (nodes become available)

Scale-Down During Borrowed Node Reclamation

If a borrowed node is reclaimed by the home vCluster while the reactive allocation is scaling down:

  1. The reclamation takes priority (home vCluster always wins)
  2. The reactive allocation loses the node immediately (graceful drain attempted, but not guaranteed)
  3. If this drops below min_nodes: scheduler attempts to acquire a replacement node from the home vCluster
  4. If no replacement available: allocation operates below min_nodes temporarily, alert raised

Metric Oscillation

If the metric oscillates around the target, causing repeated scale-up/scale-down:

  • The cooldown period (default: 3 minutes) prevents rapid oscillation
  • If scale events alternate for more than 5 cycles: alert raised suggesting the user adjust their target or increase cooldown
  • No automatic target adjustment — the user must update the configuration

Preemption During Scale-Up

If a reactive allocation is scaling up while simultaneously being preempted (e.g., a higher-priority job arrives):

  1. The preemption takes priority — the checkpoint/preemption sequence begins
  2. Any in-flight scale-up proposals are cancelled (quorum rejects proposals for allocations in Checkpointing state)
  3. After preemption completes: the allocation is suspended with its last stable node count
  4. When resumed: scaling restarts from min_nodes, re-evaluating the metric from scratch
  5. The cooldown period applies after resume to prevent immediate re-scaling

If preemption and scale-up proposals race at the quorum:

  • The quorum serializes all proposals — one wins, the other is rejected
  • The rejected proposal is retried on the next scheduling cycle (if still applicable)

Cross-References

Quota Enforcement

Design Principle

Two-tier enforcement matching the two consistency domains (ADR-004). Hard limits enforced at the quorum (strong consistency, cannot be violated). Soft limits enforced at the scheduler (eventual consistency, may temporarily overshoot, self-correcting).

Hard Quotas (Quorum-Enforced)

Hard quotas are checked during Raft proposal validation, before commit. A proposal that would violate a hard quota is rejected immediately.

QuotaScopeEnforcement
max_nodesPer tenantQuorum rejects allocation proposals that would exceed the tenant’s maximum concurrent node count
max_concurrent_allocationsPer tenantQuorum rejects proposals that would exceed the tenant’s maximum number of running allocations
sensitive_pool_sizeSystem-wideHard limit on the number of nodes that can be claimed for sensitive use

Guarantees: These quotas cannot be violated, even momentarily. Two vCluster schedulers proposing conflicting allocations that together would exceed a hard quota: the first committed wins, the second is rejected and retried next cycle.

Error handling: Hard quota rejection returns a clear error to the user:

allocation rejected: tenant "physics" would exceed max_nodes quota (current: 195, requested: 10, limit: 200)

Soft Quotas (Scheduler-Level)

Soft quotas are tracked with eventual consistency. They influence scheduling decisions through the cost function but do not hard-block allocations.

GPU-Hours Budget

gpu_hours_budget: 100000  # per billing period (month)
gpu_hours_used: 87500     # eventually consistent counter

Behavior: The scheduler uses remaining budget as a penalty in the cost function. As budget depletes:

  • 0-80% used: no penalty
  • 80-100% used: increasing penalty (lower scheduling priority)
  • 100% used: very low score (effective starvation for new allocations, but not hard rejection)

Consistency window: Up to ~30 seconds of lag. Acceptable because: (a) scheduling cycle is 5-30s, (b) over-allocation is self-correcting via fair-share scoring, (c) GPU-hours tracking is for billing, not safety.

Fair Share Target

fair_share_target: 0.15  # tenant should get ~15% of system capacity

Behavior: Feeds into f₃ (fair_share_deficit) in the cost function. Tenants below their share get priority; tenants above are deprioritized. Not a hard ceiling — a tenant can use more than their share when resources are idle.

Burst Allowance

burst_allowance: 1.5  # allow up to 150% of fair share when resources idle

Behavior: Allows temporary over-allocation when the system has spare capacity. When demand increases and other tenants need their share, burst allocations are the first candidates for preemption (via checkpoint cost model).

Internal Budget Ledger

When Waldur is unavailable or not configured, the scheduler computes GPU-hours consumption internally from allocation records in the quorum. This replaces the previously empty budget_utilization map in the cost function.

Computation

Two metrics are tracked:

node_hours_used = Σ (end_time - started_at).hours × assigned_nodes.len()
gpu_hours_used  = Σ (end_time - started_at).hours × Σ gpu_count_per_node
  • For running allocations: end_time = now
  • For completed/failed/cancelled: end_time = completed_at
  • Only allocations within the configured budget_period_days (default: 90 days, rolling window) are included
  • Node GPU count looked up from current hardware inventory; unknown nodes default to 1 GPU
  • Node-hours is the universal metric (works for CPU-only and GPU nodes)
  • When both gpu_hours_budget and node_hours_budget are set, the worse (higher) utilization fraction drives the budget penalty

Budget Period

Configurable via scheduling.budget_period_days (default: 90). This is a rolling window, not a calendar-aligned reset. Calendar-aligned resets require Waldur to push new gpu_hours_budget values at period boundaries.

Waldur Override

When Waldur is available, its remaining_budget() response takes precedence over the internal ledger. When Waldur is unavailable (transient failure), the internal ledger provides fallback data so budget enforcement continues.

API Access

  • gRPC: GetTenantUsage / GetUserUsage RPCs in AdminService
  • REST: GET /api/v1/tenants/{id}/usage?days=90 / GET /api/v1/usage?user=alice&days=90
  • Rust SDK: client.tenant_usage("physics", 90) / client.user_usage("alice", 90)
  • CLI: lattice usage --tenant physics / lattice usage (uses gRPC)

Exhausted Budget Behavior

GPU-Hours Budget Exhausted

  1. New allocations for this tenant receive a very low scheduling score (effective starvation, not hard rejection)
  2. Tenant admin notified via API event
  3. Running allocations continue to completion (no preemption for budget reasons)
  4. If Waldur integration enabled: Waldur can update the budget (cross-ref: accounting.md)
  5. Tenant admin can request budget increase through Waldur self-service portal

Max Nodes Exhausted

  1. Hard rejection at quorum — clear error returned to user
  2. User must wait for running allocations to complete or cancel existing allocations
  3. No waiting queue for hard-quota-blocked allocations (submit is rejected, user resubmits when capacity is available)

Quota Update Flow

Administrative Update

System admin updates tenant quotas via CLI or API:

# CLI (uses gRPC UpdateTenant RPC)
lattice admin tenant update physics \
  --max-nodes 250 \
  --max-concurrent-allocations 50 \
  --gpu-hours-budget 150000 \
  --node-hours-budget 500000
# Python SDK
await client.update_tenant("physics", {
    "max_nodes": 250,
    "max_concurrent_allocations": 50,
    "gpu_hours_budget": 150000,
    "node_hours_budget": 500000,
})
# REST
PUT /api/v1/tenants/{id}
{
  "max_nodes": 250,
  "max_concurrent_allocations": 50,
  "gpu_hours_budget": 150000,
  "node_hours_budget": 500000
}

Hard quota changes are Raft-committed (immediate effect). Soft quota changes propagate eventually.

Waldur-Driven Update

When Waldur integration is enabled, Waldur can push quota changes:

  1. Waldur determines budget exhaustion or contract change
  2. Waldur calls lattice-api: PUT /api/v1/tenants/{id} (authenticated with Waldur service token)
  3. Hard quotas committed via Raft; soft quotas propagated to schedulers
  4. Reducing max_nodes below current usage does not preempt running allocations — it prevents new ones

Quota Reduction While Allocations Are Running

When a quota is reduced below current usage (e.g., Waldur reduces max_nodes from 200 to 100, but tenant is currently using 150):

Hard Quota Reduction

  • Running allocations are not preempted. The reduced quota only blocks new allocations.
  • Current usage (150) exceeds new limit (100): all new proposals for this tenant are rejected until usage drops below 100.
  • The user receives a clear error on new submissions:
    allocation rejected: tenant "physics" exceeds max_nodes quota
      Current usage: 150 nodes
      New limit: 100 nodes
      Hint: Wait for running allocations to complete, or contact your tenant admin.
    
  • As running allocations complete naturally, usage drops. When usage < new limit: new allocations are accepted again.

Soft Quota Reduction

  • Reduced gpu_hours_budget: scheduling score penalty increases. Pending allocations get lower priority but are not rejected.
  • Reduced fair_share_target: tenant gets deprioritized but can still schedule when resources are idle.
  • No immediate impact on running allocations.

Pending Allocations

Allocations that are Pending (in the scheduler queue but not yet committed) when a hard quota is reduced:

  • They are not retroactively cancelled.
  • If proposed to quorum, the proposal is rejected due to the new quota.
  • The scheduler will not re-propose them until quota headroom exists.
  • User sees allocation stuck in Pending state. lattice status shows the reason: "waiting for quota headroom".

Sensitive Quota Considerations

Sensitive quotas are always hard quotas:

  • sensitive_pool_size — System-wide hard limit, quorum-enforced
  • Sensitive node claims always go through quorum (strong consistency)
  • No soft/eventual quota mechanisms for sensitive resources
  • Idle sensitive nodes (claimed but unused) are not reclaimable — they remain allocated to the claiming user

Cross-ref: sensitive-workloads.md for the full sensitive workload model.

Cross-References

GPU Topology

Design Principle

Vendor-neutral abstraction over GPU interconnect topologies. The scheduler reasons about “GPU domains” and “link bandwidth,” not vendor-specific terms. Node agents discover and report topology; the scheduler uses it for placement decisions.

Vendor Support

VendorGPU FamilyInterconnectTopology DiscoveryMetrics Collection
NVIDIAH100, GH200, B200NVLink, NVSwitchNVML (nvmlDeviceGetTopologyCommonAncestor)NVML / DCGM
AMDMI300X, MI300AInfinity Fabric, xGMIROCm-SMI (rsmi_topo_get_link_type)ROCm-SMI / rocm_smi_lib

Additional vendors can be supported by implementing the topology discovery trait in the node agent.

Abstraction Model

GpuTopology {
    gpus: Vec<GpuDevice>,
    links: Vec<GpuLink>,
    nic_affinity: Map<GpuIndex, NicId>,  // which NIC is closest to which GPU
}

GpuDevice {
    index: u32,
    vendor: GpuVendor,          // Nvidia | Amd
    model: String,              // "H100", "MI300X"
    memory_bytes: u64,
    compute_capability: String, // CUDA CC or GCN/CDNA arch
}

GpuLink {
    gpu_a: u32,
    gpu_b: u32,
    link_type: GpuLinkType,     // NvLink | NvSwitch | InfinityFabric | Xgmi | Pcie
    bandwidth_gbps: f64,
}

The node agent populates this structure at startup using vendor-specific APIs and reports it alongside node capabilities and health data.

Link TypeTypical BandwidthLatencyNotes
NVLink (H100)450 GB/s per link~1 μsDirect GPU-to-GPU
NVSwitch (H100)900 GB/s all-to-all~1 μsFull-bisection via switch
Infinity Fabric (MI300X)896 GB/s aggregate~1 μsXGMI links between dies
PCIe Gen564 GB/s~2-5 μsFallback, cross-socket
PCIe Gen432 GB/s~2-5 μsOlder systems

Actual bandwidth is discovered at runtime via vendor APIs, not hardcoded.

Intra-Node Scheduling Impact

ADR-007 defines “full-node scheduling with intra-node packing.” GPU topology informs the intra-node packing:

Multi-GPU Jobs Within a Node

For allocations requesting fewer GPUs than the node has, the node agent packs on GPUs with direct high-bandwidth links:

  1. Prefer GPUs connected via NVLink/NVSwitch/InfinityFabric (direct high-bandwidth)
  2. Avoid splitting across PCIe domains when high-bandwidth links are available
  3. For NCCL/RCCL workloads, contiguous GPU groups minimize communication overhead

Multi-Node Jobs

For allocations spanning multiple nodes:

  1. Prefer nodes where GPU-to-NIC affinity matches — GPUs closest to the NIC used for inter-node communication (Slingshot/Ultra Ethernet)
  2. NIC affinity reduces PCIe hops for inter-node traffic, improving MPI/NCCL allreduce performance
  3. Combined with f₄ (topology_fitness): inter-node placement minimizes dragonfly group span, intra-node placement maximizes link bandwidth

Selection Algorithm

For a k-GPU allocation on a node with n GPUs:
1. Build a graph of GPUs weighted by link bandwidth
2. Find the k-GPU subgraph with maximum minimum link bandwidth
3. If multiple subgraphs tie: prefer the one with best NIC affinity
4. Assign allocation to selected GPUs via cgroup/device isolation

MIG / GPU Partitioning

NVIDIA Multi-Instance GPU (MIG)

H100 can partition into up to 7 MIG instances, each with isolated memory, cache, and compute:

MIG ProfileGPU MemorySMsUse Case
1g.10gb10 GB1/7Interactive, notebooks
2g.20gb20 GB2/7Small inference
3g.40gb40 GB3/7Medium training
4g.40gb40 GB4/7Medium training
7g.80gb80 GB7/7Full GPU (no partitioning)

MIG is relevant for interactive/small-job vClusters where intra-node packing is used. Each MIG instance is a separate schedulable GPU resource.

AMD

No equivalent partitioning as of MI300 generation. MI300X allocations always get full GPU dies.

Scheduler Integration

  • MIG instances are reported as individual GpuDevice entries with reduced memory_bytes and a partitioned: true flag
  • The scheduler treats MIG instances like smaller GPUs — no special MIG logic in the knapsack solver
  • MIG configuration is managed by the node agent, not the scheduler (reconfiguration requires idle GPU)

Integration with Cost Function

GPU topology extends f₄ (topology_fitness) to include intra-node topology quality:

f₄(j) = α · inter_node_fitness(j) + (1-α) · intra_node_fitness(j)

inter_node_fitness = 1.0 - (groups_needed / max_groups_available)  // existing
intra_node_fitness = min_link_bandwidth(selected_gpus) / max_link_bandwidth(node)

α = 1.0 for single-node jobs (intra-node only matters)
α = 0.7 for multi-node jobs (inter-node dominates but intra-node still relevant)

The node agent reports GpuTopology alongside capabilities and health on every heartbeat (topology is static, but health/utilization changes).

Conformance Interaction

GPU driver version and firmware version are part of the conformance fingerprint (cross-ref: conformance.md). For multi-node GPU jobs, mismatched drivers cause NCCL/RCCL hangs. The conformance fitness factor (f₉) ensures nodes in a multi-GPU allocation share the same driver stack.

Cross-References

Memory Topology

Design Principle

Vendor-neutral abstraction over CPU-memory-GPU memory topology. The scheduler reasons about “memory domains” and “interconnect bandwidth,” not vendor-specific terms like NUMA node IDs or NVLink-C2C. Node agents discover and report memory topology; the scheduler uses it for placement decisions and memory policy configuration.

This complements gpu-topology.md, which models GPU interconnects. Memory topology models the CPU-memory-GPU memory hierarchy: NUMA domains, unified memory architectures, and CXL-attached memory tiers.

Memory Domain Types

TypeHardware ExampleCharacteristicsDiscovery
Discrete NUMAMulti-socket Intel Xeon, AMD EPYCSeparate DRAM per socket, asymmetric access latencies/sys/devices/system/node/
Unified CPU-GPUNVIDIA Grace Hopper GH200NVLink-C2C coherent, single address space across CPU and GPUNVML + /sys/devices/system/node/
APU / Unified DieAMD MI300ACPU + GPU on same package, shared HBM3 poolROCm-SMI + hwloc
CXL-AttachedCXL Type 3 memory expandersPooled or device-attached memory, higher latency than local DRAM/sys/bus/cxl/
Single-SocketSingle-socket serversTrivial: one NUMA node, uniform access/sys/devices/system/node/

Abstraction Model

MemoryTopology {
    domains: Vec<MemoryDomain>,
    interconnects: Vec<MemoryInterconnect>,
    total_capacity_bytes: u64,
}

MemoryDomain {
    id: u32,
    domain_type: MemoryDomainType,    // Dram | Hbm | CxlAttached | Unified
    capacity_bytes: u64,
    numa_node: Option<u32>,           // Linux NUMA node ID, if applicable
    attached_cpus: Vec<u32>,          // CPU IDs with local access
    attached_gpus: Vec<u32>,          // GPU indices with local/coherent access
}

MemoryInterconnect {
    domain_a: u32,
    domain_b: u32,
    link_type: MemoryLinkType,        // NumaLink | CxlSwitch | CoherentFabric
    bandwidth_gbps: f64,
    latency_ns: u64,
}

enum MemoryDomainType { Dram, Hbm, CxlAttached, Unified }
enum MemoryLinkType { NumaLink, CxlSwitch, CoherentFabric }

The node agent populates this structure at startup alongside GpuTopology and reports it with node capabilities and health data.

Interconnect Bandwidth and Latency

Link TypeTypical BandwidthTypical LatencyNotes
Local DRAM access50-100 GB/s per channel~80 nsSame-socket, same NUMA node
Remote NUMA (UPI/xGMI)20-40 GB/s~150-300 nsCross-socket, 1.5-3x local latency
NVLink-C2C (GH200)900 GB/s~100 nsCPU-GPU coherent fabric
Infinity Fabric (MI300A)896 GB/s aggregate~100 nsOn-package CPU-GPU interconnect
CXL 2.0 (Type 3)32-64 GB/s~200-400 nsMemory expander, higher latency
PCIe Gen5 (discrete GPU)64 GB/s~1-2 usNon-coherent, requires explicit transfer

Actual bandwidth and latency are discovered at runtime, not hardcoded.

Superchip Architectures

NVIDIA Grace Hopper (GH200)

Grace CPU + Hopper GPU connected via NVLink-C2C (900 GB/s bidirectional). The CPU and GPU share a single coherent address space — no explicit cudaMemcpy required for data movement.

┌────────────────────────────────────────────────────┐
│                  GH200 Superchip                   │
│                                                    │
│  ┌─────────────────┐   NVLink-C2C  ┌─────────────┐ │
│  │  Grace CPU      │◄──900 GB/s───►│  Hopper GPU │ │
│  │  72 cores       │   coherent    │  80 GB HBM3 │ │
│  │  512 GB LPDDR5X │               │             │ │
│  └─────────────────┘               └─────────────┘ │
│                                                    │
│  Single coherent address space (CPU + GPU)         │
│  → Maps to one Unified MemoryDomain                │
└────────────────────────────────────────────────────┘

Mapping to abstraction:

  • One MemoryDomain { type: Unified } spanning CPU LPDDR5X + GPU HBM3
  • attached_cpus: all Grace cores; attached_gpus: [Hopper GPU index]
  • One MemoryInterconnect { type: CoherentFabric, bandwidth: 900 } between CPU and GPU sub-domains

AMD Instinct MI300A

APU with CDNA 3 GPU + Zen 4 CPU on the same package, sharing HBM3 memory pool. No discrete CPU DRAM — all memory is HBM3 accessible by both CPU and GPU.

┌──────────────────────────────────────────────────┐
│                  MI300A Package                  │
│                                                  │
│  ┌─────────────┐   Infinity   ┌────────────────┐ │
│  │  Zen 4 CPU  │ ◄──Fabric──► │  CDNA 3 GPU    │ │
│  │  24 cores   │   896 GB/s   │  6 XCDs        │ │
│  └──────┬──────┘              └───────┬────────┘ │
│         │                             │          │
│         └──────┐          ┌───────────┘          │
│                ▼          ▼                      │
│         ┌─────────────────────┐                  │
│         │  Shared HBM3 Pool   │                  │
│         │  128 GB              │                 │
│         └─────────────────────┘                  │
│                                                  │
│  → Maps to one Unified MemoryDomain              │
└──────────────────────────────────────────────────┘

Mapping to abstraction:

  • One MemoryDomain { type: Unified } for the shared HBM3 pool
  • attached_cpus: all Zen 4 cores; attached_gpus: [MI300A GPU index]
  • Internal Infinity Fabric interconnect is not separately modeled (on-package, always present)

Discovery

The node agent discovers memory topology at startup using platform-specific sources:

SourceWhat It ProvidesPlatform
/sys/devices/system/node/NUMA node count, CPU-to-node mapping, memory per nodeLinux (all)
numactl --hardwareNUMA distances (latency matrix between nodes)Linux (all)
hwlocPortable topology discovery, cache hierarchy, PCI localityLinux (all)
NVMLGPU-to-NUMA affinity, NVLink-C2C detection (GH200)NVIDIA GPUs
ROCm-SMIGPU-to-NUMA affinity, MI300A detectionAMD GPUs
/sys/bus/cxl/CXL device enumeration, memory regions, interleave configCXL-capable systems

Superchip Detection

GH200 and MI300A superchips are identified by GPU model string during GPU discovery (cross-ref: gpu-topology.md). When detected:

  1. The node agent queries the coherent memory size via vendor API (NVML for GH200, ROCm-SMI for MI300A)
  2. NUMA nodes associated with both CPU and GPU are merged into a single Unified domain
  3. The coherent interconnect bandwidth is reported as a CoherentFabric link

Discovery Fallback

If vendor APIs are unavailable (e.g., driver not loaded), the node agent falls back to hwloc for topology and reports Dram domains only. GPU memory domains are still reported via the GPU topology path but without coherent interconnect metadata.

Scheduling Impact

Extending f₄ (topology_fitness)

Memory topology extends the intra-node component of f₄ alongside GPU topology:

intra_node_fitness = β · gpu_link_fitness + (1-β) · memory_locality_fitness

memory_locality_fitness(j, selected_nodes) =
    average over selected nodes of:
        fraction of allocation's CPUs and GPUs in the same memory domain

β = 0.7 for GPU-heavy workloads (GPU interconnect dominates)
β = 0.3 for CPU-heavy workloads with GPU offload (memory locality dominates)
β = 0.5 default

Constraint Hints

Allocations can specify memory topology preferences:

ConstraintEffect
prefer_same_numaSoft: prefer placing all CPUs in a single NUMA domain
require_unified_memoryHard: only schedule on nodes with Unified memory domains (GH200, MI300A)
prefer_local_memorySoft: prefer NUMA-local memory allocation policy
allow_cxl_memoryOpt-in: allow scheduling on CXL-expanded memory capacity

Hard constraints filter nodes before the knapsack solver runs. Soft constraints contribute to memory_locality_fitness.

Intra-Node CPU-GPU Co-location

On discrete NUMA systems (e.g., dual-socket with 4 GPUs per socket), the node agent co-locates an allocation’s CPU cores and GPUs within the same NUMA domain when possible:

For an allocation requesting k CPUs and g GPUs on a multi-NUMA node:
1. Identify NUMA domains that have both free CPUs and GPUs with local affinity
2. Prefer the domain where GPU-to-NIC affinity is best (for inter-node traffic)
3. Assign CPUs and GPUs from the same domain via cgroup/cpuset
4. If the allocation spans domains: prefer domains connected by highest-bandwidth link

Memory Mapping Policies

The node agent configures memory allocation policy at allocation start via numactl (or equivalent). This is transparent to the user unless they specify a preference.

Policynumactl FlagWhen Used
Local--localallocDefault: allocate on the NUMA node where the thread runs
Interleave--interleave=allLarge shared datasets that all threads access equally
Preferred--preferred=<node>Pin to a specific NUMA node (for known data locality)
Bind--membind=<nodes>Strict: only allocate from specified nodes (sensitive isolation)

On unified memory architectures (GH200, MI300A), NUMA policy has reduced impact since CPU and GPU share the same memory pool. The node agent skips numactl configuration for allocations on unified nodes unless the user explicitly requests a policy.

Allocation-Level Override

Users can specify memory policy in the allocation request:

resources:
  cpus: 24
  gpus: 1
  memory_gb: 128
constraints:
  memory_policy: interleave    # optional: local | interleave | preferred | bind
  require_unified_memory: true  # optional: only unified architectures

CXL Memory Tiers

CXL Type 3 memory expanders add a new capacity tier: higher latency than local DRAM but lower cost per GB. The scheduler treats CXL memory as a separate resource dimension.

Capacity Model

Node memory capacity:
  local_dram_bytes:  512 GB  (fast, NUMA-local)
  cxl_memory_bytes:  2 TB    (slower, CXL-attached)
  total_bytes:       2.5 TB

Allocation can request:
  memory_gb: 256              # scheduler satisfies from local DRAM
  memory_gb: 1024             # scheduler must use CXL tier (exceeds local DRAM)
  memory_gb: 1024
  allow_cxl_memory: true      # explicit opt-in for CXL tier

Scheduling Rules

  1. By default, allocations are placed using local DRAM capacity only
  2. If allow_cxl_memory: true, CXL capacity is included in available memory
  3. Allocations requesting more memory than local DRAM are only placed on CXL-capable nodes when the constraint is set
  4. CXL memory appears as a separate CxlAttached domain in MemoryTopology

Cross-References

  • gpu-topology.md — GPU interconnect topology, NIC affinity, intra-node GPU selection
  • telemetry.md — NUMA locality metrics collection (eBPF), memory utilization
  • scheduling-algorithm.md — f₄ topology_fitness, knapsack solver, constraint handling
  • node-lifecycle.md — Node agent startup, health reporting, capability discovery
  • conformance.md — Hardware configuration fingerprint (includes memory architecture)

Performance Tuning Guide

Design Principle

Tuning Lattice is primarily about tuning the cost function weights per vCluster. The RM-Replay simulator is the primary tool: capture production traces, replay with different weights, measure outcomes, deploy with confidence.

Cost Function Sensitivity

Weight Impact Matrix

Each cost function weight controls a trade-off. Increasing one weight reduces the influence of others:

Weight IncreasedPositive EffectNegative EffectWhen to Increase
w₁ (priority)High-priority jobs scheduled fasterLow-priority jobs starve longerMany priority levels with strict SLAs
w₂ (wait_time)Better anti-starvation, fairer wait distributionMay schedule low-value jobs before high-value onesLong tail of wait times
w₃ (fair_share)Tenants get closer to contracted shareMay reduce overall utilization (leaving resources idle)Multi-tenant with strict fairness requirements
w₄ (topology)Better placement, higher network performanceMay increase wait time (holding out for ideal placement)Network-sensitive workloads (NCCL, MPI allreduce)
w₅ (data_readiness)Less I/O stall at job startMay delay jobs whose data isn’t pre-stagedLarge-dataset workloads
w₆ (backlog)System responds to queue pressureMay destabilize scheduling when queue fluctuatesBursty submission patterns
w₇ (energy)Lower electricity costsJobs may wait for cheap-energy windowsTime-flexible workloads, sites with TOU pricing
w₈ (checkpoint)More flexible resource rebalancingOverhead from frequent checkpointingPreemption-heavy environments
w₉ (conformance)Fewer driver-mismatch issuesFewer candidate nodes (smaller conformance groups)Multi-node GPU workloads

Common Trade-offs

Throughput vs. Fairness (w₃):

  • Low w₃ (0.05): maximize utilization — schedule whatever fits, regardless of tenant share
  • High w₃ (0.35): enforce fairness — tenants below their share get priority even if it means idle resources

Typical compromise: w₃ = 0.15-0.25

Wait Time vs. Topology (w₂ vs. w₄):

  • High w₂, low w₄: schedule quickly in any topology — reduces wait but may hurt network performance
  • Low w₂, high w₄: wait for good topology — increases wait but improves job runtime

Typical for HPC: w₂ = 0.25, w₄ = 0.15 Typical for ML training: w₂ = 0.10, w₄ = 0.30

Utilization vs. Energy (w₇):

  • w₇ = 0.00: schedule immediately regardless of energy cost (default for most sites)
  • w₇ = 0.10-0.15: delay time-flexible jobs to cheap-energy windows

Only relevant for sites with significant time-of-use electricity pricing.

Using RM-Replay

Overview

RM-Replay replays production workload traces through the scheduler in simulation mode. No real resources are used. Simulation runs in seconds, not hours.

Reference: Martinasso et al., “RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management” (SC18).

Step 1: Capture Traces

Record workload traces from production (or synthetic workloads):

# Enable trace capture (writes to S3)
lattice admin config set scheduler.trace_capture=true
lattice admin config set scheduler.trace_path="s3://lattice-traces/"

# Capture for a representative period (1 week recommended)
# Traces include:
#   - Allocation submissions (arrival time, resources, constraints, tenant, priority)
#   - Allocation completions (actual duration, exit status)
#   - Node inventory (capabilities, topology, conformance groups)

Trace format is a timestamped event log (JSON lines):

{"ts": "2026-03-01T00:00:01Z", "type": "submit", "alloc": {"nodes": 64, "gpu_type": "GH200", "walltime": "72h", "tenant": "physics", "priority": 4}}
{"ts": "2026-03-01T00:00:05Z", "type": "complete", "alloc_id": "abc-123", "duration": "68h", "exit": 0}

Step 2: Configure Weights

Create weight profiles to compare:

# profiles/baseline.yaml (current production weights)
hpc-batch:
  priority: 0.20
  wait_time: 0.25
  fair_share: 0.25
  topology: 0.15
  data_readiness: 0.10
  backlog: 0.05
  energy: 0.00
  checkpoint: 0.00
  conformance: 0.10

# profiles/fairness-boost.yaml (experiment: more fairness)
hpc-batch:
  priority: 0.15
  wait_time: 0.20
  fair_share: 0.35        # increased
  topology: 0.15
  data_readiness: 0.10
  backlog: 0.05
  energy: 0.00
  checkpoint: 0.00
  conformance: 0.10

Step 3: Replay

# Replay with baseline weights
rm-replay --trace=traces/week-2026-03.jsonl \
          --weights=profiles/baseline.yaml \
          --nodes=inventory/alps.yaml \
          --output=results/baseline/

# Replay with experimental weights
rm-replay --trace=traces/week-2026-03.jsonl \
          --weights=profiles/fairness-boost.yaml \
          --nodes=inventory/alps.yaml \
          --output=results/fairness-boost/

Step 4: Evaluate

RM-Replay produces a summary report:

=== RM-Replay Results: fairness-boost ===

Utilization:
  GPU-hours consumed: 1,234,567 / 1,500,000 available (82.3%)
  ↓ 2.1% vs baseline (84.4%)

Wait Time:
  p50: 12 min  (baseline: 10 min)  ↑ 20%
  p95: 2.1 hr  (baseline: 2.5 hr)  ↓ 16%
  p99: 8.3 hr  (baseline: 12.1 hr) ↓ 31%

Fairness (Jain's Index):
  0.94 (baseline: 0.87)  ↑ 8%

Tenant Share Deviation:
  Max deviation: 3.2%  (baseline: 8.7%)  ↓ 63%

Backfill:
  Backfill jobs: 342 (baseline: 367)  ↓ 7%

Preemptions:
  Total: 15 (baseline: 12)  ↑ 25%

Step 5: Decide and Deploy

Compare results across profiles. When satisfied:

# Deploy new weights (hot-reloadable, no restart)
lattice admin vcluster set-weights --name=hpc-batch \
  --priority=0.15 --wait-time=0.20 --fair-share=0.35 \
  --topology=0.15 --data-readiness=0.10 --backlog=0.05 \
  --energy=0.00 --checkpoint=0.00 --conformance=0.10

Weights take effect on the next scheduling cycle.

Scheduling Cycle Tuning

The scheduling cycle interval affects responsiveness vs. overhead:

IntervalEffectRecommended For
5sFast scheduling, higher CPU on schedulerInteractive vCluster, small clusters
15sBalancedHPC batch, ML training
30sLower overhead, slower responseLarge clusters (5000+ nodes), service vCluster
lattice admin vcluster set-config --name=hpc-batch --cycle-interval=15s

Backfill Tuning

Backfill depth controls how many future reservations the solver considers:

DepthEffect
0No backfill (only first-fit) — simple but low utilization
10Moderate backfill — good balance
50Deep backfill — higher utilization but longer cycle time

For most sites, depth 10-20 is optimal. Increase if utilization is below target.

Conformance Group Sizing

If conformance groups are too small (many distinct fingerprints), multi-node jobs have fewer candidate sets:

  • Symptom: High wait times for multi-node jobs, f₉ scores consistently low
  • Diagnosis: lattice nodes -o wide shows many distinct conformance hashes
  • Fix: Coordinate with OpenCHAMI to standardize firmware versions. Prioritize GPU driver and NIC firmware alignment.
  • Workaround: Reduce w₉ for tolerant workloads (services, interactive)

Cross-References

Node Lifecycle

Design Principle

Nodes follow a formal state machine with well-defined transitions, timeouts, and operator actions. The node agent drives transitions locally; the quorum records ownership changes with strong consistency. Running allocations are never disrupted by state transitions unless the node is genuinely unhealthy.

State Machine

                    ┌────────────────────────────────────────────┐
                    │                                            │
                    ▼                                            │
  ┌─────────┐   boot   ┌──────────┐   health ok   ┌─────────┐    │
  │ Unknown │────────→ │ Booting  │──────────────→│  Ready  │    │
  └─────────┘          └──────────┘               └────┬────┘    │
       ▲                     │                         │         │
       │               boot fail                       │         │
       │                     │        ┌────────────────┤         │
       │                     ▼        │                │         │
       │               ┌──────────┐   │  drain cmd     │         │
       │               │  Failed  │   │       │        │         │
       │               └──────────┘   │       ▼        │         │
       │                     │        │  ┌──────────┐  │  remediated
       │               wipe/reboot    │  │ Draining │  │         │
       │                     │        │  └─────┬────┘  │         │
       │                     │        │   allocs done  │         │
       │                     │        │        │       │         │
       │                     │        │        ▼       │         │
       │                     │        │  ┌──────────┐  │         │
       │                     │        │  │ Drained  │  │         │
       │                     │        │  └─────┬────┘  │         │
       │                     │        │ undrain│       │         │
       │                     │        │        │       │         │
       │                     │        │        ▼       │         │
       │                     │        └──→ (Ready) ◄───┘         │
       │                     │                                   │
       │                     │    heartbeat miss    ┌───────────┐│
       │                     │    ┌────────────────→│ Degraded  ││
       │                     │    │   (Ready)       └─────┬─────┘│
       │                     │    │                 grace timeout│
       │                     │    │                       │      │
       │                     │    │                       ▼      │
       │                     └────┼──────────────────┌─────────┐ │
       │                          │                  │  Down   │ │
       └──────────────────────────┼──────────────────└────┬────┘ │
                                  │                 reboot│      │
                                  │                       └──────┘
                                  │
                         heartbeat resume
                          (within grace)
                                  │
                                  └──→ (Ready)

States

StateDescriptionSchedulableAllocations Run
UnknownNode exists in inventory but has never reportedNoNo
BootingOpenCHAMI booting/reimaging the nodeNoNo
ReadyHealthy, agent reporting, available for schedulingYesYes
DegradedHeartbeat missed or minor issue detectedNo (new)Yes (existing)
DownConfirmed failure, grace period expiredNoNo (requeued)
DrainingOperator or scheduler requested drain, waiting for allocations to finishNo (new)Yes (existing, draining)
DrainedAll allocations completed/migrated after drainNoNo
FailedBoot failure or unrecoverable hardware errorNoNo

Transitions

Ready → Degraded

Trigger: First missed heartbeat.

Timeout: heartbeat_interval (default: 30s). If no heartbeat received within this window, the quorum marks the node Degraded.

Effect: Node is removed from scheduling candidates for new allocations. Running allocations continue undisturbed. No user notification.

Sensitive override: Sensitive nodes use a longer degradation window (default: 2 minutes) to avoid false positives from transient network issues.

Degraded → Ready

Trigger: Heartbeat resumes within the grace period.

Effect: Node re-enters the scheduling pool. No allocation disruption occurred. Event logged but no alert.

Degraded → Down

Trigger: Grace period expired without heartbeat recovery.

Timeouts:

Node TypeGrace PeriodRationale
Standard60sBalance between fast recovery and false positive avoidance
Sensitive5 minutesSensitive allocations are high-value; avoid premature requeue
Borrowed30sBorrowed nodes should be reclaimed quickly

Effect:

  1. All allocations on the node are evaluated per their requeue policy (cross-ref: failure-modes.md)
  2. Node ownership released (Raft commit)
  3. Alert raised to operators
  4. OpenCHAMI notified for out-of-band investigation (Redfish BMC check)

Ready → Draining

Trigger: Explicit operator command (lattice node drain <id>) or scheduler-initiated (upgrade, conformance drift on sensitive node).

Effect:

  1. Node removed from scheduling candidates
  2. Running allocations continue until completion
  3. For urgent drains: scheduler may trigger checkpoint on running allocations (cross-ref: checkpoint-broker.md)
  4. No new allocations assigned

Draining → Drained

Trigger: All running allocations on the node have completed, been checkpointed, or been migrated.

Effect: Node is idle and safe for maintenance. Operator can upgrade, reboot, or reimage.

Drained → Ready

Trigger: Operator undrain (lattice node undrain <id>). Typically after maintenance.

Precondition: Node agent health check passes (heartbeat, GPU detection, network test, conformance fingerprint computed).

Effect: Node re-enters scheduling pool.

Any → Down (hardware failure)

Trigger: OpenCHAMI Redfish BMC detects critical hardware failure (PSU, uncorrectable ECC, GPU fallen off bus).

Effect: Immediate transition to Down, bypassing grace period. Same allocation handling as Degraded → Down.

Down → Booting

Trigger: Operator or automated remediation initiates reboot/reimage via OpenCHAMI.

Effect: Node enters Booting state. OpenCHAMI BSS serves the appropriate image.

Booting → Ready

Trigger: Node agent starts, passes health check, reports to quorum.

Health check: Heartbeat received, GPU count matches capabilities, NIC firmware detected, conformance fingerprint computed and reported.

Booting → Failed

Trigger: Boot timeout (default: 10 minutes) or repeated boot failures (3 consecutive).

Effect: Node marked Failed. Alert raised. Operator must investigate.

Sensitive Node Lifecycle Extensions

Sensitive nodes have additional constraints:

EventStandard NodeSensitive Node
ClaimScheduler assignsUser claims explicitly, Raft-committed
Degraded grace60s5 minutes
Down → requeueAutomaticOperator intervention required
ReleaseNode returns to poolNode must be wiped (OpenCHAMI secure erase) before returning
Conformance driftDeprioritizedImmediate Draining, audit logged

Sensitive Release Sequence

1. User releases sensitive allocation
2. Quorum releases node ownership (Raft commit, audit entry)
3. Node enters Draining (if other sensitive allocations) or proceeds to wipe
4. OpenCHAMI initiates secure wipe:
   a. GPU memory clear
   b. NVMe secure erase (if present)
   c. RAM scrub
   d. Reboot into clean image
5. Wipe confirmation reported to quorum (Raft commit, audit entry)
6. Node transitions to Ready and returns to general pool

Wipe Failure Handling

If the OpenCHAMI secure wipe fails or times out during sensitive node release:

  1. Timeout: Default wipe timeout is 30 minutes (configurable: sensitive.wipe_timeout). If wipe does not complete within this window, the node enters a Quarantine state (treated as Down by the scheduler).
  2. Quarantine: Quarantined nodes are excluded from scheduling and flagged for operator intervention. They do not return to the general pool.
  3. Operator intervention: The operator investigates (BMC console, hardware diagnostics) and either:
    • Retries the wipe: lattice admin node wipe <id> --force
    • Replaces the node hardware
    • Marks the node as permanently failed: lattice node disable <id>
  4. Audit: Wipe failures are logged as critical audit events (Raft-committed for sensitive nodes). The audit entry records: node ID, wipe start time, failure reason, operator action.
  5. Alert: lattice_sensitive_wipe_failure_total counter incremented; critical alert fired.

Operator Commands

CommandEffectConfirmation Required
lattice node drain <id>Start drainingNo
lattice node drain <id> --urgentDrain with checkpoint triggerYes (allocations will be checkpointed)
lattice node undrain <id>Re-enable schedulingNo
lattice node disable <id>Transition to Down immediatelyYes (allocations will be requeued/failed)
lattice node enable <id>Re-enable a disabled node (Down → Ready)No
lattice node status <id>Show current state, allocations, healthNo
lattice node list --state=degradedList nodes in specific stateNo

Heartbeat Protocol

Node agents send heartbeats to the quorum at a configurable interval:

ParameterDefaultDescription
heartbeat_interval10sHow often the agent sends a heartbeat
heartbeat_timeout30sQuorum marks Degraded after this silence
grace_period60sDegradedDown after this additional silence
sensitive_grace_period5mExtended grace for sensitive nodes

Heartbeats include:

  • Monotonic sequence number (replay detection)
  • Node health summary (GPU count, temperature, ECC errors)
  • Conformance fingerprint (if recomputed since last heartbeat)
  • Running allocation count

Heartbeats are lightweight (~200 bytes) and sent over the management traffic class (cross-ref: security.md).

Agent Restart and State Recovery

The node agent persists active allocation state to /var/lib/lattice/agent-state.json (configurable via --state-file). This enables workload survival across agent restarts.

On graceful shutdown (SIGTERM):

  1. Agent writes current allocation state (PIDs, cgroup paths, runtime type, mount points) to the state file
  2. Agent exits without killing workloads (systemd KillMode=process)

On startup:

  1. Agent reads the persisted state file
  2. For each allocation, checks if the process is still alive (kill(pid, 0))
  3. Alive processes are reattached — agent resumes heartbeating their status
  4. Dead processes are treated as orphans — cgroup scopes are destroyed, mounts cleaned up
  5. Stray cgroup scopes under workload.slice/alloc-*.scope with no matching state entry are also cleaned up
  6. Agent re-registers with quorum and resumes normal operation

Crash recovery: If the agent crashes without writing the state file, the startup scan of cgroup scopes under workload.slice/ provides a fallback discovery mechanism for orphaned workloads.

Cross-References

Node Conformance & Configuration Drift

Problem

In large-scale HPC systems, nodes gradually drift from their intended configuration: firmware versions diverge, driver updates are applied unevenly, kernel parameters change. This configuration drift causes:

  • Silent performance degradation. A 64-node NCCL training run where one node has a different NIC firmware version may see unexplained slowdowns or hangs.
  • Correctness issues. Mismatched GPU driver versions can produce different numerical results.
  • Compliance violations. Regulated workloads require provable consistency of the execution environment.

Design Principle

The scheduler does not manage node configuration — OpenCHAMI does. The scheduler only needs to know whether nodes are the same or different, and how strict the workload’s homogeneity requirements are. Detection is the node agent’s job. Remediation is OpenCHAMI’s job.

Conformance Fingerprint

Each node agent computes a conformance fingerprint: a hash of the node’s configuration-critical software and firmware versions.

Components included in the fingerprint:

  • GPU driver version (e.g., NVIDIA 550.54.14)
  • NIC firmware version (Slingshot/UE adapter firmware)
  • BIOS/BMC firmware version (reported via Redfish/OpenCHAMI)
  • Kernel version and boot parameters
  • uenv base image hash (for sensitive: the hardened OS image)

The fingerprint is a content hash (SHA-256 of the sorted component list). Nodes with identical fingerprints belong to the same conformance group.

Reporting

The node agent reports the conformance fingerprint alongside its existing health data. This is eventually consistent — conformance group membership does not go through Raft (it’s derived from node agent reports, same as health status).

Exception: for sensitive nodes, conformance state changes are recorded in the Raft-committed audit log (per sensitive workload requirements).

Staleness

The node agent recomputes the fingerprint:

  • On startup
  • Periodically (default: every 6 hours)
  • On explicit request from the scheduler (e.g., after OpenCHAMI remediation)

If a node hasn’t reported a fingerprint within the staleness window, the scheduler treats it as unknown conformance — equivalent to a unique conformance group of one.

Scheduling Integration

Cost Function (f₉)

See scheduling-algorithm.md for the full cost function. The conformance factor f₉ scores how homogeneous the candidate node set is:

f₉(j, candidates) = largest_conformance_group_size(candidates) / j.requested_nodes
  • 1.0 → all candidate nodes share the same fingerprint
  • 0.5 → half the nodes match, half differ
  • Low values → highly heterogeneous set

Node Selection

During node selection (solver step 2a), the solver prefers nodes from the same conformance group:

  1. Among nodes satisfying constraints (GPU type, topology, etc.), group by conformance fingerprint
  2. Select the largest conformance group that can satisfy the node count
  3. If no single group is large enough, merge groups (with a scoring penalty via f₉)
  4. For single-node jobs, conformance is irrelevant (f₉ = 1.0 trivially)

Per-vCluster Policy

vCluster TypeConformance Behavior
HPC BatchSoft preference (w₉=0.10). Prefers homogeneous sets but will mix if needed.
ML TrainingStrong preference (w₉=0.25). Multi-node training is sensitive to driver mismatches.
ServiceWeak preference (w₉=0.05). Services are usually single-node or tolerate heterogeneity.
SensitiveHard constraint at solver level (drifted nodes excluded before scoring). w₉=0.10 as tiebreaker among conformant nodes.
InteractiveIgnored (w₉=0.00). Short-lived, single-node, not sensitive to drift.

Drift Response

When the scheduler detects that a node’s conformance fingerprint has changed (or diverged from the majority in its group):

  1. Continue running workloads. Existing allocations are not disrupted — the drift already happened, and disrupting would make things worse.
  2. Stop scheduling new work. The node is deprioritized for new allocations (it now belongs to a smaller conformance group, scoring lower on f₉).
  3. Signal OpenCHAMI. The scheduler (or node agent) notifies OpenCHAMI that the node has drifted, triggering remediation (firmware update, reboot into correct image, etc.).
  4. For sensitive nodes: additionally flag the drift in the audit log and set the node to Draining (transitioning to Drained once active allocations complete) — no new sensitive claims until remediated and verified. After remediation, an operator undoes the drain (DrainedReady).

The scheduler does not attempt to remediate drift itself. It only avoids scheduling on drifted nodes and signals the infrastructure layer to fix them.

OpenCHAMI Coordination

When the scheduler detects drift:

  1. Signal: The node agent (or scheduler) calls OpenCHAMI SMD to report the drift:

    PATCH /hsm/v2/State/Components/{xname}
    { "Flag": "Warning", "FlagMsg": "conformance_drift: expected=<hash_a>, actual=<hash_b>" }
    
  2. OpenCHAMI response: OpenCHAMI evaluates the drift against its remediation policy:

    • Minor drift (kernel param change): schedule firmware update at next maintenance window
    • Major drift (GPU driver version): schedule immediate reboot into correct image via BSS
    • Critical drift (sensitive node): immediate remediation, operator notified
  3. Wait for remediation: The scheduler does not re-enable the node automatically. After OpenCHAMI remediates (reboot, firmware flash), the node agent:

    • Recomputes conformance fingerprint on startup
    • Reports new fingerprint to quorum
    • If fingerprint matches expected baseline: node returns to Ready
    • If still drifted: remains deprioritized, alert escalated
  4. Timeout: If a node remains drifted for longer than drift_remediation_timeout (default: 24 hours):

    • Alert escalated to critical
    • Node transitions to Down (removed from scheduling entirely)
    • Operator must investigate and manually undrain after fix
  5. Sensitive nodes (stricter):

    • Drift triggers immediate Draining (no grace period for new claims)
    • Remediation timeout: 4 hours (shorter, due to regulatory risk)
    • After remediation: conformance re-verified AND admin approval required before accepting sensitive claims again

Relationship to Existing Concepts

  • NodeHealth tracks whether the node is functional (Healthy/Degraded/Down/Draining). Conformance is orthogonal — a node can be Healthy but drifted.
  • NodeCapabilities tracks what the node has (GPU type, memory). Conformance tracks whether the node’s software stack matches expectations.
  • Topology (GroupId) tracks physical location. Conformance tracks software configuration. Both are inputs to node selection: pack by topology AND by conformance group.

Network Domains

Design Principle

Network domains provide L3 reachability between allocations that need to communicate. They map to Slingshot VNIs (Virtual Network Identifiers) which provide hardware-enforced network isolation. Domains are created on demand, scoped to tenants, and cleaned up automatically.

What is a Network Domain

A network domain is a named group of allocations that share network reachability:

# Two allocations sharing a domain:
allocation_a:
  connectivity:
    network_domain: "ml-workspace"

allocation_b:
  connectivity:
    network_domain: "ml-workspace"

Allocations in the same domain can communicate over the Slingshot fabric. Allocations in different domains (or with no domain) are network-isolated at the hardware level.

VNI Lifecycle

Allocation

1. User submits allocation with network_domain: "ml-workspace"
2. lattice-api checks if domain "ml-workspace" exists for this tenant:
   a. If exists: allocation joins the existing domain
   b. If not: create new domain, allocate VNI from pool
3. VNI assignment is stored in quorum state (eventually consistent)
4. Node agents configure Slingshot NIC with the VNI for the allocation's traffic

VNI Pool

VNIs are allocated from a configured pool:

network:
  vni_pool_start: 1000
  vni_pool_end: 4095
  # Reserved VNIs:
  # 1 = management
  # 2 = telemetry
  # 3-999 = reserved for future use

VNIs are allocated sequentially from the pool. When freed, they return to the available set.

Release

1. Last allocation in the domain completes (or is cancelled)
2. Domain enters "draining" state for grace_period (default: 5 minutes)
   - Allows brief gaps between allocations in a long-running workflow
3. After grace period with no new allocations: domain is released
4. VNI returns to the available pool
5. Domain name can be reused by the same tenant

The grace period prevents VNI churn in DAG workflows where allocations start and stop in sequence but share a domain.

DAG Domain Persistence

DAG workflows often have sequential stages that share a network domain but have gaps between stages (one allocation completes before the next starts). The grace period (default: 5 minutes) covers these gaps:

  • If the next DAG stage starts within the grace period: it joins the existing domain (same VNI, no churn)
  • If the gap exceeds the grace period: the domain is released and a new VNI is allocated when the next stage starts
  • For long-running DAGs with predictable inter-stage gaps, increase the grace period per-domain: lattice admin network set-grace --domain=<name> --grace=15m
  • The grace period timer resets each time a new allocation joins the domain

Scoping Rules

RuleEnforcement
Domain names are scoped to a tenantTwo tenants can use the same domain name without conflict
Only allocations from the same tenant can share a domainCross-tenant domains are not allowed (isolation requirement)
Sensitive domains are per-allocationEach sensitive allocation gets a unique domain (no sharing, even within tenant)
Domain names are user-chosen stringsNo system-generated names; users pick meaningful names

Capacity

ParameterDefaultNotes
VNI pool size3095 (1000-4095)Sufficient for typical HPC deployments
Max domains per tenant50Configurable per tenant
Max allocations per domainUnlimitedPractical limit: node count

VNI Exhaustion

If the VNI pool is exhausted:

  1. New domain creation fails with a clear error:
    Error: cannot create network domain — VNI pool exhausted (3095/3095 in use)
    Hint: Wait for running allocations to complete, or contact your system admin.
    
  2. Allocations without network_domain are unaffected (they don’t need a VNI)
  3. Allocations joining an existing domain are unaffected (domain already has a VNI)
  4. Alert raised for operators

VNI Exhaustion Mid-DAG

If the VNI pool is exhausted while a DAG has pending allocations that require a new network domain:

  • The allocation that needs the new domain enters Pending state with reason vni_pool_exhausted.
  • The DAG stalls at this allocation — downstream dependencies remain blocked.
  • Already-running DAG allocations with existing domains are unaffected.
  • Mitigation: Use a shared network domain across DAG stages where possible. This avoids new VNI allocation for each stage and reduces pool pressure.
  • Recovery: When other allocations complete and release VNIs, the pending allocation is re-evaluated on the next scheduling cycle.

Default Behavior

If an allocation does not specify network_domain:

  • Single-node allocations: no VNI needed, no network isolation beyond the default
  • Multi-node allocations: automatically assigned a domain named alloc-{id} (private to this allocation)
  • Services with expose ports: automatically assigned a domain if not specified

Service Exposure

For allocations exposing service endpoints:

connectivity:
  network_domain: "inference-cluster"
  expose:
    - name: "api"
      port: 8080
      protocol: "http"

Exposed ports are reachable from:

  1. Other allocations in the same network domain (always)
  2. The lattice-api REST gateway (for external access)
  3. Not directly reachable from outside the fabric (Slingshot is not routable from Ethernet)

Sensitive Network Domains

Sensitive allocations get strict network isolation:

connectivity:
  network_domain: "sensitive-{user}-{alloc_id}"  # auto-generated, unique
  policy:
    ingress: deny-all-except:
      - same_domain          # only processes in this allocation
      - data_gateway         # controlled data ingress
    egress: deny-all-except:
      - data_gateway         # controlled data egress
  • Each sensitive allocation gets its own domain (no sharing)
  • Ingress/egress restricted to a data gateway endpoint
  • With Ultra Ethernet: network-level encryption enabled for the VNI
  • VNI released immediately on allocation completion (no grace period)

VNI Pool Expansion

To expand the VNI pool when approaching exhaustion:

  1. Update the configuration to extend vni_pool_end:

    network:
      vni_pool_start: 1000
      vni_pool_end: 8191   # expanded from 4095
    
  2. Restart the API server to pick up the new pool range. Existing domains and their VNI assignments are not affected.

  3. Verify: The lattice_network_vni_pool_total metric should reflect the new pool size.

Note: The expanded range must not overlap with reserved VNIs (1-999) or VNIs used by other systems on the Slingshot fabric. Coordinate with network administrators before expanding.

Cross-References

MPI Process Management

Design Principle

Lattice must launch and manage multi-node MPI processes without relying on SSH between compute nodes. The node agent provides process management infrastructure (PMI) so that MPI implementations (OpenMPI, MPICH, Cray MPICH) can perform rank discovery and key-value exchange through Lattice rather than through SSH or a Slurm-specific launcher.

Problem Statement

In Slurm, srun serves as both a process launcher (fan-out to nodes) and a PMI server (rank discovery, KV exchange). Lattice replaces srun with lattice launch / the LaunchTasks RPC, but the current implementation is a stub that does not:

  1. Fan out process launch to node agents
  2. Provide PMI wire-up so MPI ranks can discover each other
  3. Manage CXI credentials for Slingshot/Ultra Ethernet fabric access

Without this, users calling mpirun directly fall back to SSH for remote process spawning, which is:

  • A security risk (SSH keys between compute nodes)
  • Incompatible with network-domain-only L3 reachability
  • Incompatible with the sensitive workload isolation model
  • Operationally fragile (SSH host key management, authorized_keys distribution)

Supported MPI Implementations

ImplementationPMI-2 SupportPMIx SupportDefault LauncherNotes
MPICHNative (PMI-2 origin)Via external PMIxHydra (SSH)PMI-2 is the natural fit
OpenMPIYes (OMPI_MCA_pmix=pmi2)Preferred (PRRTE)ORTE/PRRTE (SSH)PMI-2 fully functional
Cray MPICHNative (via PALS)Via PALSPALSPMI-2 without PALS works

All three support PMI-2. PMIx is preferred by OpenMPI but not required.

Architecture

Two-Tier Design

┌─────────────────────────────────────────────────────────┐
│  Default: Native PMI-2 Server (built into node agent)   │
│  Simple, no external dependencies, covers 95%+ of MPI   │
│  workloads. ~8 wire commands over Unix domain socket.   │
├─────────────────────────────────────────────────────────┤
│  Optional: OpenPMIx Sidecar (feature-flagged)           │
│  Full PMIx v4/v5 support for workloads that require     │
│  PMIx-specific features (spawn, tools API, events).     │
│  Node agent manages OpenPMIx server lifecycle.          │
└─────────────────────────────────────────────────────────┘

Launch Flow

User: lattice launch --alloc=123 -n 256 --tasks-per-node=4 ./my_mpi_app

  │
  ▼
lattice-api (LaunchTasks RPC)
  │
  ├─ Validates: allocation is Running, user owns it
  ├─ Computes rank layout: N nodes × tasks_per_node = total ranks
  │   Rank assignment: node 0 gets ranks [0..3], node 1 gets [4..7], ...
  ├─ Generates launch_id, PMI job attributes (appnum, size, universe_size)
  ├─ Provisions CXI credentials if Slingshot fabric (see below)
  │
  ▼ Fan-out: gRPC LaunchProcesses to each node agent in the allocation

Node Agent 0                 Node Agent 1                 Node Agent N-1
  │                            │                            │
  ├─ Creates PMI-2 server      ├─ Creates PMI-2 server      ├─ ...
  │  (Unix domain socket)      │  (Unix domain socket)      │
  │                            │                            │
  ├─ Spawns local ranks        ├─ Spawns local ranks        │
  │  rank 0: ./my_mpi_app      │  rank 4: ./my_mpi_app      │
  │  rank 1: ./my_mpi_app      │  rank 5: ./my_mpi_app      │
  │  rank 2: ./my_mpi_app      │  rank 6: ./my_mpi_app      │
  │  rank 3: ./my_mpi_app      │  rank 7: ./my_mpi_app      │
  │                            │                            │
  │  Each rank inherits:       │                            │
  │  - PMI_FD (socket fd)      │                            │
  │  - PMI_RANK (global rank)  │                            │
  │  - PMI_SIZE (world size)   │                            │
  │                            │                            │
  ▼                            ▼                            ▼
  MPI_Init() → PMI-2 fullinit → local KVS puts (libfabric endpoint addr)
  │                            │                            │
  ▼ ─────────── kvsfence (cross-node KVS exchange via gRPC) ────────────
  │                            │                            │
  MPI_Init() completes         MPI_Init() completes         ...
  │                            │                            │
  (application runs)           (application runs)           ...
  │                            │                            │
  MPI_Finalize() → PMI-2 finalize

PMI-2 Wire Protocol

The PMI-2 wire protocol is text-based over a Unix domain socket. The node agent implements these commands:

CommandDirectionPurpose
fullinitrank → agentInitialize PMI connection, receive rank/size/appnum
job-getinforank → agentQuery job attributes (e.g., universe size)
kvsputrank → agentStore a key-value pair (e.g., libfabric endpoint address)
kvsgetrank → agentRetrieve a key-value pair
kvsfencerank → agentBarrier + distribute all KV pairs across all ranks
finalizerank → agentClean shutdown of PMI connection
abortrank → agentSignal abnormal termination
spawnrank → agentDynamic process spawning (optional, rarely used)

Cross-Node KVS Exchange (Fence)

The kvsfence operation is the only cross-node PMI operation. It requires all ranks across all nodes to synchronize and exchange accumulated KV pairs. This is implemented via gRPC between node agents:

kvsfence triggered on all nodes
  │
  ▼
Phase 1: Local collection
  Each node agent collects all kvsput entries from its local ranks.

Phase 2: Exchange (star topology via designated head node)
  ┌─────────────┐
  │ Head Agent  │ ◄──── gRPC PmiFence(local_kvs) ──── Agent 1
  │ (rank 0's   │ ◄──── gRPC PmiFence(local_kvs) ──── Agent 2
  │  node)      │ ◄──── gRPC PmiFence(local_kvs) ──── Agent N-1
  │             │
  │ Merges all  │
  │ KVS entries │
  │             │
  │ Broadcasts  │ ────► gRPC PmiFenceComplete(merged_kvs) ──► Agent 1
  │ merged KVS  │ ────► gRPC PmiFenceComplete(merged_kvs) ──► Agent 2
  │             │ ────► gRPC PmiFenceComplete(merged_kvs) ──► Agent N-1
  └─────────────┘

Phase 3: Local completion
  Each node agent unblocks its local ranks' kvsfence.
  Ranks can now kvsget any key from any node.

The head agent is the node agent hosting rank 0. For large jobs (>128 nodes), a tree-based reduction can be used instead of a star to reduce head-node pressure.

Node Agent gRPC Extensions

New RPCs on the node agent service for MPI process management:

service NodeAgentService {
  // Existing RPCs...

  // Launch MPI ranks on this node (called by API server during fan-out)
  rpc LaunchProcesses(LaunchProcessesRequest) returns (LaunchProcessesResponse);

  // PMI fence exchange between node agents
  rpc PmiFence(PmiFenceRequest) returns (PmiFenceResponse);

  // PMI fence completion broadcast from head agent
  rpc PmiFenceComplete(PmiFenceCompleteRequest) returns (PmiFenceCompleteResponse);

  // Notify all local ranks to abort (e.g., one node failed)
  rpc AbortProcesses(AbortProcessesRequest) returns (AbortProcessesResponse);
}

message LaunchProcessesRequest {
  string launch_id = 1;
  string allocation_id = 2;
  string entrypoint = 3;
  repeated string args = 4;
  uint32 tasks_per_node = 5;
  uint32 first_rank = 6;        // global rank offset for this node
  uint32 world_size = 7;        // total ranks across all nodes
  map<string, string> env = 8;  // additional env vars
  PmiMode pmi_mode = 9;         // PMI2 (default) or PMIX
  // CXI credentials for Slingshot fabric
  optional CxiCredentials cxi_credentials = 10;
  // Peer node agents for fence exchange
  repeated PeerInfo peers = 11;
  // Index of the head node (for fence coordination)
  uint32 head_node_index = 12;
}

message PeerInfo {
  string node_id = 1;
  string grpc_address = 2;  // node agent address (reachable via management network)
  uint32 first_rank = 3;
  uint32 num_ranks = 4;
}

enum PmiMode {
  PMI2 = 0;
  PMIX = 1;
}

message CxiCredentials {
  uint32 vni = 1;
  bytes auth_key = 2;
  uint32 svc_id = 3;
}

PMI-2 Server Implementation

Each node agent runs a PMI-2 server per launch (one Unix socket per launch_id):

Node Agent
  │
  ├─ LaunchProcesses received
  │   ├─ Create Unix socket: /tmp/lattice-pmi-{launch_id}.sock
  │   ├─ Start PMI-2 server task (tokio)
  │   ├─ Fork/exec ranks with:
  │   │   PMI_FD={fd}           # inherited socket fd
  │   │   PMI_RANK={rank}       # global rank
  │   │   PMI_SIZE={world_size} # world size
  │   │   PMI_SPAWNED=0         # not dynamically spawned
  │   │   LATTICE_LAUNCH_ID={launch_id}
  │   │   LATTICE_ALLOC_ID={allocation_id}
  │   │   LATTICE_NODELIST={comma-separated node list}
  │   │   LATTICE_NNODES={node_count}
  │   │   LATTICE_NPROCS={world_size}
  │   │   # CXI env (if Slingshot):
  │   │   FI_CXI_DEFAULT_VNI={vni}
  │   │   FI_CXI_AUTH_KEY={key}
  │   └─ Monitor all rank processes, report exit status
  │
  ├─ PMI-2 server handles:
  │   ├─ fullinit → return rank, size, appnum, debug flag
  │   ├─ kvsput → store in local HashMap
  │   ├─ kvsget → lookup local, or merged (post-fence)
  │   ├─ kvsfence → collect local, trigger cross-node exchange, block until complete
  │   ├─ finalize → mark rank done
  │   └─ abort → signal all local ranks, notify head agent
  │
  └─ Cleanup on launch completion
      ├─ Remove Unix socket
      ├─ Report per-rank exit codes to API server
      └─ Clean up CXI credentials

Environment Variables

Lattice sets these environment variables for MPI processes:

VariableValuePurpose
PMI_FDfd numberPMI-2 socket (inherited)
PMI_RANKglobal rankMPI rank
PMI_SIZEworld sizeMPI world size
PMI_SPAWNED0Not dynamically spawned
LATTICE_LAUNCH_IDUUIDLaunch identifier
LATTICE_ALLOC_IDUUIDAllocation identifier
LATTICE_NODELISTcomma-separatedAll nodes in this launch
LATTICE_NNODESintegerNode count
LATTICE_NPROCSintegerTotal rank count
LATTICE_LOCAL_RANK0..tasks_per_node-1Node-local rank
LATTICE_LOCAL_SIZEtasks_per_nodeRanks on this node
FI_CXI_DEFAULT_VNIVNI numberSlingshot VNI (if applicable)
FI_CXI_AUTH_KEYhex stringCXI auth key (if applicable)
FI_PROVIDERcxi or verbslibfabric provider hint

For Slurm compatibility (compat.set_slurm_env=true), also set SLURM_PROCID, SLURM_NPROCS, SLURM_LOCALID, SLURM_NODELIST.

CXI Credential Management (Slingshot)

On Slingshot systems, MPI communication requires CXI (Cassini eXtended Interface) credentials tied to the allocation’s VNI. Without valid credentials, libfabric’s CXI provider refuses to open endpoints.

Credential Lifecycle

1. Allocation scheduled → network domain assigned → VNI allocated
2. LaunchTasks RPC → API server requests CXI credentials from fabric manager
   - Input: VNI, allocation ID, node list
   - Output: auth_key, svc_id (bound to VNI + node set)
3. Credentials included in LaunchProcessesRequest to each node agent
4. Node agent sets FI_CXI_DEFAULT_VNI and FI_CXI_AUTH_KEY for spawned ranks
5. On launch completion → API server revokes CXI credentials

Fabric Manager Integration

The Slingshot fabric manager provides a REST API for credential management:

OperationEndpointWhen
Create CXI servicePOST /fabric/cxi/servicesLaunch start
Get auth keyGET /fabric/cxi/services/{id}/authLaunch start
Revoke CXI serviceDELETE /fabric/cxi/services/{id}Launch end

This is a new integration point, similar to the existing VAST API integration for storage.

Optional: OpenPMIx Sidecar (Feature-Flagged)

For workloads requiring full PMIx v4/v5 support (dynamic process spawning, PMIx tools API, event notification, PMIx groups), Lattice can run an OpenPMIx server as a managed sidecar process.

When to Use PMIx Mode

ScenarioPMI-2 (default)PMIx (optional)
Standard MPI (init, communication, finalize)YesYes
Multi-application launch (MPMD)LimitedYes
Dynamic process spawning (MPI_Comm_spawn)NoYes
PMIx tools API (debugger attach)NoYes
PMIx event notificationNoYes
OpenMPI with PMIx-only featuresNoYes

Architecture

Node Agent
  │
  ├─ PmiMode::PMIX requested in LaunchProcessesRequest
  │
  ├─ Spawns OpenPMIx server (pmix_server binary)
  │   ├─ Configured via tmpdir/pmix-{launch_id}/
  │   ├─ Node agent implements the PMIx "host" callback interface
  │   │   via a small C shim library (libpmix-lattice-host.so)
  │   │   that calls back to the node agent via Unix socket
  │   ├─ Cross-node exchange: host callbacks route to node agent gRPC
  │   └─ pmix_server provides Unix rendezvous socket for ranks
  │
  ├─ Spawns ranks with:
  │   PMIX_SERVER_URI={rendezvous_uri}
  │   PMIX_NAMESPACE={launch_id}
  │   PMIX_RANK={rank}
  │   (instead of PMI_FD/PMI_RANK/PMI_SIZE)
  │
  └─ On completion: stops pmix_server, cleans up

Host Callback Shim

The OpenPMIx server requires the host (resource manager) to provide certain callbacks for cross-node operations. These are implemented via a small C shared library (libpmix-lattice-host.so) that:

  1. Is loaded by pmix_server at startup via --host-lib or LD_PRELOAD
  2. Implements: pmix_server_fencenb_fn, pmix_server_dmodex_fn, pmix_server_spawn_fn
  3. Each callback sends a request over a Unix socket to the node agent
  4. Node agent handles cross-node coordination via gRPC (same as PMI-2 fence)

This keeps the C code minimal (~200 lines) while leveraging the full OpenPMIx implementation.

Build and Deployment

# Cargo.toml (lattice-node-agent)
[features]
pmix = []  # enables PMIx sidecar support

When the pmix feature is enabled:

  • pmix_server binary must be installed on compute nodes (packaged separately or via uenv)
  • libpmix-lattice-host.so is built from infra/pmix-host/ and installed alongside the node agent
  • The node agent detects pmix_server availability at startup and reports it as a node capability

When disabled: PmiMode::PMIX requests return an error with a clear message.

Integration with Existing Runtimes

uenv Runtime

PMI-2 socket and environment variables are available inside the mount namespace with no special handling (mount namespace does not isolate Unix sockets in the parent namespace).

Sarus Runtime

The PMI-2 Unix socket must be bind-mounted into the container:

sarus run --mount=type=bind,source=/tmp/lattice-pmi-{launch_id}.sock,destination=/tmp/lattice-pmi.sock ...

The --mpi flag in Sarus already handles MPI wire-up for Slurm; for Lattice, we configure Sarus to use the Lattice-provided PMI socket instead. This requires the Sarus MPI hook to be configured for PMI-2 mode rather than Slurm PMI mode.

DMTCP (Checkpoint/Restart)

DMTCP wraps the MPI process. The PMI-2 socket is outside the DMTCP checkpoint boundary. On restart, the node agent creates a new PMI-2 server and the restarted ranks re-initialize PMI. DMTCP’s MPI plugin handles reconnecting MPI communicators.

Failure Handling

Rank Failure

1. Rank exits with non-zero code (or is killed by signal)
2. Local node agent detects via process monitor
3. Node agent sends RankFailed notification to head agent
4. Head agent:
   a. If allocation requeue policy = "on_any_failure": abort all ranks, requeue allocation
   b. If MPI_ERRORS_RETURN semantics: notify remaining ranks via PMI-2 abort
   c. Default: abort all ranks, report failure to API server

Node Agent Failure

1. Node agent crashes or becomes unreachable
2. Head agent detects via gRPC timeout during fence (or heartbeat miss)
3. Head agent aborts the launch on all surviving nodes
4. API server handles allocation state transition (same as node failure)

Fence Timeout

1. kvsfence does not complete within timeout (default: 60s, configurable)
2. Head agent declares fence failure
3. All ranks aborted with PMI-2 abort message
4. Launch reported as failed with "PMI fence timeout" reason

User-Facing Changes

lattice launch (CLI)

# MPI launch (replaces srun -n 256 ./app)
lattice launch --alloc=123 -n 256 ./my_mpi_app

# With tasks-per-node control
lattice launch --alloc=123 --tasks-per-node=4 ./my_mpi_app

# Force PMIx mode (requires pmix feature on nodes)
lattice launch --alloc=123 -n 256 --pmi=pmix ./my_mpi_app

# Launch with environment variables
lattice launch --alloc=123 -n 256 --env OMP_NUM_THREADS=8 ./my_mpi_app

Submission Script

#!/bin/bash
#LATTICE nodes=64
#LATTICE walltime=2:00:00
#LATTICE vcluster=hpc-batch
#LATTICE network_domain=my-training-run

# No SSH, no mpirun, no srun needed.
# The entrypoint IS the MPI program; Lattice handles process launch and PMI.
lattice launch -n 256 --tasks-per-node=4 ./my_mpi_training

# Or for Slurm compatibility:
# srun -n 256 ./my_mpi_training   (compat layer translates to lattice launch)

Direct mpirun (Escape Hatch)

Users who want to call mpirun directly can still do so. Lattice provides a Hydra-compatible launcher script (lattice-mpi-launcher) that uses the node agent gRPC instead of SSH:

# mpirun detects the Lattice launcher via:
#   HYDRA_LAUNCHER=manual
#   HYDRA_LAUNCHER_EXEC=lattice-mpi-launcher
# These are set automatically by the node agent when an allocation starts.

# So this "just works" inside an allocation:
mpirun -np 256 ./my_mpi_app

The lattice-mpi-launcher script:

  1. Receives the launch command from Hydra/ORTE
  2. Calls the local node agent’s LaunchProcesses gRPC to spawn on the target node
  3. Returns the PID to the MPI launcher

This provides backward compatibility for scripts that use mpirun directly while still avoiding SSH.

Performance Considerations

OperationLatencyBottleneckMitigation
Launch fan-out~100ms for 256 nodesgRPC round-tripsParallel fan-out from API server
PMI-2 fence (star)~10ms for <128 nodesHead agent mergeAcceptable for typical HPC
PMI-2 fence (tree)~20ms for 1000+ nodesTree depth (log N)Only needed at extreme scale
CXI credential provisioning~50msFabric manager APICached for allocation lifetime

MPI_Init typically takes 100-500ms. The Lattice PMI overhead is well within this budget.

Cross-References

Data Plane & Storage Architecture

Tiered Storage Model

┌─ Hot Tier (VAST-like) ─────────────────────────────────┐
│  Protocol: NFS + S3 (native multiprotocol)             │
│  Use: active datasets, home dirs, checkpoints, scratch │
│  Performance: NVMe-speed, low-latency                  │
│  Scheduler integration: QoS per export, pre-staging    │
│  Sensitive: encrypted pool, access-logged              │
└────────────────────┬───────────────────────────────────┘
                     │ policy-driven data mover
┌────────────────────┴───────────────────────────────────┐
│  Warm Tier (capacity storage)                          │
│  Protocol: S3-compatible                               │
│  Use: completed outputs, older datasets, cold models   │
│  Cost: significantly lower than hot                    │
└────────────────────┬───────────────────────────────────┘
                     │ archive policy
┌────────────────────┴───────────────────────────────────┐
│  Cold Tier (tape/object archive)                       │
│  Protocol: S3-compatible (Glacier-style retrieval)     │
│  Use: regulatory retention, long-term archival         │
│  Sensitive: 7+ year retention, immutable               │
└────────────────────────────────────────────────────────┘

Protocol Standardization

Only two protocols for user-facing access:

  • NFS: POSIX workloads, home directories, uenv images, legacy codes that expect a filesystem
  • S3: Object access for checkpoints, datasets, model artifacts, any cloud-native tooling

No Lustre/GPFS client required. VAST delivers parallel-file-system performance via NFS.

Job Data Requirements

Explicit Declaration

Users who know their data needs can declare them:

data:
  mounts:
    - source: "s3://training-data/imagenet"
      target: "/data/input"
      tier_hint: "hot"
      access: "read-only"
    - source: "nfs://home/{user}"
      target: "/home/{user}"
      access: "read-write"
  output: "s3://{tenant}/{project}/{allocation_id}/"
  scratch_per_node: "500GB"

Sane Defaults (for users who don’t specify)

Every allocation automatically gets:

  • Home directory: mounted via NFS from hot tier (/home/{user})
  • Node-local scratch: NVMe-backed ephemeral storage (/scratch/local/) if NVMe is available; tmpfs or network scratch otherwise
  • Output directory: s3://{tenant}/{project}/{allocation_id}/ auto-created
  • Checkpoint directory: s3://{tenant}/{project}/{allocation_id}/checkpoints/ (if checkpoint != none)

Data Staging (Scheduler-Integrated)

The scheduler integrates with the storage API for intelligent data movement:

  1. Pre-staging during queue wait: When a job is queued and its data is on warm/cold tier, the data mover begins warming it to hot tier. Queue wait time becomes useful instead of idle.

  2. QoS allocation at job start: The scheduler calls the VAST API to set bandwidth guarantees for the job’s NFS export. Prevents I/O-intensive jobs from starving latency-sensitive services.

  3. Checkpoint coordination: The checkpoint broker pre-allocates storage bandwidth windows to avoid I/O storms when many jobs checkpoint simultaneously.

VAST API Integration Points

OperationVAST APIWhen
Create export with QoSPOST /exports + QoS policyJob starts
Query data localityGET /catalog?path=…Scheduling (data_readiness score)
Create snapshotPOST /snapshotsJob start (reproducibility) or checkpoint
Pre-stage from warmPOST /dataspace/prefetchJob queued, data not on hot tier
Set bandwidth floorPATCH /exports/{id}/qosJob starts
Audit log queryGET /audit/logs?path=…Compliance reporting

Sensitive Storage Policy

vcluster: sensitive-secure
  storage_policy:
    encryption: aes-256-at-rest
    pool: dedicated               # separate VAST view/tenant
    wipe_on_release: true         # scrub after allocation ends
    access_logging: full          # every read/write logged
    data_sovereignty: "ch"        # data stays in Swiss jurisdiction
    retention:
      data: "as_specified_by_user"
      audit_logs: "7_years"
      tier_restriction: "hot_only"  # no unencrypted copies on warm/cold

Log Storage

Allocation logs are persisted to S3 alongside output data. See observability.md for the log storage layout:

s3://{tenant}/{project}/{alloc_id}/logs/
    ├── stdout/{node_id}/{chunk_000..N}.log.zst
    ├── stderr/{node_id}/{chunk_000..N}.log.zst
    └── metadata.json

Sensitive allocation logs are stored in the encrypted sensitive S3 pool with access logging enabled.

Node-Local Storage (Optional)

Nodes may have NVMe SSDs managed by the node agent. Local storage is not a hard requirement — nodes without NVMe operate with reduced performance but full functionality.

When NVMe is present:

  • Scratch: ephemeral, wiped between allocations. For temp files, staging.
  • Image cache: persistent across allocations. Caches uenv squashfs images and OCI layers.
    • LRU eviction policy
    • Cache hit avoids network pull from registry
    • Popular images stay warm automatically

When NVMe is absent:

  • Scratch: falls back to tmpfs (RAM-backed) or a network-mounted scratch directory. Capacity is limited by available RAM or network storage quota.
  • Image cache: no persistent local cache. Images are pulled from the registry on every allocation start (or served from a shared NFS cache if configured). Higher startup latency.
  • Allocations requesting the nvme_scratch feature constraint will not be scheduled on these nodes.

The node agent detects local storage at startup and reports its availability as part of node capabilities (features: ["nvme_scratch"]).

Data Staging & Cache Lifecycle

Design Principle

Data staging is invisible to users. The scheduler pre-stages data during queue wait time, manages node-local caches with bounded eviction, and coordinates storage bandwidth to prevent I/O storms. Users declare data requirements; the system handles placement.

This document extends data-plane.md with operational details for staging, caching, and eviction.

Pre-Staging Pipeline

Trigger

When an allocation enters the Pending state and declares data mounts with tier_hint: hot:

  1. Scheduler queries VAST API for data locality (GET /catalog?path=...)
  2. If data is on warm/cold tier: scheduler issues pre-stage request (POST /dataspace/prefetch)
  3. Allocation transitions to Staging state (visible to user via lattice status)
  4. When staging completes: allocation is eligible for scheduling

Staging During Queue Wait

Pre-staging runs concurrently with queue waiting. If the allocation reaches the front of the scheduling queue before staging completes:

ScenarioAction
Staging completeSchedule immediately
Staging >80% completeSchedule, accept brief I/O stall at start
Staging <80% completeHold in queue, f₅ (data_readiness) penalizes scheduling
Staging failedRetry up to 3 times, then alert user and keep in queue

Priority

Pre-stage requests are prioritized by:

  1. Estimated scheduling time (jobs closer to front of queue stage first)
  2. Data size (smaller datasets stage faster, unblock more jobs)
  3. Tenant fair share (tenants below their share get staging priority)

Bandwidth Coordination

The scheduler tracks aggregate staging bandwidth to avoid saturating the VAST system:

max_concurrent_staging_bandwidth = 0.3 × total_VAST_write_bandwidth

When the staging bandwidth limit is reached, additional staging requests are queued. This prevents staging from impacting running allocations’ I/O performance.

Node-Local Image Cache

Nodes with NVMe SSDs use a dedicated partition for image caching (uenv SquashFS and OCI layers). Local storage is optional — nodes without NVMe pull images directly from the registry on every allocation start, or use a shared NFS-based cache if configured. The scheduler accounts for this via the nvme_scratch feature: jobs that benefit from local caching can request it as a constraint.

Cache Layout

/var/cache/lattice/
├── uenv/                     # SquashFS images
│   ├── prgenv-gnu_24.11_v1.squashfs
│   ├── pytorch_2.4_cuda12.squashfs
│   └── ...
├── oci/                      # OCI container layers
│   ├── sha256:<hash>/
│   └── ...
└── metadata.json             # Cache index: image → size, last_used, pin

Cache Parameters

ParameterDefaultDescription
cache_partition_size80% of NVMe (if present)Reserved for image cache; ignored on nodes without NVMe
cache_high_watermark90%Eviction starts when usage exceeds this
cache_low_watermark70%Eviction stops when usage drops below this
min_free_space50 GBAbsolute minimum free space (overrides watermarks)

Eviction Policy

LRU with pinning:

  1. When cache usage exceeds cache_high_watermark:
    • Evict least-recently-used images until usage drops below cache_low_watermark
    • Never evict images currently mounted by running allocations (pinned)
    • Never evict images marked as sticky by admin (base OS images, common frameworks)
  2. Eviction order: LRU by last mount time, largest images first among equally-old entries
  3. If eviction cannot free enough space (all images pinned or sticky): alert raised, staging for new allocations pauses on this node

Cache-Full During Staging

If the node-local cache is full when a new allocation needs to pull an image:

  1. Check if eviction can free space → run eviction
  2. If eviction insufficient (all pinned): allocation’s prologue waits with backoff
  3. After 3 retries (5 minutes total): node marked as cache-full, scheduler avoids this node for allocations requiring uncached images
  4. Scheduler selects alternative nodes with cache space (or where the image is already cached)

Cache Warming

Administrators can pre-warm caches for anticipated workloads:

# Warm a uenv image on all nodes in a group
lattice cache warm --image=prgenv-gnu/24.11:v1 --group=3

# Warm on specific nodes
lattice cache warm --image=pytorch/2.4:cuda12 --nodes=x1000c0s0b0n0,x1000c0s0b0n1

Post-Reboot Cache Consistency

After a node reboot (nodes with NVMe only):

  1. Node agent reads metadata.json from the cache partition
  2. Validates each cached image (hash check against registry manifest)
  3. Images that fail validation are evicted
  4. Images that pass remain in cache (NVMe is persistent across reboots)
  5. Cache index rebuilt in ~seconds (metadata only, no full re-scan)

On nodes without NVMe, there is no persistent cache to recover — images are pulled fresh after reboot.

Allocation Data Lifecycle

Start (Prologue)

1. Node agent receives allocation assignment
2. Pull uenv image:
   a. Check node-local cache → hit: mount directly
   b. Cache miss: pull from registry → write to cache → mount
3. Mount data volumes:
   a. NFS mounts (home, shared data): mount with VAST QoS policy
   b. S3 mounts: FUSE or native S3 client
4. Create scratch directory: /scratch/local/{alloc_id}/ (NVMe) or /scratch/tmp/{alloc_id}/ (tmpfs/network)
5. Create output directory (S3): s3://{tenant}/{project}/{alloc_id}/
6. If checkpoint != none: create checkpoint directory

During Execution

  • NFS QoS maintained by VAST (bandwidth floor set at prologue)
  • Scratch is node-local NVMe (if available) or tmpfs/network scratch
  • Output is written to S3 (async, application-driven)
  • Checkpoint broker coordinates checkpoint writes to avoid bandwidth storms

End (Epilogue)

1. Processes terminated (completed, failed, or killed)
2. Flush pending log chunks to S3
3. Unmount uenv image (stays in cache for future use)
4. Unmount NFS volumes
5. Clean scratch: rm -rf /scratch/local/{alloc_id}/
6. Release VAST QoS policy
7. Sensitive: trigger secure wipe sequence (cross-ref: node-lifecycle.md)

Data Retention

Data TypeLocationRetention
uenv imagesNode-local cacheUntil evicted (LRU)
LogsS3Configurable (default: 30 days)
CheckpointsS3Configurable (default: 7 days after completion)
OutputS3User-managed (not auto-deleted)
ScratchNVMe or tmpfsDeleted at allocation end
Debug tracesS3Short (default: 7 days)
Sensitive audit logsCold tier (S3)7 years

Storage Tier Migration

Data automatically migrates between tiers based on access patterns:

Hot (VAST NFS+S3) → Warm (capacity S3) → Cold (archive S3)
     ↑ pre-stage        ↑ restore            ↑ retrieve
TriggerDirectionMechanism
Allocation queued with tier_hint: hotWarm → HotScheduler-initiated pre-stage
Data untouched for 30 daysHot → WarmVAST policy-driven (automatic)
Data untouched for 90 daysWarm → ColdStorage policy (automatic)
User request or allocation references cold dataCold → Warm/HotExplicit retrieval (may take hours)

Sensitive exception: Sensitive data on hot tier stays on hot tier (no automatic migration). tier_restriction: hot_only prevents copies on shared warm/cold tiers.

Cross-References

Federation Architecture

Design Principle

Federation is opt-in and sovereignty-first. The system is fully functional without it. When enabled, each site retains full control over its resources. The federation broker suggests, the local scheduler decides.

Feature Gate

Federation is compile-time optional via Rust feature flag:

# Cargo.toml (lattice-api)
[features]
default = []
federation = ["lattice-common/federation"]

When federation feature is disabled:

  • No Sovra dependency
  • No federation broker binary
  • No cross-site API endpoints
  • System operates as a standalone site

Trust Model: Sovra Integration

Sovra provides federated sovereign key management. Each site runs its own Sovra instance with its own root key.

Site A Sovra Instance              Site B Sovra Instance
├── Site A Root Key (sovereign)    ├── Site B Root Key (sovereign)
├── Workspace: "hpc-general"       ├── Workspace: "hpc-general"
│   (shared federation key)        │   (federated with Site A)
├── Workspace: "sensitive-ch"      └── Policy: Site B OPA rules
│   (hospital CRK, delegated)
└── Policy: Site A OPA rules

Sovra Federation Protocol (peer-to-peer, no central authority)

Key Management Principles

  1. Site root keys never leave the site. All cross-site authentication uses derived keys from shared workspaces.
  2. Federation is revocable. Revoking a shared workspace invalidates all cross-site tokens. Instant defederation.
  3. Sensitive keys are tenant-controlled. The hospital (data owner) holds the Customer Root Key. The operating site holds a delegated key. If the relationship ends, the hospital retains access.
  4. Audit logs are cryptographically signed. Each site signs its audit entries with its own key. Cross-site audit trails are verifiable by any party in the trust chain.

Federation Components

Federation Broker

A Go service that runs alongside the scheduler (when federation feature is enabled).

Responsibilities:

  • Advertises site capabilities to federated peers (available capacity, GPU types, energy prices, data locality)
  • Receives federated allocation requests from peer sites
  • Signs outbound requests with Sovra tokens
  • Verifies inbound requests against Sovra trust chain + OPA policy
  • Routes accepted requests into the local scheduling plane

Communication: gRPC over mTLS, with Sovra-signed metadata in request headers.

Federation Catalog

A read-mostly, eventually consistent shared catalog across federated sites:

ContentUpdate FrequencyConsistency
Site capabilities (GPU types, node counts)HourlyEventual
uenv image registry (cross-site name resolution)On publishEventual
Dataset catalog (where data physically resides)On changeEventual
Tenant identity mapping (OIDC trust)On federation setupStrong (Sovra)
Energy prices per siteEvery 15 minutesEventual

Catalog Consistency and Staleness

The federation catalog is eventually consistent. Entries may be stale, missing, or outdated. The system must handle this gracefully:

Staleness bounds:

Entry TypeMax StalenessEffect of Stale Data
Site capabilities2 hours (hourly sync + margin)May route job to site that no longer has capacity → remote rejection, retry locally
Energy prices30 minutesMay choose suboptimal site for energy cost → acceptable, not a correctness issue
Dataset catalogMinutes (event-driven)May not know data was moved → routing decision based on old location
uenv registryMinutes (event-driven)May reference image version not yet available at remote → prologue retry

Handling completely stale entries:

If a peer site has not reported a catalog update within 2× the expected interval (e.g., no capability update in 2 hours):

  1. Federation broker marks the peer as stale in its local view
  2. Routing decisions deprioritize stale peers (not excluded, just scored lower)
  3. Alert raised: lattice_federation_peer_stale{peer="site-b"}
  4. If stale for > 24 hours: peer marked unreachable, excluded from routing

Handling peer unavailability:

If a federated request fails (peer broker unreachable):

  1. First failure: retry with exponential backoff (1s, 2s, 4s, max 30s)
  2. After 3 retries: return failure to the user with explanation
  3. If --site=auto: fall back to local scheduling (no remote attempt)
  4. Peer marked as degraded in catalog; future requests deprioritize it
  5. Peer returns to healthy on next successful heartbeat/catalog sync

Cross-site uenv resolution:

uenv images are resolved via the federation catalog:

  1. User submits --uenv=prgenv-gnu/24.11:v1 targeting Site B
  2. Federation broker checks if Site B’s catalog includes this image
  3. If present: proceed (Site B has the image or can pull it)
  4. If absent: warn user and proceed (Site B may pull from a shared registry)
  5. If pull fails at Site B: prologue failure, allocation retried or failed per policy

Job Routing Logic

The federation broker’s routing decision is advisory, not mandatory:

Input: Allocation request from remote site (or local user targeting remote)
Output: Recommendation (run locally, run at site X, reject)

Factors:
1. Data gravity: where does the input data physically reside?
   → Strong bias toward running where data is
2. Compute availability: does the target site have capacity?
   → Check advertised capacity (may be stale)
3. Energy cost: which site has cheaper power right now?
   → Time-varying electricity prices from catalog
4. Tenant authorization: is this user allowed at the target site?
   → OPA policy check via Sovra-delegated credentials
5. Data sovereignty: can the data legally transit to the target site?
   → Sensitive data: check jurisdiction constraints

Decision: route to site with best composite score, or reject if no site qualifies

Federated Allocation Flow

1. User at Site A submits: lattice submit --site=B train.sh
2. Site A lattice-api receives request, passes to federation broker
3. Federation broker:
   a. Signs request with Sovra token (Site A workspace key)
   b. Resolves target: Site B (explicit) or best-fit (if --site=auto)
   c. Forwards to Site B's federation broker
4. Site B federation broker:
   a. Verifies Sovra token (Site A is trusted peer)
   b. Checks OPA policy (user authorized, resources available)
   c. Injects allocation into Site B's scheduling plane
5. Site B local quorum manages allocation entirely
6. Status/logs available to user at Site A via federation catalog query
7. On completion: Site B reports results, Site A's user notified

Cross-Site Data Access

When a federated job runs at a remote site but needs data from the home site:

  • Small data (<1 GB): Fetched on demand via S3 over WAN
  • Medium data (1 GB - 1 TB): Pre-staged during queue wait via VAST DataSpace sync
  • Large data (>1 TB): Strong recommendation to run job at data’s home site
  • Sensitive data: Never transferred. Job must run at data’s home site. No exceptions.

Operational Considerations

Adding a Federation Peer

  1. Exchange Sovra workspace keys (out-of-band, verified by site admins)
  2. Configure federation broker with peer endpoint + workspace ID
  3. Define OPA policies for cross-site access
  4. Test with non-production allocations
  5. Enable in production

Removing a Federation Peer

  1. Revoke Sovra shared workspace
  2. All in-flight federated allocations continue to completion (or are cancelled by policy)
  3. Remove peer from federation broker config
  4. Immediate: no new federated requests accepted

Federation Requests During Leader Election

When the local Raft quorum is undergoing a leader election (typically 1-3 seconds):

  • Inbound federated requests from peer sites receive a 503 Service Unavailable with a Retry-After: 5 header
  • The federation broker does not queue inbound requests during election — the remote site’s retry logic handles resubmission
  • Outbound federated requests (local user targeting a remote site) are unaffected — routing and signing happen in the federation broker, not the quorum
  • If the election takes longer than 10 seconds (unusual): the federation broker marks the local site as degraded in catalog updates to peers

Cross-References

Interactive Sessions

Design Principle

Interactive sessions are allocations with a terminal. They reuse the standard allocation lifecycle with additional terminal protocol handling. Sessions are not a separate concept — they are bounded or unbounded allocations with an attached PTY as the primary interaction mode.

Global session tracking (F20): Sessions are now tracked in GlobalState via Raft-committed CreateSession/DeleteSession commands. This enables:

  • Global session limit enforcement: sensitive allocations limited to one concurrent session (INV-C2)
  • Session survival across API server restarts
  • Ownership verification at creation time (allocation must be Running, user must own it)

Session Creation

A session is created via POST /v1/sessions (or lattice session):

session:
  tenant: "ml-team"
  vcluster: "interactive"         # typically the interactive FIFO vCluster
  resources:
    nodes: 1                      # default: 1 node
    constraints:
      gpu_type: "GH200"
  lifecycle:
    type: "bounded"
    walltime: "4h"                # interactive sessions have walltime
  environment:
    uenv: "prgenv-gnu/24.11:v1"

Internally, the API server creates a standard Allocation with:

  • lifecycle.type = Bounded { walltime }
  • A flag indicating terminal should auto-attach on scheduling
  • Allocation state follows the normal lifecycle (Pending → Running → Completed)

Terminal Protocol

Connection Setup

1. Client connects: POST /v1/sessions → returns session_id + allocation_id
2. Allocation is scheduled (may wait in queue)
3. Once Running, client opens terminal: GET /v1/sessions/{id}/terminal (WebSocket upgrade)
4. WebSocket connection established to lattice-api
5. lattice-api opens gRPC bidirectional stream to the node agent
6. Node agent spawns PTY + user shell in allocation's mount/network namespace

Wire Protocol

The gRPC bidirectional stream carries framed messages:

Client → Server:

Message TypeContent
StdinDataRaw bytes from client terminal
ResizeTerminal dimensions (rows, cols)
SignalSIGINT, SIGTSTP, SIGHUP, SIGQUIT
KeepaliveHeartbeat (every 30s)

Server → Client:

Message TypeContent
StdoutDataRaw bytes from PTY (stdout + stderr merged)
ExitCodeProcess exit code (terminal message)
ErrorError description (e.g., “allocation not running”)

Initial Terminal Size

The client sends a Resize message as the first message after connection. The node agent configures the PTY with these dimensions. If no Resize is sent, defaults to 80x24.

Signal Handling

SignalClient ActionServer Action
SIGINT (Ctrl+C)Send Signal(SIGINT)Node agent sends SIGINT to foreground process group
SIGTSTP (Ctrl+Z)Send Signal(SIGTSTP)Node agent sends SIGTSTP to foreground process group
SIGHUPConnection closeNode agent sends SIGHUP to session process group
SIGQUIT (Ctrl+\)Send Signal(SIGQUIT)Node agent sends SIGQUIT to foreground process group
SIGWINCHSend Resize(rows, cols)Node agent calls ioctl(TIOCSWINSZ) on PTY

Session Lifecycle

Active Session

While the terminal is connected:

  • PTY output streams to client in real-time
  • Client input streams to PTY stdin
  • Keepalive every 30s to detect stale connections
  • Session remains active as long as the WebSocket is open AND the shell process is alive

Disconnect and Reconnect

Client disconnect (network drop, laptop close):

  1. WebSocket closes (or keepalive timeout: 90s)
  2. Node agent sends SIGHUP to the session’s process group
  3. Default behavior: processes receive SIGHUP and exit
  4. If the user’s shell ignores SIGHUP (e.g., tmux, screen):
    • Processes continue running in the background
    • User can reconnect: lattice attach <alloc_id>
    • Allocation walltime continues counting

Deliberate detach:

Users who want background sessions should use tmux or screen inside the session. Lattice does not implement a detach/reattach protocol — it delegates to proven tools.

Session Timeout

TimeoutDefaultDescription
idle_timeout30 minutesIf no stdin for this duration, warn user. No auto-kill.
walltimeUser-specifiedHard deadline. SIGTERM → SIGKILL → release.
keepalive_timeout90sWebSocket keepalive. Missed → treat as disconnect.

Idle warning: After idle_timeout, the terminal displays:

[lattice] Warning: session idle for 30 minutes. Walltime remaining: 3h 12m.

No automatic termination on idle — the user may be running a long computation.

Cleanup

When the session’s allocation reaches a terminal state (Completed, Failed, Cancelled):

  1. SIGTERM to all remaining processes
  2. Grace period (30s)
  3. SIGKILL
  4. Unmount uenv, release scratch, release nodes
  5. Session terminal sends ExitCode and closes WebSocket

Preemption During Active Session

When a session’s allocation is preempted while a terminal is connected:

  1. The checkpoint sequence begins (if checkpoint != None)
  2. The terminal remains connected during checkpointing — user sees normal output
  3. When checkpoint completes and the allocation transitions to Suspended:
    • Server sends a terminal message: [lattice] Allocation preempted. Session suspended. Use 'lattice attach <id>' to reconnect after rescheduling.
    • Server sends ExitCode(-1) and closes the stream
  4. When the allocation is rescheduled and resumes:
    • The user must manually reconnect: lattice attach <id>
    • The session starts a fresh shell (PTY state is not checkpointed)
    • Application state is restored from checkpoint (if the application supports it)

Multi-Node Sessions

For sessions requesting multiple nodes:

  • The terminal connects to the first node (node 0)
  • The user’s shell runs on node 0
  • Other nodes are accessible via ssh (intra-allocation, uses the network domain)
  • Or via lattice attach <alloc_id> --node=<node_id> (opens a second terminal to a specific node)

Concurrent Attach

ScenarioAllowedNotes
Same user, multiple terminalsYesMultiple attach sessions to the same allocation
Different users (non-sensitive)NoOnly the allocation owner can attach
Different users (sensitive)NoOnly the claiming user; one session at a time
Same user, different nodesYesEach attach targets a specific node

Slurm Compatibility

SlurmLatticeNotes
salloc -N2lattice session --nodes=2Creates session allocation
srun --jobid=123 --pty bashlattice attach 123Attach to existing allocation
salloc then srunlattice session then lattice launchSession + task within allocation

CLI Usage

# Create a session (waits for scheduling, then opens terminal)
lattice session --nodes=1 --walltime=4h --uenv=prgenv-gnu/24.11:v1

# Create with specific constraints
lattice session --nodes=2 --constraint=gpu_type:GH200 --walltime=8h

# Create in a specific vCluster
lattice session --vcluster=interactive --walltime=2h

# Attach to an existing session's allocation
lattice attach 12345

# Attach to a specific node
lattice attach 12345 --node=x1000c0s0b0n3

# Attach with a specific command (not the default shell)
lattice attach 12345 --command="nvidia-smi -l 1"

Cross-References

Sensitive & Regulated Workload Design

Threat Model

Sensitive workloads on shared HPC infrastructure face regulatory requirements (Swiss FADP, EU GDPR, potentially HIPAA for international collaboration). The design must be defensible to an auditor.

What we must prove:

  1. Sensitive data was only accessible to authorized users during processing
  2. No other tenant’s workload ran on the same physical nodes simultaneously
  3. Data was encrypted at rest and in transit
  4. All access was logged with user identity and timestamp
  5. Data was destroyed when no longer needed
  6. Data did not leave the designated jurisdiction

Isolation Model: User Claims Node

Unlike other vClusters where the scheduler assigns nodes, sensitive nodes are claimed by a specific user:

Dr. X authenticates via OIDC (institutional IdP)
  → Requests 4 nodes via lattice CLI: lattice submit --sensitive
  → Quorum records: nodes N1-N4 owned by user:dr-x, tenant:hospital-a
  → Strong consistency: Raft commit before any workload starts
  → OpenCHAMI boots N1-N4 with hardened sensitive image (if not already)
  → All activity on N1-N4 audited under dr-x's identity
  → When released:
    → Quorum releases node ownership (Raft commit)
    → OpenCHAMI wipes node (memory scrub, storage secure erase if NVMe present)
    → Node returns to general pool only after wipe confirmation

No clever optimization on sensitive nodes. If Dr. X claims 4 nodes at 9am and runs nothing until 2pm, those nodes sit idle. The cost is real and should be visible to the tenant’s accounting. But there is no co-scheduling, no borrowing, no time-sharing.

Concurrent Sensitive Claims

If two users simultaneously attempt to claim overlapping nodes:

  • First Raft commit wins. Node ownership is a strong consistency domain. The quorum serializes all claim requests via Raft.
  • The second claim request receives an OwnershipConflict error with a message identifying which nodes are already claimed and by which user.
  • The second user must select different nodes or wait for the first user to release.
  • There is no queueing or waitlist for sensitive node claims — they are immediate or rejected.

OS Image

Sensitive nodes boot a hardened image via OpenCHAMI BSS:

  • Minimal kernel, no unnecessary services
  • Mandatory access control (SELinux/AppArmor enforcing)
  • No SSH daemon (all access via API gateway)
  • Encrypted swap (if any)
  • Audit daemon (auditd) logging all syscalls to audit subsystem
  • Node agent with audit mode telemetry enabled by default

Software Delivery

Sensitive allocations use signed uenv images only:

environment:
  uenv: "sensitive/validated-2024.1"  # curated, audited base stack
  sign_required: true                # image signature verified before mount
  scan_required: true                # CVE scan passed
  approved_bases_only: true          # can only use admin-approved base images

The uenv registry enforces:

  • Image signing (with Sovra keys or site-specific PKI)
  • Vulnerability scanning (integrated with JFrog/Nexus security scanning)
  • Approved base image list (maintained by site security team)
  • Audit log of all image pulls

Storage

Sensitive data lives in a dedicated storage pool:

storage_policy:
  pool: "sensitive-encrypted"          # dedicated VAST view/tenant
  encryption: "aes-256-at-rest"      # VAST native encryption
  access_logging: "full"             # every read/write logged via VAST audit
  wipe_on_release: true              # VAST secure delete on allocation end
  data_sovereignty: "ch"             # data stays in Swiss jurisdiction
  retention:
    data: "user_specified"           # user declares retention period
    audit_logs: "7_years"            # regulatory minimum
  tier_restriction: "hot_only"       # no copies on shared warm/cold tiers

Network Isolation

Sensitive allocations get a dedicated Slingshot VNI:

connectivity:
  network_domain: "sensitive-{user}-{alloc_id}"  # unique per allocation
  policy:
    ingress: deny-all-except:
      - same_domain                  # only processes in this allocation
      - data_gateway                 # controlled data ingress endpoint
    egress: deny-all-except:
      - data_gateway                 # controlled data egress

With Ultra Ethernet: network-level encryption (UET built-in) provides an additional layer without performance penalty.

Audit Trail

What is logged (strong consistency via Raft):

  • Node claim: user identity, timestamp, node IDs
  • Node release: user identity, timestamp, wipe confirmation
  • Allocation start/stop: what ran, which uenv image (with hash), which data paths
  • Data access: every file open/read/write (from eBPF audit telemetry)
  • API calls: every lattice-api call related to sensitive allocations
  • Checkpoint events: when, where, what was written
  • Attach sessions: user identity, start/end timestamps, target node, session recording reference
  • Log access events: who accessed logs, when, which allocation
  • Metrics queries: user identity, allocation queried, timestamp

Storage:

  • Append-only log (no deletions, no modifications)
  • Encrypted at rest (Sovra-managed keys if federation enabled, site PKI otherwise)
  • 7-year retention on cold tier (S3-compatible, immutable storage)
  • Cryptographically signed entries (tamper-evident)

Query Interface

The audit log is queryable via a dedicated API endpoint and CLI:

API:

GET /v1/audit/logs?user=dr-x&since=2026-03-01&until=2026-03-15
GET /v1/audit/logs?allocation=12345
GET /v1/audit/logs?node=x1000c0s0b0n0&since=2026-03-01
GET /v1/audit/logs?data_path=s3://sensitive-data/subject-001/

CLI:

lattice audit query --user=dr-x --since=2026-03-01 --until=2026-03-15
lattice audit query --alloc=12345
lattice audit query --node=x1000c0s0b0n0 --since=2026-03-01 --output=json

Scoping:

CallerVisible Scope
Claiming userOwn audit events only
Tenant admin (compliance reviewer)All audit events for their tenant
System adminAll audit events

Indexing: Audit entries are indexed by:

  • User ID (primary query dimension for compliance reporting)
  • Allocation ID (all events for a specific allocation)
  • Node ID (all events on a specific node)
  • Timestamp (range queries, required for all queries)
  • Event type (filter by: claim, release, data_access, attach, etc.)

Performance targets:

Query ScopeExpected Latency
Single allocation (any timeframe)< 1s
Single user, 1-day range< 2s
Single user, 30-day range< 10s
Tenant-wide, 1-day range< 30s

Queries spanning more than 90 days may be served from cold tier (S3 archive) with higher latency (minutes).

Export: For regulatory submissions, audit logs can be exported as signed JSON bundles:

lattice audit export --user=dr-x --since=2026-01-01 --until=2026-06-30 --output=audit-report.json.sig

The export includes cryptographic signatures for tamper evidence.

Observability Constraints

Every user-facing observability feature has sensitive-specific restrictions. The principle: observability must not weaken the isolation model.

Attach

  • Claiming user only. The user who claimed the nodes (identity verified against Raft audit log) is the only user permitted to attach. No delegation, no shared access.
  • Session recording. All attach sessions are recorded (input + output bytes) and stored at s3://sensitive-audit/{tenant}/{alloc_id}/sessions/{session_id}.recording (zstd-compressed, encrypted at rest, 7-year retention). The session recording reference is a Raft-committed audit entry.
  • Signed uenv only. Attach is only permitted when the allocation runs a signed, vulnerability-scanned uenv image. This prevents attaching to environments with unvetted tools.
  • No concurrent attach from different sessions. One active attach session per allocation at a time (prevents accidental data exposure via shared terminal).

Logs

  • Encrypted at rest. Logs from sensitive allocations are stored in the dedicated encrypted S3 pool (same as sensitive data).
  • Access-logged. Every log access (live tail or historical) generates an audit entry with user identity and timestamp.
  • Restricted access. Only the claiming user and designated compliance reviewers (via tenant admin role) can access logs.
  • Retention follows data policy. Log retention matches the allocation’s sensitive data retention policy, not the default log retention.

Metrics

  • Low sensitivity, still scoped. Metrics (GPU%, CPU%, I/O rates) do not contain sensitive data, but are still scoped to the claiming user. Tenant admins can view aggregated usage.
  • No cross-tenant visibility. Even system admins see sensitive allocation metrics only in aggregate (holistic view), not per-allocation detail.

Diagnostics

  • No cross-allocation comparison for sensitive. The CompareMetrics RPC rejects requests that include sensitive allocation IDs alongside non-sensitive ones. Comparison within a single sensitive tenant is permitted (same claiming user).
  • Network diagnostics scoped. Network diagnostics for sensitive allocations only show the allocation’s own VNI traffic, not fabric-wide metrics.

Profiling

  • Signed tools_uenv only. Profiling tools must be delivered via a signed, approved tools_uenv image. Users cannot load arbitrary profiler binaries.
  • Profile output stays in sensitive pool. All profiling output is written to the encrypted sensitive storage pool and is subject to the same access logging and retention policies.

Federation Constraints

Sensitive data does not federate by default:

  • Data stays at the designated site (data sovereignty)
  • Compute can theoretically federate (run at remote site), but only if:
    • Remote site meets the same compliance requirements
    • Data does not transit (remote compute accesses data via encrypted API, not bulk transfer)
    • Both sites’ Sovra instances have a sensitive workspace with hospital CRK
  • In practice: sensitive jobs run where the data is. Period.

Conformance Requirements

Sensitive nodes have strict conformance enforcement. Unlike general workloads where conformance is a soft preference, sensitive workloads treat configuration drift as a hard constraint:

  • Pre-claim validation. Before a node can be claimed for sensitive use, the scheduler verifies its conformance fingerprint matches the expected baseline for the sensitive vCluster. Drifted nodes are rejected.
  • Drift triggers drain. If a sensitive node’s conformance fingerprint changes during operation (e.g., a firmware update was missed), the node agent flags the drift. The scheduler will not assign new sensitive claims to the node until OpenCHAMI remediates it.
  • Audit trail. Conformance state changes on sensitive nodes are recorded in the Raft-committed audit log (which firmware/driver versions were active during the allocation).

This is deliberately conservative: sensitive workloads do not tolerate the subtle failures that configuration drift can cause, and regulatory compliance requires provable consistency of the execution environment.

Scheduler Behavior

The sensitive vCluster scheduler is intentionally simple:

  • Algorithm: Reservation-based (not knapsack). User claims nodes, scheduler validates and commits.
  • No backfill. Sensitive nodes are not shared.
  • No preemption. Sensitive allocations are never preempted.
  • No elastic borrowing. Sensitive nodes cannot be borrowed by other vClusters.
  • Fair-share: Not applicable (nodes are user-claimed, not queue-scheduled).
  • Conformance: Hard constraint — only nodes matching the expected conformance baseline are eligible.
  • Cost function weights: priority=0.90, conformance=0.10 (tiebreaker among conformant nodes; non-conformant nodes are excluded as a hard constraint at the solver level, not via the weight system), everything else near-zero.

Accounting

Design Principle

Lattice schedules, Waldur accounts. Accounting is asynchronous and optional (feature-flagged like federation). Waldur unavailability never blocks scheduling.

What is Waldur

Waldur is a hybrid cloud orchestrator with HPC integration, accounting, billing, and self-service portal. It provides:

  • Resource usage tracking and billing
  • Project-level budget management
  • Self-service quota requests
  • Invoice generation

Integration is via Waldur’s REST API.

Integration Pattern

Lattice ──async push──→ Waldur (accounting events)
Waldur ──API call──→ Lattice (quota updates)

Lattice pushes accounting events to Waldur asynchronously. Waldur can push quota updates back. The two systems are loosely coupled — neither depends on the other for core functionality.

Accounting Events

Events pushed from Lattice to Waldur:

EventTriggerPayload
allocation.startedAllocation enters Running statetenant, project, user, resources (nodes, GPUs, GPU type), estimated duration
allocation.completedAllocation reaches terminal stateactual duration, GPU-hours consumed, exit status, storage bytes written
allocation.checkpointedCheckpoint writtencheckpoint storage consumed, checkpoint duration
node.claimedSensitive node claimed by a usertenant, user, node IDs, claiming timestamp
node.releasedSensitive node releasedtenant, user, node IDs, release timestamp, wipe confirmation
quota.updatedWaldur updates a tenant’s quotanew quota values (Waldur → Lattice direction)

Events are timestamped and include the allocation ID for correlation.

Entity Mapping

Lattice EntityWaldur EntityNotes
TenantCustomer1:1 mapping
Project (within tenant)Project1:1 mapping
vClusterOfferingEach vCluster type is a service offering
AllocationOrderEach allocation is a resource order

Waldur API Endpoints Used

DirectionEndpointPurpose
Lattice → WaldurPOST /api/marketplace-orders/Report resource usage
Lattice → WaldurPOST /api/invoices/{id}/items/Add billing line items
Waldur → LatticeGET /api/customers/{id}/quotas/Read project quotas
Waldur → LatticePUT /api/v1/tenants/{id}Update tenant quotas in Lattice

Authentication

Waldur API token is stored in a secrets manager (never in config files):

waldur:
  token_secret_ref: "vault://lattice/waldur-token"

The token is loaded at startup and refreshed on rotation. Cross-ref: security.md for secret management.

Failure Handling

Waldur unavailability must never block scheduling:

  1. Buffer: Accounting events are buffered in a bounded in-memory queue (default: 10,000 events)
  2. Persist: If the buffer fills, overflow events are persisted to disk (WAL-style append log)
  3. Replay: On Waldur reconnection, buffered and persisted events are replayed in order
  4. Alert: If the disk buffer exceeds a threshold (default: 100,000 events), an alert is raised via scheduler self-monitoring (cross-ref: telemetry.md)
  5. Degrade gracefully: If both buffer and disk are full, events are dropped with a counter metric (lattice_accounting_events_dropped_total). Scheduling continues.

Operational Response to Buffer Overflow

When the accounting buffer fills and events are dropped:

  1. Detect: lattice_accounting_events_dropped_total counter increments. Alert fires when > 0.
  2. Impact: Billing data is incomplete. GPU-hours and allocation events are missing from Waldur. This affects invoice accuracy but never affects scheduling.
  3. Respond:
    • Check Waldur availability (lattice admin accounting status)
    • If Waldur is down: wait for recovery. Buffered events will replay. Dropped events are lost.
    • If Waldur is up but slow: check push interval and batch size. Increase push_interval_seconds to allow larger batches.
  4. Recovery: Dropped events cannot be recovered from the accounting pipeline. However, the quorum has allocation state (start/end times, node assignments). An admin can reconstruct missing billing data from quorum logs:
    lattice admin accounting reconcile --since=2026-03-01 --until=2026-03-02
    
    This command reads allocation history from the quorum and generates compensating events for Waldur.
  5. Prevention: Size the buffer for expected Waldur outage duration. Rule of thumb: buffer_size = events_per_minute × max_expected_outage_minutes. For a busy cluster (100 events/min) and 2-hour outage target: buffer_size = 12000.

Quota Feedback Loop

Waldur can act as the budget authority, updating Lattice tenant quotas:

  1. Waldur detects budget exhaustion (e.g., project spent its allocated compute hours)
  2. Waldur calls lattice-api: PUT /api/v1/tenants/{id} with reduced limits
  3. Lattice updates hard/soft quotas (cross-ref: quota-enforcement.md)
  4. Effect: tenant’s new allocations are blocked (hard quota) or deprioritized (soft quota)

Conversely, when a tenant purchases more compute:

  1. Waldur increases the tenant’s quota
  2. Lattice picks up the new limits
  3. Previously-starved allocations can now be scheduled

Sensitive Accounting

Sensitive allocations have additional accounting requirements:

  • All accounting events include the claiming user’s identity (not just tenant)
  • Idle node time (nodes claimed but no running allocation) is billable — Waldur receives node.claimed and node.released events
  • Accounting events for sensitive allocations are also written to the Raft-committed audit log (cross-ref: sensitive-workloads.md)
  • Waldur must retain sensitive billing records for 7 years (configured on the Waldur side)

Configuration

accounting:
  enabled: true                     # feature flag, default: false
  provider: "waldur"
  waldur:
    api_url: "https://waldur.example.com/api/"
    token_secret_ref: "vault://lattice/waldur-token"
    push_interval_seconds: 60       # batch push interval
    buffer_size: 10000              # in-memory event buffer
    disk_buffer_path: "/var/lib/lattice/accounting-wal"
    disk_buffer_max_events: 100000

When accounting.enabled is false, no accounting code runs and no Waldur dependency exists (same pattern as federation).

Cross-References

Slurm Migration

Design Principle

Migration from Slurm should be gradual and low-risk. Existing Slurm scripts should work with minimal changes via the compatibility layer. Users can adopt Lattice-native features incrementally. The goal is not perfect Slurm emulation — it’s a smooth on-ramp.

Migration Phases

Run Lattice alongside Slurm on a subset of nodes. Users can submit to either system. This provides:

  • Side-by-side comparison of scheduling behavior
  • Gradual user migration with rollback to Slurm
  • Time to validate RM-Replay weight tuning

Phase 2: Compat-Mode Cutover

Move all nodes to Lattice. Users continue using sbatch/squeue via compatibility aliases. Slurm daemons are decommissioned.

Phase 3: Native Adoption

Users migrate scripts to native lattice CLI, adopting features not available in Slurm (reactive scaling, metric-driven autoscaling, DAG workflows, data staging hints).

Script Compatibility

Supported #SBATCH Directives

Slurm DirectiveLattice MappingNotes
--nodes=Nresources.nodes: NExact match
--ntasks=NMapped to node countnodes = ceil(N / tasks_per_node)
--ntasks-per-node=NPassed as task configUsed by launcher
--time=HH:MM:SSlifecycle.walltimeExact match
--partition=Xvcluster: XPartition name → vCluster name mapping
--account=Xtenant: XAccount → tenant mapping
--job-name=Xtags.name: XStored as tag
--output=fileLog path hintLogs always go to S3; --output sets download path
--error=fileLog path hintSame as --output
--constraint=Xconstraints.featuresFeature matching
--gres=gpu:Nconstraints.gpu_countMapped to GPU constraint
--exclusiveDefault behaviorLattice schedules full nodes by default (ADR-007)
--array=0-99%20task_groupTask group with concurrency limit
--dependency=afterok:123depends_on: [{ref: "123", condition: "success"}]DAG edge
--qos=Xpreemption_classQoS → priority mapping (configurable per site)
--mail-user, --mail-typeNot supportedWarn, skip
--mem=XNot supportedFull-node scheduling; memory is not a constraint
--cpus-per-task=NNot supportedFull-node scheduling
--uenv=Xenvironment.uenv: XLattice extension, not in Slurm
--view=Xenvironment.view: XLattice extension

Unsupported Directives

Directives that have no Lattice equivalent are handled gracefully:

Warning: #SBATCH --mem=64G ignored (Lattice uses full-node scheduling, memory is not constrainable)
Warning: #SBATCH --mail-user=user@example.com ignored (use `lattice watch` for event notifications)
Submitted allocation 12345

The submission succeeds — unsupported directives produce warnings, not errors. This is critical for migration: existing scripts should not fail because of irrelevant Slurm options.

Conflicting Directives

ConflictResolution
--nodes=64 + --ntasks=128 with --ntasks-per-node=4--nodes takes precedence; ntasks-per-node used by launcher
--exclusive + --mem=64G--exclusive is default; --mem ignored with warning
--partition not foundError: vCluster "X" not found. Available: hpc-batch, ml-training, interactive

Slurm Features Not Supported

These Slurm features have no Lattice equivalent and are not planned:

FeatureReasonAlternative
Job steps (srun within sbatch)Lattice uses tasks within allocationslattice launch --alloc=<id>
Hetjob (heterogeneous job)Not yet designedSubmit separate allocations with DAG dependencies
Burst buffer (#DW)DataWarp-specificUse data.mounts with tier_hint: hot
GRES beyond GPUNot needed (full-node scheduling)Use constraints.features for non-GPU resources
Accounting (sacctmgr)Waldur handles accountinglattice history or Waldur portal
Reservations (scontrol create reservation)Use sensitive claims for dedicated nodeslattice admin reserve (future)
Licenses/resources (--licenses=)Not applicableUse constraints.features
Multi-cluster (--cluster=)Use federationlattice submit --site=X (if federation enabled)

srun Within Allocations

Slurm users often use srun inside batch scripts to launch parallel tasks. In Lattice:

# Slurm pattern:
srun -n 256 ./my_mpi_program

# Lattice equivalent (inside a running allocation):
# Option 1: The entrypoint IS the parallel launch
# In the submission script, use the appropriate launcher directly:
mpirun -np 256 ./my_mpi_program
# or:
torchrun --nproc_per_node=4 ./train.py

# Option 2: Use lattice launch from another terminal
lattice launch --alloc=12345 -n 256 ./my_mpi_program

The compatibility layer translates srun to lattice launch when the compat aliases are active.

Environment Variables

Slurm sets many environment variables in jobs. Lattice provides equivalent variables:

Slurm VariableLattice VariableDescription
SLURM_JOB_IDLATTICE_ALLOC_IDAllocation ID
SLURM_JOB_NAMELATTICE_JOB_NAMEJob name (from tags)
SLURM_NODELISTLATTICE_NODELISTComma-separated node list
SLURM_NNODESLATTICE_NNODESNumber of nodes
SLURM_NPROCSLATTICE_NPROCSNumber of tasks
SLURM_ARRAY_TASK_IDLATTICE_TASK_INDEXTask group index
SLURM_ARRAY_JOB_IDLATTICE_TASK_GROUP_IDTask group parent ID
SLURM_SUBMIT_DIRLATTICE_SUBMIT_DIRSubmission directory
SLURM_JOBIDLATTICE_ALLOC_IDAlias for compatibility

For migration convenience, the compat layer can also set SLURM_* variables (configurable: compat.set_slurm_env=true). This is disabled by default to avoid confusion.

Partition-to-vCluster Mapping

Sites configure the mapping from Slurm partition names to Lattice vClusters:

# lattice-compat.yaml
partition_mapping:
  normal: "hpc-batch"
  debug: "interactive"
  gpu: "ml-training"
  long: "hpc-batch"        # multiple partitions can map to one vCluster
  sensitive: "sensitive-secure"
qos_mapping:
  low: 1
  normal: 4
  high: 7
  urgent: 9

Unmapped partition names produce an error with a list of available vClusters.

Migration Checklist

For site administrators:

  • Deploy Lattice control plane alongside Slurm
  • Configure partition-to-vCluster mapping
  • Configure QoS-to-preemption-class mapping
  • Tune cost function weights using RM-Replay with production traces
  • Test representative batch scripts via compat layer
  • Validate accounting (Waldur) captures match Slurm sacct data
  • Train users on lattice CLI basics
  • Run dual-stack for 2-4 weeks
  • Migrate remaining users, decommission Slurm

For users:

  • Test existing scripts with lattice submit (compat mode parses #SBATCH)
  • Review warnings for unsupported directives
  • Replace srun in scripts with direct launcher commands (mpirun, torchrun)
  • (Optional) Migrate to native lattice CLI syntax for new workflows

Cross-References

Troubleshooting Guide

Allocation Stuck in Pending

Symptom: lattice status shows allocation in Pending for longer than expected.

Diagnosis:

# Check why the allocation isn't being scheduled
lattice status 12345 --verbose
Verbose OutputCauseFix
waiting for quota headroomTenant hard quota (max_nodes or max_concurrent_allocations) exceededCancel other allocations or request quota increase
no nodes matching constraintsNo nodes with requested GPU type, features, or topologyRelax constraints (--topology=any), check lattice nodes --state=ready
data staging in progressInput data being pre-staged from warm/cold tierWait (check progress with lattice status 12345 --verbose), or submit without tier_hint: hot
insufficient conformance groupNot enough nodes with matching conformance fingerprint for multi-node jobReduce node count, or wait for OpenCHAMI to remediate drifted nodes
all suitable nodes occupiedResources are busy; allocation is queued normallyWait; check queue depth with lattice status --state=pending
soft quota penalty (low score)GPU-hours budget nearly exhausted; allocation deprioritizedRequest budget increase from tenant admin or Waldur portal

Deeper investigation:

# Check scheduler cycle is running
lattice admin scheduler status --vcluster=hpc-batch

# Check if proposals are being rejected
lattice admin raft status

# View scheduling metrics
# (high proposal rejection rate may indicate race conditions or quota contention)

Scheduling Cycle Slow

Symptom: lattice_scheduling_cycle_duration_seconds p99 > 30s.

Diagnosis:

CheckCommandWhat to Look For
Queue depthlattice status --state=pending --count> 500 pending allocations
Cost function timeGrafana: lattice_scheduling_cost_function_duration_secondsDominant component of cycle
Conformance group fragmentationlattice nodes -o wide | sort -k7 | uniq -cMany small groups
Topology solverGrafana: cycle time breakdownMulti-group spanning expensive

Fixes:

CauseFix
Too many pending allocationsIncrease cycle interval to batch more proposals
Cost function slowCheck if custom metrics (f₅ data_readiness) are causing TSDB query delays
Conformance fragmentedStandardize firmware, or reduce w₉ for tolerant workloads
Topology solverReduce backfill depth, or allow topology: any for more jobs

Node Stuck in Degraded/Down

Symptom: Node shows Degraded or Down in lattice nodes.

Diagnosis:

# Check node details
lattice nodes x1000c0s0b0n0

# Check heartbeat
# If heartbeat missing: node agent may be down or network partitioned
StateDurationLikely CauseFix
Degraded, < 2 minTransient network blipWait; likely self-resolves
Degraded, > 5 minAgent crash or network partitionSSH to node, check agent: systemctl status lattice-agent
DownAgent not recoveringCheck BMC via OpenCHAMI: manta node status x1000c0s0b0n0
Down, BMC unreachableHardware failurePhysical inspection required

Recovery:

# If agent crashed, restart it
ssh x1000c0s0b0n0 systemctl restart lattice-agent

# If node needs reboot
lattice node disable x1000c0s0b0n0
# (coordinate with OpenCHAMI for reboot)
lattice node undrain x1000c0s0b0n0  # after reboot + health check

Raft Commit Latency High

Symptom: lattice_raft_commit_latency_seconds p99 > 1s.

Diagnosis:

CheckWhat to Look For
Disk I/O on quorum membersWAL write latency. Quorum members need fast SSD.
Network between quorum membersPacket loss or high latency between quorum nodes
Leader overloadedToo many proposals per second
Log compactionSnapshot in progress (one-time spike, normal)

Fixes:

CauseFix
Slow diskMove WAL to dedicated NVMe SSD
Network latencyEnsure quorum members are on low-latency network (same rack or switch)
Leader overloadIncrease scheduling cycle interval to reduce proposal rate
Log too largeReduce snapshot interval (more frequent snapshots = smaller log)

Allocation Fails During Prologue

Symptom: Allocation moves from Running to Failed within seconds of starting.

Diagnosis:

lattice logs 12345
# Look for prologue errors:
#   "uenv pull failed: hash mismatch"
#   "mount failed: ENOSPC"
#   "NFS mount timeout"
ErrorCauseFix
Hash mismatchCorrupted image in cache or registrylattice cache evict --image=... --node=... and retry
ENOSPCNode-local cache full, eviction couldn’t free spaceCheck cache status: lattice cache status --node=.... Evict unused images manually.
NFS mount timeoutVAST unavailable or network issueCheck VAST health. Check Slingshot storage traffic class.
Image not founduenv name/version doesn’t exist in registryVerify with lattice cache status --node=... or check the uenv registry directly

Preemption Not Working

Symptom: Higher-priority allocation waiting despite lower-priority allocations running on suitable nodes.

Diagnosis:

lattice status 12345 --verbose
# Check if preemption is enabled for this vCluster
lattice admin vcluster show hpc-batch
CauseFix
Pending job’s priority class ≤ running jobs’ classPreemption only works downward. Check priority classes.
Running jobs are non-preemptible (checkpoint: none + high class)Wait for them to complete
Running jobs are near completion (>90% walltime)Scheduler avoids preempting near-completion jobs. Wait.
vCluster doesn’t allow preemptionCheck vCluster config. Service vClusters only preempt borrowed nodes.

Autoscaling Not Triggering

Symptom: Reactive allocation stays at min_nodes despite high metric value.

Diagnosis:

# Check current metric value
lattice top 12345 --metric=gpu_utilization

# Check scaling events
lattice status 12345 --verbose
CauseFix
Metric below targetScaling only triggers when metric > target for scale_up_window (2 min)
Cooldown period activeRecent scale event; wait for cooldown (3 min default)
TSDB query failingCheck lattice_autoscaling_metric_query_failures_total metric
Tenant quota exhaustedmax_nodes reached; scale-up is a no-op
Metric name wrongVerify metric exists in TSDB: lattice top 12345 --metric=<name>

Sensitive Node Won’t Accept Claims

Symptom: Sensitive node claim rejected.

Diagnosis:

CheckWhat to Look For
lattice nodes <id>Is node in Ready state? (Not Degraded, Down, Draining)
ConformanceIs node’s conformance fingerprint matching the sensitive baseline?
Pool sizeIs sensitive_pool_size quota exhausted?
Previous wipeWas the node properly wiped after last sensitive use?

Fix:

# Check conformance
lattice nodes x1000c0s0b0n0 -o wide
# If drifted: coordinate with OpenCHAMI for remediation

# Check sensitive pool
lattice admin tenant show hospital-a --quotas
# If exhausted: release unused sensitive nodes or increase pool

Log Collection

When filing a bug report or escalating, collect:

# System overview
lattice admin raft status > diag/raft.txt
lattice nodes -o json > diag/nodes.json
lattice status --all -o json > diag/allocations.json

# Recent scheduler metrics (last hour)
lattice admin metrics dump --component=scheduler --duration=1h > diag/scheduler-metrics.json

# Specific node agent logs (if relevant)
ssh x1000c0s0b0n0 journalctl -u lattice-agent --since="1 hour ago" > diag/agent.log

Cross-References

Architecture Decision Records

Template

Each ADR follows this format:

  • Status: Proposed | Accepted | Superseded
  • Context: What is the problem?
  • Decision: What did we decide?
  • Consequences: What are the trade-offs?

ADR-001: Raft for Quorum Consensus

Status: Accepted

Context: The scheduler needs a distributed control plane that avoids single-point-of-failure (Slurm’s slurmctld problem). We need strong consistency for node ownership and sensitive audit, but the system schedules tens-to-hundreds of large allocations, not millions of microservices.

Decision: Use Raft consensus (via openraft crate) for the quorum. 3-5 replicas. Only node ownership changes and sensitive audit events go through Raft. Everything else is eventually consistent.

Consequences:

  • (+) No SPOF. Quorum tolerates minority failures.
  • (+) Raft is well-understood, battle-tested, good Rust implementations exist.
  • (+) Consistency latency (few ms per commit) is acceptable for our scheduling granularity.
  • (-) Operational complexity of running a Raft cluster (leader election, log compaction, membership changes).
  • (-) Write throughput limited by Raft commit latency. Not a problem at our scale.

ADR-002: Knapsack Scheduling with Composite Cost Function

Status: Accepted

Context: We need a scheduling algorithm that handles both HPC batch (topology-aware, fair-share) and cloud service (bin-packing, autoscale) workloads. Different vClusters need different optimization strategies.

Decision: Multi-dimensional knapsack formulation with a composite weighted cost function. Weights tunable per vCluster. Greedy solver with topology-aware backfill. Validated via RM-Replay simulator before production deployment.

Consequences:

  • (+) Unified framework for all workload types (just change weights).
  • (+) Cost function is extensible (add new factors without restructuring).
  • (+) RM-Replay provides safe testing of configuration changes.
  • (-) Weight tuning requires expertise and simulation. Not “plug and play.”
  • (-) Greedy solver is not globally optimal. Acceptable for our scale.

ADR-003: uenv-First Software Delivery

Status: Accepted

Context: Users need reproducible software environments. Options: full containers (Docker/Sarus), uenv (SquashFS mount namespaces), or module systems.

Decision: uenv is the default software delivery mechanism. Sarus for OCI containers when isolation is needed (multi-tenant node sharing, third-party images, sensitive with enhanced isolation). No module system.

Consequences:

  • (+) Near-zero runtime overhead (mount namespace, no container isolation overhead).
  • (+) Native GPU/Slingshot access without namespace workarounds.
  • (+) MPI “just works” — no network namespace translation.
  • (+) Proven at CSCS scale (Alps, 10,752 GH200 GPUs).
  • (-) Users must use curated uenv stacks or build their own (Spack/Stackinator).
  • (-) Weaker isolation than containers — fine for trusted HPC users, needs Sarus for untrusted workloads.

ADR-004: Two Strong Consistency Domains

Status: Accepted

Context: Strong consistency (Raft) has a performance cost. We need to minimize what goes through consensus while ensuring correctness for critical state.

Decision: Exactly two categories of state require strong consistency:

  1. Node ownership — which tenant/vCluster/allocation owns which nodes
  2. Sensitive audit log — all events related to sensitive node claims, data access, and isolation boundaries

Everything else (job queues, telemetry, quota accounting, session state) is eventually consistent.

Consequences:

  • (+) Minimal Raft throughput requirements (node ownership changes are infrequent).
  • (+) Sensitive compliance: audit trail is provably consistent and tamper-evident.
  • (+) Job queue staleness is bounded and self-correcting (rejected proposals retry next cycle).
  • (-) Eventual consistency means two vCluster schedulers might propose conflicting allocations. One gets rejected. This is a retry, not a bug.
  • (-) Quota accounting can lag. Hard limits enforced at quorum (node ownership), soft limits eventually.

ADR-005: Federation as Opt-In via Sovra

Status: Accepted

Context: Multi-site operation is desirable but adds significant complexity. Not all deployments need it. The trust model for cross-site operation is a hard problem.

Decision: Federation is a compile-time feature flag. When disabled, no Sovra dependency and no cross-site code paths. When enabled, Sovra provides the cryptographic trust layer. Each site retains full sovereignty — federation broker suggests, local scheduler decides.

Consequences:

  • (+) Zero overhead when federation is not needed.
  • (+) Sovra’s sovereign key model aligns with institutional requirements (each site controls its keys).
  • (+) Revocable federation (revoke workspace → instant defederation).
  • (-) Additional infrastructure to operate (Sovra instances, federation brokers).
  • (-) Cross-site scheduling decisions are based on eventually consistent capacity data (may be stale).

ADR-006: Rust for Scheduler Core

Status: Accepted

Context: The scheduler is a long-lived, performance-critical, correctness-critical system. Options: Rust, Go, C++.

Decision: Rust for all performance-critical components (quorum, schedulers, node agent, API server, CLI, checkpoint broker). Go for infrastructure integration (OpenCHAMI, Sovra, federation broker). Python for user-facing SDK and tooling.

Consequences:

  • (+) Memory safety without GC pauses (critical for scheduler latency).
  • (+) Strong type system for modeling resource constraints (algebraic types for allocation states).
  • (+) Excellent async/concurrency (tokio) for handling many concurrent node agent connections.
  • (+) Single binary deployment for node agents (no runtime dependencies).
  • (-) Steeper learning curve for contributors.
  • (-) Slower initial development velocity vs. Go.
  • (-) Ecosystem for HPC is smaller than C/C++ (but growing).

ADR-007: Full-Node Scheduling with Intra-Node Packing

Status: Accepted

Context: Scheduling granularity: full nodes, fractional nodes, or both?

Decision: The scheduler reasons about full nodes. The node agent handles intra-node packing (multiple containers/uenvs on a single node) for workloads that don’t need a full node (interactive sessions, small Jupyter notebooks). This is a two-level scheme: scheduler assigns nodes to vClusters, node agent packs work within allocated nodes.

Consequences:

  • (+) Simplifies the scheduler (no cgroup negotiation between co-tenants).
  • (+) Predictable performance for large jobs (no noisy neighbor at scheduler level).
  • (+) Node agent can use simple bin-packing for small workloads.
  • (-) Potential waste for small workloads that get a full node unnecessarily.
  • (-) Mitigated by: Sarus containers with resource limits for interactive vCluster, and by grouping small workloads on designated “shared” nodes.

ADR-008: Asynchronous Accounting via Waldur

Status: Accepted

Context: Lattice needs external accounting and billing but should not depend on an accounting system for core scheduling functionality. Waldur provides HPC-aware accounting, billing, and self-service portal capabilities.

Decision: Integrate with Waldur as an optional, feature-flagged accounting provider. Lattice pushes accounting events (allocation started/completed, resource usage) to Waldur asynchronously. Waldur can push quota updates back to Lattice. Waldur unavailability never blocks scheduling. Events are buffered in memory and persisted to disk on overflow, replayed on reconnection.

Consequences:

  • (+) Clean separation of concerns: Lattice schedules, Waldur accounts.
  • (+) Zero scheduling impact from accounting failures (events are buffered).
  • (+) Waldur’s self-service portal gives tenant admins quota visibility without Lattice changes.
  • (+) Feature-flagged: zero overhead when accounting is not needed.
  • (-) Eventually consistent accounting data (events pushed at configurable interval, default 60s).
  • (-) Additional external dependency to operate (Waldur instance, API token management).
  • (-) Entity mapping (Tenant↔Customer, Project↔Project) must stay synchronized.

ADR-009: Two-Tier Quota Enforcement

Status: Accepted

Context: Quota enforcement must balance strictness (prevent over-allocation) with performance (don’t bottleneck scheduling on consensus). Some quotas are safety-critical (node counts), others are advisory (GPU-hours budgets).

Decision: Two-tier quota enforcement matching the two consistency domains (ADR-004):

  1. Hard quotas (quorum-enforced, strong consistency): max_nodes, max_concurrent_allocations, sensitive_pool_size. Checked during Raft proposal validation. Cannot be violated even momentarily.
  2. Soft quotas (scheduler-enforced, eventual consistency): gpu_hours_budget, node_hours_budget, fair_share_target, burst_allowance. Influence scheduling score but don’t hard-block. May temporarily overshoot during consistency window (~30s), self-correcting via fair-share scoring. When both GPU-hours and node-hours budgets are set, the worse utilization drives the penalty.

Consequences:

  • (+) Hard quotas are provably enforced (Raft consensus guarantees).
  • (+) Soft quotas don’t bottleneck scheduling (no consensus required for budget checks).
  • (+) Consistency window for soft quotas is acceptable (scheduling cycle is 5-30s, budget tracking is for billing not safety).
  • (+) Integrates cleanly with Waldur (ADR-008): Waldur updates quotas, Lattice enforces them.
  • (-) Soft quotas can temporarily overshoot (by design). Requires clear documentation that GPU-hours tracking is approximate.
  • (-) Two enforcement paths add complexity. Developers must know which tier a quota belongs to.

ADR-010: Native PMI-2 with Optional PMIx Sidecar

Status: Accepted

Context: Lattice replaces Slurm’s srun, which serves as both a process launcher (fan-out to nodes) and a PMI server (rank/key-value discovery for MPI). Without a PMI provider, multi-node MPI jobs fall back to SSH for process spawning (OpenMPI’s ORTE, MPICH’s Hydra). SSH between compute nodes is a security risk, conflicts with network-domain isolation, and is incompatible with the sensitive workload model. The system must support OpenMPI, MPICH, and Cray MPICH.

Three options were evaluated:

  1. Full PMIx server in Rust – PMIx v4/v5 is ~200+ attributes, enormous implementation surface, no existing Rust implementation. Rejected: too much scope, too much risk.
  2. Embed OpenPMIx library via FFI – Battle-tested, full compatibility. But adds a heavy C dependency (~100K LOC), complex FFI, and still requires custom cross-node transport via gRPC.
  3. Native PMI-2 wire protocol – ~8 text commands over Unix domain socket. Implementable in ~1000-1500 lines of Rust. All three target MPI implementations support PMI-2 natively. The only cross-node operation (kvsfence) maps cleanly to gRPC between node agents.

Decision: Implement a native PMI-2 server in the node agent as the default process management interface. The node agent provides a Unix domain socket per launch, sets PMI_FD/PMI_RANK/PMI_SIZE, and handles cross-node KV exchange (fence) via gRPC between node agents. Optionally, for workloads requiring full PMIx (dynamic spawn, tools API, event notification), support an OpenPMIx sidecar process managed by the node agent, behind the pmix feature flag.

Consequences:

  • (+) No SSH between compute nodes. Eliminates an entire class of security and operational issues.
  • (+) No external C dependencies for the default path. PMI-2 is simple enough to implement and test in pure Rust.
  • (+) All three target MPI implementations (OpenMPI, MPICH, Cray MPICH) work with PMI-2 out of the box.
  • (+) Cross-node fence reuses the existing node-agent gRPC infrastructure (management network, mTLS).
  • (+) CXI credential management integrates naturally with existing VNI/network-domain lifecycle.
  • (+) PMIx available as opt-in for the ~5% of workloads that need it, without burdening the default path.
  • (-) PMI-2 does not support dynamic process spawning (MPI_Comm_spawn). Rare in HPC but used by some frameworks.
  • (-) OpenMPI users must set OMPI_MCA_pmix=pmi2 (or Lattice sets it automatically). Minor friction.
  • (-) PMIx sidecar mode adds a C dependency (OpenPMIx) and a host callback shim (~200 LOC C). Only needed when feature-flagged.
  • (-) Fence performance at extreme scale (>1000 nodes) requires tree-based reduction instead of star topology. Optimization deferred until needed.

ADR-011: Observability Data Out-of-Raft

Status: Accepted

Context: The system generates significant observability data: per-node telemetry (CPU, GPU, network, I/O), allocation logs (stdout/stderr), and metrics time series. This data must be queryable by users (dashboards, debugging) and by the scheduler (cost function factors like energy cost and data readiness). The question is where to store it.

Options:

  1. Raft state machine — guarantees consistency but creates enormous write load (thousands of metric points per second across hundreds of nodes). Raft commit latency becomes the bottleneck for telemetry ingestion.
  2. External TSDB + S3 — eventually consistent but decouples observability throughput from scheduling throughput. Standard tooling (Grafana, PromQL) works out of the box.
  3. In-memory ring buffers only — fast but volatile; node agent restart loses history; no cross-node aggregation.

Decision: Observability data is stored entirely outside the Raft state machine. Metrics go to an external TSDB (VictoriaMetrics). Logs are dual-path: ring buffer in the node agent for live streaming, S3 for persistent storage. The scheduler queries the TSDB for cost function inputs. Only sensitive audit events about observability actions (e.g., “user X attached to allocation Y”) flow through Raft consensus (per ADR-004).

Consequences:

  • (+) Raft throughput is reserved for what matters: node ownership and sensitive audit.
  • (+) Standard observability tooling (Grafana, PromQL) works without custom integration.
  • (+) Telemetry pipeline failures do not disrupt scheduling or allocation lifecycle.
  • (+) TSDB handles retention, downsampling, and high-cardinality queries natively.
  • (-) Metrics are eventually consistent (~30s lag). Scheduler cost function inputs may be slightly stale.
  • (-) TSDB is an additional infrastructure dependency to operate.
  • (-) Log persistence depends on S3 availability; brief gaps possible during S3 outages (ring buffer covers live access).

ADR-012: Allocation as Universal Work Unit

Status: Accepted

Context: The system must schedule both finite work (training runs, simulations, CI jobs) and infinite work (inference services, monitoring daemons, interactive notebooks). Slurm treats these as fundamentally different (jobs vs. “perpetual” jobs with workarounds). Kubernetes treats everything as a pod/deployment but lacks HPC scheduling semantics. We need a single abstraction that spans both worlds without losing scheduling precision.

Options:

  1. Two separate types (Job and Service) — clear semantics per type, but duplicates scheduling logic, quota enforcement, preemption policy, and API surface. Every feature must be implemented twice.
  2. Always bounded (Slurm model) — services require walltime workarounds (submit with max walltime, auto-resubmit). Clumsy and fragile.
  3. Always unbounded (K8s model) — batch jobs require explicit termination signals. Cannot express “run until completion” natively.
  4. Single type with lifecycle variants — one Allocation with lifecycle: Bounded | Unbounded | Reactive.

Decision: A single Allocation type is the universal work unit. The lifecycle field determines duration semantics: Bounded (has walltime, completes or is killed), Unbounded (runs until cancelled, auto-restarts on failure), Reactive (scales in response to metrics/load). All scheduling, quota, preemption, checkpoint, and telemetry policies operate on Allocations uniformly. Task Groups (Slurm job arrays) and DAGs (dependency graphs) compose Allocations.

Consequences:

  • (+) Unified scheduling: one cost function, one knapsack solver, one preemption engine for all workload types.
  • (+) Simpler API: users learn one submission model. Services and batch jobs differ only in lifecycle field.
  • (+) Quota and fair-share accounting is uniform — no special cases for services vs. jobs.
  • (+) DAG dependencies can mix bounded and unbounded allocations (e.g., training job → inference service).
  • (-) Lifecycle variants add complexity to the state machine (Bounded has walltime enforcement; Unbounded has restart policy; Reactive has scaling triggers).
  • (-) Users coming from Slurm must learn that “job” and “service” are the same thing with different lifecycle.

ADR-013: Network Domains via Hardware VNIs

Status: Accepted

Context: Multi-tenant HPC requires network isolation between allocations. On Slingshot/Ultra Ethernet fabrics, the NIC supports Virtual Network Identifiers (VNIs) that provide hardware-enforced L3 isolation. Alternative approaches exist in software.

Options:

  1. Software-based isolation (Linux network namespaces, iptables) — can be bypassed by privileged processes, adds per-packet overhead, difficult to audit at scale, incompatible with RDMA.
  2. No network isolation — all allocations share L2/L3. Unacceptable for multi-tenant security and sensitive workloads.
  3. Full overlay network (Kubernetes CNI model) — adds encapsulation overhead, incompatible with Slingshot fabric semantics, destroys RDMA performance.
  4. Hardware VNI isolation — Slingshot NIC enforces isolation at line rate, zero software overhead, auditable via fabric manager.

Decision: Network isolation is enforced at the Slingshot hardware level via VNIs. Each network domain maps to a VNI allocated from a managed pool. Allocations in the same domain share a VNI and have L3 reachability. Allocations in different domains are hardware-isolated. VNI assignment is eventually consistent (node agents configure NICs based on quorum-reported domain membership). Sensitive allocations get unique per-allocation domains with encrypted RDMA (Ultra Ethernet).

Consequences:

  • (+) Zero-overhead isolation — no per-packet software processing, RDMA performance preserved.
  • (+) Hardware-enforced — cannot be bypassed by user processes, even with root inside a container.
  • (+) Auditable via fabric manager — network domain membership is visible to operators.
  • (+) Naturally integrates with CXI credential management for MPI (ADR-010).
  • (-) Tied to Slingshot/Ultra Ethernet hardware. Non-Slingshot deployments need a software fallback.
  • (-) VNI pool is finite (default: 3095). Exhaustion blocks new domain creation.
  • (-) VNI configuration propagation to NICs adds latency to allocation startup (~50ms).

ADR-014: Conformance Fingerprinting for Configuration Drift Detection

Status: Accepted

Context: Multi-node GPU workloads (distributed training, MPI simulations) are sensitive to configuration heterogeneity. Nodes with different GPU driver versions, NIC firmware, or kernel versions can cause subtle correctness issues (NCCL version mismatches, libfabric ABI incompatibilities) or performance degradation. Slurm has no built-in mechanism to detect this; operators discover it via user bug reports.

Options:

  1. No tracking — silent failures; users debug configuration drift themselves.
  2. Exact node-by-node attribute matching — too strict; every firmware update requires simultaneously updating all nodes or scheduling breaks.
  3. Conformance fingerprint (hash of driver/firmware/kernel) — nodes with identical fingerprints are grouped into cohorts; scheduler places multi-node jobs on same-cohort nodes.
  4. Scheduler-driven remediation — scheduler triggers firmware updates on non-conforming nodes. Out of scope; OpenCHAMI handles infrastructure.

Decision: Each node agent computes a conformance fingerprint (SHA-256 of GPU driver version, NIC firmware version, BIOS version, kernel version) and reports it with heartbeats. The quorum groups nodes into conformance cohorts. The cost function factor f₉: conformance_fitness penalizes multi-node allocations that would span cohorts. Allocations can set require_conformance: true to hard-require same-cohort placement. Conformance drift on sensitive nodes triggers immediate drain (not remediation — that’s OpenCHAMI’s job).

Consequences:

  • (+) Detects configuration drift before it causes user-visible failures.
  • (+) Soft by default (penalty, not hard block) — avoids scheduling starvation during rolling updates.
  • (+) Hard mode available for workloads that need it (require_conformance).
  • (+) Sensitive nodes get stricter enforcement (drain on drift) for compliance.
  • (-) Fingerprint granularity is coarse. Two nodes with different BIOS settings but same BIOS version have the same fingerprint.
  • (-) Multi-node jobs with require_conformance may wait longer for same-cohort nodes.
  • (-) Rolling firmware updates temporarily create many small cohorts, reducing scheduling flexibility.

ADR-015: Attach via nsenter

Status: Accepted

Context: Users need interactive terminal access to running allocations for debugging, monitoring, and interactive workflows (equivalent to Slurm’s srun --pty bash into a running job). The question is how to provide this without compromising isolation or consuming scheduling resources.

Options:

  1. Create a new “attach” allocation on the same node — goes through the scheduler queue; consumes quota; adds latency; overkill for a debugging session.
  2. SSH into the compute node — requires SSH key distribution between login and compute nodes; security risk; incompatible with network domain isolation; operationally fragile.
  3. nsenter from node agent — the node agent enters the allocation’s mount/PID namespace via Linux nsenter; bidirectional gRPC stream provides the PTY. No new resource allocation, no SSH.
  4. Direct socket from user to container — requires host filesystem access; less secure; doesn’t work with uenv (no container to connect to).

Decision: Attach uses nsenter executed by the node agent. The user’s lattice attach <id> command opens a bidirectional gRPC stream to the API server, which forwards to the node agent hosting the allocation. The node agent spawns a shell inside the allocation’s namespace via nsenter. No new allocation is created, no quota is consumed, and no SSH is involved.

Consequences:

  • (+) Instant attach — no scheduler queue, no resource allocation.
  • (+) No SSH infrastructure needed on compute nodes.
  • (+) Works identically for uenv and Sarus allocations (both use Linux namespaces).
  • (+) Attach sessions are logged as observability events (sensitive: Raft-committed audit entry).
  • (-) Requires the node agent to have CAP_SYS_ADMIN / sufficient privileges for nsenter.
  • (-) Attach shares the allocation’s resource limits — a heavy debugging tool could impact the running workload.
  • (-) If the node agent is down, attach is unavailable (no fallback).

ADR-016: Two-Tier API (Intent API + Compatibility Layer)

Status: Accepted

Context: Lattice must serve two audiences: (1) new users and AI agents who benefit from a declarative, intent-based API (“I need 64 GPU nodes for 2 hours with this data”), and (2) existing Slurm users who have years of scripts using sbatch, squeue, scancel. Supporting both without maintaining two scheduling engines requires a clear layering decision.

Options:

  1. Single imperative API (Slurm-style) — familiar to HPC users but locks the system into Slurm’s abstractions (partitions, job steps, GRES). Cannot express reactive scaling or data staging intent.
  2. Single declarative API (Intent-only) — clean design but forces all existing users to rewrite scripts immediately. Migration barrier too high.
  3. Dual engines — one for Intent, one for Slurm compat. Code duplication, inconsistent scheduling behavior, unmaintainable.
  4. Two-tier: Intent API as primary, Compatibility API as thin mapping — Slurm commands are translated to Intent API calls. One scheduling engine, one state machine, one set of semantics.

Decision: The Intent API is the primary and only scheduling interface. The Compatibility API (sbatch, squeue, scancel and their lattice submit, lattice status, lattice cancel equivalents) is a stateless translation layer that maps Slurm directives to Intent API fields. All scheduling decisions, state transitions, and quota enforcement happen through the Intent API path. The compat layer produces warnings for unsupported directives but never errors (graceful degradation for migration).

Consequences:

  • (+) One scheduling engine, one code path, one set of tests.
  • (+) Gradual migration: existing scripts work on day one via compat layer.
  • (+) Intent API can evolve freely without Slurm compatibility constraints.
  • (+) AI agents use the Intent API directly — no impedance mismatch.
  • (-) Some Slurm features have no mapping (hetjob, burst buffer, GRES beyond GPU). Users get warnings.
  • (-) Compat layer must be maintained and tested against Slurm script variations.
  • (-) Users may stay on compat layer indefinitely, never adopting Intent API features.

ADR-017: Eventual Consistency for Job Queues

Status: Accepted

Context: When a user submits an allocation, how quickly must the system guarantee that the submission is durable and schedulable? Raft consensus provides strong guarantees but adds latency (few ms per commit) and throughput limits. Job queues see bursts (hundreds of submissions in seconds during class assignments or automated pipelines).

Options:

  1. Synchronous Raft commit on every submission — strong guarantee but adds 10-100ms per submission, bottlenecks the API under burst load, scheduler throughput limited by Raft commit latency.
  2. Eventually consistent with bounded staleness — submission is acknowledged immediately (stored in-memory queue), committed to Raft asynchronously on the next scheduling cycle. Staleness bounded by scheduling cycle time (~5-30s).
  3. Optimistic with no retry — submissions may be silently lost on leader failover. Unacceptable.

Decision: Job queue state is eventually consistent. Allocation submissions are acknowledged immediately by the API server and placed in the vCluster scheduler’s in-memory queue. The scheduler proposes allocations to the quorum on each scheduling cycle; the quorum validates and commits node ownership (strong consistency). If the API server fails between acknowledgment and the next scheduling cycle, the submission is lost — but the user receives an allocation ID and can query status, which will show “not found” (detectable failure, not silent). In practice, the window is <30s and API server failures are rare.

Consequences:

  • (+) Submission API is fast (<5ms) regardless of Raft cluster health.
  • (+) Burst submissions don’t bottleneck on consensus.
  • (+) Scheduling cycle naturally batches proposals, reducing Raft commit count.
  • (-) Submissions can be lost on API server crash (between ack and next cycle). Mitigated by: client retries on “not found” status, and API server persistence to disk (WAL) as future enhancement.
  • (-) Two schedulers may independently queue the same submission if load-balanced. Deduplication by allocation ID at quorum level.

ADR-018: Scheduler-Coordinated Checkpointing

Status: Accepted

Context: Preemption requires evicting running allocations to free resources for higher-priority work. Killing allocations without warning wastes all computed progress. Checkpointing preserves progress but has cost: I/O bandwidth for writing state, compute time lost during checkpoint, and storage for checkpoint data. The question is who decides when to checkpoint.

Options:

  1. User-initiated checkpointing — user inserts checkpoint calls in their code. Does not solve the preemption problem (scheduler cannot wait for user to decide).
  2. Periodic automatic checkpointing (fixed interval) — simple but wasteful. Short intervals waste I/O on stable workloads; long intervals lose too much progress on preemption.
  3. Transparent checkpointing (DMTCP) without cost model — works for any application but causes I/O storms when many allocations checkpoint simultaneously. No way to prioritize which allocations to preempt.
  4. Scheduler-coordinated with cost function — scheduler evaluates checkpoint value vs. cost per allocation, decides when and which allocations to checkpoint for preemption.

Decision: Checkpointing is scheduler-coordinated. The cost function evaluates checkpoint_value = resource_freed × preemptability + backlog_relief vs. checkpoint_cost = write_time + compute_waste + storage_cost. The scheduler triggers checkpoints by sending CHECKPOINT_HINT to the node agent, which forwards to the application (via signal, shmem flag, or gRPC callback). Applications declare their checkpoint capability (signal, shmem, grpc, dmtcp, or none). Applications with none are either non-preemptible or killed without checkpoint. Backlog pressure increases checkpoint aggressiveness (more allocations waiting → more willing to preempt).

Consequences:

  • (+) Checkpoint decisions are globally optimal (scheduler has full visibility of queue, resources, priorities).
  • (+) Avoids I/O storms (scheduler staggers checkpoints across time and storage bandwidth).
  • (+) Backlog-responsive: system becomes more aggressive about freeing resources when demand is high.
  • (+) Applications retain control of checkpoint mechanics (signal handler, custom format).
  • (-) Applications must implement checkpoint support to benefit. Unsupported applications are either non-preemptible or lose progress.
  • (-) Cost function calibration requires tuning (write bandwidth, storage cost per GB).
  • (-) Checkpoint hint is advisory — application may take too long, forcing a hard kill after timeout.

ADR-019: Eventually Consistent Node Capacity

Status: Accepted

Context: The scheduler needs two kinds of information about nodes: (1) ownership — which tenant/vCluster/allocation owns the node, and (2) capacity — current health, GPU utilization, temperature, available memory. Ownership must be strongly consistent (ADR-004) to prevent double-assignment. But capacity data changes frequently (every heartbeat, ~10s) and is used for scoring, not for correctness.

Options:

  1. All node updates through Raft — ownership and capacity in one consistent view. But heartbeats every 10s × hundreds of nodes = thousands of Raft writes per minute. Commit latency becomes the scheduling bottleneck.
  2. All node updates eventually consistent — fast but ownership conflicts are possible. Two schedulers could assign the same node simultaneously.
  3. Split: ownership via Raft, capacity via eventual consistency — ownership changes are rare (scheduling cycles) and go through Raft. Capacity updates are frequent (heartbeats) and propagated via gossip or direct reporting.

Decision: Node ownership (tenant, vCluster, allocation assignment) is Raft-committed (strong consistency). Node capacity (health, utilization, temperature, conformance fingerprint) is eventually consistent — node agents report to the quorum leader, which updates in-memory state without Raft commit. The scheduler reads the latest reported capacity when scoring. Stale capacity data may cause suboptimal placement but never incorrect ownership.

Consequences:

  • (+) Heartbeats do not bottleneck Raft. Hundreds of nodes can report every 10s without consensus overhead.
  • (+) Scheduling cycle time is decoupled from Raft commit latency for capacity reads.
  • (+) Ownership consistency is preserved — double-assignment is impossible.
  • (-) Capacity staleness can cause suboptimal decisions (e.g., scheduling on a node whose GPU just failed but hasn’t reported yet). Bounded by heartbeat interval.
  • (-) Two levels of consistency require developers to know which fields are strong vs. eventual.

ADR-020: Sensitive Node Claims by User Identity

Status: Accepted

Context: Sensitive (regulated, high-security) workloads require provable isolation and audit trails that satisfy regulatory requirements (e.g., data protection laws, institutional compliance). The question is what identity is recorded as the “owner” of a sensitive node allocation: the tenant (organizational unit), a role, or the specific user.

Options:

  1. Tenant-owned — the organizational unit owns the nodes. Cannot prove which individual accessed which data. Insufficient for regulatory audit (“who accessed patient records?”).
  2. Role-based — a role (e.g., “researcher”) owns the nodes. Same problem: multiple users share a role; individual accountability is lost.
  3. User-owned (OIDC subject) — the authenticated user’s identity (from OIDC token) is recorded in the Raft-committed audit log as the owner. Every data access, attach session, and log retrieval is tied to a specific person.

Decision: Sensitive allocations are claimed by the authenticated user’s OIDC subject identifier, not by the tenant or a role. The quorum records the user identity in the Raft-committed audit log. All subsequent actions on the allocation (data access, attach, log retrieval) are logged with user identity. Nodes are wiped on release (OpenCHAMI secure erase) with wipe confirmation recorded in the audit log. Audit retention is 7 years.

Consequences:

  • (+) Individual accountability: every action is tied to a specific authenticated person.
  • (+) Regulatory defensibility: audit trail shows who claimed what, when, and what they did.
  • (+) Wipe-on-release with Raft-committed confirmation provides provable data destruction.
  • (+) 7-year retention satisfies most regulatory frameworks.
  • (-) User identity must be available at claim time (requires OIDC authentication, no service accounts for sensitive claims).
  • (-) Sensitive allocations cannot be transferred between users (the claim is to a specific identity).
  • (-) Wipe-on-release adds latency to node return-to-pool (10-30 minutes for secure erase).

ADR-021: Data Staging as Invisible Background Pre-stage

Status: Accepted

Context: Many HPC workloads require large datasets (TBs) that may reside on warm or cold storage tiers. If data is not on the hot tier (VAST NFS/S3) when the allocation starts, the first minutes of compute time are wasted on I/O. The question is when and how to move data to the hot tier.

Options:

  1. User-managed staging — user runs a separate staging job before the compute job. Shifts responsibility; users who forget waste compute time. Incompatible with multi-tenant fairness (staging time counted against user).
  2. Blocking inline staging — allocation starts, blocks on data transfer before running the entrypoint. User sees unpredictable startup latency. If staging fails, the allocation is stuck in a running-but-waiting state, consuming resources.
  3. Background pre-staging during queue wait — when an allocation is queued and declares data mounts with tier_hint: hot, the data mover begins warming data to the hot tier while the allocation waits in the queue. Queue wait time becomes productive.
  4. Post-allocation staging on compute nodes — wastes compute resources on I/O; saturates node-local network bandwidth.

Decision: Data staging runs as a background process during queue wait time. The allocation transitions through a Staging state where the data mover pre-stages declared data mounts from warm/cold to hot tier. The cost function factor f₅: data_readiness scores how ready an allocation’s data is: fully staged allocations score higher and are scheduled sooner. Allocations whose data is not yet ready can still be scheduled if resources are available (staging continues during prologue). Staging failure is non-fatal — the allocation starts with a warning, and the entrypoint may encounter I/O latency.

Consequences:

  • (+) Queue wait time is no longer wasted — data moves while the allocation waits.
  • (+) Users don’t need to manage staging manually; just declare data mounts.
  • (+) Scheduler can prioritize data-ready allocations, improving overall throughput.
  • (+) Non-blocking: staging failure degrades performance but doesn’t prevent execution.
  • (-) Adds complexity to the allocation state machine (Staging state, data mover integration).
  • (-) Hot tier must have capacity for pre-staged data. Over-staging wastes hot tier space.
  • (-) Cost function tuning: f₅ weight determines how much data readiness influences scheduling order.

ADR-022: Three-Layer Telemetry Pipeline

Status: Accepted

Context: The system needs telemetry for three consumers: (1) operators (dashboards, alerts), (2) users (debugging, performance analysis), and (3) the scheduler (cost function inputs: GPU utilization, network congestion, energy cost). Each has different resolution, latency, and retention requirements. The pipeline must handle hundreds of nodes producing thousands of metric points per second.

Options:

  1. In-memory ring buffers only — fast, low overhead. But volatile: node agent restart loses history. No cross-node aggregation for dashboards. Insufficient for scheduler feedback (requires historical trends).
  2. Direct eBPF-to-S3 pipeline — durable but high latency. No live metrics for dashboards. Raw data too granular for efficient query.
  3. Stream all metrics to Raft state machine — consistent but bloats the state machine. Raft commit latency becomes the telemetry bottleneck. Fundamentally wrong abstraction.
  4. Three-layer: collect (eBPF) → aggregate (configurable resolution) → store (external TSDB) — each layer optimized for its purpose.

Decision: Telemetry follows a three-layer pipeline. Layer 1: eBPF programs (always-on, <0.3% overhead) collect kernel-level metrics at high resolution. Layer 2: the node agent aggregates at configurable resolution (production: 30s bicubic smoothing, debug: 1s raw, audit: access logs). Layer 3: aggregated metrics are pushed to an external TSDB (VictoriaMetrics) for storage, query, and alerting. The scheduler queries the TSDB for cost function inputs. Users query the TSDB via Grafana or the lattice top/lattice metrics commands.

Consequences:

  • (+) Each layer is independently scalable and replaceable (swap TSDB, change eBPF programs, adjust resolution).
  • (+) eBPF collection is always-on with negligible overhead — no sampling trade-offs.
  • (+) Configurable resolution per use case: fine-grained for debugging, coarse for production.
  • (+) Standard tooling (Grafana, PromQL, AlertManager) works without custom integration.
  • (+) Telemetry pipeline failure does not affect scheduling (graceful degradation: stale cost function inputs).
  • (-) Three layers add operational complexity (eBPF programs, agent aggregation config, TSDB deployment).
  • (-) End-to-end latency from event to queryable metric is ~30s in production mode.
  • (-) eBPF programs require kernel version compatibility and CAP_BPF on nodes.

ADR-023: vCluster as Soft Isolation Boundary

Status: Accepted

Context: Different workload types need different scheduling policies: HPC batch needs backfill with topology packing, ML training needs fair-share with GPU affinity, services need bin-packing with autoscale, sensitive needs dedicated reservation. A single scheduler cannot optimize for all simultaneously. But hard partitioning wastes resources when one workload type is idle while another is starved.

Options:

  1. Hard partitioning (dedicated node pools per workload type) — simple isolation but guaranteed waste. If the ML training pool is 50% idle and HPC batch is oversubscribed, resources sit unused.
  2. Single global scheduler with workload-type heuristics — no waste but cannot apply fundamentally different policies (backfill vs. bin-pack) simultaneously. Policy conflicts create unpredictable behavior.
  3. Opaque vClusters (cannot see each other) — avoids conflicts but makes cross-vCluster fairness impossible. Borrowing is non-deterministic because the lending vCluster doesn’t know its own utilization relative to others.
  4. Soft vClusters with global visibility — each vCluster has its own scheduler and cost function weights, but all schedulers see the global node ownership state via the quorum. Borrowing is explicit and policy-driven.

Decision: vClusters are soft isolation boundaries. Each vCluster has an independent scheduler instance with its own cost function weights (ADR-002) and scheduling algorithm (backfill, bin-pack, reservation, FIFO). All schedulers read the same global state from the quorum. vClusters have base allocations (guaranteed node counts) and can borrow from other vClusters with explicit priority and duration. Borrowed nodes are returned when the lending vCluster needs them (preemption of borrowed allocations at lower priority). The quorum enforces that proposals from different vCluster schedulers don’t conflict (node ownership is Raft-committed).

Consequences:

  • (+) Each workload type gets an optimized scheduler without one-size-fits-all compromises.
  • (+) No waste: idle resources in one vCluster are available to others via borrowing.
  • (+) Fair-share is globally visible: f₃ can compare a tenant’s usage across all vClusters.
  • (+) Borrowing is explicit and reversible: lending vCluster retains priority over its base allocation.
  • (-) Multiple schedulers proposing simultaneously can cause Raft proposal conflicts (one rejected, retried next cycle). Not a bug, but adds latency under contention.
  • (-) Borrowing policy configuration is complex (priority levels, max borrow duration, return grace period).
  • (-) Operators must understand that vClusters are not security boundaries — they are scheduling policy boundaries. Tenant isolation is provided by RBAC and network domains, not vClusters.

External References

Core Infrastructure Projects

OpenCHAMI

  • What: Open-source HPC system management platform (provisioning, boot, inventory)
  • Repo: https://github.com/OpenCHAMI
  • Docs: https://openchami.org
  • Components we integrate with: SMD (State Management Daemon), BSS (Boot Script Service), Magellan (Redfish discovery), OPAAL (auth), Cloud-init
  • Founded by: LANL, NERSC, CSCS, HPE, University of Bristol
  • Language: Go
  • Our integration: Infrastructure plane — Lattice queries SMD for node inventory, triggers BSS for boot image selection (e.g., sensitive hardened image), uses Magellan for hardware discovery

FirecREST

  • What: RESTful API gateway for HPC systems
  • Repo: https://github.com/eth-cscs/firecrest
  • Docs: https://firecrest.readthedocs.io
  • Our integration: Optional — lattice authenticates directly via hpc-auth. FirecREST is only needed for hybrid Slurm deployments where it serves as a passthrough compatibility gateway.

uenv

  • What: User environment tool for mounting SquashFS software stacks
  • Repo: https://github.com/eth-cscs/uenv
  • Related: https://github.com/eth-cscs/squashfs-mount (setuid mount binary), https://github.com/eth-cscs/slurm-uenv-mount (Slurm SPANK plugin)
  • Docs: https://docs.cscs.ch/software/uenv/using/
  • Key properties: SquashFS images, mount namespace isolation (per-process-tree), setuid binary (not FUSE), Spack-built stacks via Stackinator, multiple mount points (/user-environment, /user-tools)
  • Our integration: Software plane — node agent uses squashfs-mount to deliver uenv to allocations. We replace the Slurm SPANK plugin with native node agent integration.

Sarus

  • What: OCI-compliant container runtime for HPC
  • Repo: https://github.com/eth-cscs/sarus
  • Key properties: Near-native performance, direct GPU/interconnect access via OCI hooks, no network namespace overhead for MPI
  • Our integration: Software plane — used when full container isolation is needed (multi-tenant node sharing, third-party images, sensitive workloads with enhanced isolation)

Sovra

  • What: Federated sovereign key management for critical infrastructure
  • Repo: https://github.com/witlox/sovra
  • Docs: https://witlox.github.io/sovra/
  • Key properties: Peer-to-peer control planes, customer-controlled root keys, OPA-based policy, air-gap capable, cross-domain sharing
  • Language: Go
  • Our integration: Federation trust layer (optional, feature-gated). Provides cross-site authentication, sensitive data encryption key management, audit log signing.

Networking

Slingshot (HPE CXI)

  • What: HPE’s HPC interconnect, dragonfly topology
  • Key properties: Hardware traffic classes, VNIs for isolation, high-radix switches, RDMA
  • Scheduler relevance: Topology-aware placement (minimize inter-group hops), VNI-based network domains, separate traffic classes for compute/management/telemetry

Ultra Ethernet Consortium (UEC)

  • What: Open Ethernet-based networking stack for AI/HPC
  • Spec: https://ultraethernet.org (1.0 released June 2025)
  • Key properties: UET transport (native RDMA over Ethernet), packet spraying (adaptive multi-path), CSIG (in-band congestion signaling), built-in encryption, libfabric 2.0 API
  • Relationship to Slingshot: ~75% of UET derives from Slingshot transport. Migration path is evolutionary, not revolutionary.
  • Scheduler relevance: CSIG feeds into telemetry (congestion-aware scheduling), encryption simplifies sensitive compliance, libfabric abstraction enables fabric-agnostic scheduler

libfabric

  • What: Fabric abstraction library (provider-based: CXI for Slingshot, EFA for AWS, verbs for InfiniBand, UET for Ultra Ethernet)
  • Our integration: Network fabric abstraction. The scheduler and node agent interact with the network via libfabric, making the scheduler fabric-agnostic.

Storage

VAST Data Platform

  • What: All-flash unified storage (NFS + S3 + block), DASE architecture
  • Key properties: Multiprotocol (NFS + S3 native), RESTful API for everything, QoS per export, auto-indexing catalog, snapshots, DataSpace (global namespace with prefetch)
  • Scheduler integration: QoS setting at job start, data locality queries via Catalog API, pre-staging via DataSpace prefetch, snapshots for reproducibility, audit logs for sensitive compliance

IBM Storage Scale (GPFS)

  • What: Parallel file system with extensive management features
  • Key properties: Placement policies, AFM (async data management), filesets with quotas, watch/callback API, transparent cloud tiering
  • Scheduler integration: Alternative to VAST. Fileset-per-job for isolation, placement policies for workload-specific tuning, AFM for remote data staging.

Research Papers

CSCS Alps Architecture

  • Martinasso, Klein, Schulthess. “Alps, a versatile research infrastructure.” CUG 2025. arXiv:2507.02404
  • Alam, Gila, Klein, Martinasso, Schulthess. “Versatile software-defined HPC and cloud clusters on Alps supercomputer for diverse workflows.” IJHPCA 2023.
  • Martinasso et al. “Resource Elasticity for Scientific Platforms on HPC Infrastructure.” Springer 2025.

Scheduler Simulation

  • Martinasso, Gila, Bianco, Alam, McMurtrie, Schulthess. “RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management.” SC18. Repo

Multi-Objective Scheduling

  • Simon, Nguyen, Halem. “Multiple Objective Scheduling of HPC Workloads Through Dynamic Prioritization.” Uses bounded fractional knapsack with dynamic priority scoring.
  • Goponenko. “Objective-Driven Strategies for HPC Job Scheduling.” UCF 2024. Comprehensive metrics for scheduling quality, I/O-aware backfill.

Energy-Aware Federation

  • “Power-Aware Scheduling for Multi-Center HPC Electricity Cost Optimization.” arXiv:2503.11011. GNN-based power prediction + multi-site scheduling, up to 18% energy cost reduction.

uenv Deployment

  • Coles et al. “Deploying Alternative User Environments on Alps.” CUG 2023. Details squashfs-mount, Slurm SPANK plugin, Spack stack building.

ML on HPC

  • CSCS. “Evolving HPC services to enable ML workloads on HPE Cray EX.” CUG 2025. arXiv:2507.01880. Container Engine, Environment Definition Files, gaps for ML users.