Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Deployment & Administration

Architecture Overview

A Lattice deployment consists of:

  • 3-5 quorum members — Raft consensus nodes running lattice-server
  • N compute nodes — each running lattice-agent
  • VictoriaMetrics (or compatible TSDB) — telemetry storage
  • S3-compatible storage — checkpoint and log persistence
  • VAST (optional) — data staging and QoS

Deployment Methods

Docker Compose (dev/test)

cd infra/docker
docker compose up -d

This starts a 3-node quorum with VictoriaMetrics. See infra/docker/docker-compose.yml.

Systemd (production)

Download binaries from GitHub Releases and install:

ARCH=$(uname -m | sed 's/aarch64/arm64/')

# Server (quorum members)
curl -sSfL "https://github.com/witlox/lattice/releases/latest/download/lattice-server-${ARCH}.tar.gz" | tar xz
sudo mv lattice-server /usr/local/bin/
sudo cp infra/systemd/lattice-server.service /etc/systemd/system/
sudo cp config/production.yaml /etc/lattice/config.yaml
sudo systemctl enable --now lattice-server

# Agent (compute nodes) — single binary per architecture, all GPU support included
curl -sSfL "https://github.com/witlox/lattice/releases/latest/download/lattice-agent-${ARCH}.tar.gz" | tar xz
sudo mv lattice-agent /usr/local/bin/
sudo cp infra/systemd/lattice-agent.service /etc/systemd/system/
sudo systemctl enable --now lattice-agent

Configuration

Example configs are in config/:

FilePurpose
config/minimal.yamlSingle-node dev mode, no optional features
config/production.yamlFull reference with all sections documented

See the production config for every option with explanations.

Required Sections

  • quorum — Raft node ID, peers, data directory
  • api — gRPC and REST listen addresses
  • storage — S3 endpoint, NFS paths
  • telemetry — TSDB endpoint, aggregation mode

Optional Sections

  • node_agent — heartbeat timing, grace periods
  • network — VNI pool range for Slingshot
  • checkpoint — checkpoint evaluation and timeout tuning
  • scheduling — cycle interval, backfill depth
  • accounting — Waldur integration (requires accounting feature)
  • rate_limit — per-user API rate limiting
  • federation — Sovra cross-site federation (requires federation feature)
  • compat — Slurm compatibility settings

Authentication & Authorization

Overview

Lattice authenticates three types of callers:

CallerAuth methodToken source
Humans (CLI)OIDC (PKCE flow) → RS256 JWTIdP (Keycloak, Dex)
Agents (node agent)mTLS (production) or Bearer token (dev)SPIRE SVID / bootstrap certs / LATTICE_AGENT_TOKEN
Services (AI/MCP)OIDC (client_credentials) → RS256 JWTIdP service account

Server OIDC Configuration

api:
  oidc_issuer: "https://keycloak.example.com/realms/hpc"   # IdP discovery URL
  oidc_client_id: "lattice"                                 # Expected `aud` claim
  # oidc_hmac_secret: "dev-secret-only"                     # HMAC fallback (dev only)
Config fieldEnv varPurpose
api.oidc_issuerOIDC provider URL. Enables JWKS (RS256/ES256) validation.
api.oidc_client_idExpected aud claim. Returned by auth discovery endpoint.
api.oidc_hmac_secretLATTICE_OIDC_HMAC_SECRETShared secret for HS256 validation (dev/testing/break-glass).

Priority: JWKS (if oidc_issuer set) > HMAC (if secret set) > no auth (warning logged).

The auth discovery endpoint GET /api/v1/auth/discovery is public (no auth required) and returns {idp_url, client_id, issuer} so the CLI can bootstrap login.

Roles

Role derivation checks OIDC scopes first, then cross-system role claims (pact_role, lattice_role). First match wins.

RoleOIDC scopeCross-system claimPermissions
SystemAdminadmin or system:adminpact-platform-admin or system-adminUnrestricted — all operations
TenantAdmintenant:admintenant-adminManage own tenant’s allocations, vClusters, quotas. Drain nodes. Query audit.
OperatoroperatoroperatorDrain/undrain/disable/enable nodes. Cannot create tenants or manage federation.
ClaimingUsersensitive:claimUser + claim/release sensitive nodes
ReadOnlyreadonlyGET/LIST/WATCH only, no mutations
User(default — any authenticated user)Submit/cancel own allocations, view nodes, create sessions

IdP Setup (Keycloak / Dex)

Configure your IdP to include the appropriate scopes in issued tokens:

Keycloak:

  1. Create client lattice with PKCE (Authorization Code) flow
  2. Create client scopes: admin, tenant:admin, operator, sensitive:claim, readonly
  3. Assign scopes to users/groups via role mappings
  4. For pact+lattice co-deployment: add pact_role as a custom claim in the token mapper

Dex:

staticClients:
  - id: lattice
    name: Lattice Scheduler
    redirectURIs: ['http://localhost:8400/callback']
    public: true   # PKCE, no client secret

Dex passes through upstream IdP claims. Configure pact_role / scopes in the upstream IdP (LDAP groups, SAML attributes, etc.).

Agent Authentication

Node agents authenticate to lattice-server for registration and heartbeats.

Production (mTLS): Agent acquires identity via the cascade: SPIRE → SelfSigned CA → Bootstrap certs. The gRPC channel uses ClientTlsConfig with the acquired cert/key/CA. Server verifies the client certificate.

# Bootstrap cert path (used until SPIRE is available)
lattice-agent \
  --quorum-endpoint=https://lattice-01:50051 \
  --bootstrap-cert=/etc/lattice/tls/agent.crt \
  --bootstrap-key=/etc/lattice/tls/agent.key \
  --bootstrap-ca=/etc/lattice/tls/ca.crt \
  ...

Dev/testing (Bearer token): When no mTLS identity is available, agent falls back to LATTICE_AGENT_TOKEN.

LATTICE_AGENT_TOKEN="eyJ..." lattice-agent \
  --quorum-endpoint=http://lattice-01:50051 \
  ...
Env varPurpose
LATTICE_AGENT_TOKENBearer token for agent→server auth (dev/testing/break-glass)
LATTICE_SPIRE_SOCKETSPIRE agent socket path (default: /run/spire/agent.sock)
LATTICE_BOOTSTRAP_CERTBootstrap cert PEM path
LATTICE_BOOTSTRAP_KEYBootstrap key PEM path
LATTICE_BOOTSTRAP_CABootstrap CA PEM path

mTLS takes priority. Token auth is the fallback. In production, leave LATTICE_AGENT_TOKEN unset.

Quorum Management

Initial Bootstrap

The first quorum member initializes the Raft cluster using the --bootstrap flag. This flag must only be passed once — on the very first startup of node 1. All subsequent restarts (including systemd restarts) omit it.

# First-ever start of node 1 — initializes the Raft cluster:
lattice-server --config /etc/lattice/server.yaml --bootstrap

# All subsequent restarts — no --bootstrap:
lattice-server --config /etc/lattice/server.yaml
# (or via systemd, which never passes --bootstrap)

Configure peers in each node’s config:

quorum:
  node_id: 1
  data_dir: /var/lib/lattice/raft
  peers:
    - id: 2
      address: "lattice-02:9000"
    - id: 3
      address: "lattice-03:9000"

Nodes 2 and 3 never need --bootstrap — they join via Raft membership replication from the leader.

Raft Status

curl http://lattice-01:8080/api/v1/raft/status

Backup & Restore

# Create backup
curl -X POST http://lattice-01:8080/api/v1/admin/backup

# Verify backup integrity
curl http://lattice-01:8080/api/v1/admin/backup/verify

# Restore (requires restart)
curl -X POST http://lattice-01:8080/api/v1/admin/restore \
  -d '{"path": "/var/lib/lattice/backups/backup-20260305T120000Z.tar.gz"}'

Node Management

Agent Registration

Agents register automatically on startup. Authentication uses mTLS (production) or Bearer token (dev/testing):

# Production: mTLS via bootstrap certs (SPIRE preferred when available)
lattice-agent \
  --node-id=nid001234 \
  --quorum-endpoint=https://lattice-01:50051 \
  --bootstrap-cert=/etc/lattice/tls/agent.crt \
  --bootstrap-key=/etc/lattice/tls/agent.key \
  --bootstrap-ca=/etc/lattice/tls/ca.crt \
  --gpu-count=4 --gpu-type=GH200 --cpu-cores=72 --memory-gb=512

# Dev/testing: Bearer token auth (no certs needed)
LATTICE_AGENT_TOKEN="eyJ..." lattice-agent \
  --node-id=nid001234 \
  --quorum-endpoint=http://lattice-01:50051 \
  --gpu-count=4 --gpu-type=GH200 --cpu-cores=72 --memory-gb=512

The agent tries the identity cascade (SPIRE → SelfSigned → Bootstrap) first. If no mTLS identity is available, it falls back to LATTICE_AGENT_TOKEN.

Draining Nodes

The drain lifecycle is: Ready → Draining → Drained → Ready.

# Drain a node (existing jobs complete, no new jobs scheduled)
lattice admin drain nid001234 --reason="maintenance"

# If no active allocations, node goes directly to Drained.
# If allocations are running, node stays in Draining until they complete.
# The scheduler loop automatically transitions Draining → Drained.

# Undrain (only works from Drained state)
lattice admin undrain nid001234

Undrain only works when the node is in Drained state. If the node is still Draining (allocations running), wait for them to complete or cancel them first.

Node States

StateMeaning
ReadyAvailable for scheduling
DrainingNo new jobs; existing jobs continue
DownHeartbeat lost beyond grace period
DegradedHeartbeat late but within grace period
ClaimedReserved for sensitive workload

Tenant Management

# Create a tenant
lattice admin tenant create --name="physics" --max-nodes=100

# List tenants
lattice admin tenant list

# Update quota
lattice admin tenant update physics --max-nodes=200

TLS Configuration

Server TLS

api:
  tls_cert: /etc/lattice/tls/server.crt
  tls_key: /etc/lattice/tls/server.key

Mutual TLS (mTLS)

api:
  tls_cert: /etc/lattice/tls/server.crt
  tls_key: /etc/lattice/tls/server.key
  tls_ca: /etc/lattice/tls/ca.crt  # Require client certificates

Feature Flags

Compile-time features control optional integrations:

FeatureCrateEnables
oidclattice-apiJWT/OIDC token validation
accountinglattice-apiWaldur billing integration
federationlattice-apiSovra cross-site federation
nvidialattice-node-agentNVIDIA GPU discovery (nvml-wrapper)
rocmlattice-node-agentAMD GPU discovery (rocm-smi)
ebpflattice-node-agenteBPF kernel telemetry (Linux only)

Pre-built release binaries ship with all features enabled. GPU libraries are loaded at runtime — nodes without GPUs simply report no GPU hardware. To build from source:

# Server with all features
cargo build --release -p lattice-api --all-features

# Agent with all features
cargo build --release -p lattice-node-agent --all-features

Release Artifacts

ArtifactArchitectureGPU Support
lattice-server-x86_64.tar.gzx86_64n/a
lattice-server-arm64.tar.gzarm64n/a
lattice-x86_64.tar.gzx86_64n/a (CLI)
lattice-arm64.tar.gzarm64n/a (CLI)
lattice-agent-x86_64.tar.gzx86_64NVIDIA + AMD ROCm + eBPF
lattice-agent-arm64.tar.gzarm64NVIDIA + AMD ROCm + eBPF
rm-replay-x86_64.tar.gzx86_64n/a
rm-replay-arm64.tar.gzarm64n/a

GPU discovery is automatic at runtime. The agent detects available hardware and uses the appropriate provider:

HardwareDiscovery MethodRuntime Dependency
NVIDIA (H100, A100, GH200)nvml-wrapper (libnvidia-ml.so via dlopen)NVIDIA driver installed
AMD (MI300X, MI250)rocm-smi CLIROCm toolkit installed
CPU-only nodesNo GPU discovery runsNone

GCP Test Cluster

For integration testing without production hardware:

# 1. Build Packer image (once, ~5 min)
cd infra/gcp/packer
packer build -var project_id=YOUR_PROJECT lattice-compute.pkr.hcl

# 2. Provision infrastructure (~2 min)
cd infra/gcp
terraform apply -var="project_id=YOUR_PROJECT" -var="use_packer_image=true"

# 3. Build + bundle binaries
cargo build --release --target x86_64-unknown-linux-gnu
./scripts/deploy/make-provision-bundle.sh target/x86_64-unknown-linux-gnu/release /tmp/lattice-provision.tar.gz

# 4. Deploy to nodes (SCP bundle + run install scripts)
# See scripts/deploy/install-quorum.sh and install-compute.sh

# 5. Run validation test matrix
./scripts/deploy/validate.sh http://QUORUM1_IP:8080 x1000c0s0b0n0,x1000c0s0b0n1

# 6. Teardown
cd infra/gcp && terraform destroy

The test cluster includes: 3 quorum nodes, 2 compute nodes (with podman + squashfs-tools), 1 OCI registry, 1 VictoriaMetrics. The validate.sh script runs 15 tests covering health, auth, submit, drain, restart, and validation.

Deploy scripts (scripts/deploy/install-*.sh) are reusable on-prem — no GCP-specific logic.