Deployment & Administration
Architecture Overview
A Lattice deployment consists of:
- 3-5 quorum members — Raft consensus nodes running
lattice-server - N compute nodes — each running
lattice-agent - VictoriaMetrics (or compatible TSDB) — telemetry storage
- S3-compatible storage — checkpoint and log persistence
- VAST (optional) — data staging and QoS
Deployment Methods
Docker Compose (dev/test)
cd infra/docker
docker compose up -d
This starts a 3-node quorum with VictoriaMetrics. See infra/docker/docker-compose.yml.
Systemd (production)
Download binaries from GitHub Releases and install:
ARCH=$(uname -m | sed 's/aarch64/arm64/')
# Server (quorum members)
curl -sSfL "https://github.com/witlox/lattice/releases/latest/download/lattice-server-${ARCH}.tar.gz" | tar xz
sudo mv lattice-server /usr/local/bin/
sudo cp infra/systemd/lattice-server.service /etc/systemd/system/
sudo cp config/production.yaml /etc/lattice/config.yaml
sudo systemctl enable --now lattice-server
# Agent (compute nodes) — single binary per architecture, all GPU support included
curl -sSfL "https://github.com/witlox/lattice/releases/latest/download/lattice-agent-${ARCH}.tar.gz" | tar xz
sudo mv lattice-agent /usr/local/bin/
sudo cp infra/systemd/lattice-agent.service /etc/systemd/system/
sudo systemctl enable --now lattice-agent
Configuration
Example configs are in config/:
| File | Purpose |
|---|---|
config/minimal.yaml | Single-node dev mode, no optional features |
config/production.yaml | Full reference with all sections documented |
See the production config for every option with explanations.
Required Sections
quorum— Raft node ID, peers, data directoryapi— gRPC and REST listen addressesstorage— S3 endpoint, NFS pathstelemetry— TSDB endpoint, aggregation mode
Optional Sections
node_agent— heartbeat timing, grace periodsnetwork— VNI pool range for Slingshotcheckpoint— checkpoint evaluation and timeout tuningscheduling— cycle interval, backfill depthaccounting— Waldur integration (requiresaccountingfeature)rate_limit— per-user API rate limitingfederation— Sovra cross-site federation (requiresfederationfeature)compat— Slurm compatibility settings
Authentication & Authorization
Overview
Lattice authenticates three types of callers:
| Caller | Auth method | Token source |
|---|---|---|
| Humans (CLI) | OIDC (PKCE flow) → RS256 JWT | IdP (Keycloak, Dex) |
| Agents (node agent) | mTLS (production) or Bearer token (dev) | SPIRE SVID / bootstrap certs / LATTICE_AGENT_TOKEN |
| Services (AI/MCP) | OIDC (client_credentials) → RS256 JWT | IdP service account |
Server OIDC Configuration
api:
oidc_issuer: "https://keycloak.example.com/realms/hpc" # IdP discovery URL
oidc_client_id: "lattice" # Expected `aud` claim
# oidc_hmac_secret: "dev-secret-only" # HMAC fallback (dev only)
| Config field | Env var | Purpose |
|---|---|---|
api.oidc_issuer | — | OIDC provider URL. Enables JWKS (RS256/ES256) validation. |
api.oidc_client_id | — | Expected aud claim. Returned by auth discovery endpoint. |
api.oidc_hmac_secret | LATTICE_OIDC_HMAC_SECRET | Shared secret for HS256 validation (dev/testing/break-glass). |
Priority: JWKS (if oidc_issuer set) > HMAC (if secret set) > no auth (warning logged).
The auth discovery endpoint GET /api/v1/auth/discovery is public (no auth required) and returns {idp_url, client_id, issuer} so the CLI can bootstrap login.
Roles
Role derivation checks OIDC scopes first, then cross-system role claims (pact_role, lattice_role). First match wins.
| Role | OIDC scope | Cross-system claim | Permissions |
|---|---|---|---|
| SystemAdmin | admin or system:admin | pact-platform-admin or system-admin | Unrestricted — all operations |
| TenantAdmin | tenant:admin | tenant-admin | Manage own tenant’s allocations, vClusters, quotas. Drain nodes. Query audit. |
| Operator | operator | operator | Drain/undrain/disable/enable nodes. Cannot create tenants or manage federation. |
| ClaimingUser | sensitive:claim | — | User + claim/release sensitive nodes |
| ReadOnly | readonly | — | GET/LIST/WATCH only, no mutations |
| User | (default — any authenticated user) | — | Submit/cancel own allocations, view nodes, create sessions |
IdP Setup (Keycloak / Dex)
Configure your IdP to include the appropriate scopes in issued tokens:
Keycloak:
- Create client
latticewith PKCE (Authorization Code) flow - Create client scopes:
admin,tenant:admin,operator,sensitive:claim,readonly - Assign scopes to users/groups via role mappings
- For pact+lattice co-deployment: add
pact_roleas a custom claim in the token mapper
Dex:
staticClients:
- id: lattice
name: Lattice Scheduler
redirectURIs: ['http://localhost:8400/callback']
public: true # PKCE, no client secret
Dex passes through upstream IdP claims. Configure pact_role / scopes in the upstream IdP (LDAP groups, SAML attributes, etc.).
Agent Authentication
Node agents authenticate to lattice-server for registration and heartbeats.
Production (mTLS): Agent acquires identity via the cascade: SPIRE → SelfSigned CA → Bootstrap certs. The gRPC channel uses ClientTlsConfig with the acquired cert/key/CA. Server verifies the client certificate.
# Bootstrap cert path (used until SPIRE is available)
lattice-agent \
--quorum-endpoint=https://lattice-01:50051 \
--bootstrap-cert=/etc/lattice/tls/agent.crt \
--bootstrap-key=/etc/lattice/tls/agent.key \
--bootstrap-ca=/etc/lattice/tls/ca.crt \
...
Dev/testing (Bearer token): When no mTLS identity is available, agent falls back to LATTICE_AGENT_TOKEN.
LATTICE_AGENT_TOKEN="eyJ..." lattice-agent \
--quorum-endpoint=http://lattice-01:50051 \
...
| Env var | Purpose |
|---|---|
LATTICE_AGENT_TOKEN | Bearer token for agent→server auth (dev/testing/break-glass) |
LATTICE_SPIRE_SOCKET | SPIRE agent socket path (default: /run/spire/agent.sock) |
LATTICE_BOOTSTRAP_CERT | Bootstrap cert PEM path |
LATTICE_BOOTSTRAP_KEY | Bootstrap key PEM path |
LATTICE_BOOTSTRAP_CA | Bootstrap CA PEM path |
mTLS takes priority. Token auth is the fallback. In production, leave LATTICE_AGENT_TOKEN unset.
Quorum Management
Initial Bootstrap
The first quorum member initializes the Raft cluster using the --bootstrap flag. This flag must only be passed once — on the very first startup of node 1. All subsequent restarts (including systemd restarts) omit it.
# First-ever start of node 1 — initializes the Raft cluster:
lattice-server --config /etc/lattice/server.yaml --bootstrap
# All subsequent restarts — no --bootstrap:
lattice-server --config /etc/lattice/server.yaml
# (or via systemd, which never passes --bootstrap)
Configure peers in each node’s config:
quorum:
node_id: 1
data_dir: /var/lib/lattice/raft
peers:
- id: 2
address: "lattice-02:9000"
- id: 3
address: "lattice-03:9000"
Nodes 2 and 3 never need --bootstrap — they join via Raft membership replication from the leader.
Raft Status
curl http://lattice-01:8080/api/v1/raft/status
Backup & Restore
# Create backup
curl -X POST http://lattice-01:8080/api/v1/admin/backup
# Verify backup integrity
curl http://lattice-01:8080/api/v1/admin/backup/verify
# Restore (requires restart)
curl -X POST http://lattice-01:8080/api/v1/admin/restore \
-d '{"path": "/var/lib/lattice/backups/backup-20260305T120000Z.tar.gz"}'
Node Management
Agent Registration
Agents register automatically on startup. Authentication uses mTLS (production) or Bearer token (dev/testing):
# Production: mTLS via bootstrap certs (SPIRE preferred when available)
lattice-agent \
--node-id=nid001234 \
--quorum-endpoint=https://lattice-01:50051 \
--bootstrap-cert=/etc/lattice/tls/agent.crt \
--bootstrap-key=/etc/lattice/tls/agent.key \
--bootstrap-ca=/etc/lattice/tls/ca.crt \
--gpu-count=4 --gpu-type=GH200 --cpu-cores=72 --memory-gb=512
# Dev/testing: Bearer token auth (no certs needed)
LATTICE_AGENT_TOKEN="eyJ..." lattice-agent \
--node-id=nid001234 \
--quorum-endpoint=http://lattice-01:50051 \
--gpu-count=4 --gpu-type=GH200 --cpu-cores=72 --memory-gb=512
The agent tries the identity cascade (SPIRE → SelfSigned → Bootstrap) first. If no mTLS identity is available, it falls back to LATTICE_AGENT_TOKEN.
Draining Nodes
The drain lifecycle is: Ready → Draining → Drained → Ready.
# Drain a node (existing jobs complete, no new jobs scheduled)
lattice admin drain nid001234 --reason="maintenance"
# If no active allocations, node goes directly to Drained.
# If allocations are running, node stays in Draining until they complete.
# The scheduler loop automatically transitions Draining → Drained.
# Undrain (only works from Drained state)
lattice admin undrain nid001234
Undrain only works when the node is in Drained state. If the node is still Draining (allocations running), wait for them to complete or cancel them first.
Node States
| State | Meaning |
|---|---|
| Ready | Available for scheduling |
| Draining | No new jobs; existing jobs continue |
| Down | Heartbeat lost beyond grace period |
| Degraded | Heartbeat late but within grace period |
| Claimed | Reserved for sensitive workload |
Tenant Management
# Create a tenant
lattice admin tenant create --name="physics" --max-nodes=100
# List tenants
lattice admin tenant list
# Update quota
lattice admin tenant update physics --max-nodes=200
TLS Configuration
Server TLS
api:
tls_cert: /etc/lattice/tls/server.crt
tls_key: /etc/lattice/tls/server.key
Mutual TLS (mTLS)
api:
tls_cert: /etc/lattice/tls/server.crt
tls_key: /etc/lattice/tls/server.key
tls_ca: /etc/lattice/tls/ca.crt # Require client certificates
Feature Flags
Compile-time features control optional integrations:
| Feature | Crate | Enables |
|---|---|---|
oidc | lattice-api | JWT/OIDC token validation |
accounting | lattice-api | Waldur billing integration |
federation | lattice-api | Sovra cross-site federation |
nvidia | lattice-node-agent | NVIDIA GPU discovery (nvml-wrapper) |
rocm | lattice-node-agent | AMD GPU discovery (rocm-smi) |
ebpf | lattice-node-agent | eBPF kernel telemetry (Linux only) |
Pre-built release binaries ship with all features enabled. GPU libraries are loaded at runtime — nodes without GPUs simply report no GPU hardware. To build from source:
# Server with all features
cargo build --release -p lattice-api --all-features
# Agent with all features
cargo build --release -p lattice-node-agent --all-features
Release Artifacts
| Artifact | Architecture | GPU Support |
|---|---|---|
lattice-server-x86_64.tar.gz | x86_64 | n/a |
lattice-server-arm64.tar.gz | arm64 | n/a |
lattice-x86_64.tar.gz | x86_64 | n/a (CLI) |
lattice-arm64.tar.gz | arm64 | n/a (CLI) |
lattice-agent-x86_64.tar.gz | x86_64 | NVIDIA + AMD ROCm + eBPF |
lattice-agent-arm64.tar.gz | arm64 | NVIDIA + AMD ROCm + eBPF |
rm-replay-x86_64.tar.gz | x86_64 | n/a |
rm-replay-arm64.tar.gz | arm64 | n/a |
GPU discovery is automatic at runtime. The agent detects available hardware and uses the appropriate provider:
| Hardware | Discovery Method | Runtime Dependency |
|---|---|---|
| NVIDIA (H100, A100, GH200) | nvml-wrapper (libnvidia-ml.so via dlopen) | NVIDIA driver installed |
| AMD (MI300X, MI250) | rocm-smi CLI | ROCm toolkit installed |
| CPU-only nodes | No GPU discovery runs | None |
GCP Test Cluster
For integration testing without production hardware:
# 1. Build Packer image (once, ~5 min)
cd infra/gcp/packer
packer build -var project_id=YOUR_PROJECT lattice-compute.pkr.hcl
# 2. Provision infrastructure (~2 min)
cd infra/gcp
terraform apply -var="project_id=YOUR_PROJECT" -var="use_packer_image=true"
# 3. Build + bundle binaries
cargo build --release --target x86_64-unknown-linux-gnu
./scripts/deploy/make-provision-bundle.sh target/x86_64-unknown-linux-gnu/release /tmp/lattice-provision.tar.gz
# 4. Deploy to nodes (SCP bundle + run install scripts)
# See scripts/deploy/install-quorum.sh and install-compute.sh
# 5. Run validation test matrix
./scripts/deploy/validate.sh http://QUORUM1_IP:8080 x1000c0s0b0n0,x1000c0s0b0n1
# 6. Teardown
cd infra/gcp && terraform destroy
The test cluster includes: 3 quorum nodes, 2 compute nodes (with podman + squashfs-tools), 1 OCI registry, 1 VictoriaMetrics. The validate.sh script runs 15 tests covering health, auth, submit, drain, restart, and validation.
Deploy scripts (scripts/deploy/install-*.sh) are reusable on-prem — no GCP-specific logic.