Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lattice

A distributed workload scheduler for large-scale scientific computing, AI/ML training, inference services, and regulated workloads.

Lattice schedules both finite jobs (batch training, simulations) and infinite jobs (inference services, monitoring) on shared HPC infrastructure with topology-aware placement, federated multi-site operation, and a unified API for human users and autonomous agents.

Architecture at a Glance

User Plane         lattice-cli + lattice-api (OIDC via hpc-auth)
Software Plane     uenv (SquashFS) + Sarus (OCI) + Registry
Scheduling Plane   Raft Quorum + vCluster Schedulers (knapsack)
Data Plane         VAST (NFS/S3) tiered storage + data mover
Network Fabric     Slingshot / Ultra Ethernet (libfabric)
Node Plane         Node Agent + mount namespaces + eBPF telemetry
Infrastructure     OpenCHAMI (Redfish BMC, boot, inventory)

Start with System Architecture for the full picture, or jump to API Design to see how users interact with the system.

Source Code

The project is organized as a Rust workspace with 9 crates:

CratePurpose
lattice-commonShared types, config, protobuf bindings
lattice-quorumRaft consensus, global state machine, audit log
lattice-schedulervCluster schedulers, knapsack solver, cost function
lattice-apigRPC + REST server, OIDC, RBAC, mTLS
lattice-checkpointCheckpoint broker, cost evaluator
lattice-node-agentPer-node daemon, GPU discovery, eBPF telemetry
lattice-cliCLI binary (submit, status, cancel, session, telemetry)
lattice-test-harnessShared mocks, fixtures, builders
lattice-acceptanceBDD scenarios and property tests

Plus a Python SDK, an RM-Replay simulator, and deployment configs in infra/.