Lattice

A distributed workload scheduler for large-scale scientific computing, AI/ML training, inference services, and regulated workloads.

Lattice schedules both finite jobs (batch training, simulations) and infinite jobs (inference services, monitoring) on shared HPC infrastructure with topology-aware placement, federated multi-site operation, and a unified API for human users and autonomous agents.

Architecture at a Glance

User Plane         lattice-cli + lattice-api (OIDC via hpc-auth)
Software Plane     uenv (SquashFS) + Sarus (OCI) + Registry
Scheduling Plane   Raft Quorum + vCluster Schedulers (knapsack)
Data Plane         VAST (NFS/S3) tiered storage + data mover
Network Fabric     Slingshot / Ultra Ethernet (libfabric)
Node Plane         Node Agent + mount namespaces + eBPF telemetry
Infrastructure     OpenCHAMI (Redfish BMC, boot, inventory)

Start with System Architecture for the full picture, or jump to API Design to see how users interact with the system.

Source Code

The project is organized as a Rust workspace with 9 crates:

Crate	Purpose
`lattice-common`	Shared types, config, protobuf bindings
`lattice-quorum`	Raft consensus, global state machine, audit log
`lattice-scheduler`	vCluster schedulers, knapsack solver, cost function
`lattice-api`	gRPC + REST server, OIDC, RBAC, mTLS
`lattice-checkpoint`	Checkpoint broker, cost evaluator
`lattice-node-agent`	Per-node daemon, GPU discovery, eBPF telemetry
`lattice-cli`	CLI binary (submit, status, cancel, session, telemetry)
`lattice-test-harness`	Shared mocks, fixtures, builders
`lattice-acceptance`	BDD scenarios and property tests

Plus a Python SDK, an RM-Replay simulator, and deployment configs in infra/.

Keyboard shortcuts

Lattice Documentation

Lattice

Architecture at a Glance

Source Code