Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

API Design

Two-Tier API Model

Tier 1: Intent API (Agent-Native)

Agents and advanced users interact with the Intent API. They declare what they need; the scheduler resolves how.

Core Resources

Allocation — The universal work unit.

POST   /v1/allocations              Create allocation (or DAG of allocations)
GET    /v1/allocations              List allocations (filterable)
GET    /v1/allocations/{id}         Get allocation status
DELETE /v1/allocations/{id}         Cancel allocation
PATCH  /v1/allocations/{id}         Update allocation (e.g., extend walltime, switch telemetry)
POST   /v1/allocations/{id}/tasks   Launch tasks within an existing allocation (srun equivalent)
POST   /v1/allocations/{id}/checkpoint  Request checkpoint

Observability — User-facing debugging and monitoring.

POST   /v1/allocations/{id}/attach           Attach interactive terminal (WebSocket upgrade)
GET    /v1/allocations/{id}/logs             Historical logs from S3
GET    /v1/allocations/{id}/logs/stream      Live log tail (SSE / gRPC stream)
GET    /v1/allocations/{id}/metrics          Query metrics snapshot from TSDB
GET    /v1/allocations/{id}/metrics/stream   Push-based live metrics stream
GET    /v1/allocations/{id}/diagnostics      Combined network + storage diagnostics
GET    /v1/allocations/{id}/diagnostics/network  Network-specific diagnostics
GET    /v1/allocations/{id}/diagnostics/storage  Storage-specific diagnostics
GET    /v1/compare                           Cross-allocation metric comparison

DAGs — Workflow graph management.

POST   /v1/dags                    Submit a DAG of allocations
GET    /v1/dags                    List DAGs (filterable by tenant, user, state)
GET    /v1/dags/{id}               Get DAG status (overall state + per-allocation states)
GET    /v1/dags/{id}/graph         Get DAG structure (allocations + dependency edges)
DELETE /v1/dags/{id}               Cancel all allocations in a DAG

Session — Interactive allocation with WebSocket terminal.

POST   /v1/sessions                 Create interactive session
GET    /v1/sessions/{id}/terminal   WebSocket terminal endpoint

Nodes — Read-only view of cluster state.

GET    /v1/nodes                    List nodes (filterable by vCluster, tenant, state)
GET    /v1/nodes/{id}               Get node details

Tenants / vClusters — Administrative.

GET    /v1/tenants                  List tenants
GET    /v1/vclusters                List vClusters
GET    /v1/vclusters/{id}/queue     View vCluster queue

Accounting

GET    /v1/accounting               Query usage history

Allocation Request Schema

# Full Intent API allocation request
allocation:
  # Identity
  tenant: "ml-team"
  project: "gpt-training"
  vcluster: "ml-training"           # optional: scheduler can infer from intent
  tags: { experiment: "run-42" }

  # What to run
  intent: "train"                    # optional hint for scheduler
  environment:
    uenv: "prgenv-gnu/24.11:v1"     # uenv name/version
    view: "default"                  # uenv view to activate
    # OR:
    image: "registry.example.com/my-training:latest"  # OCI image via Sarus
  entrypoint: "torchrun --nproc_per_node=4 train.py"

  # Resources
  resources:
    nodes: 64                        # can be exact or range: { min: 32, max: 128 }
    constraints:
      gpu_type: "GH200"
      features: ["nvme_scratch"]
      topology: "tight"              # scheduler hint: pack into fewest groups

  # Lifecycle
  lifecycle:
    type: "bounded"                  # bounded | unbounded | reactive
    walltime: "72h"                  # for bounded
    preemption_class: 2              # 0 = lowest, higher = harder to preempt
    # For reactive:
    # scale_policy: { min: 4, max: 16, metric: "request_latency_p99", target: "100ms" }

  # Data
  data:
    mounts:
      - source: "s3://datasets/imagenet"
        target: "/data/input"
        access: "read-only"
        tier_hint: "hot"             # scheduler pre-stages if needed
    defaults: true                   # auto-mount home, scratch, output dir

  # Networking
  connectivity:
    network_domain: "ml-workspace"   # shared domain for cross-allocation communication
    expose:                          # for services
      - name: "metrics"
        port: 9090

  # Dependencies (for DAG submissions)
  depends_on:
    - ref: "preprocess-job"
      condition: "success"           # success | failure | any | corresponding

  # Checkpointing
  checkpoint:
    strategy: "auto"                 # auto | manual | none
    # auto: scheduler decides based on cost function
    # manual: application manages its own checkpointing
    # none: non-checkpointable, treated as non-preemptible

  # Telemetry
  telemetry:
    mode: "prod"                     # prod | debug | audit

DAG Submission

Submit multiple allocations as a workflow graph:

dag:
  allocations:
    - id: "stage-data"
      entrypoint: "python stage.py"
      resources: { nodes: 1 }
      lifecycle: { type: "bounded", walltime: "2h" }

    - id: "train"
      entrypoint: "torchrun train.py"
      resources: { nodes: 64, constraints: { topology: "tight" } }
      lifecycle: { type: "bounded", walltime: "72h" }
      depends_on: [{ ref: "stage-data", condition: "success" }]

    - id: "evaluate"
      entrypoint: "python eval.py"
      resources: { nodes: 4 }
      depends_on: [{ ref: "train", condition: "any" }]

DAG size limit: Maximum 1000 allocations per DAG (configurable). Submissions exceeding this limit are rejected at validation time. See dag-scheduling.md for details.

Task Groups (Job Arrays)

allocation:
  type: "task_group"
  template:
    entrypoint: "python sweep.py --config=${INDEX}"
    resources: { nodes: 1, constraints: { gpu_type: "GH200" } }
    lifecycle: { type: "bounded", walltime: "4h" }
  range: { start: 0, end: 99 }
  concurrency: 20                   # max simultaneous tasks

Tier 2: Compatibility API (Slurm-like)

Translates familiar Slurm commands to Intent API calls. Implemented as CLI wrappers + lattice-api REST endpoints.

Command Mapping

SlurmLattice CLIIntent API
sbatch script.shlattice submit script.shPOST /v1/allocations
sbatch --array=0-99%20 script.shlattice submit --task-group=0-99%20 script.shPOST /v1/allocations (task_group)
sbatch --dependency=afterok:123 script.shlattice submit --depends-on=123:success script.shPOST /v1/allocations (depends_on)
squeuelattice statusGET /v1/allocations
squeue -u $USERlattice status --user=$USERGET /v1/allocations?user=
scancel 123lattice cancel 123DELETE /v1/allocations/123
salloc -N2lattice session --nodes=2POST /v1/sessions
srun -n4 hostnamelattice launch --alloc=123 -n4 hostnamePOST /v1/allocations/123/tasks
sinfolattice nodesGET /v1/nodes
sacctlattice historyGET /v1/accounting
--constraint="gpu"--constraint="gpu"constraints.features
--partition=debug--vcluster=interactivevcluster field
--qos=high--priority=highpreemption_class
--uenv=prgenv-gnu/24.11:v1--uenv=prgenv-gnu/24.11:v1environment.uenv
srun --jobid=123 --pty bashlattice attach 123Attach RPC (bidir stream)
cat slurm-123.outlattice logs 123GET /v1/allocations/123/logs
tail -f slurm-123.outlattice logs 123 --followStreamLogs RPC
sstat -j 123lattice top 123QueryMetrics RPC
(no equivalent)lattice watch 123StreamMetrics RPC
(no equivalent)lattice diag 123GetDiagnostics RPC
(no equivalent)lattice compare 123 456CompareMetrics RPC

Script Parsing

The compatibility layer parses #SBATCH directives from submission scripts, translating them to Intent API fields. Unknown directives are warned but not fatal (graceful degradation).

#!/bin/bash
#SBATCH --nodes=64
#SBATCH --time=72:00:00
#SBATCH --gres=gpu:4
#SBATCH --constraint=GH200
#SBATCH --uenv=prgenv-gnu/24.11:v1
#SBATCH --view=default
#SBATCH --account=ml-team
#SBATCH --job-name=training-run

torchrun --nproc_per_node=4 train.py

Wire Format

gRPC (protobuf) is the primary protocol. REST is provided via gRPC-gateway for browser/curl access.

Protobuf definitions in proto/ directory. See proto/README.md for schema details.

Proto Coverage

The protobuf definitions in proto/lattice/v1/allocations.proto currently cover:

Service / AreaProto StatusNotes
AllocationService (submit, get, list, cancel, update, watch, checkpoint)DefinedCore allocation lifecycle
Observability RPCs (attach, logs, metrics, diagnostics, compare)DefinedPart of AllocationService
DAG RPCs (get, list, cancel)DefinedPart of AllocationService
NodeService (list, get, drain, undrain, disable, enable, health)Definedproto/lattice/v1/nodes.proto
AdminService (tenant CRUD, vCluster CRUD, Raft status, backup, audit, accounting)Definedproto/lattice/v1/admin.proto
Session RPCs (create, get, delete)DefinedPart of AllocationService
Service Discovery (lookup, list)DefinedPart of AdminService, admin.proto
LivenessProbeSpecDefinedPart of AllocationSpec, allocations.proto

All planned services have been implemented as RPCs within the existing three services (AllocationService, NodeService, AdminService). Both gRPC and REST endpoints are available for all operations.

Service Discovery Endpoints

MethodEndpointDescription
gRPCAdminService.LookupService(name)Returns endpoints for a named service (tenant-filtered)
gRPCAdminService.ListServices()Lists all registered service names (tenant-filtered)
RESTGET /api/v1/servicesJSON list of registered service names
RESTGET /api/v1/services/{name}JSON endpoints for a named service

Tenant filtering: requests with x-lattice-tenant header only see services belonging to their tenant. Without the header, all services are visible (admin mode).

Liveness Probe Schema

Allocations can include an optional liveness_probe in the submission spec:

message LivenessProbeSpec {
  string probe_type = 1;    // "tcp" or "http"
  uint32 port = 2;          // 1-65535
  string path = 3;          // HTTP path (e.g., "/healthz")
  uint32 period_secs = 4;   // default: 30
  uint32 initial_delay_secs = 5;
  uint32 failure_threshold = 6;  // default: 3
  uint32 timeout_secs = 7;      // default: 5
}

When failure_threshold consecutive probes fail, the allocation is marked Failed. The reconciliation loop then requeues it (for Unbounded/Reactive allocations with appropriate requeue policy).

Client SDKs

SDKProtocolLocation
Python (lattice-sdk)REST (httpx)sdk/python/
Rust (lattice-client)gRPC (tonic)crates/lattice-client/

The Rust SDK re-exports all proto types as lattice_client::proto — consumers do not need to depend on lattice-common directly.

Authentication

All API calls require OIDC bearer token. The lattice CLI handles the OIDC flow via hpc-auth (institutional IdP integration). The lattice-api server validates tokens against the configured OIDC provider.

Sensitive tenant tokens include additional claims for audit trail binding.