Submitting Workloads
Basic Submission
# Run a script on 4 nodes for up to 24 hours
lattice submit --nodes=4 --walltime=24h train.sh
# With GPU constraints
lattice submit --nodes=8 --walltime=72h --constraint="gpu_type=GH200" -- torchrun train.py
# With a software environment (uenv)
lattice submit --nodes=2 --uenv=prgenv-gnu/24.11:v1 -- make -j run
Script Directives
Lattice parses #LATTICE directives from your script (and #SBATCH for compatibility):
#!/bin/bash
#LATTICE --nodes=64
#LATTICE --walltime=72h
#LATTICE --uenv=prgenv-gnu/24.11:v1
#LATTICE --vcluster=ml-training
#LATTICE --tenant=physics
#LATTICE --name=large-training-run
torchrun --nproc_per_node=4 train.py --data /scratch/dataset
Resource Constraints
# GPU type
lattice submit --constraint="gpu_type=GH200,gpu_count=4" script.sh
# Memory requirements
lattice submit --constraint="memory_gb>=512" script.sh
# Require unified memory (GH200/MI300A superchip)
lattice submit --constraint="require_unified_memory" script.sh
# Prefer same NUMA domain
lattice submit --constraint="prefer_same_numa" script.sh
Task Groups (Job Arrays)
Submit multiple instances of the same job:
# 100 tasks, 20 running concurrently
lattice submit --task-group=0-99%20 sweep.sh
# Task index available as $LATTICE_TASK_INDEX
Dependencies
# Run after job succeeds
lattice submit --depends-on=a1b2c3d4:success postprocess.sh
# Run after job completes (success or failure)
lattice submit --depends-on=a1b2c3d4:any cleanup.sh
# Multiple dependencies
lattice submit --depends-on=job1:success,job2:success merge.sh
Data Staging
Lattice can pre-stage data to the hot tier before your job starts:
lattice submit --data-mount="s3://bucket/dataset:/data" --nodes=4 train.sh
The scheduler evaluates data readiness as part of the cost function — jobs with data already on the hot tier are prioritized.
Lifecycle Types
Bounded (batch) — default
lattice submit --walltime=24h train.sh
Job runs until completion or walltime, then terminates.
Unbounded (service)
lattice submit --service --expose=8080 serve.sh
Runs indefinitely. Exposed ports are reachable via the network domain.
Reactive (autoscaling)
lattice submit --reactive --min-nodes=1 --max-nodes=8 \
--scale-metric=gpu_utilization --scale-target=0.8 serve.sh
Automatically scales between min and max nodes based on the target metric.
Preemption Classes
Higher preemption class = harder to preempt:
# Best-effort (preempted first)
lattice submit --preemption-class=0 experiment.sh
# Normal priority (default: 5)
lattice submit train.sh
# High priority
lattice submit --preemption-class=8 critical-training.sh
Checkpointing
If your application supports checkpointing, declare it:
# Signal-based (receives SIGUSR1 before preemption)
lattice submit --checkpoint=signal train.sh
# gRPC callback
lattice submit --checkpoint=grpc --checkpoint-port=9999 train.sh
# Shared memory flag
lattice submit --checkpoint=shmem train.sh
# Non-preemptible (no checkpoint, never preempted)
lattice submit --no-preempt train.sh
Slurm Compatibility
Existing Slurm scripts work with minimal changes:
# These are equivalent
sbatch --nodes=4 --time=24:00:00 --partition=gpu train.sh
lattice submit --nodes=4 --walltime=24h --vcluster=gpu train.sh
Supported #SBATCH directives are automatically translated. See Slurm Migration for details.
Output Formats
# Default: human-readable table
lattice status
# JSON (for scripting)
lattice status -o json
# YAML
lattice status -o yaml
# Wide (more columns)
lattice status -o wide