Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Slurm Migration

Design Principle

Migration from Slurm should be gradual and low-risk. Existing Slurm scripts should work with minimal changes via the compatibility layer. Users can adopt Lattice-native features incrementally. The goal is not perfect Slurm emulation — it’s a smooth on-ramp.

Migration Phases

Run Lattice alongside Slurm on a subset of nodes. Users can submit to either system. This provides:

  • Side-by-side comparison of scheduling behavior
  • Gradual user migration with rollback to Slurm
  • Time to validate RM-Replay weight tuning

Phase 2: Compat-Mode Cutover

Move all nodes to Lattice. Users continue using sbatch/squeue via compatibility aliases. Slurm daemons are decommissioned.

Phase 3: Native Adoption

Users migrate scripts to native lattice CLI, adopting features not available in Slurm (reactive scaling, metric-driven autoscaling, DAG workflows, data staging hints).

Script Compatibility

Supported #SBATCH Directives

Slurm DirectiveLattice MappingNotes
--nodes=Nresources.nodes: NExact match
--ntasks=NMapped to node countnodes = ceil(N / tasks_per_node)
--ntasks-per-node=NPassed as task configUsed by launcher
--time=HH:MM:SSlifecycle.walltimeExact match
--partition=Xvcluster: XPartition name → vCluster name mapping
--account=Xtenant: XAccount → tenant mapping
--job-name=Xtags.name: XStored as tag
--output=fileLog path hintLogs always go to S3; --output sets download path
--error=fileLog path hintSame as --output
--constraint=Xconstraints.featuresFeature matching
--gres=gpu:Nconstraints.gpu_countMapped to GPU constraint
--exclusiveDefault behaviorLattice schedules full nodes by default (ADR-007)
--array=0-99%20task_groupTask group with concurrency limit
--dependency=afterok:123depends_on: [{ref: "123", condition: "success"}]DAG edge
--qos=Xpreemption_classQoS → priority mapping (configurable per site)
--mail-user, --mail-typeNot supportedWarn, skip
--mem=XNot supportedFull-node scheduling; memory is not a constraint
--cpus-per-task=NNot supportedFull-node scheduling
--uenv=Xenvironment.uenv: XLattice extension, not in Slurm
--view=Xenvironment.view: XLattice extension

Unsupported Directives

Directives that have no Lattice equivalent are handled gracefully:

Warning: #SBATCH --mem=64G ignored (Lattice uses full-node scheduling, memory is not constrainable)
Warning: #SBATCH --mail-user=user@example.com ignored (use `lattice watch` for event notifications)
Submitted allocation 12345

The submission succeeds — unsupported directives produce warnings, not errors. This is critical for migration: existing scripts should not fail because of irrelevant Slurm options.

Conflicting Directives

ConflictResolution
--nodes=64 + --ntasks=128 with --ntasks-per-node=4--nodes takes precedence; ntasks-per-node used by launcher
--exclusive + --mem=64G--exclusive is default; --mem ignored with warning
--partition not foundError: vCluster "X" not found. Available: hpc-batch, ml-training, interactive

Slurm Features Not Supported

These Slurm features have no Lattice equivalent and are not planned:

FeatureReasonAlternative
Job steps (srun within sbatch)Lattice uses tasks within allocationslattice launch --alloc=<id>
Hetjob (heterogeneous job)Not yet designedSubmit separate allocations with DAG dependencies
Burst buffer (#DW)DataWarp-specificUse data.mounts with tier_hint: hot
GRES beyond GPUNot needed (full-node scheduling)Use constraints.features for non-GPU resources
Accounting (sacctmgr)Waldur handles accountinglattice history or Waldur portal
Reservations (scontrol create reservation)Use sensitive claims for dedicated nodeslattice admin reserve (future)
Licenses/resources (--licenses=)Not applicableUse constraints.features
Multi-cluster (--cluster=)Use federationlattice submit --site=X (if federation enabled)

srun Within Allocations

Slurm users often use srun inside batch scripts to launch parallel tasks. In Lattice:

# Slurm pattern:
srun -n 256 ./my_mpi_program

# Lattice equivalent (inside a running allocation):
# Option 1: The entrypoint IS the parallel launch
# In the submission script, use the appropriate launcher directly:
mpirun -np 256 ./my_mpi_program
# or:
torchrun --nproc_per_node=4 ./train.py

# Option 2: Use lattice launch from another terminal
lattice launch --alloc=12345 -n 256 ./my_mpi_program

The compatibility layer translates srun to lattice launch when the compat aliases are active.

Environment Variables

Slurm sets many environment variables in jobs. Lattice provides equivalent variables:

Slurm VariableLattice VariableDescription
SLURM_JOB_IDLATTICE_ALLOC_IDAllocation ID
SLURM_JOB_NAMELATTICE_JOB_NAMEJob name (from tags)
SLURM_NODELISTLATTICE_NODELISTComma-separated node list
SLURM_NNODESLATTICE_NNODESNumber of nodes
SLURM_NPROCSLATTICE_NPROCSNumber of tasks
SLURM_ARRAY_TASK_IDLATTICE_TASK_INDEXTask group index
SLURM_ARRAY_JOB_IDLATTICE_TASK_GROUP_IDTask group parent ID
SLURM_SUBMIT_DIRLATTICE_SUBMIT_DIRSubmission directory
SLURM_JOBIDLATTICE_ALLOC_IDAlias for compatibility

For migration convenience, the compat layer can also set SLURM_* variables (configurable: compat.set_slurm_env=true). This is disabled by default to avoid confusion.

Partition-to-vCluster Mapping

Sites configure the mapping from Slurm partition names to Lattice vClusters:

# lattice-compat.yaml
partition_mapping:
  normal: "hpc-batch"
  debug: "interactive"
  gpu: "ml-training"
  long: "hpc-batch"        # multiple partitions can map to one vCluster
  sensitive: "sensitive-secure"
qos_mapping:
  low: 1
  normal: 4
  high: 7
  urgent: 9

Unmapped partition names produce an error with a list of available vClusters.

Migration Checklist

For site administrators:

  • Deploy Lattice control plane alongside Slurm
  • Configure partition-to-vCluster mapping
  • Configure QoS-to-preemption-class mapping
  • Tune cost function weights using RM-Replay with production traces
  • Test representative batch scripts via compat layer
  • Validate accounting (Waldur) captures match Slurm sacct data
  • Train users on lattice CLI basics
  • Run dual-stack for 2-4 weeks
  • Migrate remaining users, decommission Slurm

For users:

  • Test existing scripts with lattice submit (compat mode parses #SBATCH)
  • Review warnings for unsupported directives
  • Replace srun in scripts with direct launcher commands (mpirun, torchrun)
  • (Optional) Migrate to native lattice CLI syntax for new workflows

Cross-References