MPI Process Management
Design Principle
Lattice must launch and manage multi-node MPI processes without relying on SSH between compute nodes. The node agent provides process management infrastructure (PMI) so that MPI implementations (OpenMPI, MPICH, Cray MPICH) can perform rank discovery and key-value exchange through Lattice rather than through SSH or a Slurm-specific launcher.
Problem Statement
In Slurm, srun serves as both a process launcher (fan-out to nodes) and a PMI server (rank discovery, KV exchange). Lattice replaces srun with lattice launch / the LaunchTasks RPC, but the current implementation is a stub that does not:
- Fan out process launch to node agents
- Provide PMI wire-up so MPI ranks can discover each other
- Manage CXI credentials for Slingshot/Ultra Ethernet fabric access
Without this, users calling mpirun directly fall back to SSH for remote process spawning, which is:
- A security risk (SSH keys between compute nodes)
- Incompatible with network-domain-only L3 reachability
- Incompatible with the sensitive workload isolation model
- Operationally fragile (SSH host key management, authorized_keys distribution)
Supported MPI Implementations
| Implementation | PMI-2 Support | PMIx Support | Default Launcher | Notes |
|---|---|---|---|---|
| MPICH | Native (PMI-2 origin) | Via external PMIx | Hydra (SSH) | PMI-2 is the natural fit |
| OpenMPI | Yes (OMPI_MCA_pmix=pmi2) | Preferred (PRRTE) | ORTE/PRRTE (SSH) | PMI-2 fully functional |
| Cray MPICH | Native (via PALS) | Via PALS | PALS | PMI-2 without PALS works |
All three support PMI-2. PMIx is preferred by OpenMPI but not required.
Architecture
Two-Tier Design
┌─────────────────────────────────────────────────────────┐
│ Default: Native PMI-2 Server (built into node agent) │
│ Simple, no external dependencies, covers 95%+ of MPI │
│ workloads. ~8 wire commands over Unix domain socket. │
├─────────────────────────────────────────────────────────┤
│ Optional: OpenPMIx Sidecar (feature-flagged) │
│ Full PMIx v4/v5 support for workloads that require │
│ PMIx-specific features (spawn, tools API, events). │
│ Node agent manages OpenPMIx server lifecycle. │
└─────────────────────────────────────────────────────────┘
Launch Flow
User: lattice launch --alloc=123 -n 256 --tasks-per-node=4 ./my_mpi_app
│
▼
lattice-api (LaunchTasks RPC)
│
├─ Validates: allocation is Running, user owns it
├─ Computes rank layout: N nodes × tasks_per_node = total ranks
│ Rank assignment: node 0 gets ranks [0..3], node 1 gets [4..7], ...
├─ Generates launch_id, PMI job attributes (appnum, size, universe_size)
├─ Provisions CXI credentials if Slingshot fabric (see below)
│
▼ Fan-out: gRPC LaunchProcesses to each node agent in the allocation
Node Agent 0 Node Agent 1 Node Agent N-1
│ │ │
├─ Creates PMI-2 server ├─ Creates PMI-2 server ├─ ...
│ (Unix domain socket) │ (Unix domain socket) │
│ │ │
├─ Spawns local ranks ├─ Spawns local ranks │
│ rank 0: ./my_mpi_app │ rank 4: ./my_mpi_app │
│ rank 1: ./my_mpi_app │ rank 5: ./my_mpi_app │
│ rank 2: ./my_mpi_app │ rank 6: ./my_mpi_app │
│ rank 3: ./my_mpi_app │ rank 7: ./my_mpi_app │
│ │ │
│ Each rank inherits: │ │
│ - PMI_FD (socket fd) │ │
│ - PMI_RANK (global rank) │ │
│ - PMI_SIZE (world size) │ │
│ │ │
▼ ▼ ▼
MPI_Init() → PMI-2 fullinit → local KVS puts (libfabric endpoint addr)
│ │ │
▼ ─────────── kvsfence (cross-node KVS exchange via gRPC) ────────────
│ │ │
MPI_Init() completes MPI_Init() completes ...
│ │ │
(application runs) (application runs) ...
│ │ │
MPI_Finalize() → PMI-2 finalize
PMI-2 Wire Protocol
The PMI-2 wire protocol is text-based over a Unix domain socket. The node agent implements these commands:
| Command | Direction | Purpose |
|---|---|---|
fullinit | rank → agent | Initialize PMI connection, receive rank/size/appnum |
job-getinfo | rank → agent | Query job attributes (e.g., universe size) |
kvsput | rank → agent | Store a key-value pair (e.g., libfabric endpoint address) |
kvsget | rank → agent | Retrieve a key-value pair |
kvsfence | rank → agent | Barrier + distribute all KV pairs across all ranks |
finalize | rank → agent | Clean shutdown of PMI connection |
abort | rank → agent | Signal abnormal termination |
spawn | rank → agent | Dynamic process spawning (optional, rarely used) |
Cross-Node KVS Exchange (Fence)
The kvsfence operation is the only cross-node PMI operation. It requires all ranks across all nodes to synchronize and exchange accumulated KV pairs. This is implemented via gRPC between node agents:
kvsfence triggered on all nodes
│
▼
Phase 1: Local collection
Each node agent collects all kvsput entries from its local ranks.
Phase 2: Exchange (star topology via designated head node)
┌─────────────┐
│ Head Agent │ ◄──── gRPC PmiFence(local_kvs) ──── Agent 1
│ (rank 0's │ ◄──── gRPC PmiFence(local_kvs) ──── Agent 2
│ node) │ ◄──── gRPC PmiFence(local_kvs) ──── Agent N-1
│ │
│ Merges all │
│ KVS entries │
│ │
│ Broadcasts │ ────► gRPC PmiFenceComplete(merged_kvs) ──► Agent 1
│ merged KVS │ ────► gRPC PmiFenceComplete(merged_kvs) ──► Agent 2
│ │ ────► gRPC PmiFenceComplete(merged_kvs) ──► Agent N-1
└─────────────┘
Phase 3: Local completion
Each node agent unblocks its local ranks' kvsfence.
Ranks can now kvsget any key from any node.
The head agent is the node agent hosting rank 0. For large jobs (>128 nodes), a tree-based reduction can be used instead of a star to reduce head-node pressure.
Node Agent gRPC Extensions
New RPCs on the node agent service for MPI process management:
service NodeAgentService {
// Existing RPCs...
// Launch MPI ranks on this node (called by API server during fan-out)
rpc LaunchProcesses(LaunchProcessesRequest) returns (LaunchProcessesResponse);
// PMI fence exchange between node agents
rpc PmiFence(PmiFenceRequest) returns (PmiFenceResponse);
// PMI fence completion broadcast from head agent
rpc PmiFenceComplete(PmiFenceCompleteRequest) returns (PmiFenceCompleteResponse);
// Notify all local ranks to abort (e.g., one node failed)
rpc AbortProcesses(AbortProcessesRequest) returns (AbortProcessesResponse);
}
message LaunchProcessesRequest {
string launch_id = 1;
string allocation_id = 2;
string entrypoint = 3;
repeated string args = 4;
uint32 tasks_per_node = 5;
uint32 first_rank = 6; // global rank offset for this node
uint32 world_size = 7; // total ranks across all nodes
map<string, string> env = 8; // additional env vars
PmiMode pmi_mode = 9; // PMI2 (default) or PMIX
// CXI credentials for Slingshot fabric
optional CxiCredentials cxi_credentials = 10;
// Peer node agents for fence exchange
repeated PeerInfo peers = 11;
// Index of the head node (for fence coordination)
uint32 head_node_index = 12;
}
message PeerInfo {
string node_id = 1;
string grpc_address = 2; // node agent address (reachable via management network)
uint32 first_rank = 3;
uint32 num_ranks = 4;
}
enum PmiMode {
PMI2 = 0;
PMIX = 1;
}
message CxiCredentials {
uint32 vni = 1;
bytes auth_key = 2;
uint32 svc_id = 3;
}
PMI-2 Server Implementation
Each node agent runs a PMI-2 server per launch (one Unix socket per launch_id):
Node Agent
│
├─ LaunchProcesses received
│ ├─ Create Unix socket: /tmp/lattice-pmi-{launch_id}.sock
│ ├─ Start PMI-2 server task (tokio)
│ ├─ Fork/exec ranks with:
│ │ PMI_FD={fd} # inherited socket fd
│ │ PMI_RANK={rank} # global rank
│ │ PMI_SIZE={world_size} # world size
│ │ PMI_SPAWNED=0 # not dynamically spawned
│ │ LATTICE_LAUNCH_ID={launch_id}
│ │ LATTICE_ALLOC_ID={allocation_id}
│ │ LATTICE_NODELIST={comma-separated node list}
│ │ LATTICE_NNODES={node_count}
│ │ LATTICE_NPROCS={world_size}
│ │ # CXI env (if Slingshot):
│ │ FI_CXI_DEFAULT_VNI={vni}
│ │ FI_CXI_AUTH_KEY={key}
│ └─ Monitor all rank processes, report exit status
│
├─ PMI-2 server handles:
│ ├─ fullinit → return rank, size, appnum, debug flag
│ ├─ kvsput → store in local HashMap
│ ├─ kvsget → lookup local, or merged (post-fence)
│ ├─ kvsfence → collect local, trigger cross-node exchange, block until complete
│ ├─ finalize → mark rank done
│ └─ abort → signal all local ranks, notify head agent
│
└─ Cleanup on launch completion
├─ Remove Unix socket
├─ Report per-rank exit codes to API server
└─ Clean up CXI credentials
Environment Variables
Lattice sets these environment variables for MPI processes:
| Variable | Value | Purpose |
|---|---|---|
PMI_FD | fd number | PMI-2 socket (inherited) |
PMI_RANK | global rank | MPI rank |
PMI_SIZE | world size | MPI world size |
PMI_SPAWNED | 0 | Not dynamically spawned |
LATTICE_LAUNCH_ID | UUID | Launch identifier |
LATTICE_ALLOC_ID | UUID | Allocation identifier |
LATTICE_NODELIST | comma-separated | All nodes in this launch |
LATTICE_NNODES | integer | Node count |
LATTICE_NPROCS | integer | Total rank count |
LATTICE_LOCAL_RANK | 0..tasks_per_node-1 | Node-local rank |
LATTICE_LOCAL_SIZE | tasks_per_node | Ranks on this node |
FI_CXI_DEFAULT_VNI | VNI number | Slingshot VNI (if applicable) |
FI_CXI_AUTH_KEY | hex string | CXI auth key (if applicable) |
FI_PROVIDER | cxi or verbs | libfabric provider hint |
For Slurm compatibility (compat.set_slurm_env=true), also set SLURM_PROCID, SLURM_NPROCS, SLURM_LOCALID, SLURM_NODELIST.
CXI Credential Management (Slingshot)
On Slingshot systems, MPI communication requires CXI (Cassini eXtended Interface) credentials tied to the allocation’s VNI. Without valid credentials, libfabric’s CXI provider refuses to open endpoints.
Credential Lifecycle
1. Allocation scheduled → network domain assigned → VNI allocated
2. LaunchTasks RPC → API server requests CXI credentials from fabric manager
- Input: VNI, allocation ID, node list
- Output: auth_key, svc_id (bound to VNI + node set)
3. Credentials included in LaunchProcessesRequest to each node agent
4. Node agent sets FI_CXI_DEFAULT_VNI and FI_CXI_AUTH_KEY for spawned ranks
5. On launch completion → API server revokes CXI credentials
Fabric Manager Integration
The Slingshot fabric manager provides a REST API for credential management:
| Operation | Endpoint | When |
|---|---|---|
| Create CXI service | POST /fabric/cxi/services | Launch start |
| Get auth key | GET /fabric/cxi/services/{id}/auth | Launch start |
| Revoke CXI service | DELETE /fabric/cxi/services/{id} | Launch end |
This is a new integration point, similar to the existing VAST API integration for storage.
Optional: OpenPMIx Sidecar (Feature-Flagged)
For workloads requiring full PMIx v4/v5 support (dynamic process spawning, PMIx tools API, event notification, PMIx groups), Lattice can run an OpenPMIx server as a managed sidecar process.
When to Use PMIx Mode
| Scenario | PMI-2 (default) | PMIx (optional) |
|---|---|---|
| Standard MPI (init, communication, finalize) | Yes | Yes |
| Multi-application launch (MPMD) | Limited | Yes |
Dynamic process spawning (MPI_Comm_spawn) | No | Yes |
| PMIx tools API (debugger attach) | No | Yes |
| PMIx event notification | No | Yes |
| OpenMPI with PMIx-only features | No | Yes |
Architecture
Node Agent
│
├─ PmiMode::PMIX requested in LaunchProcessesRequest
│
├─ Spawns OpenPMIx server (pmix_server binary)
│ ├─ Configured via tmpdir/pmix-{launch_id}/
│ ├─ Node agent implements the PMIx "host" callback interface
│ │ via a small C shim library (libpmix-lattice-host.so)
│ │ that calls back to the node agent via Unix socket
│ ├─ Cross-node exchange: host callbacks route to node agent gRPC
│ └─ pmix_server provides Unix rendezvous socket for ranks
│
├─ Spawns ranks with:
│ PMIX_SERVER_URI={rendezvous_uri}
│ PMIX_NAMESPACE={launch_id}
│ PMIX_RANK={rank}
│ (instead of PMI_FD/PMI_RANK/PMI_SIZE)
│
└─ On completion: stops pmix_server, cleans up
Host Callback Shim
The OpenPMIx server requires the host (resource manager) to provide certain callbacks for cross-node operations. These are implemented via a small C shared library (libpmix-lattice-host.so) that:
- Is loaded by
pmix_serverat startup via--host-liborLD_PRELOAD - Implements:
pmix_server_fencenb_fn,pmix_server_dmodex_fn,pmix_server_spawn_fn - Each callback sends a request over a Unix socket to the node agent
- Node agent handles cross-node coordination via gRPC (same as PMI-2 fence)
This keeps the C code minimal (~200 lines) while leveraging the full OpenPMIx implementation.
Build and Deployment
# Cargo.toml (lattice-node-agent)
[features]
pmix = [] # enables PMIx sidecar support
When the pmix feature is enabled:
pmix_serverbinary must be installed on compute nodes (packaged separately or via uenv)libpmix-lattice-host.sois built frominfra/pmix-host/and installed alongside the node agent- The node agent detects
pmix_serveravailability at startup and reports it as a node capability
When disabled: PmiMode::PMIX requests return an error with a clear message.
Integration with Existing Runtimes
uenv Runtime
PMI-2 socket and environment variables are available inside the mount namespace with no special handling (mount namespace does not isolate Unix sockets in the parent namespace).
Sarus Runtime
The PMI-2 Unix socket must be bind-mounted into the container:
sarus run --mount=type=bind,source=/tmp/lattice-pmi-{launch_id}.sock,destination=/tmp/lattice-pmi.sock ...
The --mpi flag in Sarus already handles MPI wire-up for Slurm; for Lattice, we configure Sarus to use the Lattice-provided PMI socket instead. This requires the Sarus MPI hook to be configured for PMI-2 mode rather than Slurm PMI mode.
DMTCP (Checkpoint/Restart)
DMTCP wraps the MPI process. The PMI-2 socket is outside the DMTCP checkpoint boundary. On restart, the node agent creates a new PMI-2 server and the restarted ranks re-initialize PMI. DMTCP’s MPI plugin handles reconnecting MPI communicators.
Failure Handling
Rank Failure
1. Rank exits with non-zero code (or is killed by signal)
2. Local node agent detects via process monitor
3. Node agent sends RankFailed notification to head agent
4. Head agent:
a. If allocation requeue policy = "on_any_failure": abort all ranks, requeue allocation
b. If MPI_ERRORS_RETURN semantics: notify remaining ranks via PMI-2 abort
c. Default: abort all ranks, report failure to API server
Node Agent Failure
1. Node agent crashes or becomes unreachable
2. Head agent detects via gRPC timeout during fence (or heartbeat miss)
3. Head agent aborts the launch on all surviving nodes
4. API server handles allocation state transition (same as node failure)
Fence Timeout
1. kvsfence does not complete within timeout (default: 60s, configurable)
2. Head agent declares fence failure
3. All ranks aborted with PMI-2 abort message
4. Launch reported as failed with "PMI fence timeout" reason
User-Facing Changes
lattice launch (CLI)
# MPI launch (replaces srun -n 256 ./app)
lattice launch --alloc=123 -n 256 ./my_mpi_app
# With tasks-per-node control
lattice launch --alloc=123 --tasks-per-node=4 ./my_mpi_app
# Force PMIx mode (requires pmix feature on nodes)
lattice launch --alloc=123 -n 256 --pmi=pmix ./my_mpi_app
# Launch with environment variables
lattice launch --alloc=123 -n 256 --env OMP_NUM_THREADS=8 ./my_mpi_app
Submission Script
#!/bin/bash
#LATTICE nodes=64
#LATTICE walltime=2:00:00
#LATTICE vcluster=hpc-batch
#LATTICE network_domain=my-training-run
# No SSH, no mpirun, no srun needed.
# The entrypoint IS the MPI program; Lattice handles process launch and PMI.
lattice launch -n 256 --tasks-per-node=4 ./my_mpi_training
# Or for Slurm compatibility:
# srun -n 256 ./my_mpi_training (compat layer translates to lattice launch)
Direct mpirun (Escape Hatch)
Users who want to call mpirun directly can still do so. Lattice provides a Hydra-compatible launcher script (lattice-mpi-launcher) that uses the node agent gRPC instead of SSH:
# mpirun detects the Lattice launcher via:
# HYDRA_LAUNCHER=manual
# HYDRA_LAUNCHER_EXEC=lattice-mpi-launcher
# These are set automatically by the node agent when an allocation starts.
# So this "just works" inside an allocation:
mpirun -np 256 ./my_mpi_app
The lattice-mpi-launcher script:
- Receives the launch command from Hydra/ORTE
- Calls the local node agent’s
LaunchProcessesgRPC to spawn on the target node - Returns the PID to the MPI launcher
This provides backward compatibility for scripts that use mpirun directly while still avoiding SSH.
Performance Considerations
| Operation | Latency | Bottleneck | Mitigation |
|---|---|---|---|
| Launch fan-out | ~100ms for 256 nodes | gRPC round-trips | Parallel fan-out from API server |
| PMI-2 fence (star) | ~10ms for <128 nodes | Head agent merge | Acceptable for typical HPC |
| PMI-2 fence (tree) | ~20ms for 1000+ nodes | Tree depth (log N) | Only needed at extreme scale |
| CXI credential provisioning | ~50ms | Fabric manager API | Cached for allocation lifetime |
MPI_Init typically takes 100-500ms. The Lattice PMI overhead is well within this budget.
Cross-References
- network-domains.md – VNI allocation, L3 reachability
- security.md – CXI credentials, network isolation
- slurm-migration.md – srun replacement
- node-lifecycle.md – Node agent process management
- failure-modes.md – Rank and node failure handling
- checkpoint-broker.md – DMTCP + MPI checkpoint interaction
- sessions.md – Interactive allocations with MPI launch
- ADR-010: Native PMI-2 with optional PMIx sidecar