Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

pact

Promise-based configuration management and admin operations for HPC/AI infrastructure

pact manages post-boot runtime configuration on large-scale HPC/AI clusters and provides the sole admin operations interface to compute nodes — replacing both traditional config management tools and SSH.

On compute nodes, pact-agent is the init system: it supervises all services, manages configuration, and provides authenticated remote access via pact shell.

Key Features

  • No SSH needed — pact shell provides authenticated, audited, policy-scoped remote access
  • pact-agent as init — boots diskless nodes in <2s, supervises 4-7 services directly
  • Acknowledged drift — detected, measured, explicitly handled — never silently converged
  • Immutable audit log — every action recorded, any state reconstructible
  • Optimistic concurrency — apply first, commit within time window, rollback on expiry
  • 10,000+ node scale — streaming boot config, no per-node scrape targets

Architecture Overview

Admin Plane      pact CLI / pact shell / AI Agent (MCP)
Control Plane    pact-journal (Raft) + pact-policy (IAM/OPA) + Grafana/Loki
Node Plane       pact-agent (init + supervisor + observer + shell server)
Infrastructure   OpenCHAMI (boot) → pact (config + init) → lattice (scheduling)

Integrates with Lattice, OpenCHAMI, and Sovra.

Getting Started with pact

Install from Release

Download pre-built binaries from the latest release.

Platform binaries (journal, CLI, MCP)

# x86_64
curl -LO https://github.com/witlox/pact/releases/latest/download/pact-platform-x86_64.tar.gz
sudo tar xzf pact-platform-x86_64.tar.gz -C /usr/local/bin/

# aarch64
curl -LO https://github.com/witlox/pact/releases/latest/download/pact-platform-aarch64.tar.gz
sudo tar xzf pact-platform-aarch64.tar.gz -C /usr/local/bin/

This installs pact (CLI), pact-journal, and pact-mcp.

Agent binary

Choose the variant matching your hardware and supervisor model:

# x86_64 NVIDIA node with PactSupervisor (diskless HPC)
curl -LO https://github.com/witlox/pact/releases/latest/download/pact-agent-x86_64-nvidia-pact.tar.gz
sudo tar xzf pact-agent-x86_64-nvidia-pact.tar.gz -C /usr/local/bin/

# aarch64 NVIDIA node with PactSupervisor
curl -LO https://github.com/witlox/pact/releases/latest/download/pact-agent-aarch64-nvidia-pact.tar.gz
sudo tar xzf pact-agent-aarch64-nvidia-pact.tar.gz -C /usr/local/bin/

Available agent variants:

VariantArchGPUSupervisor
pact-agent-x86_64-pactx86_64PactSupervisor
pact-agent-x86_64-nvidia-pactx86_64NVIDIAPactSupervisor
pact-agent-x86_64-amd-pactx86_64AMDPactSupervisor
pact-agent-x86_64-systemdx86_64systemd
pact-agent-x86_64-nvidia-systemdx86_64NVIDIAsystemd
pact-agent-x86_64-amd-systemdx86_64AMDsystemd
pact-agent-aarch64-pactaarch64PactSupervisor
pact-agent-aarch64-nvidia-pactaarch64NVIDIAPactSupervisor
pact-agent-aarch64-systemdaarch64systemd
pact-agent-aarch64-nvidia-systemdaarch64NVIDIAsystemd

All variants include eBPF and SPIRE support. See ARCHITECTURE.md for details.

Verify

pact --version
pact-agent --version
pact-journal --version

Build from Source (development)

Prerequisites

  • Rust toolchain: stable (1.85+). The repo pins the toolchain via rust-toolchain.toml.
  • protoc: Protocol Buffers compiler (required for building pact-common).
    • macOS: brew install protobuf
    • Ubuntu/Debian: apt install protobuf-compiler
  • just: Task runner. Install with cargo install just.
  • Docker: Required only for e2e tests and docker-compose deployment.
  • cargo-nextest (optional, recommended): Faster test runner. Install with cargo install cargo-nextest.
  • cargo-deny (optional): License and advisory checks. Install with cargo install cargo-deny.

Clone and build the entire workspace:

git clone https://github.com/witlox/pact.git
cd pact
cargo build --workspace

This produces four binaries in target/debug/:

  • pact – CLI tool
  • pact-agent – per-node daemon
  • pact-journal – distributed log server (Raft quorum member)
  • pact-mcp – AI agent tool-use interface

For optimized release builds:

just release
# Binaries in target/release/

Running a Dev Cluster

1. Start the journal

The journal is the central log. For development, a single-node journal is sufficient:

just run-journal

This runs pact-journal with config/minimal.toml, listening on localhost:9443.

2. Start an agent

In a second terminal:

just run-agent

This runs pact-agent with config/minimal.toml. The agent connects to the journal at localhost:9443 and starts the shell server on port 9445.

3. Use the CLI

In a third terminal, run commands against the local journal:

# Check node status
just cli status

# View configuration log (last 10 entries)
just cli log -n 10

# Commit current drift with a message
just cli commit -m "initial dev setup"

You can also run the CLI binary directly:

cargo run --package pact-cli -- status
cargo run --package pact-cli -- log -n 5

First CLI Commands

Check status

pact status                    # All nodes in default vCluster
pact status dev-node-001       # Specific node

View configuration log

pact log                       # Last 20 entries (default)
pact log -n 50                 # Last 50 entries
pact log --scope node:dev-001  # Filter by node

Show drift (declared vs actual)

pact diff                      # Current node
pact diff dev-node-001         # Specific node

Commit drift

pact commit -m "tuned hugepages for training workload"

Roll back

pact rollback 42               # Roll back to sequence number 42

Configuration Basics

pact uses TOML configuration files. There are two main configs:

Agent config (agent.toml)

Controls the per-node daemon. Key sections:

[agent]
node_id = "dev-node-001"
vcluster = "dev-sandbox"
enforcement_mode = "observe"    # "observe" | "enforce"

[agent.supervisor]
backend = "pact"                # "pact" (built-in) | "systemd" (fallback)

[agent.journal]
endpoints = ["localhost:9443"]
tls_enabled = false

[agent.observer]
ebpf_enabled = false
inotify_enabled = true
netlink_enabled = true

[agent.shell]
enabled = true
listen = "0.0.0.0:9445"
whitelist_mode = "learning"     # "strict" | "learning" | "bypass"

[agent.commit_window]
base_window_seconds = 900

See config/minimal.toml for a complete development config and config/production.toml for a production-ready example.

CLI config (~/.config/pact/cli.toml)

Controls the CLI tool. Created automatically on first use if missing:

endpoint = "http://localhost:9443"
default_vcluster = "dev-sandbox"
output_format = "text"          # "text" | "json"
timeout_seconds = 30

The CLI resolves settings with this precedence (highest to lowest):

  1. Command-line flags (--endpoint, --token, --vcluster)
  2. Environment variables (PACT_ENDPOINT, PACT_TOKEN, PACT_VCLUSTER)
  3. Config file (~/.config/pact/cli.toml)
  4. Defaults (http://localhost:9443)

Authentication

For development, no token is needed if the journal has policy disabled ([policy] enabled = false in config/minimal.toml).

For production, set your OIDC token:

# Via environment variable
export PACT_TOKEN="eyJhbGciOiJS..."

# Via token file
echo "eyJhbGciOiJS..." > ~/.config/pact/token

# Via CLI flag
pact --token "eyJhbGciOiJS..." status

Running Tests

just test            # Unit + integration tests (fast, no Docker needed)
just test-accept     # BDD acceptance tests (584 scenarios)
just test-e2e        # End-to-end tests (requires Docker)
just ci              # Full CI suite (fmt + clippy + tests + deny)

Docker Compose (Full Stack)

For a complete local environment with monitoring:

cd infra/docker
docker compose up -d

This starts:

  • 3-node journal quorum (ports 9443, 9543, 9643)
  • 1 agent (port 9445)
  • Prometheus (port 9090)
  • Grafana (port 3000, login: admin/admin)

Next Steps

CLI Reference

pact CLI is the primary interface for configuration management and admin operations. Every command is authenticated, authorized, and logged to the immutable journal.

Global Options

pact [OPTIONS] <COMMAND>
OptionDescription
--endpoint <URL>Journal gRPC endpoint (overrides PACT_ENDPOINT and config file)
--token <TOKEN>OIDC bearer token (overrides PACT_TOKEN and config file)
--vcluster <NAME>Default vCluster scope (overrides PACT_VCLUSTER and config file)
--output <FORMAT>Output format: text (default) or json

Environment Variables

VariableDescriptionDefault
PACT_ENDPOINTJournal gRPC endpointhttp://localhost:9443
PACT_TOKENOIDC bearer token(none, reads from ~/.config/pact/token)
PACT_VCLUSTERDefault vCluster scope(none)
PACT_OUTPUTOutput format (text or json)text
RUST_LOGLog level for debug outputwarn

Exit Codes

CodeMeaning
0Success
1General error (connection failure, invalid arguments)
2Authentication or authorization failure
3Policy rejection (OPA denied the operation)
4Conflict (concurrent modification detected)
5Timeout (journal unreachable)
6Command not whitelisted (exec/shell)
10Rollback failed (active consumers hold the state)

Authentication Commands

These commands manage OIDC authentication. login and logout are exempt from the “every command requires a valid token” rule (Auth1).

pact login

Authenticate with the pact-journal server via OIDC.

pact login                          # Interactive (Auth Code + PKCE)
pact login --server https://j:9443  # Explicit server URL
pact login --device-code            # Headless (Device Code flow)
pact login --service-account        # Machine identity (Client Credentials)
OptionDescription
--server <URL>Journal server URL (overrides config/env)
--device-codeForce Device Code flow for headless environments
--service-accountUse Client Credentials flow (requires PACT_CLIENT_ID and PACT_CLIENT_SECRET env vars)

Flow selection: If no flag is given, the auth crate auto-discovers the IdP and selects the best available flow: Auth Code + PKCE → Device Code → Manual Paste.

Token cache: Tokens are stored at ~/.config/pact/auth/tokens-{server_hash}.json with mode 0600 (PAuth1: strict permission mode).

Roles: Not required (unauthenticated command).

pact logout

Clear the local token cache and revoke the session at the IdP (best-effort).

pact logout

Local cache is always cleared, even if IdP revocation fails (Auth4).

Roles: Not required (unauthenticated command).


Read Commands

These commands query state without modifying anything. Available to all roles including pact-viewer-{vcluster}.

pact status

Show node or vCluster state, drift, and capabilities.

pact status                          # All nodes in default vCluster
pact status node-042                 # Specific node
pact status --vcluster ml-training   # All nodes in a vCluster
OptionDescription
[node]Node ID to query (optional, defaults to all nodes)
--vcluster <NAME>vCluster scope

pact log

Show configuration history from the immutable journal.

pact log                             # Last 20 entries
pact log -n 50                       # Last 50 entries
pact log --scope node:node-042       # Filter by node
pact log --scope vc:ml-training      # Filter by vCluster
pact log --scope global              # Global entries only
OptionDescription
-n <COUNT>Number of entries to show (default: 20)
--scope <FILTER>Scope filter: node:<id>, vc:<name>, or global

pact diff

Show declared vs actual state differences (drift).

pact diff                            # Current node
pact diff node-042                   # Specific node
pact diff --committed node-042       # Show committed node deltas not yet promoted
OptionDescription
[node]Node ID to diff (optional)
--committedShow committed node deltas not yet promoted to overlay

pact cap

Show node hardware capability report (CPU, GPU, memory, network).

pact cap                             # Local node
pact cap node-042                    # Remote node
OptionDescription
[node]Node ID (optional, defaults to local)

pact watch

Live event stream from the journal. Streams events in real time until interrupted.

pact watch                           # Default vCluster
pact watch --vcluster ml-training    # Specific vCluster
OptionDescription
--vcluster <NAME>vCluster scope

Press Ctrl-C to stop the stream.


Write Commands

These commands modify configuration state. Requires pact-ops-{vcluster} or pact-platform-admin role. On regulated vClusters, write commands trigger the two-person approval workflow.

pact commit

Commit current drift on the node as a configuration entry in the journal.

pact commit -m "tuned hugepages for ML training"
pact commit -m "added NFS mount for datasets"
OptionDescription
-m <MESSAGE>Commit message (required)

The commit is scoped to the current vCluster (from --vcluster, PACT_VCLUSTER, or config file). On regulated vClusters, this triggers approval workflow.

pact rollback

Roll back to a previous configuration state by sequence number.

pact rollback 42                     # Roll back to seq 42
OptionDescription
<seq>Target sequence number to roll back to (required)

Use pact log to find the sequence number you want to roll back to.

pact apply

Apply a declarative configuration spec from a TOML file.

pact apply overlay.toml              # Apply a spec file
pact apply /tmp/hugepages.toml       # Apply from absolute path
OptionDescription
<spec>Path to TOML spec file (required)

The spec file format matches the vCluster overlay format. See config/vcluster-examples/overlays.toml for the schema.


Exec Commands

These commands execute operations on remote nodes. Requires pact-ops-{vcluster} or pact-platform-admin role. All executions are logged to the journal.

pact exec

Run a whitelisted command on a remote node. The command and its output are recorded in the immutable audit log.

pact exec node-042 -- nvidia-smi
pact exec node-042 -- dmesg -T | tail -20
pact exec node-042 -- cat /proc/meminfo
OptionDescription
<node>Target node ID (required)
-- <command...>Command and arguments (after --, required)

Commands must be on the agent’s whitelist. Non-whitelisted commands return exit code 6.

pact shell

Open an interactive shell session on a remote node. This replaces SSH access.

pact shell node-042
OptionDescription
<node>Target node ID (required)

Inside the shell, commands are subject to the whitelist policy configured on the agent (whitelist_mode in agent config). The session is fully logged.

pact:node-042> dmesg | tail -5
pact:node-042> cat /etc/hostname
pact:node-042> exit

pact service

Manage services on a node.

pact service status

pact service status                  # All services
pact service status chronyd          # Specific service

pact service restart

pact service restart nvidia-persistenced

Restarts are subject to the commit window. If the commit window has expired, you must commit or extend first.

pact service logs

pact service logs lattice-node-agent

Streams the last 50 log lines for the service.


Diagnostic Commands

Structured diagnostic log retrieval from nodes. Replaces ad-hoc pact exec for common log retrieval tasks with a purpose-built command that enforces server-side filtering.

pact diag

Collect diagnostic logs from one or more nodes. Logs are retrieved directly from the agent, which reads local sources (dmesg via /dev/kmsg, syslog, service logs under /run/pact/logs/). Grep filtering and line limits are enforced on the agent side, so only matching data crosses the network.

pact diag node-042                              # All sources, last 200 lines
pact diag node-042 --lines 500                  # Last 500 lines per source
pact diag node-042 --source dmesg               # Only kernel messages
pact diag node-042 --service nvidia-persistenced # Logs for a specific service
pact diag node-042 --grep "ECC"                 # Server-side grep across all sources
pact diag --vcluster ml-training                # Fleet-wide: all nodes in vCluster
pact diag --vcluster ml-training --grep "ECC"   # Fleet-wide log grep
OptionDescription
[node]Target node ID (required unless --vcluster is given)
--lines <N>Number of lines per source (default: 200)
--source <SOURCE>Log source filter: dmesg, syslog, or service (default: all)
--service <NAME>Restrict to a specific service’s logs (implies --source service)
--grep <PATTERN>Server-side grep pattern applied before streaming
--vcluster <NAME>Fleet mode: query all nodes in the vCluster (fans out concurrently)

In fleet mode (--vcluster), output lines are prefixed with [node_id]. Unreachable agents produce a warning and partial results are returned.

Roles: Requires pact-ops-{vcluster} or pact-platform-admin role (LOG1).

Design notes:

  • Grep and line limit are enforced on the agent, not the CLI (LOG2, LOG3).
  • Fleet fan-out: max 50 concurrent agent connections, 5s timeout per agent.

Admin Commands

These commands handle emergency operations and approval workflows.

pact emergency

Enter or exit emergency mode. Emergency mode relaxes policy constraints while maintaining the full audit trail. Use only for genuine emergencies.

pact emergency start

pact emergency start -r "GPU node unresponsive, need unrestricted diagnostics"
OptionDescription
-r <REASON>Reason for entering emergency mode (required)

Emergency mode extends the commit window to 4 hours (configurable via emergency_window_seconds) and relaxes whitelist restrictions.

pact emergency end

pact emergency end                   # End your own emergency
pact emergency end --force           # Force-end another admin's emergency
OptionDescription
--forceForce-end another admin’s emergency session

pact approve

Manage the two-person approval workflow for regulated vClusters.

pact approve list

pact approve list

Lists all pending approval requests across vClusters you have access to.

pact approve accept

pact approve accept ap-7f3a
OptionDescription
<id>Approval ID (required)

You cannot approve your own request. The approver must have pact-regulated-{vcluster} or pact-platform-admin role.

pact approve deny

pact approve deny ap-7f3a -m "change window not scheduled"
OptionDescription
<id>Approval ID (required)
-m <MESSAGE>Denial reason (required)

pact extend

Extend the current commit window.

pact extend                          # Extend by 15 minutes (default)
pact extend 30                       # Extend by 30 minutes
OptionDescription
[mins]Additional minutes (default: 15)

Delta Promotion

pact promote

Export committed node deltas as overlay TOML. This aggregates per-node configuration changes into a vCluster-wide overlay spec that can be reviewed, edited, and applied with pact apply.

pact promote node-042                # Export deltas as TOML to stdout
pact promote node-042 --dry-run      # Preview without generating output
pact promote node-042 > changes.toml # Export to file, then: pact apply changes.toml
OptionDescription
<node>Node ID whose committed deltas to export (required)
--dry-runShow which deltas would be exported without generating TOML

If other nodes in the vCluster have local changes on the same config keys, promote detects the conflict and requires explicit acknowledgment (overwrite or keep local). See failure-modes.md FM-8.

Requires pact-ops-{vcluster} or pact-platform-admin role.


Node Enrollment Commands

These commands manage node enrollment, assignment, and inventory. Requires pact-ops-{vcluster} or pact-platform-admin role.

pact node enroll

Register a node with a hardware identity.

pact node enroll compute-001 --mac aa:bb:cc:dd:ee:01
pact node enroll compute-002 --mac aa:bb:cc:dd:ee:02 --bmc-serial SN12345
OptionDescription
<node_id>Node ID to enroll (required)
--mac <MAC>Primary MAC address (required)
--bmc-serial <SERIAL>BMC serial number (optional)

pact node import

Batch-import nodes from OpenCHAMI SMD inventory. Discovers nodes via the SMD /hsm/v2/State/Components API and enrolls them with their hardware identity (MAC addresses from /hsm/v2/Inventory/EthernetInterfaces).

Requires PACT_OPENCHAMI_SMD_URL to be configured.

pact node import                        # Import all nodes from SMD
pact node import --group Compute        # Import only nodes with role "Compute"
OptionDescription
--group <ROLE>Filter by SMD role (e.g., “Compute”, “Service”)

Environment variables:

VariableDescription
PACT_OPENCHAMI_SMD_URLOpenCHAMI SMD base URL (required)
PACT_OPENCHAMI_TOKENOpenCHAMI auth token (optional)

pact node decommission

Decommission an enrolled node.

pact node decommission compute-001
pact node decommission compute-001 --force
OptionDescription
<node_id>Node ID to decommission (required)
--forceForce decommission even with active sessions

pact node assign

Assign a node to a vCluster.

pact node assign compute-001 --vcluster ml-training
OptionDescription
<node_id>Node ID (required)
--vcluster <NAME>Target vCluster (required)

pact node unassign

Unassign a node from its vCluster.

pact node unassign compute-001
OptionDescription
<node_id>Node ID (required)

pact node move

Move a node between vClusters.

pact node move compute-001 --to-vcluster dev-sandbox
OptionDescription
<node_id>Node ID (required)
--to-vcluster <NAME>Target vCluster (required)

pact node list

List enrolled nodes with optional filters.

pact node list
pact node list --vcluster ml-training
pact node list --state active
pact node list --unassigned
OptionDescription
--state <STATE>Filter by enrollment state (active, inactive, registered, revoked)
--vcluster <NAME>Filter by vCluster
--unassignedShow only unassigned nodes

pact node inspect

Show detailed enrollment information for a node.

pact node inspect compute-001
OptionDescription
<node_id>Node ID to inspect (required)

Node Lifecycle Commands

These commands manage node state transitions via delegation to external systems. Requires pact-ops-{vcluster} or pact-platform-admin role. All operations are logged to the journal.

pact drain

Drain workloads from a node. Delegates to lattice to gracefully migrate running workloads before taking the node out of service.

pact drain node-042
OptionDescription
<node>Target node ID (required)

pact undrain

Cancel a drain operation, returning a draining node to Ready state.

pact undrain node-042
OptionDescription
<node>Target node ID (required)

pact cordon

Mark a node as unschedulable. Existing workloads continue running but no new workloads will be placed on the node.

pact cordon node-042
OptionDescription
<node>Target node ID (required)

pact uncordon

Remove a cordon from a node, making it schedulable again.

pact uncordon node-042
OptionDescription
<node>Target node ID (required)

pact reboot

Reboot a node via BMC. Delegates to the configured node management backend (CSM CAPMC or OpenCHAMI Redfish).

pact reboot node-042
OptionDescription
<node>Target node ID (required)

pact reimage

Re-image a node via the configured node management backend. CSM creates a BOS reboot session; OpenCHAMI triggers a Redfish power cycle (BSS serves the new image).

pact reimage node-042
OptionDescription
<node>Target node ID (required)

Group Commands

Manage vCluster groups and their policies.

pact group list

List all vCluster groups.

pact group list
pact group list --output json

pact group show

Show details for a specific group.

pact group show ml-training
OptionDescription
<group>Group name (required)

pact group set-policy

Update the policy for a group.

pact group set-policy ml-training --file policy.toml
OptionDescription
<group>Group name (required)
--file <PATH>Path to policy TOML file (required)

Blacklist Commands

Manage drift detection exclusion patterns.

pact blacklist list

List current blacklist patterns for a node or vCluster.

pact blacklist list
pact blacklist list --vcluster ml-training

pact blacklist add

Add a pattern to the drift detection blacklist.

pact blacklist add "/var/cache/**"
pact blacklist add "/opt/scratch/**" --vcluster ml-training
OptionDescription
<pattern>Glob pattern to exclude from drift detection (required)
--vcluster <NAME>Apply to a specific vCluster (optional, defaults to node-local)

pact blacklist remove

Remove a pattern from the drift detection blacklist.

pact blacklist remove "/var/cache/**"
OptionDescription
<pattern>Glob pattern to remove (required)

Supercharged Commands (pact + lattice)

These commands combine pact and lattice data into unified views. They require PACT_LATTICE_ENDPOINT to be configured (or --lattice-endpoint flag).

Note: Lattice commands (including node lifecycle delegation) are hidden from pact --help when PACT_LATTICE_ENDPOINT is not set. They are always compiled in and can be invoked directly — they return a clear “not configured” error. Set the environment variable to see them in help output.

pact jobs list

List running job allocations across nodes.

pact jobs list                           # All jobs in default vCluster
pact jobs list --node node-042           # Jobs on a specific node
pact jobs list --vcluster ml-training    # Jobs in a vCluster
OptionDescription
--node <NODE>Filter by node ID
--vcluster <NAME>Filter by vCluster

pact jobs cancel

Cancel a stuck or runaway job allocation.

pact jobs cancel alloc-7f3a
OptionDescription
<id>Allocation ID to cancel (required)

Requires pact-ops-{vcluster} or pact-platform-admin role.

pact jobs inspect

Show detailed information about a job allocation, including resource requests, node placement, and liveness probe configuration (displayed after the Resources section when probes are configured).

pact jobs inspect alloc-7f3a
OptionDescription
<id>Allocation ID to inspect (required)

pact queue

Show the scheduling queue status from lattice.

pact queue                               # Default vCluster
pact queue --vcluster ml-training        # Specific vCluster
OptionDescription
--vcluster <NAME>Filter by vCluster

pact cluster

Show combined Raft cluster health for both pact-journal and lattice quorums.

pact cluster

Displays leader status, term, committed index, and member health for both the pact journal Raft group and the lattice Raft group.

pact audit

Show a unified audit trail combining pact journal events and lattice audit events.

pact audit                               # pact events only (default)
pact audit --source all                  # Combined pact + lattice events
pact audit --source lattice              # Lattice events only
pact audit -n 50                         # Last 50 entries
OptionDescription
--source <SOURCE>Event source: pact (default), lattice, or all
-n <COUNT>Number of entries to show (default: 20)

pact accounting

Show resource usage accounting (GPU hours, CPU hours) aggregated from lattice.

pact accounting                          # Default vCluster
pact accounting --vcluster ml-training   # Specific vCluster
OptionDescription
--vcluster <NAME>Filter by vCluster

pact health

Combined system health check across pact and lattice components.

pact health

Reports health status for: pact-journal Raft quorum, pact-agent connectivity, lattice scheduler, lattice node-agents, OPA policy engine, telemetry pipeline, and Lattice Services (service/endpoint counts from the service registry).

pact services list

List services registered in the lattice service registry.

pact services list                          # All services
pact services list --vcluster ml-training   # Filter by vCluster
OptionDescription
--vcluster <NAME>Filter by vCluster

pact services lookup

Look up a specific service by name in the lattice service registry.

pact services lookup my-inference-api
OptionDescription
<name>Service name to look up (required)

Returns service details including registered endpoints, health status, and vCluster association.

pact dag

Manage DAG (directed acyclic graph) workflows in lattice.

pact dag list                            # List all DAGs
pact dag list --tenant ml-team           # Filter by tenant
pact dag list --state running            # Filter by state
pact dag inspect dag-7f3a                # Show DAG details and steps
pact dag cancel dag-7f3a                 # Cancel a running DAG
SubcommandOptionsDescription
list--tenant, --state, -nList DAG workflows
inspect<id>Show DAG details and allocation status
cancel<id>Cancel a DAG and its allocations

pact budget

Query resource budget and usage tracking from lattice.

pact budget tenant ml-team               # Tenant GPU/node hours
pact budget tenant ml-team --days 30     # Last 30 days
pact budget user alice                   # User usage across all tenants
SubcommandOptionsDescription
tenant<id>, --daysGPU hours, node hours, budget fractions for a tenant
user<id>, --daysUsage breakdown by tenant for a user

pact backup

Manage lattice Raft state backups. Requires pact-platform-admin role.

pact backup create /path/to/backup.bin   # Create a backup
pact backup verify /path/to/backup.bin   # Verify backup integrity
pact backup restore /path/to/backup.bin --confirm  # Restore from backup
SubcommandOptionsDescription
create<path>Snapshot lattice state to file (audit-logged)
verify<path>Check backup validity, show snapshot term/index
restore<path>, --confirmRestore lattice state (destructive, audit-logged)

Note: restore requires the --confirm flag — it replaces the entire lattice scheduler state and cannot be undone.

pact nodes

Query lattice node inventory with hardware and ownership details.

pact nodes list                          # All nodes
pact nodes list --state draining         # Filter by state
pact nodes list --vcluster ml-training   # Filter by vCluster
pact nodes inspect node-042              # Full node details
SubcommandOptionsDescription
list--state, --vcluster, -nTabular view: state, GPUs, cores, memory, vCluster
inspect<node_id>Full details: hardware, ownership, allocations, heartbeat

Configuration File

The CLI reads its configuration from ~/.config/pact/cli.toml:

endpoint = "https://journal.example.com:9443"
default_vcluster = "ml-training"
output_format = "text"
timeout_seconds = 30
token_path = "~/.config/pact/token"

All fields are optional and have sensible defaults. See the Getting Started guide for the full precedence chain.

Admin Operations

This guide covers day-to-day operational workflows with pact. All operations described here are authenticated via OIDC, authorized via OPA policy, and recorded in the immutable journal.

Roles

RoleAccessTypical Users
pact-platform-adminFull system access2-3 people per site
pact-ops-{vcluster}Day-to-day ops for a vClusterOps engineers
pact-viewer-{vcluster}Read-only accessMonitoring teams, auditors
pact-regulated-{vcluster}Ops with two-person approvalSensitive workload admins

Day-to-Day Operations

Check node status

# Overview of all nodes in your vCluster
pact status --vcluster ml-training

# Detailed status for a specific node
pact status node-042

View drift

Drift is the difference between declared state (in the journal) and actual state (on the node). pact uses blacklist-based detection – everything is monitored except explicitly excluded paths.

# See what has drifted on a node
pact diff node-042

# See committed deltas not yet promoted to the vCluster overlay
pact diff --committed node-042

Commit drift

When drift is intentional (e.g., you tuned a sysctl), commit it to make it the new declared state:

pact commit -m "tuned vm.nr_hugepages for training workload"

Commits happen within a time window (default 15 minutes). If the window expires, drift is flagged for review rather than silently discarded.

Roll back

If a configuration change caused problems, roll back to a known-good state:

# Find the sequence number to roll back to
pact log -n 20

# Roll back
pact rollback 42

Extend the commit window

If you need more time to finalize changes before committing:

pact extend          # +15 minutes (default)
pact extend 30       # +30 minutes

Apply a configuration spec

For bulk or repeatable changes, write a TOML spec and apply it:

pact apply config/vcluster-examples/overlays.toml

This updates the vCluster overlay in the journal. All nodes in the vCluster will converge to the new declared state.

Emergency Mode

Emergency mode is for situations where normal policy constraints would prevent necessary diagnostic or repair actions. It relaxes whitelist restrictions and extends the commit window, while maintaining the full audit trail.

When to use emergency mode

  • Node is degraded and you need unrestricted diagnostic access
  • A service is failing and you need to inspect or modify files outside the whitelist
  • You need to make urgent changes that would normally require approval

Entering emergency mode

pact emergency start -r "GPU ECC errors on node-042, need unrestricted diagnostics"

This:

  1. Records the emergency entry in the journal with your identity and reason
  2. Extends the commit window to 4 hours (configurable)
  3. Relaxes command whitelist restrictions on the node
  4. Sends a notification via Loki/Grafana alerting

Working in emergency mode

All commands are still logged. Emergency mode does not bypass authentication or audit – it only relaxes operational constraints.

pact shell node-042
pact:node-042> nvidia-smi -q -d ECC
pact:node-042> dmesg | grep -i error
pact:node-042> cat /var/log/pact-agent.log
pact:node-042> exit

Exiting emergency mode

pact emergency end

If another admin left an emergency session open, a platform admin can force-end it:

pact emergency end --force

Audit implications

Emergency mode entries are flagged in the journal and appear prominently in audit reports. For regulated vClusters (7-year retention), emergency entries include:

  • Who entered emergency mode and when
  • The stated reason
  • Every command executed during the session
  • Who ended emergency mode and when

Two-Person Approval Workflow

Regulated vClusters (those with two_person_approval = true) require a second admin to approve state-changing operations before they take effect.

Submitting a change

# Admin A commits a change on a regulated vCluster
pact commit -m "add audit-forwarder service to sensitive-compute"

Output:

Approval required (two-person policy on vcluster: sensitive-compute)
Pending approval: ap-7f3a (expires in 30 min)
Waiting for approval... (Ctrl-C to background)

Reviewing and approving

# Admin B lists pending approvals
pact approve list

# Review the change details, then approve
pact approve accept ap-7f3a

Denying a change

pact approve deny ap-7f3a -m "not scheduled in the change window"

Rules

  • You cannot approve your own request
  • Approvals expire after a configurable timeout (default 30 minutes)
  • Expired requests are automatically rolled back
  • Both the request and the approval/denial are recorded in the journal

Service Management

pact-agent supervises services on compute nodes. You can check status, restart services, and view logs remotely.

Check service status

pact service status                  # All services on local node
pact service status chronyd          # Specific service

Restart a service

pact service restart nvidia-persistenced

Service restarts are subject to the commit window. If the window has expired, extend it first:

pact extend
pact service restart nvidia-persistenced

View service logs

pact service logs lattice-node-agent

Streams the last 50 lines. For continuous streaming, use pact watch.

Remote Command Execution

pact replaces SSH for all admin access to compute nodes. Commands are executed via the agent’s gRPC exec endpoint.

Single command

pact exec node-042 -- nvidia-smi
pact exec node-042 -- cat /proc/meminfo
pact exec node-042 -- dmesg -T

Commands must be on the agent’s whitelist. The whitelist mode is configured per-agent:

ModeBehavior
strictOnly explicitly whitelisted commands are allowed
learningAll commands are allowed but non-whitelisted ones are logged for review
bypassAll commands allowed (development only)

Interactive shell

pact shell node-042

The shell provides a restricted environment on the node. Same whitelist rules apply.

Using the MCP Server

pact includes an MCP (Model Context Protocol) server for AI-assisted operations. The MCP server exposes 24 tools that mirror the CLI commands.

Starting the MCP server

PACT_ENDPOINT="http://localhost:9443" pact-mcp

The server communicates via JSON-RPC 2.0 over stdio. Connect it to Claude Code or any MCP-compatible AI agent.

Available tools

ToolCategoryDescription
pact_statusReadQuery node/vCluster state
pact_diffReadShow declared vs actual differences
pact_logReadQuery configuration history
pact_capReadShow hardware capabilities
pact_service_statusReadQuery service status
pact_query_fleetReadFleet-wide queries
pact_commitWriteCommit drift
pact_rollbackWriteRoll back configuration
pact_applyWriteApply a config spec
pact_execWriteRun a remote command
pact_emergencyAdminEmergency mode (restricted to human admins)
pact_jobs_listLatticeList running allocations
pact_queue_statusLatticeScheduling queue depth
pact_cluster_healthLatticeCombined Raft cluster status
pact_system_healthLatticeCombined system health check
pact_accountingLatticeResource usage accounting
pact_undrainLatticeCancel drain on a node
pact_dag_listLatticeList DAG workflows
pact_dag_inspectLatticeDAG details and step status
pact_budgetLatticeTenant or user budget/usage
pact_backup_createAdminCreate lattice state backup
pact_backup_verifyLatticeVerify backup integrity
pact_nodes_listLatticeList nodes with state
pact_node_inspectLatticeNode hardware/ownership details

The MCP server connects to the journal (config operations), agent (exec/shell), and lattice (delegation). If any backend is unreachable, it falls back to stub responses. Destructive operations (dag cancel, backup restore) are excluded from MCP — use the CLI for those.

Environment variables

VariableDescriptionDefault
PACT_ENDPOINTJournal gRPC endpointhttp://localhost:9443
PACT_AGENT_ENDPOINTAgent gRPC endpointhttp://localhost:9445
PACT_MCP_TOKENBearer token for MCP→agent authentication(none — warns if unset)
PACT_LATTICE_ENDPOINTLattice gRPC endpoint for delegation(none — lattice tools disabled)
PACT_LATTICE_TOKENBearer token for lattice API(none)

Deployment Guide

This guide covers deploying pact in production. pact consists of three components that need to be deployed:

  1. pact-journal – 3 or 5 node Raft quorum (management nodes)
  2. pact-agent – every compute node
  3. pact CLI – admin workstations

Deploy scripts in scripts/deploy/ automate the full deployment. They are cloud-agnostic and reusable on bare metal. For GCP-specific infrastructure (VMs, networking), see infra/gcp/.

OS Requirements

Release binaries are built on Rocky Linux 9 (glibc 2.34). Compatible distributions:

  • RHEL 9+ / Rocky 9+ / Alma 9+
  • Ubuntu 22.04+
  • Debian 12+ (bookworm)
  • SLES 15 SP4+

Debian 11 and Ubuntu 20.04 are not supported (glibc too old).

Release artifacts per architecture (x86_64 and aarch64):

  • pact-platform-{arch}.tar.gz — pact (CLI), pact-journal, pact-mcp
  • pact-agent-{arch}-pact.tar.gz — agent with PactSupervisor (PID 1 mode)
  • pact-agent-{arch}-systemd.tar.gz — agent with systemd backend

All agent variants include GPU support (NVIDIA + AMD) — no separate per-GPU builds.

Important: When building from source, use --features ebpf for the agent to enable eBPF-based state observation. Without this feature, eBPF probes are compiled out.

Node Name Resolution

Pact uses DNS-based agent discovery: pact exec <node-id> resolves the node ID to http://<node-id>:9445. Ensure node IDs are resolvable via DNS or /etc/hosts.

Agent Auth Configuration

The agent shell server validates incoming OIDC tokens independently from the journal. Configure [agent.shell.auth] in the agent TOML:

[agent.shell.auth]
issuer = "https://your-idp/realm"
audience = "pact-cli"
jwks_url = "https://your-idp/realm/protocol/openid-connect/certs"
# Optional: HMAC secret for dev/test (production uses JWKS only)
# hmac_secret = "shared-secret"

Without this section, the agent falls back to fail-closed (JWKS required, no secret).

Prerequisites

Download release artifacts from GitHub releases. You can create a provisioning bundle for easy distribution:

# Download release artifacts
mkdir -p /tmp/pact-release
gh release download v2026.1.196 --dir /tmp/pact-release \
    --pattern "pact-platform-x86_64.tar.gz" \
    --pattern "pact-agent-x86_64-pact.tar.gz"

# Create provisioning bundle (includes scripts + systemd units)
scripts/deploy/make-provision-bundle.sh /tmp/pact-release /tmp/pact-provision.tar.gz

# Upload to all nodes (single file, no scp --recurse issues)
scp /tmp/pact-provision.tar.gz node:/tmp/
ssh node 'cd /tmp && tar xzf pact-provision.tar.gz'

Or manually:

  • Unpack binaries to /opt/pact/bin/ on all nodes
  • Copy infra/systemd/ to /opt/pact/systemd/ on all nodes
  • Copy scripts/deploy/ to /opt/pact/deploy/ on all nodes

Step 1: Create CA and distribute to all nodes

# On the first management node:
/opt/pact/deploy/setup-ca.sh /etc/pact/certs mgmt-1

# Then copy /etc/pact/ca/ to ALL other nodes (management + compute):
scp -r /etc/pact/ca/ mgmt-2:/etc/pact/ca/
scp -r /etc/pact/ca/ mgmt-3:/etc/pact/ca/
scp -r /etc/pact/ca/ compute-1:/etc/pact/ca/
# ...etc

Step 2: Install journal on management nodes

# Peer format: id=addr (id matches node-id argument)
PEERS="1=mgmt-1:9443,2=mgmt-2:9443,3=mgmt-3:9443"

# Node 1 — with --bootstrap to initialize the Raft cluster
/opt/pact/deploy/install-management.sh 1 mgmt-1 "$PEERS" --bootstrap

# Nodes 2 and 3 — without --bootstrap (join existing cluster)
/opt/pact/deploy/install-management.sh 2 mgmt-2 "$PEERS"
/opt/pact/deploy/install-management.sh 3 mgmt-3 "$PEERS"

# Wait ~10 seconds for Raft membership replication, then verify:
/opt/pact/deploy/bootstrap-quorum.sh mgmt-1:9443

Step 3: Install agent on compute nodes

JOURNALS="mgmt-1:9443,mgmt-2:9443,mgmt-3:9443"

/opt/pact/deploy/install-compute.sh compute-1 ml-training "$JOURNALS"
/opt/pact/deploy/install-compute.sh compute-2 ml-training "$JOURNALS"

Step 4: Install monitoring (optional)

/opt/pact/deploy/install-monitoring.sh mgmt-1,mgmt-2,mgmt-3

Step 5: Validate

# Run test matrix (v1=pact-only, v2=systemd, v3=pact+lattice, v4=systemd+lattice)
/opt/pact/deploy/validate.sh v1 mgmt-1:9443 compute-1,compute-2

Manual Deployment (step-by-step)

Journal Quorum Setup

The journal is pact’s distributed immutable log, backed by a Raft consensus group. Deploy it on dedicated management nodes or co-located with lattice (see ADR-001).

Install the binary

Download the platform binaries for your architecture from the latest release:

curl -LO https://github.com/witlox/pact/releases/latest/download/pact-platform-x86_64.tar.gz
tar xzf pact-platform-x86_64.tar.gz -C /usr/local/bin/

This installs pact-journal, pact (CLI), and pact-mcp.

3-Node Quorum (Standard)

A 3-node quorum tolerates 1 node failure. Suitable for most deployments.

Create /etc/pact/journal.env on each journal node:

journal-1:

PACT_JOURNAL_NODE_ID=1
PACT_JOURNAL_LISTEN=0.0.0.0:9443
PACT_JOURNAL_DATA_DIR=/var/lib/pact/journal
PACT_JOURNAL_PEERS=1=journal-1:9443,2=journal-2:9443,3=journal-3:9443

journal-2:

PACT_JOURNAL_NODE_ID=2
PACT_JOURNAL_LISTEN=0.0.0.0:9443
PACT_JOURNAL_DATA_DIR=/var/lib/pact/journal
PACT_JOURNAL_PEERS=1=journal-1:9443,2=journal-2:9443,3=journal-3:9443

journal-3:

PACT_JOURNAL_NODE_ID=3
PACT_JOURNAL_LISTEN=0.0.0.0:9443
PACT_JOURNAL_DATA_DIR=/var/lib/pact/journal
PACT_JOURNAL_PEERS=1=journal-1:9443,2=journal-2:9443,3=journal-3:9443

Bootstrap: On the first deploy, run pact-journal --bootstrap on node 1 to initialize the Raft membership. The membership replicates to nodes 2 and 3 automatically within seconds. Do NOT use --bootstrap on subsequent restarts or on nodes 2/3.

Create /etc/pact/journal.toml (same on all nodes, node ID comes from env):

[journal]
listen_addr = "0.0.0.0:9443"
data_dir = "/var/lib/pact/journal"

[journal.raft]
members = [
    "1:journal-1.mgmt:9444",
    "2:journal-2.mgmt:9444",
    "3:journal-3.mgmt:9444"
]
snapshot_interval = 10000

[journal.streaming]
max_concurrent_boot_streams = 15000

[policy]
enabled = true

[policy.iam]
oidc_issuer = "https://auth.example.org/realms/hpc"
oidc_audience = "pact"

[policy.engine]
type = "opa"
opa_endpoint = "http://localhost:8181/v1/data/pact"

[telemetry]
log_level = "info"
log_format = "json"
prometheus_enabled = true
prometheus_listen = "0.0.0.0:9091"
loki_enabled = true
loki_endpoint = "http://loki.mgmt:3100/loki/api/v1/push"

5-Node Quorum (High Availability)

A 5-node quorum tolerates 2 node failures. Recommended for large deployments or when journal availability is critical (e.g., boot-time config streaming for thousands of nodes).

Configuration is identical to 3-node, with two additional members:

[journal.raft]
members = [
    "1:journal-1.mgmt:9444",
    "2:journal-2.mgmt:9444",
    "3:journal-3.mgmt:9444",
    "4:journal-4.mgmt:9444",
    "5:journal-5.mgmt:9444"
]

Co-Located with Lattice

If running on the same management nodes as lattice, use separate ports and data directories. pact is the incumbent: journal quorum starts before lattice.

ComponentRaft PortgRPC PortData Dir
pact-journal94449443/var/lib/pact/journal
lattice-server900050051/var/lib/lattice/raft

Lattice bootstrap: Like pact-journal, lattice-server requires --bootstrap on first start of node 1 to initialize the Raft cluster. Subsequent restarts must NOT use --bootstrap. See lattice deployment guide for details.

Lattice agent: Deploy lattice-agent on compute nodes. It registers with the lattice scheduler via heartbeats. Pact’s supercharged commands (drain, cordon, uncordon) delegate to lattice — they require the node to be registered in both pact (enrollment) and lattice (agent heartbeat).

# Set delegation endpoint and auth token so pact CLI can reach lattice
export PACT_LATTICE_ENDPOINT=http://mgmt-1:50051
export PACT_LATTICE_TOKEN="$PACT_TOKEN"   # reuse the same OIDC token
pact drain compute-1    # → delegates to lattice (authenticated)

Important: PACT_LATTICE_TOKEN is required when lattice-server enforces auth. The pact admin’s OIDC token works if both systems share the same IdP and lattice recognizes pact_role claims. Set LATTICE_OIDC_HMAC_SECRET on lattice-server to the same value as PACT_OIDC_HMAC_SECRET for HMAC token compatibility.

Lattice agent auth: Set LATTICE_AGENT_TOKEN env var on compute nodes for the lattice-agent to authenticate to the server (machine identity via token fallback when mTLS/SPIRE is not configured).

Drain behavior: Drain with active allocations transitions to Draining. The node moves to Drained only after all allocations complete or are cancelled. Undrain only works from Drained state — cancel remaining allocations first if immediate undrain is needed.

Port Summary

PortServiceProtocol
9443pact-journal gRPCgRPC (config, streaming)
9444pact-journal RaftRaft consensus
9445pact-agent shell/execgRPC
9091pact-journal metricsHTTP (Prometheus)

Agent Installation

Install the binary

Download the agent variant for your architecture and supervisor mode from the latest release:

# PactSupervisor mode (PID 1 / diskless HPC) — includes NVIDIA + AMD GPU support
curl -LO https://github.com/witlox/pact/releases/latest/download/pact-agent-x86_64-pact.tar.gz
tar xzf pact-agent-x86_64-pact.tar.gz
sudo mv pact-agent /usr/local/bin/

# Or systemd mode (traditional service)
curl -LO https://github.com/witlox/pact/releases/latest/download/pact-agent-x86_64-systemd.tar.gz
tar xzf pact-agent-x86_64-systemd.tar.gz
sudo mv pact-agent /usr/local/bin/

For diskless nodes, include the pact-agent binary in the base SquashFS image provisioned by OpenCHAMI.

Create the config

Create /etc/pact/agent.toml:

[agent]
enforcement_mode = "enforce"

[agent.supervisor]
backend = "pact"

[agent.journal]
endpoints = ["journal-1.mgmt:9443", "journal-2.mgmt:9443", "journal-3.mgmt:9443"]
tls_enabled = true
tls_cert = "/etc/pact/agent.crt"
tls_key = "/etc/pact/agent.key"
tls_ca = "/etc/pact/ca.crt"

[agent.observer]
ebpf_enabled = true
inotify_enabled = true
netlink_enabled = true

[agent.shell]
enabled = true
listen = "0.0.0.0:9445"
whitelist_mode = "strict"

[agent.commit_window]
base_window_seconds = 900
drift_sensitivity = 2.0
emergency_window_seconds = 14400

[agent.blacklist]
patterns = ["/tmp/**", "/var/log/**", "/proc/**", "/sys/**", "/dev/**",
            "/run/user/**", "/run/pact/**", "/run/lattice/**"]

Node identity

The agent’s node_id is typically set via environment variable or auto-detected from the hostname. For diskless nodes, OpenCHAMI sets the hostname during PXE boot.

Identity and mTLS Setup

pact uses mutual TLS for agent-to-journal communication. Identity is provisioned automatically – no manual certificate generation is needed for agents or journal nodes.

SPIRE (primary, when deployed)

When SPIRE is deployed at the site, pact uses it as the primary identity provider. Agents and journal nodes receive SPIFFE SVIDs (X.509 certificates) via SPIRE node attestation. SPIRE handles certificate rotation automatically.

Configure the SPIRE agent socket in the agent config:

[agent.identity]
provider = "spire"
spire_socket = "/run/spire/agent.sock"

Ephemeral CA (fallback, default)

When SPIRE is not available, the journal quorum generates an ephemeral CA at startup. Agents enroll via the CSR workflow – no manual certificate provisioning is required.

Enrollment workflow:

  1. Platform admin enrolls a node: pact enroll <node> --hardware-id <hw-id>
  2. The journal records the enrollment with the node’s expected hardware identity
  3. Agent boots and presents its hardware identity (TPM, SMBIOS UUID, or MAC-based)
  4. Agent generates a keypair and submits a CSR to the journal
  5. Journal validates the hardware identity against the enrollment record
  6. Journal signs the CSR with the ephemeral CA and returns the certificate
  7. Agent uses the signed certificate for mTLS from this point forward

The ephemeral CA is regenerated when the journal quorum restarts. Agents automatically re-enroll to obtain new certificates.

CA cert distribution

Agents need the CA certificate bundle to validate journal server certificates. For diskless nodes, include the CA cert in the base SquashFS image:

  • /etc/pact/ca.crt – CA certificate bundle (all nodes)

For SPIRE deployments, the SPIRE trust bundle replaces this file. For ephemeral CA deployments, the journal serves the CA cert during the enrollment handshake.

Identity mapping

In PactSupervisor mode, identity mapping (pact-nss) is automatic – the agent maps SPIFFE IDs or certificate CNs to local UIDs without manual NSS configuration.

OIDC Provider Configuration

pact authenticates admins via OIDC tokens. Configure your identity provider (Keycloak, Auth0, Okta, etc.) with the following:

Create a pact client

  • Client ID: pact
  • Client type: Public (CLI) or Confidential (MCP server)
  • Redirect URI: http://localhost:8400/callback (for CLI login flow)

Define roles

Create these roles in your OIDC provider and assign them to users:

RoleDescription
pact-platform-adminFull system access (2-3 people per site)
pact-ops-{vcluster}Day-to-day ops for a vCluster
pact-viewer-{vcluster}Read-only access
pact-regulated-{vcluster}Ops with two-person approval
pact-service-agentMachine identity for agents (mTLS)
pact-service-aiMachine identity for MCP server

Configure the journal

Set the OIDC issuer and audience in the journal config:

[policy.iam]
oidc_issuer = "https://auth.example.org/realms/hpc"
oidc_audience = "pact"

systemd Service Management

Install systemd units

Copy the provided unit files:

cp infra/systemd/pact-journal.service /etc/systemd/system/
cp infra/systemd/pact-agent.service /etc/systemd/system/

Enable and start

Journal nodes:

systemctl daemon-reload
systemctl enable pact-journal
systemctl start pact-journal

Compute nodes:

systemctl daemon-reload
systemctl enable pact-agent
systemctl start pact-agent

Check status

systemctl status pact-journal
journalctl -u pact-journal -f

Environment files

The systemd units read environment variables from:

  • /etc/pact/journal.env (journal nodes)
  • /etc/pact/agent.env (compute nodes)

Docker Compose Deployment

For development, testing, or small deployments, use the provided Docker Compose configuration.

cd infra/docker
docker compose up -d

This starts:

ServiceContainerPorts
journal-1pact-journal-19443, 9091
journal-2pact-journal-29543, 9191
journal-3pact-journal-39643, 9291
agentpact-agent9445
prometheuspact-prometheus9090
grafanapact-grafana3000

Access Grafana at http://localhost:3000 (login: admin / admin).

Scaling

To run multiple agents in Docker:

docker compose up -d --scale agent=5

Monitoring with Grafana + Prometheus

Prometheus

pact-journal exposes Prometheus metrics on the metrics listen port (default 9091). The provided infra/docker/prometheus.yml scrapes all journal nodes:

scrape_configs:
  - job_name: "pact-journal"
    static_configs:
      - targets:
          - "journal-1:9091"
          - "journal-2:9091"
          - "journal-3:9091"

Loki

pact-journal streams events to Loki for structured log aggregation. Configure the Loki endpoint in journal config:

[telemetry]
loki_enabled = true
loki_endpoint = "http://loki.mgmt:3100/loki/api/v1/push"

Grafana dashboards

Import the dashboards from infra/grafana/dashboards/ into Grafana. These provide:

  • Journal quorum health (Raft leader, commit index, log size)
  • Node status overview (drift, services, capabilities)
  • Config change timeline
  • Emergency mode events
  • Approval workflow status

Alerting

Import the alerting rules from infra/alerting/rules.yml into Prometheus. Key alerts:

  • Raft leader election timeout
  • Journal node down
  • Agent disconnected
  • Emergency mode entered
  • Pending approvals nearing expiry
  • Excessive drift on node

OPA Policy Engine

pact uses OPA (Open Policy Agent) for authorization decisions. Deploy OPA as a sidecar on each journal node.

Install OPA

# Download OPA binary
curl -L -o /usr/local/bin/opa \
    https://openpolicyagent.org/downloads/v0.73.0/opa_linux_amd64_static
chmod +x /usr/local/bin/opa

Run OPA

opa run --server --addr localhost:8181 /etc/pact/policies/

Configure journal

[policy.engine]
type = "opa"
opa_endpoint = "http://localhost:8181/v1/data/pact"

Policy federation

If using Sovra for cross-site federation, policy templates are synchronized automatically:

[policy.federation]
sovra_endpoint = "https://sovra.mgmt:8443"
sync_interval_seconds = 300

Building an OpenCHAMI Image with pact

This guide covers building a diskless SquashFS compute node image with pact-agent as the init system (PID 1), SPIRE for workload identity, and OpenCHAMI for boot provisioning.

Overview

The boot chain on a diskless HPC node:

BMC/PXE → OpenCHAMI DHCP → iPXE → kernel + initramfs
  → mount SquashFS root (read-only)
  → pivot_root
  → pact-agent starts as PID 1
  → authenticates to journal (SPIRE or bootstrap cert)
  → streams vCluster config overlay
  → applies config (sysctl, modules, mounts, uenv)
  → starts services in dependency order
  → reports capabilities → node ready

What Goes in the Image vs What Gets Streamed

In the SquashFS image (static)Streamed at boot (dynamic)
pact-agent binaryvCluster overlay (sysctl, modules, mounts)
SPIRE agent binary + configNode-specific delta (per-node tunables)
Bootstrap CA cert (/etc/pact/ca.crt)Service declarations (what to start)
Base OS packages (glibc, coreutils, etc.)OPA policy bundles
GPU drivers (NVIDIA/AMD)Identity (SVID via SPIRE or CSR)
Network drivers (cxi, i40e, etc.)
pact agent config (/etc/pact/agent.toml)

The image is read-only. All runtime state goes to tmpfs (/run/pact/, /tmp/).

Prerequisites

  • OpenCHAMI deployed (SMD, BSS, DHCP, image server)
  • A build host with mksquashfs, debootstrap (or equivalent), and the pact release binaries
  • SPIRE server running on management nodes (optional but recommended)
  • pact-journal quorum running on management nodes

Step 1: Create the Base Root Filesystem

Start with a minimal Linux root. The exact method depends on your distro:

# Ubuntu/Debian
mkdir -p /tmp/pact-image/rootfs
sudo debootstrap --variant=minbase noble /tmp/pact-image/rootfs http://archive.ubuntu.com/ubuntu

# Or SUSE (for Cray/HPE systems)
# zypper --root /tmp/pact-image/rootfs install ...

# Or from an existing node image
# rsync -a /path/to/base-image/ /tmp/pact-image/rootfs/

Install essential packages in the chroot:

sudo chroot /tmp/pact-image/rootfs /bin/bash -c '
    apt-get update
    apt-get install -y --no-install-recommends \
        ca-certificates \
        iproute2 \
        kmod \
        procps \
        util-linux \
        chrony
'

Step 2: Install pact-agent

Download the agent binary matching your target hardware:

# Example: x86_64 NVIDIA with PactSupervisor
curl -LO https://github.com/witlox/pact/releases/latest/download/pact-agent-x86_64-nvidia-pact.tar.gz
sudo tar xzf pact-agent-x86_64-nvidia-pact.tar.gz -C /tmp/pact-image/rootfs/usr/local/bin/

Step 3: Install SPIRE Agent

SPIRE provides workload identity (X.509 SVIDs) for mTLS between pact-agent and the journal. If SPIRE is not available, pact falls back to the bootstrap certificate + ephemeral CA workflow.

# Download SPIRE agent
SPIRE_VERSION=1.12.0
curl -LO https://github.com/spiffe/spire/releases/download/v${SPIRE_VERSION}/spire-${SPIRE_VERSION}-linux-amd64-musl.tar.gz
tar xzf spire-${SPIRE_VERSION}-linux-amd64-musl.tar.gz
sudo cp spire-${SPIRE_VERSION}/bin/spire-agent /tmp/pact-image/rootfs/usr/local/bin/

Create the SPIRE agent config:

sudo mkdir -p /tmp/pact-image/rootfs/etc/spire
sudo tee /tmp/pact-image/rootfs/etc/spire/agent.conf << 'EOF'
agent {
    data_dir = "/run/spire/agent"
    log_level = "INFO"
    server_address = "spire-server.mgmt"
    server_port = "8081"
    socket_path = "/run/spire/agent.sock"
    trust_domain = "example.org"

    # Node attestation via TPM or join token
    NodeAttestor "tpm_devid" {
        plugin_data {}
    }
}
EOF

For sites without TPM, use join token attestation instead:

# On the SPIRE server, create a join token for this node class:
#   spire-server token generate -spiffeID spiffe://example.org/pact-agent
# Then inject the token into the image or pass via kernel cmdline.

Step 4: Install GPU Drivers

For NVIDIA nodes:

# Install NVIDIA driver + persistenced (in chroot)
sudo chroot /tmp/pact-image/rootfs /bin/bash -c '
    # Install from your driver repo or CUDA toolkit
    apt-get install -y nvidia-driver-570 nvidia-utils-570
'

For AMD nodes:

sudo chroot /tmp/pact-image/rootfs /bin/bash -c '
    # Install ROCm driver
    apt-get install -y rocm-smi-lib
'

Step 5: Install Network Drivers

For Slingshot (Cray CXI) fabric:

# CXI drivers are typically provided by HPE/Cray as RPMs or DEBs
# Install cxi-driver, cxi-utils, libfabric-cxi
sudo chroot /tmp/pact-image/rootfs /bin/bash -c '
    dpkg -i /path/to/cxi-driver_*.deb
'

Step 6: Configure pact-agent

Create the agent config. The node_id and vcluster are set dynamically at boot via environment variables (OpenCHAMI sets the hostname):

sudo mkdir -p /tmp/pact-image/rootfs/etc/pact
sudo tee /tmp/pact-image/rootfs/etc/pact/agent.toml << 'EOF'
[agent]
# node_id auto-detected from hostname (set by OpenCHAMI DHCP)
enforcement_mode = "enforce"

[agent.supervisor]
backend = "pact"

[agent.journal]
endpoints = [
    "journal-1.mgmt:9443",
    "journal-2.mgmt:9443",
    "journal-3.mgmt:9443",
]
tls_enabled = true
tls_ca = "/etc/pact/ca.crt"

[agent.identity]
provider = "spire"
spire_socket = "/run/spire/agent.sock"

[agent.observer]
ebpf_enabled = true
inotify_enabled = true
netlink_enabled = true

[agent.shell]
enabled = true
listen = "0.0.0.0:9445"
whitelist_mode = "strict"

[agent.capability]
manifest_path = "/run/pact/capability.json"
socket_path = "/run/pact/capability.sock"
gpu_poll_interval_seconds = 30

[agent.commit_window]
base_window_seconds = 900
drift_sensitivity = 2.0
emergency_window_seconds = 14400

[agent.blacklist]
patterns = [
    "/tmp/**", "/var/log/**", "/proc/**", "/sys/**",
    "/dev/**", "/run/user/**", "/run/pact/**", "/run/lattice/**",
]
EOF

Step 7: Configure pact-agent as PID 1

Create an init wrapper that sets up minimal infrastructure before handing off to pact-agent. The SquashFS root is read-only, so we need tmpfs mounts:

sudo tee /tmp/pact-image/rootfs/init << 'INITEOF'
#!/bin/sh
# Minimal init for pact-agent as PID 1 on diskless nodes.
# Called directly by the kernel after pivot_root.

# Mount essential filesystems
mount -t proc proc /proc
mount -t sysfs sysfs /sys
mount -t devtmpfs devtmpfs /dev
mount -t tmpfs tmpfs /run
mount -t tmpfs tmpfs /tmp
mkdir -p /run/pact /run/spire/agent /run/lock /var/log

# Load essential modules
modprobe -a overlay tmpfs

# Set hostname from kernel cmdline (OpenCHAMI sets pact.nodeid=)
NODEID=$(sed -n 's/.*pact.nodeid=\([^ ]*\).*/\1/p' /proc/cmdline)
[ -n "$NODEID" ] && hostname "$NODEID"

# Start SPIRE agent in background (if available)
if [ -x /usr/local/bin/spire-agent ]; then
    /usr/local/bin/spire-agent run \
        -config /etc/spire/agent.conf \
        -logLevel INFO &
    # Give SPIRE a moment to create the socket
    sleep 1
fi

# Hand off to pact-agent
exec /usr/local/bin/pact-agent --config /etc/pact/agent.toml
INITEOF
sudo chmod +x /tmp/pact-image/rootfs/init

Step 8: Install Bootstrap CA Certificate

For the initial boot before SPIRE is available, include the journal’s CA cert:

# Copy from a journal node or generate during journal setup
sudo cp /etc/pact/ca.crt /tmp/pact-image/rootfs/etc/pact/ca.crt

If using SPIRE exclusively, this cert is only needed for the first connection to obtain the SPIRE join token or for fallback when SPIRE is unavailable.

Step 9: Build the SquashFS Image

sudo mksquashfs /tmp/pact-image/rootfs /tmp/pact-image/pact-node.squashfs \
    -comp zstd \
    -Xcompression-level 19 \
    -noappend \
    -no-recovery \
    -processors $(nproc)

Typical image sizes:

  • Base + pact-agent + SPIRE: ~300 MB
  • With NVIDIA drivers: ~800 MB
  • With ROCm: ~600 MB

Step 10: Register with OpenCHAMI

Upload the image to OpenCHAMI’s image server and configure the boot parameters:

# Upload image to OpenCHAMI image server.
# Use your site's image management tooling to upload the SquashFS to the image server,
# e.g. scp, s3 upload, or your image registry workflow.

# Set boot parameters for a node group via BSS REST API
curl -X PUT https://bss.mgmt/boot/v1/bootparameters \
    -H "Content-Type: application/json" \
    -d '{
        "macs": [],
        "hosts": ["ml-training"],
        "params": "root=live:http://image-server/pact-ml-training-v1.squashfs init=/init pact.nodeid=${hostname} console=tty0",
        "kernel": "http://image-server/vmlinuz",
        "initrd": "http://image-server/initramfs.img"
    }'

The init=/init parameter tells the kernel to run our init wrapper. The pact.nodeid=${hostname} is expanded by OpenCHAMI’s DHCP/BSS.

Step 11: Pre-enroll Nodes

Before the first boot, register nodes in the journal:

# Enroll nodes with their hardware identity
pact node enroll compute-001 --mac aa:bb:cc:dd:ee:01
pact node enroll compute-002 --mac aa:bb:cc:dd:ee:02
# ... or batch import from SMD inventory:
pact node import --group ml-training

# Assign to vCluster
pact node assign compute-001 --vcluster ml-training
pact node assign compute-002 --vcluster ml-training

Step 12: Boot and Verify

Power on the nodes via OpenCHAMI/Redfish:

# Power on nodes via BMC/Redfish (use your BMC management tool: ipmitool, Redfish, etc.)
# Example with curl against OpenCHAMI SMD:
curl -X POST https://smd.mgmt/hsm/v2/State/Components/x1000c0s0b0n0/Actions/PowerCycle \
    -H "Content-Type: application/json" \
    -d '{"ResetType": "On"}'
# Or use pact's delegation command for enrolled nodes:
pact reboot compute-001

Monitor boot progress:

# Watch the journal for enrollment events
pact watch --vcluster ml-training

# Check node status (should appear within ~2 seconds of boot)
pact status --vcluster ml-training

# Verify capabilities
pact cap compute-001

# Check service status
pact service status compute-001

Updating the Image

To update the base image (new drivers, new pact-agent version):

  1. Build a new SquashFS image (steps 1-9)
  2. Upload to OpenCHAMI image server (using your site’s image management tooling)
  3. Update boot config via BSS REST API: curl -X PUT https://bss.mgmt/boot/v1/bootparameters -d '{"hosts":["ml-training"],"params":"root=live:http://image-server/pact-ml-training-v2.squashfs ..."}'
  4. Rolling reboot: pact drain compute-001 && pact reboot compute-001

Nodes pick up the new image on reboot. pact configuration (sysctl, mounts, services) is streamed from the journal — not baked into the image — so most config changes don’t require a new image.

Including Lattice (Supercharged Mode)

When deploying pact alongside lattice for workload scheduling, the compute node image includes both pact-agent and lattice-node-agent. pact supervises lattice-node-agent as a declared service — this is “supercharged mode” where both systems cooperate.

Additional binaries in the image

Add lattice-node-agent to the SquashFS image alongside pact-agent:

# Download lattice node agent
curl -LO https://github.com/witlox/lattice/releases/latest/download/lattice-node-agent-x86_64.tar.gz
sudo tar xzf lattice-node-agent-x86_64.tar.gz -C /tmp/pact-image/rootfs/usr/local/bin/

lattice-node-agent config

Create the lattice node agent config. The agent connects to the lattice scheduler quorum and reports node capabilities (read from pact’s capability manifest):

sudo mkdir -p /tmp/pact-image/rootfs/etc/lattice
sudo tee /tmp/pact-image/rootfs/etc/lattice/node-agent.toml << 'EOF'
[node_agent]
# lattice scheduler quorum endpoints (on HSN, not management network — ADR-017)
scheduler_endpoints = [
    "lattice-1.hsn:50051",
    "lattice-2.hsn:50051",
    "lattice-3.hsn:50051",
]

# pact capability manifest (lattice-node-agent reads this)
capability_manifest = "/run/pact/capability.json"
capability_socket = "/run/pact/capability.sock"

# Namespace handoff socket (pact creates namespaces, lattice uses them)
namespace_socket = "/run/pact/ns-handoff.sock"

# Mount refcounting (shared between pact and lattice)
mount_socket = "/run/pact/mount-refcount.sock"

[node_agent.identity]
# Uses the same SPIRE socket as pact for workload identity
spire_socket = "/run/spire/agent.sock"
EOF

Declare lattice-node-agent as a pact service

pact-agent supervises lattice-node-agent as a declared service. This is configured in the vCluster overlay (streamed at boot, not baked in the image).

Create the overlay spec:

# vcluster-overlay.toml — applied with: pact apply vcluster-overlay.toml
[vcluster.ml-training.services.lattice-node-agent]
binary = "/usr/local/bin/lattice-node-agent"
args = ["--config", "/etc/lattice/node-agent.toml"]
restart_policy = "always"
order = 50
depends_on = ["chronyd"]

[vcluster.ml-training.services.chronyd]
binary = "/usr/sbin/chronyd"
args = ["-d"]
restart_policy = "always"
order = 10

# For GPU nodes, add nvidia-persistenced
[vcluster.ml-training.services.nvidia-persistenced]
binary = "/usr/bin/nvidia-persistenced"
args = ["--no-persistence-mode"]
restart_policy = "on_failure"
order = 20

Boot sequence with lattice

Kernel → SquashFS root → pact-agent (PID 1)
  → auth to journal → stream vCluster config overlay
  → apply: kernel params, modules, mounts, uenv
  → start services in dependency order:
      1. chronyd (time sync)
      2. nvidia-persistenced (GPU, if declared)
      3. lattice-node-agent (workload scheduling)
  → pact writes CapabilityReport to /run/pact/capability.json
  → lattice-node-agent reads manifest, reports to scheduler
  → node ready for workloads

Supercharged CLI

With both systems running, operators get unified admin access:

# pact-native commands work as before
pact status --vcluster ml-training
pact exec compute-001 -- nvidia-smi
pact diag compute-001 --grep "ECC"

# Supercharged commands query both systems
pact jobs list --vcluster ml-training    # lattice allocations
pact health                              # pact + lattice health
pact drain compute-001                   # lattice drain + pact audit

Configure the lattice endpoint for supercharged commands. Note: the pact CLI connects to lattice’s HSN-facing gRPC port from the admin workstation (which must have HSN access or a management-to-HSN gateway):

export PACT_LATTICE_ENDPOINT=http://lattice-1.hsn:50051
export PACT_LATTICE_TOKEN=<lattice-auth-token>

Network separation

pact and lattice run on separate networks (ADR-017). pact uses the management network exclusively. Lattice runs entirely on the HSN — including agent↔scheduler communication, Raft consensus, and workload data.

TrafficNetworkPort
pact agent ↔ journalManagement9443, 9444
pact shell/exec/diagManagement9445
pact journal metricsManagement9091
lattice agent ↔ schedulerHSN50051
lattice Raft consensusHSN9000
Workload data (MPI, NCCL)HSNApplication-defined

pact never touches the HSN. Lattice never touches pact’s management ports. If the HSN goes down, pact continues operating (admin access, config management) while lattice pauses scheduling. If the management network goes down, pact agents use cached config while lattice is unaffected.

Troubleshooting

Node doesn’t appear after boot

# Check if the node enrolled
pact node list --vcluster ml-training

# Check journal logs for enrollment errors
pact audit --source pact -n 20

# If node is reachable via BMC console:
#   - Check /run/pact/ for agent logs
#   - Check if SPIRE socket exists: ls /run/spire/agent.sock
#   - Check if journal is reachable: curl -k https://journal-1.mgmt:9443/health

SPIRE agent fails to attest

# On the SPIRE server, check registration entries:
spire-server entry show

# Create a join token for manual attestation:
spire-server token generate -spiffeID spiffe://example.org/pact-agent/compute-001

# Pass the token to the node via kernel cmdline (update BSS):
#   curl -X PUT https://bss.mgmt/boot/v1/bootparameters \
#     -d '{"hosts":["compute-001"],"params":"... spire.join_token=<token>"}'

Agent falls back to bootstrap identity

This is normal on first boot or when SPIRE is unavailable. The agent will:

  1. Use the bootstrap CA cert for initial journal connection
  2. Submit a CSR to the journal
  3. Journal validates hardware identity and signs the cert
  4. Agent switches to the journal-signed cert

Once SPIRE becomes available, the agent rotates to SPIRE-managed mTLS automatically (identity cascade: SPIRE → journal-signed → bootstrap).

Troubleshooting

Agent Cannot Connect to Journal

Symptoms: Agent logs show connection errors. pact status returns exit code 5 (timeout).

Check 1: Network connectivity

# From the agent node, verify the journal port is reachable
nc -zv journal-1.mgmt 9443

Check 2: Journal is running

# On the journal node
systemctl status pact-journal
journalctl -u pact-journal --since "5 min ago"

Check 3: SPIRE identity (if using SPIRE)

# Verify the SPIRE agent socket is available
ls -la /run/spire/agent.sock

# Check SPIRE agent health
spire-agent healthcheck -socketPath /run/spire/agent.sock

# Verify the agent has a valid SVID
spire-agent api fetch x509 -socketPath /run/spire/agent.sock -write /tmp/
openssl x509 -in /tmp/svid.0.pem -noout -subject -issuer -dates

If the SPIRE socket is unavailable, check that the SPIRE agent is running on the node. If attestation fails, verify the node’s SPIRE registration entry exists and matches its hardware identity (TPM, SMBIOS UUID, or join token).

Check 4: Enrollment (if using ephemeral CA)

# Check if the node has been enrolled
pact enroll status <node>

# Check if the agent has a valid certificate
openssl x509 -in /etc/pact/agent.crt -noout -subject -dates 2>/dev/null || echo "No certificate"

Common enrollment failures:

  • Hardware identity mismatch: the node’s actual hardware identity (TPM/SMBIOS/MAC) does not match what was registered during pact enroll. Re-enroll with the correct hardware ID: pact enroll <node> --hardware-id <correct-hw-id>
  • CSR rejected: the journal could not validate the CSR. Check journal logs for the rejection reason.
  • Ephemeral CA rotated: if the journal quorum restarted, the CA was regenerated and all agents must re-enroll. Agents do this automatically on the next boot, but running agents need a restart: systemctl restart pact-agent

Check 5: Agent config

Verify endpoints in agent.toml points to the correct journal addresses:

[agent.journal]
endpoints = ["journal-1.mgmt:9443", "journal-2.mgmt:9443", "journal-3.mgmt:9443"]
tls_enabled = true
tls_ca = "/etc/pact/ca.crt"

Check 6: Firewall

Ensure port 9443 (gRPC) and 9444 (Raft) are open between journal nodes, and port 9443 is open from compute nodes to journal nodes.


Raft Leader Election Issues

Symptoms: Journal logs show repeated election timeouts. No leader elected. CLI commands hang or return timeout errors.

Check 1: Quorum availability

A 3-node quorum needs at least 2 nodes. A 5-node quorum needs at least 3. Verify all journal nodes are running:

for host in journal-1.mgmt journal-2.mgmt journal-3.mgmt; do
    echo "$host: $(nc -zv $host 9443 2>&1)"
done

Check 2: Clock synchronization

Raft is sensitive to clock skew. Verify NTP/chrony is running on all journal nodes:

chronyc tracking

Check 3: Raft peer configuration

All nodes must have identical [journal.raft] members configuration. A mismatch causes election failures. Verify on each node:

grep -A5 "journal.raft" /etc/pact/journal.toml

Check 4: Data directory permissions

The journal data directory must be writable by the pact user:

ls -la /var/lib/pact/journal/

Check 5: Network partitions

Raft port 9444 must be reachable between all journal nodes. Unlike the gRPC port, this is peer-to-peer between journal nodes only:

nc -zv journal-2.mgmt 9444

Drift Detection False Positives

Symptoms: pact diff shows drift for files or paths that should not be monitored (logs, temp files, runtime state).

Fix: Add patterns to the blacklist

The blacklist excludes paths from drift detection. Edit the agent config:

[agent.blacklist]
patterns = [
    "/tmp/**",
    "/var/log/**",
    "/proc/**",
    "/sys/**",
    "/dev/**",
    "/run/user/**",
    "/run/pact/**",
    "/run/lattice/**",
    # Add your exclusions here:
    "/var/cache/**",
    "/home/*/.bash_history"
]

After updating the config, restart the agent:

systemctl restart pact-agent

Understanding the blacklist-first model: pact monitors everything by default and excludes via blacklist (see ADR-002). This is the opposite of most config management tools which declare what to watch. The blacklist approach ensures nothing is missed, but means you need to explicitly exclude noisy paths.


Shell Command Blocked by Whitelist

Symptoms: pact exec or pact shell returns exit code 6 with “command not whitelisted”.

Check 1: Current whitelist mode

grep whitelist_mode /etc/pact/agent.toml
ModeBehavior
strictOnly explicitly whitelisted commands allowed
learningAll commands allowed, non-whitelisted ones logged
bypassAll commands allowed (development only)

Fix for development: Set whitelist_mode = "learning" or "bypass".

Fix for production: Add the command to the whitelist. The whitelist is managed via the vCluster overlay policy. Contact your platform admin to update it.

Workaround: If you need immediate unrestricted access, enter emergency mode:

pact emergency start -r "need to run diagnostics command XYZ"
# Run your command
pact exec node-042 -- your-command
pact emergency end

Emergency Mode Stuck

Symptoms: A node is in emergency mode but the admin who started it is unavailable. Other admins cannot make changes that conflict with the emergency session.

Fix: Force-end the emergency

A pact-platform-admin can force-end another admin’s emergency session:

pact emergency end --force

This records the force-end in the journal audit log, including who ended it and the original emergency reason.

If the CLI cannot reach the journal: If the journal itself is the problem (which is why emergency mode was started), you need to fix journal connectivity first. Check the Raft leader election section above.

Last resort: BMC console access provides unrestricted bash on the node, bypassing pact entirely. This is the out-of-band fallback when pact itself is not functioning.


Approval Workflow Issues

Approval request expired

Symptoms: A commit on a regulated vCluster was submitted but nobody approved it within the timeout (default 30 minutes). The change was rolled back.

Fix: Resubmit the change and coordinate with an approver in advance:

# Resubmit
pact commit -m "add audit-forwarder (re-submit after timeout)"

# Tell the approver to check immediately
# Approver runs:
pact approve list
pact approve accept ap-XXXX

Adjust timeout: If 30 minutes is too short for your workflow, update the vCluster policy:

[vcluster.sensitive-compute.policy]
approval_timeout_seconds = 3600   # 1 hour

Cannot approve own request

Symptoms: pact approve accept returns an authorization error when trying to approve your own request.

This is by design. Two-person approval requires a different admin to approve. The approver must have pact-regulated-{vcluster} or pact-platform-admin role.

No approvers available

If no other admin with the required role is available, a pact-platform-admin can approve any request. If no platform admin is available, the change must wait or be submitted through the emergency mode workflow (which has its own audit requirements).


Agent Reports Wrong Capabilities

Symptoms: pact cap shows incorrect GPU count, memory, or network capabilities.

Check 1: Capability manifest

The agent reads capabilities from a JSON manifest:

cat /run/pact/capability.json

Check 2: GPU detection

If GPU capabilities are wrong, check the GPU backend:

# For NVIDIA
nvidia-smi -L

# For AMD
rocm-smi --showproductname

Check 3: Poll interval

The agent polls GPU status periodically. Check the config:

[agent.capability]
gpu_poll_interval_seconds = 30

A recently failed GPU may not be reflected until the next poll.


Journal Data Directory Full

Symptoms: Journal logs show write errors. Raft cannot commit new entries.

Check disk usage:

df -h /var/lib/pact/journal/
du -sh /var/lib/pact/journal/*

Fix 1: Trigger a Raft snapshot

Snapshots compact the log. The snapshot interval is configured in the journal:

[journal.raft]
snapshot_interval = 10000   # Entries between snapshots

Reduce this value and restart to trigger more frequent compaction.

Fix 2: Expand storage

If the data directory is genuinely too small for your workload, expand the underlying volume.


Common Error Messages

MessageCauseFix
No auth token foundMissing OIDC tokenSet PACT_TOKEN or write to ~/.config/pact/token
No vCluster specifiedMissing vCluster scopeUse --vcluster or set PACT_VCLUSTER
connection refusedJournal not running or wrong endpointCheck journal status and endpoint config
certificate verify failedTLS cert mismatch or ephemeral CA rotatedRestart agent to re-enroll, or verify CA bundle at /etc/pact/ca.crt
SPIRE socket unavailableSPIRE agent not runningStart SPIRE agent or switch to ephemeral CA identity
enrollment: hardware mismatchNode hardware ID does not match enrollment recordRe-enroll with correct --hardware-id
policy: deniedOPA rejected the operationCheck your role has the required permissions
approval requiredRegulated vClusterAnother admin must approve (see workflow above)
commit window expiredTime window for changes has closedRun pact extend or pact commit first

System Architecture

See ../../ARCHITECTURE.md for the high-level overview. This document covers detailed design and data flows.

Design Requirements

  • R1: Eventual consistency with acknowledged drift
  • R2: Immutable configuration log
  • R3: Optimistic concurrency with commit windows
  • R4: Admin-native CLI + pact shell (replacing SSH)
  • R5: Streaming boot configuration (<2s for 10k nodes)
  • R6: Degradation-aware (partial HW failure → revised promises)
  • R7: vCluster-aware grouping
  • R8: IAM and policy enforcement (OIDC/RBAC/audit)
  • R9: Blacklist-based drift detection with learning mode
  • R10: Emergency mode (extended window + no rollback + full audit)
  • R11: Observe-first deployment
  • R12: Agentic API (MCP tool-use)
  • R13: Process supervision (pact as init, systemd fallback)
  • R14: No SSH (pact shell + pact exec)

Raft Deployment

pact-journal runs its own Raft group, independent from lattice’s quorum. Two deployment modes (see ADR-001):

  • Standalone: pact-journal on dedicated management nodes (3-5 nodes)
  • Co-located: pact-journal and lattice-server on the same management nodes, each with its own Raft group on separate ports

Pact is the incumbent in co-located mode — its quorum is already running when lattice starts. Lattice configures its peers to point to the same hostnames. No protocol-level coupling; co-location is a deployment decision.

Consistency Model

AP in CAP terms. Nodes use cached config and cached policy during partitions. Conflict resolution by timestamp ordering with admin-committed > auto-converge. A node that can’t reach the config server keeps running its workload.

During partitions, pact-agent falls back to cached VClusterPolicy for authorization (role bindings and whitelists only — complex OPA rules and two-person approval require connectivity). Degraded-mode decisions are logged locally and replayed to the journal when connectivity is restored.

Commit Window Formula

window_seconds = base_window / (1 + drift_magnitude * sensitivity)

Examples with default base_window=900s, drift_sensitivity=2.0:

DriftExampleWindowRationale
Tiny (0.05)Single sysctl~14 minLow risk
Small (0.15)Config file edit~12 minRoutine
Moderate (0.3)Mount + service~9 minNeeds attention
Large (0.8)Multiple categories~6 minSignificant deviation

Higher drift_sensitivity (e.g. 5.0 for regulated vClusters) compresses windows more aggressively: the same large drift gets ~3 min instead of ~6.

Emergency mode: pact emergency --reason "..." extends to 4h, suspends rollback.

Data Flows

Boot-Time (10,000 nodes)

PXE → SquashFS → pact-agent (PID 1)
  → mTLS auth → Phase 1 stream (vCluster overlay, ~200KB, any replica)
  → apply config → Phase 2 (node delta, <1KB)
  → start services → CapabilityReport → ready

Admin Change

pact exec / pact shell → command executed on node
  → state observer detects change → drift evaluator
  → commit window opens (proportional to drift)
  → admin commits (node delta) or window expires (rollback)
  → to codify fleet-wide: pact promote → pact apply (updates overlay)
  → journal records everything

Commit Lifecycle and Reboot Persistence

Manual changes (via exec/shell) that are committed become node-level state deltas in the journal. The journal maintains two layers of declared state:

vCluster overlay (shared)     e.g. "all ml-training nodes mount /scratch"
  + node deltas (per-node)    e.g. "node042 has extra sysctl from debugging"
  = effective declared state  (what the agent applies at boot)

On reboot, the agent streams both layers from the journal. Committed node deltas are reapplied automatically — manual changes survive reboots as long as they remain in the journal’s node state.

However, accumulating ad-hoc node deltas is not desirable long-term. They represent drift that was accepted rather than codified. Over time, nodes with many committed deltas diverge from their vCluster peers, making fleet-wide reasoning harder.

The intended lifecycle for manual changes:

StageStateAction
DetectedDriftObserver flags divergence from declared state
CommittedNode deltaAdmin commits change, recorded in journal
PromotedvCluster overlaypact apply updates the overlay to include the change
ExpiredCleaned uppact rollback or superseded by overlay update

Promotion path: when a committed manual change proves correct, the admin promotes it to the vCluster overlay:

  1. pact diff --committed <node> — review accumulated node deltas
  2. pact promote <node> --dry-run — preview the generated overlay TOML
  3. pact promote <node> > changes.toml — export, review/edit
  4. pact apply changes.toml — apply to the vCluster overlay

This updates the shared overlay and makes the node-level deltas redundant.

Expiry: node deltas with a ttl field expire automatically. Emergency-mode changes default to a TTL matching the emergency window. Changes without TTL persist until explicitly rolled back or superseded.

Hardware Degradation

GPU soft-fails → agent detects (NVML for NVIDIA, ROCm SMI for AMD, or eBPF)
  → CapabilityReport updated → scheduler adjusts eligibility
  → DriftDetected in journal → admin ack if policy requires

Integration Delegation

ActionOwnerpact does
Reboot nodeCSM or OpenCHAMIpact reboot calls CAPMC (CSM) or SMD Redfish (OpenCHAMI)
Re-image nodeCSM or OpenCHAMIpact reimage calls BOS (CSM) or Redfish PowerCycle (OpenCHAMI)
Drain nodeLatticepact drain calls lattice scheduler API
Cordon nodeLatticepact cordon calls lattice scheduler API
Job statusLatticepact jobs calls lattice API
Config managementpact (native)Direct implementation
Remote accesspact (native)Shell server, exec endpoint
Service lifecyclepact (native)PactSupervisor or SystemdBackend

Agent Design

Overview

pact-agent is the init system, configuration manager, process supervisor, and shell server for diskless HPC/AI compute nodes. It is PID 1 (or near-PID-1) and the only management process that starts from the base boot image.

Subsystems

Process Supervisor (src/supervisor/)

Two backends behind the ServiceManager trait:

PactSupervisor (default):

  • Direct process management via tokio::process::Command
  • cgroup v2 isolation: creates /sys/fs/cgroup/pact.slice/<service>/ per service
  • Memory limits, CPU quotas via cgroup controllers
  • Health checks: process alive + optional HTTP/TCP endpoint check
  • Restart with exponential backoff (configurable per service)
  • Dependency ordering from service declarations in vCluster overlay
  • Zombie reaping: pact-agent sets PR_SET_CHILD_SUBREAPER
  • stdout/stderr capture via pipes → pact log pipeline → Loki
  • Ordered shutdown: reverse dependency order, SIGTERM → grace period → SIGKILL

SystemdBackend (fallback):

  • Generates systemd unit files from vCluster service declarations
  • Start/stop/restart via D-Bus connection to systemd
  • Monitor via sd_notify protocol
  • Same ServiceManager trait — transparent to rest of pact-agent

Shell Server (src/shell/)

Replaces SSH. Listens on a gRPC endpoint (mTLS authenticated). Provides three RPC operations: exec (single command), shell (interactive session), and CollectDiag (structured diagnostic log retrieval).

pact exec (single command):

Client → ExecRequest{node_id, command, args} → pact-agent
  → authenticate (OIDC token verification)
  → authorize: call PolicyService.Evaluate() on policy node (full OPA/Rego)
      if policy service unreachable: fall back to cached VClusterPolicy
      (role_bindings + whitelist only; two-person approval denied)
  → whitelist check (command in allowed set?)
  → classify (read-only or state-changing?)
  → if state-changing: go through commit window model
  → execute via fork/exec in restricted environment
  → stream stdout/stderr back to client
  → log command + output to journal

For exec, pact-agent controls the full command — it receives a command + args, validates against the whitelist, and fork/execs directly. No shell interpretation.

pact shell (interactive session — restricted bash):

pact shell does not reimplement a shell. It spawns a restricted bash session inside a controlled environment. Reimplementing line editing, pipes, redirects, globbing, quoting, job control, and signal handling would be both enormous and a security liability (command parsing bugs = bypasses).

Client → ShellSessionRequest{node_id} → pact-agent
  → authenticate + authorize (same policy call as exec; shell requires
    higher privilege — if policy service unreachable, cached RBAC check)
  → open bidirectional gRPC stream
  → allocate PTY with restricted bash environment:
      - PATH restricted to whitelisted command directories
      - readonly PATH, ENV, BASH_ENV, SHELL (prevent escape)
      - custom PROMPT_COMMAND logs each command to pact audit
      - rbash or bash --restricted as base
      - mount namespace: hide sensitive paths if configured
      - cgroup: session-level resource limits
  → session start/end logged to journal
  → session ends: cleanup PTY, cgroup, log session summary

Restriction layers (defense in depth, not command parsing):

  1. PATH restriction: only whitelisted binaries are reachable. The agent builds a restricted PATH from the vCluster’s shell_whitelist, symlinking allowed commands into a session-specific directory (/run/pact/shell/<sid>/bin/). Bash in restricted mode (rbash) prevents changing PATH or running commands by absolute path.

  2. PROMPT_COMMAND audit: bash’s PROMPT_COMMAND hook runs before each prompt, logging the previous command ($(history 1)) to pact’s audit pipeline. This captures what was actually executed, not what pact thinks was executed.

  3. Mount namespace (optional): hide /root, /home, SSH keys, and other sensitive paths from the shell session.

  4. Seccomp/cgroup: session-level resource limits and optional syscall filtering.

  5. State change detection: the existing drift observer (eBPF + inotify + netlink) detects changes made during the session. These trigger commit windows as normal — the shell doesn’t need to pre-classify commands.

What pact exec does vs pact shell:

  • pact exec: pact controls the full command lifecycle (whitelist, classify, fork/exec). No shell involved. Suitable for automation and diagnostics.
  • pact shell: bash controls command execution. pact controls the environment (PATH, namespace, cgroup) and observes changes after the fact. Suitable for interactive debugging.

Learning mode: when a user tries to run a command not in PATH, bash returns “command not found”. The agent detects this (via audit log or PROMPT_COMMAND exit code) and suggests adding the command to the vCluster whitelist.

State Observer (src/observer/)

Three detection mechanisms:

  • eBPF probes (feature-gated ebpf, Linux-only):
    • System-level: mount, sethostname, sysctl writes, module load/unload
    • Extended: file permission changes, network namespace operations, cgroup modifications
    • No overlap with lattice eBPF: lattice traces workload-level events (job lifecycle, GPU allocation). pact traces system-level config changes. Probe attachment points are coordinated to avoid conflicts.
  • inotify: config file paths (derived from declared state + watch list)
  • netlink: interface state, address changes, mount events, routing

Observe-only mode for initial deployment (log everything, enforce nothing).

Cross-platform: On macOS (development), a MockObserver simulates drift events for local dev/test. Real observers only compile and run on Linux.

Drift Evaluator (src/drift/)

DriftVector across 7 dimensions (mounts, files, network, services, kernel, packages, gpu). Magnitude = weighted Euclidean norm with per-vCluster dimension weights.

Commit Window Manager (src/commit/)

Optimistic concurrency. Active consumer check before rollback (don’t unmount filesystems with open handles). Emergency mode: extended window + suspended rollback.

Config Subscription (src/subscription/)

After boot, the agent subscribes to BootConfigService.SubscribeConfigUpdates() on the journal for live updates. This stream delivers:

  • vCluster overlay changes (e.g. pact apply updates the overlay)
  • Node-specific delta changes (e.g. promoted changes from pact promote)
  • Policy updates (refreshes cached VClusterPolicy for authorization)
  • Blacklist changes (updates drift detection exclusions)

This means overlay and policy changes propagate to running nodes without reboot. The agent applies overlay changes through the same path as boot-time config application. If the subscription stream is interrupted, the agent reconnects with from_sequence to resume from the last received update.

Capability Reporter (src/capability/)

Five hardware detection backends, each following the trait + Linux/Mock pattern:

  • GPU (GpuBackend): NVIDIA (nvidia-smi, feature nvidia) + AMD (rocm-smi, feature amd) + Mock
  • CPU (CpuBackend): /proc/cpuinfo + sysfs (arch, cores, freq, ISA features, NUMA, L3 cache)
  • Memory (MemoryBackend): /proc/meminfo + sysfs NUMA + dmidecode for type (DDR/HBM, 2s timeout)
  • Network (NetworkBackend): /sys/class/net/ enumeration, Slingshot (cxi driver), speed, link state
  • Storage (StorageBackend): /sys/block/ NVMe, /proc/mounts + statvfs (2s timeout), diskless detection

Reports to lattice scheduler via tmpfs manifest (/run/pact/capability.json) + unix socket (consumed by lattice-node-agent, which pact supervises as a child process).

Emergency Mode (src/emergency/)

pact emergency --reason "..." → extended window, no rollback, full audit logging. Must end with explicit commit or rollback. Stale emergency → alert + scheduling hold.

Cross-Platform Development

Three-tier strategy for macOS development:

  1. Feature-gate: #[cfg(target_os = "linux")] for cgroup v2, eBPF, netlink, inotify, PTY allocation. Stubs compile on macOS.
  2. Mock implementations: MockSupervisor, MockObserver, MockGpuBackend, MockCpuBackend, MockMemoryBackend, MockNetworkBackend, MockStorageBackend for local dev/test on macOS. Unit + integration tests run with mocks.
  3. Devcontainer: Linux container for integration + acceptance tests (BDD/cucumber). Real supervisor, real observers, real cgroups. CI runs in this environment.

Resource Budget

  • RSS: < 50 MB (including eBPF maps and supervisor overhead)
  • CPU steady state: < 0.5%
  • CPU during drift eval: < 2%
  • CPU during shell session: depends on commands executed

Journal Design

Overview

pact-journal is the distributed, append-only configuration log. It runs its own Raft group, independent from lattice’s quorum (see ADR-001 for deployment modes). Single source of truth for declared state.

Deployment Modes

Two modes for the Raft quorum (see ADR-001):

Standalone (default): pact-journal runs on dedicated management nodes. Fully independent from lattice infrastructure.

Co-located: pact-journal and lattice-server run on the same management nodes, each with its own Raft group on separate ports. Pact is the incumbent — its quorum is already running when lattice starts. Lattice configures its own quorum to use the same node hostnames.

In both modes:

  • Independent Raft groups (separate leader election, log, snapshots)
  • Separate data directories (/var/lib/pact/journal)
  • Separate ports (Raft: 9444, gRPC: 9443)
  • No protocol-level coupling

Raft State Machine

The journal’s state machine (JournalState) is pact-specific:

#![allow(unused)]
fn main() {
pub struct JournalState {
    /// All config entries, indexed by sequence number.
    pub entries: BTreeMap<EntrySeq, ConfigEntry>,
    /// Per-node current config state.
    pub node_states: HashMap<NodeId, ConfigState>,
    /// Per-vCluster active policy.
    pub policies: HashMap<VClusterId, VClusterPolicy>,
    /// Pre-computed boot overlays per vCluster.
    pub overlays: HashMap<VClusterId, BootOverlay>,
    /// Admin operation audit log.
    pub audit_log: Vec<AdminOperation>,
}
}

What goes through Raft (strong consistency)

  • Config commits and rollbacks
  • Policy updates
  • Emergency mode start/end
  • Admin operation records (exec, shell session start/end)

What does NOT go through Raft

  • Boot config streaming reads (served from any replica, including learners)
  • Drift detection events (written locally, forwarded to journal asynchronously)
  • Capability reports (sent to lattice scheduler, not journal)
  • Telemetry (Loki/Prometheus, not Raft)

Command Set

#![allow(unused)]
fn main() {
pub enum JournalCommand {
    /// Append a new config entry (commit, rollback, policy update, etc.)
    AppendEntry(ConfigEntry),
    /// Update a node's config state (committed, drifted, emergency, etc.)
    UpdateNodeState { node_id: NodeId, state: ConfigState },
    /// Set or update a vCluster policy.
    SetPolicy { vcluster_id: VClusterId, policy: VClusterPolicy },
    /// Store a pre-computed boot overlay for a vCluster.
    SetOverlay { vcluster_id: VClusterId, overlay: BootOverlay },
    /// Record an admin operation (exec log, shell session).
    RecordOperation(AdminOperation),
}
}

Log Structure

ConfigEntry: sequence, timestamp, entry_type, scope, author (OIDC identity), parent (chain for state reconstruction), state_delta, policy_ref, ttl, emergency_reason.

Entry types: Commit, Rollback, AutoConverge, DriftDetected, CapabilityChange, PolicyUpdate, BootConfig, EmergencyStart, EmergencyEnd, ExecLog, ShellSession, ServiceLifecycle.

Note: ExecLog and ShellSession are new entry types — every remote command and shell session is recorded in the same immutable log as configuration changes.

Streaming Boot Config

Two-phase protocol:

  • Phase 1: vCluster base overlay (pre-computed, compressed ~100-200 KB, served from any replica)
  • Phase 2: node-specific delta (<1 KB)

Phase 2 includes both pre-declared per-node config and any previously committed manual changes stored in node_states. This means admin changes committed via pact commit survive reboots — they are reapplied from the journal alongside the vCluster overlay.

Read replicas (non-voting Raft learners) for 100k+ boot storms. Boot config reads do not go through Raft consensus — they read from the local state machine snapshot. This is why boot storms do not block the Raft group.

Overlay Pre-Computation

Hybrid commit + on-demand strategy:

  • On commit: when a config commit or policy update affects a vCluster, the overlay is rebuilt and stored via SetOverlay through Raft. This ensures overlays are warm for the common case (steady-state boots after config changes).
  • On demand: if a boot request arrives for a vCluster with no cached overlay (e.g., first boot after journal restore, or new vCluster), the overlay is built on the fly, then stored for subsequent requests.
  • Overlays are compressed (zstd) and checksummed. Stale overlays are detected by comparing the overlay version against the latest config sequence for that vCluster.

Storage

/var/lib/pact/journal/
  raft/
    vote.json                          # Persisted vote state
    committed.json                     # Last committed log ID
    wal/
      {index}.json                     # Per-entry WAL files
    snapshots/
      snap-{term}-{index}.json         # State snapshots (keep 3 most recent)

Telemetry

  • Config events → Loki (structured JSON with labels)
  • Server metrics → Prometheus (Raft health, stream throughput, event counts)

Backup

WAL + periodic snapshots + export to object storage (S3/NFS). Full state reconstruction from any snapshot + subsequent WAL.

CLI & Shell Design

Philosophy

pact CLI is both the remote admin interface (from workstations) and the local shell (on nodes, replacing bash for admin operations). Every operation is authenticated, authorized, and logged.

Command Reference

Configuration Management

CommandDescription
pact status [--vcluster X]Node/vCluster state, drift, capabilities
pact diff [node]Declared vs actual state
pact diff --committed [node]Show committed node deltas not yet in overlay
pact commit -m "msg"Commit drift on current node (node-level delta)
pact rollback [seq]Roll back to previous state
pact log [-n N] [--scope S]Configuration history
pact apply <spec.toml>Apply declarative config spec
pact promote <node> [--dry-run]Export committed node deltas as overlay TOML
pact watch [--vcluster X]Live event stream
pact extend [mins]Extend commit window
pact emergencyEnter/exit emergency mode

Admin Operations (replaces SSH)

CommandDescription
pact exec <node> -- <cmd>Run command on node (whitelisted)
pact shell <node>Interactive shell session
pact service status <name>Service status
pact service restart <name>Restart service (commit window applies)
pact service logs <name>Stream service logs
pact diag <node>Structured diagnostic log collection (dmesg, syslog, services)
pact diag --vcluster XFleet-wide diagnostic log retrieval
pact cap [node]Capability report
pact blacklistManage drift detection exclusions

Delegation (calls external APIs)

CommandDelegates toDescription
pact reboot <node>OpenCHAMI SMDReboot via Redfish BMC
pact reimage <node>OpenCHAMI SMDRe-image node
pact drain <node>LatticeDrain jobs from node
pact undrain <node>LatticeCancel drain, return to Ready
pact cordon <node>LatticeRemove from scheduling
pact uncordon <node>LatticeReturn to scheduling

Group Management

CommandDescription
pact group listList vClusters and groups
pact group show <name>Show vCluster config
pact group set-policyUpdate vCluster policy

Supercharged (pact + lattice)

These commands combine data from pact and lattice into unified views. Requires PACT_LATTICE_ENDPOINT to be configured. Lattice-only commands (delegation + supercharged) are hidden from --help when the endpoint is not set, keeping the CLI clean for sites without lattice.

CommandDescription
pact jobs list [--node X]List running allocations
pact jobs cancel <id>Cancel a stuck job
pact jobs inspect <id>Job details
pact queue [--vcluster X]Scheduling queue status
pact clusterCombined Raft cluster health
pact audit [--source all]Unified audit trail (pact + lattice)
pact accounting [--vcluster X]Resource usage (GPU/CPU hours)
pact healthCombined system health check
pact dag list [--tenant X]List DAG workflows
pact dag inspect <id>DAG details and step status
pact dag cancel <id>Cancel a DAG workflow
pact budget tenant <id>Tenant GPU/node hours budget
pact budget user <id>User usage across tenants
pact backup create <path>Backup lattice Raft state (admin)
pact backup verify <path>Verify backup integrity (admin)
pact backup restore <path>Restore from backup (admin, –confirm)
pact nodes list [--state X]List lattice nodes with state
pact nodes inspect <id>Node hardware/ownership details

Example: Debug Session

# Admin notices GPU issues on node042
$ pact cap node042
  GPUs: 3x A100 (healthy), 1x A100 (DEGRADED - ECC errors)

# Check what's different from declared state
$ pact diff node042
  gpu[3]: declared=healthy actual=DEGRADED (ECC uncorrectable: 12)

# Run diagnostics remotely
$ pact exec node042 -- nvidia-smi -q -d ECC
  [full nvidia-smi output streamed back, logged to journal]

# Need interactive access
$ pact shell node042
pact:node042> dmesg | grep -i nvidia | tail -5
  [kernel messages about GPU errors]
pact:node042> cat /var/log/nvidia-persistenced.log
  [nvidia daemon logs]
pact:node042> exit

# Cordon the node while hardware team investigates
$ pact cordon node042
  Cordoned: node042 removed from lattice scheduling (via lattice API)

Example: Promoting Node Deltas to Overlay

# After debugging, admin added a sysctl and NFS mount on node042.
# These were committed as node deltas. Check what's accumulated:
$ pact diff --committed node042
  kernel: vm.nr_hugepages = 1024  (committed seq:4812, 3 days ago)
  mounts: /local-scratch type=nfs source=storage03:/scratch  (committed seq:4815, 2 days ago)

# Export as overlay TOML (dry-run to preview)
$ pact promote node042 --dry-run
  # Generated overlay fragment for vcluster: ml-training
  # From 2 committed node deltas on node042

  [vcluster.ml-training.sysctl]
  "vm.nr_hugepages" = "1024"

  [vcluster.ml-training.mounts]
  "/local-scratch" = { type = "nfs", source = "storage03:/scratch" }

# Looks right — export to file, review, then apply to the whole vCluster
$ pact promote node042 > /tmp/hugepages-and-scratch.toml
$ vi /tmp/hugepages-and-scratch.toml   # review/edit
$ pact apply /tmp/hugepages-and-scratch.toml
  Applied to vcluster ml-training (2 changes). Overlay updated.
  Node deltas on node042 superseded (seq:4812, seq:4815 now redundant).

# Verify: node042 should have no more unpromoted deltas
$ pact diff --committed node042
  (no committed node deltas)

The promote command maps StateDelta fields to overlay TOML sections:

StateDelta fieldOverlay TOML section
kernel[vcluster.<name>.sysctl]
mounts[vcluster.<name>.mounts]
files[vcluster.<name>.files]
services[vcluster.<name>.services.<svc>]
network[vcluster.<name>.network]
packages[vcluster.<name>.packages]

Deltas that can’t be cleanly mapped (e.g. GPU state changes) are emitted as comments with the raw delta for manual handling.

Two-Person Approval (Regulated vClusters)

For vClusters with two_person_approval = true, state-changing operations require a second admin to approve before execution.

CLI Commands

CommandDescription
pact approve listShow pending approval requests
pact approve <id>Approve a pending request
pact approve deny <id> -m "reason"Deny a pending request

Flow

# Admin A: commit a change on a regulated vCluster
$ pact commit -m "add hugepages for training"
  Approval required (two-person policy on vcluster: sensitive-compute)
  Pending approval: ap-7f3a (expires in 30 min)
  Waiting for approval... (Ctrl-C to background)

# Admin B (separately): sees pending approvals
$ pact approve list
  ap-7f3a  sensitive-compute  "add hugepages for training"  by admin-a@org  12 min ago

$ pact approve ap-7f3a
  Approved. Commit applied on sensitive-compute.

Mechanism

  1. Admin A’s operation triggers PolicyService.Evaluate() → OPA returns ApprovalRequired { approval_type: "two_person", pending_approval_id: "ap-7f3a" }
  2. The request is stored in the journal as a pending operation (new entry type)
  3. Admin B queries pending approvals via journal, approves via PolicyService
  4. The journal stores the approval and executes the original operation
  5. If no approval within the timeout (default 30 min, configurable per vCluster), the request expires and the change is rolled back

Notifications

Pending approvals are emitted as Loki events with structured labels. Grafana alert rules can route these to Slack, PagerDuty, or email based on vCluster and severity.

BMC Console Access (on-node)

BMC console provides regular bash — not restricted bash, not pact shell. This is the out-of-band fallback for when pact-agent is unresponsive or when the admin needs unrestricted access (e.g. to debug pact-agent itself).

BMC access is controlled by BMC credentials (IPMI/Redfish), not by pact RBAC. Changes made via BMC are detected by the drift observer when pact-agent is running, and appear as unattributed drift (no OIDC identity).

[BMC console connects — regular bash]
root@node042:~# pact status
  Node: node042  State: COMMITTED  Supervisor: 5 services running
root@node042:~# systemctl status pact-agent
  [check agent health]
root@node042:~# nvidia-smi
  [unrestricted access, drift detected if state changes]

Exit Codes

CodeMeaning
0Success
1General error
2Authentication/authorization failure
3Policy rejection
4Conflict (concurrent modification)
5Timeout (journal unreachable)
6Command not whitelisted
10Rollback failed (active consumers)

Drift Detection Architecture

Overview

Drift detection is pact’s core mechanism for tracking configuration state divergence. It follows a blacklist-first approach (ADR-002): observe everything, exclude known noise.

Drift Vector

Seven dimensions tracked independently:

DimensionSourceWeightExample
kernelsysctl changes2.0vm.swappiness modified
mountsmount/unmount events1.0NFS share mounted
filesfile create/modify/delete1.0/etc/ntp.conf changed
networkinterface changes1.0eth0 link state change
servicesprocess start/stop1.0nginx started
packagespackage install/remove1.0CUDA toolkit updated
gpuGPU state changes2.0GPU health degraded

Magnitude: Weighted L2 norm of the drift vector. Kernel and GPU have 2x weight (higher impact on node behavior).

Observer Pipeline

Observer → ObserverEvent → DriftEvaluator → CommitWindowManager
   │                            │                    │
   ├─ InotifyObserver (files)   ├─ blacklist filter   ├─ window = base / (1 + mag * sens)
   ├─ NetlinkObserver (network) ├─ category mapping   ├─ Idle → Open → Expired
   └─ EbpfObserver (kernel)     └─ magnitude calc     └─ emergency extends window

Blacklist Patterns

Default patterns (noise suppression):

/tmp/**
/var/log/**
/proc/**
/sys/**
/dev/**
/run/user/**

Pattern matching:

  • ** = recursive match (any depth)
  • /* = single path segment
  • Exact paths = literal match

Blacklist is dynamically updateable via config subscription from journal.

Commit Window

Formula: window_seconds = base_window / (1 + drift_magnitude * sensitivity)

DriftSensitivity=2.0Base=900sWindow
0.0-900sIdle (no window)
0.52.0900s450s
1.02.0900s300s
5.02.0900s82s

Minimum window: 60 seconds (clamped). Emergency mode: window extended to emergency_window_seconds (default 4 hours).

Conflict Resolution (CR1-CR3)

On partition reconnect:

  1. Agent compares local state against journal entries
  2. Conflicting keys are registered in ConflictManager
  3. Grace period: admin resolves per-key (AcceptLocal | AcceptJournal)
  4. Auto-resolve: journal-wins after grace period expires
  5. All resolutions logged for audit

Homogeneity Check (ND3)

vCluster nodes should have identical config. Per-node deltas (node-scoped entries) indicate heterogeneity. check_homogeneity() reports nodes with per-node deltas that diverge from the vCluster overlay.

Shell Server Design

Overview

The pact shell server replaces SSH (ADR-007) as the sole admin interface to compute nodes. It provides authenticated, audited, policy-enforced command execution.

Two Execution Modes

1. Single Command (pact exec)

  • Fork/exec a single command directly (no shell interpretation)
  • Whitelisted commands only (37 defaults + vCluster policy additions)
  • Streaming output: stdout → stderr → exit code
  • Timeout enforcement (5 minutes default, 10MB output limit)

2. Interactive Shell (pact shell)

  • Allocates a PTY pair via openpty()
  • Spawns /bin/rbash (restricted bash)
  • Session-specific restricted PATH
  • PROMPT_COMMAND audit logging
  • Terminal resize support (TIOCSWINSZ)

gRPC Service

service ShellService {
  rpc Exec(ExecRequest) returns (stream ExecOutput);
  rpc Shell(stream ShellInput) returns (stream ShellOutput);
  rpc ListCommands(ListCommandsRequest) returns (ListCommandsResponse);
  rpc ExtendCommitWindow(ExtendWindowRequest) returns (ExtendWindowResponse);
}

Authentication Flow

gRPC metadata → extract Bearer token → validate JWT → extract Identity
                                         │
                                    HS256 (dev) or RS256/JWKS (prod)

Authorization Flow

Identity → whitelist check → platform admin bypass? → role check → classify
              │                      │                    │            │
              ├─ allowed?            ├─ S2: admin can     ├─ ops?      ├─ state-changing?
              │   no → learning      │   exec anything    │   yes      │   yes → commit window
              │         mode record  │                    ├─ viewer?   │   no → read-only
              └─ yes                 │                    │   read-    │
                                     │                    │   only     │
                                     │                    │   cmds     │
                                     │                    └─ deny      │

Default Whitelist (37 commands)

Diagnostic: nvidia-smi, rocm-smi, ps, top, htop, lspci, lsmod, lsblk, lscpu Network: ip, ss, ping, traceroute, ethtool File inspection: cat, head, tail, wc, ls, stat, file, md5sum, sha256sum, diff, grep System: journalctl, sysctl, dmesg, uname, hostname, uptime, free, df, mount, echo State-changing: systemctl, modprobe, umount, sysctl (write mode)

Command Argument Validation

Before executing any command via pact exec, the shell server runs validate_args() on the provided arguments. This blocks access to sensitive paths including /etc/shadow, /.ssh/, /root/, and CA key material. Path arguments are normalized (resolving .., symlinks, and percent-encoding) to prevent traversal attacks before the check is applied.

Session Security

ControlMechanism
PATH restrictionSymlinks in /run/pact/shell/{sid}/bin/
Shell restriction/bin/rbash prevents PATH changes
Startup injectionBASH_ENV="", ENV=""
Home accessHOME=/tmp
AuditPROMPT_COMMAND='history 1 >> /var/log/pact/shell.log'
Session limitConfigurable max concurrent sessions
Stale cleanupSessions in Closing state cleaned after timeout

Emergency Mode Design

Purpose

Emergency mode provides an extended operational window when immediate human intervention is required on a node. It suspends automatic drift rollback while maintaining full audit logging (ADR-004).

Lifecycle

Normal → EmergencyStart(reason, admin) → Emergency Active → EmergencyEnd(admin) → Normal
                                              │
                                         Extended commit window (4h default)
                                         Auto-converge suspended
                                         All actions still logged
                                         Shell whitelist NOT expanded

Entry Conditions

ActorCan Enter?Can Exit?
Human admin (ops/platform)YesOwn emergency or with –force
AI agent (pact-service-ai)No (P8)No (P8)
Service agentNoNo

Configuration

[agent.commit_window]
emergency_window_seconds = 14400  # 4 hours (default)

Audit Trail

Both entry and exit are recorded as immutable journal entries:

EntryType::EmergencyStart { reason, admin_identity, timestamp }
EntryType::EmergencyEnd { admin_identity, timestamp }

Stale Emergency Detection

If an emergency exceeds its window without being resolved:

  • Alert generated for platform admins
  • Emergency remains active (no auto-exit)
  • Only platform admin can force-end

What Emergency Mode Does NOT Do

  • Does not expand the shell whitelist (security invariant)
  • Does not bypass RBAC authorization
  • Does not suppress audit logging
  • Does not allow untracked changes
  • Does not grant additional privileges

Policy Engine Design (ADR-003)

Architecture

Policy evaluation is co-located on journal nodes. The PolicyService gRPC service handles all policy operations:

CLI/Agent → PolicyService (journal) → RbacEngine (pact-policy) → Decision
                                   → OPA sidecar (optional, for Rego policies)

gRPC API

service PolicyService {
  rpc Evaluate(PolicyEvalRequest) returns (PolicyEvalResponse);
  rpc GetEffectivePolicy(GetPolicyRequest) returns (VClusterPolicy);
  rpc UpdatePolicy(UpdatePolicyRequest) returns (UpdatePolicyResponse);
  rpc ListPendingApprovals(ListApprovalsRequest) returns (ListApprovalsResponse);
  rpc DecideApproval(DecideApprovalRequest) returns (DecideApprovalResponse);
}

RBAC Decisions

DecisionMeaningAction
AllowAuthorizedProceed
Deny { reason }Not authorizedReturn error with reason
DeferRequires approvalCreate PendingApproval, return approval_id

Two-Person Approval (P4)

For regulated vClusters (two_person_approval = true):

  1. Regulated role submits state-changing action
  2. PolicyService returns Defer with pending_approval_id
  3. Approval persisted through Raft (CreateApproval command)
  4. Second admin approves or rejects (DecideApproval command)
  5. Self-approval denied (requester != approver)
  6. Approvals expire after 24 hours

VCluster Policy

[vcluster.ml-training]
drift_sensitivity = 2.0
base_commit_window_seconds = 900
emergency_window_seconds = 14400
regulated = false
two_person_approval = false
enforcement_mode = "observe"  # or "enforce"
supervisor_backend = "pact"    # or "systemd"
exec_whitelist = ["nvidia-smi", "dmesg"]
shell_whitelist = ["ls", "cat"]
emergency_allowed = true
audit_retention_days = 2555    # ~7 years for regulated

OPA Integration

  • OPA runs as a sidecar on journal nodes (port 8181)
  • Rego policies pushed via OPA REST API
  • Federation: policy templates synced via Sovra
  • Fallback: built-in RbacEngine if OPA unavailable

Security

STRIDE Threat Model — pact

This document applies the STRIDE threat modeling framework to pact’s architecture. Each section identifies threats by STRIDE category, maps them to components, assesses residual risk after existing mitigations, and recommends further hardening where gaps remain.

Scope: pact-agent, pact-journal (Raft quorum), pact-policy, pact CLI, pact MCP server, and the trust boundaries between them. External systems (OPA, Sovra, OpenCHAMI/Manta, lattice) are modeled as trust boundary crossings.

Data flow summary (for threat identification):

┌─────────────┐  mTLS   ┌──────────────────────────────┐   Raft   ┌───────────────┐
│ pact-agent  │◄───────►│       pact-journal           │◄────────►│ pact-journal  │
│ (compute    │         │  (leader / follower)         │          │ (other peers) │
│  node)      │         │  ┌──────────┐ ┌────────────┐ │          └───────────────┘
│             │         │  │ PolicySvc│ │ EnrollSvc  │ │
│ observer    │         │  └─────┬────┘ └────────────┘ │
│ supervisor  │         │        │ localhost           │
│ shell srv   │         │        ▼                     │
│ drift eval  │         │  ┌──────────┐                │
└──────┬──────┘         │  │   OPA    │                │
       │                │  └──────────┘                │
       │                │  ephemeral CA key (in memory)│
       │                │  revocation registry (Raft)  │
       │                └──────────────┬───────────────┘
       │                               │
       │  gRPC+OIDC     ┌─────────────┴────────────────┐
       │                │         pact CLI             │
       │                │  (admin workstation)         │
       │                └──────────────────────────────┘
       │                               │
       │  gRPC+OIDC     ┌──────────────────────────────┐
       └────────────────│       pact MCP server        │
                        │  (AI agent tool-use)         │
                        └──────────────────────────────┘

Trust boundaries

IDBoundaryProtection
TB1Agent ↔ JournalmTLS (X.509, ephemeral intermediate CA generated at journal startup)
TB2CLI/MCP ↔ JournalgRPC + OIDC Bearer JWT
TB3CLI/MCP ↔ Agent (shell/exec)gRPC + OIDC Bearer JWT, routed through journal auth
TB4Journal ↔ OPA sidecarlocalhost-only (127.0.0.1:8181), no auth
TB5Journal ↔ VaultRemoved — CA is ephemeral, generated at journal startup. No external CA dependency.
TB6Journal ↔ SovramTLS, federation policy sync
TB7Agent ↔ OS (eBPF, cgroups, PTY)Kernel privilege (CAP_SYS_ADMIN, CAP_BPF)
TB8Enrollment endpointServer-TLS only (unauthenticated gRPC)

S — Spoofing

S-1: Rogue node enrollment

Target: Enrollment endpoint (TB8) — the only unauthenticated gRPC endpoint.

Threat: An attacker on the management network spoofs MAC + BMC serial to enroll a malicious node, obtaining a valid mTLS certificate.

Existing mitigations:

  • Enrollment registry gate: only pre-registered hardware identities are served (E1).
  • Once-Active rejection: after the real node enrolls, duplicates are rejected with ALREADY_ACTIVE until heartbeat timeout (E7).
  • CSR model: even if spoofed, the attacker gets a cert for their key — they cannot impersonate the real node’s existing connections (E4).
  • Rate limiting: configurable N enrollments/min (default 100).
  • Audit: all enrollment attempts (success/failure) logged + Loki alert on repeated failures.

Residual risk: Medium. MAC + BMC serial are not cryptographically strong. The spoofing window is narrow (between PXE boot and first enrollment, ~seconds) but exists. If an attacker has physical or BMC-level access to read identifiers, they can pre-stage the race.

Recommendations:

  1. Enable TPM attestation (tpm_endorsement_key_hash in enrollment request) for high-security deployments — closes the spoofing window entirely.
  2. Monitor NODE_ALREADY_ACTIVE rejections as a spoofing indicator. Alert on any occurrence outside of expected agent restarts.
  3. Consider network segmentation: enrollment endpoint accessible only from the PXE/boot VLAN, not the general management network.

S-2: OIDC token theft / replay

Target: CLI ↔ Journal (TB2), CLI ↔ Agent (TB3).

Threat: Stolen JWT used to impersonate an admin. JWTs are bearer tokens — whoever holds one is authenticated.

Existing mitigations:

  • Token cache files at 0600 permissions, strict mode rejects wrong perms (Auth5, PAuth1).
  • Refresh tokens never logged (Auth7).
  • Token expiry limits replay window.
  • RS256 + JWKS in production; HS256 only in dev.
  • Per-server token isolation (Auth6).

Residual risk: Medium. Bearer tokens are inherently stealable from memory, process environment, or network (if TLS is terminated incorrectly). Standard OIDC risk.

Recommendations:

  1. Short access token lifetime (5-15 min) to limit replay window.
  2. Consider token binding (DPoP) if the IdP supports it — binds tokens to a cryptographic key.
  3. Audit unusual token usage patterns: same token from different IPs, role escalation within a session.

S-3: Journal impersonation (rogue journal node)

Target: Agent ↔ Journal (TB1).

Threat: Attacker stands up a fake journal node to intercept agent connections, capture CSRs, or serve malicious overlays.

Existing mitigations:

  • Agent validates journal server certificate against CA bundle baked into SquashFS image.
  • mTLS: both sides validate. A fake journal without a valid server cert is rejected.
  • Raft membership changes require quorum agreement.

Residual risk: Low. Requires compromising the CA bundle in the boot image (OpenCHAMI supply chain) or the journal’s ephemeral intermediate CA key.

S-4: AI agent privilege escalation via MCP

Target: MCP server ↔ Journal (TB2).

Threat: AI agent (pact-service-ai) attempts to perform operations beyond its authorized scope — particularly entering emergency mode.

Existing mitigations:

  • P8: AI agents cannot enter/exit emergency mode (enforced by RBAC).
  • MCP server authenticates as pact-service-ai principal with limited write permissions.
  • All operations logged as author: service/ai-agent/<name>.

Residual risk: Low. Policy enforcement is sound. Risk increases if AI agent credentials are leaked or if OPA policies are misconfigured.


T — Tampering

T-1: Journal state corruption

Target: Raft state machine, WAL files.

Threat: Attacker with access to a journal node modifies WAL files, snapshots, or in-memory state to alter configuration history or policy.

Existing mitigations:

  • Raft consensus: writes require majority agreement (J7). Tampering a single node’s WAL is detectable — other replicas hold the correct state.
  • Immutability invariant (J2): entries are append-only, never modified.
  • Overlays are checksummed (J5).

Residual risk: Low (requires compromising majority of journal nodes). High if an attacker gains root on the majority of journal nodes.

Recommendations:

  1. Encrypt WAL at rest (dm-crypt or filesystem-level encryption).
  2. Signed Raft entries: journal leader signs each entry with its intermediate CA key. Followers and agents can verify entry provenance. This defends against a compromised minority node injecting entries.
  3. Integrity monitoring on /var/lib/pact/journal/ (file hashes, inotify alerts).

T-2: Overlay poisoning during boot stream

Target: Boot config streaming (Phase 1 + Phase 2).

Threat: Man-in-the-middle modifies overlay data in transit, causing nodes to boot with malicious configuration.

Existing mitigations:

  • mTLS protects the stream (TB1).
  • Overlay checksums (J5) detect corruption.

Residual risk: Low. Standard TLS MITM risk.

Recommendations:

  1. Agent should verify overlay checksum after receipt (defense in depth if TLS termination is misconfigured at a load balancer).

T-3: Shell command injection / whitelist bypass

Target: Shell server (pact exec, pact shell).

Threat: Attacker crafts input to escape the restricted shell environment or execute commands outside the whitelist.

Existing mitigations:

  • pact exec: no shell interpretation. Command + args are fork/exec’d directly. Whitelist is checked against the command binary name, not parsed from a string (S1).
  • pact shell: rbash prevents PATH changes, absolute path execution, output redirection (S3). PATH restricted to symlinks in session-specific directory. BASH_ENV="", ENV="" prevent startup injection.
  • Platform admin bypass is logged (S2, S4).
  • Mount namespace (optional) hides sensitive paths.

Residual risk: Medium. rbash has known escape techniques:

  • Exploiting allowed commands that can spawn subshells (e.g., vi, less, man, awk, find -exec). The default whitelist includes grep which is safe, but custom whitelist additions could introduce escapable binaries.
  • LD_PRELOAD or similar environment variables if not scrubbed.
  • Exploiting writable directories to place executables (if any exist in PATH scope).

Recommendations:

  1. Maintain a deny-list of known rbash-escapable binaries (vi, vim, less, man, more, awk, nawk, find, env, perl, python, ruby, lua, ed, ftp, gdb, git) and warn admins if they are added to a vCluster whitelist.
  2. Scrub dangerous environment variables beyond PATH: LD_PRELOAD, LD_LIBRARY_PATH, PYTHONPATH, PERL5LIB, etc.
  3. Consider seccomp profiles for shell sessions to restrict execve to the whitelist at the kernel level (defense in depth beyond rbash).
  4. Audit shell session transcripts for escape attempt patterns.

T-4: Tampering with drift detection (observer bypass)

Target: State observer (eBPF, inotify, netlink).

Threat: Attacker with root on a compute node disables or evades eBPF probes, inotify watches, or netlink monitoring to make changes invisible to pact.

Existing mitigations:

  • eBPF probes attached at kernel level (require CAP_BPF to detach).
  • Blacklist-based detection: only excluded paths are ignored (D1). Everything else is monitored.
  • Observer health: if an observer crashes or is killed, the agent should detect it.

Residual risk: High (if attacker has root). Root on a compute node can:

  • Detach eBPF programs (bpf() syscall).
  • Kill inotify watches.
  • Modify the agent process directly.

Recommendations:

  1. pact-agent should monitor its own observer health — restart crashed observers and alert if observers are repeatedly killed.
  2. Use eBPF program pinning in bpffs with restricted permissions.
  3. IMA (Integrity Measurement Architecture) on the agent binary to detect tampering.
  4. Periodic “heartbeat” from observers to the agent — silence is treated as compromise.
  5. Accept that root-on-node is a trust boundary: if the attacker has root, they are inside the trust perimeter. Focus on detection (anomalous capability reports, missing observer heartbeats) rather than prevention.

T-5: Policy tampering via OPA sidecar

Target: Journal ↔ OPA (TB4).

Threat: Attacker on a journal node modifies OPA policy bundles or intercepts localhost REST calls to alter authorization decisions.

Existing mitigations:

  • OPA runs on localhost only (not network-accessible).
  • Fall back to built-in RbacEngine if OPA unavailable.

Residual risk: Medium. No authentication on the OPA REST API (TB4 is localhost-only, no auth). A compromised process on the journal node can push arbitrary Rego policies via OPA’s management API.

Recommendations:

  1. OPA authentication token on the localhost endpoint (OPA supports bearer token auth).
  2. Read-only OPA data API — policy bundles loaded from signed files only, management API disabled.
  3. File integrity monitoring on OPA policy bundle directory.
  4. Alternatively, embed OPA as a library (opa-wasm or rego-rs) to eliminate the sidecar attack surface entirely.

R — Repudiation

R-1: Audit log gaps during partition

Target: Audit trail continuity (O3).

Threat: Actions taken during a network partition are not recorded in the immutable journal, allowing an operator to deny their actions.

Existing mitigations:

  • Local logging during partition: all degraded-mode decisions logged locally (A9).
  • Replay on reconnect: local logs replayed to journal for audit continuity.
  • Shell PROMPT_COMMAND captures every executed command (S4).
  • Emergency mode preserves full audit trail (ADR-004).

Residual risk: Low-Medium. If an agent is compromised during partition, local logs can be tampered with before replay. The replay mechanism trusts the agent’s local log integrity.

Recommendations:

  1. Sign local audit entries with the agent’s private key. On replay, the journal verifies signatures — tampered entries are flagged.
  2. Forward local audit entries to a secondary sink (syslog, Loki direct) in addition to journal replay, providing an independent record.

R-2: BMC console access is unaudited by pact

Target: Out-of-band access (PAuth4).

Threat: When pact-agent is down, the break-glass path is BMC/Redfish console. Actions taken via BMC are not captured by pact’s audit trail until the agent recovers and detects drift.

Existing mitigations:

  • Changes made via BMC are detected as “unattributed drift” on agent recovery (F6).
  • BMC consoles typically have their own audit log (Redfish event log).

Residual risk: Medium. There is a temporal gap where actions are unauditable by pact. Attribution depends on the BMC’s own logging (which is outside pact’s control).

Recommendations:

  1. Document that BMC audit logs must be preserved and correlated with pact drift events post-recovery.
  2. Consider forwarding BMC/Redfish event logs to the same Loki instance as pact events for unified audit.

R-3: Platform admin actions without second approval

Target: Platform admin bypass (P6).

Threat: Platform admin performs destructive actions without oversight. Since platform admin is always authorized and bypasses two-person approval, a single compromised platform-admin credential has unchecked power.

Existing mitigations:

  • All platform admin actions are logged (P6).
  • Only 2-3 people per site have this role.
  • Platform admin scope is visible in audit trail.

Residual risk: Medium. No preventive control — detection only. A compromised platform-admin account can do anything.

Recommendations:

  1. Consider requiring two-person approval for platform-admin on regulated vClusters (currently exempt).
  2. Time-bound platform-admin access: use short-lived privilege escalation (e.g., dynamic credentials from the IdP) rather than permanent role assignment.
  3. Anomaly detection on platform-admin activity: alert on unusual hours, unusual volume, unusual target vClusters.

I — Information Disclosure

I-1: Intermediate CA key exposure on journal nodes

Target: Journal intermediate CA signing key.

Threat: Compromise of a journal node exposes the ephemeral intermediate CA key, allowing the attacker to sign arbitrary agent certificates.

Existing mitigations:

  • Key is ephemeral — generated at journal startup, held in memory only, never persisted to disk. Exposure requires runtime memory access to a journal node.
  • Key is on 3-5 journal nodes (not 10,000 agents) — small blast radius.
  • Key is rotated on every journal restart — compromised keys have a limited validity window.
  • Agent private keys are NOT stored on journal nodes (E4).
  • Revoked cert serials tracked in Raft revocation registry, checked on every mTLS connection.

Residual risk: Medium. The ephemeral CA key is still sensitive, but the exposure risk is significantly lower than a persistent key: it exists only in memory, is rotated on restart, and cannot be extracted from disk or backups.

Recommendations:

  1. Store intermediate CA key in HSM or TPM on journal nodes for defense in depth.
  2. Periodic journal restart (e.g., rolling restart weekly) to force key rotation.
  3. Monitor for unexpected certificate issuances via the Raft audit trail.
  4. Consider short-lived intermediate CA certs (hours) with automatic renewal — limits the window even if the key is stolen.

I-2: Config overlay data in transit

Target: Boot config streams, config subscription updates.

Threat: Overlay data may contain sensitive configuration (credentials, API keys, mount credentials for shared filesystems).

Existing mitigations:

  • mTLS encrypts all agent ↔ journal traffic.
  • Config state never leaves the site (F1 federation invariant).

Residual risk: Low (assuming TLS is correctly configured). Medium if overlays contain embedded secrets rather than references to a secret store.

Recommendations:

  1. Document that overlays should reference secret stores (Vault, Kubernetes secrets) rather than embedding credentials directly.
  2. Scan overlay content for patterns matching secrets (API keys, passwords) and warn during pact apply.

I-3: Shell session output exposure

Target: Shell/exec output streaming.

Threat: Shell output (e.g., cat /etc/shadow, env) may contain sensitive data. This data flows over gRPC and is logged in the audit trail.

Existing mitigations:

  • mTLS/TLS encrypts output in transit.
  • Viewer role is read-only; sensitive commands require ops role.
  • Output size limit (10MB default).
  • Shell whitelist controls which commands are available.

Residual risk: Medium. An authorized ops user can intentionally exfiltrate data via shell/exec. The output is logged, making it auditable but not preventable.

Recommendations:

  1. Consider output redaction for known sensitive patterns (tokens, keys) in audit logs — store hash instead of plaintext for sensitive output.
  2. DLP-style alerting: flag exec/shell output containing high-entropy strings or known secret patterns.

I-4: Raft state at rest

Target: WAL files, snapshots on journal nodes.

Threat: Physical access or backup theft exposes all configuration history, policy state, enrollment records, and audit trail.

Existing mitigations:

  • No private key material in Raft state (E4).
  • Journal nodes are management infrastructure (typically physically secured).

Residual risk: Medium. Configuration data, policy rules, admin operation history, and enrollment records are sensitive operational data.

Recommendations:

  1. Encrypt WAL and snapshots at rest (dm-crypt, LUKS, or filesystem encryption).
  2. Encrypt backups before exporting to object storage.
  3. Access control on /var/lib/pact/journal/ — restrict to pact service user.

D — Denial of Service

D-1: Enrollment endpoint flooding

Target: Enrollment endpoint (TB8, unauthenticated).

Threat: Attacker floods the enrollment endpoint with fake enrollment requests, consuming journal CPU (CSR validation, registry lookups) and Raft writes (failed enrollment audit entries).

Existing mitigations:

  • Rate limiting: N enrollments/min (default 100).
  • Enrollment registry gate: unknown identities rejected immediately (minimal CPU).
  • Failed enrollments are logged but may not require Raft writes.

Residual risk: Low-Medium. Rate limiting helps, but a distributed attack from many IPs could still cause load. The registry lookup is fast (HashMap), but audit logging of failures adds overhead.

Recommendations:

  1. Implement connection-level rate limiting (per-IP) in addition to enrollment-level rate limiting.
  2. Consider moving failure audit to async/batch writes rather than per-request journal entries.
  3. Network segmentation: enrollment endpoint accessible only from the PXE/boot VLAN.

D-2: Boot storm amplification

Target: Journal boot config streaming.

Threat: Attacker triggers repeated reboots of large node groups, causing sustained boot storm load on the journal.

Existing mitigations:

  • Boot config reads do not go through Raft (served from local state, J8).
  • Read replicas (learners) absorb load.
  • max_concurrent_boot_streams limit (default 15,000).
  • Overlay caching prevents per-boot recomputation.

Residual risk: Low. The architecture handles 10,000+ concurrent boots by design (F11). Triggering reboots requires admin access (reboot is delegated to OpenCHAMI/Manta).

D-3: Shell session exhaustion

Target: Shell server on agent.

Threat: Attacker opens maximum concurrent shell sessions, exhausting PTY allocation and agent resources.

Existing mitigations:

  • Configurable max concurrent sessions.
  • Session-level cgroup resource limits.
  • Stale session cleanup (Closing state timeout).
  • Each session requires OIDC authentication + RBAC authorization.

Residual risk: Low. Requires valid credentials. Legitimate risk if an authorized account is compromised.

Recommendations:

  1. Per-identity session limits (not just per-node total).
  2. Alert on unusual session counts from a single identity.

D-4: Raft leader overload via write amplification

Target: Journal Raft leader.

Threat: Attacker (with valid credentials) submits high-volume write operations (rapid exec commands, frequent config changes) to overload the Raft leader.

Existing mitigations:

  • Operations require authentication + authorization.
  • Commit window model limits config change frequency.
  • Raft leader failover on crash (F8).

Residual risk: Low-Medium. An authorized ops user could submit rapid exec commands that each generate Raft entries (ExecLog). The per-command audit write could be a bottleneck under sustained load.

Recommendations:

  1. Rate limit write operations per identity per vCluster.
  2. Batch audit entries: buffer exec logs and flush periodically rather than one Raft write per exec.

E — Elevation of Privilege

E-1: Viewer role escalation via cached policy

Target: Degraded-mode RBAC (ADR-011).

Threat: During a network partition, a viewer-role user exploits cached policy to perform operations that would be denied by the full policy engine.

Existing mitigations:

  • Cached RBAC is conservative: viewers remain read-only in cached mode (P2, P7).
  • Two-person approval fails closed during partition.
  • Complex OPA rules fail closed during partition.
  • All degraded-mode decisions logged locally and replayed.

Residual risk: Low. The tiered fail-closed strategy (ADR-011) is well-designed. Risk exists only if the cached role bindings are stale (e.g., a recently-revoked user still appears in cache).

Recommendations:

  1. Short cache TTL for role bindings (e.g., 5 minutes) — agent drops cached authorization if it hasn’t refreshed within the TTL.
  2. On reconnect, replay degraded-mode decisions to the journal. If any decision would now be denied by the full policy engine, generate a retroactive alert.

E-2: Compromised supervised service escaping cgroup

Target: Process supervisor, cgroup isolation (TB7).

Threat: A service supervised by pact-agent (e.g., lattice-node-agent, a custom service) escapes its cgroup, gains access to agent resources, or interferes with other services.

Existing mitigations:

  • cgroup v2 isolation: per-service cgroup under pact.slice.
  • Memory limits, CPU quotas per service.
  • pact-agent sets PR_SET_CHILD_SUBREAPER.

Residual risk: Medium. cgroup escapes exist (kernel vulnerabilities). A compromised service with root inside its cgroup could potentially:

  • Manipulate the cgroup filesystem.
  • Signal the pact-agent process.
  • Access shared tmpfs (capability manifest, unix sockets).

Recommendations:

  1. Use user namespaces where possible to run services as non-root.
  2. Seccomp profiles per supervised service (restrict cgroup-related syscalls).
  3. Mount the cgroup filesystem read-only for supervised services.
  4. Consider running supervised services in mount namespaces to limit filesystem visibility.

E-3: Emergency mode abuse for extended privilege window

Target: Emergency mode lifecycle.

Threat: An authorized ops admin enters emergency mode not for a genuine emergency, but to extend the commit window and suppress auto-rollback, allowing persistent unauthorized changes.

Existing mitigations:

  • Emergency mode does NOT expand the whitelist (A10).
  • Emergency mode does NOT bypass RBAC (ADR-004).
  • Emergency mode does NOT suppress audit logging.
  • Stale emergency detection + alerting (F4).
  • Reason field is required and logged.

Residual risk: Low-Medium. Emergency mode legitimately extends the operational window. Abuse is detectable (reason field, duration, actions taken) but not preventable — it’s a trade-off for operational flexibility.

Recommendations:

  1. Alert on emergency mode entries with vague reasons.
  2. Require manager/peer approval for emergency mode on regulated vClusters (extend two-person approval to emergency entry).
  3. Post-incident review: all emergency sessions should be reviewed as part of operational practice.

E-4: OPA policy injection for privilege escalation

Target: OPA sidecar on journal nodes (TB4).

Threat: Attacker with access to a journal node pushes a Rego policy that grants elevated permissions (e.g., makes all roles equivalent to platform-admin).

Existing mitigations:

  • OPA is localhost-only.
  • Fallback to built-in RbacEngine if OPA is unavailable.
  • Federated policies come from Sovra (signed? see below).

Residual risk: High (if journal node is compromised). OPA management API has no authentication (TB4). A compromised process can push arbitrary policies.

Recommendations:

  1. Disable OPA management API. Load policies from signed bundle files only.
  2. Policy bundles signed by Sovra or a site-local signing key. OPA verifies signature before loading.
  3. Built-in RBAC invariants (P1-P8) should be enforced in pact-policy code as a floor — OPA can add restrictions but should not be able to relax core invariants. This makes the built-in RbacEngine the security baseline, not a fallback.

Summary: Risk Heat Map

ThreatSTRIDESeverityResidual RiskPriority
S-1: Rogue node enrollmentSpoofingHighMediumEnable TPM attestation
S-2: OIDC token theftSpoofingHighMediumShort token lifetime, DPoP
S-3: Journal impersonationSpoofingHighLowExisting controls sufficient
S-4: AI privilege escalationSpoofingMediumLowExisting controls sufficient
T-1: Journal state corruptionTamperingCriticalLowWAL encryption, entry signing
T-2: Overlay MITMTamperingHighLowChecksum verification at agent
T-3: Shell injection / rbash escapeTamperingHighMediumDeny-list escapable binaries, seccomp
T-4: Observer bypass (root)TamperingHighHighDetection-focused (accept trust boundary)
T-5: OPA policy tamperingTamperingCriticalMediumDisable mgmt API, signed bundles
R-1: Audit gaps during partitionRepudiationMediumLow-MediumSign local audit entries
R-2: BMC unaudited accessRepudiationMediumMediumCorrelate BMC logs
R-3: Platform admin uncheckedRepudiationHighMediumTime-bound escalation
I-1: Intermediate CA key exposureInfo DisclosureHighMediumEphemeral key (memory-only), HSM for defense in depth
I-2: Config data in transitInfo DisclosureMediumLowNo embedded secrets in overlays
I-3: Shell output exposureInfo DisclosureMediumMediumOutput redaction in audit
I-4: Raft state at restInfo DisclosureMediumMediumEncryption at rest
D-1: Enrollment endpoint floodDoSMediumLow-MediumNetwork segmentation
D-2: Boot storm amplificationDoSMediumLowBy-design resilience
D-3: Shell session exhaustionDoSLowLowPer-identity limits
D-4: Raft write amplificationDoSMediumLow-MediumRate limiting, batch audit
E-1: Cached policy escalationEoPHighLowShort cache TTL
E-2: cgroup escapeEoPHighMediumUser namespaces, seccomp
E-3: Emergency mode abuseEoPMediumLow-MediumPeer approval for regulated
E-4: OPA policy injectionEoPCriticalHighSigned bundles, invariant floor

Implemented Security Controls

JWT Validation in Enrollment Service (F9/F10)

The enrollment service validates JWT tokens on authenticated endpoints, rejecting expired, malformed, or unsigned tokens before processing enrollment requests.

Per-IP Rate Limiting (F12)

Rate limiting on the enrollment endpoint is enforced per source IP address, preventing a single attacker from exhausting the global rate budget.

Self-Approval Prevention at Raft Layer (F24)

Two-person approval enforcement is checked at the Raft state machine level, not just the API layer. A user cannot approve their own pending request even by crafting direct Raft proposals.

Wildcard Bindings Excluded from Emergency Access (F20)

Wildcard role bindings (e.g., pact-ops-*) do not grant emergency mode access. Emergency mode requires an explicit, non-wildcard role binding for the target vCluster.

AI Exec Scoping (F16)

AI agent exec operations require the ai_exec_allowed policy field to be set to true on the target vCluster. This defaults to false, preventing AI agents from executing commands unless explicitly authorized by policy.

Top 5 Hardening Priorities

  1. Intermediate CA key protection (I-1): Key is already ephemeral (memory-only). HSM provides defense in depth. Periodic rolling restarts force key rotation.

  2. OPA sidecar hardening (T-5, E-4): Disable management API, signed policy bundles, enforce built-in RBAC as invariant floor. Currently, localhost access to OPA is equivalent to policy bypass.

  3. TPM attestation for enrollment (S-1): Close the hardware identity spoofing gap. Already designed as optional — make it a deployment recommendation for production.

  4. Shell environment hardening (T-3): Maintain deny-list of rbash-escapable binaries, scrub dangerous environment variables, add seccomp as defense in depth.

  5. Encryption at rest (I-4, T-1): WAL, snapshots, and backups should be encrypted. Configuration history and audit trails are sensitive operational data.

Security Architecture

Authentication

OIDC / JWT

  • All API calls authenticated via Bearer JWT tokens in gRPC metadata
  • Development: HS256 with shared secret
  • Production: RS256 with JWKS endpoint (auto-refreshed, 1hr cache)
  • Token claims: sub (principal), pact_role (authorization role)

Machine Identity (mTLS)

  • Agent-to-journal: mutual TLS with X.509 certificates
  • Certificate fields: tls_cert, tls_key, tls_ca in agent config
  • Journal validates client certificate against CA bundle
  • Agent validates journal certificate against CA bundle

Authorization (RBAC)

Role Model

RoleScopePermissions
pact-platform-adminGlobalFull access, whitelist bypass (S2)
pact-ops-{vcluster}Per-vClusterCommit, rollback, exec, shell, service mgmt
pact-viewer-{vcluster}Per-vClusterRead-only: status, log, diff, read-only exec
pact-regulated-{vcluster}Per-vClusterLike ops, but requires two-person approval (P4)
pact-service-agentMachineAgent mTLS identity
pact-service-aiMachineMCP tools, no emergency mode (P8)

Policy Invariants

  • P1: Identity required on all requests
  • P2: Viewers read-only
  • P3: Role scoped to correct vCluster
  • P4: Regulated roles require two-person approval
  • P6: Platform admin always authorized
  • P8: AI agents cannot enter/exit emergency mode

OPA Integration (ADR-003)

  • Rego policies co-located on journal nodes
  • Evaluated via localhost REST (http://localhost:8181/v1/data/pact/authz)
  • Federation: policy templates synced via Sovra

Shell Security (ADR-007)

No SSH

  • pact shell replaces SSH for all admin access
  • BMC/Redfish console is the only out-of-band fallback

Whitelist Enforcement

  • Commands restricted via PATH symlinks (not command parsing)
  • Session-specific directory: /run/pact/shell/{session_id}/bin/
  • 37 default whitelisted commands (ps, top, nvidia-smi, etc.)
  • State-changing commands classified (systemctl, modprobe → true)
  • Learning mode captures denied commands for review

PTY Isolation

  • Restricted bash (rbash) prevents PATH modification
  • BASH_ENV="", ENV="" prevent startup file injection
  • HOME=/tmp prevents home directory access
  • PROMPT_COMMAND logs every command for audit

Audit Trail

  • Every operation logged as a ConfigEntry in the journal
  • Immutable Raft-committed log
  • EntryTypes: ExecLog, ShellSession, ServiceLifecycle, EmergencyStart/End
  • Regulated vClusters: 7-year retention (audit_retention_days=2555)

Failure Modes and Recovery

Journal Quorum Failures

Single Node Failure

  • Impact: Quorum maintained (2/3 or 3/5 nodes)
  • Detection: Raft heartbeat timeout (1.5-3s)
  • Recovery: Automatic leader re-election, failed node rejoins on restart
  • Data: No data loss — committed entries are replicated

Quorum Loss (Majority Down)

  • Impact: No writes accepted, reads still served from local state
  • Detection: Raft cannot elect leader
  • Recovery: Restore majority of nodes, cluster auto-recovers
  • Agent behavior: Runs in disconnected mode, buffers events

Split Brain

  • Impact: Impossible by Raft design (majority required for writes)
  • Detection: Minority partition detects it’s not leader
  • Recovery: Automatic on network heal

Agent Failures

Agent Cannot Connect to Journal

  • Impact: No config updates, no audit logging
  • Detection: Connection timeout, subscription backoff
  • Recovery: Exponential backoff reconnect (1s base, 60s max, 100 attempts)
  • Behavior: Agent continues with cached config in observe-only mode

Agent Crash (pact as init)

  • Impact: All supervised services orphaned
  • Detection: systemd/PID 1 watchdog (if configured)
  • Recovery: Agent restart re-reads state, re-supervises services
  • Data: Capability report regenerated on boot

Drift Detection False Positive

  • Impact: Unnecessary commit window opened
  • Detection: Admin reviews drift via pact diff
  • Recovery: Add path to blacklist patterns, drift resets on commit

Network Failures

Agent-Journal Partition

  • Impact: Config subscription disconnected
  • Detection: gRPC stream error
  • Recovery: Reconnect with from_sequence (at-least-once delivery)
  • Conflict resolution: Journal-wins after grace period (ConflictManager)

Inter-Journal Partition

  • Impact: Raft replication paused for minority side
  • Detection: Raft log divergence
  • Recovery: Automatic reconciliation on heal, minority replays missed entries

Emergency Mode Failures

Emergency Mode Stuck

  • Impact: Extended commit window, reduced automation
  • Detection: Stale emergency detection (expiry without resolution)
  • Recovery: Platform admin force-end (pact emergency end --force)
  • Audit: All emergency actions logged regardless of mode

Emergency Mode Unauthorized Entry

  • Impact: Blocked by RBAC (P8: AI agents cannot enter)
  • Detection: PolicyService evaluation returns Deny
  • Recovery: Human admin must initiate emergency mode

Observability

Design: No agent-level Prometheus scraping

Three channels:

  1. Journal server metrics → Prometheus → Grafana (3-5 scrape targets)
  2. Config + admin events → Journal → Loki → Grafana (event stream)
  3. Agent process health → lattice-node-agent eBPF → existing Prometheus

Journal Metrics Endpoint

Each pact-journal server exposes a Prometheus metrics endpoint via axum (HTTP, default port 9091 — avoids conflict with Prometheus server default on 9090). Metrics include:

  • pact_raft_leader (gauge): 1 if this node is the Raft leader
  • pact_raft_term (gauge): current Raft term
  • pact_raft_log_entries (gauge): total log entries
  • pact_raft_replication_lag (gauge): entries behind leader, per follower
  • pact_journal_entries_total (counter): total config entries appended
  • pact_journal_boot_streams_active (gauge): concurrent boot config streams
  • pact_journal_boot_stream_duration_seconds (histogram): boot stream latency
  • pact_journal_overlay_builds_total (counter): overlay pre-computation events

Health check endpoint: GET /health returns 200 if Raft is healthy.

Grafana Dashboards

  • Fleet Configuration Health: drift heatmap, commit activity, boot performance
  • Admin Operations: exec/shell session frequency, command whitelist violations
  • Emergency Sessions: active, duration, stale alerts
  • Journal Health: Raft quorum, log growth, replication lag

Alerting

Critical: quorum loss, stale emergency Warning: high drift rate, slow boot config, policy auth failures, GPU degradation

Agentic API (MCP Tool-Use)

MCP server wrapping pact gRPC API for Claude Code-style AI agent integration. Authenticates as pact-service-ai principal with scoped permissions.

Tools

  • pact_status: node/vCluster state query
  • pact_diff: drift details
  • pact_commit: commit pending changes
  • pact_apply: apply config spec
  • pact_rollback: revert to previous state
  • pact_log: query history
  • pact_exec: run diagnostic command on node
  • pact_cap: node hardware capability report
  • pact_query_fleet: fleet-wide health query
  • pact_emergency: start/end emergency (typically restricted to human admins)
  • pact_service_status: query service health across nodes

Security

  • Service principal with limited write permissions
  • Read operations broadly permitted
  • Write operations require explicit policy authorization
  • Emergency mode typically restricted to human admin principals
  • All operations logged as author: service/ai-agent/

Example: AI Agent Investigating GPU Failures

1. pact_query_fleet(vcluster="ml-training", capability_filter="gpu_health=degraded")
   → 3 nodes with degraded GPUs

2. pact_exec(node="node042", command="nvidia-smi -q -d ECC")
   → ECC error details

3. pact_log(scope="node042", entry_types=["capability_change"])
   → degradation history

4. pact_apply(scope="ml-training", config={...}, message="auto-remediation")
   → applied to all nodes, policy authorized

Supercharged Command Tools

Read-only cross-system views exposed as MCP tools, delegating to the lattice scheduler via DelegationConfig:

MCP ToolCLI EquivalentDescription
pact_jobs_listpact jobs listList running allocations with node/vCluster filters
pact_queue_statuspact queueQueue depth and scheduling status per vCluster
pact_cluster_healthpact clusterCombined pact journal + lattice Raft health
pact_system_healthpact healthCombined health check across pact and lattice
pact_accountingpact accountingResource usage (CPU/GPU hours, storage) per tenant
pact_services_listpact services listList services from lattice service registry
pact_services_lookuppact services lookupLook up a specific service by name in lattice registry

These tools require PACT_LATTICE_ENDPOINT (and optionally PACT_LATTICE_TOKEN) to be set. Without a lattice connection, they return descriptive error messages.

Write commands (pact jobs cancel) remain human-only unless explicitly authorized via policy. pact audit is useful but may expose sensitive data and should be scoped carefully.

Future MCP Tool Candidates

MCP ToolCLI EquivalentDescription
pact_diagpact diagRead-only fleet diagnostic log retrieval. Server-side grep + line limit. Natural fit for AI-driven incident triage — agent can collect logs across a vCluster, grep for error patterns, and correlate with capability/drift data without requiring exec privileges.

Federation Model (via Sovra)

Principle

Configuration state is site-local. Policy templates are federated.

Policy Language

OPA/Rego (see ADR-003). Sovra uses OPA natively, so federated policy templates are authored in Rego and distributed without translation. This is the primary reason OPA was chosen over Cedar.

What federates

  • Rego policy templates (regulated workload requirements) via Sovra mTLS
  • Compliance reports (drift/audit summaries) via Sovra attestation
  • Policy attestation (cryptographic proof of policy conformance)

What stays site-local

  • Configuration state (mounts, services, sysctl values)
  • Drift events, admin sessions, commit history
  • Capability reports (only meaningful to local scheduler)
  • Shell/exec session logs
  • OPA data (role mappings, vCluster-specific policy overrides)

Federation Sync

pact-policy syncs Rego templates from Sovra on a configurable interval (default 300s). Templates are stored locally and loaded into OPA as bundles. Site-local data (role mappings, vCluster config) is pushed to OPA separately and never leaves the site.

Consistent with lattice

Lattice scheduler is site-local. Federation enables cross-site job submission. pact follows the same boundary: config is local, policy is federated.

Testing Strategy

Four Levels

Level 1: Unit Tests (in-module)

Located in #[cfg(test)] modules within source files. Test critical paths: config deserialization, state machines, drift computation, serialization roundtrips.

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn drift_magnitude_zero_when_no_drift() { ... }
}
}

Level 2: Integration Tests (crate-level)

Located in crates/*/tests/. Use builders from pact-test-harness for setup.

#![allow(unused)]
fn main() {
use pact_test_harness::fixtures::ConfigEntryBuilder;

#[tokio::test]
async fn config_entry_roundtrip() {
    let entry = ConfigEntryBuilder::new()
        .vcluster("ml-training")
        .author("admin@example.org")
        .build();
    // ...
}
}

Level 3: BDD Acceptance Tests (pact-acceptance)

584 scenarios across 32 feature files using the cucumber crate. Covers all bounded contexts: boot config streaming, drift detection → commit/rollback, shell session lifecycle, emergency mode, enrollment, RBAC, policy evaluation, overlay management, partition resilience, identity mapping, workload integration, node management delegation, and 22 cross-context integration scenarios.

Runs on all platforms — uses real domain logic (JournalState, DriftEvaluator, CommitWindowManager, PactSupervisor) but stubs OS-level interactions.

Level 4: E2E Container Tests (pact-e2e)

Integration tests using testcontainers with real services:

  • Raft cluster: 3-node in-process cluster (consensus, failover, replication)
  • OPA: Real Rego policy evaluation via OPA container
  • Keycloak: Real OAuth2/OIDC flows (discovery, credentials, password, refresh, JWKS)
  • Loki: Structured event forwarding
  • Prometheus: Metrics scraping
  • SPIRE: Identity cascade and SVID acquisition
  • Linux privileged: Real cgroup/namespace tests (requires root)
  • Full CLI E2E: All CLI commands against real journal + agent gRPC

Level 5: Fidelity & Adversary Sweeps

Automated quality assessment (not runtime tests):

  • Fidelity sweep: measures assertion depth per scenario (THOROUGH/MODERATE/SHALLOW/STUB)
  • Adversary sweep: systematic security review across attack surfaces
  • Results in specs/fidelity/ and specs/findings/

Cross-Platform Testing Strategy

Three tiers matching the development model:

  1. macOS (local dev): Unit tests + BDD acceptance + e2e containers (Docker). Feature-gated Linux-only code compiles as stubs. Mocks (MockGpuBackend, MockCpuBackend, etc.) simulate hardware detection for testing.

  2. CI (GitHub Actions): 4-stage pipeline: fmt/clippy/deny → feature checks → test/BDD/e2e/linux-privileged → coverage. Runs on Linux runners.

  3. Devcontainer (Linux): Full integration + acceptance + chaos tests. Real PactSupervisor with cgroup v2, real eBPF probes, real PTY allocation. BDD/cucumber scenarios run here. CI uses this for release gates.

Test Infrastructure

pact-test-harness

Shared crate (publish = false) providing:

  • Fixtures: ConfigEntryBuilder, ServiceDeclBuilder — fluent builders
  • Mocks: MockJournalClient, MockPolicyEngine, MockSupervisor, MockObserver, MockGpuBackendArc<Mutex<Vec<MockCall>>> for call recording

Conventions

  • Unit tests: #[test] or #[tokio::test]
  • Slow tests: marked #[ignore], run with just test-all
  • Feature-gated tests: #[cfg(feature = "ebpf")] etc.
  • Linux-only tests: #[cfg(target_os = "linux")]
  • All mocks use async_trait and record calls for assertions
  • Property tests with proptest for type invariants
  • Coverage target: >= 80% per crate

CI Pipeline

On every commit:
  cargo fmt --check
  cargo clippy --all-targets
  cargo nextest run (Level 1 + Level 2)
  cargo deny check

On release:
  All levels must pass (including devcontainer Level 3 + Level 4)
  Coverage report → codecov

Running Tests

just test          # Fast: unit + integration (skips #[ignore])
just test-all      # Full: includes slow tests
just test-slow     # Only slow tests
just test-linux    # Linux-only tests (in devcontainer)
just test-accept   # BDD acceptance tests (in devcontainer)

ADR-001: Raft Quorum Deployment Modes

Status: Accepted (supersedes original ADR-001)

Context

pact-journal needs a consensus mechanism for its immutable configuration log. Lattice already runs a Raft quorum for node ownership and sensitive audit.

Both systems use the same Raft foundation (raft-hpc-core, which wraps openraft with HPC-specific state machine abstractions) and target the same management infrastructure nodes (3-5 nodes in the management VLAN).

In the boot sequence, pact comes first: pact-agent is the init system on compute nodes, and pact-journal must be running before lattice-server starts. This means pact’s Raft quorum is established infrastructure by the time lattice needs consensus.

Decision

Support two deployment modes for pact-journal’s Raft quorum. In both modes, pact and lattice maintain independent Raft groups with separate state machines, separate log compaction, and separate snapshots. The groups never share consensus — only infrastructure.

Mode 1: Standalone (default)

Pact-journal runs its own Raft cluster on dedicated nodes.

pact-journal-1 ─┐
pact-journal-2 ──┤ pact Raft group
pact-journal-3 ─┘

lattice-server-1 ─┐
lattice-server-2 ──┤ lattice Raft group
lattice-server-3 ─┘
  • 6-10 quorum nodes total (3-5 per system)
  • Fully independent failure domains
  • Recommended for: large sites (>5k nodes), regulated environments, sites that require independent maintenance windows

Mode 2: Co-located

Pact-journal and lattice-server run on the same management nodes, each with its own Raft group. Pact-journal is the incumbent — it is already running when lattice starts. Lattice discovers pact’s quorum nodes and deploys alongside them.

mgmt-node-1: pact-journal (Raft group A, port 9444) + lattice-server (Raft group B, port 9000)
mgmt-node-2: pact-journal (Raft group A, port 9444) + lattice-server (Raft group B, port 9000)
mgmt-node-3: pact-journal (Raft group A, port 9444) + lattice-server (Raft group B, port 9000)
  • 3-5 quorum nodes total (shared between both systems)
  • Independent Raft groups on the same nodes (separate ports, state, logs)
  • Pact quorum is primary infrastructure; lattice joins existing nodes
  • Hardware failure takes out both systems on that node (acceptable: Raft tolerates minority failure, and both groups lose the same node simultaneously)
  • Recommended for: most sites, operational simplicity

How co-location works

Pact side (no changes needed):

  • pact-journal starts normally on management nodes
  • Exposes its quorum node addresses in its config and via a discovery endpoint
  • Listens on its own Raft port (default: 9444) and gRPC port (default: 9443)

Lattice side (configuration option):

  • Lattice config gains an optional pact_journal_endpoints field
  • When set, lattice-server deploys its Raft group on the same nodes as pact-journal
  • Lattice uses its own ports (default: Raft 9000, gRPC 50051, REST 8080)
  • Lattice’s quorum config (peers) points to the same hostnames as pact’s journal endpoints, but with lattice’s Raft port

Example lattice production config (co-located):

quorum:
  node_id: 1
  raft_listen_address: "0.0.0.0:9000"
  peers:
    - id: 2
      address: "mgmt-02:9000"    # same host as pact-journal-2
    - id: 3
      address: "mgmt-03:9000"    # same host as pact-journal-3

There is no protocol-level integration. Co-location is purely an infrastructure decision — two independent processes sharing the same physical/virtual nodes.

What is NOT shared

  • Raft consensus: each system has its own leader election, log, and state machine
  • State machine: pact’s JournalState and lattice’s GlobalState are independent
  • WAL/snapshots: separate data directories (/var/lib/pact/journal vs /var/lib/lattice/raft)
  • Ports: each system listens on its own ports
  • Failure recovery: each group recovers independently (a pact leader failover does not trigger a lattice leader failover)

Trade-offs

Standalone

  • (+) Independent failure domains — pact outage doesn’t affect lattice and vice versa
  • (+) Independent maintenance windows
  • (+) Simpler mental model (no shared infrastructure)
  • (-) More nodes to operate (6-10 vs 3-5)
  • (-) More infrastructure cost

Co-located

  • (+) Fewer nodes (3-5 vs 6-10)
  • (+) Single set of management nodes to monitor and maintain
  • (+) Natural fit: both systems target the same management infrastructure
  • (+) pact is already there (init system), lattice joins naturally
  • (-) Shared hardware failure domain (mitigated by Raft’s majority quorum)
  • (-) Shared maintenance windows (reboot affects both)
  • (-) Resource contention possible under heavy load (mitigated by low Raft I/O)

Consequences

  • pact-journal does not need to know about lattice at all — it just runs its Raft group
  • Lattice’s deployment guide documents co-located mode as an option
  • Monitoring should track both Raft groups independently regardless of deployment mode
  • The pact production config includes quorum node addresses that lattice can reference
  • No code changes needed for co-location — it’s a deployment/configuration decision

Revisit

If a future requirement demands cross-system transactions (e.g., atomic “commit config + drain node”), a shared Raft group with namespaced commands could be considered. Current design does not require this.

ADR-002: Blacklist-First Drift Detection with Observe-Only Bootstrap

Status: Accepted

Decision

Monitor all system state changes by default. Blacklist known-safe operational changes. Initial deployment in observe-only mode: detect and log everything, enforce nothing. Build empirical blacklist from real traffic before enabling enforcement.

Default blacklist: /tmp/, /var/log/, /proc/, /sys/, /dev/, /run/user/

Transition to enforcement per-vCluster:

enforcement_mode = "observe"  # then "enforce"

ADR-003: Policy Engine Choice — OPA/Rego

Status: Accepted

Context

pact-policy needs a policy evaluation engine for authorization decisions: who can commit, exec, shell, start emergency mode, etc. Two candidates evaluated:

  • OPA/Rego: mature, widely adopted, REST API, used by Sovra for federation
  • Cedar: newer (AWS), Rust-native, strongly typed, no REST overhead

Decision

OPA/Rego as a sidecar process. pact-policy calls OPA via REST on localhost.

Rationale

  1. Sovra compatibility: Sovra uses OPA. Federated policy templates need a shared language. Using OPA means pact and Sovra speak the same policy format — Rego templates authored once, federated across sites without translation.

  2. Sidecar model: OPA runs as a separate process alongside pact-journal or pact-policy. pact calls http://localhost:8181/v1/data/pact/<decision> via reqwest. This keeps pact’s Rust codebase free of policy language interpreters.

  3. Not on the hot path: Policy evaluation happens on admin operations (exec, commit, shell session start) — not on every boot config read or heartbeat. The REST overhead (~1ms localhost) is negligible.

  4. Operational maturity: OPA has established patterns for testing policies (opa test), debugging (opa eval), and distributing bundles (OPA bundles API).

Deployment

OPA runs on journal/policy nodes alongside pact-journal, not on compute nodes.

Policy evaluation flow for admin operations:

CLI → pact-agent (ExecRequest/ShellSessionRequest via gRPC)
  → agent calls PolicyService.Evaluate() on pact-policy node
  → pact-policy calls OPA via localhost REST
  → OPA evaluates Rego rules → decision returned to agent
  → agent enforces decision (proceed or deny)

The agent is the entry point for exec/shell (it owns the PTY and process execution), but delegates authorization to the policy service via gRPC. pact-policy is a library crate linked into the pact-journal binary — PolicyService runs in-process with the journal, not as a separate deployment.

pact-journal node:
  pact-journal (port 9443/9444)
  opa (port 8181, localhost only)
    - loads policy bundles from /etc/pact/policies/ or Sovra sync
    - data: pact pushes current state as OPA data

OPA lifecycle on journal nodes depends on the deployment model:

  • systemd deployments: OPA runs as a systemd service alongside pact-journal
  • pact-managed deployments: OPA is declared as a supervised service in the management node’s service declarations, managed by PactSupervisor like any other service
  • Container deployments: OPA runs as a sidecar container

In all cases, OPA is co-located with pact-journal/pact-policy on management nodes. Compute nodes do not run OPA — they enforce the authorization decisions received from the policy layer.

Policy Caching and Partition Resilience

pact-agent receives VClusterPolicy as part of the boot config overlay (Phase 1). This cached policy contains the data needed for local authorization decisions:

  • role_bindings: OIDC role → allowed actions mapping
  • exec_whitelist / shell_whitelist: allowed commands
  • regulated, two_person_approval: enforcement flags

Normal operation: agent calls PolicyService.Evaluate() on the policy node for full OPA/Rego evaluation (complex rules, cross-vCluster checks, approval workflows).

Degraded operation (policy service unreachable): agent falls back to cached VClusterPolicy for basic RBAC decisions:

  • Whitelist checks: allowed (command in cached whitelist + role has action)
  • Two-person approval: denied (cannot verify without policy service)
  • Complex Rego rules: denied (cannot evaluate without OPA)
  • Platform admin override: allowed (role cached, but logged as degraded)

This follows the same AP consistency model as config: nodes keep operating with cached state during partitions. The agent logs all degraded-mode authorization decisions and replays them to the journal when connectivity is restored.

Policy cache is refreshed on each successful PolicyService.Evaluate() call and on explicit policy update events streamed from the journal.

Trade-offs

  • (+) Sovra federation works natively — same Rego language
  • (+) Rich ecosystem (testing, debugging, bundle distribution)
  • (+) No Rust bindings to maintain — clean REST boundary
  • (-) Extra process to deploy and monitor on management nodes
  • (-) REST latency vs in-process Cedar evaluation (acceptable — not hot path)
  • (-) Rego learning curve for operators writing custom policies

Rego Policy Structure

pact/
  authz/
    exec.rego          # Who can exec on which vClusters
    commit.rego        # Commit authorization + two-person approval
    shell.rego         # Shell session authorization
    emergency.rego     # Emergency mode restrictions
    service.rego       # Service lifecycle authorization
  data/
    roles.json         # OIDC group → pact role mappings
    vclusters.json     # Per-vCluster policy overrides

ADR-004: Emergency Mode Preserves Audit Trail

Status: Accepted

Decision

Emergency mode: extended commit window (configurable, default 4h) + suspended auto-rollback + continuous logging + mandatory commit-or-rollback at session end.

Stale emergency (timer expires without –end): alert (Loki event + Grafana alert rule), scheduling hold (lattice cordon via API), no auto-rollback. Another admin with pact-ops-{vcluster} or pact-platform-admin role can force-end a stale emergency with pact emergency --end --force.

Audit trail is never interrupted, including during emergencies.

Shell Restrictions During Emergency

Emergency mode does not expand the pact shell whitelist (PATH restriction). The restricted bash environment remains the same as normal operation. If the admin needs binaries outside the whitelist, they have two options:

  1. Use pact exec for specific non-whitelisted commands (platform admins can bypass the exec whitelist, though still logged)
  2. Access the node via BMC console, which provides unrestricted bash

Emergency mode changes default to a TTL matching the emergency window duration. When the TTL expires, uncommitted changes are rolled back.

ADR-005: No Agent-Level Prometheus Metrics

Status: Accepted

Decision

No per-agent Prometheus scraping (would be 10k targets). Three channels instead:

  1. Journal server metrics → Prometheus → Grafana (3-5 targets)
  2. Config events → Journal → Loki → Grafana (event stream)
  3. Agent process health → lattice-node-agent eBPF → existing Prometheus

ADR-006: Pact-Agent as Init System with SystemD Fallback

Status: Accepted (amended 2026-03-17)

Context

Diskless HPC compute nodes run 5-9 services. systemd is designed for general-purpose systems with hundreds of units. We need to decide who manages process lifecycle.

Decision

pact-agent includes a built-in process supervisor (PactSupervisor) as the default. systemd is available as a fallback for conservative deployments.

Both backends implement the same ServiceManager trait. VCluster config selects which.

Amendment (2026-03-17): Sub-context decomposition

Node management is decomposed into six bounded contexts, each with a strategy pattern providing PactSupervisor (default) and SystemdBackend (compat) implementations:

  1. Process Supervision — service lifecycle, background supervision loop, health checks, dependency ordering, restart policies
  2. Resource Isolation — cgroup v2 hierarchy, per-service scopes, OOM containment, namespace creation for lattice allocations
  3. Identity Mapping — OIDC→POSIX UID/GID translation for NFS (pact-nss module). Only active in PactSupervisor mode.
  4. Network Management — netlink interface configuration. Replaces wickedd/NetworkManager in PactSupervisor mode.
  5. Platform Bootstrap — boot phases (InitHardware → ConfigureNetwork → LoadIdentity → PullOverlay → StartServices → Ready), hardware watchdog, SPIRE integration, coldplug.
  6. Workload Integration — namespace handoff to lattice (unix socket, SCM_RIGHTS), mount refcounting, readiness gate. Contract defined in hpc-core.

Cross-cutting: audit events (hpc-audit AuditSink trait) emitted by all contexts.

Supervision loop

PactSupervisor includes a background supervision loop that:

  • Polls process status and triggers restarts per RestartPolicy
  • Uses adaptive interval: faster when idle (500ms, deeper eBPF inspections), slower when workloads active (2-5s, minimal overhead)
  • Is coupled to the hardware watchdog — each loop tick pets /dev/watchdog
  • If the loop hangs, the watchdog expires and BMC triggers hard reboot
  • Only runs in PactSupervisor mode. SystemdBackend delegates restarts to systemd natively.

Hardware watchdog

When pact-agent is PID 1 on BMC-equipped nodes, it pets /dev/watchdog. If no hardware watchdog is available, the node runs in systemd mode (pact is a regular service, not PID 1). The watchdog is the crash/hang recovery mechanism for PID 1 — there is no other process that can restart PID 1.

cgroup v2 enforcement

PactSupervisor creates a cgroup v2 hierarchy at boot:

  • pact.slice/ — pact-owned system services (infra, network, gpu, audit sub-slices)
  • workload.slice/ — lattice-owned workload allocations
  • pact-agent itself runs with OOMScoreAdj=-1000

Each supervised service gets a CgroupScope with configurable resource limits. On process death, all children in the scope are killed via cgroup.kill — no orphans.

Ownership boundary: exclusive write per slice, shared read for metrics, emergency override with audit trail for cross-slice intervention.

Real service sets (from HPE Cray EX compute nodes)

Derived from actual ps aux analysis. pact replaces: systemd, atomd (HPE ATOM), nomad + 17 executors, slurmd, munged, sssd, ldmsd, nrpe, hb_ref, rsyslogd, wickedd, udevd, haveged, DVS-IPC, agetty, bos.reporter.

ML training (GPU) — 7 services: chronyd, dbus-daemon, cxi_rh (×4 per NIC), nvidia-persistenced, nv-hostengine, rasdaemon, lattice-node-agent. Plus rpcbind/rpc.statd if NFS.

Regulated/sensitive — +2: auditd, audit-forwarder = 9 services.

Dev sandbox — 5: chronyd, dbus-daemon, cxi_rh, rasdaemon, lattice-node-agent.

Rationale

On a diskless node with a known, small, declared set of services, the process supervision requirements are simple:

  • Start N processes in dependency order
  • Monitor health, restart on failure with backoff (supervision loop)
  • Manage cgroup v2 isolation (memory/CPU limits per service)
  • Kill orphaned children on process death (cgroup.kill)
  • Hardware watchdog for agent hang detection
  • Adaptive polling to minimize workload disturbance
  • Ordered shutdown

Benefits of pact as init:

  • Every service lifecycle change is inherently a pact operation (logged, auditable)
  • No log ownership conflict between journald and pact’s log pipeline
  • Smaller base image (no systemd, no D-Bus if DCGM standalone, no logind)
  • Boot is faster (no unit parsing, no generator execution)
  • Single process to debug if something goes wrong
  • cgroup hierarchy owned by pact, shared contract with lattice via hpc-core
  • Namespace pre-creation and mount refcounting for lattice (“supercharged” mode)
  • Network configuration via netlink (no wickedd daemon)
  • Identity mapping for NFS (no SSSD)

systemd Fallback

Some deployments may prefer systemd for:

  • Existing operational tooling assumes systemd
  • Compliance requirements mandate specific init system
  • Third-party software requires systemd features (socket activation, etc.)
  • No hardware watchdog available

The fallback is selected per vCluster:

[agent.supervisor]
backend = "systemd"

In systemd mode, pact-agent:

  • Does NOT manage the hardware watchdog (systemd handles it)
  • Does NOT configure network interfaces (wickedd/NetworkManager handles it)
  • Does NOT write identity mapping .db files (SSSD handles it)
  • Does NOT create cgroup hierarchy (systemd manages it)
  • DOES pull overlays and manage config state (pact-specific)
  • DOES manage pact-specific services via generated systemd unit files

Trade-offs

  • PactSupervisor must handle edge cases: zombie reaping, OOM killer interaction, signal propagation, cgroup cleanup on crash, watchdog petting, adaptive polling
  • Software that expects systemd (rare in HPC compute context) needs adaptation
  • Two code paths to maintain (though the strategy pattern minimizes this)
  • Six sub-contexts increase architectural complexity but enable independent testing and clear ownership boundaries

References

  • specs/domain-model.md §2a-2f (sub-context decomposition)
  • specs/invariants.md PS1-PS3, RI1-RI6, IM1-IM7, NM1-NM2, PB1-PB5, WI1-WI6
  • specs/features/ (resource_isolation, identity_mapping, network_management, platform_bootstrap, workload_integration feature files)
  • specs/failure-modes.md F21-F36
  • ADR-015 (hpc-core shared contracts)
  • ADR-016 (identity mapping)

ADR-007: No SSH — Pact Shell Replaces Remote Access

Status: Accepted

Context

SSH on diskless HPC nodes creates untracked, unaudited, unrestricted root shells. Every SSH session is a potential source of unacknowledged configuration drift. The drift detection system exists largely because SSH enables uncontrolled changes.

Decision

pact shell and pact exec are the sole remote access mechanisms for compute nodes. SSH is not installed on compute node images. BMC/Redfish console (via OpenCHAMI) is the out-of-band fallback for pact-agent failures.

Design

pact exec (single command)

pact exec node042 -- nvidia-smi
pact exec node042 -- dmesg --since "5 minutes ago"
pact exec node042 -- cat /etc/resolv.conf
  • Authenticated via OIDC token
  • Authorized against caller’s role + target node’s vCluster
  • Command checked against whitelist (configurable per vCluster)
  • stdout/stderr streamed back to caller
  • Full command + output logged to journal

pact shell (interactive session)

pact shell node042
  • Opens a restricted bash session on the node (not a custom shell)
  • Authenticated + authorized (higher privilege than exec — separate permission)
  • Restriction via environment control, not command parsing:
    • rbash (restricted bash): prevents changing PATH, running /absolute/paths, redirecting output to files
    • PATH limited to whitelisted commands via session-specific directory
    • PROMPT_COMMAND hook logs each executed command to pact audit
    • Optional mount namespace hides sensitive paths
    • Session-level cgroup for resource limits
  • State changes detected by the existing drift observer (eBPF + inotify + netlink) and trigger commit windows — the shell doesn’t pre-classify commands
  • Session start/end recorded in journal

Why restricted bash, not a custom shell

Implementing a shell that interprets pipes, redirects, globbing, quoting, subshells, environment variables, job control, and signal handling is reimplementing bash — poorly. And parsing commands before execution to classify them is a security problem: $(evil), backticks, eval, and argument injection make pre-execution parsing unreliable.

Instead, pact controls what bash can reach (PATH, namespace, cgroup) and observes what happened (PROMPT_COMMAND audit, drift detection). Bash handles interactive shell semantics — it’s been doing that for 35 years.

Whitelist model

  • Implemented as PATH restriction: only whitelisted binaries are symlinked into the session’s bin directory
  • Default whitelist: common diagnostics (nvidia-smi, dmesg, lspci, ip, ss, cat, journalctl, mount, df, free, top, ps, lsmod, sysctl -a, etc.)
  • Learning mode: “command not found” errors are captured by the agent, which suggests adding the command to the vCluster whitelist
  • vCluster-scoped: regulated vClusters may have tighter whitelists
  • Platform admins: broader PATH (but still logged via PROMPT_COMMAND)

State-changing command detection

The agent does not classify commands before execution in shell mode. Instead, the existing drift observer (eBPF probes, inotify, netlink) detects actual state changes and triggers commit windows. This is the same mechanism used for any other source of drift — the shell session is not special.

For pact exec (single commands), the agent does classify commands upfront because it controls the full invocation (no shell interpretation involved).

Fallback

When pact-agent is unresponsive:

  1. Admin uses OpenCHAMI/Manta to access BMC console (Redfish)
  2. BMC console provides regular bash (unrestricted, not pact-managed)
  3. Admin diagnoses and restarts pact-agent if needed
  4. Changes made via BMC appear as unattributed drift once agent recovers
  5. If the node is unrecoverable, admin triggers re-image via OpenCHAMI

Trade-offs

  • Admins lose the flexibility of arbitrary SSH access
  • Whitelist maintenance is ongoing operational work (mitigated by learning mode)
  • Slightly higher latency than direct SSH for some operations
  • Requires trust in pact-agent reliability (mitigated by BMC fallback)
  • rbash restrictions can be bypassed by some binaries (e.g. vi, python, less with !cmd) — whitelisted commands must be audited for shell escape vectors

Security Benefits

  • All remote access is authenticated (OIDC) and authorized (RBAC)
  • Every command is logged with authenticated identity
  • State changes are tracked and require commitment
  • No unrestricted root shell — pact controls the environment via PATH, rbash, and optional mount namespace
  • Attack surface reduced: no sshd, no SSH key management, no SSH vulnerabilities

ADR-008: Node Enrollment, Domain Membership, and Certificate Lifecycle

Status: Accepted (amended 2026-03-17 — SPIRE as primary mTLS provider)

Context

pact-agent authenticates to pact-journal via mTLS. The current design (A-I2) assumes OpenCHAMI provisions per-node certificates into the SquashFS base image. This assumption breaks at scale and in multi-domain deployments:

  1. Shared image problem. Diskless nodes boot from a single SquashFS image. You cannot bake 1,000 unique certificates into one read-only image.

  2. Certificate rotation. Static certificates expire. There is no mechanism to renew them without re-imaging every node.

  3. Multi-domain assignment. A physical machine may be partitioned across multiple pact domains (each with its own journal quorum). A node must be enrollable in multiple domains, but active in only one at a time.

  4. Unauthorized enrollment. No mechanism exists to prevent a node from connecting to any journal it can reach. There is no enrollment registry or hardware identity verification.

  5. Boot storm. 10,000+ nodes booting simultaneously must not overwhelm the certificate authority.

Decision

Two-level membership model

Node lifecycle in pact has two independent axes:

Domain membership (enrollment): “this node is allowed to exist in this pact instance.” Controls certificate issuance, mTLS trust, and physical/security boundary.

vCluster assignment: “this node is currently part of this logical group.” Controls configuration overlay, policy, drift detection, and scheduling. vCluster assignment is optional — an enrolled node with no vCluster assignment is in maintenance mode.

These compose independently. A node can be:

  • Enrolled but unassigned (maintenance pool, spare, staging)
  • Enrolled and assigned to a vCluster (normal operation)
  • Moved between vClusters without re-enrollment
  • Enrolled in multiple domains, active in only one (shared hardware)

Certificate authority: self-generated ephemeral CA on journal nodes

Each pact-journal node generates an ephemeral intermediate CA key at startup (in memory only, never persisted to disk). This eliminates external CA dependencies from the boot and renewal paths entirely.

Rationale:

  • No external dependency for certificate operations — the journal is fully self-contained.
  • Ephemeral keys reduce exposure risk: key compromise requires runtime access to a journal node’s memory, and the key is rotated on every journal restart.
  • Certificate revocation is handled via a Raft revocation registry (replicated to all journal nodes), not an external CRL.
  • Journal nodes sign agent CSRs locally using the ephemeral CA key — a CPU-only operation.

Per-domain topology:

┌─ pact domain ─────────────────────────────────────────────┐
│                                                           │
│  pact-journal quorum (3-5 nodes)                          │
│    ├── each generates ephemeral CA key at startup         │
│    ├── CA cert distributed via enrollment responses       │
│    ├── revocation registry replicated via Raft            │
│    └── signs agent CSRs locally (CPU-only, no network)    │
│                                                           │
│  pact-agents (1000s)                                      │
│    ├── generate own keypair at boot (in RAM)              │
│    └── submit CSR to journal, receive signed cert         │
│                                                           │
│  OpenCHAMI/Manta (boot infra)                             │
│    └── boots nodes, no cert responsibility                │
│                                                           │
└───────────────────────────────────────────────────────────┘

CSR model: agent generates keypair, journal signs

Private keys never leave the agent. The agent generates its own keypair at boot, submits a Certificate Signing Request (CSR) to the journal, and receives a signed certificate. The journal signs using its intermediate CA key — a local CPU operation with no network dependency.

This design ensures:

  • No private keys in Raft state. Journal stores only enrollment records and signed certs (public data). Compromise of a journal node does not expose agent private keys.
  • No private keys on the wire. The enrollment endpoint serves signed certs, not key material. Even if the endpoint is spoofed, the attacker gets a cert for their own key — they cannot impersonate the real agent.
  • Boot storm safe. CSR signing is ~1ms CPU per cert. 10,000 concurrent CSRs are signed in ~10 seconds on a single core. No external traffic.

Enrollment registry

pact-journal maintains a node enrollment registry in Raft state. Each enrollment record contains:

  • Node identity (node_id)
  • Hardware identity (MAC addresses, BMC serial, optionally TPM endorsement key hash)
  • Domain membership state (Registered, Active, Inactive, Revoked)
  • vCluster assignment (optional, independent of enrollment state)
  • Signed certificate metadata (serial, expiry — no private key)

Nodes that are not in the enrollment registry cannot obtain certificates and cannot establish mTLS connections.

Enrollment state machine

                    enroll (platform-admin)
                 ┌──────────────────────►  Registered
                 │                        (enrollment record created,
                 │                         node hasn't booted yet)
                 │                              │
                 │                              │ node boots, sends CSR
                 │                              │ with matching hw identity
                 │                              ▼
                 │                           Active
                 │                         (CSR signed, mTLS up,
                 │                          streaming boot config)
                 │                              │
                 │                 ┌────────────┤
                 │                 │            │
                 │      subscription stream  admin: decommission
                 │      disconnects + grace     │
                 │      period expires          │
                 │                 │            │
                 │                 ▼            ▼
                 │             Inactive      Revoked
                 │           (node gone,    (cert serial added to
                 │            signed cert    Raft revocation registry,
                 │            may still      record removed, cannot
                 │            be valid)      re-enroll without
                 │                 │          new enrollment)
                 │                 │
                 │                 │ node boots again,
                 │                 │ sends new CSR
                 │                 ▼
                 │              Active
                 │           (new CSR signed,
                 │            new cert issued)

Transition constraints:

  • Registered → Active: only on first Enroll call with matching hardware identity.
  • Active → Active: rejected. Once Active, subsequent Enroll calls for the same hardware identity return ALREADY_ACTIVE. This prevents concurrent enrollment races. The real agent already has its cert; a second caller (spoofed or restarted) is rejected. If the agent genuinely restarts, it must wait for the heartbeat timeout (→ Inactive) before re-enrolling, or reuse its existing cert from RAM if still running.
  • Inactive → Active: on re-boot with matching hardware identity. New CSR, new cert.

Bootstrap: hardware identity + CSR, not tokens

The agent’s bootstrap credential is its hardware identity — MAC addresses and BMC serial read from SMBIOS/DMI tables at boot. No bootstrap token injection by Manta is required.

Boot enrollment flow:

  1. Admin pre-registers nodes: pact node enroll <node-id> --mac <mac> --bmc-serial <s>
  2. Journal stores enrollment record in Raft state.
  3. Node boots (via Manta, PXE, any mechanism). pact-agent starts.
  4. Agent reads its hardware identity from the system (MAC, SMBIOS).
  5. Agent generates an ephemeral keypair in memory.
  6. Agent calls EnrollmentService.Enroll(hardware_identity, csr) on the journal (server-TLS-only — the agent does not yet have a client cert).
  7. Journal matches hardware identity against enrollment registry.
  8. On match (Registered or Inactive): signs CSR with intermediate CA key, returns signed cert + current vCluster assignment (if any). Sets state to Active.
  9. On match (Active): rejects with ALREADY_ACTIVE. Prevents race conditions.
  10. On match (Revoked): rejects with NODE_REVOKED.
  11. On no match: rejects with NODE_NOT_ENROLLED.
  12. Agent builds mTLS channel using its private key + signed cert.
  13. If vCluster assigned: StreamBootConfig(vcluster_id) → normal boot.
  14. If no vCluster: maintenance mode (domain defaults only).

Enrollment endpoint security

The enrollment endpoint is the ONLY unauthenticated gRPC endpoint on the journal. Its attack surface is mitigated by:

  1. Enrollment registry gate. Only hardware identities pre-registered by a platform-admin are served. Unknown identities are rejected immediately.

  2. Once-Active rejection. Once a node transitions to Active, further Enroll calls for the same hardware identity are rejected until the node becomes Inactive (heartbeat timeout). This narrows the spoofing window to the interval between PXE boot and the first successful enrollment (~seconds).

  3. CSR model. Even if an attacker spoofs hardware identity and wins the enrollment race, they get a cert for their own key. The real node will fail to enroll (ALREADY_ACTIVE) and alert — making the attack detectable. The attacker cannot impersonate the real node’s existing connections because they don’t have its private key.

  4. Rate limiting. The enrollment endpoint is rate-limited to N enrollments per minute (configurable, default 100). Brute-force identity guessing is impractical.

  5. Server-TLS-only. The enrollment endpoint requires TLS (server cert validated by agent against the domain’s CA bundle baked into the SquashFS image) but does not require a client cert.

  6. Audit logging. All enrollment attempts (success and failure) are logged to the journal audit trail and forwarded to Loki. Repeated failures for the same hardware identity trigger an alert.

  7. TPM attestation (optional). For high-security deployments, the Enroll request can include a TPM endorsement key hash or PCR quote, providing cryptographic hardware attestation that is not spoofable.

Heartbeat: subscription stream liveness

Node liveness is detected through the existing config subscription stream (BootConfigService.SubscribeConfigUpdates). When an agent is Active, it maintains a long-lived streaming connection to the journal. The journal tracks:

  • last_seen: timestamp of last message received on the subscription stream
  • Heartbeat timeout: configurable per domain (default 5 minutes)

When the subscription stream disconnects AND the heartbeat grace period expires without reconnection, the journal transitions the node from Active → Inactive. This is a Raft write (auditable).

No separate heartbeat RPC is needed — the subscription stream is already maintained by every active agent and its connection state is a natural liveness signal.

Local signing eliminates boot storm and renewal batching

No external service is on the boot path or the renewal path for individual agent certs:

Boot storm (T+0, 10,000 nodes simultaneously):
  Agent generates keypair + CSR
  Agent → Journal: Enroll(hardware_id, csr)
  Journal: match enrollment → sign CSR locally with ephemeral CA key
  ^^^^^^ CPU-only operation. ~1ms per signing. No network calls.
  Journal: return signed cert + vCluster assignment

  Agent → Journal: StreamBootConfig(mTLS)
  ^^^^^^ Already served from local state (existing design).

External traffic during boot storm: zero. External traffic during cert renewal: zero (agents send new CSR, journal signs locally).

Certificate revocation is handled entirely within the Raft revocation registry — revoked serials are replicated to all journal nodes via consensus.

Certificate lifecycle: 3-day default, agent-driven renewal

Certificate validity: 3 days (configurable per domain). Renewal at 2/3 lifetime (day 2).

Renewal is agent-driven:

  1. Agent generates new keypair.
  2. Agent calls EnrollmentService.RenewCert(node_id, current_cert_serial, new_csr) over existing mTLS channel.
  3. Journal validates: caller’s mTLS identity matches node_id, current_serial matches stored cert. Signs new CSR. Returns signed cert.
  4. Agent performs dual-channel rotation (see below).

No batch pre-fetching or sweep is needed. Journal signing is local and fast.

Dual-channel rotation (no operational disruption)

Agent maintains two gRPC channels: active and passive.

Day 0: Boot
  Agent generates keypair → CSR → Enroll → signed cert
  Builds active mTLS channel

Day 2: Renewal trigger (2/3 of 3 days)
  1. Agent generates new keypair + CSR
  2. Agent → Journal: RenewCert(node_id, current_serial, new_csr)
  3. Journal signs new CSR, returns signed cert
  4. Agent builds passive channel with new key + new cert
  5. Agent health-checks passive channel (ping journal)
  6. Atomic swap: passive → active, old active → drain
  7. Old channel completes in-flight RPCs, then closes

Day 3: Old cert expires (already swapped out)

If renewal fails (journal unreachable):
  Active channel continues until cert expires.
  Agent enters degraded mode (cached config, invariant A9).
  Keeps retrying. Journal recovery → new CSR signed → reconnect.

Shell sessions, exec operations, and boot config subscriptions are unaffected by rotation.

Multi-domain enrollment (shared hardware)

A node may be enrolled in multiple pact domains simultaneously. This supports the use case of special hardware (e.g., a node with rare GPU configuration) that is swapped between domains.

Constraints:

  • A node can be Active in at most one domain at a time (enforced by physics — it boots from one Manta at a time).
  • Enrollment in multiple domains is a reservation, not an exclusive claim.
  • Each domain signs CSRs independently. The agent generates a new keypair per boot, so each domain’s cert uses a different key.
  • When a node disappears from domain A (heartbeat timeout → Inactive) and boots into domain B (→ Active), no cross-domain coordination is required.

Optional cross-domain visibility via Sovra: when a domain activates a node, it can publish a lightweight enrollment claim. Other domains see this and can warn if the same hardware is active elsewhere. This is advisory, not a hard lock. If Sovra is unavailable, domains operate independently.

vCluster assignment: independent of enrollment

vCluster assignment is a separate journal operation. An enrolled, active node can be:

  • Assigned to a vCluster → normal operation (streams overlay, applies policy)
  • Unassigned → maintenance mode (domain defaults only, no drift detection, not schedulable)
  • Moved between vClusters → unassign + assign (atomic journal operation)

The enrollment response includes the current vCluster assignment (if any), so the agent knows immediately after enrollment whether to stream a vCluster overlay or enter maintenance mode. No separate query is needed.

The certificate CN is pact-service-agent/{node_id}@{domain_id} — no vCluster in the cert. Moving between vClusters does not touch the cert.

Maintenance mode (active + unassigned)

An enrolled node with no vCluster assignment operates in maintenance mode under a domain-default configuration:

  • Services: pact-agent only. Time sync (chronyd/NTP) if configured in domain defaults. No lattice-node-agent, no workload services.
  • Policy: domain-level default policy. Platform-admin can exec/shell. No vCluster-scoped roles active.
  • Drift detection: disabled (no declared state to drift from).
  • Capability report: generated but marked vcluster: None. Node is not schedulable.
  • Shell/exec: available to platform-admin. Useful for diagnostics and pre-assignment hardware validation.

The domain-default configuration is a minimal VClusterPolicy with enforcement_mode: "observe", empty whitelists (platform-admin bypass only), and no regulated flags. It is stored in journal config and applied to all unassigned nodes.

Decommission safety

When decommissioning a node:

  1. If active shell sessions or exec operations exist on the node, the decommission command warns the admin and requires --force to proceed.
  2. On --force (or no active sessions): enrollment state → Revoked, cert serial added to Raft revocation registry, agent’s mTLS connection terminates.
  3. Active sessions are terminated. Session audit records are preserved.
  4. The node cannot re-enroll without a new pact node enroll command.

Batch enrollment

Batch enrollment (pact node enroll --batch nodes.csv) is not atomic. Each node is an independent Raft command. On partial failure:

  • Successfully enrolled nodes are in Registered state, ready for boot.
  • Failed enrollments are reported per-node in the batch response.
  • The admin can retry the batch — already-enrolled nodes return NODE_ALREADY_ENROLLED (idempotent for retry).

Trade-offs

  • (+) No external CA dependency — journal generates ephemeral CA key at startup
  • (+) No private keys in Raft state or on the wire — agent holds its own key in RAM
  • (+) No dependency on Manta/OpenCHAMI for cert management — pact owns its trust
  • (+) Multi-domain shared hardware without distributed locks
  • (+) Maintenance mode is a natural state, not an edge case
  • (+) Certificate rotation is invisible to operations (dual-channel swap)
  • (+) Enrollment registry provides inventory and prevents unauthorized nodes
  • (+) Boot storm safe: local signing is CPU-only (~1ms per cert, ~10s for 10,000)
  • (+) Self-contained: no external PKI required for certificate operations
  • (+) Ephemeral CA key reduces exposure risk (rotated on journal restart, memory-only)
  • (-) Enrollment is an additional admin step before first boot
  • (-) Journal intermediate CA key is sensitive (mitigated: ephemeral, memory-only, rotated on restart; same trust level as journal server TLS key; 3-5 nodes, not 10,000)
  • (-) Hardware identity (MAC + BMC serial) is not cryptographically strong without TPM (mitigated: sufficient for trusted datacenter environments; TPM optional; once-Active rejection limits spoofing window)
  • (-) No external CRL distribution — revocation is checked only by journal nodes via Raft revocation registry (mitigated: journal nodes are the only mTLS terminators)

Consequences

  • A-I2 (mTLS certificates provisioned by OpenCHAMI) is superseded. Certificate lifecycle is pact’s responsibility, using self-generated ephemeral CA keys on journal nodes.
  • Agent config no longer includes vcluster. vCluster assignment comes from the journal.
  • pact-journal gains an EnrollmentService gRPC endpoint with one unauthenticated RPC (Enroll) and authenticated RPCs for admin and renewal operations.
  • pact-journal nodes generate an ephemeral intermediate CA key at startup (in memory only, NOT stored in Raft or on disk).
  • pact-journal Raft state gains a revocation registry for revoked cert serials.
  • pact-cli gains pact node subcommands: enroll, decommission, assign, unassign, move, list, inspect.
  • pact-journal Raft state gains NodeEnrollment records (no key material).
  • New invariants E1-E10 for enrollment, cert lifecycle, and domain membership.
  • Node heartbeat detected via subscription stream liveness (default timeout: 5 minutes).

Amendment (2026-03-17): SPIRE as Primary mTLS Provider

Context for amendment

HPE Cray infrastructure uses SPIRE (SPIFFE Runtime Environment) for mTLS workload attestation. spire-agent runs on compute nodes. The original ADR-008 design assumed pact self-manages all mTLS certificates via an ephemeral intermediate CA. This creates unnecessary duplication with the existing SPIRE infrastructure.

Additionally, lattice-node-agent also needs mTLS (to lattice-quorum). Both systems managing their own certificate lifecycle independently is wasteful when SPIRE already provides this.

Amendment decision

SPIRE is the primary mTLS provider. ADR-008’s ephemeral CA self-signed model is the fallback when SPIRE is not deployed.

The identity acquisition is abstracted via hpc-identity crate (ADR-015) with an IdentityCascade that tries providers in order:

  1. SpireProvider — connect to SPIRE agent socket, obtain X.509 SVID. SPIRE handles rotation, attestation, and trust bundle management.
  2. SelfSignedProvider — ADR-008 model: agent generates keypair + CSR, journal signs with intermediate CA. Fallback when SPIRE is not deployed.
  3. StaticProvider — bootstrap identity from OpenCHAMI SquashFS image. Used for initial journal authentication before SPIRE or journal is reachable.

What changes

ComponentOriginal ADR-008After amendment
Primary cert sourceEphemeral CA via journalSPIRE SVID
Fallback cert sourceN/AEphemeral CA via journal (ADR-008 model)
BootstrapHardware identity + CSRSame (unchanged)
Cert rotationAgent-driven CSR renewal + dual-channelSPIRE-managed rotation + dual-channel
External CA dependencyNone (ephemeral CA)None (SPIRE manages its own CA)
Lattice mTLSNot addressedSame IdentityCascade via hpc-identity

What survives unchanged

  • Enrollment registry — hardware identity matching, enrollment states, admin enrollment
  • EnrollmentState machine — Registered/Active/Inactive/Revoked
  • Bootstrap identity — used for initial auth before any provider is available
  • Dual-channel rotation pattern — applicable to both SVID and self-signed rotation
  • Enrollment endpoint security — rate limiting, once-Active rejection, audit logging
  • Heartbeat via subscription stream — unchanged
  • Multi-domain enrollment — unchanged
  • Maintenance mode — unchanged

What is demoted to fallback

  • Ephemeral CA on journal nodes — only needed when SPIRE not deployed
  • Per-agent CSR signing by journal — only needed when SPIRE not deployed
  • Journal-side cert lifecycle management — SPIRE manages this when available

Boot sequence with SPIRE

T+0.0s  Kernel + initramfs → mount SquashFS root
T+0.1s  pact-agent starts (PID 1)
T+0.2s  IdentityCascade tries StaticProvider (bootstrap cert from SquashFS)
T+0.3s  Agent authenticates to journal using bootstrap identity
T+0.4s  Agent pulls vCluster overlay from journal
T+0.5s  Agent starts services (including any SPIRE-dependent services)
T+0.8s  IdentityCascade retries: SpireProvider detects SPIRE agent available
T+0.9s  Agent obtains SVID from SPIRE
T+1.0s  CertRotator performs dual-channel swap to SVID
T+1.0s  Bootstrap identity discarded (PB4)

If SPIRE agent is never available (standalone deployment): agent continues with bootstrap identity or SelfSignedProvider (journal-signed cert). All functionality works (PB5: no hard SPIRE dependency).

Implications for lattice

lattice-node-agent uses the same IdentityCascade from hpc-identity:

  • When SPIRE available: obtains SVID for lattice-quorum mTLS
  • When SPIRE not available: uses its own cert management (equivalent to ADR-008)
  • Both systems share the same IdentityProvider trait — no duplication

Revisit

  • If TPM attestation becomes available across the fleet, hardware identity verification can be strengthened from MAC+BMC to cryptographic attestation, closing the spoofing window entirely.
  • If the ephemeral CA model proves insufficient for cross-site trust, an external CA (Vault, step-ca, etc.) can be introduced as the CA key source without changing the enrollment or CSR signing model.
  • If SPIRE is adopted universally across all deployments, the SelfSignedProvider and ephemeral CA model can be removed entirely, simplifying the architecture.

ADR-009: Overlay Staleness Detection and On-Demand Rebuild

Status

Accepted

Context

Boot overlays are pre-computed compressed config bundles served to agents during boot. When a vCluster config is committed, the overlay must be rebuilt. If a node boots between the config commit and the overlay rebuild, it would receive stale configuration.

At scale (1000+ nodes booting simultaneously after a power event), overlay freshness is critical to boot correctness.

Decision

Use a hybrid proactive + reactive overlay rebuild strategy:

  1. Proactive: rebuild overlay on every config commit (covers steady-state).
  2. Reactive: detect staleness at serve time by comparing overlay version to latest config sequence. If stale, trigger on-demand rebuild before serving.

Staleness detection is cheap (version comparison). On-demand rebuild adds latency only for the first boot after a config change; subsequent boots use the cached rebuild.

Overlay consistency (J5)

BootOverlay.checksum is a deterministic hash of BootOverlay.data. The Raft state machine validates the checksum on SetOverlay — any mismatch is rejected. This ensures all replicas serve identical overlays.

Consequences

  • First boot after config commit may be slightly slower (~50-100ms rebuild).
  • All subsequent boots use cached overlay (no penalty).
  • No window where stale config can be served to a booting node.
  • Overlay rebuild is idempotent — concurrent rebuild requests produce the same result.

References

  • specs/failure-modes.md (F9: Stale overlay)
  • specs/invariants.md (J5: Overlay consistency)
  • specs/architecture/enforcement-map.md (J5 row)

ADR-010: Per-Node Delta TTL Bounds (15 minutes – 10 days)

Status

Accepted

Context

Nodes within a vCluster are expected to be homogeneous (A-H1). Per-node deltas are temporary exceptions — debugging overrides, hardware-specific workarounds, or staged rollouts. Without time bounds, deltas accumulate indefinitely, eroding homogeneity and causing scheduling correctness issues.

Decision

Enforce hard TTL bounds on per-node configuration deltas:

  • Minimum: 15 minutes — long enough for a debugging session, short enough to force a decision before the next shift.
  • Maximum: 10 days — carries over weekends with margin, forces periodic review, prevents forgotten deltas from silently accumulating.

TTL is validated at AppendEntry time in the Raft state machine. Values outside bounds are rejected with a descriptive error (ND1, ND2).

Homogeneity warning (ND3)

The system warns (does not enforce) when per-node deltas cause nodes within a vCluster to diverge from the overlay. Heterogeneity is surfaced in pact status and pact diff output. Operators decide whether to promote (make vCluster-wide) or revert.

Consequences

  • Deltas cannot persist indefinitely — operators must commit, promote, or let them expire.
  • Scheduling correctness is protected: lattice can assume vCluster homogeneity within the TTL window.
  • Promote workflow (F14) must handle conflicts when target nodes have local changes on the same keys.
  • Warning-only for heterogeneity avoids blocking legitimate node-specific config (e.g., GPU firmware workarounds).

References

  • specs/invariants.md (ND1, ND2, ND3)
  • specs/assumptions.md (A-Q3, A-H1)
  • specs/failure-modes.md (F14: Promote conflicts)

ADR-011: Degraded-Mode Policy Evaluation

Status

Accepted

Context

When the agent loses connectivity to the journal (network partition, journal overload), it cannot perform full OPA policy evaluation. The system must decide which operations to allow and which to deny during the partition.

The tension is between availability (let operators work on nodes during outages) and security (don’t allow unauthorized operations just because the policy engine is unreachable).

Decision

Adopt a tiered fail-closed strategy based on operation complexity:

Operation TypeDegraded BehaviorRationale
Whitelist commands (exec)Allowed — cached whitelist honoredLow risk, needed for diagnostics
Basic RBAC checksAllowed — cached role bindings honoredEnables routine ops during partition
Platform admin operationsAllowed — cached role check (logged)Admin must be able to act in emergencies
Two-person approvalDenied — fail-closedCannot verify second approver identity
Complex OPA rulesDenied — fail-closedCannot evaluate external policy state

All degraded-mode authorization decisions are logged locally. On reconnect, local logs are replayed to the journal for audit continuity.

Consequences

  • Operators can run diagnostics and basic ops during partitions.
  • Regulated operations (two-person approval) are blocked — operators must wait for connectivity or use emergency mode (which has its own audit trail).
  • No silent privilege escalation: complex rules default to deny, not allow.
  • Audit trail has no gaps: local logging bridges the partition.

Alternatives Considered

  • Fail-open for everything: rejected — security risk for regulated vClusters.
  • Fail-closed for everything: rejected — operators locked out during outages, which is when they most need access.
  • Cache full OPA state locally: rejected — OPA rule evaluation may depend on external data sources (Sovra, compliance databases) that are also unreachable.

References

  • specs/invariants.md (P7: Degraded mode restrictions)
  • specs/failure-modes.md (F2: PolicyService unreachable)
  • specs/assumptions.md (A-C3: Cached policy sufficiency)

ADR-012: Merge Conflict Grace Period with Journal-Wins Fallback

Status

Accepted

Context

During network partitions, agents may accumulate local state changes (admin reconfigurations via pact shell, emergency mode, manual intervention). On reconnect, these local changes may conflict with journal state on the same config keys.

The system must balance correctness (don’t silently overwrite admin work) with availability (the node must eventually converge to declared state).

Decision

Implement a three-phase conflict resolution protocol:

Phase 1: Feed-back (CR1)

On reconnect, the agent reports unpromoted local drift to the journal BEFORE accepting the journal’s current state. This ensures no local changes are lost.

Phase 2: Pause and flag (CR2)

If local changes conflict with journal state on the same keys, the agent pauses convergence for those keys and flags a merge conflict. Non-conflicting keys sync normally. The node remains operational but not fully converged.

Phase 3: Grace period with fallback (CR3)

Admin has a grace period (default: commit window duration) to resolve conflicts via pact diff and pact commit. If unresolved within the grace period, the system falls back to journal-wins: the journal’s declared state overwrites local changes. All overwritten values are logged for audit.

Admin notification (CR5)

If an admin has an active CLI session when their changes are overwritten (by grace period timeout or by a concurrent promote), they are notified in-session.

Promote integration (CR4)

When promoting node-level changes to a vCluster overlay, conflicting keys on target nodes require explicit acknowledgment: accept (keep local as per-node delta) or overwrite (apply promoted value).

Consequences

  • No silent data loss: local changes are always fed back and logged before any overwrite.
  • Availability preserved: grace period timeout ensures convergence eventually happens even without admin intervention.
  • Admin agency: operators get time to review and decide, not just informed after the fact.
  • Complexity: agent must track per-key conflict state and grace period timers.

Alternatives Considered

  • Immediate journal-wins: rejected — silently discards admin work done during partition, violates trust in the audit trail.
  • Require manual resolution always: rejected — node never converges if admin is unavailable (vacation, off-hours).
  • Last-write-wins by timestamp: rejected — clock skew between agent and journal makes this unreliable; admin-committed changes should take precedence over auto-converge.

References

  • specs/invariants.md (CR1–CR5)
  • specs/failure-modes.md (F13: Merge conflict on reconnect, F14: Promote conflicts)
  • specs/assumptions.md (A-Q2: Partition conflict replay, A-C2: Timestamp ordering)

ADR-013: Two-Person Approval as Stateful Raft Entries

Status

Accepted

Context

Regulated vClusters (e.g., sensitive-data, compliance-governed workloads) require two-person approval for state-changing operations. This must work within pact’s Raft-based journal architecture and produce an immutable audit trail.

Decision

Model two-person approval as a state machine stored in Raft:

Pending → Approved (by different identity)
Pending → Rejected (by any authorized identity)
Pending → Expired (by timeout)

Key rules

  1. Distinct identities (PAuth5): the approver’s token identity must differ from the requester’s. Same-identity approval is rejected regardless of token freshness.

  2. Configurable timeout (P5): pending requests expire after a configurable timeout (default 30 minutes). Expired requests cannot be approved — the requester must re-submit. This prevents stale approvals from being rubber-stamped days later.

  3. Raft persistence: PendingApproval records are written to Raft via CreateApproval and decided via DecideApproval. This gives them the same durability and replication guarantees as config entries.

  4. Fail-closed during partition (P7, F1): two-person approval requires Raft writes. During quorum loss or policy unreachability, approval requests are denied — not deferred.

Scope

Two-person approval applies when VClusterPolicy.two_person_approval = true. Operations covered: commit, rollback, exec (on regulated vClusters), emergency mode start/end.

Consequences

  • Full audit trail: every approval request, decision, and timeout is a Raft entry visible in pact log.
  • No approval possible during partitions — operators must use emergency mode (ADR-004) which has its own audit trail.
  • Timeout prevents stale approvals but requires requester to be present for re-submission.
  • Distinct-identity check prevents a single compromised credential from self-approving.

Alternatives Considered

  • External approval system (Slack bot, PagerDuty): rejected — adds external dependency to the critical path; pact should be self-contained for core operations.
  • Deferred approval during partition: rejected — cannot verify second identity without journal; deferred approval could be replayed after the security context has changed.

References

  • specs/invariants.md (P4, P5, PAuth5)
  • specs/failure-modes.md (F1: Journal quorum loss)
  • specs/domain-model.md (PendingApproval entity, ApprovalStatus enum)

ADR-014: Optimistic Concurrency with Commit Windows

Status

Accepted

Context

HPC clusters require low-latency configuration changes. Traditional config management (Puppet/Ansible) applies changes synchronously — the operator waits for convergence before proceeding. For pact, this would block interactive workflows on nodes that should feel “local.”

At the same time, unapplied changes must not drift indefinitely. There must be a mechanism that forces resolution — either commit the change or roll it back.

Decision

Use optimistic concurrency with time-bounded commit windows:

Apply immediately

Configuration changes take effect on the node immediately. The operator does not wait for consensus or convergence. This gives pact shell the responsiveness of being “on the box.”

Commit window opens

When drift is detected (a change has been applied but not committed to the journal), a commit window opens. The window duration is:

window_seconds = base_window / (1 + drift_magnitude * sensitivity)
  • base_window: default 900 seconds (15 minutes)
  • drift_magnitude: weighted L2 norm of the drift vector
  • sensitivity: default 2.0

Larger drift = shorter window. This creates urgency proportional to how much the node has deviated from declared state.

Auto-rollback on expiry (A4)

If the commit window expires without an explicit pact commit, the system automatically rolls back to declared state.

Exception: emergency mode (ADR-004) suspends auto-rollback. The emergency window (default 4 hours) replaces the commit window.

Rollback safety (F5)

Before rolling back, the system checks for active consumers (open file handles, running processes using the affected resources). If consumers are active, the rollback fails and the node remains drifted — the admin must resolve manually.

Consequences

  • Interactive config changes feel instant — no waiting for Raft round-trip.
  • Time pressure prevents drift accumulation: either commit or lose the change.
  • Larger changes get shorter windows, preventing large unreviewed drift.
  • Emergency mode provides an escape hatch for extended debugging sessions.
  • Active consumer check prevents data loss from premature rollback.

Alternatives Considered

  • Synchronous apply-after-commit: rejected — too slow for interactive HPC admin workflows; would require SSH-like latency which pact replaces.
  • No auto-rollback (manual only): rejected — drift accumulates silently; operators forget to commit; nodes diverge from declared state.
  • Fixed window duration: rejected — a 1-line sysctl change and a 50-mount reconfiguration shouldn’t have the same urgency.

References

  • specs/invariants.md (A3: Commit window formula, A4: Auto-rollback)
  • specs/failure-modes.md (F5: Rollback with active consumers)
  • CLAUDE.md (“Optimistic concurrency — changes apply immediately, commit within time window”)

ADR-015: hpc-core Shared Contracts (hpc-node, hpc-audit, hpc-identity)

Status: Accepted

Context

pact and lattice are independent systems that benefit from co-deployment. When pact is the init system and lattice manages workloads, lattice gains capabilities (“supercharged”): cgroup pre-creation, namespace handoff, mount refcounting, unified audit, shared mTLS. But lattice must also work standalone (on systemd-managed nodes without pact).

Both systems need to agree on:

  1. cgroup slice layout and ownership boundaries
  2. Namespace FD passing protocol
  3. Mount point conventions
  4. Audit event format for SIEM integration
  5. Workload identity (mTLS certificate) management

The existing hpc-core workspace (../hpc-core) contains three trait-based crates: raft-hpc-core, hpc-scheduler-core, hpc-auth. These define shared contracts that pact and lattice implement independently. The pattern works: traits + types, no implementations, no runtime coupling.

Decision

Add three new crates to hpc-core, following the same trait-based pattern.

hpc-node

Shared contracts for node-level resource management.

ContractPurposeImplements
CgroupManager traitHierarchy creation, scope management, metricspact-agent (direct cgroup v2), lattice-node-agent (standalone)
NamespaceProvider traitCreate namespaces for allocationspact-agent
NamespaceConsumer traitRequest namespaces (with self-service fallback)lattice-node-agent
MountManager traitRefcounted mounts, lazy unmount, reconstructionpact-agent, lattice-node-agent (standalone)
ReadinessGate traitBoot readiness signalingpact-agent (provider), lattice-node-agent (consumer)
SliceOwner enumPact / Workload ownershipCompile-time contract
slices constantsWell-known cgroup pathsCompile-time contract
Well-known pathsSocket paths, mount basesCompile-time contract

Key design:

  • CgroupManager does NOT enforce ownership — that’s the implementer’s responsibility. The trait provides slice_owner() as a query, not a guard.
  • Namespace handoff uses unix socket at HANDOFF_SOCKET_PATH with SCM_RIGHTS.
  • Mount conventions define base paths but not mount implementation.
  • ReadinessGate has both sync (is_ready()) and async (wait_ready()) methods.

hpc-audit

Shared audit event types and sink trait. Loose coupling, high coherence.

ContractPurposeImplements
AuditEvent typeUniversal event format (who, what, when, where, outcome)All components emit
AuditSink traitDestination interface (emit() + flush())pact-journal (append), pact-agent (buffer+forward), lattice-quorum, file writer, SIEM forwarder
CompliancePolicy typeRetention rules, required audit pointspact-policy, lattice-policy
Action constantsWell-known action stringsCompile-time contract

Key design:

  • AuditSink::emit() must not block. Buffer internally.
  • Each system owns its audit log (pact → journal, lattice → quorum).
  • Shared format enables a single AuditForwarder for SIEM integration.
  • AuditSource enum distinguishes which system emitted an event.

hpc-identity

Workload identity abstraction. SPIRE/self-signed/bootstrap behind a trait.

ContractPurposeImplements
IdentityProvider traitObtain workload identity from any sourceSpireProvider, SelfSignedProvider, StaticProvider
CertRotator traitCertificate rotation (dual-channel swap)pact-agent, lattice-node-agent
IdentityCascadeTry providers in order (SPIRE → self-signed → bootstrap)pact-agent, lattice-node-agent
WorkloadIdentity typeSource-agnostic cert + key + trust bundleUsed by all mTLS consumers
IdentitySource enumSpire / SelfSigned / BootstrapAudit provenance
Provider configsSpireConfig, SelfSignedConfig, BootstrapConfigConfigurable per deployment

Key design:

  • IdentityCascade is a struct, not a trait — it composes IdentityProvider impls.
  • Provider implementations live in the consuming crates (pact-agent, lattice-node-agent), not in hpc-identity. The crate only defines the contract.
  • WorkloadIdentity contains PEM data (not parsed certs) for maximum interoperability.
  • CertRotator::rotate() contract: must not interrupt in-flight operations.
  • Partially supersedes ADR-008 cert management (see ADR-008 amendment).

Rationale

Why hpc-core, not pact-specific crates?

  • Lattice must work independently of pact (A-Int6). If contracts were pact-specific, lattice would depend on a pact crate — creating coupling.
  • hpc-core is the established pattern: trait-based, no implementation, no runtime coupling.
  • Both systems implement the same traits, ensuring convention agreement without coordination.

Why three crates, not one?

  • Different change reasons: cgroup layout (hpc-node) changes rarely, audit format (hpc-audit) changes with compliance requirements, identity providers (hpc-identity) change with infrastructure evolution (SPIRE adoption).
  • Minimal dependencies: hpc-audit needs only serde + chrono. hpc-node needs only serde. hpc-identity needs async-trait + chrono + thiserror. No reason to force all consumers to take all dependencies.

Why not extend hpc-auth?

  • hpc-auth is about OAuth2/OIDC token management (user authentication at the CLI level).
  • hpc-identity is about workload mTLS identity (machine authentication between services).
  • Different domains despite both being “identity.” Different consumers (CLI vs agent).

Trade-offs

  • (+) Clear trait-based contracts — both systems implement independently
  • (+) Lattice gains capabilities when pact is present, works alone when not
  • (+) Shared audit format for unified SIEM
  • (+) SPIRE integration shared between pact and lattice
  • (+) Well-known paths and conventions prevent configuration drift
  • (-) Three more crates to maintain in hpc-core
  • (-) Trait design must be stable — breaking changes affect both pact and lattice
  • (-) Contract validation is only at integration test level (no compile-time guarantee that both sides interpret traits the same way)

Consequences

  • hpc-core workspace gains three new members: crates/node/, crates/audit/, crates/identity/
  • pact-agent depends on hpc-node, hpc-audit, hpc-identity (compile-time)
  • lattice-node-agent depends on hpc-node, hpc-audit, hpc-identity (compile-time)
  • No runtime dependency between pact and lattice (only shared contracts)
  • CI for hpc-core gains three new pipelines (ci-node.yml, ci-audit.yml, ci-identity.yml)
  • Version scheme follows existing hpc-core pattern (year.major.commitcount)

References

  • specs/domain-model.md §2b, §2f, Cross-cutting: Audit, hpc-identity
  • specs/architecture/interfaces/hpc-node.md, hpc-audit.md, hpc-identity.md
  • specs/invariants.md RI1, WI1-WI6, O3, PB4-PB5
  • ADR-008 (amended: SPIRE primary, self-signed fallback)

ADR-016: Identity Mapping — OIDC-to-POSIX UID/GID Shim for NFS

Status: Accepted

Context

pact uses OIDC for all authentication (ADR-008, hpc-auth). OIDC works natively with S3 storage. However, NFS uses POSIX UID/GID for file ownership and access control. On compute nodes where pact is init (no SSSD), a mapping layer is needed to translate OIDC subjects to POSIX identities for NFS compatibility.

This is explicitly a bypass shim, not a core identity system. It exists solely because NFS cannot authenticate via OIDC. When storage migrates to pure S3 or NFSv4 with string identifiers, this subsystem becomes unnecessary.

Decision

UidMap in pact-journal

The journal stores a UidMap — a table of OIDC subject → POSIX UID/GID mappings. Each entry is Raft-committed and immutable within a federation membership.

Two assignment models (configurable per vCluster):

  • On-demand (default): unknown OIDC subject authenticates → pact-policy checks IdP → assigns UID from org’s precursor range → Raft-commits → propagates to agents.
  • Pre-provisioned (regulated): admin pre-provisions all users. Unknown subjects rejected. Required for sensitive vClusters.

Federation deconfliction via computed precursor ranges

Each Sovra-federated org gets:

  • An org_index (sequential, Raft-committed on federation join: 0=local, 1, 2, …)
  • A computed UID precursor: base_uid + org_index * stride (default: base_uid=10000, stride=10000)
  • A computed GID precursor: base_gid + org_index * stride (same formula, same stride)
  • UID assignment is sequential within the precursor range (precursor to precursor + stride - 1)

Collision is impossible by construction (sequential org_index, non-overlapping ranges). Stride is a site-wide configurable default (adjustable before assignments start).

On federation departure: all UidEntries for that org are GC’d from the journal, org_index becomes reclaimable. NFS files owned by departed org’s UIDs become orphaned.

pact-nss: NSS module via libnss crate

A separate crate (pact-nss, cdylib) using the libnss 0.9.0 Rust crate provides a Linux NSS module (libnss_pact.so.2) that resolves UID/GID lookups from local files:

  • pact-agent writes /run/pact/passwd.db and /run/pact/group.db to tmpfs at boot and on journal subscription updates
  • NSS module reads from these files via mmap — zero network calls, ~1μs per lookup
  • /etc/nsswitch.conf: passwd: files pact / group: files pact
  • Full supplementary group resolution (getgrouplist)

The NSS module is read-only. It never writes, never makes network calls, never blocks.

Activation conditions

Identity mapping is only active when ALL of:

  • SupervisorBackend = Pact (not systemd — SSSD handles it in systemd mode)
  • NFS storage is in use on the node

When inactive, no .db files are written, no NSS module is loaded.

Rationale

Why not OIDC claims with POSIX attributes?

  • IdP may not be under pact’s control (federated environments)
  • UidMap must be populated before any user authenticates (NFS files exist at boot)
  • OIDC tokens only arrive on authentication, but UIDs are needed always

Why journal as authority, not IdP sync?

  • The mapping must be consistent across all nodes (same UID everywhere)
  • Raft-committed entries are immutable and auditable
  • Journal subscription pushes updates to agents in sub-second
  • IdP sync is an optional optimization, not the primary mechanism

Why computed precursor ranges, not configured ranges?

  • Computed = deterministic from org_index, no admin configuration per range
  • Sequential org_index = no overlap by construction
  • Reclaimable on federation departure (GC org’s entries, index reusable)
  • Simpler than hash-based mapping (no collision risk in small UID space)

Why a separate pact-nss crate?

  • libnss 0.9.0 is LGPL-3.0. Dynamic linking (cdylib) satisfies LGPL.
  • NSS module must be a shared library loaded by glibc — cannot be part of pact-agent binary.
  • Minimal dependencies (libc, lazy_static, paste).

Trade-offs

  • (+) OIDC-native design — no SSSD, no LDAP on compute nodes
  • (+) Consistent UIDs across all nodes (Raft-committed)
  • (+) Federation deconfliction by construction (no collisions possible)
  • (+) NSS module is pure read-only mmap — no performance impact
  • (+) Explicitly a shim — removable when NFS is replaced by S3
  • (+) Reclaimable ranges on federation departure
  • (-) Additional crate (pact-nss) and shared library to deploy
  • (-) UidMap adds entries to journal Raft state
  • (-) Stride change after assignments requires UID remapping (operational pain)
  • (-) NFS files from departed orgs become orphaned (admin responsibility)
  • (-) Sub-second propagation lag for new UIDs (A-Id3, F32)

Consequences

  • pact-journal Raft state gains UidMap and OrgIndex entries
  • pact-agent gains identity/ submodule for UidMap management
  • New crate: pact-nss (cdylib, LGPL-3.0 compatible)
  • SquashFS images must include libnss_pact.so.2 and nsswitch.conf entry
  • IdP sync (SCIM/LDAP) is an optional journal-side optimization, not required
  • Boot Phase 3 (LoadIdentity) loads UidMap before Phase 5 (StartServices)
  • Services with non-root users wait for UidMap resolution (IM7)

References

  • specs/domain-model.md §2c (Identity Mapping context)
  • specs/invariants.md IM1-IM7
  • specs/assumptions.md A-Id1 through A-Id6
  • specs/features/identity_mapping.feature (17 scenarios)
  • specs/failure-modes.md F24 (range exhaustion), F25 (NSS .db corruption), F32 (propagation lag)
  • libnss crate: https://lib.rs/crates/libnss (0.9.0, Feb 2025)

ADR-017: Network Topology — Management Network for Pact, HSN for Lattice

Status: Accepted

Context

HPC infrastructure has two distinct networks:

  • Management network (1G Ethernet): OpenCHAMI, BMC/IPMI, PXE boot, admin access. Always available. Low bandwidth, high reliability.
  • High-speed network (Slingshot/Ultra Ethernet, 200G+): workload traffic, MPI/NCCL, storage data plane. High bandwidth, low latency. Requires cxi_rh (Slingshot resource handler) to be running.

Both pact and lattice need mTLS-authenticated gRPC communication. The question: which network carries which traffic?

Decision

Pact traffic runs entirely on the management network. Lattice traffic runs on the high-speed network (HSN). SPIRE provides network-agnostic identity to both.

Pact on management network

TrafficDirectionSizeFrequency
Enrollment (CSR + cert)Agent → Journal~5 KBOnce per boot
Boot overlay streamingJournal → Agent100-200 KB (zstd)Once per boot
Node deltaJournal → Agent<1 KBOnce per boot
Config subscriptionJournal → AgentEvents (bytes)Occasional
Heartbeat (stream keepalive)Agent ↔ JournalBytesContinuous
Exec/shell (interactive)CLI → AgentVariableOn demand
Audit eventsAgent → Journal~1 KB eachPer operation
Journal Raft consensusJournal ↔ JournalConfig entriesOn writes

Journal listens on management network:

  • gRPC: port 9443
  • Raft: port 9444

Lattice on HSN

TrafficDirectionSizeFrequency
Quorum Raft consensusQuorum ↔ QuorumState machine opsOn writes
Node-agent heartbeat + statusAgent → QuorumTelemetry30s intervals
Allocation lifecycleQuorum → AgentCommandsPer allocation
Checkpoint coordinationAgent ↔ QuorumSignalsOn checkpoint
Capability reportsAgent → Quorum~2 KBOn change

Quorum listens on HSN:

  • gRPC: port 50051
  • Raft: port 9000

SPIRE bridges both networks

Node (management + HSN interfaces)
├── /run/spire/agent.sock  ← local unix socket, no network
│   ├── pact-agent obtains SVID → uses on management net (journal mTLS)
│   └── lattice-node-agent obtains SVID → uses on HSN (quorum mTLS)
│
├── Management NIC (1G)
│   └── pact-agent ←mTLS→ pact-journal:9443
│
└── HSN NIC (200G+, via cxi_rh)
    └── lattice-node-agent ←mTLS→ lattice-quorum:50051

X.509 certificates authenticate identity (SPIFFE ID or CN), not network interfaces. The same SVID works on both networks. SPIRE agent is node-local — no network dependency for identity acquisition.

Boot ordering enforces this

T+0.0s  PXE boot via management net (OpenCHAMI)
T+0.1s  pact-agent starts as PID 1
T+0.2s  pact-agent gets SVID from SPIRE (local socket — no network)
T+0.3s  pact-agent connects to journal on management net (mTLS)
T+0.4s  pact pulls overlay, configures management interface (netlink)
T+0.5s  pact starts cxi_rh → HSN interface comes up
T+0.7s  pact starts lattice-node-agent (supervised service)
T+0.8s  lattice-node-agent gets SVID from SPIRE (local socket)
T+0.9s  lattice-node-agent connects to quorum on HSN (mTLS)
T+1.0s  Node fully operational on both networks

Management network MUST be available before HSN — it’s the PXE boot network. HSN comes up only after pact starts cxi_rh (a supervised service, Phase 5). Therefore pact cannot use HSN for its own communication — it’s not available during early boot.

Co-located mode

When journal and quorum share physical nodes:

Co-located node:
├── Management NIC (1G):
│   ├── pact-journal gRPC :9443
│   └── pact-journal Raft :9444
│
├── HSN NIC (200G+):
│   ├── lattice-quorum gRPC :50051
│   └── lattice-quorum Raft :9000
│
└── SPIRE agent socket (shared)

Each system listens on its own network. No port conflicts. Both use SPIRE SVIDs — same trust domain, different network interfaces.

Rationale

Why management net for pact (not HSN)?

  1. Bootstrap ordering: HSN is not available during early boot. Pact must connect to the journal to get the overlay that configures HSN.

  2. Failure isolation: management net down → pact uses cached config (A9), lattice continues on HSN. HSN down → lattice pauses, pact continues managing nodes. Clean failure boundaries.

  3. Security boundary: admin operations (shell, exec) should traverse the management network, not the workload network.

  4. Bandwidth is sufficient: 10,000 nodes × 200 KB overlay = 2 GB. With 3-5 journal servers on 1G management NICs = 3-5 Gbps aggregate. Zstd-compressed overlays (~100 KB actual) = ~1 GB total = 2-3 seconds. Within the boot time target (A8: <2s with warm journal).

Why HSN for lattice (not management)?

  1. Bandwidth: telemetry from 10,000 nodes at 30s intervals, plus allocation lifecycle events, would saturate 1G management net.

  2. Latency: Raft consensus and scheduler decisions need low latency. Slingshot provides sub-microsecond latency vs milliseconds on 1G Ethernet.

  3. Consistency: workload traffic (MPI, NCCL, storage) already runs on HSN. Lattice managing workloads on the same network is natural.

Failure isolation matrix

Network downPactLatticeWorkloads
Management onlyJournal unreachable. Agents use cached config (A9). Shell/exec unavailable.Unaffected.Running workloads continue.
HSN onlyUnaffected. Admin access works.Quorum unreachable. No new scheduling.MPI/NCCL fails. Running jobs may checkpoint.
BothBMC console only (F6).Everything down.Everything down.
NeitherNormal operation.Normal operation.Normal operation.

Trade-offs

  • (+) Clean failure isolation — each system survives the other’s network failure
  • (+) No HSN dependency for pact — simpler boot sequence, fewer failure modes
  • (+) Admin operations on management net — standard HPC security practice
  • (+) SPIRE bridges both networks cleanly — same identity, different interfaces
  • (+) Co-located mode works naturally — different ports on different NICs
  • (-) Boot overlay streaming limited by 1G management net bandwidth (mitigated: zstd compression, 3-5 journal servers, overlays are small)
  • (-) Two networks to monitor for full-system health
  • (-) If management net is unreliable, pact operations are degraded even though HSN is fine (mitigated: cached config, A9)

Consequences

  • pact-journal configuration binds to management network interface
  • lattice-quorum configuration binds to HSN interface
  • pact-agent config specifies journal endpoints on management IP
  • lattice-node-agent config specifies quorum endpoints on HSN IP
  • SPIRE trust domain covers both networks (certs are interface-agnostic)
  • Monitoring must cover both networks for complete system health visibility
  • Network configuration in vCluster overlays must specify both interfaces
  • For scale beyond ~50,000 nodes, boot overlay streaming may need to move to HSN or use a multicast/CDN approach on management net

References

  • ADR-001: Raft quorum deployment modes (standalone/co-located)
  • ADR-006: Pact as init (boot ordering, service supervision)
  • ADR-008: Node enrollment (management net for enrollment, HSN for post-boot)
  • ADR-015: hpc-core shared contracts (network-agnostic identity)
  • specs/invariants.md: R3 (quorum ports), A8 (boot time target), A9 (cached config)
  • specs/failure-modes.md: F3 (partition), F28 (network config failure)