System Architecture

See ../../ARCHITECTURE.md for the high-level overview. This document covers detailed design and data flows.

Design Requirements

R1: Eventual consistency with acknowledged drift
R2: Immutable configuration log
R3: Optimistic concurrency with commit windows
R4: Admin-native CLI + pact shell (replacing SSH)
R5: Streaming boot configuration (<2s for 10k nodes)
R6: Degradation-aware (partial HW failure → revised promises)
R7: vCluster-aware grouping
R8: IAM and policy enforcement (OIDC/RBAC/audit)
R9: Blacklist-based drift detection with learning mode
R10: Emergency mode (extended window + no rollback + full audit)
R11: Observe-first deployment
R12: Agentic API (MCP tool-use)
R13: Process supervision (pact as init, systemd fallback)
R14: No SSH (pact shell + pact exec)

Raft Deployment

pact-journal runs its own Raft group, independent from lattice’s quorum. Two deployment modes (see ADR-001):

Standalone: pact-journal on dedicated management nodes (3-5 nodes)
Co-located: pact-journal and lattice-server on the same management nodes, each with its own Raft group on separate ports

Pact is the incumbent in co-located mode — its quorum is already running when lattice starts. Lattice configures its peers to point to the same hostnames. No protocol-level coupling; co-location is a deployment decision.

Consistency Model

AP in CAP terms. Nodes use cached config and cached policy during partitions. Conflict resolution by timestamp ordering with admin-committed > auto-converge. A node that can’t reach the config server keeps running its workload.

During partitions, pact-agent falls back to cached VClusterPolicy for authorization (role bindings and whitelists only — complex OPA rules and two-person approval require connectivity). Degraded-mode decisions are logged locally and replayed to the journal when connectivity is restored.

Commit Window Formula

window_seconds = base_window / (1 + drift_magnitude * sensitivity)

Examples with default base_window=900s, drift_sensitivity=2.0:

Drift	Example	Window	Rationale
Tiny (0.05)	Single sysctl	~14 min	Low risk
Small (0.15)	Config file edit	~12 min	Routine
Moderate (0.3)	Mount + service	~9 min	Needs attention
Large (0.8)	Multiple categories	~6 min	Significant deviation

Higher drift_sensitivity (e.g. 5.0 for regulated vClusters) compresses windows more aggressively: the same large drift gets ~3 min instead of ~6.

Emergency mode: pact emergency --reason "..." extends to 4h, suspends rollback.

Data Flows

Boot-Time (10,000 nodes)

PXE → SquashFS → pact-agent (PID 1)
  → mTLS auth → Phase 1 stream (vCluster overlay, ~200KB, any replica)
  → apply config → Phase 2 (node delta, <1KB)
  → start services → CapabilityReport → ready

Admin Change

pact exec / pact shell → command executed on node
  → state observer detects change → drift evaluator
  → commit window opens (proportional to drift)
  → admin commits (node delta) or window expires (rollback)
  → to codify fleet-wide: pact promote → pact apply (updates overlay)
  → journal records everything

Commit Lifecycle and Reboot Persistence

Manual changes (via exec/shell) that are committed become node-level state deltas in the journal. The journal maintains two layers of declared state:

vCluster overlay (shared)     e.g. "all ml-training nodes mount /scratch"
  + node deltas (per-node)    e.g. "node042 has extra sysctl from debugging"
  = effective declared state  (what the agent applies at boot)

On reboot, the agent streams both layers from the journal. Committed node deltas are reapplied automatically — manual changes survive reboots as long as they remain in the journal’s node state.

However, accumulating ad-hoc node deltas is not desirable long-term. They represent drift that was accepted rather than codified. Over time, nodes with many committed deltas diverge from their vCluster peers, making fleet-wide reasoning harder.

The intended lifecycle for manual changes:

Stage	State	Action
Detected	Drift	Observer flags divergence from declared state
Committed	Node delta	Admin commits change, recorded in journal
Promoted	vCluster overlay	`pact apply` updates the overlay to include the change
Expired	Cleaned up	`pact rollback` or superseded by overlay update

Promotion path: when a committed manual change proves correct, the admin promotes it to the vCluster overlay:

pact diff --committed <node> — review accumulated node deltas
pact promote <node> --dry-run — preview the generated overlay TOML
pact promote <node> > changes.toml — export, review/edit
pact apply changes.toml — apply to the vCluster overlay

This updates the shared overlay and makes the node-level deltas redundant.

Expiry: node deltas with a ttl field expire automatically. Emergency-mode changes default to a TTL matching the emergency window. Changes without TTL persist until explicitly rolled back or superseded.

Hardware Degradation

GPU soft-fails → agent detects (NVML for NVIDIA, ROCm SMI for AMD, or eBPF)
  → CapabilityReport updated → scheduler adjusts eligibility
  → DriftDetected in journal → admin ack if policy requires

Integration Delegation

Action	Owner	pact does
Reboot node	CSM or OpenCHAMI	`pact reboot` calls CAPMC (CSM) or SMD Redfish (OpenCHAMI)
Re-image node	CSM or OpenCHAMI	`pact reimage` calls BOS (CSM) or Redfish PowerCycle (OpenCHAMI)
Drain node	Lattice	`pact drain` calls lattice scheduler API
Cordon node	Lattice	`pact cordon` calls lattice scheduler API
Job status	Lattice	`pact jobs` calls lattice API
Config management	pact (native)	Direct implementation
Remote access	pact (native)	Shell server, exec endpoint
Service lifecycle	pact (native)	PactSupervisor or SystemdBackend

Keyboard shortcuts

PACT Documentation