Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

System Architecture

See ../../ARCHITECTURE.md for the high-level overview. This document covers detailed design and data flows.

Design Requirements

  • R1: Eventual consistency with acknowledged drift
  • R2: Immutable configuration log
  • R3: Optimistic concurrency with commit windows
  • R4: Admin-native CLI + pact shell (replacing SSH)
  • R5: Streaming boot configuration (<2s for 10k nodes)
  • R6: Degradation-aware (partial HW failure → revised promises)
  • R7: vCluster-aware grouping
  • R8: IAM and policy enforcement (OIDC/RBAC/audit)
  • R9: Blacklist-based drift detection with learning mode
  • R10: Emergency mode (extended window + no rollback + full audit)
  • R11: Observe-first deployment
  • R12: Agentic API (MCP tool-use)
  • R13: Process supervision (pact as init, systemd fallback)
  • R14: No SSH (pact shell + pact exec)

Raft Deployment

pact-journal runs its own Raft group, independent from lattice’s quorum. Two deployment modes (see ADR-001):

  • Standalone: pact-journal on dedicated management nodes (3-5 nodes)
  • Co-located: pact-journal and lattice-server on the same management nodes, each with its own Raft group on separate ports

Pact is the incumbent in co-located mode — its quorum is already running when lattice starts. Lattice configures its peers to point to the same hostnames. No protocol-level coupling; co-location is a deployment decision.

Consistency Model

AP in CAP terms. Nodes use cached config and cached policy during partitions. Conflict resolution by timestamp ordering with admin-committed > auto-converge. A node that can’t reach the config server keeps running its workload.

During partitions, pact-agent falls back to cached VClusterPolicy for authorization (role bindings and whitelists only — complex OPA rules and two-person approval require connectivity). Degraded-mode decisions are logged locally and replayed to the journal when connectivity is restored.

Commit Window Formula

window_seconds = base_window / (1 + drift_magnitude * sensitivity)

Examples with default base_window=900s, drift_sensitivity=2.0:

DriftExampleWindowRationale
Tiny (0.05)Single sysctl~14 minLow risk
Small (0.15)Config file edit~12 minRoutine
Moderate (0.3)Mount + service~9 minNeeds attention
Large (0.8)Multiple categories~6 minSignificant deviation

Higher drift_sensitivity (e.g. 5.0 for regulated vClusters) compresses windows more aggressively: the same large drift gets ~3 min instead of ~6.

Emergency mode: pact emergency --reason "..." extends to 4h, suspends rollback.

Data Flows

Boot-Time (10,000 nodes)

PXE → SquashFS → pact-agent (PID 1)
  → mTLS auth → Phase 1 stream (vCluster overlay, ~200KB, any replica)
  → apply config → Phase 2 (node delta, <1KB)
  → start services → CapabilityReport → ready

Admin Change

pact exec / pact shell → command executed on node
  → state observer detects change → drift evaluator
  → commit window opens (proportional to drift)
  → admin commits (node delta) or window expires (rollback)
  → to codify fleet-wide: pact promote → pact apply (updates overlay)
  → journal records everything

Commit Lifecycle and Reboot Persistence

Manual changes (via exec/shell) that are committed become node-level state deltas in the journal. The journal maintains two layers of declared state:

vCluster overlay (shared)     e.g. "all ml-training nodes mount /scratch"
  + node deltas (per-node)    e.g. "node042 has extra sysctl from debugging"
  = effective declared state  (what the agent applies at boot)

On reboot, the agent streams both layers from the journal. Committed node deltas are reapplied automatically — manual changes survive reboots as long as they remain in the journal’s node state.

However, accumulating ad-hoc node deltas is not desirable long-term. They represent drift that was accepted rather than codified. Over time, nodes with many committed deltas diverge from their vCluster peers, making fleet-wide reasoning harder.

The intended lifecycle for manual changes:

StageStateAction
DetectedDriftObserver flags divergence from declared state
CommittedNode deltaAdmin commits change, recorded in journal
PromotedvCluster overlaypact apply updates the overlay to include the change
ExpiredCleaned uppact rollback or superseded by overlay update

Promotion path: when a committed manual change proves correct, the admin promotes it to the vCluster overlay:

  1. pact diff --committed <node> — review accumulated node deltas
  2. pact promote <node> --dry-run — preview the generated overlay TOML
  3. pact promote <node> > changes.toml — export, review/edit
  4. pact apply changes.toml — apply to the vCluster overlay

This updates the shared overlay and makes the node-level deltas redundant.

Expiry: node deltas with a ttl field expire automatically. Emergency-mode changes default to a TTL matching the emergency window. Changes without TTL persist until explicitly rolled back or superseded.

Hardware Degradation

GPU soft-fails → agent detects (NVML for NVIDIA, ROCm SMI for AMD, or eBPF)
  → CapabilityReport updated → scheduler adjusts eligibility
  → DriftDetected in journal → admin ack if policy requires

Integration Delegation

ActionOwnerpact does
Reboot nodeCSM or OpenCHAMIpact reboot calls CAPMC (CSM) or SMD Redfish (OpenCHAMI)
Re-image nodeCSM or OpenCHAMIpact reimage calls BOS (CSM) or Redfish PowerCycle (OpenCHAMI)
Drain nodeLatticepact drain calls lattice scheduler API
Cordon nodeLatticepact cordon calls lattice scheduler API
Job statusLatticepact jobs calls lattice API
Config managementpact (native)Direct implementation
Remote accesspact (native)Shell server, exec endpoint
Service lifecyclepact (native)PactSupervisor or SystemdBackend