Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Deployment Guide

This guide covers deploying pact in production. pact consists of three components that need to be deployed:

  1. pact-journal – 3 or 5 node Raft quorum (management nodes)
  2. pact-agent – every compute node
  3. pact CLI – admin workstations

Deploy scripts in scripts/deploy/ automate the full deployment. They are cloud-agnostic and reusable on bare metal. For GCP-specific infrastructure (VMs, networking), see infra/gcp/.

OS Requirements

Release binaries are built on Rocky Linux 9 (glibc 2.34). Compatible distributions:

  • RHEL 9+ / Rocky 9+ / Alma 9+
  • Ubuntu 22.04+
  • Debian 12+ (bookworm)
  • SLES 15 SP4+

Debian 11 and Ubuntu 20.04 are not supported (glibc too old).

Release artifacts per architecture (x86_64 and aarch64):

  • pact-platform-{arch}.tar.gz — pact (CLI), pact-journal, pact-mcp
  • pact-agent-{arch}-pact.tar.gz — agent with PactSupervisor (PID 1 mode)
  • pact-agent-{arch}-systemd.tar.gz — agent with systemd backend

All agent variants include GPU support (NVIDIA + AMD) — no separate per-GPU builds.

Important: When building from source, use --features ebpf for the agent to enable eBPF-based state observation. Without this feature, eBPF probes are compiled out.

Node Name Resolution

Pact uses DNS-based agent discovery: pact exec <node-id> resolves the node ID to http://<node-id>:9445. Ensure node IDs are resolvable via DNS or /etc/hosts.

Agent Auth Configuration

The agent shell server validates incoming OIDC tokens independently from the journal. Configure [agent.shell.auth] in the agent TOML:

[agent.shell.auth]
issuer = "https://your-idp/realm"
audience = "pact-cli"
jwks_url = "https://your-idp/realm/protocol/openid-connect/certs"
# Optional: HMAC secret for dev/test (production uses JWKS only)
# hmac_secret = "shared-secret"

Without this section, the agent falls back to fail-closed (JWKS required, no secret).

Prerequisites

Download release artifacts from GitHub releases. You can create a provisioning bundle for easy distribution:

# Download release artifacts
mkdir -p /tmp/pact-release
gh release download v2026.1.196 --dir /tmp/pact-release \
    --pattern "pact-platform-x86_64.tar.gz" \
    --pattern "pact-agent-x86_64-pact.tar.gz"

# Create provisioning bundle (includes scripts + systemd units)
scripts/deploy/make-provision-bundle.sh /tmp/pact-release /tmp/pact-provision.tar.gz

# Upload to all nodes (single file, no scp --recurse issues)
scp /tmp/pact-provision.tar.gz node:/tmp/
ssh node 'cd /tmp && tar xzf pact-provision.tar.gz'

Or manually:

  • Unpack binaries to /opt/pact/bin/ on all nodes
  • Copy infra/systemd/ to /opt/pact/systemd/ on all nodes
  • Copy scripts/deploy/ to /opt/pact/deploy/ on all nodes

Step 1: Create CA and distribute to all nodes

# On the first management node:
/opt/pact/deploy/setup-ca.sh /etc/pact/certs mgmt-1

# Then copy /etc/pact/ca/ to ALL other nodes (management + compute):
scp -r /etc/pact/ca/ mgmt-2:/etc/pact/ca/
scp -r /etc/pact/ca/ mgmt-3:/etc/pact/ca/
scp -r /etc/pact/ca/ compute-1:/etc/pact/ca/
# ...etc

Step 2: Install journal on management nodes

# Peer format: id=addr (id matches node-id argument)
PEERS="1=mgmt-1:9443,2=mgmt-2:9443,3=mgmt-3:9443"

# Node 1 — with --bootstrap to initialize the Raft cluster
/opt/pact/deploy/install-management.sh 1 mgmt-1 "$PEERS" --bootstrap

# Nodes 2 and 3 — without --bootstrap (join existing cluster)
/opt/pact/deploy/install-management.sh 2 mgmt-2 "$PEERS"
/opt/pact/deploy/install-management.sh 3 mgmt-3 "$PEERS"

# Wait ~10 seconds for Raft membership replication, then verify:
/opt/pact/deploy/bootstrap-quorum.sh mgmt-1:9443

Step 3: Install agent on compute nodes

JOURNALS="mgmt-1:9443,mgmt-2:9443,mgmt-3:9443"

/opt/pact/deploy/install-compute.sh compute-1 ml-training "$JOURNALS"
/opt/pact/deploy/install-compute.sh compute-2 ml-training "$JOURNALS"

Step 4: Install monitoring (optional)

/opt/pact/deploy/install-monitoring.sh mgmt-1,mgmt-2,mgmt-3

Step 5: Validate

# Run test matrix (v1=pact-only, v2=systemd, v3=pact+lattice, v4=systemd+lattice)
/opt/pact/deploy/validate.sh v1 mgmt-1:9443 compute-1,compute-2

Manual Deployment (step-by-step)

Journal Quorum Setup

The journal is pact’s distributed immutable log, backed by a Raft consensus group. Deploy it on dedicated management nodes or co-located with lattice (see ADR-001).

Install the binary

Download the platform binaries for your architecture from the latest release:

curl -LO https://github.com/witlox/pact/releases/latest/download/pact-platform-x86_64.tar.gz
tar xzf pact-platform-x86_64.tar.gz -C /usr/local/bin/

This installs pact-journal, pact (CLI), and pact-mcp.

3-Node Quorum (Standard)

A 3-node quorum tolerates 1 node failure. Suitable for most deployments.

Create /etc/pact/journal.env on each journal node:

journal-1:

PACT_JOURNAL_NODE_ID=1
PACT_JOURNAL_LISTEN=0.0.0.0:9443
PACT_JOURNAL_DATA_DIR=/var/lib/pact/journal
PACT_JOURNAL_PEERS=1=journal-1:9443,2=journal-2:9443,3=journal-3:9443

journal-2:

PACT_JOURNAL_NODE_ID=2
PACT_JOURNAL_LISTEN=0.0.0.0:9443
PACT_JOURNAL_DATA_DIR=/var/lib/pact/journal
PACT_JOURNAL_PEERS=1=journal-1:9443,2=journal-2:9443,3=journal-3:9443

journal-3:

PACT_JOURNAL_NODE_ID=3
PACT_JOURNAL_LISTEN=0.0.0.0:9443
PACT_JOURNAL_DATA_DIR=/var/lib/pact/journal
PACT_JOURNAL_PEERS=1=journal-1:9443,2=journal-2:9443,3=journal-3:9443

Bootstrap: On the first deploy, run pact-journal --bootstrap on node 1 to initialize the Raft membership. The membership replicates to nodes 2 and 3 automatically within seconds. Do NOT use --bootstrap on subsequent restarts or on nodes 2/3.

Create /etc/pact/journal.toml (same on all nodes, node ID comes from env):

[journal]
listen_addr = "0.0.0.0:9443"
data_dir = "/var/lib/pact/journal"

[journal.raft]
members = [
    "1:journal-1.mgmt:9444",
    "2:journal-2.mgmt:9444",
    "3:journal-3.mgmt:9444"
]
snapshot_interval = 10000

[journal.streaming]
max_concurrent_boot_streams = 15000

[policy]
enabled = true

[policy.iam]
oidc_issuer = "https://auth.example.org/realms/hpc"
oidc_audience = "pact"

[policy.engine]
type = "opa"
opa_endpoint = "http://localhost:8181/v1/data/pact"

[telemetry]
log_level = "info"
log_format = "json"
prometheus_enabled = true
prometheus_listen = "0.0.0.0:9091"
loki_enabled = true
loki_endpoint = "http://loki.mgmt:3100/loki/api/v1/push"

5-Node Quorum (High Availability)

A 5-node quorum tolerates 2 node failures. Recommended for large deployments or when journal availability is critical (e.g., boot-time config streaming for thousands of nodes).

Configuration is identical to 3-node, with two additional members:

[journal.raft]
members = [
    "1:journal-1.mgmt:9444",
    "2:journal-2.mgmt:9444",
    "3:journal-3.mgmt:9444",
    "4:journal-4.mgmt:9444",
    "5:journal-5.mgmt:9444"
]

Co-Located with Lattice

If running on the same management nodes as lattice, use separate ports and data directories. pact is the incumbent: journal quorum starts before lattice.

ComponentRaft PortgRPC PortData Dir
pact-journal94449443/var/lib/pact/journal
lattice-server900050051/var/lib/lattice/raft

Lattice bootstrap: Like pact-journal, lattice-server requires --bootstrap on first start of node 1 to initialize the Raft cluster. Subsequent restarts must NOT use --bootstrap. See lattice deployment guide for details.

Lattice agent: Deploy lattice-agent on compute nodes. It registers with the lattice scheduler via heartbeats. Pact’s supercharged commands (drain, cordon, uncordon) delegate to lattice — they require the node to be registered in both pact (enrollment) and lattice (agent heartbeat).

# Set delegation endpoint and auth token so pact CLI can reach lattice
export PACT_LATTICE_ENDPOINT=http://mgmt-1:50051
export PACT_LATTICE_TOKEN="$PACT_TOKEN"   # reuse the same OIDC token
pact drain compute-1    # → delegates to lattice (authenticated)

Important: PACT_LATTICE_TOKEN is required when lattice-server enforces auth. The pact admin’s OIDC token works if both systems share the same IdP and lattice recognizes pact_role claims. Set LATTICE_OIDC_HMAC_SECRET on lattice-server to the same value as PACT_OIDC_HMAC_SECRET for HMAC token compatibility.

Lattice agent auth: Set LATTICE_AGENT_TOKEN env var on compute nodes for the lattice-agent to authenticate to the server (machine identity via token fallback when mTLS/SPIRE is not configured).

Drain behavior: Drain with active allocations transitions to Draining. The node moves to Drained only after all allocations complete or are cancelled. Undrain only works from Drained state — cancel remaining allocations first if immediate undrain is needed.

Port Summary

PortServiceProtocol
9443pact-journal gRPCgRPC (config, streaming)
9444pact-journal RaftRaft consensus
9445pact-agent shell/execgRPC
9091pact-journal metricsHTTP (Prometheus)

Agent Installation

Install the binary

Download the agent variant for your architecture and supervisor mode from the latest release:

# PactSupervisor mode (PID 1 / diskless HPC) — includes NVIDIA + AMD GPU support
curl -LO https://github.com/witlox/pact/releases/latest/download/pact-agent-x86_64-pact.tar.gz
tar xzf pact-agent-x86_64-pact.tar.gz
sudo mv pact-agent /usr/local/bin/

# Or systemd mode (traditional service)
curl -LO https://github.com/witlox/pact/releases/latest/download/pact-agent-x86_64-systemd.tar.gz
tar xzf pact-agent-x86_64-systemd.tar.gz
sudo mv pact-agent /usr/local/bin/

For diskless nodes, include the pact-agent binary in the base SquashFS image provisioned by OpenCHAMI.

Create the config

Create /etc/pact/agent.toml:

[agent]
enforcement_mode = "enforce"

[agent.supervisor]
backend = "pact"

[agent.journal]
endpoints = ["journal-1.mgmt:9443", "journal-2.mgmt:9443", "journal-3.mgmt:9443"]
tls_enabled = true
tls_cert = "/etc/pact/agent.crt"
tls_key = "/etc/pact/agent.key"
tls_ca = "/etc/pact/ca.crt"

[agent.observer]
ebpf_enabled = true
inotify_enabled = true
netlink_enabled = true

[agent.shell]
enabled = true
listen = "0.0.0.0:9445"
whitelist_mode = "strict"

[agent.commit_window]
base_window_seconds = 900
drift_sensitivity = 2.0
emergency_window_seconds = 14400

[agent.blacklist]
patterns = ["/tmp/**", "/var/log/**", "/proc/**", "/sys/**", "/dev/**",
            "/run/user/**", "/run/pact/**", "/run/lattice/**"]

Node identity

The agent’s node_id is typically set via environment variable or auto-detected from the hostname. For diskless nodes, OpenCHAMI sets the hostname during PXE boot.

Identity and mTLS Setup

pact uses mutual TLS for agent-to-journal communication. Identity is provisioned automatically – no manual certificate generation is needed for agents or journal nodes.

SPIRE (primary, when deployed)

When SPIRE is deployed at the site, pact uses it as the primary identity provider. Agents and journal nodes receive SPIFFE SVIDs (X.509 certificates) via SPIRE node attestation. SPIRE handles certificate rotation automatically.

Configure the SPIRE agent socket in the agent config:

[agent.identity]
provider = "spire"
spire_socket = "/run/spire/agent.sock"

Ephemeral CA (fallback, default)

When SPIRE is not available, the journal quorum generates an ephemeral CA at startup. Agents enroll via the CSR workflow – no manual certificate provisioning is required.

Enrollment workflow:

  1. Platform admin enrolls a node: pact enroll <node> --hardware-id <hw-id>
  2. The journal records the enrollment with the node’s expected hardware identity
  3. Agent boots and presents its hardware identity (TPM, SMBIOS UUID, or MAC-based)
  4. Agent generates a keypair and submits a CSR to the journal
  5. Journal validates the hardware identity against the enrollment record
  6. Journal signs the CSR with the ephemeral CA and returns the certificate
  7. Agent uses the signed certificate for mTLS from this point forward

The ephemeral CA is regenerated when the journal quorum restarts. Agents automatically re-enroll to obtain new certificates.

CA cert distribution

Agents need the CA certificate bundle to validate journal server certificates. For diskless nodes, include the CA cert in the base SquashFS image:

  • /etc/pact/ca.crt – CA certificate bundle (all nodes)

For SPIRE deployments, the SPIRE trust bundle replaces this file. For ephemeral CA deployments, the journal serves the CA cert during the enrollment handshake.

Identity mapping

In PactSupervisor mode, identity mapping (pact-nss) is automatic – the agent maps SPIFFE IDs or certificate CNs to local UIDs without manual NSS configuration.

OIDC Provider Configuration

pact authenticates admins via OIDC tokens. Configure your identity provider (Keycloak, Auth0, Okta, etc.) with the following:

Create a pact client

  • Client ID: pact
  • Client type: Public (CLI) or Confidential (MCP server)
  • Redirect URI: http://localhost:8400/callback (for CLI login flow)

Define roles

Create these roles in your OIDC provider and assign them to users:

RoleDescription
pact-platform-adminFull system access (2-3 people per site)
pact-ops-{vcluster}Day-to-day ops for a vCluster
pact-viewer-{vcluster}Read-only access
pact-regulated-{vcluster}Ops with two-person approval
pact-service-agentMachine identity for agents (mTLS)
pact-service-aiMachine identity for MCP server

Configure the journal

Set the OIDC issuer and audience in the journal config:

[policy.iam]
oidc_issuer = "https://auth.example.org/realms/hpc"
oidc_audience = "pact"

systemd Service Management

Install systemd units

Copy the provided unit files:

cp infra/systemd/pact-journal.service /etc/systemd/system/
cp infra/systemd/pact-agent.service /etc/systemd/system/

Enable and start

Journal nodes:

systemctl daemon-reload
systemctl enable pact-journal
systemctl start pact-journal

Compute nodes:

systemctl daemon-reload
systemctl enable pact-agent
systemctl start pact-agent

Check status

systemctl status pact-journal
journalctl -u pact-journal -f

Environment files

The systemd units read environment variables from:

  • /etc/pact/journal.env (journal nodes)
  • /etc/pact/agent.env (compute nodes)

Docker Compose Deployment

For development, testing, or small deployments, use the provided Docker Compose configuration.

cd infra/docker
docker compose up -d

This starts:

ServiceContainerPorts
journal-1pact-journal-19443, 9091
journal-2pact-journal-29543, 9191
journal-3pact-journal-39643, 9291
agentpact-agent9445
prometheuspact-prometheus9090
grafanapact-grafana3000

Access Grafana at http://localhost:3000 (login: admin / admin).

Scaling

To run multiple agents in Docker:

docker compose up -d --scale agent=5

Monitoring with Grafana + Prometheus

Prometheus

pact-journal exposes Prometheus metrics on the metrics listen port (default 9091). The provided infra/docker/prometheus.yml scrapes all journal nodes:

scrape_configs:
  - job_name: "pact-journal"
    static_configs:
      - targets:
          - "journal-1:9091"
          - "journal-2:9091"
          - "journal-3:9091"

Loki

pact-journal streams events to Loki for structured log aggregation. Configure the Loki endpoint in journal config:

[telemetry]
loki_enabled = true
loki_endpoint = "http://loki.mgmt:3100/loki/api/v1/push"

Grafana dashboards

Import the dashboards from infra/grafana/dashboards/ into Grafana. These provide:

  • Journal quorum health (Raft leader, commit index, log size)
  • Node status overview (drift, services, capabilities)
  • Config change timeline
  • Emergency mode events
  • Approval workflow status

Alerting

Import the alerting rules from infra/alerting/rules.yml into Prometheus. Key alerts:

  • Raft leader election timeout
  • Journal node down
  • Agent disconnected
  • Emergency mode entered
  • Pending approvals nearing expiry
  • Excessive drift on node

OPA Policy Engine

pact uses OPA (Open Policy Agent) for authorization decisions. Deploy OPA as a sidecar on each journal node.

Install OPA

# Download OPA binary
curl -L -o /usr/local/bin/opa \
    https://openpolicyagent.org/downloads/v0.73.0/opa_linux_amd64_static
chmod +x /usr/local/bin/opa

Run OPA

opa run --server --addr localhost:8181 /etc/pact/policies/

Configure journal

[policy.engine]
type = "opa"
opa_endpoint = "http://localhost:8181/v1/data/pact"

Policy federation

If using Sovra for cross-site federation, policy templates are synchronized automatically:

[policy.federation]
sovra_endpoint = "https://sovra.mgmt:8443"
sync_interval_seconds = 300