Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Admin Operations

This guide covers day-to-day operational workflows with pact. All operations described here are authenticated via OIDC, authorized via OPA policy, and recorded in the immutable journal.

Roles

RoleAccessTypical Users
pact-platform-adminFull system access2-3 people per site
pact-ops-{vcluster}Day-to-day ops for a vClusterOps engineers
pact-viewer-{vcluster}Read-only accessMonitoring teams, auditors
pact-regulated-{vcluster}Ops with two-person approvalSensitive workload admins

Day-to-Day Operations

Check node status

# Overview of all nodes in your vCluster
pact status --vcluster ml-training

# Detailed status for a specific node
pact status node-042

View drift

Drift is the difference between declared state (in the journal) and actual state (on the node). pact uses blacklist-based detection – everything is monitored except explicitly excluded paths.

# See what has drifted on a node
pact diff node-042

# See committed deltas not yet promoted to the vCluster overlay
pact diff --committed node-042

Commit drift

When drift is intentional (e.g., you tuned a sysctl), commit it to make it the new declared state:

pact commit -m "tuned vm.nr_hugepages for training workload"

Commits happen within a time window (default 15 minutes). If the window expires, drift is flagged for review rather than silently discarded.

Roll back

If a configuration change caused problems, roll back to a known-good state:

# Find the sequence number to roll back to
pact log -n 20

# Roll back
pact rollback 42

Extend the commit window

If you need more time to finalize changes before committing:

pact extend          # +15 minutes (default)
pact extend 30       # +30 minutes

Apply a configuration spec

For bulk or repeatable changes, write a TOML spec and apply it:

pact apply config/vcluster-examples/overlays.toml

This updates the vCluster overlay in the journal. All nodes in the vCluster will converge to the new declared state.

Emergency Mode

Emergency mode is for situations where normal policy constraints would prevent necessary diagnostic or repair actions. It relaxes whitelist restrictions and extends the commit window, while maintaining the full audit trail.

When to use emergency mode

  • Node is degraded and you need unrestricted diagnostic access
  • A service is failing and you need to inspect or modify files outside the whitelist
  • You need to make urgent changes that would normally require approval

Entering emergency mode

pact emergency start -r "GPU ECC errors on node-042, need unrestricted diagnostics"

This:

  1. Records the emergency entry in the journal with your identity and reason
  2. Extends the commit window to 4 hours (configurable)
  3. Relaxes command whitelist restrictions on the node
  4. Sends a notification via Loki/Grafana alerting

Working in emergency mode

All commands are still logged. Emergency mode does not bypass authentication or audit – it only relaxes operational constraints.

pact shell node-042
pact:node-042> nvidia-smi -q -d ECC
pact:node-042> dmesg | grep -i error
pact:node-042> cat /var/log/pact-agent.log
pact:node-042> exit

Exiting emergency mode

pact emergency end

If another admin left an emergency session open, a platform admin can force-end it:

pact emergency end --force

Audit implications

Emergency mode entries are flagged in the journal and appear prominently in audit reports. For regulated vClusters (7-year retention), emergency entries include:

  • Who entered emergency mode and when
  • The stated reason
  • Every command executed during the session
  • Who ended emergency mode and when

Two-Person Approval Workflow

Regulated vClusters (those with two_person_approval = true) require a second admin to approve state-changing operations before they take effect.

Submitting a change

# Admin A commits a change on a regulated vCluster
pact commit -m "add audit-forwarder service to sensitive-compute"

Output:

Approval required (two-person policy on vcluster: sensitive-compute)
Pending approval: ap-7f3a (expires in 30 min)
Waiting for approval... (Ctrl-C to background)

Reviewing and approving

# Admin B lists pending approvals
pact approve list

# Review the change details, then approve
pact approve accept ap-7f3a

Denying a change

pact approve deny ap-7f3a -m "not scheduled in the change window"

Rules

  • You cannot approve your own request
  • Approvals expire after a configurable timeout (default 30 minutes)
  • Expired requests are automatically rolled back
  • Both the request and the approval/denial are recorded in the journal

Service Management

pact-agent supervises services on compute nodes. You can check status, restart services, and view logs remotely.

Check service status

pact service status                  # All services on local node
pact service status chronyd          # Specific service

Restart a service

pact service restart nvidia-persistenced

Service restarts are subject to the commit window. If the window has expired, extend it first:

pact extend
pact service restart nvidia-persistenced

View service logs

pact service logs lattice-node-agent

Streams the last 50 lines. For continuous streaming, use pact watch.

Remote Command Execution

pact replaces SSH for all admin access to compute nodes. Commands are executed via the agent’s gRPC exec endpoint.

Single command

pact exec node-042 -- nvidia-smi
pact exec node-042 -- cat /proc/meminfo
pact exec node-042 -- dmesg -T

Commands must be on the agent’s whitelist. The whitelist mode is configured per-agent:

ModeBehavior
strictOnly explicitly whitelisted commands are allowed
learningAll commands are allowed but non-whitelisted ones are logged for review
bypassAll commands allowed (development only)

Interactive shell

pact shell node-042

The shell provides a restricted environment on the node. Same whitelist rules apply.

Using the MCP Server

pact includes an MCP (Model Context Protocol) server for AI-assisted operations. The MCP server exposes 24 tools that mirror the CLI commands.

Starting the MCP server

PACT_ENDPOINT="http://localhost:9443" pact-mcp

The server communicates via JSON-RPC 2.0 over stdio. Connect it to Claude Code or any MCP-compatible AI agent.

Available tools

ToolCategoryDescription
pact_statusReadQuery node/vCluster state
pact_diffReadShow declared vs actual differences
pact_logReadQuery configuration history
pact_capReadShow hardware capabilities
pact_service_statusReadQuery service status
pact_query_fleetReadFleet-wide queries
pact_commitWriteCommit drift
pact_rollbackWriteRoll back configuration
pact_applyWriteApply a config spec
pact_execWriteRun a remote command
pact_emergencyAdminEmergency mode (restricted to human admins)
pact_jobs_listLatticeList running allocations
pact_queue_statusLatticeScheduling queue depth
pact_cluster_healthLatticeCombined Raft cluster status
pact_system_healthLatticeCombined system health check
pact_accountingLatticeResource usage accounting
pact_undrainLatticeCancel drain on a node
pact_dag_listLatticeList DAG workflows
pact_dag_inspectLatticeDAG details and step status
pact_budgetLatticeTenant or user budget/usage
pact_backup_createAdminCreate lattice state backup
pact_backup_verifyLatticeVerify backup integrity
pact_nodes_listLatticeList nodes with state
pact_node_inspectLatticeNode hardware/ownership details

The MCP server connects to the journal (config operations), agent (exec/shell), and lattice (delegation). If any backend is unreachable, it falls back to stub responses. Destructive operations (dag cancel, backup restore) are excluded from MCP — use the CLI for those.

Environment variables

VariableDescriptionDefault
PACT_ENDPOINTJournal gRPC endpointhttp://localhost:9443
PACT_AGENT_ENDPOINTAgent gRPC endpointhttp://localhost:9445
PACT_MCP_TOKENBearer token for MCP→agent authentication(none — warns if unset)
PACT_LATTICE_ENDPOINTLattice gRPC endpoint for delegation(none — lattice tools disabled)
PACT_LATTICE_TOKENBearer token for lattice API(none)