Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Troubleshooting

Agent Cannot Connect to Journal

Symptoms: Agent logs show connection errors. pact status returns exit code 5 (timeout).

Check 1: Network connectivity

# From the agent node, verify the journal port is reachable
nc -zv journal-1.mgmt 9443

Check 2: Journal is running

# On the journal node
systemctl status pact-journal
journalctl -u pact-journal --since "5 min ago"

Check 3: SPIRE identity (if using SPIRE)

# Verify the SPIRE agent socket is available
ls -la /run/spire/agent.sock

# Check SPIRE agent health
spire-agent healthcheck -socketPath /run/spire/agent.sock

# Verify the agent has a valid SVID
spire-agent api fetch x509 -socketPath /run/spire/agent.sock -write /tmp/
openssl x509 -in /tmp/svid.0.pem -noout -subject -issuer -dates

If the SPIRE socket is unavailable, check that the SPIRE agent is running on the node. If attestation fails, verify the node’s SPIRE registration entry exists and matches its hardware identity (TPM, SMBIOS UUID, or join token).

Check 4: Enrollment (if using ephemeral CA)

# Check if the node has been enrolled
pact enroll status <node>

# Check if the agent has a valid certificate
openssl x509 -in /etc/pact/agent.crt -noout -subject -dates 2>/dev/null || echo "No certificate"

Common enrollment failures:

  • Hardware identity mismatch: the node’s actual hardware identity (TPM/SMBIOS/MAC) does not match what was registered during pact enroll. Re-enroll with the correct hardware ID: pact enroll <node> --hardware-id <correct-hw-id>
  • CSR rejected: the journal could not validate the CSR. Check journal logs for the rejection reason.
  • Ephemeral CA rotated: if the journal quorum restarted, the CA was regenerated and all agents must re-enroll. Agents do this automatically on the next boot, but running agents need a restart: systemctl restart pact-agent

Check 5: Agent config

Verify endpoints in agent.toml points to the correct journal addresses:

[agent.journal]
endpoints = ["journal-1.mgmt:9443", "journal-2.mgmt:9443", "journal-3.mgmt:9443"]
tls_enabled = true
tls_ca = "/etc/pact/ca.crt"

Check 6: Firewall

Ensure port 9443 (gRPC) and 9444 (Raft) are open between journal nodes, and port 9443 is open from compute nodes to journal nodes.


Raft Leader Election Issues

Symptoms: Journal logs show repeated election timeouts. No leader elected. CLI commands hang or return timeout errors.

Check 1: Quorum availability

A 3-node quorum needs at least 2 nodes. A 5-node quorum needs at least 3. Verify all journal nodes are running:

for host in journal-1.mgmt journal-2.mgmt journal-3.mgmt; do
    echo "$host: $(nc -zv $host 9443 2>&1)"
done

Check 2: Clock synchronization

Raft is sensitive to clock skew. Verify NTP/chrony is running on all journal nodes:

chronyc tracking

Check 3: Raft peer configuration

All nodes must have identical [journal.raft] members configuration. A mismatch causes election failures. Verify on each node:

grep -A5 "journal.raft" /etc/pact/journal.toml

Check 4: Data directory permissions

The journal data directory must be writable by the pact user:

ls -la /var/lib/pact/journal/

Check 5: Network partitions

Raft port 9444 must be reachable between all journal nodes. Unlike the gRPC port, this is peer-to-peer between journal nodes only:

nc -zv journal-2.mgmt 9444

Drift Detection False Positives

Symptoms: pact diff shows drift for files or paths that should not be monitored (logs, temp files, runtime state).

Fix: Add patterns to the blacklist

The blacklist excludes paths from drift detection. Edit the agent config:

[agent.blacklist]
patterns = [
    "/tmp/**",
    "/var/log/**",
    "/proc/**",
    "/sys/**",
    "/dev/**",
    "/run/user/**",
    "/run/pact/**",
    "/run/lattice/**",
    # Add your exclusions here:
    "/var/cache/**",
    "/home/*/.bash_history"
]

After updating the config, restart the agent:

systemctl restart pact-agent

Understanding the blacklist-first model: pact monitors everything by default and excludes via blacklist (see ADR-002). This is the opposite of most config management tools which declare what to watch. The blacklist approach ensures nothing is missed, but means you need to explicitly exclude noisy paths.


Shell Command Blocked by Whitelist

Symptoms: pact exec or pact shell returns exit code 6 with “command not whitelisted”.

Check 1: Current whitelist mode

grep whitelist_mode /etc/pact/agent.toml
ModeBehavior
strictOnly explicitly whitelisted commands allowed
learningAll commands allowed, non-whitelisted ones logged
bypassAll commands allowed (development only)

Fix for development: Set whitelist_mode = "learning" or "bypass".

Fix for production: Add the command to the whitelist. The whitelist is managed via the vCluster overlay policy. Contact your platform admin to update it.

Workaround: If you need immediate unrestricted access, enter emergency mode:

pact emergency start -r "need to run diagnostics command XYZ"
# Run your command
pact exec node-042 -- your-command
pact emergency end

Emergency Mode Stuck

Symptoms: A node is in emergency mode but the admin who started it is unavailable. Other admins cannot make changes that conflict with the emergency session.

Fix: Force-end the emergency

A pact-platform-admin can force-end another admin’s emergency session:

pact emergency end --force

This records the force-end in the journal audit log, including who ended it and the original emergency reason.

If the CLI cannot reach the journal: If the journal itself is the problem (which is why emergency mode was started), you need to fix journal connectivity first. Check the Raft leader election section above.

Last resort: BMC console access provides unrestricted bash on the node, bypassing pact entirely. This is the out-of-band fallback when pact itself is not functioning.


Approval Workflow Issues

Approval request expired

Symptoms: A commit on a regulated vCluster was submitted but nobody approved it within the timeout (default 30 minutes). The change was rolled back.

Fix: Resubmit the change and coordinate with an approver in advance:

# Resubmit
pact commit -m "add audit-forwarder (re-submit after timeout)"

# Tell the approver to check immediately
# Approver runs:
pact approve list
pact approve accept ap-XXXX

Adjust timeout: If 30 minutes is too short for your workflow, update the vCluster policy:

[vcluster.sensitive-compute.policy]
approval_timeout_seconds = 3600   # 1 hour

Cannot approve own request

Symptoms: pact approve accept returns an authorization error when trying to approve your own request.

This is by design. Two-person approval requires a different admin to approve. The approver must have pact-regulated-{vcluster} or pact-platform-admin role.

No approvers available

If no other admin with the required role is available, a pact-platform-admin can approve any request. If no platform admin is available, the change must wait or be submitted through the emergency mode workflow (which has its own audit requirements).


Agent Reports Wrong Capabilities

Symptoms: pact cap shows incorrect GPU count, memory, or network capabilities.

Check 1: Capability manifest

The agent reads capabilities from a JSON manifest:

cat /run/pact/capability.json

Check 2: GPU detection

If GPU capabilities are wrong, check the GPU backend:

# For NVIDIA
nvidia-smi -L

# For AMD
rocm-smi --showproductname

Check 3: Poll interval

The agent polls GPU status periodically. Check the config:

[agent.capability]
gpu_poll_interval_seconds = 30

A recently failed GPU may not be reflected until the next poll.


Journal Data Directory Full

Symptoms: Journal logs show write errors. Raft cannot commit new entries.

Check disk usage:

df -h /var/lib/pact/journal/
du -sh /var/lib/pact/journal/*

Fix 1: Trigger a Raft snapshot

Snapshots compact the log. The snapshot interval is configured in the journal:

[journal.raft]
snapshot_interval = 10000   # Entries between snapshots

Reduce this value and restart to trigger more frequent compaction.

Fix 2: Expand storage

If the data directory is genuinely too small for your workload, expand the underlying volume.


Common Error Messages

MessageCauseFix
No auth token foundMissing OIDC tokenSet PACT_TOKEN or write to ~/.config/pact/token
No vCluster specifiedMissing vCluster scopeUse --vcluster or set PACT_VCLUSTER
connection refusedJournal not running or wrong endpointCheck journal status and endpoint config
certificate verify failedTLS cert mismatch or ephemeral CA rotatedRestart agent to re-enroll, or verify CA bundle at /etc/pact/ca.crt
SPIRE socket unavailableSPIRE agent not runningStart SPIRE agent or switch to ephemeral CA identity
enrollment: hardware mismatchNode hardware ID does not match enrollment recordRe-enroll with correct --hardware-id
policy: deniedOPA rejected the operationCheck your role has the required permissions
approval requiredRegulated vClusterAnother admin must approve (see workflow above)
commit window expiredTime window for changes has closedRun pact extend or pact commit first