- Impact: Quorum maintained (2/3 or 3/5 nodes)
- Detection: Raft heartbeat timeout (1.5-3s)
- Recovery: Automatic leader re-election, failed node rejoins on restart
- Data: No data loss β committed entries are replicated
- Impact: No writes accepted, reads still served from local state
- Detection: Raft cannot elect leader
- Recovery: Restore majority of nodes, cluster auto-recovers
- Agent behavior: Runs in disconnected mode, buffers events
- Impact: Impossible by Raft design (majority required for writes)
- Detection: Minority partition detects itβs not leader
- Recovery: Automatic on network heal
- Impact: No config updates, no audit logging
- Detection: Connection timeout, subscription backoff
- Recovery: Exponential backoff reconnect (1s base, 60s max, 100 attempts)
- Behavior: Agent continues with cached config in observe-only mode
- Impact: All supervised services orphaned
- Detection: systemd/PID 1 watchdog (if configured)
- Recovery: Agent restart re-reads state, re-supervises services
- Data: Capability report regenerated on boot
- Impact: Unnecessary commit window opened
- Detection: Admin reviews drift via
pact diff
- Recovery: Add path to blacklist patterns, drift resets on commit
- Impact: Config subscription disconnected
- Detection: gRPC stream error
- Recovery: Reconnect with
from_sequence (at-least-once delivery)
- Conflict resolution: Journal-wins after grace period (ConflictManager)
- Impact: Raft replication paused for minority side
- Detection: Raft log divergence
- Recovery: Automatic reconciliation on heal, minority replays missed entries
- Impact: Extended commit window, reduced automation
- Detection: Stale emergency detection (expiry without resolution)
- Recovery: Platform admin force-end (
pact emergency end --force)
- Audit: All emergency actions logged regardless of mode
- Impact: Blocked by RBAC (P8: AI agents cannot enter)
- Detection: PolicyService evaluation returns Deny
- Recovery: Human admin must initiate emergency mode