Observability
Design: No agent-level Prometheus scraping
Three channels:
- Journal server metrics → Prometheus → Grafana (3-5 scrape targets)
- Config + admin events → Journal → Loki → Grafana (event stream)
- Agent process health → lattice-node-agent eBPF → existing Prometheus
Journal Metrics Endpoint
Each pact-journal server exposes a Prometheus metrics endpoint via axum (HTTP, default port 9091 — avoids conflict with Prometheus server default on 9090). Metrics include:
pact_raft_leader(gauge): 1 if this node is the Raft leaderpact_raft_term(gauge): current Raft termpact_raft_log_entries(gauge): total log entriespact_raft_replication_lag(gauge): entries behind leader, per followerpact_journal_entries_total(counter): total config entries appendedpact_journal_boot_streams_active(gauge): concurrent boot config streamspact_journal_boot_stream_duration_seconds(histogram): boot stream latencypact_journal_overlay_builds_total(counter): overlay pre-computation events
Health check endpoint: GET /health returns 200 if Raft is healthy.
Grafana Dashboards
- Fleet Configuration Health: drift heatmap, commit activity, boot performance
- Admin Operations: exec/shell session frequency, command whitelist violations
- Emergency Sessions: active, duration, stale alerts
- Journal Health: Raft quorum, log growth, replication lag
Alerting
Critical: quorum loss, stale emergency Warning: high drift rate, slow boot config, policy auth failures, GPU degradation