Cluster Monitoring & Observability

Prometheus Metrics

Lattice exposes Prometheus-compatible metrics at GET /metrics on the REST port (default 8080).

Key Metrics

Metric	Type	Description
`lattice_allocations_total`	Counter	Total allocations by state
`lattice_allocations_active`	Gauge	Currently running allocations
`lattice_scheduling_cycle_duration_seconds`	Histogram	Scheduling cycle latency
`lattice_scheduling_placements_total`	Counter	Successful placements
`lattice_scheduling_preemptions_total`	Counter	Preemption events
`lattice_raft_commit_latency_seconds`	Histogram	Raft commit latency
`lattice_raft_sensitive_audit_entries_total`	Counter	Sensitive audit log entries
`lattice_api_request_duration_seconds`	Histogram	API request latency
`lattice_api_requests_total`	Counter	API requests by method and status
`lattice_nodes_total`	Gauge	Nodes by state
`lattice_checkpoint_duration_seconds`	Histogram	Checkpoint operation latency

Scrape Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'lattice'
    static_configs:
      - targets:
        - 'lattice-01:8080'
        - 'lattice-02:8080'
        - 'lattice-03:8080'

Grafana Dashboards

Pre-built dashboards are in infra/grafana/dashboards/:

Cluster Overview — node states, allocation throughput, queue depth
Scheduling Performance — cycle latency, placement rate, preemption rate
Raft Health — commit latency, leader elections, log compaction
Per-Tenant Usage — resource consumption, fair-share deficit

Import via Grafana UI or provision from infra/grafana/provisioning/.

Alerting Rules

Pre-configured alerting rules in infra/alerting/:

Alert	Condition
`LatticeRaftNoLeader`	No Raft leader for > 30s
`LatticeNodeDown`	Node heartbeat lost for > 5m
`LatticeSchedulingStalled`	No placements for > 10m with pending jobs
`LatticeHighPreemptionRate`	> 10 preemptions/minute
`LatticeCheckpointFailure`	Checkpoint success rate < 90%
`LatticeDiskSpaceLow`	Raft data directory > 80% full

TSDB Integration

Lattice pushes per-node telemetry to VictoriaMetrics (or any Prometheus-compatible remote write endpoint).

telemetry:
  tsdb_endpoint: "http://victoriametrics:8428"
  prod_interval_seconds: 30

Telemetry includes CPU, memory, GPU utilization, network I/O, and disk I/O per node.

Audit Log

Sensitive workload operations are recorded in the Raft-committed audit log:

# Query audit log
curl "http://lattice-01:8080/api/v1/audit?tenant=sensitive-team&from=2026-03-01"

Audit entries include: node claims/releases, allocation lifecycle events, and access log entries. Retention: 7 years (configurable).

Health Check

curl http://lattice-01:8080/healthz
# {"status": "ok"}

Used by Docker/Kubernetes health probes and load balancers.

Keyboard shortcuts

Lattice Documentation