Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Cluster Monitoring & Observability

Prometheus Metrics

Lattice exposes Prometheus-compatible metrics at GET /metrics on the REST port (default 8080).

Key Metrics

MetricTypeDescription
lattice_allocations_totalCounterTotal allocations by state
lattice_allocations_activeGaugeCurrently running allocations
lattice_scheduling_cycle_duration_secondsHistogramScheduling cycle latency
lattice_scheduling_placements_totalCounterSuccessful placements
lattice_scheduling_preemptions_totalCounterPreemption events
lattice_raft_commit_latency_secondsHistogramRaft commit latency
lattice_raft_sensitive_audit_entries_totalCounterSensitive audit log entries
lattice_api_request_duration_secondsHistogramAPI request latency
lattice_api_requests_totalCounterAPI requests by method and status
lattice_nodes_totalGaugeNodes by state
lattice_checkpoint_duration_secondsHistogramCheckpoint operation latency

Scrape Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'lattice'
    static_configs:
      - targets:
        - 'lattice-01:8080'
        - 'lattice-02:8080'
        - 'lattice-03:8080'

Grafana Dashboards

Pre-built dashboards are in infra/grafana/dashboards/:

  • Cluster Overview — node states, allocation throughput, queue depth
  • Scheduling Performance — cycle latency, placement rate, preemption rate
  • Raft Health — commit latency, leader elections, log compaction
  • Per-Tenant Usage — resource consumption, fair-share deficit

Import via Grafana UI or provision from infra/grafana/provisioning/.

Alerting Rules

Pre-configured alerting rules in infra/alerting/:

AlertCondition
LatticeRaftNoLeaderNo Raft leader for > 30s
LatticeNodeDownNode heartbeat lost for > 5m
LatticeSchedulingStalledNo placements for > 10m with pending jobs
LatticeHighPreemptionRate> 10 preemptions/minute
LatticeCheckpointFailureCheckpoint success rate < 90%
LatticeDiskSpaceLowRaft data directory > 80% full

TSDB Integration

Lattice pushes per-node telemetry to VictoriaMetrics (or any Prometheus-compatible remote write endpoint).

telemetry:
  tsdb_endpoint: "http://victoriametrics:8428"
  prod_interval_seconds: 30

Telemetry includes CPU, memory, GPU utilization, network I/O, and disk I/O per node.

Audit Log

Sensitive workload operations are recorded in the Raft-committed audit log:

# Query audit log
curl "http://lattice-01:8080/api/v1/audit?tenant=sensitive-team&from=2026-03-01"

Audit entries include: node claims/releases, allocation lifecycle events, and access log entries. Retention: 7 years (configurable).

Health Check

curl http://lattice-01:8080/healthz
# {"status": "ok"}

Used by Docker/Kubernetes health probes and load balancers.