Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Telemetry Architecture

Design Principle

Collect at high resolution, aggregate at configurable resolution, transmit out-of-band.

Three-Layer Pipeline

Layer 1: Collection (eBPF, always-on)

eBPF programs JIT-compiled into kernel, attached to tracepoints and kprobes.

Kernel-level metrics:

  • CPU: context switches, runqueue depth, scheduling latency histograms
  • Network: per-flow bytes/packets, Slingshot CSIG congestion signals from packet headers
  • Block I/O: latency histograms, throughput per device (NVMe scratch, network mounts)
  • Memory: allocation/free rates, NUMA locality, page faults

GPU metrics (via NVML/DCGM hooks):

  • SM occupancy, memory utilization, power draw
  • PCIe/NVLink throughput
  • ECC error counts (feeds into checkpoint cost model)

Storage overhead: ~0.3% on compute-bound workloads. eBPF programs run in kernel context, no syscall overhead, no userspace daemon polling.

Data flows into per-CPU ring buffers (BPF_MAP_TYPE_RINGBUF), consumed by the node agent.

Layer 2: Aggregation (Node Agent, switchable)

The node agent reads ring buffers and aggregates based on the current mode.

Mode: prod (default)

  • 30-second aggregation windows
  • Statistical summaries: p50, p95, p99, mean, max, count
  • Bicubic interpolation for time-series smoothing (reduces storage, preserves trends)
  • Transmitted on Slingshot telemetry traffic class (separate from compute traffic)
  • Additional overhead: ~0.1%

Mode: debug (per-job or per-node, time-limited)

  • 1-second or sub-second raw event streams
  • Full per-flow network traces
  • GPU kernel-level profiling (CUPTI integration)
  • Stored to job-specific S3 path for user analysis
  • Additional overhead: ~2-5% (acceptable for debugging)
  • Auto-reverts to prod after configured duration (default: 30 minutes)

Mode: audit (sensitive vCluster)

  • All file access events (open, read, write, close) with user identity
  • All API calls logged with request/response metadata
  • Network flow summaries (source, destination, bytes, duration)
  • Signed with Sovra keys (if federation enabled) for tamper evidence
  • Additional overhead: ~1%
  • Retention: 7 years (cold tier, S3-compatible archive)

Layer 3: Storage and Query

Time-series store — recommended: VictoriaMetrics (single-node or cluster) for single-site deployments; Thanos on top of Prometheus for federated multi-site deployments that need a global query view across sites:

  • Ingestion: all nodes stream aggregated metrics
  • Auto-downsampling: raw → 1m → 5m → 1h → 1d
  • Retention policy configurable per tenant/vCluster

Three materialized views (label-based access control):

ViewAudienceContent
HolisticSystem adminsSystem-wide utilization, power, health, scheduling efficiency
TenantTenant adminsPer-tenant resource usage, quota tracking, job statistics
vClusterSchedulerMetrics feeding into cost function (GPU util, I/O, congestion)
UserAllocation ownersPer-allocation metrics scoped by OIDC identity (via lattice-api)

Query interface: PromQL-compatible API. Grafana dashboards for visualization.

Debug traces: Stored to s3://{tenant}/{project}/{job_id}/telemetry/ with short retention (7 days default, configurable).

Audit logs: Append-only, encrypted at rest, stored to dedicated audit storage with long retention. Queryable for compliance reporting.

Switching Telemetry Mode

Via Intent API:

PATCH /v1/allocations/{id}
{ "telemetry": { "mode": "debug", "duration": "30m" } }

Via CLI:

lattice telemetry --alloc=12345 --mode=debug --duration=30m

Switching is instant — the eBPF programs are always collecting at full resolution. Only the aggregation behavior changes.

User-Facing Telemetry Query

The telemetry pipeline serves admin dashboards and the scheduler cost function. The user-facing query layer adds scoped access so allocation owners can query their own metrics without admin intervention.

Query Path

User → lattice-api → PromQL (scoped by alloc/tenant/user) → TSDB → response

The lattice-api injects label filters to ensure users only see metrics for their own allocations. Tenant admins can query any allocation within their tenant.

Scoping Rules

CallerVisible Scope
Allocation ownerMetrics for their own allocations
Tenant adminMetrics for any allocation in their tenant
System adminAll metrics (holistic view)

User Metrics Catalog

MetricDescriptionAvailable In
gpu_utilizationSM occupancy per GPUprod, debug, audit
gpu_memory_usedGPU memory in useprod, debug, audit
gpu_power_drawGPU power consumptionprod, debug, audit
cpu_utilizationCPU usage per nodeprod, debug, audit
memory_usedSystem memory in useprod, debug, audit
network_tx_bytesNetwork bytes sent per secondprod, debug, audit
network_rx_bytesNetwork bytes received per secondprod, debug, audit
io_read_bytesStorage read throughputprod, debug, audit
io_write_bytesStorage write throughputprod, debug, audit
io_latency_p99Storage I/O latency (p99)prod, debug, audit

Telemetry Streaming

For use cases requiring push-based updates (e.g., lattice watch), the StreamMetrics RPC fans out to node agents running the target allocation and merges their streams.

Architecture

lattice-api receives StreamMetrics request
    → identifies nodes running allocation (from quorum state)
    → opens per-node metric streams to node agents
    → merges streams with allocation-scoped labels
    → returns unified server-streaming response to client

In prod mode, node agents emit aggregated snapshots every 30 seconds. In debug mode, raw events stream at 1-second intervals. The client receives the same resolution as the current telemetry mode — switching mode (via PATCH /v1/allocations/{id}) takes effect on active streams.

Alert Generation

Node agents evaluate threshold rules locally and inject MetricAlert events into the stream when:

  • GPU utilization < 10% for > 60s (potential hang)
  • GPU memory > 95% (OOM risk)
  • Network error rate exceeds 0.1%
  • I/O p99 latency exceeds 10ms

Cross-Allocation Comparison

Users can compare metrics across multiple allocations (e.g., successive training runs) via the CompareMetrics RPC or GET /v1/compare.

TSDB Query

The lattice-api issues parallel PromQL queries for each allocation ID, scoped to the requesting user’s permissions. Results are aligned by relative time (see below).

Relative Time Alignment

Allocations may run at different wall-clock times. Comparison uses relative-to-start alignment: each allocation’s metric series is indexed from t=0 (the allocation’s started_at timestamp). This allows apples-to-apples comparison of metrics across runs that started hours or days apart.

Feedback to Scheduler

The telemetry system feeds key metrics back to the scheduling cost function:

MetricCost Function ComponentEffect
GPU utilization per jobEfficiency scoringLow util → deprioritize for topology-premium placement
Network congestion (CSIG)topology_fitnessCongested groups → avoid placing new jobs there
I/O throughput per jobdata_readinessHigh I/O demand → ensure storage QoS before scheduling
Node ECC errorscheckpoint cost modelRising errors → increase checkpoint urgency
Power draw per nodeenergy_costFeeds into power budget constraint

Telemetry Aggregation Topology

For large systems (10,000+ nodes), direct streaming to a central store creates an ingestion bottleneck. Use hierarchical aggregation:

Nodes (per-group) → Group Aggregator → Central Store

Each Slingshot dragonfly group has a designated aggregator node.
Group aggregators perform first-level aggregation (merge per-node summaries).
Central store receives per-group aggregated streams.

In debug mode: bypasses group aggregation, streams directly for that job's nodes.

Scheduler Self-Monitoring

Internal metrics for monitoring Lattice’s own health. These metrics feed into canary criteria during rolling upgrades (cross-ref: upgrades.md) and are available on the holistic dashboard.

Scheduling Metrics

MetricTypeLabelsDescription
lattice_scheduling_cycle_duration_secondshistogramvclusterTime to complete one scheduling cycle
lattice_scheduling_queue_depthgaugevclusterNumber of pending allocations
lattice_scheduling_proposals_totalcountervcluster, result (accepted/rejected)Proposals sent to quorum
lattice_scheduling_cost_function_duration_secondshistogramvclusterTime to evaluate the cost function for all candidates
lattice_scheduling_backfill_jobs_totalcountervclusterAllocations placed via backfill

Quorum Metrics

MetricTypeLabelsDescription
lattice_raft_leadergaugemember_id1 if this member is leader, 0 if follower
lattice_raft_commit_latency_secondshistogrammember_idTime from proposal to commit
lattice_raft_log_entriesgaugemember_idNumber of entries in the Raft log
lattice_raft_snapshot_duration_secondshistogrammember_idTime to create a Raft snapshot

API Metrics

MetricTypeLabelsDescription
lattice_api_requests_totalcountermethod, statusTotal API requests
lattice_api_request_duration_secondshistogrammethodRequest latency
lattice_api_active_streamsgaugestream_type (attach/logs/metrics)Active streaming connections

Node Agent Metrics

MetricTypeLabelsDescription
lattice_agent_heartbeat_latency_secondshistogramnode_idHeartbeat round-trip time
lattice_agent_allocation_startup_secondshistogramnode_idTime from allocation assignment to process start (includes uenv pull/mount)
lattice_agent_ebpf_overhead_percentgaugenode_idMeasured eBPF collection overhead

Accounting Metrics

MetricTypeLabelsDescription
lattice_accounting_events_bufferedgaugeEvents in the in-memory accounting buffer
lattice_accounting_events_dropped_totalcounterEvents dropped due to buffer overflow

Federation Broker Metrics

When federation is enabled, the federation broker exposes additional metrics:

MetricTypeLabelsDescription
lattice_federation_proposals_totalcounterpeer, result (accepted/rejected/timeout)Placement proposals sent to/from peers
lattice_federation_proposal_latency_secondshistogrampeerRound-trip time for federation proposals
lattice_federation_peer_statusgaugepeer1 = connected, 0 = unreachable
lattice_federation_data_gravity_scoregaugepeer, datasetData gravity score for placement decisions (higher = more data at peer)

These metrics are only active when federation.enabled = true. The federation broker exposes them on the same /metrics endpoint as other components (default port: 9105).

Alerting Rules

Example alerting rules (PromQL-compatible):

RuleConditionSeverity
Scheduling cycle slowhistogram_quantile(0.99, lattice_scheduling_cycle_duration_seconds) > 30warning
Queue depth highlattice_scheduling_queue_depth > 100 for 5 minuteswarning
Raft commit slowhistogram_quantile(0.99, lattice_raft_commit_latency_seconds) > 5critical
Node heartbeat missingtime() - lattice_agent_last_heartbeat_timestamp > 60node degraded
API error rate spikerate(lattice_api_requests_total{status=~"5.."}[5m]) / rate(lattice_api_requests_total[5m]) > 0.05warning
Accounting buffer fillinglattice_accounting_events_buffered > 8000warning
VNI pool exhaustion approaching(lattice_network_vni_pool_total - lattice_network_vni_pool_available) / lattice_network_vni_pool_total > 0.90warning
Quota utilization highlattice_quota_used_nodes / lattice_quota_max_nodes > 0.95 for 10 minuteswarning
Raft disk usage highlattice_raft_disk_used_bytes / lattice_raft_disk_total_bytes > 0.80warning
Snapshot storage growthrate(lattice_raft_snapshot_size_bytes[1h]) > 100e6info

Dashboard Views

Three views matching the existing telemetry pattern:

DashboardAudienceKey Panels
HolisticSystem adminsAll scheduler cycle times, quorum health, total queue depth, API throughput
Per-vClusterScheduler operatorsvCluster-specific queue depth, cycle time, proposal accept rate, backfill rate
Per-quorum-memberQuorum operatorsRaft log size, commit latency, leader status, snapshot timing

Monitoring Deployment

Prometheus Scrape Configuration

All Lattice components expose metrics on a /metrics endpoint (Prometheus exposition format):

ComponentDefault Metrics PortEndpoint
Quorum members9100http://{quorum-host}:9100/metrics
API servers9101http://{api-host}:9101/metrics
vCluster schedulers9102http://{scheduler-host}:9102/metrics
Node agents9103http://{node-host}:9103/metrics
Checkpoint broker9104http://{checkpoint-host}:9104/metrics

Example Prometheus scrape config:

scrape_configs:
  - job_name: "lattice-quorum"
    static_configs:
      - targets: ["quorum-1:9100", "quorum-2:9100", "quorum-3:9100"]

  - job_name: "lattice-api"
    static_configs:
      - targets: ["api-1:9101", "api-2:9101"]

  - job_name: "lattice-scheduler"
    static_configs:
      - targets: ["scheduler-hpc:9102", "scheduler-ml:9102", "scheduler-interactive:9102"]

  - job_name: "lattice-agents"
    file_sd_configs:
      - files: ["/etc/prometheus/lattice-agents.json"]
        refresh_interval: 5m
    # Node agents are numerous; use file-based service discovery
    # populated from OpenCHAMI node inventory

Alert Routing

Alerts are routed via Alertmanager (or compatible system):

SeverityRouteResponse Time
CriticalPagerDuty / on-callImmediate (< 15 min)
WarningSlack #lattice-alertsBusiness hours (< 4 hours)
InfoSlack #lattice-infoBest effort

Example Alertmanager route:

route:
  receiver: "slack-info"
  routes:
    - match: { severity: "critical" }
      receiver: "pagerduty-oncall"
    - match: { severity: "warning" }
      receiver: "slack-alerts"

Grafana Dashboards

Pre-built dashboards for the three views described above. Dashboards are defined as JSON and version-controlled in infra/grafana/:

infra/grafana/
├── holistic.json          # System-wide overview
├── per-vcluster.json      # vCluster-specific scheduling
├── per-quorum-member.json # Raft health
├── per-node.json          # Individual node health
└── user-allocation.json   # User-facing allocation metrics

Each dashboard uses the standard Lattice metric names. Data source: Prometheus (or compatible TSDB).

TSDB Sizing

Cluster SizeMetric CardinalityIngestion RateStorage (30-day retention)
100 nodes~50,000 series~10k samples/s~50 GB
1,000 nodes~500,000 series~100k samples/s~500 GB
10,000 nodes~5,000,000 series~1M samples/s~5 TB

For clusters > 1000 nodes, use a horizontally scalable TSDB (VictoriaMetrics cluster, Mimir, or Thanos) with the hierarchical aggregation described in the Telemetry Aggregation Topology section above.