Telemetry Architecture

Design Principle

Collect at high resolution, aggregate at configurable resolution, transmit out-of-band.

Three-Layer Pipeline

Layer 1: Collection (eBPF, always-on)

eBPF programs JIT-compiled into kernel, attached to tracepoints and kprobes.

Kernel-level metrics:

CPU: context switches, runqueue depth, scheduling latency histograms
Network: per-flow bytes/packets, Slingshot CSIG congestion signals from packet headers
Block I/O: latency histograms, throughput per device (NVMe scratch, network mounts)
Memory: allocation/free rates, NUMA locality, page faults

GPU metrics (via NVML/DCGM hooks):

SM occupancy, memory utilization, power draw
PCIe/NVLink throughput
ECC error counts (feeds into checkpoint cost model)

Storage overhead: ~0.3% on compute-bound workloads. eBPF programs run in kernel context, no syscall overhead, no userspace daemon polling.

Data flows into per-CPU ring buffers (BPF_MAP_TYPE_RINGBUF), consumed by the node agent.

Layer 2: Aggregation (Node Agent, switchable)

The node agent reads ring buffers and aggregates based on the current mode.

Mode: prod (default)

30-second aggregation windows
Statistical summaries: p50, p95, p99, mean, max, count
Bicubic interpolation for time-series smoothing (reduces storage, preserves trends)
Transmitted on Slingshot telemetry traffic class (separate from compute traffic)
Additional overhead: ~0.1%

Mode: debug (per-job or per-node, time-limited)

1-second or sub-second raw event streams
Full per-flow network traces
GPU kernel-level profiling (CUPTI integration)
Stored to job-specific S3 path for user analysis
Additional overhead: ~2-5% (acceptable for debugging)
Auto-reverts to prod after configured duration (default: 30 minutes)

Mode: audit (sensitive vCluster)

All file access events (open, read, write, close) with user identity
All API calls logged with request/response metadata
Network flow summaries (source, destination, bytes, duration)
Signed with Sovra keys (if federation enabled) for tamper evidence
Additional overhead: ~1%
Retention: 7 years (cold tier, S3-compatible archive)

Time-series store — recommended: VictoriaMetrics (single-node or cluster) for single-site deployments; Thanos on top of Prometheus for federated multi-site deployments that need a global query view across sites:

Ingestion: all nodes stream aggregated metrics
Auto-downsampling: raw → 1m → 5m → 1h → 1d
Retention policy configurable per tenant/vCluster

Three materialized views (label-based access control):

View	Audience	Content
Holistic	System admins	System-wide utilization, power, health, scheduling efficiency
Tenant	Tenant admins	Per-tenant resource usage, quota tracking, job statistics
vCluster	Scheduler	Metrics feeding into cost function (GPU util, I/O, congestion)
User	Allocation owners	Per-allocation metrics scoped by OIDC identity (via lattice-api)

Query interface: PromQL-compatible API. Grafana dashboards for visualization.

Debug traces: Stored to s3://{tenant}/{project}/{job_id}/telemetry/ with short retention (7 days default, configurable).

Audit logs: Append-only, encrypted at rest, stored to dedicated audit storage with long retention. Queryable for compliance reporting.

Switching Telemetry Mode

Via Intent API:

PATCH /v1/allocations/{id}
{ "telemetry": { "mode": "debug", "duration": "30m" } }

Via CLI:

lattice telemetry --alloc=12345 --mode=debug --duration=30m

Switching is instant — the eBPF programs are always collecting at full resolution. Only the aggregation behavior changes.

User-Facing Telemetry Query

The telemetry pipeline serves admin dashboards and the scheduler cost function. The user-facing query layer adds scoped access so allocation owners can query their own metrics without admin intervention.

Query Path

User → lattice-api → PromQL (scoped by alloc/tenant/user) → TSDB → response

The lattice-api injects label filters to ensure users only see metrics for their own allocations. Tenant admins can query any allocation within their tenant.

Scoping Rules

Caller	Visible Scope
Allocation owner	Metrics for their own allocations
Tenant admin	Metrics for any allocation in their tenant
System admin	All metrics (holistic view)

User Metrics Catalog

Metric	Description	Available In
`gpu_utilization`	SM occupancy per GPU	prod, debug, audit
`gpu_memory_used`	GPU memory in use	prod, debug, audit
`gpu_power_draw`	GPU power consumption	prod, debug, audit
`cpu_utilization`	CPU usage per node	prod, debug, audit
`memory_used`	System memory in use	prod, debug, audit
`network_tx_bytes`	Network bytes sent per second	prod, debug, audit
`network_rx_bytes`	Network bytes received per second	prod, debug, audit
`io_read_bytes`	Storage read throughput	prod, debug, audit
`io_write_bytes`	Storage write throughput	prod, debug, audit
`io_latency_p99`	Storage I/O latency (p99)	prod, debug, audit

Telemetry Streaming

For use cases requiring push-based updates (e.g., lattice watch), the StreamMetrics RPC fans out to node agents running the target allocation and merges their streams.

Architecture

lattice-api receives StreamMetrics request
    → identifies nodes running allocation (from quorum state)
    → opens per-node metric streams to node agents
    → merges streams with allocation-scoped labels
    → returns unified server-streaming response to client

In prod mode, node agents emit aggregated snapshots every 30 seconds. In debug mode, raw events stream at 1-second intervals. The client receives the same resolution as the current telemetry mode — switching mode (via PATCH /v1/allocations/{id}) takes effect on active streams.

Alert Generation

Node agents evaluate threshold rules locally and inject MetricAlert events into the stream when:

GPU utilization < 10% for > 60s (potential hang)
GPU memory > 95% (OOM risk)
Network error rate exceeds 0.1%
I/O p99 latency exceeds 10ms

Cross-Allocation Comparison

Users can compare metrics across multiple allocations (e.g., successive training runs) via the CompareMetrics RPC or GET /v1/compare.

TSDB Query

The lattice-api issues parallel PromQL queries for each allocation ID, scoped to the requesting user’s permissions. Results are aligned by relative time (see below).

Relative Time Alignment

Allocations may run at different wall-clock times. Comparison uses relative-to-start alignment: each allocation’s metric series is indexed from t=0 (the allocation’s started_at timestamp). This allows apples-to-apples comparison of metrics across runs that started hours or days apart.

Feedback to Scheduler

The telemetry system feeds key metrics back to the scheduling cost function:

Metric	Cost Function Component	Effect
GPU utilization per job	Efficiency scoring	Low util → deprioritize for topology-premium placement
Network congestion (CSIG)	topology_fitness	Congested groups → avoid placing new jobs there
I/O throughput per job	data_readiness	High I/O demand → ensure storage QoS before scheduling
Node ECC errors	checkpoint cost model	Rising errors → increase checkpoint urgency
Power draw per node	energy_cost	Feeds into power budget constraint

Telemetry Aggregation Topology

For large systems (10,000+ nodes), direct streaming to a central store creates an ingestion bottleneck. Use hierarchical aggregation:

Nodes (per-group) → Group Aggregator → Central Store

Each Slingshot dragonfly group has a designated aggregator node.
Group aggregators perform first-level aggregation (merge per-node summaries).
Central store receives per-group aggregated streams.

In debug mode: bypasses group aggregation, streams directly for that job's nodes.

Scheduler Self-Monitoring

Internal metrics for monitoring Lattice’s own health. These metrics feed into canary criteria during rolling upgrades (cross-ref: upgrades.md) and are available on the holistic dashboard.

Scheduling Metrics

Metric	Type	Labels	Description
`lattice_scheduling_cycle_duration_seconds`	histogram	`vcluster`	Time to complete one scheduling cycle
`lattice_scheduling_queue_depth`	gauge	`vcluster`	Number of pending allocations
`lattice_scheduling_proposals_total`	counter	`vcluster`, `result` (accepted/rejected)	Proposals sent to quorum
`lattice_scheduling_cost_function_duration_seconds`	histogram	`vcluster`	Time to evaluate the cost function for all candidates
`lattice_scheduling_backfill_jobs_total`	counter	`vcluster`	Allocations placed via backfill

Quorum Metrics

Metric	Type	Labels	Description
`lattice_raft_leader`	gauge	`member_id`	1 if this member is leader, 0 if follower
`lattice_raft_commit_latency_seconds`	histogram	`member_id`	Time from proposal to commit
`lattice_raft_log_entries`	gauge	`member_id`	Number of entries in the Raft log
`lattice_raft_snapshot_duration_seconds`	histogram	`member_id`	Time to create a Raft snapshot

API Metrics

Metric	Type	Labels	Description
`lattice_api_requests_total`	counter	`method`, `status`	Total API requests
`lattice_api_request_duration_seconds`	histogram	`method`	Request latency
`lattice_api_active_streams`	gauge	`stream_type` (attach/logs/metrics)	Active streaming connections

Node Agent Metrics

Metric	Type	Labels	Description
`lattice_agent_heartbeat_latency_seconds`	histogram	`node_id`	Heartbeat round-trip time
`lattice_agent_allocation_startup_seconds`	histogram	`node_id`	Time from allocation assignment to process start (includes uenv pull/mount)
`lattice_agent_ebpf_overhead_percent`	gauge	`node_id`	Measured eBPF collection overhead

Accounting Metrics

Metric	Type	Labels	Description
`lattice_accounting_events_buffered`	gauge	—	Events in the in-memory accounting buffer
`lattice_accounting_events_dropped_total`	counter	—	Events dropped due to buffer overflow

Federation Broker Metrics

When federation is enabled, the federation broker exposes additional metrics:

Metric	Type	Labels	Description
`lattice_federation_proposals_total`	counter	`peer`, `result` (accepted/rejected/timeout)	Placement proposals sent to/from peers
`lattice_federation_proposal_latency_seconds`	histogram	`peer`	Round-trip time for federation proposals
`lattice_federation_peer_status`	gauge	`peer`	1 = connected, 0 = unreachable
`lattice_federation_data_gravity_score`	gauge	`peer`, `dataset`	Data gravity score for placement decisions (higher = more data at peer)

These metrics are only active when federation.enabled = true. The federation broker exposes them on the same /metrics endpoint as other components (default port: 9105).

Alerting Rules

Example alerting rules (PromQL-compatible):

Rule	Condition	Severity
Scheduling cycle slow	`histogram_quantile(0.99, lattice_scheduling_cycle_duration_seconds) > 30`	warning
Queue depth high	`lattice_scheduling_queue_depth > 100` for 5 minutes	warning
Raft commit slow	`histogram_quantile(0.99, lattice_raft_commit_latency_seconds) > 5`	critical
Node heartbeat missing	`time() - lattice_agent_last_heartbeat_timestamp > 60`	node degraded
API error rate spike	`rate(lattice_api_requests_total{status=~"5.."}[5m]) / rate(lattice_api_requests_total[5m]) > 0.05`	warning
Accounting buffer filling	`lattice_accounting_events_buffered > 8000`	warning
VNI pool exhaustion approaching	`(lattice_network_vni_pool_total - lattice_network_vni_pool_available) / lattice_network_vni_pool_total > 0.90`	warning
Quota utilization high	`lattice_quota_used_nodes / lattice_quota_max_nodes > 0.95` for 10 minutes	warning
Raft disk usage high	`lattice_raft_disk_used_bytes / lattice_raft_disk_total_bytes > 0.80`	warning
Snapshot storage growth	`rate(lattice_raft_snapshot_size_bytes[1h]) > 100e6`	info

Dashboard Views

Three views matching the existing telemetry pattern:

Dashboard	Audience	Key Panels
Holistic	System admins	All scheduler cycle times, quorum health, total queue depth, API throughput
Per-vCluster	Scheduler operators	vCluster-specific queue depth, cycle time, proposal accept rate, backfill rate
Per-quorum-member	Quorum operators	Raft log size, commit latency, leader status, snapshot timing

Monitoring Deployment

Prometheus Scrape Configuration

All Lattice components expose metrics on a /metrics endpoint (Prometheus exposition format):

Component	Default Metrics Port	Endpoint
Quorum members	9100	`http://{quorum-host}:9100/metrics`
API servers	9101	`http://{api-host}:9101/metrics`
vCluster schedulers	9102	`http://{scheduler-host}:9102/metrics`
Node agents	9103	`http://{node-host}:9103/metrics`
Checkpoint broker	9104	`http://{checkpoint-host}:9104/metrics`

Example Prometheus scrape config:

scrape_configs:
  - job_name: "lattice-quorum"
    static_configs:
      - targets: ["quorum-1:9100", "quorum-2:9100", "quorum-3:9100"]

  - job_name: "lattice-api"
    static_configs:
      - targets: ["api-1:9101", "api-2:9101"]

  - job_name: "lattice-scheduler"
    static_configs:
      - targets: ["scheduler-hpc:9102", "scheduler-ml:9102", "scheduler-interactive:9102"]

  - job_name: "lattice-agents"
    file_sd_configs:
      - files: ["/etc/prometheus/lattice-agents.json"]
        refresh_interval: 5m
    # Node agents are numerous; use file-based service discovery
    # populated from OpenCHAMI node inventory

Alert Routing

Alerts are routed via Alertmanager (or compatible system):

Severity	Route	Response Time
Critical	PagerDuty / on-call	Immediate (< 15 min)
Warning	Slack #lattice-alerts	Business hours (< 4 hours)
Info	Slack #lattice-info	Best effort

Example Alertmanager route:

route:
  receiver: "slack-info"
  routes:
    - match: { severity: "critical" }
      receiver: "pagerduty-oncall"
    - match: { severity: "warning" }
      receiver: "slack-alerts"

Grafana Dashboards

Pre-built dashboards for the three views described above. Dashboards are defined as JSON and version-controlled in infra/grafana/:

infra/grafana/
├── holistic.json          # System-wide overview
├── per-vcluster.json      # vCluster-specific scheduling
├── per-quorum-member.json # Raft health
├── per-node.json          # Individual node health
└── user-allocation.json   # User-facing allocation metrics

Each dashboard uses the standard Lattice metric names. Data source: Prometheus (or compatible TSDB).

TSDB Sizing

Cluster Size	Metric Cardinality	Ingestion Rate	Storage (30-day retention)
100 nodes	~50,000 series	~10k samples/s	~50 GB
1,000 nodes	~500,000 series	~100k samples/s	~500 GB
10,000 nodes	~5,000,000 series	~1M samples/s	~5 TB

For clusters > 1000 nodes, use a horizontally scalable TSDB (VictoriaMetrics cluster, Mimir, or Thanos) with the hierarchical aggregation described in the Telemetry Aggregation Topology section above.

Lattice Documentation