Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

User-Facing Observability & Debugging

Design Principle

Lattice already collects high-resolution telemetry (eBPF, TSDB, three aggregation modes) for operator and scheduler use. This document describes the user-facing surface that lets job owners debug, monitor, and profile their own allocations without admin intervention.

All observability data flows through existing pipelines — no new collection infrastructure is required. The user-facing layer adds scoped query access, streaming endpoints, and interactive attach.

Overview

┌─ User ───────────────────────────────────────────────────────┐
│  lattice attach / logs / top / watch / diag / compare        │
│         │           │          │         │         │         │
│         ▼           ▼          ▼         ▼         ▼         │
│    ┌─────────── lattice-api (gRPC + REST) ───────────────┐   │
│    │  Attach ──────────────── bidir stream to node agent │   │
│    │  Logs ────────────────── ring buffer (live) + S3    │   │
│    │  Metrics ─────────────── PromQL query to TSDB       │   │
│    │  StreamMetrics ───────── fan-out to node agents     │   │
│    │  Diagnostics ─────────── TSDB + fabric telemetry    │   │
│    │  Compare ─────────────── multi-alloc TSDB query     │   │
│    └─────────────────────────────────────────────────────┘   │
│         │           │          │         │                   │
│         ▼           ▼          ▼         ▼                   │
│    Node Agents    S3 logs     TSDB    Slingshot CSIG         │
└──────────────────────────────────────────────────────────────┘
CapabilityData SourceLatencyCLI Command
Attach to running allocationNode agent (nsenter)Real-timelattice attach <id>
Log streaming (live tail)Node agent ring bufferSub-secondlattice logs <id> --follow
Historical logsS3Secondslattice logs <id>
Live metrics (top)TSDB30s (prod mode)lattice top <id>
Live telemetry stream (watch)Node agents (push)1-30slattice watch <id>
DiagnosticsTSDB + fabric telemetry30slattice diag <id>
Cross-allocation comparisonTSDBSecondslattice compare <id1> <id2>
Application profilingUser tools (via tools_uenv)N/AUser-driven

Attach to Running Allocation

Architecture

The attach mechanism provides an interactive terminal session inside a running allocation’s execution environment. The node agent uses nsenter to enter the allocation’s mount and network namespaces — this is not a new allocation, just a terminal session in the existing one.

User → lattice-cli → lattice-api → gRPC bidir stream → node agent
                                                         │
                                                    nsenter into
                                                    mount/net ns
                                                         │
                                                    PTY ↔ shell

Terminal Protocol

The gRPC bidirectional stream carries:

  • Client → Server: stdin bytes, terminal resize events, signals (SIGINT, SIGTSTP)
  • Server → Client: stdout/stderr bytes, exit code on completion

The stream begins with an AttachStart message specifying the target node (for multi-node allocations) and command (default: user’s shell).

Authorization Model

vCluster TypeWho Can AttachAdditional Constraints
HPC (backfill)Allocation owner
Service (bin-pack)Allocation owner
Interactive (FIFO)Allocation ownerAlready has session; attach is secondary terminal
Sensitive (reservation)Claiming user onlySession recorded, audit trail, signed uenv only

Sensitive Constraints

  • Only the user who claimed the nodes (identity from Raft audit log) can attach
  • All attach sessions are recorded (input + output) to the sensitive audit log
  • Attach is only permitted when the allocation runs a signed uenv
  • Session start/end events are Raft-committed audit entries

Attach During Node Crash

If the node hosting an attach session crashes or becomes unreachable:

  • The gRPC bidirectional stream is dropped (connection reset).
  • The API server detects the stream drop and sets ended_at on the AttachSession record.
  • For sensitive allocations, the session end event is recorded in the audit log with reason node_unreachable.
  • The client receives a stream error and can display: "connection to node lost — attach session ended".

Attach During Preemption

If the allocation is preempted while an attach session is active, the session is terminated gracefully. See sessions.md for the detailed preemption sequence. If the allocation is in Checkpointing state, new attach requests are rejected with: "allocation is being checkpointed — attach unavailable until rescheduled".

CLI Usage

# Attach to allocation (first node, user's shell)
lattice attach 12345

# Attach to a specific node
lattice attach 12345 --node=x1000c0s0b0n3

# Attach with a specific command
lattice attach 12345 --command="nvidia-smi -l 1"

Slurm Compatibility

SlurmLattice
srun --jobid=123 --pty bashlattice attach 123

Log Streaming

Dual-Path Architecture

Logs use two paths to balance latency and durability:

  1. Ring buffer (live tail): Each node agent maintains a per-allocation ring buffer (default 64 MB) of stdout/stderr. Supports low-latency streaming for --follow mode. Data is ephemeral — lost when the allocation ends or the buffer wraps.

  2. S3 persistence: Node agents periodically flush log chunks to S3 for durable storage. Available during and after allocation execution.

Process stdout/stderr
    │
    ├──→ Ring buffer (node agent, 64 MB)
    │         │
    │         └──→ gRPC StreamLogs (live tail)
    │
    └──→ S3 flush (periodic, configurable interval)
              │
              └──→ REST GET /logs (historical)

Log Storage Layout

s3://{tenant}/{project}/{alloc_id}/logs/
    ├── stdout/{node_id}/{chunk_000..N}.log.zst
    ├── stderr/{node_id}/{chunk_000..N}.log.zst
    └── metadata.json    # timestamps, byte offsets, node list

Logs are compressed with zstd. The metadata file enables efficient range queries by time or byte offset.

Streaming (Live Tail)

Via gRPC StreamLogs RPC (server-streaming). The client specifies:

  • Allocation ID
  • Stream filter: stdout, stderr, or both
  • Node filter: specific node or all nodes
  • Follow mode: whether to keep streaming as new output arrives
  • Tail lines: number of lines from the ring buffer to replay on connect

Historical Log Access

Via REST GET /v1/allocations/{id}/logs:

  • Query params: stream (stdout/stderr), node, since, until, offset, limit
  • Returns paginated log entries from S3
  • Available after allocation completion (subject to retention policy)

Sensitive Constraints

  • Logs from sensitive allocations are encrypted at rest in the dedicated sensitive S3 pool
  • All log access events are recorded in the sensitive audit log
  • Log retention follows sensitive data retention policy (user-specified, minimum per regulation)
  • Logs are only accessible to the claiming user and designated compliance reviewers

CLI Usage

# View logs (all nodes, both streams)
lattice logs 12345

# Follow mode (live tail)
lattice logs 12345 --follow

# Filter by stream and node
lattice logs 12345 --stderr --node=x1000c0s0b0n3

# Tail last 100 lines
lattice logs 12345 --tail=100

# Historical range
lattice logs 12345 --since="2026-03-01T10:00:00Z" --until="2026-03-01T11:00:00Z"

Slurm Compatibility

SlurmLattice
cat slurm-123.outlattice logs 123
tail -f slurm-123.outlattice logs 123 --follow

User-Facing Live Metrics (lattice top)

Query Path

Metrics are served from the TSDB (not directly from node agents). The lattice-api translates user queries into PromQL, scoped to the requesting user’s allocations.

lattice top <id> → lattice-api → PromQL → TSDB → response

This reuses the existing telemetry pipeline. In prod mode, data has 30-second resolution. In debug mode (if switched), 1-second resolution.

Metrics Catalog

MetricDescriptionUnit
gpu_utilizationSM occupancy per GPU%
gpu_memory_usedGPU memory in usebytes
gpu_power_drawGPU power consumptionwatts
cpu_utilizationCPU usage per node%
memory_usedSystem memory in usebytes
network_tx_bytesNetwork bytes sentbytes/s
network_rx_bytesNetwork bytes receivedbytes/s
io_read_bytesStorage read throughputbytes/s
io_write_bytesStorage write throughputbytes/s
io_latency_p99Storage I/O latency (p99)microseconds

Display Modes

ModeFlagContent
Summary (default)Aggregated across all nodes: mean GPU%, total mem, total I/O
Per-node--per-nodeOne row per node
Per-GPU--per-gpuOne row per GPU across all nodes
Wide--wideAll metrics in a wide table

REST + gRPC Access

  • REST: GET /v1/allocations/{id}/metrics?mode=summary&duration=5m
  • gRPC: QueryMetrics RPC with MetricsQueryRequest

CLI Usage

# Summary view (default)
lattice top 12345

# Per-node breakdown
lattice top 12345 --per-node

# Per-GPU breakdown
lattice top 12345 --per-gpu

# Wide format with all metrics
lattice top 12345 --wide

# Custom time window
lattice top 12345 --duration=1h

Live Telemetry Stream (lattice watch)

Push-Based Event Stream

Unlike lattice top (which queries TSDB), lattice watch opens a push-based stream from node agents for near-real-time events.

lattice watch <id> → lattice-api → fan-out → node agents
                          ↑
                     stream merge
                          ↑
             per-node MetricsEvent streams

Relationship to Telemetry Modes

Telemetry Modelattice top Resolutionlattice watch Resolution
prod30s (TSDB)30s (prod aggregation from node agent)
debug1s (TSDB)1s (raw events from node agent)
audit30s (TSDB)30s + access events

Switching to debug mode (lattice telemetry --alloc=12345 --mode=debug) increases resolution for both top and watch.

Stream Content

Each MetricsEvent contains:

  • Timestamp and node ID
  • Current metric values (GPU, CPU, memory, network, I/O)
  • Threshold alerts (if any metric exceeds configured bounds)

Alerts are generated by node agents when metrics cross thresholds:

  • GPU utilization drops below 10% (potential hang)
  • GPU memory utilization exceeds 95% (OOM risk)
  • Network error rate exceeds threshold
  • I/O latency spike detected

CLI Usage

# Watch all metrics (refreshing display)
lattice watch 12345

# Watch specific metrics
lattice watch 12345 --metrics=gpu_utilization,memory_used

# Watch with alerts only (suppress normal updates)
lattice watch 12345 --alerts-only

Diagnostics View

Network Diagnostics

Network health is critical for multi-node allocations. Diagnostics expose Slingshot-specific metrics that are otherwise invisible to users.

MetricDescriptionSource
CSIG congestionIn-band congestion signals per Slingshot groupeBPF CSIG tap
Group spanNumber of dragonfly groups the allocation spansTopology model
Inter-node bandwidthMeasured bandwidth between node pairseBPF network flow
NVLink throughputGPU-to-GPU bandwidth (intra-node)NVML

Storage Diagnostics

MetricDescriptionSource
QoS floor vs actualConfigured storage QoS vs measured throughputVAST API + eBPF I/O
Latency histogramI/O latency distribution (p50/p95/p99)eBPF block I/O
Mount healthPer-mount status (NFS, S3, scratch)Node agent
IOPSRead/write operations per secondeBPF block I/O

Combined Diagnostics

lattice diag combines network and storage diagnostics into a single view with health indicators:

$ lattice diag 12345

Network:
  Group span:     2 groups (g3, g7)
  CSIG congestion: LOW (0.02 avg)
  Inter-node BW:  190 GB/s avg (target: 200 GB/s) ✓

Storage:
  /data/input (NFS):  12.5 GB/s read (QoS floor: 10 GB/s) ✓
  /scratch (NVMe):    6.2 GB/s write, p99 latency: 45µs ✓
  /home (NFS):        0.1 GB/s (idle) ✓

GPUs:
  SM occupancy:   92% avg across 256 GPUs ✓
  NVLink:         850 GB/s avg (of 900 GB/s) ✓
  ECC errors:     0 ✓

CLI Usage

# Full diagnostics
lattice diag 12345

# Network only
lattice diag 12345 --network

# Storage only
lattice diag 12345 --storage

Cross-Allocation Comparison

TSDB Query

Compares metrics across multiple allocations by querying the same TSDB data used for lattice top. Useful for regression detection across training runs.

Time Alignment

Comparisons use relative-to-start time alignment: each allocation’s metrics are indexed from t=0 (allocation start), not wall clock time. This allows meaningful comparison of allocations that ran at different times.

CLI Usage

# Compare two allocations
lattice compare 12345 12346

# Compare specific metric
lattice compare 12345 12346 --metric=gpu_utilization

# JSON output for scripting
lattice compare 12345 12346 --output=json

REST Interface

GET /v1/compare?ids=12345,12346&metrics=gpu_utilization,io_write_bytes&align=relative

Application Profiling Integration

Scope

Lattice provides mechanisms for profiling, not profiler implementations. Users bring their own profiling tools, delivered via tools_uenv.

Profiler Delivery

Profiling tools are packaged as uenv images and mounted alongside the application uenv:

environment:
  uenv: "prgenv-gnu/24.11:v1"        # application stack
  tools_uenv: "profiling/2024.1"     # profilers: nsight, vtune, darshan, etc.

The tools_uenv mount provides profiler binaries without contaminating the application environment.

Usage Patterns

Batch profiling (non-interactive):

# Submit with profiling tools
lattice submit --uenv=prgenv-gnu/24.11:v1 --tools-uenv=profiling/2024.1 script.sh
# Script uses profiler internally (e.g., nsys profile ./train)
# Results written to output directory

Interactive profiling (attach-based):

# Attach and run profiler interactively
lattice attach 12345 --command="nsys profile --delay=60 -o /scratch/profile ./train"

Darshan / Score-P Integration Notes

  • Darshan: LD_PRELOAD-based I/O profiling. No Lattice-specific integration needed; user loads Darshan from tools_uenv and sets LD_PRELOAD. Darshan logs written to scratch/output.
  • Score-P: Instrumentation-based profiling. User compiles with Score-P wrappers from tools_uenv. Lattice provides no special support beyond tools delivery and attach.

Security Model

Authorization

All observability endpoints are scoped by OIDC token claims:

  • Users can only query their own allocations (or allocations in their tenant, if tenant-admin)
  • Token scopes: allocations:read (metrics, logs, diagnostics), allocations:attach (interactive attach)
  • Sensitive allocations: only the claiming user (verified against Raft audit log)

Rate Limiting

All rate limits are per user (identified by OIDC subject claim). Tenant admins and system admins share the same limits unless overridden in system configuration.

EndpointRate LimitScopeRationale
Attach5 concurrent sessionsPer userResource-intensive (PTY per session)
StreamLogs10 concurrent streamsPer userMemory (ring buffer readers)
QueryMetrics60 req/minPer userTSDB query load
StreamMetrics5 concurrent streamsPer userNode agent fan-out
Diagnostics30 req/minPer userTSDB + fabric query load
Compare10 req/minPer userMulti-alloc TSDB queries

When rate limit is exceeded:

  • Concurrent limits (Attach, StreamLogs, StreamMetrics): New request rejected with 429 Too Many Requests and a message: "maximum concurrent sessions reached (5/5). Close an existing session to open a new one."
  • Request-rate limits (QueryMetrics, Diagnostics, Compare): Request rejected with 429 Too Many Requests and Retry-After header indicating seconds until the next request is allowed.
  • No queueing — rejected requests must be retried by the client.

Admin override: System admins can adjust per-user rate limits via configuration:

rate_limits:
  attach_max_concurrent: 10       # override default of 5
  query_metrics_per_minute: 120   # override default of 60

Data Sensitivity

Data TypeSensitivityHandling
Metrics (GPU%, CPU%, I/O)LowStandard OIDC scoping
Logs (stdout/stderr)MediumMay contain application data; encrypted at rest for sensitive
Attach (interactive terminal)HighSession recorded for sensitive; PTY access = code execution
Diagnostics (network/storage)LowInfrastructure metrics, no application data
Profiling outputMediumWritten to user’s storage, no Lattice-managed persistence