Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ADR-015: Observability Contract

Status: Accepted Date: 2026-04-17 Context: A-ADV-7 (observability)

Decision

OpenTelemetry-native observability with tenant-aware metric scoping.

Metrics (Prometheus-compatible, via OpenTelemetry)

ContextKey metrics
Logdelta_append_latency, raft_commit_latency, shard_count, shard_size, compaction_duration, election_count
Chunkwrite_latency, read_latency, dedup_hit_rate, gc_chunks_collected, repair_count, pool_utilization
Compositioncreate_latency, delete_count, multipart_in_progress, refcount_operations
Viewmaterialization_lag_ms, staleness_violation_count, rebuild_progress, pin_count
Gatewayrequest_latency (p50/p99/p999), requests_per_sec, error_rate, active_connections
Clientfuse_latency, transport_type, cache_hit_rate, prefetch_effectiveness
Key Mgrderive_latency, rotation_in_progress, kms_reachability, cache_hit_rate
Controltenant_count, namespace_count, quota_utilization, federation_sync_lag

Zero-trust metric scoping

  • Cluster admin sees: aggregated metrics, per-node metrics, system health. Per-tenant metrics are anonymized (tenant_id replaced with opaque hash) unless cluster admin has approved access for that tenant.
  • Tenant admin sees: their own tenant’s metrics via tenant audit export.
  • No metric exposes: file names, directory structure, data content, or access patterns attributable to a specific tenant (without approval).

Distributed tracing

  • Every write/read path carries a trace ID (OpenTelemetry context propagation)
  • Traces span: client → gateway → composition → log → chunk → view
  • Tenant-scoped traces are visible only to the tenant admin
  • Cluster admin sees system-level spans (no tenant content in span attributes)

Structured logging

  • JSON structured logs, one line per event
  • Log levels: ERROR, WARN, INFO, DEBUG, TRACE
  • Tenant-identifying fields are present but content fields are encrypted
  • Logs ship to the same audit/observability pipeline

Consequences

  • OpenTelemetry SDK in both Rust and Go codebases
  • Metric cardinality must be bounded (no unbounded label values)
  • Tracing overhead ~1-2% on data path (acceptable for production)