Sovra Sovra

Telemetry Guide

This guide covers Sovra’s observability stack, including distributed tracing, metrics, and logging.

Overview

Sovra provides comprehensive telemetry for:

Privacy-First Design

Sovra’s telemetry is designed with privacy as a core principle:

NEVER include in traces/logs:

  • Request/response bodies
  • User identifiers (emails, IDs)
  • Tokens, API keys, passwords
  • Certificates, private keys
  • IP addresses, hostnames
  • Query parameters
  • Most HTTP headers

Only sanitized, non-identifying data is included in telemetry.

Distributed Tracing

Configuration

# sovra.yaml
telemetry:
  enabled: true
  endpoint: otel-collector.monitoring.svc:4318
  service_name: sovra-control-plane
  service_version: 1.0.0
  sample_rate: 0.1  # 10% of requests

Environment Variables

Variable Description Default
SOVRA_TELEMETRY_ENABLED Enable tracing false
SOVRA_TELEMETRY_ENDPOINT OTLP collector endpoint -
SOVRA_TELEMETRY_SERVICE_NAME Service name in traces -
SOVRA_TELEMETRY_SERVICE_VERSION Service version -
SOVRA_TELEMETRY_SAMPLE_RATE Sampling rate (0.0-1.0) 0.1

Trace Propagation

Sovra propagates trace context using W3C Trace Context headers:

This enables distributed tracing across:

Safe Attributes

Only the following attributes are included in traces:

// HTTP middleware attributes (sanitized)
semconv.HTTPMethod("POST")
semconv.HTTPRoute("/v1/workspace/{workspace}/encrypt")
semconv.HTTPStatusCode(200)

// Database attributes (no queries)
semconv.DBSystem("postgresql")
semconv.DBOperation("SELECT")

// Service-level span attributes (via telemetry.NewSafeAttributes)
attribute.String("operation", "encrypt")
attribute.String("result", "success")
attribute.Int64("duration_ms", 42)

Service-level spans use the telemetry.NewSafeAttributes() builder to ensure only approved attributes are attached:

ctx, span := otel.Tracer("sovra").Start(ctx, "workspace.encrypt")
defer span.End()
// ... operation logic ...
span.SetAttributes(telemetry.NewSafeAttributes().
    Operation("encrypt").
    Result("success").
    Build()...)

On error, spans record the error and set the result attribute:

span.RecordError(err)
span.SetAttributes(telemetry.NewSafeAttributes().
    Operation("encrypt").
    Result("error").
    Build()...)

Instrumented Operations

The following services have operation-level spans:

Service Spans
Workspace workspace.create, workspace.encrypt, workspace.decrypt, workspace.rotate_dek, workspace.archive, workspace.delete
Policy policy.create, policy.update, policy.delete, policy.evaluate
Federation federation.init, federation.establish, federation.revoke, federation.rotate_certificate
Identity identity.create_admin, identity.bootstrap_admin, identity.enroll_admin
Edge edge.register, edge.sync_policies, edge.sync_keys

Example Trace

Trace: 4b3a9c2d-1e5f-4a8b-9c2d-1e5f4a8b9c2d
├── api-gateway: POST /v1/workspace/{workspace}/encrypt (42ms)
│   ├── workspace.encrypt (35ms)
│   │   ├── policy.evaluate (5ms)
│   │   │   └── opa: query (3ms)
│   │   ├── vault: transit/encrypt (25ms)
│   │   └── audit: record-event (2ms)
│   │       └── postgresql: INSERT (1ms)

Metrics

Prometheus Endpoints

The Sovra API gateway exposes metrics at /metrics:

# Control plane metrics
curl http://api-gateway:9090/metrics

# Edge node metrics
curl http://vault:8200/v1/sys/metrics?format=prometheus

Core Metrics

API Gateway

# Request rate
sovra_api_requests_total{method="POST",path="/v1/workspace/{workspace}/encrypt",status="200"}

# Request latency (histogram)
sovra_api_request_duration_seconds_bucket{method="POST",path="/v1/workspace/{workspace}/encrypt",le="0.1"}

# Active connections
sovra_api_active_connections

# Errors
sovra_api_errors_total{type="policy_violation"}

Policy (via API Gateway)

# Policy evaluations
sovra_policy_evaluations_total{workspace="cancer-research",result="allow"}

# Evaluation latency
sovra_policy_evaluation_duration_seconds

# Cache performance
sovra_policy_cache_hits_total
sovra_policy_cache_misses_total

Audit (via API Gateway)

# Audit events
sovra_audit_events_total{type="workspace.access",org="eth-zurich"}

# Write latency
sovra_audit_write_duration_seconds

# Lag (for async writes)
sovra_audit_lag_seconds

Federation

# Active federations
sovra_federation_connections{partner="partner-university",status="healthy"}

# Federation requests
sovra_federation_requests_total{partner="partner-university",operation="sync"}

# Federation errors
sovra_federation_errors_total{partner="partner-university",type="timeout"}

Edge Nodes

# Vault status
vault_core_unsealed

# Vault memory
vault_runtime_alloc_bytes

# Edge agent heartbeat
sovra_edge_heartbeat_seconds

# Certificate expiry
sovra_edge_cert_expiry_seconds

Grafana Dashboards

Pre-built dashboards are available in infrastructure/kubernetes/monitoring/dashboards/:

Import Dashboards

# Copy dashboards to Grafana
kubectl cp infrastructure/kubernetes/monitoring/dashboards/ \
  monitoring/grafana-xxx:/var/lib/grafana/dashboards/

# Or use ConfigMap
kubectl apply -f infrastructure/kubernetes/monitoring/grafana-dashboards.yaml

Logging

Structured Logging

All Sovra services use structured JSON logging:

{
  "timestamp": "2026-01-30T14:30:00.123Z",
  "level": "info",
  "service": "api-gateway",
  "trace_id": "4b3a9c2d1e5f4a8b",
  "span_id": "9c2d1e5f",
  "message": "request completed",
  "method": "POST",
  "path": "/v1/workspace/{workspace}/encrypt",
  "status": 200,
  "duration_ms": 42
}

Log Levels

Level Description Use Case
error Error conditions Failures requiring attention
warn Warning conditions Potential issues
info Normal operations Request completion, state changes
debug Debug information Development troubleshooting

Configure via:

log_level: info
log_format: json

Log Aggregation

Loki Setup

# promtail-config.yaml
scrape_configs:
  - job_name: sovra
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - sovra
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
    pipeline_stages:
      - json:
          expressions:
            level: level
            trace_id: trace_id
            service: service
      - labels:
          level:
          trace_id:
          service:

Log Queries

# All errors
{namespace="sovra"} |= "error" | json

# Specific service errors
{namespace="sovra", app="api-gateway"} |= "error"

# By trace ID
{namespace="sovra"} | json | trace_id="4b3a9c2d1e5f4a8b"

# Failed authentication
{namespace="sovra", app="api-gateway"} | json | status=401

# Slow requests (>500ms)
{namespace="sovra"} | json | duration_ms > 500

OpenTelemetry Collector

Deployment

# otel-collector.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: monitoring
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          http:
            endpoint: 0.0.0.0:4318
          grpc:
            endpoint: 0.0.0.0:4317
    
    processors:
      batch:
        timeout: 10s
        send_batch_size: 1024
      
      # Remove sensitive attributes if any slip through
      attributes:
        actions:
          - key: user.email
            action: delete
          - key: user.id
            action: delete
          - key: http.client_ip
            action: delete
    
    exporters:
      jaeger:
        endpoint: jaeger-collector.monitoring:14250
        tls:
          insecure: true
      
      prometheus:
        endpoint: 0.0.0.0:8889
    
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch, attributes]
          exporters: [jaeger]
        metrics:
          receivers: [otlp]
          processors: [batch]
          exporters: [prometheus]

Accessing Traces

# Port-forward Jaeger UI
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686

# Open http://localhost:16686

Alerting

Prometheus Alerts

# prometheus-alerts.yaml
groups:
  - name: sovra-telemetry
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: rate(sovra_api_errors_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate (>5%)"
      
      # High latency
      - alert: HighLatency
        expr: histogram_quantile(0.95, sovra_api_request_duration_seconds) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency > 500ms"
      
      # Missing traces
      - alert: TracingDown
        expr: up{job="otel-collector"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "OpenTelemetry collector is down"

Performance Impact

Resource Overhead

Component CPU Impact Memory Impact Network
Tracing (10% sample) ~1% ~10MB ~100KB/s
Metrics ~0.5% ~5MB ~10KB/s
Logging ~1% ~20MB ~50KB/s

Tuning

For high-throughput deployments:

telemetry:
  sample_rate: 0.01  # 1% sampling

For debugging:

telemetry:
  sample_rate: 1.0   # 100% sampling (temporary)

Troubleshooting

No Traces Appearing

# Check collector is running
kubectl get pods -n monitoring -l app=otel-collector

# Check collector logs
kubectl logs -n monitoring -l app=otel-collector

# Check service connectivity
kubectl run -it --rm debug --image=curlimages/curl -- \
  curl -v http://otel-collector.monitoring:4318/v1/traces

Missing Metrics

# Check metrics endpoint
kubectl port-forward -n sovra svc/api-gateway 9090:9090
curl http://localhost:9090/metrics

# Check Prometheus targets
# Access Prometheus UI -> Status -> Targets

High Cardinality

Avoid high-cardinality labels:

Privacy Compliance

GDPR Considerations

Sovra telemetry is designed to be GDPR-compliant:

Data Retention

# Prometheus retention
prometheus:
  retention: 15d

# Loki retention
loki:
  retention_period: 30d

# Jaeger retention
jaeger:
  storage:
    es:
      max-span-age: 7d