Monitoring Guide

Overview

Monitor Sovra with Prometheus and Grafana for comprehensive observability.

Architecture

┌─────────────────────────────────────────┐
│ Sovra API Gateway (/metrics)            │
│   Unified service: workspace,           │
│   federation, policy, audit, edge, CRK  │
└──────────────┬──────────────────────────┘
               │ scrape
┌──────────────▼──────────────────────────┐
│ Prometheus                              │
│ ├─ Storage (15d retention)              │
│ └─ Alertmanager                         │
└──────────────┬──────────────────────────┘
               │ query
┌──────────────▼──────────────────────────┐
│ Grafana                                 │
│ ├─ Dashboards                           │
│ └─ Alerts                               │
└─────────────────────────────────────────┘

Quick Setup

# Deploy monitoring stack
kubectl apply -k infrastructure/kubernetes/monitoring/

# Wait for pods
kubectl wait --for=condition=ready pod \
  -l app.kubernetes.io/name=prometheus \
  -n monitoring \
  --timeout=300s

# Access Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000

# Default credentials: admin/admin

Prometheus Configuration

# prometheus-config.yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'api-gateway'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - sovra
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: api-gateway
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2

  - job_name: 'vault'
    static_configs:
      - targets:
          - 'vault-0.vault:8200'
          - 'vault-1.vault:8200'
          - 'vault-2.vault:8200'
    metrics_path: '/v1/sys/metrics'
    params:
      format: ['prometheus']

  - job_name: 'postgresql'
    static_configs:
      - targets:
          - 'postgres-exporter:9187'

Key Metrics

Control Plane Metrics

The API gateway exposes HTTP-level and domain-level metrics. HTTP middleware automatically records request counts, latency, and active connections. Domain metric collectors are wired into each service and record operation-level counters.

# API Gateway HTTP (subsystem = serviceName, e.g. "api_gateway")
sovra_api_gateway_http_requests_total{method,path,status}
sovra_api_gateway_http_request_duration_seconds{method,path}
sovra_api_gateway_http_active_requests
sovra_api_gateway_errors_total{type}
sovra_api_gateway_auth_attempts_total{method,result}

# Policy Engine
sovra_policy_evaluations_total{result}          # allow/deny per evaluation
sovra_policy_evaluation_duration_seconds         # evaluation latency histogram
sovra_policy_cache_hits_total
sovra_policy_cache_misses_total

# Key Lifecycle (workspace operations)
sovra_keys_operations_total{operation,result}    # create/encrypt/decrypt/rotate
sovra_keys_operation_duration_seconds{operation}
sovra_keys_active_total{type}
sovra_keys_rotation_age_seconds{workspace_hash}

# Audit Service
sovra_audit_events_total{event_type}
sovra_audit_write_duration_seconds
sovra_audit_sync_lag_seconds
sovra_audit_queue_depth

# Federation
sovra_federation_connections_active{status}
sovra_federation_requests_total{operation,result}  # init/establish/revoke/rotate
sovra_federation_sync_duration_seconds
sovra_federation_errors_total{type}

Edge Node Metrics

# Vault
vault_core_unsealed
vault_runtime_alloc_bytes
vault_runtime_num_goroutines
vault_core_leadership_setup_failed
vault_core_leadership_lost

# Edge Agent (domain metrics from edge service)
sovra_edge_heartbeat_age_seconds{node_id}
sovra_edge_cert_expiry_seconds{node_id}
sovra_edge_sync_duration_seconds{sync_type}
sovra_edge_nodes_total{status}                   # register/unregister
sovra_edge_operations_total{operation,result}     # register/unregister/sync_policies/sync_keys

Grafana Dashboards

Import Pre-Built Dashboards

# Import Sovra overview dashboard
kubectl apply -f infrastructure/kubernetes/monitoring/dashboards/sovra-overview.json

# Import edge node dashboard
kubectl apply -f infrastructure/kubernetes/monitoring/dashboards/edge-nodes.json

# Import federation dashboard
kubectl apply -f infrastructure/kubernetes/monitoring/dashboards/federation.json

Dashboard Panels

Sovra Overview Dashboard:

Request rate (req/s)
Request latency (p50, p95, p99)
Error rate
Active workspaces
Federation health
Audit event rate

Edge Nodes Dashboard:

Vault seal status
Heartbeat lag
Memory usage
Disk usage
Certificate expiry
Operation latency

Federation Dashboard:

Active federations
Cross-org requests
Federation errors
Workspace activity
Audit sync lag

Alerts

Critical Alerts

# prometheus-alerts.yaml
groups:
  - name: sovra-critical
    rules:
      - alert: SovraDown
        expr: up{job=~"sovra-.*"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Sovra service  is down"
          
      - alert: VaultSealed
        expr: vault_core_unsealed == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Vault  is sealed"
          
      - alert: DatabaseDown
        expr: up{job="postgresql"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL is down"
          
      - alert: FederationDown
        expr: sovra_federation_connections_active{status="healthy"} < 1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Federation with  is down"

Warning Alerts

  - name: sovra-warning
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(sovra_api_gateway_http_request_duration_seconds_bucket[5m])) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "API latency is high (p95 > 500ms)"
          
      - alert: HighErrorRate
        expr: rate(sovra_api_gateway_errors_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Error rate is high (>5%)"
          
      - alert: CertificateExpiringSoon
        expr: sovra_edge_cert_expiry_seconds < 604800  # 7 days
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Certificate  expires in < 7 days"
          
      - alert: AuditLagHigh
        expr: sovra_audit_sync_lag_seconds > 300
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Audit log lag is high (> 5 minutes)"

Alertmanager Configuration

# alertmanager-config.yaml
global:
  resolve_timeout: 5m
  slack_api_url: '<slack-webhook-url>'

route:
  receiver: 'default'
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#sovra-alerts'
        title: 'Sovra Alert'
        text: ''

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '<pagerduty-key>'

  - name: 'slack'
    slack_configs:
      - channel: '#sovra-warnings'
        title: 'Sovra Warning'
        text: ''

Log Aggregation

Loki Setup

# Deploy Loki
kubectl apply -f infrastructure/kubernetes/monitoring/loki/

# Deploy Promtail
kubectl apply -f infrastructure/kubernetes/monitoring/promtail/

Query Examples

# All errors
{app="sovra"} |= "error" | json

# API gateway errors
{app="api-gateway"} |= "error"

# Audit events
{app="api-gateway"} | json | event_type="workspace.access"

# Failed authentication
{app="api-gateway"} | json | status_code="401"

Performance Monitoring

Query Performance

# API endpoint latency
rate(sovra_api_gateway_http_request_duration_seconds_sum[5m])
/
rate(sovra_api_gateway_http_request_duration_seconds_count[5m])

Resource Usage

# Memory usage
container_memory_usage_bytes{pod=~"sovra-.*"}

# CPU usage
rate(container_cpu_usage_seconds_total{pod=~"sovra-.*"}[5m])

# Disk usage
kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"sovra-.*"}
/
kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sovra-.*"}

Troubleshooting

High CPU Usage

# Identify which service
kubectl top pods -n sovra

# Check metrics
# High CPU usually means high request rate or complex policy evaluation

High Memory Usage

# Check for memory leaks
# Look at container_memory_usage_bytes trend

# Restart if needed
kubectl rollout restart deployment/api-gateway -n sovra

Missing Metrics

# Check service discovery
kubectl get servicemonitors -n monitoring

# Check Prometheus targets
# Access Prometheus UI -> Status -> Targets

# Check pod annotations
kubectl get pod -n sovra -o yaml | grep prometheus