Monitoring Guide
Overview
Monitor Sovra with Prometheus and Grafana for comprehensive observability.
Architecture
┌─────────────────────────────────────────┐
│ Sovra API Gateway (/metrics) │
│ Unified service: workspace, │
│ federation, policy, audit, edge, CRK │
└──────────────┬──────────────────────────┘
│ scrape
┌──────────────▼──────────────────────────┐
│ Prometheus │
│ ├─ Storage (15d retention) │
│ └─ Alertmanager │
└──────────────┬──────────────────────────┘
│ query
┌──────────────▼──────────────────────────┐
│ Grafana │
│ ├─ Dashboards │
│ └─ Alerts │
└─────────────────────────────────────────┘
Quick Setup
# Deploy monitoring stack
kubectl apply -k infrastructure/kubernetes/monitoring/
# Wait for pods
kubectl wait --for=condition=ready pod \
-l app.kubernetes.io/name=prometheus \
-n monitoring \
--timeout=300s
# Access Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000
# Default credentials: admin/admin
Prometheus Configuration
# prometheus-config.yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'api-gateway'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- sovra
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: api-gateway
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- job_name: 'vault'
static_configs:
- targets:
- 'vault-0.vault:8200'
- 'vault-1.vault:8200'
- 'vault-2.vault:8200'
metrics_path: '/v1/sys/metrics'
params:
format: ['prometheus']
- job_name: 'postgresql'
static_configs:
- targets:
- 'postgres-exporter:9187'
Key Metrics
Control Plane Metrics
The API gateway exposes HTTP-level and domain-level metrics. HTTP middleware automatically records request counts, latency, and active connections. Domain metric collectors are wired into each service and record operation-level counters.
# API Gateway HTTP (subsystem = serviceName, e.g. "api_gateway")
sovra_api_gateway_http_requests_total{method,path,status}
sovra_api_gateway_http_request_duration_seconds{method,path}
sovra_api_gateway_http_active_requests
sovra_api_gateway_errors_total{type}
sovra_api_gateway_auth_attempts_total{method,result}
# Policy Engine
sovra_policy_evaluations_total{result} # allow/deny per evaluation
sovra_policy_evaluation_duration_seconds # evaluation latency histogram
sovra_policy_cache_hits_total
sovra_policy_cache_misses_total
# Key Lifecycle (workspace operations)
sovra_keys_operations_total{operation,result} # create/encrypt/decrypt/rotate
sovra_keys_operation_duration_seconds{operation}
sovra_keys_active_total{type}
sovra_keys_rotation_age_seconds{workspace_hash}
# Audit Service
sovra_audit_events_total{event_type}
sovra_audit_write_duration_seconds
sovra_audit_sync_lag_seconds
sovra_audit_queue_depth
# Federation
sovra_federation_connections_active{status}
sovra_federation_requests_total{operation,result} # init/establish/revoke/rotate
sovra_federation_sync_duration_seconds
sovra_federation_errors_total{type}
Edge Node Metrics
# Vault
vault_core_unsealed
vault_runtime_alloc_bytes
vault_runtime_num_goroutines
vault_core_leadership_setup_failed
vault_core_leadership_lost
# Edge Agent (domain metrics from edge service)
sovra_edge_heartbeat_age_seconds{node_id}
sovra_edge_cert_expiry_seconds{node_id}
sovra_edge_sync_duration_seconds{sync_type}
sovra_edge_nodes_total{status} # register/unregister
sovra_edge_operations_total{operation,result} # register/unregister/sync_policies/sync_keys
Grafana Dashboards
Import Pre-Built Dashboards
# Import Sovra overview dashboard
kubectl apply -f infrastructure/kubernetes/monitoring/dashboards/sovra-overview.json
# Import edge node dashboard
kubectl apply -f infrastructure/kubernetes/monitoring/dashboards/edge-nodes.json
# Import federation dashboard
kubectl apply -f infrastructure/kubernetes/monitoring/dashboards/federation.json
Dashboard Panels
Sovra Overview Dashboard:
- Request rate (req/s)
- Request latency (p50, p95, p99)
- Error rate
- Active workspaces
- Federation health
- Audit event rate
Edge Nodes Dashboard:
- Vault seal status
- Heartbeat lag
- Memory usage
- Disk usage
- Certificate expiry
- Operation latency
Federation Dashboard:
- Active federations
- Cross-org requests
- Federation errors
- Workspace activity
- Audit sync lag
Alerts
Critical Alerts
# prometheus-alerts.yaml
groups:
- name: sovra-critical
rules:
- alert: SovraDown
expr: up{job=~"sovra-.*"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Sovra service is down"
- alert: VaultSealed
expr: vault_core_unsealed == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Vault is sealed"
- alert: DatabaseDown
expr: up{job="postgresql"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL is down"
- alert: FederationDown
expr: sovra_federation_connections_active{status="healthy"} < 1
for: 5m
labels:
severity: critical
annotations:
summary: "Federation with is down"
Warning Alerts
- name: sovra-warning
rules:
- alert: HighLatency
expr: histogram_quantile(0.95, rate(sovra_api_gateway_http_request_duration_seconds_bucket[5m])) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "API latency is high (p95 > 500ms)"
- alert: HighErrorRate
expr: rate(sovra_api_gateway_errors_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Error rate is high (>5%)"
- alert: CertificateExpiringSoon
expr: sovra_edge_cert_expiry_seconds < 604800 # 7 days
for: 1h
labels:
severity: warning
annotations:
summary: "Certificate expires in < 7 days"
- alert: AuditLagHigh
expr: sovra_audit_sync_lag_seconds > 300
for: 5m
labels:
severity: warning
annotations:
summary: "Audit log lag is high (> 5 minutes)"
Alertmanager Configuration
# alertmanager-config.yaml
global:
resolve_timeout: 5m
slack_api_url: '<slack-webhook-url>'
route:
receiver: 'default'
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'default'
slack_configs:
- channel: '#sovra-alerts'
title: 'Sovra Alert'
text: ''
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<pagerduty-key>'
- name: 'slack'
slack_configs:
- channel: '#sovra-warnings'
title: 'Sovra Warning'
text: ''
Log Aggregation
Loki Setup
# Deploy Loki
kubectl apply -f infrastructure/kubernetes/monitoring/loki/
# Deploy Promtail
kubectl apply -f infrastructure/kubernetes/monitoring/promtail/
Query Examples
# All errors
{app="sovra"} |= "error" | json
# API gateway errors
{app="api-gateway"} |= "error"
# Audit events
{app="api-gateway"} | json | event_type="workspace.access"
# Failed authentication
{app="api-gateway"} | json | status_code="401"
Performance Monitoring
Query Performance
# API endpoint latency
rate(sovra_api_gateway_http_request_duration_seconds_sum[5m])
/
rate(sovra_api_gateway_http_request_duration_seconds_count[5m])
Resource Usage
# Memory usage
container_memory_usage_bytes{pod=~"sovra-.*"}
# CPU usage
rate(container_cpu_usage_seconds_total{pod=~"sovra-.*"}[5m])
# Disk usage
kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"sovra-.*"}
/
kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"sovra-.*"}
Troubleshooting
High CPU Usage
# Identify which service
kubectl top pods -n sovra
# Check metrics
# High CPU usually means high request rate or complex policy evaluation
High Memory Usage
# Check for memory leaks
# Look at container_memory_usage_bytes trend
# Restart if needed
kubectl rollout restart deployment/api-gateway -n sovra
Missing Metrics
# Check service discovery
kubectl get servicemonitors -n monitoring
# Check Prometheus targets
# Access Prometheus UI -> Status -> Targets
# Check pod annotations
kubectl get pod -n sovra -o yaml | grep prometheus