Sovra Sovra

Operations Guide

Overview

This guide covers operational aspects of running Sovra in production.

Topics

Setup & Initialization

Monitoring & Observability

Maintenance

Troubleshooting

Monitoring Quick Start

# Deploy monitoring stack
kubectl apply -k infrastructure/kubernetes/monitoring/

# Access Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000

# Access Prometheus
kubectl port-forward -n monitoring svc/prometheus 9090:9090

Daily Operations

Health Checks

# Control plane health (via API)
curl -s https://control.example.com/health | jq .

# Readiness / liveness probes
curl -s https://control.example.com/ready
curl -s https://control.example.com/live

Audit Review

# Review recent audit events via API
sovra --cert admin.crt --key admin.key activity list --limit 50

# Filter by workspace
sovra --cert admin.crt --key admin.key activity list --workspace-id <id>

Certificate Management

# Renew admin certificate
sovra --cert admin.crt --key admin.key identity admin renew-cert

# Rotate edge node certificates
./scripts/rotate-certificates.sh --namespace sovra-edge

Backup Verification

# Verify Vault snapshot
vault operator raft snapshot inspect /backup/vault-latest.snap

# Verify database backup
pg_restore --list /backup/sovra-latest.sql > /dev/null && echo "DB backup valid"

# List application backups
sovra --cert admin.crt --key admin.key backup list

Security Review

# Review access logs via activity endpoint
sovra --cert admin.crt --key admin.key activity list --limit 100

# Review compliance report
sovra --cert admin.crt --key admin.key compliance report generate \
  --period 7d --format json

Capacity Planning

# Review resource usage
kubectl top nodes
kubectl top pods -n sovra

# Check database growth (via PostgreSQL)
psql -U sovra sovra -c "SELECT pg_size_pretty(pg_database_size('sovra'));"

# Review audit event count
psql -U sovra sovra -c "SELECT COUNT(*) FROM audit_events WHERE created_at > now() - interval '30 days';"

Security Patching

# Check for updates
go list -m -u all

# Review CVEs
gosec ./...

Key Metrics to Monitor

Control Plane (subsystem = "api_gateway"):
├── sovra_api_gateway_http_requests_total{method,path,status} (counter)
├── sovra_api_gateway_http_request_duration_seconds{method,path} (histogram)
├── sovra_api_gateway_http_active_requests (gauge)
├── sovra_api_gateway_auth_attempts_total{method,result} (counter)
├── sovra_api_gateway_errors_total{type} (counter)
└── sovra_api_gateway_info{version,go_version} (gauge)

Edge Nodes (Vault built-in):
├── vault_core_unsealed (gauge)
├── vault_runtime_alloc_bytes (gauge)
└── vault_runtime_num_goroutines (gauge)

Automation

Scheduled Tasks

# Certificate rotation
0 2 * * * /path/to/scripts/rotate-certificates.sh --namespace sovra-edge

# Vault backup
0 3 * * * /path/to/scripts/backup-vault.sh --snapshot --retain 14

# Health check (via readiness probe)
*/5 * * * * curl -sf https://control.example.com/ready || echo "ALERT: control plane not ready"