Disaster Recovery
Overview
Comprehensive disaster recovery procedures for Sovra control plane and edge nodes.
Recovery Time Objectives
| Component | RTO | RPO |
|---|---|---|
| Control Plane | 1 hour | 15 minutes |
| Edge Nodes | 30 minutes | 5 minutes |
| Federation | 2 hours | 1 hour |
Backup Strategy
Control Plane Backups
PostgreSQL Database:
# Automated daily backups
0 3 * * * pg_dump -U sovra sovra > /backup/sovra-$(date +%Y%m%d).sql
# Weekly full backup
0 3 * * 0 pg_dumpall -U postgres > /backup/sovra-full-$(date +%Y%m%d).sql
Kubernetes Resources:
# Daily backup
0 4 * * * kubectl get all -A -o yaml > /backup/k8s-$(date +%Y%m%d).yaml
# Secrets backup (encrypted)
0 4 * * * kubectl get secrets -A -o yaml | gpg --encrypt --recipient backup@example.com > /backup/secrets-$(date +%Y%m%d).yaml.gpg
Vault Snapshots:
# Automated snapshots
0 2 * * * vault operator raft snapshot save /backup/vault-$(date +%Y%m%d).snap
Application-Level Backups
In addition to infrastructure backups, Sovra provides encrypted application-level backups that capture organization data (workspaces, federations, policies). These backups are encrypted at rest using the organization’s KEK via Vault transit and require a CRK co-signature for both creation and restoration.
# Create an encrypted application backup
sovra --cert admin.crt --key admin.key backup create \
--crk-signature <base64-signature>
# List available backups
sovra --cert admin.crt --key admin.key backup list
# Restore (same org or clean instance only)
sovra --cert admin.crt --key admin.key backup restore <backup-id> \
--crk-signature <base64-signature>
Restore restrictions: A backup can only be restored to the same organization or a clean instance with no existing organizations. Cross-organization restore is rejected to prevent data leakage.
Edge Node Backups
# Vault snapshot
vault operator raft snapshot save /backup/edge-vault-$(date +%Y%m%d).snap
# Configuration backup
tar czf /backup/edge-config-$(date +%Y%m%d).tar.gz /etc/sovra /etc/vault
Recovery Procedures
Scenario 1: Control Plane Database Failure
Impact: Complete control plane outage
RTO: 1 hour
RPO: 15 minutes (last backup)
Recovery Steps:
# 1. Deploy new PostgreSQL instance
terraform apply -target=module.postgresql
# 2. Restore from backup
psql -U sovra sovra < /backup/sovra-latest.sql
# 3. Verify data integrity
psql -U sovra sovra -c "SELECT COUNT(*) FROM organizations;"
psql -U sovra sovra -c "SELECT COUNT(*) FROM workspaces;"
# 4. Restart control plane services
kubectl rollout restart deployment -n sovra
# 5. Verify health
sovra health check
Scenario 2: Control Plane Total Loss
Impact: Complete infrastructure loss
RTO: 2 hours
RPO: Last backup
Recovery Steps:
# 1. Deploy infrastructure
cd infrastructure/terraform/
terraform init
terraform apply
# 2. Restore PostgreSQL
psql -U sovra sovra < /backup/sovra-latest.sql
# 3. Restore Kubernetes resources
kubectl apply -f /backup/k8s-latest.yaml
# 4. Restore secrets
gpg --decrypt /backup/secrets-latest.yaml.gpg | kubectl apply -f -
# 5. Initialize Vault
vault operator init -recovery-shares=5 -recovery-threshold=3
# 6. Restore Vault data
vault operator raft snapshot restore /backup/vault-latest.snap
# 7. Verify all services
sovra health check --all
Scenario 3: Edge Node Failure
Impact: Single edge node unavailable
RTO: 30 minutes
RPO: 5 minutes
Recovery Steps:
# 1. Deploy replacement edge node
terraform apply -target=module.edge-node-1
# 2. Restore Vault snapshot
vault operator raft snapshot restore /backup/edge-vault-latest.snap
# 3. Unseal Vault
vault operator unseal
# 4. Re-register with control plane
sovra edge-node register --node-id edge-1 ...
# 5. Verify health
sovra edge-node status edge-1
Scenario 4: Federation Link Failure
Impact: Cannot share data with partner
RTO: 2 hours
RPO: 1 hour
Recovery Steps:
# 1. Check connectivity
curl -k https://partner-sovra.example.org/health
# 2. Regenerate federation certificate
sovra federation cert-renew --partner org-b
# 3. Exchange with partner (secure channel)
# Transfer new certificate to partner
# 4. Re-establish federation
sovra federation establish --partner org-b
# 5. Verify shared workspaces
sovra workspace list
Disaster Recovery Testing
Monthly DR Test
# 1. Snapshot production database
pg_dump -U sovra sovra > /backup/dr-test/sovra-snapshot.sql
# 2. Snapshot Vault
vault operator raft snapshot save /backup/dr-test/vault-snapshot.snap
# 3. Create application-level backup
sovra --cert admin.crt --key admin.key backup create \
--crk-signature <base64-signature>
# 4. Deploy a DR environment and restore
psql -U sovra sovra_dr < /backup/dr-test/sovra-snapshot.sql
vault operator raft snapshot restore /backup/dr-test/vault-snapshot.snap
# 5. Verify health
curl -s https://dr-instance.example.com/health | jq .
# 6. Tear down DR environment
Backup Verification
# Weekly backup verification
0 5 * * 1 /usr/local/bin/verify-backups.sh
# verify-backups.sh
#!/bin/bash
set -e
echo "Verifying PostgreSQL backup..."
pg_restore --list /backup/sovra-latest.sql > /dev/null
echo "Verifying Vault snapshot..."
vault operator raft snapshot inspect /backup/vault-latest.snap
echo "Verifying Kubernetes backup..."
kubectl apply --dry-run=client -f /backup/k8s-latest.yaml
echo "All backups verified successfully"
Monitoring & Alerts
# Backup monitoring
- alert: BackupFailed
expr: backup_success == 0
for: 1h
annotations:
summary: "Backup failed for "
- alert: BackupOld
expr: time() - backup_timestamp_seconds > 86400
annotations:
summary: "Backup is older than 24 hours"