Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ADR-016: Backup and Disaster Recovery

Status: Accepted Date: 2026-04-17 Context: A-ADV-8 (backup and DR)

Decision

Federation is the primary DR mechanism. External backup is additive and optional.

Site-level DR via federation

  • Federated-async replication to a secondary site is the primary DR story
  • RPO: bounded by async replication lag (seconds to minutes)
  • RTO: secondary site is warm (has replicated data + tenant config); switchover requires KMS connectivity and control plane reconfiguration
  • Data replication is ciphertext-only (no key material in replication stream)

What is replicated

ComponentReplicated?Mechanism
Chunk data (ciphertext)YesAsync replication to peer site
Log deltasYesAsync replication of committed deltas
Control plane configYesFederation config sync
Tenant KMS configNoSame tenant KMS serves both sites
System master keysNoPer-site system key manager
Audit logYesPer-tenant audit shard replicated

External backup (optional, additive)

  • Cluster admin can configure external backup targets (S3-compatible store)
  • Backup contains: encrypted chunk data + log snapshots + control plane state
  • Backup is encrypted with the system key (at rest) — no plaintext in backup
  • HIPAA requirement met: backup is encrypted
  • Backup frequency: configurable (hourly/daily snapshots of control plane, continuous for chunk data)

Recovery scenarios

ScenarioRecovery pathRPORTO
Single node lossRaft re-election + EC repair0Seconds-minutes
Multiple node lossRaft reconfiguration + EC repair0Minutes
Full site lossFailover to federated peerReplication lagMinutes-hours
Site loss, no federationRestore from external backupBackup lagHours
Tenant KMS lossUnrecoverable (I-K11)N/AN/A

Consequences

  • Federation is the recommended (and primary) DR strategy
  • External backup is for defense-in-depth, not primary recovery
  • RTO for site failover depends on control plane reconfiguration speed
  • System key manager is per-site — site failover requires the secondary site’s own system key manager (different master keys, but tenants’ data is accessible because tenant KMS is shared cross-site)