Backup, Recovery & Business Continuity
A backup you have never restored is a hope, not a backup. Data will be deleted, corrupted, ransomed, or lost to a regional outage. When that happens, the only thing that matters is whether you can get the data and the service back fast enough, with little enough lost. Plan for recovery on purpose, set targets, and prove it works.
Resilience (Designing for Failure) keeps the system running through faults. Backup and disaster recovery are about getting it back after the worst has happened: data loss, corruption, a ransomware event, or a whole region going down. Two numbers frame everything. RPO is how much data you can afford to lose, measured in time. RTO is how long recovery may take. Set them per dataset, and design backups and DR to meet them.
For a regulated platform holding irreplaceable customer, KYC, and AML records, this is both an operational need and a certification requirement (see Marketplace & Certification Readiness). Recovery must follow the same rules as everything else: regulated records are retained and restored, never quietly lost, and backups of sensitive data are protected as strongly as the live data.
Back up with intent
- AlwaysDefine an RPO and RTO for each important dataset, and design backups and recovery to actually meet them, not whatever the defaults happen to give.
- DoAutomate regular backups of data and configuration, keep several generations, and store copies isolated from the live environment (separate account or region) so one compromise cannot take both.
- DoProtect backups like production: encrypted, access-controlled, and, for regulated data, retained for the required period (see Data Retention & Erasure).
- ConsiderImmutable, write-once backups so ransomware or a bad actor cannot encrypt or delete your recovery point.
- NeverTreat a backup as proven before you have successfully restored it in a test. An untested backup may not be recoverable at all.
Be able to recover
- DoKeep a disaster-recovery plan for major loss (region outage, corruption, ransomware): what is recovered, in what order, by whom, and how.
- DoTest restores and DR regularly. Actually rehearse recovery; do not just assume it works. Fix what the rehearsal reveals (ties to Incident Readiness).
- DoProvision infrastructure as code so a lost environment can be rebuilt the same way each time, not reconstructed from memory (see Infrastructure as Code).
- ConsiderMulti-region or cross-region redundancy for the most critical services, within data-residency limits (see Azure & Cloud Platform).
- AlwaysRestore regulated records (KYC, AML, audit, SARs) intact and keep them within their required residency and retention. Recovery must not become a quiet data loss.
// nightly dump written to the same storage account as production
// never restored; no RPO/RTO defined
If that account is compromised, deleted, or ransomed, both production and its backup go together. And nobody knows if the dump can even be restored. This is a backup in name only.
// automated backups -> separate region/account, immutable, encrypted
// RPO 15m / RTO 2h defined per dataset
// quarterly restore drill verifies recovery and timings
Backups survive a compromise of production, cannot be tampered with, meet defined targets, and are proven by regular restore drills. Recovery is something we know works, not something we hope works.
Self-review checklist
- AskWhat is the RPO/RTO for this data, and do our backups and DR actually meet them?
- AskHas a restore from these backups actually been tested recently?
- AskCould one event (region outage, ransomware, bad delete) take out both production and its backups?
- AskOn recovery, are regulated records restored intact, in-region, and within retention?