Chapter 11: Backup & Restore Basics
Why This Chapter Exists
Backups are useful only if restore is tested and repeatable. This chapter uses CloudNativePG as real stateful target with PVC-backed PostgreSQL.
Data Plane Choice
CloudNativePG setup in this repo:
- operator: flux/infrastructure/data/cnpg-operator
- clusters: flux/infrastructure/data/cnpg-clusters (
develop,staging,production) - each environment has dedicated
Cluster+ScheduledBackup
Backup Credential Model
Before SOPS integration, bootstrap credentials are created by Terraform:
- secret name:
cnpg-backup-s3 - namespaces:
develop,staging,production - keys:
ACCESS_KEY_ID,ACCESS_SECRET_KEY,BUCKET(+ optionalENDPOINT,REGION)
Terraform source:
Incident Hook
A backup job reports success, but a real restore attempt fails under pressure. Objects are present in storage, yet restored data is unusable due to permission/schema mismatch. Service stays degraded because backup existence was mistaken for recoverability proof. This chapter turns backup from checkbox into validated recovery capability.
What AI Would Propose (Brave Junior)
- “Backup job is green, so recovery is guaranteed.”
- “Skip restore drill; it takes too long.”
- “Restore in production directly when incident starts.”
Why this sounds reasonable:
- avoids extra drill time
- keeps release pipeline short
Why This Is Dangerous
- backup success does not guarantee restore correctness.
- untested restore paths fail exactly when recovery time matters most.
- production-first restore attempts can amplify incident impact.
Guardrails That Stop It
- No backup without tested restore path.
- Backup target credentials must be secret-managed (SOPS path next).
- Recovery drills must run in non-production first.
- Evidence is required: backup status + restore validation query.
Safe Workflow (Step-by-Step)
- Verify scheduled backups and retention status.
- Trigger one controlled manual backup.
- Restore into isolated non-production target.
- Run restore verification checklist:
- schema accessible
- representative data query passes
- app-level smoke checks succeed
- Record evidence and update recovery notes before considering production readiness.
Restore Verification Checklist (Required)
Restore is considered valid only if all checks pass:
- database object/schema exists and expected migrations are present
- representative read query and write query both succeed
- application health checks pass against restored data source
- permissions/roles required by app are present
- row-count or key business record spot-check matches backup expectations
Bad Restore Example (Why Backup Success Is Not Enough)
Observed failure pattern:
- backup artifact exists and restore command exits successfully
- restored DB misses required role grants or schema compatibility
- app starts but fails at runtime with authorization/schema errors
Lesson:
- “restore completed” is not recovery proof without data and app-level validation.
Lab Files
lab.mdrunbook.mdquiz.md
Done When
- learner can verify scheduled backups are running
- learner can execute one manual backup
- learner can perform restore simulation and validate recovered data