Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Artifacts

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter 11: Backup & Restore Basics

Why This Chapter Exists

Backups are useful only if restore is tested and repeatable. This chapter uses CloudNativePG as real stateful target with PVC-backed PostgreSQL.

Data Plane Choice

CloudNativePG setup in this repo:

Backup Credential Model

Before SOPS integration, bootstrap credentials are created by Terraform:

  • secret name: cnpg-backup-s3
  • namespaces: develop, staging, production
  • keys: ACCESS_KEY_ID, ACCESS_SECRET_KEY, BUCKET (+ optional ENDPOINT, REGION)

Terraform source:

Incident Hook

A backup job reports success, but a real restore attempt fails under pressure. Objects are present in storage, yet restored data is unusable due to permission/schema mismatch. Service stays degraded because backup existence was mistaken for recoverability proof. This chapter turns backup from checkbox into validated recovery capability.

What AI Would Propose (Brave Junior)

  • “Backup job is green, so recovery is guaranteed.”
  • “Skip restore drill; it takes too long.”
  • “Restore in production directly when incident starts.”

Why this sounds reasonable:

  • avoids extra drill time
  • keeps release pipeline short

Why This Is Dangerous

  • backup success does not guarantee restore correctness.
  • untested restore paths fail exactly when recovery time matters most.
  • production-first restore attempts can amplify incident impact.

Guardrails That Stop It

  • No backup without tested restore path.
  • Backup target credentials must be secret-managed (SOPS path next).
  • Recovery drills must run in non-production first.
  • Evidence is required: backup status + restore validation query.

Safe Workflow (Step-by-Step)

  1. Verify scheduled backups and retention status.
  2. Trigger one controlled manual backup.
  3. Restore into isolated non-production target.
  4. Run restore verification checklist:
    • schema accessible
    • representative data query passes
    • app-level smoke checks succeed
  5. Record evidence and update recovery notes before considering production readiness.

Restore Verification Checklist (Required)

Restore is considered valid only if all checks pass:

  • database object/schema exists and expected migrations are present
  • representative read query and write query both succeed
  • application health checks pass against restored data source
  • permissions/roles required by app are present
  • row-count or key business record spot-check matches backup expectations

Bad Restore Example (Why Backup Success Is Not Enough)

Observed failure pattern:

  • backup artifact exists and restore command exits successfully
  • restored DB misses required role grants or schema compatibility
  • app starts but fails at runtime with authorization/schema errors

Lesson:

  • “restore completed” is not recovery proof without data and app-level validation.

Lab Files

  • lab.md
  • runbook.md
  • quiz.md

Done When

  • learner can verify scheduled backups are running
  • learner can execute one manual backup
  • learner can perform restore simulation and validate recovered data

Lab: CloudNativePG Backup and Restore Simulation

verify CNPG cluster and scheduled backup trigger one on-demand backup perform restore simulation into a separate cluster Prerequisites CNPG operator is ready app-postgres exists in develop secret cnpg-backup-s3 exists in …

Quiz: Chapter 11 (Backup & Restore Basics)

Which CNPG resource defines periodic backup schedule? Which secret name is used for object-store backup credentials in this repo? What is the safest environment for routine restore simulations? Which statement is …

Runbook: Backup and Restore (CNPG)

confirm backup health execute manual backup run restore simulation safely Scope primary target: develop or staging production restore only under incident protocol Step 1: Backup Health Check kubectl -n <env> get …