Chapter 11: Backup & Restore Basics
Incident Hook
A backup job reports success, but a real restore attempt fails under pressure. Objects are present in storage, yet restored data is unusable due to permission/schema mismatch. Service stays degraded because backup existence was mistaken for recoverability proof. This chapter turns backup from checkbox into validated recovery capability.
Observed Symptoms
What the team sees first:
- the backup job is green
- restore artifacts exist
- the restored service still cannot function correctly
That mismatch is the lesson. Backup presence is not the same thing as recovery proof.
Why This Chapter Exists
Backups are useful only if restore is tested and repeatable. This chapter uses CloudNativePG as real stateful target with PVC-backed PostgreSQL.
Data Plane Choice
CloudNativePG setup:
- operator managed by Flux
- dedicated clusters for
develop,staging, andproduction - each environment has its own
ClusterplusScheduledBackupobjects
Backup Credential Model
Before SOPS integration, bootstrap credentials are created by Terraform:
- secret name:
cnpg-backup-s3 - namespaces:
develop,staging,production - keys:
ACCESS_KEY_ID,ACCESS_SECRET_KEY,BUCKET(+ optionalENDPOINT,REGION)
The lesson snapshots below show the SafeOps cluster baseline and the bootstrap path that creates the initial backup credentials.
Confusion Phase
The team now has two competing stories:
- the backup system worked because artifacts exist
- the recovery path failed because the restored data is not operationally usable
If those are not separated clearly, teams declare success too early.
What AI Would Propose (Brave Junior)
- “Backup job is green, so recovery is guaranteed.”
- “Skip restore drill; it takes too long.”
- “Restore in production directly when incident starts.”
Why this sounds reasonable:
- avoids extra drill time
- keeps release pipeline short
Why This Is Dangerous
- backup success does not guarantee restore correctness.
- untested restore paths fail exactly when recovery time matters most.
- production-first restore attempts can amplify incident impact.
Investigation
Treat restore validation as the real test, not the backup status line.
Safe investigation sequence:
- confirm the backup artifact and retention status
- restore into isolated non-production target
- verify schema, permissions, representative reads, and writes
- prove the application can actually use the restored data
Containment
Containment means restoring confidence before touching production:
- keep production recovery decisions gated on restore evidence
- fix permission, schema, or bootstrap gaps in non-production first
- document the validated restore path
- promote the recovery procedure only when it is repeatable
Guardrails That Stop It
- No backup without tested restore path.
- Backup target credentials must be secret-managed (SOPS path next).
- Recovery drills must run in non-production first.
- Evidence is required: backup status + restore validation query.
Investigation Snapshots
Here is the CloudNativePG cluster baseline used in the SafeOps system for scheduled backups and restore drills.
CloudNativePG cluster baseline
Show the cluster baseline
flux/infrastructure/data/cnpg-clusters/develop/cluster.yamlflux/infrastructure/data/cnpg-clusters/develop/kustomization.yamlflux/infrastructure/data/cnpg-clusters/develop/scheduled-backup.yamlflux/infrastructure/data/cnpg-clusters/production/cluster.yamlflux/infrastructure/data/cnpg-clusters/production/kustomization.yamlflux/infrastructure/data/cnpg-clusters/production/postgres-app-secret.yamlflux/infrastructure/data/cnpg-clusters/production/scheduled-backup.yamlflux/infrastructure/data/cnpg-clusters/staging/cluster.yamlflux/infrastructure/data/cnpg-clusters/staging/kustomization.yamlflux/infrastructure/data/cnpg-clusters/staging/scheduled-backup.yaml
Here is the Terraform bootstrap used in the SafeOps system to create the initial backup credential secret before the encrypted path takes over.
Backup credential bootstrap
Show the bootstrap Terraform
provider "hcloud" {
token = var.hcloud_token
}
locals {
# Control plane — always one pool.
control_plane_nodepools = [
{
name = "cp"
server_type = var.control_plane_server_type
location = var.location
labels = ["project=sre", "managed-by=terraform"]
taints = []
count = var.control_plane_count
disable_ipv6 = true
},
]
# Static workers — used when autoscaling is OFF.
# When autoscaling is ON the workers pool moves to autoscaler_nodepools.
static_agent_pools = var.autoscaling_enabled ? [] : [
{
name = "workers"
server_type = var.workers_server_type
location = var.location
labels = ["role=workers", "project=sre", "managed-by=terraform"]
taints = []
count = var.workers_count
disable_ipv6 = true
},
]
# Autoscaler pool — used when autoscaling is ON.
autoscaler_nodepools = var.autoscaling_enabled ? [
{
name = "workers"
server_type = var.workers_server_type
location = var.location
min_nodes = var.autoscaling_min_nodes
max_nodes = var.autoscaling_max_nodes
labels = { "role" = "workers", "project" = "sre", "managed-by" = "terraform" }
},
] : []
# Kured options — only populated when enabled.
kured_options = var.kured_enabled ? {
"reboot-days" = var.kured_reboot_days
"start-time" = var.kured_start_time
"end-time" = var.kured_end_time
} : {}
# etcd S3 backup — reuses the R2/S3 credentials already wired through load-env.sh.
# k3s expects a bare hostname (no https:// prefix).
etcd_s3_endpoint = var.backup_s3_endpoint != "" ? replace(var.backup_s3_endpoint, "https://", "") : ""
etcd_s3_backup = local.etcd_s3_endpoint != "" ? {
"etcd-s3-endpoint" = local.etcd_s3_endpoint
"etcd-s3-access-key" = var.backup_s3_access_key_id
"etcd-s3-secret-key" = var.backup_s3_secret_access_key
"etcd-s3-bucket" = var.backup_s3_bucket
"etcd-s3-folder" = "${var.cluster_name}/etcd-snapshots"
"etcd-s3-region" = var.backup_s3_region
} : {}
}
module "kube_hetzner" {
# kube-hetzner v2.19.0
source = "git::https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner.git?ref=a52d120bfb9f67d6c1d01add5d202609543df3ab"
providers = {
hcloud = hcloud
}
# Core
hcloud_token = var.hcloud_token
cluster_name = var.cluster_name
ssh_public_key = var.ssh_public_key
ssh_private_key = var.ssh_private_key
# Node pools
control_plane_nodepools = local.control_plane_nodepools
agent_nodepools = local.static_agent_pools
autoscaler_nodepools = local.autoscaler_nodepools
allow_scheduling_on_control_plane = var.allow_scheduling_on_control_plane
# Load balancer
load_balancer_type = var.load_balancer_type
load_balancer_location = var.location
load_balancer_disable_ipv6 = true
# Ingress
ingress_controller = var.ingress_controller
traefik_redirect_to_https = var.traefik_redirect_to_https
traefik_autoscaling = var.traefik_autoscaling
# K3s versioning
initial_k3s_channel = var.k3s_channel
install_k3s_version = var.k3s_version
automatically_upgrade_k3s = var.auto_upgrade_k3s
automatically_upgrade_os = var.auto_upgrade_os
# cert-manager is managed by Flux, not kube-hetzner
enable_cert_manager = false
# Kured
kured_options = local.kured_options
# etcd backup to S3/R2
etcd_s3_backup = local.etcd_s3_backup
}
locals {
kubeconfig_path = pathexpand("${path.module}/kubeconfig.yaml")
# Render pullSecret only when a token is provided.
flux_pull_secret_yaml = var.flux_git_token != "" ? " pullSecret: flux-system\n" : ""
flux_git_secret_enabled = var.flux_git_token != ""
sops_age_secret_enabled = var.sops_age_key != ""
backup_s3_secret_enabled = nonsensitive(
var.backup_s3_access_key_id != "" &&
var.backup_s3_secret_access_key != "" &&
var.backup_s3_bucket != ""
)
}
resource "local_sensitive_file" "kubeconfig" {
content = module.kube_hetzner.kubeconfig
filename = local.kubeconfig_path
file_permission = "0600"
}
provider "helm" {
kubernetes {
host = module.kube_hetzner.kubeconfig_data.host
client_certificate = module.kube_hetzner.kubeconfig_data.client_certificate
client_key = module.kube_hetzner.kubeconfig_data.client_key
cluster_ca_certificate = module.kube_hetzner.kubeconfig_data.cluster_ca_certificate
}
}
provider "kubernetes" {
host = module.kube_hetzner.kubeconfig_data.host
client_certificate = module.kube_hetzner.kubeconfig_data.client_certificate
client_key = module.kube_hetzner.kubeconfig_data.client_key
cluster_ca_certificate = module.kube_hetzner.kubeconfig_data.cluster_ca_certificate
}
resource "kubernetes_namespace" "bootstrap" {
for_each = toset([
"flux-system",
"develop",
"staging",
"production",
"observability",
])
metadata {
name = each.value
labels = {
"managed-by" = "terraform"
}
}
depends_on = [local_sensitive_file.kubeconfig]
lifecycle {
ignore_changes = [
metadata[0].labels,
metadata[0].annotations,
]
}
}
data = {
cloudflare_proxied = "enabled"
cluster_name = var.cluster_name
}
depends_on = [kubernetes_namespace.bootstrap]
}
metadata {
name = "flux-system"
namespace = "flux-system"
}
type = "Opaque"
data = {
username = "git"
password = var.flux_git_token
}
depends_on = [kubernetes_namespace.bootstrap]
}
resource "null_resource" "flux_operator_install" {
depends_on = [kubernetes_namespace.bootstrap]
triggers = {
kubeconfig_path = local.kubeconfig_path
}
provisioner "local-exec" {
when = create
interpreter = ["/bin/bash", "-c"]
command = "kubectl --kubeconfig=\"${local.kubeconfig_path}\" apply -f https://github.com/controlplaneio-fluxcd/flux-operator/releases/latest/download/install.yaml"
}
}
resource "null_resource" "flux_instance" {
depends_on = [
null_resource.flux_operator_install,
kubernetes_secret.flux_git_credentials,
]
triggers = {
kubeconfig_path = local.kubeconfig_path
repo_url = var.flux_git_repository_url
repo_branch = var.flux_git_repository_branch
repo_path = var.flux_kustomization_path
flux_version = var.flux_version
provider = "generic"
}
provisioner "local-exec" {
when = create
interpreter = ["/bin/bash", "-c"]
command = <<-EOC
cat <<EOF | kubectl --kubeconfig="${local.kubeconfig_path}" apply -f -
apiVersion: fluxcd.controlplane.io/v1
kind: FluxInstance
metadata:
name: flux
namespace: flux-system
spec:
distribution:
version: "${var.flux_version}"
registry: ghcr.io/fluxcd
components:
- source-controller
- kustomize-controller
- helm-controller
- notification-controller
- image-reflector-controller
- image-automation-controller
cluster:
type: kubernetes
sync:
kind: GitRepository
url: "${var.flux_git_repository_url}"
ref: "refs/heads/${var.flux_git_repository_branch}"
provider: generic
path: "${var.flux_kustomization_path}"
${local.flux_pull_secret_yaml}
EOF
EOC
}
provisioner "local-exec" {
when = destroy
on_failure = continue
interpreter = ["/bin/bash", "-c"]
command = "kubectl --kubeconfig=\"${self.triggers.kubeconfig_path}\" delete fluxinstance flux -n flux-system --ignore-not-found=true --wait=false --timeout=30s 2>/dev/null || true"
}
}
resource "null_resource" "flux_pre_destroy" {
depends_on = [
local_sensitive_file.kubeconfig,
kubernetes_namespace.bootstrap,
null_resource.flux_instance,
]
triggers = {
kubeconfig_path = local.kubeconfig_path
namespaces = "flux-system,develop,staging,production,observability"
}
provisioner "local-exec" {
when = destroy
on_failure = continue
interpreter = ["/bin/bash", "-c"]
command = "\"${path.module}/../scripts/flux-pre-destroy.sh\" \"${self.triggers.kubeconfig_path}\" \"${self.triggers.namespaces}\""
}
}
metadata {
name = "ghcr-credentials-docker"
namespace = each.key
}
type = "kubernetes.io/dockerconfigjson"
data = {
".dockerconfigjson" = jsonencode({
auths = {
"ghcr.io" = {
username = var.ghcr_username
password = var.ghcr_token
auth = base64encode("${var.ghcr_username}:${var.ghcr_token}")
}
}
})
}
depends_on = [kubernetes_namespace.bootstrap]
}
metadata {
name = "github-image-automation"
namespace = "flux-system"
}
type = "Opaque"
data = {
username = "git"
password = var.flux_git_token
}
depends_on = [kubernetes_namespace.bootstrap]
}
metadata {
name = "sops-age"
namespace = "flux-system"
}
data = {
"age.agekey" = var.sops_age_key
}
type = "Opaque"
depends_on = [kubernetes_namespace.bootstrap]
}
metadata {
name = "cnpg-backup-s3"
namespace = each.key
}
type = "Opaque"
data = merge(
{
ACCESS_KEY_ID = var.backup_s3_access_key_id
ACCESS_SECRET_KEY = var.backup_s3_secret_access_key
BUCKET = var.backup_s3_bucket
},
var.backup_s3_endpoint != "" ? { ENDPOINT = var.backup_s3_endpoint } : {},
var.backup_s3_region != "" ? { REGION = var.backup_s3_region } : {},
)
depends_on = [kubernetes_namespace.bootstrap]
}
System Context
This chapter gives stateful recovery the same evidence standard as the rest of the course.
It connects directly to:
- Chapter 10 observability, which proves whether restore actually recovered service
- Chapter 14 operations, where backup claims must survive real incident pressure
- Chapter 17 data migrations, where rollback safety depends on knowing recovery is real
Safe Workflow (Step-by-Step)
- Verify scheduled backups and retention status.
- Trigger one controlled manual backup.
- Restore into isolated non-production target.
- Run restore verification checklist:
- schema accessible
- representative data query passes
- app-level smoke checks succeed
- Record evidence and update recovery notes before considering production readiness.
Restore Verification Checklist (Required)
Restore is considered valid only if all checks pass:
- database object/schema exists and expected migrations are present
- representative read query and write query both succeed
- application health checks pass against restored data source
- permissions/roles required by app are present
- row-count or key business record spot-check matches backup expectations
Bad Restore Example (Why Backup Success Is Not Enough)
Observed failure pattern:
- backup artifact exists and restore command exits successfully
- restored DB misses required role grants or schema compatibility
- app starts but fails at runtime with authorization/schema errors
Lesson:
- “restore completed” is not recovery proof without data and app-level validation.
Lab Files
lab.mdrunbook.mdquiz.md
Done When
- learner can verify scheduled backups are running
- learner can execute one manual backup
- learner can perform restore simulation and validate recovered data