Advanced Track Do this after finishing Chapters 01-14.

Estimated Time

  • Reading: 30-40 min
  • Lab: 60-90 min
  • Quiz: 15-20 min

Prerequisites

  • Core track (Chapters 01-14) completed.
  • GitOps promotion and observability workflows available.

Source Code References

  • cnpg-clusters/ Members

Sign in to view source code.

What You Will Produce

A go/no-go evidence package: rollout results, remediation notes, and explicit rollback conditions.

Chapter 17: Rollback and Data Migrations (Advanced)

Incident Hook

A release includes application code and schema migration in one step. Migration drops/renames a column used by previous app version. New deployment fails health checks; rollback of application image succeeds, but old app cannot read data anymore. Incident duration expands because “app rollback” alone cannot recover service.

Observed Symptoms

What the team sees first:

  • the new release fails, but the old app version also cannot recover cleanly
  • image rollback appears to work while data compatibility does not
  • responders discover too late that the rollback window already closed

The real failure is coupling application rollout to irreversible schema change.

Confusion Phase

Application rollback usually feels familiar. Database rollback usually is not.

The real question is:

  • is the current problem application behavior, schema compatibility, or both
  • and which rollback paths are still genuinely safe

Why This Chapter Exists

Application rollback is easy only when database state is compatible. Most production rollback failures happen at the boundary between application version and schema version.

This chapter defines a safe migration discipline:

  • backward-compatible schema first
  • application rollout second
  • destructive schema changes last
  • explicit rollback windows and feature flag gates

Learning Objectives

By the end of this chapter, learners can:

  • explain expand/contract migration strategy
  • design rollback-safe deploy sequence for app + schema
  • execute a migration incident drill with evidence capture
  • define break-glass rules for failed migrations

Course Implementation Scope

  • this chapter runs migration workflow drills on CNPG/PostgreSQL targets
  • application behavior gating is demonstrated with feature-flag simulation
  • the same rollout and rollback sequence applies directly to database-backed login/user flows

What AI Would Propose (Brave Junior)

  • “Apply migration and deploy together in one PR.”
  • “If deploy fails, just rollback image tag.”
  • “Skip feature flags to reduce complexity.”

Why this sounds reasonable:

  • fewer moving parts in one release
  • fast visible progress

Why This Is Dangerous

  • schema and application coupling creates irreversible rollback paths
  • destructive changes remove safety window
  • partial rollout can leave mixed-version traffic against incompatible schema

Investigation

Treat compatibility as the primary evidence path.

Safe investigation sequence:

  1. verify what schema change already landed
  2. confirm whether the previous app version can still operate safely
  3. inspect feature-flag state and rollout order
  4. decide whether to revert behavior, hold the schema, or both

Containment

Containment protects the rollback window:

  1. freeze destructive migration steps
  2. revert application behavior before touching data cleanup
  3. keep the additive schema in place while compatibility is restored
  4. move to contract phase only after explicit go/no-go approval returns

Guardrails That Stop It

  • expand/contract strategy only
  • migration scripts must be idempotent and reviewed
  • app rollout uses feature flags for behavior gating
  • rollback plan includes data compatibility checks
  • destructive DDL only after verification window and explicit approval

Expand/Contract Visual Flow

Expand (additive schema) --> Deploy app (compat mode, flag OFF)
           |                                |
           +------ rollback-safe window ----+
                            |
                            v
                 Enable flag gradually
                            |
                            v
                 Contract (destructive cleanup)

Investigation Snapshots

Here is the CloudNativePG cluster baseline used in the SafeOps system. This is the data platform contract that schema changes must respect during the rollback window.

CloudNativePG cluster baseline

Show the data platform baseline
  • flux/infrastructure/data/cnpg-clusters/develop/cluster.yaml
  • flux/infrastructure/data/cnpg-clusters/develop/kustomization.yaml
  • flux/infrastructure/data/cnpg-clusters/develop/scheduled-backup.yaml
  • flux/infrastructure/data/cnpg-clusters/production/cluster.yaml
  • flux/infrastructure/data/cnpg-clusters/production/kustomization.yaml
  • flux/infrastructure/data/cnpg-clusters/production/postgres-app-secret.yaml
  • flux/infrastructure/data/cnpg-clusters/production/scheduled-backup.yaml
  • flux/infrastructure/data/cnpg-clusters/staging/cluster.yaml
  • flux/infrastructure/data/cnpg-clusters/staging/kustomization.yaml
  • flux/infrastructure/data/cnpg-clusters/staging/scheduled-backup.yaml

System Context

This chapter combines the course’s hardest dependencies: app rollout, data state, and rollback evidence.

It depends on:

  • Chapter 02 reviewed change execution
  • Chapter 11 validated recovery paths
  • Chapter 14 incident discipline when rollback is no longer a single-command decision

Safe Workflow (Step-by-Step)

  1. Create expand migration (additive only).
  2. Deploy migration job and verify schema compatibility.
  3. Deploy app with new code path behind feature flag (flag off).
  4. Enable flag gradually and monitor SLO/error budget.
  5. Keep rollback window open until confidence threshold.
  6. Run contract migration only after explicit approval.

Rollback Window Rules

  • no destructive migration inside initial rollout window
  • rollback window must have clear end criteria (time + stability metrics)
  • if error budget/SLO degrades, freeze contract phase and revert app behavior first
  • contract step requires explicit go/no-go approval with backup evidence attached

Lab Files

  • lab.md
  • runbook-rollback-migrations.md
  • quiz.md

Done When

  • learner can run migration drill with rollback-safe sequence
  • learner can distinguish app rollback vs data rollback limits
  • learner can define no-go conditions before destructive migration

Scope Note

Rollback-safe database change design cannot be fully explained in a single chapter. This lesson is only an entry point into the subject so the learner can start with the right mental model: compatibility windows, expand/contract sequencing, recovery evidence, and rollback limits.

This topic deserves a dedicated course of its own. SafeOps Academy will have a separate course focused entirely on production database operations, migration safety, recovery strategy, and rollback-aware change design.

Hands-On Materials

Labs, quizzes, and runbooks — available to course members.

  • Lab: Rollback-Safe Migration Drill (Advanced) Members
  • Quiz: Chapter 17 (Rollback and Data Migrations) Members
  • Rollback & Data Migrations Scorecard (Template) Members
  • Runbook: Rollback and Migration Operations (Advanced) Members