Advanced Track Do this after finishing Chapters 01-14.

Estimated Time

  • Reading: 30-40 min
  • Lab: 60-90 min
  • Quiz: 15-20 min

Prerequisites

  • Core track (Chapters 01-14) completed.
  • GitOps promotion and observability workflows available.

Source Code References

  • cnpg-clusters/ Members

Sign in to view source code.

What You Will Produce

A go/no-go evidence package: rollout results, remediation notes, and explicit rollback conditions.

Investigation

Treat stateful failures as a dual-layer incident: code and data.

Safe investigation sequence:

  1. Verify Migration Status: Identify which migration step failed and what changes were partially applied.
  2. Check App Compatibility: Confirm if the current application version can function with the current database schema.
  3. Audit Data Integrity: Scan the affected tables for corruption or missing records.
  4. Identify Restore Point: Find the exact timestamp or backup ID from just before the migration began.

Containment

Containment means stopping the corruption and deciding on the restore path.

Containment steps:

  1. Pause Traffic: Stop the application or the migration job to prevent further data corruption.
  2. Evaluate Revert vs. Restore: Decide if the migration can be safely undone via SQL or if a full PITR restore is required.
  3. Execute Code Rollback: Revert the application to the previous version in Git.
  4. Perform Data Recovery: Restore the database to the pre-migration checkpoint.
  5. Verify Both: Confirm the app and data are synchronized and functional.

The goal is “atomic recovery,” where code and state return to a consistent baseline.


Pause and Predict: What automated guardrail would have prevented this incident entirely?