Investigation
Treat stateful failures as a dual-layer incident: code and data.
Safe investigation sequence:
- Verify Migration Status: Identify which migration step failed and what changes were partially applied.
- Check App Compatibility: Confirm if the current application version can function with the current database schema.
- Audit Data Integrity: Scan the affected tables for corruption or missing records.
- Identify Restore Point: Find the exact timestamp or backup ID from just before the migration began.
Containment
Containment means stopping the corruption and deciding on the restore path.
Containment steps:
- Pause Traffic: Stop the application or the migration job to prevent further data corruption.
- Evaluate Revert vs. Restore: Decide if the migration can be safely undone via SQL or if a full PITR restore is required.
- Execute Code Rollback: Revert the application to the previous version in Git.
- Perform Data Recovery: Restore the database to the pre-migration checkpoint.
- Verify Both: Confirm the app and data are synchronized and functional.
The goal is “atomic recovery,” where code and state return to a consistent baseline.
Pause and Predict: What automated guardrail would have prevented this incident entirely?