Runbook: Rollback and Migration Operations (Advanced)
Purpose
Operate application + schema releases with explicit rollback safety and minimal blast radius.
Scope
This runbook covers:
- migration classification and sequencing
- rollback execution order
- incident handling for migration-related failures
- destructive migration approval gates
Migration Types
- Expand (safe/additive):
- add nullable columns
- add new tables/indexes
- keep old schema path valid
- Contract (destructive):
- drop/rename columns
- remove legacy constraints/paths
- only after stable compatibility window
Pre-Deploy Checklist
- Migration classified (
expandorcontract). - Rollback window defined with owner and duration.
- Backup/restore evidence is fresh.
- Feature flag plan exists for new code path.
- Monitoring and alert thresholds are confirmed.
Rollout Sequence (Mandatory)
- Expand migration.
- Application deploy with flag OFF.
- Controlled flag enable.
- Observe stability window.
- Contract migration (approval required).
Rollback Order
If incident occurs after expand + app deploy:
- disable feature flag (fastest mitigation)
- rollback application version if needed
- keep expanded schema intact
- investigate before any schema reversal
If destructive migration already applied:
- treat as high-severity incident
- invoke restore/data recovery protocol
- communicate RTO/RPO impact immediately
Commands / Evidence
kubectl -n develop get pods
kubectl -n develop get events --sort-by=.lastTimestamp | tail -n 30
Add your migration tool commands and SQL evidence to incident timeline.
Break-Glass Rules
Allowed only with:
- incident owner approval
- explicit risk acceptance
- documented rollback/recovery path
- post-incident follow-up task
Failure Modes
- Mixed-version incompatibility:
- symptom: old pods fail against new schema
- action: disable flag + rollback app, preserve expand schema
- Long-running lock/contention migration:
- symptom: API latency spikes/timeouts
- action: stop rollout, reduce scope, schedule maintenance window
- Data integrity regression:
- symptom: missing/corrupted values after migration
- action: incident protocol + restore/repair workflow