Incident Hook
A team rebuilds “the same” code for production during incident pressure. The binary differs from staging due to dependency drift and build-time variance. Rollback is confusing because the promoted artifact is not the one that was tested.
Result: Time is lost proving artifact lineage instead of restoring service.
Observed Symptoms
What the team sees first:
- Production is running a digest different from the one validated in staging.
- The Git history sounds correct, but the artifact identity does not match.
- Rollback discussion turns into a trust discussion.
The incident is not only about the symptom; it is about losing artifact certainty at the worst possible moment.
Confusion Phase
The team now has multiple candidates for “the right image”:
- The last known-good production image.
- The staging image that was supposed to be promoted.
- The rebuilt production image that actually deployed.
That ambiguity is what immutable promotion is supposed to prevent.
Deployment Model
Our platform defines a strict three-tier deployment model:
- Develop: Deploys
develop-*images automatically on push to thedevelopbranch. - Staging: Deploys
staging-*images automatically on push to themainbranch. - Production: Deploys
production-*images created by explicit promotion (retagging) from Staging.
What AI Would Propose (Brave Junior):
- “Just rebuild from main and deploy to production now.”
- “Use mutable
latesttag for speed.”
Pause and Predict: Before reading the investigation, write down your top 3 hypotheses. What would you check first?