The Incident: The Rebuild Trap

Incident Hook

A team rebuilds “the same” code for production during incident pressure. The binary differs from staging due to dependency drift and build-time variance. Rollback is confusing because the promoted artifact is not the one that was tested.

Result: Time is lost proving artifact lineage instead of restoring service.

Observed Symptoms

What the team sees first:

Production is running a digest different from the one validated in staging.
The Git history sounds correct, but the artifact identity does not match.
Rollback discussion turns into a trust discussion.

The incident is not only about the symptom; it is about losing artifact certainty at the worst possible moment.

Confusion Phase

The team now has multiple candidates for “the right image”:

The last known-good production image.
The staging image that was supposed to be promoted.
The rebuilt production image that actually deployed.

That ambiguity is what immutable promotion is supposed to prevent.

Deployment Model

Our platform defines a strict three-tier deployment model:

Develop: Deploys develop-* images automatically on push to the develop branch.
Staging: Deploys staging-* images automatically on push to the main branch.
Production: Deploys production-* images created by explicit promotion (retagging) from Staging.

What AI Would Propose (Brave Junior):

“Just rebuild from main and deploy to production now.”
“Use mutable latest tag for speed.”

Pause and Predict: Before reading the investigation, write down your top 3 hypotheses. What would you check first?

Estimated Time

Prerequisites

Source Code References

What You Will Produce

Incident Hook

Observed Symptoms

Confusion Phase

Deployment Model