Incident Hook
A deployment reaches 100% of production traffic instantly. A hidden bug in the new version causes intermittent crashes, but it is only detected after 30 minutes of user reports. By then, the entire user base is affected, and the team is forced into a high-pressure, manual rollback.
Result: The failure blast radius is 100% of your users because the deployment was an “all-or-nothing” event.
Observed Symptoms
What the team sees first:
- Error rates spike across all users immediately after the release.
- Latency increases globally, affecting every request path.
- The rollback takes several minutes to complete, during which the system is fully degraded.
The incident is caused by uncontrolled traffic shifting.
Progressive Delivery Model
To reduce the blast radius of new releases, we move from “Big-Bang” to “Progressive” delivery:
- Canary Deployment: We route a small percentage of traffic (e.g., 5%) to the new version first.
- Automated Analysis: We use real-time metrics (latency, error rate) to analyze the health of the canary.
- Step-by-Step Shift: If the metrics are healthy, we increase the traffic percentage incrementally (5% -> 10% -> 50% -> 100%).
- Automated Rollback: If the metrics degrade at any step, the system automatically reverts traffic to the old version.
What AI Would Propose (Brave Junior):
- “Deploy 100% to production immediately to see if it works.”
- “Skip canary for small bug fixes.”
- “Monitor the dashboard manually and rollback if things look ‘weird’.”
Pause and Predict: Before reading the investigation, write down your top 3 hypotheses. What would you check first?