Advanced Track Do this after finishing Chapters 01-14.

Estimated Time

  • Reading: 30-40 min
  • Lab: 60-90 min
  • Quiz: 15-20 min

Prerequisites

  • Core track (Chapters 01-14) completed.
  • GitOps promotion and observability workflows available.

Source Code References

  • canary.example.yaml Members

Sign in to view source code.

What You Will Produce

A go/no-go evidence package: rollout results, remediation notes, and explicit rollback conditions.

Incident Hook

A deployment reaches 100% of production traffic instantly. A hidden bug in the new version causes intermittent crashes, but it is only detected after 30 minutes of user reports. By then, the entire user base is affected, and the team is forced into a high-pressure, manual rollback.

Result: The failure blast radius is 100% of your users because the deployment was an “all-or-nothing” event.

Observed Symptoms

What the team sees first:

  • Error rates spike across all users immediately after the release.
  • Latency increases globally, affecting every request path.
  • The rollback takes several minutes to complete, during which the system is fully degraded.

The incident is caused by uncontrolled traffic shifting.

Progressive Delivery Model

To reduce the blast radius of new releases, we move from “Big-Bang” to “Progressive” delivery:

  1. Canary Deployment: We route a small percentage of traffic (e.g., 5%) to the new version first.
  2. Automated Analysis: We use real-time metrics (latency, error rate) to analyze the health of the canary.
  3. Step-by-Step Shift: If the metrics are healthy, we increase the traffic percentage incrementally (5% -> 10% -> 50% -> 100%).
  4. Automated Rollback: If the metrics degrade at any step, the system automatically reverts traffic to the old version.

What AI Would Propose (Brave Junior):

  • “Deploy 100% to production immediately to see if it works.”
  • “Skip canary for small bug fixes.”
  • “Monitor the dashboard manually and rollback if things look ‘weird’.”

Pause and Predict: Before reading the investigation, write down your top 3 hypotheses. What would you check first?