Core Exercises (Required)
- Incident Simulation: Use the Chaos Monkey (Chapter 12) to trigger a Sev1 failure. Assign roles within your team (or simulate them) and follow the Safe Workflow.
- Timeline Generation: Using your simulation, create a 10-point timeline that includes at least three metric signals, two logs, and four operator actions.
- Draft a Postmortem: Use the template in
sre/docs/postmortem-template.mdto document your simulation. Focus on the “5 Whys” analysis.
Postmortem Quality Bar
A postmortem is successful only if it includes:
- Evidence-backed Timeline: All major events are tied to a metric, log, or trace.
- Causal Analysis: Identifies why the system allowed the failure, not just who did it.
- Hardening Actions: Specific tasks with an owner, a due date, and a validation method.
- Blameless Tone: Focuses on technical and process improvements.
Challenge Exercise (Optional)
Full Tabletop Incident Simulation: Run a full tabletop incident simulation with all four roles assigned. Process the incident through severity declaration, investigation, containment, and resolution. Produce a complete blameless postmortem document.
Done When
You have completed this chapter and the Core Track when:
- You can run a full incident lifecycle with assigned roles and severity levels.
- You have produced a complete, high-quality blameless postmortem.
- You understand the difference between technical mitigation and organizational coordination.
- You can define and verify technical hardening actions from incident evidence.
Knowledge Check
Before finishing this chapter, complete the Quiz to verify your understanding of the guardrail principles.