Investigation
Treat drills as controlled experiments, not as a spectacle.
Safe investigation sequence:
- Define the Drill: Choose one failure type (e.g., pod termination) and one target service.
- Confirm Controls: Verify the kill switch, namespace scope, and time window before starting.
- Capture Telemetry: Ensure you are recording metrics, traces, and logs during the injection.
- Compare Response: Compare the actual response path with your documented runbook.
Containment
Containment is an integral part of the drill itself.
Containment steps:
- Stop the Monkey: Use the kill switch if the blast radius or impact becomes unclear.
- Execute Mitigation: Follow the practiced mitigation steps to restore service.
- Verify Recovery: Confirm service health using the evidence from Chapter 10.
- Harden the System: End every drill by identifying one technical action to reduce the impact of that failure in the future.
The objective is “learn from failure,” not just “survive the noise.”
Pause and Predict: What automated guardrail would have prevented this incident entirely?