Core Exercises (Required)
- Deterministic Failure: Use
kubectl delete podto terminate a single backend pod. Observe the HPA and PDB in action and identify the latency spike in Grafana. - Chaos Monkey Run: Manually trigger the
chaos-monkeyCronJob. Usekubectl get pods -n develop -wto see the pods being randomly terminated. - Evidence Collection: Find a trace in Uptrace that covers a request that was interrupted by a pod termination. Identify the corresponding log evidence.
Game Day Scorecard
Every drill must end with a scorecard:
- What was the target? (e.g., Backend Pods)
- Was the failure detected? (Alert/Metric signal)
- Did the system automatically recover? (HPA/Deployment controller)
- What is our hardening action? (e.g., Add a PDB, tune HPA, improve graceful shutdown)
Handoff to Chapter 13 (AI Guardian)
The Chaos Monkey produces structured events in its CronJob output. In Chapter 13, our SRE Guardian will consume these events and categorize them as:
- Expected controlled disruption (Chaos).
- Unexpected collateral impact.
- High-priority incident requiring human escalation.
Challenge Exercise (Optional)
Custom Chaos Drill Runbook: Design a complete chaos drill runbook for a failure mode not covered in this chapter. Include scope definition, kill switch mechanism, evidence capture plan, and success/failure criteria.
Done When
You have completed this chapter when:
- You have successfully run at least two controlled failure drills with evidence.
- You have successfully enabled and disabled the Chaos Monkey.
- You have captured one “Game Day” scorecard with at least one hardening action.
- You understand why chaos is restricted to the
developnamespace.
Knowledge Check
Before finishing this chapter, complete the Quiz to verify your understanding of the guardrail principles.