Chapter 11: Controlled Chaos
Why This Chapter Exists
Production resilience is not proven in calm conditions. This chapter validates behavior under controlled failures with explicit blast-radius limits.
Scope
Failure classes in this chapter:
- crash loop (
/panic) - elevated 5xx (
/status/500) - random pod termination (Chaos Monkey)
Implementation focus:
- deterministic drills first
- Chaos Monkey in
developwith kill switch and strict target allowlist
Chaos Monkey (MVP)
Flux path:
Safety controls:
- namespace scope:
developonly (RBAC Role indevelop) - target scope:
app=frontendorapp=backend - schedule: every 15 minutes
- window: UTC
10-16 - kill switch:
spec.suspend: trueon CronJob (default)
Guardrails
- Never run uncontrolled chaos in
staging/production. - One failure injection per run.
- Evidence-first triage: metrics -> traces -> logs.
- Every drill must end with recovery verification and a hardening action.
Lab Files
lab.mdrunbook-game-day.mdscorecard.mdquiz.md
Handoff to Chapter 12 (AI Guardian)
Chaos Monkey emits structured log events in CronJob output. In Chapter 12, Guardian watchers consume these events and classify:
- expected controlled disruption
- unexpected collateral impact
- escalation-required incident
Done When
- learner runs at least two controlled failure drills with evidence
- learner enables/disables Chaos Monkey safely
- learner captures one game-day scorecard with action items