Runbook: Controlled Chaos Game Day
Purpose
Run controlled failure injection with strict safety boundaries and evidence-based response.
Roles
- Incident Commander: owns decision flow
- Driver: executes injection commands
- Observer: records timeline and evidence
Preflight (Required)
- Confirm environment is
develop. - Confirm rollback path is known.
- Confirm monitoring/tracing access is available.
- Confirm Chaos Monkey is
suspend: truebefore and after run.
Timeline Template
- T0: baseline metrics snapshot
- T+2m: inject failure
- T+5m: detect symptom
- T+10m: isolate via traces/logs
- T+15m: mitigate/recover
- T+25m: verify stability
- T+35m: write scorecard + actions
Injection Options
- Deterministic:
GET /status/500GET /panic
- Monkey:
- one pod deletion via
chaos-monkeyjob
Decision Classes
- Class A: low impact, recover automatically
- Class B: moderate impact, manual mitigation needed
- Class C: customer-impact pattern, trigger rollback/incident protocol
Exit Criteria
- service recovered to baseline
- no active critical alerts for drill scenario
- evidence package complete (metrics + traces + logs)
- at least one hardening issue created
Post-Run Deliverables
- filled
scorecard.md - one short blameless summary
- one backlog item with owner and due date