Guardrails That Stop It
- Develop Scope: Never run uncontrolled chaos in
stagingorproduction. - Single Injection: Only one failure mode is allowed per run to maintain causal clarity.
- Evidence-First Triage: All drills must follow the
metrics -> traces -> logsinvestigation path. - Hardening Action: Every drill must result in at least one system hardening task.
STOP: Kill Switch First
Before starting any chaos exercise, you must confirm your ability to stop it.
- Kill Switch:
spec.suspend: trueon the CronJob (the default state). - Time Window: Chaos is only allowed during UTC
10-16on business days. - RBAC Limit: The chaos job only has
deletepermissions on Pods in thedevelopnamespace.
Safe Workflow (Step-by-Step)
- Confirm Controls: Verify the kill switch and target namespace before starting.
- Deterministic Drill: Start with a single, manual pod deletion to test alerts.
- Trigger Chaos Monkey: Enable the
chaos-monkeyCronJob for a bounded time window. - Triage & Mitigate: Use Chapter 10’s evidence path to identify and resolve the failure.
- Verify Recovery: Confirm that the service has returned to its healthy baseline.
- Re-suspend: Ensure the chaos job is returned to
spec.suspend: true.
This builds on: Backup and restore (Chapter 11) — chaos validates that recovery actually works. This enables: AI-assisted SRE (Chapter 13) — guardian uses chaos evidence for incident routing.