Incident Hook
An untested failure mode appears for the first time during a high-stakes on-call shift. Responders improvise under stress, the mitigation order is unclear, and evidence collection is fragmented. Recovery takes far longer than expected because the behavior of the system was never rehearsed under failure.
Result: Uncertainty expands the blast radius and recovery time because failure response was not a practiced discipline.
Observed Symptoms
What the team sees first:
- A failure mode appears with no documented or practiced response path.
- Responders collect evidence late, missing the initial causal signals.
- Communication is chaotic, as multiple engineers try different fixes in parallel.
The missing ingredient is rehearsal under controlled scope.
Chaos Monkey (MVP)
To prevent this, we introduce the Chaos Monkey — a controlled tool that injects failure into the system on purpose.
Our baseline deployment:
- Scope: Restricted by RBAC to the
developnamespace only. - Targets:
app=frontendorapp=backendpods. - Behavior: Randomly terminates selected pods at a fixed interval.
- Kill Switch:
spec.suspend: trueon the CronJob (the default state).
Develop chaos pack
Show the chaos configuration
flux/infrastructure/chaos/develop/cronjob.yamlflux/infrastructure/chaos/develop/kustomization.yamlflux/infrastructure/chaos/develop/role.yamlflux/infrastructure/chaos/develop/rolebinding.yamlflux/infrastructure/chaos/develop/serviceaccount.yaml
What AI Would Propose (Brave Junior):
- “Run chaos in staging/production to test realism quickly.”
- “Inject multiple failures at once for faster coverage.”
- “Leave chaos schedule always enabled.”
Pause and Predict: Before reading the investigation, write down your top 3 hypotheses. What would you check first?