The Incident: Chaos Improv | SafeOps Academy

Incident Hook

An untested failure mode appears for the first time during a high-stakes on-call shift. Responders improvise under stress, the mitigation order is unclear, and evidence collection is fragmented. Recovery takes far longer than expected because the behavior of the system was never rehearsed under failure.

Result: Uncertainty expands the blast radius and recovery time because failure response was not a practiced discipline.

Observed Symptoms

What the team sees first:

A failure mode appears with no documented or practiced response path.
Responders collect evidence late, missing the initial causal signals.
Communication is chaotic, as multiple engineers try different fixes in parallel.

The missing ingredient is rehearsal under controlled scope.

Chaos Monkey (MVP)

To prevent this, we introduce the Chaos Monkey — a controlled tool that injects failure into the system on purpose.

Our baseline deployment:

Scope: Restricted by RBAC to the develop namespace only.
Targets: app=frontend or app=backend pods.
Behavior: Randomly terminates selected pods at a fixed interval.
Kill Switch: spec.suspend: true on the CronJob (the default state).

Develop chaos pack

Show the chaos configuration

flux/infrastructure/chaos/develop/cronjob.yaml
flux/infrastructure/chaos/develop/kustomization.yaml
flux/infrastructure/chaos/develop/role.yaml
flux/infrastructure/chaos/develop/rolebinding.yaml
flux/infrastructure/chaos/develop/serviceaccount.yaml

What AI Would Propose (Brave Junior):

“Run chaos in staging/production to test realism quickly.”
“Inject multiple failures at once for faster coverage.”
“Leave chaos schedule always enabled.”

Pause and Predict: Before reading the investigation, write down your top 3 hypotheses. What would you check first?

Estimated Time

Prerequisites

Source Code References

What You Will Produce

Incident Hook

Observed Symptoms

Chaos Monkey (MVP)