Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Artifacts

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter 12: Controlled Chaos

Why This Chapter Exists

Production resilience is not proven in calm conditions. This chapter validates behavior under controlled failures with explicit blast-radius limits.

Incident Hook

An untested failure mode appears during on-call and responders improvise under stress. Mitigation order is unclear, evidence is fragmented, and rollback confidence is low. The outage lasts longer because recovery behavior was never rehearsed. Controlled chaos drills make failure response predictable before real incidents.

Scope

Failure classes in this chapter:

  • crash loop (/panic)
  • elevated 5xx (/status/500)
  • random pod termination (Chaos Monkey)

Implementation focus:

  • deterministic drills first
  • Chaos Monkey in develop with kill switch and strict target allowlist

What AI Would Propose (Brave Junior)

  • “Run chaos in staging/production to test realism quickly.”
  • “Inject multiple failures at once for faster coverage.”
  • “Leave chaos schedule always enabled.”

Why this sounds reasonable:

  • faster validation in fewer runs
  • appears comprehensive

Why This Is Dangerous

  • uncontrolled environments increase blast radius beyond drill intent.
  • multi-failure injections destroy causal clarity during triage.
  • always-on chaos without windowing creates operational noise and fatigue.

Chaos Monkey (MVP)

Flux path:

Safety controls:

  • namespace scope: develop only (RBAC Role in develop)
  • target scope: app=frontend or app=backend
  • schedule: every 15 minutes
  • window: UTC 10-16
  • kill switch: spec.suspend: true on CronJob (default)

STOP: Kill Switch First

Critical rule:

  • chaos is allowed only in develop
  • run only in bounded time window
  • produce evidence for every drill
  • if scope/time is unclear, keep spec.suspend: true

Guardrails That Stop It

  • Never run uncontrolled chaos in staging/production.
  • One failure injection per run.
  • Evidence-first triage: metrics -> traces -> logs.
  • Every drill must end with recovery verification and a hardening action.

Safe Workflow (Step-by-Step)

  1. Confirm kill switch state and target scope before starting any drill.
  2. Run one deterministic failure scenario in develop.
  3. Triage with metrics -> traces -> logs and execute documented mitigation.
  4. Verify recovery and capture scorecard evidence.
  5. Re-enable suspend/kill switch if schedule-based chaos is not actively required.

Lab Files

  • lab.md
  • runbook-game-day.md
  • scorecard.md
  • quiz.md

Handoff to Chapter 13 (AI Guardian)

Chaos Monkey emits structured log events in CronJob output. In Chapter 13, Guardian watchers consume these events and classify:

  • expected controlled disruption
  • unexpected collateral impact
  • escalation-required incident

Done When

  • learner runs at least two controlled failure drills with evidence
  • learner enables/disables Chaos Monkey safely
  • learner captures one game-day scorecard with action items

Game Day Scorecard (Template)

Game Day Scorecard (Template) Date: Environment: Scenario: Driver: Incident Commander: Observer: Detection First symptom timestamp: Detection signal: MTTD (minutes): Triage Representative trace id: Correlated log …

Lab: Controlled Chaos with Safety Guardrails

confirm detection run incident workflow verify recovery Prerequisites kubectl -n flux-system get kustomization chaos-monkey-develop kubectl -n develop get deploy frontend backend kubectl -n observability get …

Quiz: Chapter 12 (Controlled Chaos)

Which CronJob field is the primary kill switch for Chaos Monkey? In this repo, what target app labels are allowed for monkey pod deletion? Which incident flow is required before mitigation decisions? Which statement is …

Runbook: Controlled Chaos Game Day

Roles Incident Commander: owns decision flow Driver: executes injection commands Observer: records timeline and evidence Preflight (Required) Confirm environment is develop. Confirm rollback path is known. Confirm …