Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

  • cronjob.yaml Members
  • develop/ Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Guardrails That Stop It

  • Develop Scope: Never run uncontrolled chaos in staging or production.
  • Single Injection: Only one failure mode is allowed per run to maintain causal clarity.
  • Evidence-First Triage: All drills must follow the metrics -> traces -> logs investigation path.
  • Hardening Action: Every drill must result in at least one system hardening task.

STOP: Kill Switch First

Before starting any chaos exercise, you must confirm your ability to stop it.

  • Kill Switch: spec.suspend: true on the CronJob (the default state).
  • Time Window: Chaos is only allowed during UTC 10-16 on business days.
  • RBAC Limit: The chaos job only has delete permissions on Pods in the develop namespace.

Safe Workflow (Step-by-Step)

  1. Confirm Controls: Verify the kill switch and target namespace before starting.
  2. Deterministic Drill: Start with a single, manual pod deletion to test alerts.
  3. Trigger Chaos Monkey: Enable the chaos-monkey CronJob for a bounded time window.
  4. Triage & Mitigate: Use Chapter 10’s evidence path to identify and resolve the failure.
  5. Verify Recovery: Confirm that the service has returned to its healthy baseline.
  6. Re-suspend: Ensure the chaos job is returned to spec.suspend: true.

This builds on: Backup and restore (Chapter 11) — chaos validates that recovery actually works. This enables: AI-assisted SRE (Chapter 13) — guardian uses chaos evidence for incident routing.