Workflow & Kill Switch | SafeOps Academy

Core Track Guardrails-first chapter in core learning path.

Estimated Time

Reading: 20-25 min
Lab: 45-60 min
Quiz: 10-15 min

Prerequisites

Previous chapter completed: Chapter 11: Backup & Restore Basics.
Working access to target `develop` namespace and tooling.

Source Code References

cronjob.yaml Members
develop/ Members

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Guardrails That Stop It

Develop Scope: Never run uncontrolled chaos in staging or production.
Single Injection: Only one failure mode is allowed per run to maintain causal clarity.
Evidence-First Triage: All drills must follow the metrics -> traces -> logs investigation path.
Hardening Action: Every drill must result in at least one system hardening task.

STOP: Kill Switch First

Before starting any chaos exercise, you must confirm your ability to stop it.

Kill Switch: spec.suspend: true on the CronJob (the default state).
Time Window: Chaos is only allowed during UTC 10-16 on business days.
RBAC Limit: The chaos job only has delete permissions on Pods in the develop namespace.

Safe Workflow (Step-by-Step)

Confirm Controls: Verify the kill switch and target namespace before starting.
Deterministic Drill: Start with a single, manual pod deletion to test alerts.
Trigger Chaos Monkey: Enable the chaos-monkey CronJob for a bounded time window.
Triage & Mitigate: Use Chapter 10’s evidence path to identify and resolve the failure.
Verify Recovery: Confirm that the service has returned to its healthy baseline.
Re-suspend: Ensure the chaos job is returned to spec.suspend: true.

This builds on: Backup and restore (Chapter 11) — chaos validates that recovery actually works. This enables: AI-assisted SRE (Chapter 13) — guardian uses chaos evidence for incident routing.