Investigation & Containment | SafeOps Academy

Core Track Guardrails-first chapter in core learning path.

Estimated Time

Reading: 20-25 min
Lab: 45-60 min
Quiz: 10-15 min

Prerequisites

Previous chapter completed: Chapter 11: Backup & Restore Basics.
Working access to target `develop` namespace and tooling.

Source Code References

cronjob.yaml Members
develop/ Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Investigation

Treat drills as controlled experiments, not as a spectacle.

Safe investigation sequence:

Define the Drill: Choose one failure type (e.g., pod termination) and one target service.
Confirm Controls: Verify the kill switch, namespace scope, and time window before starting.
Capture Telemetry: Ensure you are recording metrics, traces, and logs during the injection.
Compare Response: Compare the actual response path with your documented runbook.

Containment

Containment is an integral part of the drill itself.

Containment steps:

Stop the Monkey: Use the kill switch if the blast radius or impact becomes unclear.
Execute Mitigation: Follow the practiced mitigation steps to restore service.
Verify Recovery: Confirm service health using the evidence from Chapter 10.
Harden the System: End every drill by identifying one technical action to reduce the impact of that failure in the future.

The objective is “learn from failure,” not just “survive the noise.”

Pause and Predict: What automated guardrail would have prevented this incident entirely?