Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

  • cronjob.yaml Members
  • develop/ Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Investigation

Treat drills as controlled experiments, not as a spectacle.

Safe investigation sequence:

  1. Define the Drill: Choose one failure type (e.g., pod termination) and one target service.
  2. Confirm Controls: Verify the kill switch, namespace scope, and time window before starting.
  3. Capture Telemetry: Ensure you are recording metrics, traces, and logs during the injection.
  4. Compare Response: Compare the actual response path with your documented runbook.

Containment

Containment is an integral part of the drill itself.

Containment steps:

  1. Stop the Monkey: Use the kill switch if the blast radius or impact becomes unclear.
  2. Execute Mitigation: Follow the practiced mitigation steps to restore service.
  3. Verify Recovery: Confirm service health using the evidence from Chapter 10.
  4. Harden the System: End every drill by identifying one technical action to reduce the impact of that failure in the future.

The objective is “learn from failure,” not just “survive the noise.”


Pause and Predict: What automated guardrail would have prevented this incident entirely?