Investigation & Containment | SafeOps Academy

Core Track Guardrails-first chapter in core learning path.

Estimated Time

Reading: 20-25 min
Lab: 45-60 min
Quiz: 10-15 min

Prerequisites

Previous chapter completed: Chapter 07: Security Context & Pod Hardening.
Working access to target `develop` namespace and tooling.

Source Code References

deployment.yaml Members
develop/ Members
resourcequota.yaml Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Investigation

Start with scheduler behavior and events, not guesswork.

Safe investigation sequence:

Inspect Pod Events: Look for OOMKilled, Throttling, and Evicted signals.
Confirm QoS Class: Check the QoS class of the affected workloads.
Compare Behavior: Compare the requests and limits against the real, observed behavior in Grafana or kubectl top.
Identify Scope: Distinguish between a single noisy pod and broader, node-level pressure.

Containment

Containment is about restoring predictability to the cluster’s resource management.

Containment steps:

Keep Definitions Explicit: Do not remove limits to “unblock” an OOM pod.
Tune from Evidence: Adjust requests and limits based on the actual peak usage, not panic.
Verify Quota Enforcement: Ensure that ResourceQuota and LimitRange are protecting neighboring namespaces.
Test Before Promotion: Re-run the failure scenario in a lower environment before promoting the new sizing.

Pause and Predict: What automated guardrail would have prevented this incident entirely?