Investigation
Start with scheduler behavior and events, not guesswork.
Safe investigation sequence:
- Inspect Pod Events: Look for
OOMKilled,Throttling, andEvictedsignals. - Confirm QoS Class: Check the QoS class of the affected workloads.
- Compare Behavior: Compare the requests and limits against the real, observed behavior in Grafana or
kubectl top. - Identify Scope: Distinguish between a single noisy pod and broader, node-level pressure.
Containment
Containment is about restoring predictability to the cluster’s resource management.
Containment steps:
- Keep Definitions Explicit: Do not remove limits to “unblock” an OOM pod.
- Tune from Evidence: Adjust requests and limits based on the actual peak usage, not panic.
- Verify Quota Enforcement: Ensure that
ResourceQuotaandLimitRangeare protecting neighboring namespaces. - Test Before Promotion: Re-run the failure scenario in a lower environment before promoting the new sizing.
Pause and Predict: What automated guardrail would have prevented this incident entirely?