Incident Hook
One service starts consuming memory aggressively during a traffic peak. Neighbor workloads on the same node are evicted while incident responders chase symptoms. The original service restarts repeatedly with OOMKilled, but the blast radius already spread to healthy services.
Result: Without resource guardrails, on-call loses control of prioritization and recovery as the cluster’s stability degrades.
Observed Symptoms
What the team sees first:
- Repeated
OOMKilledevents on one workload. - Evictions or instability in neighboring workloads.
- Pressure appears at the node level, not only inside one pod.
Unbounded resource use becomes a platform incident, not just an application bug.
Confusion Phase
Under pressure, teams often react with scale-up or bigger nodes first. That can multiply the problem instead of isolating it. The real question is:
- Is the issue workload sizing, node pressure, or quota enforcement?
- Which workloads are being sacrificed because resource classes (QoS) were never defined cleanly?
Requests/Limits -> QoS -> Expected Behavior
Kubernetes uses your resource definitions to assign a Quality of Service (QoS) class, which determines its priority under pressure.
| Requests/Limits Pattern | QoS Class | Expected Behavior Under Pressure |
|---|---|---|
| no requests, no limits | BestEffort | First candidate for eviction; highly unstable. |
| requests set, limits optional/mixed | Burstable | Moderate resilience; can be evicted under node pressure. |
| requests == limits for CPU/memory | Guaranteed | Strongest scheduling and eviction priority. |
What AI Would Propose (Brave Junior):
- “Remove limits so pods stop restarting.”
- “Scale replicas first; tune resources later.”
- “Ignore QoS classes and just increase node size.”
Pause and Predict: Before reading the investigation, write down your top 3 hypotheses. What would you check first?