Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

  • deployment.yaml Members
  • develop/ Members
  • resourcequota.yaml Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Incident Hook

One service starts consuming memory aggressively during a traffic peak. Neighbor workloads on the same node are evicted while incident responders chase symptoms. The original service restarts repeatedly with OOMKilled, but the blast radius already spread to healthy services.

Result: Without resource guardrails, on-call loses control of prioritization and recovery as the cluster’s stability degrades.

Observed Symptoms

What the team sees first:

  • Repeated OOMKilled events on one workload.
  • Evictions or instability in neighboring workloads.
  • Pressure appears at the node level, not only inside one pod.

Unbounded resource use becomes a platform incident, not just an application bug.

Confusion Phase

Under pressure, teams often react with scale-up or bigger nodes first. That can multiply the problem instead of isolating it. The real question is:

  • Is the issue workload sizing, node pressure, or quota enforcement?
  • Which workloads are being sacrificed because resource classes (QoS) were never defined cleanly?

Requests/Limits -> QoS -> Expected Behavior

Kubernetes uses your resource definitions to assign a Quality of Service (QoS) class, which determines its priority under pressure.

Requests/Limits PatternQoS ClassExpected Behavior Under Pressure
no requests, no limitsBestEffortFirst candidate for eviction; highly unstable.
requests set, limits optional/mixedBurstableModerate resilience; can be evicted under node pressure.
requests == limits for CPU/memoryGuaranteedStrongest scheduling and eviction priority.

What AI Would Propose (Brave Junior):

  • “Remove limits so pods stop restarting.”
  • “Scale replicas first; tune resources later.”
  • “Ignore QoS classes and just increase node size.”

Pause and Predict: Before reading the investigation, write down your top 3 hypotheses. What would you check first?