Chapter 08: Resource Management & QoS
Why This Chapter Exists
Unbounded workloads create noisy-neighbor incidents and unpredictable recovery. This chapter enforces resource discipline:
- requests/limits per container
- namespace quotas
- predictable QoS behavior under pressure
Incident Hook
One service starts consuming memory aggressively during traffic peak.
Neighbor workloads on the same node are evicted while incident responders chase symptoms.
The original service restarts repeatedly with OOMKilled, but the blast radius already spread.
Without resource guardrails, on-call loses control of prioritization and recovery.
What AI Would Propose (Brave Junior)
- “Remove limits so pods stop restarting.”
- “Scale replicas first; tune resources later.”
- “Ignore QoS classes and just increase node size.”
Why this sounds reasonable:
- fast visible mitigation
- avoids immediate manifest edits
Why This Is Dangerous
- removing limits can starve other workloads and increase cluster instability.
- scaling without right requests/limits multiplies bad scheduling behavior.
- ignoring QoS leads to unpredictable evictions under pressure.
Guardrails That Stop It
- Every workload must define CPU/memory requests and limits.
- Namespaces must enforce
LimitRangeandResourceQuota. - OOM and throttling analysis must happen before scaling decisions.
Requests/Limits -> QoS -> Expected Behavior
| Requests/Limits Pattern | QoS Class | Expected Behavior Under Pressure |
|---|---|---|
| no requests, no limits | BestEffort | first candidate for eviction; highly unstable |
| requests set, limits optional/mixed | Burstable | moderate resilience; can be evicted under node pressure |
| requests == limits for CPU/memory | Guaranteed | strongest scheduling/eviction priority |
Repo Mapping
Platform repository references:
- Backend resource config
- Frontend resource config
- Develop quotas/limits
- Staging quotas/limits
- Production quotas/limits
- Flux infrastructure wiring
- Flux apps wiring
Current Implementation (This Repo)
- Backend and frontend define CPU/memory/ephemeral-storage requests+limits.
develop,staging,productionhaveLimitRangeandResourceQuotavia Flux.- Apps depend on resource-management Kustomizations before reconcile.
Safe Workflow (Step-by-Step)
- Confirm each container has explicit CPU/memory requests and limits.
- Validate namespace
LimitRangeandResourceQuotabefore rollout. - Reproduce load and observe:
- pod events (
OOMKilled, throttling) - QoS class
- node pressure signals
- pod events (
- Tune requests/limits based on evidence, then re-test.
- Promote adjustments environment by environment and keep the same guardrails in place.
Lab Scenarios (Must Cover)
- OOM scenario:
- trigger memory pressure in one workload
- verify
OOMKilledevidence and tune limits/requests
- Node pressure scenario:
- simulate broader contention
- observe eviction/QoS behavior differences across workloads
Lab Files
lab.mdquiz.md
Done When
- learner can explain Burstable vs Guaranteed vs BestEffort with real manifests
- learner can verify quota/limitrange enforcement in cluster
- learner can diagnose OOM/resource pressure from pod events and metrics