Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Artifacts

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter 08: Resource Management & QoS

Why This Chapter Exists

Unbounded workloads create noisy-neighbor incidents and unpredictable recovery. This chapter enforces resource discipline:

  • requests/limits per container
  • namespace quotas
  • predictable QoS behavior under pressure

Incident Hook

One service starts consuming memory aggressively during traffic peak. Neighbor workloads on the same node are evicted while incident responders chase symptoms. The original service restarts repeatedly with OOMKilled, but the blast radius already spread. Without resource guardrails, on-call loses control of prioritization and recovery.

What AI Would Propose (Brave Junior)

  • “Remove limits so pods stop restarting.”
  • “Scale replicas first; tune resources later.”
  • “Ignore QoS classes and just increase node size.”

Why this sounds reasonable:

  • fast visible mitigation
  • avoids immediate manifest edits

Why This Is Dangerous

  • removing limits can starve other workloads and increase cluster instability.
  • scaling without right requests/limits multiplies bad scheduling behavior.
  • ignoring QoS leads to unpredictable evictions under pressure.

Guardrails That Stop It

  • Every workload must define CPU/memory requests and limits.
  • Namespaces must enforce LimitRange and ResourceQuota.
  • OOM and throttling analysis must happen before scaling decisions.

Requests/Limits -> QoS -> Expected Behavior

Requests/Limits PatternQoS ClassExpected Behavior Under Pressure
no requests, no limitsBestEffortfirst candidate for eviction; highly unstable
requests set, limits optional/mixedBurstablemoderate resilience; can be evicted under node pressure
requests == limits for CPU/memoryGuaranteedstrongest scheduling/eviction priority

Repo Mapping

Platform repository references:

Current Implementation (This Repo)

  • Backend and frontend define CPU/memory/ephemeral-storage requests+limits.
  • develop, staging, production have LimitRange and ResourceQuota via Flux.
  • Apps depend on resource-management Kustomizations before reconcile.

Safe Workflow (Step-by-Step)

  1. Confirm each container has explicit CPU/memory requests and limits.
  2. Validate namespace LimitRange and ResourceQuota before rollout.
  3. Reproduce load and observe:
    • pod events (OOMKilled, throttling)
    • QoS class
    • node pressure signals
  4. Tune requests/limits based on evidence, then re-test.
  5. Promote adjustments environment by environment and keep the same guardrails in place.

Lab Scenarios (Must Cover)

  1. OOM scenario:
  • trigger memory pressure in one workload
  • verify OOMKilled evidence and tune limits/requests
  1. Node pressure scenario:
  • simulate broader contention
  • observe eviction/QoS behavior differences across workloads

Lab Files

  • lab.md
  • quiz.md

Done When

  • learner can explain Burstable vs Guaranteed vs BestEffort with real manifests
  • learner can verify quota/limitrange enforcement in cluster
  • learner can diagnose OOM/resource pressure from pod events and metrics

Lab: Requests, Limits, QoS, and OOM Analysis

verify requests/limits are present verify namespace quota and default limits trigger controlled memory pressure and analyze behavior Prerequisites Flux healthy develop namespace workloads running kubectl -n flux-system …

Quiz: Chapter 08 (Resource Management & QoS)

What QoS class do pods usually get when requests and limits are both set but not equal? What Kubernetes object enforces namespace-wide total resource caps? What Kubernetes object provides default/min/max resource values …