Chapter 08: Resource Management & QoS
Incident Hook
One service starts consuming memory aggressively during traffic peak.
Neighbor workloads on the same node are evicted while incident responders chase symptoms.
The original service restarts repeatedly with OOMKilled, but the blast radius already spread.
Without resource guardrails, on-call loses control of prioritization and recovery.
Observed Symptoms
What the team sees first:
- repeated
OOMKilledevents on one workload - evictions or instability in neighboring workloads
- pressure appears at node level, not only inside one pod
That is why unbounded resource use becomes a platform incident, not just an app bug.
Confusion Phase
Under pressure, teams often react with scale or bigger nodes first. That can multiply the problem instead of isolating it.
The real question is:
- is the issue workload sizing, node pressure, or quota enforcement
- and which workloads are being sacrificed because resource classes were never defined cleanly
Why This Chapter Exists
Unbounded workloads create noisy-neighbor incidents and unpredictable recovery. This chapter enforces resource discipline:
- requests/limits per container
- namespace quotas
- predictable QoS behavior under pressure
What AI Would Propose (Brave Junior)
- “Remove limits so pods stop restarting.”
- “Scale replicas first; tune resources later.”
- “Ignore QoS classes and just increase node size.”
Why this sounds reasonable:
- fast visible mitigation
- avoids immediate manifest edits
Why This Is Dangerous
- removing limits can starve other workloads and increase cluster instability.
- scaling without right requests/limits multiplies bad scheduling behavior.
- ignoring QoS leads to unpredictable evictions under pressure.
Investigation
Start with scheduler behavior, not guesswork.
Safe investigation sequence:
- inspect pod events for
OOMKilled, throttling, and eviction signals - confirm QoS class for the affected workloads
- compare requests and limits against real observed behavior
- distinguish one noisy pod from broader node-level pressure
Containment
Containment restores predictability:
- keep requests and limits explicit
- tune them from evidence, not panic
- verify quota and limit-range enforcement still protect neighbors
- re-run the scenario before promoting the new sizing
Guardrails That Stop It
- Every workload must define CPU/memory requests and limits.
- Namespaces must enforce
LimitRangeandResourceQuota. - OOM and throttling analysis must happen before scaling decisions.
Requests/Limits -> QoS -> Expected Behavior
| Requests/Limits Pattern | QoS Class | Expected Behavior Under Pressure |
|---|---|---|
| no requests, no limits | BestEffort | first candidate for eviction; highly unstable |
| requests set, limits optional/mixed | Burstable | moderate resilience; can be evicted under node pressure |
| requests == limits for CPU/memory | Guaranteed | strongest scheduling/eviction priority |
Investigation Snapshots
Here is the backend deployment resource block used in the SafeOps system. It shows the requests and limits that turn resource discipline into scheduler behavior.
Backend resource block
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend
labels:
app: backend
app.kubernetes.io/name: backend
app.kubernetes.io/component: api
spec:
replicas: 1
revisionHistoryLimit: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
selector:
matchLabels:
app: backend
template:
metadata:
labels:
app: backend
app.kubernetes.io/name: backend
app.kubernetes.io/component: api
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
imagePullSecrets:
- name: ghcr-credentials-docker
securityContext:
runAsNonRoot: true
runAsUser: 10001
runAsGroup: 10001
fsGroup: 10001
seccompProfile:
type: RuntimeDefault
containers:
- name: backend
image: ghcr.io/ldbl/backend:latest
imagePullPolicy: IfNotPresent
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 10001
runAsGroup: 10001
capabilities:
drop:
- ALL
ports:
- containerPort: 8080
name: http
protocol: TCP
env:
- name: PORT
value: "8080"
- name: NAMESPACE
value: "${NAMESPACE}"
- name: ENVIRONMENT
value: "${ENVIRONMENT}"
- name: LOG_LEVEL
value: "${LOG_LEVEL}"
- name: SERVICE_NAME
value: "backend"
- name: SERVICE_VERSION
value: "v1.0.0"
- name: DEPLOYMENT_ENVIRONMENT
value: "${ENVIRONMENT}"
- name: OTEL_RESOURCE_ATTRIBUTES
value: "k8s.cluster.name=${cluster_name}"
- name: UPTRACE_DSN
valueFrom:
secretKeyRef:
name: backend-secrets
key: uptrace-dsn
- name: OTEL_EXPORTER_OTLP_HEADERS
valueFrom:
secretKeyRef:
name: backend-secrets
key: uptrace-headers
- name: JWT_SECRET
valueFrom:
secretKeyRef:
name: backend-secrets
key: jwt-secret
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: app-postgres-app
key: username
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: app-postgres-app
key: password
- name: POSTGRES_HOST
value: app-postgres-rw
- name: POSTGRES_DB
value: app
livenessProbe:
httpGet:
path: /livez
port: http
initialDelaySeconds: 15
periodSeconds: 20
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /readyz
port: http
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
startupProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30
resources:
requests:
cpu: 10m
memory: 32Mi
ephemeral-storage: 64Mi
limits:
cpu: 100m
memory: 128Mi
ephemeral-storage: 128Mi
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /home/app/.cache
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir:
sizeLimit: 10Mi
Here is the develop namespace quota and limit baseline used for rehearsal.
Develop quota and limit baseline
Show the develop resource baseline
flux/infrastructure/resource-management/develop/kustomization.yamlflux/infrastructure/resource-management/develop/limitrange.yamlflux/infrastructure/resource-management/develop/resourcequota.yaml
System Context
This chapter makes scaling and availability meaningful instead of cosmetic.
It feeds directly into:
- Chapter 09 HPA behavior, which depends on sensible resource targets
- Chapter 10 observability, where node and pod pressure become evidence
- Chapter 14 on-call operations, where responders need predictable prioritization under load
Expected Baseline
- Every container defines CPU, memory, and ephemeral-storage requests and limits.
- Each environment (
develop,staging,production) hasLimitRangeandResourceQuota. - Apps depend on resource-management Kustomizations before reconcile.
Safe Workflow (Step-by-Step)
- Confirm each container has explicit CPU/memory requests and limits.
- Validate namespace
LimitRangeandResourceQuotabefore rollout. - Reproduce load and observe:
- pod events (
OOMKilled, throttling) - QoS class
- node pressure signals
- pod events (
- Tune requests/limits based on evidence, then re-test.
- Promote adjustments environment by environment and keep the same guardrails in place.
Lab Scenarios (Must Cover)
- OOM scenario:
- trigger memory pressure in one workload
- verify
OOMKilledevidence and tune limits/requests
- Node pressure scenario:
- simulate broader contention
- observe eviction/QoS behavior differences across workloads
Lab Files
lab.mdquiz.md
Done When
- learner can explain Burstable vs Guaranteed vs BestEffort with real manifests
- learner can verify quota/limitrange enforcement in cluster
- learner can diagnose OOM/resource pressure from pod events and metrics