Chapter 08: Resource Management & QoS

Incident Hook

One service starts consuming memory aggressively during traffic peak. Neighbor workloads on the same node are evicted while incident responders chase symptoms. The original service restarts repeatedly with OOMKilled, but the blast radius already spread. Without resource guardrails, on-call loses control of prioritization and recovery.

Observed Symptoms

What the team sees first:

repeated OOMKilled events on one workload
evictions or instability in neighboring workloads
pressure appears at node level, not only inside one pod

That is why unbounded resource use becomes a platform incident, not just an app bug.

Confusion Phase

Under pressure, teams often react with scale or bigger nodes first. That can multiply the problem instead of isolating it.

The real question is:

is the issue workload sizing, node pressure, or quota enforcement
and which workloads are being sacrificed because resource classes were never defined cleanly

Why This Chapter Exists

Unbounded workloads create noisy-neighbor incidents and unpredictable recovery. This chapter enforces resource discipline:

requests/limits per container
namespace quotas
predictable QoS behavior under pressure

What AI Would Propose (Brave Junior)

“Remove limits so pods stop restarting.”
“Scale replicas first; tune resources later.”
“Ignore QoS classes and just increase node size.”

Why this sounds reasonable:

fast visible mitigation
avoids immediate manifest edits

Why This Is Dangerous

removing limits can starve other workloads and increase cluster instability.
scaling without right requests/limits multiplies bad scheduling behavior.
ignoring QoS leads to unpredictable evictions under pressure.

Investigation

Start with scheduler behavior, not guesswork.

Safe investigation sequence:

inspect pod events for OOMKilled, throttling, and eviction signals
confirm QoS class for the affected workloads
compare requests and limits against real observed behavior
distinguish one noisy pod from broader node-level pressure

Containment

Containment restores predictability:

keep requests and limits explicit
tune them from evidence, not panic
verify quota and limit-range enforcement still protect neighbors
re-run the scenario before promoting the new sizing

Guardrails That Stop It

Every workload must define CPU/memory requests and limits.
Namespaces must enforce LimitRange and ResourceQuota.
OOM and throttling analysis must happen before scaling decisions.

Requests/Limits -> QoS -> Expected Behavior

Requests/Limits Pattern	QoS Class	Expected Behavior Under Pressure
no requests, no limits	BestEffort	first candidate for eviction; highly unstable
requests set, limits optional/mixed	Burstable	moderate resilience; can be evicted under node pressure
requests == limits for CPU/memory	Guaranteed	strongest scheduling/eviction priority

Investigation Snapshots

Here is the backend deployment resource block used in the SafeOps system. It shows the requests and limits that turn resource discipline into scheduler behavior.

Backend resource block

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: backend
  labels:
    app: backend
    app.kubernetes.io/name: backend
    app.kubernetes.io/component: api
spec:
  replicas: 1
  revisionHistoryLimit: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app: backend
  template:
    metadata:
      labels:
        app: backend
        app.kubernetes.io/name: backend
        app.kubernetes.io/component: api
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      imagePullSecrets:
      - name: ghcr-credentials-docker
      securityContext:
        runAsNonRoot: true
        runAsUser: 10001
        runAsGroup: 10001
        fsGroup: 10001
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: backend
        image: ghcr.io/ldbl/backend:latest
        imagePullPolicy: IfNotPresent
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 10001
          runAsGroup: 10001
          capabilities:
            drop:
              - ALL
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        env:
        - name: PORT
          value: "8080"
        - name: NAMESPACE
          value: "${NAMESPACE}"
        - name: ENVIRONMENT
          value: "${ENVIRONMENT}"
        - name: LOG_LEVEL
          value: "${LOG_LEVEL}"
        - name: SERVICE_NAME
          value: "backend"
        - name: SERVICE_VERSION
          value: "v1.0.0"
        - name: DEPLOYMENT_ENVIRONMENT
          value: "${ENVIRONMENT}"
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: "k8s.cluster.name=${cluster_name}"
        - name: UPTRACE_DSN
          valueFrom:
            secretKeyRef:
              name: backend-secrets
              key: uptrace-dsn
        - name: OTEL_EXPORTER_OTLP_HEADERS
          valueFrom:
            secretKeyRef:
              name: backend-secrets
              key: uptrace-headers
        - name: JWT_SECRET
          valueFrom:
            secretKeyRef:
              name: backend-secrets
              key: jwt-secret
        - name: POSTGRES_USER
          valueFrom:
            secretKeyRef:
              name: app-postgres-app
              key: username
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: app-postgres-app
              key: password
        - name: POSTGRES_HOST
          value: app-postgres-rw
        - name: POSTGRES_DB
          value: app
        livenessProbe:
          httpGet:
            path: /livez
            port: http
          initialDelaySeconds: 15
          periodSeconds: 20
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /readyz
            port: http
          initialDelaySeconds: 5
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 3
        startupProbe:
          httpGet:
            path: /healthz
            port: http
          initialDelaySeconds: 0
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 30
        resources:
          requests:
            cpu: 10m
            memory: 32Mi
            ephemeral-storage: 64Mi
          limits:
            cpu: 100m
            memory: 128Mi
            ephemeral-storage: 128Mi
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: cache
          mountPath: /home/app/.cache
      volumes:
      - name: tmp
        emptyDir: {}
      - name: cache
        emptyDir:
          sizeLimit: 10Mi

Here is the develop namespace quota and limit baseline used for rehearsal.

Develop quota and limit baseline

Show the develop resource baseline

flux/infrastructure/resource-management/develop/kustomization.yaml
flux/infrastructure/resource-management/develop/limitrange.yaml
flux/infrastructure/resource-management/develop/resourcequota.yaml

System Context

This chapter makes scaling and availability meaningful instead of cosmetic.

It feeds directly into:

Chapter 09 HPA behavior, which depends on sensible resource targets
Chapter 10 observability, where node and pod pressure become evidence
Chapter 14 on-call operations, where responders need predictable prioritization under load

Expected Baseline

Every container defines CPU, memory, and ephemeral-storage requests and limits.
Each environment (develop, staging, production) has LimitRange and ResourceQuota.
Apps depend on resource-management Kustomizations before reconcile.

Safe Workflow (Step-by-Step)

Confirm each container has explicit CPU/memory requests and limits.
Validate namespace LimitRange and ResourceQuota before rollout.
Reproduce load and observe:
- pod events (OOMKilled, throttling)
- QoS class
- node pressure signals
Tune requests/limits based on evidence, then re-test.
Promote adjustments environment by environment and keep the same guardrails in place.

Lab Scenarios (Must Cover)

OOM scenario:

trigger memory pressure in one workload
verify OOMKilled evidence and tune limits/requests

Node pressure scenario:

simulate broader contention
observe eviction/QoS behavior differences across workloads

Lab Files

lab.md
quiz.md

Done When

learner can explain Burstable vs Guaranteed vs BestEffort with real manifests
learner can verify quota/limitrange enforcement in cluster
learner can diagnose OOM/resource pressure from pod events and metrics

Estimated Time

Prerequisites

Source Code References

What You Will Produce