Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

  • deployment.yaml Members
  • develop/ Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter 08: Resource Management & QoS

Incident Hook

One service starts consuming memory aggressively during traffic peak. Neighbor workloads on the same node are evicted while incident responders chase symptoms. The original service restarts repeatedly with OOMKilled, but the blast radius already spread. Without resource guardrails, on-call loses control of prioritization and recovery.

Observed Symptoms

What the team sees first:

  • repeated OOMKilled events on one workload
  • evictions or instability in neighboring workloads
  • pressure appears at node level, not only inside one pod

That is why unbounded resource use becomes a platform incident, not just an app bug.

Confusion Phase

Under pressure, teams often react with scale or bigger nodes first. That can multiply the problem instead of isolating it.

The real question is:

  • is the issue workload sizing, node pressure, or quota enforcement
  • and which workloads are being sacrificed because resource classes were never defined cleanly

Why This Chapter Exists

Unbounded workloads create noisy-neighbor incidents and unpredictable recovery. This chapter enforces resource discipline:

  • requests/limits per container
  • namespace quotas
  • predictable QoS behavior under pressure

What AI Would Propose (Brave Junior)

  • “Remove limits so pods stop restarting.”
  • “Scale replicas first; tune resources later.”
  • “Ignore QoS classes and just increase node size.”

Why this sounds reasonable:

  • fast visible mitigation
  • avoids immediate manifest edits

Why This Is Dangerous

  • removing limits can starve other workloads and increase cluster instability.
  • scaling without right requests/limits multiplies bad scheduling behavior.
  • ignoring QoS leads to unpredictable evictions under pressure.

Investigation

Start with scheduler behavior, not guesswork.

Safe investigation sequence:

  1. inspect pod events for OOMKilled, throttling, and eviction signals
  2. confirm QoS class for the affected workloads
  3. compare requests and limits against real observed behavior
  4. distinguish one noisy pod from broader node-level pressure

Containment

Containment restores predictability:

  1. keep requests and limits explicit
  2. tune them from evidence, not panic
  3. verify quota and limit-range enforcement still protect neighbors
  4. re-run the scenario before promoting the new sizing

Guardrails That Stop It

  • Every workload must define CPU/memory requests and limits.
  • Namespaces must enforce LimitRange and ResourceQuota.
  • OOM and throttling analysis must happen before scaling decisions.

Requests/Limits -> QoS -> Expected Behavior

Requests/Limits PatternQoS ClassExpected Behavior Under Pressure
no requests, no limitsBestEffortfirst candidate for eviction; highly unstable
requests set, limits optional/mixedBurstablemoderate resilience; can be evicted under node pressure
requests == limits for CPU/memoryGuaranteedstrongest scheduling/eviction priority

Investigation Snapshots

Here is the backend deployment resource block used in the SafeOps system. It shows the requests and limits that turn resource discipline into scheduler behavior.

Backend resource block

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: backend
  labels:
    app: backend
    app.kubernetes.io/name: backend
    app.kubernetes.io/component: api
spec:
  replicas: 1
  revisionHistoryLimit: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app: backend
  template:
    metadata:
      labels:
        app: backend
        app.kubernetes.io/name: backend
        app.kubernetes.io/component: api
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      imagePullSecrets:
      - name: ghcr-credentials-docker
      securityContext:
        runAsNonRoot: true
        runAsUser: 10001
        runAsGroup: 10001
        fsGroup: 10001
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: backend
        image: ghcr.io/ldbl/backend:latest
        imagePullPolicy: IfNotPresent
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 10001
          runAsGroup: 10001
          capabilities:
            drop:
              - ALL
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        env:
        - name: PORT
          value: "8080"
        - name: NAMESPACE
          value: "${NAMESPACE}"
        - name: ENVIRONMENT
          value: "${ENVIRONMENT}"
        - name: LOG_LEVEL
          value: "${LOG_LEVEL}"
        - name: SERVICE_NAME
          value: "backend"
        - name: SERVICE_VERSION
          value: "v1.0.0"
        - name: DEPLOYMENT_ENVIRONMENT
          value: "${ENVIRONMENT}"
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: "k8s.cluster.name=${cluster_name}"
        - name: UPTRACE_DSN
          valueFrom:
            secretKeyRef:
              name: backend-secrets
              key: uptrace-dsn
        - name: OTEL_EXPORTER_OTLP_HEADERS
          valueFrom:
            secretKeyRef:
              name: backend-secrets
              key: uptrace-headers
        - name: JWT_SECRET
          valueFrom:
            secretKeyRef:
              name: backend-secrets
              key: jwt-secret
        - name: POSTGRES_USER
          valueFrom:
            secretKeyRef:
              name: app-postgres-app
              key: username
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: app-postgres-app
              key: password
        - name: POSTGRES_HOST
          value: app-postgres-rw
        - name: POSTGRES_DB
          value: app
        livenessProbe:
          httpGet:
            path: /livez
            port: http
          initialDelaySeconds: 15
          periodSeconds: 20
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /readyz
            port: http
          initialDelaySeconds: 5
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 3
        startupProbe:
          httpGet:
            path: /healthz
            port: http
          initialDelaySeconds: 0
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 30
        resources:
          requests:
            cpu: 10m
            memory: 32Mi
            ephemeral-storage: 64Mi
          limits:
            cpu: 100m
            memory: 128Mi
            ephemeral-storage: 128Mi
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: cache
          mountPath: /home/app/.cache
      volumes:
      - name: tmp
        emptyDir: {}
      - name: cache
        emptyDir:
          sizeLimit: 10Mi

Here is the develop namespace quota and limit baseline used for rehearsal.

Develop quota and limit baseline

Show the develop resource baseline
  • flux/infrastructure/resource-management/develop/kustomization.yaml
  • flux/infrastructure/resource-management/develop/limitrange.yaml
  • flux/infrastructure/resource-management/develop/resourcequota.yaml

System Context

This chapter makes scaling and availability meaningful instead of cosmetic.

It feeds directly into:

  • Chapter 09 HPA behavior, which depends on sensible resource targets
  • Chapter 10 observability, where node and pod pressure become evidence
  • Chapter 14 on-call operations, where responders need predictable prioritization under load

Expected Baseline

  • Every container defines CPU, memory, and ephemeral-storage requests and limits.
  • Each environment (develop, staging, production) has LimitRange and ResourceQuota.
  • Apps depend on resource-management Kustomizations before reconcile.

Safe Workflow (Step-by-Step)

  1. Confirm each container has explicit CPU/memory requests and limits.
  2. Validate namespace LimitRange and ResourceQuota before rollout.
  3. Reproduce load and observe:
    • pod events (OOMKilled, throttling)
    • QoS class
    • node pressure signals
  4. Tune requests/limits based on evidence, then re-test.
  5. Promote adjustments environment by environment and keep the same guardrails in place.

Lab Scenarios (Must Cover)

  1. OOM scenario:
  • trigger memory pressure in one workload
  • verify OOMKilled evidence and tune limits/requests
  1. Node pressure scenario:
  • simulate broader contention
  • observe eviction/QoS behavior differences across workloads

Lab Files

  • lab.md
  • quiz.md

Done When

  • learner can explain Burstable vs Guaranteed vs BestEffort with real manifests
  • learner can verify quota/limitrange enforcement in cluster
  • learner can diagnose OOM/resource pressure from pod events and metrics

Hands-On Materials

Labs, quizzes, and runbooks — available to course members.

  • Lab: Requests, Limits, QoS, and OOM Analysis Members
  • Quiz: Chapter 08 (Resource Management & QoS) Members