Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Artifacts

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter 09: Availability Engineering (HPA + PDB)

Why This Chapter Exists

Replicas alone do not guarantee availability during disruption. This chapter combines:

  • HPA for load-based scaling
  • PDB for controlled voluntary disruptions
  • rollout/drain awareness

Incident Hook

A routine node drain starts during moderate traffic. One critical service has minReplicas=1 and a restrictive PDB. Drain stalls, rollout queue backs up, and recovery coordination becomes manual. Availability fails because scaling/disruption settings were not engineered together.

What AI Would Propose (Brave Junior)

  • “Set minReplicas: 1 everywhere to save resources.”
  • “Disable PDB for this maintenance window.”
  • “Drain first and investigate later if disruption appears.”

Why this sounds reasonable:

  • less capacity cost
  • fewer blockers during maintenance

Why This Is Dangerous

  • low replica baseline removes failure tolerance during rollout/drain events.
  • PDB bypass can create avoidable downtime.
  • drain without preflight checks turns planned maintenance into incident.

Guardrails That Stop It

  • staging/production start from 2 replicas for critical services.
  • each service has HPA bounds (minReplicas, maxReplicas) and resource targets.
  • each service has PDB to prevent unsafe disruption.
  • node drain or rollout is never executed without checking PDB/HPA state.

Anti-Pattern: minReplicas=1 for Critical Services

For critical services in staging/production, minReplicas=1 is a reliability regression:

  • no failure tolerance during rollout/drain/restart
  • single disruption can remove all healthy capacity
  • incident response starts from outage state instead of degraded state

Repo Mapping

Platform repository references:

Current Implementation (This Repo)

  • HPA (autoscaling/v2) added for backend and frontend in all three environments.
  • PDB (policy/v1) added for backend and frontend in all three environments.
  • staging/production baseline replicas are 2 for backend and frontend.

Drain Day Scenario

Run one controlled node drain rehearsal and compare outcomes:

  • with healthy HPA + PDB: drain proceeds with bounded disruption
  • with restrictive/incorrect PDB: drain blocks (expected protective behavior)
  • with low replica baseline: higher user-facing risk and slower recovery

The objective is to confirm behavior before a real maintenance event.

Safe Workflow (Step-by-Step)

  1. Preflight check HPA status and bounds (minReplicas, maxReplicas, utilization targets).
  2. Confirm PDB allowed disruptions before rollout or node drain.
  3. Simulate/execute one disruption action and observe scaling + availability behavior.
  4. If constraints conflict (for example drain blocked), adjust configuration safely, not by disabling guardrails.
  5. Record expected behavior for future maintenance runbooks.

Lab Files

  • lab.md
  • quiz.md

Done When

  • learner can verify HPA target/bounds and current scaling state
  • learner can verify PDB allowed disruptions before node drain
  • learner can explain interaction: HPA, PDB, rollout, and drain

Lab: HPA + PDB + Node Drain Readiness

HPA exists and can scale within safe bounds PDB constrains voluntary disruptions drain simulation is evaluated through PDB/HPA signals first Prerequisites Metrics API available (kubectl top works) backend/frontend …

Quiz: Chapter 09 (Availability Engineering)

What does a PodDisruptionBudget control? Which signal must be checked before node drain? If Allowed disruptions = 0 for a critical service, what is the correct action? Which statement is correct? A) PDB affects all pod …