Chapter 09: Availability Engineering (HPA + PDB)

Why This Chapter Exists

Replicas alone do not guarantee availability during disruption. This chapter combines:

HPA for load-based scaling
PDB for controlled voluntary disruptions
rollout/drain awareness

Incident Hook

A routine node drain starts during moderate traffic. One critical service has minReplicas=1 and a restrictive PDB. Drain stalls, rollout queue backs up, and recovery coordination becomes manual. Availability fails because scaling/disruption settings were not engineered together.

What AI Would Propose (Brave Junior)

“Set minReplicas: 1 everywhere to save resources.”
“Disable PDB for this maintenance window.”
“Drain first and investigate later if disruption appears.”

Why this sounds reasonable:

less capacity cost
fewer blockers during maintenance

Why This Is Dangerous

low replica baseline removes failure tolerance during rollout/drain events.
PDB bypass can create avoidable downtime.
drain without preflight checks turns planned maintenance into incident.

Guardrails That Stop It

staging/production start from 2 replicas for critical services.
each service has HPA bounds (minReplicas, maxReplicas) and resource targets.
each service has PDB to prevent unsafe disruption.
node drain or rollout is never executed without checking PDB/HPA state.

Anti-Pattern: `minReplicas=1` for Critical Services

For critical services in staging/production, minReplicas=1 is a reliability regression:

no failure tolerance during rollout/drain/restart
single disruption can remove all healthy capacity
incident response starts from outage state instead of degraded state

Repo Mapping

Platform repository references:

Current Implementation (This Repo)

HPA (autoscaling/v2) added for backend and frontend in all three environments.
PDB (policy/v1) added for backend and frontend in all three environments.
staging/production baseline replicas are 2 for backend and frontend.

Drain Day Scenario

Run one controlled node drain rehearsal and compare outcomes:

with healthy HPA + PDB: drain proceeds with bounded disruption
with restrictive/incorrect PDB: drain blocks (expected protective behavior)
with low replica baseline: higher user-facing risk and slower recovery

The objective is to confirm behavior before a real maintenance event.

Safe Workflow (Step-by-Step)

Preflight check HPA status and bounds (minReplicas, maxReplicas, utilization targets).
Confirm PDB allowed disruptions before rollout or node drain.
Simulate/execute one disruption action and observe scaling + availability behavior.
If constraints conflict (for example drain blocked), adjust configuration safely, not by disabling guardrails.
Record expected behavior for future maintenance runbooks.

Lab Files

lab.md
quiz.md

Done When

learner can verify HPA target/bounds and current scaling state
learner can verify PDB allowed disruptions before node drain
learner can explain interaction: HPA, PDB, rollout, and drain

Estimated Time

Prerequisites

Artifacts

What You Will Produce

Chapter 09: Availability Engineering (HPA + PDB)

Why This Chapter Exists

Incident Hook

What AI Would Propose (Brave Junior)

Why This Is Dangerous

Guardrails That Stop It

Anti-Pattern: `minReplicas=1` for Critical Services

Repo Mapping

Current Implementation (This Repo)

Drain Day Scenario

Safe Workflow (Step-by-Step)

Lab Files

Done When

Lab: HPA + PDB + Node Drain Readiness

Quiz: Chapter 09 (Availability Engineering)

Estimated Time

Prerequisites

Artifacts

What You Will Produce

Chapter 09: Availability Engineering (HPA + PDB)

Why This Chapter Exists

Incident Hook

What AI Would Propose (Brave Junior)

Why This Is Dangerous

Guardrails That Stop It

Anti-Pattern: minReplicas=1 for Critical Services

Repo Mapping

Current Implementation (This Repo)

Drain Day Scenario

Safe Workflow (Step-by-Step)

Lab Files

Done When

Lab: HPA + PDB + Node Drain Readiness

Quiz: Chapter 09 (Availability Engineering)

Anti-Pattern: `minReplicas=1` for Critical Services