Chapter 09: Availability Engineering (HPA + PDB)
Why This Chapter Exists
Replicas alone do not guarantee availability during disruption. This chapter combines:
- HPA for load-based scaling
- PDB for controlled voluntary disruptions
- rollout/drain awareness
Incident Hook
A routine node drain starts during moderate traffic.
One critical service has minReplicas=1 and a restrictive PDB.
Drain stalls, rollout queue backs up, and recovery coordination becomes manual.
Availability fails because scaling/disruption settings were not engineered together.
What AI Would Propose (Brave Junior)
- “Set
minReplicas: 1everywhere to save resources.” - “Disable PDB for this maintenance window.”
- “Drain first and investigate later if disruption appears.”
Why this sounds reasonable:
- less capacity cost
- fewer blockers during maintenance
Why This Is Dangerous
- low replica baseline removes failure tolerance during rollout/drain events.
- PDB bypass can create avoidable downtime.
- drain without preflight checks turns planned maintenance into incident.
Guardrails That Stop It
- staging/production start from 2 replicas for critical services.
- each service has HPA bounds (
minReplicas,maxReplicas) and resource targets. - each service has PDB to prevent unsafe disruption.
- node drain or rollout is never executed without checking PDB/HPA state.
Anti-Pattern: minReplicas=1 for Critical Services
For critical services in staging/production, minReplicas=1 is a reliability regression:
- no failure tolerance during rollout/drain/restart
- single disruption can remove all healthy capacity
- incident response starts from outage state instead of degraded state
Repo Mapping
Platform repository references:
Current Implementation (This Repo)
- HPA (
autoscaling/v2) added for backend and frontend in all three environments. - PDB (
policy/v1) added for backend and frontend in all three environments. - staging/production baseline replicas are 2 for backend and frontend.
Drain Day Scenario
Run one controlled node drain rehearsal and compare outcomes:
- with healthy HPA + PDB: drain proceeds with bounded disruption
- with restrictive/incorrect PDB: drain blocks (expected protective behavior)
- with low replica baseline: higher user-facing risk and slower recovery
The objective is to confirm behavior before a real maintenance event.
Safe Workflow (Step-by-Step)
- Preflight check HPA status and bounds (
minReplicas,maxReplicas, utilization targets). - Confirm PDB allowed disruptions before rollout or node drain.
- Simulate/execute one disruption action and observe scaling + availability behavior.
- If constraints conflict (for example drain blocked), adjust configuration safely, not by disabling guardrails.
- Record expected behavior for future maintenance runbooks.
Lab Files
lab.mdquiz.md
Done When
- learner can verify HPA target/bounds and current scaling state
- learner can verify PDB allowed disruptions before node drain
- learner can explain interaction: HPA, PDB, rollout, and drain