Chapter 09: Availability Engineering (HPA + PDB)
Incident Hook
A routine node drain starts during moderate traffic.
One critical service has minReplicas=1 and a restrictive PDB.
Drain stalls, rollout queue backs up, and recovery coordination becomes manual.
Availability fails because scaling/disruption settings were not engineered together.
Observed Symptoms
What the team sees first:
- the node drain does not proceed cleanly
- one critical service has no spare healthy capacity
- operators must choose between breaking guardrails or accepting downtime
The incident is caused by settings that looked reasonable in isolation but fail together during disruption.
Confusion Phase
HPA and PDB each look like availability features on their own. That is why teams misconfigure them separately.
The real question is:
- is the service under-provisioned
- or is the disruption policy blocking exactly because it is trying to protect a fragile baseline
Why This Chapter Exists
Replicas alone do not guarantee availability during disruption. This chapter combines:
- HPA for load-based scaling
- PDB for controlled voluntary disruptions
- rollout/drain awareness
What AI Would Propose (Brave Junior)
- “Set
minReplicas: 1everywhere to save resources.” - “Disable PDB for this maintenance window.”
- “Drain first and investigate later if disruption appears.”
Why this sounds reasonable:
- less capacity cost
- fewer blockers during maintenance
Why This Is Dangerous
- low replica baseline removes failure tolerance during rollout/drain events.
- PDB bypass can create avoidable downtime.
- drain without preflight checks turns planned maintenance into incident.
Investigation
Treat maintenance state as evidence, not inconvenience.
Safe investigation sequence:
- inspect current replica count and HPA bounds
- confirm PDB allowed disruptions before the drain
- compare the planned disruption with the service’s actual tolerance
- decide whether configuration needs correction before maintenance continues
Containment
Containment protects availability first:
- pause or slow the disruption if spare capacity is insufficient
- adjust configuration safely instead of disabling the PDB blindly
- verify the service returns to a healthy multi-replica baseline
- rerun the maintenance step only when allowed disruptions are clear again
Guardrails That Stop It
- staging/production start from 2 replicas for critical services.
- each service has HPA bounds (
minReplicas,maxReplicas) and resource targets. - each service has PDB to prevent unsafe disruption.
- node drain or rollout is never executed without checking PDB/HPA state.
Anti-Pattern: minReplicas=1 for Critical Services
For critical services in staging/production, minReplicas=1 is a reliability regression:
- no failure tolerance during rollout/drain/restart
- single disruption can remove all healthy capacity
- incident response starts from outage state instead of degraded state
Investigation Snapshots
Here is the backend develop overlay used in the SafeOps system. It contains the HPA and PDB objects that keep maintenance and load spikes from turning into downtime.
Backend HPA and PDB layout
Show the backend availability layout
flux/apps/backend/develop/hpa.yamlflux/apps/backend/develop/image-automation.yamlflux/apps/backend/develop/image-policy.yamlflux/apps/backend/develop/kustomization.yamlflux/apps/backend/develop/patches/feature-flags.yamlflux/apps/backend/develop/pdb.yaml
Here is the frontend develop overlay that follows the same availability contract.
Frontend HPA and PDB layout
Show the frontend availability layout
flux/apps/frontend/overlays/develop/hpa.yamlflux/apps/frontend/overlays/develop/image-automation.yamlflux/apps/frontend/overlays/develop/image-policy.yamlflux/apps/frontend/overlays/develop/kustomization.yamlflux/apps/frontend/overlays/develop/namespace.yamlflux/apps/frontend/overlays/develop/patches/deployment.yamlflux/apps/frontend/overlays/develop/patches/ingress.yamlflux/apps/frontend/overlays/develop/pdb.yaml
System Context
This chapter turns scaling and disruption into one reliability contract.
It builds on:
- Chapter 08 resource discipline, which gives HPA meaningful inputs
- Chapter 10 observability, which proves whether disruption stayed bounded
- Chapter 14 operations, where planned maintenance should not become incident improvisation
Expected Baseline
- HPA (
autoscaling/v2) configured for backend and frontend in all environments. - PDB (
policy/v1) configured for backend and frontend in all environments. - Staging/production baseline replicas are 2+ for both services.
Drain Day Scenario
Run one controlled node drain rehearsal and compare outcomes:
- with healthy HPA + PDB: drain proceeds with bounded disruption
- with restrictive/incorrect PDB: drain blocks (expected protective behavior)
- with low replica baseline: higher user-facing risk and slower recovery
The objective is to confirm behavior before a real maintenance event.
Safe Workflow (Step-by-Step)
- Preflight check HPA status and bounds (
minReplicas,maxReplicas, utilization targets). - Confirm PDB allowed disruptions before rollout or node drain.
- Simulate/execute one disruption action and observe scaling + availability behavior.
- If constraints conflict (for example drain blocked), adjust configuration safely, not by disabling guardrails.
- Record expected behavior for future maintenance runbooks.
Lab Files
lab.mdquiz.md
Done When
- learner can verify HPA target/bounds and current scaling state
- learner can verify PDB allowed disruptions before node drain
- learner can explain interaction: HPA, PDB, rollout, and drain