Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

  • develop/ Members
  • develop/ Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter 09: Availability Engineering (HPA + PDB)

Incident Hook

A routine node drain starts during moderate traffic. One critical service has minReplicas=1 and a restrictive PDB. Drain stalls, rollout queue backs up, and recovery coordination becomes manual. Availability fails because scaling/disruption settings were not engineered together.

Observed Symptoms

What the team sees first:

  • the node drain does not proceed cleanly
  • one critical service has no spare healthy capacity
  • operators must choose between breaking guardrails or accepting downtime

The incident is caused by settings that looked reasonable in isolation but fail together during disruption.

Confusion Phase

HPA and PDB each look like availability features on their own. That is why teams misconfigure them separately.

The real question is:

  • is the service under-provisioned
  • or is the disruption policy blocking exactly because it is trying to protect a fragile baseline

Why This Chapter Exists

Replicas alone do not guarantee availability during disruption. This chapter combines:

  • HPA for load-based scaling
  • PDB for controlled voluntary disruptions
  • rollout/drain awareness

What AI Would Propose (Brave Junior)

  • “Set minReplicas: 1 everywhere to save resources.”
  • “Disable PDB for this maintenance window.”
  • “Drain first and investigate later if disruption appears.”

Why this sounds reasonable:

  • less capacity cost
  • fewer blockers during maintenance

Why This Is Dangerous

  • low replica baseline removes failure tolerance during rollout/drain events.
  • PDB bypass can create avoidable downtime.
  • drain without preflight checks turns planned maintenance into incident.

Investigation

Treat maintenance state as evidence, not inconvenience.

Safe investigation sequence:

  1. inspect current replica count and HPA bounds
  2. confirm PDB allowed disruptions before the drain
  3. compare the planned disruption with the service’s actual tolerance
  4. decide whether configuration needs correction before maintenance continues

Containment

Containment protects availability first:

  1. pause or slow the disruption if spare capacity is insufficient
  2. adjust configuration safely instead of disabling the PDB blindly
  3. verify the service returns to a healthy multi-replica baseline
  4. rerun the maintenance step only when allowed disruptions are clear again

Guardrails That Stop It

  • staging/production start from 2 replicas for critical services.
  • each service has HPA bounds (minReplicas, maxReplicas) and resource targets.
  • each service has PDB to prevent unsafe disruption.
  • node drain or rollout is never executed without checking PDB/HPA state.

Anti-Pattern: minReplicas=1 for Critical Services

For critical services in staging/production, minReplicas=1 is a reliability regression:

  • no failure tolerance during rollout/drain/restart
  • single disruption can remove all healthy capacity
  • incident response starts from outage state instead of degraded state

Investigation Snapshots

Here is the backend develop overlay used in the SafeOps system. It contains the HPA and PDB objects that keep maintenance and load spikes from turning into downtime.

Backend HPA and PDB layout

Show the backend availability layout
  • flux/apps/backend/develop/hpa.yaml
  • flux/apps/backend/develop/image-automation.yaml
  • flux/apps/backend/develop/image-policy.yaml
  • flux/apps/backend/develop/kustomization.yaml
  • flux/apps/backend/develop/patches/feature-flags.yaml
  • flux/apps/backend/develop/pdb.yaml

Here is the frontend develop overlay that follows the same availability contract.

Frontend HPA and PDB layout

Show the frontend availability layout
  • flux/apps/frontend/overlays/develop/hpa.yaml
  • flux/apps/frontend/overlays/develop/image-automation.yaml
  • flux/apps/frontend/overlays/develop/image-policy.yaml
  • flux/apps/frontend/overlays/develop/kustomization.yaml
  • flux/apps/frontend/overlays/develop/namespace.yaml
  • flux/apps/frontend/overlays/develop/patches/deployment.yaml
  • flux/apps/frontend/overlays/develop/patches/ingress.yaml
  • flux/apps/frontend/overlays/develop/pdb.yaml

System Context

This chapter turns scaling and disruption into one reliability contract.

It builds on:

  • Chapter 08 resource discipline, which gives HPA meaningful inputs
  • Chapter 10 observability, which proves whether disruption stayed bounded
  • Chapter 14 operations, where planned maintenance should not become incident improvisation

Expected Baseline

  • HPA (autoscaling/v2) configured for backend and frontend in all environments.
  • PDB (policy/v1) configured for backend and frontend in all environments.
  • Staging/production baseline replicas are 2+ for both services.

Drain Day Scenario

Run one controlled node drain rehearsal and compare outcomes:

  • with healthy HPA + PDB: drain proceeds with bounded disruption
  • with restrictive/incorrect PDB: drain blocks (expected protective behavior)
  • with low replica baseline: higher user-facing risk and slower recovery

The objective is to confirm behavior before a real maintenance event.

Safe Workflow (Step-by-Step)

  1. Preflight check HPA status and bounds (minReplicas, maxReplicas, utilization targets).
  2. Confirm PDB allowed disruptions before rollout or node drain.
  3. Simulate/execute one disruption action and observe scaling + availability behavior.
  4. If constraints conflict (for example drain blocked), adjust configuration safely, not by disabling guardrails.
  5. Record expected behavior for future maintenance runbooks.

Lab Files

  • lab.md
  • quiz.md

Done When

  • learner can verify HPA target/bounds and current scaling state
  • learner can verify PDB allowed disruptions before node drain
  • learner can explain interaction: HPA, PDB, rollout, and drain

Hands-On Materials

Labs, quizzes, and runbooks — available to course members.

  • Lab: HPA + PDB + Node Drain Readiness Members
  • Quiz: Chapter 09 (Availability Engineering) Members