Incident Hook
A routine node drain starts during moderate traffic. One critical service has minReplicas=1 and a restrictive PDB. The drain stalls, the rollout queue backs up, and recovery coordination becomes a manual process. Availability fails because scaling and disruption settings were not engineered to work together.
Result: Settings that looked reasonable in isolation fail together during a disruption, turning planned maintenance into an incident.
Observed Symptoms
What the team sees first:
- The node drain does not proceed cleanly.
- One critical service has no spare healthy capacity.
- Operators must choose between breaking guardrails or accepting downtime.
The incident is caused by misaligned availability settings.
Confusion Phase
HPA and PDB each look like availability features on their own. That is why teams often misconfigure them separately. The real question is:
- Is the service under-provisioned?
- Or is the disruption policy (PDB) blocking exactly because it is trying to protect a fragile, single-replica baseline?
Anti-Pattern: minReplicas=1 for Critical Services
In staging and production, setting minReplicas=1 for critical services is a reliability regression:
- Zero failure tolerance during rollout, drain, or restart events.
- A single disruption can remove all healthy capacity from the service.
- Incident response starts from an outage state instead of a degraded state.
What AI Would Propose (Brave Junior):
- “Set
minReplicas: 1everywhere to save resources.” - “Disable PDB for this maintenance window.”
- “Drain first and investigate later if disruption appears.”
Pause and Predict: Before reading the investigation, write down your top 3 hypotheses. What would you check first?