The Incident: The Stalled Drain

Incident Hook

A routine node drain starts during moderate traffic. One critical service has minReplicas=1 and a restrictive PDB. The drain stalls, the rollout queue backs up, and recovery coordination becomes a manual process. Availability fails because scaling and disruption settings were not engineered to work together.

Result: Settings that looked reasonable in isolation fail together during a disruption, turning planned maintenance into an incident.

Observed Symptoms

What the team sees first:

The node drain does not proceed cleanly.
One critical service has no spare healthy capacity.
Operators must choose between breaking guardrails or accepting downtime.

The incident is caused by misaligned availability settings.

Confusion Phase

HPA and PDB each look like availability features on their own. That is why teams often misconfigure them separately. The real question is:

Is the service under-provisioned?
Or is the disruption policy (PDB) blocking exactly because it is trying to protect a fragile, single-replica baseline?

Anti-Pattern: `minReplicas=1` for Critical Services

In staging and production, setting minReplicas=1 for critical services is a reliability regression:

Zero failure tolerance during rollout, drain, or restart events.
A single disruption can remove all healthy capacity from the service.
Incident response starts from an outage state instead of a degraded state.

What AI Would Propose (Brave Junior):

“Set minReplicas: 1 everywhere to save resources.”
“Disable PDB for this maintenance window.”
“Drain first and investigate later if disruption appears.”

Pause and Predict: Before reading the investigation, write down your top 3 hypotheses. What would you check first?

Estimated Time

Prerequisites

Source Code References

What You Will Produce

Incident Hook

Observed Symptoms

Confusion Phase

Anti-Pattern: `minReplicas=1` for Critical Services

Estimated Time

Prerequisites

Source Code References

What You Will Produce

Incident Hook

Observed Symptoms

Confusion Phase

Anti-Pattern: minReplicas=1 for Critical Services

Anti-Pattern: `minReplicas=1` for Critical Services