Investigation
Treat maintenance state and blocking events as evidence, not as an inconvenience.
Safe investigation sequence:
- Inspect current state: Check the current replica count and current HPA scaling status.
- Confirm PDB allowance: Check
kubectl get pdbto see how many “allowed disruptions” are remaining. - Compare settings: Compare the planned disruption (e.g., draining a node) with the service’s actual tolerance.
- Identify the conflict: Determine if the PDB is blocking the drain because the service is already at its minimum replica count.
Containment
Containment is about protecting availability while resolving maintenance blockers.
Containment steps:
- Pause the disruption: Stop the node drain if it is stalling and impacting other workloads.
- Adjust safely: Increase
minReplicasor relax the PDB safely (after peer review) rather than disabling the guardrail blindly. - Verify health: Ensure the service returns to a healthy multi-replica baseline before resuming maintenance.
- Re-run correctly: Resume the maintenance step only after the allowed disruptions are clear and positive.
Pause and Predict: What automated guardrail would have prevented this incident entirely?