Investigation & Containment | SafeOps Academy

Core Track Guardrails-first chapter in core learning path.

Estimated Time

Reading: 20-25 min
Lab: 45-60 min
Quiz: 10-15 min

Prerequisites

Previous chapter completed: Chapter 08: Resource Management & QoS.
Working access to target `develop` namespace and tooling.

Source Code References

develop/ Members
develop/ Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Investigation

Treat maintenance state and blocking events as evidence, not as an inconvenience.

Safe investigation sequence:

Inspect current state: Check the current replica count and current HPA scaling status.
Confirm PDB allowance: Check kubectl get pdb to see how many “allowed disruptions” are remaining.
Compare settings: Compare the planned disruption (e.g., draining a node) with the service’s actual tolerance.
Identify the conflict: Determine if the PDB is blocking the drain because the service is already at its minimum replica count.

Containment

Containment is about protecting availability while resolving maintenance blockers.

Containment steps:

Pause the disruption: Stop the node drain if it is stalling and impacting other workloads.
Adjust safely: Increase minReplicas or relax the PDB safely (after peer review) rather than disabling the guardrail blindly.
Verify health: Ensure the service returns to a healthy multi-replica baseline before resuming maintenance.
Re-run correctly: Resume the maintenance step only after the allowed disruptions are clear and positive.

Pause and Predict: What automated guardrail would have prevented this incident entirely?