Investigation & Containment | SafeOps Academy

Core Track Guardrails-first chapter in core learning path.

Estimated Time

Reading: 20-25 min
Lab: 45-60 min
Quiz: 10-15 min

Prerequisites

Previous chapter completed: Chapter 10: Observability (Metrics, Logs, Traces).
Working access to target `develop` namespace and tooling.

Source Code References

cnpg-clusters/ Members
main.tf Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Investigation

Treat restore validation as the real test, not the backup status line.

Safe investigation sequence:

Confirm Artifact: Verify the backup artifact exists and matches the retention policy.
Restore Locally: Restore into an isolated, non-production target (e.g., a restore-test namespace).
Verify Data: Prove the application can actually use the restored data (schema, permissions, sample reads/writes).
Identify Gaps: Pinpoint why the restore failed (e.g., missing role grants or bootstrap scripts).

Containment

Containment is about restoring confidence in the recovery path before touching production.

Containment steps:

Gate Decisions: Keep production recovery decisions gated on successful non-production restore evidence.
Fix Gaps: Address permission, schema, or bootstrap gaps in the lower environment first.
Document the Path: Write down the validated restore steps once they are repeatable.
Promote Procedure: Only promote the recovery procedure once it is proven reliable.

Pause and Predict: What automated guardrail would have prevented this incident entirely?