The Incident: Schrödinger's Backup

Incident Hook

A backup job reports success, but a real restore attempt fails under pressure. Objects are present in storage, yet restored data is unusable due to permission or schema mismatch. Service stays degraded because backup existence was mistaken for recoverability proof.

Result: The team realizes too late that a “green” backup status does not equal a functional recovery plan.

Observed Symptoms

What the team sees first:

The backup job is “green” (successful).
Restore artifacts (files) exist in S3.
The restored service still cannot function correctly (e.g., app cannot connect, data is missing).

The lesson is: Backup presence is not the same thing as recovery proof.

Confusion Phase

The team now has two competing stories:

The backup system worked because artifacts exist.
The recovery path failed because the data is not operationally usable.

If those are not separated clearly, teams declare success too early and fail during a real crisis.

Bad Restore Example

Observed failure pattern:

Backup artifact exists and the restore command exits successfully.
Restored DB misses required role grants or schema compatibility.
App starts but fails at runtime with authorization or schema errors.

Lesson: “Restore completed” is not recovery proof without data and app-level validation.

What AI Would Propose (Brave Junior):

“Backup job is green, so recovery is guaranteed.”
“Skip restore drill; it takes too long.”
“Restore in production directly when the incident starts.”

Pause and Predict: Before reading the investigation, write down your top 3 hypotheses. What would you check first?

Estimated Time

Prerequisites

Source Code References

What You Will Produce

Incident Hook

Observed Symptoms

Confusion Phase

Bad Restore Example