Lab: Full Incident Lifecycle (24/7 SRE)

Goal

Run one full lifecycle simulation:

detect
triage
mitigate
recover
postmortem

Scenario Input

Use one recent controlled scenario (recommended from Chapter 11/12):

backend crash/panic pattern, or
elevated 5xx with recurring incidents

Step 1: Incident Declaration

Define:

severity (SEV-1/SEV-2/SEV-3)
blast radius
IC and responder roles
comms channel and update cadence

Step 2: Evidence Collection

Capture:

symptom metrics
representative trace(s)
correlated log evidence
guardian incident id (if available)

Step 3: Mitigation Decision

Choose one:

rollback
config change
scale/traffic control
observe-only with timebox

Record why this action is safest.

Step 4: Recovery Verification

Confirm:

metric recovery
trace duration/error normalization
no repeating critical logs for same fingerprint

Step 5: Postmortem

Complete postmortem-template.md:

timeline
root/contributing factors
what worked, what failed
action items

Hard Stop Conditions

mitigation applied without evidence
no assigned owner for critical action item
incident closed without recovery verification

Done When

complete incident record exists
postmortem is blameless and actionable
at least one prevention action is accepted into backlog