Lab: Full Incident Lifecycle (24/7 SRE)

Goal

Run one full lifecycle simulation:

  • detect
  • triage
  • mitigate
  • recover
  • postmortem

Scenario Input

Use one recent controlled scenario (recommended from Chapter 11/12):

  • backend crash/panic pattern, or
  • elevated 5xx with recurring incidents

Step 1: Incident Declaration

Define:

  • severity (SEV-1/SEV-2/SEV-3)
  • blast radius
  • IC and responder roles
  • comms channel and update cadence

Step 2: Evidence Collection

Capture:

  • symptom metrics
  • representative trace(s)
  • correlated log evidence
  • guardian incident id (if available)

Step 3: Mitigation Decision

Choose one:

  • rollback
  • config change
  • scale/traffic control
  • observe-only with timebox

Record why this action is safest.

Step 4: Recovery Verification

Confirm:

  • metric recovery
  • trace duration/error normalization
  • no repeating critical logs for same fingerprint

Step 5: Postmortem

Complete postmortem-template.md:

  • timeline
  • root/contributing factors
  • what worked, what failed
  • action items

Hard Stop Conditions

  • mitigation applied without evidence
  • no assigned owner for critical action item
  • incident closed without recovery verification

Done When

  • complete incident record exists
  • postmortem is blameless and actionable
  • at least one prevention action is accepted into backlog