Chapter 13: 24/7 Production SRE

Why This Chapter Exists

Tooling is not enough without operational discipline. This chapter defines how teams run incidents, reduce recurrence, and harden systems continuously.

Scope

  • on-call operating model
  • incident lifecycle and severity policy
  • recurring-problem management
  • blameless postmortem workflow
  • AI boundary policy in production

Core Principles

  1. Evidence first:
  • metrics + traces + logs before high-risk actions
  1. Blameless response:
  • focus on system conditions and guardrail gaps, not individuals
  1. Controlled escalation:
  • severity-based comms and ownership
  1. AI boundary:
  • AI can classify and recommend
  • humans own decisions and execution

Operating Model

  • Incident Commander (IC)
  • Primary Responder
  • Communications Owner
  • Scribe

Lab Files

  • lab.md
  • runbook-oncall.md
  • postmortem-template.md
  • quiz.md

Done When

  • learner can run a full incident timeline with roles and severity
  • learner can produce a complete blameless postmortem
  • learner can define hardening actions with owner and due date

Blameless Postmortem Template

Blameless Postmortem Template Incident Metadata Incident ID: Date/Time (UTC): Severity: Services affected: Incident Commander: Summary What happened: Customer impact: Duration: Timeline (UTC) Detection: Triage: …

Lab: Full Incident Lifecycle (24/7 SRE)

detect triage mitigate recover postmortem Scenario Input Use one recent controlled scenario (recommended from Chapter 11/12): backend crash/panic pattern, or elevated 5xx with recurring incidents Step 1: Incident …

Quiz: Chapter 13 (24/7 Production SRE)

Which statement is correct? A) Decide mitigation first, collect evidence later. B) Collect evidence first, then choose mitigation. C) Wait for AI confidence to reach 100%. Name the minimum evidence set before high-risk …

Runbook: On-Call Incident Operations

Runbook: On-Call Incident Operations Severity Matrix SEV-1: active customer outage or high data-risk SEV-2: major degradation with customer impact SEV-3: limited/contained issue, no major customer impact Standard …