Chapter 14: 24/7 Production SRE

Why This Chapter Exists

Tooling is not enough without operational discipline. This chapter defines how teams run incidents, reduce recurrence, and harden systems continuously.

Scope

on-call operating model
incident lifecycle and severity policy
recurring-problem management
blameless postmortem workflow
AI boundary policy in production

Incident Hook

A Sev1 alert starts outside business hours. Teams join quickly but roles are unclear, updates are inconsistent, and actions race each other. Time is lost coordinating people instead of restoring service. This chapter defines the operating discipline that keeps incidents controlled.

What AI Would Propose (Brave Junior)

“Skip incident command and jump straight to fixes.”
“Postmortem can wait; just close ticket after recovery.”
“Let AI choose remediation automatically if confidence is high.”

Why this sounds reasonable:

faster short-term response
less process overhead in the moment

Why This Is Dangerous

no role model causes duplicate work and missed critical steps.
without severity policy, communication and escalation are inconsistent.
missing postmortem discipline guarantees recurring incidents.

Guardrails That Stop It

severity-based incident model with explicit owner per role.
evidence-first actions: metrics + traces + logs before high-risk changes.
mandatory blameless postmortem with owners and due dates.
AI may classify/recommend, but execution remains human-owned.

Severity Matrix (Sev0-Sev3)

Severity	Typical Impact	Response Target	Escalation
Sev0	Critical business-wide outage or data risk	immediate incident command	page all required responders + leadership
Sev1	Major user-facing degradation	rapid coordinated response	page core service owners + incident manager
Sev2	Partial degradation with workaround	planned urgent response	notify owning team + on-call
Sev3	Low-impact defect/noise	normal backlog/working hours	track and trend for recurrence

Communication Escalation Rules

Declare severity at incident start and update it if impact changes.
Assign communications owner for status cadence and stakeholder updates.
For Sev0/Sev1, include clear next update time in every message.
Close incident comms only after recovery evidence is confirmed.

Core Principles

Evidence first:

metrics + traces + logs before high-risk actions

Blameless response:

focus on system conditions and guardrail gaps, not individuals

Controlled escalation:

severity-based comms and ownership

AI boundary:

AI can classify and recommend
humans own decisions and execution

Operating Model

Incident Commander (IC)
Primary Responder
Communications Owner
Scribe

Safe Workflow (Step-by-Step)

Declare severity and assign IC, responder, comms owner, and scribe.
Build shared timeline from correlated evidence (metrics/traces/logs).
Execute lowest-risk mitigation first and communicate status on fixed cadence.
Resolve incident, confirm service objectives recovered, and close timeline.
Publish blameless postmortem with follow-ups (owner + due date + validation method).

Postmortem Quality Bar

Every postmortem must include:

timeline with evidence, not assumptions
5 whys or equivalent root/contributing factor analysis
specific follow-up actions with owner and due date
validation method showing risk reduction after fixes

Lab Files

lab.md
runbook-oncall.md
postmortem-template.md
quiz.md

Done When

learner can run a full incident timeline with roles and severity
learner can produce a complete blameless postmortem
learner can define hardening actions with owner and due date

Estimated Time

Prerequisites

Artifacts

What You Will Produce