Chapter 14: 24/7 Production SRE
Why This Chapter Exists
Tooling is not enough without operational discipline. This chapter defines how teams run incidents, reduce recurrence, and harden systems continuously.
Scope
- on-call operating model
- incident lifecycle and severity policy
- recurring-problem management
- blameless postmortem workflow
- AI boundary policy in production
Incident Hook
A Sev1 alert starts outside business hours. Teams join quickly but roles are unclear, updates are inconsistent, and actions race each other. Time is lost coordinating people instead of restoring service. This chapter defines the operating discipline that keeps incidents controlled.
What AI Would Propose (Brave Junior)
- “Skip incident command and jump straight to fixes.”
- “Postmortem can wait; just close ticket after recovery.”
- “Let AI choose remediation automatically if confidence is high.”
Why this sounds reasonable:
- faster short-term response
- less process overhead in the moment
Why This Is Dangerous
- no role model causes duplicate work and missed critical steps.
- without severity policy, communication and escalation are inconsistent.
- missing postmortem discipline guarantees recurring incidents.
Guardrails That Stop It
- severity-based incident model with explicit owner per role.
- evidence-first actions: metrics + traces + logs before high-risk changes.
- mandatory blameless postmortem with owners and due dates.
- AI may classify/recommend, but execution remains human-owned.
Severity Matrix (Sev0-Sev3)
| Severity | Typical Impact | Response Target | Escalation |
|---|---|---|---|
| Sev0 | Critical business-wide outage or data risk | immediate incident command | page all required responders + leadership |
| Sev1 | Major user-facing degradation | rapid coordinated response | page core service owners + incident manager |
| Sev2 | Partial degradation with workaround | planned urgent response | notify owning team + on-call |
| Sev3 | Low-impact defect/noise | normal backlog/working hours | track and trend for recurrence |
Communication Escalation Rules
- Declare severity at incident start and update it if impact changes.
- Assign communications owner for status cadence and stakeholder updates.
- For Sev0/Sev1, include clear next update time in every message.
- Close incident comms only after recovery evidence is confirmed.
Core Principles
- Evidence first:
- metrics + traces + logs before high-risk actions
- Blameless response:
- focus on system conditions and guardrail gaps, not individuals
- Controlled escalation:
- severity-based comms and ownership
- AI boundary:
- AI can classify and recommend
- humans own decisions and execution
Operating Model
- Incident Commander (IC)
- Primary Responder
- Communications Owner
- Scribe
Safe Workflow (Step-by-Step)
- Declare severity and assign IC, responder, comms owner, and scribe.
- Build shared timeline from correlated evidence (metrics/traces/logs).
- Execute lowest-risk mitigation first and communicate status on fixed cadence.
- Resolve incident, confirm service objectives recovered, and close timeline.
- Publish blameless postmortem with follow-ups (owner + due date + validation method).
Postmortem Quality Bar
Every postmortem must include:
- timeline with evidence, not assumptions
- 5 whys or equivalent root/contributing factor analysis
- specific follow-up actions with owner and due date
- validation method showing risk reduction after fixes
Lab Files
lab.mdrunbook-oncall.mdpostmortem-template.mdquiz.md
Done When
- learner can run a full incident timeline with roles and severity
- learner can produce a complete blameless postmortem
- learner can define hardening actions with owner and due date