Chapter 14: 24/7 Production SRE
Incident Hook
A Sev1 alert starts outside business hours. Teams join quickly but roles are unclear, updates are inconsistent, and actions race each other. Time is lost coordinating people instead of restoring service. This chapter defines the operating discipline that keeps incidents controlled.
Observed Symptoms
What the team sees first:
- responders join, but ownership is unclear
- communication cadence is inconsistent
- multiple actions start before a shared evidence picture exists
The incident is already harder than it needs to be before any technical fix lands.
Why This Chapter Exists
Tooling is not enough without operational discipline. This chapter defines how teams run incidents, reduce recurrence, and harden systems continuously.
Scope
- on-call operating model
- incident lifecycle and severity policy
- recurring-problem management
- blameless postmortem workflow
- AI boundary policy in production
Confusion Phase
Without explicit roles, every fast action feels helpful. That is exactly how response quality degrades.
The real question is:
- who owns command, communications, evidence, and execution
- and how to keep urgency from turning into parallel guesswork
What AI Would Propose (Brave Junior)
- “Skip incident command and jump straight to fixes.”
- “Postmortem can wait; just close ticket after recovery.”
- “Let AI choose remediation automatically if confidence is high.”
Why this sounds reasonable:
- faster short-term response
- less process overhead in the moment
Why This Is Dangerous
- no role model causes duplicate work and missed critical steps.
- without severity policy, communication and escalation are inconsistent.
- missing postmortem discipline guarantees recurring incidents.
Investigation
Treat coordination gaps as part of the incident, not background noise.
Safe investigation sequence:
- declare severity and assign roles immediately
- build one shared timeline from metrics, traces, logs, and operator actions
- separate confirmed evidence from assumptions
- keep mitigation choices tied to the shared picture, not to individual urgency
Containment
Containment in 24/7 operations is organizational as well as technical:
- establish incident command and update cadence
- execute the lowest-risk mitigation that matches the evidence
- confirm recovery before winding down communication
- open follow-up work while the timeline is still fresh
Guardrails That Stop It
- severity-based incident model with explicit owner per role.
- evidence-first actions: metrics + traces + logs before high-risk changes.
- mandatory blameless postmortem with owners and due dates.
- AI may classify/recommend, but execution remains human-owned.
Severity Matrix (Sev0-Sev3)
| Severity | Typical Impact | Response Target | Escalation |
|---|---|---|---|
| Sev0 | Critical business-wide outage or data risk | immediate incident command | page all required responders + leadership |
| Sev1 | Major user-facing degradation | rapid coordinated response | page core service owners + incident manager |
| Sev2 | Partial degradation with workaround | planned urgent response | notify owning team + on-call |
| Sev3 | Low-impact defect/noise | normal backlog/working hours | track and trend for recurrence |
Communication Escalation Rules
- Declare severity at incident start and update it if impact changes.
- Assign communications owner for status cadence and stakeholder updates.
- For Sev0/Sev1, include clear next update time in every message.
- Close incident comms only after recovery evidence is confirmed.
Core Principles
- Evidence first:
- metrics + traces + logs before high-risk actions
- Blameless response:
- focus on system conditions and guardrail gaps, not individuals
- Controlled escalation:
- severity-based comms and ownership
- AI boundary:
- AI can classify and recommend
- humans own decisions and execution
System Context
This chapter operationalizes the whole course under real pressure.
It depends on:
- Chapter 10 for evidence-first investigation
- Chapter 12 for rehearsed response under failure
- Chapter 13 for structured incident input that still stays human-owned
Operating Model
- Incident Commander (IC)
- Primary Responder
- Communications Owner
- Scribe
Safe Workflow (Step-by-Step)
- Declare severity and assign IC, responder, comms owner, and scribe.
- Build shared timeline from correlated evidence (metrics/traces/logs).
- Execute lowest-risk mitigation first and communicate status on fixed cadence.
- Resolve incident, confirm service objectives recovered, and close timeline.
- Publish blameless postmortem with follow-ups (owner + due date + validation method).
Postmortem Quality Bar
Every postmortem must include:
- timeline with evidence, not assumptions
- 5 whys or equivalent root/contributing factor analysis
- specific follow-up actions with owner and due date
- validation method showing risk reduction after fixes
Lab Files
lab.mdrunbook-oncall.mdpostmortem-template.mdquiz.md
Done When
- learner can run a full incident timeline with roles and severity
- learner can produce a complete blameless postmortem
- learner can define hardening actions with owner and due date