Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

Primary chapter content only.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter 14: 24/7 Production SRE

Incident Hook

A Sev1 alert starts outside business hours. Teams join quickly but roles are unclear, updates are inconsistent, and actions race each other. Time is lost coordinating people instead of restoring service. This chapter defines the operating discipline that keeps incidents controlled.

Observed Symptoms

What the team sees first:

  • responders join, but ownership is unclear
  • communication cadence is inconsistent
  • multiple actions start before a shared evidence picture exists

The incident is already harder than it needs to be before any technical fix lands.

Why This Chapter Exists

Tooling is not enough without operational discipline. This chapter defines how teams run incidents, reduce recurrence, and harden systems continuously.

Scope

  • on-call operating model
  • incident lifecycle and severity policy
  • recurring-problem management
  • blameless postmortem workflow
  • AI boundary policy in production

Confusion Phase

Without explicit roles, every fast action feels helpful. That is exactly how response quality degrades.

The real question is:

  • who owns command, communications, evidence, and execution
  • and how to keep urgency from turning into parallel guesswork

What AI Would Propose (Brave Junior)

  • “Skip incident command and jump straight to fixes.”
  • “Postmortem can wait; just close ticket after recovery.”
  • “Let AI choose remediation automatically if confidence is high.”

Why this sounds reasonable:

  • faster short-term response
  • less process overhead in the moment

Why This Is Dangerous

  • no role model causes duplicate work and missed critical steps.
  • without severity policy, communication and escalation are inconsistent.
  • missing postmortem discipline guarantees recurring incidents.

Investigation

Treat coordination gaps as part of the incident, not background noise.

Safe investigation sequence:

  1. declare severity and assign roles immediately
  2. build one shared timeline from metrics, traces, logs, and operator actions
  3. separate confirmed evidence from assumptions
  4. keep mitigation choices tied to the shared picture, not to individual urgency

Containment

Containment in 24/7 operations is organizational as well as technical:

  1. establish incident command and update cadence
  2. execute the lowest-risk mitigation that matches the evidence
  3. confirm recovery before winding down communication
  4. open follow-up work while the timeline is still fresh

Guardrails That Stop It

  • severity-based incident model with explicit owner per role.
  • evidence-first actions: metrics + traces + logs before high-risk changes.
  • mandatory blameless postmortem with owners and due dates.
  • AI may classify/recommend, but execution remains human-owned.

Severity Matrix (Sev0-Sev3)

SeverityTypical ImpactResponse TargetEscalation
Sev0Critical business-wide outage or data riskimmediate incident commandpage all required responders + leadership
Sev1Major user-facing degradationrapid coordinated responsepage core service owners + incident manager
Sev2Partial degradation with workaroundplanned urgent responsenotify owning team + on-call
Sev3Low-impact defect/noisenormal backlog/working hourstrack and trend for recurrence

Communication Escalation Rules

  • Declare severity at incident start and update it if impact changes.
  • Assign communications owner for status cadence and stakeholder updates.
  • For Sev0/Sev1, include clear next update time in every message.
  • Close incident comms only after recovery evidence is confirmed.

Core Principles

  1. Evidence first:
  • metrics + traces + logs before high-risk actions
  1. Blameless response:
  • focus on system conditions and guardrail gaps, not individuals
  1. Controlled escalation:
  • severity-based comms and ownership
  1. AI boundary:
  • AI can classify and recommend
  • humans own decisions and execution

System Context

This chapter operationalizes the whole course under real pressure.

It depends on:

  • Chapter 10 for evidence-first investigation
  • Chapter 12 for rehearsed response under failure
  • Chapter 13 for structured incident input that still stays human-owned

Operating Model

  • Incident Commander (IC)
  • Primary Responder
  • Communications Owner
  • Scribe

Safe Workflow (Step-by-Step)

  1. Declare severity and assign IC, responder, comms owner, and scribe.
  2. Build shared timeline from correlated evidence (metrics/traces/logs).
  3. Execute lowest-risk mitigation first and communicate status on fixed cadence.
  4. Resolve incident, confirm service objectives recovered, and close timeline.
  5. Publish blameless postmortem with follow-ups (owner + due date + validation method).

Postmortem Quality Bar

Every postmortem must include:

  • timeline with evidence, not assumptions
  • 5 whys or equivalent root/contributing factor analysis
  • specific follow-up actions with owner and due date
  • validation method showing risk reduction after fixes

Lab Files

  • lab.md
  • runbook-oncall.md
  • postmortem-template.md
  • quiz.md

Done When

  • learner can run a full incident timeline with roles and severity
  • learner can produce a complete blameless postmortem
  • learner can define hardening actions with owner and due date

Hands-On Materials

Labs, quizzes, and runbooks — available to course members.

  • Blameless Postmortem Template Members
  • Lab: Full Incident Lifecycle (24/7 SRE) Members
  • Quiz: Chapter 14 (24/7 Production SRE) Members
  • Runbook: On-Call Incident Operations Members