Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Artifacts

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter 14: 24/7 Production SRE

Why This Chapter Exists

Tooling is not enough without operational discipline. This chapter defines how teams run incidents, reduce recurrence, and harden systems continuously.

Scope

  • on-call operating model
  • incident lifecycle and severity policy
  • recurring-problem management
  • blameless postmortem workflow
  • AI boundary policy in production

Incident Hook

A Sev1 alert starts outside business hours. Teams join quickly but roles are unclear, updates are inconsistent, and actions race each other. Time is lost coordinating people instead of restoring service. This chapter defines the operating discipline that keeps incidents controlled.

What AI Would Propose (Brave Junior)

  • “Skip incident command and jump straight to fixes.”
  • “Postmortem can wait; just close ticket after recovery.”
  • “Let AI choose remediation automatically if confidence is high.”

Why this sounds reasonable:

  • faster short-term response
  • less process overhead in the moment

Why This Is Dangerous

  • no role model causes duplicate work and missed critical steps.
  • without severity policy, communication and escalation are inconsistent.
  • missing postmortem discipline guarantees recurring incidents.

Guardrails That Stop It

  • severity-based incident model with explicit owner per role.
  • evidence-first actions: metrics + traces + logs before high-risk changes.
  • mandatory blameless postmortem with owners and due dates.
  • AI may classify/recommend, but execution remains human-owned.

Severity Matrix (Sev0-Sev3)

SeverityTypical ImpactResponse TargetEscalation
Sev0Critical business-wide outage or data riskimmediate incident commandpage all required responders + leadership
Sev1Major user-facing degradationrapid coordinated responsepage core service owners + incident manager
Sev2Partial degradation with workaroundplanned urgent responsenotify owning team + on-call
Sev3Low-impact defect/noisenormal backlog/working hourstrack and trend for recurrence

Communication Escalation Rules

  • Declare severity at incident start and update it if impact changes.
  • Assign communications owner for status cadence and stakeholder updates.
  • For Sev0/Sev1, include clear next update time in every message.
  • Close incident comms only after recovery evidence is confirmed.

Core Principles

  1. Evidence first:
  • metrics + traces + logs before high-risk actions
  1. Blameless response:
  • focus on system conditions and guardrail gaps, not individuals
  1. Controlled escalation:
  • severity-based comms and ownership
  1. AI boundary:
  • AI can classify and recommend
  • humans own decisions and execution

Operating Model

  • Incident Commander (IC)
  • Primary Responder
  • Communications Owner
  • Scribe

Safe Workflow (Step-by-Step)

  1. Declare severity and assign IC, responder, comms owner, and scribe.
  2. Build shared timeline from correlated evidence (metrics/traces/logs).
  3. Execute lowest-risk mitigation first and communicate status on fixed cadence.
  4. Resolve incident, confirm service objectives recovered, and close timeline.
  5. Publish blameless postmortem with follow-ups (owner + due date + validation method).

Postmortem Quality Bar

Every postmortem must include:

  • timeline with evidence, not assumptions
  • 5 whys or equivalent root/contributing factor analysis
  • specific follow-up actions with owner and due date
  • validation method showing risk reduction after fixes

Lab Files

  • lab.md
  • runbook-oncall.md
  • postmortem-template.md
  • quiz.md

Done When

  • learner can run a full incident timeline with roles and severity
  • learner can produce a complete blameless postmortem
  • learner can define hardening actions with owner and due date

Blameless Postmortem Template

Blameless Postmortem Template Incident Metadata Incident ID: Date/Time (UTC): Severity: Services affected: Incident Commander: Summary What happened: Customer impact: Duration: Timeline (UTC) Detection: Triage: …

Lab: Full Incident Lifecycle (24/7 SRE)

detect triage mitigate recover postmortem Scenario Input Use one recent controlled scenario (recommended from Chapter 12/13): backend crash/panic pattern, or elevated 5xx with recurring incidents Step 1: Incident …

Quiz: Chapter 14 (24/7 Production SRE)

Which statement is correct? A) Decide mitigation first, collect evidence later. B) Collect evidence first, then choose mitigation. C) Wait for AI confidence to reach 100%. Name the minimum evidence set before high-risk …

Runbook: On-Call Incident Operations

Runbook: On-Call Incident Operations Severity Matrix SEV-1: active customer outage or high data-risk SEV-2: major degradation with customer impact SEV-3: limited/contained issue, no major customer impact Standard …