Guardrails That Stop It
- Owner-Per-Role: Every incident must have an assigned owner for Command, Comms, and Execution.
- Evidence-First: Metrics, traces, and logs must be captured before any high-risk production change.
- Mandatory Postmortems: All Sev0 and Sev1 incidents require a blameless postmortem within 48 hours.
- AI Boundary Policy: AI tools can analyze and recommend, but humans must own the final decision and execution.
Core SRE Principles
- Evidence Over Urgency: Act based on confirmed signals (Chapter 10), not on panic.
- Blameless Response: Focus on system gaps and guardrail failures, not individual mistakes.
- Controlled Escalation: Follow the severity-based communication and ownership model.
Operating Model (The Incident Team)
- Incident Commander (IC): Strategist. Owns the decision-making and resource allocation.
- Primary Responder: Surgeon. Owns the technical execution and verification.
- Communications Lead: Voice. Owns stakeholder updates and status pages.
- Scribe: Memory. Owns the timeline and evidence logging.
Safe Workflow (Step-by-Step)
- Detect & Declare: Use Chapter 10 signals or Chapter 13 Guardian alerts to detect a failure. Declare severity.
- Assign Roles: Identify the IC, Responder, and Comms Lead.
- Build Timeline: Record every key metric change and operator command in a shared log.
- Mitigate: Execute the lowest-risk fix first. Communicate status on a fixed cadence.
- Resolve: Confirm recovery via metrics. Record the time of resolution.
- Postmortem: Conduct a blameless review and assign hardening actions.
This builds on: AI-assisted guardian (Chapter 13) — on-call uses guardian for triage and enrichment. This enables: Capstone — all core guardrails are now operational.