Investigation
Treat coordination and communication gaps as part of the incident, not as background noise.
Safe investigation sequence:
- Declare Severity: Explicitly name the severity and assign core roles immediately.
- Build the Timeline: Create a shared, real-time timeline from metrics, logs, and operator actions.
- Separate Evidence: Clearly distinguish between confirmed facts and assumptions or hypotheses.
- Audit Coordination: Identify if parallel work is happening without the Incident Commander’s knowledge.
Containment
Containment in SRE is both organizational and technical.
Containment steps:
- Establish Command: Assign an Incident Commander (IC) to manage the people and the strategy.
- Set Update Cadence: Commit to a fixed communication interval based on the severity level.
- Lowest-Risk Mitigation: Execute the simplest action that matches the available evidence.
- Confirm Recovery: Don’t wind down the “War Room” until metrics confirm service recovery.
- Open Follow-ups: Capture immediate action items while the context is still fresh.
Pause and Predict: What automated guardrail would have prevented this incident entirely?