Investigation
Treat observability as a drill-down path, not a bag of disconnected tools.
Safe investigation sequence:
- Detect Symptom: Start from the metric symptom (latency or error spike).
- Pivot to Traces: Use traces to isolate the exact failing path.
- Correlate Logs: Search logs for the
trace_idfrom the failing trace. - Identify Cause: Act only after at least two signals support the same explanation.
Containment
Containment follows the evidence you’ve gathered.
Containment steps:
- Stabilize Route: Stabilize the failing dependency or route identified by traces.
- Verify Clearing: Confirm that the symptom clears in Grafana metrics.
- Confirm Baseline: Ensure that both logs and traces return to their expected behavior.
- Record Path: Document the exact signal path that made the diagnosis fast enough to trust.
The goal is “diagnose first, then act,” rather than “guess and restart.”
Pause and Predict: What automated guardrail would have prevented this incident entirely?