Investigation
Treat the SRE Guardian itself as a guarded incident pipeline.
Safe investigation sequence:
- Inspect Raw Signals: Review the raw Kubernetes events and metrics entering the Guardian.
- Verify Sanitization: Confirm that secrets, tokens, and context budgets are correctly handled before LLM analysis.
- Confirm Deduplication: Ensure that the Guardian correctly collapsed multiple related alerts into a single incident record.
- Review Proposed Actions: Check if the AI-suggested actions are useful and stay within the “no-mutation” boundary.
Containment
Containment keeps the Guardian helpful but safely bounded.
Containment steps:
- Preserve Human Approval: Do not allow any remediation step to execute without explicit human sign-off.
- Reduce Noise: Tune deduplication and escalation rules to prevent alert fatigue.
- Block Unsafe Context: Regularly audit the sanitization logic to prevent secret leakage.
- Treat Low Confidence as Review: Handle incidents with low AI confidence as high-priority human review items rather than automation failures.
Pause and Predict: What automated guardrail would have prevented this incident entirely?