Incident Hook
Multiple warning signals fire after a controlled chaos drill or a real failure. On-call receives fragmented alerts with no clear priority or incident ownership. Manual triage burns time on duplicate noise while real impact grows.
Result: Tooling is not enough if it does not provide a single, normalized, and actionable picture of the incident.
Observed Symptoms
What the team sees first:
- Many alerts are technically true but operationally fragmented.
- Responders cannot tell if they are seeing one incident or many.
The “Alert Storm” (Raw signals):
[10:00:01] ⚠️ Prometheus: BackendHighLatency (develop)
[10:00:05] ⚠️ Flux: Kustomization/apps-develop is stalled
[10:00:10] ⚠️ K8s: Pod/backend-x7y2 is in CrashLoopBackOff
[10:00:15] ⚠️ Prometheus: BackendHighErrorRate (develop)
# ❌ Question: Is this 4 problems? No, it's 1 deployment mistake causing noise.
The problem is not a lack of detection; it is a lack of normalization.
Confusion Phase
At this point, “let the AI fix it” starts sounding attractive. That is the trap. The real question is:
- How to reduce noise without giving the model unsafe write authority?
- How to keep useful context while redacting secrets and budgets?
Guardian Contract (Inputs / Outputs / No-Go)
We define a strict contract for how the SRE Guardian interacts with our cluster:
Inputs:
- Kubernetes warning events and Flux conditions.
- Metrics snapshots (error rate, latency, etc.).
- Bounded pod logs with sensitive fields redacted.
Outputs:
- Structured incident summary and hypotheses.
- Proposed next runbook actions.
- Daily and weekly cluster health reports.
Not Allowed (No-Go):
- No direct mutation:
kubectl apply,patch, ordeletefrom AI output is forbidden. - No raw secrets: Sending tokens, private keys, or passwords to an LLM provider is a security violation.
- No autonomous closing: Incidents cannot be resolved without human acknowledgment.
What AI Would Propose (Brave Junior):
- “Auto-remediate incidents directly from AI output.”
- “Send full raw logs and secrets to the LLM for better context.”
- “Resolve low-confidence incidents automatically to reduce the queue.”
Pause and Predict: Before reading the investigation, write down your top 3 hypotheses. What would you check first?