The Incident: The Alert Storm

Incident Hook

Multiple warning signals fire after a controlled chaos drill or a real failure. On-call receives fragmented alerts with no clear priority or incident ownership. Manual triage burns time on duplicate noise while real impact grows.

Result: Tooling is not enough if it does not provide a single, normalized, and actionable picture of the incident.

Observed Symptoms

What the team sees first:

Many alerts are technically true but operationally fragmented.
Responders cannot tell if they are seeing one incident or many.

The “Alert Storm” (Raw signals):

[10:00:01] ⚠️ Prometheus: BackendHighLatency (develop)
[10:00:05] ⚠️ Flux: Kustomization/apps-develop is stalled
[10:00:10] ⚠️ K8s: Pod/backend-x7y2 is in CrashLoopBackOff
[10:00:15] ⚠️ Prometheus: BackendHighErrorRate (develop)
# ❌ Question: Is this 4 problems? No, it's 1 deployment mistake causing noise.

The problem is not a lack of detection; it is a lack of normalization.

Confusion Phase

At this point, “let the AI fix it” starts sounding attractive. That is the trap. The real question is:

How to reduce noise without giving the model unsafe write authority?
How to keep useful context while redacting secrets and budgets?

Guardian Contract (Inputs / Outputs / No-Go)

We define a strict contract for how the SRE Guardian interacts with our cluster:

Inputs:

Kubernetes warning events and Flux conditions.
Metrics snapshots (error rate, latency, etc.).
Bounded pod logs with sensitive fields redacted.

Outputs:

Structured incident summary and hypotheses.
Proposed next runbook actions.
Daily and weekly cluster health reports.

Not Allowed (No-Go):

No direct mutation: kubectl apply, patch, or delete from AI output is forbidden.
No raw secrets: Sending tokens, private keys, or passwords to an LLM provider is a security violation.
No autonomous closing: Incidents cannot be resolved without human acknowledgment.

What AI Would Propose (Brave Junior):

“Auto-remediate incidents directly from AI output.”
“Send full raw logs and secrets to the LLM for better context.”
“Resolve low-confidence incidents automatically to reduce the queue.”

Pause and Predict: Before reading the investigation, write down your top 3 hypotheses. What would you check first?

Estimated Time

Prerequisites

Source Code References

What You Will Produce

Incident Hook

Observed Symptoms

Confusion Phase

Guardian Contract (Inputs / Outputs / No-Go)