Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

  • clusterrole.yaml Members
  • deployment.yaml Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Incident Hook

Multiple warning signals fire after a controlled chaos drill or a real failure. On-call receives fragmented alerts with no clear priority or incident ownership. Manual triage burns time on duplicate noise while real impact grows.

Result: Tooling is not enough if it does not provide a single, normalized, and actionable picture of the incident.

Observed Symptoms

What the team sees first:

  • Many alerts are technically true but operationally fragmented.
  • Responders cannot tell if they are seeing one incident or many.

The “Alert Storm” (Raw signals):

[10:00:01] ⚠️ Prometheus: BackendHighLatency (develop)
[10:00:05] ⚠️ Flux: Kustomization/apps-develop is stalled
[10:00:10] ⚠️ K8s: Pod/backend-x7y2 is in CrashLoopBackOff
[10:00:15] ⚠️ Prometheus: BackendHighErrorRate (develop)
# ❌ Question: Is this 4 problems? No, it's 1 deployment mistake causing noise.

The problem is not a lack of detection; it is a lack of normalization.

Confusion Phase

At this point, “let the AI fix it” starts sounding attractive. That is the trap. The real question is:

  • How to reduce noise without giving the model unsafe write authority?
  • How to keep useful context while redacting secrets and budgets?

Guardian Contract (Inputs / Outputs / No-Go)

We define a strict contract for how the SRE Guardian interacts with our cluster:

Inputs:

  • Kubernetes warning events and Flux conditions.
  • Metrics snapshots (error rate, latency, etc.).
  • Bounded pod logs with sensitive fields redacted.

Outputs:

  • Structured incident summary and hypotheses.
  • Proposed next runbook actions.
  • Daily and weekly cluster health reports.

Not Allowed (No-Go):

  • No direct mutation: kubectl apply, patch, or delete from AI output is forbidden.
  • No raw secrets: Sending tokens, private keys, or passwords to an LLM provider is a security violation.
  • No autonomous closing: Incidents cannot be resolved without human acknowledgment.

What AI Would Propose (Brave Junior):

  • “Auto-remediate incidents directly from AI output.”
  • “Send full raw logs and secrets to the LLM for better context.”
  • “Resolve low-confidence incidents automatically to reduce the queue.”

Pause and Predict: Before reading the investigation, write down your top 3 hypotheses. What would you check first?