Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

  • clusterrole.yaml Members
  • deployment.yaml Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Guardrails That Stop It

  • Propose vs. Mutate: AI proposes remediation; humans must approve and execute it.
  • Mandatory Redaction: No LLM call can occur without first redacting secrets and tokens.
  • Context Budgets: Enforced limits on context size (e.g., 16KB for alerts) to prevent cost overrun and noise.
  • Rate and Cost Limits: Default caps on LLM calls per hour to ensure economic safety.
  • Read-Only RBAC: The Guardian uses a read-oriented ClusterRole; it cannot mutate workloads.

Detection and Analysis Pipeline

The SRE Guardian works in four distinct stages:

  1. Detect: From real-time events, Flux stalled conditions, and periodic scanners (Pods, PVCs, Certs, etc.).
  2. Analyze: Collects pod state, logs, and metrics, then sanitizes and budgets the context.
  3. Decide: Creates/updates an incident record in SQLite, applies deduplication, and attaches Hypotheses.
  4. Notify: Sends structured Slack alerts and exposes results via API and CLI.

Escalation Model

To prevent alert storms, we use a tiered escalation model:

  • Fresh: First occurrence, base cooldown of 30 minutes.
  • Recurring: Second occurrence within 6 hours; sends an escalation notice.
  • Persistent: Third+ occurrence; emits a hardening alert with exponential backoff.

Safe Workflow (Step-by-Step)

  1. Ingest Signal: The Guardian detects a failure and creates a normalized incident record.
  2. Sanitize Context: Sensitive data is redacted before analysis.
  3. Generate Hypotheses: AI provides ranked likely causes and confidence scores.
  4. Human Approval: An operator reviews the hypotheses and suggested actions.
  5. Execute Action: The human performs the fix and records the decision.
  6. Verify & Resolve: Confirm the fix works and resolve the incident.

This builds on: Controlled chaos (Chapter 12) — guardian routes incidents discovered by drills. This enables: 24/7 Production SRE (Chapter 14) — guardian operates within the on-call model.