Workflow & Analysis Pipeline

Guardrails That Stop It

Propose vs. Mutate: AI proposes remediation; humans must approve and execute it.
Mandatory Redaction: No LLM call can occur without first redacting secrets and tokens.
Context Budgets: Enforced limits on context size (e.g., 16KB for alerts) to prevent cost overrun and noise.
Rate and Cost Limits: Default caps on LLM calls per hour to ensure economic safety.
Read-Only RBAC: The Guardian uses a read-oriented ClusterRole; it cannot mutate workloads.

Detection and Analysis Pipeline

The SRE Guardian works in four distinct stages:

Detect: From real-time events, Flux stalled conditions, and periodic scanners (Pods, PVCs, Certs, etc.).
Analyze: Collects pod state, logs, and metrics, then sanitizes and budgets the context.
Decide: Creates/updates an incident record in SQLite, applies deduplication, and attaches Hypotheses.
Notify: Sends structured Slack alerts and exposes results via API and CLI.

Escalation Model

To prevent alert storms, we use a tiered escalation model:

Fresh: First occurrence, base cooldown of 30 minutes.
Recurring: Second occurrence within 6 hours; sends an escalation notice.
Persistent: Third+ occurrence; emits a hardening alert with exponential backoff.

Safe Workflow (Step-by-Step)

Ingest Signal: The Guardian detects a failure and creates a normalized incident record.
Sanitize Context: Sensitive data is redacted before analysis.
Generate Hypotheses: AI provides ranked likely causes and confidence scores.
Human Approval: An operator reviews the hypotheses and suggested actions.
Execute Action: The human performs the fix and records the decision.
Verify & Resolve: Confirm the fix works and resolve the incident.

This builds on: Controlled chaos (Chapter 12) — guardian routes incidents discovered by drills. This enables: 24/7 Production SRE (Chapter 14) — guardian operates within the on-call model.

Estimated Time

Prerequisites

Source Code References

What You Will Produce

Guardrails That Stop It

Detection and Analysis Pipeline

Escalation Model

Safe Workflow (Step-by-Step)