Guardrails That Stop It
- Propose vs. Mutate: AI proposes remediation; humans must approve and execute it.
- Mandatory Redaction: No LLM call can occur without first redacting secrets and tokens.
- Context Budgets: Enforced limits on context size (e.g., 16KB for alerts) to prevent cost overrun and noise.
- Rate and Cost Limits: Default caps on LLM calls per hour to ensure economic safety.
- Read-Only RBAC: The Guardian uses a read-oriented ClusterRole; it cannot mutate workloads.
Detection and Analysis Pipeline
The SRE Guardian works in four distinct stages:
- Detect: From real-time events, Flux stalled conditions, and periodic scanners (Pods, PVCs, Certs, etc.).
- Analyze: Collects pod state, logs, and metrics, then sanitizes and budgets the context.
- Decide: Creates/updates an incident record in SQLite, applies deduplication, and attaches Hypotheses.
- Notify: Sends structured Slack alerts and exposes results via API and CLI.
Escalation Model
To prevent alert storms, we use a tiered escalation model:
- Fresh: First occurrence, base cooldown of 30 minutes.
- Recurring: Second occurrence within 6 hours; sends an escalation notice.
- Persistent: Third+ occurrence; emits a hardening alert with exponential backoff.
Safe Workflow (Step-by-Step)
- Ingest Signal: The Guardian detects a failure and creates a normalized incident record.
- Sanitize Context: Sensitive data is redacted before analysis.
- Generate Hypotheses: AI provides ranked likely causes and confidence scores.
- Human Approval: An operator reviews the hypotheses and suggested actions.
- Execute Action: The human performs the fix and records the decision.
- Verify & Resolve: Confirm the fix works and resolve the incident.
This builds on: Controlled chaos (Chapter 12) — guardian routes incidents discovered by drills. This enables: 24/7 Production SRE (Chapter 14) — guardian operates within the on-call model.