Chapter 13: AI-Assisted SRE Guardian
Why This Chapter Exists
Chaos testing and alerting generate noise unless incidents are normalized and prioritized. This chapter defines an AI-assisted guardian that analyzes incidents, proposes actions, and escalates safely without autonomous production changes.
Incident Hook
Multiple warning signals fire after a controlled chaos drill. On-call receives fragmented alerts with no clear priority or incident ownership. Manual triage burns time on duplicate noise while real impact grows. Guardian workflow turns raw signals into structured, actionable incident context.
What AI Would Propose (Brave Junior)
- “Auto-remediate incidents directly from AI output.”
- “Send full raw logs/secrets to LLM for better context.”
- “Resolve low-confidence incidents automatically to reduce queue.”
Why this sounds reasonable:
- reduces immediate on-call load
- looks faster and more autonomous
Why This Is Dangerous
- autonomous write-back can apply unsafe changes at runtime.
- unsanitized context leaks secrets and violates policy/compliance.
- low-confidence automation hides risk instead of reducing it.
Implementation Scope
This chapter uses a standalone guardian service pattern integrated with the platform:
- Kubernetes event handlers for warnings and Flux conditions
- scanner loops for pods, PVCs, certificates, and endpoints
- structured LLM analysis with strict JSON schema output
- incident lifecycle storage (SQLite recommended)
- confidence-based human escalation
Guardian Contract (Inputs / Outputs / Not Allowed)
Inputs:
- Kubernetes warning events and Flux conditions
- metrics snapshots (error-rate, latency, saturation indicators)
- bounded log context with sensitive fields redacted
Outputs:
- structured incident summary
- ranked likely causes
- proposed next runbook actions with confidence score
Not allowed:
- direct workload mutation (
kubectl apply/delete/patch) from AI output - sending raw secrets/tokens/private keys to LLM providers
- automatic incident resolve/close without human acknowledgement
Guardian Responsibilities
- Detect:
- Kubernetes Warning events
- Flux stalled conditions
- periodic scanner findings
- Analyze:
- collect structured context
- sanitize sensitive data
- enforce context budget
- call LLM for structured root-cause hypotheses
- Decide:
- create/update incident record
- deduplicate repeated noise
- escalate recurring/persistent incidents
- Notify:
- send structured alert
- expose incident APIs for ack/resolve
Guardrails That Stop It
- AI proposes; humans approve remediation.
- No autonomous write-back to production workloads.
- Confidence below threshold requires explicit human review.
- Secret/token redaction is mandatory before LLM calls.
- Rate and cost limits are mandatory.
Safe Workflow (Step-by-Step)
- Ingest incident signals (events, metrics, logs) and normalize into one record.
- Sanitize context and enforce token/context budgets before any LLM call.
- Generate structured recommendations (
summary,likely cause,next actions) only. - Route recommendation through human approval gate.
- Execute selected runbook step and record decision/evidence in incident lifecycle.
Approval Gates
- AI suggests.
- Human selects allowed action.
- Runbook step executes in controlled scope.
- Post-action evidence is reviewed before next step.
Guardian Deployment Architecture
The guardian runs as k8s-ai-monitor, a single-replica deployment in the observability namespace.
Deployment Specification
- Image:
ghcr.io/ldbl/k8s-ai-monitor:latest - Replicas: 1 (singleton — only one instance should process events)
- Namespace:
observability
Environment Configuration
| Variable | Purpose |
|---|---|
CLUSTER_NAME | Identifies the cluster in incident reports |
WATCH_NAMESPACES | Namespaces to monitor (e.g., production) |
NON_PROD_NAMESPACES | Non-production namespaces (reduced alert severity) |
EXCLUDE_NAMESPACES | Namespaces to ignore |
LLM Integration
OPENAI_API_KEYorANTHROPIC_API_KEYinjected from Kubernetes Secret- Rate and cost limits enforced in guardian config
- Context sanitization runs before every LLM call
Prometheus Connection
- Internal service URL for metrics queries
- Used for context enrichment (error rates, latency percentiles during incidents)
Storage
- SQLite persistence via 2Gi PVC mounted at
/data - Stores incident records, analysis history, and LLM usage tracking
RBAC (ClusterRole)
Read-only access to:
- Pods, Events, Deployments, HPAs
- CNPG clusters
- Flux Kustomizations and HelmReleases
No write permissions. Guardian observes; humans act.
Endpoints
| Endpoint | Purpose |
|---|---|
/healthz | Health check |
/state | Current guardian state and scan results |
/reports | Incident reports and analysis |
Integrations
- Slack: Webhook URL for incident notifications
- OpsGenie: Webhook URL for on-call escalation
Resource Limits
| Resource | Request | Limit |
|---|---|---|
| CPU | 100m | 1000m |
| Memory | 256Mi | 1Gi |
Integration Map
- Chapter 12: Controlled Chaos as primary incident signal source.
- Chapter 10: Observability for evidence correlation.
- Chapter 14: 24/7 Production SRE for on-call lifecycle integration.
- Platform GitOps manifests for runtime context.
Lab Files
lab.mdrunbook-guardian.mdquiz.md
Done When
- guardian captures at least one controlled chaos scenario
- incident is persisted with structured analysis and confidence
- on-call can acknowledge and resolve incident via API
- one escalation scenario is demonstrated (recurring or persistent)