Chapter 12: AI-Assisted SRE Guardian
Why This Chapter Exists
Chaos testing and alerting generate noise unless incidents are normalized and prioritized. This chapter defines an AI-assisted guardian that analyzes incidents, proposes actions, and escalates safely without autonomous production changes.
Implementation Scope
This chapter uses a standalone guardian service pattern integrated with the platform:
- Kubernetes event handlers for warnings and Flux conditions
- scanner loops for pods, PVCs, certificates, and endpoints
- structured LLM analysis with strict JSON schema output
- incident lifecycle storage (SQLite recommended)
- confidence-based human escalation
Guardian Responsibilities
- Detect:
- Kubernetes Warning events
- Flux stalled conditions
- periodic scanner findings
- Analyze:
- collect structured context
- sanitize sensitive data
- enforce context budget
- call LLM for structured root-cause hypotheses
- Decide:
- create/update incident record
- deduplicate repeated noise
- escalate recurring/persistent incidents
- Notify:
- send structured alert
- expose incident APIs for ack/resolve
Guardrails
- AI proposes; humans approve remediation.
- No autonomous write-back to production workloads.
- Confidence below threshold requires explicit human review.
- Secret/token redaction is mandatory before LLM calls.
- Rate and cost limits are mandatory.
Integration Map
- Chapter 11: Controlled Chaos as primary incident signal source.
- Chapter 09: Observability for evidence correlation.
- Chapter 13: 24/7 Production SRE for on-call lifecycle integration.
- Platform GitOps manifests for runtime context.
Lab Files
lab.mdrunbook-guardian.mdquiz.md
Done When
- guardian captures at least one controlled chaos scenario
- incident is persisted with structured analysis and confidence
- on-call can acknowledge and resolve incident via API
- one escalation scenario is demonstrated (recurring or persistent)