Chapter 12: AI-Assisted SRE Guardian

Why This Chapter Exists

Chaos testing and alerting generate noise unless incidents are normalized and prioritized. This chapter defines an AI-assisted guardian that analyzes incidents, proposes actions, and escalates safely without autonomous production changes.

Implementation Scope

This chapter uses a standalone guardian service pattern integrated with the platform:

Kubernetes event handlers for warnings and Flux conditions
scanner loops for pods, PVCs, certificates, and endpoints
structured LLM analysis with strict JSON schema output
incident lifecycle storage (SQLite recommended)
confidence-based human escalation

Guardian Responsibilities

Detect:

Kubernetes Warning events
Flux stalled conditions
periodic scanner findings

Analyze:

collect structured context
sanitize sensitive data
enforce context budget
call LLM for structured root-cause hypotheses

Decide:

create/update incident record
deduplicate repeated noise
escalate recurring/persistent incidents

Notify:

send structured alert
expose incident APIs for ack/resolve

Guardrails

AI proposes; humans approve remediation.
No autonomous write-back to production workloads.
Confidence below threshold requires explicit human review.
Secret/token redaction is mandatory before LLM calls.
Rate and cost limits are mandatory.

Integration Map

Chapter 11: Controlled Chaos as primary incident signal source.
Chapter 09: Observability for evidence correlation.
Chapter 13: 24/7 Production SRE for on-call lifecycle integration.
Platform GitOps manifests for runtime context.

Lab Files

lab.md
runbook-guardian.md
quiz.md

Done When

guardian captures at least one controlled chaos scenario
incident is persisted with structured analysis and confidence
on-call can acknowledge and resolve incident via API
one escalation scenario is demonstrated (recurring or persistent)