Chapter 12: AI-Assisted SRE Guardian

Why This Chapter Exists

Chaos testing and alerting generate noise unless incidents are normalized and prioritized. This chapter defines an AI-assisted guardian that analyzes incidents, proposes actions, and escalates safely without autonomous production changes.

Implementation Scope

This chapter uses a standalone guardian service pattern integrated with the platform:

  • Kubernetes event handlers for warnings and Flux conditions
  • scanner loops for pods, PVCs, certificates, and endpoints
  • structured LLM analysis with strict JSON schema output
  • incident lifecycle storage (SQLite recommended)
  • confidence-based human escalation

Guardian Responsibilities

  1. Detect:
  • Kubernetes Warning events
  • Flux stalled conditions
  • periodic scanner findings
  1. Analyze:
  • collect structured context
  • sanitize sensitive data
  • enforce context budget
  • call LLM for structured root-cause hypotheses
  1. Decide:
  • create/update incident record
  • deduplicate repeated noise
  • escalate recurring/persistent incidents
  1. Notify:
  • send structured alert
  • expose incident APIs for ack/resolve

Guardrails

  • AI proposes; humans approve remediation.
  • No autonomous write-back to production workloads.
  • Confidence below threshold requires explicit human review.
  • Secret/token redaction is mandatory before LLM calls.
  • Rate and cost limits are mandatory.

Integration Map

Lab Files

  • lab.md
  • runbook-guardian.md
  • quiz.md

Done When

  • guardian captures at least one controlled chaos scenario
  • incident is persisted with structured analysis and confidence
  • on-call can acknowledge and resolve incident via API
  • one escalation scenario is demonstrated (recurring or persistent)

Lab: Guardian on Top of Controlled Chaos

detect from cluster events/scanner structured AI analysis incident persistence and lifecycle actions Prerequisites Chapter 11 chaos flow available in develop k8s-ai-monitor image built and deployed in playground cluster …

Quiz: Chapter 12 (AI-Assisted SRE Guardian)

Which backend enables full incident lifecycle in guardian? A) configmap B) sqlite C) in-memory only Which output format is required from LLM for reliable automation boundaries? What should happen when confidence is low? …

Runbook: AI Guardian Operations

Runtime Checks Health: curl -s http://localhost:8080/healthz Recent incidents: curl -s http://localhost:8080/incidents | jq LLM usage and rate: curl -s "http://localhost:8080/llm-usage?hours=24" | jq Incident …