Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Artifacts

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter 13: AI-Assisted SRE Guardian

Why This Chapter Exists

Chaos testing and alerting generate noise unless incidents are normalized and prioritized. This chapter defines an AI-assisted guardian that analyzes incidents, proposes actions, and escalates safely without autonomous production changes.

Incident Hook

Multiple warning signals fire after a controlled chaos drill. On-call receives fragmented alerts with no clear priority or incident ownership. Manual triage burns time on duplicate noise while real impact grows. Guardian workflow turns raw signals into structured, actionable incident context.

What AI Would Propose (Brave Junior)

  • “Auto-remediate incidents directly from AI output.”
  • “Send full raw logs/secrets to LLM for better context.”
  • “Resolve low-confidence incidents automatically to reduce queue.”

Why this sounds reasonable:

  • reduces immediate on-call load
  • looks faster and more autonomous

Why This Is Dangerous

  • autonomous write-back can apply unsafe changes at runtime.
  • unsanitized context leaks secrets and violates policy/compliance.
  • low-confidence automation hides risk instead of reducing it.

Implementation Scope

This chapter uses a standalone guardian service pattern integrated with the platform:

  • Kubernetes event handlers for warnings and Flux conditions
  • scanner loops for pods, PVCs, certificates, and endpoints
  • structured LLM analysis with strict JSON schema output
  • incident lifecycle storage (SQLite recommended)
  • confidence-based human escalation

Guardian Contract (Inputs / Outputs / Not Allowed)

Inputs:

  • Kubernetes warning events and Flux conditions
  • metrics snapshots (error-rate, latency, saturation indicators)
  • bounded log context with sensitive fields redacted

Outputs:

  • structured incident summary
  • ranked likely causes
  • proposed next runbook actions with confidence score

Not allowed:

  • direct workload mutation (kubectl apply/delete/patch) from AI output
  • sending raw secrets/tokens/private keys to LLM providers
  • automatic incident resolve/close without human acknowledgement

Guardian Responsibilities

  1. Detect:
  • Kubernetes Warning events
  • Flux stalled conditions
  • periodic scanner findings
  1. Analyze:
  • collect structured context
  • sanitize sensitive data
  • enforce context budget
  • call LLM for structured root-cause hypotheses
  1. Decide:
  • create/update incident record
  • deduplicate repeated noise
  • escalate recurring/persistent incidents
  1. Notify:
  • send structured alert
  • expose incident APIs for ack/resolve

Guardrails That Stop It

  • AI proposes; humans approve remediation.
  • No autonomous write-back to production workloads.
  • Confidence below threshold requires explicit human review.
  • Secret/token redaction is mandatory before LLM calls.
  • Rate and cost limits are mandatory.

Safe Workflow (Step-by-Step)

  1. Ingest incident signals (events, metrics, logs) and normalize into one record.
  2. Sanitize context and enforce token/context budgets before any LLM call.
  3. Generate structured recommendations (summary, likely cause, next actions) only.
  4. Route recommendation through human approval gate.
  5. Execute selected runbook step and record decision/evidence in incident lifecycle.

Approval Gates

  1. AI suggests.
  2. Human selects allowed action.
  3. Runbook step executes in controlled scope.
  4. Post-action evidence is reviewed before next step.

Guardian Deployment Architecture

The guardian runs as k8s-ai-monitor, a single-replica deployment in the observability namespace.

Deployment Specification

  • Image: ghcr.io/ldbl/k8s-ai-monitor:latest
  • Replicas: 1 (singleton — only one instance should process events)
  • Namespace: observability

Environment Configuration

VariablePurpose
CLUSTER_NAMEIdentifies the cluster in incident reports
WATCH_NAMESPACESNamespaces to monitor (e.g., production)
NON_PROD_NAMESPACESNon-production namespaces (reduced alert severity)
EXCLUDE_NAMESPACESNamespaces to ignore

LLM Integration

  • OPENAI_API_KEY or ANTHROPIC_API_KEY injected from Kubernetes Secret
  • Rate and cost limits enforced in guardian config
  • Context sanitization runs before every LLM call

Prometheus Connection

  • Internal service URL for metrics queries
  • Used for context enrichment (error rates, latency percentiles during incidents)

Storage

  • SQLite persistence via 2Gi PVC mounted at /data
  • Stores incident records, analysis history, and LLM usage tracking

RBAC (ClusterRole)

Read-only access to:

  • Pods, Events, Deployments, HPAs
  • CNPG clusters
  • Flux Kustomizations and HelmReleases

No write permissions. Guardian observes; humans act.

Endpoints

EndpointPurpose
/healthzHealth check
/stateCurrent guardian state and scan results
/reportsIncident reports and analysis

Integrations

  • Slack: Webhook URL for incident notifications
  • OpsGenie: Webhook URL for on-call escalation

Resource Limits

ResourceRequestLimit
CPU100m1000m
Memory256Mi1Gi

Integration Map

Lab Files

  • lab.md
  • runbook-guardian.md
  • quiz.md

Done When

  • guardian captures at least one controlled chaos scenario
  • incident is persisted with structured analysis and confidence
  • on-call can acknowledge and resolve incident via API
  • one escalation scenario is demonstrated (recurring or persistent)

Lab: Guardian on Top of Controlled Chaos

detect from cluster events/scanner structured AI analysis incident persistence and lifecycle actions Prerequisites Chapter 12 chaos flow available in develop k8s-ai-monitor image built and deployed in playground cluster …

Quiz: Chapter 13 (AI-Assisted SRE Guardian)

Which backend enables full incident lifecycle in guardian? A) configmap B) sqlite C) in-memory only Which output format is required from LLM for reliable automation boundaries? What should happen when confidence is low? …

Runbook: AI Guardian Operations

Runtime Checks Health: curl -s http://localhost:8080/healthz Recent incidents: curl -s http://localhost:8080/incidents | jq LLM usage and rate: curl -s "http://localhost:8080/llm-usage?hours=24" | jq Incident …