Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

  • flux/ Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter 13: AI-Assisted SRE Guardian

Incident Hook

Multiple warning signals fire after a controlled chaos drill. On-call receives fragmented alerts with no clear priority or incident ownership. Manual triage burns time on duplicate noise while real impact grows. Guardian workflow turns raw signals into structured, actionable incident context.

Observed Symptoms

What the team sees first:

  • many alerts are technically true but operationally fragmented
  • responders cannot tell whether they are seeing one incident or many
  • signal volume starts competing with actual investigation time

The problem is not lack of detection. It is lack of normalization.

Confusion Phase

This is the point where “let the AI fix it” starts sounding attractive. That is the trap.

The real question is:

  • how to reduce noise without giving the model unsafe write authority
  • and how to keep useful context while still redacting secrets and budgets

Why This Chapter Exists

Chaos testing and alerting generate noise unless incidents are normalized and prioritized. This chapter defines an AI-assisted guardian that analyzes incidents, proposes actions, and escalates safely without autonomous production changes.

What AI Would Propose (Brave Junior)

  • “Auto-remediate incidents directly from AI output.”
  • “Send full raw logs and secrets to the LLM for better context.”
  • “Resolve low-confidence incidents automatically to reduce the queue.”

Why this sounds reasonable:

  • reduces immediate on-call load
  • looks faster and more autonomous

Why This Is Dangerous

  • autonomous write-back can apply unsafe changes at runtime.
  • unsanitized context leaks secrets and violates policy or compliance.
  • low-confidence automation hides uncertainty instead of reducing it.

Investigation

Treat the guardian itself as a guarded incident pipeline.

Safe investigation sequence:

  1. inspect the raw signals entering the guardian
  2. verify sanitization, context budgets, and confidence handling
  3. confirm deduplication collapsed duplicates into one incident record
  4. review whether the proposed actions are useful without crossing the no-mutation boundary

Containment

Containment keeps the guardian helpful but bounded:

  1. preserve human approval for all remediation
  2. reduce noise through deduplication and escalation rules
  3. block unsafe context before LLM analysis
  4. treat low-confidence output as a review queue, not an automation success

Implementation Scope

This chapter uses a standalone guardian service pattern integrated with the platform:

  • Kopf handlers watch Kubernetes warning events and Flux conditions.
  • periodic scanners watch pods, PVCs, certificates, endpoints, backups, and critical service chains.
  • Prometheus is the baseline signal source; the guardian enriches with structured pod logs, optional Uptrace trace context, and optional configured log-backend search.
  • incident lifecycle, suppressions, deduplication state, and LLM usage are stored in SQLite.
  • human operators can work through an HTTP API, a CLI, and an MCP server.

Guardian Contract (Inputs / Outputs / Not Allowed)

Inputs:

  • Kubernetes warning events and Flux conditions
  • metrics snapshots for error rate, latency, saturation, and node pressure
  • bounded pod log context with sensitive fields redacted
  • optional Uptrace trace context and optional configured log-backend search when available

Outputs:

  • structured incident summary
  • ranked likely causes with confidence scores
  • proposed next runbook actions
  • daily and weekly cluster health reports

Not allowed:

  • direct workload mutation (kubectl apply, patch, delete) from AI output
  • sending raw secrets, tokens, or private keys to LLM providers
  • automatic incident resolve or close without human acknowledgement

Detection and Analysis Pipeline

The guardian works in four stages:

  1. Detect from four sources: real-time warning events, Flux stalled conditions, periodic scanners, and scheduled daily reports.
  2. Analyze by collecting pod state, recent logs, events, owner chain, metrics, and optional traces, then sanitizing and budgeting the context.
  3. Decide by creating or updating an incident record in SQLite, applying deduplication, and attaching confidence-ranked hypotheses.
  4. Notify by sending structured Slack alerts and exposing the result over API, CLI, and MCP.

Before every LLM call, the guardian redacts connection strings, bearer tokens, AWS keys, JWTs, key-value secrets, and PEM private key blocks. Container env vars and command arguments are never sent.

Scanner Coverage

ScannerMain PurposeDefault CadenceNotes
PodCrashLoopBackOff, OOMKilled, ImagePullBackOff, Error, Evicted30 minDiagnostic plugins add issue-specific context
PVC + storage verifyDisk pressure and backup integrity checks1 hour + daily verifyWarn at 80%, critical at 90%
Certificatecert-manager expiry and renewal health1 hour14-day warning threshold
Endpoint + critical endpointReal user path and deep service-chain probing5 min / 60 secProbes Traefik path, backend services, and upstream dependencies
BackupCNPG, CronJob-based, and Percona backup health1 hourMissing CRDs auto-skip without failing the scanner
Flux condition watchStalled Kustomizations and HelmReleasesevent-drivenCaptures GitOps failure as incident input

Diagnostic plugins specialize the evidence that is attached to each incident. OOM, crash, image pull, scheduling, mount, unhealthy, and eviction cases each collect targeted context before analysis.

Deduplication and Escalation

Four layers prevent alert storms:

  • in-flight state blocks concurrent duplicates.
  • SQLite cooldown windows stop repeated re-alerting for the same state.
  • context hashes prevent re-analyzing identical evidence.
  • escalation logic separates fresh, recurring, and persistent incidents.

Escalation model:

  • Fresh: first occurrence, base cooldown of 30 minutes by default.
  • Recurring: second occurrence within a 6-hour window, send escalation notice.
  • Persistent: third or later occurrence with age above 1 hour, emit hardening alert with exponential backoff capped at 12 hours.

Non-production namespaces use 2x longer cooldowns. LLM analysis there is intentionally reduced to the critical-endpoint path so low-value noise does not consume budget.

Guardrails That Stop It

  • AI proposes; humans approve remediation.
  • No autonomous write-back to workloads.
  • Confidence below threshold requires explicit human review.
  • Secret and token redaction is mandatory before any LLM call.
  • Context budgets are enforced: 16KB for alerts, 80KB for reports.
  • Rate and cost limits are enforced: default 20 LLM calls per hour.
  • RBAC allows observation plus limited Kubernetes Event writes only; it still cannot mutate workloads or apply manifests.

Safe Workflow (Step-by-Step)

  1. Ingest incident signals and normalize them into one incident record.
  2. Collect evidence, sanitize it, and enforce context and rate limits.
  3. Generate structured recommendations only: root_cause, confidence, hypotheses[], suggested_actions[].
  4. Route the recommendation through a human approval gate.
  5. Execute the selected runbook step and record the decision with evidence.

Approval Gates

  1. AI suggests.
  2. Human selects an allowed action.
  3. The runbook step executes in controlled scope.
  4. Post-action evidence is reviewed before the next step.

Guardian Deployment Architecture

The guardian runs as k8s-ai-monitor, a singleton Kopf operator in the observability namespace.

Baseline deployment facts:

  • image: ghcr.io/ldbl/k8s-ai-monitor:main
  • entrypoint: kopf run --standalone --all-namespaces src/handlers/__init__.py
  • persistence: 2Gi PVC mounted at /data for incidents, suppressions, dedup state, reports, and LLM usage
  • resources: 10m/64Mi requests and 100m/256Mi limits
  • watch model: configurable namespace allowlist plus explicit non-prod and exclude namespace sets
  • provider config: Anthropic or OpenAI, with Prometheus as the baseline metrics source

RBAC is read-oriented. It can read pods, logs, events, workloads, namespaces, nodes, services, endpoints, PVCs, cert-manager resources, CNPG and Percona backup objects, and Flux objects. It can also write Kubernetes Event objects so it can leave breadcrumbs. It does not have workload-mutation verbs.

If you enable Traefik IngressRoute critical-endpoint scanning, include read access to traefik.io ingressroutes in the ClusterRole.

Operator Surfaces

The guardian gives operators three safe surfaces:

  • HTTP API for health, incidents, reports, suppressions, and LLM-usage visibility
  • CLI for acknowledge, resolve, suppress, investigate, log search, trace search, metrics, and audits
  • MCP server for cluster_health, investigate, audit, search_logs, search_traces, query_metrics, list_incidents, and get_incident

Detailed operational commands live in runbook-guardian.md. The lesson focuses on the guardrails and the incident flow, not memorizing endpoint tables.

Integrations

The guardian is wired around the current SafeOps baseline:

  • Slack webhooks for production and non-production notifications
  • Prometheus for metrics queries and alert inputs
  • structured pod logs as the baseline evidence source
  • optional log backend search when a backend is configured
  • Uptrace for distributed tracing context

Integration Map

Integration Snapshot

Here is the platform GitOps layout used in the SafeOps system to give the guardian runtime context without granting write authority.

Guardian runtime context

Show the guardian GitOps context
  • flux/README.md
  • flux/apps/backend/base/deployment.yaml
  • flux/apps/backend/base/kustomization.yaml
  • flux/apps/backend/base/service.yaml
  • flux/apps/backend/base/servicemonitor.yaml
  • flux/apps/backend/develop/hpa.yaml
  • flux/apps/backend/develop/image-automation.yaml
  • flux/apps/backend/develop/image-policy.yaml
  • flux/apps/backend/develop/kustomization.yaml
  • flux/apps/backend/develop/patches/feature-flags.yaml
  • flux/apps/backend/develop/pdb.yaml
  • flux/apps/backend/production/hpa.yaml
  • flux/apps/backend/production/image-automation.yaml
  • flux/apps/backend/production/image-policy.yaml
  • flux/apps/backend/production/kustomization.yaml
  • flux/apps/backend/production/pdb.yaml
  • flux/apps/backend/staging/hpa.yaml
  • flux/apps/backend/staging/image-automation.yaml
  • flux/apps/backend/staging/image-policy.yaml
  • flux/apps/backend/staging/kustomization.yaml
  • flux/apps/backend/staging/pdb.yaml
  • flux/apps/frontend/base/deployment.yaml
  • flux/apps/frontend/base/ingress.yaml
  • flux/apps/frontend/base/kustomization.yaml
  • flux/apps/frontend/base/service.yaml
  • flux/apps/frontend/overlays/develop/hpa.yaml
  • flux/apps/frontend/overlays/develop/image-automation.yaml
  • flux/apps/frontend/overlays/develop/image-policy.yaml
  • flux/apps/frontend/overlays/develop/kustomization.yaml
  • flux/apps/frontend/overlays/develop/namespace.yaml
  • flux/apps/frontend/overlays/develop/patches/deployment.yaml
  • flux/apps/frontend/overlays/develop/patches/ingress.yaml
  • flux/apps/frontend/overlays/develop/pdb.yaml
  • flux/apps/frontend/overlays/production/hpa.yaml
  • flux/apps/frontend/overlays/production/image-automation.yaml
  • flux/apps/frontend/overlays/production/image-policy.yaml
  • flux/apps/frontend/overlays/production/kustomization.yaml
  • flux/apps/frontend/overlays/production/namespace.yaml
  • flux/apps/frontend/overlays/production/patches/deployment.yaml
  • flux/apps/frontend/overlays/production/patches/ingress.yaml
  • flux/apps/frontend/overlays/production/pdb.yaml
  • flux/apps/frontend/overlays/staging/hpa.yaml
  • flux/apps/frontend/overlays/staging/image-automation.yaml
  • flux/apps/frontend/overlays/staging/image-policy.yaml
  • flux/apps/frontend/overlays/staging/kustomization.yaml
  • flux/apps/frontend/overlays/staging/namespace.yaml
  • flux/apps/frontend/overlays/staging/patches/deployment.yaml
  • flux/apps/frontend/overlays/staging/patches/ingress.yaml
  • flux/apps/frontend/overlays/staging/pdb.yaml
  • flux/bootstrap/apps/.gitkeep

System Context

This chapter connects detection, evidence, and human operations into one incident loop.

It builds directly on:

  • Chapter 10 observability, which provides the raw evidence
  • Chapter 12 chaos drills, which generate controlled noisy scenarios
  • Chapter 14 production SRE, where guardian output must fit human-owned incident management

Lab Files

  • lab.md
  • runbook-guardian.md
  • quiz.md

Done When

  • guardian captures at least one controlled chaos scenario
  • incident is persisted with structured analysis and confidence
  • on-call can acknowledge and resolve the incident via API or CLI
  • one escalation scenario is demonstrated (recurring or persistent)
  • daily report is generated and accessible via API
  • suppression rules can be created and verified

Hands-On Materials

Labs, quizzes, and runbooks — available to course members.

  • Lab: Guardian on Top of Controlled Chaos Members
  • Quiz: Chapter 13 (AI-Assisted SRE Guardian) Members
  • Runbook: AI Guardian Operations Members

Interactive Explainer

Sign in to watch the video for this chapter.