Chapter 13: AI-Assisted SRE Guardian

Incident Hook

Multiple warning signals fire after a controlled chaos drill. On-call receives fragmented alerts with no clear priority or incident ownership. Manual triage burns time on duplicate noise while real impact grows. Guardian workflow turns raw signals into structured, actionable incident context.

Observed Symptoms

What the team sees first:

many alerts are technically true but operationally fragmented
responders cannot tell whether they are seeing one incident or many
signal volume starts competing with actual investigation time

The problem is not lack of detection. It is lack of normalization.

Confusion Phase

This is the point where “let the AI fix it” starts sounding attractive. That is the trap.

The real question is:

how to reduce noise without giving the model unsafe write authority
and how to keep useful context while still redacting secrets and budgets

Why This Chapter Exists

Chaos testing and alerting generate noise unless incidents are normalized and prioritized. This chapter defines an AI-assisted guardian that analyzes incidents, proposes actions, and escalates safely without autonomous production changes.

What AI Would Propose (Brave Junior)

“Auto-remediate incidents directly from AI output.”
“Send full raw logs and secrets to the LLM for better context.”
“Resolve low-confidence incidents automatically to reduce the queue.”

Why this sounds reasonable:

reduces immediate on-call load
looks faster and more autonomous

Why This Is Dangerous

autonomous write-back can apply unsafe changes at runtime.
unsanitized context leaks secrets and violates policy or compliance.
low-confidence automation hides uncertainty instead of reducing it.

Investigation

Treat the guardian itself as a guarded incident pipeline.

Safe investigation sequence:

inspect the raw signals entering the guardian
verify sanitization, context budgets, and confidence handling
confirm deduplication collapsed duplicates into one incident record
review whether the proposed actions are useful without crossing the no-mutation boundary

Containment

Containment keeps the guardian helpful but bounded:

preserve human approval for all remediation
reduce noise through deduplication and escalation rules
block unsafe context before LLM analysis
treat low-confidence output as a review queue, not an automation success

Implementation Scope

This chapter uses a standalone guardian service pattern integrated with the platform:

Kopf handlers watch Kubernetes warning events and Flux conditions.
periodic scanners watch pods, PVCs, certificates, endpoints, backups, and critical service chains.
Prometheus is the baseline signal source; the guardian enriches with structured pod logs, optional Uptrace trace context, and optional configured log-backend search.
incident lifecycle, suppressions, deduplication state, and LLM usage are stored in SQLite.
human operators can work through an HTTP API, a CLI, and an MCP server.

Guardian Contract (Inputs / Outputs / Not Allowed)

Inputs:

Kubernetes warning events and Flux conditions
metrics snapshots for error rate, latency, saturation, and node pressure
bounded pod log context with sensitive fields redacted
optional Uptrace trace context and optional configured log-backend search when available

Outputs:

structured incident summary
ranked likely causes with confidence scores
proposed next runbook actions
daily and weekly cluster health reports

Not allowed:

direct workload mutation (kubectl apply, patch, delete) from AI output
sending raw secrets, tokens, or private keys to LLM providers
automatic incident resolve or close without human acknowledgement

Detection and Analysis Pipeline

The guardian works in four stages:

Detect from four sources: real-time warning events, Flux stalled conditions, periodic scanners, and scheduled daily reports.
Analyze by collecting pod state, recent logs, events, owner chain, metrics, and optional traces, then sanitizing and budgeting the context.
Decide by creating or updating an incident record in SQLite, applying deduplication, and attaching confidence-ranked hypotheses.
Notify by sending structured Slack alerts and exposing the result over API, CLI, and MCP.

Before every LLM call, the guardian redacts connection strings, bearer tokens, AWS keys, JWTs, key-value secrets, and PEM private key blocks. Container env vars and command arguments are never sent.

Scanner Coverage

Scanner	Main Purpose	Default Cadence	Notes
Pod	CrashLoopBackOff, OOMKilled, ImagePullBackOff, Error, Evicted	30 min	Diagnostic plugins add issue-specific context
PVC + storage verify	Disk pressure and backup integrity checks	1 hour + daily verify	Warn at 80%, critical at 90%
Certificate	cert-manager expiry and renewal health	1 hour	14-day warning threshold
Endpoint + critical endpoint	Real user path and deep service-chain probing	5 min / 60 sec	Probes Traefik path, backend services, and upstream dependencies
Backup	CNPG, CronJob-based, and Percona backup health	1 hour	Missing CRDs auto-skip without failing the scanner
Flux condition watch	Stalled Kustomizations and HelmReleases	event-driven	Captures GitOps failure as incident input

Diagnostic plugins specialize the evidence that is attached to each incident. OOM, crash, image pull, scheduling, mount, unhealthy, and eviction cases each collect targeted context before analysis.

Deduplication and Escalation

Four layers prevent alert storms:

in-flight state blocks concurrent duplicates.
SQLite cooldown windows stop repeated re-alerting for the same state.
context hashes prevent re-analyzing identical evidence.
escalation logic separates fresh, recurring, and persistent incidents.

Escalation model:

Fresh: first occurrence, base cooldown of 30 minutes by default.
Recurring: second occurrence within a 6-hour window, send escalation notice.
Persistent: third or later occurrence with age above 1 hour, emit hardening alert with exponential backoff capped at 12 hours.

Non-production namespaces use 2x longer cooldowns. LLM analysis there is intentionally reduced to the critical-endpoint path so low-value noise does not consume budget.

Guardrails That Stop It

AI proposes; humans approve remediation.
No autonomous write-back to workloads.
Confidence below threshold requires explicit human review.
Secret and token redaction is mandatory before any LLM call.
Context budgets are enforced: 16KB for alerts, 80KB for reports.
Rate and cost limits are enforced: default 20 LLM calls per hour.
RBAC allows observation plus limited Kubernetes Event writes only; it still cannot mutate workloads or apply manifests.

Safe Workflow (Step-by-Step)

Ingest incident signals and normalize them into one incident record.
Collect evidence, sanitize it, and enforce context and rate limits.
Generate structured recommendations only: root_cause, confidence, hypotheses[], suggested_actions[].
Route the recommendation through a human approval gate.
Execute the selected runbook step and record the decision with evidence.

Approval Gates

AI suggests.
Human selects an allowed action.
The runbook step executes in controlled scope.
Post-action evidence is reviewed before the next step.

Guardian Deployment Architecture

The guardian runs as k8s-ai-monitor, a singleton Kopf operator in the observability namespace.

Baseline deployment facts:

image: ghcr.io/ldbl/k8s-ai-monitor:main
entrypoint: kopf run --standalone --all-namespaces src/handlers/__init__.py
persistence: 2Gi PVC mounted at /data for incidents, suppressions, dedup state, reports, and LLM usage
resources: 10m/64Mi requests and 100m/256Mi limits
watch model: configurable namespace allowlist plus explicit non-prod and exclude namespace sets
provider config: Anthropic or OpenAI, with Prometheus as the baseline metrics source

RBAC is read-oriented. It can read pods, logs, events, workloads, namespaces, nodes, services, endpoints, PVCs, cert-manager resources, CNPG and Percona backup objects, and Flux objects. It can also write Kubernetes Event objects so it can leave breadcrumbs. It does not have workload-mutation verbs.

If you enable Traefik IngressRoute critical-endpoint scanning, include read access to traefik.io ingressroutes in the ClusterRole.

Operator Surfaces

The guardian gives operators three safe surfaces:

HTTP API for health, incidents, reports, suppressions, and LLM-usage visibility
CLI for acknowledge, resolve, suppress, investigate, log search, trace search, metrics, and audits
MCP server for cluster_health, investigate, audit, search_logs, search_traces, query_metrics, list_incidents, and get_incident

Detailed operational commands live in runbook-guardian.md. The lesson focuses on the guardrails and the incident flow, not memorizing endpoint tables.

Integrations

The guardian is wired around the current SafeOps baseline:

Slack webhooks for production and non-production notifications
Prometheus for metrics queries and alert inputs
structured pod logs as the baseline evidence source
optional log backend search when a backend is configured
Uptrace for distributed tracing context

Integration Map

Chapter 12: Controlled Chaos as the primary incident signal source.
Chapter 10: Observability for metrics, structured logs, and trace correlation.
Chapter 14: 24/7 Production SRE for on-call lifecycle and escalation discipline.
Members receive separate read-only repository access, but the lesson itself embeds the implementation snapshots needed for this chapter.

Integration Snapshot

Here is the platform GitOps layout used in the SafeOps system to give the guardian runtime context without granting write authority.

Guardian runtime context

Show the guardian GitOps context

flux/README.md
flux/apps/backend/base/deployment.yaml
flux/apps/backend/base/kustomization.yaml
flux/apps/backend/base/service.yaml
flux/apps/backend/base/servicemonitor.yaml
flux/apps/backend/develop/hpa.yaml
flux/apps/backend/develop/image-automation.yaml
flux/apps/backend/develop/image-policy.yaml
flux/apps/backend/develop/kustomization.yaml
flux/apps/backend/develop/patches/feature-flags.yaml
flux/apps/backend/develop/pdb.yaml
flux/apps/backend/production/hpa.yaml
flux/apps/backend/production/image-automation.yaml
flux/apps/backend/production/image-policy.yaml
flux/apps/backend/production/kustomization.yaml
flux/apps/backend/production/pdb.yaml
flux/apps/backend/staging/hpa.yaml
flux/apps/backend/staging/image-automation.yaml
flux/apps/backend/staging/image-policy.yaml
flux/apps/backend/staging/kustomization.yaml
flux/apps/backend/staging/pdb.yaml
flux/apps/frontend/base/deployment.yaml
flux/apps/frontend/base/ingress.yaml
flux/apps/frontend/base/kustomization.yaml
flux/apps/frontend/base/service.yaml
flux/apps/frontend/overlays/develop/hpa.yaml
flux/apps/frontend/overlays/develop/image-automation.yaml
flux/apps/frontend/overlays/develop/image-policy.yaml
flux/apps/frontend/overlays/develop/kustomization.yaml
flux/apps/frontend/overlays/develop/namespace.yaml
flux/apps/frontend/overlays/develop/patches/deployment.yaml
flux/apps/frontend/overlays/develop/patches/ingress.yaml
flux/apps/frontend/overlays/develop/pdb.yaml
flux/apps/frontend/overlays/production/hpa.yaml
flux/apps/frontend/overlays/production/image-automation.yaml
flux/apps/frontend/overlays/production/image-policy.yaml
flux/apps/frontend/overlays/production/kustomization.yaml
flux/apps/frontend/overlays/production/namespace.yaml
flux/apps/frontend/overlays/production/patches/deployment.yaml
flux/apps/frontend/overlays/production/patches/ingress.yaml
flux/apps/frontend/overlays/production/pdb.yaml
flux/apps/frontend/overlays/staging/hpa.yaml
flux/apps/frontend/overlays/staging/image-automation.yaml
flux/apps/frontend/overlays/staging/image-policy.yaml
flux/apps/frontend/overlays/staging/kustomization.yaml
flux/apps/frontend/overlays/staging/namespace.yaml
flux/apps/frontend/overlays/staging/patches/deployment.yaml
flux/apps/frontend/overlays/staging/patches/ingress.yaml
flux/apps/frontend/overlays/staging/pdb.yaml
flux/bootstrap/apps/.gitkeep

System Context

This chapter connects detection, evidence, and human operations into one incident loop.

It builds directly on:

Chapter 10 observability, which provides the raw evidence
Chapter 12 chaos drills, which generate controlled noisy scenarios
Chapter 14 production SRE, where guardian output must fit human-owned incident management

Lab Files

lab.md
runbook-guardian.md
quiz.md

Done When

guardian captures at least one controlled chaos scenario
incident is persisted with structured analysis and confidence
on-call can acknowledge and resolve the incident via API or CLI
one escalation scenario is demonstrated (recurring or persistent)
daily report is generated and accessible via API
suppression rules can be created and verified

Estimated Time

Prerequisites

Source Code References

What You Will Produce