Chapter 13: AI-Assisted SRE Guardian
Incident Hook
Multiple warning signals fire after a controlled chaos drill. On-call receives fragmented alerts with no clear priority or incident ownership. Manual triage burns time on duplicate noise while real impact grows. Guardian workflow turns raw signals into structured, actionable incident context.
Observed Symptoms
What the team sees first:
- many alerts are technically true but operationally fragmented
- responders cannot tell whether they are seeing one incident or many
- signal volume starts competing with actual investigation time
The problem is not lack of detection. It is lack of normalization.
Confusion Phase
This is the point where “let the AI fix it” starts sounding attractive. That is the trap.
The real question is:
- how to reduce noise without giving the model unsafe write authority
- and how to keep useful context while still redacting secrets and budgets
Why This Chapter Exists
Chaos testing and alerting generate noise unless incidents are normalized and prioritized. This chapter defines an AI-assisted guardian that analyzes incidents, proposes actions, and escalates safely without autonomous production changes.
What AI Would Propose (Brave Junior)
- “Auto-remediate incidents directly from AI output.”
- “Send full raw logs and secrets to the LLM for better context.”
- “Resolve low-confidence incidents automatically to reduce the queue.”
Why this sounds reasonable:
- reduces immediate on-call load
- looks faster and more autonomous
Why This Is Dangerous
- autonomous write-back can apply unsafe changes at runtime.
- unsanitized context leaks secrets and violates policy or compliance.
- low-confidence automation hides uncertainty instead of reducing it.
Investigation
Treat the guardian itself as a guarded incident pipeline.
Safe investigation sequence:
- inspect the raw signals entering the guardian
- verify sanitization, context budgets, and confidence handling
- confirm deduplication collapsed duplicates into one incident record
- review whether the proposed actions are useful without crossing the no-mutation boundary
Containment
Containment keeps the guardian helpful but bounded:
- preserve human approval for all remediation
- reduce noise through deduplication and escalation rules
- block unsafe context before LLM analysis
- treat low-confidence output as a review queue, not an automation success
Implementation Scope
This chapter uses a standalone guardian service pattern integrated with the platform:
- Kopf handlers watch Kubernetes warning events and Flux conditions.
- periodic scanners watch pods, PVCs, certificates, endpoints, backups, and critical service chains.
- Prometheus is the baseline signal source; the guardian enriches with structured pod logs, optional Uptrace trace context, and optional configured log-backend search.
- incident lifecycle, suppressions, deduplication state, and LLM usage are stored in SQLite.
- human operators can work through an HTTP API, a CLI, and an MCP server.
Guardian Contract (Inputs / Outputs / Not Allowed)
Inputs:
- Kubernetes warning events and Flux conditions
- metrics snapshots for error rate, latency, saturation, and node pressure
- bounded pod log context with sensitive fields redacted
- optional Uptrace trace context and optional configured log-backend search when available
Outputs:
- structured incident summary
- ranked likely causes with confidence scores
- proposed next runbook actions
- daily and weekly cluster health reports
Not allowed:
- direct workload mutation (
kubectl apply,patch,delete) from AI output - sending raw secrets, tokens, or private keys to LLM providers
- automatic incident resolve or close without human acknowledgement
Detection and Analysis Pipeline
The guardian works in four stages:
- Detect from four sources: real-time warning events, Flux stalled conditions, periodic scanners, and scheduled daily reports.
- Analyze by collecting pod state, recent logs, events, owner chain, metrics, and optional traces, then sanitizing and budgeting the context.
- Decide by creating or updating an incident record in SQLite, applying deduplication, and attaching confidence-ranked hypotheses.
- Notify by sending structured Slack alerts and exposing the result over API, CLI, and MCP.
Before every LLM call, the guardian redacts connection strings, bearer tokens, AWS keys, JWTs, key-value secrets, and PEM private key blocks. Container env vars and command arguments are never sent.
Scanner Coverage
| Scanner | Main Purpose | Default Cadence | Notes |
|---|---|---|---|
| Pod | CrashLoopBackOff, OOMKilled, ImagePullBackOff, Error, Evicted | 30 min | Diagnostic plugins add issue-specific context |
| PVC + storage verify | Disk pressure and backup integrity checks | 1 hour + daily verify | Warn at 80%, critical at 90% |
| Certificate | cert-manager expiry and renewal health | 1 hour | 14-day warning threshold |
| Endpoint + critical endpoint | Real user path and deep service-chain probing | 5 min / 60 sec | Probes Traefik path, backend services, and upstream dependencies |
| Backup | CNPG, CronJob-based, and Percona backup health | 1 hour | Missing CRDs auto-skip without failing the scanner |
| Flux condition watch | Stalled Kustomizations and HelmReleases | event-driven | Captures GitOps failure as incident input |
Diagnostic plugins specialize the evidence that is attached to each incident. OOM, crash, image pull, scheduling, mount, unhealthy, and eviction cases each collect targeted context before analysis.
Deduplication and Escalation
Four layers prevent alert storms:
- in-flight state blocks concurrent duplicates.
- SQLite cooldown windows stop repeated re-alerting for the same state.
- context hashes prevent re-analyzing identical evidence.
- escalation logic separates fresh, recurring, and persistent incidents.
Escalation model:
- Fresh: first occurrence, base cooldown of 30 minutes by default.
- Recurring: second occurrence within a 6-hour window, send escalation notice.
- Persistent: third or later occurrence with age above 1 hour, emit hardening alert with exponential backoff capped at 12 hours.
Non-production namespaces use 2x longer cooldowns. LLM analysis there is intentionally reduced to the critical-endpoint path so low-value noise does not consume budget.
Guardrails That Stop It
- AI proposes; humans approve remediation.
- No autonomous write-back to workloads.
- Confidence below threshold requires explicit human review.
- Secret and token redaction is mandatory before any LLM call.
- Context budgets are enforced: 16KB for alerts, 80KB for reports.
- Rate and cost limits are enforced: default 20 LLM calls per hour.
- RBAC allows observation plus limited Kubernetes
Eventwrites only; it still cannot mutate workloads or apply manifests.
Safe Workflow (Step-by-Step)
- Ingest incident signals and normalize them into one incident record.
- Collect evidence, sanitize it, and enforce context and rate limits.
- Generate structured recommendations only:
root_cause,confidence,hypotheses[],suggested_actions[]. - Route the recommendation through a human approval gate.
- Execute the selected runbook step and record the decision with evidence.
Approval Gates
- AI suggests.
- Human selects an allowed action.
- The runbook step executes in controlled scope.
- Post-action evidence is reviewed before the next step.
Guardian Deployment Architecture
The guardian runs as k8s-ai-monitor, a singleton Kopf operator in the observability namespace.
Baseline deployment facts:
- image:
ghcr.io/ldbl/k8s-ai-monitor:main - entrypoint:
kopf run --standalone --all-namespaces src/handlers/__init__.py - persistence: 2Gi PVC mounted at
/datafor incidents, suppressions, dedup state, reports, and LLM usage - resources:
10m/64Mirequests and100m/256Milimits - watch model: configurable namespace allowlist plus explicit non-prod and exclude namespace sets
- provider config: Anthropic or OpenAI, with Prometheus as the baseline metrics source
RBAC is read-oriented. It can read pods, logs, events, workloads, namespaces, nodes, services, endpoints, PVCs, cert-manager resources, CNPG and Percona backup objects, and Flux objects. It can also write Kubernetes Event objects so it can leave breadcrumbs. It does not have workload-mutation verbs.
If you enable Traefik IngressRoute critical-endpoint scanning, include read access to traefik.io ingressroutes in the ClusterRole.
Operator Surfaces
The guardian gives operators three safe surfaces:
- HTTP API for health, incidents, reports, suppressions, and LLM-usage visibility
- CLI for acknowledge, resolve, suppress, investigate, log search, trace search, metrics, and audits
- MCP server for
cluster_health,investigate,audit,search_logs,search_traces,query_metrics,list_incidents, andget_incident
Detailed operational commands live in runbook-guardian.md. The lesson focuses on the guardrails and the incident flow, not memorizing endpoint tables.
Integrations
The guardian is wired around the current SafeOps baseline:
- Slack webhooks for production and non-production notifications
- Prometheus for metrics queries and alert inputs
- structured pod logs as the baseline evidence source
- optional log backend search when a backend is configured
- Uptrace for distributed tracing context
Integration Map
- Chapter 12: Controlled Chaos as the primary incident signal source.
- Chapter 10: Observability for metrics, structured logs, and trace correlation.
- Chapter 14: 24/7 Production SRE for on-call lifecycle and escalation discipline.
- Members receive separate read-only repository access, but the lesson itself embeds the implementation snapshots needed for this chapter.
Integration Snapshot
Here is the platform GitOps layout used in the SafeOps system to give the guardian runtime context without granting write authority.
Guardian runtime context
Show the guardian GitOps context
flux/README.mdflux/apps/backend/base/deployment.yamlflux/apps/backend/base/kustomization.yamlflux/apps/backend/base/service.yamlflux/apps/backend/base/servicemonitor.yamlflux/apps/backend/develop/hpa.yamlflux/apps/backend/develop/image-automation.yamlflux/apps/backend/develop/image-policy.yamlflux/apps/backend/develop/kustomization.yamlflux/apps/backend/develop/patches/feature-flags.yamlflux/apps/backend/develop/pdb.yamlflux/apps/backend/production/hpa.yamlflux/apps/backend/production/image-automation.yamlflux/apps/backend/production/image-policy.yamlflux/apps/backend/production/kustomization.yamlflux/apps/backend/production/pdb.yamlflux/apps/backend/staging/hpa.yamlflux/apps/backend/staging/image-automation.yamlflux/apps/backend/staging/image-policy.yamlflux/apps/backend/staging/kustomization.yamlflux/apps/backend/staging/pdb.yamlflux/apps/frontend/base/deployment.yamlflux/apps/frontend/base/ingress.yamlflux/apps/frontend/base/kustomization.yamlflux/apps/frontend/base/service.yamlflux/apps/frontend/overlays/develop/hpa.yamlflux/apps/frontend/overlays/develop/image-automation.yamlflux/apps/frontend/overlays/develop/image-policy.yamlflux/apps/frontend/overlays/develop/kustomization.yamlflux/apps/frontend/overlays/develop/namespace.yamlflux/apps/frontend/overlays/develop/patches/deployment.yamlflux/apps/frontend/overlays/develop/patches/ingress.yamlflux/apps/frontend/overlays/develop/pdb.yamlflux/apps/frontend/overlays/production/hpa.yamlflux/apps/frontend/overlays/production/image-automation.yamlflux/apps/frontend/overlays/production/image-policy.yamlflux/apps/frontend/overlays/production/kustomization.yamlflux/apps/frontend/overlays/production/namespace.yamlflux/apps/frontend/overlays/production/patches/deployment.yamlflux/apps/frontend/overlays/production/patches/ingress.yamlflux/apps/frontend/overlays/production/pdb.yamlflux/apps/frontend/overlays/staging/hpa.yamlflux/apps/frontend/overlays/staging/image-automation.yamlflux/apps/frontend/overlays/staging/image-policy.yamlflux/apps/frontend/overlays/staging/kustomization.yamlflux/apps/frontend/overlays/staging/namespace.yamlflux/apps/frontend/overlays/staging/patches/deployment.yamlflux/apps/frontend/overlays/staging/patches/ingress.yamlflux/apps/frontend/overlays/staging/pdb.yamlflux/bootstrap/apps/.gitkeep
System Context
This chapter connects detection, evidence, and human operations into one incident loop.
It builds directly on:
- Chapter 10 observability, which provides the raw evidence
- Chapter 12 chaos drills, which generate controlled noisy scenarios
- Chapter 14 production SRE, where guardian output must fit human-owned incident management
Lab Files
lab.mdrunbook-guardian.mdquiz.md
Done When
- guardian captures at least one controlled chaos scenario
- incident is persisted with structured analysis and confidence
- on-call can acknowledge and resolve the incident via API or CLI
- one escalation scenario is demonstrated (recurring or persistent)
- daily report is generated and accessible via API
- suppression rules can be created and verified