Runbook: AI Guardian Operations

Purpose

Operate the guardian as a safe incident triage layer, not an auto-remediation engine.

Runtime Checks

Health:

curl -s http://localhost:8080/healthz

Recent incidents:

curl -s http://localhost:8080/incidents | jq

LLM usage and rate:

curl -s "http://localhost:8080/llm-usage?hours=24" | jq

Incident Handling Workflow

Confirm symptom in platform observability.
Open guardian incident detail.
Validate confidence and evidence:

if low confidence, require manual deep-dive
if high confidence with strong evidence, apply runbook action

Ack incident when ownership is clear.
Resolve only after recovery verification.

Escalation Logic

recurring incidents should raise urgency
persistent incidents should trigger hardening task with owner/due date
no closure without verified mitigation

Security & Compliance Checks

ensure sanitizer policy is active
verify no plaintext secrets in incident payloads
rotate API tokens regularly

Failure Modes

LLM unavailable:

continue with raw context and manual triage
avoid blocking incident response

SQLite unavailable:

fallback to configmap mode if needed for continuity
restore SQLite for full incident lifecycle features

Alert storm:

tune dedup/cooldown thresholds
reduce scanner frequency temporarily