Runbook: AI Guardian Operations
Purpose
Operate the guardian as a safe incident triage layer, not an auto-remediation engine.
Runtime Checks
- Health:
curl -s http://localhost:8080/healthz
- Recent incidents:
curl -s http://localhost:8080/incidents | jq
- LLM usage and rate:
curl -s "http://localhost:8080/llm-usage?hours=24" | jq
Incident Handling Workflow
- Confirm symptom in platform observability.
- Open guardian incident detail.
- Validate confidence and evidence:
- if low confidence, require manual deep-dive
- if high confidence with strong evidence, apply runbook action
- Ack incident when ownership is clear.
- Resolve only after recovery verification.
Escalation Logic
- recurring incidents should raise urgency
- persistent incidents should trigger hardening task with owner/due date
- no closure without verified mitigation
Security & Compliance Checks
- ensure sanitizer policy is active
- verify no plaintext secrets in incident payloads
- rotate API tokens regularly
Failure Modes
- LLM unavailable:
- continue with raw context and manual triage
- avoid blocking incident response
- SQLite unavailable:
- fallback to configmap mode if needed for continuity
- restore SQLite for full incident lifecycle features
- Alert storm:
- tune dedup/cooldown thresholds
- reduce scanner frequency temporarily