Lab: Guardian on Top of Controlled Chaos
Goal
Validate guardian flow end-to-end for one controlled incident:
- detect from cluster events/scanner
- structured AI analysis
- incident persistence and lifecycle actions
Prerequisites
- Chapter 12 chaos flow available in
develop k8s-ai-monitorimage built and deployed in playground clusterSTATE_BACKEND=sqlite- API token configured for write endpoints
Step 1: Verify Guardian Deployment
Confirm the guardian service is running:
kubectl get deploy k8s-ai-monitor -n observability
Expected:
- 1/1 replicas ready
Check recent guardian logs:
kubectl logs -n observability deploy/k8s-ai-monitor --tail=20
Expected:
- startup messages indicating watchers and scanners are active
- no crash loops or fatal errors
Verify health endpoint:
kubectl exec -n observability deploy/k8s-ai-monitor -- wget -qO- http://localhost:8080/healthz
Expected:
- health check response (200 OK or JSON status)
Step 2: Trigger Controlled Failure
Use one scenario:
backend /status/500burst, orbackend /panic, or- one manual Chaos Monkey run.
Capture start timestamp.
Step 3: Verify Detection
Check guardian logs:
kubectl -n observability logs deploy/k8s-ai-monitor --since=15m
Expected:
- warning/scan detection
- state key creation
- analysis call entry
Step 4: Verify Incident Record
kubectl -n observability port-forward deploy/k8s-ai-monitor 8080:8080
curl -s http://localhost:8080/incidents | jq
Expected:
- active incident present
occurrence_count >= 1
Step 5: Validate Structured Analysis
curl -s http://localhost:8080/incidents/<id> | jq
Expected fields:
root_causeconfidencehypotheses[]suggested_actions[]
Step 6: Incident Lifecycle Actions
curl -s -X POST -H "X-Internal-Token: <token>" http://localhost:8080/incidents/<id>/ack
curl -s -X POST -H "X-Internal-Token: <token>" http://localhost:8080/incidents/<id>/resolve
Expected:
- status transitions to
acknowledged, thenresolved
Step 7: Cost/Usage Check
curl -s "http://localhost:8080/llm-usage?hours=24" | jq
Confirm:
- calls are rate-limited
- usage and cost visible for audit
Hard Stop Conditions
- guardian attempts autonomous remediation
- raw secrets/tokens visible in incident context
- no dedup and alert storm on repeated identical events
Done When
- one chaos incident is fully tracked by guardian
- analysis is structured and actionable
- lifecycle actions are auditable