Lab: Guardian on Top of Controlled Chaos

Goal

Validate guardian flow end-to-end for one controlled incident:

detect from cluster events/scanner
structured AI analysis
incident persistence and lifecycle actions

Prerequisites

Chapter 12 chaos flow available in develop
k8s-ai-monitor image built and deployed in playground cluster
STATE_BACKEND=sqlite
API token configured for write endpoints

Step 1: Verify Guardian Deployment

Confirm the guardian service is running:

kubectl get deploy k8s-ai-monitor -n observability

Expected:

1/1 replicas ready

Check recent guardian logs:

kubectl logs -n observability deploy/k8s-ai-monitor --tail=20

Expected:

startup messages indicating watchers and scanners are active
no crash loops or fatal errors

Verify health endpoint:

kubectl exec -n observability deploy/k8s-ai-monitor -- wget -qO- http://localhost:8080/healthz

Expected:

health check response (200 OK or JSON status)

Step 2: Trigger Controlled Failure

Use one scenario:

backend /status/500 burst, or
backend /panic, or
one manual Chaos Monkey run.

Capture start timestamp.

Step 3: Verify Detection

Check guardian logs:

kubectl -n observability logs deploy/k8s-ai-monitor --since=15m

Expected:

warning/scan detection
state key creation
analysis call entry

Step 4: Verify Incident Record

kubectl -n observability port-forward deploy/k8s-ai-monitor 8080:8080
curl -s http://localhost:8080/incidents | jq

Expected:

active incident present
occurrence_count >= 1

Step 5: Validate Structured Analysis

curl -s http://localhost:8080/incidents/<id> | jq

Expected fields:

root_cause
confidence
hypotheses[]
suggested_actions[]

Step 6: Incident Lifecycle Actions

curl -s -X POST -H "X-Internal-Token: <token>" http://localhost:8080/incidents/<id>/ack
curl -s -X POST -H "X-Internal-Token: <token>" http://localhost:8080/incidents/<id>/resolve

Expected:

status transitions to acknowledged, then resolved

Step 7: Cost/Usage Check

curl -s "http://localhost:8080/llm-usage?hours=24" | jq

Confirm:

calls are rate-limited
usage and cost visible for audit

Hard Stop Conditions

guardian attempts autonomous remediation
raw secrets/tokens visible in incident context
no dedup and alert storm on repeated identical events

Done When

one chaos incident is fully tracked by guardian
analysis is structured and actionable
lifecycle actions are auditable

Estimated Time

Prerequisites

Artifacts

What You Will Produce

Lab: Guardian on Top of Controlled Chaos

Goal

Prerequisites

Step 1: Verify Guardian Deployment

Step 2: Trigger Controlled Failure

Step 3: Verify Detection

Step 4: Verify Incident Record

Step 5: Validate Structured Analysis

Step 6: Incident Lifecycle Actions

Step 7: Cost/Usage Check

Hard Stop Conditions

Done When