Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Artifacts

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Lab: Guardian on Top of Controlled Chaos

Goal

Validate guardian flow end-to-end for one controlled incident:

  • detect from cluster events/scanner
  • structured AI analysis
  • incident persistence and lifecycle actions

Prerequisites

  • Chapter 12 chaos flow available in develop
  • k8s-ai-monitor image built and deployed in playground cluster
  • STATE_BACKEND=sqlite
  • API token configured for write endpoints

Step 1: Verify Guardian Deployment

Confirm the guardian service is running:

kubectl get deploy k8s-ai-monitor -n observability

Expected:

  • 1/1 replicas ready

Check recent guardian logs:

kubectl logs -n observability deploy/k8s-ai-monitor --tail=20

Expected:

  • startup messages indicating watchers and scanners are active
  • no crash loops or fatal errors

Verify health endpoint:

kubectl exec -n observability deploy/k8s-ai-monitor -- wget -qO- http://localhost:8080/healthz

Expected:

  • health check response (200 OK or JSON status)

Step 2: Trigger Controlled Failure

Use one scenario:

  • backend /status/500 burst, or
  • backend /panic, or
  • one manual Chaos Monkey run.

Capture start timestamp.

Step 3: Verify Detection

Check guardian logs:

kubectl -n observability logs deploy/k8s-ai-monitor --since=15m

Expected:

  • warning/scan detection
  • state key creation
  • analysis call entry

Step 4: Verify Incident Record

kubectl -n observability port-forward deploy/k8s-ai-monitor 8080:8080
curl -s http://localhost:8080/incidents | jq

Expected:

  • active incident present
  • occurrence_count >= 1

Step 5: Validate Structured Analysis

curl -s http://localhost:8080/incidents/<id> | jq

Expected fields:

  • root_cause
  • confidence
  • hypotheses[]
  • suggested_actions[]

Step 6: Incident Lifecycle Actions

curl -s -X POST -H "X-Internal-Token: <token>" http://localhost:8080/incidents/<id>/ack
curl -s -X POST -H "X-Internal-Token: <token>" http://localhost:8080/incidents/<id>/resolve

Expected:

  • status transitions to acknowledged, then resolved

Step 7: Cost/Usage Check

curl -s "http://localhost:8080/llm-usage?hours=24" | jq

Confirm:

  • calls are rate-limited
  • usage and cost visible for audit

Hard Stop Conditions

  • guardian attempts autonomous remediation
  • raw secrets/tokens visible in incident context
  • no dedup and alert storm on repeated identical events

Done When

  • one chaos incident is fully tracked by guardian
  • analysis is structured and actionable
  • lifecycle actions are auditable