Production-Grade Kubernetes with Guardrails & AI-Assisted SRE

This page is the detailed study plan. Use /course/ for the course overview and learner-facing homepage. Use this page when you want the exact order, difficulty, time estimate, and prerequisites for every chapter.

Core Track (14 Chapters)

AI Changes Two Things at Once Beginner · ~2h

correlated blast radius from bundling unrelated changes
AI as a brave junior: fast, useful, but unsafe without guardrails
context checks, plan-before-apply, and one-change-per-PR discipline
Prerequisites: Basic Kubernetes familiarity (pods, deployments, services)

Infrastructure as Code (IaC) Beginner · ~3h

Terraform module structure, state, locking, and drift control
safe plan -> review -> apply workflow and deny-by-default destroy discipline
local kind path and cloud path as separate validation layers
Prerequisites: Chapter 01; basic CLI/terminal skills

Secrets Management (SOPS) Intermediate · ~2.5h

encrypted secrets with SOPS + age
Flux decryption flow and secret rotation discipline
leak response mindset: revert is not remediation
Prerequisites: Chapter 02; understanding of encryption basics

GitOps & Version Promotion Intermediate · ~3h

Flux reconciliation model and environment overlays
immutable promotion without rebuild across develop, staging, and production
rollback by Git evidence, not by ad-hoc rebuilds
Prerequisites: Chapters 02-03; Git branching familiarity

CI/CD & Developer Guardrails Intermediate · ~2.5h

local hooks, CI validation, approval gates, and AI-assisted review
plan/apply separation and shared validation as the final pre-merge contract
layered guardrails: workstation -> CI -> review -> cluster
Prerequisites: Chapter 04; Git workflow familiarity

Network Policies (Production Isolation) Intermediate · ~2.5h

default deny baseline and explicit allow rules
ingress, DNS, and egress openings as deliberate policy decisions
blocked-traffic triage instead of emergency allow-all shortcuts
Prerequisites: Chapter 04; basic TCP/IP networking

Security Context & Pod Hardening Intermediate · ~2.5h

non-root, read-only filesystem, dropped capabilities, and seccomp baseline
break/fix understanding of permission failures without privilege shortcuts
golden manifest baseline versus insecure diff
Prerequisites: Chapter 06; Linux permissions model

Resource Management & QoS Intermediate · ~2.5h

requests, limits, quotas, and QoS classes under pressure
OOMKilled behavior versus node-pressure behavior
scaling only after resource evidence is understood
Prerequisites: Chapter 04; understanding of CPU/memory concepts

Availability Engineering (HPA + PDB) Intermediate · ~3h

HPA bounds, PDB discipline, and rollout/drain coordination
anti-patterns like minReplicas=1 for critical services
planned disruption as an engineering scenario, not an improvisation
Prerequisites: Chapter 08

Observability (Metrics, Logs, Traces) Intermediate · ~3h

metrics -> traces -> logs as the preferred incident flow
structured log correlation and trace propagation across frontend and backend
evidence-first investigation with Prometheus, Uptrace, and guardian alert routing
Prerequisites: Chapters 04, 08-09; familiarity with Prometheus concepts

Backup & Restore Basics Intermediate · ~2.5h

backup success versus restore success
CloudNativePG backup and restore validation workflow
restore verification checklist as the actual safety bar
Prerequisites: Chapter 04; PostgreSQL basics

Controlled Chaos Intermediate · ~3h

deterministic drills first, bounded chaos second
kill switch, time window, and evidence capture for every drill
failure rehearsal as preparation for real incidents
Prerequisites: Chapters 06-11

AI-Assisted SRE Guardian Intermediate · ~2.5h

incident normalization, deduplication, and guarded escalation with k8s-ai-monitor
LLM boundaries: sanitize, budget, human approval, no workload mutation
API, CLI, and MCP surfaces for safe investigation support
Prerequisites: Chapters 10 and 12

24/7 Production SRE Intermediate · ~3h

on-call operating model, severity handling, and escalation discipline
postmortems and recurring-problem hardening workflow
keeping AI inside human-owned production boundaries
Prerequisites: Chapters 01-13

Advanced Track

Supply Chain Security (Advanced) Advanced · ~3h

SBOM generation, image signing, and attestation evidence
admission-time verification before deployment
audit-first rollout toward enforceable supply-chain trust
Prerequisites: Core track complete

Admission Policy Guardrails (Advanced) Advanced · ~3h

Kyverno policy packs for risky manifest prevention
audit -> enforce rollout model for cluster-side rules
exceptions and break-glass discipline with expiry and evidence
Prerequisites: Chapter 15

Advanced Module: Progressive Delivery (Canary with Traefik + Flagger) Advanced · ~3.5h

weighted rollout progression with concrete abort criteria
Prometheus-driven canary analysis and rollback evidence
bounded blast radius through ingress-level traffic control
Prerequisites: Core track complete

Rollback and Data Migrations (Advanced) Advanced · ~2.5h

expand/contract migration model and rollback windows
feature-flag-assisted release compatibility
destructive migration approval gates and recovery planning
Prerequisites: Core track complete

Reference Appendices

Local Development Environment Reference · ~45 min

Terraform-managed kind cluster for fast validation
local kubeconfig, optional local registry, and local Flux bootstrap
when to use local verification versus provider-realistic cloud validation
Prerequisites: Chapter 02 recommended

DNS and TLS Automation Reference · ~45 min

external-dns + cert-manager as the SafeOps edge automation baseline
Cloudflare DNS-01, ClusterIssuers, and Traefik ingress TLS wiring
DNS and certificate failures as edge incidents, not application incidents
Prerequisites: Chapters 04 and 10 recommended