Production-Grade Kubernetes with Guardrails & AI-Assisted SRE
This page is the detailed study plan. Use /course/ for the course overview and learner-facing homepage. Use this page when you want the exact order, difficulty, time estimate, and prerequisites for every chapter.
Core Track (14 Chapters)
- AI Changes Two Things at Once
Beginner· ~2h
- correlated blast radius from bundling unrelated changes
- AI as a brave junior: fast, useful, but unsafe without guardrails
- context checks, plan-before-apply, and one-change-per-PR discipline
- Prerequisites: Basic Kubernetes familiarity (pods, deployments, services)
- Infrastructure as Code (IaC)
Beginner· ~3h
- Terraform module structure, state, locking, and drift control
- safe
plan -> review -> applyworkflow and deny-by-default destroy discipline - local
kindpath and cloud path as separate validation layers - Prerequisites: Chapter 01; basic CLI/terminal skills
- Secrets Management (SOPS)
Intermediate· ~2.5h
- encrypted secrets with SOPS + age
- Flux decryption flow and secret rotation discipline
- leak response mindset: revert is not remediation
- Prerequisites: Chapter 02; understanding of encryption basics
- GitOps & Version Promotion
Intermediate· ~3h
- Flux reconciliation model and environment overlays
- immutable promotion without rebuild across develop, staging, and production
- rollback by Git evidence, not by ad-hoc rebuilds
- Prerequisites: Chapters 02-03; Git branching familiarity
- CI/CD & Developer Guardrails
Intermediate· ~2.5h
- local hooks, CI validation, approval gates, and AI-assisted review
- plan/apply separation and shared validation as the final pre-merge contract
- layered guardrails: workstation -> CI -> review -> cluster
- Prerequisites: Chapter 04; Git workflow familiarity
- Network Policies (Production Isolation)
Intermediate· ~2.5h
- default deny baseline and explicit allow rules
- ingress, DNS, and egress openings as deliberate policy decisions
- blocked-traffic triage instead of emergency allow-all shortcuts
- Prerequisites: Chapter 04; basic TCP/IP networking
- Security Context & Pod Hardening
Intermediate· ~2.5h
- non-root, read-only filesystem, dropped capabilities, and seccomp baseline
- break/fix understanding of permission failures without privilege shortcuts
- golden manifest baseline versus insecure diff
- Prerequisites: Chapter 06; Linux permissions model
- Resource Management & QoS
Intermediate· ~2.5h
- requests, limits, quotas, and QoS classes under pressure
- OOMKilled behavior versus node-pressure behavior
- scaling only after resource evidence is understood
- Prerequisites: Chapter 04; understanding of CPU/memory concepts
- Availability Engineering (HPA + PDB)
Intermediate· ~3h
- HPA bounds, PDB discipline, and rollout/drain coordination
- anti-patterns like
minReplicas=1for critical services - planned disruption as an engineering scenario, not an improvisation
- Prerequisites: Chapter 08
- Observability (Metrics, Logs, Traces)
Intermediate· ~3h
- metrics -> traces -> logs as the preferred incident flow
- structured log correlation and trace propagation across frontend and backend
- evidence-first investigation with Prometheus, Uptrace, and guardian alert routing
- Prerequisites: Chapters 04, 08-09; familiarity with Prometheus concepts
- Backup & Restore Basics
Intermediate· ~2.5h
- backup success versus restore success
- CloudNativePG backup and restore validation workflow
- restore verification checklist as the actual safety bar
- Prerequisites: Chapter 04; PostgreSQL basics
- Controlled Chaos
Intermediate· ~3h
- deterministic drills first, bounded chaos second
- kill switch, time window, and evidence capture for every drill
- failure rehearsal as preparation for real incidents
- Prerequisites: Chapters 06-11
- AI-Assisted SRE Guardian
Intermediate· ~2.5h
- incident normalization, deduplication, and guarded escalation with
k8s-ai-monitor - LLM boundaries: sanitize, budget, human approval, no workload mutation
- API, CLI, and MCP surfaces for safe investigation support
- Prerequisites: Chapters 10 and 12
- 24/7 Production SRE
Intermediate· ~3h
- on-call operating model, severity handling, and escalation discipline
- postmortems and recurring-problem hardening workflow
- keeping AI inside human-owned production boundaries
- Prerequisites: Chapters 01-13
Advanced Track
- Supply Chain Security (Advanced)
Advanced· ~3h
- SBOM generation, image signing, and attestation evidence
- admission-time verification before deployment
- audit-first rollout toward enforceable supply-chain trust
- Prerequisites: Core track complete
- Admission Policy Guardrails (Advanced)
Advanced· ~3h
- Kyverno policy packs for risky manifest prevention
- audit -> enforce rollout model for cluster-side rules
- exceptions and break-glass discipline with expiry and evidence
- Prerequisites: Chapter 15
- weighted rollout progression with concrete abort criteria
- Prometheus-driven canary analysis and rollback evidence
- bounded blast radius through ingress-level traffic control
- Prerequisites: Core track complete
- Rollback and Data Migrations (Advanced)
Advanced· ~2.5h
- expand/contract migration model and rollback windows
- feature-flag-assisted release compatibility
- destructive migration approval gates and recovery planning
- Prerequisites: Core track complete
Reference Appendices
- Local Development Environment
Reference· ~45 min
- Terraform-managed
kindcluster for fast validation - local kubeconfig, optional local registry, and local Flux bootstrap
- when to use local verification versus provider-realistic cloud validation
- Prerequisites: Chapter 02 recommended
- DNS and TLS Automation
Reference· ~45 min
external-dns+cert-manageras the SafeOps edge automation baseline- Cloudflare DNS-01, ClusterIssuers, and Traefik ingress TLS wiring
- DNS and certificate failures as edge incidents, not application incidents
- Prerequisites: Chapters 04 and 10 recommended