Production-Grade Kubernetes with Guardrails & AI-Assisted SRE

This page is the detailed study plan. Use /course/ for the course overview and learner-facing homepage. Use this page when you want the exact order, difficulty, time estimate, and prerequisites for every chapter.

Core Track (14 Chapters)

  1. AI Changes Two Things at Once Beginner · ~2h
  • correlated blast radius from bundling unrelated changes
  • AI as a brave junior: fast, useful, but unsafe without guardrails
  • context checks, plan-before-apply, and one-change-per-PR discipline
  • Prerequisites: Basic Kubernetes familiarity (pods, deployments, services)
  1. Infrastructure as Code (IaC) Beginner · ~3h
  • Terraform module structure, state, locking, and drift control
  • safe plan -> review -> apply workflow and deny-by-default destroy discipline
  • local kind path and cloud path as separate validation layers
  • Prerequisites: Chapter 01; basic CLI/terminal skills
  1. Secrets Management (SOPS) Intermediate · ~2.5h
  • encrypted secrets with SOPS + age
  • Flux decryption flow and secret rotation discipline
  • leak response mindset: revert is not remediation
  • Prerequisites: Chapter 02; understanding of encryption basics
  1. GitOps & Version Promotion Intermediate · ~3h
  • Flux reconciliation model and environment overlays
  • immutable promotion without rebuild across develop, staging, and production
  • rollback by Git evidence, not by ad-hoc rebuilds
  • Prerequisites: Chapters 02-03; Git branching familiarity
  1. CI/CD & Developer Guardrails Intermediate · ~2.5h
  • local hooks, CI validation, approval gates, and AI-assisted review
  • plan/apply separation and shared validation as the final pre-merge contract
  • layered guardrails: workstation -> CI -> review -> cluster
  • Prerequisites: Chapter 04; Git workflow familiarity
  1. Network Policies (Production Isolation) Intermediate · ~2.5h
  • default deny baseline and explicit allow rules
  • ingress, DNS, and egress openings as deliberate policy decisions
  • blocked-traffic triage instead of emergency allow-all shortcuts
  • Prerequisites: Chapter 04; basic TCP/IP networking
  1. Security Context & Pod Hardening Intermediate · ~2.5h
  • non-root, read-only filesystem, dropped capabilities, and seccomp baseline
  • break/fix understanding of permission failures without privilege shortcuts
  • golden manifest baseline versus insecure diff
  • Prerequisites: Chapter 06; Linux permissions model
  1. Resource Management & QoS Intermediate · ~2.5h
  • requests, limits, quotas, and QoS classes under pressure
  • OOMKilled behavior versus node-pressure behavior
  • scaling only after resource evidence is understood
  • Prerequisites: Chapter 04; understanding of CPU/memory concepts
  1. Availability Engineering (HPA + PDB) Intermediate · ~3h
  • HPA bounds, PDB discipline, and rollout/drain coordination
  • anti-patterns like minReplicas=1 for critical services
  • planned disruption as an engineering scenario, not an improvisation
  • Prerequisites: Chapter 08
  1. Observability (Metrics, Logs, Traces) Intermediate · ~3h
  • metrics -> traces -> logs as the preferred incident flow
  • structured log correlation and trace propagation across frontend and backend
  • evidence-first investigation with Prometheus, Uptrace, and guardian alert routing
  • Prerequisites: Chapters 04, 08-09; familiarity with Prometheus concepts
  1. Backup & Restore Basics Intermediate · ~2.5h
  • backup success versus restore success
  • CloudNativePG backup and restore validation workflow
  • restore verification checklist as the actual safety bar
  • Prerequisites: Chapter 04; PostgreSQL basics
  1. Controlled Chaos Intermediate · ~3h
  • deterministic drills first, bounded chaos second
  • kill switch, time window, and evidence capture for every drill
  • failure rehearsal as preparation for real incidents
  • Prerequisites: Chapters 06-11
  1. AI-Assisted SRE Guardian Intermediate · ~2.5h
  • incident normalization, deduplication, and guarded escalation with k8s-ai-monitor
  • LLM boundaries: sanitize, budget, human approval, no workload mutation
  • API, CLI, and MCP surfaces for safe investigation support
  • Prerequisites: Chapters 10 and 12
  1. 24/7 Production SRE Intermediate · ~3h
  • on-call operating model, severity handling, and escalation discipline
  • postmortems and recurring-problem hardening workflow
  • keeping AI inside human-owned production boundaries
  • Prerequisites: Chapters 01-13

Advanced Track

  1. Supply Chain Security (Advanced) Advanced · ~3h
  • SBOM generation, image signing, and attestation evidence
  • admission-time verification before deployment
  • audit-first rollout toward enforceable supply-chain trust
  • Prerequisites: Core track complete
  1. Admission Policy Guardrails (Advanced) Advanced · ~3h
  • Kyverno policy packs for risky manifest prevention
  • audit -> enforce rollout model for cluster-side rules
  • exceptions and break-glass discipline with expiry and evidence
  • Prerequisites: Chapter 15
  1. Advanced Module: Progressive Delivery (Canary with Traefik + Flagger) Advanced · ~3.5h
  • weighted rollout progression with concrete abort criteria
  • Prometheus-driven canary analysis and rollback evidence
  • bounded blast radius through ingress-level traffic control
  • Prerequisites: Core track complete
  1. Rollback and Data Migrations (Advanced) Advanced · ~2.5h
  • expand/contract migration model and rollback windows
  • feature-flag-assisted release compatibility
  • destructive migration approval gates and recovery planning
  • Prerequisites: Core track complete

Reference Appendices

  1. Local Development Environment Reference · ~45 min
  • Terraform-managed kind cluster for fast validation
  • local kubeconfig, optional local registry, and local Flux bootstrap
  • when to use local verification versus provider-realistic cloud validation
  • Prerequisites: Chapter 02 recommended
  1. DNS and TLS Automation Reference · ~45 min
  • external-dns + cert-manager as the SafeOps edge automation baseline
  • Cloudflare DNS-01, ClusterIssuers, and Traefik ingress TLS wiring
  • DNS and certificate failures as edge incidents, not application incidents
  • Prerequisites: Chapters 04 and 10 recommended