SafeOps Academy
Production-Grade Kubernetes with Guardrails and AI-Assisted SRE
A practical course for shipping faster without increasing production risk. Every chapter is built around a real failure mode, a safe operating path, and hands-on labs.
Watch the Intro
Five minutes that set the foundation: the SafeOps mental model, the pledge, and a preview of every chapter in the course.
Core Track
Foundation-first path for platform, CI/CD, GitOps, observability, reliability, and on-call operations.
Chapter 01: AI Changes Two Things at Once
Incident-first guardrails mindset: how AI speed increases blast radius and how to contain it.
CoreChapter 02: Infrastructure as Code (IaC)
Safe Terraform workflow with plan/review/apply, drift checks, and rollback discipline.
CoreChapter 03: Secrets Management (SOPS)
SOPS + age end-to-end: encrypt before commit, decrypt in-cluster with Flux.
CoreChapter 04: GitOps & Version Promotion
Promotion without rebuild: immutable artifact flow develop -> staging -> production.
CoreChapter 05: CI/CD & Developer Guardrails
Layered guardrails: pre-commit hooks, CI pipelines, approval gates, and AI-assisted review.
CoreChapter 06: Network Policies (Production Isolation)
Default-deny isolation and controlled allow rules with blocked-traffic debugging.
CoreChapter 07: Security Context & Pod Hardening
Pod hardening baseline: non-root, read-only rootfs, dropped caps, seccomp.
CoreChapter 08: Resource Management & QoS
Requests/limits, QoS, quotas, and OOM behavior under load.
CoreChapter 09: Availability Engineering (HPA + PDB)
HPA + PDB rollout safety, disruption control, and drain readiness checks.
CoreChapter 10: Observability (Metrics, Logs, Traces)
Metrics, traces, logs correlation with evidence-driven incident triage.
CoreChapter 11: Backup & Restore Basics
CNPG backup/restore drills and validation-focused recovery workflow.
CoreChapter 12: Controlled Chaos
Deterministic failure drills with bounded blast radius and recovery evidence.
CoreChapter 13: AI-Assisted SRE Guardian
AI-assisted incident analysis with strict human approval boundaries and explicit human gates.
CoreChapter 14: 24/7 Production SRE
On-call lifecycle, incident command, and blameless postmortem operations.
Advanced Track
Coming soon — SafeOps Advanced: Production AI, Under Control.
Policy, supply-chain trust, progressive delivery, and rollback/data migration safety patterns.
Chapter 15: Supply Chain Security (Advanced)
SBOM, signing, and deploy-time verification policies for trusted runtime artifacts.
AdvancedChapter 16: Admission Policy Guardrails (Advanced)
Kyverno-based admission enforcement to block risky manifests at cluster edge.
AdvancedChapter 17: Rollback and Data Migrations (Advanced)
Rollback-safe schema change strategy with feature flags and compatibility windows.
AdvancedAdvanced Module: Progressive Delivery (Canary with Traefik + Flagger)
Traefik + Flagger progressive delivery: weighted canary with measurable abort criteria.