SafeOps Academy

Production-Grade Kubernetes with Guardrails and AI-Assisted SRE

A practical course for shipping faster without increasing production risk. Every chapter is built around a real failure mode, a safe operating path, and hands-on labs.

Core track: 14 chapters Advanced track: 4 modules Format: chapter + lab + runbook + quiz Labs & quizzes: members only

Watch the Intro

Five minutes that set the foundation: the SafeOps mental model, the pledge, and a preview of every chapter in the course.

Core Track

Foundation-first path for platform, CI/CD, GitOps, observability, reliability, and on-call operations.

Core

Chapter 01: AI Changes Two Things at Once

Incident-first guardrails mindset: how AI speed increases blast radius and how to contain it.

Core

Chapter 02: Infrastructure as Code (IaC)

Safe Terraform workflow with plan/review/apply, drift checks, and rollback discipline.

Core

Chapter 03: Secrets Management (SOPS)

SOPS + age end-to-end: encrypt before commit, decrypt in-cluster with Flux.

Core

Chapter 04: GitOps & Version Promotion

Promotion without rebuild: immutable artifact flow develop -> staging -> production.

Core

Chapter 05: CI/CD & Developer Guardrails

Layered guardrails: pre-commit hooks, CI pipelines, approval gates, and AI-assisted review.

Core

Chapter 06: Network Policies (Production Isolation)

Default-deny isolation and controlled allow rules with blocked-traffic debugging.

Core

Chapter 07: Security Context & Pod Hardening

Pod hardening baseline: non-root, read-only rootfs, dropped caps, seccomp.

Core

Chapter 08: Resource Management & QoS

Requests/limits, QoS, quotas, and OOM behavior under load.

Core

Chapter 09: Availability Engineering (HPA + PDB)

HPA + PDB rollout safety, disruption control, and drain readiness checks.

Core

Chapter 10: Observability (Metrics, Logs, Traces)

Metrics, traces, logs correlation with evidence-driven incident triage.

Core

Chapter 11: Backup & Restore Basics

CNPG backup/restore drills and validation-focused recovery workflow.

Core

Chapter 12: Controlled Chaos

Deterministic failure drills with bounded blast radius and recovery evidence.

Core

Chapter 13: AI-Assisted SRE Guardian

AI-assisted incident analysis with strict human approval boundaries and explicit human gates.

Core

Chapter 14: 24/7 Production SRE

On-call lifecycle, incident command, and blameless postmortem operations.

Advanced Track

Coming soon — SafeOps Advanced: Production AI, Under Control.

Policy, supply-chain trust, progressive delivery, and rollback/data migration safety patterns.