Guardrails-First Course Materials

What You Get

This course is organized as a practical production path:

  • each chapter focuses on one common incident/failure mode
  • each chapter defines a safe operational workflow
  • each chapter includes hands-on materials (lab.md, and quiz.md; many include runbooks)
  1. Start from Chapter 01 and go sequentially through the core track.
  2. Run the lab before moving to the next chapter.
  3. Use the quiz at the end of each chapter to validate understanding.
  4. Move to advanced modules only after finishing the core path.

Tracks

Core track:

  • Chapters 01-13 (platform fundamentals, GitOps, security, observability, reliability, on-call)

Advanced track:

  • Chapter 14: Supply Chain Security
  • Chapter 15: Admission Policy Guardrails
  • Chapter 16: Rollback and Data Migrations
  • Module: Linkerd + Progressive Delivery (Canary / A-B)

Chapter-by-Chapter Implementation Coverage (sre -> course)

Chapter/ModulePlatform Status in sreEvidence in Platform Repo
Chapter 01 Intro GuardrailsImplementedscripts/guard-kube-context.sh, scripts/guard-terraform-plan.sh, .pre-commit-config.yaml
Chapter 02 IaCImplementedinfra/terraform/hcloud_cluster/, infra/terraform/kind_cluster/
Chapter 03 SecretsImplemented.sops.yaml, flux/secrets/**, scripts/sops-setup.sh
Chapter 04 GitOpsImplementedflux/bootstrap/infrastructure/image-automation/, flux/apps/**/image-policy.yaml
Chapter 05 Network PoliciesImplementedflux/infrastructure/network-policies/**
Chapter 06 Security ContextImplementedflux/apps/backend/base/deployment.yaml, flux/apps/frontend/base/deployment.yaml
Chapter 07 Resource ManagementImplementedflux/infrastructure/resource-management/**
Chapter 08 Availability (HPA+PDB)Implementedflux/apps/backend/*/{hpa,pdb}.yaml, flux/apps/frontend/overlays/*/{hpa,pdb}.yaml
Chapter 09 ObservabilityImplementedflux/infrastructure/observability/**, backend/frontend telemetry instrumentation
Chapter 10 Backup & RestoreImplementedflux/infrastructure/data/cnpg-operator/, flux/infrastructure/data/cnpg-clusters/**
Chapter 11 Controlled ChaosImplementedflux/infrastructure/chaos/develop/**
Chapter 12 AI GuardianPartially implemented (course-ready + integration contract)chapter design and runbook ready; guardian service manifests are external integration target
Chapter 13 24/7 Production SREImplemented (operational content + alerts baseline)chapter-13 runbooks/postmortem + observability alert rules
Chapter 14 Supply Chain SecurityScaffolded in platform, fully documented in courseflux/infrastructure/policy/kyverno/, policy/packs/chapter-14-supply-chain/
Chapter 15 Admission Policy GuardrailsScaffolded in platform, fully documented in courseflux/infrastructure/policy/kyverno/, policy/packs/chapter-15-admission-guardrails/
Chapter 16 Rollback & Data MigrationsCourse workflow ready (simulation), implementation follows app DB evolutionCNPG platform baseline + chapter lab/runbook for rollout/rollback sequence
Module Linkerd Progressive DeliveryControllers/manifests present; sample pack opt-influx/infrastructure/progressive-delivery/{linkerd,flagger,develop}/

Advanced-track policy packs are intentionally shipped in safe scaffold mode first (Audit-first workflow in chapters), then moved to enforced runtime policy as rollout evidence matures.

References

Chapter 01: AI Changes Two Things at Once

a backend image tag bump for develop an ingress manifest change intended for staging The change looks harmless in review because each diff is small. In practice, the combined blast radius is larger: routing breaks while …

Chapter 02: Infrastructure as Code (IaC)

repeatability reviewability rollback paths controlled blast radius This chapter introduces a guardrails-first Terraform workflow for Kubernetes platforms. Learning Objectives By the end of this chapter, learners can: …

Chapter 03: Secrets Management (SOPS)

secrets are encrypted before commit Flux decrypts in-cluster with sops-age key material is never committed The Incident Hook A teammate commits a plaintext API key to fix a failing deploy quickly. The key is exposed in …

Chapter 04: GitOps & Version Promotion

develop deploys develop-* images staging deploys staging-* images production deploys production-* images from explicit promotion The Incident Hook A team rebuilds “the same” code for production during …

Chapter 05: Network Policies (Production Isolation)

default deny explicit allow rules DNS and ingress paths opened intentionally The Incident Hook A debug pod in develop reaches internal services it should never touch. No exploit sophistication is needed, only open …

Chapter 06: Security Context & Pod Hardening

non-root execution read-only root filesystem where possible dropped Linux capabilities runtime-default seccomp The Incident Hook A container compromise lands shell access inside a pod. If the pod runs with broad …

Chapter 07: Resource Management & QoS

requests/limits per container namespace quotas predictable QoS behavior under pressure Guardrails Every workload must define CPU/memory requests and limits. Namespaces must enforce LimitRange and ResourceQuota. OOM and …

Chapter 08: Availability Engineering (HPA + PDB)

HPA for load-based scaling PDB for controlled voluntary disruptions rollout/drain awareness Guardrails staging/production start from 2 replicas for critical services. each service has HPA bounds (minReplicas, …

Chapter 09: Observability (Metrics, Logs, Traces)

metrics for symptom detection traces for path analysis logs for evidence Scope Decision (MVP) No in-cluster OpenTelemetry Collector in this phase. Frontend and backend export telemetry directly to Uptrace. Target …

Chapter 10: Backup & Restore Basics

Data Plane Choice CloudNativePG setup in this repo: operator: flux/infrastructure/data/cnpg-operator clusters: flux/infrastructure/data/cnpg-clusters (develop, staging, production) each environment has dedicated Cluster …

Chapter 11: Controlled Chaos

Scope Failure classes in this chapter: crash loop (/panic) elevated 5xx (/status/500) random pod termination (Chaos Monkey) Implementation focus: deterministic drills first Chaos Monkey in develop with kill switch and …

Chapter 12: AI-Assisted SRE Guardian

Implementation Scope This chapter uses a standalone guardian service pattern integrated with the platform: Kubernetes event handlers for warnings and Flux conditions scanner loops for pods, PVCs, certificates, and …

Chapter 13: 24/7 Production SRE

Scope on-call operating model incident lifecycle and severity policy recurring-problem management blameless postmortem workflow AI boundary policy in production Core Principles Evidence first: metrics + traces + logs …

Chapter 14: Supply Chain Security (Advanced)

The supply-chain baseline in this course is: immutable artifact identity (digest or immutable tag) SBOM generation image signing and attestation cluster-side verification before admission Learning Objectives By the end …

Chapter 15: Admission Policy Guardrails (Advanced)

This chapter focuses on policy-as-code guardrails that block risky workloads even when upstream checks fail. Learning Objectives By the end of this chapter, learners can: explain why cluster-side policy is mandatory in …

Chapter 16: Rollback and Data Migrations (Advanced)

This chapter defines a safe migration discipline: backward-compatible schema first application rollout second destructive schema changes last explicit rollback windows and feature flag gates Learning Objectives By the …

Advanced Module: Linkerd + Progressive Delivery (Canary / A-B)

Linkerd mTLS by default canary rollout with measurable abort criteria A/B routing with explicit experiment boundaries The Incident Hook A full rollout passes smoke checks but fails under real production traffic mix. …

Intro: AI as a Very Well-Read Junior Engineer

It is about using AI in DevOps / SysOps / SRE without increasing risk or blast radius. The Mental Model AI is the most well-read junior engineer you will ever work with: Knows tooling, flags, YAML, Terraform, Helm. Works …

Production-Grade Kubernetes with Guardrails & AI-Assisted SRE

build and operate a production-grade Kubernetes platform promote versions safely across environments enforce security and isolation guardrails manage resource behavior under pressure implement backup/restore practices …