Guardrails-First Course Materials

This course teaches production-grade Kubernetes and SRE practice through incidents, guardrails, and repeatable workflows.

The goal is not to memorize tools. The goal is to learn how to keep systems safe when pressure, ambiguity, and AI-assisted speed all show up at the same time.

Who This Is For

  • platform engineers moving from “it works” to “it survives mistakes”
  • DevOps engineers who want stronger operating discipline, not more tooling hype
  • SREs who want concrete labs, guardrails, and incident-shaped lessons

How the Course Works

Each chapter is built around one production failure pattern:

  • what broke
  • why the shortcut looked reasonable
  • how the investigation becomes confusing
  • which guardrail restores a safe operating path

Every core lesson includes:

  • a written incident walkthrough
  • a hands-on lab
  • a quiz to confirm the operating rule
  • runbooks or scorecards where the topic needs them

This course does not only teach how to operate Kubernetes around applications. It also shows what a production-ready Kubernetes application should look like so rollout safety, observability, GitOps reconciliation, and incident response work correctly in the first place.

The course uses the SafeOps reference applications as concrete examples:

  • ldbl/backend, a small production-shaped Go API with health probes, metrics, tracing hooks, chaos endpoints, and OpenAPI/Swagger support
  • ldbl/frontend, a Vue-based frontend with container hardening, runtime config injection, and Kubernetes deployment packaging

Many of the application patterns used throughout those reference apps are inspired by Podinfo by Stefan Prodan, including:

  • readiness and liveness probes
  • graceful shutdown on interrupt signals
  • config and secret reload patterns
  • Prometheus and OpenTelemetry instrumentation
  • structured logging
  • 12-factor configuration
  • fault injection for safe drills
  • packaging and install paths with Timoni, Helm, and Kustomize
  • end-to-end validation with Kind and Helm
  • multi-arch images, signing, SBOMs, provenance, and CVE scanning

Video assets are optional. The written lesson remains the primary source of truth, and the video should make the same lesson easier to absorb, not replace the material.

  1. Start with Intro: AI as a Very Well-Read Junior Engineer.
  2. Go through Chapters 01-14 in order.
  3. Run the lab before moving to the next chapter.
  4. Use the quiz to confirm the main guardrail rule before continuing.
  5. Move to the advanced modules only after the core path feels operationally natural.

Tracks

Core track:

  • Chapters 01-14 covering platform foundations, GitOps, CI/CD, security, observability, reliability, and on-call discipline

Advanced track:

  • Chapter 15: Supply Chain Security
  • Chapter 16: Admission Policy Guardrails
  • Chapter 17: Rollback and Data Migrations
  • Module: Progressive Delivery (Canary with Traefik + Flagger)

Reference appendices:

  • Appendix: Local Development Environment
  • Appendix: DNS and TLS Automation

References

Chapter 01: AI Changes Two Things at Once

a backend image tag bump for develop an ingress manifest change intended for staging The pull request looks harmless because each diff is small. The incident begins because the change boundary is not. Routing breaks …

Chapter 02: Infrastructure as Code (IaC)

Observed Symptoms What the team sees first: one apply job holds the lock while another waits or retries the later apply changes resources nobody expected to touch a fresh plan no longer matches the reviewed plan artifact …

Chapter 03: Secrets Management (SOPS)

Observed Symptoms What the team sees first: the deploy is unblocked, but the secret is visible in the diff the same value may now exist in PR tooling, CI logs, and local clones responders cannot tell immediately how many …

Chapter 04: GitOps & Version Promotion

Observed Symptoms What the team sees first: production is running a digest different from the one validated in staging the Git history sounds correct, but the artifact identity does not match rollback discussion turns …

Chapter 05: CI/CD & Developer Guardrails

Observed Symptoms What the team sees first: there is no normal PR discussion for the change no approved plan artifact exists for the infrastructure mutation responders must reconstruct intent after the change already …

Chapter 06: Network Policies (Production Isolation)

Observed Symptoms What the team sees first: a pod in develop can connect to services outside its intended boundary responders cannot answer quickly what is reachable and what is not containment feels manual because the …

Chapter 07: Security Context & Pod Hardening

Observed Symptoms What the team sees first: a shell exists inside a compromised container the pod may be able to write broadly or escalate privileges responders need to know whether the workload is hardened or soft by …

Chapter 08: Resource Management & QoS

Observed Symptoms What the team sees first: repeated OOMKilled events on one workload evictions or instability in neighboring workloads pressure appears at node level, not only inside one pod That is why unbounded …

Chapter 09: Availability Engineering (HPA + PDB)

Observed Symptoms What the team sees first: the node drain does not proceed cleanly one critical service has no spare healthy capacity operators must choose between breaking guardrails or accepting downtime The incident …

Chapter 10: Observability (Metrics, Logs, Traces)

Observed Symptoms What the team sees first: metrics clearly show a user-facing problem logs contain noise but not a clean causal path responders are tempted to restart pods before they understand the failing route The …

Chapter 11: Backup & Restore Basics

Observed Symptoms What the team sees first: the backup job is green restore artifacts exist the restored service still cannot function correctly That mismatch is the lesson. Backup presence is not the same thing as …

Chapter 12: Controlled Chaos

Observed Symptoms What the team sees first: a failure mode appears with no practiced response path responders collect evidence late and inconsistently uncertainty, not only outage duration, expands the blast radius The …

Chapter 13: AI-Assisted SRE Guardian

Observed Symptoms What the team sees first: many alerts are technically true but operationally fragmented responders cannot tell whether they are seeing one incident or many signal volume starts competing with actual …

Chapter 14: 24/7 Production SRE

Observed Symptoms What the team sees first: responders join, but ownership is unclear communication cadence is inconsistent multiple actions start before a shared evidence picture exists The incident is already harder …

Chapter 15: Supply Chain Security (Advanced)

Observed Symptoms What the team sees first: the workload is running, but artifact provenance is unclear tags look familiar while signer and SBOM evidence do not line up responders must investigate trust before they can …

Chapter 16: Admission Policy Guardrails (Advanced)

Observed Symptoms What the team sees first: risky workload settings reach the cluster boundary upstream checks were skipped or insufficient operators feel pressure to disable policy instead of fixing the manifest The …

Chapter 17: Rollback and Data Migrations (Advanced)

Observed Symptoms What the team sees first: the new release fails, but the old app version also cannot recover cleanly image rollback appears to work while data compatibility does not responders discover too late that …

Advanced Module: Progressive Delivery (Canary with Traefik + Flagger)

Observed Symptoms What the team sees first: smoke checks pass, but live user traffic behaves differently bad signals appear only after the rollout reaches meaningful load rollback starts late because failure detection …

Appendix: DNS and TLS Automation

Why This Appendix Exists The main course keeps early chapters focused on platform safety and GitOps. This appendix explains the edge automation layer used by the SafeOps platform: external-dns manages DNS records from …

Appendix: Local Development Environment

Why This Appendix Exists The main course teaches the production path first. This appendix shows the fastest safe feedback loop for local experimentation: a Terraform-managed kind cluster generated kubeconfig and context …

Intro: AI as a Very Well-Read Junior Engineer

It is about using AI in DevOps / SysOps / SRE without increasing risk or blast radius. The Mental Model AI is the most well-read junior engineer you will ever work with: Knows tooling, flags, YAML, Terraform, Helm. Works …

Production-Grade Kubernetes with Guardrails & AI-Assisted SRE

Core Track (14 Chapters) AI Changes Two Things at Once Beginner · ~2h correlated blast radius from bundling unrelated changes AI as a brave junior: fast, useful, but unsafe without guardrails context checks, …