Chapter 01: AI Changes Two Things at Once

Incident Hook

A fast “AI-assisted” hotfix bundles two unrelated changes in one push:

a backend image tag bump for develop
an ingress manifest change intended for staging

The pull request looks harmless because each diff is small. The incident begins because the change boundary is not. Routing breaks while backend behavior changes at the same time, and the team loses a clean rollback path before the investigation even starts.

Observed Symptoms

What the team sees first:

frontend requests start returning 502 Bad Gateway
a backend rollout is still in progress in develop
the pull request contains both an image change and an ingress edit

At this point the system does not tell you which change is guilty. It only tells you that two unrelated layers are now noisy at the same time.

Confusion Phase

The incident now has two plausible stories:

the new backend image introduced a real regression
the ingress change sent traffic to the wrong place

That ambiguity is the real failure pattern. Rollback is no longer obvious because the team has to investigate both paths before touching production again.

What AI Would Propose (Brave Junior)

“Update image and ingress together to save one pipeline run.”
“Apply quickly to unblock the demo.”
“Skip context checks; it is just develop.”

Why it sounds reasonable:

fewer PRs
faster merge
faster “visible progress”

Why This Is Dangerous

Missing context: target cluster/namespace is often assumed, not verified.
Hidden coupling: app rollout + ingress mutation creates correlated failure modes.
Production risk pattern: the same behavior scales into high-blast-radius incidents.

Investigation

The first job is not to guess. It is to separate routing evidence from application evidence.

Safe investigation sequence:

inspect the ingress in develop
verify the host and backend target match the intended environment
check backend pod health and logs directly
decide whether the outage is routing-only, app-only, or genuinely mixed

In this incident, the ingress host is the strongest signal. It was changed for the wrong environment, and that explains the edge failure faster than backend rollout noise does.

Containment

Containment is narrow on purpose:

revert the ingress change only
let the GitOps path reconcile it back to the correct host
confirm routing is healthy again
evaluate the backend image separately after traffic is stable

The goal is to restore one clean rollback path. Do not “fix everything at once” during the incident.

Guardrails That Stop It

Context guard before any Kubernetes write:
- scripts/guard-kube-context.sh --context <ctx> --namespace <ns>
Plan-before-apply guard for Terraform:
- scripts/guard-terraform-plan.sh plan ...
- scripts/guard-terraform-plan.sh apply ...
Single-change policy:
- one PR for image/promotion
- separate PR for networking/ingress
Git pre-hooks for repository hygiene:
- scripts/pre-commit-master-check.sh blocks direct work against protected branches
- scripts/prevent-amend-after-push.sh blocks amending already-pushed commits
- scripts/flux-kustomize-validate.sh blocks broken Flux Kustomize renders before commit

Investigation Snapshots

Here is the Kubernetes context guard used in the SafeOps system to stop writes to the wrong cluster or namespace before the change even starts.

Kubernetes context guard

#!/usr/bin/env bash
set -euo pipefail

usage() {
  cat <<'EOF'
usage: scripts/guard-kube-context.sh --context <name> --namespace <name> [--kubeconfig <path>]

Verifies kubectl is pointing to the expected cluster context and namespace.
Fails fast with actionable output if context/namespace checks do not pass.

Examples:
  scripts/guard-kube-context.sh --context sre-control-plane --namespace develop
  scripts/guard-kube-context.sh --context sre-control-plane --namespace production --kubeconfig ./kubeconfig.yaml
EOF
}

EXPECTED_CONTEXT=""
EXPECTED_NAMESPACE=""
KUBECONFIG_PATH=""

while [[ $# -gt 0 ]]; do
  case "$1" in
    --context)
      EXPECTED_CONTEXT="${2:-}"
      shift 2
      ;;
    --namespace)
      EXPECTED_NAMESPACE="${2:-}"
      shift 2
      ;;
    --kubeconfig)
      KUBECONFIG_PATH="${2:-}"
      shift 2
      ;;
    -h|--help)
      usage
      exit 0
      ;;
    *)
      echo "[guard-kube] unknown argument: $1" >&2
      usage >&2
      exit 2
      ;;
  esac
done

if [[ -z "${EXPECTED_CONTEXT}" || -z "${EXPECTED_NAMESPACE}" ]]; then
  echo "[guard-kube] --context and --namespace are required" >&2
  usage >&2
  exit 2
fi

if ! command -v kubectl >/dev/null 2>&1; then
  echo "[guard-kube] kubectl not found in PATH" >&2
  exit 1
fi

if [[ -n "${KUBECONFIG_PATH}" ]]; then
  export KUBECONFIG="${KUBECONFIG_PATH}"
fi

CURRENT_CONTEXT="$(kubectl config current-context 2>/dev/null || true)"
if [[ -z "${CURRENT_CONTEXT}" ]]; then
  echo "[guard-kube] no current kubectl context is set" >&2
  exit 1
fi

if [[ "${CURRENT_CONTEXT}" != "${EXPECTED_CONTEXT}" ]]; then
  echo "[guard-kube] context mismatch" >&2
  echo "  expected: ${EXPECTED_CONTEXT}" >&2
  echo "  actual:   ${CURRENT_CONTEXT}" >&2
  exit 1
fi

if ! kubectl get namespace "${EXPECTED_NAMESPACE}" >/dev/null 2>&1; then
  echo "[guard-kube] namespace '${EXPECTED_NAMESPACE}' not found in context '${CURRENT_CONTEXT}'" >&2
  exit 1
fi

echo "[guard-kube] OK context=${CURRENT_CONTEXT} namespace=${EXPECTED_NAMESPACE}"

Here is the Terraform guard used in the same system to force a reviewed plan artifact before any apply.

Plan-before-apply guard

Show the Terraform guard script

#!/usr/bin/env bash
set -euo pipefail

usage() {
  cat <<'EOF'
usage:
  scripts/guard-terraform-plan.sh plan  --dir <path> [--out <planfile>]
  scripts/guard-terraform-plan.sh apply --dir <path> [--out <planfile>] [--max-age-minutes <n>]

Guardrail wrapper for Terraform plan/apply.
- `plan` creates a planfile and metadata marker.
- `apply` refuses to run unless a fresh planfile + metadata marker exist.

Examples:
  scripts/guard-terraform-plan.sh plan --dir infra/terraform/hcloud_cluster --out tfplan
  scripts/guard-terraform-plan.sh apply --dir infra/terraform/hcloud_cluster --out tfplan --max-age-minutes 60
EOF
}

if [[ $# -lt 1 ]]; then
  usage >&2
  exit 2
fi

if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
  usage >&2
  exit 0
fi

MODE="$1"
shift

WORKDIR=""
PLAN_FILE="tfplan"
MAX_AGE_MINUTES="120"

while [[ $# -gt 0 ]]; do
  case "$1" in
    --dir)
      WORKDIR="${2:-}"
      shift 2
      ;;
    --out)
      PLAN_FILE="${2:-}"
      shift 2
      ;;
    --max-age-minutes)
      MAX_AGE_MINUTES="${2:-}"
      shift 2
      ;;
    -h|--help)
      usage
      exit 0
      ;;
    *)
      echo "[guard-tf] unknown argument: $1" >&2
      usage >&2
      exit 2
      ;;
  esac
done

if [[ -z "${WORKDIR}" ]]; then
  echo "[guard-tf] --dir is required" >&2
  usage >&2
  exit 2
fi

if ! command -v terraform >/dev/null 2>&1; then
  echo "[guard-tf] terraform not found in PATH" >&2
  exit 1
fi

if ! [[ -d "${WORKDIR}" ]]; then
  echo "[guard-tf] directory not found: ${WORKDIR}" >&2
  exit 1
fi

PLAN_PATH="${WORKDIR}/${PLAN_FILE}"
META_PATH="${PLAN_PATH}.meta"

case "${MODE}" in
  plan)
    terraform -chdir="${WORKDIR}" init -input=false
    terraform -chdir="${WORKDIR}" plan -input=false -lock-timeout=5m -out "${PLAN_FILE}"
    {
      echo "created_at_epoch=$(date +%s)"
      echo "workdir=${WORKDIR}"
      echo "plan_file=${PLAN_FILE}"
    } > "${META_PATH}"
    echo "[guard-tf] plan created: ${PLAN_PATH}"
    echo "[guard-tf] metadata created: ${META_PATH}"
    ;;
  apply)
    if [[ ! -f "${PLAN_PATH}" ]]; then
      echo "[guard-tf] missing plan file: ${PLAN_PATH}" >&2
      echo "[guard-tf] run: scripts/guard-terraform-plan.sh plan --dir ${WORKDIR} --out ${PLAN_FILE}" >&2
      exit 1
    fi
    if [[ ! -f "${META_PATH}" ]]; then
      echo "[guard-tf] missing plan metadata: ${META_PATH}" >&2
      echo "[guard-tf] refusing apply without plan marker" >&2
      exit 1
    fi

    # shellcheck disable=SC1090
    source "${META_PATH}"
    NOW_EPOCH="$(date +%s)"
    AGE_SECONDS="$((NOW_EPOCH - created_at_epoch))"
    AGE_MINUTES="$((AGE_SECONDS / 60))"

    if (( AGE_MINUTES > MAX_AGE_MINUTES )); then
      echo "[guard-tf] plan is too old (${AGE_MINUTES}m > ${MAX_AGE_MINUTES}m)" >&2
      echo "[guard-tf] re-run plan before apply" >&2
      exit 1
    fi

    terraform -chdir="${WORKDIR}" apply -input=false "${PLAN_FILE}"
    echo "[guard-tf] apply completed using ${PLAN_PATH}"
    ;;
  *)
    echo "[guard-tf] unknown mode: ${MODE}" >&2
    usage >&2
    exit 2
    ;;
esac

System Context

This chapter establishes the operating rule for the rest of the course: keep change boundaries narrow enough that investigation and rollback stay obvious.

It also introduces another course-wide assumption: the platform is only half of the story, and the application itself must be built with Kubernetes-native operational contracts such as probes, graceful shutdown, structured telemetry, safe packaging, and signed delivery artifacts.

Those contracts are demonstrated through the course reference applications, especially ldbl/backend and ldbl/frontend, with several implementation patterns borrowed from podinfo.

The same discipline appears again in later chapters:

Chapter 02 keeps Terraform execution inside one reviewed plan path
Chapter 04 separates promotion from rebuild and ad-hoc environment edits
Chapter 05 layers workstation, CI, review, and cluster guardrails around the same idea

If learners do not internalize this rule here, the later guardrails will feel procedural instead of necessary.

Local Git Guardrails (Pre-Hooks)

Install and verify local hooks before running labs:

make install-hooks
pre-commit run --all-files

These hooks enforce branch and history discipline before CI starts, so risky workflow mistakes are caught early on the workstation. For GitOps manifest changes under flux/**, they also enforce local Kustomize render validity.

Safe Workflow (Step-by-Step)

Verify context and namespace.
Produce plan/diff first (Terraform or GitOps diff).
Review for correlated changes before merge or apply.
Apply one change type at a time.
Verify health and routing separately.
Keep rollback commands prepared before merge/apply.

Demo Commands

A. Kubernetes context/namespace guard

# Expected success example
scripts/guard-kube-context.sh \
  --context sre-control-plane \
  --namespace develop

Expected output:

[guard-kube] OK context=sre-control-plane namespace=develop

Failure example (wrong namespace):

scripts/guard-kube-context.sh \
  --context sre-control-plane \
  --namespace does-not-exist

Expected output:

[guard-kube] namespace 'does-not-exist' not found in context 'sre-control-plane'

B. Terraform plan-before-apply guard

# Create plan + metadata marker
scripts/guard-terraform-plan.sh plan \
  --dir infra/terraform/hcloud_cluster \
  --out tfplan

# Apply only from a fresh, reviewed planfile
scripts/guard-terraform-plan.sh apply \
  --dir infra/terraform/hcloud_cluster \
  --out tfplan \
  --max-age-minutes 60

If plan marker is missing/stale, apply is blocked with an explicit error.

Rollback Checklist

If Kubernetes deploy changed:
- kubectl -n <ns> rollout undo deployment/<name>
If ingress changed:
- revert ingress commit in Git and let Flux reconcile
If Terraform apply changed infra:
- create a new reviewed plan and apply rollback change
Verify:
- /healthz on backend
- ingress route with Host header

Exercises

Split a mixed PR into two PRs:
- PR1: image tag update only
- PR2: ingress update only
Intentionally run guard-terraform-plan.sh apply without a planfile and capture the failure output.

Done When

Student can explain why “small but mixed” changes are high risk.
Student can demonstrate both guard scripts before any apply action.
Student can separate investigation, containment, and rollback into distinct decisions.

Estimated Time

Prerequisites

Source Code References

What You Will Produce