Chapter 01: AI Changes Two Things at Once
Incident Hook
A fast “AI-assisted” hotfix bundles two unrelated changes in one push:
- a backend image tag bump for
develop - an ingress manifest change intended for
staging
The pull request looks harmless because each diff is small. The incident begins because the change boundary is not. Routing breaks while backend behavior changes at the same time, and the team loses a clean rollback path before the investigation even starts.
Observed Symptoms
What the team sees first:
- frontend requests start returning
502 Bad Gateway - a backend rollout is still in progress in
develop - the pull request contains both an image change and an ingress edit
At this point the system does not tell you which change is guilty. It only tells you that two unrelated layers are now noisy at the same time.
Confusion Phase
The incident now has two plausible stories:
- the new backend image introduced a real regression
- the ingress change sent traffic to the wrong place
That ambiguity is the real failure pattern. Rollback is no longer obvious because the team has to investigate both paths before touching production again.
What AI Would Propose (Brave Junior)
- “Update image and ingress together to save one pipeline run.”
- “Apply quickly to unblock the demo.”
- “Skip context checks; it is just
develop.”
Why it sounds reasonable:
- fewer PRs
- faster merge
- faster “visible progress”
Why This Is Dangerous
- Missing context: target cluster/namespace is often assumed, not verified.
- Hidden coupling: app rollout + ingress mutation creates correlated failure modes.
- Production risk pattern: the same behavior scales into high-blast-radius incidents.
Investigation
The first job is not to guess. It is to separate routing evidence from application evidence.
Safe investigation sequence:
- inspect the ingress in
develop - verify the host and backend target match the intended environment
- check backend pod health and logs directly
- decide whether the outage is routing-only, app-only, or genuinely mixed
In this incident, the ingress host is the strongest signal. It was changed for the wrong environment, and that explains the edge failure faster than backend rollout noise does.
Containment
Containment is narrow on purpose:
- revert the ingress change only
- let the GitOps path reconcile it back to the correct host
- confirm routing is healthy again
- evaluate the backend image separately after traffic is stable
The goal is to restore one clean rollback path. Do not “fix everything at once” during the incident.
Guardrails That Stop It
- Context guard before any Kubernetes write:
scripts/guard-kube-context.sh --context <ctx> --namespace <ns>
- Plan-before-apply guard for Terraform:
scripts/guard-terraform-plan.sh plan ...scripts/guard-terraform-plan.sh apply ...
- Single-change policy:
- one PR for image/promotion
- separate PR for networking/ingress
- Git pre-hooks for repository hygiene:
scripts/pre-commit-master-check.shblocks direct work against protected branchesscripts/prevent-amend-after-push.shblocks amending already-pushed commitsscripts/flux-kustomize-validate.shblocks broken Flux Kustomize renders before commit
Investigation Snapshots
Here is the Kubernetes context guard used in the SafeOps system to stop writes to the wrong cluster or namespace before the change even starts.
Kubernetes context guard
#!/usr/bin/env bash
set -euo pipefail
usage() {
cat <<'EOF'
usage: scripts/guard-kube-context.sh --context <name> --namespace <name> [--kubeconfig <path>]
Verifies kubectl is pointing to the expected cluster context and namespace.
Fails fast with actionable output if context/namespace checks do not pass.
Examples:
scripts/guard-kube-context.sh --context sre-control-plane --namespace develop
scripts/guard-kube-context.sh --context sre-control-plane --namespace production --kubeconfig ./kubeconfig.yaml
EOF
}
EXPECTED_CONTEXT=""
EXPECTED_NAMESPACE=""
KUBECONFIG_PATH=""
while [[ $# -gt 0 ]]; do
case "$1" in
--context)
EXPECTED_CONTEXT="${2:-}"
shift 2
;;
--namespace)
EXPECTED_NAMESPACE="${2:-}"
shift 2
;;
--kubeconfig)
KUBECONFIG_PATH="${2:-}"
shift 2
;;
-h|--help)
usage
exit 0
;;
*)
echo "[guard-kube] unknown argument: $1" >&2
usage >&2
exit 2
;;
esac
done
if [[ -z "${EXPECTED_CONTEXT}" || -z "${EXPECTED_NAMESPACE}" ]]; then
echo "[guard-kube] --context and --namespace are required" >&2
usage >&2
exit 2
fi
if ! command -v kubectl >/dev/null 2>&1; then
echo "[guard-kube] kubectl not found in PATH" >&2
exit 1
fi
if [[ -n "${KUBECONFIG_PATH}" ]]; then
export KUBECONFIG="${KUBECONFIG_PATH}"
fi
CURRENT_CONTEXT="$(kubectl config current-context 2>/dev/null || true)"
if [[ -z "${CURRENT_CONTEXT}" ]]; then
echo "[guard-kube] no current kubectl context is set" >&2
exit 1
fi
if [[ "${CURRENT_CONTEXT}" != "${EXPECTED_CONTEXT}" ]]; then
echo "[guard-kube] context mismatch" >&2
echo " expected: ${EXPECTED_CONTEXT}" >&2
echo " actual: ${CURRENT_CONTEXT}" >&2
exit 1
fi
if ! kubectl get namespace "${EXPECTED_NAMESPACE}" >/dev/null 2>&1; then
echo "[guard-kube] namespace '${EXPECTED_NAMESPACE}' not found in context '${CURRENT_CONTEXT}'" >&2
exit 1
fi
echo "[guard-kube] OK context=${CURRENT_CONTEXT} namespace=${EXPECTED_NAMESPACE}"
Here is the Terraform guard used in the same system to force a reviewed plan artifact before any apply.
Plan-before-apply guard
Show the Terraform guard script
#!/usr/bin/env bash
set -euo pipefail
usage() {
cat <<'EOF'
usage:
scripts/guard-terraform-plan.sh plan --dir <path> [--out <planfile>]
scripts/guard-terraform-plan.sh apply --dir <path> [--out <planfile>] [--max-age-minutes <n>]
Guardrail wrapper for Terraform plan/apply.
- `plan` creates a planfile and metadata marker.
- `apply` refuses to run unless a fresh planfile + metadata marker exist.
Examples:
scripts/guard-terraform-plan.sh plan --dir infra/terraform/hcloud_cluster --out tfplan
scripts/guard-terraform-plan.sh apply --dir infra/terraform/hcloud_cluster --out tfplan --max-age-minutes 60
EOF
}
if [[ $# -lt 1 ]]; then
usage >&2
exit 2
fi
if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
usage >&2
exit 0
fi
MODE="$1"
shift
WORKDIR=""
PLAN_FILE="tfplan"
MAX_AGE_MINUTES="120"
while [[ $# -gt 0 ]]; do
case "$1" in
--dir)
WORKDIR="${2:-}"
shift 2
;;
--out)
PLAN_FILE="${2:-}"
shift 2
;;
--max-age-minutes)
MAX_AGE_MINUTES="${2:-}"
shift 2
;;
-h|--help)
usage
exit 0
;;
*)
echo "[guard-tf] unknown argument: $1" >&2
usage >&2
exit 2
;;
esac
done
if [[ -z "${WORKDIR}" ]]; then
echo "[guard-tf] --dir is required" >&2
usage >&2
exit 2
fi
if ! command -v terraform >/dev/null 2>&1; then
echo "[guard-tf] terraform not found in PATH" >&2
exit 1
fi
if ! [[ -d "${WORKDIR}" ]]; then
echo "[guard-tf] directory not found: ${WORKDIR}" >&2
exit 1
fi
PLAN_PATH="${WORKDIR}/${PLAN_FILE}"
META_PATH="${PLAN_PATH}.meta"
case "${MODE}" in
plan)
terraform -chdir="${WORKDIR}" init -input=false
terraform -chdir="${WORKDIR}" plan -input=false -lock-timeout=5m -out "${PLAN_FILE}"
{
echo "created_at_epoch=$(date +%s)"
echo "workdir=${WORKDIR}"
echo "plan_file=${PLAN_FILE}"
} > "${META_PATH}"
echo "[guard-tf] plan created: ${PLAN_PATH}"
echo "[guard-tf] metadata created: ${META_PATH}"
;;
apply)
if [[ ! -f "${PLAN_PATH}" ]]; then
echo "[guard-tf] missing plan file: ${PLAN_PATH}" >&2
echo "[guard-tf] run: scripts/guard-terraform-plan.sh plan --dir ${WORKDIR} --out ${PLAN_FILE}" >&2
exit 1
fi
if [[ ! -f "${META_PATH}" ]]; then
echo "[guard-tf] missing plan metadata: ${META_PATH}" >&2
echo "[guard-tf] refusing apply without plan marker" >&2
exit 1
fi
# shellcheck disable=SC1090
source "${META_PATH}"
NOW_EPOCH="$(date +%s)"
AGE_SECONDS="$((NOW_EPOCH - created_at_epoch))"
AGE_MINUTES="$((AGE_SECONDS / 60))"
if (( AGE_MINUTES > MAX_AGE_MINUTES )); then
echo "[guard-tf] plan is too old (${AGE_MINUTES}m > ${MAX_AGE_MINUTES}m)" >&2
echo "[guard-tf] re-run plan before apply" >&2
exit 1
fi
terraform -chdir="${WORKDIR}" apply -input=false "${PLAN_FILE}"
echo "[guard-tf] apply completed using ${PLAN_PATH}"
;;
*)
echo "[guard-tf] unknown mode: ${MODE}" >&2
usage >&2
exit 2
;;
esac
System Context
This chapter establishes the operating rule for the rest of the course: keep change boundaries narrow enough that investigation and rollback stay obvious.
It also introduces another course-wide assumption: the platform is only half of the story, and the application itself must be built with Kubernetes-native operational contracts such as probes, graceful shutdown, structured telemetry, safe packaging, and signed delivery artifacts.
Those contracts are demonstrated through the course reference applications, especially ldbl/backend and ldbl/frontend, with several implementation patterns borrowed from podinfo.
The same discipline appears again in later chapters:
- Chapter 02 keeps Terraform execution inside one reviewed plan path
- Chapter 04 separates promotion from rebuild and ad-hoc environment edits
- Chapter 05 layers workstation, CI, review, and cluster guardrails around the same idea
If learners do not internalize this rule here, the later guardrails will feel procedural instead of necessary.
Local Git Guardrails (Pre-Hooks)
Install and verify local hooks before running labs:
make install-hooks
pre-commit run --all-files
These hooks enforce branch and history discipline before CI starts, so risky workflow mistakes are caught early on the workstation.
For GitOps manifest changes under flux/**, they also enforce local Kustomize render validity.
Safe Workflow (Step-by-Step)
- Verify context and namespace.
- Produce plan/diff first (Terraform or GitOps diff).
- Review for correlated changes before merge or apply.
- Apply one change type at a time.
- Verify health and routing separately.
- Keep rollback commands prepared before merge/apply.
Demo Commands
A. Kubernetes context/namespace guard
# Expected success example
scripts/guard-kube-context.sh \
--context sre-control-plane \
--namespace develop
Expected output:
[guard-kube] OK context=sre-control-plane namespace=develop
Failure example (wrong namespace):
scripts/guard-kube-context.sh \
--context sre-control-plane \
--namespace does-not-exist
Expected output:
[guard-kube] namespace 'does-not-exist' not found in context 'sre-control-plane'
B. Terraform plan-before-apply guard
# Create plan + metadata marker
scripts/guard-terraform-plan.sh plan \
--dir infra/terraform/hcloud_cluster \
--out tfplan
# Apply only from a fresh, reviewed planfile
scripts/guard-terraform-plan.sh apply \
--dir infra/terraform/hcloud_cluster \
--out tfplan \
--max-age-minutes 60
If plan marker is missing/stale, apply is blocked with an explicit error.
Rollback Checklist
- If Kubernetes deploy changed:
kubectl -n <ns> rollout undo deployment/<name>
- If ingress changed:
- revert ingress commit in Git and let Flux reconcile
- If Terraform apply changed infra:
- create a new reviewed plan and apply rollback change
- Verify:
/healthzon backend- ingress route with Host header
Exercises
- Split a mixed PR into two PRs:
- PR1: image tag update only
- PR2: ingress update only
- Intentionally run
guard-terraform-plan.sh applywithout a planfile and capture the failure output.
Done When
- Student can explain why “small but mixed” changes are high risk.
- Student can demonstrate both guard scripts before any apply action.
- Student can separate investigation, containment, and rollback into distinct decisions.