Chapter 02: Infrastructure as Code (IaC)
Incident Hook
Two engineers run infrastructure changes close together during incident pressure.
One apply acquires lock, the second run retries and later applies a stale plan.
Result: partial drift plus unexpected replacement in unrelated resources.
Recovery takes longer because no one can prove which plan produced the final state.
Observed Symptoms
What the team sees first:
- one apply job holds the lock while another waits or retries
- the later apply changes resources nobody expected to touch
- a fresh plan no longer matches the reviewed plan artifact
The warning sign is not only contention. It is contention plus uncertainty about execution intent.
Confusion Phase
Remote state locking looks like it should have protected the workflow. That is what makes this failure deceptive.
The team now has to answer two different questions:
- did Terraform behave incorrectly
- or did the workflow allow an old plan to survive long enough to become dangerous
Why This Chapter Exists
In production, infrastructure mistakes are expensive and fast-moving. IaC is not only about automation speed. It is about:
- repeatability
- reviewability
- rollback paths
- controlled blast radius
This chapter introduces a guardrails-first Terraform workflow for Kubernetes platforms.
Learning Objectives
By the end of this chapter, learners can:
- explain module boundaries and Terraform folder structure in the course platform
- run a safe
plan -> review -> applyworkflow - explain why remote state and locking are non-negotiable in team environments
- detect drift and decide whether to reconcile or rollback
- execute safe destroy practices with explicit scope checks
State Failure Story (Lock Contention)
Typical failure chain:
- pipeline A acquires state lock and applies.
- pipeline B waits, then retries from outdated assumptions.
- B applies stale plan after lock release.
Blast radius:
- unintended resource replacement
- drift hidden by unrelated changes
- rollback uncertainty because state changed twice in short window
What AI Would Propose (Brave Junior)
- “Run
terraform applydirectly, we already know the desired change.” - “If lock fails, retry until it succeeds.”
- “Destroy and recreate is faster than careful rollback.”
Why this sounds reasonable:
- looks faster in the moment
- fewer review steps
- immediate visible progress
Why This Is Dangerous
applywithout reviewed plan removes the last safe checkpoint.- stale plan + concurrent runs create hard-to-debug infra divergence.
- destroy shortcuts can expand blast radius across dependencies.
Investigation
Start by treating state and plan history as evidence, not memory.
Safe investigation sequence:
- identify every plan and apply job that touched the same environment
- compare the reviewed plan artifact with a fresh plan against current state
- confirm whether the later apply ran from assumptions older than the current state
- trace the workflow gap that allowed stale intent to remain executable
The root cause here is usually workflow design, not Terraform syntax.
Containment
Containment starts by stopping overlap:
- pause concurrent applies for that environment
- generate a fresh plan from current state
- review only the corrective diff
- apply once from the fresh reviewed plan
Only after state is trustworthy again should the team tune concurrency, approvals, or destroy policy.
Guardrails That Stop It
- mandatory
plan -> review -> applysequence, never direct apply. - remote state locking is required for team workflows.
- CI/apply pipeline concurrency must be
1per environment. - destroy is deny-by-default outside
develop, unless break-glass record is approved. - every destructive action must include recreate/rollback evidence first.
Break-Glass Minimum Record (Destroy Outside Develop)
If destroy is required outside develop, record must include:
- incident/ticket reference
- exact scope (workspace/resource/module)
- expected impact and rollback/recreate plan
- approver identity and time window
Investigation Snapshots
Here is the plan/apply guard used in the SafeOps system. This is where “always run plan before apply” becomes executable policy instead of team etiquette.
Terraform plan/apply guard
Show the Terraform guard script
#!/usr/bin/env bash
set -euo pipefail
usage() {
cat <<'EOF'
usage:
scripts/guard-terraform-plan.sh plan --dir <path> [--out <planfile>]
scripts/guard-terraform-plan.sh apply --dir <path> [--out <planfile>] [--max-age-minutes <n>]
Guardrail wrapper for Terraform plan/apply.
- `plan` creates a planfile and metadata marker.
- `apply` refuses to run unless a fresh planfile + metadata marker exist.
Examples:
scripts/guard-terraform-plan.sh plan --dir infra/terraform/hcloud_cluster --out tfplan
scripts/guard-terraform-plan.sh apply --dir infra/terraform/hcloud_cluster --out tfplan --max-age-minutes 60
EOF
}
if [[ $# -lt 1 ]]; then
usage >&2
exit 2
fi
if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
usage >&2
exit 0
fi
MODE="$1"
shift
WORKDIR=""
PLAN_FILE="tfplan"
MAX_AGE_MINUTES="120"
while [[ $# -gt 0 ]]; do
case "$1" in
--dir)
WORKDIR="${2:-}"
shift 2
;;
--out)
PLAN_FILE="${2:-}"
shift 2
;;
--max-age-minutes)
MAX_AGE_MINUTES="${2:-}"
shift 2
;;
-h|--help)
usage
exit 0
;;
*)
echo "[guard-tf] unknown argument: $1" >&2
usage >&2
exit 2
;;
esac
done
if [[ -z "${WORKDIR}" ]]; then
echo "[guard-tf] --dir is required" >&2
usage >&2
exit 2
fi
if ! command -v terraform >/dev/null 2>&1; then
echo "[guard-tf] terraform not found in PATH" >&2
exit 1
fi
if ! [[ -d "${WORKDIR}" ]]; then
echo "[guard-tf] directory not found: ${WORKDIR}" >&2
exit 1
fi
PLAN_PATH="${WORKDIR}/${PLAN_FILE}"
META_PATH="${PLAN_PATH}.meta"
case "${MODE}" in
plan)
terraform -chdir="${WORKDIR}" init -input=false
terraform -chdir="${WORKDIR}" plan -input=false -lock-timeout=5m -out "${PLAN_FILE}"
{
echo "created_at_epoch=$(date +%s)"
echo "workdir=${WORKDIR}"
echo "plan_file=${PLAN_FILE}"
} > "${META_PATH}"
echo "[guard-tf] plan created: ${PLAN_PATH}"
echo "[guard-tf] metadata created: ${META_PATH}"
;;
apply)
if [[ ! -f "${PLAN_PATH}" ]]; then
echo "[guard-tf] missing plan file: ${PLAN_PATH}" >&2
echo "[guard-tf] run: scripts/guard-terraform-plan.sh plan --dir ${WORKDIR} --out ${PLAN_FILE}" >&2
exit 1
fi
if [[ ! -f "${META_PATH}" ]]; then
echo "[guard-tf] missing plan metadata: ${META_PATH}" >&2
echo "[guard-tf] refusing apply without plan marker" >&2
exit 1
fi
# shellcheck disable=SC1090
source "${META_PATH}"
NOW_EPOCH="$(date +%s)"
AGE_SECONDS="$((NOW_EPOCH - created_at_epoch))"
AGE_MINUTES="$((AGE_SECONDS / 60))"
if (( AGE_MINUTES > MAX_AGE_MINUTES )); then
echo "[guard-tf] plan is too old (${AGE_MINUTES}m > ${MAX_AGE_MINUTES}m)" >&2
echo "[guard-tf] re-run plan before apply" >&2
exit 1
fi
terraform -chdir="${WORKDIR}" apply -input=false "${PLAN_FILE}"
echo "[guard-tf] apply completed using ${PLAN_PATH}"
;;
*)
echo "[guard-tf] unknown mode: ${MODE}" >&2
usage >&2
exit 2
;;
esac
Here is the local validation baseline used before Terraform changes leave the workstation.
IaC hook baseline
Show the pre-commit configuration
default_install_hook_types:
- pre-commit
- pre-push
- pre-merge-commit
- prepare-commit-msg
repos:
- repo: local
hooks:
- id: master-branch-check
name: Protected branch guard
entry: scripts/pre-commit-master-check.sh
language: script
always_run: true
pass_filenames: false
stages: [pre-commit, pre-push, pre-merge-commit]
args:
- --protected=master
- --protected=main
- id: prevent-amend-after-push
name: Prevent amending pushed commits
entry: scripts/prevent-amend-after-push.sh
language: script
always_run: true
pass_filenames: false
stages: [prepare-commit-msg]
- repo: local
hooks:
- id: flux-kustomize-validate
name: Flux kustomize validate
entry: scripts/flux-kustomize-validate.sh
language: script
files: ^flux/.*\.ya?ml$
pass_filenames: true
require_serial: true
stages: [pre-commit]
- id: terraform-fmt
name: Terraform format check
entry: terraform fmt -recursive -diff -check
language: system
files: \.tf$
pass_filenames: false
stages: [pre-commit]
- id: terraform-validate
name: Terraform validate
entry: scripts/terraform-validate.sh
language: script
files: \.(tf|tfvars)$
pass_filenames: false
require_serial: true
stages: [pre-commit]
- id: terraform-security
name: Terraform security scan
entry: scripts/terraform-security.sh
language: script
files: \.(tf|tfvars)$
pass_filenames: false
require_serial: true
stages: [pre-commit]
- repo: local
hooks:
- id: no-secrets
name: Block sensitive files
entry: scripts/block-secrets.sh
language: script
files: (kubeconfig|\.key$|\.pem$|credentials|\.env$)
stages: [pre-commit]
- repo: https://github.com/koalaman/shellcheck-precommit
rev: v0.10.0
hooks:
- id: shellcheck
files: \.sh$
args: [--severity=warning]
stages: [pre-commit]
- repo: https://github.com/adrienverge/yamllint
rev: v1.35.1
hooks:
- id: yamllint
files: \.ya?ml$
args: [-d, relaxed]
stages: [pre-commit]
System Context
This chapter gives the rest of the course a trustworthy infrastructure baseline.
Later chapters depend on this discipline:
- Chapter 04 needs artifact promotion to land on stable infrastructure state
- Chapter 05 turns this workflow into CI and approval policy
- Chapter 17 depends on the same reviewed execution path when data changes become riskier
Core Concepts
- Terraform structure and modules
- root configuration should stay thin and readable
- provider/module versions must be pinned
- reusable logic belongs in modules, not copy/paste blocks
- Remote state and locking
- shared state enables team collaboration
- locking prevents concurrent apply corruption
- backend config is part of production reliability
- IAM and RBAC principles
- least privilege by default
- separate read/plan/apply responsibilities
- no broad credentials for automation or AI tooling
- Drift detection
- drift = actual infra != declared infra
- detect drift before making unrelated changes
- never hide drift by batching many changes together
- Safe destroy
- destroy is valid, but only with explicit scope
- always verify workspace, targets, and dependency impact
- create a rollback/recreate plan before destructive actions
Safe Workflow (Step-by-Step)
- Read this chapter,
lab.md, and the review checklist. - Install and run local hooks:
make install-hooks && pre-commit run --all-files. - Generate a plan artifact and perform peer review.
- Apply only from the reviewed/fresh plan artifact.
- Run drift check and confirm expected state after apply.
- Complete
quiz.mdand record operational evidence.
Pre-Commit Guardrails for IaC
Before Terraform changes are committed, hooks enforce:
terraform fmt -recursive -diff -checkscripts/terraform-validate.shscripts/terraform-security.shscripts/flux-kustomize-validate.sh(for anyflux/**manifest changes in the same PR)
These checks reduce noisy reviews and block unsafe IaC changes before they reach CI/apply workflows.
CI Concurrency Guardrail (Example)
Use one apply lane per environment:
concurrency:
group: terraform-${{ github.workflow }}-${{ inputs.environment }}
cancel-in-progress: false
This prevents overlapping apply jobs from mutating shared state concurrently.
Anti-Patterns to Avoid
- Running
terraform applywithout reviewedplan. - Applying from stale plan output.
- Sharing one credential set across all environments.
- Using destroy in ambiguous context.
Done When
- learner can explain and demonstrate
plan -> review -> applyunder lock discipline - learner can identify drift and choose reconcile vs rollback path
- learner can state clear no-go conditions for destroy actions
Alternative: Local Development with Kind
The primary workflow uses Hetzner Cloud for production-like infrastructure. For local testing, CI environments, or learning without cloud costs, a Kind (Kubernetes in Docker) cluster provides the same Flux bootstrap path.
Kind Cluster Setup
The Kind cluster creates a 3-node topology:
- 1 control-plane node
- 2 worker nodes
Local Registry Mirror
Kind includes a local registry at localhost:5001 for image caching and local development builds.
Port Mappings
30080->8080(HTTP)30443->8443(HTTPS)
Flux Operator Auto-Bootstrap
The Kind cluster uses the same Flux bootstrap path as Hetzner:
- Flux Operator installs and reconciles the repository
- Same namespace structure:
develop,staging,production,observability - Same Kustomization overlays and HelmRelease definitions
When to Use Kind
- Local testing before pushing to CI
- CI pipeline integration tests
- Learning the course without cloud costs
- Validating Flux manifests against a real cluster
Kind Limitations
- No real DNS resolution (use
/etc/hostsornip.io) - No real TLS certificates (self-signed only)
- No Hetzner-specific features (CCM, CSI, load balancers)
- Not suitable for performance testing or production simulation
SafeOps Snapshot
Here is the local Kind cluster baseline used in the SafeOps system for low-cost rehearsal and CI-friendly testing.
Kind cluster layout
Show the Kind cluster layout
infra/terraform/kind_cluster/.gitignoreinfra/terraform/kind_cluster/README.mdinfra/terraform/kind_cluster/UPGRADE.mdinfra/terraform/kind_cluster/main.tfinfra/terraform/kind_cluster/scripts/merge-kubeconfig.shinfra/terraform/kind_cluster/templates/git-repository.yaml.tplinfra/terraform/kind_cluster/templates/kustomization.yaml.tplinfra/terraform/kind_cluster/values/components.yamlinfra/terraform/kind_cluster/variables.tf
Next Chapter
Continue with Chapter 03 (Secrets Management with SOPS).