Chapter 02: Infrastructure as Code (IaC)

Incident Hook

Two engineers run infrastructure changes close together during incident pressure. One apply acquires lock, the second run retries and later applies a stale plan. Result: partial drift plus unexpected replacement in unrelated resources. Recovery takes longer because no one can prove which plan produced the final state.

Observed Symptoms

What the team sees first:

one apply job holds the lock while another waits or retries
the later apply changes resources nobody expected to touch
a fresh plan no longer matches the reviewed plan artifact

The warning sign is not only contention. It is contention plus uncertainty about execution intent.

Confusion Phase

Remote state locking looks like it should have protected the workflow. That is what makes this failure deceptive.

The team now has to answer two different questions:

did Terraform behave incorrectly
or did the workflow allow an old plan to survive long enough to become dangerous

Why This Chapter Exists

In production, infrastructure mistakes are expensive and fast-moving. IaC is not only about automation speed. It is about:

repeatability
reviewability
rollback paths
controlled blast radius

This chapter introduces a guardrails-first Terraform workflow for Kubernetes platforms.

Learning Objectives

By the end of this chapter, learners can:

explain module boundaries and Terraform folder structure in the course platform
run a safe plan -> review -> apply workflow
explain why remote state and locking are non-negotiable in team environments
detect drift and decide whether to reconcile or rollback
execute safe destroy practices with explicit scope checks

State Failure Story (Lock Contention)

Typical failure chain:

pipeline A acquires state lock and applies.
pipeline B waits, then retries from outdated assumptions.
B applies stale plan after lock release.

Blast radius:

unintended resource replacement
drift hidden by unrelated changes
rollback uncertainty because state changed twice in short window

What AI Would Propose (Brave Junior)

“Run terraform apply directly, we already know the desired change.”
“If lock fails, retry until it succeeds.”
“Destroy and recreate is faster than careful rollback.”

Why this sounds reasonable:

looks faster in the moment
fewer review steps
immediate visible progress

Why This Is Dangerous

apply without reviewed plan removes the last safe checkpoint.
stale plan + concurrent runs create hard-to-debug infra divergence.
destroy shortcuts can expand blast radius across dependencies.

Investigation

Start by treating state and plan history as evidence, not memory.

Safe investigation sequence:

identify every plan and apply job that touched the same environment
compare the reviewed plan artifact with a fresh plan against current state
confirm whether the later apply ran from assumptions older than the current state
trace the workflow gap that allowed stale intent to remain executable

The root cause here is usually workflow design, not Terraform syntax.

Containment

Containment starts by stopping overlap:

pause concurrent applies for that environment
generate a fresh plan from current state
review only the corrective diff
apply once from the fresh reviewed plan

Only after state is trustworthy again should the team tune concurrency, approvals, or destroy policy.

Guardrails That Stop It

mandatory plan -> review -> apply sequence, never direct apply.
remote state locking is required for team workflows.
CI/apply pipeline concurrency must be 1 per environment.
destroy is deny-by-default outside develop, unless break-glass record is approved.
every destructive action must include recreate/rollback evidence first.

Break-Glass Minimum Record (Destroy Outside Develop)

If destroy is required outside develop, record must include:

incident/ticket reference
exact scope (workspace/resource/module)
expected impact and rollback/recreate plan
approver identity and time window

Investigation Snapshots

Here is the plan/apply guard used in the SafeOps system. This is where “always run plan before apply” becomes executable policy instead of team etiquette.

Terraform plan/apply guard

Show the Terraform guard script

#!/usr/bin/env bash
set -euo pipefail

usage() {
  cat <<'EOF'
usage:
  scripts/guard-terraform-plan.sh plan  --dir <path> [--out <planfile>]
  scripts/guard-terraform-plan.sh apply --dir <path> [--out <planfile>] [--max-age-minutes <n>]

Guardrail wrapper for Terraform plan/apply.
- `plan` creates a planfile and metadata marker.
- `apply` refuses to run unless a fresh planfile + metadata marker exist.

Examples:
  scripts/guard-terraform-plan.sh plan --dir infra/terraform/hcloud_cluster --out tfplan
  scripts/guard-terraform-plan.sh apply --dir infra/terraform/hcloud_cluster --out tfplan --max-age-minutes 60
EOF
}

if [[ $# -lt 1 ]]; then
  usage >&2
  exit 2
fi

if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
  usage >&2
  exit 0
fi

MODE="$1"
shift

WORKDIR=""
PLAN_FILE="tfplan"
MAX_AGE_MINUTES="120"

while [[ $# -gt 0 ]]; do
  case "$1" in
    --dir)
      WORKDIR="${2:-}"
      shift 2
      ;;
    --out)
      PLAN_FILE="${2:-}"
      shift 2
      ;;
    --max-age-minutes)
      MAX_AGE_MINUTES="${2:-}"
      shift 2
      ;;
    -h|--help)
      usage
      exit 0
      ;;
    *)
      echo "[guard-tf] unknown argument: $1" >&2
      usage >&2
      exit 2
      ;;
  esac
done

if [[ -z "${WORKDIR}" ]]; then
  echo "[guard-tf] --dir is required" >&2
  usage >&2
  exit 2
fi

if ! command -v terraform >/dev/null 2>&1; then
  echo "[guard-tf] terraform not found in PATH" >&2
  exit 1
fi

if ! [[ -d "${WORKDIR}" ]]; then
  echo "[guard-tf] directory not found: ${WORKDIR}" >&2
  exit 1
fi

PLAN_PATH="${WORKDIR}/${PLAN_FILE}"
META_PATH="${PLAN_PATH}.meta"

case "${MODE}" in
  plan)
    terraform -chdir="${WORKDIR}" init -input=false
    terraform -chdir="${WORKDIR}" plan -input=false -lock-timeout=5m -out "${PLAN_FILE}"
    {
      echo "created_at_epoch=$(date +%s)"
      echo "workdir=${WORKDIR}"
      echo "plan_file=${PLAN_FILE}"
    } > "${META_PATH}"
    echo "[guard-tf] plan created: ${PLAN_PATH}"
    echo "[guard-tf] metadata created: ${META_PATH}"
    ;;
  apply)
    if [[ ! -f "${PLAN_PATH}" ]]; then
      echo "[guard-tf] missing plan file: ${PLAN_PATH}" >&2
      echo "[guard-tf] run: scripts/guard-terraform-plan.sh plan --dir ${WORKDIR} --out ${PLAN_FILE}" >&2
      exit 1
    fi
    if [[ ! -f "${META_PATH}" ]]; then
      echo "[guard-tf] missing plan metadata: ${META_PATH}" >&2
      echo "[guard-tf] refusing apply without plan marker" >&2
      exit 1
    fi

    # shellcheck disable=SC1090
    source "${META_PATH}"
    NOW_EPOCH="$(date +%s)"
    AGE_SECONDS="$((NOW_EPOCH - created_at_epoch))"
    AGE_MINUTES="$((AGE_SECONDS / 60))"

    if (( AGE_MINUTES > MAX_AGE_MINUTES )); then
      echo "[guard-tf] plan is too old (${AGE_MINUTES}m > ${MAX_AGE_MINUTES}m)" >&2
      echo "[guard-tf] re-run plan before apply" >&2
      exit 1
    fi

    terraform -chdir="${WORKDIR}" apply -input=false "${PLAN_FILE}"
    echo "[guard-tf] apply completed using ${PLAN_PATH}"
    ;;
  *)
    echo "[guard-tf] unknown mode: ${MODE}" >&2
    usage >&2
    exit 2
    ;;
esac

Here is the local validation baseline used before Terraform changes leave the workstation.

IaC hook baseline

Show the pre-commit configuration

default_install_hook_types:
  - pre-commit
  - pre-push
  - pre-merge-commit
  - prepare-commit-msg

repos:
  - repo: local
    hooks:
      - id: master-branch-check
        name: Protected branch guard
        entry: scripts/pre-commit-master-check.sh
        language: script
        always_run: true
        pass_filenames: false
        stages: [pre-commit, pre-push, pre-merge-commit]
        args:
          - --protected=master
          - --protected=main

      - id: prevent-amend-after-push
        name: Prevent amending pushed commits
        entry: scripts/prevent-amend-after-push.sh
        language: script
        always_run: true
        pass_filenames: false
        stages: [prepare-commit-msg]

  - repo: local
    hooks:
      - id: flux-kustomize-validate
        name: Flux kustomize validate
        entry: scripts/flux-kustomize-validate.sh
        language: script
        files: ^flux/.*\.ya?ml$
        pass_filenames: true
        require_serial: true
        stages: [pre-commit]

      - id: terraform-fmt
        name: Terraform format check
        entry: terraform fmt -recursive -diff -check
        language: system
        files: \.tf$
        pass_filenames: false
        stages: [pre-commit]

      - id: terraform-validate
        name: Terraform validate
        entry: scripts/terraform-validate.sh
        language: script
        files: \.(tf|tfvars)$
        pass_filenames: false
        require_serial: true
        stages: [pre-commit]

      - id: terraform-security
        name: Terraform security scan
        entry: scripts/terraform-security.sh
        language: script
        files: \.(tf|tfvars)$
        pass_filenames: false
        require_serial: true
        stages: [pre-commit]

  - repo: local
    hooks:
      - id: no-secrets
        name: Block sensitive files
        entry: scripts/block-secrets.sh
        language: script
        files: (kubeconfig|\.key$|\.pem$|credentials|\.env$)
        stages: [pre-commit]

  - repo: https://github.com/koalaman/shellcheck-precommit
    rev: v0.10.0
    hooks:
      - id: shellcheck
        files: \.sh$
        args: [--severity=warning]
        stages: [pre-commit]

  - repo: https://github.com/adrienverge/yamllint
    rev: v1.35.1
    hooks:
      - id: yamllint
        files: \.ya?ml$
        args: [-d, relaxed]
        stages: [pre-commit]

System Context

This chapter gives the rest of the course a trustworthy infrastructure baseline.

Later chapters depend on this discipline:

Chapter 04 needs artifact promotion to land on stable infrastructure state
Chapter 05 turns this workflow into CI and approval policy
Chapter 17 depends on the same reviewed execution path when data changes become riskier

Core Concepts

Terraform structure and modules

root configuration should stay thin and readable
provider/module versions must be pinned
reusable logic belongs in modules, not copy/paste blocks

Remote state and locking

shared state enables team collaboration
locking prevents concurrent apply corruption
backend config is part of production reliability

IAM and RBAC principles

least privilege by default
separate read/plan/apply responsibilities
no broad credentials for automation or AI tooling

Drift detection

drift = actual infra != declared infra
detect drift before making unrelated changes
never hide drift by batching many changes together

Safe destroy

destroy is valid, but only with explicit scope
always verify workspace, targets, and dependency impact
create a rollback/recreate plan before destructive actions

Safe Workflow (Step-by-Step)

Read this chapter, lab.md, and the review checklist.
Install and run local hooks: make install-hooks && pre-commit run --all-files.
Generate a plan artifact and perform peer review.
Apply only from the reviewed/fresh plan artifact.
Run drift check and confirm expected state after apply.
Complete quiz.md and record operational evidence.

Pre-Commit Guardrails for IaC

Before Terraform changes are committed, hooks enforce:

terraform fmt -recursive -diff -check
scripts/terraform-validate.sh
scripts/terraform-security.sh
scripts/flux-kustomize-validate.sh (for any flux/** manifest changes in the same PR)

These checks reduce noisy reviews and block unsafe IaC changes before they reach CI/apply workflows.

CI Concurrency Guardrail (Example)

Use one apply lane per environment:

concurrency:
  group: terraform-${{ github.workflow }}-${{ inputs.environment }}
  cancel-in-progress: false

This prevents overlapping apply jobs from mutating shared state concurrently.

Anti-Patterns to Avoid

Running terraform apply without reviewed plan.
Applying from stale plan output.
Sharing one credential set across all environments.
Using destroy in ambiguous context.

Done When

learner can explain and demonstrate plan -> review -> apply under lock discipline
learner can identify drift and choose reconcile vs rollback path
learner can state clear no-go conditions for destroy actions

Alternative: Local Development with Kind

The primary workflow uses Hetzner Cloud for production-like infrastructure. For local testing, CI environments, or learning without cloud costs, a Kind (Kubernetes in Docker) cluster provides the same Flux bootstrap path.

Kind Cluster Setup

The Kind cluster creates a 3-node topology:

1 control-plane node
2 worker nodes

Local Registry Mirror

Kind includes a local registry at localhost:5001 for image caching and local development builds.

Port Mappings

30080 -> 8080 (HTTP)
30443 -> 8443 (HTTPS)

Flux Operator Auto-Bootstrap

The Kind cluster uses the same Flux bootstrap path as Hetzner:

Flux Operator installs and reconciles the repository
Same namespace structure: develop, staging, production, observability
Same Kustomization overlays and HelmRelease definitions

When to Use Kind

Local testing before pushing to CI
CI pipeline integration tests
Learning the course without cloud costs
Validating Flux manifests against a real cluster

Kind Limitations

No real DNS resolution (use /etc/hosts or nip.io)
No real TLS certificates (self-signed only)
No Hetzner-specific features (CCM, CSI, load balancers)
Not suitable for performance testing or production simulation

SafeOps Snapshot

Here is the local Kind cluster baseline used in the SafeOps system for low-cost rehearsal and CI-friendly testing.

Kind cluster layout

Show the Kind cluster layout

infra/terraform/kind_cluster/.gitignore
infra/terraform/kind_cluster/README.md
infra/terraform/kind_cluster/UPGRADE.md
infra/terraform/kind_cluster/main.tf
infra/terraform/kind_cluster/scripts/merge-kubeconfig.sh
infra/terraform/kind_cluster/templates/git-repository.yaml.tpl
infra/terraform/kind_cluster/templates/kustomization.yaml.tpl
infra/terraform/kind_cluster/values/components.yaml
infra/terraform/kind_cluster/variables.tf

Next Chapter

Continue with Chapter 03 (Secrets Management with SOPS).

Estimated Time

Prerequisites

Source Code References

What You Will Produce