Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

  • .pre-commit-config.yaml Members
  • guard-terraform-plan.sh Members
  • kind_cluster/ Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter 02: Infrastructure as Code (IaC)

Incident Hook

Two engineers run infrastructure changes close together during incident pressure. One apply acquires lock, the second run retries and later applies a stale plan. Result: partial drift plus unexpected replacement in unrelated resources. Recovery takes longer because no one can prove which plan produced the final state.

Observed Symptoms

What the team sees first:

  • one apply job holds the lock while another waits or retries
  • the later apply changes resources nobody expected to touch
  • a fresh plan no longer matches the reviewed plan artifact

The warning sign is not only contention. It is contention plus uncertainty about execution intent.

Confusion Phase

Remote state locking looks like it should have protected the workflow. That is what makes this failure deceptive.

The team now has to answer two different questions:

  • did Terraform behave incorrectly
  • or did the workflow allow an old plan to survive long enough to become dangerous

Why This Chapter Exists

In production, infrastructure mistakes are expensive and fast-moving. IaC is not only about automation speed. It is about:

  • repeatability
  • reviewability
  • rollback paths
  • controlled blast radius

This chapter introduces a guardrails-first Terraform workflow for Kubernetes platforms.

Learning Objectives

By the end of this chapter, learners can:

  • explain module boundaries and Terraform folder structure in the course platform
  • run a safe plan -> review -> apply workflow
  • explain why remote state and locking are non-negotiable in team environments
  • detect drift and decide whether to reconcile or rollback
  • execute safe destroy practices with explicit scope checks

State Failure Story (Lock Contention)

Typical failure chain:

  1. pipeline A acquires state lock and applies.
  2. pipeline B waits, then retries from outdated assumptions.
  3. B applies stale plan after lock release.

Blast radius:

  • unintended resource replacement
  • drift hidden by unrelated changes
  • rollback uncertainty because state changed twice in short window

What AI Would Propose (Brave Junior)

  • “Run terraform apply directly, we already know the desired change.”
  • “If lock fails, retry until it succeeds.”
  • “Destroy and recreate is faster than careful rollback.”

Why this sounds reasonable:

  • looks faster in the moment
  • fewer review steps
  • immediate visible progress

Why This Is Dangerous

  • apply without reviewed plan removes the last safe checkpoint.
  • stale plan + concurrent runs create hard-to-debug infra divergence.
  • destroy shortcuts can expand blast radius across dependencies.

Investigation

Start by treating state and plan history as evidence, not memory.

Safe investigation sequence:

  1. identify every plan and apply job that touched the same environment
  2. compare the reviewed plan artifact with a fresh plan against current state
  3. confirm whether the later apply ran from assumptions older than the current state
  4. trace the workflow gap that allowed stale intent to remain executable

The root cause here is usually workflow design, not Terraform syntax.

Containment

Containment starts by stopping overlap:

  1. pause concurrent applies for that environment
  2. generate a fresh plan from current state
  3. review only the corrective diff
  4. apply once from the fresh reviewed plan

Only after state is trustworthy again should the team tune concurrency, approvals, or destroy policy.

Guardrails That Stop It

  • mandatory plan -> review -> apply sequence, never direct apply.
  • remote state locking is required for team workflows.
  • CI/apply pipeline concurrency must be 1 per environment.
  • destroy is deny-by-default outside develop, unless break-glass record is approved.
  • every destructive action must include recreate/rollback evidence first.

Break-Glass Minimum Record (Destroy Outside Develop)

If destroy is required outside develop, record must include:

  • incident/ticket reference
  • exact scope (workspace/resource/module)
  • expected impact and rollback/recreate plan
  • approver identity and time window

Investigation Snapshots

Here is the plan/apply guard used in the SafeOps system. This is where “always run plan before apply” becomes executable policy instead of team etiquette.

Terraform plan/apply guard

Show the Terraform guard script
#!/usr/bin/env bash
set -euo pipefail

usage() {
  cat <<'EOF'
usage:
  scripts/guard-terraform-plan.sh plan  --dir <path> [--out <planfile>]
  scripts/guard-terraform-plan.sh apply --dir <path> [--out <planfile>] [--max-age-minutes <n>]

Guardrail wrapper for Terraform plan/apply.
- `plan` creates a planfile and metadata marker.
- `apply` refuses to run unless a fresh planfile + metadata marker exist.

Examples:
  scripts/guard-terraform-plan.sh plan --dir infra/terraform/hcloud_cluster --out tfplan
  scripts/guard-terraform-plan.sh apply --dir infra/terraform/hcloud_cluster --out tfplan --max-age-minutes 60
EOF
}

if [[ $# -lt 1 ]]; then
  usage >&2
  exit 2
fi

if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
  usage >&2
  exit 0
fi

MODE="$1"
shift

WORKDIR=""
PLAN_FILE="tfplan"
MAX_AGE_MINUTES="120"

while [[ $# -gt 0 ]]; do
  case "$1" in
    --dir)
      WORKDIR="${2:-}"
      shift 2
      ;;
    --out)
      PLAN_FILE="${2:-}"
      shift 2
      ;;
    --max-age-minutes)
      MAX_AGE_MINUTES="${2:-}"
      shift 2
      ;;
    -h|--help)
      usage
      exit 0
      ;;
    *)
      echo "[guard-tf] unknown argument: $1" >&2
      usage >&2
      exit 2
      ;;
  esac
done

if [[ -z "${WORKDIR}" ]]; then
  echo "[guard-tf] --dir is required" >&2
  usage >&2
  exit 2
fi

if ! command -v terraform >/dev/null 2>&1; then
  echo "[guard-tf] terraform not found in PATH" >&2
  exit 1
fi

if ! [[ -d "${WORKDIR}" ]]; then
  echo "[guard-tf] directory not found: ${WORKDIR}" >&2
  exit 1
fi

PLAN_PATH="${WORKDIR}/${PLAN_FILE}"
META_PATH="${PLAN_PATH}.meta"

case "${MODE}" in
  plan)
    terraform -chdir="${WORKDIR}" init -input=false
    terraform -chdir="${WORKDIR}" plan -input=false -lock-timeout=5m -out "${PLAN_FILE}"
    {
      echo "created_at_epoch=$(date +%s)"
      echo "workdir=${WORKDIR}"
      echo "plan_file=${PLAN_FILE}"
    } > "${META_PATH}"
    echo "[guard-tf] plan created: ${PLAN_PATH}"
    echo "[guard-tf] metadata created: ${META_PATH}"
    ;;
  apply)
    if [[ ! -f "${PLAN_PATH}" ]]; then
      echo "[guard-tf] missing plan file: ${PLAN_PATH}" >&2
      echo "[guard-tf] run: scripts/guard-terraform-plan.sh plan --dir ${WORKDIR} --out ${PLAN_FILE}" >&2
      exit 1
    fi
    if [[ ! -f "${META_PATH}" ]]; then
      echo "[guard-tf] missing plan metadata: ${META_PATH}" >&2
      echo "[guard-tf] refusing apply without plan marker" >&2
      exit 1
    fi

    # shellcheck disable=SC1090
    source "${META_PATH}"
    NOW_EPOCH="$(date +%s)"
    AGE_SECONDS="$((NOW_EPOCH - created_at_epoch))"
    AGE_MINUTES="$((AGE_SECONDS / 60))"

    if (( AGE_MINUTES > MAX_AGE_MINUTES )); then
      echo "[guard-tf] plan is too old (${AGE_MINUTES}m > ${MAX_AGE_MINUTES}m)" >&2
      echo "[guard-tf] re-run plan before apply" >&2
      exit 1
    fi

    terraform -chdir="${WORKDIR}" apply -input=false "${PLAN_FILE}"
    echo "[guard-tf] apply completed using ${PLAN_PATH}"
    ;;
  *)
    echo "[guard-tf] unknown mode: ${MODE}" >&2
    usage >&2
    exit 2
    ;;
esac

Here is the local validation baseline used before Terraform changes leave the workstation.

IaC hook baseline

Show the pre-commit configuration
default_install_hook_types:
  - pre-commit
  - pre-push
  - pre-merge-commit
  - prepare-commit-msg

repos:
  - repo: local
    hooks:
      - id: master-branch-check
        name: Protected branch guard
        entry: scripts/pre-commit-master-check.sh
        language: script
        always_run: true
        pass_filenames: false
        stages: [pre-commit, pre-push, pre-merge-commit]
        args:
          - --protected=master
          - --protected=main

      - id: prevent-amend-after-push
        name: Prevent amending pushed commits
        entry: scripts/prevent-amend-after-push.sh
        language: script
        always_run: true
        pass_filenames: false
        stages: [prepare-commit-msg]

  - repo: local
    hooks:
      - id: flux-kustomize-validate
        name: Flux kustomize validate
        entry: scripts/flux-kustomize-validate.sh
        language: script
        files: ^flux/.*\.ya?ml$
        pass_filenames: true
        require_serial: true
        stages: [pre-commit]

      - id: terraform-fmt
        name: Terraform format check
        entry: terraform fmt -recursive -diff -check
        language: system
        files: \.tf$
        pass_filenames: false
        stages: [pre-commit]

      - id: terraform-validate
        name: Terraform validate
        entry: scripts/terraform-validate.sh
        language: script
        files: \.(tf|tfvars)$
        pass_filenames: false
        require_serial: true
        stages: [pre-commit]

      - id: terraform-security
        name: Terraform security scan
        entry: scripts/terraform-security.sh
        language: script
        files: \.(tf|tfvars)$
        pass_filenames: false
        require_serial: true
        stages: [pre-commit]

  - repo: local
    hooks:
      - id: no-secrets
        name: Block sensitive files
        entry: scripts/block-secrets.sh
        language: script
        files: (kubeconfig|\.key$|\.pem$|credentials|\.env$)
        stages: [pre-commit]

  - repo: https://github.com/koalaman/shellcheck-precommit
    rev: v0.10.0
    hooks:
      - id: shellcheck
        files: \.sh$
        args: [--severity=warning]
        stages: [pre-commit]

  - repo: https://github.com/adrienverge/yamllint
    rev: v1.35.1
    hooks:
      - id: yamllint
        files: \.ya?ml$
        args: [-d, relaxed]
        stages: [pre-commit]

System Context

This chapter gives the rest of the course a trustworthy infrastructure baseline.

Later chapters depend on this discipline:

  • Chapter 04 needs artifact promotion to land on stable infrastructure state
  • Chapter 05 turns this workflow into CI and approval policy
  • Chapter 17 depends on the same reviewed execution path when data changes become riskier

Core Concepts

  1. Terraform structure and modules
  • root configuration should stay thin and readable
  • provider/module versions must be pinned
  • reusable logic belongs in modules, not copy/paste blocks
  1. Remote state and locking
  • shared state enables team collaboration
  • locking prevents concurrent apply corruption
  • backend config is part of production reliability
  1. IAM and RBAC principles
  • least privilege by default
  • separate read/plan/apply responsibilities
  • no broad credentials for automation or AI tooling
  1. Drift detection
  • drift = actual infra != declared infra
  • detect drift before making unrelated changes
  • never hide drift by batching many changes together
  1. Safe destroy
  • destroy is valid, but only with explicit scope
  • always verify workspace, targets, and dependency impact
  • create a rollback/recreate plan before destructive actions

Safe Workflow (Step-by-Step)

  1. Read this chapter, lab.md, and the review checklist.
  2. Install and run local hooks: make install-hooks && pre-commit run --all-files.
  3. Generate a plan artifact and perform peer review.
  4. Apply only from the reviewed/fresh plan artifact.
  5. Run drift check and confirm expected state after apply.
  6. Complete quiz.md and record operational evidence.

Pre-Commit Guardrails for IaC

Before Terraform changes are committed, hooks enforce:

  • terraform fmt -recursive -diff -check
  • scripts/terraform-validate.sh
  • scripts/terraform-security.sh
  • scripts/flux-kustomize-validate.sh (for any flux/** manifest changes in the same PR)

These checks reduce noisy reviews and block unsafe IaC changes before they reach CI/apply workflows.

CI Concurrency Guardrail (Example)

Use one apply lane per environment:

concurrency:
  group: terraform-${{ github.workflow }}-${{ inputs.environment }}
  cancel-in-progress: false

This prevents overlapping apply jobs from mutating shared state concurrently.

Anti-Patterns to Avoid

  • Running terraform apply without reviewed plan.
  • Applying from stale plan output.
  • Sharing one credential set across all environments.
  • Using destroy in ambiguous context.

Done When

  • learner can explain and demonstrate plan -> review -> apply under lock discipline
  • learner can identify drift and choose reconcile vs rollback path
  • learner can state clear no-go conditions for destroy actions

Alternative: Local Development with Kind

The primary workflow uses Hetzner Cloud for production-like infrastructure. For local testing, CI environments, or learning without cloud costs, a Kind (Kubernetes in Docker) cluster provides the same Flux bootstrap path.

Kind Cluster Setup

The Kind cluster creates a 3-node topology:

  • 1 control-plane node
  • 2 worker nodes

Local Registry Mirror

Kind includes a local registry at localhost:5001 for image caching and local development builds.

Port Mappings

  • 30080 -> 8080 (HTTP)
  • 30443 -> 8443 (HTTPS)

Flux Operator Auto-Bootstrap

The Kind cluster uses the same Flux bootstrap path as Hetzner:

  • Flux Operator installs and reconciles the repository
  • Same namespace structure: develop, staging, production, observability
  • Same Kustomization overlays and HelmRelease definitions

When to Use Kind

  • Local testing before pushing to CI
  • CI pipeline integration tests
  • Learning the course without cloud costs
  • Validating Flux manifests against a real cluster

Kind Limitations

  • No real DNS resolution (use /etc/hosts or nip.io)
  • No real TLS certificates (self-signed only)
  • No Hetzner-specific features (CCM, CSI, load balancers)
  • Not suitable for performance testing or production simulation

SafeOps Snapshot

Here is the local Kind cluster baseline used in the SafeOps system for low-cost rehearsal and CI-friendly testing.

Kind cluster layout

Show the Kind cluster layout
  • infra/terraform/kind_cluster/.gitignore
  • infra/terraform/kind_cluster/README.md
  • infra/terraform/kind_cluster/UPGRADE.md
  • infra/terraform/kind_cluster/main.tf
  • infra/terraform/kind_cluster/scripts/merge-kubeconfig.sh
  • infra/terraform/kind_cluster/templates/git-repository.yaml.tpl
  • infra/terraform/kind_cluster/templates/kustomization.yaml.tpl
  • infra/terraform/kind_cluster/values/components.yaml
  • infra/terraform/kind_cluster/variables.tf

Next Chapter

Continue with Chapter 03 (Secrets Management with SOPS).

Hands-On Materials

Labs, quizzes, and runbooks — available to course members.

  • Chapter 02 Quiz: Infrastructure as Code (IaC) Members
  • Drift Detection Playbook (Chapter 02) Members
  • Lab: Safe Terraform Workflow for Production-Like Kubernetes Members
  • Terraform Plan Review Checklist (Guardrails-First) Members

Interactive Explainer

Sign in to watch the video for this chapter.