Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

  • .pre-commit-config.yaml Members
  • main.tf Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Incident Hook

Two engineers run infrastructure changes close together during incident pressure. One apply acquires the state lock, while the second run retries and later applies a stale plan.

Result: Partial drift plus unexpected replacement in unrelated resources. Recovery takes longer because no one can prove which plan produced the final state.

Observed Symptoms

What the team sees first:

  • One apply job holds the lock while another waits or retries.
  • The later apply changes resources nobody expected to touch.

The “Stale Plan” Warning Sign:

# terraform apply output
# Module.db.hcloud_server.database will be REPLACED
-/+ resource "hcloud_server" "database" {
      ~ name = "db-prod" -> "db-staging" # ❌ WRONG: Stale plan applying dev values to prod
      + image = "ubuntu-22.04"
      - delete_protection = true -> null # ❌ DANGER: Protection removed
    }

Plan: 1 to add, 0 to change, 1 to destroy.

The warning sign is not only contention; it is contention plus uncertainty about execution intent.

Confusion Phase

Remote state locking looks like it should have protected the workflow. That is what makes this failure deceptive. The team now has to answer:

  • Did Terraform behave incorrectly?
  • Or did the workflow allow an old plan to survive long enough to become dangerous?

State Failure Story (Lock Contention)

Typical failure chain:

  1. Pipeline A acquires state lock and applies.
  2. Pipeline B waits, then retries from outdated assumptions.
  3. Pipeline B applies a stale plan after the lock is released.

Blast radius:

  • Unintended resource replacement.
  • Drift hidden by unrelated changes.
  • Rollback uncertainty because state changed twice in a short window.

What AI Would Propose (Brave Junior)

  • “Run terraform apply directly, we already know the desired change.”
  • “If lock fails, retry until it succeeds.”
  • “Destroy and recreate is faster than careful rollback.”

Why This Is Dangerous

  • apply without a reviewed plan removes the last safe checkpoint.
  • Stale plans + concurrent runs create hard-to-debug infrastructure divergence.
  • Destroy shortcuts can expand the blast radius across dependencies.

Pause and Predict: Before reading the investigation, write down your top 3 hypotheses. What would you check first?