Checkpoint B: Your Runtime Safety Net

You have completed Chapters 01-09. Before moving into observability and operations, pause and look at the runtime constraints you have built.

This page is a consolidation — not new material. Its purpose is to show how the four hardening chapters (06-09) work together to contain failure.


What You Have Built

Your workloads now run inside a layered safety net. Each layer limits blast radius in a different direction:

                ┌─────────────────────────────────────┐
                │   Chapter 06: Network Policies      │
                │   Who can talk to whom              │
                ├─────────────────────────────────────┤
                │   Chapter 07: Security Context      │
                │   How pods run (non-root, read-only)│
                ├─────────────────────────────────────┤
                │   Chapter 08: Resource Management   │
                │   How much CPU and memory           │
                ├─────────────────────────────────────┤
                │   Chapter 09: Availability          │
                │   How many replicas, when           │
                └─────────────────────────────────────┘
                  ▲
                  │ Workloads deployed via Chapters 01-05

How The Layers Interact

The layers are not independent — they compose:

  • A pod hardened by security context (Ch 07) runs within the network boundaries set by network policies (Ch 06)
  • Resource requests (Ch 08) determine QoS class, which in turn affects which replicas HPA and PDB (Ch 09) can safely scale or evict
  • A NetworkPolicy (Ch 06) that blocks egress to the metrics endpoint breaks observability — which is why Chapter 10 comes next

The Guardrails You Have in Place

GuardrailSourceWhat It Contains
Default-deny NetworkPolicyChapter 06Lateral movement between compromised pods
runAsNonRoot: trueChapter 07Root-level exploits from a compromised container
readOnlyRootFilesystem: trueChapter 07Persistent filesystem tampering
Resource requests and limitsChapter 08Noisy-neighbor starvation and OOM cascades
minReplicas: 2 for critical servicesChapter 09Single-node failure becoming an outage
PDB maxUnavailableChapter 09Node drains causing service downtime

Self-Check

You are ready for Chapter 10 when you can answer without looking:

  • Why is default-deny safer than allow-all, even though it creates more up-front work?
  • What is the standard pattern for writable /tmp when readOnlyRootFilesystem: true is enforced?
  • What evidence do you need before raising a pod’s memory limit?
  • Why is minReplicas: 1 a reliability regression, even for a deployment that “never fails”?
  • If a PDB blocks a node drain, what does that tell you about the deployment’s replica count?

If any of these are unclear, revisit the relevant chapter before moving forward.


What Comes Next

You can now deliver code safely (Ch 01-05) and your workloads have runtime constraints (Ch 06-09). The next block (Chapters 10-14) shifts to what happens when something goes wrong:

  • Chapter 10 — Observability: how you see what is happening
  • Chapter 11 — Backup and Restore: how you recover from data loss
  • Chapter 12 — Controlled Chaos: how you rehearse failure
  • Chapter 13 — AI-Assisted SRE Guardian: how you route incidents safely
  • Chapter 14 — 24/7 Production SRE: how humans coordinate response

These five chapters turn your running system into an operable, observable, recoverable production platform.