Checkpoint B: Your Runtime Safety Net
You have completed Chapters 01-09. Before moving into observability and operations, pause and look at the runtime constraints you have built.
This page is a consolidation — not new material. Its purpose is to show how the four hardening chapters (06-09) work together to contain failure.
What You Have Built
Your workloads now run inside a layered safety net. Each layer limits blast radius in a different direction:
┌─────────────────────────────────────┐
│ Chapter 06: Network Policies │
│ Who can talk to whom │
├─────────────────────────────────────┤
│ Chapter 07: Security Context │
│ How pods run (non-root, read-only)│
├─────────────────────────────────────┤
│ Chapter 08: Resource Management │
│ How much CPU and memory │
├─────────────────────────────────────┤
│ Chapter 09: Availability │
│ How many replicas, when │
└─────────────────────────────────────┘
▲
│ Workloads deployed via Chapters 01-05
How The Layers Interact
The layers are not independent — they compose:
- A pod hardened by security context (Ch 07) runs within the network boundaries set by network policies (Ch 06)
- Resource requests (Ch 08) determine QoS class, which in turn affects which replicas HPA and PDB (Ch 09) can safely scale or evict
- A
NetworkPolicy(Ch 06) that blocks egress to the metrics endpoint breaks observability — which is why Chapter 10 comes next
The Guardrails You Have in Place
| Guardrail | Source | What It Contains |
|---|---|---|
| Default-deny NetworkPolicy | Chapter 06 | Lateral movement between compromised pods |
runAsNonRoot: true | Chapter 07 | Root-level exploits from a compromised container |
readOnlyRootFilesystem: true | Chapter 07 | Persistent filesystem tampering |
| Resource requests and limits | Chapter 08 | Noisy-neighbor starvation and OOM cascades |
minReplicas: 2 for critical services | Chapter 09 | Single-node failure becoming an outage |
PDB maxUnavailable | Chapter 09 | Node drains causing service downtime |
Self-Check
You are ready for Chapter 10 when you can answer without looking:
- Why is default-deny safer than allow-all, even though it creates more up-front work?
- What is the standard pattern for writable
/tmpwhenreadOnlyRootFilesystem: trueis enforced? - What evidence do you need before raising a pod’s memory limit?
- Why is
minReplicas: 1a reliability regression, even for a deployment that “never fails”? - If a PDB blocks a node drain, what does that tell you about the deployment’s replica count?
If any of these are unclear, revisit the relevant chapter before moving forward.
What Comes Next
You can now deliver code safely (Ch 01-05) and your workloads have runtime constraints (Ch 06-09). The next block (Chapters 10-14) shifts to what happens when something goes wrong:
- Chapter 10 — Observability: how you see what is happening
- Chapter 11 — Backup and Restore: how you recover from data loss
- Chapter 12 — Controlled Chaos: how you rehearse failure
- Chapter 13 — AI-Assisted SRE Guardian: how you route incidents safely
- Chapter 14 — 24/7 Production SRE: how humans coordinate response
These five chapters turn your running system into an operable, observable, recoverable production platform.