Chapter 06: Network Policies (Production Isolation)
Incident Hook
A debug pod in develop reaches internal services it should never touch.
No exploit sophistication is needed, only open east-west traffic.
When incident starts, responders cannot quickly prove or limit blast radius.
Network policies turn this into an auditable allowlist model.
Observed Symptoms
What the team sees first:
- a pod in
developcan connect to services outside its intended boundary - responders cannot answer quickly what is reachable and what is not
- containment feels manual because the network has no default-deny baseline
The issue is not only one bad connection. It is the absence of a trustworthy traffic model.
Confusion Phase
Without policies, every connectivity question becomes investigative work.
The team now has to discover:
- which paths are legitimately required
- which paths are accidental exposure
- how to contain the pod without breaking the namespace blindly
Why This Chapter Exists
Without network isolation, one compromised pod can move laterally across environments. This chapter introduces a safe baseline:
- default deny
- explicit allow rules
- DNS and ingress paths opened intentionally
What AI Would Propose (Brave Junior)
- “Skip policies for now to avoid breaking traffic.”
- “We can secure networking later after release.”
Why this sounds reasonable:
- avoids immediate traffic risk
- seems faster during release pressure
Why This Is Dangerous
- Flat networking means high lateral-movement risk.
- Production and non-production boundaries become weak.
- Incidents are harder to contain under pressure.
Investigation
Start with the boundary, not with ad-hoc firewall guesses.
Safe investigation sequence:
- list the source pod, target service, namespace, and port involved
- prove what traffic is currently open
- define the minimum required paths: DNS, ingress, and exact egress needs
- test one allow rule at a time against the default-deny baseline
Containment
Containment narrows traffic fast:
- apply namespace default deny
- add back DNS first
- add ingress path second
- allow only the exact egress the workload truly needs
The goal is not “network works somehow.” The goal is “network is explainable.”
Guardrails That Stop It
- Start from default deny in target namespace.
- Add minimum allow rules one by one with verification.
- Keep policy changes isolated from application changes.
- Keep rollback manifest ready before applying restrictive policies.
Common AI Trap
AI often suggests broad allow rules to “get traffic working”:
0.0.0.0/0egress- namespace-wide allow-all policy
- temporary wildcard selectors
Do not apply these shortcuts. Fix exact source/destination/path requirements instead.
Investigation Snapshots
Here is the backend allow policy used in the SafeOps system to permit only the ingress path the workload actually needs.
Backend allow policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-backend-ingress
spec:
podSelector:
matchExpressions:
- key: app
operator: In
values: [backend, backend-primary]
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchExpressions:
- key: app
operator: In
values: [frontend, frontend-primary]
ports:
- protocol: TCP
port: 8080
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: traefik
ports:
- protocol: TCP
port: 8080
Here is the baseline policy pack that gets promoted across environments.
Network policy baseline
Show the network policy baseline
flux/infrastructure/network-policies/base/allow-backend-egress-https.yamlflux/infrastructure/network-policies/base/allow-backend-egress-postgres.yamlflux/infrastructure/network-policies/base/allow-backend-ingress.yamlflux/infrastructure/network-policies/base/allow-backend-metrics-from-observability.yamlflux/infrastructure/network-policies/base/allow-dns-egress.yamlflux/infrastructure/network-policies/base/allow-frontend-egress-backend.yamlflux/infrastructure/network-policies/base/allow-frontend-ingress.yamlflux/infrastructure/network-policies/base/allow-postgres-egress-apiserver.yamlflux/infrastructure/network-policies/base/allow-postgres-egress-https.yamlflux/infrastructure/network-policies/base/allow-postgres-ingress-backend.yamlflux/infrastructure/network-policies/base/allow-postgres-ingress-cnpg-operator.yamlflux/infrastructure/network-policies/base/default-deny-all.yamlflux/infrastructure/network-policies/base/kustomization.yaml
System Context
This chapter creates the runtime isolation that later lessons rely on.
It connects directly to:
- Chapter 07, where workload hardening limits what an attacker can do after shell access
- Chapter 10, where incident response depends on clean service boundaries
- Chapter 12, where drills should fail inside bounded scope instead of spreading silently
Safe Workflow (Step-by-Step)
- Start from namespace default-deny policy in
develop. - Add minimal allow rules in order:
- DNS first
- ingress path second
- required egress last
- Test each allow rule before adding the next one.
- Run blocked-traffic triage for failures:
- DNS resolution
- namespace/pod labels
- egress target and policy selector match
- Reject “allow all” shortcuts even for temporary fixes; patch specific policy instead.
- Promote policy changes environment by environment with evidence.
Blocked Traffic Triage Playbook
When traffic is blocked:
- Check DNS resolution from source pod.
- Confirm source and destination labels match policy selectors.
- Verify namespace labels used by
namespaceSelector. - Validate port/protocol correctness in policy rules.
- Confirm egress destination (service vs IP) matches allowed targets.
- Re-test with one rule change at a time and capture evidence.
Lab Files
lab.mdquiz.md
Done When
- learner can apply default deny without losing control of the environment
- learner can allow only required DNS + ingress traffic
- learner can debug and explain blocked traffic with evidence