Advanced Module: Progressive Delivery (Canary with Traefik + Flagger)
Why This Module Exists
Safe delivery is not only “deploy or rollback”. This module adds ingress-driven progressive rollout guardrails:
- weighted canary rollout with measurable abort criteria
- Traefik ingress-level traffic splitting
- Flagger automated analysis and rollback
Incident Hook
A full rollout passes smoke checks but fails under real production traffic mix. Error rate and latency spike after deploy, and rollback starts late because detection is manual. The team needs controlled traffic progression with automatic safety checks.
What AI Would Propose (Brave Junior)
- “Ship 100% now; we can rollback if needed.”
- “Canary is too slow for this fix.”
- “Use ad-hoc routing rules without SLO checks.”
Why this sounds reasonable:
- fastest short-term path
- fewer moving parts in one deploy
Why This Is Dangerous
- blast radius is immediate and broad
- no objective stop conditions during rollout
- unchecked rollout can hide impact until it is too late
Guardrails That Stop It
- traffic progression in controlled steps (for example 10% -> 20% -> 30% -> 40% -> 50%)
- abort on SLO violation (error rate, latency, success rate)
- Traefik ingress metrics provide request-level visibility
- rollback path tested before canary start
Abort Criteria (Concrete Defaults)
Example production-safe defaults:
- error rate increase >= 2x baseline over 5 minutes -> abort
- p95 latency increase >= 30% over baseline -> abort
- success rate below 99% for guarded endpoint -> abort
Module Scope
- Traefik ingress baseline (IngressRoute, TraefikService, dashboard).
- Canary rollout flow (Flagger + Traefik weighted routing).
- Evidence capture for rollout decision and postmortem.
Safe Workflow (Step-by-Step)
- Verify Traefik ingress and Flagger controller are healthy.
- Start canary with bounded traffic progression and explicit blast limits.
- Monitor abort criteria continuously (error rate, latency, success rate).
- Auto-abort/rollback immediately on SLO threshold breach.
- Record decision with evidence from Flagger events and Prometheus metrics.
Max Blast Limits
Apply hard caps during progressive delivery:
- max canary traffic before full confidence: 25-50% (context dependent)
- max experiment duration without decision: progressDeadlineSeconds (for example 600 seconds)
- if no measurable improvement, revert to stable baseline
Source Code References
- flux/infrastructure/progressive-delivery/flagger Advanced
- flux/infrastructure/progressive-delivery/develop Advanced
- flux/bootstrap/flux-system/infrastructure.yaml Advanced (Flagger enabled, develop canary pack opt-in)
Files
lab.mdrunbook-progressive-delivery.mdquiz.mdscorecard.md
Done When
- learner can run canary with automated abort criteria
- learner can explain how Traefik weighted routing limits blast radius
- learner can capture evidence from Flagger events and Prometheus metrics