Advanced Track Do this after finishing Chapters 01-14.

Estimated Time

  • Reading: 30-40 min
  • Lab: 60-90 min
  • Quiz: 15-20 min

Prerequisites

  • Core track (Chapters 01-14) completed.
  • GitOps promotion and observability workflows available.

Source Code References

  • develop/ Members
  • flagger/ Members
  • infrastructure.yaml Members

Sign in to view source code.

What You Will Produce

A go/no-go evidence package: rollout results, remediation notes, and explicit rollback conditions.

Advanced Module: Progressive Delivery (Canary with Traefik + Flagger)

Why This Module Exists

Safe delivery is not only “deploy or rollback”. This module adds ingress-driven progressive rollout guardrails:

  • weighted canary rollout with measurable abort criteria
  • Traefik ingress-level traffic splitting
  • Flagger automated analysis and rollback

Incident Hook

A full rollout passes smoke checks but fails under real production traffic mix. Error rate and latency spike after deploy, and rollback starts late because detection is manual. The team needs controlled traffic progression with automatic safety checks.

What AI Would Propose (Brave Junior)

  • “Ship 100% now; we can rollback if needed.”
  • “Canary is too slow for this fix.”
  • “Use ad-hoc routing rules without SLO checks.”

Why this sounds reasonable:

  • fastest short-term path
  • fewer moving parts in one deploy

Why This Is Dangerous

  • blast radius is immediate and broad
  • no objective stop conditions during rollout
  • unchecked rollout can hide impact until it is too late

Guardrails That Stop It

  • traffic progression in controlled steps (for example 10% -> 20% -> 30% -> 40% -> 50%)
  • abort on SLO violation (error rate, latency, success rate)
  • Traefik ingress metrics provide request-level visibility
  • rollback path tested before canary start

Abort Criteria (Concrete Defaults)

Example production-safe defaults:

  • error rate increase >= 2x baseline over 5 minutes -> abort
  • p95 latency increase >= 30% over baseline -> abort
  • success rate below 99% for guarded endpoint -> abort

Module Scope

  1. Traefik ingress baseline (IngressRoute, TraefikService, dashboard).
  2. Canary rollout flow (Flagger + Traefik weighted routing).
  3. Evidence capture for rollout decision and postmortem.

Safe Workflow (Step-by-Step)

  1. Verify Traefik ingress and Flagger controller are healthy.
  2. Start canary with bounded traffic progression and explicit blast limits.
  3. Monitor abort criteria continuously (error rate, latency, success rate).
  4. Auto-abort/rollback immediately on SLO threshold breach.
  5. Record decision with evidence from Flagger events and Prometheus metrics.

Max Blast Limits

Apply hard caps during progressive delivery:

  • max canary traffic before full confidence: 25-50% (context dependent)
  • max experiment duration without decision: progressDeadlineSeconds (for example 600 seconds)
  • if no measurable improvement, revert to stable baseline

Source Code References

Files

  • lab.md
  • runbook-progressive-delivery.md
  • quiz.md
  • scorecard.md

Done When

  • learner can run canary with automated abort criteria
  • learner can explain how Traefik weighted routing limits blast radius
  • learner can capture evidence from Flagger events and Prometheus metrics

Hands-On Materials

Labs, quizzes, and runbooks — available to course members.

  • Lab: Canary Rollout with Traefik + Flagger (Advanced) Members
  • Progressive Delivery Scorecard (Template) Members
  • Quiz: Advanced Module (Progressive Delivery with Traefik + Flagger) Members
  • Runbook: Progressive Delivery Operations (Advanced) Members