Advanced Track Do this after finishing Chapters 01-14.

Estimated Time

  • Reading: 30-40 min
  • Lab: 60-90 min
  • Quiz: 15-20 min

Prerequisites

  • Core track (Chapters 01-14) completed.
  • GitOps promotion and observability workflows available.

Source Code References

  • develop/ Members
  • flagger/ Members

Sign in to view source code.

What You Will Produce

A go/no-go evidence package: rollout results, remediation notes, and explicit rollback conditions.

Advanced Module: Progressive Delivery (Canary with Traefik + Flagger)

Incident Hook

A full rollout passes smoke checks but fails under real production traffic mix. Error rate and latency spike after deploy, and rollback starts late because detection is manual. The team needs controlled traffic progression with automatic safety checks.

Observed Symptoms

What the team sees first:

  • smoke checks pass, but live user traffic behaves differently
  • bad signals appear only after the rollout reaches meaningful load
  • rollback starts late because failure detection depends on humans noticing trends

The problem is not only release speed. It is uncontrolled exposure before evidence catches up.

Confusion Phase

A rollout can look healthy at 1% and risky at 30%. That is why full release confidence cannot come from one early success signal.

The real question is:

  • how much traffic is safe to expose next
  • and which metric should stop the rollout before humans start arguing

Why This Module Exists

Safe delivery is not only “deploy or rollback”. This module adds ingress-driven progressive rollout guardrails:

  • weighted canary rollout with measurable abort criteria
  • Traefik ingress-level traffic splitting
  • Flagger automated analysis and rollback

What AI Would Propose (Brave Junior)

  • “Ship 100% now; we can rollback if needed.”
  • “Canary is too slow for this fix.”
  • “Use ad-hoc routing rules without SLO checks.”

Why this sounds reasonable:

  • fastest short-term path
  • fewer moving parts in one deploy

Why This Is Dangerous

  • blast radius is immediate and broad
  • no objective stop conditions during rollout
  • unchecked rollout can hide impact until it is too late

Investigation

Treat canary progression as an evidence ladder.

Safe investigation sequence:

  1. verify the current traffic weight and abort thresholds
  2. compare error rate, latency, and success rate against baseline
  3. inspect Flagger events and Prometheus evidence together
  4. decide whether to progress, hold, or abort based on thresholds, not confidence alone

Containment

Containment must already be wired before the canary starts:

  1. keep blast limits explicit
  2. auto-abort on threshold breach
  3. confirm traffic returned to stable baseline after rollback
  4. record the evidence that justified the abort or promotion decision

Guardrails That Stop It

  • traffic progression in controlled steps (for example 10% -> 20% -> 30% -> 40% -> 50%)
  • abort on SLO violation (error rate, latency, success rate)
  • Traefik ingress metrics provide request-level visibility
  • rollback path tested before canary start

Abort Criteria (Concrete Defaults)

Example production-safe defaults:

  • error rate increase >= 2x baseline over 5 minutes -> abort
  • p95 latency increase >= 30% over baseline -> abort
  • success rate below 99% for guarded endpoint -> abort

Module Scope

  1. Traefik ingress baseline (IngressRoute, TraefikService, dashboard).
  2. Canary rollout flow (Flagger + Traefik weighted routing).
  3. Evidence capture for rollout decision and postmortem.

Safe Workflow (Step-by-Step)

  1. Verify Traefik ingress and Flagger controller are healthy.
  2. Start canary with bounded traffic progression and explicit blast limits.
  3. Monitor abort criteria continuously (error rate, latency, success rate).
  4. Auto-abort/rollback immediately on SLO threshold breach.
  5. Record decision with evidence from Flagger events and Prometheus metrics.

Max Blast Limits

Apply hard caps during progressive delivery:

  • max canary traffic before full confidence: 25-50% (context dependent)
  • max experiment duration without decision: progressDeadlineSeconds (for example 600 seconds)
  • if no measurable improvement, revert to stable baseline

Investigation Snapshots

Here is the Flagger installation layout used in the SafeOps system to make traffic progression and rollback measurable.

Flagger installation layout

Show the Flagger layout
  • flux/infrastructure/progressive-delivery/flagger/kustomization.yaml
  • flux/infrastructure/progressive-delivery/flagger/namespace.yaml
  • flux/infrastructure/progressive-delivery/flagger/release.yaml
  • flux/infrastructure/progressive-delivery/flagger/repository.yaml

Here is the develop canary pack used for bounded experiments before wider rollout.

Develop canary pack

Show the develop canary layout
  • flux/infrastructure/progressive-delivery/develop/canary-backend.yaml
  • flux/infrastructure/progressive-delivery/develop/canary-frontend.yaml
  • flux/infrastructure/progressive-delivery/develop/ingressroutes.yaml
  • flux/infrastructure/progressive-delivery/develop/kustomization.yaml
  • flux/infrastructure/progressive-delivery/develop/metric-templates.yaml
  • flux/infrastructure/progressive-delivery/develop/namespace.yaml

System Context

This module takes the earlier release discipline and applies it during live traffic.

It builds on:

  • Chapter 04 immutable promotion
  • Chapter 09 availability behavior during disruption
  • Chapter 10 evidence-first decisions from metrics

Files

  • lab.md
  • runbook-progressive-delivery.md
  • quiz.md
  • scorecard.md

Done When

  • learner can run canary with automated abort criteria
  • learner can explain how Traefik weighted routing limits blast radius
  • learner can capture evidence from Flagger events and Prometheus metrics

Hands-On Materials

Labs, quizzes, and runbooks — available to course members.

  • Lab: Canary Rollout with Traefik + Flagger (Advanced) Members
  • Progressive Delivery Scorecard (Template) Members
  • Quiz: Advanced Module (Progressive Delivery with Traefik + Flagger) Members
  • Runbook: Progressive Delivery Operations (Advanced) Members