Investigation & Containment

Core Track Guardrails-first chapter in core learning path.

Estimated Time

Reading: 20-25 min
Lab: 45-60 min
Quiz: 10-15 min

Prerequisites

Course intro completed: AI as a Very Well-Read Junior Engineer.
Working `kubectl` and `git` setup.

Source Code References

guard-kube-context.sh Members
guard-terraform-plan.sh Members

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Investigation

The first job is not to guess. It is to separate routing evidence from application evidence.

Safe investigation sequence:

Inspect the Ingress in develop: Use kubectl get ing -n develop to verify the current configuration.
Verify Host and Backend Target: Ensure they match the intended environment (develop vs staging).
Check Backend Pod Health: Check logs directly using kubectl logs -l app=backend -n develop.
Decide the Outage Type: Is it routing-only (Ingress), app-only (image version), or genuinely mixed?

In this incident, the ingress host is the strongest signal. It was changed for the wrong environment, which explains the edge failure faster than the backend rollout noise does.

Containment

Containment is narrow on purpose to restore stability as quickly as possible.

Containment steps:

Revert the Ingress change only: Keep the backend image as is for a moment to avoid adding more noise.
Reconcile: Let the GitOps path (Flux) reconcile the manifest back to the correct host.
Confirm Routing: Verify that traffic is flowing again and the 502 Bad Gateway is gone.
Evaluate Separately: Only after traffic is stable, evaluate the backend image update separately.

The goal is to restore one clean rollback path. Do not “fix everything at once” during the incident.

Pause and Predict: What automated guardrail would have prevented this incident entirely?