Investigation
The first job is not to guess. It is to separate routing evidence from application evidence.
Safe investigation sequence:
- Inspect the Ingress in
develop: Usekubectl get ing -n developto verify the current configuration. - Verify Host and Backend Target: Ensure they match the intended environment (
developvsstaging). - Check Backend Pod Health: Check logs directly using
kubectl logs -l app=backend -n develop. - Decide the Outage Type: Is it routing-only (Ingress), app-only (image version), or genuinely mixed?
In this incident, the ingress host is the strongest signal. It was changed for the wrong environment, which explains the edge failure faster than the backend rollout noise does.
Containment
Containment is narrow on purpose to restore stability as quickly as possible.
Containment steps:
- Revert the Ingress change only: Keep the backend image as is for a moment to avoid adding more noise.
- Reconcile: Let the GitOps path (Flux) reconcile the manifest back to the correct host.
- Confirm Routing: Verify that traffic is flowing again and the
502 Bad Gatewayis gone. - Evaluate Separately: Only after traffic is stable, evaluate the backend image update separately.
The goal is to restore one clean rollback path. Do not “fix everything at once” during the incident.
Pause and Predict: What automated guardrail would have prevented this incident entirely?