Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

  • backend-alerts.yaml Members
  • servicemonitor.yaml Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Incident Hook

Users report intermittent 5xx errors and slow responses. Dashboards show elevated latency, but the root cause is unclear. Without trace correlation, the team jumps between pods and log streams blindly.

Result: Time is lost because responders cannot see the causal path across service boundaries.

Observed Symptoms

What the team sees first:

  • Metrics clearly show a user-facing problem (latency/error spikes).
  • Logs contain noise but not a clean causal path.

The “Wall of Noise” (Unstructured Logs):

# How logs look WITHOUT correlation:
2026-04-12 10:05:01 ERROR failed to process request
2026-04-12 10:05:02 INFO  user login success
2026-04-12 10:05:03 ERROR database timeout
# ❌ Problem: Which error belongs to which user request? We can't tell.

The incident is not a lack of telemetry volume; it is a lack of correlation.

Confusion Phase

Every signal says something different at first glance. The real question is:

  • Which request path is failing?
  • Does the symptom, trace path, and log evidence all point to the same cause?

Application Operational Contract

Observability is not something we add after the application is “done.” It is part of the operational contract required for safe Kubernetes operations. A production-ready app must provide:

  • Readiness & Liveness Probes for health checks.
  • Graceful Shutdown to prevent request drops during rollout.
  • Prometheus Metrics for symptom detection.
  • OpenTelemetry Traces with cross-boundary propagation.
  • Structured Logs with trace_id and span_id.

What AI Would Propose (Brave Junior):

  • “Check logs only and restart pods quickly.”
  • “Turn sampling up everywhere permanently.”
  • “Skip propagation; metrics are enough.”

Pause and Predict: Before reading the investigation, write down your top 3 hypotheses. What would you check first?