Chapter 09: Observability (Metrics, Logs, Traces)

Why This Chapter Exists

Without correlated signals, incidents become guesswork. This chapter defines the minimum production baseline:

metrics for symptom detection
traces for path analysis
logs for evidence

Scope Decision (MVP)

No in-cluster OpenTelemetry Collector in this phase.
Frontend and backend export telemetry directly to Uptrace.
Target investigation path: frontend -> backend now, -> database when DB layer is introduced.

References:

The Incident Hook

Users report intermittent 5xx errors and slow responses. Dashboards show elevated latency, but root cause is unclear. Without trace correlation, the team jumps between pods/logs blindly. With baseline observability, on-call narrows cause in minutes.

Guardrails

No telemetry credentials in plaintext Git.
No debugging based on logs-only; always pivot through traces.
Keep rollback decision tied to evidence: metrics + traces + logs.

Repo Mapping

Service and platform references:

Lab Files

lab.md
runbook-incident-debug.md
sli-slo.md
quiz.md

Done When

learner can trigger and find one end-to-end trace from frontend to backend
learner can match backend error log by trace_id
learner can run incident workflow metrics -> traces -> logs -> action
learner can explain backend availability SLI/SLO and validate burn-rate alerts

Chapter 09: Observability (Metrics, Logs, Traces)

Why This Chapter Exists

Scope Decision (MVP)

The Incident Hook

Guardrails

Repo Mapping

Lab Files

Done When

Lab: Baseline Observability with Uptrace (No In-Cluster Collector)

Quiz: Chapter 09 (Observability)

Runbook: Incident Debug (Metrics -> Traces -> Logs)

SLI/SLO Spec: Chapter 09 Baseline