Chapter 09: Observability (Metrics, Logs, Traces)
Why This Chapter Exists
Without correlated signals, incidents become guesswork. This chapter defines the minimum production baseline:
- metrics for symptom detection
- traces for path analysis
- logs for evidence
Scope Decision (MVP)
- No in-cluster OpenTelemetry Collector in this phase.
- Frontend and backend export telemetry directly to Uptrace.
- Target investigation path:
frontend -> backendnow,-> databasewhen DB layer is introduced.
References:
The Incident Hook
Users report intermittent 5xx errors and slow responses. Dashboards show elevated latency, but root cause is unclear. Without trace correlation, the team jumps between pods/logs blindly. With baseline observability, on-call narrows cause in minutes.
Guardrails
- No telemetry credentials in plaintext Git.
- No debugging based on logs-only; always pivot through traces.
- Keep rollback decision tied to evidence: metrics + traces + logs.
Repo Mapping
Service and platform references:
- Frontend telemetry init
- Frontend manual spans store
- Frontend chaos view instrumentation
- Backend telemetry package
- Backend trace/log correlation and panic endpoint
- Frontend deployment manifest
- Backend deployment manifest
Lab Files
lab.mdrunbook-incident-debug.mdsli-slo.mdquiz.md
Done When
- learner can trigger and find one end-to-end trace from frontend to backend
- learner can match backend error log by
trace_id - learner can run incident workflow
metrics -> traces -> logs -> action - learner can explain backend availability SLI/SLO and validate burn-rate alerts