Chapter 09: Observability (Metrics, Logs, Traces)

Why This Chapter Exists

Without correlated signals, incidents become guesswork. This chapter defines the minimum production baseline:

  • metrics for symptom detection
  • traces for path analysis
  • logs for evidence

Scope Decision (MVP)

  • No in-cluster OpenTelemetry Collector in this phase.
  • Frontend and backend export telemetry directly to Uptrace.
  • Target investigation path: frontend -> backend now, -> database when DB layer is introduced.

References:

The Incident Hook

Users report intermittent 5xx errors and slow responses. Dashboards show elevated latency, but root cause is unclear. Without trace correlation, the team jumps between pods/logs blindly. With baseline observability, on-call narrows cause in minutes.

Guardrails

  • No telemetry credentials in plaintext Git.
  • No debugging based on logs-only; always pivot through traces.
  • Keep rollback decision tied to evidence: metrics + traces + logs.

Repo Mapping

Service and platform references:

Lab Files

  • lab.md
  • runbook-incident-debug.md
  • sli-slo.md
  • quiz.md

Done When

  • learner can trigger and find one end-to-end trace from frontend to backend
  • learner can match backend error log by trace_id
  • learner can run incident workflow metrics -> traces -> logs -> action
  • learner can explain backend availability SLI/SLO and validate burn-rate alerts

Lab: Baseline Observability with Uptrace (No In-Cluster Collector)

frontend creates spans for user actions backend receives trace context and emits correlated logs Uptrace shows trace chain and related service signals Prometheus alert path is connected to the same incident workflow …

Quiz: Chapter 09 (Observability)

In this MVP, where is telemetry exported from? A) only in-cluster OTel collector B) directly from frontend/backend to Uptrace C) only backend exports telemetry What header set is required for end-to-end context …

Runbook: Incident Debug (Metrics -> Traces -> Logs)

elevated latency and/or sporadic 5xx This runbook is optimized for the current MVP setup: direct export to Uptrace from frontend/backend no in-cluster OTel collector Inputs environment (develop, staging, or production) …

SLI/SLO Spec: Chapter 09 Baseline

backend HTTP API Environment scope: develop, staging, production Indicators (SLIs) Availability SLI Definition: ratio of successful requests (non-5xx) to total requests. PromQL: 1 - ( …