Chapter 10: Observability (Metrics, Logs, Traces)
Why This Chapter Exists
Without correlated signals, incidents become guesswork. This chapter defines the minimum production baseline:
- metrics for symptom detection
- traces for path analysis
- logs for evidence
Scope Decision
- OTEL Collector DaemonSet is active for log shipping (filelog → Uptrace OTLP endpoint).
- Frontend and backend export traces directly to Uptrace (no collector proxy for traces).
- Logs are collected from
/var/log/pods/by the Collector and shipped to Uptrace for correlation with traces. - Target investigation path:
frontend -> backendnow,-> databasewhen DB layer is introduced.
References:
Incident Hook
Users report intermittent 5xx errors and slow responses. Dashboards show elevated latency, but root cause is unclear. Without trace correlation, the team jumps between pods/logs blindly. With baseline observability, on-call narrows cause in minutes.
What AI Would Propose (Brave Junior)
- “Check logs only and restart pods quickly.”
- “Turn up sampling to 100% everywhere permanently.”
- “Skip trace propagation; we can still debug from metrics.”
Why this sounds reasonable:
- fastest path to immediate action
- fewer telemetry configuration steps
Why This Is Dangerous
- logs-only debugging misses causal path across services.
- uncontrolled sampling can create high cost/noise without better decisions.
- missing propagation breaks correlation and slows incident resolution.
Guardrails That Stop It
- No telemetry credentials in plaintext Git.
- No debugging based on logs-only; always pivot through traces.
- Keep rollback decision tied to evidence: metrics + traces + logs.
3 Signals, 1 Incident Exercise
For one controlled incident, capture all three:
- Metrics symptom (for example latency/error-rate spike).
- Trace path showing failing route/span chain.
- Log evidence with matching
trace_id.
This exercise is successful only when all three artifacts point to the same causal path.
Repo Mapping
Service and platform references:
- Frontend telemetry init
- Frontend manual spans store
- Frontend chaos view instrumentation
- Backend telemetry package
- Backend trace/log correlation and panic endpoint
- Frontend deployment manifest
- Backend deployment manifest
Safe Workflow (Step-by-Step)
- Start from symptom in metrics (latency/error-rate/request-rate anomaly).
- Pivot to traces and isolate affected route/service path.
- Correlate with backend/frontend logs using
trace_id. - Decide action only after evidence from at least two signals.
- Validate recovery in metrics and confirm trace/log behavior returned to baseline.
Definition of Done: Evidence, Not Assumptions
Incident triage is complete only when responder can explain:
- what failed
- where it failed
- why it failed
using correlated evidence (metrics + traces + logs), not guesses.
The Third Signal: Structured Logging
Metrics tell you something is wrong. Traces show you the path. Logs explain why.
Why Logging Matters Alongside Metrics and Traces
Metrics detect symptoms (error rate spike), traces isolate the path (which service, which endpoint), but logs provide the evidence: error messages, stack traces, request payloads, and decision context.
Without structured logs, the last mile of root cause analysis depends on guesswork.
Structured vs Unstructured Logging
Unstructured (plaintext):
2024-01-15 10:23:45 ERROR failed to process request for user 123
Structured (JSON):
{"time":"2024-01-15T10:23:45Z","level":"error","msg":"failed to process request","user_id":123,"trace_id":"abc123","span_id":"def456","error":"connection timeout"}
Structured logging enables:
- machine-parseable queries (filter by level, user, trace_id)
- correlation with traces via
trace_idfield - aggregation and alerting on specific error patterns
- consistent field names across services
Log Levels and Operational Meaning
| Level | Use | Operational Response |
|---|---|---|
debug | Development diagnostics | Disabled in production |
info | Normal operations (request served, job completed) | No action needed |
warn | Degraded but functional (retry succeeded, fallback used) | Monitor for escalation |
error | Failed operation requiring attention | Investigate, may page on-call |
fatal | Process cannot continue | Immediate response, process restart |
Centralized Logging Architecture
Production logging follows a pipeline:
Application → stdout/stderr → OTEL Collector (DaemonSet) → Uptrace OTLP endpoint
Uptrace as single pane of glass for traces + logs:
- Already in use for traces — zero new vendor, zero new UI
- ClickHouse backend = fast queries on high-cardinality data (namespace, pod, container, trace_id)
- Native
trace_id↔ log correlation in one UI — click from log to trace and back - No in-cluster log storage to manage (ClickHouse runs inside Uptrace Cloud)
- Metrics remain in Prometheus/Grafana — each tool does what it does best
OTEL Collector DaemonSet configuration (conceptual):
receivers:
filelog:
include: [/var/log/pods/**/*.log]
operators:
- type: container # extracts k8s metadata from file path
id: container-parser # namespace, pod name, container name
k8sevents: # optional: collect Kubernetes Events
namespaces: [develop, observability]
processors:
batch:
send_batch_size: 1024
timeout: 5s
exporters:
otlp/uptrace:
endpoint: https://api.uptrace.dev:4317
headers:
uptrace-dsn: "${UPTRACE_DSN}"
service:
pipelines:
logs:
receivers: [filelog, k8sevents]
processors: [batch]
exporters: [otlp/uptrace]
Key points:
filelogreceiver reads container logs from the node filesystem — no sidecar needed- Kubernetes metadata (namespace, pod, container) is extracted automatically from log file paths
k8seventsreceiver (optional) captures cluster events like pod restarts, OOMKills, and scheduling failuresbatchprocessor reduces export overheadotlpexporter sends to Uptrace using the same DSN already configured for traces
Log-Based Alerting Patterns
Complement metric-based alerts with log-based rules:
- Alert on
errorlog rate exceeding threshold - Alert on specific error patterns (OOM, connection refused, auth failure)
- Alert on absence of expected log entries (heartbeat logs missing)
Correlation: trace_id Enables Full Drill-Down
The key to unified observability is the trace_id field in logs:
Metric alert → Find affected traces → Filter logs by trace_id → Read error context
When the backend emits structured logs with trace_id, every log line becomes part of the trace story. This enables the metrics → traces → logs drill-down workflow from the incident hook.
OpenTelemetry Integration
The OTEL Collector DaemonSet runs one pod per node, reads container logs from /var/log/pods/, and ships them to Uptrace via OTLP. This is the same Collector pattern used for traces — extending it to logs requires adding the filelog receiver and a logs pipeline.
Resource attributes added automatically:
k8s.namespace.name,k8s.pod.name,k8s.container.name(from file path)k8s.node.name(from DaemonSet scheduling)
These attributes enable filtering in Uptrace by namespace, pod, or container — the same dimensions used in Prometheus queries and Grafana dashboards.
Lab Files
lab.mdrunbook-incident-debug.mdsli-slo.mdquiz.md
Done When
- learner can trigger and find one end-to-end trace from frontend to backend
- learner can match backend error log by
trace_id - learner can run incident workflow
metrics -> traces -> logs -> action - learner can explain backend availability SLI/SLO and validate burn-rate alerts