Chapter 10: Observability (Metrics, Logs, Traces)

Why This Chapter Exists

Without correlated signals, incidents become guesswork. This chapter defines the minimum production baseline:

metrics for symptom detection
traces for path analysis
logs for evidence

Scope Decision

OTEL Collector DaemonSet is active for log shipping (filelog → Uptrace OTLP endpoint).
Frontend and backend export traces directly to Uptrace (no collector proxy for traces).
Logs are collected from /var/log/pods/ by the Collector and shipped to Uptrace for correlation with traces.
Target investigation path: frontend -> backend now, -> database when DB layer is introduced.

References:

Incident Hook

Users report intermittent 5xx errors and slow responses. Dashboards show elevated latency, but root cause is unclear. Without trace correlation, the team jumps between pods/logs blindly. With baseline observability, on-call narrows cause in minutes.

What AI Would Propose (Brave Junior)

“Check logs only and restart pods quickly.”
“Turn up sampling to 100% everywhere permanently.”
“Skip trace propagation; we can still debug from metrics.”

Why this sounds reasonable:

fastest path to immediate action
fewer telemetry configuration steps

Why This Is Dangerous

logs-only debugging misses causal path across services.
uncontrolled sampling can create high cost/noise without better decisions.
missing propagation breaks correlation and slows incident resolution.

Guardrails That Stop It

No telemetry credentials in plaintext Git.
No debugging based on logs-only; always pivot through traces.
Keep rollback decision tied to evidence: metrics + traces + logs.

3 Signals, 1 Incident Exercise

For one controlled incident, capture all three:

Metrics symptom (for example latency/error-rate spike).
Trace path showing failing route/span chain.
Log evidence with matching trace_id.

This exercise is successful only when all three artifacts point to the same causal path.

Repo Mapping

Service and platform references:

Safe Workflow (Step-by-Step)

Start from symptom in metrics (latency/error-rate/request-rate anomaly).
Pivot to traces and isolate affected route/service path.
Correlate with backend/frontend logs using trace_id.
Decide action only after evidence from at least two signals.
Validate recovery in metrics and confirm trace/log behavior returned to baseline.

Definition of Done: Evidence, Not Assumptions

Incident triage is complete only when responder can explain:

what failed
where it failed
why it failed

using correlated evidence (metrics + traces + logs), not guesses.

The Third Signal: Structured Logging

Metrics tell you something is wrong. Traces show you the path. Logs explain why.

Why Logging Matters Alongside Metrics and Traces

Metrics detect symptoms (error rate spike), traces isolate the path (which service, which endpoint), but logs provide the evidence: error messages, stack traces, request payloads, and decision context.

Without structured logs, the last mile of root cause analysis depends on guesswork.

Structured vs Unstructured Logging

Unstructured (plaintext):

2024-01-15 10:23:45 ERROR failed to process request for user 123

Structured (JSON):

{"time":"2024-01-15T10:23:45Z","level":"error","msg":"failed to process request","user_id":123,"trace_id":"abc123","span_id":"def456","error":"connection timeout"}

Structured logging enables:

machine-parseable queries (filter by level, user, trace_id)
correlation with traces via trace_id field
aggregation and alerting on specific error patterns
consistent field names across services

Log Levels and Operational Meaning

Level	Use	Operational Response
`debug`	Development diagnostics	Disabled in production
`info`	Normal operations (request served, job completed)	No action needed
`warn`	Degraded but functional (retry succeeded, fallback used)	Monitor for escalation
`error`	Failed operation requiring attention	Investigate, may page on-call
`fatal`	Process cannot continue	Immediate response, process restart

Centralized Logging Architecture

Production logging follows a pipeline:

Application → stdout/stderr → OTEL Collector (DaemonSet) → Uptrace OTLP endpoint

Uptrace as single pane of glass for traces + logs:

Already in use for traces — zero new vendor, zero new UI
ClickHouse backend = fast queries on high-cardinality data (namespace, pod, container, trace_id)
Native trace_id ↔ log correlation in one UI — click from log to trace and back
No in-cluster log storage to manage (ClickHouse runs inside Uptrace Cloud)
Metrics remain in Prometheus/Grafana — each tool does what it does best

OTEL Collector DaemonSet configuration (conceptual):

receivers:
  filelog:
    include: [/var/log/pods/**/*.log]
    operators:
      - type: container        # extracts k8s metadata from file path
        id: container-parser   # namespace, pod name, container name
  k8sevents:                   # optional: collect Kubernetes Events
    namespaces: [develop, observability]

processors:
  batch:
    send_batch_size: 1024
    timeout: 5s

exporters:
  otlp/uptrace:
    endpoint: https://api.uptrace.dev:4317
    headers:
      uptrace-dsn: "${UPTRACE_DSN}"

service:
  pipelines:
    logs:
      receivers: [filelog, k8sevents]
      processors: [batch]
      exporters: [otlp/uptrace]

Key points:

filelog receiver reads container logs from the node filesystem — no sidecar needed
Kubernetes metadata (namespace, pod, container) is extracted automatically from log file paths
k8sevents receiver (optional) captures cluster events like pod restarts, OOMKills, and scheduling failures
batch processor reduces export overhead
otlp exporter sends to Uptrace using the same DSN already configured for traces

Log-Based Alerting Patterns

Complement metric-based alerts with log-based rules:

Alert on error log rate exceeding threshold
Alert on specific error patterns (OOM, connection refused, auth failure)
Alert on absence of expected log entries (heartbeat logs missing)

Correlation: trace_id Enables Full Drill-Down

The key to unified observability is the trace_id field in logs:

Metric alert → Find affected traces → Filter logs by trace_id → Read error context

When the backend emits structured logs with trace_id, every log line becomes part of the trace story. This enables the metrics → traces → logs drill-down workflow from the incident hook.

OpenTelemetry Integration

The OTEL Collector DaemonSet runs one pod per node, reads container logs from /var/log/pods/, and ships them to Uptrace via OTLP. This is the same Collector pattern used for traces — extending it to logs requires adding the filelog receiver and a logs pipeline.

Resource attributes added automatically:

k8s.namespace.name, k8s.pod.name, k8s.container.name (from file path)
k8s.node.name (from DaemonSet scheduling)

These attributes enable filtering in Uptrace by namespace, pod, or container — the same dimensions used in Prometheus queries and Grafana dashboards.

Lab Files

lab.md
runbook-incident-debug.md
sli-slo.md
quiz.md

Done When

learner can trigger and find one end-to-end trace from frontend to backend
learner can match backend error log by trace_id
learner can run incident workflow metrics -> traces -> logs -> action
learner can explain backend availability SLI/SLO and validate burn-rate alerts

Estimated Time

Prerequisites

Artifacts

What You Will Produce