Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Artifacts

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter 10: Observability (Metrics, Logs, Traces)

Why This Chapter Exists

Without correlated signals, incidents become guesswork. This chapter defines the minimum production baseline:

  • metrics for symptom detection
  • traces for path analysis
  • logs for evidence

Scope Decision

  • OTEL Collector DaemonSet is active for log shipping (filelog → Uptrace OTLP endpoint).
  • Frontend and backend export traces directly to Uptrace (no collector proxy for traces).
  • Logs are collected from /var/log/pods/ by the Collector and shipped to Uptrace for correlation with traces.
  • Target investigation path: frontend -> backend now, -> database when DB layer is introduced.

References:

Incident Hook

Users report intermittent 5xx errors and slow responses. Dashboards show elevated latency, but root cause is unclear. Without trace correlation, the team jumps between pods/logs blindly. With baseline observability, on-call narrows cause in minutes.

What AI Would Propose (Brave Junior)

  • “Check logs only and restart pods quickly.”
  • “Turn up sampling to 100% everywhere permanently.”
  • “Skip trace propagation; we can still debug from metrics.”

Why this sounds reasonable:

  • fastest path to immediate action
  • fewer telemetry configuration steps

Why This Is Dangerous

  • logs-only debugging misses causal path across services.
  • uncontrolled sampling can create high cost/noise without better decisions.
  • missing propagation breaks correlation and slows incident resolution.

Guardrails That Stop It

  • No telemetry credentials in plaintext Git.
  • No debugging based on logs-only; always pivot through traces.
  • Keep rollback decision tied to evidence: metrics + traces + logs.

3 Signals, 1 Incident Exercise

For one controlled incident, capture all three:

  1. Metrics symptom (for example latency/error-rate spike).
  2. Trace path showing failing route/span chain.
  3. Log evidence with matching trace_id.

This exercise is successful only when all three artifacts point to the same causal path.

Repo Mapping

Service and platform references:

Safe Workflow (Step-by-Step)

  1. Start from symptom in metrics (latency/error-rate/request-rate anomaly).
  2. Pivot to traces and isolate affected route/service path.
  3. Correlate with backend/frontend logs using trace_id.
  4. Decide action only after evidence from at least two signals.
  5. Validate recovery in metrics and confirm trace/log behavior returned to baseline.

Definition of Done: Evidence, Not Assumptions

Incident triage is complete only when responder can explain:

  • what failed
  • where it failed
  • why it failed

using correlated evidence (metrics + traces + logs), not guesses.

The Third Signal: Structured Logging

Metrics tell you something is wrong. Traces show you the path. Logs explain why.

Why Logging Matters Alongside Metrics and Traces

Metrics detect symptoms (error rate spike), traces isolate the path (which service, which endpoint), but logs provide the evidence: error messages, stack traces, request payloads, and decision context.

Without structured logs, the last mile of root cause analysis depends on guesswork.

Structured vs Unstructured Logging

Unstructured (plaintext):

2024-01-15 10:23:45 ERROR failed to process request for user 123

Structured (JSON):

{"time":"2024-01-15T10:23:45Z","level":"error","msg":"failed to process request","user_id":123,"trace_id":"abc123","span_id":"def456","error":"connection timeout"}

Structured logging enables:

  • machine-parseable queries (filter by level, user, trace_id)
  • correlation with traces via trace_id field
  • aggregation and alerting on specific error patterns
  • consistent field names across services

Log Levels and Operational Meaning

LevelUseOperational Response
debugDevelopment diagnosticsDisabled in production
infoNormal operations (request served, job completed)No action needed
warnDegraded but functional (retry succeeded, fallback used)Monitor for escalation
errorFailed operation requiring attentionInvestigate, may page on-call
fatalProcess cannot continueImmediate response, process restart

Centralized Logging Architecture

Production logging follows a pipeline:

Application → stdout/stderr → OTEL Collector (DaemonSet) → Uptrace OTLP endpoint

Uptrace as single pane of glass for traces + logs:

  • Already in use for traces — zero new vendor, zero new UI
  • ClickHouse backend = fast queries on high-cardinality data (namespace, pod, container, trace_id)
  • Native trace_id ↔ log correlation in one UI — click from log to trace and back
  • No in-cluster log storage to manage (ClickHouse runs inside Uptrace Cloud)
  • Metrics remain in Prometheus/Grafana — each tool does what it does best

OTEL Collector DaemonSet configuration (conceptual):

receivers:
  filelog:
    include: [/var/log/pods/**/*.log]
    operators:
      - type: container        # extracts k8s metadata from file path
        id: container-parser   # namespace, pod name, container name
  k8sevents:                   # optional: collect Kubernetes Events
    namespaces: [develop, observability]

processors:
  batch:
    send_batch_size: 1024
    timeout: 5s

exporters:
  otlp/uptrace:
    endpoint: https://api.uptrace.dev:4317
    headers:
      uptrace-dsn: "${UPTRACE_DSN}"

service:
  pipelines:
    logs:
      receivers: [filelog, k8sevents]
      processors: [batch]
      exporters: [otlp/uptrace]

Key points:

  • filelog receiver reads container logs from the node filesystem — no sidecar needed
  • Kubernetes metadata (namespace, pod, container) is extracted automatically from log file paths
  • k8sevents receiver (optional) captures cluster events like pod restarts, OOMKills, and scheduling failures
  • batch processor reduces export overhead
  • otlp exporter sends to Uptrace using the same DSN already configured for traces

Log-Based Alerting Patterns

Complement metric-based alerts with log-based rules:

  • Alert on error log rate exceeding threshold
  • Alert on specific error patterns (OOM, connection refused, auth failure)
  • Alert on absence of expected log entries (heartbeat logs missing)

Correlation: trace_id Enables Full Drill-Down

The key to unified observability is the trace_id field in logs:

Metric alert → Find affected traces → Filter logs by trace_id → Read error context

When the backend emits structured logs with trace_id, every log line becomes part of the trace story. This enables the metrics → traces → logs drill-down workflow from the incident hook.

OpenTelemetry Integration

The OTEL Collector DaemonSet runs one pod per node, reads container logs from /var/log/pods/, and ships them to Uptrace via OTLP. This is the same Collector pattern used for traces — extending it to logs requires adding the filelog receiver and a logs pipeline.

Resource attributes added automatically:

  • k8s.namespace.name, k8s.pod.name, k8s.container.name (from file path)
  • k8s.node.name (from DaemonSet scheduling)

These attributes enable filtering in Uptrace by namespace, pod, or container — the same dimensions used in Prometheus queries and Grafana dashboards.

Lab Files

  • lab.md
  • runbook-incident-debug.md
  • sli-slo.md
  • quiz.md

Done When

  • learner can trigger and find one end-to-end trace from frontend to backend
  • learner can match backend error log by trace_id
  • learner can run incident workflow metrics -> traces -> logs -> action
  • learner can explain backend availability SLI/SLO and validate burn-rate alerts

Lab: Baseline Observability with Uptrace

frontend creates spans for user actions backend receives trace context and emits correlated logs Uptrace shows trace chain and related service signals Prometheus alert path is connected to the same incident workflow …

Quiz: Chapter 10 (Observability)

In this MVP, where is telemetry exported from? A) only in-cluster OTel collector B) directly from frontend/backend to Uptrace C) only backend exports telemetry What header set is required for end-to-end context …

Runbook: Incident Debug (Metrics -> Traces -> Logs)

elevated latency and/or sporadic 5xx This runbook is optimized for the current setup: traces: direct export to Uptrace from frontend/backend logs: OTEL Collector DaemonSet ships container logs to Uptrace (filelog → OTLP) …

SLI/SLO Spec: Chapter 10 Baseline

backend HTTP API Environment scope: develop, staging, production Indicators (SLIs) Availability SLI Definition: ratio of successful requests (non-5xx) to total requests. PromQL: 1 - ( …