Runbook: Incident Debug (Metrics -> Traces -> Logs)

Purpose

Provide one repeatable on-call path for the most common symptom:

elevated latency and/or sporadic 5xx

This runbook is optimized for the current setup:

traces: direct export to Uptrace from frontend/backend
logs: OTEL Collector DaemonSet ships container logs to Uptrace (filelog → OTLP)

Inputs

environment (develop, staging, or production)
incident window (UTC time range)
primary route/symptom if known

Step 1: Confirm Symptom (Metrics First)

Check service-level symptoms:

request rate anomaly
p95/p99 latency increase
5xx error-rate increase

Decision:

if no metric deviation, treat as likely client/local issue and continue with scoped tracing
if deviation exists, continue to traces

PromQL shortcuts:

# Error rate (5m)
sum(rate(app_http_requests_total{job="backend",status=~"5.."}[5m]))
/ clamp_min(sum(rate(app_http_requests_total{job="backend"}[5m])), 1e-9)

# Latency p95 (5m)
histogram_quantile(0.95,
  sum(rate(app_http_request_duration_seconds_bucket{job="backend"}[5m])) by (le)
)

Step 2: Pivot to Traces

In Uptrace, filter by:

service.name = "backend" (and frontend when needed)
time range around the spike
status/error indicators

Find one representative failing or slow trace and capture:

trace_id
top slow span
endpoint/route attributes

Step 3: Correlate Logs by trace_id

Kubernetes log check:

kubectl -n <env> logs deploy/backend --since=30m | rg "<trace_id>"

For crash scenario (/panic):

kubectl -n <env> logs deploy/backend --since=30m | rg "panic|trace_id"

Expected:

one or more backend entries with the same trace_id
clear error context (panic, timeout, dependency issue, etc.)

Step 3.1: Check Logs in Uptrace (via OTEL Collector Pipeline)

As an alternative to kubectl logs, use the OTEL Collector log pipeline in Uptrace:

Open Uptrace UI → Logs section
Filter by k8s.namespace.name = <env> and time range
Search for the trace_id from Step 2
Click through to the associated trace for full context

This is especially useful when pods have restarted (losing local logs) or when correlating logs across multiple pods/nodes.

Step 4: Classify and Decide

Class A: isolated or low impact

monitor + create follow-up issue

Class B: recurring but controlled impact

apply low-risk mitigation and monitor

Class C: active customer impact

execute rollback/fix path per service runbook
communicate incident status update immediately

Step 4.1: Validate Alert Context

Check whether one of these alerts is active for the same window:

BackendHighErrorRate / BackendCriticalErrorRate
BackendHighLatency
BackendSLOErrorBudgetBurnCritical / BackendSLOErrorBudgetBurnWarning

If no matching alert exists but traces/logs confirm impact:

classify as detection gap
open follow-up issue for alert tuning (threshold/window/labels)

Step 5: Verify Recovery

After mitigation:

confirm latency and error metrics recover
confirm new traces return to normal duration/status
confirm no repeating error logs for same pattern

Frontend -> Backend Crash Drill

Trigger panic from frontend Chaos page.
Capture returned trace_id.
In Uptrace, open trace and verify frontend + backend spans share the same trace.
In backend logs, filter by that trace_id.
Confirm panic event and restart behavior are visible.

Database Leg (Optional Extension)

This runbook baseline covers the frontend -> backend path. In environments with database-backed request paths, extend the same investigation flow:

require DB child span under backend request span
require DB error/latency evidence before rollback decision
include slow-query fingerprint in incident notes

Evidence Template

Environment:
Time window:
Symptom metric(s):
trace_id:
Correlated log evidence:
Impact class (A/B/C):
Action taken:
Verification result:
Alert observed (name + state):

Estimated Time

Prerequisites

Artifacts

What You Will Produce