Lab: Baseline Observability with Uptrace (No In-Cluster Collector)
Goal
Validate that telemetry is operational and correlated:
- frontend creates spans for user actions
- backend receives trace context and emits correlated logs
- Uptrace shows trace chain and related service signals
- Prometheus alert path is connected to the same incident workflow
Prerequisites
- frontend and backend are deployed in one environment (recommended:
develop) - Uptrace DSN is configured in secrets and injected into workloads
- Flux reconciliation is healthy
Quick checks:
kubectl -n flux-system get kustomizations
kubectl -n develop get deploy frontend backend
kubectl -n develop get secret backend-secrets
kubectl -n observability get prometheusrule backend-alerts backend-slo-rules
Step 1: Verify Runtime Telemetry Config
kubectl -n develop get deploy frontend -o jsonpath='{.spec.template.spec.containers[0].env[?(@.name=="VITE_UPTRACE_DSN")].name}{"\n"}'
kubectl -n develop get deploy backend -o jsonpath='{.spec.template.spec.containers[0].env[?(@.name=="UPTRACE_DSN")].name}{"\n"}'
Expected:
- frontend has
VITE_UPTRACE_DSN - backend has
UPTRACE_DSN
Step 2: Generate Trace via Frontend
- Open frontend UI.
- Go to Chaos page.
- Trigger one action:
- delay or status action for non-destructive check
- panic action for crash/correlation drill
Expected:
- frontend creates manual span (for example
ui.chaos.trigger_panic) - backend receives request with propagated trace context
Step 3: Verify in Uptrace
In Uptrace, find the trace from the recent action and confirm:
- frontend span exists
- backend HTTP span is a child in the same trace
- status/error details are visible on backend span
Step 4: Verify Correlated Backend Logs
Get recent backend logs:
kubectl -n develop logs deploy/backend --tail=200
Expected:
- request/error logs contain
trace_id - for panic flow, log contains panic termination message with same
trace_id
Step 5: Capture Evidence
For lab completion, attach:
- one Uptrace trace screenshot/id
- one backend log snippet with matching
trace_id - one alert snapshot (
BackendHighLatencyor one SLO burn-rate alert state) - one short conclusion (root cause + next action)
Step 6: Optional Alert Drill (Recommended)
- Trigger
/status/500repeatedly from Chaos page for 5-10 minutes. - In Prometheus Alerts UI, verify one error-rate alert enters
pendingorfiring. - Pivot to Uptrace trace + backend log evidence before deciding action.
Hard Stop Conditions
- telemetry secrets missing or plaintext in Git
- no trace context propagation (orphan backend spans only)
- on-call action chosen without evidence from at least two signals
Failure Scenarios
- No traces in Uptrace
- verify DSN wiring in frontend/backend env
- verify app can reach Uptrace endpoint
- Backend spans exist but not linked to frontend spans
- verify propagation headers are allowed by CORS (
traceparent,tracestate,baggage) - verify frontend instrumentation is enabled
- Logs exist but no
trace_id
- verify request logging path and panic handler logging
- verify request executed via instrumented routes
Done When
- learner can produce one correlated incident sample (trace + log by
trace_id) - learner can explain the chosen action based on evidence
- learner can identify whether issue is config, propagation, or runtime behavior
- learner can identify at least one matching alert for the same symptom