Chapter 10: Observability (Metrics, Logs, Traces)
Incident Hook
Users report intermittent 5xx errors and slow responses. Dashboards show elevated latency, but root cause is unclear. Without trace correlation, the team jumps between pods and log streams blindly. With baseline observability, on-call narrows cause in minutes.
Observed Symptoms
What the team sees first:
- metrics clearly show a user-facing problem
- logs contain noise but not a clean causal path
- responders are tempted to restart pods before they understand the failing route
The incident is not a lack of telemetry volume. It is a lack of correlation.
Confusion Phase
Every signal says something different at first glance. That is normal.
The real question is:
- which request path is failing
- and whether the symptom, trace path, and log evidence all point to the same cause
Why This Chapter Exists
Without correlated signals, incidents become guesswork. This chapter defines the minimum production baseline:
- metrics for symptom detection
- traces for path analysis
- logs for evidence
Scope Decision
This chapter teaches the current platform baseline, not every possible observability add-on.
- Metrics baseline: Prometheus + Grafana dashboards + PrometheusRule alerts.
- Trace baseline: frontend and backend export traces directly to Uptrace.
- Log baseline: backend emits structured logs with
trace_idandspan_id, and the lab workflow proves correlation withkubectl logsfirst. - Alert routing baseline: Prometheus rules detect symptoms, then
k8s-ai-monitorenriches and routes actionable alerts. Alertmanager is intentionally disabled in this stack. - Centralized log shipping: optional extension. You can later wire Uptrace Logs, Vector, or another cloud backend, but this chapter does not depend on a specific log vendor.
- Target investigation path:
frontend -> backendnow,-> databasewhen the DB-backed path is introduced.
The implementation snapshots later in this chapter show the exact ServiceMonitor, alert rules, and guardian deployment used in the current SafeOps baseline.
Application Operational Contract
This course does not treat observability as something added after the application is already “done.” It treats observability as part of the application contract required for safe Kubernetes operations.
A production-ready application in this course is expected to provide:
- readiness and liveness probes
- graceful shutdown on interrupt signals
- config and secret reload patterns where runtime updates matter
- Prometheus metrics for symptom detection
- OpenTelemetry traces with propagation across request boundaries
- structured logs that carry stable fields such as
trace_idandspan_id - 12-factor configuration so behavior is explicit and environment-scoped
- safe packaging for Kubernetes delivery through manifests, Helm, Kustomize, or Timoni
- testable install paths and end-to-end validation in cluster-like environments
- signed images, SBOMs, provenance, and vulnerability scanning for supply chain evidence
The reference implementations for this course are ldbl/backend and ldbl/frontend.
Several of these patterns are inspired by podinfo, but the course expectation is broader:
an application should be observable, operable, recoverable, and safe to promote inside Kubernetes, not just runnable.
What AI Would Propose (Brave Junior)
- “Check logs only and restart pods quickly.”
- “Turn sampling up everywhere permanently.”
- “Skip propagation; metrics are enough.”
Why this sounds reasonable:
- fastest path to immediate action
- fewer telemetry configuration steps
Why This Is Dangerous
- logs-only debugging misses the causal path across services.
- uncontrolled sampling raises cost and noise without better decisions.
- missing propagation breaks correlation and slows incident resolution.
- acting on one signal alone increases the chance of the wrong rollback or restart.
Investigation
Treat observability as a drill-down path, not a bag of tools.
Safe investigation sequence:
- start from the metric symptom
- pivot to traces to isolate the failing path
- correlate logs by
trace_id - act only after at least two signals support the same explanation
Containment
Containment follows evidence:
- stabilize the failing dependency or route identified by traces
- verify the symptom clears in metrics
- confirm logs and traces return to expected baseline behavior
- record the exact signal path that made the diagnosis fast enough to trust
Guardrails That Stop It
- No telemetry credentials in plaintext Git.
- No debugging based on logs-only; always pivot through traces.
- Keep rollback decisions tied to evidence: metrics + traces + logs.
- Alert routing stays evidence-first: Prometheus detects,
k8s-ai-monitorenriches, humans decide.
3 Signals, 1 Incident Exercise
For one controlled incident, capture all three:
- Metrics symptom (for example latency or error-rate spike).
- Trace path showing the failing route and span chain.
- Log evidence with matching
trace_id.
This exercise is successful only when all three artifacts point to the same causal path.
Investigation Snapshots
Here is the ServiceMonitor used in the SafeOps system to turn backend metrics into Prometheus evidence.
Backend ServiceMonitor
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: backend
labels:
app: backend
spec:
selector:
matchLabels:
app: backend
endpoints:
- port: http
path: /metrics
interval: 30s
scrapeTimeout: 10s
Here are the alert rules that convert symptoms into evidence-first detection.
Backend alert rules
Show the backend alert rules
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: backend-alerts
namespace: observability
labels:
prometheus: kube-prometheus-stack
role: alert-rules
spec:
groups:
- name: backend.rules
interval: 30s
rules:
# High error rate alert
- alert: BackendHighErrorRate
expr: |
(
sum(rate(app_http_requests_total{job="backend",status=~"5.."}[5m]))
/ clamp_min(sum(rate(app_http_requests_total{job="backend"}[5m])), 1e-9)
) > 0.05
for: 5m
labels:
severity: warning
component: backend
annotations:
summary: "Backend service has high error rate"
description: "Backend error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
# Critical error rate alert
- alert: BackendCriticalErrorRate
expr: |
(
sum(rate(app_http_requests_total{job="backend",status=~"5.."}[5m]))
/ clamp_min(sum(rate(app_http_requests_total{job="backend"}[5m])), 1e-9)
) > 0.10
for: 2m
labels:
severity: critical
component: backend
annotations:
summary: "Backend service has critical error rate"
description: "Backend error rate is {{ $value | humanizePercentage }} (threshold: 10%)"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
# High latency alert (p95)
- alert: BackendHighLatency
expr: |
histogram_quantile(0.95,
sum(rate(app_http_request_duration_seconds_bucket{job="backend"}[5m])) by (le)
) > 1
for: 5m
labels:
severity: warning
component: backend
annotations:
summary: "Backend service has high latency"
description: "Backend p95 latency is {{ $value }}s (threshold: 1s)"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
# Service down alert
- alert: BackendServiceDown
expr: up{job="backend"} == 0
for: 1m
labels:
severity: critical
component: backend
annotations:
summary: "Backend service is down"
description: "Backend service {{ $labels.instance }} is down"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
# High memory usage
- alert: BackendHighMemoryUsage
expr: |
(
process_resident_memory_bytes{job="backend"}
/
1024 / 1024 / 1024
) > 0.8
for: 5m
labels:
severity: warning
component: backend
annotations:
summary: "Backend service has high memory usage"
description: "Backend memory usage is {{ $value }}GB (threshold: 0.8GB)"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
# Too many goroutines
- alert: BackendHighGoroutines
expr: go_goroutines{job="backend"} > 10000
for: 5m
labels:
severity: warning
component: backend
annotations:
summary: "Backend has too many goroutines"
description: "Backend has {{ $value }} goroutines (threshold: 10000)"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
# Pod restarts
- alert: BackendPodRestarting
expr: |
rate(kube_pod_container_status_restarts_total{
namespace=~"develop|staging|production",
pod=~"backend-.*"
}[15m]) > 0
for: 5m
labels:
severity: warning
component: backend
annotations:
summary: "Backend pod is restarting frequently"
description: "Backend pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
Here is the guardian deployment used to enrich and route actionable alerts.
k8s-ai-monitor deployment
Show the guardian deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: k8s-ai-monitor
labels:
app.kubernetes.io/name: k8s-ai-monitor
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: k8s-ai-monitor
template:
metadata:
labels:
app.kubernetes.io/name: k8s-ai-monitor
spec:
serviceAccountName: k8s-ai-monitor
securityContext:
fsGroup: 1000
imagePullSecrets:
- name: ghcr-credentials-docker
terminationGracePeriodSeconds: 30
containers:
- name: k8s-ai-monitor
image: ghcr.io/ldbl/k8s-ai-monitor:main # {"$imagepolicy": "observability:k8s-ai-monitor:tag"}
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8080
env:
- name: CLUSTER_NAME
value: safeops
- name: WATCH_NAMESPACES
value: production
- name: NON_PROD_NAMESPACES
value: develop,staging
- name: EXCLUDE_NAMESPACES
value: kube-system,kube-public,kube-node-lease,flux-system
- name: LOG_LEVEL
value: INFO
- name: LLM_PROVIDER
value: openai
- name: PROMETHEUS_URL
value: http://kube-prometheus-stack-prometheus.observability.svc.cluster.local:9090
- name: SQLITE_PATH
value: /data/k8s-ai-monitor.db
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: k8s-ai-monitor-secrets
key: openai-api-key
optional: true
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: k8s-ai-monitor-secrets
key: anthropic-api-key
optional: true
- name: SLACK_WEBHOOK_URL
valueFrom:
secretKeyRef:
name: k8s-ai-monitor-secrets
key: slack-webhook-url
optional: true
- name: SLACK_WEBHOOK_URL_NONPROD
valueFrom:
secretKeyRef:
name: k8s-ai-monitor-secrets
key: slack-webhook-url-nonprod
optional: true
- name: INTERNAL_TOKEN
valueFrom:
secretKeyRef:
name: k8s-ai-monitor-secrets
key: internal-token
optional: true
- name: ELASTICSEARCH_URL
valueFrom:
secretKeyRef:
name: k8s-ai-monitor-secrets
key: elasticsearch-url
optional: true
- name: ELASTICSEARCH_USER
valueFrom:
secretKeyRef:
name: k8s-ai-monitor-secrets
key: elasticsearch-user
optional: true
- name: ELASTICSEARCH_PASSWORD
valueFrom:
secretKeyRef:
name: k8s-ai-monitor-secrets
key: elasticsearch-password
optional: true
- name: SCANNER_CRITICAL_ENDPOINT_ENABLED
value: "true"
- name: ENDPOINT_INGRESS_SERVICE
value: traefik.traefik.svc.cluster.local
- name: SCANNER_BACKUP_ENABLED
value: "true"
volumeMounts:
- name: data
mountPath: /data
readinessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 10
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 20
periodSeconds: 20
resources:
requests:
cpu: 10m
memory: 64Mi
limits:
cpu: 100m
memory: 256Mi
volumes:
- name: data
persistentVolumeClaim:
claimName: k8s-ai-monitor-data
System Context
This chapter gives the rest of the course an evidence-first investigation path.
It becomes essential in:
- Chapter 12, where drills must be explained, not just survived
- Chapter 13, where guardian summaries depend on good source signals
- Chapter 14, where on-call actions should be justified by correlated evidence
Current Operating Model
Metrics: Fast Symptom Detection
Prometheus scrapes backend metrics from /metrics and evaluates alert rules.
Grafana provides the operator view for request rate, latency, error rate, saturation, and SLO burn.
Metrics answer the first question: is something wrong right now? They do not tell you the full causal path.
Traces: Request Path Isolation
Frontend and backend export traces directly to Uptrace. Propagation headers connect browser actions to backend work, so one request becomes one visible path.
Traces answer the second question: where is the request failing or slowing down?
Logs: Evidence and Error Context
The backend emits structured logs with stable fields such as time, level, msg, trace_id, and span_id.
That makes kubectl logs useful immediately, even before you add a centralized log backend.
The important invariant is not the vendor. The important invariant is that logs carry the same trace_id you saw in the trace.
Alert Routing: Prometheus Detects, Guardian Routes
Prometheus rules detect the symptom.
k8s-ai-monitor consumes Prometheus metrics plus Kubernetes context and routes actionable alerts to webhook targets.
This stack does not rely on Alertmanager for final delivery. The guardian is the routing and enrichment layer.
Centralized Logging Is an Extension, Not the Lesson
Collector-based shipping, Vector, or a cloud log backend may be added later. Those are implementation choices.
This chapter focuses on the operator skill that survives every backend choice:
- detect in metrics
- isolate in traces
- prove in logs
Safe Workflow (Step-by-Step)
- Start from symptom in metrics: latency, error rate, request rate, or saturation anomaly.
- Pivot to traces and isolate the affected route and span chain.
- Correlate with backend or frontend logs using
trace_id. - Validate whether the detection path also produced the expected alert or guardian incident.
- Decide action only after evidence from at least two signals.
- Validate recovery in metrics and confirm trace/log behavior returned to baseline.
Definition of Done: Evidence, Not Assumptions
Incident triage is complete only when the responder can explain:
- what failed
- where it failed
- why it failed
using correlated evidence (metrics + traces + logs), not guesses.
Lab Files
lab.mdrunbook-incident-debug.mdsli-slo.mdquiz.md
Done When
- learner can trigger and find one end-to-end trace from frontend to backend
- learner can match one backend log entry by
trace_id - learner can explain why the current alert path goes through
k8s-ai-monitor - learner can run incident workflow
metrics -> traces -> logs -> action - learner can explain backend availability SLI/SLO and validate burn-rate alerts