Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

  • backend-alerts.yaml Members
  • deployment.yaml Members
  • servicemonitor.yaml Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter 10: Observability (Metrics, Logs, Traces)

Incident Hook

Users report intermittent 5xx errors and slow responses. Dashboards show elevated latency, but root cause is unclear. Without trace correlation, the team jumps between pods and log streams blindly. With baseline observability, on-call narrows cause in minutes.

Observed Symptoms

What the team sees first:

  • metrics clearly show a user-facing problem
  • logs contain noise but not a clean causal path
  • responders are tempted to restart pods before they understand the failing route

The incident is not a lack of telemetry volume. It is a lack of correlation.

Confusion Phase

Every signal says something different at first glance. That is normal.

The real question is:

  • which request path is failing
  • and whether the symptom, trace path, and log evidence all point to the same cause

Why This Chapter Exists

Without correlated signals, incidents become guesswork. This chapter defines the minimum production baseline:

  • metrics for symptom detection
  • traces for path analysis
  • logs for evidence

Scope Decision

This chapter teaches the current platform baseline, not every possible observability add-on.

  • Metrics baseline: Prometheus + Grafana dashboards + PrometheusRule alerts.
  • Trace baseline: frontend and backend export traces directly to Uptrace.
  • Log baseline: backend emits structured logs with trace_id and span_id, and the lab workflow proves correlation with kubectl logs first.
  • Alert routing baseline: Prometheus rules detect symptoms, then k8s-ai-monitor enriches and routes actionable alerts. Alertmanager is intentionally disabled in this stack.
  • Centralized log shipping: optional extension. You can later wire Uptrace Logs, Vector, or another cloud backend, but this chapter does not depend on a specific log vendor.
  • Target investigation path: frontend -> backend now, -> database when the DB-backed path is introduced.

The implementation snapshots later in this chapter show the exact ServiceMonitor, alert rules, and guardian deployment used in the current SafeOps baseline.

Application Operational Contract

This course does not treat observability as something added after the application is already “done.” It treats observability as part of the application contract required for safe Kubernetes operations.

A production-ready application in this course is expected to provide:

  • readiness and liveness probes
  • graceful shutdown on interrupt signals
  • config and secret reload patterns where runtime updates matter
  • Prometheus metrics for symptom detection
  • OpenTelemetry traces with propagation across request boundaries
  • structured logs that carry stable fields such as trace_id and span_id
  • 12-factor configuration so behavior is explicit and environment-scoped
  • safe packaging for Kubernetes delivery through manifests, Helm, Kustomize, or Timoni
  • testable install paths and end-to-end validation in cluster-like environments
  • signed images, SBOMs, provenance, and vulnerability scanning for supply chain evidence

The reference implementations for this course are ldbl/backend and ldbl/frontend. Several of these patterns are inspired by podinfo, but the course expectation is broader: an application should be observable, operable, recoverable, and safe to promote inside Kubernetes, not just runnable.

What AI Would Propose (Brave Junior)

  • “Check logs only and restart pods quickly.”
  • “Turn sampling up everywhere permanently.”
  • “Skip propagation; metrics are enough.”

Why this sounds reasonable:

  • fastest path to immediate action
  • fewer telemetry configuration steps

Why This Is Dangerous

  • logs-only debugging misses the causal path across services.
  • uncontrolled sampling raises cost and noise without better decisions.
  • missing propagation breaks correlation and slows incident resolution.
  • acting on one signal alone increases the chance of the wrong rollback or restart.

Investigation

Treat observability as a drill-down path, not a bag of tools.

Safe investigation sequence:

  1. start from the metric symptom
  2. pivot to traces to isolate the failing path
  3. correlate logs by trace_id
  4. act only after at least two signals support the same explanation

Containment

Containment follows evidence:

  1. stabilize the failing dependency or route identified by traces
  2. verify the symptom clears in metrics
  3. confirm logs and traces return to expected baseline behavior
  4. record the exact signal path that made the diagnosis fast enough to trust

Guardrails That Stop It

  • No telemetry credentials in plaintext Git.
  • No debugging based on logs-only; always pivot through traces.
  • Keep rollback decisions tied to evidence: metrics + traces + logs.
  • Alert routing stays evidence-first: Prometheus detects, k8s-ai-monitor enriches, humans decide.

3 Signals, 1 Incident Exercise

For one controlled incident, capture all three:

  1. Metrics symptom (for example latency or error-rate spike).
  2. Trace path showing the failing route and span chain.
  3. Log evidence with matching trace_id.

This exercise is successful only when all three artifacts point to the same causal path.

Investigation Snapshots

Here is the ServiceMonitor used in the SafeOps system to turn backend metrics into Prometheus evidence.

Backend ServiceMonitor

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: backend
  labels:
    app: backend
spec:
  selector:
    matchLabels:
      app: backend
  endpoints:
    - port: http
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s

Here are the alert rules that convert symptoms into evidence-first detection.

Backend alert rules

Show the backend alert rules
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: backend-alerts
  namespace: observability
  labels:
    prometheus: kube-prometheus-stack
    role: alert-rules
spec:
  groups:
    - name: backend.rules
      interval: 30s
      rules:
        # High error rate alert
        - alert: BackendHighErrorRate
          expr: |
            (
              sum(rate(app_http_requests_total{job="backend",status=~"5.."}[5m]))
              / clamp_min(sum(rate(app_http_requests_total{job="backend"}[5m])), 1e-9)
            ) > 0.05
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend service has high error rate"
            description: "Backend error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # Critical error rate alert
        - alert: BackendCriticalErrorRate
          expr: |
            (
              sum(rate(app_http_requests_total{job="backend",status=~"5.."}[5m]))
              / clamp_min(sum(rate(app_http_requests_total{job="backend"}[5m])), 1e-9)
            ) > 0.10
          for: 2m
          labels:
            severity: critical
            component: backend
          annotations:
            summary: "Backend service has critical error rate"
            description: "Backend error rate is {{ $value | humanizePercentage }} (threshold: 10%)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # High latency alert (p95)
        - alert: BackendHighLatency
          expr: |
            histogram_quantile(0.95,
              sum(rate(app_http_request_duration_seconds_bucket{job="backend"}[5m])) by (le)
            ) > 1
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend service has high latency"
            description: "Backend p95 latency is {{ $value }}s (threshold: 1s)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # Service down alert
        - alert: BackendServiceDown
          expr: up{job="backend"} == 0
          for: 1m
          labels:
            severity: critical
            component: backend
          annotations:
            summary: "Backend service is down"
            description: "Backend service {{ $labels.instance }} is down"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # High memory usage
        - alert: BackendHighMemoryUsage
          expr: |
            (
              process_resident_memory_bytes{job="backend"}
              /
              1024 / 1024 / 1024
            ) > 0.8
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend service has high memory usage"
            description: "Backend memory usage is {{ $value }}GB (threshold: 0.8GB)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # Too many goroutines
        - alert: BackendHighGoroutines
          expr: go_goroutines{job="backend"} > 10000
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend has too many goroutines"
            description: "Backend has {{ $value }} goroutines (threshold: 10000)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # Pod restarts
        - alert: BackendPodRestarting
          expr: |
            rate(kube_pod_container_status_restarts_total{
              namespace=~"develop|staging|production",
              pod=~"backend-.*"
            }[15m]) > 0
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend pod is restarting frequently"
            description: "Backend pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

Here is the guardian deployment used to enrich and route actionable alerts.

k8s-ai-monitor deployment

Show the guardian deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: k8s-ai-monitor
  labels:
    app.kubernetes.io/name: k8s-ai-monitor
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: k8s-ai-monitor
  template:
    metadata:
      labels:
        app.kubernetes.io/name: k8s-ai-monitor
    spec:
      serviceAccountName: k8s-ai-monitor
      securityContext:
        fsGroup: 1000
      imagePullSecrets:
        - name: ghcr-credentials-docker
      terminationGracePeriodSeconds: 30
      containers:
        - name: k8s-ai-monitor
          image: ghcr.io/ldbl/k8s-ai-monitor:main # {"$imagepolicy": "observability:k8s-ai-monitor:tag"}
          imagePullPolicy: IfNotPresent
          ports:
            - name: http
              containerPort: 8080
          env:
            - name: CLUSTER_NAME
              value: safeops
            - name: WATCH_NAMESPACES
              value: production
            - name: NON_PROD_NAMESPACES
              value: develop,staging
            - name: EXCLUDE_NAMESPACES
              value: kube-system,kube-public,kube-node-lease,flux-system
            - name: LOG_LEVEL
              value: INFO
            - name: LLM_PROVIDER
              value: openai
            - name: PROMETHEUS_URL
              value: http://kube-prometheus-stack-prometheus.observability.svc.cluster.local:9090
            - name: SQLITE_PATH
              value: /data/k8s-ai-monitor.db
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: openai-api-key
                  optional: true
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: anthropic-api-key
                  optional: true
            - name: SLACK_WEBHOOK_URL
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: slack-webhook-url
                  optional: true
            - name: SLACK_WEBHOOK_URL_NONPROD
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: slack-webhook-url-nonprod
                  optional: true
            - name: INTERNAL_TOKEN
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: internal-token
                  optional: true
            - name: ELASTICSEARCH_URL
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: elasticsearch-url
                  optional: true
            - name: ELASTICSEARCH_USER
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: elasticsearch-user
                  optional: true
            - name: ELASTICSEARCH_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: elasticsearch-password
                  optional: true
            - name: SCANNER_CRITICAL_ENDPOINT_ENABLED
              value: "true"
            - name: ENDPOINT_INGRESS_SERVICE
              value: traefik.traefik.svc.cluster.local
            - name: SCANNER_BACKUP_ENABLED
              value: "true"
          volumeMounts:
            - name: data
              mountPath: /data
          readinessProbe:
            httpGet:
              path: /healthz
              port: http
            initialDelaySeconds: 10
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /healthz
              port: http
            initialDelaySeconds: 20
            periodSeconds: 20
          resources:
            requests:
              cpu: 10m
              memory: 64Mi
            limits:
              cpu: 100m
              memory: 256Mi
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: k8s-ai-monitor-data

System Context

This chapter gives the rest of the course an evidence-first investigation path.

It becomes essential in:

  • Chapter 12, where drills must be explained, not just survived
  • Chapter 13, where guardian summaries depend on good source signals
  • Chapter 14, where on-call actions should be justified by correlated evidence

Current Operating Model

Metrics: Fast Symptom Detection

Prometheus scrapes backend metrics from /metrics and evaluates alert rules. Grafana provides the operator view for request rate, latency, error rate, saturation, and SLO burn.

Metrics answer the first question: is something wrong right now? They do not tell you the full causal path.

Traces: Request Path Isolation

Frontend and backend export traces directly to Uptrace. Propagation headers connect browser actions to backend work, so one request becomes one visible path.

Traces answer the second question: where is the request failing or slowing down?

Logs: Evidence and Error Context

The backend emits structured logs with stable fields such as time, level, msg, trace_id, and span_id. That makes kubectl logs useful immediately, even before you add a centralized log backend.

The important invariant is not the vendor. The important invariant is that logs carry the same trace_id you saw in the trace.

Alert Routing: Prometheus Detects, Guardian Routes

Prometheus rules detect the symptom. k8s-ai-monitor consumes Prometheus metrics plus Kubernetes context and routes actionable alerts to webhook targets.

This stack does not rely on Alertmanager for final delivery. The guardian is the routing and enrichment layer.

Centralized Logging Is an Extension, Not the Lesson

Collector-based shipping, Vector, or a cloud log backend may be added later. Those are implementation choices.

This chapter focuses on the operator skill that survives every backend choice:

  • detect in metrics
  • isolate in traces
  • prove in logs

Safe Workflow (Step-by-Step)

  1. Start from symptom in metrics: latency, error rate, request rate, or saturation anomaly.
  2. Pivot to traces and isolate the affected route and span chain.
  3. Correlate with backend or frontend logs using trace_id.
  4. Validate whether the detection path also produced the expected alert or guardian incident.
  5. Decide action only after evidence from at least two signals.
  6. Validate recovery in metrics and confirm trace/log behavior returned to baseline.

Definition of Done: Evidence, Not Assumptions

Incident triage is complete only when the responder can explain:

  • what failed
  • where it failed
  • why it failed

using correlated evidence (metrics + traces + logs), not guesses.

Lab Files

  • lab.md
  • runbook-incident-debug.md
  • sli-slo.md
  • quiz.md

Done When

  • learner can trigger and find one end-to-end trace from frontend to backend
  • learner can match one backend log entry by trace_id
  • learner can explain why the current alert path goes through k8s-ai-monitor
  • learner can run incident workflow metrics -> traces -> logs -> action
  • learner can explain backend availability SLI/SLO and validate burn-rate alerts

Hands-On Materials

Labs, quizzes, and runbooks — available to course members.

  • Lab: Baseline Observability with Uptrace Members
  • Quiz: Chapter 10 (Observability) Members
  • Runbook: Incident Debug (Metrics -> Traces -> Logs) Members
  • SLI/SLO Spec: Chapter 10 Baseline Members

Interactive Explainer

Sign in to watch the video for this chapter.