Workflow & Operating Model | SafeOps Academy

Guardrails That Stop It

No Plaintext Credentials: Telemetry credentials (e.g., Uptrace/Grafana) are never in plaintext.
Evidence-First Action: No debugging based on logs only; always pivot through traces.
Correlated Rollbacks: Rollback decisions must be tied to combined evidence (metrics + traces + logs).

Current Operating Model

1. Metrics: Fast Symptom Detection

Prometheus scrapes backend metrics and evaluates alert rules. Grafana provides the operator view (rate, latency, error rate). Metrics answer: “Is something wrong right now?”

Backend ServiceMonitor

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: backend
  labels:
    app: backend
spec:
  selector:
    matchLabels:
      app: backend
  endpoints:
    - port: http
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s

2. Traces: Request Path Isolation

Frontend and backend export traces directly to Uptrace. Propagation headers connect browser actions to backend work. Traces answer: “Where is the request failing or slowing down?”

3. Logs: Evidence and Error Context

Backend emits structured logs with trace_id and span_id. This makes kubectl logs useful immediately. Logs answer: “What exactly happened at that point?”

4. Alert Routing: Prometheus Detects, Guardian Routes

Prometheus rules detect symptoms. k8s-ai-monitor enriches them with context and routes actionable alerts. We do not rely on Alertmanager for final delivery.

Backend alert rules

Show the backend alert rules

---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: backend-alerts
  namespace: observability
  labels:
    prometheus: kube-prometheus-stack
    role: alert-rules
spec:
  groups:
    - name: backend.rules
      interval: 30s
      rules:
        # High error rate alert
        - alert: BackendHighErrorRate
          expr: |
            (
              sum(rate(app_http_requests_total{job="backend",status=~"5.."}[5m]))
              / clamp_min(sum(rate(app_http_requests_total{job="backend"}[5m])), 1e-9)
            ) > 0.05
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend service has high error rate"
            description: "Backend error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # Critical error rate alert
        - alert: BackendCriticalErrorRate
          expr: |
            (
              sum(rate(app_http_requests_total{job="backend",status=~"5.."}[5m]))
              / clamp_min(sum(rate(app_http_requests_total{job="backend"}[5m])), 1e-9)
            ) > 0.10
          for: 2m
          labels:
            severity: critical
            component: backend
          annotations:
            summary: "Backend service has critical error rate"
            description: "Backend error rate is {{ $value | humanizePercentage }} (threshold: 10%)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # High latency alert (p95)
        - alert: BackendHighLatency
          expr: |
            histogram_quantile(0.95,
              sum(rate(app_http_request_duration_seconds_bucket{job="backend"}[5m])) by (le)
            ) > 1
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend service has high latency"
            description: "Backend p95 latency is {{ $value }}s (threshold: 1s)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # Service down alert
        - alert: BackendServiceDown
          expr: up{job="backend"} == 0
          for: 1m
          labels:
            severity: critical
            component: backend
          annotations:
            summary: "Backend service is down"
            description: "Backend service {{ $labels.instance }} is down"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # High memory usage
        - alert: BackendHighMemoryUsage
          expr: |
            (
              process_resident_memory_bytes{job="backend"}
              /
              1024 / 1024 / 1024
            ) > 0.8
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend service has high memory usage"
            description: "Backend memory usage is {{ $value }}GB (threshold: 0.8GB)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # Too many goroutines
        - alert: BackendHighGoroutines
          expr: go_goroutines{job="backend"} > 10000
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend has too many goroutines"
            description: "Backend has {{ $value }} goroutines (threshold: 10000)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # Pod restarts
        - alert: BackendPodRestarting
          expr: |
            rate(kube_pod_container_status_restarts_total{
              namespace=~"develop|staging|production",
              pod=~"backend-.*"
            }[15m]) > 0
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend pod is restarting frequently"
            description: "Backend pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

Safe Workflow (Step-by-Step)

Start from Metrics: Identify the symptom (latency/error spike).
Pivot to Traces: Isolate the affected route and span chain.
Correlate with Logs: Search backend logs using the specific trace_id.
Identify Cause: Confirm the causal path.
Act: Execute the lowest-risk mitigation based on the evidence.
Verify Recovery: Confirm the metrics return to baseline.

This builds on: Availability engineering (Chapter 09) — metrics detect when scaling fails. This enables: Backup and restore (Chapter 11) — observability validates recovery success.

Estimated Time

Prerequisites

Source Code References

What You Will Produce