Guardrails That Stop It
- No Plaintext Credentials: Telemetry credentials (e.g., Uptrace/Grafana) are never in plaintext.
- Evidence-First Action: No debugging based on logs only; always pivot through traces.
- Correlated Rollbacks: Rollback decisions must be tied to combined evidence (metrics + traces + logs).
Current Operating Model
1. Metrics: Fast Symptom Detection
Prometheus scrapes backend metrics and evaluates alert rules. Grafana provides the operator view (rate, latency, error rate). Metrics answer: “Is something wrong right now?”
Backend ServiceMonitor
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: backend
labels:
app: backend
spec:
selector:
matchLabels:
app: backend
endpoints:
- port: http
path: /metrics
interval: 30s
scrapeTimeout: 10s
2. Traces: Request Path Isolation
Frontend and backend export traces directly to Uptrace. Propagation headers connect browser actions to backend work. Traces answer: “Where is the request failing or slowing down?”
3. Logs: Evidence and Error Context
Backend emits structured logs with trace_id and span_id. This makes kubectl logs useful immediately. Logs answer: “What exactly happened at that point?”
4. Alert Routing: Prometheus Detects, Guardian Routes
Prometheus rules detect symptoms. k8s-ai-monitor enriches them with context and routes actionable alerts. We do not rely on Alertmanager for final delivery.
Backend alert rules
Show the backend alert rules
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: backend-alerts
namespace: observability
labels:
prometheus: kube-prometheus-stack
role: alert-rules
spec:
groups:
- name: backend.rules
interval: 30s
rules:
# High error rate alert
- alert: BackendHighErrorRate
expr: |
(
sum(rate(app_http_requests_total{job="backend",status=~"5.."}[5m]))
/ clamp_min(sum(rate(app_http_requests_total{job="backend"}[5m])), 1e-9)
) > 0.05
for: 5m
labels:
severity: warning
component: backend
annotations:
summary: "Backend service has high error rate"
description: "Backend error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
# Critical error rate alert
- alert: BackendCriticalErrorRate
expr: |
(
sum(rate(app_http_requests_total{job="backend",status=~"5.."}[5m]))
/ clamp_min(sum(rate(app_http_requests_total{job="backend"}[5m])), 1e-9)
) > 0.10
for: 2m
labels:
severity: critical
component: backend
annotations:
summary: "Backend service has critical error rate"
description: "Backend error rate is {{ $value | humanizePercentage }} (threshold: 10%)"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
# High latency alert (p95)
- alert: BackendHighLatency
expr: |
histogram_quantile(0.95,
sum(rate(app_http_request_duration_seconds_bucket{job="backend"}[5m])) by (le)
) > 1
for: 5m
labels:
severity: warning
component: backend
annotations:
summary: "Backend service has high latency"
description: "Backend p95 latency is {{ $value }}s (threshold: 1s)"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
# Service down alert
- alert: BackendServiceDown
expr: up{job="backend"} == 0
for: 1m
labels:
severity: critical
component: backend
annotations:
summary: "Backend service is down"
description: "Backend service {{ $labels.instance }} is down"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
# High memory usage
- alert: BackendHighMemoryUsage
expr: |
(
process_resident_memory_bytes{job="backend"}
/
1024 / 1024 / 1024
) > 0.8
for: 5m
labels:
severity: warning
component: backend
annotations:
summary: "Backend service has high memory usage"
description: "Backend memory usage is {{ $value }}GB (threshold: 0.8GB)"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
# Too many goroutines
- alert: BackendHighGoroutines
expr: go_goroutines{job="backend"} > 10000
for: 5m
labels:
severity: warning
component: backend
annotations:
summary: "Backend has too many goroutines"
description: "Backend has {{ $value }} goroutines (threshold: 10000)"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
# Pod restarts
- alert: BackendPodRestarting
expr: |
rate(kube_pod_container_status_restarts_total{
namespace=~"develop|staging|production",
pod=~"backend-.*"
}[15m]) > 0
for: 5m
labels:
severity: warning
component: backend
annotations:
summary: "Backend pod is restarting frequently"
description: "Backend pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
Safe Workflow (Step-by-Step)
- Start from Metrics: Identify the symptom (latency/error spike).
- Pivot to Traces: Isolate the affected route and span chain.
- Correlate with Logs: Search backend logs using the specific
trace_id. - Identify Cause: Confirm the causal path.
- Act: Execute the lowest-risk mitigation based on the evidence.
- Verify Recovery: Confirm the metrics return to baseline.
This builds on: Availability engineering (Chapter 09) — metrics detect when scaling fails. This enables: Backup and restore (Chapter 11) — observability validates recovery success.