SLI/SLO Spec: Chapter 09 Baseline
Scope
Service in scope:
backendHTTP API
Environment scope:
develop,staging,production
Indicators (SLIs)
- Availability SLI
- Definition: ratio of successful requests (non-5xx) to total requests.
- PromQL:
1 - (
sum(rate(app_http_requests_total{job="backend",status=~"5.."}[30m]))
/
clamp_min(sum(rate(app_http_requests_total{job="backend"}[30m])), 1e-9)
)
- Latency SLI (p95)
- Definition: p95 backend request duration over rolling window.
- PromQL:
histogram_quantile(0.95,
sum(rate(app_http_request_duration_seconds_bucket{job="backend"}[5m])) by (le)
)
Objectives (SLOs)
- Availability SLO
- Target:
99.5%over 30 days. - Error budget:
0.5%.
- Latency objective (operational target)
- Target:
p95 < 1son 5-minute windows. - Used for warning/critical operational alerts.
Alert Strategy
- Immediate symptom alerts
BackendCriticalErrorRateBackendHighLatencyBackendServiceDown
- Budget consumption alerts (burn-rate)
BackendSLOErrorBudgetBurnCritical: fast burn on 5m and 1h windows (14.4x budget).BackendSLOErrorBudgetBurnWarning: sustained burn on 30m and 1h windows (6x budget).
Guardrails
- Do not page only on single-point spikes without cross-signal evidence.
- For customer-impact decisions, require: metrics symptom + one representative trace + correlated log line.
- Every alert route must include runbook:
runbook-incident-debug.md.