Why SLI/SLO
A threshold alert like "CPU > 80%" is almost always noise:
- 80% CPU is fine on a batch worker, a problem on a user-facing service
- It does not reflect what the user feels
- The noise (flapping, false positives) wears down on-call
The Google SRE approach (the book "Site Reliability Engineering"):
- Define an SLI, a Service Level Indicator. A metric close to the user: percent of successful requests, p99 latency.
- Set an SLO, a Service Level Objective. A target over a period: "99.9% of requests succeed over 30 days".
- Compute the error budget:
1 - SLO. For 99.9% that is 0.1% of allowed downtime = 43 minutes a month. - Spend the budget on incidents, risky releases, and experiments.
- An alert fires when the budget burn rate crosses a threshold.
This shifts the conversation from "something broke" to "how much room do we have left to be broken".
SLI vs SLO vs SLA
| Term | What |
|---|---|
| SLI | Indicator, the metric itself (availability rate, latency p99) |
| SLO | Objective, the target value (99.9%) |
| SLA | Agreement, a contract with the user (with a penalty) |
| Error budget | 1 - SLO. The allowed percentage and duration of failures |
An SLO is internal (an engineering tool), an SLA is external (legal, refunds money).
Usually SLA > SLO. The SLO is stricter to keep a margin. If SLA=99.5%, set SLO=99.9%.
Good SLIs
They should:
- Reflect user experience correctly (if the SLI is green, the user is happy)
- Be aggregatable (you can compute a percentage over a period)
- Be measured stably (no flapping on trivia)
Good:
- Availability:
successful_requests / total_requests(status < 500 / total) - Latency:
p99 < 200ms(% requests faster than threshold) - Throughput:
actual_qps / target_qps - Correctness:
correct_results / total_results(for batch jobs) - Freshness:
data_age_p99 < 5min(for pipelines)
Bad:
- CPU%: does not reflect user experience
- Avg latency: hides the tail (p99 = 5s, avg = 100ms, the user is angry, the metric is "green")
- "A user complained": not aggregatable
Window: rolling vs calendar
Rolling 30d: "over the last 30 days, 99.9% success". It is recomputed at every moment. The SRE standard.
Calendar month: "October at 99.9%". Simple for the business, but it behaves badly at month boundaries.
Use rolling.
Error budget: how to compute it
SLO = 99.9% over 30 days.
Error budget = 1 - 0.999 = 0.001 = 0.1% of all requests may fail.
If 100K req/day × 30d = 3M req, then 3000 failures are allowed.
After 15 days with 1500 failures, that is 50% of the budget consumed. Over the next 15 days you have 1500 more. If you are already at 2900, that is 97% consumed, with only 100 left for 15 days.
Burn rate = consumed / time_passed. If the budget is spent faster than
linear, alert.
In hours:
- 99.9% over 30d = 43.2 minutes of allowed downtime
- 99.99% = 4.3 minutes
- 99.999% = 26 seconds (needs hot-standby and multi-region)
Multi-window burn rate alerting
The old approach: alert on "error rate > 1% for 5m". The problems:
- flapping on short spikes
- it does not tell "slightly slow" from "burning the budget within an hour"
Multi-window burn rate (Google SRE Workbook, ch. 5):
groups:
- name: slo
rules:
# Burn rate over 5m and 1h at once
- alert: ErrorBudgetBurnFast
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) /sum(rate(http_requests_total[5m]))
) > (14.4 * 0.001)
and
(
sum(rate(http_requests_total{status=~"5.."}[1h])) /sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
for: 2m
labels: {severity: critical}annotations:
summary: "Burning the budget at 14.4x; in an hour we lose 2% (of the 30-day budget)"
- alert: ErrorBudgetBurnSlow
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h])) /sum(rate(http_requests_total[1h]))
) > (3 * 0.001)
and
(
sum(rate(http_requests_total{status=~"5.."}[6h])) /sum(rate(http_requests_total[6h]))
) > (3 * 0.001)
for: 15m
labels: {severity: warning}The trick of two windows:
- Short window (5m): react fast
- Long window (1h): filter out flaps
The multipliers (14.4 for fast, 3 for slow) are chosen to page early if you are burning through in a day (fast) or in a week (slow).
Burn-rate cheatsheet
For an SLO of 99.9% (0.1% budget):
| Burn rate | Time to full burn | When to page |
|---|---|---|
| 14.4× | 2.1 day | 2-min page (fast) |
| 6× | 5 days | 15-min page (medium) |
| 3× | 10 days | 1h page (slow) |
| 1× | 30 days (planned) | no page |
Source: Google SRE Workbook, table 5-2.
Error budget policy
Codify what to do when the budget runs out. Example:
Error Budget Policy v1.4
If over a rolling 30d:
- Budget < 0%: code freeze. Bugfix releases only.
SRE and dev split priorities 50/50 on reliability work.
- Budget < 25%: rollout restricted (slow rollout).
Canary releases are required.
- Budget > 25%: normal velocity.
Budget resets only with elapsed time.
We do not "forgive" incidents retroactively.
Without a policy, an SLO is just a slide in a dashboard. With a policy it is real governance: the dev team sees the real cost of bad releases.
SLOs for different systems
| System | SLI | SLO |
|---|---|---|
| Web API | success rate, p99 latency | 99.9% / p99 < 200ms |
| Async queue | processed rate | 99.99% (queue HA) |
| Batch ETL | freshness, correctness | freshness < 1h, correctness 100% |
| Cache | hit rate? no, that is not user-facing | latency p99 < 50ms |
| Search | relevance score, latency | latency p95 < 1s, relevance > 0.7 |
A cache hit rate is an internal metric, not an SLI. An SLI is what the user sees (latency).
Prometheus recording rules for SLO
groups:
- name: slo_recording
interval: 30s
rules:
# Per-service availability
- record: slo:request_availability:ratio_rate5m
expr: |
sum by (service)(rate(http_requests_total{status!~"5.."}[5m]))/
sum by (service)(rate(http_requests_total[5m]))
# Per-service latency SLI
- record: slo:request_latency:ratio_rate5m
expr: |
sum by (service)(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))/
sum by (service)(rate(http_request_duration_seconds_count[5m]))
Burn-rate alerting uses these recording rules instead of raw queries. Cheaper and more readable.
Tools
- Sloth (CNCF): generates SLO, alerting, and recording rules from a YAML spec
- OpenSLO: an SLO-spec standard, supported by Sloth/Nobl9
- Pyrra: a UI for SLO-as-code in Kubernetes
- Grafana SLO (paid): managed SLO in Grafana Cloud
Sloth example:
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
spec:
service: api
slos:
- name: availability
objective: 99.9
sli:
events:
error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}])) total_query: sum(rate(http_requests_total[{{.window}}]))alerting:
page_alert: {labels: {severity: critical}} ticket_alert: {labels: {severity: warning}}Sloth generates recording and alerting rules automatically with the right multi-window burn rate.
When things go wrong
- The SLO is missed, but the budget is positive: the 30d window. A flap 30 days ago just rolled off. Normal.
- Budget = 100% all the time: the SLO is too weak. Tighten it.
- Budget-burn alerting does not fire on a real outage: the burn-rate multiplier is too high, or the SLI does not reflect the affected requests. For example, the SLI is on rate while the outage is on latency.
- Cardinality explosion in SLO recording:
sum by (user_id)made a million series. See cardinality-explosion. - Teams disagree about an SLO: normal. Agree through iterations, not once and forever.
- "If we track p99, we need p999": no. p99 covers 99% of the user experience. p999 is noise and ML territory.
Anti-patterns
- An SLO without a budget policy: engineering theater
- A 100% SLO: impossible. The error budget is 0, so any deploy breaks it
- An SLI on an internal metric (CPU, memory): does not reflect the user
- An alert on any error: you need a
for:or multi-window burn rate - Multiple overlapping SLOs: choose one user-facing