SLI / SLO / error budget: SRE metrics without the noise

Why SLI/SLO

A threshold alert like "CPU > 80%" is almost always noise:

80% CPU is fine on a batch worker, a problem on a user-facing service
It does not reflect what the user feels
The noise (flapping, false positives) wears down on-call

The Google SRE approach (the book "Site Reliability Engineering"):

Define an SLI, a Service Level Indicator. A metric close to the user: percent of successful requests, p99 latency.
Set an SLO, a Service Level Objective. A target over a period: "99.9% of requests succeed over 30 days".
Compute the error budget: 1 - SLO. For 99.9% that is 0.1% of allowed downtime = 43 minutes a month.
Spend the budget on incidents, risky releases, and experiments.
An alert fires when the budget burn rate crosses a threshold.

This shifts the conversation from "something broke" to "how much room do we have left to be broken".

SLI vs SLO vs SLA

Term	What
SLI	Indicator, the metric itself (availability rate, latency p99)
SLO	Objective, the target value (99.9%)
SLA	Agreement, a contract with the user (with a penalty)
Error budget	1 - SLO. The allowed percentage and duration of failures

An SLO is internal (an engineering tool), an SLA is external (legal, refunds money).

Usually SLA > SLO. The SLO is stricter to keep a margin. If SLA=99.5%, set SLO=99.9%.

Good SLIs

They should:

Reflect user experience correctly (if the SLI is green, the user is happy)
Be aggregatable (you can compute a percentage over a period)
Be measured stably (no flapping on trivia)

Good:

Availability: successful_requests / total_requests (status < 500 / total)
Latency: p99 < 200ms (% requests faster than threshold)
Throughput: actual_qps / target_qps
Correctness: correct_results / total_results (for batch jobs)
Freshness: data_age_p99 < 5min (for pipelines)

Bad:

CPU%: does not reflect user experience
Avg latency: hides the tail (p99 = 5s, avg = 100ms, the user is angry, the metric is "green")
"A user complained": not aggregatable

Window: rolling vs calendar

Rolling 30d: "over the last 30 days, 99.9% success". It is recomputed at every moment. The SRE standard.

Calendar month: "October at 99.9%". Simple for the business, but it behaves badly at month boundaries.

Use rolling.

Error budget: how to compute it

SLO = 99.9% over 30 days.

Error budget = 1 - 0.999 = 0.001 = 0.1% of all requests may fail.

If 100K req/day × 30d = 3M req, then 3000 failures are allowed.

After 15 days with 1500 failures, that is 50% of the budget consumed. Over the next 15 days you have 1500 more. If you are already at 2900, that is 97% consumed, with only 100 left for 15 days.

Burn rate = consumed / time_passed. If the budget is spent faster than linear, alert.

In hours:

99.9% over 30d = 43.2 minutes of allowed downtime
99.99% = 4.3 minutes
99.999% = 26 seconds (needs hot-standby and multi-region)

Multi-window burn rate alerting

The old approach: alert on "error rate > 1% for 5m". The problems:

flapping on short spikes
it does not tell "slightly slow" from "burning the budget within an hour"

Multi-window burn rate (Google SRE Workbook, ch. 5):

yaml

groups:

  - name: slo

    rules:

      # Burn rate over 5m and 1h at once

      - alert: ErrorBudgetBurnFast

        expr: |

            sum(rate(http_requests_total{status=~"5.."}[5m])) /

            sum(rate(http_requests_total[5m]))

          ) > (14.4 * 0.001)

and

            sum(rate(http_requests_total{status=~"5.."}[1h])) /

            sum(rate(http_requests_total[1h]))

          ) > (14.4 * 0.001)

        for: 2m

        labels: {severity: critical}

        annotations:

          summary: "Burning the budget at 14.4x; in an hour we lose 2% (of the 30-day budget)"

      - alert: ErrorBudgetBurnSlow

        expr: |

            sum(rate(http_requests_total{status=~"5.."}[1h])) /

            sum(rate(http_requests_total[1h]))

          ) > (3 * 0.001)

and

            sum(rate(http_requests_total{status=~"5.."}[6h])) /

            sum(rate(http_requests_total[6h]))

          ) > (3 * 0.001)

        for: 15m

        labels: {severity: warning}

The trick of two windows:

Short window (5m): react fast
Long window (1h): filter out flaps

The multipliers (14.4 for fast, 3 for slow) are chosen to page early if you are burning through in a day (fast) or in a week (slow).

Burn-rate cheatsheet

For an SLO of 99.9% (0.1% budget):

Burn rate	Time to full burn	When to page
14.4×	2.1 day	2-min page (fast)
6×	5 days	15-min page (medium)
3×	10 days	1h page (slow)
1×	30 days (planned)	no page

Source: Google SRE Workbook, table 5-2.

Error budget policy

Codify what to do when the budget runs out. Example:

Error Budget Policy v1.4

If over a rolling 30d:

 - Budget < 0%: code freeze. Bugfix releases only.

   SRE and dev split priorities 50/50 on reliability work.

 - Budget < 25%: rollout restricted (slow rollout).

   Canary releases are required.

 - Budget > 25%: normal velocity.

Budget resets only with elapsed time.

We do not "forgive" incidents retroactively.

Without a policy, an SLO is just a slide in a dashboard. With a policy it is real governance: the dev team sees the real cost of bad releases.

SLOs for different systems

System	SLI	SLO
Web API	success rate, p99 latency	99.9% / p99 < 200ms
Async queue	processed rate	99.99% (queue HA)
Batch ETL	freshness, correctness	freshness < 1h, correctness 100%
Cache	hit rate? no, that is not user-facing	latency p99 < 50ms
Search	relevance score, latency	latency p95 < 1s, relevance > 0.7

A cache hit rate is an internal metric, not an SLI. An SLI is what the user sees (latency).

Prometheus recording rules for SLO

yaml

groups:

  - name: slo_recording

    interval: 30s

    rules:

      # Per-service availability

      - record: slo:request_availability:ratio_rate5m

        expr: |

          sum by (service)(rate(http_requests_total{status!~"5.."}[5m]))

          sum by (service)(rate(http_requests_total[5m]))

      # Per-service latency SLI

      - record: slo:request_latency:ratio_rate5m

        expr: |

          sum by (service)(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))

          sum by (service)(rate(http_request_duration_seconds_count[5m]))

Burn-rate alerting uses these recording rules instead of raw queries. Cheaper and more readable.

Tools

Sloth (CNCF): generates SLO, alerting, and recording rules from a YAML spec
OpenSLO: an SLO-spec standard, supported by Sloth/Nobl9
Pyrra: a UI for SLO-as-code in Kubernetes
Grafana SLO (paid): managed SLO in Grafana Cloud

Sloth example:

yaml

apiVersion: sloth.slok.dev/v1

kind: PrometheusServiceLevel

spec:

  service: api

  slos:

    - name: availability

      objective: 99.9

      sli:

        events:

          error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))

          total_query: sum(rate(http_requests_total[{{.window}}]))

      alerting:

        page_alert: {labels: {severity: critical}}

        ticket_alert: {labels: {severity: warning}}

Sloth generates recording and alerting rules automatically with the right multi-window burn rate.

When things go wrong

The SLO is missed, but the budget is positive: the 30d window. A flap 30 days ago just rolled off. Normal.
Budget = 100% all the time: the SLO is too weak. Tighten it.
Budget-burn alerting does not fire on a real outage: the burn-rate multiplier is too high, or the SLI does not reflect the affected requests. For example, the SLI is on rate while the outage is on latency.
Cardinality explosion in SLO recording: sum by (user_id) made a million series. See cardinality-explosion.
Teams disagree about an SLO: normal. Agree through iterations, not once and forever.
"If we track p99, we need p999": no. p99 covers 99% of the user experience. p999 is noise and ML territory.

Anti-patterns

An SLO without a budget policy: engineering theater
A 100% SLO: impossible. The error budget is 0, so any deploy breaks it
An SLI on an internal metric (CPU, memory): does not reflect the user
An alert on any error: you need a for: or multi-window burn rate
Multiple overlapping SLOs: choose one user-facing

Why SLI/SLO

A threshold alert like "CPU > 80%" is almost always noise:

80% CPU is fine on a batch worker, a problem on a user-facing service
It does not reflect what the user feels
The noise (flapping, false positives) wears down on-call

The Google SRE approach (the book "Site Reliability Engineering"):

Define an SLI, a Service Level Indicator. A metric close to the user: percent of successful requests, p99 latency.
Set an SLO, a Service Level Objective. A target over a period: "99.9% of requests succeed over 30 days".
Compute the error budget: 1 - SLO. For 99.9% that is 0.1% of allowed downtime = 43 minutes a month.
Spend the budget on incidents, risky releases, and experiments.
An alert fires when the budget burn rate crosses a threshold.

This shifts the conversation from "something broke" to "how much room do we have left to be broken".

SLI vs SLO vs SLA

Term	What
SLI	Indicator, the metric itself (availability rate, latency p99)
SLO	Objective, the target value (99.9%)
SLA	Agreement, a contract with the user (with a penalty)
Error budget	1 - SLO. The allowed percentage and duration of failures

An SLO is internal (an engineering tool), an SLA is external (legal, refunds money).

Usually SLA > SLO. The SLO is stricter to keep a margin. If SLA=99.5%, set SLO=99.9%.

Good SLIs

They should:

Reflect user experience correctly (if the SLI is green, the user is happy)
Be aggregatable (you can compute a percentage over a period)
Be measured stably (no flapping on trivia)

Good:

Availability: successful_requests / total_requests (status < 500 / total)
Latency: p99 < 200ms (% requests faster than threshold)
Throughput: actual_qps / target_qps
Correctness: correct_results / total_results (for batch jobs)
Freshness: data_age_p99 < 5min (for pipelines)

Bad:

CPU%: does not reflect user experience
Avg latency: hides the tail (p99 = 5s, avg = 100ms, the user is angry, the metric is "green")
"A user complained": not aggregatable

Window: rolling vs calendar

Rolling 30d: "over the last 30 days, 99.9% success". It is recomputed at every moment. The SRE standard.

Calendar month: "October at 99.9%". Simple for the business, but it behaves badly at month boundaries.

Use rolling.

Error budget: how to compute it

SLO = 99.9% over 30 days.

Error budget = 1 - 0.999 = 0.001 = 0.1% of all requests may fail.

If 100K req/day × 30d = 3M req, then 3000 failures are allowed.

Burn rate = consumed / time_passed. If the budget is spent faster than linear, alert.

In hours:

99.9% over 30d = 43.2 minutes of allowed downtime
99.99% = 4.3 minutes
99.999% = 26 seconds (needs hot-standby and multi-region)

Multi-window burn rate alerting

The old approach: alert on "error rate > 1% for 5m". The problems:

flapping on short spikes
it does not tell "slightly slow" from "burning the budget within an hour"

Multi-window burn rate (Google SRE Workbook, ch. 5):

yaml

groups:

  - name: slo

    rules:

      # Burn rate over 5m and 1h at once

      - alert: ErrorBudgetBurnFast

        expr: |

            sum(rate(http_requests_total{status=~"5.."}[5m])) /

            sum(rate(http_requests_total[5m]))

          ) > (14.4 * 0.001)

and

            sum(rate(http_requests_total{status=~"5.."}[1h])) /

            sum(rate(http_requests_total[1h]))

          ) > (14.4 * 0.001)

        for: 2m

        labels: {severity: critical}

        annotations:

          summary: "Burning the budget at 14.4x; in an hour we lose 2% (of the 30-day budget)"

      - alert: ErrorBudgetBurnSlow

        expr: |

            sum(rate(http_requests_total{status=~"5.."}[1h])) /

            sum(rate(http_requests_total[1h]))

          ) > (3 * 0.001)

and

            sum(rate(http_requests_total{status=~"5.."}[6h])) /

            sum(rate(http_requests_total[6h]))

          ) > (3 * 0.001)

        for: 15m

        labels: {severity: warning}

The trick of two windows:

Short window (5m): react fast
Long window (1h): filter out flaps

The multipliers (14.4 for fast, 3 for slow) are chosen to page early if you are burning through in a day (fast) or in a week (slow).

Burn-rate cheatsheet

For an SLO of 99.9% (0.1% budget):

Burn rate	Time to full burn	When to page
14.4×	2.1 day	2-min page (fast)
6×	5 days	15-min page (medium)
3×	10 days	1h page (slow)
1×	30 days (planned)	no page

Source: Google SRE Workbook, table 5-2.

Error budget policy

Codify what to do when the budget runs out. Example:

Error Budget Policy v1.4

If over a rolling 30d:

 - Budget < 0%: code freeze. Bugfix releases only.

   SRE and dev split priorities 50/50 on reliability work.

 - Budget < 25%: rollout restricted (slow rollout).

   Canary releases are required.

 - Budget > 25%: normal velocity.

Budget resets only with elapsed time.

We do not "forgive" incidents retroactively.

Without a policy, an SLO is just a slide in a dashboard. With a policy it is real governance: the dev team sees the real cost of bad releases.

SLOs for different systems

System	SLI	SLO
Web API	success rate, p99 latency	99.9% / p99 < 200ms
Async queue	processed rate	99.99% (queue HA)
Batch ETL	freshness, correctness	freshness < 1h, correctness 100%
Cache	hit rate? no, that is not user-facing	latency p99 < 50ms
Search	relevance score, latency	latency p95 < 1s, relevance > 0.7

A cache hit rate is an internal metric, not an SLI. An SLI is what the user sees (latency).

Prometheus recording rules for SLO

yaml

groups:

  - name: slo_recording

    interval: 30s

    rules:

      # Per-service availability

      - record: slo:request_availability:ratio_rate5m

        expr: |

          sum by (service)(rate(http_requests_total{status!~"5.."}[5m]))

          sum by (service)(rate(http_requests_total[5m]))

      # Per-service latency SLI

      - record: slo:request_latency:ratio_rate5m

        expr: |

          sum by (service)(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))

          sum by (service)(rate(http_request_duration_seconds_count[5m]))

Burn-rate alerting uses these recording rules instead of raw queries. Cheaper and more readable.

Tools

Sloth (CNCF): generates SLO, alerting, and recording rules from a YAML spec
OpenSLO: an SLO-spec standard, supported by Sloth/Nobl9
Pyrra: a UI for SLO-as-code in Kubernetes
Grafana SLO (paid): managed SLO in Grafana Cloud

Sloth example:

yaml

apiVersion: sloth.slok.dev/v1

kind: PrometheusServiceLevel

spec:

  service: api

  slos:

    - name: availability

      objective: 99.9

      sli:

        events:

          error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))

          total_query: sum(rate(http_requests_total[{{.window}}]))

      alerting:

        page_alert: {labels: {severity: critical}}

        ticket_alert: {labels: {severity: warning}}

Sloth generates recording and alerting rules automatically with the right multi-window burn rate.

When things go wrong

The SLO is missed, but the budget is positive: the 30d window. A flap 30 days ago just rolled off. Normal.
Budget = 100% all the time: the SLO is too weak. Tighten it.
Budget-burn alerting does not fire on a real outage: the burn-rate multiplier is too high, or the SLI does not reflect the affected requests. For example, the SLI is on rate while the outage is on latency.
Cardinality explosion in SLO recording: sum by (user_id) made a million series. See cardinality-explosion.
Teams disagree about an SLO: normal. Agree through iterations, not once and forever.
"If we track p99, we need p999": no. p99 covers 99% of the user experience. p999 is noise and ML territory.

Anti-patterns

An SLO without a budget policy: engineering theater
A 100% SLO: impossible. The error budget is 0, so any deploy breaks it
An SLI on an internal metric (CPU, memory): does not reflect the user
An alert on any error: you need a for: or multi-window burn rate
Multiple overlapping SLOs: choose one user-facing

SLI / SLO / error budget: SRE metrics without the noise

Why SLI/SLO

SLI vs SLO vs SLA

Good SLIs

Window: rolling vs calendar

Error budget: how to compute it

Multi-window burn rate alerting

Burn-rate cheatsheet

Error budget policy

SLOs for different systems

Prometheus recording rules for SLO

Tools

When things go wrong

Anti-patterns

§ команды

§ см. также

SLI / SLO / error budget: SRE metrics without the noise

Why SLI/SLO

SLI vs SLO vs SLA

Good SLIs

Window: rolling vs calendar

Error budget: how to compute it

Multi-window burn rate alerting

Burn-rate cheatsheet

Error budget policy

SLOs for different systems

Prometheus recording rules for SLO

Tools

When things go wrong

Anti-patterns

§ команды

§ см. также