linuxlab.io
Tutorials▾
  • Linux & networking
    File system, processes, TCP/IP, BGP and OSPF
    →
  • Terraform & IaC
    HCL, state, plan/apply on a LocalStack sandbox
    →
  • Git & GitHub
    Object model, plumbing, branching, GitHub Actions
    →
All tutorials →
PricingAboutSign inCreate account
/
  • Introduction
  • Lessons
  • How it works
  • Simulator
  • Knowledge base
  • Interview prep
Index
Categories
All entries
Footer
linuxlab-TutorialsPricingAboutPrivacy & cookies
Copyright © 2026 LinuxLab. All rights reserved.
home/linux/kb/Observability & monitoring/sli-slo-error-budget

kb/observability ── Observability & monitoring ── intermediate

SLI / SLO / error budget: SRE metrics without the noise

SLI is a user-facing metric (availability, p99 latency). SLO is a target over a period (99.9% over 30d). Error budget = 1-SLO, spent on incidents and releases. Multi-window burn-rate alerting replaces threshold alerts, with less noise.

view as markdownaka: slo, sli, error-budget, burn-rate, multi-window-burn-rate

Why SLI/SLO

A threshold alert like "CPU > 80%" is almost always noise:

  • 80% CPU is fine on a batch worker, a problem on a user-facing service
  • It does not reflect what the user feels
  • The noise (flapping, false positives) wears down on-call

The Google SRE approach (the book "Site Reliability Engineering"):

  1. Define an SLI, a Service Level Indicator. A metric close to the user: percent of successful requests, p99 latency.
  2. Set an SLO, a Service Level Objective. A target over a period: "99.9% of requests succeed over 30 days".
  3. Compute the error budget: 1 - SLO. For 99.9% that is 0.1% of allowed downtime = 43 minutes a month.
  4. Spend the budget on incidents, risky releases, and experiments.
  5. An alert fires when the budget burn rate crosses a threshold.

This shifts the conversation from "something broke" to "how much room do we have left to be broken".

SLI vs SLO vs SLA

TermWhat
SLIIndicator, the metric itself (availability rate, latency p99)
SLOObjective, the target value (99.9%)
SLAAgreement, a contract with the user (with a penalty)
Error budget1 - SLO. The allowed percentage and duration of failures

An SLO is internal (an engineering tool), an SLA is external (legal, refunds money).

Usually SLA > SLO. The SLO is stricter to keep a margin. If SLA=99.5%, set SLO=99.9%.

Good SLIs

They should:

  • Reflect user experience correctly (if the SLI is green, the user is happy)
  • Be aggregatable (you can compute a percentage over a period)
  • Be measured stably (no flapping on trivia)

Good:

  • Availability: successful_requests / total_requests (status < 500 / total)
  • Latency: p99 < 200ms (% requests faster than threshold)
  • Throughput: actual_qps / target_qps
  • Correctness: correct_results / total_results (for batch jobs)
  • Freshness: data_age_p99 < 5min (for pipelines)

Bad:

  • CPU%: does not reflect user experience
  • Avg latency: hides the tail (p99 = 5s, avg = 100ms, the user is angry, the metric is "green")
  • "A user complained": not aggregatable

Window: rolling vs calendar

Rolling 30d: "over the last 30 days, 99.9% success". It is recomputed at every moment. The SRE standard.

Calendar month: "October at 99.9%". Simple for the business, but it behaves badly at month boundaries.

Use rolling.

Error budget: how to compute it

SLO = 99.9% over 30 days.

Error budget = 1 - 0.999 = 0.001 = 0.1% of all requests may fail.

If 100K req/day × 30d = 3M req, then 3000 failures are allowed.

After 15 days with 1500 failures, that is 50% of the budget consumed. Over the next 15 days you have 1500 more. If you are already at 2900, that is 97% consumed, with only 100 left for 15 days.

Burn rate = consumed / time_passed. If the budget is spent faster than linear, alert.

In hours:

  • 99.9% over 30d = 43.2 minutes of allowed downtime
  • 99.99% = 4.3 minutes
  • 99.999% = 26 seconds (needs hot-standby and multi-region)

Multi-window burn rate alerting

The old approach: alert on "error rate > 1% for 5m". The problems:

  • flapping on short spikes
  • it does not tell "slightly slow" from "burning the budget within an hour"

Multi-window burn rate (Google SRE Workbook, ch. 5):

yaml
groups:
  - name: slo
    rules:
      # Burn rate over 5m and 1h at once
      - alert: ErrorBudgetBurnFast
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m])) /
            sum(rate(http_requests_total[5m]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[1h])) /
            sum(rate(http_requests_total[1h]))
          ) > (14.4 * 0.001)
        for: 2m
        labels: {severity: critical}
        annotations:
          summary: "Burning the budget at 14.4x; in an hour we lose 2% (of the 30-day budget)"
      - alert: ErrorBudgetBurnSlow
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h])) /
            sum(rate(http_requests_total[1h]))
          ) > (3 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[6h])) /
            sum(rate(http_requests_total[6h]))
          ) > (3 * 0.001)
        for: 15m
        labels: {severity: warning}

The trick of two windows:

  • Short window (5m): react fast
  • Long window (1h): filter out flaps

The multipliers (14.4 for fast, 3 for slow) are chosen to page early if you are burning through in a day (fast) or in a week (slow).

Burn-rate cheatsheet

For an SLO of 99.9% (0.1% budget):

Burn rateTime to full burnWhen to page
14.4×2.1 day2-min page (fast)
6×5 days15-min page (medium)
3×10 days1h page (slow)
1×30 days (planned)no page

Source: Google SRE Workbook, table 5-2.

Error budget policy

Codify what to do when the budget runs out. Example:

Error Budget Policy v1.4
If over a rolling 30d:
 - Budget < 0%: code freeze. Bugfix releases only.
   SRE and dev split priorities 50/50 on reliability work.
 - Budget < 25%: rollout restricted (slow rollout).
   Canary releases are required.
 - Budget > 25%: normal velocity.
Budget resets only with elapsed time.
We do not "forgive" incidents retroactively.

Without a policy, an SLO is just a slide in a dashboard. With a policy it is real governance: the dev team sees the real cost of bad releases.

SLOs for different systems

SystemSLISLO
Web APIsuccess rate, p99 latency99.9% / p99 < 200ms
Async queueprocessed rate99.99% (queue HA)
Batch ETLfreshness, correctnessfreshness < 1h, correctness 100%
Cachehit rate? no, that is not user-facinglatency p99 < 50ms
Searchrelevance score, latencylatency p95 < 1s, relevance > 0.7

A cache hit rate is an internal metric, not an SLI. An SLI is what the user sees (latency).

Prometheus recording rules for SLO

yaml
groups:
  - name: slo_recording
    interval: 30s
    rules:
      # Per-service availability
      - record: slo:request_availability:ratio_rate5m
        expr: |
          sum by (service)(rate(http_requests_total{status!~"5.."}[5m]))
          /
          sum by (service)(rate(http_requests_total[5m]))
      # Per-service latency SLI
      - record: slo:request_latency:ratio_rate5m
        expr: |
          sum by (service)(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
          /
          sum by (service)(rate(http_request_duration_seconds_count[5m]))

Burn-rate alerting uses these recording rules instead of raw queries. Cheaper and more readable.

Tools

  • Sloth (CNCF): generates SLO, alerting, and recording rules from a YAML spec
  • OpenSLO: an SLO-spec standard, supported by Sloth/Nobl9
  • Pyrra: a UI for SLO-as-code in Kubernetes
  • Grafana SLO (paid): managed SLO in Grafana Cloud

Sloth example:

yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
spec:
  service: api
  slos:
    - name: availability
      objective: 99.9
      sli:
        events:
          error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))
          total_query: sum(rate(http_requests_total[{{.window}}]))
      alerting:
        page_alert: {labels: {severity: critical}}
        ticket_alert: {labels: {severity: warning}}

Sloth generates recording and alerting rules automatically with the right multi-window burn rate.

When things go wrong

  • The SLO is missed, but the budget is positive: the 30d window. A flap 30 days ago just rolled off. Normal.
  • Budget = 100% all the time: the SLO is too weak. Tighten it.
  • Budget-burn alerting does not fire on a real outage: the burn-rate multiplier is too high, or the SLI does not reflect the affected requests. For example, the SLI is on rate while the outage is on latency.
  • Cardinality explosion in SLO recording: sum by (user_id) made a million series. See cardinality-explosion.
  • Teams disagree about an SLO: normal. Agree through iterations, not once and forever.
  • "If we track p99, we need p999": no. p99 covers 99% of the user experience. p999 is noise and ML territory.

Anti-patterns

  • An SLO without a budget policy: engineering theater
  • A 100% SLO: impossible. The error budget is 0, so any deploy breaks it
  • An SLI on an internal metric (CPU, memory): does not reflect the user
  • An alert on any error: you need a for: or multi-window burn rate
  • Multiple overlapping SLOs: choose one user-facing

§ команды

bash
promtool query instant 'slo:request_availability:ratio_rate5m'

The current availability SLI: 1.0 = all succeeded, 0.999 = 0.1% errors

bash
sloth generate -i slo.yaml -o rules.yaml

Sloth: spec → ready recording and alerting rules for Prometheus

bash
curl -s 'http://prom:9090/api/v1/query?query=1-slo:request_availability:ratio_rate30d' | jq

How much budget is spent over 30d, alerting material

bash
amtool alert query alertname=ErrorBudgetBurnFast

Current fast-burn alerts on the error budget

bash
promtool check rules slo-rules.yaml

Validate SLO recording and alerting rules: syntax, expr references

bash
kubectl apply -f slo-spec.yaml -n slo  # OpenSLO via Pyrra

SLO-as-code: the Pyrra operator brings up recording rules in Prom

§ см. также

  • metric-typesMetric types: counter, gauge, histogram, summaryFour metric types: counter (up only), gauge (any value), histogram (buckets for p99), summary (quantile in the client). Native histogram (Prom 2.40+) uses sparse buckets and is gentler on memory. Exemplars link a metric to a trace_id.
  • opentelemetryOpenTelemetry: signals, OTLP, Collector pipelineOpenTelemetry is the CNCF standard for metrics, traces, and logs in one SDK. The OTLP protocol runs over gRPC or HTTP. The Collector receives, filters, and routes to Prom/Tempo/Loki/Jaeger. Auto-instrumentation needs no code change.
  • metrics-vs-logs-vs-tracesMetrics vs logs vs traces: the three pillars of observabilityMetrics are aggregated numbers over time, cheap, for alerts. Logs are discrete events with context, for root-cause. Traces are request flow across services, for distributed debug. Structure beats volume.
Footer
linuxlab-
Copyright © 2026 LinuxLab. All rights reserved.
Tutorials
Pricing
About
Privacy & cookies