linuxlab.io
Tutorials▾
  • Linux & networking
    File system, processes, TCP/IP, BGP and OSPF
    →
  • Terraform & IaC
    HCL, state, plan/apply on a LocalStack sandbox
    →
  • Git & GitHub
    Object model, plumbing, branching, GitHub Actions
    →
All tutorials →
PricingAboutSign inCreate account
/
  • Introduction
  • Lessons
  • How it works
  • Simulator
  • Knowledge base
  • Interview prep
Index
Categories
All entries
Footer
linuxlab-TutorialsPricingAboutPrivacy & cookies
Copyright © 2026 LinuxLab. All rights reserved.
home/linux/kb/Observability & monitoring/metric-types

kb/observability ── Observability & monitoring ── intermediate

Metric types: counter, gauge, histogram, summary

Four metric types: counter (up only), gauge (any value), histogram (buckets for p99), summary (quantile in the client). Native histogram (Prom 2.40+) uses sparse buckets and is gentler on memory. Exemplars link a metric to a trace_id.

view as markdownaka: counter-gauge-histogram, prometheus-metric-types, histogram-vs-summary, native-histogram

Why different types exist

A metric is not just a number. Its semantics decide how you aggregate it.

  • http_requests_total rose by 1200 over a minute, so rate = 20 req/s. A counter is monotonic, and the difference between measurements is the pace.
  • memory_usage_bytes = 2.3 GB right now. This is the direct value, and a gauge can swing up and down. The mean over an hour is its average.
  • request_duration_seconds is a distribution, so you need percentiles (p50, p99). Taking the mean is useless: it hides tail latency.

Each type is a contract between the application and the query system (PromQL/MetricsQL). Using rate() on a gauge gives garbage. Using avg() on a histogram loses the meaning.

Counter, monotonically increasing

Semantics: how many times an event happened. Up only, with a reset on process restart.

http_requests_total{method="GET",status="200"} 12345

In PromQL, never use the raw counter. Always go through rate() or increase():

rate(http_requests_total[5m])           # average req/s over 5 min
increase(http_requests_total[1h])       # how much it grew over an hour
sum by (status)(rate(http_requests_total[5m]))  # by status

rate() automatically handles a counter reset (on restart).

Examples of counters in real systems:

  • process_cpu_seconds_total, total CPU
  • node_network_receive_bytes_total, bytes received
  • kafka_consumer_messages_consumed_total

Gauge, current value

Semantics: the value right now. It can rise and fall.

node_memory_MemAvailable_bytes 4521234432
goroutines_active 142
queue_depth{queue="orders"} 87

In PromQL, use it directly:

node_memory_MemAvailable_bytes / 1024 / 1024 / 1024
avg_over_time(queue_depth[10m])
max(queue_depth) by (queue)

Do not use rate() on a gauge! rate(temperature[5m]) is meaningless.

Histogram, a distribution with buckets

Semantics: how many events fell into each bucket by value.

http_request_duration_seconds_bucket{le="0.1"}  4500
http_request_duration_seconds_bucket{le="0.25"} 4800
http_request_duration_seconds_bucket{le="0.5"}  4920
http_request_duration_seconds_bucket{le="1"}    4980
http_request_duration_seconds_bucket{le="+Inf"} 5000
http_request_duration_seconds_sum               350.5
http_request_duration_seconds_count             5000

Each bucket is a counter. le="0.5" means how many events were <=0.5 sec.

The percentile is computed at query time:

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

It is aggregatable across instances: you can sum by (status). This is unlike summary.

Choose buckets carefully. The default 10 buckets from 5ms to 10s suit HTTP, but not DB queries (which need microseconds) or batch jobs (which run for minutes). Too many buckets lead to cardinality explosion (cardinality-explosion).

Summary, pre-computed quantiles

The client computes the quantiles itself (p50, p95, p99) and exports them:

http_request_duration_seconds{quantile="0.5"}  0.012
http_request_duration_seconds{quantile="0.95"} 0.087
http_request_duration_seconds{quantile="0.99"} 0.241
http_request_duration_seconds_sum              350.5
http_request_duration_seconds_count            5000

Problems with summary:

  • Not aggregatable: you cannot sum by (instance), because it is mathematically wrong. It only makes sense per instance.
  • Expensive on CPU and memory, since the client holds a running quantile estimator.
  • Quantile fixed at compile time, so you cannot compute p99.9 if the client does not export it.

Use summary only when p99 is cheap to compute in the client and aggregation is not needed. In most cases, histogram is better.

Histogram vs Summary

CriterionHistogramSummary
Where it is computedserver (PromQL)client
Aggregatable across instancesyesno
Quantile precisionbucket-boundedexact-ish
Quantile changeable after the factyes (new query)no
Memory in clientlowmedium
Cardinalitybucket × labelsquantile × labels

Rule: use histogram. Summary is only for legacy.

Native histogram (Prom 2.40+)

The problem with a classic histogram is fixed buckets, either many (cardinality) or few (poor precision).

A native histogram (a.k.a. sparse histogram) builds buckets on the fly, with a logarithmic scale and sparse encoding:

metric_native_histogram{} {schema:1, count:5000, sum:350.5,
                          positive_buckets: ...sparse...}
  • One time series per metric (instead of N buckets)
  • Precision of about 1-3% at any quantile
  • 100x less storage than a classic histogram with the same precision

It requires:

  • A client SDK with support (Go >=1.16, Python >=0.18, Java >=1.0)
  • Prometheus 2.40+ with --enable-feature=native-histograms
  • Grafana 10+ for visualization

Production-ready as of Prom 2.50+ (2023). It should become the default.

Exemplars, a bridge to traces

An exemplar is a concrete sample attached to a bucket:

http_request_duration_seconds_bucket{le="0.5"} 4920 # {trace_id="abc123"} 0.42 1683456789.123

"One of the 4920 requests in this bucket had trace_id=abc123." In Grafana you can click a point on the graph and jump into the trace (tracing-basics).

Support:

  • Prometheus 2.26+ (requires the OpenMetrics format)
  • SDK: Go, Java, Python, .NET
  • Grafana 8+

OpenMetrics, the formal spec

OpenMetrics is a CNCF standard (RFC-style) for metrics that extends the Prometheus exposition format:

  • UTF-8, not ASCII
  • A # UNIT line
  • Exemplars are formalized
  • JSON serialization (optional)

In 2025, most SDKs export to OpenMetrics automatically. Prometheus and the [[opentelemetry|OTel collector]] both understand it.

When things go wrong

  • rate(my_metric[5m]) returns 0: the counter is named like a gauge, so PromQL computes rate over a continuously identical value. Rename the metric to _total.
  • p99 latency jumps around: too few samples in the window (rate over [5m] for 0.1 req/s is 30 events, which is noisy). Widen the window to [30m].
  • histogram_quantile() returns NaN: the buckets do not cover the observed values, or there is no data in the window. Check _count > 0.
  • Cardinality explosion: you added endpoint=/api/v1/user/123, so every user_id creates a new series. Refactor to endpoint=/api/v1/user/:id.
  • Summary quantile is inaccurate after a restart: the client estimator resets. This is a property of summary, not a bug.
  • Buckets with le= as a string, not a number: Prom expects strings "0.1", "0.5". le=0.1 (a number) breaks.

§ команды

bash
curl -s localhost:9090/metrics | grep -E 'http_requests_total|^# (TYPE|HELP) http_requests'

Look at the raw export: HELP, TYPE, and the counter values in OpenMetrics format

bash
promtool query instant 'rate(http_requests_total[5m])'

Compute the request rate over a 5-min window from the CLI without opening Grafana

bash
promtool query instant 'histogram_quantile(0.99, sum by (le)(rate(http_request_duration_seconds_bucket[5m])))'

p99 latency aggregated across instances, with the correct sum by (le) BEFORE quantile

bash
curl -s 'localhost:9090/api/v1/query?query=up' | jq '.data.result[].value'

All 'up' gauge values: 1 means the scrape works, 0 means the target is down

bash
promtool check metrics /etc/prometheus/recording.yml

Validate that recording rules use the correct functions for each type

bash
prometheus --enable-feature=native-histograms,exemplar-storage

Enable native histograms and storage for exemplars in Prometheus 2.40+

§ см. также

  • opentelemetryOpenTelemetry: signals, OTLP, Collector pipelineOpenTelemetry is the CNCF standard for metrics, traces, and logs in one SDK. The OTLP protocol runs over gRPC or HTTP. The Collector receives, filters, and routes to Prom/Tempo/Loki/Jaeger. Auto-instrumentation needs no code change.
  • sli-slo-error-budgetSLI / SLO / error budget: SRE metrics without the noiseSLI is a user-facing metric (availability, p99 latency). SLO is a target over a period (99.9% over 30d). Error budget = 1-SLO, spent on incidents and releases. Multi-window burn-rate alerting replaces threshold alerts, with less noise.
  • metrics-vs-logs-vs-tracesMetrics vs logs vs traces: the three pillars of observabilityMetrics are aggregated numbers over time, cheap, for alerts. Logs are discrete events with context, for root-cause. Traces are request flow across services, for distributed debug. Structure beats volume.
Footer
linuxlab-
Copyright © 2026 LinuxLab. All rights reserved.
Tutorials
Pricing
About
Privacy & cookies