linuxlab.io
Tutorials▾
  • Linux & networking
    File system, processes, TCP/IP, BGP and OSPF
    →
  • Terraform & IaC
    HCL, state, plan/apply on a LocalStack sandbox
    →
  • Git & GitHub
    Object model, plumbing, branching, GitHub Actions
    →
All tutorials →
PricingAboutSign inCreate account
/
  • Introduction
  • Lessons
  • How it works
  • Simulator
  • Knowledge base
  • Interview prep
Index
Categories
All entries
Footer
linuxlab-TutorialsPricingAboutPrivacy & cookies
Copyright © 2026 LinuxLab. All rights reserved.
home/linux/kb/Observability & monitoring/metrics-vs-logs-vs-traces

kb/observability ── Observability & monitoring ── beginner

Metrics vs logs vs traces: the three pillars of observability

Metrics are aggregated numbers over time, cheap, for alerts. Logs are discrete events with context, for root-cause. Traces are request flow across services, for distributed debug. Structure beats volume.

view as markdownaka: observability-pillars, three-pillars, logs-metrics-traces, metrics-logs-traces

Why three pillars

A server went down at 3:42 in the morning. What do you have?

  • A metric tells you: "error rate jumped from 0.1% to 47% at 03:41". Cheap, in-memory, easy to alert on. But it will not tell you "why".
  • A log tells you: ERROR connection refused at db.example.com:5432 pool exhausted. A specific event with context. But you have 50 GB of logs per hour, and grep is dead.
  • A trace shows you: POST /checkout (4.2s) -> cart-service (12ms) -> payment-service (4.1s) -> stripe-api (timeout). You see exactly where the time was lost.

Each pillar answers its own question. With only one, the blind spots are large. This triad is called MELT (metrics, events, logs, traces), or observability pillars in the literature.

Comparison

PropertyMetricsLogsTraces
Data typenumeric time seriessemi-structured eventstree spans
Cardinality costcritical (see below)mediumlow
Storage costcheap (~$0.01/series/mo)expensive ($0.50/GB)medium
Latency to alertsecondsminutesminutes
Root-cause depthshallowdeepmedium
Samplingnone (aggregates)rarelyhead/tail usually
ToolsPrometheus, VictoriaLoki, Elastic, SplunkJaeger, Tempo

When to pick which

Metrics are for everything you need to alert on and graph: request rate, latency p99, CPU, memory, queue depth. They have fixed cardinality (metric x labels), aggregates are cheap, and 1 year of retention is OK.

They are not for: "find a specific request", "see the stack trace of an error", "understand the sequence of calls".

Logs are for error context and audit. When a metric triggers, you go to the logs to see "what exactly happened". Structured JSON logs ({"level":"error","trace_id":"abc","msg":"..."}) are better than plain text: you can filter by fields without a regex.

They are not for: "the overall trend" (100x slower than metrics) or broad alerts (latency).

Traces are for distributed systems where a request goes through 5+ services. You see the full path, parent-child calls, and where the latency is. On top of that, trace_id links logs across services (correlation).

They are not for: a monolith (overkill) or the internal logic of functions (use [[pyroscope-continuous-profiling|profiling]]).

Structured logging instead of grep

The old approach:

2024-05-03 14:22:31 ERROR Something failed for user 12345

To find errors for user=12345: grep "user 12345" *.log | grep ERROR. Slow and imprecise.

A structured log:

json
{"ts":"2024-05-03T14:22:31Z","level":"error","user_id":"12345",
 "trace_id":"7f3a","msg":"checkout failed","err":"connection refused"}

A query in [[loki-grafana-logging|Loki]] LogQL:

{service="checkout"} | json | level="error" | user_id="12345"

Precise, fast, indexed by labels.

Cost trade-off

In a large system:

  • Metrics: 100K series x 1 sample/15s = ~$200/mo in Prom
  • Logs: 1 TB/day of log volume = ~$5K/mo in Loki, ~$30K in Splunk
  • Traces: 10K traces/sec with full sampling = OOM. That is why you use sampling (head 1%, tail 100% on errors)

Cardinality kills metrics fast: add a user_id label for 1M users and Prometheus goes OOM (cardinality-explosion). So high-cardinality fields (user_id, request_id) go into logs and traces, not into metrics.

Correlation through trace_id

The lifecycle of a request:

  1. The frontend generates trace_id=abc123
  2. It passes it to the backend through the [[http2-internals|HTTP traceparent header]] (W3C standard)
  3. The backend writes to logs: {"trace_id":"abc123",...}
  4. The backend creates spans through the [[opentelemetry|OpenTelemetry SDK]]
  5. An alert on the latency_p99 metric fires
  6. You open the dashboard, see the slow trace, see the slow span, then open the logs for that span

This is "pillars work together" in practice, not three separate tools. Without trace_id in the logs, you have to stitch events together by timestamp by hand.

When things go wrong

  • Metrics say "all OK", but users complain: the alerts are on the average, not on the [[metric-types|p99 histogram]]. Average latency = 100ms, p99 = 5s.
  • Cardinality explosion in Prometheus after labeling by user_id: Prom goes OOM, and there are no metrics at all. Remove the label, see cardinality-explosion.
  • Logs without trace_id: you cannot link events from 3 services. Add trace_id to the logger context right at the edge.
  • Trace sampling at 1% hides rare errors: switch to tail-based sampling: keep 100% if a span errors.
  • Logs grow 10x after a deploy: someone added log.info("entered function") to a hot path. Remove it or raise the threshold.
  • An alert on up == 0 did not fire: Prometheus could not reach the target. A metric about metrics is a blind spot. You need a deadman alert (alerting-rules-alertmanager).

Evolution: OpenTelemetry unifies

It used to be 3 different SDKs (Prometheus client, Jaeger client, log lib), 3 different pipelines, 3 formats.

[[opentelemetry|OpenTelemetry]] combines all three signals into one SDK and the OTLP protocol. One collector takes in metrics, logs, and traces, and routes them to Prom/Loki/Tempo. It reduces coupling and simplifies migration between backends.

§ команды

bash
curl -s localhost:9090/api/v1/query?query=up | jq

The simplest metric, 'up', is the availability of all scrape targets in Prometheus

bash
logcli query '{service="api"} |= "error"' --limit 50

Loki LogQL: find the last 50 errors in the logs of the api service

bash
curl -s tempo:3200/api/traces/abc123 | jq '.batches[].scopeSpans[].spans[] | {name, durationNanos}'

Tempo trace by ID: view the duration of each span in the trace

bash
journalctl -u myapp -o json | jq 'select(.PRIORITY<="3")'

Structured journal logs (priority<=3 = error/critical), filtered through jq

bash
promtool check rules /etc/prometheus/rules/*.yml

Validate Prometheus alerting/recording rules before a reload

bash
echo '{"trace_id":"abc","span_id":"123","msg":"test"}' | jq

A minimal structured log with trace correlation: always log trace_id

§ см. также

  • opentelemetryOpenTelemetry: signals, OTLP, Collector pipelineOpenTelemetry is the CNCF standard for metrics, traces, and logs in one SDK. The OTLP protocol runs over gRPC or HTTP. The Collector receives, filters, and routes to Prom/Tempo/Loki/Jaeger. Auto-instrumentation needs no code change.
  • loki-grafana-loggingLoki: label-based logs, LogQL, Promtail/Vector pipelineLoki is log aggregation with a label-based index, not full-text like Elastic. Cheap on S3 storage. Promtail/Vector are the agents. LogQL resembles PromQL: filter, parse, aggregate. Cardinality is the enemy.
  • metric-typesMetric types: counter, gauge, histogram, summaryFour metric types: counter (up only), gauge (any value), histogram (buckets for p99), summary (quantile in the client). Native histogram (Prom 2.40+) uses sparse buckets and is gentler on memory. Exemplars link a metric to a trace_id.
Footer
linuxlab-
Copyright © 2026 LinuxLab. All rights reserved.
Tutorials
Pricing
About
Privacy & cookies