Metrics vs logs vs traces: the three pillars of observability

Why three pillars

A server went down at 3:42 in the morning. What do you have?

A metric tells you: "error rate jumped from 0.1% to 47% at 03:41". Cheap, in-memory, easy to alert on. But it will not tell you "why".
A log tells you: ERROR connection refused at db.example.com:5432 pool exhausted. A specific event with context. But you have 50 GB of logs per hour, and grep is dead.
A trace shows you: POST /checkout (4.2s) -> cart-service (12ms) -> payment-service (4.1s) -> stripe-api (timeout). You see exactly where the time was lost.

Each pillar answers its own question. With only one, the blind spots are large. This triad is called MELT (metrics, events, logs, traces), or observability pillars in the literature.

Comparison

Property	Metrics	Logs	Traces
Data type	numeric time series	semi-structured events	tree spans
Cardinality cost	critical (see below)	medium	low
Storage cost	cheap (~$0.01/series/mo)	expensive ($0.50/GB)	medium
Latency to alert	seconds	minutes	minutes
Root-cause depth	shallow	deep	medium
Sampling	none (aggregates)	rarely	head/tail usually
Tools	Prometheus, Victoria	Loki, Elastic, Splunk	Jaeger, Tempo

When to pick which

Metrics are for everything you need to alert on and graph: request rate, latency p99, CPU, memory, queue depth. They have fixed cardinality (metric x labels), aggregates are cheap, and 1 year of retention is OK.

They are not for: "find a specific request", "see the stack trace of an error", "understand the sequence of calls".

Logs are for error context and audit. When a metric triggers, you go to the logs to see "what exactly happened". Structured JSON logs ({"level":"error","trace_id":"abc","msg":"..."}) are better than plain text: you can filter by fields without a regex.

They are not for: "the overall trend" (100x slower than metrics) or broad alerts (latency).

Traces are for distributed systems where a request goes through 5+ services. You see the full path, parent-child calls, and where the latency is. On top of that, trace_id links logs across services (correlation).

They are not for: a monolith (overkill) or the internal logic of functions (use [[pyroscope-continuous-profiling|profiling]]).

Structured logging instead of grep

The old approach:

2024-05-03 14:22:31 ERROR Something failed for user 12345

To find errors for user=12345: grep "user 12345" *.log | grep ERROR. Slow and imprecise.

A structured log:

json

{"ts":"2024-05-03T14:22:31Z","level":"error","user_id":"12345",

 "trace_id":"7f3a","msg":"checkout failed","err":"connection refused"}

A query in [[loki-grafana-logging|Loki]] LogQL:

{service="checkout"} | json | level="error" | user_id="12345"

Precise, fast, indexed by labels.

Cost trade-off

In a large system:

Metrics: 100K series x 1 sample/15s = ~$200/mo in Prom
Logs: 1 TB/day of log volume = ~$5K/mo in Loki, ~$30K in Splunk
Traces: 10K traces/sec with full sampling = OOM. That is why you use sampling (head 1%, tail 100% on errors)

Cardinality kills metrics fast: add a user_id label for 1M users and Prometheus goes OOM (cardinality-explosion). So high-cardinality fields (user_id, request_id) go into logs and traces, not into metrics.

Correlation through trace_id

The lifecycle of a request:

The frontend generates trace_id=abc123
It passes it to the backend through the [[http2-internals|HTTP traceparent header]] (W3C standard)
The backend writes to logs: {"trace_id":"abc123",...}
The backend creates spans through the [[opentelemetry|OpenTelemetry SDK]]
An alert on the latency_p99 metric fires
You open the dashboard, see the slow trace, see the slow span, then open the logs for that span

This is "pillars work together" in practice, not three separate tools. Without trace_id in the logs, you have to stitch events together by timestamp by hand.

When things go wrong

Metrics say "all OK", but users complain: the alerts are on the average, not on the [[metric-types|p99 histogram]]. Average latency = 100ms, p99 = 5s.
Cardinality explosion in Prometheus after labeling by user_id: Prom goes OOM, and there are no metrics at all. Remove the label, see cardinality-explosion.
Logs without trace_id: you cannot link events from 3 services. Add trace_id to the logger context right at the edge.
Trace sampling at 1% hides rare errors: switch to tail-based sampling: keep 100% if a span errors.
Logs grow 10x after a deploy: someone added log.info("entered function") to a hot path. Remove it or raise the threshold.
An alert on up == 0 did not fire: Prometheus could not reach the target. A metric about metrics is a blind spot. You need a deadman alert (alerting-rules-alertmanager).

Evolution: OpenTelemetry unifies

It used to be 3 different SDKs (Prometheus client, Jaeger client, log lib), 3 different pipelines, 3 formats.

[[opentelemetry|OpenTelemetry]] combines all three signals into one SDK and the OTLP protocol. One collector takes in metrics, logs, and traces, and routes them to Prom/Loki/Tempo. It reduces coupling and simplifies migration between backends.

Why three pillars

A server went down at 3:42 in the morning. What do you have?

A metric tells you: "error rate jumped from 0.1% to 47% at 03:41". Cheap, in-memory, easy to alert on. But it will not tell you "why".
A log tells you: ERROR connection refused at db.example.com:5432 pool exhausted. A specific event with context. But you have 50 GB of logs per hour, and grep is dead.
A trace shows you: POST /checkout (4.2s) -> cart-service (12ms) -> payment-service (4.1s) -> stripe-api (timeout). You see exactly where the time was lost.

Each pillar answers its own question. With only one, the blind spots are large. This triad is called MELT (metrics, events, logs, traces), or observability pillars in the literature.

Comparison

Property	Metrics	Logs	Traces
Data type	numeric time series	semi-structured events	tree spans
Cardinality cost	critical (see below)	medium	low
Storage cost	cheap (~$0.01/series/mo)	expensive ($0.50/GB)	medium
Latency to alert	seconds	minutes	minutes
Root-cause depth	shallow	deep	medium
Sampling	none (aggregates)	rarely	head/tail usually
Tools	Prometheus, Victoria	Loki, Elastic, Splunk	Jaeger, Tempo

When to pick which

They are not for: "find a specific request", "see the stack trace of an error", "understand the sequence of calls".

They are not for: "the overall trend" (100x slower than metrics) or broad alerts (latency).

They are not for: a monolith (overkill) or the internal logic of functions (use [[pyroscope-continuous-profiling|profiling]]).

Structured logging instead of grep

The old approach:

2024-05-03 14:22:31 ERROR Something failed for user 12345

To find errors for user=12345: grep "user 12345" *.log | grep ERROR. Slow and imprecise.

A structured log:

json

{"ts":"2024-05-03T14:22:31Z","level":"error","user_id":"12345",

 "trace_id":"7f3a","msg":"checkout failed","err":"connection refused"}

A query in [[loki-grafana-logging|Loki]] LogQL:

{service="checkout"} | json | level="error" | user_id="12345"

Precise, fast, indexed by labels.

Cost trade-off

In a large system:

Metrics: 100K series x 1 sample/15s = ~$200/mo in Prom
Logs: 1 TB/day of log volume = ~$5K/mo in Loki, ~$30K in Splunk
Traces: 10K traces/sec with full sampling = OOM. That is why you use sampling (head 1%, tail 100% on errors)

Correlation through trace_id

The lifecycle of a request:

The frontend generates trace_id=abc123
It passes it to the backend through the [[http2-internals|HTTP traceparent header]] (W3C standard)
The backend writes to logs: {"trace_id":"abc123",...}
The backend creates spans through the [[opentelemetry|OpenTelemetry SDK]]
An alert on the latency_p99 metric fires
You open the dashboard, see the slow trace, see the slow span, then open the logs for that span

This is "pillars work together" in practice, not three separate tools. Without trace_id in the logs, you have to stitch events together by timestamp by hand.

When things go wrong

Metrics say "all OK", but users complain: the alerts are on the average, not on the [[metric-types|p99 histogram]]. Average latency = 100ms, p99 = 5s.
Cardinality explosion in Prometheus after labeling by user_id: Prom goes OOM, and there are no metrics at all. Remove the label, see cardinality-explosion.
Logs without trace_id: you cannot link events from 3 services. Add trace_id to the logger context right at the edge.
Trace sampling at 1% hides rare errors: switch to tail-based sampling: keep 100% if a span errors.
Logs grow 10x after a deploy: someone added log.info("entered function") to a hot path. Remove it or raise the threshold.
An alert on up == 0 did not fire: Prometheus could not reach the target. A metric about metrics is a blind spot. You need a deadman alert (alerting-rules-alertmanager).

Evolution: OpenTelemetry unifies

It used to be 3 different SDKs (Prometheus client, Jaeger client, log lib), 3 different pipelines, 3 formats.

Metrics vs logs vs traces: the three pillars of observability

Why three pillars

Comparison

When to pick which

Structured logging instead of grep

Cost trade-off

Correlation through trace_id

When things go wrong

Evolution: OpenTelemetry unifies

§ команды

§ см. также

Metrics vs logs vs traces: the three pillars of observability

Why three pillars

Comparison

When to pick which

Structured logging instead of grep

Cost trade-off

Correlation through trace_id

When things go wrong

Evolution: OpenTelemetry unifies

§ команды

§ см. также