Why three pillars
A server went down at 3:42 in the morning. What do you have?
- A metric tells you: "error rate jumped from 0.1% to 47% at 03:41". Cheap, in-memory, easy to alert on. But it will not tell you "why".
- A log tells you:
ERROR connection refused at db.example.com:5432 pool exhausted. A specific event with context. But you have 50 GB of logs per hour, andgrepis dead. - A trace shows you:
POST /checkout (4.2s) -> cart-service (12ms) -> payment-service (4.1s) -> stripe-api (timeout). You see exactly where the time was lost.
Each pillar answers its own question. With only one, the blind spots are large. This triad is called MELT (metrics, events, logs, traces), or observability pillars in the literature.
Comparison
| Property | Metrics | Logs | Traces |
|---|---|---|---|
| Data type | numeric time series | semi-structured events | tree spans |
| Cardinality cost | critical (see below) | medium | low |
| Storage cost | cheap (~$0.01/series/mo) | expensive ($0.50/GB) | medium |
| Latency to alert | seconds | minutes | minutes |
| Root-cause depth | shallow | deep | medium |
| Sampling | none (aggregates) | rarely | head/tail usually |
| Tools | Prometheus, Victoria | Loki, Elastic, Splunk | Jaeger, Tempo |
When to pick which
Metrics are for everything you need to alert on and graph: request rate, latency p99, CPU, memory, queue depth. They have fixed cardinality (metric x labels), aggregates are cheap, and 1 year of retention is OK.
They are not for: "find a specific request", "see the stack trace of an error", "understand the sequence of calls".
Logs are for error context and audit. When a metric triggers, you
go to the logs to see "what exactly happened". Structured JSON logs
({"level":"error","trace_id":"abc","msg":"..."}) are better than plain text:
you can filter by fields without a regex.
They are not for: "the overall trend" (100x slower than metrics) or broad alerts (latency).
Traces are for distributed systems where a request goes through 5+ services. You see the full path, parent-child calls, and where the latency is. On top of that, trace_id links logs across services (correlation).
They are not for: a monolith (overkill) or the internal logic of functions (use [[pyroscope-continuous-profiling|profiling]]).
Structured logging instead of grep
The old approach:
2024-05-03 14:22:31 ERROR Something failed for user 12345
To find errors for user=12345: grep "user 12345" *.log | grep ERROR.
Slow and imprecise.
A structured log:
{"ts":"2024-05-03T14:22:31Z","level":"error","user_id":"12345","trace_id":"7f3a","msg":"checkout failed","err":"connection refused"}
A query in [[loki-grafana-logging|Loki]] LogQL:
{service="checkout"} | json | level="error" | user_id="12345"Precise, fast, indexed by labels.
Cost trade-off
In a large system:
- Metrics: 100K series x 1 sample/15s = ~$200/mo in Prom
- Logs: 1 TB/day of log volume = ~$5K/mo in Loki, ~$30K in Splunk
- Traces: 10K traces/sec with full sampling = OOM. That is why you use sampling (head 1%, tail 100% on errors)
Cardinality kills metrics fast: add a user_id label
for 1M users and Prometheus goes OOM (cardinality-explosion).
So high-cardinality fields (user_id, request_id) go into logs and
traces, not into metrics.
Correlation through trace_id
The lifecycle of a request:
- The frontend generates
trace_id=abc123 - It passes it to the backend through the [[http2-internals|HTTP
traceparentheader]] (W3C standard) - The backend writes to logs:
{"trace_id":"abc123",...} - The backend creates spans through the [[opentelemetry|OpenTelemetry SDK]]
- An alert on the latency_p99 metric fires
- You open the dashboard, see the slow trace, see the slow span, then open the logs for that span
This is "pillars work together" in practice, not three separate tools. Without
trace_id in the logs, you have to stitch events together by timestamp by hand.
When things go wrong
- Metrics say "all OK", but users complain: the alerts are on the average, not on the [[metric-types|p99 histogram]]. Average latency = 100ms, p99 = 5s.
- Cardinality explosion in Prometheus after labeling by
user_id: Prom goes OOM, and there are no metrics at all. Remove the label, see cardinality-explosion. - Logs without trace_id: you cannot link events from 3 services.
Add
trace_idto the logger context right at the edge. - Trace sampling at 1% hides rare errors: switch to tail-based sampling: keep 100% if a span errors.
- Logs grow 10x after a deploy: someone added
log.info("entered function")to a hot path. Remove it or raise the threshold. - An alert on
up == 0did not fire: Prometheus could not reach the target. A metric about metrics is a blind spot. You need a deadman alert (alerting-rules-alertmanager).
Evolution: OpenTelemetry unifies
It used to be 3 different SDKs (Prometheus client, Jaeger client, log lib), 3 different pipelines, 3 formats.
[[opentelemetry|OpenTelemetry]] combines all three signals into one SDK and the OTLP protocol. One collector takes in metrics, logs, and traces, and routes them to Prom/Loki/Tempo. It reduces coupling and simplifies migration between backends.