Why Loki
Elasticsearch does full-text search over every field of a log. Every word is indexed. The cost:
- 1 TB/day of logs = ~30 GB/day of RAM on the heap (3-5x source size)
- $30K/month for an ES cluster vs $5K/month for S3-Loki
- Indexing latency is real pain above 100K events/s
Loki (Grafana Labs, 2018) flipped the approach:
- Only labels are indexed (as in [[prometheus-basics|Prometheus]])
- The log payload itself is stored as compressed chunks on S3-compatible storage
- A search means "pick streams by labels" plus "grep through chunks"
- Cheap, with almost unlimited scale
Trade-off: full-text search is slower, but 95% of queries in
observability look like {service=X, level=error} |= "timeout", which Loki
handles quickly.
Architecture
┌─────────┐ push ┌──────────┐ write ┌─────────┐
│ Promtail│ ─────────► │ Loki │ ─────────► │ S3 │
│ (agent) │ │ (server) │ │ (chunks)│
└─────────┘ └──────────┘ └─────────┘
┌─────────┐ push ▲ ▲
│ Vector │ ────────────────┘ │
└─────────┘ │
▲ │
│ tail files / journald / docker read │
│ │
┌────┴────┐ ┌──────┴───┐
│ /var/log│ ◄── │ Grafana │
└─────────┘ └──────────┘
LogQL query
Loki components in a cluster:
- distributor receives the push and does the hashing
- ingester buffers chunks in RAM and flushes to S3 every 10-30 min
- querier reads chunks from S3 plus the ingester for recent data
- query-frontend splits large queries and caches
- compactor merges indexes and enforces retention
Stream and labels
A stream in Loki is a unique combination of labels:
{service="api", env="prod", host="node-12", level="info"}Each stream is a separate chunked file on S3. Inside a stream, log lines are ordered by time.
Like [[metric-types|Prometheus series]], cardinality is the product of the unique values of each label. More than 10K active streams in one tenant cause degradation. Dangerous labels:
request_id(millions), neveruser_id, never, write it into the payloadpod_name(k8s), can be thousands, OK with retentionhost, tens to hundreds, OKlevel, 4-5 values, ideal
Rule: labels have low cardinality, everything else goes into the log line.
LogQL, the query language
PromQL-like, but on logs.
Stream selector plus line filter:
{service="api", env="prod"} |= "error"{service="api"} |~ "timeout|refused" # regex{service="api"} != "healthcheck" # exclude{service=~"api.*"} | json | level="error" # parse JSONOperators:
| Op | What it does |
|---|---|
|= | line contains substring |
\!= | line not contains |
|~ | line matches regex |
\!~ | line not matches regex |
Parsers (after |):
jsonparses JSON, fields become accessible aslevel,user_id, etclogfmt, forkey=valuelogsregexp,| regexp "(?P<status>\d+)"pattern,| pattern "<_> [<level>] <msg>"unpack, for Fluentbit-wrapped entries
Metrics from logs (Loki as a time series):
rate({service="api"} |= "error" [5m]) # error rate in req/ssum by (status)(count_over_time({service="api"} | json [1m]))This is cheap, Loki computes it on the fly without an index. It is used in alerting when a metric is missing (alerting-rules-alertmanager).
Promtail, the Loki-native agent
It discovers log files, parses them, adds labels, and pushes:
scrape_configs:
- job_name: system
static_configs:
- targets: [localhost]
labels:
job: varlogs
__path__: /var/log/*.log
- job_name: containers
docker_sd_configs:
- host: unix:///var/run/docker.sock
relabel_configs:
- source_labels: ['__meta_docker_container_name']
regex: '/(.*)'
target_label: container
- source_labels: ['__meta_docker_container_log_stream']
target_label: stream
pipeline_stages:
- cri: {}- json:
expressions: {level: level, msg: msg, trace_id: trace_id}- labels:
level:
- structured_metadata:
trace_id:
pipeline_stages transform the log line. structured_metadata
(Loki 2.9+) are fields with no cardinality cost (trace_id, request_id):
searchable but not indexed as a label. This solves the problem of high-card
identifiers.
Vector, the alternative agent
Vector (Datadog, open source) has a more capable pipeline:
[sources.in]
type = "kubernetes_logs"
[transforms.parse]
type = "remap"
inputs = ["in"]
source = '''
. = parse_json!(.message) ?? .
.level = downcase(string!(.level))
'''
[sinks.loki]
type = "loki"
inputs = ["parse"]
endpoint = "http://loki:3100"
labels = {service = "{{ kubernetes.container_name }}", level = "{{ level }}"}remove_label_fields = true
Vector can:
- Multi-sink: Loki plus S3 plus Kafka at the same time
- VRL (Vector Remap Language), a JS-like language for parsing
- Backpressure handling: a disk-buffered queue
- Sampling and filtering before sending
Use Vector when the pipeline is complex or you need several backends.
Retention and storage
Loki cost is almost 100% S3 storage. The math:
- 1 GB/day of logs → compressed ~100-200 MB chunks
- 90d retention → ~15 GB on S3 → $0.35/month (S3 standard)
- Index is ~5% of chunks: $0.02/month
Total: less than $1/month for ~100 GB of logs. Compare with Datadog ($1.27/GB/month).
Retention config:
limits_config:
retention_period: 90d
compactor:
retention_enabled: true
retention_delete_delay: 2h
Sizing rules of thumb
- 1 TB/day ingest = 3 ingester + 2 querier + S3
- Ingester RAM ≈ chunks_in_flight × 1.5 MB
- Compactor, 1-2 vCPU, not loaded
- Index lookup in the querier is fast, the bottleneck is usually chunk-fetch
Loki vs Elastic vs ClickHouse
| Criterion | Loki | Elastic | ClickHouse |
|---|---|---|---|
| Index | label-only | full-text | columnar |
| Storage | S3 (cheap) | local SSD | local/S3 |
| Cost @ 1TB/day | ~$5K/month | ~$30K/month | ~$10K/month |
| Full-text speed | medium | very fast | fast (with skip-index) |
| Aggregations | LogQL metrics | aggregations API | SQL |
| Multi-tenancy | yes | via index | via DB |
ClickHouse-based options (SigNoz, Quickwit) are a compromise: cheaper than Elastic, faster than Loki on full-text. They are growing in popularity.
When things go wrong
- Cardinality explosion, tens of thousands of streams.
loki-canaryshows active streams. Remove dynamic labels. See cardinality-explosion. - Logs are not arriving, check Promtail logs (
journalctl -u promtail): auth failure, network, disk full in /tmp. - "too many outstanding requests", the query frontend rate-limited you. Narrow the range, add a label selector.
entry too far behind, a log line is larger thanmax_line_size(default 256 KB). Truncate in the agent or raise the limit.- A search returns 0 even though logs exist, wrong tenant header
(
X-Scope-OrgID), or the label selector does not match. Check{__path__=~".+"}. - Loki OOM on the ingester, chunks_per_user_per_target exceeded. Reduce the flush interval or extend retention in memory.
- Promtail falls behind the logs, disk-IO on read; k8s pod logs rotated. Use Vector with a persistent buffer.