Loki: label-based logs, LogQL, Promtail/Vector pipeline

Why Loki

Elasticsearch does full-text search over every field of a log. Every word is indexed. The cost:

1 TB/day of logs = ~30 GB/day of RAM on the heap (3-5x source size)
$30K/month for an ES cluster vs $5K/month for S3-Loki
Indexing latency is real pain above 100K events/s

Loki (Grafana Labs, 2018) flipped the approach:

Only labels are indexed (as in [[prometheus-basics|Prometheus]])
The log payload itself is stored as compressed chunks on S3-compatible storage
A search means "pick streams by labels" plus "grep through chunks"
Cheap, with almost unlimited scale

Trade-off: full-text search is slower, but 95% of queries in observability look like {service=X, level=error} |= "timeout", which Loki handles quickly.

Architecture

┌─────────┐  push      ┌──────────┐  write     ┌─────────┐

│ Promtail│ ─────────► │  Loki    │ ─────────► │   S3    │

│ (agent) │            │ (server) │            │ (chunks)│

└─────────┘            └──────────┘            └─────────┘

┌─────────┐  push           ▲                       ▲

│  Vector │ ────────────────┘                       │

└─────────┘                                         │

     ▲                                              │

     │ tail files / journald / docker         read  │

     │                                              │

┌────┴────┐                                  ┌──────┴───┐

│ /var/log│                              ◄── │  Grafana │

└─────────┘                                  └──────────┘

                                               LogQL query

Loki components in a cluster:

distributor receives the push and does the hashing
ingester buffers chunks in RAM and flushes to S3 every 10-30 min
querier reads chunks from S3 plus the ingester for recent data
query-frontend splits large queries and caches
compactor merges indexes and enforces retention

Stream and labels

A stream in Loki is a unique combination of labels:

{service="api", env="prod", host="node-12", level="info"}

Each stream is a separate chunked file on S3. Inside a stream, log lines are ordered by time.

Like [[metric-types|Prometheus series]], cardinality is the product of the unique values of each label. More than 10K active streams in one tenant cause degradation. Dangerous labels:

request_id (millions), never
user_id, never, write it into the payload
pod_name (k8s), can be thousands, OK with retention
host, tens to hundreds, OK
level, 4-5 values, ideal

Rule: labels have low cardinality, everything else goes into the log line.

LogQL, the query language

PromQL-like, but on logs.

Stream selector plus line filter:

{service="api", env="prod"} |= "error"

{service="api"} |~ "timeout|refused"          # regex

{service="api"} != "healthcheck"              # exclude

{service=~"api.*"} | json | level="error"     # parse JSON

Operators:

Op	What it does
`\|=`	line contains substring
`\!=`	line not contains
`\|~`	line matches regex
`\!~`	line not matches regex

Parsers (after |):

json parses JSON, fields become accessible as level, user_id, etc
logfmt, for key=value logs
regexp, | regexp "(?P<status>\d+)"
pattern, | pattern "<_> [<level>] <msg>"
unpack, for Fluentbit-wrapped entries

Metrics from logs (Loki as a time series):

rate({service="api"} |= "error" [5m])              # error rate in req/s

sum by (status)(count_over_time({service="api"} | json [1m]))

This is cheap, Loki computes it on the fly without an index. It is used in alerting when a metric is missing (alerting-rules-alertmanager).

Promtail, the Loki-native agent

It discovers log files, parses them, adds labels, and pushes:

yaml

scrape_configs:

  - job_name: system

    static_configs:

      - targets: [localhost]

        labels:

          job: varlogs

          __path__: /var/log/*.log

  - job_name: containers

    docker_sd_configs:

      - host: unix:///var/run/docker.sock

    relabel_configs:

      - source_labels: ['__meta_docker_container_name']

        regex: '/(.*)'

        target_label: container

      - source_labels: ['__meta_docker_container_log_stream']

        target_label: stream

    pipeline_stages:

      - cri: {}

      - json:

          expressions: {level: level, msg: msg, trace_id: trace_id}

      - labels:

          level:

      - structured_metadata:

          trace_id:

pipeline_stages transform the log line. structured_metadata (Loki 2.9+) are fields with no cardinality cost (trace_id, request_id): searchable but not indexed as a label. This solves the problem of high-card identifiers.

Vector, the alternative agent

Vector (Datadog, open source) has a more capable pipeline:

toml

[sources.in]

type = "kubernetes_logs"

[transforms.parse]

type = "remap"

inputs = ["in"]

source = '''

. = parse_json!(.message) ?? .

.level = downcase(string!(.level))

'''

[sinks.loki]

type = "loki"

inputs = ["parse"]

endpoint = "http://loki:3100"

labels = {service = "{{ kubernetes.container_name }}", level = "{{ level }}"}

remove_label_fields = true

Vector can:

Multi-sink: Loki plus S3 plus Kafka at the same time
VRL (Vector Remap Language), a JS-like language for parsing
Backpressure handling: a disk-buffered queue
Sampling and filtering before sending

Use Vector when the pipeline is complex or you need several backends.

Retention and storage

Loki cost is almost 100% S3 storage. The math:

1 GB/day of logs → compressed ~100-200 MB chunks
90d retention → ~15 GB on S3 → $0.35/month (S3 standard)
Index is ~5% of chunks: $0.02/month

Total: less than $1/month for ~100 GB of logs. Compare with Datadog ($1.27/GB/month).

Retention config:

yaml

limits_config:

  retention_period: 90d

compactor:

  retention_enabled: true

  retention_delete_delay: 2h

Sizing rules of thumb

1 TB/day ingest = 3 ingester + 2 querier + S3
Ingester RAM ≈ chunks_in_flight × 1.5 MB
Compactor, 1-2 vCPU, not loaded
Index lookup in the querier is fast, the bottleneck is usually chunk-fetch

Loki vs Elastic vs ClickHouse

Criterion	Loki	Elastic	ClickHouse
Index	label-only	full-text	columnar
Storage	S3 (cheap)	local SSD	local/S3
Cost @ 1TB/day	~$5K/month	~$30K/month	~$10K/month
Full-text speed	medium	very fast	fast (with skip-index)
Aggregations	LogQL metrics	aggregations API	SQL
Multi-tenancy	yes	via index	via DB

ClickHouse-based options (SigNoz, Quickwit) are a compromise: cheaper than Elastic, faster than Loki on full-text. They are growing in popularity.

When things go wrong

Cardinality explosion, tens of thousands of streams. loki-canary shows active streams. Remove dynamic labels. See cardinality-explosion.
Logs are not arriving, check Promtail logs (journalctl -u promtail): auth failure, network, disk full in /tmp.
"too many outstanding requests", the query frontend rate-limited you. Narrow the range, add a label selector.
entry too far behind, a log line is larger than max_line_size (default 256 KB). Truncate in the agent or raise the limit.
A search returns 0 even though logs exist, wrong tenant header (X-Scope-OrgID), or the label selector does not match. Check {__path__=~".+"}.
Loki OOM on the ingester, chunks_per_user_per_target exceeded. Reduce the flush interval or extend retention in memory.
Promtail falls behind the logs, disk-IO on read; k8s pod logs rotated. Use Vector with a persistent buffer.

Why Loki

Elasticsearch does full-text search over every field of a log. Every word is indexed. The cost:

1 TB/day of logs = ~30 GB/day of RAM on the heap (3-5x source size)
$30K/month for an ES cluster vs $5K/month for S3-Loki
Indexing latency is real pain above 100K events/s

Loki (Grafana Labs, 2018) flipped the approach:

Only labels are indexed (as in [[prometheus-basics|Prometheus]])
The log payload itself is stored as compressed chunks on S3-compatible storage
A search means "pick streams by labels" plus "grep through chunks"
Cheap, with almost unlimited scale

Trade-off: full-text search is slower, but 95% of queries in observability look like {service=X, level=error} |= "timeout", which Loki handles quickly.

Architecture

┌─────────┐  push      ┌──────────┐  write     ┌─────────┐

│ Promtail│ ─────────► │  Loki    │ ─────────► │   S3    │

│ (agent) │            │ (server) │            │ (chunks)│

└─────────┘            └──────────┘            └─────────┘

┌─────────┐  push           ▲                       ▲

│  Vector │ ────────────────┘                       │

└─────────┘                                         │

     ▲                                              │

     │ tail files / journald / docker         read  │

     │                                              │

┌────┴────┐                                  ┌──────┴───┐

│ /var/log│                              ◄── │  Grafana │

└─────────┘                                  └──────────┘

                                               LogQL query

Loki components in a cluster:

distributor receives the push and does the hashing
ingester buffers chunks in RAM and flushes to S3 every 10-30 min
querier reads chunks from S3 plus the ingester for recent data
query-frontend splits large queries and caches
compactor merges indexes and enforces retention

Stream and labels

A stream in Loki is a unique combination of labels:

{service="api", env="prod", host="node-12", level="info"}

Each stream is a separate chunked file on S3. Inside a stream, log lines are ordered by time.

Like [[metric-types|Prometheus series]], cardinality is the product of the unique values of each label. More than 10K active streams in one tenant cause degradation. Dangerous labels:

request_id (millions), never
user_id, never, write it into the payload
pod_name (k8s), can be thousands, OK with retention
host, tens to hundreds, OK
level, 4-5 values, ideal

Rule: labels have low cardinality, everything else goes into the log line.

LogQL, the query language

PromQL-like, but on logs.

Stream selector plus line filter:

{service="api", env="prod"} |= "error"

{service="api"} |~ "timeout|refused"          # regex

{service="api"} != "healthcheck"              # exclude

{service=~"api.*"} | json | level="error"     # parse JSON

Operators:

Op	What it does
`\|=`	line contains substring
`\!=`	line not contains
`\|~`	line matches regex
`\!~`	line not matches regex

Parsers (after |):

json parses JSON, fields become accessible as level, user_id, etc
logfmt, for key=value logs
regexp, | regexp "(?P<status>\d+)"
pattern, | pattern "<_> [<level>] <msg>"
unpack, for Fluentbit-wrapped entries

Metrics from logs (Loki as a time series):

rate({service="api"} |= "error" [5m])              # error rate in req/s

sum by (status)(count_over_time({service="api"} | json [1m]))

This is cheap, Loki computes it on the fly without an index. It is used in alerting when a metric is missing (alerting-rules-alertmanager).

Promtail, the Loki-native agent

It discovers log files, parses them, adds labels, and pushes:

yaml

scrape_configs:

  - job_name: system

    static_configs:

      - targets: [localhost]

        labels:

          job: varlogs

          __path__: /var/log/*.log

  - job_name: containers

    docker_sd_configs:

      - host: unix:///var/run/docker.sock

    relabel_configs:

      - source_labels: ['__meta_docker_container_name']

        regex: '/(.*)'

        target_label: container

      - source_labels: ['__meta_docker_container_log_stream']

        target_label: stream

    pipeline_stages:

      - cri: {}

      - json:

          expressions: {level: level, msg: msg, trace_id: trace_id}

      - labels:

          level:

      - structured_metadata:

          trace_id:

Vector, the alternative agent

Vector (Datadog, open source) has a more capable pipeline:

toml

[sources.in]

type = "kubernetes_logs"

[transforms.parse]

type = "remap"

inputs = ["in"]

source = '''

. = parse_json!(.message) ?? .

.level = downcase(string!(.level))

'''

[sinks.loki]

type = "loki"

inputs = ["parse"]

endpoint = "http://loki:3100"

labels = {service = "{{ kubernetes.container_name }}", level = "{{ level }}"}

remove_label_fields = true

Vector can:

Multi-sink: Loki plus S3 plus Kafka at the same time
VRL (Vector Remap Language), a JS-like language for parsing
Backpressure handling: a disk-buffered queue
Sampling and filtering before sending

Use Vector when the pipeline is complex or you need several backends.

Retention and storage

Loki cost is almost 100% S3 storage. The math:

1 GB/day of logs → compressed ~100-200 MB chunks
90d retention → ~15 GB on S3 → $0.35/month (S3 standard)
Index is ~5% of chunks: $0.02/month

Total: less than $1/month for ~100 GB of logs. Compare with Datadog ($1.27/GB/month).

Retention config:

yaml

limits_config:

  retention_period: 90d

compactor:

  retention_enabled: true

  retention_delete_delay: 2h

Sizing rules of thumb

1 TB/day ingest = 3 ingester + 2 querier + S3
Ingester RAM ≈ chunks_in_flight × 1.5 MB
Compactor, 1-2 vCPU, not loaded
Index lookup in the querier is fast, the bottleneck is usually chunk-fetch

Loki vs Elastic vs ClickHouse

Criterion	Loki	Elastic	ClickHouse
Index	label-only	full-text	columnar
Storage	S3 (cheap)	local SSD	local/S3
Cost @ 1TB/day	~$5K/month	~$30K/month	~$10K/month
Full-text speed	medium	very fast	fast (with skip-index)
Aggregations	LogQL metrics	aggregations API	SQL
Multi-tenancy	yes	via index	via DB

ClickHouse-based options (SigNoz, Quickwit) are a compromise: cheaper than Elastic, faster than Loki on full-text. They are growing in popularity.

When things go wrong

Cardinality explosion, tens of thousands of streams. loki-canary shows active streams. Remove dynamic labels. See cardinality-explosion.
Logs are not arriving, check Promtail logs (journalctl -u promtail): auth failure, network, disk full in /tmp.
"too many outstanding requests", the query frontend rate-limited you. Narrow the range, add a label selector.
entry too far behind, a log line is larger than max_line_size (default 256 KB). Truncate in the agent or raise the limit.
A search returns 0 even though logs exist, wrong tenant header (X-Scope-OrgID), or the label selector does not match. Check {__path__=~".+"}.
Loki OOM on the ingester, chunks_per_user_per_target exceeded. Reduce the flush interval or extend retention in memory.
Promtail falls behind the logs, disk-IO on read; k8s pod logs rotated. Use Vector with a persistent buffer.

Loki: label-based logs, LogQL, Promtail/Vector pipeline

Why Loki

Architecture

Stream and labels

LogQL, the query language

Promtail, the Loki-native agent

Vector, the alternative agent

Retention and storage

Sizing rules of thumb

Loki vs Elastic vs ClickHouse

When things go wrong

§ команды

§ см. также

Loki: label-based logs, LogQL, Promtail/Vector pipeline

Why Loki

Architecture

Stream and labels

LogQL, the query language

Promtail, the Loki-native agent

Vector, the alternative agent

Retention and storage

Sizing rules of thumb

Loki vs Elastic vs ClickHouse

When things go wrong

§ команды

§ см. также