linuxlab.io
Tutorials▾
  • Linux & networking
    File system, processes, TCP/IP, BGP and OSPF
    →
  • Terraform & IaC
    HCL, state, plan/apply on a LocalStack sandbox
    →
  • Git & GitHub
    Object model, plumbing, branching, GitHub Actions
    →
All tutorials →
PricingAboutSign inCreate account
/
  • Introduction
  • Lessons
  • How it works
  • Simulator
  • Knowledge base
  • Interview prep
Index
Categories
All entries
Footer
linuxlab-TutorialsPricingAboutPrivacy & cookies
Copyright © 2026 LinuxLab. All rights reserved.
home/linux/kb/Observability & monitoring/loki-grafana-logging

kb/observability ── Observability & monitoring ── intermediate

Loki: label-based logs, LogQL, Promtail/Vector pipeline

Loki is log aggregation with a label-based index, not full-text like Elastic. Cheap on S3 storage. Promtail/Vector are the agents. LogQL resembles PromQL: filter, parse, aggregate. Cardinality is the enemy.

view as markdownaka: loki, logql, promtail, vector-loki, grafana-loki

Why Loki

Elasticsearch does full-text search over every field of a log. Every word is indexed. The cost:

  • 1 TB/day of logs = ~30 GB/day of RAM on the heap (3-5x source size)
  • $30K/month for an ES cluster vs $5K/month for S3-Loki
  • Indexing latency is real pain above 100K events/s

Loki (Grafana Labs, 2018) flipped the approach:

  • Only labels are indexed (as in [[prometheus-basics|Prometheus]])
  • The log payload itself is stored as compressed chunks on S3-compatible storage
  • A search means "pick streams by labels" plus "grep through chunks"
  • Cheap, with almost unlimited scale

Trade-off: full-text search is slower, but 95% of queries in observability look like {service=X, level=error} |= "timeout", which Loki handles quickly.

Architecture

┌─────────┐  push      ┌──────────┐  write     ┌─────────┐
│ Promtail│ ─────────► │  Loki    │ ─────────► │   S3    │
│ (agent) │            │ (server) │            │ (chunks)│
└─────────┘            └──────────┘            └─────────┘
┌─────────┐  push           ▲                       ▲
│  Vector │ ────────────────┘                       │
└─────────┘                                         │
     ▲                                              │
     │ tail files / journald / docker         read  │
     │                                              │
┌────┴────┐                                  ┌──────┴───┐
│ /var/log│                              ◄── │  Grafana │
└─────────┘                                  └──────────┘
                                               LogQL query

Loki components in a cluster:

  • distributor receives the push and does the hashing
  • ingester buffers chunks in RAM and flushes to S3 every 10-30 min
  • querier reads chunks from S3 plus the ingester for recent data
  • query-frontend splits large queries and caches
  • compactor merges indexes and enforces retention

Stream and labels

A stream in Loki is a unique combination of labels:

{service="api", env="prod", host="node-12", level="info"}

Each stream is a separate chunked file on S3. Inside a stream, log lines are ordered by time.

Like [[metric-types|Prometheus series]], cardinality is the product of the unique values of each label. More than 10K active streams in one tenant cause degradation. Dangerous labels:

  • request_id (millions), never
  • user_id, never, write it into the payload
  • pod_name (k8s), can be thousands, OK with retention
  • host, tens to hundreds, OK
  • level, 4-5 values, ideal

Rule: labels have low cardinality, everything else goes into the log line.

LogQL, the query language

PromQL-like, but on logs.

Stream selector plus line filter:

{service="api", env="prod"} |= "error"
{service="api"} |~ "timeout|refused"          # regex
{service="api"} != "healthcheck"              # exclude
{service=~"api.*"} | json | level="error"     # parse JSON

Operators:

OpWhat it does
|=line contains substring
\!=line not contains
|~line matches regex
\!~line not matches regex

Parsers (after |):

  • json parses JSON, fields become accessible as level, user_id, etc
  • logfmt, for key=value logs
  • regexp, | regexp "(?P<status>\d+)"
  • pattern, | pattern "<_> [<level>] <msg>"
  • unpack, for Fluentbit-wrapped entries

Metrics from logs (Loki as a time series):

rate({service="api"} |= "error" [5m])              # error rate in req/s
sum by (status)(count_over_time({service="api"} | json [1m]))

This is cheap, Loki computes it on the fly without an index. It is used in alerting when a metric is missing (alerting-rules-alertmanager).

Promtail, the Loki-native agent

It discovers log files, parses them, adds labels, and pushes:

yaml
scrape_configs:
  - job_name: system
    static_configs:
      - targets: [localhost]
        labels:
          job: varlogs
          __path__: /var/log/*.log
  - job_name: containers
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        regex: '/(.*)'
        target_label: container
      - source_labels: ['__meta_docker_container_log_stream']
        target_label: stream
    pipeline_stages:
      - cri: {}
      - json:
          expressions: {level: level, msg: msg, trace_id: trace_id}
      - labels:
          level:
      - structured_metadata:
          trace_id:

pipeline_stages transform the log line. structured_metadata (Loki 2.9+) are fields with no cardinality cost (trace_id, request_id): searchable but not indexed as a label. This solves the problem of high-card identifiers.

Vector, the alternative agent

Vector (Datadog, open source) has a more capable pipeline:

toml
[sources.in]
type = "kubernetes_logs"
[transforms.parse]
type = "remap"
inputs = ["in"]
source = '''
. = parse_json!(.message) ?? .
.level = downcase(string!(.level))
'''
[sinks.loki]
type = "loki"
inputs = ["parse"]
endpoint = "http://loki:3100"
labels = {service = "{{ kubernetes.container_name }}", level = "{{ level }}"}
remove_label_fields = true

Vector can:

  • Multi-sink: Loki plus S3 plus Kafka at the same time
  • VRL (Vector Remap Language), a JS-like language for parsing
  • Backpressure handling: a disk-buffered queue
  • Sampling and filtering before sending

Use Vector when the pipeline is complex or you need several backends.

Retention and storage

Loki cost is almost 100% S3 storage. The math:

  • 1 GB/day of logs → compressed ~100-200 MB chunks
  • 90d retention → ~15 GB on S3 → $0.35/month (S3 standard)
  • Index is ~5% of chunks: $0.02/month

Total: less than $1/month for ~100 GB of logs. Compare with Datadog ($1.27/GB/month).

Retention config:

yaml
limits_config:
  retention_period: 90d
compactor:
  retention_enabled: true
  retention_delete_delay: 2h

Sizing rules of thumb

  • 1 TB/day ingest = 3 ingester + 2 querier + S3
  • Ingester RAM ≈ chunks_in_flight × 1.5 MB
  • Compactor, 1-2 vCPU, not loaded
  • Index lookup in the querier is fast, the bottleneck is usually chunk-fetch

Loki vs Elastic vs ClickHouse

CriterionLokiElasticClickHouse
Indexlabel-onlyfull-textcolumnar
StorageS3 (cheap)local SSDlocal/S3
Cost @ 1TB/day~$5K/month~$30K/month~$10K/month
Full-text speedmediumvery fastfast (with skip-index)
AggregationsLogQL metricsaggregations APISQL
Multi-tenancyyesvia indexvia DB

ClickHouse-based options (SigNoz, Quickwit) are a compromise: cheaper than Elastic, faster than Loki on full-text. They are growing in popularity.

When things go wrong

  • Cardinality explosion, tens of thousands of streams. loki-canary shows active streams. Remove dynamic labels. See cardinality-explosion.
  • Logs are not arriving, check Promtail logs (journalctl -u promtail): auth failure, network, disk full in /tmp.
  • "too many outstanding requests", the query frontend rate-limited you. Narrow the range, add a label selector.
  • entry too far behind, a log line is larger than max_line_size (default 256 KB). Truncate in the agent or raise the limit.
  • A search returns 0 even though logs exist, wrong tenant header (X-Scope-OrgID), or the label selector does not match. Check {__path__=~".+"}.
  • Loki OOM on the ingester, chunks_per_user_per_target exceeded. Reduce the flush interval or extend retention in memory.
  • Promtail falls behind the logs, disk-IO on read; k8s pod logs rotated. Use Vector with a persistent buffer.

§ команды

bash
logcli query '{service="api"} |= "error"' --limit 50 --since 1h

Loki CLI, the 50 most recent error lines from the last hour for service=api

bash
logcli query 'sum by (level)(count_over_time({service="api"} | json [1m]))' --limit 5

Metrics from logs, count by level over a minute, as in Prometheus

bash
curl -s 'http://loki:3100/loki/api/v1/labels' | jq

All labels in the system, the first step in diagnosing cardinality

bash
curl -s 'http://loki:3100/loki/api/v1/label/service/values' | jq

All values of the 'service' label, how many unique streams there are

bash
promtail -config.file=/etc/promtail/config.yaml -log.level=debug

Promtail in debug mode, you can see which files it tails and where it pushes

bash
journalctl -u promtail -n 100 -f

Live logs of Promtail itself, a typical source of ingestion errors

bash
vector validate /etc/vector/vector.toml

Validate the Vector config before startup, in CI

§ см. также

  • cmd-journalctljournalctl: systemd journal`journalctl` reads the binary journal written by systemd-journald. It is the central log for the system: kernel, systemd services, syslog, all through one interface.
  • auditdauditd: syscall and file auditauditd writes kernel events to /var/log/audit/audit.log: file watches (-w), syscall rules (-a), execs. Use ausearch to search, aureport for reports. This is the basis of compliance (PCI-DSS, HIPAA, FZ-152).
  • metrics-vs-logs-vs-tracesMetrics vs logs vs traces: the three pillars of observabilityMetrics are aggregated numbers over time, cheap, for alerts. Logs are discrete events with context, for root-cause. Traces are request flow across services, for distributed debug. Structure beats volume.
Footer
linuxlab-
Copyright © 2026 LinuxLab. All rights reserved.
Tutorials
Pricing
About
Privacy & cookies