Why service discovery
A static config works for 5 hosts. For 500 it does not. In Kubernetes the endpoints change every second (rollouts, autoscaling). You need a way to learn who to scrape automatically.
The answer is service discovery (SD): Prometheus says "give me all pods/services with these labels", the SD mechanism returns a list of endpoints, and Prom scrapes them.
Around 30 SD mechanisms are supported: kubernetes, consul, dns, ec2, azure, gce, file_sd, http_sd. The most common are k8s and consul.
Discovery → relabel → scrape
┌──────────────┐
│ SD mechanism │ returns targets with meta-labels
│ (k8s, etc) │ __meta_kubernetes_pod_name, etc
└──────┬───────┘
│ raw targets with __meta_* labels
▼
┌──────────────┐
│ relabel_ │ filter + transform labels
│ configs │ action: keep/drop/replace/labelmap
└──────┬───────┘
│ final targets
▼
┌──────────────┐
│ scrape │ HTTP GET /metrics
└──────┬───────┘
│ raw metrics
▼
┌──────────────┐
│ metric_ │ drop bad metrics, rewrite names
│ relabel_ │
│ configs │
└──────┬───────┘
│
▼
TSDB
Critical insight: __meta_* labels are dropped after relabel. If
you want them in the TSDB, use an explicit replace action.
Kubernetes SD
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod # pod | service | endpoints | endpointslices | node | ingress
relabel_configs:
# Only pods with the annotation prometheus.io/scrape=true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: 'true'
# Take the port from the annotation prometheus.io/port
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: '([^:]+)(?::\d+)?;(\d+)'
replacement: '$1:$2'
target_label: __address__
# Path from the annotation prometheus.io/path (default /metrics)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: '(.+)'
# All pod labels → metric labels with a prefix
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
# Convenience labels
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_pod_node_name]
target_label: node
Result: every pod with the prometheus.io/scrape=true annotation is
scraped. All its k8s labels are copied into metric labels.
Roles in kubernetes_sd
| Role | What it returns | When |
|---|---|---|
node | Kubernetes nodes (kubelet) | host metrics, kubelet |
pod | every pod | application metrics |
service | k8s Service objects | blackbox probes to services |
endpoints | endpoints (legacy) | a replacement for service for kube-state-metrics |
endpointslices | EndpointSlice (modern) | k8s 1.21+, scale better |
ingress | Ingress objects | check ingresses |
Modern setup: endpointslices instead of endpoints (better
performance on large clusters).
Consul SD
scrape_configs:
- job_name: consul
consul_sd_configs:
- server: consul.example.com:8500
tags: ['prometheus'] # only services with the tag
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: service
- source_labels: [__meta_consul_tags]
target_label: tags
Consul is popular in non-k8s stacks (Nomad, classic VMs). A service registers itself in Consul, and Prom learns about it through SD.
file_sd: static with granularity
When there is no k8s or Consul, but you have a script that knows who to scrape:
scrape_configs:
- job_name: file-discovery
file_sd_configs:
- files: ['/etc/prometheus/targets/*.json']
refresh_interval: 30s
The file:
[
{"targets": ["host1:9100", "host2:9100"],
"labels": {"env": "prod", "team": "infra"}},
{"targets": ["dbhost:9187"],
"labels": {"env": "prod", "team": "db"}}
]
An external tool (Ansible, terraform, chef) generates the JSON. Prom auto-reloads every 30s. Flexible and simple.
relabel actions
| Action | What it does |
|---|---|
replace | writes regex.replace(source, replacement) into target_label |
keep | drop the target if source ~ regex does NOT match |
drop | drop the target if source ~ regex matches |
keepequal | keep if source == target |
dropequal | drop if source == target |
hashmod | target_label = hash(source) % modulus (for sharding) |
labelmap | copies all labels matching the regex (with renaming) |
labeldrop | removes labels matching the regex |
labelkeep | keeps only labels matching the regex |
lowercase / uppercase | case transform |
keep and drop are the most common for filtering. replace and
labelmap are for shaping labels.
Sharding with hashmod
Three Proms scrape 1000 targets, split evenly:
relabel_configs:
- source_labels: [__address__]
modulus: 3
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: '0' # this Prom, shard 0
action: keep
Each Prom holds about 330 targets. Federation aggregates upward.
metric_relabel_configs: after scrape
Applies to already scraped metrics, before they are written to the TSDB.
scrape_configs:
- job_name: ...
metric_relabel_configs:
# Drop high-cardinality metrics
- source_labels: [__name__]
regex: 'go_gc_pauses_seconds_bucket'
action: drop
# Drop a specific label with user_id (cardinality)
- regex: 'user_id'
action: labeldrop
# Rewrite metric name
- source_labels: [__name__]
regex: 'old_metric_name'
replacement: 'new_metric_name'
target_label: __name__
Used to fight cardinality-explosion from ill-behaved exporters. Better to fix it in the code, but sometimes you have no access.
Best practices
- Filter at the SD stage, not the metric stage:
keep/dropis cheaper thanmetric_relabel, and it puts less load on the target. - Convenient labels (
namespace,pod,service): stable names across all jobs. Do not use__meta_kubernetes_*in queries. - Do not copy every pod label with labelmap blindly. k8s attaches
controller-revision-hash,pod-template-hash, and so on. That is cardinality. Whitelist with a regex inlabelmap:yaml- action: labelmap
regex: __meta_kubernetes_pod_label_(app|version|component)
- CI-test your relabel:
promtool check configplus targeted dry-runs throughpromtool(limited).
kube-state-metrics + node-exporter
The standard k8s monitoring stack:
- node-exporter on every node →
node_*metrics - kube-state-metrics, a single instance →
kube_*metrics about the state of k8s objects - cAdvisor in the kubelet → container metrics
- app metrics through annotation discovery
All through k8s SD with different relabel configs.
When things go wrong
- No targets in
/targets: a relabelkeepis too strict and nothing is left. Remove one rule at a time and check the UI. - Targets exist, but scrape errors with "401 Unauthorized": a
kubelet scrape needs a ServiceAccount and RBAC, or
bearer_token_file: /var/run/secrets/.../token. - Cardinality explosion after a rollout: labelmap copied
pod-template-hash. Whitelist the labels. - Targets are duplicated: the same endpoint appears in several roles. Deduplicate: one role plus the right selector.
- Slow SD reload (5+ minutes): a k8s API rate limit. Lower
refresh_intervalor use endpointslices instead of endpoints. __address__has the wrong port: k8s SD takes the first declared port. Override it from an annotation with replace.- Stale targets after a k8s namespace delete: Prom keeps them until
--query.lookback-delta(default 5m). This is normal.