linuxlab.io
Tutorials▾
  • Linux & networking
    File system, processes, TCP/IP, BGP and OSPF
    →
  • Terraform & IaC
    HCL, state, plan/apply on a LocalStack sandbox
    →
  • Git & GitHub
    Object model, plumbing, branching, GitHub Actions
    →
All tutorials →
PricingAboutSign inCreate account
/
  • Introduction
  • Lessons
  • How it works
  • Simulator
  • Knowledge base
  • Interview prep
Index
Categories
All entries
Footer
linuxlab-TutorialsPricingAboutPrivacy & cookies
Copyright © 2026 LinuxLab. All rights reserved.
home/linux/kb/Observability & monitoring/pyroscope-continuous-profiling

kb/observability ── Observability & monitoring ── advanced

Continuous profiling: Pyroscope, eBPF, flame graphs in production

Continuous profiling is an always-on CPU/memory profiler in production through eBPF. 1-2% overhead. Flame graphs show the hot path. Pyroscope (Grafana), Parca, Polar Signals. It replaces ad-hoc perf for production debugging.

view as markdownaka: pyroscope, continuous-profiling, flame-graphs, ebpf-profiler, parca

Why continuous profiling

Classic profiling, Brendan Gregg-style: SSH to the host, perf record, perf script | flamegraph.pl. The hassle:

  • Reactive (after the incident, usually too late)
  • One host, one process
  • A snapshot, not trends
  • Needs root, sudo, debug symbols on disk
  • Overhead during the measurement, around 5-10% CPU

Continuous profiling: an always-on profiler on every node, sampling at about 100 Hz, with 1-2% CPU overhead, sending to central storage. It appeared around 2020 (the Google "Continuous Profiling" paper, Pyroscope).

What you get:

  • Always-on flame graphs: open Grafana and see where CPU goes right now, or two weeks ago
  • A diff between versions: a flame graph before and after a release to spot a regression
  • Correlation with metrics: an alert "p99 latency is rising" → you immediately see the stack behind it
  • Root cause in minutes, not days

In 2025 it is a production tool at Cloudflare, Polar Signals, Coinbase, and Pinterest. The most visible runtime regressions are found through profiling.

How an eBPF profiler works

The old perf: a kernel ring buffer, perf_event_open, a sample on each CPU through the PMU (timer interrupt). Stack walk through frame pointers or DWARF.

An eBPF profiler ([[ebpf-basics|eBPF]]) attaches an eBPF program to a perf event (sample at 100 Hz):

int profile(struct bpf_perf_event_data *ctx) {
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    u64 *count = bpf_map_lookup_elem(&counts, &pid);
    bpf_get_stack(ctx, stack, sizeof(stack), 0);     // kernel stack
    bpf_get_stack(ctx, stack, sizeof(stack), USER);  // user stack
    // aggregate in BPF map
}

Advantages:

  • Frame-pointer-free stack walking (DWARF unwinding in eBPF, in recent kernels)
  • Per-cgroup filtering: profiles per Kubernetes pod without knowing the PID
  • On-host symbolization: eBPF reads /proc/PID/maps, finds the ELF, and extracts symbols
  • CO-RE ([[bpf-co-re|BPF Compile Once Run Everywhere]]): one binary across different kernels

Flame graphs: how to read them

Each rectangle is a function. Width = share of CPU time. The Y-axis is the call stack (parent at the bottom, children on top).

┌─────────────────────────────────────────────┐
│                  main()                     │
├─────────────────┬───────────────────────────┤
│  serve_request  │      gc_collect           │
├──────────┬──────┼─────────────┬─────────────┤
│  parse   │ db   │  sweep_old  │ allocate    │
├──┬───────┼──────┼─────────────┼─────────────┤
│j │ regex │ scan │             │             │
└──┴───────┴──────┴─────────────┴─────────────┘

"regex takes 25% of the whole process's CPU". A wide, flat plateau means a hot leaf. Tall and narrow means a deep call stack with no plateau, usually fine.

A flame graph is sampled, not traced, so the exact numbers are imprecise (±5%), but the proportions hold.

Pyroscope architecture

┌───────────────┐   pull profiles    ┌──────────────┐
│  Pyroscope    │ ─────────────────► │   target     │
│  server       │                    │ (with eBPF   │
│ ┌───────────┐ │                    │  profiler    │
│ │ TSDB-like │ │                    │  agent)      │
│ │ profiles  │ │                    └──────────────┘
│ │  store    │ │                    ┌──────────────┐
│ └───────────┘ │ ◄───── push ────── │   target     │
│ ┌───────────┐ │                    │ (with SDK    │
│ │ Querier   │ │ ◄─── Grafana       │  push agent) │
│ └───────────┘ │       UI           └──────────────┘
└───────────────┘

Storage is similar to [[prometheus-basics|Prometheus TSDB]]: a profile per (service, instance) per timestamp. A query returns a flame graph over a range filtered by labels.

Two ingest modes:

  • Pull eBPF agent (DaemonSet): a single agent profiles everything on the host with no SDK. Best for k8s.
  • Push SDK: runtime-specific (Java, Go, Python). More accurate for managed runtimes (eBPF cannot see Java JIT frames).

Labels: every profile is tagged

As in Prometheus, labels define the series:

service.name=checkout
pod=checkout-7f8b9c-q2lx9
namespace=prod
region=us-east-1
version=1.4.2

You can filter in the UI: "a flame graph for version=1.4.2 vs 1.4.1", "a diff between region=us-east vs us-west". That is the power: slice by a label to find an anomaly.

The cardinality rules are the same: pod, tens of thousands, fine with retention. request_id, never.

Profile types

Pyroscope/Parca collect several types:

TypeWhat it shows
process_cpuon-CPU sampling, where CPU time goes
goroutine (Go)live goroutines, leak detection
inuse_space (Go heap)currently alloc'd memory
alloc_space (Go heap)total allocated since start
block (Go)blocking on chan/mutex
mutex (Go)contention
wall_clockwall-time (including sleep, IO wait), what blocks
lock_contention (Java)JFR-based

For Java this is automatic through [[opentelemetry|OpenTelemetry profiling]] plus JFR (Java Flight Recorder).

Pyroscope vs Parca vs Polar Signals

ToolBackendStoragePitch
PyroscopeGrafana Labs (acq 2023)own TSDB-likeOSS, integrated with Grafana
ParcaPolar SignalsownOSS, Parquet-on-S3
Polar Signals Cloudhostedproprietary$$ paid
GProfilerGranulate (acq Intel)ownautomatic ARM/x86

Pyroscope is the pick for self-hosting with a Grafana stack. Parca is for when you want parquet-friendly queries through DuckDB/SQL.

Continuous profiling vs traditional perf

Propertyperf recordcontinuous profiler
Whenad-hoc (after the incident)always-on
Overhead5-10% while recording1-2% constant
Storagelocal file (perf.data)central, retention
Symbolizationlocallyon-agent + central
Multi-hostnoyes, per-pod
Diff between datesmanual through foldedUI feature
Stack accuracyDWARF + perfeBPF + DWARF/frame-pointers

CPU profiling vs memory

For a CPU bottleneck, the process_cpu type, sampled @100Hz, hot path.

For a memory leak, alloc_space (the Go heap profiler) or Java JFR Allocation* events. You see where garbage is allocated, what keeps growing after GC.

Memory profiling overhead is higher (around 3-5%), not always-on in production. Turn it on on demand or when memory.high triggers ([[cgroups-v2-deep|cgroup PSI]]).

Profile-Guided Optimization (PGO)

Go 1.21+, Rust nightly: the compiler takes a production profile as a hint and optimizes hot code (better inlining, branch prediction).

Workflow:

  1. Continuous profiling in production → you collect default.pgo
  2. Put it in the repo next to main.go
  3. go build picks up the PGO profile
  4. A 2-15% speedup on a typical Go service

Pyroscope has a pyroscope-pgo exporter for this.

When things go wrong

  • Stack shows "[unknown]": no debug symbols, or frame pointers are missing (gcc defaults to -fomit-frame-pointer). Build with -fno-omit-frame-pointer, or enable DWARF unwinding in the profiler.
  • Java stack frames are empty: JIT compilation. You need a JFR-based profiler instead of pure eBPF.
  • High cardinality: every pod is a new series. Limit retention, or drop the pod label and keep the deployment.
  • Profiles do not arrive: the eBPF agent needs CAP_SYS_ADMIN and CAP_BPF (bpf-co-re). On k8s, use securityContext.privileged or the matching capabilities.
  • The delta between versions looks odd: different [[cgroups-v2-deep|cgroup PSI throttling]] changed execution time, not the code. Control the test environment.
  • Async/await stacks are broken: Go goroutines, Rust async. The Pyroscope Go SDK plus eBPF handle this; Python asyncio, partially.

When not to use it

  • A small service < 10K rps: the 1-2% overhead is invisible, but there is also nothing to find. Ad-hoc pprof is enough.
  • Compliance forbids reading stack traces from production data (PII in strings): do not use sampling string captures.
  • Embedded or IoT: no resources for an eBPF agent.

§ команды

bash
pyroscope agent --target-pid=$(pgrep -f myapp) --tag=service=myapp

Pyroscope eBPF agent on a specific PID, a quick check

bash
curl -s 'http://pyroscope:4040/render?query=process_cpu{service="checkout"}&from=now-1h&until=now&format=collapsed' | head -20

Raw folded stacks (the FlameGraph.pl format) for the query 'CPU usage of checkout'

bash
go tool pprof -http=:8080 'http://pyroscope:4040/render?query=process_cpu{...}&format=pprof'

Open a Pyroscope profile in `go tool pprof` locally

bash
perf record -F 99 -g -p $(pgrep myapp) -- sleep 30

Classic perf to compare with the eBPF-based one, 99Hz, 30 seconds

bash
parca-agent --node=$NODE --kubernetes --metadata-external-labels=cluster=prod

Parca agent on a k8s node, DaemonSet-style deploy

bash
go test -cpuprofile=cpu.prof ./... && go tool pprof -http=: cpu.prof

On-demand pprof in Go, locally for benchmarks, not for prod

bash
OTEL_EXPORTER_OTLP_PROFILES_ENDPOINT=http://collector:4318 java -javaagent:opentelemetry-javaagent.jar -jar app.jar

OTel Java agent with profiling enabled (experimental in OTel 1.30+)

§ см. также

  • bpf-co-reBPF CO-RE: Compile Once Run EverywhereCO-RE means one compiled eBPF object runs on different kernels thanks to BTF (BPF Type Format). vmlinux.h is a dump of kernel structures. libbpf rewrites offsets at runtime. It replaces BCC, and you no longer need LLVM in production.
  • cgroups-v2-deepcgroups v2: unified hierarchy, PSI, eBPF controlcgroups v2 uses one tree instead of separate per-controller hierarchies. Clean semantics, new fields (memory.high, io.cost). PSI shows resource pressure. eBPF can manage resources. Default in RHEL 9, Ubuntu 22+.
  • opentelemetryOpenTelemetry: signals, OTLP, Collector pipelineOpenTelemetry is the CNCF standard for metrics, traces, and logs in one SDK. The OTLP protocol runs over gRPC or HTTP. The Collector receives, filters, and routes to Prom/Tempo/Loki/Jaeger. Auto-instrumentation needs no code change.
  • metrics-vs-logs-vs-tracesMetrics vs logs vs traces: the three pillars of observabilityMetrics are aggregated numbers over time, cheap, for alerts. Logs are discrete events with context, for root-cause. Traces are request flow across services, for distributed debug. Structure beats volume.
Footer
linuxlab-
Copyright © 2026 LinuxLab. All rights reserved.
Tutorials
Pricing
About
Privacy & cookies