Continuous profiling: Pyroscope, eBPF, flame graphs in production

Why continuous profiling

Classic profiling, Brendan Gregg-style: SSH to the host, perf record, perf script | flamegraph.pl. The hassle:

Reactive (after the incident, usually too late)
One host, one process
A snapshot, not trends
Needs root, sudo, debug symbols on disk
Overhead during the measurement, around 5-10% CPU

Continuous profiling: an always-on profiler on every node, sampling at about 100 Hz, with 1-2% CPU overhead, sending to central storage. It appeared around 2020 (the Google "Continuous Profiling" paper, Pyroscope).

What you get:

Always-on flame graphs: open Grafana and see where CPU goes right now, or two weeks ago
A diff between versions: a flame graph before and after a release to spot a regression
Correlation with metrics: an alert "p99 latency is rising" → you immediately see the stack behind it
Root cause in minutes, not days

In 2025 it is a production tool at Cloudflare, Polar Signals, Coinbase, and Pinterest. The most visible runtime regressions are found through profiling.

How an eBPF profiler works

The old perf: a kernel ring buffer, perf_event_open, a sample on each CPU through the PMU (timer interrupt). Stack walk through frame pointers or DWARF.

An eBPF profiler ([[ebpf-basics|eBPF]]) attaches an eBPF program to a perf event (sample at 100 Hz):

int profile(struct bpf_perf_event_data *ctx) {

    u32 pid = bpf_get_current_pid_tgid() >> 32;

    u64 *count = bpf_map_lookup_elem(&counts, &pid);

    bpf_get_stack(ctx, stack, sizeof(stack), 0);     // kernel stack

    bpf_get_stack(ctx, stack, sizeof(stack), USER);  // user stack

    // aggregate in BPF map

Advantages:

Frame-pointer-free stack walking (DWARF unwinding in eBPF, in recent kernels)
Per-cgroup filtering: profiles per Kubernetes pod without knowing the PID
On-host symbolization: eBPF reads /proc/PID/maps, finds the ELF, and extracts symbols
CO-RE ([[bpf-co-re|BPF Compile Once Run Everywhere]]): one binary across different kernels

Flame graphs: how to read them

Each rectangle is a function. Width = share of CPU time. The Y-axis is the call stack (parent at the bottom, children on top).

┌─────────────────────────────────────────────┐

│                  main()                     │

├─────────────────┬───────────────────────────┤

│  serve_request  │      gc_collect           │

├──────────┬──────┼─────────────┬─────────────┤

│  parse   │ db   │  sweep_old  │ allocate    │

├──┬───────┼──────┼─────────────┼─────────────┤

│j │ regex │ scan │             │             │

└──┴───────┴──────┴─────────────┴─────────────┘

"regex takes 25% of the whole process's CPU". A wide, flat plateau means a hot leaf. Tall and narrow means a deep call stack with no plateau, usually fine.

A flame graph is sampled, not traced, so the exact numbers are imprecise (±5%), but the proportions hold.

Pyroscope architecture

┌───────────────┐   pull profiles    ┌──────────────┐

│  Pyroscope    │ ─────────────────► │   target     │

│  server       │                    │ (with eBPF   │

│ ┌───────────┐ │                    │  profiler    │

│ │ TSDB-like │ │                    │  agent)      │

│ │ profiles  │ │                    └──────────────┘

│ │  store    │ │                    ┌──────────────┐

│ └───────────┘ │ ◄───── push ────── │   target     │

│ ┌───────────┐ │                    │ (with SDK    │

│ │ Querier   │ │ ◄─── Grafana       │  push agent) │

│ └───────────┘ │       UI           └──────────────┘

└───────────────┘

Storage is similar to [[prometheus-basics|Prometheus TSDB]]: a profile per (service, instance) per timestamp. A query returns a flame graph over a range filtered by labels.

Two ingest modes:

Pull eBPF agent (DaemonSet): a single agent profiles everything on the host with no SDK. Best for k8s.
Push SDK: runtime-specific (Java, Go, Python). More accurate for managed runtimes (eBPF cannot see Java JIT frames).

Labels: every profile is tagged

As in Prometheus, labels define the series:

service.name=checkout

pod=checkout-7f8b9c-q2lx9

namespace=prod

region=us-east-1

version=1.4.2

You can filter in the UI: "a flame graph for version=1.4.2 vs 1.4.1", "a diff between region=us-east vs us-west". That is the power: slice by a label to find an anomaly.

The cardinality rules are the same: pod, tens of thousands, fine with retention. request_id, never.

Profile types

Pyroscope/Parca collect several types:

Type	What it shows
`process_cpu`	on-CPU sampling, where CPU time goes
`goroutine` (Go)	live goroutines, leak detection
`inuse_space` (Go heap)	currently alloc'd memory
`alloc_space` (Go heap)	total allocated since start
`block` (Go)	blocking on chan/mutex
`mutex` (Go)	contention
`wall_clock`	wall-time (including sleep, IO wait), what blocks
`lock_contention` (Java)	JFR-based

For Java this is automatic through [[opentelemetry|OpenTelemetry profiling]] plus JFR (Java Flight Recorder).

Pyroscope vs Parca vs Polar Signals

Tool	Backend	Storage	Pitch
Pyroscope	Grafana Labs (acq 2023)	own TSDB-like	OSS, integrated with Grafana
Parca	Polar Signals	own	OSS, Parquet-on-S3
Polar Signals Cloud	hosted	proprietary	$$ paid
GProfiler	Granulate (acq Intel)	own	automatic ARM/x86

Pyroscope is the pick for self-hosting with a Grafana stack. Parca is for when you want parquet-friendly queries through DuckDB/SQL.

Continuous profiling vs traditional perf

Property	perf record	continuous profiler
When	ad-hoc (after the incident)	always-on
Overhead	5-10% while recording	1-2% constant
Storage	local file (perf.data)	central, retention
Symbolization	locally	on-agent + central
Multi-host	no	yes, per-pod
Diff between dates	manual through folded	UI feature
Stack accuracy	DWARF + perf	eBPF + DWARF/frame-pointers

CPU profiling vs memory

For a CPU bottleneck, the process_cpu type, sampled @100Hz, hot path.

For a memory leak, alloc_space (the Go heap profiler) or Java JFR Allocation* events. You see where garbage is allocated, what keeps growing after GC.

Memory profiling overhead is higher (around 3-5%), not always-on in production. Turn it on on demand or when memory.high triggers ([[cgroups-v2-deep|cgroup PSI]]).

Profile-Guided Optimization (PGO)

Go 1.21+, Rust nightly: the compiler takes a production profile as a hint and optimizes hot code (better inlining, branch prediction).

Workflow:

Continuous profiling in production → you collect default.pgo
Put it in the repo next to main.go
go build picks up the PGO profile
A 2-15% speedup on a typical Go service

Pyroscope has a pyroscope-pgo exporter for this.

When things go wrong

Stack shows "[unknown]": no debug symbols, or frame pointers are missing (gcc defaults to -fomit-frame-pointer). Build with -fno-omit-frame-pointer, or enable DWARF unwinding in the profiler.
Java stack frames are empty: JIT compilation. You need a JFR-based profiler instead of pure eBPF.
High cardinality: every pod is a new series. Limit retention, or drop the pod label and keep the deployment.
Profiles do not arrive: the eBPF agent needs CAP_SYS_ADMIN and CAP_BPF (bpf-co-re). On k8s, use securityContext.privileged or the matching capabilities.
The delta between versions looks odd: different [[cgroups-v2-deep|cgroup PSI throttling]] changed execution time, not the code. Control the test environment.
Async/await stacks are broken: Go goroutines, Rust async. The Pyroscope Go SDK plus eBPF handle this; Python asyncio, partially.

When not to use it

A small service < 10K rps: the 1-2% overhead is invisible, but there is also nothing to find. Ad-hoc pprof is enough.
Compliance forbids reading stack traces from production data (PII in strings): do not use sampling string captures.
Embedded or IoT: no resources for an eBPF agent.

Why continuous profiling

Classic profiling, Brendan Gregg-style: SSH to the host, perf record, perf script | flamegraph.pl. The hassle:

Reactive (after the incident, usually too late)
One host, one process
A snapshot, not trends
Needs root, sudo, debug symbols on disk
Overhead during the measurement, around 5-10% CPU

What you get:

Always-on flame graphs: open Grafana and see where CPU goes right now, or two weeks ago
A diff between versions: a flame graph before and after a release to spot a regression
Correlation with metrics: an alert "p99 latency is rising" → you immediately see the stack behind it
Root cause in minutes, not days

In 2025 it is a production tool at Cloudflare, Polar Signals, Coinbase, and Pinterest. The most visible runtime regressions are found through profiling.

How an eBPF profiler works

The old perf: a kernel ring buffer, perf_event_open, a sample on each CPU through the PMU (timer interrupt). Stack walk through frame pointers or DWARF.

An eBPF profiler ([[ebpf-basics|eBPF]]) attaches an eBPF program to a perf event (sample at 100 Hz):

int profile(struct bpf_perf_event_data *ctx) {

    u32 pid = bpf_get_current_pid_tgid() >> 32;

    u64 *count = bpf_map_lookup_elem(&counts, &pid);

    bpf_get_stack(ctx, stack, sizeof(stack), 0);     // kernel stack

    bpf_get_stack(ctx, stack, sizeof(stack), USER);  // user stack

    // aggregate in BPF map

Advantages:

Frame-pointer-free stack walking (DWARF unwinding in eBPF, in recent kernels)
Per-cgroup filtering: profiles per Kubernetes pod without knowing the PID
On-host symbolization: eBPF reads /proc/PID/maps, finds the ELF, and extracts symbols
CO-RE ([[bpf-co-re|BPF Compile Once Run Everywhere]]): one binary across different kernels

Flame graphs: how to read them

Each rectangle is a function. Width = share of CPU time. The Y-axis is the call stack (parent at the bottom, children on top).

┌─────────────────────────────────────────────┐

│                  main()                     │

├─────────────────┬───────────────────────────┤

│  serve_request  │      gc_collect           │

├──────────┬──────┼─────────────┬─────────────┤

│  parse   │ db   │  sweep_old  │ allocate    │

├──┬───────┼──────┼─────────────┼─────────────┤

│j │ regex │ scan │             │             │

└──┴───────┴──────┴─────────────┴─────────────┘

"regex takes 25% of the whole process's CPU". A wide, flat plateau means a hot leaf. Tall and narrow means a deep call stack with no plateau, usually fine.

A flame graph is sampled, not traced, so the exact numbers are imprecise (±5%), but the proportions hold.

Pyroscope architecture

┌───────────────┐   pull profiles    ┌──────────────┐

│  Pyroscope    │ ─────────────────► │   target     │

│  server       │                    │ (with eBPF   │

│ ┌───────────┐ │                    │  profiler    │

│ │ TSDB-like │ │                    │  agent)      │

│ │ profiles  │ │                    └──────────────┘

│ │  store    │ │                    ┌──────────────┐

│ └───────────┘ │ ◄───── push ────── │   target     │

│ ┌───────────┐ │                    │ (with SDK    │

│ │ Querier   │ │ ◄─── Grafana       │  push agent) │

│ └───────────┘ │       UI           └──────────────┘

└───────────────┘

Storage is similar to [[prometheus-basics|Prometheus TSDB]]: a profile per (service, instance) per timestamp. A query returns a flame graph over a range filtered by labels.

Two ingest modes:

Pull eBPF agent (DaemonSet): a single agent profiles everything on the host with no SDK. Best for k8s.
Push SDK: runtime-specific (Java, Go, Python). More accurate for managed runtimes (eBPF cannot see Java JIT frames).

Labels: every profile is tagged

As in Prometheus, labels define the series:

service.name=checkout

pod=checkout-7f8b9c-q2lx9

namespace=prod

region=us-east-1

version=1.4.2

You can filter in the UI: "a flame graph for version=1.4.2 vs 1.4.1", "a diff between region=us-east vs us-west". That is the power: slice by a label to find an anomaly.

The cardinality rules are the same: pod, tens of thousands, fine with retention. request_id, never.

Profile types

Pyroscope/Parca collect several types:

Type	What it shows
`process_cpu`	on-CPU sampling, where CPU time goes
`goroutine` (Go)	live goroutines, leak detection
`inuse_space` (Go heap)	currently alloc'd memory
`alloc_space` (Go heap)	total allocated since start
`block` (Go)	blocking on chan/mutex
`mutex` (Go)	contention
`wall_clock`	wall-time (including sleep, IO wait), what blocks
`lock_contention` (Java)	JFR-based

For Java this is automatic through [[opentelemetry|OpenTelemetry profiling]] plus JFR (Java Flight Recorder).

Pyroscope vs Parca vs Polar Signals

Tool	Backend	Storage	Pitch
Pyroscope	Grafana Labs (acq 2023)	own TSDB-like	OSS, integrated with Grafana
Parca	Polar Signals	own	OSS, Parquet-on-S3
Polar Signals Cloud	hosted	proprietary	$$ paid
GProfiler	Granulate (acq Intel)	own	automatic ARM/x86

Pyroscope is the pick for self-hosting with a Grafana stack. Parca is for when you want parquet-friendly queries through DuckDB/SQL.

Continuous profiling vs traditional perf

Property	perf record	continuous profiler
When	ad-hoc (after the incident)	always-on
Overhead	5-10% while recording	1-2% constant
Storage	local file (perf.data)	central, retention
Symbolization	locally	on-agent + central
Multi-host	no	yes, per-pod
Diff between dates	manual through folded	UI feature
Stack accuracy	DWARF + perf	eBPF + DWARF/frame-pointers

CPU profiling vs memory

For a CPU bottleneck, the process_cpu type, sampled @100Hz, hot path.

For a memory leak, alloc_space (the Go heap profiler) or Java JFR Allocation* events. You see where garbage is allocated, what keeps growing after GC.

Memory profiling overhead is higher (around 3-5%), not always-on in production. Turn it on on demand or when memory.high triggers ([[cgroups-v2-deep|cgroup PSI]]).

Profile-Guided Optimization (PGO)

Go 1.21+, Rust nightly: the compiler takes a production profile as a hint and optimizes hot code (better inlining, branch prediction).

Workflow:

Continuous profiling in production → you collect default.pgo
Put it in the repo next to main.go
go build picks up the PGO profile
A 2-15% speedup on a typical Go service

Pyroscope has a pyroscope-pgo exporter for this.

When things go wrong

Stack shows "[unknown]": no debug symbols, or frame pointers are missing (gcc defaults to -fomit-frame-pointer). Build with -fno-omit-frame-pointer, or enable DWARF unwinding in the profiler.
Java stack frames are empty: JIT compilation. You need a JFR-based profiler instead of pure eBPF.
High cardinality: every pod is a new series. Limit retention, or drop the pod label and keep the deployment.
Profiles do not arrive: the eBPF agent needs CAP_SYS_ADMIN and CAP_BPF (bpf-co-re). On k8s, use securityContext.privileged or the matching capabilities.
The delta between versions looks odd: different [[cgroups-v2-deep|cgroup PSI throttling]] changed execution time, not the code. Control the test environment.
Async/await stacks are broken: Go goroutines, Rust async. The Pyroscope Go SDK plus eBPF handle this; Python asyncio, partially.

When not to use it

A small service < 10K rps: the 1-2% overhead is invisible, but there is also nothing to find. Ad-hoc pprof is enough.
Compliance forbids reading stack traces from production data (PII in strings): do not use sampling string captures.
Embedded or IoT: no resources for an eBPF agent.

Continuous profiling: Pyroscope, eBPF, flame graphs in production

Why continuous profiling

How an eBPF profiler works

Flame graphs: how to read them

Pyroscope architecture

Labels: every profile is tagged

Profile types

Pyroscope vs Parca vs Polar Signals

Continuous profiling vs traditional perf

CPU profiling vs memory

Profile-Guided Optimization (PGO)

When things go wrong

When not to use it

§ команды

§ см. также

Continuous profiling: Pyroscope, eBPF, flame graphs in production

Why continuous profiling

How an eBPF profiler works

Flame graphs: how to read them

Pyroscope architecture

Labels: every profile is tagged

Profile types

Pyroscope vs Parca vs Polar Signals

Continuous profiling vs traditional perf

CPU profiling vs memory

Profile-Guided Optimization (PGO)

When things go wrong

When not to use it

§ команды

§ см. также