Why continuous profiling
Classic profiling, Brendan Gregg-style: SSH to the host, perf record,
perf script | flamegraph.pl. The hassle:
- Reactive (after the incident, usually too late)
- One host, one process
- A snapshot, not trends
- Needs root, sudo, debug symbols on disk
- Overhead during the measurement, around 5-10% CPU
Continuous profiling: an always-on profiler on every node, sampling at about 100 Hz, with 1-2% CPU overhead, sending to central storage. It appeared around 2020 (the Google "Continuous Profiling" paper, Pyroscope).
What you get:
- Always-on flame graphs: open Grafana and see where CPU goes right now, or two weeks ago
- A diff between versions: a flame graph before and after a release to spot a regression
- Correlation with metrics: an alert "p99 latency is rising" → you immediately see the stack behind it
- Root cause in minutes, not days
In 2025 it is a production tool at Cloudflare, Polar Signals, Coinbase, and Pinterest. The most visible runtime regressions are found through profiling.
How an eBPF profiler works
The old perf: a kernel ring buffer, perf_event_open, a sample on each CPU through the PMU (timer interrupt). Stack walk through frame pointers or DWARF.
An eBPF profiler ([[ebpf-basics|eBPF]]) attaches an eBPF program to a perf event (sample at 100 Hz):
int profile(struct bpf_perf_event_data *ctx) {u32 pid = bpf_get_current_pid_tgid() >> 32;
u64 *count = bpf_map_lookup_elem(&counts, &pid);
bpf_get_stack(ctx, stack, sizeof(stack), 0); // kernel stack
bpf_get_stack(ctx, stack, sizeof(stack), USER); // user stack
// aggregate in BPF map
}
Advantages:
- Frame-pointer-free stack walking (DWARF unwinding in eBPF, in recent kernels)
- Per-cgroup filtering: profiles per Kubernetes pod without knowing the PID
- On-host symbolization: eBPF reads /proc/PID/maps, finds the ELF, and extracts symbols
- CO-RE ([[bpf-co-re|BPF Compile Once Run Everywhere]]): one binary across different kernels
Flame graphs: how to read them
Each rectangle is a function. Width = share of CPU time. The Y-axis is the call stack (parent at the bottom, children on top).
┌─────────────────────────────────────────────┐
│ main() │
├─────────────────┬───────────────────────────┤
│ serve_request │ gc_collect │
├──────────┬──────┼─────────────┬─────────────┤
│ parse │ db │ sweep_old │ allocate │
├──┬───────┼──────┼─────────────┼─────────────┤
│j │ regex │ scan │ │ │
└──┴───────┴──────┴─────────────┴─────────────┘
"regex takes 25% of the whole process's CPU". A wide, flat plateau
means a hot leaf. Tall and narrow means a deep call stack with no
plateau, usually fine.
A flame graph is sampled, not traced, so the exact numbers are imprecise (±5%), but the proportions hold.
Pyroscope architecture
┌───────────────┐ pull profiles ┌──────────────┐
│ Pyroscope │ ─────────────────► │ target │
│ server │ │ (with eBPF │
│ ┌───────────┐ │ │ profiler │
│ │ TSDB-like │ │ │ agent) │
│ │ profiles │ │ └──────────────┘
│ │ store │ │ ┌──────────────┐
│ └───────────┘ │ ◄───── push ────── │ target │
│ ┌───────────┐ │ │ (with SDK │
│ │ Querier │ │ ◄─── Grafana │ push agent) │
│ └───────────┘ │ UI └──────────────┘
└───────────────┘
Storage is similar to [[prometheus-basics|Prometheus TSDB]]: a profile per (service, instance) per timestamp. A query returns a flame graph over a range filtered by labels.
Two ingest modes:
- Pull eBPF agent (DaemonSet): a single agent profiles everything on the host with no SDK. Best for k8s.
- Push SDK: runtime-specific (Java, Go, Python). More accurate for managed runtimes (eBPF cannot see Java JIT frames).
Labels: every profile is tagged
As in Prometheus, labels define the series:
service.name=checkout
pod=checkout-7f8b9c-q2lx9
namespace=prod
region=us-east-1
version=1.4.2
You can filter in the UI: "a flame graph for version=1.4.2 vs 1.4.1",
"a diff between region=us-east vs us-west". That is the power: slice
by a label to find an anomaly.
The cardinality rules are the same: pod, tens of thousands, fine with
retention. request_id, never.
Profile types
Pyroscope/Parca collect several types:
| Type | What it shows |
|---|---|
process_cpu | on-CPU sampling, where CPU time goes |
goroutine (Go) | live goroutines, leak detection |
inuse_space (Go heap) | currently alloc'd memory |
alloc_space (Go heap) | total allocated since start |
block (Go) | blocking on chan/mutex |
mutex (Go) | contention |
wall_clock | wall-time (including sleep, IO wait), what blocks |
lock_contention (Java) | JFR-based |
For Java this is automatic through [[opentelemetry|OpenTelemetry profiling]] plus JFR (Java Flight Recorder).
Pyroscope vs Parca vs Polar Signals
| Tool | Backend | Storage | Pitch |
|---|---|---|---|
| Pyroscope | Grafana Labs (acq 2023) | own TSDB-like | OSS, integrated with Grafana |
| Parca | Polar Signals | own | OSS, Parquet-on-S3 |
| Polar Signals Cloud | hosted | proprietary | $$ paid |
| GProfiler | Granulate (acq Intel) | own | automatic ARM/x86 |
Pyroscope is the pick for self-hosting with a Grafana stack. Parca is
for when you want parquet-friendly queries through DuckDB/SQL.
Continuous profiling vs traditional perf
| Property | perf record | continuous profiler |
|---|---|---|
| When | ad-hoc (after the incident) | always-on |
| Overhead | 5-10% while recording | 1-2% constant |
| Storage | local file (perf.data) | central, retention |
| Symbolization | locally | on-agent + central |
| Multi-host | no | yes, per-pod |
| Diff between dates | manual through folded | UI feature |
| Stack accuracy | DWARF + perf | eBPF + DWARF/frame-pointers |
CPU profiling vs memory
For a CPU bottleneck, the process_cpu type, sampled @100Hz, hot path.
For a memory leak, alloc_space (the Go heap profiler) or Java JFR
Allocation* events. You see where garbage is allocated, what keeps
growing after GC.
Memory profiling overhead is higher (around 3-5%), not always-on in
production. Turn it on on demand or when memory.high triggers
([[cgroups-v2-deep|cgroup PSI]]).
Profile-Guided Optimization (PGO)
Go 1.21+, Rust nightly: the compiler takes a production profile as a hint and optimizes hot code (better inlining, branch prediction).
Workflow:
- Continuous profiling in production → you collect
default.pgo - Put it in the repo next to
main.go go buildpicks up the PGO profile- A 2-15% speedup on a typical Go service
Pyroscope has a pyroscope-pgo exporter for this.
When things go wrong
- Stack shows "[unknown]": no debug symbols, or frame pointers are
missing (gcc defaults to
-fomit-frame-pointer). Build with-fno-omit-frame-pointer, or enable DWARF unwinding in the profiler. - Java stack frames are empty: JIT compilation. You need a JFR-based profiler instead of pure eBPF.
- High cardinality: every pod is a new series. Limit retention, or drop the pod label and keep the deployment.
- Profiles do not arrive: the eBPF agent needs
CAP_SYS_ADMINandCAP_BPF(bpf-co-re). On k8s, usesecurityContext.privilegedor the matching capabilities. - The delta between versions looks odd: different [[cgroups-v2-deep|cgroup PSI throttling]] changed execution time, not the code. Control the test environment.
- Async/await stacks are broken: Go goroutines, Rust async. The Pyroscope Go SDK plus eBPF handle this; Python asyncio, partially.
When not to use it
- A small service < 10K rps: the 1-2% overhead is invisible, but
there is also nothing to find. Ad-hoc
pprofis enough. - Compliance forbids reading stack traces from production data (PII in strings): do not use sampling string captures.
- Embedded or IoT: no resources for an eBPF agent.