Observability: perf, eBPF, metrics, logs

Questions about how to see what the system is doing right now. perf, strace, eBPF, metrics, logs, and traces are tools at different levels. In senior interviews the expectation is that you know which tool fits which situation and what each one costs. Junior questions about basic `top`/`htop`/`journalctl` also live here.

6 вопросов · ~25 мин чтения

#strace-vs-ebpf

seniorчасто

What is the difference between strace and eBPF? When do you use each one?

Что отвечать

`strace` uses ptrace. It stops the process on every syscall, copies the registers into userspace, then hands control back. The overhead is huge (a 10x to 100x slowdown), but it works everywhere. eBPF hooks into tracepoints and kprobes in the kernel and runs its bytecode right there, with no context switch. The overhead is tiny (a few percent), but it needs kernel 4.x or newer and BPF capabilities. In production, reach for eBPF (bpftrace, bcc). On a dev machine, where you just want to understand quickly what one command does, reach for strace.

Что хотят услышать

A senior candidate should: - explain the ptrace overhead. Each syscall means two context switches. - point out that strace on a multi-threaded process with high RPS can serialize everything into one thread and wreck the timings - say that the eBPF verifier guarantees the program terminates and is memory safe, which is why the kernel allows BPF programs to run in production - name the bpftrace one-liners from Brendan Gregg as the canonical set - mention that perf, ftrace, and the BPF Compiler Collection (BCC) are all part of the Linux tracing family, sharing one infrastructure through different entry points

Подводные камни

✗ Running strace on a production process with thousands of syscalls per second. The process will grind to a halt.
✗ Assuming eBPF can do everything strace does. It cannot. BPF cannot block a syscall, it can only observe.
✗ Not knowing that strace follows only the main thread by default. You need `-f` to follow children.

Follow-up

? What does `strace -c` do, and why is it the first thing you reach for?
? How does the BPF verifier guarantee that a program terminates?
? How does a kprobe differ from a tracepoint, and which one is more stable?

Глубина в базе знаний

#perf-flame-graph

seniorиногда

A service is eating CPU. How do you pin down the function to blame?

Что отвечать

`perf record -F 99 -p <pid> -g -- sleep 30` samples stack traces 99 times a second for 30 seconds. `perf report` shows the top functions by CPU time. For a visual, use a flame graph (Brendan Gregg). `perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg` produces an SVG where the width of a block is its share of CPU and the height is stack depth. The culprit is the widest flat block near the top.

Что хотят услышать

A senior candidate should: - explain sampling versus tracing: the perf sampler takes a snapshot of the stack 99 times a second, it does not trace every call - name `-g` for collecting the call graph (without it you see only the top frame) - say that proper symbols need debug symbols (`-debuginfo` packages on RHEL, `dbgsym` on Debian) - name the flame graph as the standard visualization. It reads left to right as the whole wall-time and bottom to top as the stack. - mention continuous profiling (Pyroscope, Parca) as the production evolution of running perf by hand

Подводные камни

✗ Running perf without `-g` and never figuring out who actually eats the CPU.
✗ Not installing debug symbols. The flame graph will be full of `[unknown]`.
✗ Sampling at 999, or at 1, per second. Either too little data or enormous overhead.

Follow-up

? How does `perf record -e cycles` differ from `-e cpu-clock`?
? What is an off-CPU flame graph, and when do you need it?
? How do you read an inverted flame graph (an icicle graph)?

Глубина в базе знаний

#use-vs-red-method

intermediateчасто

What are the USE and RED methods? When do you apply each one?

Что отвечать

USE (Brendan Gregg) is for host resources: Utilization (percent busy), Saturation (queueing beyond capacity), and Errors. You apply it to CPU, RAM, disks, and the network. RED (Tom Wilkie) is for services: Rate (RPS), Errors (the error fraction), and Duration (latency). USE answers what is overloaded right now. RED answers whether the service is healthy from the client's point of view. In production you need both: USE for capacity planning, RED for SLOs.

Что хотят услышать

A candidate should: - separate the two levels of abstraction: USE is about hardware and the OS, RED is about application metrics - give an example: high CPU utilization (USE) is not a problem by itself if the service's RED latency is fine - mention Google SRE's four golden signals as another set: latency, traffic, errors, saturation, which overlaps with both - say that RED maps cleanly onto an SLO (Service Level Objective): the error budget comes from E, the latency target from D

Подводные камни

✗ Watching only USE and ignoring latency. You can sleep through a service degradation while utilization stays low.
✗ Watching only RED and not knowing that CPU at 90 percent can catch a node overheating before the client notices.
✗ Confusing saturation and utilization. Utilization can be 100 percent with saturation at 0 (the CPU is busy but there is no queue).

Follow-up

? What does saturation mean for the CPU? And for a disk?
? How do the golden signals differ from USE plus RED?
? How do you build an SLO and an error budget from RED metrics?

Глубина в базе знаний

#logs-vs-metrics-vs-traces

intermediateчасто

What is the difference between logs, metrics, and traces? When do you use each one?

Что отвечать

Metrics are aggregated numbers (counter, gauge, histogram) with low cardinality. They are cheap, good for alerting and trends. Logs are event text with context. They are expensive, good for the post-mortem question of what exactly happened. Traces are the cause-and-effect chain of one request across services. They are the most expensive, good for distributed debugging. In production you need all three, but in different proportions: metrics on every scrape, logs sampled, traces at 1 to 10 percent.

Что хотят услышать

A senior candidate should: - name cardinality as the key budget: a metric with a user_id label breaks Prometheus, a log line with a user_id does not - say that OpenTelemetry unifies all three (it used to be a separate stack for each: Prometheus, Fluentd, Jaeger) - mention that every request needs a trace_id that flows through ALL components, which requires propagation in every HTTP client and worker - explain sampling in traces: head-based (the decision is made at the entry point) versus tail-based (after completion, with the errors already visible)

Подводные камни

✗ Putting user_id in a metric label. You get a cardinality explosion.
✗ Assuming logs replace metrics. That does not scale on cost.
✗ Failing to pass trace_id between services. The traces break apart.

Follow-up

? What are exemplars in Prometheus, and why do you need them?
? How does head-based sampling differ from tail-based sampling?
? Why does Loki store logs more cheaply than Elasticsearch?

Глубина в базе знаний

Metrics vs logs vs traces: the three pillars of observability
[[cardinality-explosion]]
[[tracing-basics]]
OpenTelemetry: signals, OTLP, Collector pipeline

#journalctl-filters

juniorиногда

How do you find a specific error in journald from the last hour?

Что отвечать

`journalctl --since "1 hour ago" -p err` shows every entry with priority err or higher from the last hour. `-u nginx.service` filters by unit. `_PID=1234` filters by a specific PID. `-f` gives a live tail. `-o json-pretty` gives structured output that is easy to parse through jq. The main advantage of journalctl over `tail /var/log/syslog` is structured fields and filtering without chains of grep.

Что хотят услышать

A candidate should: - name the priority levels (0=emerg, 3=err, 4=warning, 7=debug), the syslog standard from RFC 5424 - separate time filters (`--since`, `--until`, `--since today`) from unit filters (`-u`, `--user-unit`) - say that journald keeps logs persistent with `Storage=persistent` in `/etc/systemd/journald.conf`, otherwise they live in RAM and are lost on reboot - mention `journalctl --vacuum-size=1G` for cleanup - name `_TRANSPORT`, `_SYSTEMD_UNIT`, `_PID`, `_UID` as examples of structured fields you can filter on

Подводные камни

✗ Writing `grep ERROR | journalctl`. The order is wrong, and you lose the structured fields.
✗ Not setting up persistent storage. After a reboot the logs are gone.
✗ Not knowing that `journalctl -k` is the kernel ring buffer (dmesg), a separate channel.

Follow-up

? How does `journalctl -k` differ from `dmesg`?
? How do you set up forwarding from journald to syslog or Loki?
? What do you do when the journal grows to 50G and fills the disk?

Глубина в базе знаний

journalctl: systemd journal
[[cmd-dmesg]]
systemd: the init system and service manager

#cardinality-explosion

seniorиногда

What is a cardinality explosion in Prometheus and how do you fight it?

Что отвечать

Each unique combination of (metric_name, labels) is a separate time series. One high-cardinality label (user_id, trace_id, a path with an ID) multiplies the series count by thousands. 10M series is roughly 30GB of RAM on Prometheus. The fix: drop the label through `metric_relabel_configs`, normalize URL paths in code (`/api/users/:id` instead of `/api/users/12345`), or move it to logs and traces (where high cardinality is fine).

Что хотят услышать

A senior candidate should: - name the rule of thumb of about 3 KB per series for Prometheus - explain why cardinality is multiplicative: 2 methods x 5 statuses x 100 endpoints x 1000 users is 1M series, not a sum - mention `topk(20, count by (__name__)({...}))` as the first diagnostic command when Prometheus hits OOM - mention VictoriaMetrics and Thanos as scaling answers, but not a replacement for discipline with labels - say that Loki structured_metadata (Loki 3.0+) is the right place for high-cardinality identifiers

Подводные камни

✗ Adding request_id to a metric label because 'it is handy to filter on'.
✗ Assuming VictoriaMetrics solves cardinality. It buys headroom, but it does not fix the cause.
✗ Skipping route normalization. Every ID in a URL becomes a new label.

Follow-up

? Which PromQL query shows the series count per metric?
? What does `histogram_quantile` do, and why are histograms cardinality-friendly?
? How does Loki structured_metadata differ from regular labels?

Глубина в базе знаний

[[cardinality-explosion]]
[[prometheus-basics]]
Loki: label-based logs, LogQL, Promtail/Vector pipeline

Observability: perf, eBPF, metrics, logs

6 вопросов · ~25 мин чтения