kb/observability

Linux observability: Prometheus, OpenTelemetry, eBPF profiling

Linux observability is the third pillar of reliability after testing and code review: metrics with Prometheus, distributed traces with OpenTelemetry, logs with Loki/Vector, continuous profiling with eBPF/Pyroscope. Not theory but real pipelines and the traps of a production deploy: cardinality explosion, SLOs and error budgets, alerting without noise.

7 статей в категории

§ статьи

pyroscope-continuous-profilingContinuous profiling: Pyroscope, eBPF, flame graphs in productionContinuous profiling is an always-on CPU/memory profiler in production through eBPF. 1-2% overhead. Flame graphs show the hot path. Pyroscope (Grafana), Parca, Polar Signals. It replaces ad-hoc perf for production debugging.
loki-grafana-loggingLoki: label-based logs, LogQL, Promtail/Vector pipelineLoki is log aggregation with a label-based index, not full-text like Elastic. Cheap on S3 storage. Promtail/Vector are the agents. LogQL resembles PromQL: filter, parse, aggregate. Cardinality is the enemy.
metric-typesMetric types: counter, gauge, histogram, summaryFour metric types: counter (up only), gauge (any value), histogram (buckets for p99), summary (quantile in the client). Native histogram (Prom 2.40+) uses sparse buckets and is gentler on memory. Exemplars link a metric to a trace_id.
metrics-vs-logs-vs-tracesMetrics vs logs vs traces: the three pillars of observabilityMetrics are aggregated numbers over time, cheap, for alerts. Logs are discrete events with context, for root-cause. Traces are request flow across services, for distributed debug. Structure beats volume.
opentelemetryOpenTelemetry: signals, OTLP, Collector pipelineOpenTelemetry is the CNCF standard for metrics, traces, and logs in one SDK. The OTLP protocol runs over gRPC or HTTP. The Collector receives, filters, and routes to Prom/Tempo/Loki/Jaeger. Auto-instrumentation needs no code change.
service-discovery-prometheusService discovery in Prometheus: k8s, Consul, file_sd, relabelProm discovers targets through the k8s API, Consul, or file_sd (static). relabel_configs runs before scrape (filter and rewrite labels). metric_relabel runs after scrape (drop bad metrics). Without relabel, cardinality from k8s explodes.
sli-slo-error-budgetSLI / SLO / error budget: SRE metrics without the noiseSLI is a user-facing metric (availability, p99 latency). SLO is a target over a period (99.9% over 30d). Error budget = 1-SLO, spent on incidents and releases. Multi-window burn-rate alerting replaces threshold alerts, with less noise.

← вернуться ко всем категориям базы знаний