linuxlab.io
Tutorials▾
  • Linux & networking
    File system, processes, TCP/IP, BGP and OSPF
    →
  • Terraform & IaC
    HCL, state, plan/apply on a LocalStack sandbox
    →
  • Git & GitHub
    Object model, plumbing, branching, GitHub Actions
    →
All tutorials →
PricingAboutSign inCreate account
/
  • Introduction
  • Lessons
  • How it works
  • Simulator
  • Knowledge base
  • Interview prep
Index
Categories
All entries
Footer
linuxlab-TutorialsPricingAboutPrivacy & cookies
Copyright © 2026 LinuxLab. All rights reserved.
home/linux/kb/Observability & monitoring/opentelemetry

kb/observability ── Observability & monitoring ── intermediate

OpenTelemetry: signals, OTLP, Collector pipeline

OpenTelemetry is the CNCF standard for metrics, traces, and logs in one SDK. The OTLP protocol runs over gRPC or HTTP. The Collector receives, filters, and routes to Prom/Tempo/Loki/Jaeger. Auto-instrumentation needs no code change.

view as markdownaka: otel, opentelemetry-sdk, otlp, otel-collector

Why OpenTelemetry

Before OTel, every vendor had its own SDK:

  • Prometheus client for metrics
  • Jaeger / Zipkin client for traces
  • Datadog APM agent for everything, but with vendor lock-in
  • structured logger for logs

Four SDKs, four pipelines, four formats. Changing the backend meant rewriting all the instrumentation in your code.

OpenTelemetry (CNCF, 2019, a merger of OpenCensus and OpenTracing) gives you:

  • One SDK for all three signals (metrics, traces, logs)
  • One protocol, OTLP (OpenTelemetry Line Protocol)
  • One Collector for transformation and routing
  • Vendor-neutral code: it does not know where data goes, whether to Prom, Datadog, or cloud monitoring. You switch backends through config.

In 2025 OTel is the de facto standard for new projects. The old Prometheus client still works (the OTel Collector can receive it), but new code is written with OTel.

The three signals

Traces (tracing-basics) are the request flow through services. Spans carry parent-child links, and context propagation goes through the [[http2-internals|traceparent header]].

Metrics are counters, gauges, and histograms. The semantics match [[metric-types|Prometheus]], but they go through the OTel SDK.

Logs are structured events with automatic correlation: trace_id and span_id are baked into the log record.

All three travel through one SDK over a single OTLP channel. That cuts coupling and keeps the data coherent.

Architecture

┌──────────────┐     OTLP gRPC :4317       ┌─────────────┐
│     App      │ ──────────────────────►   │  Collector  │
│ ┌──────────┐ │     OTLP HTTP :4318       │             │
│ │ OTel SDK │ │                           │ ┌─────────┐ │
│ │ ┌──────┐ │ │                           │ │receivers│ │
│ │ │tracer│ │ │                           │ ├─────────┤ │
│ │ │meter │ │ │                           │ │processor│ │
│ │ │logger│ │ │                           │ ├─────────┤ │
│ │ └──────┘ │ │                           │ │exporters│ │
│ └──────────┘ │                           │ └────┬────┘ │
└──────────────┘                           └──────┼──────┘
                                                  │
                                ┌─────────┬───────┼────────┬────────┐
                                ▼         ▼       ▼        ▼        ▼
                            Prometheus  Tempo   Loki    Jaeger   Datadog

OTLP, the protocol

OTLP is a single wire format. It has two transports:

TransportPort (default)When
gRPC4317server-to-server, internal, low-latency
HTTP/protobuf4318through a proxy, browser, restrictive networks

The payload is Protocol Buffers. Structure:

ResourceSpans
  ├── Resource (service.name, host.name, k8s.pod.name)
  └── ScopeSpans
      ├── InstrumentationScope (library name + version)
      └── Span[]
          ├── trace_id, span_id, parent_span_id
          ├── name, start_time_nano, end_time_nano
          ├── attributes (key-value)
          ├── events[]
          ├── links[]
          └── status (OK / ERROR)

Metrics work the same way (ResourceMetrics → ScopeMetrics → Metric) and so do logs (ResourceLogs → ScopeLogs → LogRecord).

Advantages over the Prom format:

  • Binary, 3-5x more compact
  • Streaming through [[grpc-basics|gRPC]] (no HTTP poll)
  • One format for all three signals

SDK: auto vs manual instrumentation

With auto-instrumentation, an agent patches libraries at runtime, with no code changes:

  • Java: -javaagent:opentelemetry-javaagent.jar patches JDBC, Servlet, the Kafka client, gRPC, around 120 libraries
  • Python: opentelemetry-instrument python app.py patches requests, Flask, Django, psycopg2, redis-py
  • Node.js: --require @opentelemetry/auto-instrumentations-node
  • Go: no reflection, so you add it by hand (an eBPF-based agent is in progress)
  • .NET: the OTEL_DOTNET_AUTO_HOME env var

You get traces for HTTP/DB/Kafka without a single line of code. From there you can add manual spans for business logic.

Manual instrumentation uses an explicit API:

python
from opentelemetry import trace, metrics
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)
request_counter = meter.create_counter("requests")
duration_histogram = meter.create_histogram("request_duration_ms")
@app.get("/checkout")
def checkout():
    with tracer.start_as_current_span("checkout") as span:
        span.set_attribute("user.id", user_id)
        request_counter.add(1, {"endpoint": "/checkout"})
        # ... business logic

OTel Collector

A standalone service. Deploy it on every node (DaemonSet) or as a per-cluster gateway.

The config has three sections:

yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
        - job_name: app
          static_configs:
            - targets: [app:8080]
processors:
  batch:                    # batches before exporting
    send_batch_size: 8192
    timeout: 200ms
  memory_limiter:           # backpressure
    limit_mib: 512
  tail_sampling:            # sample by trace condition
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow
        type: latency
        latency: {threshold_ms: 1000}
      - name: probabilistic-1pct
        type: probabilistic
        probabilistic: {sampling_percentage: 1}
exporters:
  prometheusremotewrite:
    endpoint: http://victoriametrics:8480/api/v1/write
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

A pipeline is an acyclic graph. One collector can serve all three signals independently.

Resource: describing the source

Resource holds process and host attributes shared by every signal:

service.name=checkout
service.version=1.4.2
service.instance.id=checkout-7f8b9c
k8s.namespace.name=prod
k8s.pod.name=checkout-7f8b9c-q2lx9
host.name=node-12.us-east
cloud.provider=aws
cloud.region=us-east-1

Set it through env vars:

OTEL_SERVICE_NAME=checkout
OTEL_RESOURCE_ATTRIBUTES=service.version=1.4.2,deployment.environment=prod

In k8s it is auto-injected through the [[opentelemetry-operator-k8s|OTel Operator]] (a sidecar/auto-instrumentation CRD).

OTel vs Prometheus client

AspectProm clientOTel SDK
Signalsmetrics onlymetrics+traces+logs
TransportHTTP pull (/metrics)OTLP push
Vendor neutralityProm-onlyany backend
Auto-instrumentationminimalfull
Adoptionwidestgrowing fast
Wire formattext/OpenMetricsprotobuf

You can combine them: the OTel SDK for traces and logs plus the Prom client for metrics. Or the OTel SDK for everything, with the Collector exporting metrics in Prom format.

Sampling: head vs tail

Keeping 100% of traces is not feasible: 10K req/s × 5 spans × 5KB is about 250 MB/s. You need sampling.

  • Head-based: the sample-or-drop decision happens at the start of the trace (at the edge), and every downstream span honors it. Simple and predictable. The downside is that it drops error traces at random.
  • Tail-based: collect the whole trace in the Collector, then decide keep or drop from trace properties (status, latency, attributes). You see every error, but the Collector holds all spans in memory for 5 to 30s.

Tail-based is preferable. Use the Tail Sampling Processor in the Collector.

When things go wrong

  • OTLP/gRPC connection refused: the Collector is not running, or it is on a different port. The default is 4317. Check the firewall.
  • A trace is missing even though there was an error: head sampling at 1% dropped it. Use tail sampling with a status_code: ERROR policy.
  • Collector OOM: the memory_limiter processor is missing, or the limit is above RAM. Add a limit below 80% of the container memory.
  • Cardinality explosion: metric attributes carry a user-id or request-id. See cardinality-explosion.
  • Auto-instrumentation broke the app: usually the Java agent and a bytecode-patch conflict. Bump the OTel agent version, or disable a specific instrumentation: OTEL_INSTRUMENTATION_<name>_ENABLED=false.
  • Span attributes go missing: the batch processor is missing, or attributes were added after end(). Set them before .end().
  • Service.name = "unknown_service" in Tempo: the OTEL_SERVICE_NAME env var is missing. The Resource is not configured.

OTel vs Datadog/New Relic

Vendor APMs (Datadog, New Relic, Splunk) give you an all-in-one with a UI and ML features. But there is vendor lock-in, and replacing one means rewriting.

With OTel you write instrumentation once and send it to Datadog through a native receiver. A year later you point the exporter at Tempo/Loki/Mimir without touching the code.

Cost-wise: OTel plus self-hosted (Tempo, Loki, VictoriaMetrics) is 5 to 10x cheaper than Datadog at 100 GB/day or more, but it takes ops investment.

§ команды

bash
opentelemetry-instrument --service_name=myapp python app.py

Auto-instrumentation for Python: HTTP, DB, and Kafka traces with no code

bash
java -javaagent:opentelemetry-javaagent.jar -Dotel.service.name=myapp -jar app.jar

Java auto-agent: patches JDBC, Servlet, gRPC, and around 120 libraries at startup

bash
otelcol --config=/etc/otelcol/config.yaml

Start the Collector. The logs show config errors and pipeline problems

bash
curl -X POST -H 'Content-Type: application/x-protobuf' --data-binary @trace.pb http://collector:4318/v1/traces

Manual OTLP HTTP push, for debugging or CI tests

bash
OTEL_TRACES_EXPORTER=console python -c 'import myapp'

Debug mode: traces go to stdout instead of being sent, so you see the span structure

bash
otelcol validate --config=/etc/otelcol/config.yaml

Check the Collector config without starting it, for CI

bash
curl -s localhost:13133/  # health check

OTel Collector health endpoint: 200 if the pipeline is OK, 503 under backpressure

§ см. также

  • metric-typesMetric types: counter, gauge, histogram, summaryFour metric types: counter (up only), gauge (any value), histogram (buckets for p99), summary (quantile in the client). Native histogram (Prom 2.40+) uses sparse buckets and is gentler on memory. Exemplars link a metric to a trace_id.
  • grpc-basicsgRPC: HTTP/2 + Protobuf RPC FrameworkgRPC = HTTP/2 + Protocol Buffers + code generation. Four RPC types: unary (like REST), server-stream, client-stream, bidirectional. Strong typing, binary wire format, multi-language support. grpcurl is curl for gRPC.
  • http2-internalsHTTP/2: Binary Framing, HPACK, Stream MultiplexingHTTP/2 is binary multiplexing over a single TCP connection. HPACK compresses headers through an indexed dictionary. Streams are independent. Server push is deprecated. On a loss-prone link, HoL blocking is a real problem, solved by QUIC.
  • metrics-vs-logs-vs-tracesMetrics vs logs vs traces: the three pillars of observabilityMetrics are aggregated numbers over time, cheap, for alerts. Logs are discrete events with context, for root-cause. Traces are request flow across services, for distributed debug. Structure beats volume.
Footer
linuxlab-
Copyright © 2026 LinuxLab. All rights reserved.
Tutorials
Pricing
About
Privacy & cookies