OpenTelemetry: signals, OTLP, Collector pipeline

Why OpenTelemetry

Before OTel, every vendor had its own SDK:

Prometheus client for metrics
Jaeger / Zipkin client for traces
Datadog APM agent for everything, but with vendor lock-in
structured logger for logs

Four SDKs, four pipelines, four formats. Changing the backend meant rewriting all the instrumentation in your code.

OpenTelemetry (CNCF, 2019, a merger of OpenCensus and OpenTracing) gives you:

One SDK for all three signals (metrics, traces, logs)
One protocol, OTLP (OpenTelemetry Line Protocol)
One Collector for transformation and routing
Vendor-neutral code: it does not know where data goes, whether to Prom, Datadog, or cloud monitoring. You switch backends through config.

In 2025 OTel is the de facto standard for new projects. The old Prometheus client still works (the OTel Collector can receive it), but new code is written with OTel.

The three signals

Traces (tracing-basics) are the request flow through services. Spans carry parent-child links, and context propagation goes through the [[http2-internals|traceparent header]].

Metrics are counters, gauges, and histograms. The semantics match [[metric-types|Prometheus]], but they go through the OTel SDK.

Logs are structured events with automatic correlation: trace_id and span_id are baked into the log record.

All three travel through one SDK over a single OTLP channel. That cuts coupling and keeps the data coherent.

Architecture

┌──────────────┐     OTLP gRPC :4317       ┌─────────────┐

│     App      │ ──────────────────────►   │  Collector  │

│ ┌──────────┐ │     OTLP HTTP :4318       │             │

│ │ OTel SDK │ │                           │ ┌─────────┐ │

│ │ ┌──────┐ │ │                           │ │receivers│ │

│ │ │tracer│ │ │                           │ ├─────────┤ │

│ │ │meter │ │ │                           │ │processor│ │

│ │ │logger│ │ │                           │ ├─────────┤ │

│ │ └──────┘ │ │                           │ │exporters│ │

│ └──────────┘ │                           │ └────┬────┘ │

└──────────────┘                           └──────┼──────┘

│

                                ┌─────────┬───────┼────────┬────────┐

                                ▼         ▼       ▼        ▼        ▼

                            Prometheus  Tempo   Loki    Jaeger   Datadog

OTLP, the protocol

OTLP is a single wire format. It has two transports:

Transport	Port (default)	When
gRPC	4317	server-to-server, internal, low-latency
HTTP/protobuf	4318	through a proxy, browser, restrictive networks

The payload is Protocol Buffers. Structure:

ResourceSpans

  ├── Resource (service.name, host.name, k8s.pod.name)

  └── ScopeSpans

      ├── InstrumentationScope (library name + version)

      └── Span[]

          ├── trace_id, span_id, parent_span_id

          ├── name, start_time_nano, end_time_nano

          ├── attributes (key-value)

          ├── events[]

          ├── links[]

          └── status (OK / ERROR)

Metrics work the same way (ResourceMetrics → ScopeMetrics → Metric) and so do logs (ResourceLogs → ScopeLogs → LogRecord).

Advantages over the Prom format:

Binary, 3-5x more compact
Streaming through [[grpc-basics|gRPC]] (no HTTP poll)
One format for all three signals

SDK: auto vs manual instrumentation

With auto-instrumentation, an agent patches libraries at runtime, with no code changes:

Java: -javaagent:opentelemetry-javaagent.jar patches JDBC, Servlet, the Kafka client, gRPC, around 120 libraries
Python: opentelemetry-instrument python app.py patches requests, Flask, Django, psycopg2, redis-py
Node.js: --require @opentelemetry/auto-instrumentations-node
Go: no reflection, so you add it by hand (an eBPF-based agent is in progress)
.NET: the OTEL_DOTNET_AUTO_HOME env var

You get traces for HTTP/DB/Kafka without a single line of code. From there you can add manual spans for business logic.

Manual instrumentation uses an explicit API:

python

from opentelemetry import trace, metrics

tracer = trace.get_tracer(__name__)

meter = metrics.get_meter(__name__)

request_counter = meter.create_counter("requests")

duration_histogram = meter.create_histogram("request_duration_ms")

@app.get("/checkout")

def checkout():

    with tracer.start_as_current_span("checkout") as span:

        span.set_attribute("user.id", user_id)

        request_counter.add(1, {"endpoint": "/checkout"})

        # ... business logic

OTel Collector

A standalone service. Deploy it on every node (DaemonSet) or as a per-cluster gateway.

The config has three sections:

yaml

receivers:

  otlp:

    protocols:

      grpc:

        endpoint: 0.0.0.0:4317

      http:

        endpoint: 0.0.0.0:4318

  prometheus:

    config:

      scrape_configs:

        - job_name: app

          static_configs:

            - targets: [app:8080]

processors:

  batch:                    # batches before exporting

    send_batch_size: 8192

    timeout: 200ms

  memory_limiter:           # backpressure

    limit_mib: 512

  tail_sampling:            # sample by trace condition

    policies:

      - name: errors

        type: status_code

        status_code: {status_codes: [ERROR]}

      - name: slow

        type: latency

        latency: {threshold_ms: 1000}

      - name: probabilistic-1pct

        type: probabilistic

        probabilistic: {sampling_percentage: 1}

exporters:

  prometheusremotewrite:

    endpoint: http://victoriametrics:8480/api/v1/write

  otlp/tempo:

    endpoint: tempo:4317

    tls:

      insecure: true

  loki:

    endpoint: http://loki:3100/loki/api/v1/push

service:

  pipelines:

    traces:

      receivers: [otlp]

      processors: [memory_limiter, tail_sampling, batch]

      exporters: [otlp/tempo]

    metrics:

      receivers: [otlp, prometheus]

      processors: [memory_limiter, batch]

      exporters: [prometheusremotewrite]

    logs:

      receivers: [otlp]

      processors: [memory_limiter, batch]

      exporters: [loki]

A pipeline is an acyclic graph. One collector can serve all three signals independently.

Resource: describing the source

Resource holds process and host attributes shared by every signal:

service.name=checkout

service.version=1.4.2

service.instance.id=checkout-7f8b9c

k8s.namespace.name=prod

k8s.pod.name=checkout-7f8b9c-q2lx9

host.name=node-12.us-east

cloud.provider=aws

cloud.region=us-east-1

Set it through env vars:

OTEL_SERVICE_NAME=checkout

OTEL_RESOURCE_ATTRIBUTES=service.version=1.4.2,deployment.environment=prod

In k8s it is auto-injected through the [[opentelemetry-operator-k8s|OTel Operator]] (a sidecar/auto-instrumentation CRD).

OTel vs Prometheus client

Aspect	Prom client	OTel SDK
Signals	metrics only	metrics+traces+logs
Transport	HTTP pull (`/metrics`)	OTLP push
Vendor neutrality	Prom-only	any backend
Auto-instrumentation	minimal	full
Adoption	widest	growing fast
Wire format	text/OpenMetrics	protobuf

You can combine them: the OTel SDK for traces and logs plus the Prom client for metrics. Or the OTel SDK for everything, with the Collector exporting metrics in Prom format.

Sampling: head vs tail

Keeping 100% of traces is not feasible: 10K req/s × 5 spans × 5KB is about 250 MB/s. You need sampling.

Head-based: the sample-or-drop decision happens at the start of the trace (at the edge), and every downstream span honors it. Simple and predictable. The downside is that it drops error traces at random.
Tail-based: collect the whole trace in the Collector, then decide keep or drop from trace properties (status, latency, attributes). You see every error, but the Collector holds all spans in memory for 5 to 30s.

Tail-based is preferable. Use the Tail Sampling Processor in the Collector.

When things go wrong

OTLP/gRPC connection refused: the Collector is not running, or it is on a different port. The default is 4317. Check the firewall.
A trace is missing even though there was an error: head sampling at 1% dropped it. Use tail sampling with a status_code: ERROR policy.
Collector OOM: the memory_limiter processor is missing, or the limit is above RAM. Add a limit below 80% of the container memory.
Cardinality explosion: metric attributes carry a user-id or request-id. See cardinality-explosion.
Auto-instrumentation broke the app: usually the Java agent and a bytecode-patch conflict. Bump the OTel agent version, or disable a specific instrumentation: OTEL_INSTRUMENTATION_<name>_ENABLED=false.
Span attributes go missing: the batch processor is missing, or attributes were added after end(). Set them before .end().
Service.name = "unknown_service" in Tempo: the OTEL_SERVICE_NAME env var is missing. The Resource is not configured.

OTel vs Datadog/New Relic

Vendor APMs (Datadog, New Relic, Splunk) give you an all-in-one with a UI and ML features. But there is vendor lock-in, and replacing one means rewriting.

With OTel you write instrumentation once and send it to Datadog through a native receiver. A year later you point the exporter at Tempo/Loki/Mimir without touching the code.

Cost-wise: OTel plus self-hosted (Tempo, Loki, VictoriaMetrics) is 5 to 10x cheaper than Datadog at 100 GB/day or more, but it takes ops investment.

Why OpenTelemetry

Before OTel, every vendor had its own SDK:

Prometheus client for metrics
Jaeger / Zipkin client for traces
Datadog APM agent for everything, but with vendor lock-in
structured logger for logs

Four SDKs, four pipelines, four formats. Changing the backend meant rewriting all the instrumentation in your code.

OpenTelemetry (CNCF, 2019, a merger of OpenCensus and OpenTracing) gives you:

One SDK for all three signals (metrics, traces, logs)
One protocol, OTLP (OpenTelemetry Line Protocol)
One Collector for transformation and routing
Vendor-neutral code: it does not know where data goes, whether to Prom, Datadog, or cloud monitoring. You switch backends through config.

In 2025 OTel is the de facto standard for new projects. The old Prometheus client still works (the OTel Collector can receive it), but new code is written with OTel.

The three signals

Traces (tracing-basics) are the request flow through services. Spans carry parent-child links, and context propagation goes through the [[http2-internals|traceparent header]].

Metrics are counters, gauges, and histograms. The semantics match [[metric-types|Prometheus]], but they go through the OTel SDK.

Logs are structured events with automatic correlation: trace_id and span_id are baked into the log record.

All three travel through one SDK over a single OTLP channel. That cuts coupling and keeps the data coherent.

Architecture

┌──────────────┐     OTLP gRPC :4317       ┌─────────────┐

│     App      │ ──────────────────────►   │  Collector  │

│ ┌──────────┐ │     OTLP HTTP :4318       │             │

│ │ OTel SDK │ │                           │ ┌─────────┐ │

│ │ ┌──────┐ │ │                           │ │receivers│ │

│ │ │tracer│ │ │                           │ ├─────────┤ │

│ │ │meter │ │ │                           │ │processor│ │

│ │ │logger│ │ │                           │ ├─────────┤ │

│ │ └──────┘ │ │                           │ │exporters│ │

│ └──────────┘ │                           │ └────┬────┘ │

└──────────────┘                           └──────┼──────┘

│

                                ┌─────────┬───────┼────────┬────────┐

                                ▼         ▼       ▼        ▼        ▼

                            Prometheus  Tempo   Loki    Jaeger   Datadog

OTLP, the protocol

OTLP is a single wire format. It has two transports:

Transport	Port (default)	When
gRPC	4317	server-to-server, internal, low-latency
HTTP/protobuf	4318	through a proxy, browser, restrictive networks

The payload is Protocol Buffers. Structure:

ResourceSpans

  ├── Resource (service.name, host.name, k8s.pod.name)

  └── ScopeSpans

      ├── InstrumentationScope (library name + version)

      └── Span[]

          ├── trace_id, span_id, parent_span_id

          ├── name, start_time_nano, end_time_nano

          ├── attributes (key-value)

          ├── events[]

          ├── links[]

          └── status (OK / ERROR)

Metrics work the same way (ResourceMetrics → ScopeMetrics → Metric) and so do logs (ResourceLogs → ScopeLogs → LogRecord).

Advantages over the Prom format:

Binary, 3-5x more compact
Streaming through [[grpc-basics|gRPC]] (no HTTP poll)
One format for all three signals

SDK: auto vs manual instrumentation

With auto-instrumentation, an agent patches libraries at runtime, with no code changes:

Java: -javaagent:opentelemetry-javaagent.jar patches JDBC, Servlet, the Kafka client, gRPC, around 120 libraries
Python: opentelemetry-instrument python app.py patches requests, Flask, Django, psycopg2, redis-py
Node.js: --require @opentelemetry/auto-instrumentations-node
Go: no reflection, so you add it by hand (an eBPF-based agent is in progress)
.NET: the OTEL_DOTNET_AUTO_HOME env var

You get traces for HTTP/DB/Kafka without a single line of code. From there you can add manual spans for business logic.

Manual instrumentation uses an explicit API:

python

from opentelemetry import trace, metrics

tracer = trace.get_tracer(__name__)

meter = metrics.get_meter(__name__)

request_counter = meter.create_counter("requests")

duration_histogram = meter.create_histogram("request_duration_ms")

@app.get("/checkout")

def checkout():

    with tracer.start_as_current_span("checkout") as span:

        span.set_attribute("user.id", user_id)

        request_counter.add(1, {"endpoint": "/checkout"})

        # ... business logic

OTel Collector

A standalone service. Deploy it on every node (DaemonSet) or as a per-cluster gateway.

The config has three sections:

yaml

receivers:

  otlp:

    protocols:

      grpc:

        endpoint: 0.0.0.0:4317

      http:

        endpoint: 0.0.0.0:4318

  prometheus:

    config:

      scrape_configs:

        - job_name: app

          static_configs:

            - targets: [app:8080]

processors:

  batch:                    # batches before exporting

    send_batch_size: 8192

    timeout: 200ms

  memory_limiter:           # backpressure

    limit_mib: 512

  tail_sampling:            # sample by trace condition

    policies:

      - name: errors

        type: status_code

        status_code: {status_codes: [ERROR]}

      - name: slow

        type: latency

        latency: {threshold_ms: 1000}

      - name: probabilistic-1pct

        type: probabilistic

        probabilistic: {sampling_percentage: 1}

exporters:

  prometheusremotewrite:

    endpoint: http://victoriametrics:8480/api/v1/write

  otlp/tempo:

    endpoint: tempo:4317

    tls:

      insecure: true

  loki:

    endpoint: http://loki:3100/loki/api/v1/push

service:

  pipelines:

    traces:

      receivers: [otlp]

      processors: [memory_limiter, tail_sampling, batch]

      exporters: [otlp/tempo]

    metrics:

      receivers: [otlp, prometheus]

      processors: [memory_limiter, batch]

      exporters: [prometheusremotewrite]

    logs:

      receivers: [otlp]

      processors: [memory_limiter, batch]

      exporters: [loki]

A pipeline is an acyclic graph. One collector can serve all three signals independently.

Resource: describing the source

Resource holds process and host attributes shared by every signal:

service.name=checkout

service.version=1.4.2

service.instance.id=checkout-7f8b9c

k8s.namespace.name=prod

k8s.pod.name=checkout-7f8b9c-q2lx9

host.name=node-12.us-east

cloud.provider=aws

cloud.region=us-east-1

Set it through env vars:

OTEL_SERVICE_NAME=checkout

OTEL_RESOURCE_ATTRIBUTES=service.version=1.4.2,deployment.environment=prod

In k8s it is auto-injected through the [[opentelemetry-operator-k8s|OTel Operator]] (a sidecar/auto-instrumentation CRD).

OTel vs Prometheus client

Aspect	Prom client	OTel SDK
Signals	metrics only	metrics+traces+logs
Transport	HTTP pull (`/metrics`)	OTLP push
Vendor neutrality	Prom-only	any backend
Auto-instrumentation	minimal	full
Adoption	widest	growing fast
Wire format	text/OpenMetrics	protobuf

You can combine them: the OTel SDK for traces and logs plus the Prom client for metrics. Or the OTel SDK for everything, with the Collector exporting metrics in Prom format.

Sampling: head vs tail

Keeping 100% of traces is not feasible: 10K req/s × 5 spans × 5KB is about 250 MB/s. You need sampling.

Head-based: the sample-or-drop decision happens at the start of the trace (at the edge), and every downstream span honors it. Simple and predictable. The downside is that it drops error traces at random.
Tail-based: collect the whole trace in the Collector, then decide keep or drop from trace properties (status, latency, attributes). You see every error, but the Collector holds all spans in memory for 5 to 30s.

Tail-based is preferable. Use the Tail Sampling Processor in the Collector.

When things go wrong

OTLP/gRPC connection refused: the Collector is not running, or it is on a different port. The default is 4317. Check the firewall.
A trace is missing even though there was an error: head sampling at 1% dropped it. Use tail sampling with a status_code: ERROR policy.
Collector OOM: the memory_limiter processor is missing, or the limit is above RAM. Add a limit below 80% of the container memory.
Cardinality explosion: metric attributes carry a user-id or request-id. See cardinality-explosion.
Auto-instrumentation broke the app: usually the Java agent and a bytecode-patch conflict. Bump the OTel agent version, or disable a specific instrumentation: OTEL_INSTRUMENTATION_<name>_ENABLED=false.
Span attributes go missing: the batch processor is missing, or attributes were added after end(). Set them before .end().
Service.name = "unknown_service" in Tempo: the OTEL_SERVICE_NAME env var is missing. The Resource is not configured.

OTel vs Datadog/New Relic

Vendor APMs (Datadog, New Relic, Splunk) give you an all-in-one with a UI and ML features. But there is vendor lock-in, and replacing one means rewriting.

With OTel you write instrumentation once and send it to Datadog through a native receiver. A year later you point the exporter at Tempo/Loki/Mimir without touching the code.

Cost-wise: OTel plus self-hosted (Tempo, Loki, VictoriaMetrics) is 5 to 10x cheaper than Datadog at 100 GB/day or more, but it takes ops investment.

OpenTelemetry: signals, OTLP, Collector pipeline

Why OpenTelemetry

The three signals

Architecture

OTLP, the protocol

SDK: auto vs manual instrumentation

OTel Collector

Resource: describing the source

OTel vs Prometheus client

Sampling: head vs tail

When things go wrong

OTel vs Datadog/New Relic

§ команды

§ см. также

OpenTelemetry: signals, OTLP, Collector pipeline

Why OpenTelemetry

The three signals

Architecture

OTLP, the protocol

SDK: auto vs manual instrumentation

OTel Collector

Resource: describing the source

OTel vs Prometheus client

Sampling: head vs tail

When things go wrong

OTel vs Datadog/New Relic

§ команды

§ см. также