OpenTelemetry in Go — Professional Level¶
Table of Contents¶
- Introduction
- The SDK Architecture, End to End
- TracerProvider Lifecycle and the Provider Registry
- Inside the BatchSpanProcessor
- Exporter Internals: OTLP/gRPC vs OTLP/HTTP
- The Sampler Interface and Custom Samplers
- The Metrics SDK: Readers, Aggregations, Temporality
- The OpenTelemetry Collector
- Graceful Shutdown, Draining, and Signal Handling
- Performance Profile and Allocation Behavior
- A Fully Wired Production Setup
- Edge Cases the Source Reveals
- Operational Playbook
- Summary
Introduction¶
The professional level treats OpenTelemetry-Go not as a set of calls but as a set of cooperating subsystems with lifecycles, buffering semantics, and back-pressure behavior you must understand to run it reliably. The SDK is small and readable; most production incidents come from misunderstanding the processor and exporter lifecycle — dropped spans under load, lost telemetry on shutdown, a Collector pipeline that silently discards data.
This file is for engineers who own the observability platform: who configure the Collector, set sampling and redaction policy, tune the batch processor, and guarantee that the last spans of a deploy actually arrive. After reading you will:
- Know the SDK's component graph and how data flows through it
- Reason about the
BatchSpanProcessorqueue, batching, and drop behavior precisely - Understand the OTLP exporters' transport, retry, and timeout semantics
- Implement a custom
Samplercorrectly - Understand metric readers, aggregation temporality, and why it matters for Prometheus
- Place and configure a Collector and know what each pipeline stage does
- Implement shutdown that flushes the final telemetry without hanging
OpenTelemetry is conceptually simple — produce spans and metrics, ship them out — but the buffering and lifecycle details govern whether you actually see the telemetry when it matters most: under load and during deploys.
The SDK Architecture, End to End¶
The trace SDK is a pipeline of well-defined components:
otel.Tracer(name) // API: returns a Tracer from the global provider
│ .Start(ctx, name)
▼
TracerProvider // sdk/trace: holds resource, samplers, processors
│ consults Sampler at span start (record? export?)
▼
Span (recording) // accumulates attributes/events until End()
│ .End()
▼
SpanProcessor.OnEnd(span) // BatchSpanProcessor enqueues; SimpleSpanProcessor exports inline
▼
SpanExporter.ExportSpans() // OTLP/stdout: serialize + transmit
▼
Collector / backend
Three facts a professional internalizes:
- The sampler runs at
Start, notEnd. The decision to record-and-export is made when the span begins, based on parent context and the configuredSampler. A non-recording span is nearly free — attributes set on it are dropped. OnEndis the hand-off to the processor. WithBatchSpanProcessor,OnEndis a non-blocking enqueue. WithSimpleSpanProcessor,OnEndcalls the exporter synchronously — which is why it is unsuitable for production.- The exporter is the only network-touching component. Everything upstream is in-process. The exporter owns timeouts, retries, and connection lifecycle.
The metrics SDK mirrors this with a different vocabulary: MeterProvider → Reader (Periodic or manual) → Exporter, with aggregation happening in the reader.
TracerProvider Lifecycle and the Provider Registry¶
Construction is configuration¶
sdktrace.NewTracerProvider(opts...) is pure setup: it stores the resource, the sampler, and the list of span processors. No goroutines for export start until a processor needs them (BatchSpanProcessor starts a background flush goroutine on creation).
Registration vs explicit passing¶
There are two ways to make tracers find your provider:
- Global registry:
otel.SetTracerProvider(tp). Thenotel.Tracer(name)anywhere returns a tracer backed bytp. Convenient; the default style. The catch: until you call it,otel.Tracerreturns a no-op that silently drops everything. - Explicit: pass
tp(or aTracer) into the code that needs it. More testable, no global state, but more plumbing.
Libraries should use the global so they work regardless of how the app wires up. Applications may use either; mixing the two inconsistently is a classic "half my spans vanished" bug.
One provider, many tracers¶
A provider yields many Tracer instances keyed by instrumentation name (and optional version/schema). The name identifies who produced the span (go.opentelemetry.io/contrib/.../otelhttp, github.com/me/mypkg). Backends use it to attribute spans to instrumentation scopes. Use a stable, package-level name — typically the import path.
Shutdown is terminal¶
tp.Shutdown(ctx) flushes and stops all processors, then the provider is dead — further spans go nowhere. It is idempotent and must be called exactly once at process end. ForceFlush(ctx) flushes without shutting down, for mid-life flushes (batch jobs, pre-deploy drains).
Inside the BatchSpanProcessor¶
This is the component whose behavior under load you must understand.
The data structures¶
The BatchSpanProcessor holds: - a bounded queue (MaxQueueSize, default 2048) of finished spans, - a batch assembled from the queue up to MaxExportBatchSize (default 512), - a timer (BatchTimeout, default 5s) that forces a flush even when the batch is not full, - an export timeout (ExportTimeout, default 30s) bounding each ExportSpans call.
The flow¶
OnEnd(span)enqueues the finished span — non-blocking. If the queue is full, the span is dropped and a dropped-spans counter increments. This is the back-pressure design: protect the application, sacrifice telemetry.- A background goroutine drains the queue into batches and calls
exporter.ExportSpans(ctx, batch)either when a batch fills or the timer fires. - Export errors are logged (via the OTel error handler) and the batch is dropped —
BatchSpanProcessordoes not itself retry; retry lives in the OTLP exporter (below).
Tuning implications¶
- Bursty high-QPS service: raise
MaxQueueSizeso transient spikes do not overflow, and possiblyMaxExportBatchSizefor throughput. - Latency-sensitive freshness (you want spans in the UI fast): lower
BatchTimeout. - Flaky collector: the export timeout and the exporter's retry govern resilience; the queue size governs how long you can buffer through an outage before dropping.
The drop you must monitor¶
Dropped spans are silent unless you watch for them. Register the OTel error handler and/or scrape the SDK's self-metrics. A service that "lost traces during the incident" usually overflowed its batch queue precisely when traffic spiked — the worst time to be blind.
Exporter Internals: OTLP/gRPC vs OTLP/HTTP¶
The OTLP exporters are where the network lives.
Common semantics¶
Both otlptracegrpc and otlptracehttp: - Serialize spans to the OTLP protobuf schema. - Apply a per-export timeout (WithTimeout). - Implement retry with backoff for retryable errors (WithRetry): transient network failures and UNAVAILABLE/429-style responses are retried with exponential backoff and a max elapsed time; non-retryable errors (malformed request) are not. - Support compression (gzip) to cut egress.
gRPC vs HTTP¶
OTLP/gRPC (otlptracegrpc) | OTLP/HTTP (otlptracehttp) | |
|---|---|---|
| Default port | 4317 | 4318 |
| Transport | HTTP/2 + protobuf | HTTP/1.1 + protobuf (or JSON) |
| Streaming/multiplexing | Yes (HTTP/2) | No |
| Proxy/LB friendliness | Needs gRPC-aware infra | Works through any HTTP proxy |
| Typical choice | Service-to-Collector inside the mesh | Edge/constrained environments, simple proxies |
gRPC is the default for in-cluster service→Collector hops; HTTP is the escape hatch where gRPC is inconvenient. Functionally equivalent for the data; differ in operational fit.
TLS and auth¶
Production exporters use WithTLSCredentials and inject auth (headers, mTLS) toward the Collector. WithInsecure is dev-only. The Collector, not the app, typically holds the credentials for the backend, which is another reason to export to a Collector rather than directly to a vendor.
The Sampler Interface and Custom Samplers¶
Sampler is a small interface:
type Sampler interface {
ShouldSample(parameters SamplingParameters) SamplingResult
Description() string
}
ShouldSample is called at span Start with the parent context, trace ID, span name, kind, and attributes. It returns a decision: Drop, RecordOnly, or RecordAndSample, plus optional trace-state changes.
Built-in samplers¶
AlwaysSample()/NeverSample()— extremes.TraceIDRatioBased(fraction)— deterministic ratio keyed on the trace ID, so the decision is consistent wherever the same trace ID appears.ParentBased(root, opts...)— honor the parent's sampled flag if there is a parent; otherwise applyroot. The production default:ParentBased(TraceIDRatioBased(r)).
Writing a custom sampler¶
You might want "always sample /health at 0%, errors-prone routes at 100%, everything else at 5%." Implement ShouldSample and inspect the parameters:
func (s routeSampler) ShouldSample(p sdktrace.SamplingParameters) sdktrace.SamplingResult {
if name := p.Name; strings.HasPrefix(name, "GET /health") {
return sdktrace.SamplingResult{Decision: sdktrace.Drop}
}
return s.fallback.ShouldSample(p)
}
Two cautions: (1) keep ShouldSample cheap — it runs on every span start; (2) attribute-based decisions only see attributes available at start, not ones added later. For decisions that need the outcome (error, latency), you need tail sampling in the Collector, not an SDK sampler — the SDK cannot see the future.
The Metrics SDK: Readers, Aggregations, Temporality¶
The metrics SDK has subtleties that bite during Prometheus integration.
Readers¶
A Reader collects aggregated metrics from the SDK: - PeriodicReader pulls on a timer and pushes to an exporter (OTLP push model). - The Prometheus exporter is itself a reader that is collected on scrape (pull model).
Aggregation¶
Each instrument has a default aggregation: counters → sum, histograms → explicit-bucket histogram, gauges → last value. Views override aggregation, bucket boundaries, and attribute sets.
Temporality — the gotcha¶
Aggregation temporality is cumulative (running total since process start) or delta (change since last collection).
- Prometheus expects cumulative counters (it computes rates itself). The Prometheus exporter uses cumulative temporality.
- Some push backends prefer delta. The OTLP exporter's temporality is configurable; the wrong choice produces counters that look reset every cycle or rates that double-count.
Getting temporality wrong is a common, confusing metrics bug — "my request rate graph is sawtoothed" or "my counter keeps resetting." Match temporality to the backend: cumulative for Prometheus, delta only where the backend asks for it.
Cardinality at the SDK¶
Views are the SDK-level cardinality control: drop an attribute (Stream{AttributeFilter: ...}) so a high-cardinality dimension never becomes thousands of series. This is the cheapest place to fix cardinality — before the data is even exported.
The OpenTelemetry Collector¶
The Collector is a standalone process that sits between your services and your backends. It is the keystone of a serious deployment.
Why it exists¶
- Decoupling / portability. Services export OTLP to the Collector; the Collector fans out to Jaeger, Tempo, Prometheus, a vendor. Switching backends is a Collector config change, not a fleet redeploy.
- Central processing. Sampling (including tail sampling), attribute redaction (PII), batching, filtering, and enrichment happen once, centrally, instead of in every service.
- Buffering and resilience. The Collector batches and retries toward backends, absorbing backend outages.
- Protocol translation. Receive OTLP; emit Jaeger, Prometheus, Zipkin, vendor formats.
Pipeline shape¶
A Collector config is receivers → processors → exporters, grouped into pipelines per signal:
receivers:
otlp:
protocols: { grpc: {}, http: {} }
processors:
memory_limiter: {} # protect the Collector from OOM
tail_sampling: # keep errors + slow + sample the rest
policies: [ ... ]
attributes: # redact / hash PII, drop high-cardinality keys
actions: [ ... ]
batch: {} # batch toward backends
exporters:
otlp/tempo: { endpoint: tempo:4317 }
prometheus: { endpoint: 0.0.0.0:9464 }
service:
pipelines:
traces: { receivers: [otlp], processors: [memory_limiter, tail_sampling, batch], exporters: [otlp/tempo] }
metrics: { receivers: [otlp], processors: [memory_limiter, batch], exporters: [prometheus] }
Agent vs gateway¶
A common topology: a Collector agent as a sidecar/daemonset close to each service (cheap local OTLP, fast hand-off), forwarding to a Collector gateway cluster that does the expensive tail sampling and fan-out. The agent keeps per-service export cheap; the gateway centralizes policy.
The memory_limiter processor and the tail_sampling decision window are the two things to size carefully — the Collector is a real service with real capacity limits, and an under-provisioned gateway drops data exactly under load.
Graceful Shutdown, Draining, and Signal Handling¶
The most-lost telemetry is the telemetry of the final requests before a deploy. Correct shutdown ordering prevents that.
func main() {
ctx := context.Background()
shutdownTracer, _ := initTracer(ctx) // returns tp.Shutdown
shutdownMeter, _ := initMeter(ctx)
srv := &http.Server{Addr: ":8080", Handler: instrumentedHandler()}
go srv.ListenAndServe()
// Wait for SIGTERM/SIGINT.
stop := make(chan os.Signal, 1)
signal.Notify(stop, syscall.SIGTERM, syscall.SIGINT)
<-stop
// 1. Stop accepting, drain in-flight requests (their spans complete).
drainCtx, cancel := context.WithTimeout(context.Background(), 25*time.Second)
defer cancel()
_ = srv.Shutdown(drainCtx)
// 2. THEN flush telemetry — bounded, so a dead collector can't hang us.
flushCtx, cancel2 := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel2()
_ = shutdownTracer(flushCtx)
_ = shutdownMeter(flushCtx)
}
The ordering rules: 1. Drain the server first so in-flight requests finish and their spans are enqueued. 2. Then shut down the providers so the final batch flushes. 3. Bound every shutdown context so an unreachable collector cannot wedge the process (which, in Kubernetes, leads to SIGKILL and more lost telemetry).
Reverse this order — shut down providers before draining — and you lose exactly the spans you most want during a rollout: the ones that show whether the new version is healthy.
Performance Profile and Allocation Behavior¶
Where the cost actually is:
- A non-recording span (dropped by the sampler) is cheap — no attribute storage, no enqueue. Sampling reduces cost before allocation.
- A recording span allocates for its attribute slice, events, and the span object. High-frequency span creation in a hot loop shows up in allocation profiles.
BatchSpanProcessorenqueue is a channel/ring-buffer op — cheap and non-blocking.- Export is amortized across a batch and happens off the request path.
- Metric recording (
Add/Record) is lock-light but the attribute set passed each call is hashed to find the aggregation series — passing a fresh[]attribute.KeyValueeach call costs an allocation; precompute attribute sets where hot.
Practical guidance: sample to cut the bulk of span cost; reuse attribute sets in hot paths; never use WithSyncer in production; and profile with the same tools as any Go code (pprof) to confirm instrumentation is not a hot spot. See optimize.md for benchmarks and concrete reductions.
A Fully Wired Production Setup¶
A complete, realistic initialization — traces and metrics, OTLP to a Collector, proper resource, sampling, propagator, bounded shutdown.
func initObservability(ctx context.Context) (shutdown func(context.Context) error, err error) {
res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceName("checkout"),
semconv.ServiceVersion(buildVersion),
semconv.DeploymentEnvironment(env),
),
resource.WithHost(), resource.WithProcess(),
)
if err != nil {
return nil, err
}
// Traces
traceExp, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint(collectorAddr),
otlptracegrpc.WithTLSCredentials(creds),
)
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(traceExp,
sdktrace.WithMaxQueueSize(4096),
sdktrace.WithBatchTimeout(2*time.Second),
),
sdktrace.WithResource(res),
sdktrace.WithSampler(sdktrace.ParentBased(sdktrace.TraceIDRatioBased(0.05))),
)
// Metrics
metricExp, err := otlpmetricgrpc.New(ctx,
otlpmetricgrpc.WithEndpoint(collectorAddr),
otlpmetricgrpc.WithTLSCredentials(creds),
)
if err != nil {
return nil, err
}
mp := sdkmetric.NewMeterProvider(
sdkmetric.WithReader(sdkmetric.NewPeriodicReader(metricExp)),
sdkmetric.WithResource(res),
)
otel.SetTracerProvider(tp)
otel.SetMeterProvider(mp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{}, propagation.Baggage{},
))
otel.SetErrorHandler(otel.ErrorHandlerFunc(func(e error) {
log.Printf("otel error: %v", e) // surface dropped-span / export errors
}))
return func(ctx context.Context) error {
return errors.Join(tp.Shutdown(ctx), mp.Shutdown(ctx))
}, nil
}
Note the SetErrorHandler — without it, export failures and span drops are invisible.
Edge Cases the Source Reveals¶
- No-op until registered.
otel.Tracer/otel.Meterreturn no-op implementations untilSetTracerProvider/SetMeterProviderrun. Code "works" and emits nothing. - Sampler sees start-time attributes only. Attributes added after
Startcannot influence the sampling decision; outcome-based selection must be tail sampling. OnEndafterShutdownis a no-op. Spans ending after provider shutdown are dropped; ordering of server-drain vs provider-shutdown matters.- Batch queue overflow is silent without the error handler. Always register one.
- gRPC exporter against an HTTP port (4318) fails with confusing transport errors; match protocol to port (4317 gRPC / 4318 HTTP).
- Temporality mismatch with the backend yields reset-looking counters; cumulative for Prometheus.
- Resource detectors can block.
resource.WithHost()/cloud detectors may do I/O; in constrained environments give them a context with a timeout. context.WithoutCancelis needed to keep a span in a goroutine that must outlive a cancelled request context.- Double
span.End()is ignored but signals confused ownership; end once.
These are not facts to memorize but pointers to reach for the (small, readable) SDK source when behavior surprises you.
Operational Playbook¶
| Scenario | Recipe |
|---|---|
| Spans not arriving | Verify SetTracerProvider called; exporter endpoint/port (4317/4318); Shutdown on exit; check error handler logs. |
| Traces break across services | Set TraceContext propagator on both sides; use otelhttp/otelgrpc for transport/handler. |
| Lost spans under load | Raise MaxQueueSize; monitor dropped-span count; ensure WithBatcher not WithSyncer. |
| Lost spans on deploy | Drain server before provider Shutdown; bound shutdown context. |
| Metric series explosion | Add a View dropping the high-cardinality attribute; redact in Collector too. |
| Prometheus counters look reset | Set cumulative temporality; use the Prometheus exporter or correct OTLP temporality. |
| Need errors/slow traces kept | Tail sampling in the Collector, not an SDK sampler. |
| PII in telemetry | Redact via Collector attributes/transform processor; don't record it in the first place. |
| Switch backend | Change the Collector exporter pipeline; services untouched. |
| Collector OOM | Add memory_limiter; size the gateway; cap tail-sampling decision window. |
| Verify overhead | Benchmark hot paths with/without instrumentation; profile with pprof. |
Summary¶
At the professional level, OpenTelemetry-Go is a set of subsystems with explicit lifecycles and buffering semantics. Spans are sampled at start, accumulate while recording, hand off at OnEnd to a processor, and leave the process only through an exporter. The BatchSpanProcessor is the component to understand: a bounded queue that drops under overflow to protect the application, a batch assembled by size or timer, exported off the request path. The OTLP exporters own timeouts, retries, compression, and TLS, and differ only operationally between gRPC (4317, in-mesh) and HTTP (4318, proxy-friendly). The metrics SDK adds readers, aggregations, and the temporality gotcha that must match the backend. The Collector is the keystone — receivers → processors → exporters — centralizing tail sampling, redaction, and fan-out, and making the application vendor-neutral.
The two operational details that most often determine whether you actually have telemetry when you need it are batch-queue sizing (so you do not drop spans under load) and shutdown ordering (drain the server, then flush the providers, with bounded contexts, so you do not lose the final, most-important spans of a deploy). Master the lifecycle and the buffering, register an error handler so failures are visible, and OpenTelemetry becomes a dependable platform rather than a demo that quietly loses data at the worst moment.
In this topic