OpenTelemetry in Go — Find the Bug¶
Each snippet contains a real-world bug in OpenTelemetry-Go instrumentation. OTel produces traces, metrics, and logs through a
TracerProvider/MeterProvideryou wire up once, an exporter (OTLP/stdout), and a propagator for cross-service context. Most bugs come from broken context threading, missing lifecycle calls, or cardinality/cost mistakes. Find the bug, explain it, fix it.
Bug 1 — Child span passed the wrong context¶
func handler(ctx context.Context) {
ctx, span := tracer.Start(ctx, "handler")
defer span.End()
doWork(context.Background()) // child should be under "handler"
}
func doWork(ctx context.Context) {
_, span := tracer.Start(ctx, "do-work")
defer span.End()
}
Bug: doWork is called with context.Background(), not the ctx returned by Start. The parent link lives in context.Context, so do-work becomes a separate root trace with its own trace ID instead of a child of handler.
Fix: thread the ctx:
This is the single most common OTel bug. Shadowing ctx on the Start line makes passing the right one automatic.
Bug 2 — Provider never registered¶
func main() {
exp, _ := stdouttrace.New()
tp := sdktrace.NewTracerProvider(sdktrace.WithBatcher(exp))
defer tp.Shutdown(context.Background())
// ... business code uses otel.Tracer("x").Start(...)
}
Bug: otel.SetTracerProvider(tp) is never called. The global otel.Tracer(...) therefore returns a no-op tracer that silently drops every span. The program runs fine and emits nothing.
Fix: register the provider globally:
If you prefer not to use the global, pass tp.Tracer(...) explicitly everywhere — but then otel.Tracer must not be used.
Bug 3 — No Shutdown, last spans lost¶
func main() {
tp := sdktrace.NewTracerProvider(sdktrace.WithBatcher(exp))
otel.SetTracerProvider(tp)
runServerOnce() // creates spans, then returns
// main returns here
}
Bug: No tp.Shutdown(ctx). With WithBatcher, spans are buffered and flushed on a timer or at shutdown. When main returns before the batch timer fires, the buffered spans are never exported.
Fix:
defer func() {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
_ = tp.Shutdown(ctx)
}()
Bound the context so a dead collector can't hang shutdown forever.
Bug 4 — RecordError without SetStatus¶
ctx, span := tracer.Start(ctx, "charge")
defer span.End()
if err := charge(ctx); err != nil {
span.RecordError(err)
return err
}
Bug: RecordError adds an error event but does not change the span's status. The span stays Unset/Ok — green in the UI. Alerting and dashboards that key on span status treat this failed operation as a success.
Fix: also set the status:
Bug 5 — Propagator never set, so traces don't cross services¶
func main() {
otel.SetTracerProvider(tp)
// missing: otel.SetTextMapPropagator(...)
client := http.Client{Transport: otelhttp.NewTransport(http.DefaultTransport)}
_, _ = client.Get("http://downstream/api")
}
Bug: otelhttp.NewTransport injects trace headers using the global propagator, but the default global propagator is a no-op. Without SetTextMapPropagator, no traceparent header is written, and the downstream service starts a brand-new trace.
Fix:
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{}, propagation.Baggage{},
))
Set it on both the caller and the callee.
Bug 6 — WithSyncer in production¶
tp := sdktrace.NewTracerProvider(
sdktrace.WithSyncer(otlpExporter), // exports on the request path!
)
Bug: WithSyncer installs a SimpleSpanProcessor that exports each span synchronously when it ends — a network round-trip to the collector on every span.End(), on the request hot path. Under load, request latency is now coupled to collector latency, and a slow collector slows every request.
Fix: use the batch processor in production:
Reserve WithSyncer for tests and the stdouttrace exporter.
Bug 7 — High-cardinality attribute on a metric¶
reqs, _ := meter.Int64Counter("http.requests")
func handler(w http.ResponseWriter, r *http.Request) {
reqs.Add(r.Context(), 1,
metric.WithAttributes(attribute.String("user.id", userID(r))))
}
Bug: user.id is unbounded. Every unique user creates a new metric time series. With millions of users, the metrics backend OOMs and the bill explodes — a classic cardinality incident.
Fix: remove the unbounded dimension from the metric. Keep low-cardinality dimensions only; put the user ID on the span instead:
reqs.Add(r.Context(), 1,
metric.WithAttributes(attribute.String("route", "/checkout")))
// user.id belongs on the span (one record per trace), not the metric.
Defense in depth: an SDK View with an AttributeFilter can drop user.id centrally.
Bug 8 — Goroutine uses a cancelled request context¶
func handler(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
go func() {
_, span := tracer.Start(ctx, "async-job") // ctx cancelled when handler returns
defer span.End()
slowWork(ctx) // aborted early
}()
w.Write([]byte("accepted"))
}
Bug: r.Context() is cancelled when the handler returns. The background goroutine's ctx is dead almost immediately, so slowWork is aborted and async-job is truncated.
Fix: keep the span/trace link but drop cancellation with context.WithoutCancel (Go 1.21+):
bgCtx := context.WithoutCancel(r.Context())
go func() {
_, span := tracer.Start(bgCtx, "async-job")
defer span.End()
slowWork(bgCtx)
}()
Bug 9 — Mismatched OTLP transport and port¶
exp, _ := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("collector:4318"), // HTTP port!
otlptracegrpc.WithInsecure(),
)
Bug: otlptracegrpc speaks OTLP/gRPC, whose default port is 4317. Port 4318 is OTLP/HTTP. Pointing the gRPC exporter at the HTTP port yields opaque connection/transport errors and no spans.
Fix: match exporter to port:
Bug 10 — Span per loop iteration in a hot path¶
ctx, span := tracer.Start(ctx, "process-batch")
defer span.End()
for _, item := range millionItems {
_, s := tracer.Start(ctx, "process-item") // a million spans
process(ctx, item)
s.End()
}
Bug: Creating a span per element of a million-item loop floods the backend with a million spans per batch, adds per-iteration allocation, and makes the trace unreadable.
Fix: span the batch, not each item; record per-item facts as attributes/events or counts:
ctx, span := tracer.Start(ctx, "process-batch")
defer span.End()
span.SetAttributes(attribute.Int("batch.size", len(millionItems)))
for _, item := range millionItems {
process(ctx, item)
}
If per-item visibility is essential, sample or use a metric (a counter), not a span each.
Bug 11 — Provider shut down before the server drains¶
Bug: Ordering is reversed. The provider is shut down (and becomes terminal) before in-flight requests finish. Spans from the final requests — exactly the ones that show whether the new deploy is healthy — end after the provider is dead and are dropped.
Fix: drain the server first, then flush the providers:
<-sigterm
srv.Shutdown(drainCtx) // let in-flight requests finish; their spans enqueue
tp.Shutdown(flushCtx) // THEN flush the final batch
Bound both contexts with timeouts.
Bug 12 — Library imports the SDK directly¶
// inside a reusable library package
import sdktrace "go.opentelemetry.io/otel/sdk/trace"
func init() {
tp := sdktrace.NewTracerProvider(/* ... */)
otel.SetTracerProvider(tp) // library hijacks the global provider!
}
Bug: A library configures the SDK and overwrites the application's global TracerProvider in its init. Now the app's carefully configured exporter/sampler is replaced by the library's defaults — or whichever init runs last wins. Libraries must not own the SDK.
Fix: libraries depend on the API only and never set the global provider:
import "go.opentelemetry.io/otel"
var tracer = otel.Tracer("github.com/me/mylib")
// no SDK, no SetTracerProvider — the application wires that up.
Bug 13 — Dynamic value baked into the span name¶
Bug: The user ID is in the span name, making span names unbounded (one per user). Backends group by span name; this fragments the data so you can never aggregate "GET /users/:id" latency, and it inflates index cardinality.
Fix: use a low-cardinality route template as the name and put the ID in an attribute:
ctx, span := tracer.Start(ctx, "GET /users/:id")
defer span.End()
span.SetAttributes(attribute.Int("user.id", userID))
Bug 14 — Metric temporality wrong for Prometheus¶
// Pushing OTLP metrics with DELTA temporality to a Prometheus-backed store
exp, _ := otlpmetricgrpc.New(ctx,
otlpmetricgrpc.WithTemporalitySelector(deltaSelector),
)
Bug: Prometheus expects cumulative counters (it computes rates itself). With delta temporality, each export reports only the change since the last collection, so the Prometheus-facing store sees counters that appear to reset every cycle — rates are wrong/sawtoothed.
Fix: use cumulative temporality for Prometheus-style backends (the default, or the Prometheus exporter which is inherently cumulative):
// Use exporters/prometheus, or cumulative temporality on the OTLP exporter.
exp, _ := prometheus.New()
mp := sdkmetric.NewMeterProvider(sdkmetric.WithReader(exp))
Bug 15 — WithInsecure shipped to production¶
exp, _ := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("otel-collector.prod:4317"),
otlptracegrpc.WithInsecure(), // plaintext in production
)
Bug: WithInsecure() disables TLS. In production this sends telemetry — which may contain attributes derived from request data — in plaintext over the network, and lets any party impersonate the collector endpoint.
Fix: use TLS (and auth) toward the collector:
exp, _ := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("otel-collector.prod:4317"),
otlptracegrpc.WithTLSCredentials(credentials.NewClientTLSFromCert(pool, "")),
)
WithInsecure is for local dev only.
Bug 16 — PII recorded in a span attribute¶
span.SetAttributes(
attribute.String("user.email", req.Email),
attribute.String("request.body", string(rawBody)),
attribute.String("auth.token", token),
)
Bug: Email, raw request body, and auth token are PII/secrets. Attributes are exported and stored in the telemetry backend, often with broad access. This is a compliance and security leak.
Fix: never record secrets/PII; redact or hash, and record only what you need:
span.SetAttributes(
attribute.String("user.id", hash(req.Email)), // pseudonymous
attribute.Int("request.body_bytes", len(rawBody)),
)
Add a Collector attributes/redaction processor as a backstop across all services.
Bug 17 — Mixing a global and an explicit provider¶
otel.SetTracerProvider(globalTP) // global
router := chi.NewRouter()
router.Use(otelhttp.NewMiddleware("svc")) // uses the GLOBAL provider
svc := NewService(WithTracerProvider(otherTP)) // explicit, different pipeline
Bug: Middleware spans go to globalTP; service spans go to otherTP. The two pipelines have different exporters/samplers, so a single request's spans are split across backends — "half my spans are missing."
Fix: pick one strategy. Either everything uses the global:
or thread the explicit provider into the middleware too. Don't mix.
Bug 18 — Sampler expected to see the error¶
// Custom sampler that tries to keep error traces:
func (s sampler) ShouldSample(p sdktrace.SamplingParameters) sdktrace.SamplingResult {
for _, kv := range p.Attributes {
if kv.Key == "error" && kv.Value.AsBool() {
return sdktrace.SamplingResult{Decision: sdktrace.RecordAndSample}
}
}
return drop
}
Bug: ShouldSample runs at span start, before the request executes. The error attribute is set later, during/after the work — so it is not present in p.Attributes at sampling time. The sampler can never see it; error traces are dropped.
Fix: outcome-based selection (keep errors, keep slow) must be tail sampling in the Collector, which sees the completed trace. The SDK sampler can only use start-time information:
processors:
tail_sampling:
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
Bug 19 — Fresh attribute slice allocated on every metric call¶
func record(ctx context.Context, route, method, status string) {
latency.Record(ctx, elapsed,
metric.WithAttributes(
attribute.String("route", route),
attribute.String("method", method),
attribute.String("status", status),
)) // new []KeyValue every call, in a hot path
}
Bug: In a hot path, building a fresh []attribute.KeyValue (and the WithAttributes option) on every call allocates and forces an attribute-set hash each time. Under high QPS this shows up in allocation profiles.
Fix: precompute the attribute set / option for the common combinations:
var okGetAttrs = metric.WithAttributeSet(attribute.NewSet(
attribute.String("route", "/checkout"),
attribute.String("method", "GET"),
attribute.String("status", "ok"),
))
latency.Record(ctx, elapsed, okGetAttrs)
Bug 20 — No error handler, drops are invisible¶
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exp, sdktrace.WithMaxQueueSize(512)),
)
otel.SetTracerProvider(tp)
// no otel.SetErrorHandler(...)
Bug: Under a traffic spike, the 512-span queue overflows and the BatchSpanProcessor drops spans silently. Export failures are also swallowed. With no error handler registered, you have no signal that telemetry is being lost — discovered only when a trace is missing during an incident.
Fix: register an error handler and size the queue for your QPS:
otel.SetErrorHandler(otel.ErrorHandlerFunc(func(e error) {
log.Printf("otel error: %v", e) // surface drops/export failures
}))
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exp, sdktrace.WithMaxQueueSize(4096)),
)
Monitor dropped-span counts so overflow is visible before an incident.
Bug 21 — Confusing OTel with runtime/trace¶
// Goal: "find why this service is GC-thrashing and blocking on a mutex"
ctx, span := otel.Tracer("svc").Start(ctx, "investigate")
defer span.End()
// ... expecting span data to show scheduler/GC behavior
Bug: OpenTelemetry spans describe application/request operations across services. They do not capture scheduler events, GC pauses, goroutine blocking, or syscalls. The engineer is using the wrong tool for an in-process performance question.
Fix: for in-process scheduler/runtime analysis use runtime/trace + go tool trace (or a CPU profile):
f, _ := os.Create("trace.out")
trace.Start(f) // runtime/trace, not OTel
defer trace.Stop()
// reproduce load, then: go tool trace trace.out
OTel tells you which service/span is slow; runtime/trace tells you why, inside the process. See 03-runtime-trace-application-tracing.
Bug 22 — Unbounded baggage propagated everywhere¶
member, _ := baggage.NewMember("debug.payload", string(largeJSONBlob))
bag, _ := baggage.New(member)
ctx = baggage.ContextWithBaggage(ctx, bag)
// every downstream HTTP/gRPC call now carries this blob in a header
Bug: Baggage is propagated as a header to every downstream service for the rest of the request. A large blob inflates every outbound request, risks exceeding header size limits (causing 431/dropped requests), and may leak data across service boundaries.
Fix: keep baggage tiny and non-sensitive — small identifiers only:
member, _ := baggage.NewMember("tenant.id", "acme")
bag, _ := baggage.New(member)
ctx = baggage.ContextWithBaggage(ctx, bag)
Large or sensitive data belongs on a span attribute (local to the producing service), not in propagated baggage.
Summary¶
OpenTelemetry instrumentation looks like a few Start/End/Record calls, but the bugs cluster into three habits:
-
Broken context and lifecycle. Forgetting to thread the
ctxfromStart(orphaned spans), forgettingSetTracerProvider(no-op tracer), forgettingShutdownor getting shutdown ordering wrong (lost final spans), missing the propagator (broken cross-service traces), or using a cancelled request context in a goroutine. Always threadctx, register the provider, set the propagator, and drain-then-flush with bounded timeouts. -
Cost and cardinality mistakes. High-cardinality attributes on metrics, dynamic values in span names, a span per loop iteration, fresh attribute slices in hot paths, and unbounded baggage. Keep metric dimensions bounded, span names static, baggage tiny — and push unbounded values onto spans, not metric dimensions.
-
Wrong tool / wrong layer.
WithSyncerin production, mismatched OTLP port, delta temporality for Prometheus, libraries owning the SDK, expecting the SDK sampler to see outcomes (use Collector tail sampling),WithInsecure/PII in production, no error handler, and confusing OTel withruntime/trace. UseWithBatcher, match transport to port, keep the SDK inmain, do outcome-based sampling in the Collector, secure and redact, register an error handler — and reach forruntime/tracefor in-process questions and OTel for cross-service ones.
Treat instrumentation as a side effect that must never change behavior, never block the request path, and never grow your bill linearly with success — and the rest of OpenTelemetry becomes dependable.
In this topic