Skip to content

Tracing Roadmap

"A trace is the story of one request — told by every service that touched it."

This roadmap is about distributed tracing as it appears inside the code — instrumenting spans, propagating context, choosing what's a span vs a log, and avoiding the "trace everything" trap. It's the third pillar of observability alongside Logging and Metrics.

Looking for the system-design angle (collector topology, storage backends, Jaeger / Tempo / Honeycomb architecture)? See Backend → Distributed Tracing.

This section is the language-level discipline — what your code emits, how spans are linked, how context crosses an await or a thread boundary.


Why a Dedicated Roadmap

Distributed tracing is the diagnostic tool that changes how you read code:

  • Logs tell you what each service said.
  • Metrics tell you the aggregate shape.
  • Traces tell you the actual path one request took, end-to-end, including the slow span you didn't suspect.

The hard parts aren't the SDK — they're context propagation across async boundaries, deciding what's worth a span, and sampling without losing the interesting traces.

Roadmap Question it answers
Logging What did each service say?
Metrics What does it look like at aggregate?
Tracing (this) What path did this one request take?

Sections

# Topic Focus
01 Trace, Span, Context The data model; parent/child, root span, span attributes/events
02 OpenTelemetry SDK The cross-language standard; tracer, span, propagator, exporter
03 Context Propagation W3C Trace Context (traceparent header), B3, baggage; how a trace stays whole across services
04 Propagation Across Async / Threads Why a context disappears across await or a goroutine; explicit vs implicit propagation
05 Manual Instrumentation When to start a span, naming conventions, attributes vs events
06 Auto-Instrumentation Java agents, Python opentelemetry-instrument, what they catch and miss
07 Sampling Head-based vs tail-based sampling; rate-based vs always-on for errors
08 Span Events vs Logs When a log should be an event on a span; the unifying trend
09 Linking Spans to Logs & Metrics Trace ID in logs, exemplars on metrics, correlated views
10 Debugging With Traces Reading a trace, identifying slow spans, missing spans, broken propagation
11 Cost & Overhead Per-span allocation cost, exporter batch tuning, tail-sampling complexity
12 Anti-patterns "Trace every function," missing parent links, log-spam-as-spans, leaking PII into attributes

Languages

Examples in Go (go.opentelemetry.io/otel), Java (OpenTelemetry SDK, Java agent), Python (opentelemetry-sdk, automatic instrumentation), Node (@opentelemetry/*), Rust (tracing + tracing-opentelemetry).


Status

Structure defined; content pending.


References

  • Distributed Tracing in Practice — Parker, Spoonhower, Mace, Sigelman
  • Mastering Distributed Tracing — Yuri Shkuro (the Jaeger author)
  • OpenTelemetry Specification — opentelemetry.io
  • Dapper, a Large-Scale Distributed Systems Tracing Infrastructure — Sigelman et al. (the Google paper that started it all)

Project Context

Part of the Senior Project — a personal effort to consolidate the essential knowledge of software engineering in one place.