Tracing Roadmap¶
"A trace is the story of one request — told by every service that touched it."
This roadmap is about distributed tracing as it appears inside the code — instrumenting spans, propagating context, choosing what's a span vs a log, and avoiding the "trace everything" trap. It's the third pillar of observability alongside Logging and Metrics.
Looking for the system-design angle (collector topology, storage backends, Jaeger / Tempo / Honeycomb architecture)? See Backend → Distributed Tracing.
This section is the language-level discipline — what your code emits, how spans are linked, how context crosses an
awaitor a thread boundary.
Why a Dedicated Roadmap¶
Distributed tracing is the diagnostic tool that changes how you read code:
- Logs tell you what each service said.
- Metrics tell you the aggregate shape.
- Traces tell you the actual path one request took, end-to-end, including the slow span you didn't suspect.
The hard parts aren't the SDK — they're context propagation across async boundaries, deciding what's worth a span, and sampling without losing the interesting traces.
| Roadmap | Question it answers |
|---|---|
| Logging | What did each service say? |
| Metrics | What does it look like at aggregate? |
| Tracing (this) | What path did this one request take? |
Sections¶
| # | Topic | Focus |
|---|---|---|
| 01 | Trace, Span, Context | The data model; parent/child, root span, span attributes/events |
| 02 | OpenTelemetry SDK | The cross-language standard; tracer, span, propagator, exporter |
| 03 | Context Propagation | W3C Trace Context (traceparent header), B3, baggage; how a trace stays whole across services |
| 04 | Propagation Across Async / Threads | Why a context disappears across await or a goroutine; explicit vs implicit propagation |
| 05 | Manual Instrumentation | When to start a span, naming conventions, attributes vs events |
| 06 | Auto-Instrumentation | Java agents, Python opentelemetry-instrument, what they catch and miss |
| 07 | Sampling | Head-based vs tail-based sampling; rate-based vs always-on for errors |
| 08 | Span Events vs Logs | When a log should be an event on a span; the unifying trend |
| 09 | Linking Spans to Logs & Metrics | Trace ID in logs, exemplars on metrics, correlated views |
| 10 | Debugging With Traces | Reading a trace, identifying slow spans, missing spans, broken propagation |
| 11 | Cost & Overhead | Per-span allocation cost, exporter batch tuning, tail-sampling complexity |
| 12 | Anti-patterns | "Trace every function," missing parent links, log-spam-as-spans, leaking PII into attributes |
Languages¶
Examples in Go (go.opentelemetry.io/otel), Java (OpenTelemetry SDK, Java agent), Python (opentelemetry-sdk, automatic instrumentation), Node (@opentelemetry/*), Rust (tracing + tracing-opentelemetry).
Status¶
⏳ Structure defined; content pending.
References¶
- Distributed Tracing in Practice — Parker, Spoonhower, Mace, Sigelman
- Mastering Distributed Tracing — Yuri Shkuro (the Jaeger author)
- OpenTelemetry Specification — opentelemetry.io
- Dapper, a Large-Scale Distributed Systems Tracing Infrastructure — Sigelman et al. (the Google paper that started it all)
Project Context¶
Part of the Senior Project — a personal effort to consolidate the essential knowledge of software engineering in one place.