Skip to content

Telemetry Cost & Sampling Strategy Roadmap

"If your observability bill is bigger than the bill for the system it observes, you have stopped measuring the system and started measuring your budget."

This roadmap is about controlling the volume and cost of logs, metrics, and traces while preserving the fidelity you actually need. It is the discipline that keeps observability from quietly becoming the most expensive line item you own — and from failing at the only moment it has to work: the 3 a.m. question, when the one trace you needed was the one you sampled away.

Looking for what to emit (the four metric types, span design, structured logs)? See the sibling pillars: Metrics, Tracing, Logging.

Looking for the strategy of observability as a whole (what to instrument, SLOs, the build-vs-buy of a stack)? See Observability Engineering.

This section is the economics-and-throughput discipline — how much of what you emit you actually keep, where you decide that, and how you stay statistically honest after throwing data away.


Why a Dedicated Roadmap

"Log everything, trace everything, measure everything" is the default, and it is financially and operationally unsustainable the moment traffic and service count grow. Telemetry volume grows super-linearly: more traffic × more services × more spans-per-request × more labels-per-metric. A senior engineer knows the cost driver is different for each signal, and that the cure differs accordingly.

Signal Dominant cost driver The killer pattern The lever
Metrics Cardinality (one time series per unique label set) a user_id label → one series per user label allow-lists; move identity to exemplars/traces
Logs Volume (bytes ingested × retention) DEBUG logs left on in prod at full traffic level control, field pruning, retention tiers
Traces Volume × span count 100% sampling of a 40-span request at 50k rps head + tail sampling in the collector

The central choice this roadmap turns on is head vs tail sampling for traces:

Head-based Tail-based
Decides at trace start, before anything happened after the whole trace is seen
Knows if the trace is interesting? No — it's blind Yes — saw the error / the latency
Cost / complexity Cheap, stateless, per-service Expensive — collector must buffer every trace
Keeps all errors & slow traces? No (only by luck) Yes (that's the whole point)
Good for uniform cost cap, huge fleets "keep the ones that matter"

Sections

# Topic Focus
01 The Cost Problem Super-linear growth; observability bill > compute bill; cost vs fidelity vs 3 a.m. questions
02 Cost Drivers per Signal Cardinality (metrics), volume (logs), volume × spans (traces), wide events
03 Cardinality, the Metric Killer The user_id explosion, worked numbers, allow-lists, dropping labels
04 Head-Based Sampling Probabilistic, rate-limiting; cheap, stateless, blind to interestingness
05 Tail-Based Sampling Buffer-then-decide; keep all errors + slow traces + a sample of the rest
06 Consistent / Deterministic Sampling Same trace_id decision across services; propagating the decision; OTel samplers
07 Statistical Correctness Adjusted counts / upsampling (×1/sample_rate) so sampled metrics stay accurate
08 Reducing Cost Without Losing Signal Aggregation at source, exemplars, log levels, field pruning, span filtering
09 Retention & Downsampling Tiers Hot/warm/cold, recording rules, Prometheus downsampling, metric rollups
10 The OTel Collector as Control Point tail_sampling, probabilistic_sampler, filter, attributes, memory_limiter; gateway vs agent
11 The Fidelity Floor What you keep at 100% — errors, audit/security, SLO signals, billing — and never sample
12 Org Cost Strategy Budgets, chargeback/showback, cardinality governance, policy-as-code, spend anomaly alerts, vendor pricing traps

Languages

Configuration-first (the cost-control point is the OpenTelemetry Collector, language-agnostic YAML), with SDK-side sampler examples in Go, Java, Python, Node, and Rust for head sampling and propagating the sampling decision.


Tools

The OpenTelemetry Collector (probabilistic_sampler, tail_sampling, filter, attributes, batch, memory_limiter processors), Prometheus (recording rules, downsampling), Grafana Mimir / Thanos / VictoriaMetrics (retention tiers, rollups), and vendor backends (Honeycomb, Datadog, Grafana Cloud) whose pricing models drive the trade-offs. The observability-stack, monitoring-alerting, and caching-strategies skills are directly relevant.


Status

Content completejunior · middle · senior · professional · interview · tasks.


References

  • Observability Engineering — Majors, Fong-Jones, Miranda (cardinality, wide events, the cost argument)
  • Honeycomb — Sampling docs & "Dynamic Sampling" (the canonical head-vs-tail and dynamic-sampling treatment)
  • OpenTelemetry — Sampling & the Collector tail_sampling/probabilistic_sampler processors
  • Google — "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure" (the origin of trace sampling and adjusted counts)
  • Prometheus — Recording rules & downsampling; Grafana Mimir / Thanos compactor docs

Project Context

Part of the Senior Project — a personal effort to consolidate the essential knowledge of software engineering in one place.