Telemetry Cost & Sampling Strategy Roadmap¶
"If your observability bill is bigger than the bill for the system it observes, you have stopped measuring the system and started measuring your budget."
This roadmap is about controlling the volume and cost of logs, metrics, and traces while preserving the fidelity you actually need. It is the discipline that keeps observability from quietly becoming the most expensive line item you own — and from failing at the only moment it has to work: the 3 a.m. question, when the one trace you needed was the one you sampled away.
Looking for what to emit (the four metric types, span design, structured logs)? See the sibling pillars: Metrics, Tracing, Logging.
Looking for the strategy of observability as a whole (what to instrument, SLOs, the build-vs-buy of a stack)? See Observability Engineering.
This section is the economics-and-throughput discipline — how much of what you emit you actually keep, where you decide that, and how you stay statistically honest after throwing data away.
Why a Dedicated Roadmap¶
"Log everything, trace everything, measure everything" is the default, and it is financially and operationally unsustainable the moment traffic and service count grow. Telemetry volume grows super-linearly: more traffic × more services × more spans-per-request × more labels-per-metric. A senior engineer knows the cost driver is different for each signal, and that the cure differs accordingly.
| Signal | Dominant cost driver | The killer pattern | The lever |
|---|---|---|---|
| Metrics | Cardinality (one time series per unique label set) | a user_id label → one series per user | label allow-lists; move identity to exemplars/traces |
| Logs | Volume (bytes ingested × retention) | DEBUG logs left on in prod at full traffic | level control, field pruning, retention tiers |
| Traces | Volume × span count | 100% sampling of a 40-span request at 50k rps | head + tail sampling in the collector |
The central choice this roadmap turns on is head vs tail sampling for traces:
| Head-based | Tail-based | |
|---|---|---|
| Decides | at trace start, before anything happened | after the whole trace is seen |
| Knows if the trace is interesting? | No — it's blind | Yes — saw the error / the latency |
| Cost / complexity | Cheap, stateless, per-service | Expensive — collector must buffer every trace |
| Keeps all errors & slow traces? | No (only by luck) | Yes (that's the whole point) |
| Good for | uniform cost cap, huge fleets | "keep the ones that matter" |
Sections¶
| # | Topic | Focus |
|---|---|---|
| 01 | The Cost Problem | Super-linear growth; observability bill > compute bill; cost vs fidelity vs 3 a.m. questions |
| 02 | Cost Drivers per Signal | Cardinality (metrics), volume (logs), volume × spans (traces), wide events |
| 03 | Cardinality, the Metric Killer | The user_id explosion, worked numbers, allow-lists, dropping labels |
| 04 | Head-Based Sampling | Probabilistic, rate-limiting; cheap, stateless, blind to interestingness |
| 05 | Tail-Based Sampling | Buffer-then-decide; keep all errors + slow traces + a sample of the rest |
| 06 | Consistent / Deterministic Sampling | Same trace_id decision across services; propagating the decision; OTel samplers |
| 07 | Statistical Correctness | Adjusted counts / upsampling (×1/sample_rate) so sampled metrics stay accurate |
| 08 | Reducing Cost Without Losing Signal | Aggregation at source, exemplars, log levels, field pruning, span filtering |
| 09 | Retention & Downsampling Tiers | Hot/warm/cold, recording rules, Prometheus downsampling, metric rollups |
| 10 | The OTel Collector as Control Point | tail_sampling, probabilistic_sampler, filter, attributes, memory_limiter; gateway vs agent |
| 11 | The Fidelity Floor | What you keep at 100% — errors, audit/security, SLO signals, billing — and never sample |
| 12 | Org Cost Strategy | Budgets, chargeback/showback, cardinality governance, policy-as-code, spend anomaly alerts, vendor pricing traps |
Languages¶
Configuration-first (the cost-control point is the OpenTelemetry Collector, language-agnostic YAML), with SDK-side sampler examples in Go, Java, Python, Node, and Rust for head sampling and propagating the sampling decision.
Tools¶
The OpenTelemetry Collector (probabilistic_sampler, tail_sampling, filter, attributes, batch, memory_limiter processors), Prometheus (recording rules, downsampling), Grafana Mimir / Thanos / VictoriaMetrics (retention tiers, rollups), and vendor backends (Honeycomb, Datadog, Grafana Cloud) whose pricing models drive the trade-offs. The observability-stack, monitoring-alerting, and caching-strategies skills are directly relevant.
Status¶
✅ Content complete — junior · middle · senior · professional · interview · tasks.
References¶
- Observability Engineering — Majors, Fong-Jones, Miranda (cardinality, wide events, the cost argument)
- Honeycomb — Sampling docs & "Dynamic Sampling" (the canonical head-vs-tail and dynamic-sampling treatment)
- OpenTelemetry — Sampling & the Collector
tail_sampling/probabilistic_samplerprocessors - Google — "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure" (the origin of trace sampling and adjusted counts)
- Prometheus — Recording rules & downsampling; Grafana Mimir / Thanos compactor docs
Project Context¶
Part of the Senior Project — a personal effort to consolidate the essential knowledge of software engineering in one place.