Observability Engineering — Professional Level¶
Topic: Observability Engineering Roadmap Focus: Observability as an organisational discipline and platform. The OpenTelemetry pipeline at scale — agent vs gateway Collector topology, where tail sampling lives, multi-tenancy and PII governance. Backend architecture and build-vs-buy (Prometheus/Grafana/Tempo/Loki vs Honeycomb vs Datadog). The observability maturity model and cultural adoption (observability-driven development). SLO practice across many teams, cost governance, on-call and incident response enabled by observability, and the Goodhart traps that ruin observability metrics.
Table of Contents¶
- Introduction
- Prerequisites
- Glossary
- Core Concepts
- The Observability Platform — A Reference Architecture
- Collector Topology: Agent vs Gateway
- Backends and Build-vs-Buy
- The Observability Maturity Model
- Cultural Adoption & Observability-Driven Development
- SLO Practice at Org Scale
- Cost Governance
- On-Call & Incident Response
- Goodhart Pitfalls on Observability Metrics
- Code Examples
- Use Cases
- Mental Models
- Best Practices
- Edge Cases & Pitfalls
- Common Mistakes
- Tricky Points
- Test Yourself
- Cheat Sheet
- Summary
- Further Reading
- Related Topics
- Diagrams & Visual Aids
Introduction¶
Focus: How do you make a whole organisation observable — the pipeline, the backends, the budget, and the culture — not just one well-instrumented service?
The senior level was about designing observability into a service. This level is about the platform and the organisation: dozens of teams, polyglot services, millions of dollars of telemetry spend, an on-call rotation that lives or dies by whether the data answers questions at 3 a.m. The technical core is the same OpenTelemetry pipeline — but now you must decide its topology (agent vs gateway Collectors, where tail sampling runs, how multi-tenancy and PII redaction work), its backends (the build-vs-buy question that determines your cost and capability ceiling), and its economics (telemetry is frequently a top-three infrastructure line item — cross-ref telemetry-cost).
But the platform is only half the job, and the easier half. The harder half is adoption and culture. An observability platform that nobody instruments richly, or that engineers don't reach for during incidents, is shelfware. The professional discipline includes the maturity model (where is the org on the path from "we have logs" to "we debug unknown-unknowns in minutes"), observability-driven development (instrumentation written with the feature, not bolted on after an outage), SLO practice that scales to many teams without becoming bureaucratic theatre, and the Goodhart traps — the ways well-intentioned observability metrics get gamed into uselessness ("100% of services have a dashboard" while none of them answer real questions).
This page is written for the engineer who owns, or influences, observability across an organisation: platform/SRE leads, staff engineers setting standards, anyone deciding build-vs-buy or defending the telemetry budget. We connect to the observability-stack and monitoring-alerting skills for tooling and alert-design depth, and to engineering-metrics-and-dora for the SLO and reliability-metric foundations.
🎓 Why this matters at the professional level: At this scale, the failure modes are organisational and economic, not technical. The platform that works for one team falls over at fifty; the SLO practice that's crisp for one service becomes theatre across the org; the telemetry bill that's a rounding error becomes a budget crisis. Getting observability right at the org level is a leadership problem with a technical substrate.
Prerequisites¶
- Required:
senior.md— wide-event design, cardinality economics, span/data-model design, the debugging loop, SLO/alerting fundamentals, sampling and fidelity. - Required:
middle.md— OTel spec/SDK/Collector, propagation, correlation chain. - Required: Operating experience with a real backend (Prometheus/Grafana or a SaaS) and a Collector deployment.
- Helpful:
../telemetry-cost-and-sampling-strategy/for the cost/sampling deep dive. - Helpful:
../../quality-engineering/engineering-metrics-and-dora/for SLO/error-budget and reliability-metric practice, and the Goodhart discussion there.
Glossary¶
| Term | Definition |
|---|---|
| Agent Collector | A Collector running next to the workload (sidecar or per-node DaemonSet); first-hop collection. |
| Gateway Collector | A centralised Collector tier (a scalable deployment) that does heavy processing — tail sampling, routing, redaction. |
| Tail sampling | A keep-decision made after a whole trace is seen; requires all spans of a trace to land on the same Collector instance. |
| Span metrics | Metrics (RED) derived from spans by the Collector's spanmetrics connector — metrics without separate instrumentation. |
| OTTL | OpenTelemetry Transformation Language — the Collector's expression language for transforming/redacting telemetry. |
| Build-vs-buy | The decision between self-hosting an OSS stack and paying a SaaS vendor. |
| Maturity model | A staged description of an org's observability capability, from reactive monitoring to proactive observability. |
| ODD | Observability-Driven Development — writing instrumentation as part of the feature, validated in prod. |
| Error budget policy | The pre-agreed consequence of burning the budget (e.g. freeze features, prioritise reliability). |
| Cardinality budget | A per-team/per-service cap on label/attribute cardinality to control cost. |
| Goodhart's Law | "When a measure becomes a target, it ceases to be a good measure." |
| MTTD / MTTR | Mean Time To Detect / Mean Time To Resolve — incident-response outcome metrics. |
| Single pane of glass | One place to pivot across all signals; an aspiration often abused as a procurement slogan. |
Core Concepts¶
1. Observability is a platform, and platforms have customers¶
Your customers are the engineering teams. The platform's job is to make rich instrumentation easy (sane defaults, shared SDK config, auto-instrumentation, golden dashboards) and the wrong thing hard (cardinality guardrails, PII redaction by default). Treat it as a product: paved roads, self-service, documentation, and a feedback loop — not a mandate.
2. The Collector tier is where org-level policy lives¶
In a one-service world the Collector is a convenience. At org scale it is the policy enforcement point: tail sampling, PII redaction, attribute normalisation, cost controls, backend routing, and multi-tenancy all live in the gateway tier — centrally, without redeploying every service. Owning the Collector topology is owning observability governance.
3. Build-vs-buy is a capability and cost decision, not just a price¶
Self-hosting OSS (Prometheus/Grafana/Tempo/Loki) gives control and avoids per-GB SaaS pricing, but you pay in operational toil and you inherit the TSDB cardinality ceiling. SaaS (Honeycomb, Datadog) buys you the wide-event query model or a turnkey platform, but the bill scales with telemetry volume and can become the dominant cost. There is no universally right answer — only a right answer for your scale, team, and cardinality needs.
4. Adoption beats architecture¶
A perfect pipeline feeding a perfect backend is worthless if teams emit thin spans and never query during incidents. The maturity model and ODD culture are not soft extras — they're the difference between an observability platform and an observability capability.
5. SLOs scale only with discipline¶
One service's SLO is easy. Fifty teams' SLOs become theatre unless you standardise (what's an SLI, how to set targets, the error-budget policy) without centralising (each team owns its SLOs). The goal is a shared language, not a central committee approving every threshold.
6. Cost is a first-class design constraint¶
Telemetry volume grows super-linearly with traffic and microservice count. Without governance — sampling, cardinality budgets, retention tiers, attribution back to teams — the bill becomes a crisis and the reflexive fix (turn down sampling) blinds you exactly when you scale. Cost governance is an observability-design problem, not just a finance one (cross-ref telemetry-cost).
7. Observability metrics get Goodharted¶
The moment "every service must have a dashboard" or "MTTR must drop 20%" becomes a target, teams optimise the metric, not the capability. Professional practice means choosing observability KPIs that resist gaming and watching for the gaming when it happens.
The Observability Platform — A Reference Architecture¶
A platform is a pipeline with policy, governance, and a product surface around it.
┌─ SERVICES (polyglot: Go, Python, Java, Node) ─┐
│ OTel SDK / auto-instrumentation │ instrument ONCE (vendor-neutral)
└───────────────────────┬───────────────────────┘
│ OTLP
▼
┌─ AGENT COLLECTOR (per-node DaemonSet / sidecar) ─┐
│ receive · enrich (k8s metadata) · batch │ first hop, near the workload
└───────────────────────┬───────────────────────────┘
│ OTLP (load-balancing exporter, trace-aware)
▼
┌─ GATEWAY COLLECTOR TIER (scaled deployment) ─────────────────────┐
│ TAIL SAMPLING · PII redaction (OTTL) · attribute normalisation │ ORG POLICY
│ span→metrics connector · cost controls · multi-tenant routing │ lives here
└──────────┬───────────────────┬───────────────────┬───────────────┘
▼ ▼ ▼
TRACES backend METRICS backend LOGS backend
(Tempo / Honeycomb) (Prometheus/Mimir) (Loki)
└─────────── single query/pivot layer (Grafana / SaaS UI) ──────────┘
dashboards · SLOs · alerting · ad-hoc exploration
Key properties this buys you:
- Instrument once, govern centrally. Apps speak OTLP; everything downstream — sampling, redaction, backend choice — changes in the Collector tier without app redeploys.
- PII never leaves un-redacted. A redaction stage in the gateway is the chokepoint where you guarantee emails/card numbers/tokens are scrubbed before export, especially to a third-party SaaS.
- Cost lives in one place. Sampling and cardinality controls are centralised, so you can tune the bill without touching fifty repos.
- Backends are swappable. Re-point an exporter; run two backends in parallel during a migration.
Collector Topology: Agent vs Gateway¶
The single most important platform-architecture decision. The two roles are usually both present in a layered topology.
| Agent Collector | Gateway Collector | |
|---|---|---|
| Runs as | Sidecar or per-node DaemonSet | Centralised, horizontally-scaled deployment |
| Proximity | Next to the workload | A shared tier |
| Does | Receive, add host/k8s metadata, batch, ship | Tail sampling, redaction, normalisation, routing, span-metrics |
| Why | Offload the app, capture node context | Centralise expensive/global processing & policy |
| Scaling | One per node/pod | Scale out by load |
The tail-sampling constraint that shapes the whole topology¶
Tail sampling must see all spans of a trace to decide whether to keep it — so all spans of a given trace must reach the same gateway instance. You cannot naïvely round-robin spans across gateway replicas, or each replica sees a fragment and the decision is wrong. The solution is a trace-aware load-balancing exporter on the agent tier that routes by trace_id, plus a two-layer gateway (a routing layer that shards by trace_id, then a sampling layer). This constraint is the reason gateway Collectors exist as a distinct tier and why tail sampling is a platform concern, not an app one.
agents ──(load-balancing exporter, hash by trace_id)──► gateway replica N
(sees the WHOLE trace)
→ correct tail decision
Where each policy lives¶
- Agent: host/k8s resource enrichment, batching, initial OTLP receive. Cheap, local, per-node.
- Gateway: tail sampling (needs whole traces), PII redaction (one chokepoint), attribute normalisation (consistent names org-wide),
spanmetricsconnector (RED from spans), multi-tenant routing, and exporter fan-out.
For a small org, a single gateway tier suffices. As trace volume grows, split routing from sampling. This topology is what lets a 200-service org change sampling or add a backend with zero app deploys.
Backends and Build-vs-Buy¶
The backend determines your capability ceiling (can you group by arbitrary high-cardinality fields?) and a large fraction of your cost.
The OSS / self-host stack¶
| Tool | Signal | Notes |
|---|---|---|
| Prometheus (or Mimir/Thanos for scale/HA) | Metrics | The TSDB standard. Cardinality is the cost/scaling ceiling — high-cardinality labels are forbidden. |
| Grafana | Dashboards / query layer | The pane-of-glass over Tempo/Loki/Prometheus; exemplar support links metrics→traces. |
| Tempo | Traces | Cheap object-store-backed tracing; pairs with tail sampling in the Collector. |
| Loki | Logs | Label-indexed log store; correlate via trace_id. |
| Pyroscope / Parca | Profiles | Continuous profiling; span-aware linking. |
Strengths: control, no per-GB vendor pricing, data stays in-house, no lock-in. Costs: real operational toil (you run, scale, and upgrade it), and you inherit the TSDB cardinality ceiling — slicing to one customer needs the trace/log side, not Prometheus.
The SaaS options¶
| Vendor | Model | Notes |
|---|---|---|
| Honeycomb | Wide-event / columnar | Built around the arbitrarily-wide event; group by any high-cardinality field, BubbleUp outlier analysis. The native expression of the senior-level model. |
| Datadog | Turnkey all-in-one | Broad, polished, fast to adopt; pricing scales with hosts/volume/custom metrics and can dominate the bill. |
| Grafana Cloud | Managed OSS stack | Hosted Prometheus/Tempo/Loki — OSS model without the ops. |
| New Relic / others | All-in-one | Similar trade-offs to Datadog. |
The decision framework¶
| Choose build (OSS) when | Choose buy (SaaS) when |
|---|---|
| You have platform/SRE capacity to operate it | Engineering time is scarcer than money |
| Data residency / sovereignty is mandatory | You want capability now, not in a quarter |
| Volume is huge and per-GB SaaS pricing is prohibitive | The wide-event query model (Honeycomb) is the goal and you won't build a column store |
| You need deep customisation | You'd rather not run a TSDB at scale |
The pragmatic reality: most orgs end up hybrid — OTel as the universal instrumentation layer (so the choice is reversible), Prometheus/Grafana for cheap, high-value SLO metrics, and a SaaS (often Honeycomb-style) for high-cardinality trace exploration. Because everything speaks OTLP through the Collector, build-vs-buy becomes a per-signal, reversible decision rather than a one-way door. OTel is the hedge that keeps the question open.
The Observability Maturity Model¶
A staged view of where an org actually is. Useful for honest assessment and for sequencing investment.
| Level | Name | Characteristic | The telling question |
|---|---|---|---|
| 0 | Reactive | Logs grepped on boxes; alerts on host metrics | "What happened?" answered by SSH-ing in |
| 1 | Monitoring | Dashboards + threshold alerts for known failures | "Is X broken?" — but only X you predicted |
| 2 | Structured | Structured logs, RED metrics, some tracing; signals exist but are siloed | "We have logs, metrics, traces" (uncorrelated) |
| 3 | Correlated | Shared trace_id across signals; exemplars; you pivot signal→signal | "From this metric spike I can reach the trace and logs" |
| 4 | Observable | Wide events, high-cardinality ad-hoc queries; SLOs; debug unknown-unknowns with no deploy | "We cornered a novel failure in minutes by slicing on a dimension we never alerted on" |
| 5 | Proactive | Observability drives development; experiments in prod; cost-governed; observability is a paved road | "Instrumentation ships with the feature; we explore prod routinely, not just during incidents" |
Use it diagnostically, not as a vanity ladder. The jump that matters most is 2 → 3 → 4: from "we have the signals" to "they're correlated" to "we can ask new questions." Many orgs stall at level 2 with three expensive, disconnected pipelines and call it observability.
Cultural Adoption & Observability-Driven Development¶
Tooling is necessary and insufficient. The cultural shifts:
- Observability-Driven Development (ODD). Instrumentation is part of the feature's definition of done, written with the code — "what wide-event attributes would I need to debug this in prod?" — not bolted on after the first outage. Reviewers ask for it like they ask for tests.
- Debug in production, routinely. Normalise exploring prod telemetry as a daily habit, not a panic move. Engineers who only open the observability tool during incidents never build the fluency to use it well during incidents.
- Watch the ones who instrument richly. The teams that resolve incidents fast are the ones whose spans are wide and whose attributes match their domain. Make their patterns the paved road.
- Paved roads over mandates. Ship shared SDK config, auto-instrumentation, golden dashboards, and SLO templates so the easy path is the rich-instrumentation path. Mandates without ergonomics produce checkbox compliance (level-2 theatre).
- Blameless incident culture. Observability and blamelessness reinforce each other: people share what the data shows when they won't be punished for what it reveals (cross-ref post-mortem-analysis).
The adoption test is behavioural, not architectural: do engineers reach for the observability tool to answer questions, including outside incidents? If yes, you have a capability. If they only look at dashboards someone else built, you have shelfware.
SLO Practice at Org Scale¶
One service's SLO is a senior topic; fifty teams' SLOs is a professional one. The failure mode is bureaucracy or theatre. (Mechanics: engineering-metrics-and-dora.)
- Standardise the language, decentralise the targets. Define org-wide what an SLI is (user-perceived, request-level, split success/error), how to express it as a query over wide events, and what an error-budget policy is. Let each team own its own SLO numbers and own its budget. Centralising target-setting creates a committee bottleneck; decentralising the definition creates incomparable nonsense.
- SLOs on user journeys, not just services. "Checkout succeeds" is a journey crossing many services; per-service SLOs can all be green while the journey is broken. Define journey-level SLOs for the things users actually do.
- The error-budget policy must have teeth. Pre-agree the consequence of burning the budget (freeze features, shift to reliability work). An SLO with no consequence is a vanity metric.
- Alert on multi-window, multi-burn-rate budget consumption, not raw thresholds — page on a fast burn that threatens the budget, ticket on a slow burn. This is the org-scale answer to alert fatigue (see
monitoring-alerting). - Beware SLO inflation. Teams set 99.99% because it sounds responsible, then chronically miss it and the SLO loses all meaning. Set the target at the level users actually need; an SLO you reliably meet and that reflects user pain is worth more than an aspirational one you ignore.
Cost Governance¶
Telemetry is frequently a top-three infrastructure cost and grows faster than traffic (more services × more spans × more attributes). Without governance, the reflex when the bill spikes is to slash sampling — which blinds you precisely as you scale. (Deep dive: telemetry-cost.) The governance levers:
| Lever | What it does | Where |
|---|---|---|
| Tail sampling | Keep all errors/slow/key-tenants, sample the boring majority | Gateway Collector |
| Cardinality budgets | Cap per-team metric label cardinality; alert/drop on breach | Collector + policy |
| Attribute pruning | Drop attributes that are never queried (pure cost) | Collector (OTTL) |
| Retention tiers | Hot/cheap-cold tiers; keep raw events briefly, aggregates longer | Backend config |
| Cost attribution | Tag telemetry by team/service so spend is showback/chargeback | Resource attributes |
| Metrics vs traces split | Metrics are cheap aggregates (no sampling); spend the sampling budget on traces | Pipeline design |
The professional stance: make cost visible and attributable, so the team generating the spend sees the spend. The most powerful single move is keeping metrics unsampled and cheap (they power SLOs) while aggressively tail-sampling traces — full-fidelity "something's wrong" with sampled "here's why." And resist the panic-sampling reflex: the right cut is dropping boring data (unqueried attributes, successful-fast-path traces), not blanket-reducing fidelity on the failures you'll need.
On-Call & Incident Response¶
Observability's payoff is realised in the incident. (Cross-ref post-mortem-analysis; see the monitoring-alerting skill for alert design.)
- Good observability shrinks MTTR by collapsing the investigation loop. An on-call engineer who can hypothesise → query wide events → narrow → confirm resolves in minutes; one who must add-log-and-redeploy is stuck for hours. The platform's ROI is mostly here.
- Alerts must be actionable. Every page should mean "a human must act now" and link to the relevant SLO and a starting query/dashboard. Non-actionable pages are alert fatigue, and fatigue causes missed real incidents — the most dangerous failure mode of all.
- Runbooks point into the telemetry. A good runbook is "open this SLO, run this
group by, look for this pattern" — not prose. Observability turns runbooks from rituals into queries. - The incident timeline is reconstructed from telemetry. Wide events with build/deploy/flag attributes let the post-mortem answer "what changed at 14:02?" precisely. Capturing
build.versionanddeploymenton events is what makes "the deploy did it" a query rather than a guess. - Feedback loop: every incident either confirms the instrumentation was sufficient or names the missing attribute. Add it. Over time, ODD + this loop drive the org up the maturity model.
Goodhart Pitfalls on Observability Metrics¶
Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.
Observability KPIs are unusually easy to game because the capability (answering unknown-unknowns) is hard to measure, so proxies get targeted instead — and proxies get optimised. (The DORA/SPACE discussion in engineering-metrics-and-dora covers the general phenomenon.)
| Targeted metric | How it gets gamed | The real thing it was a proxy for |
|---|---|---|
| "% of services with a dashboard" | Auto-generate empty dashboards nobody uses | Can you actually answer questions? |
| "Instrumentation coverage %" | Thin auto-spans everywhere, zero business attributes | Wide, query-useful events |
| MTTR target | Resolve = "page acked"; or split one incident into many small "fast" ones | Genuinely faster diagnosis |
| Alert count down | Silence/raise thresholds until nothing fires (and real incidents are missed) | Less noise, not less signal |
| SLO compliance % | Set lax SLOs; define "valid request" to exclude the failures | User-perceived reliability |
| # of custom metrics | Emit thousands of unqueried metrics (and blow the cost) | Useful, queried signals |
Defences: measure outcomes, not activity (time-to-corner-a-novel-incident beats dashboard count); use a basket of metrics so gaming one shows up in another; pair every quantitative KPI with qualitative signal (do engineers say the tooling helps?); and never tie observability metrics directly to individual performance reviews — that guarantees gaming. The honest top-level question resists Goodharting because it's expensive to fake: the last three novel incidents — were they cornered by querying, with no new instrumentation?
Code Examples¶
Gateway Collector — org policy: tail sampling, PII redaction, span→metrics, multi-backend¶
# gateway-collector.yaml (the org POLICY enforcement point)
receivers:
otlp: { protocols: { grpc: { endpoint: 0.0.0.0:4317 } } }
processors:
# 1) Normalise + redact PII before anything leaves the building (one chokepoint).
transform/redact:
trace_statements:
- context: span
statements:
- delete_key(attributes, "app.user.email")
- replace_pattern(attributes["http.url"], "token=[^&]+", "token=REDACTED")
# 2) Tail sampling — keep what you'll actually need; requires whole traces (see topology).
tail_sampling:
decision_wait: 12s
policies:
- { name: errors, type: status_code, status_code: { status_codes: [ERROR] } }
- { name: slow, type: latency, latency: { threshold_ms: 1000 } }
- { name: key-tenant, type: string_attribute,
string_attribute: { key: app.customer.plan, values: [enterprise] } }
- { name: rest, type: probabilistic, probabilistic: { sampling_percentage: 3 } }
batch: {}
connectors:
# 3) Derive RED metrics from spans — full-fidelity metrics WITHOUT separate instrumentation.
spanmetrics:
histogram: { explicit: { buckets: [50ms, 100ms, 300ms, 1s, 3s] } }
dimensions: [ { name: http.route }, { name: app.customer.plan } ]
exporters:
otlp/tempo: { endpoint: tempo:4317, tls: { insecure: true } } # sampled traces
otlphttp/hny: { endpoint: https://api.honeycomb.io } # wide-event explore
prometheusremotewrite: { endpoint: http://mimir:9009/api/v1/push } # UNSAMPLED metrics
service:
pipelines:
traces:
receivers: [otlp]
processors: [transform/redact, tail_sampling, batch]
exporters: [otlp/tempo, otlphttp/hny, spanmetrics] # spanmetrics fans into metrics
metrics:
receivers: [spanmetrics]
processors: [batch]
exporters: [prometheusremotewrite] # metrics are NOT tail-sampled
This one file is org-level governance: PII can't leak, fidelity is allocated, RED metrics are free and unsampled, and three backends are fed without a single app redeploy.
Agent Collector — trace-aware routing to the gateway (the tail-sampling prerequisite)¶
# agent-collector.yaml (per-node DaemonSet)
receivers:
otlp: { protocols: { grpc: { endpoint: 0.0.0.0:4317 } } }
processors:
k8sattributes: {} # enrich with pod/namespace/node — node context lives near the node
batch: {}
exporters:
# Route ALL spans of a trace to the SAME gateway replica, hashing by trace_id,
# so tail sampling downstream sees whole traces.
loadbalancing:
routing_key: traceID
protocol: { otlp: { tls: { insecure: true } } }
resolver: { dns: { hostname: otel-gateway-headless } }
service:
pipelines:
traces: { receivers: [otlp], processors: [k8sattributes, batch], exporters: [loadbalancing] }
Java — shared SDK config as a paved road (auto-instrument + standard resource attributes)¶
// Started via the OpenTelemetry Java agent; this config is the org "paved road".
// -javaagent:opentelemetry-javaagent.jar
// OTEL_SERVICE_NAME=checkout
// OTEL_RESOURCE_ATTRIBUTES=service.version=4.2.1,deployment.environment=prod,team=payments
// OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-agent:4317
// OTEL_TRACES_SAMPLER=parentbased_always_on // SAMPLE AT THE TAIL in the gateway, not here
//
// Teams add only business attributes; the platform owns the rest.
import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.Span;
void onCheckout(Checkout c) {
Span span = Span.current(); // auto-created by the agent
span.setAttribute("app.account.id", c.accountId()); // identity (high-card: fine on a span)
span.setAttribute("app.customer.plan", c.plan()); // business dimension to group by
span.setAttribute("app.payment.provider", c.provider());
}
The platform ships the agent, the resource attributes (including team for cost attribution), and the "sample at the tail, not the head" decision. Teams write only the domain attributes — the paved road that makes rich instrumentation the easy path.
Use Cases¶
| Situation | Professional move |
|---|---|
| 200 polyglot services, inconsistent telemetry | OTel paved road: shared agent config, golden dashboards, standard resource attributes |
| Tail sampling needed across replicas | Trace-aware load-balancing exporter (hash by trace_id) → sharded gateway tier |
| PII must never reach a SaaS backend | Redaction stage in the gateway as the single chokepoint |
| Telemetry bill is a top-3 cost | Tail-sample traces, keep metrics unsampled, cardinality budgets, cost attribution by team |
| Build-vs-buy under debate | OTel as the hedge; per-signal, reversible choice (OSS metrics + SaaS wide-event explore) |
| SLO practice has become theatre | Standardise the SLI language, decentralise targets, error-budget policy with teeth, journey-level SLOs |
| Org stuck at "we have three pillars" | Push 2→3→4: shared trace_id, exemplars, then wide-event ad-hoc querying |
| "100% have dashboards" but incidents still slow | Goodhart trap — measure time-to-corner-a-novel-incident instead |
Mental Models¶
1. The platform is a product; teams are its customers¶
Make rich instrumentation the easy path and the wrong thing hard. Paved roads, self-service, defaults — not mandates. Adoption is measured by behaviour, not coverage percentages.
2. The gateway Collector is the policy plane¶
Tail sampling, redaction, normalisation, cost control, routing — all centralised in the gateway, changeable without app redeploys. Owning the topology is owning governance.
3. OTel is the reversibility hedge¶
Because everything speaks OTLP through the Collector, build-vs-buy is a per-signal, reversible decision instead of a one-way door. Instrument once; keep the backend question open.
4. Metrics cheap and full-fidelity; traces sampled and rich¶
The economic core of cost governance: counters count everything cheaply (and power SLOs), while per-event traces are tail-sampled for fidelity-where-it-matters. "Something's wrong" at full fidelity; "here's why" sampled.
5. Maturity is correlation, then ad-hoc query¶
The leap that matters is 2→3→4: from siloed signals, to correlated signals, to answering unknown-unknowns. Three expensive disconnected pipelines is not observability.
6. Every observability metric will be gamed¶
If it's a target, it'll be optimised — usually away from the capability it proxied. Measure outcomes, use a basket, pair with qualitative signal, and never tie to individual reviews.
Best Practices¶
- Layer the Collector: agents for local enrichment/batching, a gateway tier for tail sampling, redaction, normalisation, span-metrics, and routing.
- Route by
trace_id(load-balancing exporter) so tail sampling sees whole traces — this constraint shapes the whole topology. - Redact PII at one chokepoint in the gateway, especially before any SaaS export.
- Keep metrics unsampled and cheap; tail-sample traces for fidelity-where-it-matters; SLOs come from the unsampled metrics.
- Use OTel as the reversibility hedge — make build-vs-buy a per-signal, reversible choice.
- Ship paved roads: shared SDK/agent config, golden dashboards, SLO templates, standard resource attributes (incl.
teamfor cost attribution). - Standardise the SLI language, decentralise targets, define journey-level SLOs, and give the error-budget policy real teeth.
- Govern cost with cardinality budgets, attribute pruning, retention tiers, and showback — make spend visible to the team that creates it.
- Make every alert actionable and link it to an SLO + a starting query; ruthlessly cut non-actionable pages.
- Measure observability by outcomes (time-to-corner-a-novel-incident), resist Goodhart, never tie to individual performance.
Edge Cases & Pitfalls¶
- Sharded gateways breaking tail sampling. Round-robining spans across replicas gives each a fragment → wrong decisions. Route by
trace_id. - Tail-sampler memory pressure from long-lived traces (a 30-minute trace pins memory in the gateway). Cap durations; size the tier; tune
decision_wait. - PII leaking to a SaaS because redaction was per-service and one service skipped it. Centralise redaction in the gateway.
- SLO theatre: aspirational 99.99% targets chronically missed and ignored, or "valid request" defined to exclude the failures. Set targets to real user need.
- Per-service SLOs all green, the user journey broken. Add journey-level SLOs.
- Cost panic → blanket sampling cut that blinds you as you scale. Cut boring data, not failure fidelity; keep metrics unsampled.
- Cardinality bombs from resource attributes (
host.nameper pod) inflating metric series. Fine on events; guard as metric labels. - Goodharted KPIs: dashboards/coverage/MTTR targets gamed into meaninglessness. Measure outcomes; basket of metrics; qualitative signal.
- "Single pane of glass" as a procurement slogan — buying one tool doesn't create correlation; correlation is shared
trace_id+ exemplars + wiring.
Common Mistakes¶
- Treating observability as a tooling purchase. A SaaS contract doesn't make you observable; rich instrumentation + correlation + culture do.
- No gateway tier, so tail sampling/redaction/cost control are scattered into apps and can't change without redeploys.
- Round-robin span routing that silently corrupts tail-sampling decisions.
- Sampling metrics (then computing SLOs from them and getting the error rate wrong) — metrics should be unsampled.
- Mandating coverage without ergonomics — checkbox compliance, thin spans, level-2 theatre.
- SLOs with no error-budget policy — vanity numbers with no consequence.
- Slashing sampling when the bill spikes instead of dropping unqueried/boring data — going blind at scale.
- Targeting observability KPIs (dashboard count, MTTR) and getting them gamed; tying them to individual reviews.
Tricky Points¶
- Tail sampling forces topology. Because the decision needs the whole trace, you must route by
trace_idto a sticky gateway — this is why a gateway tier exists, and why you can't just scale Collectors statelessly for traces. - Metrics and traces want opposite sampling. Metrics: unsampled (cheap aggregates, power SLOs). Traces: sampled (per-event, expensive). Designing them the same way breaks either cost or SLO accuracy.
- OTel makes build-vs-buy reversible and per-signal — the strategic value of OTel is as much about keeping the question open as about instrumentation.
- The maturity leap is correlation, not coverage. You can have 100% of services emitting all three pillars and still be at level 2 if they share no identity.
- Cost governance and visibility blindness trade off. The cheapest pipeline is the one that records nothing; the value is in keeping the right fidelity, not minimising volume.
- Goodhart hits observability harder than most domains because the real capability is hard to measure, so easy proxies get targeted and gamed.
- PII redaction belongs in the gateway, not the app — one chokepoint you can audit, versus N services any one of which can leak.
Test Yourself¶
- Draw the agent→gateway Collector topology. What lives where, and why does tail sampling force a
trace_id-aware load balancer? - Why must metrics be unsampled while traces are tail-sampled? What breaks if you sample metrics?
- Give the build-vs-buy decision framework. How does OTel change the nature of the decision?
- Place an org at a maturity level from a one-line description, and name the next leap. Why is 2→3 the hard one?
- What is Observability-Driven Development, and what's the behavioural test for whether adoption succeeded?
- How do you run SLO practice across fifty teams without bureaucracy or theatre? Why journey-level SLOs?
- List five cost-governance levers and where each lives. Why is "slash sampling" the wrong reflex?
- Give three observability KPIs and exactly how each gets Goodharted. What defends against it?
- Design the gateway processing pipeline for a payments platform: redaction, tail sampling, span-metrics, multi-backend. Justify the ordering.
Cheat Sheet¶
┌──────────────── OBSERVABILITY ENGINEERING — PROFESSIONAL CHEAT SHEET ────────────────────────┐
│ │
│ PLATFORM = pipeline + policy + product. teams are your customers; paved roads > mandates. │
│ │
│ COLLECTOR TOPOLOGY │
│ AGENT (per-node): receive · k8s-enrich · batch · route-by-trace_id │
│ GATEWAY (scaled): TAIL SAMPLING · PII redaction · normalise · span→metrics · routing │
│ tail sampling needs WHOLE traces → load-balance by trace_id to a STICKY gateway replica │
│ │
│ BACKENDS / BUILD-vs-BUY │
│ OSS: Prometheus/Mimir(metrics) Grafana Tempo(traces) Loki(logs) — control, TSDB card ceiling│
│ SaaS: Honeycomb (wide-event/high-card) · Datadog (turnkey, volume-priced) │
│ OTel = the HEDGE → per-signal, REVERSIBLE choice. most orgs go hybrid. │
│ │
│ MATURITY: 0 reactive →1 monitor →2 structured(siloed) →3 CORRELATED →4 OBSERVABLE →5 proactive│
│ the leap that matters = 2→3→4 (correlation, then ad-hoc query on unknown-unknowns) │
│ │
│ COST GOVERNANCE │
│ metrics UNSAMPLED+cheap (power SLOs) │ traces TAIL-SAMPLED (fidelity where it matters) │
│ cardinality budgets · attribute pruning · retention tiers · showback by team │
│ bill spikes → drop BORING data, NOT failure fidelity │
│ │
│ SLO@SCALE: standardise the LANGUAGE, decentralise TARGETS; journey-level; budget policy w/teeth│
│ GOODHART: measure OUTCOMES (time-to-corner-a-novel-incident), basket of metrics, never reviews │
└─────────────────────────────────────────────────────────────────────────────────────────────────┘
Summary¶
- Observability at org scale is a platform with policy and a product surface. Teams are the customers; the platform's job is to make rich instrumentation the easy path (paved roads) and the wrong thing hard (cardinality/PII guardrails).
- The Collector topology is the policy plane. Agents enrich and batch near the workload; a gateway tier does tail sampling, PII redaction, normalisation, span-metrics, and routing — centrally, without app redeploys. Tail sampling needs whole traces, which forces
trace_id-aware routing to a sticky gateway replica. - Backends set the capability ceiling and a big share of cost. OSS (Prometheus/Grafana/Tempo/Loki) gives control and a TSDB cardinality ceiling; SaaS (Honeycomb's wide-event model, Datadog's turnkey breadth) buys capability for money. OTel is the hedge that makes build-vs-buy a per-signal, reversible choice — most orgs run hybrid.
- The maturity model runs reactive → monitoring → structured → correlated → observable → proactive; the leap that matters is correlation then ad-hoc querying of unknown-unknowns. Many orgs stall at three siloed pillars.
- Adoption is cultural: Observability-Driven Development, routine prod debugging, blameless post-mortems, paved roads over mandates. The test is behavioural — do engineers query the tool to answer questions, including outside incidents?
- SLOs scale by standardising the language and decentralising targets, defining journey-level SLOs, and giving the error-budget policy teeth.
- Cost governance keeps metrics unsampled and cheap (SLOs) while tail-sampling traces for fidelity-where-it-matters; cut boring data, not failure fidelity; attribute spend to teams.
- Observability metrics get Goodharted because the real capability is hard to measure — measure outcomes, use a basket, pair with qualitative signal, and never tie to individual performance.
Further Reading¶
- Observability Engineering — Charity Majors, Liz Fong-Jones, George Miranda (O'Reilly) — wide events, maturity, the cultural argument.
- Site Reliability Engineering & The Site Reliability Workbook — Google — SLOs, error-budget policy, multi-burn-rate alerting, on-call.
- OpenTelemetry Collector docs — https://opentelemetry.io/docs/collector/ — receivers/processors/exporters, the load-balancing exporter, tail-sampling processor, OTTL.
- "Scaling the OpenTelemetry Collector" and tail-sampling architecture write-ups — agent/gateway topology and the
trace_idrouting constraint. - Implementing Service Level Objectives — Alex Hidalgo (O'Reilly) — SLO practice at scale.
- The
observability-stackandmonitoring-alertingskills — backend selection and humane alert design. - Quality Engineering → Engineering Metrics & DORA — DORA/SPACE and the Goodhart discussion.
Related Topics¶
- Down a level: senior.md — wide-event design, cardinality economics, span/data-model design, the debugging loop, SLO/alerting fundamentals.
- Foundations: middle.md · junior.md.
- Interview prep: interview.md. Practice: tasks.md.
Sibling diagnostic topics:
- Telemetry Cost & Sampling Strategy — the cost/sampling deep dive.
- Tracing · Metrics · Logging · Continuous Profiling — the signals the platform carries.
- Dynamic Instrumentation & eBPF — observing without code changes.
- Crash Reporting · Post-Mortem Analysis — incident follow-through.
Cross-roadmap links:
- Quality Engineering → Engineering Metrics & DORA — SLOs/error budgets and Goodhart.
- Quality Engineering → Testing → Testing in Production — observability as its prerequisite.
Diagrams & Visual Aids¶
Agent → Gateway topology (and the tail-sampling constraint)¶
services ─OTLP─► AGENT (per node) GATEWAY TIER (scaled)
k8s-enrich · batch ┌────────────────────────────┐
route by trace_id ─────────►│ replica that owns this trace│
(load-balancing exporter) │ tail sample · redact PII │
│ normalise · span→metrics │
└──────┬──────────┬───────────┘
▼ ▼
traces (sampled) metrics (UNSAMPLED)
→ Tempo/Honeycomb → Prometheus/Mimir
✗ round-robin spans → each replica sees a fragment → WRONG tail decision
✓ hash by trace_id → one replica sees the whole trace → CORRECT decision
Build-vs-buy with OTel as the hedge¶
┌──────── OTel SDK (instrument ONCE, vendor-neutral) ────────┐
│ OTLP via Collector │
▼ ▼
OSS (build): Prometheus/Grafana/Tempo/Loki SaaS (buy): Honeycomb / Datadog
control · TSDB card ceiling · ops toil capability now · volume-priced
└──────── reversible, PER-SIGNAL choice (most orgs go hybrid) ──────────┘
The maturity ladder¶
0 reactive ─► 1 monitoring ─► 2 structured(SILOED) ─► 3 CORRELATED ─► 4 OBSERVABLE ─► 5 proactive
grep boxes threshold 3 pillars, no shared trace_id wide-event ODD, prod
alerts shared identity + exemplars ad-hoc query exploration
└──── the leap that matters: 2 → 3 → 4 ────┘
Cost governance: fidelity allocation¶
metrics ───────────────────────────────────► UNSAMPLED · cheap aggregates · power SLOs
traces ─► [gateway tail sampler] ─► keep 100%: errors · slow · key tenants
keep 3%: boring successful majority
+ cardinality budgets · attribute pruning · retention tiers · showback by team
bill spikes? → drop BORING data, never failure fidelity.
In this topic