Metrics Roadmap¶
"What you can't measure, you can't improve — but what you measure badly, you'll optimise for the wrong thing."
This roadmap is about emitting numeric signals from a running program — counters, gauges, histograms, and the discipline of choosing what to measure. It's the second of the "three pillars of observability" alongside Logging and Tracing.
Looking for system-level observability (Prometheus stack, Grafana, alert design)? See Backend → Observability.
Looking for profiling (CPU / memory / lock contention)? See Quality Engineering → Performance → Profiling.
This section is the language-level discipline — what your code emits, not what your monitoring stack ingests.
Why a Dedicated Roadmap¶
Every team's first metric is "request count" — and most stop there. A senior engineer knows:
- Counters can only go up — wrong tool for "current connections"
- Histograms are eight bytes per observation when stored as percentiles, lossless when stored as buckets — choose deliberately
- High-cardinality labels (user ID, request ID) destroy your TSDB — but the same data as a log or trace attribute is fine
- The metric you wish you had at 3 a.m. is the one you didn't think mattered at design time
| Roadmap | Question it answers |
|---|---|
| Logging | What happened, in human-readable form? |
| Tracing | What's the path of one request through the system? |
| Metrics (this) | What's the aggregate behaviour over time? |
Sections¶
| # | Topic | Focus |
|---|---|---|
| 01 | What Counts as a Metric | Numeric, time-series, aggregatable; metric vs log vs trace |
| 02 | Counter, Gauge, Histogram, Summary | The four base types, what each is for, what they cost |
| 03 | Labels & Cardinality | Why user-ID labels kill your TSDB; high vs low cardinality dimensions |
| 04 | Naming & Units | _seconds, _bytes, _total; Prometheus conventions; OpenMetrics |
| 05 | Pull vs Push | Prometheus scrape vs StatsD push; trade-offs and failure modes |
| 06 | The Four Golden Signals | Latency, traffic, errors, saturation (Google SRE book) |
| 07 | RED & USE | Request-rate / Error-rate / Duration; Utilisation / Saturation / Errors |
| 08 | Percentiles & Histograms | p50/p95/p99, why averages lie, HDR histograms, bucket sizing |
| 09 | OpenTelemetry Metrics SDK | Cross-language standard; instruments, views, exporters |
| 10 | Language-Specific Clients | prometheus_client (Python), Micrometer (JVM), prometheus/client_golang (Go), metrics (Rust) |
| 11 | Cost & Sampling | Metric cardinality explosions, sampling strategies, dropping vs aggregating at source |
| 12 | Anti-patterns | "Just emit everything," metric-driven design, gauge-as-counter, untyped strings |
Languages¶
Examples in Go (prometheus/client_golang, expvar), Java (Micrometer, JFR-as-metrics), Python (prometheus_client, OpenTelemetry SDK), Rust (metrics, prometheus), and Node (prom-client, OpenTelemetry).
Status¶
⏳ Structure defined; content pending.
References¶
- Site Reliability Engineering — Google SRE Book (Four Golden Signals chapter)
- Observability Engineering — Majors, Fong-Jones, Miranda (the cardinality argument)
- Histograms with Prometheus: A Tale of Woe — Björn Rabenstein
- USE Method — Brendan Gregg
Project Context¶
Part of the Senior Project — a personal effort to consolidate the essential knowledge of software engineering in one place.