Metrics Roadmap¶

"What you can't measure, you can't improve — but what you measure badly, you'll optimise for the wrong thing."

This roadmap is about emitting numeric signals from a running program — counters, gauges, histograms, and the discipline of choosing what to measure. It's the second of the "three pillars of observability" alongside Logging and Tracing.

Looking for system-level observability (Prometheus stack, Grafana, alert design)? See Backend → Observability.

Looking for profiling (CPU / memory / lock contention)? See Quality Engineering → Performance → Profiling.

This section is the language-level discipline — what your code emits, not what your monitoring stack ingests.

Why a Dedicated Roadmap¶

Every team's first metric is "request count" — and most stop there. A senior engineer knows:

Counters can only go up — wrong tool for "current connections"
Histograms are eight bytes per observation when stored as percentiles, lossless when stored as buckets — choose deliberately
High-cardinality labels (user ID, request ID) destroy your TSDB — but the same data as a log or trace attribute is fine
The metric you wish you had at 3 a.m. is the one you didn't think mattered at design time

Roadmap	Question it answers
Logging	What happened, in human-readable form?
Tracing	What's the path of one request through the system?
Metrics (this)	What's the aggregate behaviour over time?

Sections¶

#	Topic	Focus
01	What Counts as a Metric	Numeric, time-series, aggregatable; metric vs log vs trace
02	Counter, Gauge, Histogram, Summary	The four base types, what each is for, what they cost
03	Labels & Cardinality	Why user-ID labels kill your TSDB; high vs low cardinality dimensions
04	Naming & Units	`_seconds`, `_bytes`, `_total`; Prometheus conventions; OpenMetrics
05	Pull vs Push	Prometheus scrape vs StatsD push; trade-offs and failure modes
06	The Four Golden Signals	Latency, traffic, errors, saturation (Google SRE book)
07	RED & USE	Request-rate / Error-rate / Duration; Utilisation / Saturation / Errors
08	Percentiles & Histograms	p50/p95/p99, why averages lie, HDR histograms, bucket sizing
09	OpenTelemetry Metrics SDK	Cross-language standard; instruments, views, exporters
10	Language-Specific Clients	`prometheus_client` (Python), Micrometer (JVM), `prometheus/client_golang` (Go), `metrics` (Rust)
11	Cost & Sampling	Metric cardinality explosions, sampling strategies, dropping vs aggregating at source
12	Anti-patterns	"Just emit everything," metric-driven design, gauge-as-counter, untyped strings

Languages¶

Examples in Go (prometheus/client_golang, expvar), Java (Micrometer, JFR-as-metrics), Python (prometheus_client, OpenTelemetry SDK), Rust (metrics, prometheus), and Node (prom-client, OpenTelemetry).

Status¶

⏳ Structure defined; content pending.

References¶

Site Reliability Engineering — Google SRE Book (Four Golden Signals chapter)
Observability Engineering — Majors, Fong-Jones, Miranda (the cardinality argument)
Histograms with Prometheus: A Tale of Woe — Björn Rabenstein
USE Method — Brendan Gregg

Project Context¶

Part of the Senior Project — a personal effort to consolidate the essential knowledge of software engineering in one place.