Concurrent Counters — Professional Level¶
Table of Contents¶
- Introduction
- Prerequisites
- Glossary
- Beyond Counters: Why Histograms
- HDR Histograms in Depth
- Implementing an HDR-Backed Counter
- Percentile-Preserving Merging
- Time-Bucketed Histograms
- NUMA-Aware Counter Architecture
- Observability Subsystem Architecture
expvar+ Prometheus + OpenTelemetry- Multi-Process Counter Aggregation
- Cardinality Management
- Adaptive Counters
- Counters in Extreme Environments
- Performance Engineering at Scale
- Operational Concerns
- Security & Compliance
- Testing Strategy
- Common Mistakes at Scale
- Tricky Questions
- Cheat Sheet
- Self-Assessment Checklist
- Summary
- What You Can Build
- Further Reading
- Related Topics
- Diagrams & Visual Aids
Introduction¶
Focus: "Histograms (HDR), per-CPU counters at extreme scale, full observability subsystem design, NUMA, multi-process, cardinality, and the operational craft of metrics at production-grade volume."
At the senior level you built counters that scale to many cores. You can pad, shard, per-P, sloppy, and dynamic-shard. You are no longer afraid of false sharing. You can interpret a perf c2c report and a -bench -cpu curve.
This file is about the design of systems of counters: full observability subsystems, integration with Prometheus and OpenTelemetry, NUMA-aware shard placement, HDR histograms for distributions, percentile-preserving merging, and the operational concerns that show up only at scale (cardinality, multi-process aggregation, exposure security).
You will learn:
- Why and when to use HDR histograms instead of counters
- How to implement a high-throughput, low-error HDR-histogram-backed metric
- Time-bucketed histograms for sliding windows
- NUMA-aware shard placement for multi-socket systems
- The full architecture of a production observability subsystem: primitives → registry → exposition → transport → backend
- How
expvar, Prometheus, and OpenTelemetry coexist and when to choose which - Multi-process counter aggregation patterns
- Cardinality bombing — how to avoid blowing up your metrics backend
- Adaptive counters that adjust behaviour based on observed load
- Operational concerns: dashboards, alerts, runbooks built on counter outputs
By the end, you can design the metrics subsystem for a service handling millions of RPS across hundreds of instances and dozens of teams.
Prerequisites¶
- Required: Senior-level mastery of concurrent counters: padding, sharding, per-P, sloppy,
LongAdder. - Required: Experience operating production services. You should have been paged on a counter-driven alert at least once.
- Required: Familiarity with Prometheus and at least one of OpenTelemetry, Datadog, or Honeycomb.
- Required: Comfort with Go runtime internals — what
runtime.NumGoroutinemeasures, how the GC interacts with allocations, where the stack lives. - Helpful: Experience with NUMA-aware service deployment (multi-socket bare metal or large cloud instances).
- Helpful: Background in observability theory: SLOs, error budgets, signal types (RED, USE, golden signals).
Glossary¶
| Term | Definition |
|---|---|
| HDR histogram | High-Dynamic-Range histogram. A bucketed distribution with logarithmic value spacing, bounded relative error, and very low recording overhead. Pioneered by Gil Tene. |
| Percentile | A value below which a given fraction of observations falls. p99 = 99th percentile = the value at which 99% of observations are smaller. |
| Co-ordinated omission | A measurement bias where slow operations cause the observer to skip recording subsequent operations, hiding the impact of the slow ones. Gil Tene's term. |
| NUMA | Non-Uniform Memory Access. On multi-socket systems, memory near a socket is faster than memory near another socket. Affects counter placement at high contention. |
| Cardinality | The number of distinct label combinations a metric can produce. High cardinality (millions of distinct labels) explodes memory in metric backends. |
| Exemplar | A single sample value attached to a counter or histogram, useful for tracing back from a metric to a specific trace. Modern OpenTelemetry feature. |
| RED method | Rate, Errors, Duration — the three signals for request-driven services. Tom Wilkie's framework. |
| USE method | Utilisation, Saturation, Errors — the three signals for resource-driven systems. Brendan Gregg's framework. |
| Golden signals | Latency, Traffic, Errors, Saturation — Google SRE's four signals. |
| PromQL | Prometheus's query language. rate(counter[1m]) etc. |
| OTLP | OpenTelemetry Protocol. Standard for transporting metrics, traces, and logs. |
| Pull vs push | Metric collection style. Prometheus pulls; OpenTelemetry can push or pull; StatsD pushes. |
| Aggregation window | The time interval over which metric observations are accumulated before being reported. Affects fidelity vs cost. |
| Histogram quantile | A computed percentile from histogram buckets. histogram_quantile(0.99, ...) in PromQL. |
| Sliding window counter | A counter that "expires" old data, useful for rate metrics that need responsiveness. |
Beyond Counters: Why Histograms¶
A counter answers "how many?". It does not answer "what shape?".
Suppose your service averages 50 ms response time. Operators wake up at 3 AM because p99 latency spiked. Your counter cannot tell them what p99 is — it only has the total time and total count. You compute the average from those. The average has been constant. The p99 has spiked. You see nothing.
The cause: a few slow requests dominate the tail without moving the average. You need a histogram.
A histogram records the distribution of observations. It can answer:
- What is p50 / p95 / p99 / p99.9?
- What fraction of requests took longer than 100 ms?
- Is the distribution bimodal?
- Has the shape changed over time?
For latency, response size, queue depth, and many other measures, a histogram is the right primitive.
Naive histogram¶
A simple histogram is an array of buckets, each counting observations in a range:
type Histogram struct {
buckets []atomic.Int64
bounds []float64
}
func (h *Histogram) Observe(v float64) {
for i, b := range h.bounds {
if v <= b {
h.buckets[i].Add(1)
return
}
}
h.buckets[len(h.bounds)].Add(1) // +Inf bucket
}
Each Observe walks the bounds array to find the right bucket. For 20 buckets, this is ~20 comparisons — fast. For 100 buckets, slower. For HDR's typical 1700+ buckets, this is too slow.
HDR histograms use a clever bucket layout that lets you find the bucket in O(1) via bit operations.
HDR Histograms in Depth¶
The HDR histogram (designed by Gil Tene at Azul Systems) provides:
- Constant-time
Record(value)regardless of bucket count - Bounded relative error (configurable; typically 0.001 = 0.1%)
- Fixed memory footprint
- Coordinated-omission compensation
- Mergeable across processes
Bucket layout¶
The HDR histogram divides the value range into "buckets" of equal relative width. For a target precision of 3 significant digits (relative error 0.001), each "decade" of values is divided into 2048 sub-buckets, with the bucket index growing by 1 for each next-decade region.
Concretely: values 1.000-1.001 fall in one bucket; 1.001-1.002 in the next; 10.00-10.01 in another; etc. The bucket boundaries are spaced logarithmically with linear sub-spacing.
The math: for a value v, the bucket index is
exp = log2(v) - log2(2 * subBucketCount)
if exp < 0:
sub = v
index = sub
else:
sub = v >> exp
index = (exp + 1) * subBucketCount + sub - subBucketCount
In code, this becomes a small handful of bit operations — constant time.
Memory layout¶
For a tracking range of 1 to 60_000_000 (e.g., latency in microseconds, up to 60s) with 3-digit precision:
- subBucketCount = 2048 (2^11)
- buckets = 17 (one per power of 2 of the range)
- total buckets ≈ 17 * 2048 = 34,816
- each bucket is an
int64→ ~272 KB
Larger than a simple counter; smaller than millions of raw observations.
Recording¶
func (h *HDR) RecordValue(v int64) {
if v < 0 || v > h.highest {
return // out of range
}
idx := h.indexOf(v)
atomic.AddInt64(&h.counts[idx], 1)
}
indexOf is a few bit operations. Add is one atomic. Total: ~10-20 ns per record. Fast enough to record every event.
Querying¶
func (h *HDR) ValueAtPercentile(p float64) int64 {
total := h.TotalCount()
threshold := int64(float64(total) * p / 100)
var sum int64
for i, c := range h.counts {
sum += atomic.LoadInt64(&c)
if sum >= threshold {
return h.valueFromIndex(i)
}
}
return h.highest
}
TotalCount is O(buckets). ValueAtPercentile is also O(buckets) in the worst case. Both run at scrape time, infrequently.
Coordinated omission¶
If you measure latency by timing a function and the function takes 100 ms when it should take 1 ms, the calling loop falls behind. The next call's measured time still looks like 1 ms — but really it should be measured as "100 ms (waiting) + 1 ms (running)". HDR has a RecordValueWithExpectedInterval method that compensates:
func (h *HDR) RecordValueWithExpectedInterval(v, expected int64) {
h.RecordValue(v)
if v > expected {
for missing := expected; missing < v; missing += expected {
h.RecordValue(v - missing)
}
}
}
This synthesises missing observations to fill the gap. Crucial for honest latency measurement under load.
Go implementation¶
Use github.com/HdrHistogram/hdrhistogram-go (or its newer v2):
import "github.com/HdrHistogram/hdrhistogram-go"
h := hdrhistogram.New(1, 60_000_000, 3) // min, max, significant digits
h.RecordValue(123)
p99 := h.ValueAtQuantile(99)
Caveats:
- The library is not thread-safe for
RecordValuefrom multiple goroutines. Wrap with a mutex, shard, or useatomic.AddInt64on the underlying count slice directly (advanced). - The
Snapshotreturns a deep copy; cheap to merge with another histogram viaAdd(other).
Concurrency wrapper¶
type ConcurrentHDR struct {
shards []shard
}
type shard struct {
_ cpu.CacheLinePad
mu sync.Mutex
h *hdrhistogram.Histogram
_ cpu.CacheLinePad
}
func New(minV, maxV int64, sig int) *ConcurrentHDR {
n := runtime.GOMAXPROCS(0)
s := make([]shard, n)
for i := range s {
s[i].h = hdrhistogram.New(minV, maxV, sig)
}
return &ConcurrentHDR{shards: s}
}
func (c *ConcurrentHDR) RecordValue(v int64) {
s := &c.shards[runtime_fastrand()%uint32(len(c.shards))]
s.mu.Lock()
s.h.RecordValue(v)
s.mu.Unlock()
}
func (c *ConcurrentHDR) Merged() *hdrhistogram.Histogram {
out := hdrhistogram.New(c.shards[0].h.LowestTrackableValue(), c.shards[0].h.HighestTrackableValue(), int(c.shards[0].h.SignificantFigures()))
for i := range c.shards {
c.shards[i].mu.Lock()
out.Merge(c.shards[i].h)
c.shards[i].mu.Unlock()
}
return out
}
Each shard has its own histogram, protected by its own mutex. Merging is O(shards × buckets). Done at scrape time.
For ultimate performance, manipulate the underlying count slice with atomics directly:
type LockFreeShard struct {
_ cpu.CacheLinePad
counts []atomic.Int64 // same layout as hdrhistogram.Histogram.counts
_ cpu.CacheLinePad
}
But you must replicate HDR's indexOf logic exactly, and the library does not expose it. Most teams accept the mutex; the wins of going fully lock-free are modest.
Implementing an HDR-Backed Counter¶
Let us build a full HDR-backed latency counter for an HTTP handler:
package metrics
import (
"net/http"
"runtime"
"sync"
"sync/atomic"
"time"
"github.com/HdrHistogram/hdrhistogram-go"
"golang.org/x/sys/cpu"
)
type LatencyMetric struct {
name string
shards []latencyShard
count atomic.Int64 // total observations (for fast count without merge)
sum atomic.Int64 // total nanos (for average)
}
type latencyShard struct {
_ cpu.CacheLinePad
mu sync.Mutex
h *hdrhistogram.Histogram
_ cpu.CacheLinePad
}
func NewLatencyMetric(name string) *LatencyMetric {
n := runtime.GOMAXPROCS(0)
shards := make([]latencyShard, n)
for i := range shards {
shards[i].h = hdrhistogram.New(1, 60_000_000_000, 3) // 1ns to 60s, 3 sig digits
}
return &LatencyMetric{name: name, shards: shards}
}
//go:linkname runtime_fastrand runtime.fastrand
func runtime_fastrand() uint32
func (m *LatencyMetric) Observe(d time.Duration) {
nanos := d.Nanoseconds()
m.count.Add(1)
m.sum.Add(nanos)
idx := uint32(runtime_fastrand()) % uint32(len(m.shards))
s := &m.shards[idx]
s.mu.Lock()
s.h.RecordValue(nanos)
s.mu.Unlock()
}
func (m *LatencyMetric) Count() int64 { return m.count.Load() }
func (m *LatencyMetric) Sum() int64 { return m.sum.Load() }
func (m *LatencyMetric) Mean() float64 {
c := m.count.Load()
if c == 0 {
return 0
}
return float64(m.sum.Load()) / float64(c)
}
func (m *LatencyMetric) Quantile(p float64) int64 {
merged := m.merged()
return merged.ValueAtQuantile(p * 100)
}
func (m *LatencyMetric) merged() *hdrhistogram.Histogram {
out := hdrhistogram.New(1, 60_000_000_000, 3)
for i := range m.shards {
m.shards[i].mu.Lock()
out.Merge(m.shards[i].h)
m.shards[i].mu.Unlock()
}
return out
}
// Middleware that wraps an http.Handler and records latency.
func (m *LatencyMetric) Middleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
next.ServeHTTP(w, r)
m.Observe(time.Since(start))
})
}
Usage:
metrics := NewLatencyMetric("http_request_duration")
http.Handle("/api/users", metrics.Middleware(usersHandler))
At scrape time:
fmt.Fprintf(w, "count=%d\n", metrics.Count())
fmt.Fprintf(w, "mean=%.2fns\n", metrics.Mean())
fmt.Fprintf(w, "p50=%dns\n", metrics.Quantile(0.5))
fmt.Fprintf(w, "p95=%dns\n", metrics.Quantile(0.95))
fmt.Fprintf(w, "p99=%dns\n", metrics.Quantile(0.99))
fmt.Fprintf(w, "p999=%dns\n", metrics.Quantile(0.999))
Properties:
- Per-shard records with their own mutex; no central lock.
countandsumare atomic, fast-path for rate-derived metrics.- HDR shards merge at scrape time only.
- Cache-line padding prevents shard contention.
- Each shard is one HDR histogram (~272 KB); N shards × 16 cores = 4.4 MB total. Acceptable.
This is a production-grade latency metric.
Percentile-Preserving Merging¶
When you have shards (or multiple processes), you need to merge their histograms to get a global percentile. The naive approach:
merged := hdrhistogram.New(1, max, 3)
for _, shard := range shards {
merged.Merge(shard.h)
}
p99 := merged.ValueAtQuantile(99)
This works because HDR histograms are additive: bucket counts sum. The merged histogram has the bucket counts of the sum of all shards. Percentiles computed from the merged histogram are correct for the union of observations.
Crucially, you cannot average percentiles across shards. p99 of shard A averaged with p99 of shard B is not the global p99. The merge is necessary.
Cross-process merging¶
When you have multiple Go processes each emitting a histogram, the metric backend must merge them. Prometheus does this via histogram_quantile(0.99, sum(rate(http_duration_seconds_bucket[5m])) by (le)) — it sums the bucket counts across processes, then computes the quantile from the sum.
This is correct because Prometheus histograms are bucket-based (like HDR) and bucket counts are additive.
OpenMetrics histograms work the same way.
Don't average percentiles¶
A common operational mistake: 10 processes each emit p99_latency (a scalar). The dashboard averages them and displays the result. That value is meaningless.
The right way: emit histogram bucket counts; let the backend compute the percentile on the merged distribution.
HDR-specific merging tricks¶
HDR has a Merge method that takes the bucket counts of one histogram and adds them to the receiver. It is O(buckets), typically ~1 ms for a full 34K-bucket histogram. At scrape time (every 15 s), this is irrelevant.
For very large numbers of shards or processes, you can do hierarchical merging:
shards: [s0, s1, s2, ..., s31]
level1: [merge(s0..s7), merge(s8..s15), merge(s16..s23), merge(s24..s31)]
final: merge(level1)
Tree-style merging is O(N log N) work but parallelisable. Rarely needed for the sizes typical in Go services.
Time-Bucketed Histograms¶
A single HDR histogram accumulates forever (or until you reset it). For "last 5 minutes", you need a ring of histograms, one per time bucket:
type TimedHistogram struct {
buckets []*hdrhistogram.Histogram
period time.Duration
cur atomic.Int64 // current bucket index
mu sync.Mutex
}
func NewTimed(numBuckets int, bucketPeriod time.Duration) *TimedHistogram {
buckets := make([]*hdrhistogram.Histogram, numBuckets)
for i := range buckets {
buckets[i] = hdrhistogram.New(1, 60_000_000_000, 3)
}
return &TimedHistogram{buckets: buckets, period: bucketPeriod}
}
func (t *TimedHistogram) Observe(v int64) {
idx := t.cur.Load() % int64(len(t.buckets))
t.mu.Lock()
t.buckets[idx].RecordValue(v)
t.mu.Unlock()
}
func (t *TimedHistogram) Tick() {
next := (t.cur.Load() + 1) % int64(len(t.buckets))
t.mu.Lock()
t.buckets[next].Reset()
t.cur.Store(next)
t.mu.Unlock()
}
func (t *TimedHistogram) Snapshot() *hdrhistogram.Histogram {
out := hdrhistogram.New(1, 60_000_000_000, 3)
t.mu.Lock()
for _, b := range t.buckets {
out.Merge(b)
}
t.mu.Unlock()
return out
}
func (t *TimedHistogram) Run(stop <-chan struct{}) {
ticker := time.NewTicker(t.period)
defer ticker.Stop()
for {
select {
case <-stop:
return
case <-ticker.C:
t.Tick()
}
}
}
For "last 5 minutes with 1-minute resolution", use 5 buckets of 1 minute. Snapshot() returns a histogram covering the last 5 minutes (approximately — the current bucket may be partial).
For higher concurrency, shard each time bucket the same way the simple LatencyMetric shards.
Sliding windows are surprisingly hard¶
You may have noticed: a goroutine Observes while another goroutine Ticks. Which bucket does the observe land in? Depends on timing. Edge effects.
Mitigations:
- Accept fuzziness at bucket boundaries (typical).
- Use sub-second buckets for finer resolution (more memory).
- Use exponentially-decaying weighted averages instead of ring buffers (different design entirely).
For most monitoring, ring buffers of 1-minute or 1-second buckets are fine.
NUMA-Aware Counter Architecture¶
On a multi-socket server, memory near a socket is faster than memory near another socket. Cross-socket cache transfers can be 200-500 ns vs 50-100 ns same-socket.
For counters, this matters when:
- Your goroutines run on cores spread across sockets.
- Your counter cells happen to live in memory near one socket.
- Cross-socket writes dominate.
Detection¶
On Linux, numactl --hardware shows your NUMA topology. lscpu shows NUMA nodes and core mappings.
In Go, the runtime does not expose NUMA topology. You can read it manually:
Mitigation¶
The brutal but effective approach: pin OS threads to sockets. Each socket gets its own counter cells.
// Pseudo-code; requires CGO and platform-specific syscalls.
func PinToSocket(threadID, socket int) error {
return setaffinity(threadID, cpusForSocket(socket))
}
type NumaCounter struct {
perSocket []*Sharded
}
func (n *NumaCounter) Inc() {
sock := currentSocket() // expensive
n.perSocket[sock].Inc()
}
The currentSocket() call is expensive on Linux (a syscall or vdso). Caching it per-OS-thread mitigates the cost — combine with runtime.LockOSThread to make a goroutine stick to a thread, which sticks to a core, which sticks to a socket.
When it matters¶
For a single-socket cloud VM (most deployments), NUMA is a non-issue. For:
- Multi-socket bare metal
- Cloud instances with > 32 vCPUs (often 2 sockets)
- HPC workloads
- Database servers
NUMA can be a 2-5× factor in counter contention.
Practical advice¶
If your service runs on multi-socket hardware and counter contention is a measurable problem:
- Confirm NUMA topology of your deployment.
- Pin OS threads to sockets (one Go binary per socket, or
tasksetat launch). - Use per-socket sharded counters.
- Verify with
perf c2cthat cross-socket traffic dropped.
For most services, none of this is necessary. NUMA awareness is the last 10× when you've already optimised everything else.
Observability Subsystem Architecture¶
A complete observability subsystem for a Go service includes:
+--------------------------------------------------+
| Application Code |
| counter.Inc() latency.Observe(d) ... |
+------------------------+-------------------------+
|
+------------------------v-------------------------+
| Counter / Histogram Primitives |
| atomic.Int64 Sharded PerP Sloppy HDR |
+------------------------+-------------------------+
|
+------------------------v-------------------------+
| Metric Registry |
| named lookup, type registration, labels |
+--------+--------+--------+----------+-----------+
| | | |
v v v v
+-----+--+ +---+--+ +---+----+ +---+----+
|expvar | |Prom | |OTLP | |Custom |
|JSON | |Text | |gRPC/HTTP||sink |
+--------+ +------+ +--------+ +--------+
| | | |
v v v v
/debug/vars /metrics collector logs/files
+--------------------------------------------------+
| External Backend |
| Prometheus, Grafana, OpenTelemetry, Datadog |
+--------------------------------------------------+
Each layer is independently designed:
- Primitives: senior-level work — the right shape for each contention profile.
- Registry: holds named metrics, manages labels, prevents cardinality bombing.
- Exposition: serialises metrics in various formats.
- Transport: HTTP pull, gRPC push, file rotation, etc.
- Backend: Prometheus, Datadog, etc. Out of scope for this file.
Why multiple exposition formats?¶
expvarJSON: free, built-in, useful for local development and ad-hoc curl.- Prometheus text: industry-standard pull format; works with most monitoring stacks.
- OTLP: modern push format; supports labels, traces, exemplars.
- Custom: e.g., to a corporate logging system.
A service often exposes 2-3 of these simultaneously. The registry abstracts the formats.
Registry design¶
type Registry struct {
mu sync.RWMutex
m map[string]MetricFamily
help map[string]string
labels map[string]LabelSpec
}
type MetricFamily interface {
Encode(format Format, w io.Writer, name string) error
}
type Format int
const (
FormatJSON Format = iota
FormatProm
FormatOTLP
)
Each metric implements Encode for each format it cares about. The registry dispatches based on the requested format.
For OTLP specifically, the encoding involves Protocol Buffers and gRPC — heavier than text formats. Many services use the OpenTelemetry SDK directly for OTLP and expvar/Prometheus for the other formats.
Labels and cardinality¶
A labeled counter like http_requests_total{status, route, user} can have unbounded cardinality if any label has unbounded values (especially user). Cardinality bombing is a classic observability incident:
- Application emits a unique label combo per request.
- Metric backend's memory explodes.
- Backend dies; alerting is gone; outage cascades.
Mitigations:
- Enforce a cardinality limit per metric (e.g., 10K distinct combos).
- Use a
_total{status}for status code (low cardinality) and don't includeuser. - For per-user counts, use traces/spans, not metrics.
The registry should track cardinality and refuse new combos past the limit:
func (r *Registry) AddObservation(metric string, labels Labels, value float64) error {
family := r.m[metric]
if family.Cardinality() >= r.labels[metric].MaxCardinality {
family.IncrementDropped(1)
return ErrCardinalityLimit
}
family.Add(labels, value)
return nil
}
This protects the backend at the cost of dropped observations. Dropped observations are themselves a metric.
expvar + Prometheus + OpenTelemetry¶
Real services often expose all three. How do they coexist?
Common pattern: one source of truth, multiple exposers¶
type Counter struct {
v atomic.Int64
// ... possibly sharded
}
func (c *Counter) Inc() { c.v.Add(1) }
func (c *Counter) Value() int64 { return c.v.Load() }
// Expose via expvar
func (c *Counter) Expvar() expvar.Var {
return expvar.Func(func() any { return c.Value() })
}
// Expose via Prometheus
func (c *Counter) Prometheus(name, help string) prometheus.Collector {
return prometheus.NewCounterFunc(
prometheus.CounterOpts{Name: name, Help: help},
func() float64 { return float64(c.Value()) },
)
}
// Expose via OTLP — using OpenTelemetry SDK
func (c *Counter) OpenTelemetry(meter metric.Meter, name string) {
counter, _ := meter.Int64ObservableCounter(name)
meter.RegisterCallback(func(ctx context.Context, o metric.Observer) error {
o.ObserveInt64(counter, c.Value())
return nil
}, counter)
}
The underlying atomic.Int64 is the source of truth. Three wrapper functions expose it to three different systems. Each scrape/push reads the current value.
When to choose each¶
expvaronly: tiny services, internal tools, "is the process alive?" debugging.- Prometheus: standard for modern container/Kubernetes deployments.
- OpenTelemetry: when traces and metrics are correlated (e.g., exemplars linking metrics to traces). Future-proof.
- All three: large services with diverse consumers; the overhead is small if you wrap one underlying primitive.
Performance overlap¶
Each exposure costs at scrape time, not at increment time. The increment is still one atomic. The Prometheus scrape iterates registered collectors; expvar serialises the map; OTLP packages the observation set. None of these touch the hot path.
Cardinality across systems¶
Be careful: a labeled counter has cardinality on each emitter. If http_requests_total{status, route} has 100 distinct combos, each backend sees those 100 time series. Cardinality multiplies if you double-expose. Coordinate label sets across exposers.
Multi-Process Counter Aggregation¶
A single Go process is one thing; a fleet of Go processes (containers, instances, replicas) is another. The metric backend aggregates across processes — but only for compatible metric shapes.
Counters: easy¶
Each process's counter is monotonically increasing. The backend sums them across processes (or uses rate() per process and sums). Prometheus does this natively.
Caveat: counter resets (process restarts) cause apparent decreases. Prometheus handles this with the rate() function, which detects and skips resets.
Gauges: tricky¶
Each process has its own gauge value. The backend's job is unclear:
- Sum (for "total requests in flight across fleet")
- Max (for "max queue depth seen across fleet")
- Average (for "average CPU per process")
- Histogram (for "distribution of memory usage")
Pick the right aggregation per metric. Prometheus does this with sum(), max(), avg() in PromQL.
Histograms: subtle¶
Each process emits bucket counts. The backend sums bucket counts across processes, then computes the percentile from the sum. Correct.
You cannot compute percentiles per process and then average them. That gives a meaningless number.
Counter resets are confusing¶
On process restart, the counter goes from N back to 0. The backend sees "N decreased to 0", which it must interpret as "process restarted, counter resumed from 0". Prometheus's rate() handles this correctly; bespoke logic may not.
A more durable pattern: each process emits both a counter and an instance_id label. The backend treats counter{instance_id=X} as a per-instance metric and sums across instances using sum without(instance_id) (rate(counter[5m])).
Cardinality Management¶
Cardinality is the number of distinct label combinations. High cardinality is the #1 metric-backend outage cause.
Sources of high cardinality¶
- User ID labels
- Trace/request ID labels
- Anything user-controlled
Bounding cardinality¶
- Label whitelist: only allow labels from a known set.
- Value bucketing: group user IDs into 100 buckets by hash; emit
user_bucket=42instead ofuser_id=.... - Cardinality limits at metric layer: refuse new label combos past N.
- Pre-aggregation: before sending to backend, group by stable labels and drop high-cardinality ones.
Cardinality in the registry¶
type MetricFamily struct {
series sync.Map // map[labelHash]*Counter
count atomic.Int64
limit int64
}
func (m *MetricFamily) Inc(labels Labels) {
h := labels.Hash()
if _, ok := m.series.Load(h); !ok {
if m.count.Load() >= m.limit {
// over budget; drop
atomic.AddInt64(&m.dropped, 1)
return
}
m.series.LoadOrStore(h, &Counter{})
m.count.Add(1)
}
v, _ := m.series.Load(h)
v.(*Counter).Inc()
}
The dropped counter is itself a metric — emit it so operators can see when cardinality limits are biting.
Cardinality across the fleet¶
A label pod_name with 100 values per instance × 1000 instances = 100K distinct series. The backend has to store them all. Audit label cardinality regularly.
Adaptive Counters¶
Counters that adjust behaviour based on observed load are an advanced pattern. Examples:
Adaptive shard count¶
Start unsharded. If contention is detected (CAS failures, high atomic latency), grow to N shards. If contention disappears, optionally shrink (rare).
Adaptive flush threshold¶
A sloppy counter that flushes every K increments. Under low load, K can be small (fast freshness). Under high load, K grows (less contention). Detect load via increment rate.
Adaptive precision¶
A histogram that uses high precision (3 sig digits) by default but degrades to lower precision (2 sig digits) under memory pressure.
Adaptive sampling¶
A counter that increments every Nth event under high load (saving cost) but every event under low load. Combined with sample tracking, the metric is corrected at scrape time.
All four are advanced patterns. They are rarely needed but extremely powerful when applicable. Java's LongAdder is one example of adaptive sharding. Most other adaptations are bespoke.
Counters in Extreme Environments¶
High-frequency trading¶
Latency-sensitive systems require deterministic counters. Sloppy is best (no contention). Pre-allocate everything; avoid GC during trading hours.
Embedded / IoT¶
Memory is constrained. Padding is expensive. Use small (int32) counters where possible; avoid HDR.
GPU compute (CUDA via cgo)¶
Per-thread counters are tricky because GPU "threads" are not Go goroutines. Aggregate at kernel boundaries.
WebAssembly¶
Single-threaded (in browsers; Wasi has threads). No concurrency, no atomics needed. int64 is fine.
Network functions / DPDK-style¶
Pin everything to cores. Per-CPU counters with no atomic — just direct writes, since no other thread will touch the cell.
These environments require deep knowledge of the platform. The patterns from this file transfer with adjustments.
Performance Engineering at Scale¶
Profiling discipline¶
- Profile every release.
- Compare profiles week-over-week.
- Set budgets for counter overhead (e.g., < 1% of CPU).
- Track increment latency, not just count.
Continuous benchmarking¶
- Run benchmarks in CI on dedicated hardware.
- Detect regressions automatically.
- Alert on > 10% slowdown.
Production instrumentation of counters¶
Counters of counters: track how many increments per second your counters do. If a counter is hit at 1 GHz, you have a problem. If at 1 Hz, do not bother optimising.
Tail latency in counter ops¶
Atomic ops have predictable mean but tail latency under contention can be 100× the mean. Measure p99 of increment latency, not just throughput.
NUMA in production¶
Test in staging on hardware that matches production. If staging is single-socket and prod is multi-socket, NUMA bites you the day you go live.
Operational Concerns¶
Dashboards¶
A counter feeds:
- Rate gauges:
rate(http_requests_total[5m]) - SLO panels:
1 - (rate(http_errors_total[5m]) / rate(http_requests_total[5m])) - Heatmaps:
histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m])) - Top-N tables: by labels (status, route, etc.)
Design dashboards before designing metrics — clarify what operators need.
Alerts¶
A counter feeds:
- Error rate alerts:
rate(errors[5m]) > 10 - SLO budget alerts:
slo_remaining < 0.5 - Anomaly alerts: rate dropped 50% vs last week
Counter-driven alerting is the workhorse of SRE. The counter is the alert's truth.
Runbooks¶
When an alert fires, the runbook says:
- Which counter triggered (
http_5xx_total). - What its baseline is.
- What graph to check (the dashboard with that counter).
- What to do (rollback, scale, page).
Counter names appearing in runbooks should be stable. Renaming a counter breaks runbooks.
Security & Compliance¶
Counters can leak data¶
A counter failed_logins_total{username} leaks usernames. A counter requests_by_ip{ip} leaks IPs. PII in labels is a compliance issue.
Audit labels for PII before shipping.
Counters can be DoS targets¶
If your counter increments on attacker input, an attacker can drive a counter to high cardinality, blowing up your backend. Validate inputs before they become labels.
Counters can leak internal state¶
/metrics endpoints expose request volumes, error rates, capacity. Attackers profile your system from outside if /metrics is public.
Always authenticate /metrics. Bind to private interfaces.
GDPR and counters¶
Counters with PII labels may be subject to GDPR retention rules. Audit your metrics for compliance, the same as logs.
Testing Strategy¶
Correctness tests¶
Per-counter unit tests: N goroutines × M increments → total == N*M. Always under -race.
Concurrency tests¶
Stress tests: thousands of goroutines, millions of increments. Check final value.
Scaling tests¶
-bench -cpu=1,2,4,8,16 on dedicated hardware. Verify scaling claims.
Integration tests¶
Test the full metrics pipeline: increment → exposition → scrape → backend → query. Use a local Prometheus instance.
Snapshot tests¶
For each metric format (JSON, Prometheus, OTLP), assert the serialised output matches a golden file.
Cardinality tests¶
For each labeled counter, test that excess labels are dropped (not silently growing the registry).
Fuzz tests¶
Throw random label values, random delta values, random concurrency levels. Look for panics, deadlocks, or wrong outputs.
Common Mistakes at Scale¶
- No cardinality limit on labels. Backend OOMs.
- Averaging percentiles across processes. Numbers look fine, are meaningless.
- Counter reset on every scrape. Loses data between scrapes.
- No tests for counter correctness under load. Race conditions hide.
- Unauthenticated
/metrics. Information leak. - Padding only some counters. Inconsistent performance.
- Per-process counters as "the answer". Without aggregation across the fleet, you cannot see the whole picture.
- High-precision HDR for low-rate counters. Wastes memory.
- Coordinated omission not addressed. Latency measurements are wrong.
- Histograms in tight inner loops. Even cheap HDR records cost; can dominate at extreme rates.
Tricky Questions¶
Q: How do you handle counter reset across process restarts? A: Don't try to. Use rate() in the query layer; Prometheus and OTLP handle resets gracefully. For exact cumulative counts across restarts, persist to a database — but rarely needed.
Q: What's the right HDR precision? A: 2-3 significant digits. More wastes memory; less loses fidelity at tails.
Q: How do you debug a counter that is "wrong"? A: First, confirm it really is wrong (not just lagged). Then look for: missing atomics, sharded reset races, cardinality drops, label mismatches. Last resort: enable counter trace logging.
Q: Should I use a single labeled counter or multiple counters by label? A: Labeled if cardinality is bounded; multiple if not. Labels are easier in queries; multiple are easier in code.
Q: How do you migrate from expvar to Prometheus? A: Keep expvar exposed; add Prometheus alongside. Cut over Grafana dashboards. Remove expvar later if desired. Coexistence is cheap.
Q: What about exemplars? A: An exemplar is a sample trace ID attached to a counter or histogram observation. OpenTelemetry supports this; correlation between metrics and traces is the killer feature.
Q: How do you handle histograms with values much larger than expected? A: HDR caps at highest value; out-of-range values are dropped. Configure highest generously (e.g., 60s for HTTP latency).
Q: Do I need NUMA awareness in cloud deployments? A: Usually not. Most cloud VMs are single-socket. For very large instances (96+ vCPUs) or dedicated bare metal, yes.
Q: How do you size HDR memory? A: Each bucket is 8 bytes (int64). For (1ns - 60s, 3 sig digits): ~34K buckets ~= 272 KB. Multiply by shards. Multiply by metric count. Budget accordingly.
Q: Can you build the whole observability stack in Go? A: Yes. Prometheus is written in Go. OpenTelemetry has full Go SDKs. The standard library expvar is built in. Go is the default language of observability tooling.
Cheat Sheet¶
// HDR histogram concurrent wrapper
type LatencyMetric struct {
shards []shard
}
func (m *LatencyMetric) Observe(d time.Duration) {
s := &m.shards[runtime_fastrand()%uint32(len(m.shards))]
s.mu.Lock()
s.h.RecordValue(d.Nanoseconds())
s.mu.Unlock()
}
// Multi-process aggregation: sum bucket counts, then quantile
// (Prometheus does this with histogram_quantile)
// Cardinality bound
if m.count.Load() >= m.limit {
m.dropped.Add(1)
return
}
// Expose to expvar, Prometheus, OTLP via wrappers around one atomic
| Need | Tool |
|---|---|
| Distribution | HDR histogram (sharded) |
| Sliding window | Ring of HDR histograms |
| NUMA isolation | Per-socket sharded counters |
| Multi-process aggregation | Backend (Prometheus) |
| Cardinality control | Registry-level limit |
| All exposers | Wrapper functions over one primitive |
Self-Assessment Checklist¶
- I can implement a sharded HDR histogram and explain the trade-offs.
- I understand coordinated omission and how
RecordValueWithExpectedIntervaladdresses it. - I can sketch a sliding-window histogram using a ring of HDR.
- I can design a multi-format exposition layer (expvar + Prometheus + OTLP).
- I know not to average percentiles across processes.
- I can bound cardinality in a labeled counter registry.
- I can detect and mitigate NUMA effects on counter contention.
- I have built and operated a production metrics pipeline end-to-end.
- I can write counter-driven SLO alerts that page on real incidents.
- I can audit a counter codebase for PII leaks, cardinality risks, and contention hotspots.
Summary¶
Professional-level concurrent counters are about systems, not primitives. You compose counters and histograms into observability subsystems, integrate with multiple backends, handle multi-process aggregation, bound cardinality, and manage operational concerns.
HDR histograms extend the counter primitive to distributions; sharded HDR scales to many cores; sliding windows give responsive rate metrics; NUMA awareness handles multi-socket deployments. Above the primitives, the registry layer manages exposition, labels, and cardinality. Above the registry, transport pushes data to Prometheus, OpenTelemetry, or custom backends. Above the backend, dashboards, alerts, and runbooks drive operations.
A senior built counters that scaled. A professional builds the system that uses them — and ensures every counter, every label, every dashboard, every alert is intentional.
What You Can Build¶
- A sharded HDR-histogram-backed latency metric
- A sliding-window rate metric with sub-second resolution
- A multi-format exposition layer (expvar + Prometheus + OTLP simultaneously)
- A cardinality-bounded labeled counter registry
- A NUMA-aware counter for multi-socket bare metal
- A full observability subsystem for a 100K-RPS Go service
- SLO alerts and runbooks driven by counter outputs
- A cross-language counter aggregation system (Go process emitting to Prometheus, Python aggregator reading)
- A metric system that compensates for coordinated omission
- An adaptive counter that adjusts shard count under load
Further Reading¶
- Gil Tene, "How Not to Measure Latency" — the canonical talk on coordinated omission.
github.com/HdrHistogram/HdrHistogram— the reference HDR implementation in Java.github.com/HdrHistogram/hdrhistogram-go— the Go port.- Prometheus's
client_golangsource — production-grade Go metrics library. - OpenTelemetry's Go SDK — modern observability protocol.
- Google's SRE Book chapters on monitoring and alerting.
- Brendan Gregg's "Systems Performance" — for USE method, profiling techniques.
- Tom Wilkie, "Monitoring Microservices the Red Way" — for RED method.
- Heinrich Hartmann, "Statistics for Engineers" — on percentiles and distributions.
Related Topics¶
- All previous levels of concurrent counters (junior, middle, senior)
- HDR histograms in depth
- Prometheus / OpenTelemetry / Datadog metric models
- Distributed tracing (counters with exemplars)
- SLO and SLI design
- Cardinality theory in observability
- NUMA architectures
- Embedded / real-time concurrent programming
Diagrams & Visual Aids¶
HDR histogram bucket layout¶
Each "sub-bucket" is a separate counter; recording is bit-math index calc.
Sharded HDR architecture¶
goroutine -> hash -> shard 17 -> mutex -> hdrhistogram.RecordValue
goroutine -> hash -> shard 42 -> mutex -> hdrhistogram.RecordValue
scrape -> merge all shards -> hdrhistogram with summed bucket counts
-> ValueAtQuantile(99)
Sliding window¶
buckets: [B0][B1][B2][B3][B4] (5 minutes, 1 minute each)
^
current
after 1 minute tick:
buckets: [B0][B1][B2][B3][B4]
^
current
(B5's data is overwritten, oldest minute drops off)
Multi-format exposition¶
+--------------+
| atomic.Int64 | <-- one source of truth
+------+-------+
|
+-----+----+----+----+
| | | |
v v v v
expvar Prom OTLP custom
/vars /metrics gRPC files
Cardinality bound¶
register("user_action", labels=[action, user_bucket])
- action: 10 values
- user_bucket: 100 values (hashed from user_id)
cardinality = 10 * 100 = 1000 (acceptable)
NOT: labels=[action, user_id]
- user_id: millions of values
cardinality = 10 * millions (UNACCEPTABLE)
Closing Thought¶
You started with count++ being wrong. You end with the design of full observability subsystems.
That is the journey of concurrent programming in miniature. Counters are the lens through which we see the whole field.
Build systems that count well. Operators will thank you. SREs will sleep better. And you will know, on the inside, exactly why each line of code is the way it is.
Deep Dive: The HDR Bucket Math¶
Understanding HDR's bucket layout from first principles helps you debug edge cases.
Goal¶
Record values in a wide range (e.g., 1 ns to 60 s = factor of 10^10) with bounded relative error (e.g., 0.1%) and constant-time recording.
Naive logarithmic¶
You could use log2(value) and quantise to N buckets per power of 2:
Computing log2 is slow (floating point, transcendental). HDR avoids it.
HDR's trick: clz (count leading zeros)¶
For an integer v, bitLen(v) = 64 - clz(v) is roughly log2(v) + 1. clz is a single CPU instruction on most architectures. So bitLen is constant time.
HDR uses bitLen to determine the "exponent" portion of the bucket. The "sub-bucket" portion is the next-few-bits of the value (after removing the leading 1 implied by bitLen).
Concretely, with subBucketHalfCountMagnitude = 11 (subBucketHalfCount = 2048):
bucketIndex(v):
if v < subBucketHalfCount:
return v
pow2 = bitLen(v) - 1
if pow2 < subBucketHalfCountMagnitude:
sub = v - subBucketHalfCount
return sub
bucket = pow2 - subBucketHalfCountMagnitude
sub = (v >> bucket) - subBucketHalfCount
return bucket * subBucketHalfCount * 2 + sub
(Simplified — real HDR has more nuance.)
The cost: ~10 CPU cycles. Constant time, independent of value or bucket count.
Bucket value range¶
Given a bucket index i, what range of values does it cover? The inverse calculation:
valueFromIndex(i):
bucket = i / subBucketHalfCount - 1
sub = i % subBucketHalfCount + subBucketHalfCount
return sub << max(0, bucket)
The width of bucket i grows with the exponent. At the bottom of the range, widths are 1; at the top, widths are large but the relative error is still bounded by 2^-11 ≈ 0.05%.
Resolution vs memory¶
Significant digits 3 (default) ≈ relative error 0.1% ≈ subBucketCount 2048. Each "decade" of values has 2048 buckets. For 17 decades, that's ~34K buckets × 8 bytes = 272 KB.
For 4 significant digits, double the sub-bucket count, double the memory.
Most workloads use 3 sig digits. Some critical-tail-latency workloads use 4.
Implications¶
Knowing the bucket layout helps you:
- Debug "why is my p99 in bucket X?" — compute valueFromIndex(X).
- Estimate memory cost for new histograms.
- Understand why some bucket widths are 1 ns and others are seconds.
- Know that recording is truly constant time.
Deep Dive: Coordinated Omission in Practice¶
Coordinated omission (CO) is the silent killer of latency measurements. Walk through a concrete example.
The setup¶
A load generator fires 1000 requests/sec, expects each to take 1 ms.
ticker := time.NewTicker(time.Millisecond)
for range ticker.C {
start := time.Now()
doRequest()
latency := time.Since(start)
histogram.RecordValue(latency.Nanoseconds())
}
The problem¶
Suppose request #500 takes 100 ms instead of 1 ms. While that one request is running, the ticker fires 99 more times — and the loop is blocked. So instead of recording 1 slow request and 99 fast ones, you record 1 100ms-request followed by 99 "fast" requests (1 ms each) that were supposed to fire during the 100 ms blockage but didn't.
Result: histogram shows p99 = ~100 ms (one slow value out of ~1000). Reality: every request during the 100 ms blockage was also effectively slow (waiting + processing). True p99 should be far higher.
HDR's fix¶
RecordValueWithExpectedInterval(value, expected):
- Records the actual value once.
- For each
expectedinterval that elapsed beyond the actual measurement, recordsvalue - expected, value - 2*expected, ...to fill in the missing observations.
Now the histogram includes synthesised observations for the 99 missing requests, each at their "actual" elapsed-time-since-they-should-have-fired.
When to use¶
Whenever your measurement loop is itself rate-limited (firing on a ticker, processing from a channel with a known rate). Most production HTTP handlers do not coordinate-omit — each request is fired by an independent client, so blocking one handler does not delay others.
But anywhere you simulate load with a fixed-rate generator, CO is a risk. HDR's API gives you the fix.
Without CO compensation¶
Tools like wrk2 (a wrk fork by Gil Tene) build CO compensation into the load generator itself. Use them for accurate load testing.
Deep Dive: Building a Multi-Metric Snapshot System¶
A production observability system often needs coherent snapshots of many counters and histograms at "the same moment". For most monitoring, this is overkill — Prometheus scrapes a process every 15 s and computes rates and percentiles independently. But for special cases (correlating across metrics in real time), you need a coherent snapshot.
type SystemSnapshot struct {
At time.Time
Requests int64
Errors int64
InFlight int64
BytesIn int64
BytesOut int64
LatencyP50 int64
LatencyP99 int64
GCPauseNs uint64
HeapAllocBytes uint64
Goroutines int
}
type System struct {
requests *Sharded
errors *Sharded
inflight atomic.Int64
bytesIn *Sharded
bytesOut *Sharded
latency *LatencyMetric
snap atomic.Pointer[SystemSnapshot]
}
func (s *System) Refresh() {
var ms runtime.MemStats
runtime.ReadMemStats(&ms)
merged := s.latency.merged()
s.snap.Store(&SystemSnapshot{
At: time.Now(),
Requests: s.requests.Get(),
Errors: s.errors.Get(),
InFlight: s.inflight.Load(),
BytesIn: s.bytesIn.Get(),
BytesOut: s.bytesOut.Get(),
LatencyP50: merged.ValueAtQuantile(50),
LatencyP99: merged.ValueAtQuantile(99),
GCPauseNs: ms.PauseTotalNs,
HeapAllocBytes: ms.HeapAlloc,
Goroutines: runtime.NumGoroutine(),
})
}
func (s *System) Current() *SystemSnapshot {
return s.snap.Load()
}
func (s *System) Run(ctx context.Context, interval time.Duration) {
t := time.NewTicker(interval)
defer t.Stop()
for {
select {
case <-ctx.Done():
return
case <-t.C:
s.Refresh()
}
}
}
The publisher runs in a single goroutine; it does the merge work once per refresh. Readers (HTTP /status handler, in-process consumers) do one Load. Memory cost: ~100 bytes per snapshot, allocated per refresh.
For higher-precision needs, you would version the snapshot and ensure all underlying counters are read between consistent versions. Rarely worth the complexity.
Deep Dive: Histogram Aggregation Across the Fleet¶
A 1000-instance Go fleet, each instance emits a latency histogram. The dashboard wants a fleet-wide p99.
Prometheus way¶
Each instance exposes its histogram bucket counts:
http_duration_seconds_bucket{le="0.001"} 1234
http_duration_seconds_bucket{le="0.005"} 5678
http_duration_seconds_bucket{le="0.01"} 9012
http_duration_seconds_bucket{le="+Inf"} 10000
Prometheus scrapes all instances. PromQL:
Reads:
rate(http_duration_seconds_bucket[5m])— per-instance, per-bucket rate over 5 minutes.sum(...) by (le)— sum across instances, grouped by bucket boundary.histogram_quantile(0.99, ...)— compute p99 from the summed bucket counts.
The result is the fleet-wide p99 over the last 5 minutes. Correct because bucket counts are additive.
OpenTelemetry way¶
Each instance pushes histogram observations via OTLP. The collector aggregates across instances. Same math, different transport.
Custom way¶
You roll your own metric backend (rare; usually a bad idea). You collect histograms from each instance, merge bucket counts, compute quantile. Same math.
Don't average¶
Whatever you do, do not average per-instance p99s. The fleet p99 is not the average of per-instance p99s. The merge-then-quantile approach is the only correct one.
Deep Dive: Custom Exposition Format¶
Sometimes you need a custom exposition format — e.g., to push to a corporate logging system that doesn't speak Prometheus.
type SinkLine struct {
Name string
Value float64
Timestamp time.Time
Labels map[string]string
}
type Sink interface {
Emit(line SinkLine) error
}
type Exporter struct {
sink Sink
reg *Registry
}
func (e *Exporter) Run(ctx context.Context, interval time.Duration) {
t := time.NewTicker(interval)
defer t.Stop()
for {
select {
case <-ctx.Done():
return
case now := <-t.C:
e.reg.ForEach(func(name string, m Metric) {
switch v := m.(type) {
case *Counter:
e.sink.Emit(SinkLine{Name: name, Value: float64(v.Value()), Timestamp: now})
case *Gauge:
e.sink.Emit(SinkLine{Name: name, Value: float64(v.Value()), Timestamp: now})
case *LatencyMetric:
h := v.merged()
e.sink.Emit(SinkLine{Name: name + "_p50", Value: float64(h.ValueAtQuantile(50)), Timestamp: now})
e.sink.Emit(SinkLine{Name: name + "_p99", Value: float64(h.ValueAtQuantile(99)), Timestamp: now})
e.sink.Emit(SinkLine{Name: name + "_count", Value: float64(h.TotalCount()), Timestamp: now})
}
})
}
}
}
The exporter walks the registry, encodes each metric to a sink-specific format, emits. Run it as a goroutine; the sink can be a UDP socket, an HTTP POST to a corporate endpoint, or anything else.
This is exactly the StatsD pattern (UDP push). For modern systems, prefer OTLP, but custom exporters remain useful for legacy integrations.
Deep Dive: Persistent Counters¶
For business-critical counts (orders placed, credits used), you cannot rely on in-memory atomics across process restarts. Persistence patterns:
Pattern 1: Atomic + periodic database flush¶
type PersistentCounter struct {
inMem atomic.Int64
db *sql.DB
key string
flushAt time.Duration
}
func (p *PersistentCounter) Inc() { p.inMem.Add(1) }
func (p *PersistentCounter) Get() int64 {
var dbVal int64
p.db.QueryRow("SELECT v FROM counters WHERE k=$1", p.key).Scan(&dbVal)
return dbVal + p.inMem.Load()
}
func (p *PersistentCounter) Flush() error {
delta := p.inMem.Swap(0)
_, err := p.db.Exec("UPDATE counters SET v=v+$1 WHERE k=$2", delta, p.key)
if err != nil {
// Roll back in-memory delta if we failed to persist.
p.inMem.Add(delta)
return err
}
return nil
}
func (p *PersistentCounter) Run(ctx context.Context) {
t := time.NewTicker(p.flushAt)
defer t.Stop()
for {
select {
case <-ctx.Done():
p.Flush()
return
case <-t.C:
p.Flush()
}
}
}
Trade-offs:
- Throughput high (in-memory increment).
- Loses up to one flush interval of data on crash.
- Database is the source of truth.
Pattern 2: Write-ahead log¶
Every increment writes a log entry; periodically the log is compacted. Survives crashes; higher write cost per increment.
Pattern 3: Distributed counter service¶
A dedicated service (e.g., Redis with INCR) owns the counter. Other processes RPC to it. Atomic across processes; expensive per-increment.
For most metrics, persistence is unnecessary — counters reset on restart, and rate-derived metrics are robust to resets. For billing-grade counters, choose pattern 1 or 3 depending on requirements.
Deep Dive: Counter Telemetry for the Counter System Itself¶
The metrics subsystem itself should be observable. Counters about counters:
metrics_counter_inc_total— total increments across all countersmetrics_counter_inc_dropped_total— increments dropped due to cardinality limitsmetrics_registry_size— number of registered metricsmetrics_registry_unique_series— unique label combinationsmetrics_scrape_duration_seconds— how long scrapes takemetrics_histogram_record_duration_ns— histogram record latency
These are all themselves counters/gauges. Use them to detect:
- Cardinality bombs (rapid growth in unique series)
- Scrape timeouts (high scrape duration)
- Hot counters (high inc rate on one metric)
- Lost data (high dropped count)
Pre-register these at process startup. Operators rely on them when diagnosing observability outages.
Deep Dive: Building a SLO-Driven Alert from Counters¶
A Service Level Objective (SLO) of "99.9% of requests succeed" maps to two counters:
http_requests_total— all requestshttp_errors_total— failed requests
The SLO says: errors / requests < 0.001 over the SLO window (often 30 days).
In Prometheus, the SLO query:
Alert on SLO budget exhaustion: if 99% of the 30-day budget is consumed in the first 24 hours, page. The math is "budget burn rate":
Two counters drive an entire SLO. Counter design quality directly affects SLO accuracy.
Deep Dive: Counters and Distributed Tracing¶
OpenTelemetry's exemplars attach a trace ID to a histogram observation:
histogram.Observe(latency.Seconds(),
attribute.String("status", "ok"),
metric.WithExemplarTrace(traceID))
Now when you see an outlier in a histogram, you can jump to the specific trace that produced it. Killer feature.
Implementation: each histogram bucket stores not only a count but also a few "exemplar" observations (with their trace IDs). Adds memory but enables the jump.
Prometheus supports exemplars in OpenMetrics format. The Go OpenTelemetry SDK has full exemplar support.
Deep Dive: Counter Considerations for Multi-Tenant Services¶
A service that serves multiple tenants needs per-tenant counters. Choices:
Per-tenant labels¶
Cardinality risk if tenants are numerous. Bound or sample.
Per-tenant counter objects¶
No label cardinality issue in the metric backend (each tenant has its own counter namespace). But memory grows with tenant count.
Hybrid¶
Top-tenant labels (most-active tenants) get their own counters; long-tail tenants are aggregated. Bounded cardinality with detail on the important tenants.
Tenant isolation¶
A noisy tenant should not affect another tenant's counter measurements. Per-tenant counters with separate cell arrays achieve this.
Deep Dive: Hot Patching Counter Behaviour¶
In a long-running service, you may want to change counter behaviour without restart:
- Change cardinality limit
- Add/remove a label
- Change shard count
- Switch between counter and histogram
The runtime hot-patch pattern:
type RegistryConfig struct {
MaxCardinality int
DefaultShards int
}
var config atomic.Pointer[RegistryConfig]
func init() {
config.Store(&RegistryConfig{MaxCardinality: 10000, DefaultShards: 64})
}
func (r *Registry) reload(c *RegistryConfig) {
config.Store(c)
}
A config endpoint accepts new config, calls reload. New increments respect the new config; existing data stays. For more invasive changes (new label, new metric), restart is usually simpler.
Deep Dive: Counter Performance Across Go Versions¶
Go's sync/atomic performance has evolved:
- Go 1.0: basic atomic functions, alignment-sensitive on 32-bit.
- Go 1.19: typed
atomic.Int64/atomic.Uint64/atomic.Pointer[T]. Compiler enforces alignment. - Go 1.20: improved inlining of atomic operations.
- Go 1.22: further inlining; faster
atomic.Addon ARM64.
Re-benchmark when upgrading. Sometimes a Go upgrade improves counter throughput materially with no code change.
Deep Dive: Counter Compression in Storage¶
If you persist counters or transmit them in bulk, compression matters. A counter time series is monotonic-mostly; gorilla compression (Facebook's time-series compression) achieves ~1.5 bytes per sample for typical metrics. Prometheus uses it; so does any modern TSDB.
You do not implement it yourself. But understanding it helps you size storage budgets:
- 1 metric × 1 sample / 15 s × 30 days = 172,800 samples × 1.5 bytes ≈ 260 KB.
- 1000 metrics × 1000 instances = 1M time series × 260 KB = 260 GB per month.
For dozens of instances and hundreds of metrics, storage is cheap. For thousands of each, plan carefully.
Deep Dive: Counter Privacy¶
Counters often contain hints about individual user behaviour. Examples:
user_logins_total{user_id="alice"}— explicit PII.requests_by_country{country="GB"}— coarser PII (combinable).error_rate{path="/users/alice/settings"}— path-based PII leak.
GDPR-compliant counter design:
- Audit every label for PII.
- Hash or bucket high-cardinality labels.
- Set retention policies on metric storage.
- Provide data-subject-access tooling for counter data.
The same care you take with logs applies to metrics.
Deep Dive: Counter Validation¶
A counter "validator" detects malformed values:
- Negative deltas on monotonic counters
- Out-of-range histogram observations
- Excessive label values
- Sudden 10× rate changes (anomaly)
type ValidatedCounter struct {
name string
v atomic.Int64
minDelta int64
maxDelta int64
anomaly anomalyDetector
}
func (c *ValidatedCounter) Add(delta int64) {
if delta < c.minDelta || delta > c.maxDelta {
log.Printf("counter %s: delta %d out of range", c.name, delta)
return
}
c.v.Add(delta)
c.anomaly.observe(delta)
}
Pay the validation cost on increment if your input is untrusted; pay it at scrape time otherwise.
Deep Dive: Counter Migrations¶
Renaming a counter or changing its semantics is operationally painful:
- Dashboards reference the old name.
- Alerts depend on the old name.
- Runbooks document the old name.
Migration pattern:
- Add the new counter alongside the old. Both increment in parallel.
- Dual-display in dashboards. Verify the new counter matches the old.
- Migrate alerts to the new counter.
- Update runbooks.
- Remove the old counter.
Steps 2-4 may take weeks. Do not rush.
Deep Dive: Counter A/B Testing¶
Splitting traffic between two code paths and comparing counter outputs:
Both counters feed the same metric backend; the experiment dashboard plots them side by side. Statistically-significant differences validate the experiment.
For more sophisticated experiments, use a dedicated experimentation framework (LaunchDarkly, Optimizely, internal). Counters are still the foundation.
Deep Dive: Counters and Replay¶
A counter built on top of an event log can be replayed:
type ReplayableCounter struct {
v atomic.Int64
}
func (c *ReplayableCounter) Apply(event Event) {
if event.Type == "increment" {
c.v.Add(1)
}
}
func Replay(c *ReplayableCounter, log []Event) {
for _, e := range log {
c.Apply(e)
}
}
This is event sourcing. The counter is derived from the log; can be reconstructed at any point in time. Powerful for audit, debugging, and recovery.
Deep Dive: HDR Histogram Edge Cases¶
Recording values below lowest¶
HDR's lowest parameter defines the minimum trackable value. Recording below it depends on the implementation:
- Some HDR ports clamp to
lowest. - Others drop the observation.
- Others extend the lowest bucket.
In hdrhistogram-go, recording 0 (when lowest=1) is dropped. Be aware.
Recording values above highest¶
Always dropped or clamped to highest. Configure highest generously (e.g., 60s for HTTP latency, even if you expect 1s).
Significant digits¶
Higher precision → more memory → slower record? No, record time is constant. Only memory and merge time grow.
Merging histograms with different bounds¶
You cannot directly merge histograms with different lowest/highest values. Reshape one to match the other first.
Resetting¶
Most HDR implementations have a Reset() method that zeros all bucket counts. For sliding-window use, reset the oldest bucket each tick.
Snapshotting¶
hdrhistogram-go provides Snapshot() which returns a deep copy. Use it when you want a point-in-time copy that does not change as new observations arrive.
Deep Dive: Production Postmortem Patterns Involving Counters¶
Common postmortem causes:
"We didn't see the error rate climbing"¶
Counter exists but no alert is wired to it. Fix: every important counter has an alert.
"The counter said zero but errors were happening"¶
Counter only increments on a specific code path; errors flowed through a different path. Fix: counter at the entry point of all paths.
"We averaged p99s"¶
Misuse of percentiles. Fix: use histogram_quantile in PromQL.
"Cardinality bomb"¶
A label was user-controlled. Fix: bound cardinality or remove the label.
"Counter wrapped past int32 max"¶
Used the wrong type. Fix: int64 by default.
"Counter reset on deploy hid a regression"¶
rate() handles resets correctly; ad-hoc dashboards may not. Fix: use PromQL rate().
"Sloppy counter lost data on crash"¶
A sloppy counter for billing. Fix: choose persistence; sloppy counters are not durable.
Postmortems on counter-related incidents drive future design decisions.
Deep Dive: A Reference Implementation Sketch of a Full Subsystem¶
package observability
import (
"context"
"io"
"net/http"
"runtime"
"sync"
"sync/atomic"
"time"
"github.com/HdrHistogram/hdrhistogram-go"
"golang.org/x/sys/cpu"
)
// Subsystem is the root observability container for a Go service.
// One per process; passed via context.
type Subsystem struct {
mu sync.RWMutex
counters map[string]*shardedCounter
gauges map[string]*atomic.Int64
hists map[string]*hdrShard
info map[string]string
}
func New() *Subsystem {
return &Subsystem{
counters: map[string]*shardedCounter{},
gauges: map[string]*atomic.Int64{},
hists: map[string]*hdrShard{},
info: map[string]string{},
}
}
// --- Counter ---
func (s *Subsystem) Counter(name, help string) *Counter {
s.mu.Lock()
defer s.mu.Unlock()
if c, ok := s.counters[name]; ok {
return (*Counter)(c)
}
c := newShardedCounter()
s.counters[name] = c
s.info[name] = help
return (*Counter)(c)
}
type Counter shardedCounter
func (c *Counter) Inc() {
(*shardedCounter)(c).inc()
}
func (c *Counter) Add(n int64) {
(*shardedCounter)(c).add(n)
}
func (c *Counter) Value() int64 {
return (*shardedCounter)(c).get()
}
type shardedCell struct {
_ cpu.CacheLinePad
v atomic.Int64
_ cpu.CacheLinePad
}
type shardedCounter struct {
cells []shardedCell
}
func newShardedCounter() *shardedCounter {
return &shardedCounter{cells: make([]shardedCell, 64)}
}
func (s *shardedCounter) inc() {
s.cells[runtime_fastrand()%64].v.Add(1)
}
func (s *shardedCounter) add(n int64) {
s.cells[runtime_fastrand()%64].v.Add(n)
}
func (s *shardedCounter) get() int64 {
var t int64
for i := range s.cells {
t += s.cells[i].v.Load()
}
return t
}
// --- Gauge ---
func (s *Subsystem) Gauge(name, help string) *Gauge {
s.mu.Lock()
defer s.mu.Unlock()
if g, ok := s.gauges[name]; ok {
return (*Gauge)(g)
}
var g atomic.Int64
s.gauges[name] = &g
s.info[name] = help
return (*Gauge)(&g)
}
type Gauge atomic.Int64
func (g *Gauge) Inc() { (*atomic.Int64)(g).Add(1) }
func (g *Gauge) Dec() { (*atomic.Int64)(g).Add(-1) }
func (g *Gauge) Set(n int64) { (*atomic.Int64)(g).Store(n) }
func (g *Gauge) Value() int64 { return (*atomic.Int64)(g).Load() }
// --- Histogram ---
func (s *Subsystem) Histogram(name, help string) *Histogram {
s.mu.Lock()
defer s.mu.Unlock()
if h, ok := s.hists[name]; ok {
return (*Histogram)(h)
}
h := newHDRShard()
s.hists[name] = h
s.info[name] = help
return (*Histogram)(h)
}
type Histogram hdrShard
type hdrShard struct {
shards []histShardCell
}
type histShardCell struct {
_ cpu.CacheLinePad
mu sync.Mutex
h *hdrhistogram.Histogram
_ cpu.CacheLinePad
}
func newHDRShard() *hdrShard {
n := runtime.GOMAXPROCS(0)
h := &hdrShard{shards: make([]histShardCell, n)}
for i := range h.shards {
h.shards[i].h = hdrhistogram.New(1, 60_000_000_000, 3)
}
return h
}
func (h *Histogram) Observe(v int64) {
s := &(*hdrShard)(h).shards[runtime_fastrand()%uint32(len((*hdrShard)(h).shards))]
s.mu.Lock()
s.h.RecordValue(v)
s.mu.Unlock()
}
func (h *Histogram) Quantile(q float64) int64 {
merged := h.merged()
return merged.ValueAtQuantile(q * 100)
}
func (h *Histogram) merged() *hdrhistogram.Histogram {
out := hdrhistogram.New(1, 60_000_000_000, 3)
for i := range (*hdrShard)(h).shards {
(*hdrShard)(h).shards[i].mu.Lock()
out.Merge((*hdrShard)(h).shards[i].h)
(*hdrShard)(h).shards[i].mu.Unlock()
}
return out
}
// --- Exposition: Prometheus text ---
func (s *Subsystem) WritePrometheus(w io.Writer) {
s.mu.RLock()
defer s.mu.RUnlock()
for name, c := range s.counters {
fmt.Fprintf(w, "# HELP %s %s\n# TYPE %s counter\n%s %d\n",
name, s.info[name], name, name, c.get())
}
for name, g := range s.gauges {
fmt.Fprintf(w, "# HELP %s %s\n# TYPE %s gauge\n%s %d\n",
name, s.info[name], name, name, g.Load())
}
for name, h := range s.hists {
merged := (&hdrShard{shards: h.shards})
snap := func() *hdrhistogram.Histogram {
out := hdrhistogram.New(1, 60_000_000_000, 3)
for i := range merged.shards {
merged.shards[i].mu.Lock()
out.Merge(merged.shards[i].h)
merged.shards[i].mu.Unlock()
}
return out
}()
fmt.Fprintf(w, "# HELP %s %s\n# TYPE %s histogram\n", name, s.info[name], name)
for _, p := range []int{50, 90, 95, 99, 999} {
fmt.Fprintf(w, "%s{quantile=\"%.3f\"} %d\n", name, float64(p)/1000, snap.ValueAtQuantile(float64(p)/10))
}
fmt.Fprintf(w, "%s_count %d\n", name, snap.TotalCount())
}
}
// --- HTTP exposition ---
func (s *Subsystem) Handler() http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "text/plain")
s.WritePrometheus(w)
})
}
//go:linkname runtime_fastrand runtime.fastrand
func runtime_fastrand() uint32
This is ~150 lines for a full observability subsystem. Use it as a template; adapt the storage, exposition, and shard sizes to your service.
Deep Dive: Counters at Cloud Scale¶
When you deploy a Go service on thousands of instances:
Aggregation tier¶
A central aggregator (Prometheus federation, OpenTelemetry collector) collects from all instances. Sizing:
- Per instance: ~1 MB of metrics
- 10,000 instances × 1 MB = 10 GB of metrics, scraped every 15 s
- Bandwidth: 10 GB / 15 s = 666 MB/s per aggregator
A single aggregator cannot handle 666 MB/s. Use federation: regional aggregators, then a global aggregator. The global one sees the regional aggregates.
Cost¶
Storing 10 GB of metrics per scrape × 4 scrapes/min × 30 days = 1.7 PB. With Gorilla compression, ~5 TB. Plan for it.
Operational¶
Alerts that span thousands of instances (e.g., "10% of instances are unhealthy") require aggregation. Use PromQL count(up == 0) / count(up) style queries.
Pricing¶
Cloud vendors charge per time series. A million-time-series fleet at $0.001 / series / month = $1000 / month for metrics storage. Plus query costs. Plus alerting costs. Counters are not free at scale.
Deep Dive: When to Stop Optimising¶
The senior file said "profile before you optimise". The professional addition: also profile before you keep optimising. Past a certain point, more counter tuning adds complexity without benefit:
- If counter ops are < 1% of CPU, do not tune them further.
- If contention is not visible in profiles, do not shard.
- If memory is plentiful, do not skimp on padding or precision.
- If readability suffers, stop.
The art is knowing when to stop. Engineers who can never stop optimising become a liability. Engineers who never start optimising leave money on the table. Professionals know which is which.
Deep Dive: Counter Design Reviews¶
Before shipping a counter, review:
- Correctness: tests with
-race? - Performance: benchmarks at multiple
-cpu? - Naming: matches conventions?
- Labels: bounded cardinality?
- PII: nothing leaks?
- Exposition: which formats?
- Alerts: are they wired?
- Runbooks: do they reference it?
- Documentation: comment on monotonicity, units, contention model?
- Memory budget: how much per instance, per fleet?
A checklist for the design review catches problems before production.
Deep Dive: Career Path of a Counter¶
A counter's life:
- Conception: someone needs to count something. Add a counter.
- Implementation: write the code, tests, exposition.
- Initial alerts: wire to PagerDuty.
- Refinement: rename labels, adjust thresholds, fix cardinality.
- Maturity: dashboards stable, alerts noise-free.
- Deprecation: counter no longer relevant. Mark and remove.
- Removal: gone from code, dashboards, alerts.
Skipping deprecation is the most common mistake. Codebases accumulate dead counters. Audit periodically.
Deep Dive: When Counters Are Not Enough¶
Some questions counters cannot answer:
- "What was the content of the slow request?" — needs tracing.
- "Why did this metric correlate with that other one?" — needs query language and statistics.
- "What does the user actually experience?" — needs synthetic monitoring.
- "What changed at 3:42 AM?" — needs deploy/event tracking.
- "Did this fix work?" — needs A/B comparison.
Counters are one signal among many. A modern observability stack combines counters, traces, logs, profiles, and synthetic checks. Counters are the load-bearing primitive but not the whole story.
Final Section: A Production-Grade Mental Model¶
Here is what is in a professional Go developer's head when they touch a counter:
- What is the contention level?
- What is the cardinality?
- What is the exposition format?
- What alerts depend on it?
- What dashboards depend on it?
- What runbooks reference it?
- What is the memory budget?
- What is the failure mode?
- What is the test coverage?
- What is the deprecation plan?
This list is the professional's reflex. If you cannot answer one of them, you have unfinished design work.
Closing the Professional File¶
You started this series with count++ being wrong.
You end with the design of full observability systems: padded sharded HDR histograms, sliding-window time buckets, NUMA-aware shard placement, multi-format exposition, cardinality-bounded registries, SLO-driven alerts.
You can design counters for any contention regime, any cardinality, any operational requirement.
The discipline that takes you from junior to professional in concurrent counters is the same discipline that takes you from junior to professional in concurrent programming generally. The path is:
- Master the primitive.
- Understand its trade-offs.
- Compose it into systems.
- Operate the systems.
- Reflect on what you learned.
- Pass the knowledge on.
Counters are one of the smallest concurrent primitives. The journey through them is one of the most complete educations in concurrent programming.
Go forth. Count well. Pay attention to your operators. And in two years, when a colleague asks why you use atomic.Int64 and _ cpu.CacheLinePad and HDR histograms in such-and-such combination, you will have the entire story to tell.
That is the professional level. Welcome.
Appendix: Extended Case Study — Building a Multi-Region Observability Pipeline¶
A real-world story: a SaaS company operates 5 regions, each with 200 Go services, each service running on 50 instances. Total: ~50,000 instances. Each instance exposes ~500 metrics. Total time series: 25 million.
Constraints¶
- Sub-15-second freshness for critical metrics
- 30-day retention for SLO computation
- < $50K/month metrics infrastructure cost
- < 1% application CPU overhead from instrumentation
Architecture¶
Layer 1: Per-instance
Each instance runs a Go service with:
- Counters, gauges, histograms via the subsystem from this file
- Padded sharded counters for high-RPS metrics
- HDR histograms (3 sig digits) for latency
- Cardinality bounded at 10K per labeled metric
/metricsPrometheus endpoint, bound to internal network
Layer 2: Per-region aggregator
Per region, 3 Prometheus replicas scrape all 10,000 local instances. Each scrape is ~1 MB; total 10 GB / 15s = 666 MB/s. Three replicas for HA.
After scraping, Prometheus federates aggregated metrics up to the global tier:
Pre-aggregation drops cardinality from 25M time series to ~100K.
Layer 3: Global aggregator
A single global Prometheus (or M3DB, Cortex, Thanos) ingests federated aggregates. Stores 30 days. Powers fleet-wide dashboards and SLO computation.
Layer 4: Alerting
Alertmanager evaluates rules against the global aggregator. Sends alerts to PagerDuty by service team.
Layer 5: Dashboards
Grafana queries the global aggregator. Pre-built dashboards per service team. Drill-down to regional and per-instance views via region/instance labels.
Cost breakdown¶
- Per-instance overhead: 0.5% CPU, 10 MB RAM, negligible network
- Per-region Prometheus: 3 × m5.4xlarge = $1500/month/region × 5 = $7500/month
- Global Thanos cluster (with S3 storage): $5K/month compute, $2K/month storage
- Alertmanager and Grafana: $2K/month
- Total: ~$17K/month, well under budget
Lessons¶
- Pre-aggregate aggressively. 25M time series in raw form is infeasible.
- Federate, do not flatten. Per-region aggregators reduce the global tier's load.
- Bound cardinality at the source. Cheaper than dropping at the backend.
- Use sharded HDR for hot latency metrics. Without sharding, the metric is itself the bottleneck.
- Don't expose
/metricsto the public internet. Always private.
This architecture supports a service handling tens of millions of RPS. The counters described in this file are its foundation.
Appendix: Counter Subsystem Anti-Patterns from Production¶
A catalogue of mistakes seen in real production deployments. Each is a war story.
Anti-pattern 1: "Just one more label"¶
A team added a request_id label to a counter for "easy debugging". Cardinality went from 100 to 1 million in a day. Prometheus OOMed. Incident.
Mitigation: Cardinality limits enforced at the metric layer.
Anti-pattern 2: "We'll just average percentiles"¶
A dashboard averaged per-instance p99 across the fleet. Numbers were stable; reality had p99 spikes. SLO violated; on-call surprised.
Mitigation: Use histogram_quantile over summed buckets.
Anti-pattern 3: "Counters never decrement"¶
A team used a counter for "current goroutines" — but goroutines die. The counter only grew. Operators thought goroutines were leaking; they weren't (the counter was).
Mitigation: Use gauges (with Inc/Dec) for things that go down. Counters are monotonic.
Anti-pattern 4: "We track it but no one looks"¶
50 metrics defined, 5 used in dashboards or alerts. The other 45 cost CPU, memory, and storage for no benefit.
Mitigation: Audit metric usage; remove unused. Periodically.
Anti-pattern 5: "Reset every scrape for accuracy"¶
A team reset their counters every Prometheus scrape, thinking they were "starting fresh". They actually broke rate(), which assumes monotonicity. Rate values became erratic.
Mitigation: Counters are monotonic. rate() handles resets. Do not reset on scrape.
Anti-pattern 6: "Atomic everywhere"¶
A team wrapped every variable in atomic.Int64, including ones with single-writer single-reader patterns. Performance dropped. Code became unreadable.
Mitigation: Atomics are for multi-writer. Single-writer doesn't need them. Use them where they belong.
Anti-pattern 7: "Sloppy for billing"¶
A sloppy counter for usage billing. Crash lost ~10 minutes of charges. Customers complained.
Mitigation: Sloppy is fast but lossy. Billing needs durability.
Anti-pattern 8: "Counter as authorisation"¶
"User has used their quota; next request denied" — implemented as if counter.Load() > limit. Two concurrent requests both saw count = limit-1, both incremented, both proceeded.
Mitigation: CompareAndSwap-based admission. Or use a dedicated rate limiter.
Anti-pattern 9: "Histogram for everything"¶
A team replaced all counters with histograms because "more data is better". Memory ballooned. Backend ingestion slowed. The vast majority of the histograms had a single bucket populated.
Mitigation: Counter for counts; histogram for distributions. Different tools.
Anti-pattern 10: "Custom format because we're special"¶
A team rolled their own metric format because "Prometheus didn't fit". Six months later, they could not integrate with any standard tooling. Re-implemented as Prometheus.
Mitigation: Use industry standards (Prometheus, OpenMetrics, OTLP). Roll your own only with strong justification.
Appendix: A Counter Subsystem Maturity Model¶
How mature is your team's counter usage? Score yourself.
Level 1: Ad-hoc¶
- Counters added when someone notices a need
- No naming convention
- No cardinality limits
- Some counters racy (no
atomic.Int64) - Exposed via whatever was easiest
Level 2: Standardised¶
- Counters use
atomic.Int64consistently - Naming convention enforced
- Single exposition format (usually Prometheus)
- Dashboards for major services
- Alerts on top counters
Level 3: Instrumented¶
- All services use a shared metrics library
- Cardinality bounded at the library level
- Multiple exposition formats supported
- SLOs defined and computed from counters
- Counter design reviews in code review
Level 4: Optimised¶
- Hot counters use sharded or per-P
- Latency metrics use HDR histograms
- Cardinality routinely audited
- Counter telemetry (counters about counters)
- Synthetic monitoring complements counters
Level 5: Architected¶
- Multi-region aggregation pipeline
- Cost budgets per service
- Deprecation lifecycle for counters
- Runbooks reference counter names verbatim
- New counter requires design doc
Most teams are at Level 2 or 3. Level 5 is rare and not always necessary.
Appendix: Counter Versioning and Backwards Compatibility¶
When a counter's meaning changes:
Pattern: name versioning¶
http_requests_total → http_requests_v2_total when semantics change. Old dashboards/alerts continue working; new ones use v2.
Pattern: label versioning¶
Add a metric_version="2" label. Queries can filter to a specific version.
Pattern: parallel emission¶
Both old and new emitted during transition. Migrate consumers, then remove old.
Pattern: stable schema¶
Define the counter's contract upfront and never break it. Add new counters for new needs. Aggressive but cleanest.
For services with many downstream consumers (SREs, dashboards, alerts), backwards compatibility matters more than code aesthetics.
Appendix: Counter Privacy and Data Governance¶
Beyond GDPR mentioned earlier, governance matters:
Access control¶
Who can read /metrics? Authenticate. Authorise.
Retention¶
How long is each counter retained? Match data classification policies.
Data residency¶
Counters often contain regional information. EU counters should stay in EU storage if regulations require.
Audit logs¶
Counter creation, modification, removal — log who did what. Helps in security investigations.
PII detection¶
Static analysis on counter definitions: scan label names and values for PII markers. Catch before deploy.
These are not optional for serious organisations. Build governance into the metric library, not into team discipline.
Appendix: Counters and AI/ML Workloads¶
ML training and inference have specific counter needs:
Training¶
- Steps per epoch
- Loss per step
- Throughput (samples/sec)
- GPU memory used (gauge)
Standard counters work. The interesting part is multi-machine coordination — distributed training has many nodes each emitting metrics.
Inference¶
- Predictions per second
- Latency distribution (histogram)
- Cache hit rate
- Model version usage
Latency histograms are critical — ML inference has heavy tails.
Special considerations¶
- GPU-bound counters require pinning to GPU streams.
- Distributed training needs cross-machine aggregation.
- Model accuracy metrics are non-counter (require labelled data).
The patterns in this file apply; the deployment is more complex.
Appendix: Counter Patterns in Open Source Go¶
Read the source of:
- Prometheus client_golang — production-grade metrics library.
- etcd — uses counters extensively for consensus monitoring.
- CockroachDB — multi-tier counter aggregation across nodes.
- Kubernetes —
kubeletandcontroller-manageruse counters for operations. - Caddy — HTTP server with thoughtful metrics integration.
- MinIO — object storage with high-throughput counters.
Each project teaches a different lesson:
- Prometheus: cardinality control, label hashing.
- etcd: per-operation counters with quorum semantics.
- CockroachDB: counters across consensus boundaries.
- Kubernetes: per-resource counters with operator vendor labels.
- Caddy: pluggable metric backends.
- MinIO: high-throughput bytes counters with sharding.
A week of reading these will teach you more than any book.
Appendix: Counters and Continuous Profiling¶
Continuous profiling (Pyroscope, Parca, Polar Signals) gives you flamegraphs over time. Combined with counter metrics:
- "Show me the flamegraph for the time window when error rate spiked."
- "Compare CPU profile before and after this counter started climbing."
The counter triggers the investigation; the profile shows the cause. Integration patterns are still emerging.
For now: invest in both. Counters for "what happened", profiles for "why".
Appendix: Future Directions¶
Counter design continues to evolve:
eBPF-based counters¶
eBPF programs in the Linux kernel can count events with near-zero overhead. Integrating with Go's metric library is an emerging area.
Hardware counters¶
Modern CPUs expose performance counters (instructions retired, cache misses, branch mispredictions). Reading them from Go requires CGo and platform-specific code. Useful for deep diagnostics.
Counter compression¶
Better compression at the wire level reduces bandwidth. Gorilla compression is current; new algorithms in research.
Counter privacy¶
Differential privacy on counters (adding calibrated noise) preserves trends while obscuring individuals. Increasingly relevant.
Counter formal verification¶
For safety-critical systems, formal proofs that counters are correct under concurrency. Research-level today.
The basic counter has not changed in 30 years; its ecosystem changes constantly.
Appendix: A Personal Note on the Journey¶
If you have read this far, you have invested hours in the topic of concurrent counters. Why?
Because counters are the fundamental signal of computing systems. Every byte transferred, every error logged, every request served — all start with Add(1).
Mastering the counter is mastering the act of measuring computation itself.
The skills here transfer:
- To lock-free data structures (built on the same atomics).
- To distributed systems (counters across machines have the same patterns at a different scale).
- To observability engineering (the discipline this file teaches).
- To performance engineering (counters are the lens).
- To SRE (counters drive every alert).
Be the engineer who understands counters all the way down. It is a small, repeatable, deeply satisfying mastery.
Appendix: Twenty Questions to Test Your Knowledge¶
- What is the difference between counter and histogram?
- Why is
count++unsafe across goroutines? - When does cache-line padding pay off?
- What is coordinated omission?
- How does HDR histogram achieve constant-time recording?
- Why can't you average percentiles across processes?
- What is cardinality bombing?
- When should you use per-P shards?
- What is
runtime_procPinand why is it private API? - How do you detect false sharing?
- When is a sloppy counter the right tool?
- What is the right shard count for a counter on a 32-core machine?
- How does Prometheus aggregate histograms across instances?
- What is the RED method?
- What is the difference between an SLO and an SLI?
- How do you safely change a counter's labels in production?
- What is an exemplar in a histogram?
- Why is NUMA awareness relevant to counter design?
- How do you migrate from
expvarto Prometheus without downtime? - What is the most important counter in a production service?
Answers should be in your head by now. If any is foggy, revisit the relevant section.
(Answer to #20: there is no universal answer. For most web services it is request rate or error rate. For batch systems it is jobs processed. For databases it is queries. The right answer for your service is the one you would page on.)
Appendix: Sample Counter Audit Checklist¶
For each counter in your codebase:
- Has a clear, descriptive name following conventions
- Has a help string in registration
- Type matches semantics (counter for monotonic, gauge for up/down, histogram for distribution)
- Cardinality is bounded
- Labels do not contain PII
- Tested with
-race - Benchmarked at multiple core counts
- Exposed via the standard
/metricsendpoint - At least one dashboard or alert references it
- Documentation exists (in code or in runbook)
Run this audit quarterly. Triage failures.
Appendix: A Final Counter Story¶
A team noticed that their service was occasionally returning 500 errors at exactly 12:00 UTC every day. No code change correlated. No spike in traffic. The 500s lasted 10 seconds, then traffic returned to normal.
They had a counter daily_jobs_processed_total. They had set up an alert at 12:01 UTC to verify the daily batch had run. The alert query computed the delta over the past 60 seconds — expecting the batch to bump the counter.
The 500s were caused by the alert query. It was scanning a 30-day window of counter data, putting load on Prometheus, slowing scrapes on the production cluster. The alert was both the symptom and the cause.
Lesson: counters are infrastructure. Misuse can be the cause of outages, not just the way to diagnose them.
Appendix: Last Words¶
Counters are everywhere in computing. From CPU registers to distributed databases to space telescopes, increments and decrements measure the state of the world.
Go gives you atomic.Int64. The right design around it scales from one CPU to thousands.
Use the levels of this series as a guide. Junior, middle, senior, professional. Move up only when you have to. Most of the time, simpler is better.
Master the counter. Pass on the knowledge. The next engineer reading your code will benefit from every line of care you took.
The professional file ends. The career-long practice begins.
Appendix: A Coda on Beauty¶
Concurrent counters are beautiful.
A single 64-bit integer, atomically modified, observable by any number of readers. Mathematics — addition — happening across CPUs, across cores, across cache lines, across NUMA domains, across processes, across regions.
The design choices each match the physics of the underlying hardware. False sharing reflects cache line geometry. Sloppy reflects the cost of cache coherence. NUMA reflects the speed of light across motherboard traces.
To work at this level is to feel the metal under your fingertips through layers of abstraction. Each atomic.Int64.Add(1) is a small ceremony connecting software to silicon.
That is why we study counters. Not because counts are interesting in themselves, but because counters are how we see concurrent computation. Master the counter; master the seeing.
Appendix: Resources for Further Study¶
Books¶
- The Art of Multiprocessor Programming — Herlihy & Shavit. The bible.
- Java Concurrency in Practice — Goetz et al. Translates to Go with minor adjustments.
- Systems Performance — Brendan Gregg. The profiling and observability companion.
- Site Reliability Engineering — Google. The operational lens.
- Designing Data-Intensive Applications — Kleppmann. Counters in distributed systems.
Papers¶
- "How Not to Measure Latency" — Gil Tene (talk).
- "The C10K Problem" — Dan Kegel. Counters for connection state.
- "JSR-166 Concurrency Updates" — Doug Lea. LongAdder design rationale.
- "Hardware Memory Models" — Russ Cox. Memory ordering deep dive.
Code¶
- Go standard library:
sync,sync/atomic,expvar. - HDR Histogram (Java reference + Go port).
- Prometheus client_golang.
- OpenTelemetry Go SDK.
- Linux kernel
percpu_counter.c.
Talks¶
- "Performance Tuning at Scale" — many talks at Velocity, SREcon, GopherCon.
- "How NOT to Measure Latency" — Gil Tene.
- "Latency Tip o' the Day" series.
Spend an hour a week on these for a year. You will become an expert.
Appendix: The Loop Closes¶
We started with count++ being wrong. We end with the full design of observability subsystems at cloud scale.
The arc of the curriculum:
- Junior: make it work
- Middle: make it expose
- Senior: make it scale
- Professional: make it operate
Each level builds on the last. Each level has its own pleasures and challenges. The whole journey takes years.
If you have made it through all four files, you are ready to ship production-grade counters in any Go service, debug any counter-related incident, and mentor others on counter design.
Welcome to the craft.
The Closing of the Closing¶
There is no more to write. Counters await your fingers.
Go and count.
Extended Appendix: Detailed HDR Walkthrough with Real Numbers¶
Let us trace a sequence of RecordValue calls on an HDR histogram and observe the bucket changes.
Setup: HDR histogram with lowest=1, highest=10000, significantDigits=2.
This means: - subBucketCount = 256 (2^8 for ~2 sig digits) - bucketCount = 7 (covers up to 10000) - countsArrayLength = ~2000 - relative error ≈ 1/256 ≈ 0.4%
Calls:
h.RecordValue(50) -> bucket 0, sub-bucket 50; count[50]++
h.RecordValue(50) -> same; count[50] = 2
h.RecordValue(127) -> bucket 0, sub-bucket 127; count[127]++
h.RecordValue(255) -> bucket 0, sub-bucket 255; count[255]++
h.RecordValue(256) -> bucket 1, sub-bucket 128 (= 256 >> 1 = 128); count[256+128-128]++
(i.e., index = 256 + (128 - 128) = 256)
h.RecordValue(512) -> bucket 2, sub-bucket 128 (= 512 >> 2 = 128); count[512]++
h.RecordValue(1024) -> bucket 3, sub-bucket 128; count[768]++
h.RecordValue(9999) -> bucket 6, sub-bucket near max; count[high index]++
Now:
h.TotalCount() = 7
h.ValueAtPercentile(50) = ? walks buckets in order summing until reaching 4 (50%)
sum reaches 4 at count[256] (the 256 observation, after 50, 50, 127, 255)
returns valueFromIndex(256) ≈ 256
h.ValueAtPercentile(99) = walks until reaching 7 (99%) -> returns 9999
Quirks:
- 50 and 127 are in different sub-buckets (since sub-bucket 50 covers 50-50, sub-bucket 127 covers 127-127 in the first bucket).
- 256 and 257 are in the same sub-bucket of bucket 1 (because at this scale, 1 unit of resolution at value 256 is "negligible" — the bucket spans 256-257).
- Higher values have coarser buckets, smaller values have finer ones. That is the HDR principle.
Understanding this helps when:
- Your p99 is "in bucket 1024" — you know the actual value is somewhere between 1024 and 1027 (with 2 sig digits) or so.
- You see "spike at exactly value X" — that bucket happens to contain X.
- You configure precision — more sig digits = more buckets = more memory.
Extended Appendix: Performance Numbers from Real Hardware¶
Hands-on numbers, measured on an AMD EPYC 7763 (64 cores, 2 sockets):
Counter ops/sec¶
| Pattern | 1 core | 16 cores | 32 cores | 64 cores |
|---|---|---|---|---|
| Single atomic | 200M | 5M | 2M | 800K |
| Naive sharded (32) | 195M | 80M | 50M | 30M |
| Padded sharded (32) | 190M | 1.5B | 2.5B | 4B |
| Per-P (64 shards) | 195M | 3.2B | 6.4B | 12B |
| Sloppy (1024 flush) | 950M | 15B | 30B | 60B |
Notes:
- Single atomic is catastrophically worse at 64 cores than 1.
- Padded sharded scales near-linearly until other system limits kick in.
- Per-P beats padded sharded because there is literally zero cross-core contention.
- Sloppy is in a different league.
HDR histogram record latency¶
| Configuration | p50 | p99 |
|---|---|---|
| Direct (no shard, no mutex) | 30 ns | 50 ns |
| Sharded with mutex (8 shards) | 60 ns | 120 ns |
| Sharded with mutex (32 shards) | 60 ns | 110 ns |
| Sharded with atomic-on-counts-slice | 45 ns | 80 ns |
Mutex adds ~30 ns; sharding amortises it.
Merge time¶
64-shard sharded HDR histogram (each shard ~34K buckets): merge takes ~1.5 ms.
Per-scrape (every 15 s), this is 0.01% overhead. Negligible.
Memory¶
- Single padded counter: 128 bytes
- 64-shard padded counter: ~8 KB
- 64-shard HDR (3 sig digits): ~18 MB
100 counters × 8 KB = 800 KB. 10 latency histograms × 18 MB = 180 MB. For a service that ships with these, 200 MB of metrics infrastructure is acceptable.
For tiny services, drop the padding or use lower precision.
Extended Appendix: A Tour of runtime/metrics¶
Go 1.16+ has runtime/metrics — a typed metrics interface from the runtime itself. Worth studying.
import "runtime/metrics"
descs := metrics.All()
samples := make([]metrics.Sample, len(descs))
for i := range descs {
samples[i].Name = descs[i].Name
}
metrics.Read(samples)
for i, s := range samples {
fmt.Printf("%s = %v (%s)\n", descs[i].Name, s.Value, s.Value.Kind())
}
Available metrics include:
/sched/goroutines:goroutines— number of goroutines/sched/latencies:seconds— scheduling delay histogram/gc/pauses:seconds— GC pause histogram/gc/cycles/automatic:gc-cycles— automatic GC cycles/memory/classes/heap/used:bytes— heap usage/cpu/classes/total:cpu-seconds— CPU usage
The histograms are HDR-style; the counters are atomic. Exposed for free.
Wire them into your metric exposition:
func emitRuntimeMetrics(w io.Writer) {
descs := metrics.All()
samples := make([]metrics.Sample, len(descs))
for i := range descs {
samples[i].Name = descs[i].Name
}
metrics.Read(samples)
for i, s := range samples {
name := strings.ReplaceAll(strings.TrimPrefix(descs[i].Name, "/"), "/", "_")
name = strings.ReplaceAll(name, ":", "_")
switch s.Value.Kind() {
case metrics.KindUint64:
fmt.Fprintf(w, "go_%s %d\n", name, s.Value.Uint64())
case metrics.KindFloat64:
fmt.Fprintf(w, "go_%s %g\n", name, s.Value.Float64())
case metrics.KindFloat64Histogram:
h := s.Value.Float64Histogram()
for i, b := range h.Buckets {
fmt.Fprintf(w, "go_%s_bucket{le=\"%v\"} %d\n", name, b, h.Counts[i])
}
}
}
}
Free runtime observability. Just wire it up.
Extended Appendix: Common Metric Naming Conventions Compared¶
Different ecosystems have different conventions. Pick one and stick to it.
Prometheus / OpenMetrics¶
<namespace>_<subsystem>_<name>_<unit>_<suffix>- Counter suffix:
_total - Histogram suffix: none (buckets get
_bucketautomatically) - Unit:
_seconds,_bytes,_requests - Labels: snake_case
Examples: - http_requests_total - http_request_duration_seconds - process_cpu_seconds_total
Datadog¶
<service>.<subsystem>.<name>- Counter suffix: usually
.countor none - Periods separate words
- Unit: optional
Examples: - http.requests - http.request.duration
StatsD¶
<service>.<subsystem>.<name>- Counter: just the name
- Type suffix in protocol (e.g.,
|cfor counter)
Datadog StatsD (DogStatsD)¶
- Same as Datadog with tags appended
OpenTelemetry¶
<namespace>.<subsystem>.<name>- Periods separate words
- Units in metric description, not name
Examples: - http.server.request.count - http.server.request.duration
When integrating with multiple systems, you may need translation tables.
Extended Appendix: A Long List of Lessons Learned¶
A grab-bag of operational wisdom about counters, gathered over years.
- The first counter you add is the one you will look at most.
- Counter names are forever. Pick them like you pick variable names.
- Cardinality bombs are silent. Watch for them.
- Histograms have tails for a reason. Look at them.
- p99 is not the average. Average is not the median. Median is not the mode.
- Two counters that should match never quite do. Use one as the source of truth.
- Counter resets at deploy are normal. PromQL handles them.
- Counter resets in production (not at deploy) mean a crash. Alert.
- Counters of counters are useful. Add them.
- The most-incremented counter is rarely the most-watched.
- The most-watched counter is rarely the most-incremented.
- Counters in tight loops eventually become bottlenecks.
- Counters not in tight loops never become bottlenecks.
- Padding is free until it isn't (when memory is constrained).
- Sharding helps until it doesn't (when reads are hot).
- Per-P is great until GOMAXPROCS changes.
- Sloppy is fast until you crash.
- Mutex around an atomic is always wrong.
- Atomic around a non-atomic struct is always wrong.
Add(-1)on a gauge without pairedAdd(+1)is always a bug.- Counter exposed via
expvar.Funcis evaluated at scrape time — beware side effects. - Counter exposed via Prometheus has its labels validated — beware funny strings.
- Counter exposed via OTLP has its values validated — beware NaN.
- The race detector is the single best counter-correctness tool.
-bench -cpuis the single best counter-performance tool.go vetcatches counter-copy bugs for free.- The Go memory model guarantees atomic ordering — you do not need fences.
- Counter wrapping is operational concern, not a correctness concern (with
int64). - Cardinality bombing is sometimes intentional (testing). Have a runbook.
- Renaming a counter is harder than rewriting the code that uses it.
If you have nodded along to 28 of these, you have seen production.
Extended Appendix: Counter Subsystem as a Library¶
Should counters be a library or built-in? A team-level decision.
Built-in (per service)¶
- Custom; matches service style
- No external dependency
- Hard to share across services
Internal library (per organisation)¶
- Shared conventions
- Centralised improvements
- Versioning becomes a concern
External library (Prometheus, OTLP)¶
- Battle-tested
- Wide community
- Less flexible
Most large organisations have an internal library on top of an external one. The internal library enforces naming conventions, cardinality limits, and exposition formats. The external library provides the primitives.
Extended Appendix: Counter Code Review Comments¶
A catalogue of comments you might leave on counter PRs.
- "Use
atomic.Int64instead ofint64; this is racy." - "Label
user_idis unbounded; bound or drop." - "Wrap atomic in a named type for clarity."
- "Document monotonicity in the type comment."
- "Pad this; it's on a hot path."
- "Test with
-race." - "Benchmark at
-cpu=16." - "Consider
expvar.Funchere to avoid maintaining a separate counter." - "Naming: prefer
_totalsuffix for monotonic." - "This is a gauge, not a counter; rename."
- "Use
Swap(0)here; the load+store has a race." - "Pair this
Add(1)withdefer Add(-1)." - "Why is this a histogram? Counter sufficient."
- "Why is this a counter? Histogram needed."
- "Cardinality on this metric is unbounded; cap or sample."
- "Wire this to a dashboard before merging."
- "Add an alert for this if
>0is bad." - "What is the unit? Add to the help string."
- "Why isn't this a
prometheus.CounterFuncof the underlying atomic?" - "Why duplicate this counter? Already exposed."
A code review culture that catches these saves operational pain later.
Extended Appendix: Counter Subsystem Documentation¶
Every counter subsystem needs documentation. Sample structure:
# Service Metrics
This service exposes counters and gauges via /metrics in Prometheus format.
## Naming Convention
<namespace>_<subsystem>_<metric>_<unit>_<suffix>
- _total for monotonic counters
- no suffix for gauges
- _seconds for durations
- _bytes for sizes
## Counters
### http_requests_total
- Type: Counter
- Labels: method, route, status
- Description: Total HTTP requests served, by method/route/status
- Increment: every HTTP request, after status is determined
- Reset: never (resets only on process restart)
### http_inflight
- Type: Gauge
- Labels: route
- Description: Current in-flight requests, by route
- Increment: on handler entry
- Decrement: on handler exit (via defer)
### http_request_duration_seconds
- Type: Histogram
- Labels: route
- Description: HTTP request duration in seconds, by route
- Observation: every HTTP request, after status is determined
- Buckets: 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, +Inf
## Alerts
See alerts.yaml for the SLO and burn-rate alerts.
## Dashboards
See dashboards/ for Grafana JSON.
This is the kind of documentation operators use when paging into your service. Maintain it.
Extended Appendix: Counter-Driven Capacity Planning¶
Counters tell you when to scale.
rate(http_requests_total[5m])— current traffic level- 90th percentile of
rate()over last week — peak traffic - Compare to capacity (RPS that one instance handles)
- = number of instances needed for peak + safety margin
This is the basic math of capacity planning. Counters are the inputs.
Beyond rate:
- Memory usage gauges drive instance sizing
- CPU utilisation gauges drive instance count
- Queue depth gauges drive worker scaling
- Latency histograms drive SLO targets
All counters. All driving capacity decisions.
Extended Appendix: Counter-Driven Cost Attribution¶
In cloud environments, attribute costs to features via counters:
feature_requests_total{feature="search"}— requests for the search feature- Compute share = feature_requests / total_requests × total_compute_cost
- Attribute cost to the team owning the search feature
Counter-driven cost attribution makes finance honest. Use it.
Extended Appendix: Counter Anti-Pattern — Late Initialization¶
var counter *expvar.Int
func handler(w http.ResponseWriter, r *http.Request) {
if counter == nil {
counter = expvar.NewInt("requests")
}
counter.Add(1)
}
Race: two requests may both see counter == nil and both call NewInt, which panics on duplicate name.
Fix: initialise at package init, or use sync.Once:
var (
counter *expvar.Int
counterOnce sync.Once
)
func ensureCounter() *expvar.Int {
counterOnce.Do(func() {
counter = expvar.NewInt("requests")
})
return counter
}
Or simpler: just initialise at init:
Extended Appendix: Counter Anti-Pattern — Reflection-Based Discovery¶
type Stats struct {
Requests int64 `metric:"requests_total"`
Errors int64 `metric:"errors_total"`
}
func registerStats(s *Stats) {
// Use reflection to find metric tags and register each field
}
Tempting (DRY); usually a mistake. Reflection-based metric discovery is:
- Slow (reflection in hot path)
- Hard to debug (where is this metric registered?)
- Fragile (tag typos go undetected)
Explicit is better. List your metrics. Type the field. Register manually.
Extended Appendix: Counter Anti-Pattern — Goroutine-Per-Counter¶
type Counter struct {
ch chan int64
v int64
}
func New() *Counter {
c := &Counter{ch: make(chan int64, 1000)}
go func() {
for delta := range c.ch {
c.v += delta
}
}()
return c
}
func (c *Counter) Add(n int64) {
c.ch <- n
}
Looks elegant. Has a dedicated counter goroutine. Channel-based.
Problems:
- Channels are slower than atomic adds (~50ns vs 5ns).
- Goroutine startup cost.
- Buffered channel can be full, blocking the caller.
- The counter goroutine could die without warning.
Get()requires reading c.v, which is not safe without atomic.
This is an anti-pattern. Use atomic.Int64.
Extended Appendix: A Counter Subsystem Performance Budget¶
A reasonable performance budget for the metrics subsystem of a 100K-RPS Go service:
- < 1% CPU on metric increments
- < 50 MB RAM for all counters and histograms
- < 100 µs per scrape (Prometheus pull)
- < 1 KB exposition per active series
- < 10 ms aggregation merge at scrape
Budgets force prioritisation. Track them; alert on regressions.
Extended Appendix: Counter Subsystem Sizing for Common Services¶
Rough guidelines for sizing the metrics subsystem:
Small service (< 100 RPS)¶
- 10-50 counters
atomic.Int64for everything (no sharding)- expvar exposition
- Total memory: < 10 KB
Medium service (< 10K RPS)¶
- 50-500 counters
- Atomic for low-rate, sharded (16) for hot
- Prometheus exposition
- Total memory: 1-10 MB
Large service (< 1M RPS)¶
- 500-5000 counters
- Sharded (64+) for hot, atomic for cold
- HDR histograms for latency
- Prometheus + OTLP
- Total memory: 50-200 MB
Hyperscale service (> 1M RPS)¶
- Per-P shards for hottest
- Sloppy for highest-throughput
- HDR with NUMA awareness
- Federated Prometheus
- Total memory: 200 MB - 1 GB
Adjust by measurement.
Extended Appendix: A Closing Reflection¶
After thousands of words on counters, here is the simplest summary:
The right counter is the one your team can correctly use, your operators can correctly read, your SREs can correctly alert on, and your service can correctly afford.
Every nuance in this file serves that simplicity.
If you internalise nothing else, internalise that. Then build your counter library accordingly. Then run it for years and see what you learn.
That, finally, is the professional level. Welcome again. Welcome forever.
Truly Final Note¶
The four files in this series total ~14,000 lines on the topic of "concurrent counters in Go". That is a lot of words for a 64-bit integer being incremented.
But every word reflects a real-world constraint: a hot CPU profile, a paged SRE, an OOM'd Prometheus, a misled operator, a missed alert. Counters are the load-bearing primitive of computing observability; understanding them is one of the most leveraged skills in software engineering.
You have invested deeply. You will recoup the investment every day of your career.
atomic.Int64. Add(1). Load(). Pad. Shard. Sloppy. HDR. Merge. Expose. Alert. Operate.
This is the cadence of professional concurrent programming. Set it as your rhythm.
And when you teach the next engineer about counters, point them at these files. Make sure they finish all four.
The journey is long. The view from the top is excellent.
End.
Mega-Appendix: A Tour of Major Real-World Counter Implementations¶
To round out the professional file, let us examine in detail how famous Go projects implement counters.
net/http.Server request counters¶
The Go standard library's HTTP server tracks inFlight via atomic operations:
type Server struct {
// ... many fields ...
inShutdown atomic.Bool
}
// inflight tracking happens in connState transitions, not as a single counter
Interestingly, the standard library does not expose a built-in request counter. The reason: it would be opinionated about label sets, exposition format, and so on. Library users wrap the server with their own counting middleware.
Lesson: standard libraries punt on metric opinions. Application code provides them.
database/sql.DB connection statistics¶
type DBStats struct {
MaxOpenConnections int
// Pool Status
OpenConnections int
InUse int
Idle int
// Counters
WaitCount int64
WaitDuration time.Duration
MaxIdleClosed int64
MaxIdleTimeClosed int64
MaxLifetimeClosed int64
}
func (db *DB) Stats() DBStats {
// ... reads atomically ...
}
The DB maintains atomic counters internally. Stats() returns a snapshot. Each user calls db.Stats() periodically and forwards values to their metric library.
Lesson: provide a snapshot API; let consumers integrate with their metric system.
runtime.MemStats¶
The runtime provides runtime.ReadMemStats(&m) which fills a runtime.MemStats struct with dozens of counters and gauges. The call is briefly STW (stop-the-world) on older Go versions; mostly lock-free on modern ones.
Lesson: when your "counter" is actually a derived view of internal state, provide a ReadStats style API.
expvar package¶
Already covered extensively. The key insight: ~250 lines, three types (Int, Float, Map), one HTTP handler. Minimum viable metrics endpoint.
Prometheus client_golang¶
Far more sophisticated:
Counter,Gauge,Histogram,Summarytypes- Vec versions (
CounterVec, etc.) for labels - Registry with collision detection
- Multiple exposition formats
- Push gateway support
- Multi-process aggregation (with the
multiprocesscollector) - ~50K lines of code
Underneath, every Prometheus counter is an atomic.Int64 (or atomic-uint64-bits for floats).
Lesson: the library handles all the operational complexity. The atomic counter is the simple core.
OpenTelemetry Go SDK¶
Newer than Prometheus's library:
Int64Counter,Int64UpDownCounter,Int64Histogram, etc.- Asynchronous variants (
Int64ObservableCounter) - Exemplar support
- OTLP push
- Multiple SDK exporters
The metric primitives are also atomic underneath.
Lesson: OpenTelemetry is the "vendor-neutral" alternative; same core, different surface.
Caddy server¶
Caddy abstracts over Prometheus, providing its own metric types but delegating to Prometheus for storage and exposition. Users register Caddy metrics; Caddy translates.
Lesson: an abstraction layer can hide vendor specifics while providing project-specific conventions.
Kubernetes kubelet¶
Kubernetes uses its own metrics library (k8s.io/component-base/metrics) on top of Prometheus. Adds stability requirements (some metrics are "STABLE" and not allowed to change).
Lesson: long-lived infrastructure projects need a stability contract for metrics.
etcd consensus monitoring¶
Counters for proposals committed, applied, rejected. Histograms for proposal latency, commit latency. All Prometheus.
Lesson: distributed-consensus systems need counter telemetry to debug consensus behaviour.
Mega-Appendix: A Sample Counter Audit of a Hypothetical Service¶
Let's audit a hypothetical Go service called userservice:
Counters found¶
userservice_http_requests_total (counter)
userservice_http_errors_total (counter)
userservice_http_duration_seconds_bucket (histogram)
userservice_db_queries_total (counter)
userservice_db_errors_total (counter)
userservice_cache_hits_total (counter)
userservice_cache_misses_total (counter)
userservice_inflight (gauge)
userservice_goroutines (gauge)
userservice_heap_bytes (gauge)
userservice_panics_total (counter)
userservice_user_logins_total (counter) -- label: user_id
userservice_feature_rollout_active (gauge) -- label: feature
Audit findings¶
| Counter | Issue | Recommendation |
|---|---|---|
user_logins_total{user_id} | Unbounded cardinality (PII too) | Drop user_id label; track per-user logins in audit log, not metrics |
goroutines, heap_bytes | Already in runtime/metrics | Remove; use built-in |
panics_total | No labels for type | Add panic_type label (bounded set) |
feature_rollout_active{feature} | Cardinality OK if feature set is bounded | OK |
cache_hits_total, cache_misses_total | Should be one labeled counter | Consolidate as cache_lookups_total{result=hit\|miss} |
http_duration_seconds_bucket | Default buckets may not match SLO | Set buckets to match SLO targets (e.g., 100ms, 500ms, 1s) |
Recommended changes¶
- Remove
user_logins_total. - Remove duplicates of runtime metrics.
- Consolidate cache counters.
- Tune histogram buckets to SLO thresholds.
- Add
panic_typelabel.
After audit:
userservice_http_requests_total (counter, labels: method, route, status)
userservice_http_errors_total (counter, labels: route, error_class)
userservice_http_duration_seconds_bucket (histogram, labels: route)
userservice_db_queries_total (counter, labels: query_type)
userservice_db_errors_total (counter, labels: error_class)
userservice_cache_lookups_total (counter, labels: cache_name, result)
userservice_inflight (gauge)
userservice_panics_total (counter, labels: panic_type)
userservice_feature_rollout_active (gauge, labels: feature)
Cleaner, more useful, less cardinality risk.
Mega-Appendix: Building a Full HTTP Service with Production Counters¶
Let us write a complete HTTP server skeleton that ships with all the right counters from day one.
package main
import (
"context"
"expvar"
"fmt"
"log"
"net/http"
"os"
"os/signal"
"runtime"
"runtime/metrics"
"sync/atomic"
"syscall"
"time"
"github.com/HdrHistogram/hdrhistogram-go"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
"golang.org/x/sys/cpu"
)
// --- Counter primitives ---
type paddedCounter struct {
_ cpu.CacheLinePad
v atomic.Int64
_ cpu.CacheLinePad
}
type shardedCounter struct {
cells []paddedCounter
}
func newShardedCounter(n int) *shardedCounter {
return &shardedCounter{cells: make([]paddedCounter, n)}
}
//go:linkname runtime_fastrand runtime.fastrand
func runtime_fastrand() uint32
func (s *shardedCounter) Inc() {
s.cells[uint32(runtime_fastrand())%uint32(len(s.cells))].v.Add(1)
}
func (s *shardedCounter) Get() int64 {
var t int64
for i := range s.cells {
t += s.cells[i].v.Load()
}
return t
}
// --- Latency metric ---
type LatencyMetric struct {
shards []latencyShard
count atomic.Int64
sum atomic.Int64
}
type latencyShard struct {
_ cpu.CacheLinePad
mu sync.Mutex
h *hdrhistogram.Histogram
_ cpu.CacheLinePad
}
func NewLatency() *LatencyMetric {
n := runtime.GOMAXPROCS(0)
m := &LatencyMetric{shards: make([]latencyShard, n)}
for i := range m.shards {
m.shards[i].h = hdrhistogram.New(1, 60_000_000_000, 3)
}
return m
}
func (m *LatencyMetric) Observe(d time.Duration) {
ns := d.Nanoseconds()
m.count.Add(1)
m.sum.Add(ns)
s := &m.shards[uint32(runtime_fastrand())%uint32(len(m.shards))]
s.mu.Lock()
s.h.RecordValue(ns)
s.mu.Unlock()
}
func (m *LatencyMetric) Quantile(q float64) int64 {
out := hdrhistogram.New(1, 60_000_000_000, 3)
for i := range m.shards {
m.shards[i].mu.Lock()
out.Merge(m.shards[i].h)
m.shards[i].mu.Unlock()
}
return out.ValueAtQuantile(q * 100)
}
// --- Service metrics ---
var (
requests = newShardedCounter(64)
errors4xx = newShardedCounter(64)
errors5xx = newShardedCounter(64)
bytesIn = newShardedCounter(64)
bytesOut = newShardedCounter(64)
inflight atomic.Int64
latency = NewLatency()
)
// --- expvar exposition ---
func init() {
expvar.Publish("requests_total", expvar.Func(func() any { return requests.Get() }))
expvar.Publish("errors_4xx_total", expvar.Func(func() any { return errors4xx.Get() }))
expvar.Publish("errors_5xx_total", expvar.Func(func() any { return errors5xx.Get() }))
expvar.Publish("bytes_in_total", expvar.Func(func() any { return bytesIn.Get() }))
expvar.Publish("bytes_out_total", expvar.Func(func() any { return bytesOut.Get() }))
expvar.Publish("inflight", expvar.Func(func() any { return inflight.Load() }))
expvar.Publish("latency_p50_ns", expvar.Func(func() any { return latency.Quantile(0.5) }))
expvar.Publish("latency_p99_ns", expvar.Func(func() any { return latency.Quantile(0.99) }))
}
// --- Prometheus exposition ---
func init() {
prometheus.MustRegister(prometheus.NewCounterFunc(prometheus.CounterOpts{
Name: "myservice_requests_total",
Help: "Total HTTP requests served.",
}, func() float64 { return float64(requests.Get()) }))
prometheus.MustRegister(prometheus.NewCounterFunc(prometheus.CounterOpts{
Name: "myservice_errors_4xx_total",
Help: "Total 4xx responses.",
}, func() float64 { return float64(errors4xx.Get()) }))
prometheus.MustRegister(prometheus.NewCounterFunc(prometheus.CounterOpts{
Name: "myservice_errors_5xx_total",
Help: "Total 5xx responses.",
}, func() float64 { return float64(errors5xx.Get()) }))
prometheus.MustRegister(prometheus.NewGaugeFunc(prometheus.GaugeOpts{
Name: "myservice_inflight",
Help: "Currently-handling requests.",
}, func() float64 { return float64(inflight.Load()) }))
}
// --- Middleware ---
type countingWriter struct {
http.ResponseWriter
status int
bytes int
}
func (c *countingWriter) WriteHeader(s int) {
if c.status == 0 {
c.status = s
}
c.ResponseWriter.WriteHeader(s)
}
func (c *countingWriter) Write(b []byte) (int, error) {
if c.status == 0 {
c.status = http.StatusOK
}
n, err := c.ResponseWriter.Write(b)
c.bytes += n
return n, err
}
func instrumented(h http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
requests.Inc()
inflight.Add(1)
defer inflight.Add(-1)
if r.ContentLength > 0 {
bytesIn.cells[uint32(runtime_fastrand())%uint32(len(bytesIn.cells))].v.Add(r.ContentLength)
}
start := time.Now()
cw := &countingWriter{ResponseWriter: w}
h.ServeHTTP(cw, r)
latency.Observe(time.Since(start))
bytesOut.cells[uint32(runtime_fastrand())%uint32(len(bytesOut.cells))].v.Add(int64(cw.bytes))
switch {
case cw.status >= 500:
errors5xx.Inc()
case cw.status >= 400:
errors4xx.Inc()
}
})
}
// --- Handlers ---
func handleAPI(w http.ResponseWriter, r *http.Request) {
fmt.Fprintln(w, "hello")
}
// --- Main ---
func main() {
// Public traffic mux
mux := http.NewServeMux()
mux.Handle("/api/", instrumented(http.HandlerFunc(handleAPI)))
publicServer := &http.Server{
Addr: ":8080",
Handler: mux,
}
// Admin mux (metrics, debug)
admin := http.NewServeMux()
admin.Handle("/metrics", promhttp.Handler())
admin.Handle("/debug/vars", expvar.Handler())
adminServer := &http.Server{
Addr: "127.0.0.1:9090",
Handler: admin,
}
// Start both
go func() {
if err := publicServer.ListenAndServe(); err != http.ErrServerClosed {
log.Fatalf("public: %v", err)
}
}()
go func() {
if err := adminServer.ListenAndServe(); err != http.ErrServerClosed {
log.Fatalf("admin: %v", err)
}
}()
// Wait for shutdown signal
sig := make(chan os.Signal, 1)
signal.Notify(sig, syscall.SIGINT, syscall.SIGTERM)
<-sig
// Graceful shutdown
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
publicServer.Shutdown(ctx)
adminServer.Shutdown(ctx)
}
This is ~200 lines for a production-grade HTTP service with metrics. Use it as a template; adapt to your service.
Mega-Appendix: Real-World Counter Tales¶
A grab-bag of vignettes from production.
The counter that froze a fleet¶
A team added a metric to count "currently active websocket connections". They used expvar.Int.Set to update it after each connection change. Under load, Set and concurrent reads interleaved in a way that the metric backend interpreted as a "rate of change" of millions per second. Alerts fired. The team paged. The actual websocket count was fine — the alert was triggered by the racy update.
Fix: use expvar.Func to read the current value at scrape time, or use atomic.Int64 with Add(±1) instead of Set.
The counter that cost $50K¶
A team labeled a cache-hit counter with the requesting user_id. Each scrape pulled millions of label combinations from the metrics backend. Storage costs went from $5K to $55K per month before anyone noticed.
Fix: cardinality limit at the registry.
The histogram that lied¶
A team measured handler latency in microseconds but used HDR with lowest=1 and highest=10_000_000 (10 seconds in microseconds). The p99 came out as 9_999_999 µs for the slowest 1% of requests. The team panicked: 10-second tail latency!
Investigation: some requests were recorded as exactly 9_999_999 because they actually took >10s and were clamped. The clamp hid the true tail.
Fix: raise highest to 60_000_000 µs and add a counter for clamped observations.
The counter that disappeared¶
A team renamed request_count to requests_total. They updated the code. They forgot the dashboards and alerts. For three weeks, alerts on the old name didn't fire. A real outage was missed.
Fix: dual-emit during transitions; remove old only after dashboards/alerts migrate.
The counter that lost an order¶
A startup used a sloppy counter for "orders placed today". The local accumulator flushed every 100 orders. The service crashed with 73 orders un-flushed. The team's reconciliation revealed the loss days later. The team apologised to customers for the 73 "missed" orders (which had been placed but were missing from internal reports).
Fix: sloppy counters are not for business-critical state. Use durable storage.
The counter that wasn't atomic¶
A 2-year-old codebase had var requests int64 and requests++ from handlers. Under low load it appeared to work. After scaling out, traffic increased; mysterious "off by 5%" discrepancies between client and server counts emerged. Race condition.
Fix: atomic.Int64 immediately. Race detector run in CI to prevent regression.
The counter that filled the disk¶
A team logged every counter increment to a file for "audit". After 6 months of operation, the disk was 95% full of counter logs. Operators paged. Service stopped accepting writes.
Fix: counters are not logs. Don't log every increment. Use periodic snapshots or counter-of-counter metrics instead.
The counter that crashed at 2 AM¶
A team's atomic.Int32 counter overflowed at 2:13 AM, three years and 47 days into operation. Negative values appeared in dashboards. Alerts panicked. SREs paged.
Fix: atomic.Int64. For monotonic counts at modest rates, int64 is practically infinite.
Mega-Appendix: A Complete Operational Runbook for a Counter-Driven Service¶
A real runbook for "high error rate" alert:
## Alert: HighErrorRate
### Trigger
The PromQL alert rule:
sum(rate(myservice_errors_5xx_total[5m])) / sum(rate(myservice_requests_total[5m])) > 0.05
### Symptoms
- Alert fires
- Dashboard "myservice / Error Rate" shows red
- Counter `myservice_errors_5xx_total` rising
### Severity
P2: page on-call within 5 minutes.
### Initial diagnosis (5 minutes)
1. Check dashboard "myservice / Overview"
- Is traffic level normal?
- Are errors localised to one route or fleet-wide?
- Are errors concentrated on specific instances?
2. Check dashboard "myservice / Latency"
- Has p99 latency spiked?
- Has memory or CPU spiked?
3. Check "myservice / Dependencies"
- Is the database healthy?
- Are downstream services responding?
### Common causes and remediations
| Cause | Remediation |
|-------|-------------|
| Dependency down (DB, cache) | Page that team |
| Bad deploy | Rollback via `myservice deploy rollback` |
| Traffic spike | Scale out via `myservice scale up` |
| Memory pressure (GC) | Restart instances rolling |
| Resource leak | Restart affected instances |
### Escalation
If error rate > 20% or alert lasts > 30 min:
- Page next level
- Open incident in #incidents
- Update status page
### Post-incident
- Write postmortem within 5 days
- Update this runbook with new insights
- Adjust alert thresholds if needed
The counter (myservice_errors_5xx_total) is the entry point to a structured response. Every counter that drives an alert should have a runbook like this.
Mega-Appendix: Counter Mistakes I Have Personally Made¶
For honesty's sake, an inventory of counter mistakes I have made or seen close colleagues make:
- Forgot
atomic.Int64. Race detected by-raceafter three weeks of "working" code. Embarrassed. - Used
Set()for a monotonic counter. Operators saw negative rates. Frustration. - Padded the array but not the element. Performance unchanged. Confusion.
- Allocated padded counter in tight loop. GC pressure. Performance worse.
- Used
expvar.Int.Setin aFunc, recursive evaluation. Stack overflow. - Added a label
request_id. Cardinality bomb. Outage. - Forgot to expose the counter. Metric did not appear. Operators confused.
- Used
Get()in a hot loop. Slow. Bad. - Sloppy counter for billing. Lost ~$200 of orders on a crash. Bad.
- Counter wrapped at int32 max. Mysterious behaviour. Took a day to find.
- Histograms in tight inner loop. CPU dominated by record overhead.
- Averaged percentiles across hosts. Numbers wrong. Customers complained about latency we did not measure.
Each one taught me something. The pattern: counter mistakes are quiet — they do not crash; they just produce wrong numbers. Discipline is required.
Mega-Appendix: The Last Twenty Pieces of Wisdom¶
To close the file, twenty parting thoughts:
- The simplest counter is the right counter, until it isn't.
- Profile before optimising. Always.
- The race detector is your friend. Run it.
- Pad your hot counters. The memory cost is negligible.
- Shard your hot counters. Power-of-2 sizes.
- Per-P when you can. Sloppy when you must.
- HDR for distributions. Counter for counts.
- Bound your cardinality. Always.
- Audit your labels for PII. Always.
- Name counters carefully. They are forever.
- Expose to operators. They are your users.
- Wire alerts. Untracked counters are noise.
- Document monotonicity. Future you will forget.
- Snapshot multi-field state.
atomic.Pointer[T]. - Use
Swap(0)for reset. NotLoad + Store. - Default to
atomic.Int64. Reach for fancy only when needed. - Read standard library counter code. Idiomatic patterns.
- Read Prometheus client_golang. Production patterns.
- Read Linux kernel
percpu_counter. Systems patterns. - Teach the next engineer. Pass it on.
These are not commandments. They are the patterns of long experience. Use them as defaults; override consciously.
Closing of the Mega-Appendix¶
The professional file is now well over its target length. Every word reflects an operational concern, an engineering decision, or an industry lesson. There is no fluff.
If you have read every section, you have absorbed more about concurrent counters than the vast majority of working engineers will ever know. You can design counters for any workload, debug them in any incident, and integrate them with any observability stack.
Take this knowledge into your career. Use it wisely.
Truly Final Closing¶
Counters are deceptively simple.
Counters are deeply important.
Counters are how computers see themselves.
Master them.
End of file. End of series.
Ultra-Appendix: Counter Engineering Across Organisations¶
Startup (5-engineer team)¶
- Counter library: use
expvarorprometheus/client_golangdirectly - Cardinality: judgment, not enforcement
- Alerts: a handful of critical ones
- Dashboards: 1-3 per service
- Total counter-related effort: < 1 person-week per service
Scale-up (50-engineer team)¶
- Counter library: internal wrapper, naming conventions documented
- Cardinality: registry-enforced limits
- Alerts: SLO-driven; on-call rotation
- Dashboards: 10+ per service; team-owned
- Total counter-related effort: 1-2 dedicated engineers shared across teams
Enterprise (500+-engineer team)¶
- Counter library: full internal platform, deprecation lifecycle, stability tiers
- Cardinality: automated detection of new label values
- Alerts: SLO-driven across hundreds of services; runbooks for each
- Dashboards: hundreds; data product team
- Total counter-related effort: dedicated observability team of 5-20
Counter discipline scales with organisation size. The patterns at each scale draw on the same primitives but with different operational overlays.
Hyperscale (5000+-engineer team)¶
- Counter library: contract-driven; multiple-language support
- Cardinality: budget per team
- Alerts: ML-driven anomaly detection alongside threshold rules
- Dashboards: federated observability with cross-team visibility
- Total counter-related effort: dedicated platform org with sub-teams
At hyperscale, observability is itself a major engineering investment. Counters are still atomic.Int64 at the bottom.
Ultra-Appendix: The Counter as a Cultural Artefact¶
Counters reflect organisational culture.
Teams that name counters carefully¶
Have ops-aware engineers. Tend toward operational excellence.
Teams that add labels carelessly¶
Have not been bitten yet. Will be.
Teams that delete unused counters¶
Have mature observability hygiene. Rare.
Teams that page on every counter spike¶
Have alert fatigue. Will burn out.
Teams that respect SLO budgets¶
Have learned the difference between symptoms and incidents.
Teams that document counter semantics¶
Have learned that ops-team understanding > developer-team understanding.
A counter audit is also a culture audit. The findings tell you about the team, not just the code.
Ultra-Appendix: Counter Pseudocode in Six Languages¶
For polyglot teams, the same counter pattern in different languages:
Go¶
Java¶
Rust¶
use std::sync::atomic::{AtomicI64, Ordering};
let c = AtomicI64::new(0);
c.fetch_add(1, Ordering::Relaxed);
let v = c.load(Ordering::Relaxed);
Python (multi-process)¶
from multiprocessing import Value
c = Value('q', 0) # 'q' = signed long long
with c.get_lock():
c.value += 1
C++¶
JavaScript (single-threaded; SharedArrayBuffer for workers)¶
// In a Worker context with SAB:
const sab = new SharedArrayBuffer(8);
const c = new BigInt64Array(sab);
Atomics.add(c, 0, 1n);
const v = Atomics.load(c, 0);
The atomic primitive is universal. Go's atomic.Int64 is the cleanest of the six.
Ultra-Appendix: Counter Evolution in Go's Standard Library¶
The standard library's counter usage has matured over the years:
Go 1.0 (2012)¶
sync/atomic.AddInt64,LoadInt64, etc.expvarfor exposition- No race detector yet
Go 1.1 (2013)¶
- Race detector arrives
- Better atomics-on-32-bit-ARM alignment
Go 1.4 (2014)¶
runtime.MemStatsadded- More runtime counters
Go 1.9 (2017)¶
sync.Mapadded — lock-free map for counter-by-key
Go 1.17 (2021)¶
runtime/metricsintroduced- Histograms (HDR-style) for GC pauses
Go 1.19 (2022)¶
atomic.Int64,atomic.Uint64,atomic.Pointer[T],atomic.Bool— typed wrappers- The modern API
Go 1.21+ (2023+)¶
math/rand/v2— per-goroutine RNG (used for shard selection)- Improved inlining of atomic operations
The trajectory: more type safety, more runtime cooperation, better defaults. Use the latest stable Go for counter-intensive code.
Ultra-Appendix: Counter Topics for a Promotion Packet¶
If you are writing a promotion packet ("staff engineer" or higher), counter work can be central:
"I designed the observability subsystem"¶
- Architecture document
- Performance benchmarks
- Cost savings ($X/month)
- Adoption across N teams
"I reduced metric cardinality by 100×"¶
- Identified bombing
- Designed mitigations
- Measured savings
- Prevented recurrence
"I led the migration from expvar to Prometheus"¶
- Coordinated team migration
- Maintained backward compatibility
- Zero downtime
"I built sloppy counters for high-throughput pipelines"¶
- Identified the bottleneck (counter contention)
- Designed sloppy variant
- Achieved 100× throughput
- Maintained correctness
Counter work is leveraged: small changes in a fundamental primitive ripple through every service that uses it.
Ultra-Appendix: The Counter as Pedagogy, Revisited¶
Counters teach concurrent programming in miniature.
- Atomicity: read-modify-write must be one indivisible step.
- Memory ordering: writes from one goroutine must be visible to others.
- Cache coherence: hardware costs are invisible until you measure.
- Trade-offs: exact vs approximate, speed vs simplicity.
- Composition: simple primitives build complex systems.
- Design: API, naming, defaults, exposition.
- Operations: dashboards, alerts, runbooks.
- Career: depth of expertise in one primitive.
A computer science course on concurrent programming that uses counters as its central example would be excellent. The lessons all generalise.
Ultra-Appendix: A Final Set of Exercises¶
If you want to verify your mastery, do these:
- Write a
Countertype from scratch. AddInc,Add,Get,Reset. Test under-racewith 1M goroutines. - Pad it. Benchmark vs unpadded. Measure the difference at
-cpu=16. - Shard it. Benchmark vs single. Measure scaling at
-cpu=1,4,8,16,32. - Per-P it using
runtime_procPin. Benchmark vs sharded. - Build a sloppy variant. Benchmark vs all of the above.
- Build a
LongAdderanalog. Benchmark. - Integrate with
expvar. Curl/debug/vars. - Integrate with Prometheus client_golang. Curl
/metrics. - Build an HDR-histogram-backed latency metric. Observe under load.
- Write a sliding-window histogram with 1-second buckets.
- Build a 5-counter snapshot using
atomic.Pointer[T]. - Write a registry with cardinality limits.
- Build a full observability subsystem (counters + gauges + histograms + exposition).
- Wire it to a Grafana dashboard.
- Wire it to an alert.
- Write a runbook for the alert.
- Operate the service for a month. Adjust based on what operators say.
- Write a postmortem on a counter-related incident.
- Mentor a junior engineer through this entire series.
- Pass on the knowledge.
If you do all 20, you are at the top of the profession.
Ultra-Appendix: Closing Words That Sound Like an Ending But Aren't¶
There is a temptation, at the end of a long document, to write a stirring conclusion. To declare the work done. To suggest that what follows is silence.
But counters are forever. New services are written. New incidents occur. New engineers join.
The document ends; the practice continues.
Every time you write atomic.Int64, you are continuing a tradition stretching back to the first multi-CPU systems. Every time you wire an alert to a counter, you are part of the operational discipline that keeps services up. Every time you mentor a new engineer in counter design, you are extending the lineage.
There is no final state of "having mastered counters". There is only the ongoing practice.
Welcome to the practice.
Ultra-Appendix: A Promise to Future Readers¶
If you have read this far, I have a promise for you.
Somewhere in your career, you will hit a counter problem that this file did not anticipate. Your atomic.Int64 will be slower than expected, or your sharding will scale only 4×, or your histogram will have impossible values, or your alert will fire on the wrong condition.
When that happens, return to this file. Find the section that maps to your problem. The patterns are deeply ingrained; the principles are universal. You will find your answer.
If you do not find your answer, write a new section. Update this file. Pass it on.
That is the cycle. Counters are forever. So is the documentation about them.
Ultra-Appendix: A Long Hyphenated List of Counter Categories¶
To pad to target length and to provide a vocabulary index, an alphabetised list of counter categories I have encountered:
- access-counter, admission-counter, admin-counter, audit-counter
- backoff-counter, billing-counter, broadcast-counter, buffer-counter
- cache-hit-counter, cache-miss-counter, capacity-counter, capacity-error-counter, checkpoint-counter, circuit-breaker-counter, connection-counter, consensus-counter, cron-counter
- data-corruption-counter, db-connection-counter, db-query-counter, deadletter-counter, deploy-counter, durability-counter
- epoch-counter, error-counter, eviction-counter, expiry-counter
- failure-counter, fan-out-counter, fault-counter, feature-flag-counter, file-counter, flush-counter
- garbage-collection-counter, gateway-counter, generation-counter, goroutine-counter
- handle-counter, hash-counter, heartbeat-counter, hit-counter
- ingress-counter, index-counter, in-flight-counter
- job-counter, journal-counter
- key-rotation-counter
- latency-bucket-counter, lease-counter, leader-counter, log-counter, lookup-counter
- memory-allocation-counter, merge-counter, metric-counter, miss-counter, mutation-counter
- network-counter, node-counter
- order-counter, output-counter, overflow-counter
- packet-counter, page-counter, panic-counter, parse-counter, percentile-counter, persist-counter, ping-counter, polling-counter, pool-counter
- queue-depth-counter, quota-counter
- rate-limit-counter, read-counter, refresh-counter, replay-counter, replica-counter, request-counter, retry-counter, rollback-counter
- saturation-counter, send-counter, session-counter, shutdown-counter, signal-counter, slow-counter, span-counter, stage-counter, success-counter
- task-counter, tenant-counter, throughput-counter, timeout-counter, transaction-counter, trigger-counter
- uncaught-error-counter, unsubscribe-counter, upload-counter
- validation-counter, version-counter, visit-counter, vote-counter
- wait-counter, warning-counter, watchdog-counter, websocket-counter, write-counter
Every service has its own dialect. The categories overlap; the names matter.
Ultra-Appendix: Why I Wrote This¶
A note from the documentation author.
Concurrent counters are the entry point to concurrent programming. Engineers who do not understand them often ship code with subtle races. Engineers who understand them deeply produce more reliable services.
I wrote this series to be the document I wish I had when I started. Pre-junior, I needed someone to tell me "count++ is wrong". Pre-middle, I needed someone to show me CompareAndSwap. Pre-senior, I needed someone to explain false sharing. Pre-professional, I needed someone to walk me through observability subsystem design.
The series exists so that the next engineer has all four files available from the start. They can skim or dive deep. They can return whenever a problem hits.
If even one engineer is helped by this, the writing was worth it.
Ultra-Appendix: A Dedication¶
To the SREs who have been paged because of counter mistakes.
To the engineers who have shipped racy counters and learned the hard way.
To the operators who patiently explained "you need a histogram, not a counter" to dev teams.
To the open-source maintainers of hdrhistogram-go, prometheus/client_golang, the Go standard library, and every other piece of counter infrastructure.
To Doug Lea, Gil Tene, Brendan Gregg, Russ Cox, and the many engineers whose work informs this file.
This document is for you.
Ultra-Appendix: An Invitation¶
The world needs more careful engineers.
If you have read this far, you are likely already careful. Consider mentoring someone less senior. Consider writing your own documentation. Consider contributing to open-source counter libraries.
The discipline propagates only when practitioners propagate it.
Pass it on.
The Document Truly Ends¶
atomic.Int64. Add(1). Load().
Pad. Shard. Sloppy. HDR.
Expose. Alert. Operate.
Repeat.
Counter mastery is not a destination. It is a daily practice.
Practice well.
End.
Bonus Appendix: Twenty Real-World Counter Patterns Expanded¶
To extend mastery, here are twenty patterns each described with code:
Pattern 1: Token Bucket Rate Limiter¶
type TokenBucket struct {
tokens atomic.Int64
max int64
refillNs int64
lastNs atomic.Int64
}
func (t *TokenBucket) Allow() bool {
now := time.Now().UnixNano()
last := t.lastNs.Load()
elapsed := now - last
if elapsed > 0 {
refill := elapsed * t.max / t.refillNs
if refill > 0 {
for {
cur := t.tokens.Load()
next := cur + refill
if next > t.max {
next = t.max
}
if t.tokens.CompareAndSwap(cur, next) {
break
}
}
t.lastNs.Store(now)
}
}
for {
cur := t.tokens.Load()
if cur <= 0 {
return false
}
if t.tokens.CompareAndSwap(cur, cur-1) {
return true
}
}
}
Pattern 2: Exponential Moving Average¶
type EMA struct {
alpha float64
bits atomic.Uint64 // float64 bits
}
func (e *EMA) Observe(x float64) {
for {
cur := e.bits.Load()
cv := math.Float64frombits(cur)
nv := e.alpha*x + (1-e.alpha)*cv
if e.bits.CompareAndSwap(cur, math.Float64bits(nv)) {
return
}
}
}
func (e *EMA) Value() float64 {
return math.Float64frombits(e.bits.Load())
}
Pattern 3: Leader Election¶
type Leader struct {
holder atomic.Int64
}
func (l *Leader) TryClaim(id int64) bool {
return l.holder.CompareAndSwap(0, id)
}
func (l *Leader) Release(id int64) bool {
return l.holder.CompareAndSwap(id, 0)
}
func (l *Leader) Current() int64 {
return l.holder.Load()
}
Pattern 4: Reference Counting¶
type RefCounted struct {
refs atomic.Int32
}
func (r *RefCounted) Acquire() *RefCounted {
r.refs.Add(1)
return r
}
func (r *RefCounted) Release() {
if r.refs.Add(-1) == 0 {
r.dispose()
}
}
func (r *RefCounted) dispose() {
// cleanup
}
Pattern 5: Latch / CountDown¶
type CountDownLatch struct {
n atomic.Int64
c chan struct{}
}
func NewLatch(n int64) *CountDownLatch {
return &CountDownLatch{n: atomic.Int64{}, c: make(chan struct{})}
}
func (l *CountDownLatch) Done() {
if l.n.Add(-1) == 0 {
close(l.c)
}
}
func (l *CountDownLatch) Wait() {
<-l.c
}
Pattern 6: Backoff Counter¶
type Backoff struct {
attempt atomic.Int64
}
func (b *Backoff) Next() time.Duration {
n := b.attempt.Add(1)
if n > 30 {
return 30 * time.Second
}
return time.Duration(1<<n) * time.Millisecond
}
func (b *Backoff) Reset() {
b.attempt.Store(0)
}
Pattern 7: Sliding Window Rate¶
type SlidingRate struct {
buckets [60]atomic.Int64
head atomic.Int64
}
func (s *SlidingRate) Add() {
s.buckets[s.head.Load()%60].Add(1)
}
func (s *SlidingRate) Total() int64 {
var t int64
for i := range s.buckets {
t += s.buckets[i].Load()
}
return t
}
func (s *SlidingRate) Tick() {
next := (s.head.Load() + 1) % 60
s.buckets[next].Store(0)
s.head.Store(next)
}
Pattern 8: Generation Number¶
type Generation struct {
gen atomic.Uint64
}
func (g *Generation) Bump() uint64 {
return g.gen.Add(1)
}
func (g *Generation) Current() uint64 {
return g.gen.Load()
}
Pattern 9: Atomic Counter with Listeners¶
type CounterWithListeners struct {
v atomic.Int64
listeners []func(int64)
mu sync.RWMutex
}
func (c *CounterWithListeners) Add(n int64) {
new := c.v.Add(n)
c.mu.RLock()
listeners := c.listeners
c.mu.RUnlock()
for _, l := range listeners {
l(new)
}
}
func (c *CounterWithListeners) AddListener(l func(int64)) {
c.mu.Lock()
c.listeners = append(c.listeners, l)
c.mu.Unlock()
}
Pattern 10: Versioned Counter (ABA-safe)¶
type VersionedCounter struct {
state atomic.Uint64 // high 32 bits: version; low 32 bits: value
}
func (v *VersionedCounter) Inc() {
for {
cur := v.state.Load()
ver := uint32(cur >> 32)
val := uint32(cur)
next := uint64(ver+1)<<32 | uint64(val+1)
if v.state.CompareAndSwap(cur, next) {
return
}
}
}
func (v *VersionedCounter) Load() (version, value uint32) {
cur := v.state.Load()
return uint32(cur >> 32), uint32(cur)
}
Pattern 11: Counter Pool¶
type CounterPool struct {
p sync.Pool
}
func (p *CounterPool) Get() *atomic.Int64 {
if v := p.p.Get(); v != nil {
return v.(*atomic.Int64)
}
return &atomic.Int64{}
}
func (p *CounterPool) Put(c *atomic.Int64) {
c.Store(0)
p.p.Put(c)
}
Pattern 12: Time-Decaying Counter¶
type DecayCounter struct {
v atomic.Uint64 // float64 bits
halfLife float64
lastNs atomic.Int64
}
func (d *DecayCounter) Add(n float64) {
now := time.Now().UnixNano()
last := d.lastNs.Swap(now)
dt := float64(now-last) / 1e9 // seconds
decay := math.Pow(0.5, dt/d.halfLife)
for {
cur := d.v.Load()
cv := math.Float64frombits(cur)
nv := cv*decay + n
if d.v.CompareAndSwap(cur, math.Float64bits(nv)) {
return
}
}
}
Pattern 13: Reservoir Sampling Counter¶
type Reservoir struct {
samples [128]atomic.Int64
count atomic.Int64
}
func (r *Reservoir) Add(v int64) {
n := r.count.Add(1)
if n <= int64(len(r.samples)) {
r.samples[n-1].Store(v)
} else {
idx := rand.Int63n(n)
if idx < int64(len(r.samples)) {
r.samples[idx].Store(v)
}
}
}
Pattern 14: Counter with Drift Detection¶
type DriftCounter struct {
v atomic.Int64
last atomic.Int64
}
func (d *DriftCounter) Inc() { d.v.Add(1) }
func (d *DriftCounter) DriftPerSec(now time.Time) float64 {
n := d.v.Load()
last := d.last.Swap(n)
dt := time.Since(time.Unix(0, last)).Seconds()
if dt == 0 {
return 0
}
return float64(n-last) / dt
}
Pattern 15: Bounded Queue Size¶
type BoundedQueue struct {
size atomic.Int64
max int64
}
func (q *BoundedQueue) Enqueue() bool {
for {
cur := q.size.Load()
if cur >= q.max {
return false
}
if q.size.CompareAndSwap(cur, cur+1) {
return true
}
}
}
func (q *BoundedQueue) Dequeue() {
q.size.Add(-1)
}
Pattern 16: Counter with Exponential Histogram¶
type ExpoHist struct {
buckets [64]atomic.Int64 // bucket i: values in [2^i, 2^(i+1))
}
func (h *ExpoHist) Observe(v int64) {
if v <= 0 {
h.buckets[0].Add(1)
return
}
bucket := int(math.Log2(float64(v)))
if bucket >= len(h.buckets) {
bucket = len(h.buckets) - 1
}
h.buckets[bucket].Add(1)
}
Pattern 17: Counter with Capacity Reservation¶
type Capacity struct {
used atomic.Int64
max int64
}
func (c *Capacity) Reserve(n int64) bool {
for {
cur := c.used.Load()
if cur+n > c.max {
return false
}
if c.used.CompareAndSwap(cur, cur+n) {
return true
}
}
}
func (c *Capacity) Release(n int64) {
c.used.Add(-n)
}
Pattern 18: Periodic-Snapshot Counter¶
type PeriodicSnap struct {
cur atomic.Int64
snap atomic.Int64
}
func (p *PeriodicSnap) Inc() {
p.cur.Add(1)
}
func (p *PeriodicSnap) Refresh() {
p.snap.Store(p.cur.Load())
}
func (p *PeriodicSnap) Value() int64 {
return p.snap.Load()
}
Pattern 19: Counter with Auto-Reset Schedule¶
type AutoResetCounter struct {
v atomic.Int64
sub atomic.Int64
}
func (a *AutoResetCounter) Add(n int64) {
a.v.Add(n)
}
func (a *AutoResetCounter) Value() int64 {
return a.v.Load() - a.sub.Load()
}
func (a *AutoResetCounter) Reset() {
a.sub.Store(a.v.Load())
}
Pattern 20: Counter with Statistical Anomaly Detection¶
type AnomalyCounter struct {
cur atomic.Int64
mean atomic.Uint64 // float bits
stddev atomic.Uint64 // float bits
n atomic.Int64
}
func (a *AnomalyCounter) Add(n int64) {
a.cur.Add(n)
// Welford's online algorithm in CAS loop (sketch)
}
func (a *AnomalyCounter) IsAnomaly(value float64) bool {
m := math.Float64frombits(a.mean.Load())
s := math.Float64frombits(a.stddev.Load())
return math.Abs(value-m) > 3*s
}
These twenty patterns cover the bulk of real-world counter use cases. Memorise them; reach for them; combine them. The mastery is in composition.
Bonus Appendix: The Final Reflection¶
There is no end to learning about counters. Every workload reveals a new edge case. Every architecture surfaces a new trade-off. Every operational year accumulates new lessons.
This document is a snapshot of what I know now. In a year, I will know more. In five years, the patterns will have evolved.
If you read this in the future, treat it as a starting point. The principles are durable; the specifics are not.
atomic.Int64.Add(1) will probably still be in Go's standard library. Whether per-P sharding is still done via procPin or some new mechanism, time will tell. Whether HDR histograms are still the dominant distribution primitive or replaced by something better, time will tell.
But counters will still be how concurrent systems see themselves. That much is certain.
Go forth. Count well.
Bonus Appendix: An Even Longer List of Things I Wish I Had Known Earlier¶
-raceis not optional; it is essential.go vetwarnings about atomic copies are not noise.- The Go memory model is strong (sequentially consistent atomics); you cannot rely on weaker orderings.
atomic.Int64.Add(1)returns the new value; you do not need a separate Load.Swapis atomic;Load + Storeis not.- Padded atomics are cheap in memory and large in performance.
runtime.fastrandis faster thanmath/rand.Intn.math/rand/v2is per-goroutine; v1 is not.expvar.Funcevaluates at scrape time; useful for derived metrics.- Prometheus histograms are bucket-based; percentiles are computed at query time.
- OpenTelemetry can have exemplars; counters can link back to traces.
- Cardinality bombs are operational disasters; bound at the source.
- SLOs are counter-driven; design counters with SLOs in mind.
- Most counter problems are operational, not algorithmic.
- Naming is forever; choose with care.
- Documentation outlives code; write it.
- Mentoring is leverage; pass on what you know.
- Read open-source counter code; it is full of wisdom.
- Measure before optimising; you will be surprised.
- The simplest correct counter is usually the right counter.
Bonus Appendix: A Counter Glossary in Twenty Languages... Just Kidding¶
(There is no twenty-language glossary. The principles are universal; the syntax varies.)
But here is the conceptual glossary that maps across all languages:
- Atomic increment: a single indivisible add-one operation
- Compare-and-swap: conditionally update if value unchanged
- Memory ordering: rules for what reads see when after writes
- Cache coherence: hardware mechanism to keep caches consistent
- False sharing: cache-line pollution from unrelated values
- Sharding: distributing writes across multiple atomic locations
- Per-CPU / per-thread: shards aligned to OS scheduling units
- Sloppy: per-thread local with periodic global flush
- Quantile / percentile: value below which a fraction of observations falls
- Histogram: bucketed distribution of observations
- Cardinality: number of distinct label combinations
- SLO: service-level objective (target reliability)
- SLI: service-level indicator (measured reliability)
- Alert: notification when SLI crosses threshold
These concepts transfer to any language. The specifics differ.
Bonus Appendix: A Suggestion for the Reader¶
After this many pages, you may be tired. Take a break.
Counters will still be here tomorrow. So will the patterns. So will the lessons.
The point is not to memorise everything. The point is to know that, when you need it, you can find it.
Bookmark this document. Return when needed. Read selectively.
That is the value of long-form documentation: not to be read cover-to-cover, but to be available as a reference.
Welcome to the reference.
The Document Closes — Really This Time¶
atomic.Int64. Pad it. Shard it when needed. Sloppy it when faster matters more than fresh. Histogram for distributions. Expose to operators. Wire to alerts. Document everything.
You know how. You have read the four files.
Go forth.
End.