Steady-State — Professional¶
Table of Contents¶
- Introduction
- War story 1 — The six-day OOM
- War story 2 — The queue creeping at 0.1 percent per hour
- War story 3 — The pool that fragmented under failover
- War story 4 — The deploy window that grew tail latency
- War story 5 — The FD leak that took twenty-eight days to surface
GOMEMLIMITand cgroup integration- Background GC pacing
runtime/metricspipelines- Per-shard steady-state separation
- The runbook
- Closing notes
Introduction¶
This page is built from incidents. The lessons here are not theoretical; each section names a real failure mode, walks through what we saw on the dashboards, what we got wrong on first diagnosis, and what the fix looked like. The names of services and companies are abstracted; the engineering is not. If you have not yet been on call for a Go service in production, read this page anyway — the shape of the diagnosis is the same whether you are debugging a six-replica side-project or a six-thousand-replica payment processor.
The recurring theme is time. Production hides its bugs in time. A unit test runs for one second; a load test for an hour; production for years. Each order of magnitude reveals a new class of failure. Steady-state engineering is the discipline of designing for the largest order of magnitude from the start — because once your service is in production, the time-revealed failures will find you anyway.
War story 1 — The six-day OOM¶
The system¶
A media-streaming service receives webhooks from an upstream CDN for every video segment uploaded by every customer. Each webhook is small (a few kilobytes). The service decodes the JSON, looks up the customer, applies access rules, and writes a row to a Postgres table. Throughput: about three thousand webhooks per second at peak, average two thousand.
The incident¶
The service is deployed Tuesday afternoon. By Friday it is humming along, dashboards green. On Monday morning, around 6 a.m., a single pod (out of twelve) is killed by the OOM-killer. By 10 a.m. three more have been killed. By noon the rest are following a clear pattern: every pod's resident set is growing at about thirty megabytes per hour, and they crash when they cross the four-gigabyte container limit.
First diagnosis (wrong)¶
The team's first guess is the JSON decoder. The webhook payload contains a customer-supplied URL field; one engineer suggests an unbounded string is being kept in a cache somewhere. Heap snapshots are taken at 11 a.m. and noon (one hour apart). go tool pprof -base 11.pb.gz 12.pb.gz shows the diff.
The diff is unhelpful at first. The top consumers are database/sql connection-related allocations and encoding/json buffer reuse — both expected, neither growing in a way that explains the leak.
Second diagnosis (closer)¶
The on-call lead pulls a goroutine profile. Goroutine count is supposedly stable, but the profile shows about ten thousand goroutines parked on the same line:
This is in a per-customer "subscriber" goroutine — one is spawned the first time a customer ID appears in the day. The ctx is the request context. When the request finishes, the goroutine is supposed to exit because ctx.Done() is closed.
But pubSub.Events is a buffered channel shared across all subscribers. When a customer's subscriber is parked on the receive, and the request context is cancelled, the select should pick the Done branch. It does. Most of the time.
Sometimes, however, an event arrives just before the cancel, and the goroutine reads it, calls process(ev), and only then returns to the select. At that point the request handler has long since exited — but the subscriber is still in process(ev). Each event holds references to the customer record, the event payload, and a pgx connection. The goroutine's per-G stack is also still allocated.
In normal operation, this is fine: process(ev) finishes in milliseconds, the goroutine returns to select, finds Done is closed, and exits. But under sustained traffic, a slow downstream creates back-pressure inside process(ev). The goroutine sits there for seconds. During those seconds, new requests for the same customer create new subscriber goroutines, because the lookup map's "is there a subscriber?" check returns false (the subscriber map has already been cleaned up by the request-finishing path).
Now there are two subscribers for one customer, both still alive. Over six days, this accumulates. At 1.8 KB per subscriber goroutine stack, plus pinned per-customer state, you get the thirty-megabytes-per-hour drift.
The fix¶
Two changes, neither flashy:
- Subscribers are no longer per-request. They live for the lifetime of the process. A worker pool of fifty consumers drains the shared event channel. Per-customer state moved into a bounded LRU map.
- The
selectwas changed to drain the channel afterDone:
for {
select {
case ev := <-pubSub.Events:
process(ev)
case <-ctx.Done():
// drain any in-flight to avoid orphaning
for {
select {
case ev := <-pubSub.Events:
process(ev)
default:
return
}
}
}
}
But this was a defensive layer; the structural fix was decoupling subscriber lifetime from request lifetime.
Lessons¶
- Per-request goroutines are a steady-state hazard. Any time the goroutine count is "proportional to traffic in flight," and not "proportional to fixed pool size," you are vulnerable.
- Heap diffs are not enough. The leak was not in the heap — it was in the goroutine stacks and the closed-over state.
runtime.NumGoroutinetold the truer story. - Six days is short. Some leaks take two weeks. Some take a month. The longer the window, the larger the multiplier on small per-second drift.
War story 2 — The queue creeping at 0.1 percent per hour¶
The system¶
A search-indexing pipeline consumes documents from Kafka, runs a CPU-heavy enrichment step, and writes to an inverted index. The consumer has a worker pool of sixteen goroutines; the queue between Kafka and the worker pool is a buffered channel of capacity 256.
The incident¶
The queue depth metric is normally near zero (workers keep up easily). Two weeks after a deploy, the queue depth has settled at an average of three. After three weeks, four. After four weeks, six. By six weeks it is twelve. The average is creeping.
There is no alert; the absolute number is far below capacity. But the on-call lead notices the trend on a dashboard and starts asking questions.
Diagnosis¶
This is not a leak. It is a capacity creep: the arrival rate is growing faster than the worker throughput. The cause turns out to be a slow regression in the enrichment step — a new feature added one extra database lookup per document, which raised the median processing time from twenty milliseconds to twenty-two. With sixteen workers and a 22 ms per document, the steady-state throughput is 16 / 0.022 = 727 documents per second. Arrival rate had grown to about 720 per second through six weeks of organic traffic growth. The gap is closing.
The pinch point¶
By Little's Law, the average queue depth is arrival_rate * average_wait_time. The average wait time grew because the workers were getting closer to saturation. The depth was a signal, not the cause.
The fix¶
Three changes:
- Cache the new database lookup. Median processing time dropped back to twenty milliseconds; throughput restored to eight hundred per second.
- Add an alert on queue depth slope, not just absolute depth. A queue creeping upward, even slowly, is a steady-state violation.
- Add the Little's Law calculation as a SLI:
expected_capacity = workers / median_processing_time; alert ifarrival_rate > 0.7 * expected_capacity.
Lessons¶
- Capacity creep is harder than memory leaks because the symptoms look like noise.
- Always alert on slope, not just level. A flat dashboard line at "depth = 12" looks fine until you realise it was at "depth = 3" a month ago.
- Little's Law is the steady-state engineer's slide rule. Memorise it:
L = λW.
War story 3 — The pool that fragmented under failover¶
The system¶
A user-profile microservice talks to a Postgres primary via pgx. The pool is configured MaxOpenConns = 25, MaxIdleConns = 25, ConnMaxLifetime = 30 minutes. Steady-state behaviour: 25 long-lived connections, both in pg_stat_activity and in the Go pool.
The incident¶
Postgres performs a planned failover. The primary is demoted, the standby is promoted. The Go service should reconnect and resume.
It mostly does. But three minutes after the failover, latency on the user-profile service has gone from 5 ms median to 60 ms median. The connection pool stats say InUse = 25, Idle = 0, WaitCount = 3000+. The service is queueing.
Diagnosis¶
After the failover, every existing connection in the pool is dead (the old primary refused them). Each call to db.Conn() triggers the driver's "ping" code, which discovers the connection is dead, throws it away, and dials a new one. Dial-on-demand is slow (TLS handshake plus Postgres startup); each new dial takes about a hundred milliseconds.
With twenty-five connections all needing to be replaced and a continuous load of two hundred RPS, the math is grim: twenty-five replaces times one hundred milliseconds equals two and a half seconds during which the pool is essentially empty. Two hundred RPS times two and a half seconds equals five hundred queued requests. The pool never catches up.
But that's not all. As the dials happen, the pool's internal "free list" is repeatedly reshuffled. New connections come in at the front; in-use ones go to the back. Under burst, a connection used twice in succession may end up scheduled twice on the same Postgres backend, while another backend is idle — the pool is fragmented.
The fix¶
Three changes, applied as a unit:
- Lower
ConnMaxLifetimeto fifteen minutes. Connections rotate more often, so by the time a failover happens, fewer are old and stale. This is the cheapest mitigation. - Add a circuit-breaker around
db.QueryContext. When the breaker opens, fail fast instead of queueing. Saturation is exposed to the caller rather than absorbed by the pool. - Pre-warm the pool after dial failure. A small helper sees the first dial failure, opens N parallel dials to refill, and signals the pool to drop the dead connections.
// After a connection error indicating loss of the primary,
// kick off N parallel dials so the pool fills in one round-trip
// time, not N round-trip times.
func prewarm(db *sql.DB, n int) {
var wg sync.WaitGroup
for i := 0; i < n; i++ {
wg.Add(1)
go func() {
defer wg.Done()
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()
conn, err := db.Conn(ctx)
if err == nil {
conn.Close()
}
}()
}
wg.Wait()
}
Lessons¶
- Connection pools are not as steady as they look. A pool that has hummed for weeks can collapse in three minutes when its peers cycle.
ConnMaxLifetimeis a steady-state lever. Lower values cost slightly more dial CPU but limit blast radius.- Pre-warm is cheap insurance. It costs almost nothing at startup and saves you on the only day you really need it.
War story 4 — The deploy window that grew tail latency¶
The system¶
An ad-serving service running on Kubernetes, sixty replicas, one-percent rolling deploys (so six replicas at a time). Steady-state behaviour: thirty-millisecond p99, twenty-millisecond median.
The incident¶
A new feature ships. Functionally it works. p99 latency on the dashboard shows a clear, periodic spike every time a rolling deploy happens — about every three days. The spike lasts the duration of the deploy (about ten minutes), tops out at one hundred and fifty milliseconds, then returns to baseline.
The team initially attributes this to "warm-up cost." It would be acceptable — except this is a paid traffic system, and a 5x p99 for ten minutes every three days costs measurable revenue.
Diagnosis¶
The issue is the new pod's cold start. The feature added a per-pod cache that takes about three minutes to warm to its steady-state hit rate. During those three minutes, the cache misses cause a cascade of database lookups; each lookup is fifty milliseconds; the pod's queue depth climbs; tail latency follows.
But it gets worse. The load balancer sends new traffic to the new pod immediately on ready. The pod was marked ready as soon as its HTTP listener accepted a connection — long before the cache was warm.
The fix¶
Two changes:
- Readiness gates on cache warm. The readiness probe returns 503 until the cache has at least fifty percent of its steady-state size, measured by entry count. This means the rolling deploy is slower, but new pods never serve traffic when cold.
- Pre-warm on startup. The pod reads a snapshot of the most recent cache state from a shared object store and inflates the cache before signalling ready. Steady-state warm time drops from three minutes to thirty seconds.
Lessons¶
- Steady-state has a startup transient. The deploy window is a recurring excursion from steady-state, and every excursion is a chance for a tail-latency spike.
- Readiness must mean "ready to perform at steady-state quality," not "ready to accept TCP connections."
- Cache warming is a steady-state feature, not a one-time hack. Plan for it.
War story 5 — The FD leak that took twenty-eight days to surface¶
The system¶
A file-transcoding service spawns subprocesses for each transcoding job. The Go service manages the subprocess lifecycle, captures stdout and stderr, and writes results to S3.
The incident¶
Twenty-eight days after a deploy, a single pod reports "too many open files" and crashes. The other forty-three pods follow over the next forty-eight hours.
Diagnosis¶
The service was opening pipes to subprocesses via exec.Cmd.StdoutPipe and exec.Cmd.StderrPipe. The job code read from both pipes, processed the output, and waited for the subprocess to exit. Standard pattern.
But: in an error path, one of the pipes was not drained before the subprocess was killed. The Go runtime kept the pipe's file descriptor alive until the goroutine reading from it noticed the EOF — which, with no draining, never came.
Each occurrence of the error path leaked one FD. The error path ran about every ten minutes (about once per two hundred jobs). At one leak per ten minutes, you leak 144 per day. The pod's RLIMIT_NOFILE was 4096. Twenty-eight days times 144 is 4032. The math is exact.
The fix¶
cmd := exec.Command(...)
stdout, _ := cmd.StdoutPipe()
stderr, _ := cmd.StderrPipe()
// Always drain both pipes, even if we are about to kill the process.
done := make(chan struct{}, 2)
go func() { io.Copy(io.Discard, stdout); done <- struct{}{} }()
go func() { io.Copy(io.Discard, stderr); done <- struct{}{} }()
if err := cmd.Start(); err != nil { ... }
// On error path:
cmd.Process.Kill()
<-done
<-done
cmd.Wait()
Lessons¶
- FD leaks are slow. With
RLIMIT_NOFILE = 4096and a leak rate of one per ten minutes, the failure surfaces after twenty-eight days. WithRLIMIT_NOFILE = 1024, it surfaces in seven days; with65535, in over a year. - Always alert on FD count, not just on FD exhaustion. A pod whose FD count has slope
> 0over twenty-four hours is leaking. - Subprocess pipes are a common offender. Drain everything; assume nothing.
GOMEMLIMIT and cgroup integration¶
In production, the right GOMEMLIMIT is derived from the cgroup, not hard-coded. Hard-coded values rot: the container's memory limit changes (a Kubernetes resize, a node migration), but the binary's GOMEMLIMIT stays fixed. Either you set it too low and waste memory, or you set it too high and OOM.
The cgroup paths¶
On cgroup v2 (almost all modern systems):
On cgroup v1 (older systems):
Both files contain either a number or the string max (cgroup v2) or a very large sentinel value (cgroup v1).
A working implementation¶
package memlimit
import (
"fmt"
"os"
"strconv"
"strings"
"runtime/debug"
)
const fraction = 0.9
func SetFromCgroup() (int64, error) {
if n, err := readV2(); err == nil {
return setLimit(n)
}
if n, err := readV1(); err == nil {
return setLimit(n)
}
return 0, fmt.Errorf("no cgroup memory limit found")
}
func readV2() (int64, error) {
b, err := os.ReadFile("/sys/fs/cgroup/memory.max")
if err != nil {
return 0, err
}
s := strings.TrimSpace(string(b))
if s == "max" {
return 0, fmt.Errorf("unlimited")
}
return strconv.ParseInt(s, 10, 64)
}
func readV1() (int64, error) {
b, err := os.ReadFile("/sys/fs/cgroup/memory/memory.limit_in_bytes")
if err != nil {
return 0, err
}
n, err := strconv.ParseInt(strings.TrimSpace(string(b)), 10, 64)
if err != nil {
return 0, err
}
// cgroup v1 sentinel: practically unlimited
if n > (1 << 62) {
return 0, fmt.Errorf("unlimited")
}
return n, nil
}
func setLimit(cgroupMax int64) (int64, error) {
limit := int64(float64(cgroupMax) * fraction)
debug.SetMemoryLimit(limit)
return limit, nil
}
Call this from init() (or, even earlier, from a Go-runtime initialiser if you have one). The KimMachineGun/automemlimit library does exactly this and is the de-facto standard for production Go services.
Why ninety percent?¶
Three sources of off-heap memory accumulate at the cgroup level but are not counted by the Go runtime:
- Goroutine stacks above the initial threshold.
- cgo allocations (libpq, librdkafka, libcurl, etc.).
- Runtime overhead:
runtime.MemStats.OtherSys, page tables for mmaps.
A ten-percent buffer absorbs all three for typical services.
When to override¶
If your service is unusual (very cgo-heavy, or very stack-heavy), tune the fraction. For an OpenSearch client written in cgo, eighty percent may be safer; for a pure-Go service with light stacks, ninety-five percent leaves more memory for the heap. Use measurements, not guesses.
Background GC pacing¶
GOMEMLIMIT plus GOGC give you two knobs that interact non-trivially. A useful mental model:
- Below the memory limit, GC is triggered by
GOGC(the heap-doubling ratio). - As you approach the limit, GC is triggered increasingly by the limit itself.
- Near the limit, GC runs continuously and CPU spikes.
In production this means: if you set GOMEMLIMIT too low, you trade a fixed memory headroom for a wildly variable CPU bill. Watch runtime/metrics:
/gc/cpu/percent:%— the fraction of CPU spent in GC. Above five percent is a warning; above ten percent is a sign you should raise the limit./gc/cycles/total:gc-cycles— total cycles. Sudden acceleration is a sign of memory pressure.
A working pattern: set GOMEMLIMIT to ninety percent of cgroup, set GOGC=100, watch /gc/cpu/percent:%. If it stays under five percent at peak load, you are in good shape. If it climbs over ten percent at peak, either raise GOGC (less aggressive GC) or raise the cgroup memory.
runtime/metrics pipelines¶
The runtime exposes a stable, structured metrics interface via runtime/metrics. In production, you want this feeding into your monitoring system — Prometheus, Datadog, OpenTelemetry — as a first-class set of dimensions.
Polling implementation¶
package rtmetrics
import (
"context"
"runtime/metrics"
"time"
)
type Exporter struct {
Interval time.Duration
Send func(name string, value float64)
}
var watched = []string{
"/memory/classes/heap/objects:bytes",
"/memory/classes/heap/free:bytes",
"/memory/classes/heap/released:bytes",
"/memory/classes/total:bytes",
"/gc/cycles/total:gc-cycles",
"/gc/cpu/percent:%",
"/sched/goroutines:goroutines",
"/sync/mutex/wait/total:seconds",
"/gc/pauses:seconds",
}
func (e *Exporter) Run(ctx context.Context) {
samples := make([]metrics.Sample, len(watched))
for i, name := range watched {
samples[i].Name = name
}
ticker := time.NewTicker(e.Interval)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
metrics.Read(samples)
for _, s := range samples {
switch s.Value.Kind() {
case metrics.KindUint64:
e.Send(s.Name, float64(s.Value.Uint64()))
case metrics.KindFloat64:
e.Send(s.Name, s.Value.Float64())
case metrics.KindFloat64Histogram:
h := s.Value.Float64Histogram()
e.Send(s.Name+"/mean", mean(h))
e.Send(s.Name+"/p99", quantile(h, 0.99))
}
}
}
}
}
func mean(h *metrics.Float64Histogram) float64 {
var total, count float64
for i := 0; i < len(h.Counts); i++ {
mid := (h.Buckets[i] + h.Buckets[i+1]) / 2
total += mid * float64(h.Counts[i])
count += float64(h.Counts[i])
}
if count == 0 {
return 0
}
return total / count
}
func quantile(h *metrics.Float64Histogram, q float64) float64 {
var total, count float64
for _, c := range h.Counts {
total += float64(c)
}
target := total * q
for i, c := range h.Counts {
count += float64(c)
if count >= target {
return h.Buckets[i+1]
}
}
return h.Buckets[len(h.Buckets)-1]
}
What to alert on¶
Roughly, in priority order:
/memory/classes/total:bytesslope over twenty-four hours./sched/goroutines:goroutinesabsolute value, with a configurable cap./gc/cpu/percent:%exceeding ten percent./gc/pauses:secondsp99 exceeding ten milliseconds./sync/mutex/wait/total:secondsrate-of-change exceeding a threshold.
Per-shard steady-state separation¶
A multi-tenant service that shares all resources across all tenants has a single steady-state. One bad tenant pushes the whole service out of equilibrium. Per-shard separation breaks that coupling.
Concrete pattern¶
A queue per shard, a worker pool per shard, a semaphore per shard. Tenants are mapped to shards by a consistent hash. Each shard's resources are bounded independently. The pathological tenant is contained inside one shard; the other N-1 shards are unaffected.
type Shard struct {
Queue chan Job
Workers *WorkerPool
Semaphore *semaphore.Weighted
Stats ShardStats
}
type Service struct {
Shards []*Shard
}
func (s *Service) Dispatch(j Job) {
idx := hashShard(j.Key) % len(s.Shards)
s.Shards[idx].Submit(j)
}
Trade-offs¶
Per-shard separation costs (a) more resources per node (each shard has its own pool), and (b) the loss of load smoothing across shards. It buys (a) blast-radius containment, (b) per-shard observability, and (c) the ability to roll-restart one shard at a time.
For services with hostile tenant patterns (sudden bursts from one tenant, asymmetric workloads), the trade is worth it. For uniform-load services, a single shared pool is fine.
The runbook¶
When the on-call alert fires "service is drifting out of steady-state," the runbook should already be open.
Step 1: classify the drift¶
- Heap up? Goroutines up? FDs up? Latency up? Each maps to a different first diagnosis.
- Look at the timeline: when did the slope start? Correlate with deploys, traffic changes, upstream changes.
Step 2: capture diagnostics¶
curl /debug/pprof/heap > heap.pb.gzcurl /debug/pprof/goroutine?debug=2 > goroutines.txtls /proc/$PID/fd | wc -l(or equivalent metric)- Snapshot
runtime/metricsto a file.
Step 3: mitigate¶
- Roll back the most recent deploy if the drift started after it.
- Restart the most affected pod to buy time.
- If multi-shard, isolate the worst shard.
Step 4: diagnose¶
- Compare two heap snapshots from before and during the drift:
go tool pprof -base before.pb.gz now.pb.gz. - Compare goroutine profiles likewise.
- Identify the top growing consumer.
Step 5: fix¶
- Apply the smallest possible fix.
- Add an alert that would have caught this earlier.
- Add a test that exercises the fix.
Step 6: write the post-mortem¶
- The incident is not over until the post-mortem is written. The lessons here are not theoretical — they exist because somebody wrote them.
War story 6 — The reverse-proxy cache that ate every host¶
The system¶
A reverse proxy in front of a federation of HTTP backends. It maintains a small per-host connection cache and a small per-host TLS session cache. Total backends: about forty. Expected behaviour: about eighty TCP connections (two per backend on average) and forty cached TLS sessions.
The incident¶
Six weeks after deploy, one pod's memory has climbed from 600 MiB to 2.2 GiB. The pod has not been killed yet but is heading for OOM by the weekend.
Diagnosis¶
A heap snapshot shows the top growing consumer is crypto/tls.(*Conn).readHandshake. The TLS session cache, which was supposed to hold forty entries (one per backend), holds 4,200.
The cache key is host:port. But the upstream resolves to several IPs per hostname (multi-AZ deployment). When the proxy talks to "backend-a.example.com:443", it might end up on any of three IPs. The TLS cache key was the resolved IP, not the hostname. New IP, new cache entry.
Multiply: 40 hostnames × ~3 IPs × DNS-driven churn (each IP changes every few hours as the upstream's load balancer adjusts). Over six weeks, the cache had seen 4,200 distinct IP:port pairs.
The fix¶
Key the cache by hostname, not IP. Small code change; the cache stabilised at forty entries within minutes of the deploy.
Lessons¶
- Cache keys must reflect the lifetime of the cached object. An IP that changes hourly cannot be the key for an object that should live for the lifetime of the process.
- Caches are leaks in disguise. A cache with an unbounded key space is a memory leak with extra steps.
- TLS sessions are not free. Each one is a few kilobytes; thousands of them adds up.
War story 7 — The graceful-shutdown that wasn't¶
The system¶
A background job processor. On SIGTERM, it stops accepting new jobs, waits for in-flight to finish, then exits. The wait is bounded by a thirty-second deadline.
The incident¶
Deploys are working but each pod shows about thirty seconds of error-rate spike at the end of its lifetime. Investigation reveals the pod is hitting SIGKILL from Kubernetes after thirty seconds — exactly the grace period.
Diagnosis¶
The graceful shutdown sequence:
SIGTERMreceived.- Stop pulling from Kafka.
- Wait for in-flight jobs.
- Close database connections.
- Exit.
The bug: step 2 stopped pulling, but the consumer's poll loop had a 100 ms idle wait that ran a select on ctx.Done. If the loop was inside processBatch when the signal fired, it kept running until the batch finished, then returned to the select. A slow batch (downstream Postgres locked, an outlier) could keep the pod processing for thirty seconds — exactly when Kubernetes' grace period expired.
The pod was killed mid-batch. The half-committed transactions were rolled back. The error rate spiked. The next pod re-processed the same messages.
The fix¶
Two changes:
- Lower the per-batch deadline. Each batch is now budgeted at five seconds, not "until done." If a batch exceeds the budget, it is aborted; messages are re-delivered (their Kafka offsets are not committed).
- Raise the Kubernetes
terminationGracePeriodSecondsfrom 30 to 60 to give the pod headroom.
Lessons¶
- Steady-state must include shutdown. A pod that drifts toward "won't shut down in time" is a steady-state violation, even if memory and goroutines look fine.
- Per-iteration deadlines are not the same as cumulative deadlines. A loop that runs each iteration for "up to one minute" can take a long time to finish if many iterations are pending.
War story 8 — The deadline-less internal call¶
The system¶
A monolith broken into many internal microservices. Service A calls service B, B calls C, C calls D, D calls a database. Most calls were tagged with context.WithTimeout, but one wasn't.
The incident¶
p99 latency for service A climbs from 80 ms to 4 seconds over twelve hours. p50 is unchanged. The pattern: A's tail follows D's database tail; D's database is having a bad day (a long-running query).
Diagnosis¶
A's call to B had a 500 ms timeout. B's call to C had a 400 ms timeout. C's call to D had a 300 ms timeout. D's call to the database had no timeout.
When the database was slow, D waited. C's 300 ms timer expired and C returned an error; but in D, the database call was still running until the database returned (or D was OOM'd). The goroutine inside D was leaking — slowly, one per slow-query event.
The fix¶
Add an explicit timeout to every external call, no exceptions. Better yet, audit the codebase with a static analyser that flags db.QueryContext(context.Background(), ...) or any call where the context is not derived from the request.
Lessons¶
- Deadline propagation is everyone's job. A single un-deadlined call corrupts the whole pipeline.
- Static analysis is cheap. A grep-able rule "never use
context.Background()in handlers" catches this class of bug. - Goroutine drift is a symptom, not the disease. The disease was the missing deadline; the symptom was the goroutines waiting forever.
Production tooling stack¶
A list of the tools we have found indispensable for steady-state engineering.
Build-time¶
go vetwith custom analysers for context propagation and timer usage.golangci-lintwitherrcheck,bodyclose,noctx, androwserrcheckenabled.govulncheckfor known vulnerabilities (security overlaps with steady-state).
Runtime telemetry¶
runtime/metricsfor Go-specific signals.prometheus_client_golangoropentelemetry-gofor exporting.net/http/pprofon a localhost listener for ad-hoc diagnostics.
Dashboards¶
- Grafana or Datadog with one dashboard per service.
- A common service-health dashboard pinned across all services.
Alerting¶
- Alertmanager or PagerDuty for routing.
- Slope alerts using
deriv()orrate()over six-hour windows. - Each alert linked to a runbook in a shared wiki.
Chaos¶
testcontainers-gofor spinning up dependencies in tests.- A custom chaos harness (per the senior page) that runs nightly.
Profiling¶
pproffor heap, goroutine, allocation, mutex, block.go tool tracefor execution traces, very useful for scheduling and GC inspection.bcc(Linux) for kernel-level latency at the syscall boundary.
A diagnostic decision tree¶
When the alert "service is drifting" fires, the diagnosis flow:
Alert fires.
├─ Was there a deploy in the last 6 hours?
│ ├─ Yes: suspect the deploy. Roll back; observe.
│ └─ No: continue.
├─ Which resource is drifting?
│ ├─ Heap: take two pprof heap snapshots; diff.
│ ├─ Goroutines: take goroutine?debug=1; identify the stuck stack.
│ ├─ FDs: ls /proc/$PID/fd; identify the leaking type.
│ ├─ Latency p99: pull traces from the slow tail.
│ └─ Pool wait: check the resource saturation; size up or fix slow consumer.
├─ Is the leak localised to one shard, one tenant?
│ ├─ Yes: investigate the per-shard / per-tenant pathology.
│ └─ No: global change; suspect a code change or upstream change.
└─ Mitigation: restart, roll back, or scale up. Then diagnose root cause.
The flow is intentionally short. Diagnosis should take minutes, not hours. The chaos harness, the dashboards, and the runbooks are designed to make this flow fast.
What changes at scale¶
One pod, ten requests per second¶
Steady-state engineering is overkill. A leak rate of one MB per minute means OOM at hour 33. Plenty of buffer; deploy weekly; never see the leak.
Ten pods, hundred requests per second¶
Steady-state becomes worth doing but the cost-benefit is moderate. Most teams adopt the basics (bounded queues, sized pools) and skip the advanced patterns.
Hundred pods, ten thousand requests per second¶
Steady-state becomes mission-critical. Every pod has the basics; the fleet has chaos harnesses and saturation dashboards. The team has a runbook per alert.
Thousand pods, hundred thousand requests per second¶
Steady-state is institutional infrastructure. Dedicated SRE-style support, custom platform tools, shared libraries with team-wide defaults. The "steady-state library" is itself maintained as a first-class internal product.
Ten thousand pods, million requests per second¶
Now steady-state is at the platform level, not the service level. Each service inherits steady-state guarantees from a shared framework. Individual teams cannot regress; the framework prevents it (via mandatory metrics, mandatory pool sizes derived from a registry, mandatory chaos tests in CI).
The pattern: as scale grows, steady-state moves from "developer discipline" to "platform guarantee." Both are valid; the right one depends on where you are.
Steady-state and team culture¶
The technical patterns above are necessary but not sufficient. The team culture matters as much.
Cultures that support steady-state¶
- Engineers carry the pager for their own services.
- Incidents are reviewed in blameless post-mortems.
- Time is allocated for reliability work, not just feature work.
- Dashboards are reviewed regularly, not only during incidents.
- Junior engineers shadow on-call.
Cultures that undermine it¶
- The pager rotation is "always someone else's problem."
- Incidents are blamed on individuals.
- Reliability work is "nice-to-have," features are "must-have."
- Dashboards exist but no one looks at them between incidents.
- Junior engineers never see production.
You can write the best steady-state code in the world; in the wrong culture, it will rot. The technical and cultural sides have to be paired.
A retrospective on five years of production Go¶
A senior engineer reflecting on five years of Go in production might say:
- The hardest bug is the one that takes six days to surface. Make leak detectors aggressive enough to find them in CI.
- The easiest bug is the obvious one (missing
defer, unbounded channel). Tools catch these; codify the tools. - The most surprising bug is always at a boundary: pod-to-pod, service-to-service, process-to-process. Boundaries are where the assumptions diverge.
- The cheapest win is consistency: every service uses the same defaults for
GOMEMLIMIT,MaxOpenConns, queue capacity. New engineers learn one set of patterns and apply them everywhere. - The most expensive lesson is that steady-state is a property, not a single feature. It emerges from a thousand small decisions made consistently over years.
If you absorb only one thing from this page: every long-lived resource needs a name, a bound, a metric, and an alert. That principle covers ninety-five percent of what steady-state engineering requires. The remaining five percent is the war stories — and now you have read them.
Closing notes¶
The pattern across these stories is unromantic: steady-state engineering is prevention. The heroes of the war stories above are the engineers who, after one painful incident, wrote the alert that would have caught it earlier — and then the alert never fires again, because the next engineer knows where the trap is and avoids it.
A service that has been in production for a year without a single steady-state incident is not a service that got lucky. It is a service whose dashboards, alerts, and defaults reflect the accumulated lessons of every team that came before. Steady-state is institutional memory expressed in YAML, in runtime/debug.SetMemoryLimit, in MaxOpenConns, in ConnMaxLifetime, in chaos harnesses that run every night and silently catch the next class of regression before any human sees it.
The mark of a senior production engineer is not that they know all the failure modes — nobody does. It is that they know how to find a new one quickly, fix it, and turn the lesson into a permanent guardrail. Every section of this page is one such guardrail, paid for in incidents.
Patterns from the post-mortems¶
Across the eight war stories above, several patterns recur. A few extracted as standalone lessons:
Pattern — Lifetime coupling¶
A surprising number of leaks are lifetime coupling problems: resource X is supposed to live for the lifetime of Y, but Y exits while X is still in flight. The result is X surviving alone, holding the references it had at the moment of Y's exit.
Examples from the stories:
- Subscriber goroutines (war story 1) outlived the requests that created them.
- TLS sessions (war story 6) outlived the connection cache entries.
- Subprocess pipes (war story 5) outlived the process that should have drained them.
The fix in every case: make the lifetime explicit and bounded. Either tie X to Y's lifetime via context.Context, or give X its own bounded lifetime (TTL, idle GC).
Pattern — Implicit assumption that small numbers are fine¶
Production exposes small numbers to time. "A few hundred milliseconds extra per call" becomes "a multi-minute incident over six weeks." Small constants multiplied by large times produce significant drift.
Examples:
- 22 ms processing time instead of 20 ms (war story 2) compounds to capacity creep over weeks.
- 100 ms TLS handshake (war story 3) becomes 2.5 seconds of pool emptiness during failover.
- One FD leak per ten minutes (war story 5) becomes 4032 FDs over twenty-eight days.
The fix: dimensional analysis. Take any small number and multiply by your service's expected lifetime. If the product is concerning, the number is not small.
Pattern — The hidden cumulative¶
Several stories featured a metric that looked small instantaneously but cumulated over time:
- Connection cache entries (war story 6) grew one per DNS-driven IP change.
- Tenant map entries (war story 7's lesson) grew one per session ID.
- In-progress transactions (find-bug story 12) grew one per failed handler.
The fix: every metric that is "per X" must have its X bounded. Every cumulative counter needs an eviction policy. Every map needs a size cap or TTL.
Pattern — The deploy boundary as state reset¶
Several stories were only manageable because deploys happened often enough to reset the drifting state. This works until it doesn't (a holiday freeze, a long incident response, a critical dependency that requires testing before deploy).
The fix: do not rely on deploys to reset state. Engineer the service to be steady-state-correct without periodic restarts.
A more complete inventory of failure modes¶
Beyond the eight stories above, here is a fuller catalogue, organised by symptom:
Memory grows monotonically¶
- Goroutine leak (with stack and closures).
- Cache without bound.
- Map without eviction.
- Slice appended without truncation.
- Channel buffer accumulating items.
sync.Pool.Putmissing on some path.- Long-lived structure holding references to short-lived data (via string slicing, slice-of-slice, struct embedding).
- Finalizer queue growing because finalizers run slowly.
Goroutines grow monotonically¶
- Per-request
go func()with no exit signal. - Goroutine waiting on a channel that is never sent to.
- Goroutine waiting on a context that is never cancelled.
- Goroutine in an infinite loop with no escape.
- Goroutine spawned by a library that doesn't expose its lifecycle.
File descriptors grow monotonically¶
Body.Close()missing onhttp.Response.rows.Close()missing onsql.Rows.os.File.Close()missing on error paths.- Subprocess pipes not drained.
net.Listener.Accept()connections not closed on error.inotifyinstances created and not closed.- TCP connections accumulating in
TIME_WAIT.
p99 latency rises slowly¶
- Deadline drift (missing or growing timeouts).
- Cache miss rate rising (cache eviction policy not matching access pattern).
- Mutex contention rising (lock not sharded).
- Goroutine scheduling latency rising (CPU saturated).
- Allocator pressure rising (GC pause growing).
- Database query plan changing (statistics drift).
Throughput drops slowly¶
- Worker pool churn (workers restarting frequently).
- Connection pool saturation (
WaitCountrising). - Per-tenant semaphore saturation.
- Rate limiter triggering more often (downstream slower).
- Background tasks consuming foreground CPU.
Error rate rises slowly¶
- Pool exhaustion (
ErrPoolFull, "too many open files", etc.). - Downstream timeouts rising.
- Disk full (logs, traces, ephemeral storage).
- TLS certificates near expiry.
For each, the diagnosis follows the decision tree above; the fix is in middle or senior depending on layer.
The "graceful degradation curve"¶
A mature steady-state service has a degradation curve: as load rises, latency, throughput, and error rate move along a known shape. Three phases:
Phase 1 — Linear scaling¶
Load doubles, throughput doubles, latency unchanged. Resources are not saturated. This is the regime in which most testing happens.
Phase 2 — Sub-linear scaling¶
Load doubles, throughput grows but less than doubles, latency grows. One resource is saturating. Could be CPU, memory, database, connection pool. In Phase 2, p99 grows non-linearly because queue depth is rising.
Phase 3 — Saturation¶
Load doubles, throughput is flat or declining, latency is high, errors are rising. The bottleneck resource is fully saturated; queues are at their bounds; shedding has begun.
A well-engineered service has explicit thresholds for each phase. Capacity planning aims for sustained Phase 1 with bursts into Phase 2. Phase 3 is reserved for emergencies, and the system's behaviour in Phase 3 is predictable: it sheds gracefully, returns 503s, drops some work, but does not crash.
Without explicit engineering, services skip from Phase 2 directly to crash. With it, they ride into Phase 3 and recover when load drops.
What "production-ready" means in this discipline¶
A checklist for declaring a service production-ready:
GOMEMLIMITset from cgroup at startup.- Every queue is bounded and has a metric.
- Every goroutine spawn site has an explicit exit condition.
- Every external call has a context deadline.
- Every connection pool has explicit sizing with documented rationale.
- Every cache has TTL, size limit, or both.
- Saturation dashboard with at least the five core metrics.
- Slope-based alert on each saturation metric.
- Runbook for each alert, linking to the dashboard.
- Chaos harness passes in CI.
- Long-runner staging pod has been clean for at least two weeks.
- Graceful shutdown tested under load.
- Deploy is monitored for one hour with a clean dashboard before marking done.
- On-call rotation has at least three engineers familiar with the service.
- The post-mortem template is in place for when something goes wrong.
A service that ticks all fifteen is unlikely to surprise you in production. A service that ticks fewer is correspondingly more fragile. The cost of each item is small relative to the cost of an incident; build the muscle of completing the checklist by default.
Closing reflection¶
Steady-state engineering, at its core, is humility about time. Software written for a five-second test will fail in five days. Software written for a five-day test will fail in five months. The discipline is acknowledging that you cannot test for what you have not run, and building defences that catch what you missed.
The defences are pedestrian: bounded queues, sized pools, slope-based alerts, runbooks, chaos harnesses. None is glamorous; each prevents an incident. In aggregate, they let a small team operate a large service without burning out.
If you read this page once, you have the vocabulary. If you read it twice, you have the patterns. To internalise them, run a service in production for a year, and read this page again every time something drifts. The lessons become permanent only when you have lived a few of them.
War story 9 — The metric exporter that DoS'd itself¶
The system¶
A medium-traffic API service that exports about a hundred Prometheus metrics per request. Each metric has labels for status code, method, route, and user agent.
The incident¶
Over the course of three weeks, the service's metric exporter slowly stops responding. The Prometheus scrape endpoint takes longer and longer to return. Eventually it takes more than the scrape timeout (10 seconds), and the metric goes "missing" from the monitoring dashboard. The on-call has zero visibility into the service for hours at a time.
Diagnosis¶
The metric registry grows by one entry per unique combination of (status, method, route, user_agent). The user agent label was the kicker: it included the full UA string of every client. Over three weeks, the service had seen hundreds of thousands of distinct UA strings (bots, scanners, mobile clients with version-pinned UAs).
The registry grew to over a million entries. Each entry was a few hundred bytes. The scrape endpoint had to serialise the whole registry on each scrape: a few hundred megabytes of text, multiplied by the gzip overhead. The serialisation took longer than the timeout, so the scrape failed, so the metric appeared missing.
The fix¶
Label sanitisation. The user_agent label was replaced with a bucket: chrome, firefox, safari, bot, unknown. Cardinality dropped from hundreds of thousands to five. Scrape latency returned to milliseconds.
Lessons¶
- Cardinality is a steady-state property. A label with unbounded distinct values is a leak.
- The metric system is part of your service. Failures in observability are failures in the service.
- Bucket aggressively. When in doubt, pick a small set of categorical values and force every label into one of them.
War story 10 — The log rotation race¶
The system¶
A service that uses lumberjack for log rotation. Configured to rotate logs at 100 MiB or 24 hours, keeping seven rotated files.
The incident¶
The pod is OOM-killed unexpectedly. Memory grew over twelve hours from 600 MiB to 4 GiB and crossed the cgroup limit. The dashboards showed no obvious leak: heap was small, goroutines stable.
Diagnosis¶
The leaked memory was off-heap: file buffers held by the kernel. The lumberjack rotation logic had a race: under specific timing, the old log file's os.File was not closed before the new one was opened. The leaked file descriptor was held in the kernel's page cache for the file's contents.
Over twelve hours of high log volume, dozens of FDs leaked, each holding several hundred megabytes of cached log data in kernel buffers. The Go runtime saw no leak; the kernel did.
The fix¶
Upgrade lumberjack to a version that fixed the race. Add an FD count alert (which would have caught this earlier).
Lessons¶
- Not all memory leaks are in Go. Off-heap memory (file buffers, mmap, cgo) can grow even when the Go heap is healthy.
- The kernel cache is not free. A leaked open file holds its cache; many leaked files hold many caches.
- FD count is a leading indicator for some leaks that the heap will not show.
War story 11 — The "Friday morning" cron¶
The system¶
A monolith with a weekly cron job that compacts old database tables. Runs every Friday at 6 AM. Takes about thirty minutes.
The incident¶
For about an hour every Friday morning, the service's p99 latency rises sharply. The team initially blames the cron job. They split the cron into a separate process; the latency spike persists.
Diagnosis¶
The cron job ran in a separate process, but on the same database. The compaction held locks on tables; the live service tried to query them and waited. The locks lasted only milliseconds each but were frequent enough to add up over the compaction window.
The fix¶
Move the compaction to a read replica. The primary stayed available; locks no longer affected live traffic.
Lessons¶
- Shared resources are shared steady-state. Splitting your service does not split the database.
- Locks are a steady-state hazard. A short lock held frequently is the same as a long lock held briefly.
- Read replicas are a backpressure tool. Routing heavy reads or maintenance to replicas keeps the primary in steady-state.
War story 12 — The unbounded retry storm¶
The system¶
A microservice mesh: A calls B calls C. All three retry on failure with exponential backoff: three retries, starting at 100 ms.
The incident¶
Service C has a brief outage (90 seconds). When it recovers, the entire mesh is in a retry storm. Service C is hit with five times the normal traffic because B's clients are retrying. Service B is hit with five times the traffic because A's clients are retrying. C is overloaded and starts timing out, which triggers more retries, which keeps C overloaded.
The outage of C lasted 90 seconds; the recovery to steady-state took twenty minutes.
Diagnosis¶
The retry chain compounds: at each hop, the retry count multiplied the effective load. Three retries times three retries times three retries equals twenty-seven retries for the deepest call.
The fix¶
Three changes:
- Retry budgets: each service has a cap on retries per minute, beyond which it gives up and returns the original error.
- Token-bucket rate limit at each service ingress, prioritising fresh requests over retries.
- Circuit breaker that opens on sustained failure rate, refusing to call downstream until the failure rate drops.
After the fix, the same 90-second outage of C now recovers in 100 seconds, not 20 minutes.
Lessons¶
- Retries are multiplicative across hops. The product of retry counts is the effective amplification.
- Retry budgets are a steady-state primitive. Without them, retry storms are inevitable.
- Circuit breakers belong at every hop, not just the outermost one.
War story 13 — The well-meaning gauge¶
The system¶
A service exporting a per-request gauge: "this request is currently being processed by goroutine X." Useful for debugging.
The incident¶
A junior engineer adds the gauge with the goroutine ID as a label. Over time, the metric registry grows because each goroutine has a unique ID and labels are never removed.
The metric exporter eventually crosses the scrape timeout, alerts fire, and the team discovers a hundreds-of-megabytes registry.
The fix¶
Remove the goroutine ID label. The intent of the metric (debugging) was served better by a trace than by a metric. Convert to a trace span, drop the metric.
Lessons¶
- Metrics are not for one-off debugging. They are for trends. Use traces or logs for one-off identification.
- Code review should flag high-cardinality labels. Once in production, they are hard to remove.
- Junior mistakes are senior responsibility. A well-meaning gauge that no one noticed during review is a process failure.
Patterns across all stories¶
A consolidated view of the patterns from all thirteen war stories:
| Story | Pattern | Fix |
|---|---|---|
| 1. Six-day OOM | Per-request goroutines outlive request | Lifetime decoupling, pool with shared queue |
| 2. Queue creep | Capacity creep from regression | Slope alerts, Little's Law SLI |
| 3. Pool fragment | Connection pool collapses on failover | Pre-warm, shorter ConnMaxLifetime |
| 4. Deploy spike | Cold pod cache miss | Readiness gate on cache warm |
| 5. FD leak | Subprocess pipes not drained | Drain on every path |
| 6. TLS cache | Cache key explosion via IP rotation | Key by hostname, not IP |
| 7. Shutdown delay | Per-batch deadline unbounded | Per-batch budget |
| 8. Deadline-less call | Missing timeout cascades | Static analysis, audit |
| 9. Metric DoS | Label cardinality explosion | Bucket labels |
| 10. Log rotation | FD race in third-party library | FD count alert, upgrade |
| 11. Cron lock | Shared database, weekly lock | Read replica for heavy work |
| 12. Retry storm | Multiplicative retries across hops | Retry budgets, circuit breakers |
| 13. Junior metric | High-cardinality gauge | Code review, traces for debugging |
The right column is the catalogue of fixes that, applied consistently, produce a steady-state service.
Building the runbook from incidents¶
Each story above became a runbook entry. The template:
# Alert: [name]
**Symptom:** What does this look like on the dashboard?
**Severity:** Page / Warn / Info
**Likely cause:** Based on past incidents.
**Diagnostic steps:**
1. Check [metric/dashboard].
2. If [pattern], suspect [cause].
3. Otherwise, [next diagnostic].
**Mitigations:**
- Quick: [restart / scale / roll back]
- Lasting: [code change / config change]
**Post-incident:**
- Add alert if missing.
- Update this runbook.
- Add chaos test if applicable.
Each runbook is one or two pages. The team's full set is maybe twenty runbooks for a mature service. Curated, indexed, kept up to date — they become institutional memory.
A note on humility¶
After working through many steady-state incidents, an honest senior engineer admits: most incidents are not exotic. They are recurrences of patterns we have seen before. The unique incident is rare; the variation on a known pattern is the norm.
This is good news. It means steady-state engineering is largely about building defences against known patterns, not about anticipating the unknown. The unknown will surface; when it does, it joins the catalogue.
The pattern catalogue grows slowly. By year five of operating a service, your team's catalogue is probably stable. Most new incidents have a known shape. The on-call diagnosis is fast.
By year ten, the catalogue may be shared across services, codified in shared libraries, taught to new hires in onboarding. The cost of an incident has been amortised across the entire engineering organisation.
That is what mature steady-state engineering looks like: not zero incidents, but incidents that are fast to diagnose, fast to mitigate, and easy to prevent next time.