Skip to content

Steady-State — Professional

← Back

Table of Contents

  1. Introduction
  2. War story 1 — The six-day OOM
  3. War story 2 — The queue creeping at 0.1 percent per hour
  4. War story 3 — The pool that fragmented under failover
  5. War story 4 — The deploy window that grew tail latency
  6. War story 5 — The FD leak that took twenty-eight days to surface
  7. GOMEMLIMIT and cgroup integration
  8. Background GC pacing
  9. runtime/metrics pipelines
  10. Per-shard steady-state separation
  11. The runbook
  12. Closing notes

Introduction

This page is built from incidents. The lessons here are not theoretical; each section names a real failure mode, walks through what we saw on the dashboards, what we got wrong on first diagnosis, and what the fix looked like. The names of services and companies are abstracted; the engineering is not. If you have not yet been on call for a Go service in production, read this page anyway — the shape of the diagnosis is the same whether you are debugging a six-replica side-project or a six-thousand-replica payment processor.

The recurring theme is time. Production hides its bugs in time. A unit test runs for one second; a load test for an hour; production for years. Each order of magnitude reveals a new class of failure. Steady-state engineering is the discipline of designing for the largest order of magnitude from the start — because once your service is in production, the time-revealed failures will find you anyway.


War story 1 — The six-day OOM

The system

A media-streaming service receives webhooks from an upstream CDN for every video segment uploaded by every customer. Each webhook is small (a few kilobytes). The service decodes the JSON, looks up the customer, applies access rules, and writes a row to a Postgres table. Throughput: about three thousand webhooks per second at peak, average two thousand.

The incident

The service is deployed Tuesday afternoon. By Friday it is humming along, dashboards green. On Monday morning, around 6 a.m., a single pod (out of twelve) is killed by the OOM-killer. By 10 a.m. three more have been killed. By noon the rest are following a clear pattern: every pod's resident set is growing at about thirty megabytes per hour, and they crash when they cross the four-gigabyte container limit.

First diagnosis (wrong)

The team's first guess is the JSON decoder. The webhook payload contains a customer-supplied URL field; one engineer suggests an unbounded string is being kept in a cache somewhere. Heap snapshots are taken at 11 a.m. and noon (one hour apart). go tool pprof -base 11.pb.gz 12.pb.gz shows the diff.

The diff is unhelpful at first. The top consumers are database/sql connection-related allocations and encoding/json buffer reuse — both expected, neither growing in a way that explains the leak.

Second diagnosis (closer)

The on-call lead pulls a goroutine profile. Goroutine count is supposedly stable, but the profile shows about ten thousand goroutines parked on the same line:

select {
case ev := <-pubSub.Events:
    process(ev)
case <-ctx.Done():
    return
}

This is in a per-customer "subscriber" goroutine — one is spawned the first time a customer ID appears in the day. The ctx is the request context. When the request finishes, the goroutine is supposed to exit because ctx.Done() is closed.

But pubSub.Events is a buffered channel shared across all subscribers. When a customer's subscriber is parked on the receive, and the request context is cancelled, the select should pick the Done branch. It does. Most of the time.

Sometimes, however, an event arrives just before the cancel, and the goroutine reads it, calls process(ev), and only then returns to the select. At that point the request handler has long since exited — but the subscriber is still in process(ev). Each event holds references to the customer record, the event payload, and a pgx connection. The goroutine's per-G stack is also still allocated.

In normal operation, this is fine: process(ev) finishes in milliseconds, the goroutine returns to select, finds Done is closed, and exits. But under sustained traffic, a slow downstream creates back-pressure inside process(ev). The goroutine sits there for seconds. During those seconds, new requests for the same customer create new subscriber goroutines, because the lookup map's "is there a subscriber?" check returns false (the subscriber map has already been cleaned up by the request-finishing path).

Now there are two subscribers for one customer, both still alive. Over six days, this accumulates. At 1.8 KB per subscriber goroutine stack, plus pinned per-customer state, you get the thirty-megabytes-per-hour drift.

The fix

Two changes, neither flashy:

  1. Subscribers are no longer per-request. They live for the lifetime of the process. A worker pool of fifty consumers drains the shared event channel. Per-customer state moved into a bounded LRU map.
  2. The select was changed to drain the channel after Done:
for {
    select {
    case ev := <-pubSub.Events:
        process(ev)
    case <-ctx.Done():
        // drain any in-flight to avoid orphaning
        for {
            select {
            case ev := <-pubSub.Events:
                process(ev)
            default:
                return
            }
        }
    }
}

But this was a defensive layer; the structural fix was decoupling subscriber lifetime from request lifetime.

Lessons

  • Per-request goroutines are a steady-state hazard. Any time the goroutine count is "proportional to traffic in flight," and not "proportional to fixed pool size," you are vulnerable.
  • Heap diffs are not enough. The leak was not in the heap — it was in the goroutine stacks and the closed-over state. runtime.NumGoroutine told the truer story.
  • Six days is short. Some leaks take two weeks. Some take a month. The longer the window, the larger the multiplier on small per-second drift.

War story 2 — The queue creeping at 0.1 percent per hour

The system

A search-indexing pipeline consumes documents from Kafka, runs a CPU-heavy enrichment step, and writes to an inverted index. The consumer has a worker pool of sixteen goroutines; the queue between Kafka and the worker pool is a buffered channel of capacity 256.

The incident

The queue depth metric is normally near zero (workers keep up easily). Two weeks after a deploy, the queue depth has settled at an average of three. After three weeks, four. After four weeks, six. By six weeks it is twelve. The average is creeping.

There is no alert; the absolute number is far below capacity. But the on-call lead notices the trend on a dashboard and starts asking questions.

Diagnosis

This is not a leak. It is a capacity creep: the arrival rate is growing faster than the worker throughput. The cause turns out to be a slow regression in the enrichment step — a new feature added one extra database lookup per document, which raised the median processing time from twenty milliseconds to twenty-two. With sixteen workers and a 22 ms per document, the steady-state throughput is 16 / 0.022 = 727 documents per second. Arrival rate had grown to about 720 per second through six weeks of organic traffic growth. The gap is closing.

The pinch point

By Little's Law, the average queue depth is arrival_rate * average_wait_time. The average wait time grew because the workers were getting closer to saturation. The depth was a signal, not the cause.

The fix

Three changes:

  1. Cache the new database lookup. Median processing time dropped back to twenty milliseconds; throughput restored to eight hundred per second.
  2. Add an alert on queue depth slope, not just absolute depth. A queue creeping upward, even slowly, is a steady-state violation.
  3. Add the Little's Law calculation as a SLI: expected_capacity = workers / median_processing_time; alert if arrival_rate > 0.7 * expected_capacity.

Lessons

  • Capacity creep is harder than memory leaks because the symptoms look like noise.
  • Always alert on slope, not just level. A flat dashboard line at "depth = 12" looks fine until you realise it was at "depth = 3" a month ago.
  • Little's Law is the steady-state engineer's slide rule. Memorise it: L = λW.

War story 3 — The pool that fragmented under failover

The system

A user-profile microservice talks to a Postgres primary via pgx. The pool is configured MaxOpenConns = 25, MaxIdleConns = 25, ConnMaxLifetime = 30 minutes. Steady-state behaviour: 25 long-lived connections, both in pg_stat_activity and in the Go pool.

The incident

Postgres performs a planned failover. The primary is demoted, the standby is promoted. The Go service should reconnect and resume.

It mostly does. But three minutes after the failover, latency on the user-profile service has gone from 5 ms median to 60 ms median. The connection pool stats say InUse = 25, Idle = 0, WaitCount = 3000+. The service is queueing.

Diagnosis

After the failover, every existing connection in the pool is dead (the old primary refused them). Each call to db.Conn() triggers the driver's "ping" code, which discovers the connection is dead, throws it away, and dials a new one. Dial-on-demand is slow (TLS handshake plus Postgres startup); each new dial takes about a hundred milliseconds.

With twenty-five connections all needing to be replaced and a continuous load of two hundred RPS, the math is grim: twenty-five replaces times one hundred milliseconds equals two and a half seconds during which the pool is essentially empty. Two hundred RPS times two and a half seconds equals five hundred queued requests. The pool never catches up.

But that's not all. As the dials happen, the pool's internal "free list" is repeatedly reshuffled. New connections come in at the front; in-use ones go to the back. Under burst, a connection used twice in succession may end up scheduled twice on the same Postgres backend, while another backend is idle — the pool is fragmented.

The fix

Three changes, applied as a unit:

  1. Lower ConnMaxLifetime to fifteen minutes. Connections rotate more often, so by the time a failover happens, fewer are old and stale. This is the cheapest mitigation.
  2. Add a circuit-breaker around db.QueryContext. When the breaker opens, fail fast instead of queueing. Saturation is exposed to the caller rather than absorbed by the pool.
  3. Pre-warm the pool after dial failure. A small helper sees the first dial failure, opens N parallel dials to refill, and signals the pool to drop the dead connections.
// After a connection error indicating loss of the primary,
// kick off N parallel dials so the pool fills in one round-trip
// time, not N round-trip times.
func prewarm(db *sql.DB, n int) {
    var wg sync.WaitGroup
    for i := 0; i < n; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
            defer cancel()
            conn, err := db.Conn(ctx)
            if err == nil {
                conn.Close()
            }
        }()
    }
    wg.Wait()
}

Lessons

  • Connection pools are not as steady as they look. A pool that has hummed for weeks can collapse in three minutes when its peers cycle.
  • ConnMaxLifetime is a steady-state lever. Lower values cost slightly more dial CPU but limit blast radius.
  • Pre-warm is cheap insurance. It costs almost nothing at startup and saves you on the only day you really need it.

War story 4 — The deploy window that grew tail latency

The system

An ad-serving service running on Kubernetes, sixty replicas, one-percent rolling deploys (so six replicas at a time). Steady-state behaviour: thirty-millisecond p99, twenty-millisecond median.

The incident

A new feature ships. Functionally it works. p99 latency on the dashboard shows a clear, periodic spike every time a rolling deploy happens — about every three days. The spike lasts the duration of the deploy (about ten minutes), tops out at one hundred and fifty milliseconds, then returns to baseline.

The team initially attributes this to "warm-up cost." It would be acceptable — except this is a paid traffic system, and a 5x p99 for ten minutes every three days costs measurable revenue.

Diagnosis

The issue is the new pod's cold start. The feature added a per-pod cache that takes about three minutes to warm to its steady-state hit rate. During those three minutes, the cache misses cause a cascade of database lookups; each lookup is fifty milliseconds; the pod's queue depth climbs; tail latency follows.

But it gets worse. The load balancer sends new traffic to the new pod immediately on ready. The pod was marked ready as soon as its HTTP listener accepted a connection — long before the cache was warm.

The fix

Two changes:

  1. Readiness gates on cache warm. The readiness probe returns 503 until the cache has at least fifty percent of its steady-state size, measured by entry count. This means the rolling deploy is slower, but new pods never serve traffic when cold.
  2. Pre-warm on startup. The pod reads a snapshot of the most recent cache state from a shared object store and inflates the cache before signalling ready. Steady-state warm time drops from three minutes to thirty seconds.

Lessons

  • Steady-state has a startup transient. The deploy window is a recurring excursion from steady-state, and every excursion is a chance for a tail-latency spike.
  • Readiness must mean "ready to perform at steady-state quality," not "ready to accept TCP connections."
  • Cache warming is a steady-state feature, not a one-time hack. Plan for it.

War story 5 — The FD leak that took twenty-eight days to surface

The system

A file-transcoding service spawns subprocesses for each transcoding job. The Go service manages the subprocess lifecycle, captures stdout and stderr, and writes results to S3.

The incident

Twenty-eight days after a deploy, a single pod reports "too many open files" and crashes. The other forty-three pods follow over the next forty-eight hours.

Diagnosis

The service was opening pipes to subprocesses via exec.Cmd.StdoutPipe and exec.Cmd.StderrPipe. The job code read from both pipes, processed the output, and waited for the subprocess to exit. Standard pattern.

But: in an error path, one of the pipes was not drained before the subprocess was killed. The Go runtime kept the pipe's file descriptor alive until the goroutine reading from it noticed the EOF — which, with no draining, never came.

Each occurrence of the error path leaked one FD. The error path ran about every ten minutes (about once per two hundred jobs). At one leak per ten minutes, you leak 144 per day. The pod's RLIMIT_NOFILE was 4096. Twenty-eight days times 144 is 4032. The math is exact.

The fix

cmd := exec.Command(...)
stdout, _ := cmd.StdoutPipe()
stderr, _ := cmd.StderrPipe()

// Always drain both pipes, even if we are about to kill the process.
done := make(chan struct{}, 2)
go func() { io.Copy(io.Discard, stdout); done <- struct{}{} }()
go func() { io.Copy(io.Discard, stderr); done <- struct{}{} }()

if err := cmd.Start(); err != nil { ... }

// On error path:
cmd.Process.Kill()
<-done
<-done
cmd.Wait()

Lessons

  • FD leaks are slow. With RLIMIT_NOFILE = 4096 and a leak rate of one per ten minutes, the failure surfaces after twenty-eight days. With RLIMIT_NOFILE = 1024, it surfaces in seven days; with 65535, in over a year.
  • Always alert on FD count, not just on FD exhaustion. A pod whose FD count has slope > 0 over twenty-four hours is leaking.
  • Subprocess pipes are a common offender. Drain everything; assume nothing.

GOMEMLIMIT and cgroup integration

In production, the right GOMEMLIMIT is derived from the cgroup, not hard-coded. Hard-coded values rot: the container's memory limit changes (a Kubernetes resize, a node migration), but the binary's GOMEMLIMIT stays fixed. Either you set it too low and waste memory, or you set it too high and OOM.

The cgroup paths

On cgroup v2 (almost all modern systems):

/sys/fs/cgroup/memory.max

On cgroup v1 (older systems):

/sys/fs/cgroup/memory/memory.limit_in_bytes

Both files contain either a number or the string max (cgroup v2) or a very large sentinel value (cgroup v1).

A working implementation

package memlimit

import (
    "fmt"
    "os"
    "strconv"
    "strings"
    "runtime/debug"
)

const fraction = 0.9

func SetFromCgroup() (int64, error) {
    if n, err := readV2(); err == nil {
        return setLimit(n)
    }
    if n, err := readV1(); err == nil {
        return setLimit(n)
    }
    return 0, fmt.Errorf("no cgroup memory limit found")
}

func readV2() (int64, error) {
    b, err := os.ReadFile("/sys/fs/cgroup/memory.max")
    if err != nil {
        return 0, err
    }
    s := strings.TrimSpace(string(b))
    if s == "max" {
        return 0, fmt.Errorf("unlimited")
    }
    return strconv.ParseInt(s, 10, 64)
}

func readV1() (int64, error) {
    b, err := os.ReadFile("/sys/fs/cgroup/memory/memory.limit_in_bytes")
    if err != nil {
        return 0, err
    }
    n, err := strconv.ParseInt(strings.TrimSpace(string(b)), 10, 64)
    if err != nil {
        return 0, err
    }
    // cgroup v1 sentinel: practically unlimited
    if n > (1 << 62) {
        return 0, fmt.Errorf("unlimited")
    }
    return n, nil
}

func setLimit(cgroupMax int64) (int64, error) {
    limit := int64(float64(cgroupMax) * fraction)
    debug.SetMemoryLimit(limit)
    return limit, nil
}

Call this from init() (or, even earlier, from a Go-runtime initialiser if you have one). The KimMachineGun/automemlimit library does exactly this and is the de-facto standard for production Go services.

Why ninety percent?

Three sources of off-heap memory accumulate at the cgroup level but are not counted by the Go runtime:

  1. Goroutine stacks above the initial threshold.
  2. cgo allocations (libpq, librdkafka, libcurl, etc.).
  3. Runtime overhead: runtime.MemStats.OtherSys, page tables for mmaps.

A ten-percent buffer absorbs all three for typical services.

When to override

If your service is unusual (very cgo-heavy, or very stack-heavy), tune the fraction. For an OpenSearch client written in cgo, eighty percent may be safer; for a pure-Go service with light stacks, ninety-five percent leaves more memory for the heap. Use measurements, not guesses.


Background GC pacing

GOMEMLIMIT plus GOGC give you two knobs that interact non-trivially. A useful mental model:

  • Below the memory limit, GC is triggered by GOGC (the heap-doubling ratio).
  • As you approach the limit, GC is triggered increasingly by the limit itself.
  • Near the limit, GC runs continuously and CPU spikes.

In production this means: if you set GOMEMLIMIT too low, you trade a fixed memory headroom for a wildly variable CPU bill. Watch runtime/metrics:

  • /gc/cpu/percent:% — the fraction of CPU spent in GC. Above five percent is a warning; above ten percent is a sign you should raise the limit.
  • /gc/cycles/total:gc-cycles — total cycles. Sudden acceleration is a sign of memory pressure.

A working pattern: set GOMEMLIMIT to ninety percent of cgroup, set GOGC=100, watch /gc/cpu/percent:%. If it stays under five percent at peak load, you are in good shape. If it climbs over ten percent at peak, either raise GOGC (less aggressive GC) or raise the cgroup memory.


runtime/metrics pipelines

The runtime exposes a stable, structured metrics interface via runtime/metrics. In production, you want this feeding into your monitoring system — Prometheus, Datadog, OpenTelemetry — as a first-class set of dimensions.

Polling implementation

package rtmetrics

import (
    "context"
    "runtime/metrics"
    "time"
)

type Exporter struct {
    Interval time.Duration
    Send     func(name string, value float64)
}

var watched = []string{
    "/memory/classes/heap/objects:bytes",
    "/memory/classes/heap/free:bytes",
    "/memory/classes/heap/released:bytes",
    "/memory/classes/total:bytes",
    "/gc/cycles/total:gc-cycles",
    "/gc/cpu/percent:%",
    "/sched/goroutines:goroutines",
    "/sync/mutex/wait/total:seconds",
    "/gc/pauses:seconds",
}

func (e *Exporter) Run(ctx context.Context) {
    samples := make([]metrics.Sample, len(watched))
    for i, name := range watched {
        samples[i].Name = name
    }

    ticker := time.NewTicker(e.Interval)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            metrics.Read(samples)
            for _, s := range samples {
                switch s.Value.Kind() {
                case metrics.KindUint64:
                    e.Send(s.Name, float64(s.Value.Uint64()))
                case metrics.KindFloat64:
                    e.Send(s.Name, s.Value.Float64())
                case metrics.KindFloat64Histogram:
                    h := s.Value.Float64Histogram()
                    e.Send(s.Name+"/mean", mean(h))
                    e.Send(s.Name+"/p99", quantile(h, 0.99))
                }
            }
        }
    }
}

func mean(h *metrics.Float64Histogram) float64 {
    var total, count float64
    for i := 0; i < len(h.Counts); i++ {
        mid := (h.Buckets[i] + h.Buckets[i+1]) / 2
        total += mid * float64(h.Counts[i])
        count += float64(h.Counts[i])
    }
    if count == 0 {
        return 0
    }
    return total / count
}

func quantile(h *metrics.Float64Histogram, q float64) float64 {
    var total, count float64
    for _, c := range h.Counts {
        total += float64(c)
    }
    target := total * q
    for i, c := range h.Counts {
        count += float64(c)
        if count >= target {
            return h.Buckets[i+1]
        }
    }
    return h.Buckets[len(h.Buckets)-1]
}

What to alert on

Roughly, in priority order:

  1. /memory/classes/total:bytes slope over twenty-four hours.
  2. /sched/goroutines:goroutines absolute value, with a configurable cap.
  3. /gc/cpu/percent:% exceeding ten percent.
  4. /gc/pauses:seconds p99 exceeding ten milliseconds.
  5. /sync/mutex/wait/total:seconds rate-of-change exceeding a threshold.

Per-shard steady-state separation

A multi-tenant service that shares all resources across all tenants has a single steady-state. One bad tenant pushes the whole service out of equilibrium. Per-shard separation breaks that coupling.

Concrete pattern

A queue per shard, a worker pool per shard, a semaphore per shard. Tenants are mapped to shards by a consistent hash. Each shard's resources are bounded independently. The pathological tenant is contained inside one shard; the other N-1 shards are unaffected.

type Shard struct {
    Queue     chan Job
    Workers   *WorkerPool
    Semaphore *semaphore.Weighted
    Stats     ShardStats
}

type Service struct {
    Shards []*Shard
}

func (s *Service) Dispatch(j Job) {
    idx := hashShard(j.Key) % len(s.Shards)
    s.Shards[idx].Submit(j)
}

Trade-offs

Per-shard separation costs (a) more resources per node (each shard has its own pool), and (b) the loss of load smoothing across shards. It buys (a) blast-radius containment, (b) per-shard observability, and (c) the ability to roll-restart one shard at a time.

For services with hostile tenant patterns (sudden bursts from one tenant, asymmetric workloads), the trade is worth it. For uniform-load services, a single shared pool is fine.


The runbook

When the on-call alert fires "service is drifting out of steady-state," the runbook should already be open.

Step 1: classify the drift

  • Heap up? Goroutines up? FDs up? Latency up? Each maps to a different first diagnosis.
  • Look at the timeline: when did the slope start? Correlate with deploys, traffic changes, upstream changes.

Step 2: capture diagnostics

  • curl /debug/pprof/heap > heap.pb.gz
  • curl /debug/pprof/goroutine?debug=2 > goroutines.txt
  • ls /proc/$PID/fd | wc -l (or equivalent metric)
  • Snapshot runtime/metrics to a file.

Step 3: mitigate

  • Roll back the most recent deploy if the drift started after it.
  • Restart the most affected pod to buy time.
  • If multi-shard, isolate the worst shard.

Step 4: diagnose

  • Compare two heap snapshots from before and during the drift: go tool pprof -base before.pb.gz now.pb.gz.
  • Compare goroutine profiles likewise.
  • Identify the top growing consumer.

Step 5: fix

  • Apply the smallest possible fix.
  • Add an alert that would have caught this earlier.
  • Add a test that exercises the fix.

Step 6: write the post-mortem

  • The incident is not over until the post-mortem is written. The lessons here are not theoretical — they exist because somebody wrote them.

War story 6 — The reverse-proxy cache that ate every host

The system

A reverse proxy in front of a federation of HTTP backends. It maintains a small per-host connection cache and a small per-host TLS session cache. Total backends: about forty. Expected behaviour: about eighty TCP connections (two per backend on average) and forty cached TLS sessions.

The incident

Six weeks after deploy, one pod's memory has climbed from 600 MiB to 2.2 GiB. The pod has not been killed yet but is heading for OOM by the weekend.

Diagnosis

A heap snapshot shows the top growing consumer is crypto/tls.(*Conn).readHandshake. The TLS session cache, which was supposed to hold forty entries (one per backend), holds 4,200.

The cache key is host:port. But the upstream resolves to several IPs per hostname (multi-AZ deployment). When the proxy talks to "backend-a.example.com:443", it might end up on any of three IPs. The TLS cache key was the resolved IP, not the hostname. New IP, new cache entry.

Multiply: 40 hostnames × ~3 IPs × DNS-driven churn (each IP changes every few hours as the upstream's load balancer adjusts). Over six weeks, the cache had seen 4,200 distinct IP:port pairs.

The fix

Key the cache by hostname, not IP. Small code change; the cache stabilised at forty entries within minutes of the deploy.

Lessons

  • Cache keys must reflect the lifetime of the cached object. An IP that changes hourly cannot be the key for an object that should live for the lifetime of the process.
  • Caches are leaks in disguise. A cache with an unbounded key space is a memory leak with extra steps.
  • TLS sessions are not free. Each one is a few kilobytes; thousands of them adds up.

War story 7 — The graceful-shutdown that wasn't

The system

A background job processor. On SIGTERM, it stops accepting new jobs, waits for in-flight to finish, then exits. The wait is bounded by a thirty-second deadline.

The incident

Deploys are working but each pod shows about thirty seconds of error-rate spike at the end of its lifetime. Investigation reveals the pod is hitting SIGKILL from Kubernetes after thirty seconds — exactly the grace period.

Diagnosis

The graceful shutdown sequence:

  1. SIGTERM received.
  2. Stop pulling from Kafka.
  3. Wait for in-flight jobs.
  4. Close database connections.
  5. Exit.

The bug: step 2 stopped pulling, but the consumer's poll loop had a 100 ms idle wait that ran a select on ctx.Done. If the loop was inside processBatch when the signal fired, it kept running until the batch finished, then returned to the select. A slow batch (downstream Postgres locked, an outlier) could keep the pod processing for thirty seconds — exactly when Kubernetes' grace period expired.

The pod was killed mid-batch. The half-committed transactions were rolled back. The error rate spiked. The next pod re-processed the same messages.

The fix

Two changes:

  1. Lower the per-batch deadline. Each batch is now budgeted at five seconds, not "until done." If a batch exceeds the budget, it is aborted; messages are re-delivered (their Kafka offsets are not committed).
  2. Raise the Kubernetes terminationGracePeriodSeconds from 30 to 60 to give the pod headroom.

Lessons

  • Steady-state must include shutdown. A pod that drifts toward "won't shut down in time" is a steady-state violation, even if memory and goroutines look fine.
  • Per-iteration deadlines are not the same as cumulative deadlines. A loop that runs each iteration for "up to one minute" can take a long time to finish if many iterations are pending.

War story 8 — The deadline-less internal call

The system

A monolith broken into many internal microservices. Service A calls service B, B calls C, C calls D, D calls a database. Most calls were tagged with context.WithTimeout, but one wasn't.

The incident

p99 latency for service A climbs from 80 ms to 4 seconds over twelve hours. p50 is unchanged. The pattern: A's tail follows D's database tail; D's database is having a bad day (a long-running query).

Diagnosis

A's call to B had a 500 ms timeout. B's call to C had a 400 ms timeout. C's call to D had a 300 ms timeout. D's call to the database had no timeout.

When the database was slow, D waited. C's 300 ms timer expired and C returned an error; but in D, the database call was still running until the database returned (or D was OOM'd). The goroutine inside D was leaking — slowly, one per slow-query event.

The fix

Add an explicit timeout to every external call, no exceptions. Better yet, audit the codebase with a static analyser that flags db.QueryContext(context.Background(), ...) or any call where the context is not derived from the request.

Lessons

  • Deadline propagation is everyone's job. A single un-deadlined call corrupts the whole pipeline.
  • Static analysis is cheap. A grep-able rule "never use context.Background() in handlers" catches this class of bug.
  • Goroutine drift is a symptom, not the disease. The disease was the missing deadline; the symptom was the goroutines waiting forever.

Production tooling stack

A list of the tools we have found indispensable for steady-state engineering.

Build-time

  • go vet with custom analysers for context propagation and timer usage.
  • golangci-lint with errcheck, bodyclose, noctx, and rowserrcheck enabled.
  • govulncheck for known vulnerabilities (security overlaps with steady-state).

Runtime telemetry

  • runtime/metrics for Go-specific signals.
  • prometheus_client_golang or opentelemetry-go for exporting.
  • net/http/pprof on a localhost listener for ad-hoc diagnostics.

Dashboards

  • Grafana or Datadog with one dashboard per service.
  • A common service-health dashboard pinned across all services.

Alerting

  • Alertmanager or PagerDuty for routing.
  • Slope alerts using deriv() or rate() over six-hour windows.
  • Each alert linked to a runbook in a shared wiki.

Chaos

  • testcontainers-go for spinning up dependencies in tests.
  • A custom chaos harness (per the senior page) that runs nightly.

Profiling

  • pprof for heap, goroutine, allocation, mutex, block.
  • go tool trace for execution traces, very useful for scheduling and GC inspection.
  • bcc (Linux) for kernel-level latency at the syscall boundary.

A diagnostic decision tree

When the alert "service is drifting" fires, the diagnosis flow:

Alert fires.
├─ Was there a deploy in the last 6 hours?
│  ├─ Yes: suspect the deploy. Roll back; observe.
│  └─ No:  continue.
├─ Which resource is drifting?
│  ├─ Heap: take two pprof heap snapshots; diff.
│  ├─ Goroutines: take goroutine?debug=1; identify the stuck stack.
│  ├─ FDs: ls /proc/$PID/fd; identify the leaking type.
│  ├─ Latency p99: pull traces from the slow tail.
│  └─ Pool wait: check the resource saturation; size up or fix slow consumer.
├─ Is the leak localised to one shard, one tenant?
│  ├─ Yes: investigate the per-shard / per-tenant pathology.
│  └─ No:  global change; suspect a code change or upstream change.
└─ Mitigation: restart, roll back, or scale up. Then diagnose root cause.

The flow is intentionally short. Diagnosis should take minutes, not hours. The chaos harness, the dashboards, and the runbooks are designed to make this flow fast.


What changes at scale

One pod, ten requests per second

Steady-state engineering is overkill. A leak rate of one MB per minute means OOM at hour 33. Plenty of buffer; deploy weekly; never see the leak.

Ten pods, hundred requests per second

Steady-state becomes worth doing but the cost-benefit is moderate. Most teams adopt the basics (bounded queues, sized pools) and skip the advanced patterns.

Hundred pods, ten thousand requests per second

Steady-state becomes mission-critical. Every pod has the basics; the fleet has chaos harnesses and saturation dashboards. The team has a runbook per alert.

Thousand pods, hundred thousand requests per second

Steady-state is institutional infrastructure. Dedicated SRE-style support, custom platform tools, shared libraries with team-wide defaults. The "steady-state library" is itself maintained as a first-class internal product.

Ten thousand pods, million requests per second

Now steady-state is at the platform level, not the service level. Each service inherits steady-state guarantees from a shared framework. Individual teams cannot regress; the framework prevents it (via mandatory metrics, mandatory pool sizes derived from a registry, mandatory chaos tests in CI).

The pattern: as scale grows, steady-state moves from "developer discipline" to "platform guarantee." Both are valid; the right one depends on where you are.


Steady-state and team culture

The technical patterns above are necessary but not sufficient. The team culture matters as much.

Cultures that support steady-state

  • Engineers carry the pager for their own services.
  • Incidents are reviewed in blameless post-mortems.
  • Time is allocated for reliability work, not just feature work.
  • Dashboards are reviewed regularly, not only during incidents.
  • Junior engineers shadow on-call.

Cultures that undermine it

  • The pager rotation is "always someone else's problem."
  • Incidents are blamed on individuals.
  • Reliability work is "nice-to-have," features are "must-have."
  • Dashboards exist but no one looks at them between incidents.
  • Junior engineers never see production.

You can write the best steady-state code in the world; in the wrong culture, it will rot. The technical and cultural sides have to be paired.


A retrospective on five years of production Go

A senior engineer reflecting on five years of Go in production might say:

  • The hardest bug is the one that takes six days to surface. Make leak detectors aggressive enough to find them in CI.
  • The easiest bug is the obvious one (missing defer, unbounded channel). Tools catch these; codify the tools.
  • The most surprising bug is always at a boundary: pod-to-pod, service-to-service, process-to-process. Boundaries are where the assumptions diverge.
  • The cheapest win is consistency: every service uses the same defaults for GOMEMLIMIT, MaxOpenConns, queue capacity. New engineers learn one set of patterns and apply them everywhere.
  • The most expensive lesson is that steady-state is a property, not a single feature. It emerges from a thousand small decisions made consistently over years.

If you absorb only one thing from this page: every long-lived resource needs a name, a bound, a metric, and an alert. That principle covers ninety-five percent of what steady-state engineering requires. The remaining five percent is the war stories — and now you have read them.


Closing notes

The pattern across these stories is unromantic: steady-state engineering is prevention. The heroes of the war stories above are the engineers who, after one painful incident, wrote the alert that would have caught it earlier — and then the alert never fires again, because the next engineer knows where the trap is and avoids it.

A service that has been in production for a year without a single steady-state incident is not a service that got lucky. It is a service whose dashboards, alerts, and defaults reflect the accumulated lessons of every team that came before. Steady-state is institutional memory expressed in YAML, in runtime/debug.SetMemoryLimit, in MaxOpenConns, in ConnMaxLifetime, in chaos harnesses that run every night and silently catch the next class of regression before any human sees it.

The mark of a senior production engineer is not that they know all the failure modes — nobody does. It is that they know how to find a new one quickly, fix it, and turn the lesson into a permanent guardrail. Every section of this page is one such guardrail, paid for in incidents.


Patterns from the post-mortems

Across the eight war stories above, several patterns recur. A few extracted as standalone lessons:

Pattern — Lifetime coupling

A surprising number of leaks are lifetime coupling problems: resource X is supposed to live for the lifetime of Y, but Y exits while X is still in flight. The result is X surviving alone, holding the references it had at the moment of Y's exit.

Examples from the stories:

  • Subscriber goroutines (war story 1) outlived the requests that created them.
  • TLS sessions (war story 6) outlived the connection cache entries.
  • Subprocess pipes (war story 5) outlived the process that should have drained them.

The fix in every case: make the lifetime explicit and bounded. Either tie X to Y's lifetime via context.Context, or give X its own bounded lifetime (TTL, idle GC).

Pattern — Implicit assumption that small numbers are fine

Production exposes small numbers to time. "A few hundred milliseconds extra per call" becomes "a multi-minute incident over six weeks." Small constants multiplied by large times produce significant drift.

Examples:

  • 22 ms processing time instead of 20 ms (war story 2) compounds to capacity creep over weeks.
  • 100 ms TLS handshake (war story 3) becomes 2.5 seconds of pool emptiness during failover.
  • One FD leak per ten minutes (war story 5) becomes 4032 FDs over twenty-eight days.

The fix: dimensional analysis. Take any small number and multiply by your service's expected lifetime. If the product is concerning, the number is not small.

Pattern — The hidden cumulative

Several stories featured a metric that looked small instantaneously but cumulated over time:

  • Connection cache entries (war story 6) grew one per DNS-driven IP change.
  • Tenant map entries (war story 7's lesson) grew one per session ID.
  • In-progress transactions (find-bug story 12) grew one per failed handler.

The fix: every metric that is "per X" must have its X bounded. Every cumulative counter needs an eviction policy. Every map needs a size cap or TTL.

Pattern — The deploy boundary as state reset

Several stories were only manageable because deploys happened often enough to reset the drifting state. This works until it doesn't (a holiday freeze, a long incident response, a critical dependency that requires testing before deploy).

The fix: do not rely on deploys to reset state. Engineer the service to be steady-state-correct without periodic restarts.


A more complete inventory of failure modes

Beyond the eight stories above, here is a fuller catalogue, organised by symptom:

Memory grows monotonically

  • Goroutine leak (with stack and closures).
  • Cache without bound.
  • Map without eviction.
  • Slice appended without truncation.
  • Channel buffer accumulating items.
  • sync.Pool.Put missing on some path.
  • Long-lived structure holding references to short-lived data (via string slicing, slice-of-slice, struct embedding).
  • Finalizer queue growing because finalizers run slowly.

Goroutines grow monotonically

  • Per-request go func() with no exit signal.
  • Goroutine waiting on a channel that is never sent to.
  • Goroutine waiting on a context that is never cancelled.
  • Goroutine in an infinite loop with no escape.
  • Goroutine spawned by a library that doesn't expose its lifecycle.

File descriptors grow monotonically

  • Body.Close() missing on http.Response.
  • rows.Close() missing on sql.Rows.
  • os.File.Close() missing on error paths.
  • Subprocess pipes not drained.
  • net.Listener.Accept() connections not closed on error.
  • inotify instances created and not closed.
  • TCP connections accumulating in TIME_WAIT.

p99 latency rises slowly

  • Deadline drift (missing or growing timeouts).
  • Cache miss rate rising (cache eviction policy not matching access pattern).
  • Mutex contention rising (lock not sharded).
  • Goroutine scheduling latency rising (CPU saturated).
  • Allocator pressure rising (GC pause growing).
  • Database query plan changing (statistics drift).

Throughput drops slowly

  • Worker pool churn (workers restarting frequently).
  • Connection pool saturation (WaitCount rising).
  • Per-tenant semaphore saturation.
  • Rate limiter triggering more often (downstream slower).
  • Background tasks consuming foreground CPU.

Error rate rises slowly

  • Pool exhaustion (ErrPoolFull, "too many open files", etc.).
  • Downstream timeouts rising.
  • Disk full (logs, traces, ephemeral storage).
  • TLS certificates near expiry.

For each, the diagnosis follows the decision tree above; the fix is in middle or senior depending on layer.


The "graceful degradation curve"

A mature steady-state service has a degradation curve: as load rises, latency, throughput, and error rate move along a known shape. Three phases:

Phase 1 — Linear scaling

Load doubles, throughput doubles, latency unchanged. Resources are not saturated. This is the regime in which most testing happens.

Phase 2 — Sub-linear scaling

Load doubles, throughput grows but less than doubles, latency grows. One resource is saturating. Could be CPU, memory, database, connection pool. In Phase 2, p99 grows non-linearly because queue depth is rising.

Phase 3 — Saturation

Load doubles, throughput is flat or declining, latency is high, errors are rising. The bottleneck resource is fully saturated; queues are at their bounds; shedding has begun.

A well-engineered service has explicit thresholds for each phase. Capacity planning aims for sustained Phase 1 with bursts into Phase 2. Phase 3 is reserved for emergencies, and the system's behaviour in Phase 3 is predictable: it sheds gracefully, returns 503s, drops some work, but does not crash.

Without explicit engineering, services skip from Phase 2 directly to crash. With it, they ride into Phase 3 and recover when load drops.


What "production-ready" means in this discipline

A checklist for declaring a service production-ready:

  1. GOMEMLIMIT set from cgroup at startup.
  2. Every queue is bounded and has a metric.
  3. Every goroutine spawn site has an explicit exit condition.
  4. Every external call has a context deadline.
  5. Every connection pool has explicit sizing with documented rationale.
  6. Every cache has TTL, size limit, or both.
  7. Saturation dashboard with at least the five core metrics.
  8. Slope-based alert on each saturation metric.
  9. Runbook for each alert, linking to the dashboard.
  10. Chaos harness passes in CI.
  11. Long-runner staging pod has been clean for at least two weeks.
  12. Graceful shutdown tested under load.
  13. Deploy is monitored for one hour with a clean dashboard before marking done.
  14. On-call rotation has at least three engineers familiar with the service.
  15. The post-mortem template is in place for when something goes wrong.

A service that ticks all fifteen is unlikely to surprise you in production. A service that ticks fewer is correspondingly more fragile. The cost of each item is small relative to the cost of an incident; build the muscle of completing the checklist by default.


Closing reflection

Steady-state engineering, at its core, is humility about time. Software written for a five-second test will fail in five days. Software written for a five-day test will fail in five months. The discipline is acknowledging that you cannot test for what you have not run, and building defences that catch what you missed.

The defences are pedestrian: bounded queues, sized pools, slope-based alerts, runbooks, chaos harnesses. None is glamorous; each prevents an incident. In aggregate, they let a small team operate a large service without burning out.

If you read this page once, you have the vocabulary. If you read it twice, you have the patterns. To internalise them, run a service in production for a year, and read this page again every time something drifts. The lessons become permanent only when you have lived a few of them.


War story 9 — The metric exporter that DoS'd itself

The system

A medium-traffic API service that exports about a hundred Prometheus metrics per request. Each metric has labels for status code, method, route, and user agent.

The incident

Over the course of three weeks, the service's metric exporter slowly stops responding. The Prometheus scrape endpoint takes longer and longer to return. Eventually it takes more than the scrape timeout (10 seconds), and the metric goes "missing" from the monitoring dashboard. The on-call has zero visibility into the service for hours at a time.

Diagnosis

The metric registry grows by one entry per unique combination of (status, method, route, user_agent). The user agent label was the kicker: it included the full UA string of every client. Over three weeks, the service had seen hundreds of thousands of distinct UA strings (bots, scanners, mobile clients with version-pinned UAs).

The registry grew to over a million entries. Each entry was a few hundred bytes. The scrape endpoint had to serialise the whole registry on each scrape: a few hundred megabytes of text, multiplied by the gzip overhead. The serialisation took longer than the timeout, so the scrape failed, so the metric appeared missing.

The fix

Label sanitisation. The user_agent label was replaced with a bucket: chrome, firefox, safari, bot, unknown. Cardinality dropped from hundreds of thousands to five. Scrape latency returned to milliseconds.

Lessons

  • Cardinality is a steady-state property. A label with unbounded distinct values is a leak.
  • The metric system is part of your service. Failures in observability are failures in the service.
  • Bucket aggressively. When in doubt, pick a small set of categorical values and force every label into one of them.

War story 10 — The log rotation race

The system

A service that uses lumberjack for log rotation. Configured to rotate logs at 100 MiB or 24 hours, keeping seven rotated files.

The incident

The pod is OOM-killed unexpectedly. Memory grew over twelve hours from 600 MiB to 4 GiB and crossed the cgroup limit. The dashboards showed no obvious leak: heap was small, goroutines stable.

Diagnosis

The leaked memory was off-heap: file buffers held by the kernel. The lumberjack rotation logic had a race: under specific timing, the old log file's os.File was not closed before the new one was opened. The leaked file descriptor was held in the kernel's page cache for the file's contents.

Over twelve hours of high log volume, dozens of FDs leaked, each holding several hundred megabytes of cached log data in kernel buffers. The Go runtime saw no leak; the kernel did.

The fix

Upgrade lumberjack to a version that fixed the race. Add an FD count alert (which would have caught this earlier).

Lessons

  • Not all memory leaks are in Go. Off-heap memory (file buffers, mmap, cgo) can grow even when the Go heap is healthy.
  • The kernel cache is not free. A leaked open file holds its cache; many leaked files hold many caches.
  • FD count is a leading indicator for some leaks that the heap will not show.

War story 11 — The "Friday morning" cron

The system

A monolith with a weekly cron job that compacts old database tables. Runs every Friday at 6 AM. Takes about thirty minutes.

The incident

For about an hour every Friday morning, the service's p99 latency rises sharply. The team initially blames the cron job. They split the cron into a separate process; the latency spike persists.

Diagnosis

The cron job ran in a separate process, but on the same database. The compaction held locks on tables; the live service tried to query them and waited. The locks lasted only milliseconds each but were frequent enough to add up over the compaction window.

The fix

Move the compaction to a read replica. The primary stayed available; locks no longer affected live traffic.

Lessons

  • Shared resources are shared steady-state. Splitting your service does not split the database.
  • Locks are a steady-state hazard. A short lock held frequently is the same as a long lock held briefly.
  • Read replicas are a backpressure tool. Routing heavy reads or maintenance to replicas keeps the primary in steady-state.

War story 12 — The unbounded retry storm

The system

A microservice mesh: A calls B calls C. All three retry on failure with exponential backoff: three retries, starting at 100 ms.

The incident

Service C has a brief outage (90 seconds). When it recovers, the entire mesh is in a retry storm. Service C is hit with five times the normal traffic because B's clients are retrying. Service B is hit with five times the traffic because A's clients are retrying. C is overloaded and starts timing out, which triggers more retries, which keeps C overloaded.

The outage of C lasted 90 seconds; the recovery to steady-state took twenty minutes.

Diagnosis

The retry chain compounds: at each hop, the retry count multiplied the effective load. Three retries times three retries times three retries equals twenty-seven retries for the deepest call.

The fix

Three changes:

  1. Retry budgets: each service has a cap on retries per minute, beyond which it gives up and returns the original error.
  2. Token-bucket rate limit at each service ingress, prioritising fresh requests over retries.
  3. Circuit breaker that opens on sustained failure rate, refusing to call downstream until the failure rate drops.

After the fix, the same 90-second outage of C now recovers in 100 seconds, not 20 minutes.

Lessons

  • Retries are multiplicative across hops. The product of retry counts is the effective amplification.
  • Retry budgets are a steady-state primitive. Without them, retry storms are inevitable.
  • Circuit breakers belong at every hop, not just the outermost one.

War story 13 — The well-meaning gauge

The system

A service exporting a per-request gauge: "this request is currently being processed by goroutine X." Useful for debugging.

The incident

A junior engineer adds the gauge with the goroutine ID as a label. Over time, the metric registry grows because each goroutine has a unique ID and labels are never removed.

The metric exporter eventually crosses the scrape timeout, alerts fire, and the team discovers a hundreds-of-megabytes registry.

The fix

Remove the goroutine ID label. The intent of the metric (debugging) was served better by a trace than by a metric. Convert to a trace span, drop the metric.

Lessons

  • Metrics are not for one-off debugging. They are for trends. Use traces or logs for one-off identification.
  • Code review should flag high-cardinality labels. Once in production, they are hard to remove.
  • Junior mistakes are senior responsibility. A well-meaning gauge that no one noticed during review is a process failure.

Patterns across all stories

A consolidated view of the patterns from all thirteen war stories:

Story Pattern Fix
1. Six-day OOM Per-request goroutines outlive request Lifetime decoupling, pool with shared queue
2. Queue creep Capacity creep from regression Slope alerts, Little's Law SLI
3. Pool fragment Connection pool collapses on failover Pre-warm, shorter ConnMaxLifetime
4. Deploy spike Cold pod cache miss Readiness gate on cache warm
5. FD leak Subprocess pipes not drained Drain on every path
6. TLS cache Cache key explosion via IP rotation Key by hostname, not IP
7. Shutdown delay Per-batch deadline unbounded Per-batch budget
8. Deadline-less call Missing timeout cascades Static analysis, audit
9. Metric DoS Label cardinality explosion Bucket labels
10. Log rotation FD race in third-party library FD count alert, upgrade
11. Cron lock Shared database, weekly lock Read replica for heavy work
12. Retry storm Multiplicative retries across hops Retry budgets, circuit breakers
13. Junior metric High-cardinality gauge Code review, traces for debugging

The right column is the catalogue of fixes that, applied consistently, produce a steady-state service.


Building the runbook from incidents

Each story above became a runbook entry. The template:

# Alert: [name]
**Symptom:** What does this look like on the dashboard?
**Severity:** Page / Warn / Info
**Likely cause:** Based on past incidents.
**Diagnostic steps:**
  1. Check [metric/dashboard].
  2. If [pattern], suspect [cause].
  3. Otherwise, [next diagnostic].
**Mitigations:**
  - Quick: [restart / scale / roll back]
  - Lasting: [code change / config change]
**Post-incident:**
  - Add alert if missing.
  - Update this runbook.
  - Add chaos test if applicable.

Each runbook is one or two pages. The team's full set is maybe twenty runbooks for a mature service. Curated, indexed, kept up to date — they become institutional memory.


A note on humility

After working through many steady-state incidents, an honest senior engineer admits: most incidents are not exotic. They are recurrences of patterns we have seen before. The unique incident is rare; the variation on a known pattern is the norm.

This is good news. It means steady-state engineering is largely about building defences against known patterns, not about anticipating the unknown. The unknown will surface; when it does, it joins the catalogue.

The pattern catalogue grows slowly. By year five of operating a service, your team's catalogue is probably stable. Most new incidents have a known shape. The on-call diagnosis is fast.

By year ten, the catalogue may be shared across services, codified in shared libraries, taught to new hires in onboarding. The cost of an incident has been amortised across the entire engineering organisation.

That is what mature steady-state engineering looks like: not zero incidents, but incidents that are fast to diagnose, fast to mitigate, and easy to prevent next time.