Soak Testing & Leak Hunting¶
Run a Go service under steady, unremarkable load for 12–24 hours and watch it die slowly. The bugs that hide in a 5-minute test — a goroutine blocked forever, a map that never evicts, an
*http.Responsebody nobody closed — only show up as a curve that won't flatten. Plant them, then hunt them with pprof until the line goes flat.
| Tier | Load-testing (meta-skill) |
| Primary domain | Endurance testing & resource-leak diagnosis |
| Skills exercised | Go runtime & GC, net/http/pprof, heap/goroutine profiling, runtime/metrics, GOMEMLIMIT, fd accounting (lsof), RSS vs heap reasoning, latency-drift detection |
| Interview sections | 1 (Go runtime), 17 (performance), 18 (observability) |
| Est. effort | 3–5 focused days (plus wall-clock soak time) |
1. Context¶
You own a Go HTTP service that passes every test, sails through a 5-minute load run, and ships. Three days later, on-call gets paged: the pod RSS has climbed from 180 MB to 2.1 GB, the kubelet OOM-kills it, it restarts, and the clock resets. Nobody can reproduce it locally because nobody runs it for three days locally.
This is the class of bug a soak test exists to catch: slow drift. Memory that grows without bound, goroutines that accumulate, file descriptors that leak, GC pauses that creep up, p99 that drifts from 8 ms to 40 ms over six hours. None of it is visible in a short run. All of it is fatal in production.
Your job is to build a soak harness, run a Go service under sustained load for many hours, and deliberately plant the canonical Go leak sources — then find each one with the runtime's own tools. You will produce flat curves and the profiles that prove they're flat, not vibes.
2. Goals / Non-goals¶
Goals - Stand up a soak harness: a Go service under steady load with RSS, heap, goroutine count, open-fd count, and GC-pause trend recorded every few seconds for ≥ 12 hours. - Plant each canonical Go leak (goroutine, fd, unbounded map, unstopped ticker) and detect it from the metric curve, then localize it with pprof. - Use heap-diff profiling (pprof -base) to point at the exact allocation site of a growing map. - Distinguish a real leak from legitimate steady-state cache growth and from RSS that is heap-fragmentation / not-yet-returned-to-OS rather than live memory. - Demonstrate that a fix flattens the curve over a fresh long run.
Non-goals - Peak-throughput or breakpoint testing — that's 04-capacity-and-breakpoint-testing. - Injecting infrastructure faults — that's 02-chaos-and-fault-injection. - Native/cgo leaks (malloc arenas, C libraries). Stay in pure Go; lsof for fds is the one OS-level tool you'll need. - Hunting CPU regressions — this lab is about resources that don't come back, not cycles.
3. Functional requirements¶
- A target service (
cmd/svc) — a real-ish Go HTTP API (a few endpoints that allocate, call a downstream, touch a cache, open files/sockets) withnet/http/pprofmounted on a separate admin port and a/metricsendpoint. - A leak registry: each leak is a build/runtime flag (
-leak=goroutine|fd|map|ticker|none) so a single binary can be run clean or with exactly one planted leak. Leaks must be plausible code, not strawmen. - A load driver (
cmd/drive) that holds a fixed, moderate request rate (open-model, not closed-loop) for the full soak duration with a deterministic request mix. - A sampler (
cmd/sampleor a sidecar script) that scrapesruntime/metrics,/debug/pprof/, andlsofon an interval and appends to a CSV/Parquet time series, plus captures a heap and goroutine profile every N minutes for later diffing. - A report step that turns the time series into the curves of §5 and the pprof diffs of §10.
4. Load & data profile¶
- Duration: one clean baseline run ≥ 12 h (24 h preferred); each leak-hunt run long enough for the curve to be unambiguous — typically 1–4 h, or accelerated by raising request rate so a slow leak surfaces in minutes (state the acceleration factor).
- Load: steady moderate rate — e.g. 200–500 req/s — deliberately below the service's ceiling so any drift is a leak, not saturation. Open-model: the driver sends at a fixed rate regardless of how fast the service responds.
- Request mix: deterministic given a seed; e.g. 70% reads (cache hit/miss mix), 20% writes, 10% a downstream-call endpoint that opens a socket.
- Cache realism: the service has a legitimate bounded cache that warms over the first ~20 min then plateaus — present specifically so you must tell true steady-state growth apart from a leak.
- Sampling cadence: resource metrics every 5–10 s; full heap + goroutine profile every 5 min (cheap, and diffable).
5. Non-functional requirements / SLOs¶
The whole lab is "make these curves flat." Targets for the clean service over an N-hour run (N ≥ 12):
| Metric | Source | Target over N hours |
|---|---|---|
| Goroutine count | runtime/metrics /sched/goroutines:goroutines | Flat — bounded, returns to baseline ±5% between load waves; no monotonic rise |
Live heap (HeapAlloc / /memory/classes/heap/objects:bytes) | runtime/metrics | Sawtooth around a flat mean; mean drift < 5% over the run |
| Process RSS | /proc/<pid>/status VmRSS | Plateaus after warm-up; no sustained upward slope; gap vs live heap explained (frag / not-returned-to-OS) |
| Open file descriptors | lsof -p <pid> count / /proc/<pid>/fd | Flat — bounded by pool sizes; no monotonic climb |
GC pause (p99 pause_ns) | runtime/metrics /gc/pauses:seconds histogram | Stable; p99 pause < 5 ms and not trending up as the run continues |
| GC frequency / CPU fraction | /gc/cycles/total:gc-cycles, /cpu/classes/gc/total:cpu-seconds | Steady cycles-per-minute; GC CPU fraction flat |
| Request p99 latency | service histogram | No drift — p99 at hour 12 within 10% of p99 at hour 1 |
The number that matters isn't the absolute value — it's the slope. A flat line over 12 hours is the deliverable; a rising line is the bug.
6. Architecture constraints & guidance¶
- Pure Go target.
net/httpservice,net/http/pprofimported and mounted on an admin-only port (never the public one). Pin the Go version — GC behavior changes across releases. - Read metrics from
runtime/metrics, not the deprecatedruntime.MemStatsfields where aruntime/metricsequivalent exists. It's the stable, histogram-aware API: goroutine count, GC pause distribution, heap class breakdown, GC CPU. GOMEMLIMITis part of the experiment surface, not just config — you'll run with it set and unset and observe the GC's response near the limit.- Run the service in a container with a memory limit (e.g.
--memory=512m) so OOM is a real outcome you can trigger and observe, not a hypothetical. - Keep the load driver on a separate host/process from the service so the driver's own allocations never contaminate the service's profiles.
- Store every profile (
heap.<t>.pb.gz,goroutine.<t>.txt) with a timestamp so any two can be diffed after the fact.
7. Data model (the soak time series)¶
sample (every 5–10s):
ts int64 // unix ms
goroutines int
heap_alloc_bytes uint64 // live objects
heap_sys_bytes uint64 // reserved from OS (heap)
rss_bytes uint64 // VmRSS from /proc
open_fds int // count of /proc/<pid>/fd
gc_pause_p99_ns uint64 // from /gc/pauses histogram
gc_cycles uint64 // monotonic counter
next_gc_bytes uint64 // GC trigger target (vs GOMEMLIMIT)
req_p99_ms float64 // service-reported
profiles (every 5 min):
heap.<ts>.pb.gz // go tool pprof -base heap.<t0> heap.<t1>
goroutine.<ts>.txt // ?debug=2 full stacks for blocked-goroutine grouping
-base) and between two goroutine dumps is where localization happens; the scalar time series is where detection happens. 8. Interface contract¶
GET /healthz→200GET /api/...→ the workload endpoints (read / write / downstream-call)- Admin port:
GET /debug/pprof/heap→ heap profile (?gc=1to force GC first for a clean live-heap reading)GET /debug/pprof/goroutine?debug=2→ full goroutine stacksGET /debug/pprof/allocs→ allocation profileGET /metrics→ Prometheus exposition incl. mirroredruntime/metrics- Flags:
-leak={none|goroutine|fd|map|ticker},-rate,-cache-size,-gomemlimit(or envGOMEMLIMIT),-accel(rate multiplier).
9. Key technical challenges¶
- Leak vs steady-state cache growth. A warming bounded cache also makes the heap line go up — for a while. You must show it plateaus (and prove the eviction works) versus a true leak that never plateaus. The tell is the second derivative over a long enough window, plus a heap diff that keeps pointing at the same growing object.
- RSS vs live heap. RSS can stay high while the live heap shrinks: freed memory may be
MADV_DONTNEED'd lazily, and heap fragmentation reserves spans the allocator won't release. A rising RSS with a flat live heap is not a Go leak — it's the runtime not returning memory to the OS yet, and you must say so rather than chase a phantom. - Goroutine leaks are silent. A goroutine blocked on a channel send/receive or a never-cancelled
contextconsumes a few KB of stack and never shows in heap-allocation profiles as "live" the way you'd expect — you find it in the goroutine profile, grouped by blocking stack, with a count that only rises. - fd leaks are invisible to pprof. An unclosed
resp.Bodyor*sql.Rowsleaks a socket/fd that the Go heap tools can't see at all; onlylsof//proc/<pid>/fdreveals it, and it ends inEMFILE: too many open files. - Latency drift has many causes. Drifting p99 might be GC pressure from a growing heap, lock contention that worsens as a map grows, or the OS reclaiming pages — you must attribute it, not just observe it.
10. Experiments to run (break it / tune it)¶
For each: name the curve that detects it, the tool that localizes it, and the before/after once fixed.
- Goroutine leak — forgotten context cancel. An endpoint spawns a worker that selects on a channel + a
ctx.Done()that's never triggered (caller forgotdefer cancel()). Measure:/sched/goroutinesclimbing linearly with request count. Localize:goroutine?debug=2, group by blocking stack — thousands parked on the samechan receive. Fix: propagate cancellation; show the count returns to baseline between waves. - Heap-diff a growing map. A request-keyed map (e.g. per-request entry cached "for later") is written but never deleted. Measure: live heap mean drifting up, never plateauing. Localize:
go tool pprof -base heap.t0 heap.t1→ the growth concentrates at onemapassignment line. Fix: bound it (LRU/TTL); prove the diff is now empty and the mean is flat. - fd leak — unclosed response body. The downstream-call endpoint does
http.Getbut neverresp.Body.Close(), so the connection/fd leaks (and the keep-alive pool can't reuse). Measure:lsof -p <pid> | wc -lclimbing; eventuallyEMFILE. Localize: confirm withlsofshowing accumulating sockets inCLOSE_WAIT/ESTABLISHED; correlate with the endpoint. Fix:defer resp.Body.Close()(+ drain); show fd count flat and pool reuse. - Unstopped
time.Ticker. A handler createstime.NewTickerper request withoutStop(). Measure: both goroutine count and heap creep (ticker + its goroutine retained). Localize: goroutine dump shows runtime timer goroutines /time.Tickerreferences in the heap diff. Fix:defer ticker.Stop(); flat curves. - GC pause trend under a slowly growing heap. Run leak #2 and watch the
/gc/pausesp99 and GC frequency as the live heap grows. Measure: pause p99 and GC CPU fraction trending up as heap size rises. Show: the pause regression is a symptom of the leak, and disappears when the heap is bounded. GOMEMLIMITon a near-OOM service. Take a service whose heap grows toward the container's 512 MB limit. Run (a) noGOMEMLIMIT→ GC stays lazy, RSS marches to the cgroup limit, kernel OOM-kills it; (b)GOMEMLIMIT=450MiB→ GC gets aggressive near the soft limit,next_gcclamps, the process survives but burns more GC CPU (and can enter a GC death-spiral if truly out of room). Measure:next_gc_bytesvs limit, GC CPU fraction, survival vs OOM-kill. Explain the trade-off.- Prove the fix flattens RSS over a long run. After fixing leak #2, do a fresh ≥ 12 h run. Deliver: the RSS and live-heap curves side by side, both plateaued, with the heap-diff between hour 1 and hour 12 essentially empty.
- Cache vs leak discrimination (control). Run with the legitimate bounded cache and no planted leak. Show the heap rises during the ~20 min warm-up then plateaus, eviction counter is non-zero, and the hour-1→hour-12 heap diff is empty — the curve that looks like a leak for 20 minutes but isn't.
11. Milestones¶
- Target service +
net/http/pprof+/metrics(mirroringruntime/metrics); load driver holding a steady rate; sampler writing the time series. - First clean baseline run; produce the §5 curves; confirm they're flat (or find an accidental leak — common, and a good sign the harness works).
- Plant + detect + localize leaks #1–#4; one findings entry per leak with the detecting curve and the pprof/
lsofevidence. - GC +
GOMEMLIMITexperiments (#5, #6); the near-OOM trade-off write-up. - Fix-and-flatten long run (#7) and the cache-vs-leak control (#8); final report.
12. Acceptance criteria (definition of done)¶
- A ≥ 12 h clean run with flat goroutine, fd, and live-heap curves — plotted, with the RSS-vs-heap gap explained.
- Each of the four planted leaks: detected from a named metric curve and localized to the responsible code via pprof (
-baseheap diff or grouped goroutine dump) orlsof. - The growing-map leak localized by a
pprof -basediff that names the allocation line; after the fix, the same diff is empty. - The fd leak confirmed by
lsofshowing accumulating sockets and anEMFILE; after the fix, fd count is flat. -
GOMEMLIMITexperiment showing the on/off difference near the container limit (survival vs OOM-kill,next_gcclamping, GC-CPU cost) with numbers. - A documented method for telling steady-state cache growth from a leak, with the control run as evidence.
- Every curve and profile reproducible from a committed command + seed.
13. Stretch goals¶
- Continuous profiling (Pyroscope/Parca): keep heap + goroutine profiles for the whole soak and scrub back to the moment a curve bent.
sync.Poolmisuse: show a pool that retains oversized buffers inflating RSS, and the per-GC-cycle pool drain that complicates "live heap" reasoning.- Finalizer pitfalls: plant a
runtime.SetFinalizercycle/ordering bug that delays reclamation; observe the heap-class effect. MADV_DONTNEEDvsMADV_FREE: run withGODEBUG=madvdontneed=1and show the RSS-return-to-OS difference under the same workload.- Automated leak gate: a CI job that runs a short accelerated soak and fails the build if the goroutine or heap slope exceeds a threshold (linear-regression on the time series).
14. Evaluation rubric¶
| Dimension | Senior bar | Staff bar |
|---|---|---|
| Leak detection | Spots a rising curve and names the leak class | Builds the harness so any of the four classes is caught automatically; quantifies the slope |
| Localization | Uses pprof to find the goroutine/allocation | Reads a -base heap diff and grouped goroutine dump fluently; pinpoints the line and explains the retention path |
| RSS vs heap | Knows RSS ≠ live heap | Explains fragmentation + lazy return-to-OS, proves which one a given gap is, and won't chase a phantom |
| GC understanding | Reads GC pause/frequency from runtime/metrics | Predicts the pause trend from heap growth; reasons about GOMEMLIMIT/next_gc near a hard limit and the death-spiral risk |
| fd discipline | Knows unclosed bodies leak fds | Catches it via lsof, ties it to keep-alive pool reuse, and prevents the whole class |
| Cache vs leak | Suspects growth might be a cache | Proves plateau vs unbounded growth with a control and second-derivative reasoning |
| Communication | Curves + a findings note | Could defend "this line is flat, here's why, here's the profile" to a staff panel |
15. References¶
- Go docs:
runtime/metrics,net/http/pprof,runtime/pprof; theGOMEMLIMIT/ soft-memory-limit design doc andGODEBUG=gctrace=1. go tool pprof— heap profiling and-basedifferential profiling;?gc=1and?debug=2query params.- The Go Memory Model and the runtime GC pacer notes (for pause/
next_gcreasoning). lsof//proc/<pid>/fdand/proc/<pid>/status(VmRSS) for OS-level fd and RSS accounting.- Continuous-profiling: Pyroscope / Parca / Google-Wide Profiling paper.
- See also:
Interview Question/01-golang-language-and-runtime/(goroutines, GC, scheduler, memory) andInterview Question/18-observability/(profiling, metrics, drift detection).