Go Memory Management & Zero-Allocation¶
Take one hot request handler in a high-RPS Go service and drive it toward zero allocations per request — then learn exactly what the garbage collector costs you when you can't. Find where memory comes from, bound the GC's CPU and pause budget, and prove the p99 improvement across a fleet.
| Tier | Load-testing (perf craft) |
| Primary domain | Go runtime / memory & GC |
| Skills exercised | Escape analysis, sync.Pool, GOGC/GOMEMLIMIT, tricolor GC + write barriers + assist, struct packing, off-heap/arenas/mmap, runtime/metrics, pprof -alloc_objects, gctrace |
| Interview sections | 1 (Go language & runtime), 17 (performance engineering), 2 (concurrency) |
| Est. effort | 3–5 focused days |
1. Context¶
You own the hottest endpoint in a Go service: a request handler that looks up an item in an in-memory index and serializes a response. At ~5k RPS per pod it's fine. At 60k RPS across the fleet, the p99 latency develops a sawtooth — a spike every few hundred milliseconds — and CPU sits 25% higher than the handler's own work explains. The flame graph blames runtime.mallocgc, runtime.gcBgMarkWorker, and runtime.gcAssistAlloc. Translation: you are allocating too much, the GC is running constantly, and request goroutines are being conscripted to help it.
Separately, the team that runs the in-memory index has the opposite problem. Their pod holds a 40 GB live working set at a trickle of requests, and every GC cycle scans the whole heap — a single mark phase stalls the world long enough to trip the readiness probe, and the pod gets killed during what should be a quiet period.
Your job is to characterize where memory comes from in Go, drive the hot path's allocations toward zero, bound the GC's cost on both a churn-heavy and a huge-heap workload, and produce numbers — allocs/op, GC pause p99, GC CPU% — that you can defend. You will produce evidence, not vibes about "GC pressure."
2. Goals / Non-goals¶
Goals - Account for every allocation on the hot path and explain why each one escapes to the heap (go build -gcflags=-m), then eliminate the ones you can. - Drive a representative hot handler to ≤ 1 alloc/op (ideally 0) and show the GC-CPU and p99 effect of getting there. - Bound GC behavior on two opposite workloads: a huge live heap (pause duration, GOMEMLIMIT near the ceiling) and a high-churn small heap (GC CPU%, assist, sync.Pool win). - Explain the tricolor mark-sweep GC, write barriers, and assist well enough to predict what a given workload will do to pause and CPU before you measure it. - Use sync.Pool correctly — and demonstrate at least one case where it loses.
Non-goals - Rewriting the service in another language or going full off-heap as the default. Off-heap/arenas are a measured experiment, not the recommendation. - Tuning the allocator's size classes or patching the runtime. You tune behavior through GOGC, GOMEMLIMIT, allocation shape, and pooling — not runtime forks. - CPU-bound algorithmic optimization (that's the profiling-guided lab). Here the enemy is memory traffic and the GC, not the math.
3. Functional requirements¶
- A hot handler (
cmd/svc) servesGET /lookup?key=…: find a record in an in-memory index and write a JSON (and a binary) response. This is the path you drive to zero allocations. - An index loader (
cmd/loadindex) builds the in-memory working set from generated data, sized by a flag so the same binary runs Stage 0 (MB) and Stage 1/3 (tens of GB). - The handler exposes two serialization paths, switchable by flag: a naive one (
encoding/jsoninto fresh buffers,fmt, string concatenation) and an optimized one (pooled buffers,[]byteAPIs, no interface boxing,strconv.AppendInt-style append). State which is active in every result. - An alloc-accounting endpoint (
GET /debug/allocs) readsruntime/metricsand reports live heap, total bytes allocated, GC cycle count, GC CPU fraction, and last/p99 pause — so the harness can read GC state without parsing logs. - A knob surface:
GOGC,GOMEMLIMIT, pool on/off, serialization path, and index size are all configurable per run and recorded in the findings.
4. Load & data profile¶
- Index dataset:
cmd/genproduces records keyed byuint64id with a payload (a struct of ~8 fields + a small[]byte). Sizes: 64 MB (Stage 0), 40 GB live (Stage 1/3). Deterministic given a seed. - Record shape matters: the struct is deliberately mis-packed in the baseline (a
boolbetween twoint64s, etc.) so field reordering is a real, measurable win. Reportunsafe.Sizeofbefore/after. - Key distribution: lookups are Zipfian (s≈1.1) over the id space, so a hot subset dominates — this keeps the resident working set small even when the live heap is huge, which is exactly what stresses the GC's scan cost vs. the cache.
- String/interface bait: the naive path forces
interface{}boxing (e.g. loggingany,fmt.Sprintf) andstring(b)/[]byte(s)round-trips, so you can show the copy and the escape they cause. - Traffic model: open-model load (fixed send rate, not closed-loop), so GC-induced stalls show up as a latency signal — coordinated-omission-correct — rather than silently throttling the offered rate.
5. Non-functional requirements / SLOs¶
Measured on the optimized path unless stated; baseline is the control.
| Metric | Target |
|---|---|
| Hot-handler allocations | ≤ 1 alloc/op (0 is the goal) on the optimized path, proven by testing.B -benchmem and a steady-state pprof -alloc_objects |
| GC CPU fraction at Stage 2 | < 5% of CPU (/gc/cpu:cores vs. GOMAXPROCS); baseline will be much higher — report the delta |
| GC pause p99 (STW) at Stage 1 (40 GB live) | < 5 ms with tuned GOGC/GOMEMLIMIT; report what the untuned default does |
| Request p99 inflation attributable to GC | Isolate it: p99(GC on) − p99(GOGC=off for a window). Drive the GC-attributable component < 2 ms at Stage 2 |
GOMEMLIMIT behavior near the ceiling | At 90% of limit, throughput degrades gracefully (GC works harder), it does not enter a death spiral or OOM — proven |
| Reproducibility | Every alloc/op and pause number comes with the exact command, GOGC/GOMEMLIMIT, and commit |
The point is not to hit "zero" as a trophy — it's to know which allocations you removed, what each one bought in GC CPU and p99, and when removing the next one stops being worth the code complexity.
6. Architecture constraints & guidance¶
- Pin the toolchain. Go's GC, escape analysis, and
GOMEMLIMITsemantics change between releases — recordgo versionin every result. Use a recent Go (≥ 1.21 soGOMEMLIMITis stable; note that thearenaexperiment needsGOEXPERIMENT=arenasand may be gated/removed — treat it as a probe, not a load-bearing dependency). - Measure allocations three ways and reconcile them:
testing.Bwith-benchmem(per-op, microbench),pprof -alloc_objects/-alloc_space(where in the code), andruntime/metrics/gc/heap/allocs:bytes(whole-process rate under load). If they disagree, you're measuring the wrong scope. - Turn the GC off to attribute its cost. Run a short window with
GOGC=off(or a hugeGOMEMLIMIT) at fixed load; the p99 delta vs. GC-on is the GC-attributable tail. Don't shipGOGC=off— it's a measurement tool. sync.Poolis per-P and cleared every GC. Pool buffers/large transient objects, never things with a finalizer-like lifecycle; always reset beforePut; never retain a pointer into a pooled buffer afterPut. Show the race detector / a deliberate use-after-Put bug as a cautionary result.- Profiling has overhead.
-alloc_*profiling samples;gctraceis cheap;runtime/traceis heavy. Use the cheapest tool that answers the question and state which was on during each measured run.
7. Data model & hot-path shape¶
record (baseline, mis-packed): record (packed):
type Rec struct { type Rec struct {
A int64 A, B, C int64
Hot bool // padding hole Payload []byte // 24B header
B int64 Hot bool // tail, 1B
Active bool // padding hole Active bool
C int64 } // 8B-aligned, smaller
Payload []byte
} // report unsafe.Sizeof both
in-memory index:
map[uint64]*Rec (Stage 1/3: tens of GB live — this is what the GC must scan)
hot path (optimized):
buf := bufPool.Get().(*bytes.Buffer); buf.Reset()
appendJSON(buf.Bytes()[:0], rec) // []byte API, no interface boxing
w.Write(buf.Bytes()); bufPool.Put(buf) // reset-before-Put, no retained alias
[]Rec or off-heap layouts. Comparing them is an experiment (§10). 8. Interface contract¶
GET /lookup?key=N&fmt=json|bin→ the record; the path under test.GET /debug/allocs→{ live_heap, total_alloc, gc_cycles, gc_cpu_fraction, pause_p99_ms, next_gc }fromruntime/metrics.GET /debug/pprof/{heap,allocs,...}→ standardnet/http/pprof.GET /metrics→ Prometheus: request p50/p99/p999, GC CPU fraction, pause histogram, live heap, allocs/sec.- Run knobs via env/flags:
GOGC,GOMEMLIMIT,-pool=on|off,-path=naive|opt,-index-size,-seed.
9. Key technical challenges¶
- Reading escape analysis honestly.
-gcflags=-mis noisy and lies by omission (inlining changes everything). You must tie each "escapes to heap" line to an actual allocation in the profile — not just trust the compiler's chatter. - Pooling that helps vs. pooling that hurts.
sync.Poolwins on high-churn transient buffers and loses on small/cheap objects (its own bookkeeping + the GC-clears-it-each-cycle cost exceed the saved alloc). Find both regimes. - The GC death spiral. Set
GOMEMLIMITtoo tight for the live set and the GC runs back-to-back trying to stay under the limit, burning ~all CPU and starving request work — throughput collapses without an OOM. Reproduce it on purpose, then show the limit that degrades gracefully instead. - Pointer-heavy heaps and scan cost. A 40 GB heap of
map[uint64]*Recmakes the mark phase expensive (every pointer is a scan edge); the same data as a flat slice or off-heap blob marks far faster. The win isn't fewer allocs — it's less for the GC to trace. - Attributing the tail. Some of your p99 is GC pause, some is assist (request goroutines doing mark work), some is unrelated. Separating them needs
gctrace runtime/trace, not guesswork.
Stages (0 simple → 1 big data → 2 high RPS → 3 both)¶
Build Stage 0 correct first — it's the control every later number is measured against. Then push each axis alone, then both. The two axes fail differently: big-heap stresses GC pause/scan, high-RPS stresses allocation churn/assist.
| Stage | Live heap | Rate / concurrency | Bottleneck this stage exposes | Pass criterion |
|---|---|---|---|---|
| 0 · Simple | 64 MB | ~500 RPS, 8 conns | Baseline correctness: what the handler actually allocs/op and how often the GC runs at rest | Optimized path proven ≤ 1 alloc/op (-benchmem); GC runs predictably; baseline alloc/op recorded as the control |
| 1 · Big data | 40 GB live | ~500 RPS (low) | GC pause & scan cost. A whole-heap mark stalls the world; default GOGC over-collects a huge heap; GOMEMLIMIT must cap RSS | Tuned GC pause p99 < 5 ms; GOMEMLIMIT holds RSS without a death spiral; report untuned-default pause for contrast |
| 2 · High RPS | 64 MB (small) | 60k+ RPS, 512 conns | Allocation churn → GC CPU% and p99 sawtooth. assist conscription; the sync.Pool win is largest here | GC CPU < 5%, GC-attributable p99 < 2 ms; quantified sync.Pool win (allocs/op, GC CPU, p99 before/after) |
| 3 · Both | 40 GB live | 60k+ RPS | Worst case. Huge heap and high churn: long marks overlap heavy assist; hot Zipfian keys keep churn high while the cold tail bloats scan cost | Full SLOs hold simultaneously: p99 under a stated SLO (e.g. < 25 ms) with GC pause p99 < 5 ms and GC CPU < 8% — defended with gctrace+trace evidence |
Stage 1 and Stage 2 can be passed independently and still fail Stage 3, because long marks (heap) and heavy assist (churn) compound: a request that hits an assist during a long mark eats both. Stage 3 is the production boss.
10. Experiments to run (break it / tune it)¶
Record before/after allocs/op, GC CPU%, pause p99, and request p99 for each:
- Escape-analysis sweep.
go build -gcflags='-m -m'the hot path; for the top 5 escaping allocations, fix each (avoid pointer-return,[]byteAPI, dropinterface{}) and record the alloc/op drop per fix. Some won't move the needle — note which. - Interface boxing & string↔[]byte. Replace
fmt.Sprintf/anylogging andstring(b)/[]byte(s)round-trips with[]byte/strconv.Append*paths. Measure allocs/op and bytes/op. sync.Poolon/off, both regimes. Pool the response buffer (big transient) vs. pool a tiny struct (cheap). Show the buffer pool wins and the tiny pool loses — with numbers.- Pool footguns. Add a deliberate use-after-
Putand a missing-Reset; show the corruption / data leak and the race detector output. Then fix. GOGCsweep.GOGC=50,100,200,400,offat fixed Stage-2 load. Plot GC CPU% vs. live-heap-ceiling vs. p99. Find the knee.GOMEMLIMITnear the ceiling. Stage 1: setGOMEMLIMITto 110%, 105%, 101% of the live set. Find where it goes from "GC works harder" to death spiral (GC CPU → ~100%, throughput → 0). Report the safe margin.- Struct packing. Reorder fields per §7; report
unsafe.Sizeofand the total-heap / GC-scan-time delta at Stage 1. - Pointer-heavy vs. flat vs. off-heap. Hold the 40 GB set as
map[uint64]*Rec, as a flat[]Rec+ index, and off-heap viammap/arena. Measure mark time and pause for each — the scan-cost story. - Attribute the GC tail. At fixed Stage-2 load, run a window with
GOGC=off; p99(on) − p99(off) is the GC-attributable tail. Corroborate withGODEBUG=gctrace=1pause sum and aruntime/traceassist view. - String interning. Intern the repeated string fields (or use a
[]bytedictionary). Measure live-heap reduction and any lookup-CPU cost it adds.
11. Milestones¶
cmd/svc+cmd/gen+/debug/allocsup; Stage 0 baseline allocs/op and a Grafana board for p99 / GC CPU / pause / live heap.- Escape-analysis pass + optimized path; hit ≤ 1 alloc/op on the microbench (experiments 1–2).
sync.Pool(and its footguns) under Stage-2 load; the churn/CPU/p99 win (experiments 3–4).- Stage 1 huge-heap run:
GOGC/GOMEMLIMITtuning, pause p99, the death-spiral reproduction and the safe margin (experiments 5–6). - Layout + Stage 3: struct packing, pointer-heavy vs. flat vs. off-heap, tail attribution; findings note (experiments 7–10).
12. Acceptance criteria (definition of done)¶
- Optimized hot path proven ≤ 1 alloc/op by
-benchmemand a steady-statepprof -alloc_objects(no hidden allocs under load). - Stage 2: GC CPU < 5% and GC-attributable p99 < 2 ms, with the
GOGC=off-window attribution shown. - Stage 1: GC pause p99 < 5 ms at 40 GB live under a tuned
GOGC/GOMEMLIMIT, with the untuned-default pause reported for contrast. -
GOMEMLIMITdeath spiral reproduced on purpose and the graceful-degrade margin stated, with thegctraceevidence for both. -
sync.Poolwin and a documented loss case, both with numbers; plus the use-after-Putfootgun shown and fixed. - Stage 3: full SLOs hold simultaneously under a stated request-p99 SLO, defended with
gctrace+runtime/trace. - Every number reproducible from a committed command +
GOGC/GOMEMLIMIT+ commit.
13. Stretch goals¶
- GC pacer model. From
gctracenumbers, predict the next cycle's trigger heap and CPU; compare your prediction to the runtime's actual pacing. - Generational illusion. Go's GC isn't generational — show why a high-churn short-lived-object workload that a generational GC would love still costs Go, and how
sync.Poolpartly fills that gap. - Region/arena allocation under
GOEXPERIMENT=arenas: bulk-free a per-request arena and measure the GC-scan savings vs. the safety cost. madvisetuning. CompareGODEBUG=madvdontneedvs. default on RSS return to the OS after a load spike subsides.- Container reality. Run with a cgroup memory limit and
GOMEMLIMITset from it; show the OOM-kill you avoid vs. leavingGOMEMLIMITunset.
14. Evaluation rubric¶
| Dimension | Senior bar | Staff bar |
|---|---|---|
| Allocation accounting | Reduces allocs/op with -benchmem | Accounts for every remaining alloc, ties each to an escape line, knows which to leave |
| GC cost model | Knows GC uses CPU and pauses | Predicts pause/CPU from heap shape & churn before measuring; explains assist & write barriers |
sync.Pool | Uses it to cut allocs | Shows where it loses; avoids the reset/use-after-Put footguns and proves correctness |
GOGC/GOMEMLIMIT | Tunes them and sees the effect | Reproduces the death spiral, states the safe margin, recommends settings per SLO |
| Heap layout | Packs structs | Quantifies pointer-scan cost; chooses flat/off-heap with mark-time evidence |
| Tail attribution | Reports p99 | Isolates the GC-attributable component and proves cause with trace |
| Communication | Clear findings note | Could defend every alloc/op and pause curve to a staff panel |
15. References¶
- Go runtime: "A Guide to the Go Garbage Collector" (the
GOGC/GOMEMLIMIT+ pacer doc),runtime/metricspackage docs,GODEBUG=gctrace=1format. - Escape analysis:
go build -gcflags='-m -m'; the compiler's escape-analysis notes. sync.Pooldocs + the "pool is cleared every GC, per-P" semantics.- The Go Programming Language (Donovan & Kernighan) — memory & profiling chapters.
- pprof:
go tool pprof -alloc_objects/-alloc_space;net/http/pprof. - See also:
Interview Question/01-golang-language-and-runtime/(runtime, GC, escape analysis) andInterview Question/17-performance-engineering/(allocation reduction, profiling, tail latency).