Memory Allocator — Interview¶

1. How to use this file¶

26 questions in interview order — junior to staff — plus a "what NOT to say" section and a 5-minute prep checklist. Each question has a short answer (one-line headline, then expanded prose the length you'd give in the room) and where it matters a follow-up to expect. The Go memory allocator is one of those topics where the surface API is tiny (new, make, &T{}) but the runtime mechanics are deep — interviewers grade on whether you can move fluently between the user-visible API, the allocator data structures (mcache, mcentral, mheap), the size-class system, escape analysis, and the operational consequences (RSS, GOMEMLIMIT, fragmentation). What's tested is not whether you've memorized runtime/malloc.go line by line, but whether you can reason about why a given allocation went to the heap, what size class it landed in, and why RSS doesn't drop the moment your code finishes. Read top to bottom on first pass; on revision skim and re-read only the ones you stumbled on.

2. Junior questions (Q1–Q5)¶

Q1. What is a memory allocator?¶

A: A memory allocator is the layer between your code's "give me 64 bytes" request and the operating system's "here's a page of virtual memory" response — it carves OS pages into program-sized chunks, tracks which chunks are free, and hands them out without a syscall on every allocation.

The OS hands out memory in 4KB or 16KB pages via mmap (or VirtualAlloc on Windows). Asking the kernel for 64 bytes every time you need a struct would be catastrophically slow — a syscall costs microseconds, and a Go program might allocate millions of objects per second. The allocator sits in user space, requests big chunks (typically 64KB+) from the kernel up front, splits them into the sizes the program actually wants, hands them out, and reclaims them when the GC says they're no longer reachable. In Go specifically, the allocator lives in the runtime — runtime/malloc.go, runtime/mcache.go, runtime/mcentral.go, runtime/mheap.go — and is woven tightly into the garbage collector, the scheduler, and the goroutine stack-growth machinery.

Follow-up: Is "the allocator" the same thing as "the heap"? Answer: no. The heap is where memory lives (a region of virtual address space managed by the allocator). The allocator is the code that decides which bytes of the heap belong to which Go object. The GC is a third actor — it tells the allocator which objects are no longer reachable so the bytes can be reused. Junior candidates conflate these three; staff candidates name them separately and describe the contract between each pair.

Q2. What's the difference between stack and heap in Go?¶

A: The stack is per-goroutine, grows and shrinks automatically with function calls, and is freed instantly when a function returns; the heap is a shared region managed by the allocator and reclaimed by the GC.

Every goroutine gets a small stack (initially 2 KB in modern Go, grows as needed by copying to a larger segment). Local variables — function parameters, return values, locals that "don't escape" — live on the stack. When the function returns, the stack pointer moves back; there's no per-allocation cleanup cost. The heap, by contrast, holds objects whose lifetime exceeds their declaring function's scope, objects too big to fit on the stack, and anything the escape analyzer can't prove is stack-safe. Heap allocation is more expensive (atomic operations, possibly a trip through mcentral/mheap) and carries an ongoing cost — the GC must trace it on every cycle until it's unreachable. Practically, you want the escape analyzer to put as much as possible on the stack: a stack-allocated &Point{} costs roughly the same as int += 16.

Follow-up: Can you control whether something goes on the stack? Answer: not directly. You influence escape analysis by avoiding patterns it can't analyze — don't return pointers to locals you intend to be short-lived, don't pass them into interface{} parameters, don't capture them in closures that outlive the function. The decision is the compiler's; you can inspect it with go build -gcflags="-m".

Q3. What's `new` vs `make` in Go?¶

A: new(T) returns a *T pointing at a zero-valued T; make(T, ...) returns an initialized T and exists only for slice, map, and chan where construction needs more than zeroing.

new(int) returns *int pointing at 0. new(struct{Name string}) returns a *struct{} with Name == "". The result is always a pointer; the underlying value is zero-initialized. make, by contrast, returns the value, not a pointer, and only works for the three built-in reference types: make([]int, 10, 100) returns a []int slice header pointing at a freshly allocated 100-cap backing array; make(map[string]int) returns an initialized map (the underlying hmap struct is allocated and ready); make(chan int, 5) returns a buffered channel. The split exists because slices, maps, and channels have internal state beyond "the zero value" — a var m map[string]int is nil and panics on write; make(map[string]int) is empty-but-usable.

Follow-up: Does new(T) always allocate on the heap? Answer: no, despite the name. new(T) is just sugar for "give me a zeroed T and return its address". If escape analysis proves the pointer doesn't escape the function, the T lives on the stack and new(T) is free. The common interview wrong answer is "new allocates on heap, var allocates on stack" — both are subject to escape analysis.

Q4. What is escape analysis?¶

A: Escape analysis is a compile-time pass that decides, for each allocation, whether the object can safely live on the stack (freed automatically when the function returns) or must "escape" to the heap (managed by the GC).

The rule is: an object lives on the heap if the compiler cannot prove that all references to it die before the function does. The classic cases that force escape: (1) returning a pointer to a local — return &x — because the caller will use it after x's frame is gone; (2) assigning to an interface{} — because the interface header may outlive the function (the compiler usually can't track all uses of an any); (3) capturing in a closure that escapes; (4) sending the pointer on a channel; (5) the object is too large for the stack (currently > 64 KB triggers heap allocation regardless). Inspect with go build -gcflags="-m" — it prints lines like ./main.go:12:9: &User literal escapes to heap for each escaping allocation. The reason escape analysis matters: heap allocations are 10–100× more expensive than stack ones (GC overhead, allocator path, cache effects), and a hot path with one stray escape can dominate latency.

Follow-up: What does "escapes to heap" actually do at runtime? Answer: the compiler emits a runtime.newobject (or runtime.makeslice etc.) call where the literal would have been, which routes through mallocgc in runtime/malloc.go. A non-escaping &T{} becomes an offset from SP (stack pointer) and a zero-init — no function call at all. The cost difference is roughly one cache-line write (stack) vs one function call plus possibly a P-cache miss (heap), which is why escape analysis dominates microbenchmarks.

Q4b. Why is escape analysis conservative?¶

A: Escape analysis is sound but incomplete — it accepts the cost of unnecessarily heap-allocating an object when it can't prove the object stays bounded, rather than risk a stack-use-after-return bug.

The compiler's job is correctness first, performance second. If the analyzer can't determine whether a pointer escapes (because the code path involves an interface, a function pointer, a generic type, or a closure with complex capture), it conservatively assumes escape and emits a heap allocation. This is "false positive escape" — the runtime overhead exists even though the object could safely live on the stack. The reverse mistake (placing on stack when it actually escapes) would be catastrophic: the stack frame disappears on return, leaving the caller holding a dangling pointer. The asymmetric cost — heap alloc is slow but correct, stack alloc is fast but unsafe when wrong — forces the analyzer toward heap. Senior Go engineers know which patterns confuse the analyzer (interfaces, large maps, complex generic instantiations) and refactor to give it cleaner cases. The line inlining call to X in -m output often unlocks better escape decisions because the inlined body exposes the pointer's full use to analysis.

Follow-up: Are there cases where escape analysis is too aggressive? Answer: yes — for example, slicing into a stack-allocated array arr with s := arr[:n] historically caused arr to escape because s could outlive the function. Recent compiler versions handle some cases better; the right mitigation is to look at the actual -m output rather than assume.

Q5. Where does Go allocate the `&T{}` literal?¶

A: Wherever escape analysis decides — stack if the pointer's reach is provably bounded by the function, heap otherwise.

p := &Point{X: 1, Y: 2} is the canonical example. If you use p locally — fmt.Println(p.X) — escape analysis sees the pointer dies in the function and emits a stack allocation; the Point is a 16-byte (two ints) slot on the stack. If you return p, store it in a global, send it on a channel, or pass it to fmt.Println(p) itself (where it goes through interface{} and escape analysis gives up), the &Point{} allocation is converted into a runtime.newobject call that returns a heap pointer. The &T{} syntax is identical in source — the runtime behaviour differs entirely based on what the surrounding code does with the result. This is why senior Go developers say "the heap is a property of how the pointer is used, not how it's created".

Follow-up: Why does passing to fmt.Println cause escape? Answer: fmt.Println takes ...any (i.e., ...interface{}). Putting a concrete pointer into an interface value means the runtime needs the pointer to remain valid for the lifetime of the interface — which the compiler can't bound at compile time. So it conservatively heap-allocates. This is why fmt.Println(x) in a hot loop is a notorious allocation source.

3. Middle questions (Q6–Q13)¶

Q6. Walk me through what happens when I call `make([]byte, 1024)`.¶

A: make([]byte, 1024) becomes a runtime.makeslice call that computes total bytes (1024 * 1 = 1024), rounds up to a size class (1024 maps to class 1024 exactly), pulls a free span from the goroutine's mcache, returns the address, and the compiler builds a []byte slice header {ptr, 1024, 1024} around it.

The flow in detail. (1) The compiler sees make([]byte, 1024), knows byte is 1 byte and the count is known at runtime, emits a call to runtime.makeslice(elem *_type, len, cap int). (2) makeslice calls mallocgc(1024, byteType, true) — the third argument means "needs zeroing". (3) mallocgc checks the size: 1024 bytes is "small" (< 32 KB) and non-zero, so it consults the size class table (runtime/sizeclasses.go); 1024 falls into the size class whose bytes field is exactly 1024 (class index ~25 — Go has ~67 small classes from 8 bytes to 32 KB). (4) The current P's mcache has a mspan for that size class with a free list of 1024-byte slots; it pops one. (5) The slot is zeroed (because we asked) and returned. (6) The compiler wraps the pointer in a slice header — {base, 1024, 1024} — and that's what your code sees. No syscalls, no locking — just a pop from the per-P free list. The whole thing is a few hundred nanoseconds in the fast path.

Follow-up: What changes if I write make([]byte, 1024*1024) instead? Answer: 1 MB is well above the "small object" threshold (32 KB). mallocgc routes it to the large-object path — directly allocate one or more pages from mheap, skip the size class machinery, and the span belongs to that single object. Large allocations bypass the per-P fast path and acquire the heap lock; they're rarer and cheaper to track individually.

Q7. Why does Go have its own allocator instead of using `malloc`?¶

A: Go's allocator is co-designed with the GC and the goroutine model — it needs concurrent allocation without per-call locking, integration with write barriers, and per-P caches that match the scheduler's structure. Generic malloc doesn't give you that.

Three concrete reasons. (1) GC coupling: Go's GC needs to know which words are pointers (for tracing) and which are scalars (for skipping). The allocator emits pointer-bitmap metadata alongside each span; a generic malloc has no idea about object structure. (2) Per-P caches: each scheduler thread (P) owns an mcache with size-classed free lists, so the common allocation path is zero atomic operations and zero locking. Generic malloc implementations use thread-local caches too (jemalloc, tcmalloc), but the integration with goroutine scheduling — when a goroutine migrates between Ps, the cache stays with the P, not the goroutine — is Go-specific. (3) Goroutine stack management: stacks grow by copying, which means the allocator must know which heap pointers point into stacks (so they can be updated). A generic malloc can't help with that. Beyond these, the allocator is tuned for Go's workload: many small, short-lived objects (the GC pacing assumes this) and uniform 8-byte alignment (Go has no __attribute__((packed)) equivalent at the allocator level).

Follow-up: Could Go just call jemalloc instead? Answer: not without rewriting the GC and the runtime. The allocator hands the GC pointer bitmaps; jemalloc has no concept of those. You'd lose the type-aware sweeping that makes Go's mark phase tractable. There's been discussion of "allocator pluggability" over the years; it's never landed because the coupling is too deep.

Q8. What's a size class?¶

A: A size class is a fixed bucket of bytes (8, 16, 24, 32, 48, …, 32768) that the allocator rounds requests up to, so it can manage objects in homogeneous spans rather than tracking variable-size blocks.

Go has ~67 size classes for small objects, listed in runtime/sizeclasses.go. When you ask for 25 bytes, the allocator rounds up to the next class — 32 bytes — and gives you a 32-byte slot. The "lost" 7 bytes are internal fragmentation; the win is enormous: each span holds objects of one size, so the free list is just a stack of identical slots, the allocator never has to search for a fit, and free just pushes back onto the stack. The class sizes themselves are tuned: the small classes (8, 16, 24, 32) are dense because Go programs allocate huge numbers of small structs; the larger classes grow geometrically (e.g., 4096, 4864, 5376, 6144) to keep waste bounded — the design target is ≤ 12.5% waste per class. Above 32 KB, the size-class system stops and objects become "large" with their own dedicated span.

Follow-up: Why not just have classes at every power of 2? Answer: too much waste. A 65-byte object in a 128-byte class wastes ~49%; in Go's 80-byte class (the next class up from 64) it wastes only ~19%. The size class table is hand-tuned (and machine-generated by runtime/mksizeclasses.go) to keep average waste low across realistic workloads.

Q9. What's `mcache`, `mcentral`, `mheap`?¶

A: Three-layer hierarchy: mcache is the per-P (per-scheduler) cache for the fast path; mcentral is a per-size-class shared pool that mcache refills from; mheap is the global, page-grained heap that mcentral carves spans out of.

The flow. Each P (logical processor in the scheduler — there's one per GOMAXPROCS slot) has an mcache struct holding ~67 mspan pointers, one per size class. Allocation pulls from the P's mcache first — no atomics, no locks, just a free-list pop. When a class's span runs out (all slots used), the mcache asks the corresponding mcentral for a fresh span. mcentral is a global, size-classed pool protected by a lock; it holds two lists of spans (those with free slots and those entirely full). When mcentral is also empty, it asks mheap to allocate pages and carve them into a span of the requested class. mheap manages the heap at page granularity (8 KB pages in modern Go), using a radix tree (since Go 1.14) to track which pages are allocated, free, scavenged (returned to OS), etc. The three-tier design is the standard "thread-local cache + central pool + global allocator" pattern — same idea as jemalloc and tcmalloc, adapted to Go's per-P scheduler model.

Follow-up: Why per-P instead of per-goroutine? Answer: cache size. Per-goroutine would mean millions of caches in a real Go program (one per goroutine); per-P means one per CPU. The cache is reused as goroutines migrate; the per-P design amortizes the cache size over all goroutines that run on that P. The trade-off: when a goroutine is preempted mid-allocation and resumes on a different P, it loses cache locality — but the cost of "lost locality" is much smaller than the cost of "one cache per goroutine".

Q9b. How does allocation handle the fast path with no locking?¶

A: The fast path is a free-list pop on the per-P mcache — no atomics, no locks, no syscalls — because Go's scheduler guarantees that only one goroutine at a time can be running on a given P, so the mcache is single-threaded by construction.

This is the elegance of the per-P design. A goroutine running on P3 wants to allocate a 32-byte object. It indexes into mcache.alloc[sizeClass32], gets the mspan, pops the head of the free list — three pointer manipulations, no synchronization primitives. The Go scheduler guarantees that no other goroutine runs on P3 simultaneously; goroutines are preempted at safe points (function calls, allocations, etc.), so even if the OS thread is scheduled out, the mcache is in a consistent state. When mcache runs dry, the call to mcentral does take a lock — but that's the slow path, hit only ~1% of the time in steady state. The fast path is what makes Go's allocator competitive: it's roughly 20–40 nanoseconds per small allocation, comparable to a hand-tuned C++ pool allocator and orders of magnitude faster than malloc with thread-local caches.

Follow-up: What about goroutine preemption mid-allocation? Answer: the runtime treats mallocgc as a critical section — the allocation can be preempted at GC safepoints but the mcache state is always consistent at those points. If a preemption happens during the free-list pop, the resumed goroutine continues on whatever P it's now on, and the new P's mcache is used. There's no "torn" state because each mcache is independent.

Q10. What's the tiny allocator?¶

A: The tiny allocator is a special fast path in mallocgc for very small (< 16 byte), non-pointer allocations — it packs several of them into a single 16-byte slot to reduce per-object overhead.

In Go, very small objects without pointers are common: short strings, small []byte slices, struct{int8}-like wrappers. Each one allocated to its own 8-byte or 16-byte slot wastes a lot of metadata overhead (one bit per word in the GC bitmap, a slot in the free list). The tiny allocator pulls a 16-byte slot from the size-16 class and packs successive tiny allocations into it: first allocation gets bytes 0–4, second gets 5–9, third gets 10–15. When the current tiny block fills, a new 16-byte slot is pulled. The state lives in mcache.tiny (a pointer to the current block) and mcache.tinyoffset. The constraint: only non-pointer objects qualify — if it contained a pointer, the GC would have to scan inside, and the tiny packing breaks the "one object per slot" invariant the bitmap relies on. The win is significant for allocation-heavy code: profiling typically shows 10–20% of small allocs route through tiny.

Follow-up: What's the cost of the tiny allocator? Answer: deferred reclamation. A tiny block isn't freed until all objects packed into it are unreachable. If you tiny-allocate three things and two die quickly but one lives forever, the entire 16-byte block sticks around. Usually a non-issue because tiny objects tend to have similar lifetimes; pathological cases exist but are rare.

Q11. Why are large objects allocated differently?¶

A: Large objects (> 32 KB) skip the size-class machinery, get their own dedicated span (one or more pages), and are tracked individually in mheap — because at that size the fragmentation cost of size classes outweighs the indexing cost.

The size-class system trades internal fragmentation for fast allocation. At 32 KB and below, fragmentation per object is small relative to allocation cost, so size classes win. Above 32 KB, the next size class would have to be 36 KB, 40 KB, … and a single 33 KB object in a 64 KB class wastes half the slot. Large allocations route directly to mheap.allocSpan(npages), which finds a free run of pages in the page allocator (a radix tree since Go 1.14, replacing the older treap), marks them allocated, and returns a one-object span. The span's metadata records "this is a large alloc" so the sweep phase knows to free the whole span at once rather than scanning for free slots. Cost: each large alloc acquires the heap lock; it's slower than the fast path, but large allocs are rare. The threshold (32 KB) is tunable in source — it's a balance between size-class overhead and lock contention.

Follow-up: What about really large objects, like 1 GB? Answer: same path, but the page allocator has to find a contiguous run of 131072 pages. On a fragmented heap, this can fail and trigger a fresh mmap from the OS. The radix tree was introduced precisely because the old treap-based search degraded on heaps with millions of pages. Very large allocations also become individually traced by the GC (one object per span = one mark bit per span), which is unusually cheap from the GC's perspective — paradoxically, one 1 GB allocation is cheaper to manage than 1 million 1 KB allocations.

Q12. When does my allocation escape to the heap?¶

A: When the compiler can't prove all references to the object die before the enclosing function returns — most commonly because you return a pointer, store it in an interface, capture it in a closure, send it on a channel, or it's larger than ~64 KB.

The mechanical rule: each allocation is a node in a "points-to" graph the compiler builds during compilation. If any reference to the object reaches a function return, a global, an interface value the compiler can't track, or a channel send, the object is marked escaping. Concretely:

// Stays on stack
func a() int { p := &Point{X: 1}; return p.X }

// Escapes — pointer returned
func b() *Point { return &Point{X: 1} }

// Escapes — interface conversion
func c() { var i any = &Point{}; _ = i }

// Escapes — closure capture
func d() func() int { p := &Point{X: 1}; return func() int { return p.X } }

// Escapes — too large for stack (>~64KB)
func e() { var buf [100000]byte; _ = buf }

// Escapes — sent on channel
func f(ch chan *Point) { ch <- &Point{X: 1} }

You can audit with go build -gcflags="-m" (one -m for basic decisions, -m=2 for verbose). The output lines &Point literal escapes to heap show exactly which allocations escaped and why. The senior move is to read -m output for any hot path and rearrange code to keep allocations stack-bound where possible.

Follow-up: What does inlined call to X mean in -m output? Answer: the compiler inlined the function call, which often enables escape analysis to see further (because the inlined body's allocations are now in the caller's scope). Inlining + escape analysis combine to keep more allocations on the stack; large functions that can't inline often produce escapes that smaller functions wouldn't.

Q13. How does `sync.Pool` work?¶

A: sync.Pool is a per-P free list for reusable heap objects — Put returns an object to the local P's pool, Get pops from it; when the P's pool is empty, it steals from other Ps; objects are aggressively dropped at every GC cycle to avoid hoarding.

The implementation (sync/pool.go) is itself a small allocator on top of the heap allocator. Each P has a local pool with two slices — private (single fast slot accessible without atomic ops) and shared (deque accessible by other Ps). Get tries private first, then shared, then steals from a random other P, and finally calls the user-supplied New function. Put writes to private if empty, else pushes onto shared. The key design constraint: pool contents are dropped at every GC. Specifically, before each GC cycle, the runtime calls poolCleanup which moves contents to a "victim" cache; the next GC drops the victim. So objects survive zero or one GC cycle. The reason: if pools held memory across GCs, they'd defeat the GC's job of returning memory to the OS, and a leak in pool sizing would accumulate forever.

var bufPool = sync.Pool{New: func() any { return make([]byte, 0, 4096) }}

func process(data []byte) []byte {
    buf := bufPool.Get().([]byte)[:0]
    defer bufPool.Put(buf) //nolint: be careful — see follow-up
    // ... use buf ...
    return append([]byte(nil), buf...) // copy out before Put
}

Follow-up: What's wrong with defer bufPool.Put(buf)? Answer: if you return buf itself (without copying), the caller holds a reference to memory you've just put back in the pool — another goroutine can Get it and modify it under you. The pattern only works for internal buffers; never Put a slice you've returned. The race detector won't catch this in tests unless you happen to time it right; it's the kind of bug that ships and surfaces under production load.

4. Senior questions (Q14–Q21)¶

Q14. Read this `-gcflags="-m"` output and explain.¶

A: The -m output reports each allocation decision and inlining choice — escape lines (X escapes to heap) tell you to investigate; does not escape confirms a stack alloc; moved to heap flags surprising heap-pushes.

Consider this output from go build -gcflags="-m" ./...:

./main.go:10:6: can inline newPoint
./main.go:14:9: inlining call to newPoint
./main.go:11:9: &Point{...} escapes to heap
./main.go:18:14: ... argument does not escape
./main.go:21:6: moved to heap: buf
./main.go:25:10: leaking param: data

Line by line. can inline newPoint — the compiler will substitute the function body where it's called; small functions usually get this. inlining call to newPoint — the call site got the inlined body, which can change escape outcomes downstream. &Point{...} escapes to heap — despite inlining, the resulting pointer is used in a way that forces heap allocation; you need to check whether the use is necessary. ... argument does not escape — a fmt.Println(x)-style call where the variadic argument's lifetime is bounded; good news. moved to heap: buf — the compiler moved a local variable to the heap because something took its address and the address outlives the function; often a fixable bug (use the value directly, or pre-allocate). leaking param: data — the parameter escapes via the function (stored in a long-lived structure, returned, etc.); for a parameter this is usually fine if intentional.

The senior move on -m output: focus on escapes to heap and moved to heap lines in hot paths, ignore the rest. A web handler with five escapes per request might allocate millions of objects per second under load — fix those. A startup function with ten escapes runs once and doesn't matter.

Follow-up: When is leaking param a problem? Answer: when the function is a tiny helper that should be allocation-free but isn't because a parameter escapes through it. E.g., a func log(msg string) that internally formats and stores msg somewhere — every caller pays a string copy. Refactor to not store, or accept that the helper is now an allocation point and inline it for the hot path.

Q15. How would you reduce allocations in a hot path with no profiler available?¶

A: Six interventions, ordered by effort: read -gcflags="-m", swap to []byte slices preallocated outside the loop, replace fmt.Sprintf with strconv.AppendXxx, replace interface{} parameters with concrete types, introduce sync.Pool for reusable buffers, and check that string conversions aren't producing copies.

The triage in order. (1) Read -gcflags="-m" for the hot file. Identify the escapes to heap lines — those are the candidates. Each escape is an allocation per call; in a hot path that's millions per second. (2) Preallocate buffers. A loop that builds a string via += allocates a new string every iteration; switch to strings.Builder (which uses a []byte internally) or a hand-managed []byte you grow once. (3) Avoid fmt. fmt.Sprintf("%d", n) boxes n into an interface{} (which escapes), then uses reflection-driven formatting. strconv.AppendInt(buf, n, 10) writes directly into a buffer with zero allocations. The same applies to fmt.Errorf("...: %v", err) in hot paths — use errors.New + %w wrapping at boundaries, not on every error. (4) Concrete types in hot interfaces. A function func f(x any) forces every caller to allocate an interface header; func f(x int) doesn't. Generics (Go 1.18+) let you write f[T any](x T) without boxing. (5) sync.Pool for reusable buffers. When the same object type is allocated and discarded per call (parser scratch, encoder buffer), pool it. Watch the lifetime caveat from Q13. (6) String conversions. string([]byte) and []byte(string) always allocate. strings.Builder.String() does too, unless you use the unsafe trick (*(*string)(unsafe.Pointer(&buf))) — not recommended without a benchmark proving it matters.

Even without a profiler, GODEBUG=gctrace=1 will print every GC cycle: pause times, heap size before/after, MB allocated. If pause times are climbing, allocation pressure is climbing; if you've made the right changes, GC frequency drops measurably.

Follow-up: What about the testing.AllocsPerOp benchmark helper? Answer: that's the right tool when you can write a benchmark — b.ReportAllocs() shows per-op allocations. Goal: zero. Anything non-zero in a hot path is suspicious; investigate every byte. The pattern in performance-critical code: write a benchmark before optimizing, get a baseline (e.g., 12 allocs/op, 320 B/op), apply one change, re-run, compare. If allocs dropped, keep the change; if not, revert. The discipline matters because intuition is wrong more often than it's right — a "obvious" optimization like switching from string concatenation to bytes.Buffer can actually add allocations if the use case is small enough that the buffer's internal grow path triggers.

Q16. Why doesn't Go return memory to the OS immediately?¶

A: Returning memory requires a syscall (madvise(MADV_DONTNEED) on Linux), and the syscall is far more expensive than holding the memory; the runtime amortizes the cost via a background scavenger that runs in runtime.bgscavenge.

When the GC frees a span, the pages become available for the next allocation — same process, same address space, no kernel involvement. Releasing pages to the OS is a separate decision: it reduces RSS (resident set size), making the process less of an apparent memory hog, but it costs a syscall and means the next allocation in that range has to fault the pages back in. The Go runtime's compromise: the background scavenger sweeps slowly through freed spans, calling madvise(MADV_DONTNEED) (or MADV_FREE on newer kernels) on pages that have been idle for a while. The pace is determined by a heuristic (and by GOMEMLIMIT since Go 1.19) — fast enough that long-lived overhead doesn't accumulate, slow enough that transient spikes don't trigger syscall storms. Under GOMEMLIMIT, the scavenger gets more aggressive when the heap approaches the limit, to keep total RSS in bounds.

Follow-up: Why does MADV_DONTNEED differ from MADV_FREE? Answer: MADV_DONTNEED immediately reclaims pages — RSS drops, but the next access faults. MADV_FREE (Linux 4.5+) marks pages reclaimable; the kernel may reclaim them under pressure, but if memory isn't tight, the pages stick around and the next access is free. Go uses MADV_FREE when available because it's cheaper; one consequence is that RSS in ps doesn't immediately reflect freed memory. This trips up monitoring dashboards constantly — operators see "high RSS" and assume a leak when the runtime has actually released the memory and the kernel is just lazy. The fix in those cases is GODEBUG=madvdontneed=1, which forces the older, more eager behaviour at the cost of slightly more syscall overhead.

Q17. What's the difference between RSS, HeapSys, HeapInuse, HeapReleased?¶

A: RSS is what the OS sees (pages actually backed by physical memory); HeapSys is what the Go runtime has reserved from the OS (mmaped virtual address space); HeapInuse is what the runtime considers live (spans currently allocated to objects); HeapReleased is the portion the runtime has told the OS it doesn't need.

The relationships matter. HeapSys = HeapInuse + HeapIdle where HeapIdle is reserved-but-not-allocated. HeapReleased ≤ HeapIdle — released pages are a subset of idle ones. RSS approximately equals HeapSys - HeapReleased, if the kernel has reclaimed released pages — under MADV_FREE it may not have, so RSS can exceed that estimate. From runtime.MemStats:

var ms runtime.MemStats
runtime.ReadMemStats(&ms)
// ms.HeapSys      — virtual address space reserved
// ms.HeapInuse    — bytes in spans with at least one allocation
// ms.HeapIdle     — bytes in spans with no allocations
// ms.HeapReleased — bytes returned to OS (subset of HeapIdle)
// ms.HeapAlloc    — bytes allocated and not yet freed (within HeapInuse)

The classic confusion: "my Go process uses 2 GB but runtime.MemStats.HeapAlloc is 200 MB". Explanation: HeapAlloc is what's live; HeapSys is what's reserved; RSS is what's resident. They diverge because (a) spans have unused slots (HeapInuse > HeapAlloc), (b) idle spans haven't been released yet (HeapSys > HeapInuse), (c) released spans may still show as resident under MADV_FREE (RSS > HeapSys - HeapReleased). All four numbers are normal; the interesting question is which one is growing.

Follow-up: How do you read these in production? Answer: expose via expvar or Prometheus (prometheus/client_golang exports them automatically). The dashboards to watch: go_memstats_heap_alloc_bytes (logical use), go_memstats_heap_sys_bytes (reservation), process_resident_memory_bytes (OS view). Divergence between them tells you where the issue is: HeapAlloc rising → leak; HeapSys rising while HeapAlloc flat → fragmentation; RSS rising while HeapReleased rising → kernel hasn't reclaimed.

Q18. How does the allocator interact with GC pacing?¶

A: Every allocation contributes to the GC trigger — once the heap grows by a configurable ratio (GOGC, default 100% = "trigger when heap doubles"), the next allocation pays a "mutator assist" cost to help the concurrent collector keep up.

The Go GC is concurrent and incremental: it runs alongside the program, marking live objects in background goroutines while the program continues allocating. The pacer's job is to ensure marking finishes before the heap fills again, otherwise the program would have to stop and wait. The pacing equation, roughly: target heap size = heap size at last GC × (1 + GOGC/100). With GOGC=100, the GC tries to keep the live heap at 2× the post-collection size — meaning half the heap is "headroom" for new allocations during marking. If the program allocates faster than the GC can mark, the runtime forces every allocation to do some marking work itself — this is the mutator assist. Heavy allocation = heavy assist = your code slows down because it's helping the GC. The metric gc_cpu_fraction in MemStats tells you the GC's share of total CPU; > 25% sustained is a smell.

GOMEMLIMIT (Go 1.19+) adds a second constraint: if total memory use approaches the limit, the GC runs more often, regardless of GOGC. The pacer becomes dual-loop: hit the GOGC target unless doing so would exceed GOMEMLIMIT, in which case run earlier. This lets you set a hard ceiling without the runtime OOMing.

Follow-up: How is GOGC=off different from GOGC=10000? Answer: off disables the GC entirely — the heap grows forever; only runtime.GC() triggers collection. GOGC=10000 means "let the heap grow 100× before triggering" — extremely lazy GC, but still automatic. Use off for one-shot batch jobs where the process exits and the OS cleans up; never in long-running services.

Q18b. What does "write barrier" mean in the allocator context?¶

A: A write barrier is compiler-inserted code that runs on pointer writes during GC marking — it makes sure the collector sees new pointers written to already-scanned objects, so it doesn't miss live data.

Go's GC is concurrent: while the program runs, the GC scans the heap in the background, marking reachable objects. The problem: if the program writes a pointer from a scanned object to an unscanned object, the scanned object's pointer was already recorded as "no children there", and the unscanned object never gets marked — the GC may free a live object. The write barrier prevents this: during marking, every obj.field = ptr triggers a runtime call that records the new pointer (or the object holding it) for the GC to re-scan. Go uses a "hybrid" write barrier (Dijkstra + Yuasa) since Go 1.8 — it records both the old and the new value at the time of write, which lets the marking phase finish without stop-the-world.

The allocator integrates with this in two places. (1) Newly allocated objects are gray (in marking terms) by default — the next pointer write into them triggers the barrier. (2) Spans store pointer bitmaps (set up by mallocgc based on the type passed in) so the GC knows which words to scan. The barrier code is small (a handful of instructions) but runs on every pointer write during marking; it's why concurrent GC isn't free even when no GC is "actively" running on your goroutine. GODEBUG=gctrace=1 shows the assist time, which is mostly barrier work.

Follow-up: When is the write barrier off? Answer: outside GC cycles. Between collections, pointer writes are plain stores. The barrier turns on at the start of marking and off at the end of marking. The GC's pacer aims to keep total time-in-barrier low; if your program spends > 25% of CPU in marking, the barrier overhead is a real concern.

Q19. Why is Go's allocator non-compacting and what are the consequences?¶

A: Go's GC marks objects but never moves them — once allocated, an object stays at its original address until it's freed. The consequence is heap fragmentation: long-lived objects can pin large spans, and the heap's apparent size grows beyond what's strictly needed.

Compacting GCs (Java's G1, .NET's CLR) periodically copy live objects to a fresh region, defragmenting in the process. Cost: every object move requires updating every pointer to it, which means stopping the world (or sophisticated read barriers). Go's design choice: non-compacting, because (a) updating pointers is fundamentally incompatible with Go's "pointers as integers" semantics (you can stash a pointer in an unsafe.Pointer and the runtime wouldn't find it to update), and (b) the goroutine stacks are already a moving target — adding compaction would multiply complexity. The consequence is fragmentation: a span with one live object holds the whole 8 KB; many sparsely populated spans add up. In practice, fragmentation is manageable because (a) size classes mean each span only holds one size — no "this 100-byte object blocks a 200-byte allocation" problem, (b) the allocator preferentially uses spans with more free slots, keeping sparse spans for later. But it does mean Go's heap can be 30–50% larger than the live set in pathological cases.

Follow-up: When does fragmentation actually bite? Answer: workloads that allocate many objects of a particular size, free most of them, then never allocate that size again. The spans for that size sit with a few live objects each, can't be reclaimed (one live object pins the whole span), and HeapInuse stays high while HeapAlloc is low. Diagnostic: large gap between HeapInuse and HeapAlloc after a stable workload. Mitigation: object pooling (so the same spans get reused) or runtime.GC() followed by debug.FreeOSMemory() to scavenge aggressively. The deeper fix is to redesign data structures so that hot-path allocations reuse the same size class — for example, fixed-size message buffers in a connection pool rather than variable-size buffers per request.

Q19b. What's the relationship between size classes and fragmentation?¶

A: Size classes bound fragmentation per object (≤ 12.5% internal waste by design) but introduce a different problem: external fragmentation across spans of different size classes.

The 12.5% bound is per-class internal fragmentation — a 25-byte object in a 32-byte slot wastes 22%, but average waste across realistic workloads stays under 12.5% because the size classes are chosen to bracket common allocation sizes tightly. Where size classes hurt is external fragmentation: a span dedicated to 64-byte objects can never hold a 128-byte object, even if that span has 99% free slots. If your workload shifts from many 64-byte allocations to many 128-byte allocations, the old spans sit there unused, and the heap grows to hold the new spans. The page allocator can eventually reclaim entirely-empty spans and reuse the pages for any class, but a span with one lingering 64-byte object pins all its pages forever — the page allocator can't split a single span.

This is why long-running services with heterogeneous allocation patterns can show "steady-state" RSS that's 30–50% higher than the live set. Java's G1 collector compacts to fix this; Go's design accepts it as the cost of non-compacting collection. Mitigation is structural: keep allocation patterns consistent (one allocation profile per service, not "burst at startup, then quiet, then different burst on reload").

Follow-up: Does this mean Go is bad for long-running services? Answer: no — the absolute overhead is small in practice (most services run at 1.3–1.5× live-set RSS, which is acceptable). But for memory-constrained deployments (low-RAM containers, embedded systems), the fragmentation overhead matters; tune by reducing allocation variety and adding pooling where alloc patterns are known.

Q20. When should you NOT use `sync.Pool`?¶

A: When objects are cheap to allocate, when lifetimes are unbounded, when objects contain references the GC wouldn't otherwise hold, or when correctness depends on objects being independent.

Five anti-patterns. (1) Tiny objects. Pooling a Point (16 bytes) costs more in pool overhead (atomic ops, P-local indirection) than the allocation it saves. Rule of thumb: pool objects ≥ 1 KB. (2) Long-lived objects. sync.Pool clears its contents at every GC — pooled objects survive at most one cycle. If you Get an object and hold it for minutes, you're not pooling, you're allocating-with-extra-steps. (3) Objects holding large references. If your pooled *Buffer references a 1 MB []byte, the pool keeps that 1 MB alive between GC cycles even if no one's using it; you've added memory pressure rather than relieved it. (4) State carryover bugs. If a pooled object holds state that wasn't reset on Put, the next Get sees the leftover. The senior pattern is to reset on Get (defensive — you don't trust the previous user), not on Put (the user may have already discarded the reference). (5) Hot ownership transfer. If you Put an object then a goroutine elsewhere Gets it and starts modifying, while the first goroutine still holds a reference, you've got a data race. The pool is not a synchronization primitive; it's a free-list.

// Wrong: state leak
b := pool.Get().(*bytes.Buffer)
// ... append things ...
pool.Put(b) // next user sees the old contents

// Right: reset on Get
b := pool.Get().(*bytes.Buffer)
b.Reset()
// ...
pool.Put(b)

Follow-up: What's the right benchmark to prove a pool helps? Answer: go test -bench -benchmem -run=^$ ./pkg/ with and without the pool, comparing allocs/op and ns/op. Pools should drop allocs/op to near zero and reduce ns/op by the cost of allocation. If ns/op increases, the pool overhead exceeds the allocation savings — remove it.

Q21. How would you debug a "RSS keeps growing but HeapInuse is flat" situation?¶

A: That's classic non-Go memory growth — cgo allocations, mmap regions outside the Go heap, goroutine stacks, or the OS not reclaiming MADV_FREE pages. Investigate with pprof, top/smaps, and runtime.MemStats.Sys.

Six suspects. (1) Goroutine stacks. Each goroutine starts at 2 KB but can grow to 1 GB. A leak of long-running goroutines with deep call stacks shows up as runtime.MemStats.StackInuse growing while HeapInuse is stable. Diagnose with pprof goroutine and pprof goroutine?debug=2 for full stacks. (2) cgo and malloc. If your program calls C code that does its own allocations, those bytes never appear in Go's MemStats — they're in the C heap. RSS grows; Go thinks all is well. Diagnose: top -p PID shows RSS, pprof shows zero growth — that gap is C. (3) mmap outside the heap. Some packages (memory-mapped files, custom allocators) call syscall.Mmap directly; the mapped region counts in RSS but not in HeapSys. Diagnose: cat /proc/$PID/smaps lists every mapping; large unfamiliar regions are the culprit. (4) OS-level reclamation lag. MADV_FREE pages remain resident until the kernel reclaims them under pressure — common on Linux 4.5+. RSS doesn't drop even though the runtime has released. Diagnose: runtime.MemStats.HeapReleased is large but RSS isn't dropping; consider GODEBUG=madvdontneed=1 to force MADV_DONTNEED. (5) Pinned spans (fragmentation). As in Q19 — long-lived objects pin sparse spans, HeapInuse stays high. But the question says HeapInuse is flat; so this is unlikely the culprit unless RSS-vs-HeapSys is the gap. (6) runtime.MemStats.Sys overall. This is the total bytes the runtime has from the OS, including stacks, mcache structures, GC metadata, etc. Compare Sys growth vs HeapSys growth — if Sys is growing but HeapSys isn't, it's stacks or metadata.

The systematic flow: runtime.ReadMemStats(&ms), check HeapSys, StackSys, MSpanSys, MCacheSys, Sys — whichever is growing identifies the bucket; then pprof or smaps localizes within that bucket.

Follow-up: How would you confirm cgo specifically? Answer: build with CGO_ENABLED=0 and rerun the workload (if the program tolerates it). If RSS growth disappears, it was cgo. Failing that, link with -msan (memory sanitizer) or run under Valgrind — both catch C-side leaks Go's tooling can't see.

5. Staff/Architect questions (Q22–Q26)¶

Q22. Compare Go's allocator to jemalloc/glibc malloc.¶

A: All three are size-classed, thread-cached, page-grained allocators — the differences are in GC integration, allocation patterns optimized for, fragmentation strategy, and how memory is returned to the OS.

Dimension	Go runtime	jemalloc	glibc ptmalloc
Thread cache	per-P (`mcache`)	per-thread (tcache)	per-thread (tcache)
Size classes	~67 classes, hand-tuned	~110 classes, log-bucketed	~64 classes (small bins)
Large objects	direct from `mheap`, 8 KB pages	extent allocator, 4 KB pages	sbrk/mmap with chunks
GC integration	tight — pointer bitmaps, sweep	none, manual `free`	none, manual `free`
Compaction	none (non-moving)	none	none
Concurrency	per-P caches; central pool with locks	per-thread caches; per-arena locking	per-thread caches; arena per ~8 threads
OS return	`MADV_FREE`/`MADV_DONTNEED` via scavenger	aggressive `madvise`, configurable	conservative; `mmap`ed chunks return on free
Workload target	many small short-lived objects + concurrent GC	server workloads, fragmentation-resistant	general C programs

Go's allocator is specialized — it gives up the generality of jemalloc (which has knobs for everything) to be tightly co-designed with the GC. jemalloc's main advantage in head-to-head benchmarks for C programs: lower fragmentation under long-running workloads with diverse allocation sizes. Go's allocator's main advantage: zero-syscall fast path for the common case, integrated pointer-bitmap generation for the GC, per-P caches that match the scheduler. Glibc's allocator is the weakest of the three for concurrent workloads — its arena lock contention is famous for being the reason production C++ servers switch to jemalloc or tcmalloc.

Staff move: name what can't be ported. Go's allocator can't be swapped for jemalloc because the GC needs the pointer bitmaps — you'd have to rewrite mark/sweep simultaneously. Conversely, a C program can't easily adopt Go's approach because there's no GC to drive sweeping, no scheduler to define "per-P", and no escape analysis to keep allocations off the heap in the first place. Each allocator's strengths are tied to its language's constraints.

Follow-up: Why does tcmalloc exist if jemalloc is so good? Answer: history — tcmalloc was Google's allocator (used inside Google's C++ services) before jemalloc was widely adopted, and was designed before some of jemalloc's concurrency innovations. They've converged in approach; the practical differences today are mostly tuning defaults and integration with profiling tools.

Q23. Walk through the page allocator (radix tree).¶

A: Since Go 1.14, mheap uses a multi-level radix tree to track which pages are allocated, free, or scavenged — it replaced the old treap-based design because lookups degraded badly on heaps with millions of pages.

The structure (runtime/mpagealloc.go). The heap's virtual address space is divided into 8 KB pages. The page allocator must answer two queries quickly: (1) "find me a free run of N contiguous pages" and (2) "mark these pages allocated/free/scavenged". A flat bitmap (one bit per page) is fine for query (2) but linear for query (1); a treap was fast for both but degraded as the heap fragmented. The radix tree solution: levels of summarized bitmaps. The leaf level is a bit per page (1 = free, 0 = allocated). The next level up summarizes a chunk of leaves — "max consecutive free pages in this region" and "starts-with free pages count" and "ends-with free pages count". The level above summarizes those summaries. To find N contiguous free pages, you walk down the tree, at each level skipping any subtree whose max-consecutive is less than N. Found-or-not in O(log heap-size), and the constant factor is small because each level is densely packed.

Levels in modern Go (amd64): 5 levels, each summarizing 8192 children, covering 16 GB per level-2 node — enough for any realistic heap. Scavenging uses the same tree: the scavenger walks looking for spans of free, not-yet-scavenged pages, calls madvise on them, and marks them scavenged in the leaf bitmap. The "scavenged" state is a third color (alongside free/allocated) so subsequent allocations can prefer non-scavenged pages (avoiding the page fault cost of touching released memory).

Staff move: the choice of radix tree was made for operational reasons, not just performance. The old treap had pathological worst-cases on highly fragmented heaps that were hard to predict; the radix tree's bounded depth gives bounded latency, which matters for tail latency in production. Go's runtime team valued p99 over average performance — a design philosophy worth noting.

Follow-up: Why not just keep a per-size-class free list of spans? Answer: spans are size-classed but pages aren't — when a large object frees its multi-page span, the pages need to be reclaimable for any size class's future allocation, including a different large object. Per-class free lists handle small allocs; the page allocator handles the underlying multi-page coalescing.

Q24. What would you change about Go's allocator design?¶

A: Three changes I'd argue for, in decreasing order of how much I'd actually advocate: NUMA-aware per-P caches for large boxes, pluggable arena/per-region allocators for known-lifetime workloads, and tighter integration between size classes and CPU cache lines.

(1) NUMA awareness. On a multi-socket machine, accessing memory on a remote NUMA node is 2–3× slower than local. Go's allocator is NUMA-oblivious: a goroutine on socket 0 may get memory from socket 1 because the mheap doesn't track origin. For machines with > 64 cores across multiple sockets, this is a real performance leak. The change: per-NUMA-node mheap slices, with mcache refills preferring local-node pages, falling back cross-node only on exhaustion. Cost: complexity in the global heap structure; benefit: meaningful only on big iron.

(2) First-class arenas. The arena experimental package (Q25) is the right direction but limited. A general design: an arena is a manually-managed memory region; allocations inside the arena are nearly free (bump pointer, no per-object metadata, no GC tracking); the entire arena is freed in one operation. Useful for request-scoped allocations (an HTTP handler that knows everything dies at request end) and known-batched work (parsing a 1 GB file once). The downside is loss of memory safety — arena pointers can outlive the arena, dangling — so the API must enforce lifetime statically (the arena experiment uses generic-typed handles to do this; Rust-style lifetimes would be cleaner but require language changes).

(3) Cache-line tuning. Size classes today are 8-byte-aligned. Some classes straddle cache lines (a 24-byte object on a 64-byte cache line may span two lines depending on offset within a span). For hot small objects (linked-list nodes, lookup-table entries), false sharing is a real cost. The change: alignment hints per type, similar to //go:notinheap but for cache-line alignment. Cost: more size classes, more metadata; benefit: measurable on tight inner loops.

What I would not change. (a) The non-compacting design — too much complexity for marginal benefit given size-classed spans already limit fragmentation. (b) GOGC as the primary knob — it's an awful name but the model is right; explaining it to teams is worth the investment. (c) The per-P caching — the fit with the scheduler is too good to give up.

Staff move: name the constraints. Most of these are workload-specific. NUMA awareness matters for a handful of database/storage products; arenas matter for request-heavy servers; cache-line tuning matters for inner loops. The base allocator is well-tuned for the median Go workload; changes should be opt-in or workload-detected.

Follow-up: Why hasn't NUMA awareness landed? Answer: complexity vs benefit. Most Go deployments are on cloud VMs that present a single virtual NUMA node (whether or not the underlying hardware has multiple). The cost-benefit only flips on bare-metal multi-socket machines, which are a minority. The Go team has shown willingness when the case is strong (the radix tree, GOMEMLIMIT); NUMA hasn't crossed that bar yet.

Q25. Discuss the `arena` experimental package and why it was proposed.¶

A: arena (Go 1.20 experimental, gated by GOEXPERIMENT=arenas) is a manually-managed memory region — allocate inside it cheaply, free the whole arena in one operation, bypass the GC for arena-allocated objects. Proposed for workloads where GC overhead dominates and lifetimes are known.

The API, roughly:

import "arena"

a := arena.NewArena()
defer a.Free() // free everything at once

// Allocate inside the arena
p := arena.New[Point](a)
s := arena.MakeSlice[byte](a, 1024, 1024)

The motivation: in workloads like protobuf decoding, every request allocates dozens of small structs that all die at request end. The GC traces them, sweeps them, accounts for them in pacing — and the work is wasted because the lifetime is bounded by the handler. An arena lets you allocate them in a pre-known region, and a.Free() discards the whole thing without GC involvement. Benchmarks on protobuf-heavy services showed 15–30% improvements in latency and CPU.

Why experimental and why pushback? The safety concern: nothing prevents an arena pointer from escaping the arena's scope. arena.New[Point](a) returns *Point; if you store it in a global, the GC will see the pointer, but the arena's underlying memory is gone after a.Free() — use-after-free. The current design uses static analysis and runtime checks, but it's an escape hatch from Go's memory safety. Russ Cox's argument against making it stable: Go's value proposition is "safe by default"; arenas create a class of bugs that look like Go but behave like C. The compromise: keep it experimental, learn from real usage, decide later. Some teams use it for known-safe workloads (proto decoding inside generated code); broader adoption is intentionally slow.

Staff move: the proposal frames arenas as a performance feature, but the deeper question is whether Go should support manual memory management at all. Go has historically said "the GC is good enough, write idiomatic code"; arenas concede that for some workloads it isn't. Whether they ship in stable Go depends on whether the safety model can be tightened (generics-based lifetime tracking, escape analysis improvements) without turning arenas into Rust lifetimes — which would change the language's character.

Follow-up: What's the alternative to arenas for protobuf decoding? Answer: aggressive sync.Pool-ing of the message types (works for one-message-at-a-time decoding but not for nested messages with independent lifetimes), code-generated zero-copy decoders (works but limits message structure), or accepting the GC overhead and optimizing elsewhere. Each is partial; arenas are the cleanest fix when they apply.

Q26. How does `GOMEMLIMIT` interact with the allocator pacing?¶

A: GOMEMLIMIT (Go 1.19+) adds a soft cap on total memory the runtime will use; when memory approaches the limit, the GC and scavenger run more aggressively to bring it down, even at the cost of CPU.

Without GOMEMLIMIT, Go's GC pacing is controlled entirely by GOGC — the heap grows by GOGC% before the next GC. If your workload has a sudden allocation burst, the heap can spike to 10× its steady-state size; without intervention, you might OOM. GOMEMLIMIT adds the rule: total memory (heap + stacks + runtime metadata + everything in runtime.MemStats.Sys) should not exceed the limit. If it would, the GC triggers earlier than GOGC would say — potentially much earlier. The pacer becomes dual-objective: hit the GOGC target unless doing so would exceed GOMEMLIMIT. When the limit is being respected, GOGC drives the cadence; when memory is tight, the limit takes over.

The behavioural shift. With GOMEMLIMIT=2GiB and a normally-1.5 GiB workload, the runtime allows the usual GC behaviour until you approach 2 GiB. Then: (a) GC frequency increases (more cycles per minute), (b) scavenger gets more aggressive (more madvise calls), (c) eventually mutator assist forces every allocation to do GC work. CPU usage climbs because the GC is working harder. The application slows down — but doesn't OOM. The trade is explicit: extra CPU in exchange for bounded memory.

When to set it. Containerized deployments with hard memory limits (Kubernetes pods, Docker --memory). Set GOMEMLIMIT to ~90% of the container limit, leaving headroom for non-Go memory (cgo, mmap, kernel overhead). Without it, Go happily allocates up to the container's hard limit, the OOM killer fires, and your pod dies. With it, Go pre-empts itself before OOM, taking a latency hit but staying alive. The 90% guideline is rough — the safe margin depends on how much non-heap memory your program uses.

Staff move: GOMEMLIMIT is one of the most important runtime changes in years because it lets operators express memory budgets without per-app tuning. Before it, the only options were "set GOGC very low and waste CPU all the time" or "let it OOM". GOMEMLIMIT gives a third option: "stay under this number; spend whatever CPU you need". For multi-tenant Kubernetes clusters, that's the right default.

Follow-up: What happens if GOMEMLIMIT is set lower than the live heap? Answer: the GC runs continuously, mutator assist is maxed out, and the program spends most of its CPU on GC instead of work. Throughput collapses but it doesn't OOM. This is the "death spiral" mode — a sign you've under-provisioned. The right response is to raise the limit or reduce allocation rate; the runtime can't conjure memory it doesn't have.

6. What NOT to say — common interview mistakes¶

These signal a weak candidate.

"Everything is on the heap in Go." False — escape analysis puts most short-lived allocations on the stack. The interviewer wants to hear about escape analysis; saying "everything is on the heap" tells them you've never read -gcflags="-m" output.
"new allocates on heap, var allocates on stack." Both go through escape analysis. The keyword is irrelevant; what matters is whether the address escapes the function.
"Go uses malloc internally." It does not — Go has its own allocator in runtime/malloc.go. The name shares a prefix; the implementations have nothing in common.
"sync.Pool is a memory pool that holds objects forever." It clears at every GC. If you "pool" objects expecting them to survive minutes, you're allocating-with-extra-steps.
"You can force stack allocation with & or heap with new." Neither does that. Escape analysis decides; you influence it only by changing how the result is used.
"The GC immediately returns memory to the OS." It doesn't — released pages go through a background scavenger, and even then MADV_FREE may leave them resident until kernel pressure.
"GOGC=off improves performance." Only for short-running batch jobs. In a long-running service, it means the heap grows unboundedly until you OOM.
"Channels and maps are allocated on the stack with make." No — make for chan/map always allocates the underlying structure on the heap (the runtime header is too large and lifetime too uncertain for stack).
"Size classes are powers of 2." They're hand-tuned (~67 classes), with geometric-but-not-power-of-2 spacing to bound waste at ~12.5%. Saying "powers of 2" is the classic wrong-textbook answer.
"Go's allocator is the same as jemalloc." Surface similarities (size classes, thread caches) hide deep differences (GC integration, pointer bitmaps, scheduler coupling). Anyone who says they're equivalent hasn't read either.
"RSS = HeapAlloc." Wrong on every level — RSS is OS-resident pages; HeapAlloc is bytes currently allocated to live Go objects; the gap can be 10× in normal conditions.
"runtime.GC() should be called in production to keep memory low." Almost never the right answer. The GC's pacing is better than your guess; forcing GC just adds pause time without lowering steady-state memory. Reserve runtime.GC() for benchmarks and post-batch cleanup.
"Goroutine stacks are unlimited." They start at 2 KB and grow as needed, but they're bounded by runtime.SetMaxStack (default 1 GB). A runaway recursive function will eventually crash with stack overflow, not "infinite stack".
"arena.NewArena() is safe to use today." It's experimental, gated behind GOEXPERIMENT=arenas, and has known unsafety patterns. Don't claim it's production-ready.
"GOMEMLIMIT is a hard cap." It's a soft limit — the runtime works to stay under it but can exceed it briefly. For a hard cap, use container limits (cgroups). The two work together.

7. 5-minute prep checklist¶

Read this once before walking into the interview. If you can speak each line in plain English without notes, you're ready.

The allocator's three layers. mcache (per-P, lock-free fast path) → mcentral (per-size-class, locked) → mheap (global, page-grained, radix tree).
Size classes. ~67 of them, hand-tuned, 8 bytes to 32 KB; rounds requests up to the next class; one span per class holds many same-size slots.
Tiny allocator. Packs sub-16-byte non-pointer allocations into shared 16-byte slots; reduces per-object overhead for common small structs.
Large objects. > 32 KB; skip size classes, get their own span direct from mheap; acquire the heap lock per allocation.
new vs make. new(T) returns *T zeroed; make returns the value, only works for slice/map/chan, initializes internal state.
new does not force heap. Escape analysis decides; new(T) with non-escaping result stays on stack.
Escape triggers. Return pointer, store in interface, capture in escaping closure, send on channel, > 64 KB on stack.
Inspect with -gcflags="-m". escapes to heap, moved to heap, does not escape, leaking param.
sync.Pool quirks. Per-P; cleared at every GC; never Put an object you've also returned to caller.
GC pacing. GOGC=100 means heap doubles before GC; mutator assist forces allocators to help collect when behind.
GOMEMLIMIT. Soft total-memory cap; trades CPU for memory; set to ~90% of container limit.
Non-compacting. Objects never move; fragmentation possible but bounded by size classes.
OS return is lazy. Background scavenger uses MADV_FREE (kernel keeps pages until pressure); RSS may not drop immediately.
MemStats vocabulary. HeapAlloc (live), HeapInuse (spans with allocs), HeapIdle (spans without), HeapReleased (told OS about), HeapSys (reserved from OS), Sys (total runtime).
Allocator source files. runtime/malloc.go, runtime/mcache.go, runtime/mcentral.go, runtime/mheap.go, runtime/mpagealloc.go, runtime/sizeclasses.go.
Stdlib Fail-Fast at allocator level. runtime.SetMaxStack for stack runaway, GOMEMLIMIT for memory runaway, panic on OOM during alloc.
When arena package applies. Request-scoped allocations, decoder scratch buffers, known-batch lifetimes; still experimental.
Diagnostic flow. GODEBUG=gctrace=1 for GC visibility; runtime.ReadMemStats for per-bucket numbers; pprof heap for allocation sites; pprof goroutine for stack leaks; /proc/PID/smaps for non-Go mappings.

8. Further reading¶

runtime/malloc.go: https://cs.opensource.google/go/go/+/refs/heads/master:src/runtime/malloc.go — the canonical entry point for allocation. Read mallocgc top to bottom; everything else hangs off it.
runtime/sizeclasses.go: https://cs.opensource.google/go/go/+/refs/heads/master:src/runtime/sizeclasses.go — the size class table. Machine-generated; read the header comment for the design rationale.
runtime/mheap.go and runtime/mpagealloc.go: https://cs.opensource.google/go/go/+/refs/heads/master:src/runtime/mheap.go — the heap and page allocator. The radix tree implementation is in mpagealloc.go.
"Go's Memory Allocator" — Andre Carvalho: https://andrestc.com/post/go-memory-allocator/ — a walkthrough with diagrams of how mallocgc flows through the three caches. Good for solidifying the mental model.
"Getting to Go: The Journey of Go's Garbage Collector" — Rick Hudson: https://go.dev/blog/ismmkeynote — the GC's evolution; explains pacing, write barriers, and the trade-offs that shape the allocator.
"A Guide to the Go Garbage Collector": https://tip.golang.org/doc/gc-guide — the official, current-as-of-the-runtime guide. Covers GOGC, GOMEMLIMIT, pacing, and tuning. Read once after every major Go release.
"Proposal: arena package" — Cherry Mui & Russ Cox: https://github.com/golang/go/issues/51317 — the design discussion for arenas, including the safety concerns. Required reading if you want to argue about manual memory management in Go.