Memory Management in Depth — Senior¶

1. The runtime in one mental model¶

Hold these facts as a single picture:

P is the scheduling unit. It owns an mcache, so the small-object fast path is lock-free per CPU.
The heap is a virtual region (typically 256 TiB reserved on amd64), grown by mapping arenas (64 MiB on Linux/64-bit).
The collector is concurrent and non-moving. It never compacts; it relies on size classes to keep fragmentation bounded.
The pacer is feedback-controlled. It adjusts mark-assist debt and worker count to land near the heap target by the time live work is done.
The OS-return mechanism is asynchronous. Idle pages are returned with MADV_FREE (lazy) or MADV_DONTNEED (immediate); reported RSS lags behind HeapReleased.

If you internalize those five points, almost every "weird" memory observation becomes predictable.

2. Pacer math, demystified¶

Let H be the live heap at the end of cycle n-1. The pacer picks a target for cycle n:

target = max(H × (1 + GOGC/100), minHeap)

It then sets a trigger (when to start marking) earlier than the target so that, given the current allocation rate r and mark rate m, marking finishes just as the heap reaches target.

If allocation outpaces marking, the runtime forces user goroutines to do mark assist — for every k bytes you allocate, you must mark c bytes. That assist debt is invisible until you read traces and notice that your goroutines spent time in runtime.gcAssistAlloc1.

Two practical levers:

Raise GOGC (or set debug.SetGCPercent) to widen the headroom and cut total GC CPU at the cost of higher RSS.
Set GOMEMLIMIT to bound RSS. The pacer will then GC more aggressively, possibly continuously, when you push toward the cap. Programs that thrash near the limit need fewer allocations, not a higher limit.

3. `GOMEMLIMIT` — when to use it, when not to¶

GOMEMLIMIT (Go 1.19+) is a soft cap on the total memory the runtime accounts for (heap + stacks + goroutines + GC metadata + a few smaller buckets).

Scenario	Use `GOMEMLIMIT`?
Containerized app with a hard cgroup limit	Yes — set to ~90% of the cgroup limit to avoid OOMKill
Bursty workload with idle heap retained	Yes — bounds steady-state RSS
GC-CPU starved batch job	No — let `GOGC` rise instead
You can't tell if you're allocation-bound or memory-bound	Measure first

The combination GOGC=off + GOMEMLIMIT=X is a sentinel: the GC never runs by ratio, only when memory pressure approaches X. Useful for spiky allocation patterns where you want all the headroom up to a hard ceiling.

4. Stack copying and pointer rewriting¶

When a goroutine's stack must grow, the runtime:

Allocates a new contiguous stack of the next size up.
Walks every frame using DWARF-like metadata baked in by the compiler.
For each on-stack pointer, computes the offset into the old stack and writes the new address.
Updates every g.sched / g.stk* field and resumes.

The "no permanent escape to the stack" rule is enforced because of step 3: if you stored a stack address into a goroutine-external location (heap, global, channel), there'd be nothing to rewrite. The escape analyzer prevents this at compile time; you can't sneak around it.

Pathological case: a small recursive function that grows to a few MiB triggers many copies. Each copy is O(stack size). If you see this in a profile, restructure or pre-grow with runtime/debug.SetMaxStack only after you've ruled out the algorithm.

5. Write barriers, hybrid¶

Go's barrier is hybrid: deletion (Yuasa) + insertion (Dijkstra). On a pointer store *slot = ptr during marking:

The previously stored pointer (*slot before write) is shaded grey (deletion barrier).
The newly stored pointer (ptr) is shaded grey (insertion barrier).

This permits stack scanning without rescans — once a stack is scanned, the barrier alone is enough to maintain the invariant. Practically, this kept STW pauses sub-millisecond after Go 1.8.

You see the barrier in benchmarks as a small per-pointer-store overhead during the mark phase. It is not optional.

6. Finalizers, the trap¶

runtime.SetFinalizer(obj, func(o *Obj) { o.Close() })

What seniors learn the hard way:

Finalizers run after the object becomes unreachable, in a separate goroutine, in unspecified order.
They resurrect the object for one more cycle so the finalizer can read its fields. This delays reclamation.
They are not guaranteed to run before program exit. Never depend on them for visible side effects.
Cycles among finalizer-bearing objects are never collected. Two objects with finalizers pointing at each other live forever.
You cannot SetFinalizer twice on the same object (panics) or finalize a value receiver of a method.

For Go 1.24+ prefer runtime.AddCleanup: multiple cleanups per object, no resurrection, no keep-alive — far less footgun-shaped. Existing finalizer code should migrate.

7. `runtime.KeepAlive`, the underused friend¶

buf := allocateCBuffer()
defer C.free(unsafe.Pointer(buf))

_, err := C.write(fd, buf, len)
runtime.KeepAlive(buf)        // ensure buf isn't reclaimed before write() returns

Without KeepAlive, the compiler can decide that buf's last Go use is the C.write call's argument evaluation, and a concurrent GC could collect the object before the C function returns. KeepAlive extends the lifetime to that program point. Required whenever you pass a Go-managed allocation to C and the C side may use it after the call returns control to Go.

(The 1.24 cleanup API does not keep the object alive, which is part of what makes it safer than finalizers — but it also means you still need KeepAlive at C boundaries.)

8. `sync.Pool` semantics¶

var bufPool = sync.Pool{
    New: func() any { return make([]byte, 0, 4096) },
}

b := bufPool.Get().([]byte)[:0]
defer bufPool.Put(b)

What seniors must remember:

Pool contents are evicted on GC. It's a hint, not a cache.
Per-P storage with theft. Each P has its own pool slice; Get from another P only on miss.
Don't put oversized values back. A 1 MiB buffer in a pool keeps that memory permanently warm; better to drop it if growth exceeds a threshold.
Pools cost zero only on the hot path. Cold pools are pure overhead.

Use sync.Pool for high-frequency, short-lived, similarly-sized allocations (HTTP request scratch buffers, JSON encoders). Don't reach for it before measuring.

9. `MADV_FREE` vs `MADV_DONTNEED`¶

On Linux, the runtime decides how to give pages back to the OS:

Mode	Behavior	RSS effect
`MADV_FREE` (default since Go 1.12 on Linux ≥ 4.5)	Pages are eligible for reclaim under memory pressure, but still counted as RSS until then	RSS appears flat after `debug.FreeOSMemory()`
`MADV_DONTNEED`	Pages immediately unmapped; faulted back in zeroed on next touch	RSS drops immediately, but next touch incurs a page fault

For dashboards: a "memory leak" that's just retained idle pages is one set of metrics; a real leak is another. GODEBUG=madvdontneed=1 forces the older eager-return behavior and is what you set in cgroup-bounded containers when you'd rather pay the page-fault cost than report inflated RSS.

10. Reading a GC trace line¶

gc 23 @4.821s 6%: 0.040+1.8+0.014 ms clock, 0.32+0.10/3.5/9.2+0.11 ms cpu, 76→81→48 MB, 81 MB goal, 8 P

Field	Meaning
`gc 23`	23rd cycle since start
`@4.821s`	Time since process start
`6%`	Fraction of CPU spent in GC so far
`0.040+1.8+0.014 ms clock`	STW sweep term + concurrent mark + STW mark term, wall clock
`0.32+0.10/3.5/9.2+0.11 ms cpu`	Same phases, CPU time across all cores
`76→81→48 MB`	Heap size: at sweep start → at mark end → live
`81 MB goal`	Pacer's target
`8 P`	GOMAXPROCS during this cycle

If the final live (48 MB above) is dropping but the goal (81) keeps rising, you've got an allocation burst that hasn't propagated yet. If GCCPUFraction climbs above ~25%, you're allocation-bound — fix the code, not the knobs.

11. Goroutine cost accounting¶

Each goroutine costs:

~2 KiB initial stack (often grows).
~200 B in the g struct and scheduler bookkeeping.
Whatever the closure or function captured.
Any object it transitively retains.

A million idle goroutines is ~2 GiB just in stacks. Goroutine leaks are usually heap leaks in disguise: the goroutine holds a closure that retains a slice that retains the rest of the request.

import _ "net/http/pprof"
// then: go tool pprof http://localhost:6060/debug/pprof/goroutine

Compare counts over time. Steady growth is the leak signal.

12. `debug.FreeOSMemory()`, the last-resort button¶

import "runtime/debug"

debug.FreeOSMemory()

Forces a GC and asks the runtime to return idle pages to the OS now. Useful in batch programs after a known peak (e.g., right after ingesting a big file), or in long-lived services right before going idle. Not a substitute for sane allocation patterns and not a regular maintenance routine — it's a hammer.

13. Summary¶

The Go memory system is a TCMalloc-derived allocator wrapped by a concurrent, non-moving, tri-color GC, paced against GOGC and bounded by GOMEMLIMIT. Knowing where the costs live — the write barrier during marking, the assist tax during allocation, the stack copy on growth, the MADV_FREE lag in RSS — turns "mysterious" behavior into a checklist. Reach for runtime/metrics, pprof, and gctrace before the knobs, and only reach for finalizers, runtime.GC(), or FreeOSMemory() when the alternative is worse.