Memory Management in Depth — Middle¶
1. The allocator hierarchy¶
Every heap allocation in Go travels through three levels, designed to keep the fast path lock-free:
goroutine → P.mcache (no lock, per-P)
↓ miss
mcentral (mutex, per size class)
↓ miss
mheap (mutex, page heap)
↓ miss
mmap from the OS
mcacheis attached to theP(processor) the goroutine is running on. Tiny allocations hit only this.mcentralis a global pool of partially used spans for a size class. Whenmcacheruns out of objects in a class, it refills frommcentral.mheapmanages the address space in 8 KiB pages. Large allocations (≥ 32 KiB) go straight here.
A "span" is a contiguous run of pages dedicated to one size class. Each free object inside a span is the same size, which removes the need for per-object metadata.
2. Size classes¶
Small allocations are rounded up to one of ~67 fixed sizes (8, 16, 24, 32, 48, … up to 32 KiB). The full table lives in runtime/sizeclasses.go.
Implication: struct{a int64; b int64; c byte} requests 17 bytes, but it gets a 24-byte slot (24-byte class). That's 7 bytes of internal fragmentation per object. Multiplied across millions of objects, layout choices matter.
type bad struct {
a int64
b int64
c byte // 17 bytes → 24-byte class (waste 7 B)
}
type good struct {
a int64
b int64 // 16 bytes → 16-byte class (no waste)
}
3. Tiny allocator¶
Allocations smaller than 16 bytes that contain no pointers get packed into a shared 16-byte block by the tiny allocator, amortizing the per-object cost. This is why *int (8 bytes, pointer-containing) costs more than a byte of the same logical size.
Practical consequence: prefer value types over *int when storing small scalars.
4. Where the GC pacer comes in¶
The runtime decides when to start a GC cycle using the pacer:
With the default GOGC=100, GC kicks in when the heap has grown to 2× the live set seen at the end of the last cycle.
Other inputs the pacer considers:
GOMEMLIMIT: an upper bound on total runtime memory. If set, the pacer will GC more often to stay under it.- The expected CPU cost of the next cycle relative to allocation rate.
import "runtime/debug"
prev := debug.SetGCPercent(200) // less frequent GC, more memory
debug.SetMemoryLimit(2 << 30) // 2 GiB soft cap
SetGCPercent(-1) disables GC entirely. Useful for short batch programs, dangerous for everything else.
5. The mark-and-sweep cycle, slightly less abstracted¶
A GC cycle has four phases:
- Sweep termination (STW, microseconds): finish any in-progress sweeping from the previous cycle.
- Concurrent marking: GC workers walk the object graph from roots, painting reachable objects black. Your code is running, so a write barrier intercepts pointer stores to keep the graph consistent.
- Mark termination (STW, sub-millisecond): finalize the mark.
- Concurrent sweeping: reclaim the spans that contained no live objects, lazily, as allocations need them.
The write barrier is the price of concurrent collection: every pointer store costs a little extra during the mark phase. That's invisible most of the time but can show up under microbenchmarks.
6. Tri-color invariant in one paragraph¶
Objects are conceptually white (unmarked), grey (queued for scan), or black (scanned, all children queued). The invariant: a black object must never point directly to a white object. The write barrier preserves this when your code mutates pointers during the mark phase.
7. Stack growth, in practice¶
A goroutine begins life with a 2 KiB stack. When a function would overflow it, the runtime:
- Allocates a stack 2× larger.
- Copies all frames over.
- Walks the old stack and rewrites every pointer to point into the new one.
- Resumes the goroutine.
Two consequences you should internalize:
- You can never escape pointers to stack variables out of a goroutine — they'd dangle after the copy. The compiler enforces this by promoting them to the heap.
- Deeply recursive functions can cause many growths, and each growth is real work. Iterative versions or pre-sized stacks help.
The stack shrinks during GC if its used size drops below ¼ of the allocated size.
8. Heap leaks¶
Go has a GC, but you can absolutely "leak" by holding references you no longer need. The most common shapes:
| Pattern | Why it leaks |
|---|---|
| Global maps that only grow | No eviction policy |
| Goroutines blocked on channels nobody sends to | Goroutine, its stack, and everything it references stay live |
Cached *Request or *Context from middleware | Pins the entire request graph |
| Slices of structs containing pointers, kept long-term | One struct can pin large object graphs |
time.Ticker not stopped | Internal goroutine + closure references |
Leak detection tools: pprof heap profile (go tool pprof http://.../debug/pprof/heap) and runtime.NumGoroutine() over time.
9. runtime.MemStats you should know¶
| Field | What it tells you |
|---|---|
HeapAlloc | Live heap bytes right now |
HeapInuse | Bytes in spans currently used (≥ HeapAlloc due to size-class rounding) |
HeapIdle | Bytes in idle spans the runtime is holding for reuse |
HeapReleased | Bytes returned to the OS |
NumGC | Cycles completed since start |
PauseTotalNs | Cumulative STW nanoseconds |
GCCPUFraction | Fraction of program CPU spent in GC since start |
ReadMemStats briefly stops-the-world. For high-frequency reads, use runtime/metrics.
10. runtime/metrics (preferred over MemStats for monitoring)¶
import "runtime/metrics"
samples := []metrics.Sample{
{Name: "/gc/heap/live:bytes"},
{Name: "/gc/cycles/automatic:gc-cycles"},
{Name: "/sched/goroutines:goroutines"},
}
metrics.Read(samples)
for _, s := range samples {
fmt.Println(s.Name, s.Value)
}
This API is stable, versioned, named (no opaque struct fields), and does not stop the world. Use it for exporters and dashboards.
11. Allocator-friendly code patterns¶
| Pattern | Win |
|---|---|
Preallocate with make([]T, 0, knownCap) | Avoids reallocation + copy as the slice grows |
Reuse buffers via sync.Pool | Reduces allocation rate |
| Pass structs by value when they fit in 1–2 cache lines | Avoids heap escape via *T |
Use bytes.Buffer / strings.Builder instead of += | One backing array vs many |
Avoid interface{} for hot small values | Boxing forces a heap allocation |
| Group related fields, order by alignment | Smaller struct → smaller class → less waste |
sync.Pool is not a cache; the GC can drain it at any time. Don't rely on what's inside.
12. Tooling cheat sheet¶
| Tool | What it shows |
|---|---|
go build -gcflags="-m" | Escape decisions for each allocation |
GODEBUG=gctrace=1 | One line per GC cycle to stderr |
GODEBUG=allocfreetrace=1 | Stack of every alloc/free (very noisy) |
go test -benchmem | Allocs and bytes per op |
go tool pprof http://.../debug/pprof/heap | In-use and alloc objects/bytes |
go tool pprof -alloc_objects | Cumulative allocations, not just live |
13. Summary¶
Heap allocations climb a three-level hierarchy (mcache → mcentral → mheap) that's organized by size class. The GC is concurrent, mark-and-sweep, tri-color, non-generational, non-moving, and paced by the GOGC/GOMEMLIMIT knobs. Your job is to keep allocation rate sane, avoid leaking references, and choose data layouts that pack into the existing size classes. Reach for runtime/metrics and pprof long before you reach for runtime.GC().
Further reading¶
- Go GC guide: https://go.dev/doc/gc-guide
runtime/metrics: https://pkg.go.dev/runtime/metrics- TCMalloc design (the inspiration): https://google.github.io/tcmalloc/design.html