Memory Allocator — Practice Tasks¶
Twenty hands-on exercises to build intuition for how the Go runtime allocates memory: escape analysis, size classes, mcache/mcentral/mheap, the tiny allocator, large-object path, the scavenger, and the radix-tree page allocator. The goal is not to memorise runtime/malloc.go — it is to read code, run an experiment, and predict the allocator's response before the numbers come back. Difficulty climbs from Junior through Staff.
Each task gives a Goal, Difficulty, Skills, Steps, Acceptance criteria, a folded Hint, and a folded Reference solution. Read junior.md first — the difference between "the compiler decided this escapes" (escape analysis), "the runtime grabbed an mspan from mcentral" (small allocation), and "the runtime took a page from mheap" (large allocation) is the spine of every task below. The exercises force you to live with that distinction.
Source references throughout: runtime/malloc.go, runtime/mcache.go, runtime/mcentral.go, runtime/mheap.go, runtime/sizeclasses.go, runtime/mpagealloc.go, runtime/mgcscavenge.go. Read them with the failing benchmark in one window and the source in the other — the questions you ask the source change after you've measured.
Task 1: ReadMemStats before/after a 1M allocation¶
Goal. Use runtime.ReadMemStats to print HeapAlloc/HeapSys/HeapIdle before and after allocating 1,000,000 small items, and explain the deltas.
Difficulty. Junior.
Skills. runtime.MemStats, basic allocation accounting, reading HeapAlloc vs HeapSys.
Steps.
- Write a program that calls
runtime.GC()thenruntime.ReadMemStats(&m1). - Allocate
[]*struct{ X int }of length 1,000,000 and fill each slot with&struct{ X int }{i}. - Call
runtime.GC()thenruntime.ReadMemStats(&m2). - Print
HeapAlloc,HeapSys,HeapIdle,HeapObjects,Mallocsdeltas. - Comment on which delta is "live bytes" and which is "OS-reserved".
Acceptance criteria.
HeapObjectsdelta is at least 1,000,000.HeapAllocdelta is bounded near1_000_000 * 16bytes (size class 2 / size 16, give or take alignment).- You explain in a code comment why
HeapSys >= HeapAllocalways. - Program does not leak the slice (keep it alive until after the second ReadMemStats).
Hint
Call `runtime.GC()` before *each* `ReadMemStats` so the figure isn't polluted by unreachable garbage. To keep the slice alive past the measurement, reference it after the print — `runtime.KeepAlive(s)` is idiomatic. The smallest struct (`struct{ X int }` = 8 bytes) actually allocates into size class 2 (16 bytes) — see `runtime/sizeclasses.go`.Reference solution
package main
import (
"fmt"
"runtime"
)
type Item struct{ X int }
func main() {
var m1, m2 runtime.MemStats
runtime.GC()
runtime.ReadMemStats(&m1)
const N = 1_000_000
items := make([]*Item, N)
for i := 0; i < N; i++ {
items[i] = &Item{X: i}
}
runtime.GC()
runtime.ReadMemStats(&m2)
fmt.Printf("HeapAlloc delta: %d bytes (%.2f MB)\n",
m2.HeapAlloc-m1.HeapAlloc, float64(m2.HeapAlloc-m1.HeapAlloc)/1<<20)
fmt.Printf("HeapSys delta: %d bytes (%.2f MB)\n",
m2.HeapSys-m1.HeapSys, float64(m2.HeapSys-m1.HeapSys)/1<<20)
fmt.Printf("HeapIdle delta: %d bytes\n", int64(m2.HeapIdle)-int64(m1.HeapIdle))
fmt.Printf("HeapObjects delta: %d\n", m2.HeapObjects-m1.HeapObjects)
fmt.Printf("Mallocs delta: %d\n", m2.Mallocs-m1.Mallocs)
// HeapAlloc = live bytes the allocator has handed out to the program.
// HeapSys = bytes the runtime got from the OS via mmap (>= HeapAlloc).
// HeapIdle = pages held but not currently used to back live allocations.
// HeapSys = HeapAlloc + HeapIdle + (small fixed metadata). Always.
runtime.KeepAlive(items)
}
Extension. Re-run with GOGC=off and observe how HeapSys behaves differently when no GC fires between the two measurements.
Task 2: Escape analysis on three functions¶
Goal. Run go build -gcflags="-m" on a small program and identify the compiler's escape decision for three functions.
Difficulty. Junior.
Skills. Reading -gcflags="-m" output, distinguishing stack vs heap allocation.
Steps.
- Write a file
escape.gowith three functions: (a) returns a pointer to a local struct, (b) returns the struct by value, (c) takes the address but does not return it. - Build with
go build -gcflags="-m=2" escape.goand capture the output. - For each function, write a one-line comment summarising the compiler's decision: "escapes to heap" or "does not escape".
- Cross-check against your prediction — write predictions in a comment before running the build.
Acceptance criteria.
- Function (a) reports "moved to heap" or "escapes to heap".
- Function (b) reports no escape (return by value).
- Function (c) reports "does not escape" (address-of stays on stack).
- You can name which line of
-m=2output justifies each decision.
Hint
The `-m` flag prints escape decisions; `-m=2` prints why. Look for phrases `escapes to heap`, `moved to heap`, and `does not escape`. The classic "pointer escapes via return" appears as `&T literal escapes to heap`. A struct returned by value is copied into the caller's frame and never appears in escape output at all — that itself is the answer.Reference solution
// escape.go
package main
type T struct{ X, Y int }
// (a) PREDICTION: escapes — address leaks via return.
// $ go build -gcflags='-m=2' says: "&T{...} escapes to heap"
func newT() *T {
return &T{X: 1, Y: 2}
}
// (b) PREDICTION: stays on stack — return by value, no address taken.
// $ go build says nothing because there's no allocation to report.
func makeT() T {
return T{X: 3, Y: 4}
}
// (c) PREDICTION: stays on stack — address taken but does not escape.
// $ go build says: "&t does not escape"
func useT() int {
t := T{X: 5, Y: 6}
p := &t
return p.X + p.Y
}
func main() {
_ = newT()
_ = makeT()
_ = useT()
}
Extension. Add a fourth function that stores &t into a global slice. Predict, then verify the escape.
Task 3: new(int) vs &Int{0} — heap or stack?¶
Goal. Demonstrate that new(int) and &Int{0} may or may not heap-allocate depending on how the result is used.
Difficulty. Junior.
Skills. Escape analysis, understanding that allocation syntax does not determine allocation site.
Steps.
- Write four small functions:
a()callsnew(int)and returns the pointer.b()callsnew(int), dereferences once, and returns the int.c()returns&Int{0}(a custom struct).d()uses&Int{0}locally and returns one of its fields.- Run
go build -gcflags="-m"and record which escape. - Add benchmarks
BenchmarkA/BenchmarkB/BenchmarkC/BenchmarkDand run with-benchmem. Observeallocs/op.
Acceptance criteria.
a()andc()show 1 alloc/op.b()andd()show 0 allocs/op.- You can articulate "the keyword
newdoes not force heap allocation — the use of the result does".
Hint
Allocation in Go is not a syntactic decision. `new(T)` is *exactly* equivalent to `&T{}` to the compiler — both go through escape analysis. The decision rule is: if the address can outlive the current stack frame, heap; otherwise stack. Returning the pointer leaks the address upward; dereferencing immediately and returning the value does not.Reference solution
package alloc
type Int struct{ V int }
// a: escapes — pointer returned.
func a() *int {
p := new(int)
*p = 7
return p
}
// b: does not escape — pointer never leaves the frame.
func b() int {
p := new(int)
*p = 7
return *p
}
// c: escapes — pointer returned.
func c() *Int {
return &Int{V: 7}
}
// d: does not escape — address-of used purely locally.
func d() int {
i := &Int{V: 7}
return i.V
}
package alloc
import "testing"
func BenchmarkA(b *testing.B) {
for i := 0; i < b.N; i++ {
_ = a()
}
}
func BenchmarkB(b *testing.B) {
for i := 0; i < b.N; i++ {
_ = b1()
}
}
func b1() int { return b() }
func BenchmarkC(b *testing.B) {
for i := 0; i < b.N; i++ {
_ = c()
}
}
func BenchmarkD(b *testing.B) {
for i := 0; i < b.N; i++ {
_ = d()
}
}
Extension. Add e() that passes new(int) to a function whose argument is interface{}. Predict the escape, then verify. Interface conversions are a notorious escape trigger.
Task 4: Allocations of fmt.Sprintf vs strconv.Itoa¶
Goal. Use -benchmem to measure allocations for converting an int to a string two ways.
Difficulty. Junior.
Skills. Benchmarking, reading allocs/op, understanding when reflection-heavy APIs cost more.
Steps.
- Write
BenchmarkSprintfthat callsfmt.Sprintf("%d", 12345)in a loop. - Write
BenchmarkItoathat callsstrconv.Itoa(12345)in a loop. - Run
go test -bench=. -benchmem -benchtime=2s. - Compare
B/opandallocs/op.
Acceptance criteria.
Sprintfreports at least 2 allocs/op (interface conversion + result string + formatter buffer churn).Itoareports 1 alloc/op (just the result string).ns/opforItoais roughly 5–10x faster.
Hint
`fmt.Sprintf` runs the reflection-based formatter, which boxes `12345` into an `interface{}` (one alloc), allocates a `pp` formatter (often pooled — sometimes zero), and assembles the string (one more alloc). `strconv.Itoa` is a hand-rolled digit loop into a `[]byte` whose result is converted to `string`. Read `strconv/itoa.go` — it's a hundred lines, all of which exist to skip the boxing.Reference solution
package fmtbench
import (
"fmt"
"strconv"
"testing"
)
func BenchmarkSprintf(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
_ = fmt.Sprintf("%d", 12345)
}
}
func BenchmarkItoa(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
_ = strconv.Itoa(12345)
}
}
func BenchmarkAppendInt(b *testing.B) {
// Bonus: zero-alloc with a reusable scratch buffer.
b.ReportAllocs()
buf := make([]byte, 0, 32)
for i := 0; i < b.N; i++ {
buf = strconv.AppendInt(buf[:0], 12345, 10)
}
_ = buf
}
Extension. Reproduce the comparison for fmt.Errorf("got %d", n) vs errors.New("got " + strconv.Itoa(n)). The "fmt for errors" idiom costs more than most engineers realise.
Task 5: Read runtime/sizeclasses.go — explain three classes¶
Goal. Pick three size classes from the table in runtime/sizeclasses.go and explain npages, size, nclass, and the resulting objects per span.
Difficulty. Middle.
Skills. Reading runtime sources, understanding span layout, computing waste.
Steps.
- Locate
runtime/sizeclasses.go. Read the comment block at the top. - Pick size classes 3 (size 32 B), 25 (size 1024 B), and 67 (size 32768 B).
- For each: compute
objects per span = (npages * 8192) / size. Verify against the table. - Identify the "tail waste" — bytes per span unused because
pagesize % size != 0. - Explain in plain English why class 67 needs more pages.
Acceptance criteria.
- Three correct (
size,npages,objects,tail-waste) tuples in a comment. - One sentence per class explaining why the runtime chose that
npages.
Hint
A "page" in the Go runtime is 8192 bytes (`_PageSize` in `runtime/malloc.go`). A size class with `npages = N` means each span occupies `N * 8192` bytes and holds `(N * 8192) / size` objects. The number of pages is chosen so that the leftover at the tail of the span is < ~12.5% of the span — read the constants `_MaxSmallSize` and `tailWaste` near the top of the file.Reference solution
// notes_sizeclasses.go
package notes
// runtime/sizeclasses.go (Go 1.22) table excerpt:
//
// class bytes/obj bytes/span objects tail waste max waste
// ...
// 3 32 8192 256 0 46.88%
// ...
// 25 1024 8192 8 0 21.48%
// ...
// 67 32768 32768 1 0 12.50%
//
// Class 3 (size = 32 B):
// npages = 1
// span = 8192 B
// objects = 8192 / 32 = 256
// tail = 8192 - 256*32 = 0 B (perfect packing)
// why 1 page: 32 divides 8192 evenly, so a single page is optimal.
//
// Class 25 (size = 1024 B):
// npages = 1
// span = 8192 B
// objects = 8192 / 1024 = 8
// tail = 0 B (perfect)
// why 1 page: 1024 divides 8192; 8 objects per span is enough granularity.
//
// Class 67 (size = 32 KiB):
// npages = 4
// span = 32768 B
// objects = 32768 / 32768 = 1
// tail = 0 B
// why 4 pages: a single object IS the span. Above _MaxSmallSize (32 KiB)
// we'd go through the large-object path (mheap directly, no mcache).
//
// The table is generated by runtime/mksizeclasses.go, which searches for
// the (size, npages) pair that keeps waste < 12.5% for each size class.
// See the comment in malloc.go for the historical justification — the
// 12.5% bound is the "worst-case fragmentation" guarantee Go gives.
Extension. Run go run runtime/mksizeclasses.go (it's a generator) and observe the algorithm. What would happen if you changed _MaxSmallSize to 16 KiB?
Task 6: Trigger a small-object refill from mcentral¶
Goal. Construct a benchmark that allocates so many objects of one size class that the per-P mcache must refill from mcentral, and observe the refill via a tracer.
Difficulty. Middle.
Skills. mcache/mcentral interaction, GODEBUG tracing, span lifecycle.
Steps.
- Pick size class 5 (size 64 B). Write a benchmark allocating 100,000
[64]bytearrays. - Run with
GODEBUG=allocfreetrace=0,gctrace=1— observe GC frequency. - Run with
GOEXPERIMENT=andGODEBUG=gctrace=1capturing one cycle. - Estimate: a single mspan for class 5 holds
8192 / 64 = 128objects, so 100,000 allocations require ~780 mcentral refills per P. Multiply by GOMAXPROCS. - Write the prediction as a comment; verify the gctrace pace agrees in order of magnitude.
Acceptance criteria.
- Benchmark allocates 100k
[64]byteand reports>= 100k allocs/opaggregate. - A comment estimates refill count and explains "refill = mcache asks mcentral for a fresh mspan".
GODEBUG=gctrace=1shows GC cycles consistent with the heap-growth rate.
Hint
The per-P mcache holds at most one mspan per size class. When that span runs out of free objects, mcache calls `(*mcentral).cacheSpan()` (see `runtime/mcentral.go`). mcentral keeps two lists per class: spans with free objects (`partial`) and spans that are full (`full`). On refill, mcentral pops from `partial`; if empty, it asks mheap for a fresh span. The full lifecycle is in the comment block of `mcentral.go`.Reference solution
package refill
import "testing"
type Block [64]byte
// One Block goes into size class 5 (size 64). One mspan for class 5 has
// npages = 1, so 8192 / 64 = 128 objects per span. 100_000 allocations
// require ~782 mcentral refills per goroutine, multiplied across all Ps
// that get scheduled to allocate.
func BenchmarkSmallRefill(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
bs := make([]*Block, 100_000)
for j := range bs {
bs[j] = &Block{}
}
_ = bs
}
}
Extension. Run the same benchmark with runtime.GOMAXPROCS(1) vs GOMAXPROCS(8). mcache is per-P; refill rate per P drops as Ps multiply because work is distributed.
Task 7: HeapAlloc delta for 100k small objects¶
Goal. Allocate 100,000 small objects and measure the precise HeapAlloc delta.
Difficulty. Middle.
Skills. runtime.MemStats, size-class rounding, predicting allocator bookkeeping.
Steps.
- Define
type S struct { A, B int32 }— 8 bytes net. - Capture
m1before, allocate 100,000*S, capturem2after (withruntime.GC()between). - Predict
HeapAllocdelta:100_000 * sizeClass(8) = 100_000 * 16 = 1_600_000bytes for the items, plus100_000 * 8 = 800_000for the pointer slice. - Print actual delta. Reconcile to within 1%.
Acceptance criteria.
- Prediction within 1% of observed
HeapAllocdelta. - Comment explains why net 8 bytes per object becomes 16 on the heap (size class 2: size 16, alignment & header overhead).
- Slice is kept alive via
runtime.KeepAlive.
Hint
The minimum object size in the small allocator is 16 bytes (size class 2). Anything <= 16 bytes rounds up to 16. To see the rounding, read `nextFreeFast` in `runtime/malloc.go` — it computes the size class via `class_to_size[sizeclass]`. The "extra" bytes are not visible to the program but cost real heap.Reference solution
package main
import (
"fmt"
"runtime"
)
type S struct {
A, B int32
}
func main() {
const N = 100_000
var m1, m2 runtime.MemStats
runtime.GC()
runtime.ReadMemStats(&m1)
xs := make([]*S, N)
for i := 0; i < N; i++ {
xs[i] = &S{A: int32(i), B: int32(-i)}
}
runtime.GC()
runtime.ReadMemStats(&m2)
want := uint64(N*16 + N*8) // 16 B per S (rounded), 8 B per slice slot
got := m2.HeapAlloc - m1.HeapAlloc
fmt.Printf("want ~%d B, got %d B (%.2f%% diff)\n",
want, got, 100*float64(int64(got)-int64(want))/float64(want))
runtime.KeepAlive(xs)
}
Extension. Change S to struct{ A, B, C int64 } (24 bytes net). Predict the new delta: 32 B/object (size class 4). Verify.
Task 8: sync.Pool for 1 KiB buffers¶
Goal. Use sync.Pool to recycle 1 KiB scratch buffers; compare allocations with/without the pool.
Difficulty. Middle.
Skills. sync.Pool, allocation amortisation, when pooling helps and when it doesn't.
Steps.
- Write
BenchmarkNoPoolthat creates a freshmake([]byte, 1024)per iteration, writes a digit pattern into it, and returns the first byte. - Write
BenchmarkPoolusing async.PoolwhoseNewreturns a 1 KiB buffer; get, write, return viaPut. - Run both with
-benchmem. Compareallocs/opandB/op. - Add
BenchmarkPoolUnderGCthat callsruntime.GC()once per iteration to show the pool drain on GC.
Acceptance criteria.
NoPoolreports 1 alloc/op, 1024 B/op.Poolreports near 0 allocs/op steady-state.PoolUnderGCreports near 1 alloc/op (pool is fully drained each GC cycle).- A comment explains "sync.Pool entries are cleared at every GC".
Hint
`sync.Pool` is *not* an object cache that survives GC — every full GC cycle empties pool victim caches. The two-generation Get/Put logic is in `sync/pool.go`. The pool wins when allocations are hot *between* GC cycles (most production workloads). Under tight memory pressure it offers little.Reference solution
package poolbench
import (
"runtime"
"sync"
"testing"
)
var bufPool = sync.Pool{
New: func() any {
b := make([]byte, 1024)
return &b
},
}
func BenchmarkNoPool(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
buf := make([]byte, 1024)
buf[0] = byte(i)
_ = buf[0]
}
}
func BenchmarkPool(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
pb := bufPool.Get().(*[]byte)
buf := *pb
buf[0] = byte(i)
_ = buf[0]
bufPool.Put(pb)
}
}
func BenchmarkPoolUnderGC(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
pb := bufPool.Get().(*[]byte)
(*pb)[0] = byte(i)
bufPool.Put(pb)
if i%1000 == 0 {
runtime.GC()
}
}
}
Extension. Use runtime.SetFinalizer to count how many pool buffers are reclaimed by GC. The number should equal the pool size at the moment GC ran.
Task 9: Pointer-heavy vs value-heavy structs and GC scan time¶
Goal. Compare a struct with all pointer fields vs all value fields; benchmark GC scan time.
Difficulty. Middle.
Skills. GC mark phase, pointer-bitmap layout, runtime.MemStats.PauseNs.
Steps.
- Define
type P struct { A, B, C *int }andtype V struct { A, B, C int }. - Allocate 1,000,000 of each in two separate runs.
- Trigger 10 GC cycles in each run; record
PauseTotalNs. - Compare. Pointers force GC to walk; values do not.
Acceptance criteria.
P-heap GC pause time is measurably larger (often 2–5x).- A comment explains "the GC mark phase walks pointer bitmaps; value-only structs have a zero bitmap and skip scanning".
Hint
Every type has a "ptr-mask" the compiler emits — one bit per word indicating "pointer here". The mark phase only enqueues words flagged as pointers. A struct of pure `int`s has an all-zero ptr-mask. A struct of `*int` has an all-ones ptr-mask. The cost difference is real and measurable on multi-million-object heaps.Reference solution
package main
import (
"fmt"
"runtime"
)
type P struct{ A, B, C *int }
type V struct{ A, B, C int }
func gcStats() (uint64, uint64) {
var m runtime.MemStats
runtime.ReadMemStats(&m)
return m.PauseTotalNs, m.NumGC
}
func runPointers(n int) {
xs := make([]*P, n)
one := 1
for i := range xs {
xs[i] = &P{A: &one, B: &one, C: &one}
}
for i := 0; i < 10; i++ {
runtime.GC()
}
runtime.KeepAlive(xs)
}
func runValues(n int) {
xs := make([]*V, n)
for i := range xs {
xs[i] = &V{A: 1, B: 1, C: 1}
}
for i := 0; i < 10; i++ {
runtime.GC()
}
runtime.KeepAlive(xs)
}
func main() {
const N = 1_000_000
p0, g0 := gcStats()
runPointers(N)
p1, g1 := gcStats()
fmt.Printf("pointers: %d GCs, total pause %.2f ms\n",
g1-g0, float64(p1-p0)/1e6)
p2, g2 := gcStats()
runValues(N)
p3, g3 := gcStats()
fmt.Printf("values: %d GCs, total pause %.2f ms\n",
g3-g2, float64(p3-p2)/1e6)
}
Extension. Use runtime.GCStats (from runtime/debug) for richer per-cycle data. Plot pause times across cycles — you'll see the first 1–2 cycles are noisier than steady-state.
Task 10: Closure capture forcing heap escape¶
Goal. Write code where closure capture forces a local variable onto the heap; confirm with -gcflags="-m".
Difficulty. Middle.
Skills. Closures, escape analysis, lifetime of captured variables.
Steps.
- Write a function that creates a local int
n, returns a closure capturing&n, and observe the escape. - Write a contrasting function that creates a local int
n, returns a closure that capturesnby value, and observe no escape. - Build with
-gcflags="-m=2"to see the escape graph.
Acceptance criteria.
- First closure:
-mreportsmoved to heap: n(or&n escapes to heap). - Second closure: no escape diagnostics for
n. - A comment explains "a closure captures variables by reference; if the closure outlives the frame, the captured vars must move to the heap".
Hint
A Go closure is a struct holding pointers to captured variables. If the closure is returned (escapes the frame), every captured variable must outlive the frame too, which forces them to the heap. Capturing by *value* (passing the int as an argument to the closure factory) doesn't capture — the value lives inside the closure struct and never needs heap promotion if the closure itself doesn't escape further.Reference solution
// closure_escape.go
package main
import "fmt"
// counter1 returns a closure that mutates `n`. `n` escapes because the
// closure outlives the frame and holds a pointer to `n`.
//
// $ go build -gcflags='-m' says: "moved to heap: n"
func counter1() func() int {
n := 0
return func() int {
n++
return n
}
}
// counter2 returns a closure that does NOT mutate the captured int,
// but captures BY REFERENCE because that's what closures do. So `n`
// still escapes.
//
// $ go build -gcflags='-m' says: "moved to heap: n"
func counter2() func() int {
n := 42
return func() int {
return n // still captured by reference; n must outlive counter2
}
}
// counter3 uses a factory that takes the int as an argument. The
// closure captures a parameter, but the parameter is copied into the
// closure struct itself; nothing on the outer frame needs to outlive
// counter3.
//
// $ go build -gcflags='-m' says nothing about heap for the int (the
// returned func itself still escapes — closures are heap-allocated when
// returned — but the int is part of the closure object).
func counter3(start int) func() int {
return func() int {
return start
}
}
func main() {
c1 := counter1()
fmt.Println(c1(), c1(), c1()) // 1 2 3
c2 := counter2()
fmt.Println(c2()) // 42
c3 := counter3(7)
fmt.Println(c3()) // 7
}
$ go build -gcflags='-m=2' closure_escape.go 2>&1 | grep -E 'heap|escape'
./closure_escape.go:10:2: moved to heap: n
./closure_escape.go:11:9: func literal escapes to heap
./closure_escape.go:21:2: moved to heap: n
./closure_escape.go:22:9: func literal escapes to heap
./closure_escape.go:32:9: func literal escapes to heap
Extension. Write a closure that captures a 64-byte struct. Build with -m=2. Then make a version that captures only a slice header (24 bytes). Compare allocation sizes.
Task 11: Profile with pprof -alloc_space — top 3 allocators¶
Goal. Profile a small program with pprof -alloc_space and identify the top 3 allocation sites.
Difficulty. Senior.
Skills. net/http/pprof, go tool pprof, reading flame graphs by allocated bytes.
Steps.
- Write a program that imports
_ "net/http/pprof"and starts ahttp.ListenAndServe(":6060", nil)in a goroutine. - In the main goroutine, run three workloads: (a) JSON-marshal 10,000 structs, (b) build 10,000 strings via
+, (c) allocate 10,000 1 KiB scratch buffers. - Capture a heap profile:
go tool pprof -alloc_space http://localhost:6060/debug/pprof/allocs. - Use
top10andlistto identify the top 3 sites byflatbytes. - Document the three top callers in a comment.
Acceptance criteria.
- Three callers identified with their
flatandcumbyte counts. - One sentence per site explaining the allocator's view (which size class, hot path or cold).
Hint
`-alloc_space` is "cumulative bytes allocated since process start" — *not* live heap. It captures churn. The complementary view `-inuse_space` is "what's still live now". For finding hot allocation sites, `-alloc_space` is the right axis. `top10 -cum` sorts by cumulative; `listReference solution
package main
import (
"encoding/json"
"log"
"net/http"
_ "net/http/pprof"
"strconv"
"time"
)
type Item struct{ ID int; Name string }
func workA() {
for i := 0; i < 10_000; i++ {
b, _ := json.Marshal(Item{ID: i, Name: "item-" + strconv.Itoa(i)})
_ = b
}
}
func workB() {
for i := 0; i < 10_000; i++ {
s := ""
for j := 0; j < 10; j++ {
s += "x"
}
_ = s
}
}
func workC() {
for i := 0; i < 10_000; i++ {
b := make([]byte, 1024)
b[0] = byte(i)
_ = b
}
}
func main() {
go func() { log.Fatal(http.ListenAndServe(":6060", nil)) }()
for {
workA()
workB()
workC()
time.Sleep(50 * time.Millisecond)
}
}
$ go tool pprof -alloc_space http://localhost:6060/debug/pprof/allocs
(pprof) top10
Showing nodes accounting for 130MB, 96.30% of 135MB total
flat flat% sum% cum cum%
40MB 29.6% 29.6% 40MB 29.6% main.workB (string concat)
35MB 25.9% 55.5% 45MB 33.3% encoding/json.Marshal
25MB 18.5% 74.0% 25MB 18.5% main.workC ([]byte alloc)
15MB 11.1% 85.1% 15MB 11.1% strconv.Itoa
...
// 1. main.workB — string concatenation in a loop.
// Each "s += 'x'" allocates a new string of length |s|+1. The total
// cost is O(N^2) bytes per outer iteration. The fix is strings.Builder
// or []byte assembly.
//
// 2. encoding/json.Marshal — emits a fresh []byte per call.
// Reflection-based marshalling allocates scratch buffers and a result
// slice every call. Mitigation: json.Encoder against a pooled bytes.Buffer.
//
// 3. main.workC — explicit make([]byte, 1024).
// Direct alloc; one per loop iteration. Already pool-able via sync.Pool
// if the call site can release the buffer at a defined point.
Extension. Add pprof.SetGoroutineLabels around each work* call. The flame graph then groups by label, which makes the per-workload view trivial.
Task 12: pprof -alloc_objects vs -alloc_space¶
Goal. Capture both -alloc_space (bytes) and -alloc_objects (count). Discuss when each view matters.
Difficulty. Senior.
Skills. Profile interpretation, distinguishing GC pressure from allocator throughput.
Steps.
- Reuse the program from Task 11.
- Capture two profiles:
-alloc_spaceand-alloc_objects. - Compare the top 5 in each. Are they the same? Probably not.
- Identify one site that is hot in count but cool in bytes (many small allocs) and one that is hot in bytes but cool in count (few large allocs).
- Write a one-paragraph rule of thumb: when to optimise the count and when to optimise the bytes.
Acceptance criteria.
- Two distinct top-5 lists captured.
- At least one "count-heavy, byte-light" callsite identified.
- One paragraph explains: count drives mcache/mcentral traffic and GC mark cost; bytes drive heap size and pause time.
Hint
The mark phase of the GC walks pointers; each *object* contributes mark-bitmap work proportional to its size, but the per-object overhead (queue push/pop, span lookup) is constant. So 1,000,000 16-byte allocs cost more *mark work* than 10,000 1 KiB allocs even though the latter is more bytes. Conversely, allocator throughput is dominated by `mallocgc` calls — also per-object, not per-byte.Reference solution
// Capture commands:
//
// $ go tool pprof -alloc_space http://localhost:6060/debug/pprof/allocs
// (pprof) top5 -cum
//
// $ go tool pprof -alloc_objects http://localhost:6060/debug/pprof/allocs
// (pprof) top5 -cum
//
// Typical divergence:
//
// Space view (bytes):
// 1. workB (string concat) 40 MB - bytes hot, count cool
// 2. encoding/json.Marshal 35 MB - bytes hot, count moderate
// 3. workC ([]byte 1024) 25 MB - bytes hot, count hot
//
// Object view (count):
// 1. strconv.Itoa 500_000 - count hot, bytes cool
// 2. json.encodeStringField 400_000 - count hot, bytes moderate
// 3. workB (string concat) 100_000 - count moderate, bytes hot
//
// Rule of thumb:
// * Optimise BYTES when paus times grow with heap size — i.e. the mark
// phase is your bottleneck, or RSS is climbing.
// * Optimise COUNT when malloc/mcache traffic dominates CPU — i.e. you
// see runtime.mallocgc high in CPU profiles, or GC frequency is high
// (small frequent allocs trigger GC sooner because GOGC is %-based).
//
// Practical priority: object-count optimisations almost always pay for
// themselves first. Bytes follow naturally — fewer small allocs usually
// means a tighter heap. The exception is "rarely-allocated large blob"
// (image decoders, big JSON payloads) — for those, bytes is the only
// useful axis.
Extension. Add runtime/pprof.WriteHeapProfile programmatically every 30 seconds. Diff successive snapshots with go tool pprof -base older.pb.gz newer.pb.gz to see what changed between two points in time.
Task 13: Watch the scavenger with GODEBUG=scavtrace=1¶
Goal. Set GODEBUG=scavtrace=1 and observe the scavenger releasing memory back to the OS over time.
Difficulty. Senior.
Skills. Reading scavtrace output, understanding the page scavenger lifecycle.
Steps.
- Write a program that allocates 100 MB of slices, drops them, runs
runtime.GC(), and sleeps for 60 seconds in a loop. - Run with
GODEBUG=scavtrace=1 go run .. - Observe the trace lines:
scav N KiB.... Note the cadence (~1 line per second when scavenger is active). - Compare
RSS(viaps -orss= -p $PID) at 0 s, 10 s, and 60 s.
Acceptance criteria.
- Scavtrace output captured.
- RSS measured at three timestamps showing post-drop decline.
- A comment explains: the scavenger releases idle pages back to the OS via
madvise(MADV_DONTNEED)(Linux) orVM_PAGEABLE(BSD/Darwin), and runs continuously in the background paced byruntime.GOMEMLIMITandGOGC.
Hint
The scavenger runs as a background goroutine (`runtime.bgscavenge`) that walks the page allocator's radix tree looking for runs of "released-able" pages. It releases at a pace targeted to keep the heap close to its target size. The `scavtrace=1` output prints per-cycle stats: how many KiB returned, current RSS estimate, time spent. Read `runtime/mgcscavenge.go` — the goal-seeking logic is well-commented.Reference solution
package main
import (
"fmt"
"os"
"runtime"
"time"
)
func main() {
fmt.Printf("pid=%d\n", os.Getpid())
for cycle := 0; cycle < 6; cycle++ {
// Allocate 100 MB.
big := make([][]byte, 100)
for i := range big {
big[i] = make([]byte, 1<<20) // 1 MiB
}
runtime.KeepAlive(big)
// Drop.
big = nil
runtime.GC()
fmt.Printf("cycle=%d dropped 100MB, sleeping 10s\n", cycle)
time.Sleep(10 * time.Second)
}
}
Extension. Set GOMEMLIMIT=200MiB (Go 1.19+) and re-run. The scavenger will be aggressive trying to keep RSS under the limit — observe how scavtrace cadence changes.
Task 14: debug.FreeOSMemory and RSS measurement¶
Goal. Force the scavenger via debug.FreeOSMemory and measure RSS before and after.
Difficulty. Senior.
Skills. runtime/debug.FreeOSMemory, /proc/self/status parsing, knowing when (and when not) to call it.
Steps.
- Allocate 200 MB. Print RSS via
/proc/self/status(Linux) orps -orss=. - Drop the allocation. Run
runtime.GC(). Print RSS — likely unchanged. - Call
debug.FreeOSMemory(). Print RSS — should drop substantially. - Document the observation in a comment, including the warning from godoc: this is a stop-the-world operation, not for hot paths.
Acceptance criteria.
- RSS measurements at three points: post-alloc, post-GC, post-FreeOSMemory.
- Drop after FreeOSMemory is at least 50% of the peak.
- Comment quotes/paraphrases the FreeOSMemory godoc warning.
Hint
`debug.FreeOSMemory` calls `runtime.GC()` then synchronously runs the scavenger to release every page it can. The godoc note: "FreeOSMemory forces a garbage collection followed by an attempt to return as much memory to the operating system as possible. (Even if this is not called, the runtime gradually returns memory to the operating system in a background task.)" Use this at known low-traffic moments (e.g. after a batch import), never in a per-request handler.Reference solution
package main
import (
"bufio"
"fmt"
"os"
"runtime"
"runtime/debug"
"strconv"
"strings"
"time"
)
func rssKB() int {
f, err := os.Open("/proc/self/status")
if err != nil {
return -1
}
defer f.Close()
scanner := bufio.NewScanner(f)
for scanner.Scan() {
line := scanner.Text()
if strings.HasPrefix(line, "VmRSS:") {
fields := strings.Fields(line)
n, _ := strconv.Atoi(fields[1])
return n
}
}
return -1
}
func main() {
fmt.Printf("startup: RSS=%d KB\n", rssKB())
// Allocate 200 MB.
big := make([][]byte, 200)
for i := range big {
big[i] = make([]byte, 1<<20)
big[i][0] = 1 // touch the page so it's resident
}
fmt.Printf("after alloc: RSS=%d KB\n", rssKB())
// Drop and GC.
big = nil
runtime.GC()
time.Sleep(100 * time.Millisecond) // GC is async-ish
fmt.Printf("after GC: RSS=%d KB (likely unchanged)\n", rssKB())
// Force OS release.
// "FreeOSMemory forces a garbage collection followed by an attempt
// to return as much memory to the operating system as possible.
// Use sparingly — it stops the world and is expensive."
debug.FreeOSMemory()
fmt.Printf("after FreeOSMemory: RSS=%d KB\n", rssKB())
}
Extension. Wrap the program with cgroup memory limit and observe how the OOM-killer triggers without FreeOSMemory but is dodged with it. Reproduces the Kubernetes memory-pressure pattern.
Task 15: Tiny allocator vs small-class allocator benchmark¶
Goal. Construct a benchmark comparing the tiny allocator (objects <16 B, no pointers) against the regular small-class allocator.
Difficulty. Senior.
Skills. Tiny allocator semantics, runtime/malloc.go, noscan types.
Steps.
- Read
mallocgcinruntime/malloc.go, specifically theif size < maxTinySizebranch and the explanation of why tiny allocs pack multiple small objects into a single 16-byte block. - Write
BenchmarkTinyallocating millions of*int8(1 byte, noscan). - Write
BenchmarkSmallNoScanallocating millions of*[16]byte(16 bytes, noscan). - Write
BenchmarkSmallScanallocating millions of*struct{ p *int }(16 bytes, scan). - Compare
B/opandns/op— the tiny allocator should pack 16*int8per 16-byte block.
Acceptance criteria.
BenchmarkTinyreports B/op << 16 (typically 1–2 B/op effective; the runtime amortises).BenchmarkSmallNoScanreports 16 B/op.BenchmarkSmallScanreports 16 B/op AND noticeably more GC pause time across runs.- Comment explains "tiny allocator: bumps a 16-byte offset; new block only when full".
Hint
The tiny allocator path in `runtime/malloc.go` is: if `size < 16` and the type has no pointers (`noscan`), use the per-P `mcache.tiny` buffer. The mcache holds a 16-byte block and a bump offset; objects of size 1, 2, 4, 8 are packed in. Once full, a fresh 16-byte block is taken from size class 2. This is why `make([]byte, 4)` rarely shows up in alloc profiles by bytes — they're amortised.Reference solution
package tinybench
import "testing"
type Big16 [16]byte
type Ptr16 struct{ P *int }
func BenchmarkTiny(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
x := new(int8) // 1 byte, noscan -> tiny allocator
*x = int8(i)
_ = x
}
}
func BenchmarkSmallNoScan(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
x := new(Big16) // 16 bytes, noscan -> size class 2
x[0] = byte(i)
_ = x
}
}
func BenchmarkSmallScan(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
x := &Ptr16{} // 16 bytes, HAS a pointer -> size class 2, scan
_ = x
}
}
Extension. Add a benchmark allocating *int (8 bytes, scan because Go internals treat scalar size as a hint but pointer types are always scan). Compare with *int8. The scan/noscan distinction matters more than the size.
Task 16: Large-object allocation (>32 KiB) benchmark¶
Goal. Benchmark allocations larger than _MaxSmallSize (32 KiB) which take the large-object path: straight to mheap.
Difficulty. Senior.
Skills. mheap, large-object lifecycle, runtime.MemStats.HeapInuse.
Steps.
- Write
BenchmarkLarge64KiBallocating onemake([]byte, 65536)per iteration. - Read
largeAllocinruntime/malloc.go. Note that large allocs bypass mcache entirely — every call goes throughmheap.allocunder the heap lock. - Compare against
BenchmarkLarge8KiB(one page, still small-class) andBenchmarkLarge128KiB. - Observe whether
ns/opscales linearly with size.
Acceptance criteria.
- All three benchmarks measured; B/op equals object size in each.
- ns/op shows the large path is significantly higher latency per byte than the small path (mheap lock + page allocation).
- Comment explains "large alloc bypasses mcache/mcentral, locks mheap, takes pages from the page allocator directly".
Hint
Anything `> _MaxSmallSize` (32 KiB) goes through `largeAlloc` -> `mheap_.alloc` -> `pageAlloc.alloc`. The heap is single-locked at the top. mcache short-circuits this for small allocs — every P has its own — but for large allocs, every goroutine contends. In a multi-tenant service that allocates large objects (image decoders, JSON payloads), this lock is the bottleneck.Reference solution
package largebench
import "testing"
func BenchmarkLarge8KiB(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
x := make([]byte, 8192)
x[0] = byte(i)
_ = x
}
}
func BenchmarkLarge64KiB(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
x := make([]byte, 65536)
x[0] = byte(i)
_ = x
}
}
func BenchmarkLarge128KiB(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
x := make([]byte, 131072)
x[0] = byte(i)
_ = x
}
}
Extension. Run BenchmarkLarge64KiB with GOMAXPROCS=16 and observe contention. Use go tool trace to see goroutines waiting on the mheap lock.
Task 17: Track a single allocation with MemProfileRate=1¶
Goal. Set runtime.MemProfileRate = 1 and runtime/pprof.Lookup("heap") to capture every allocation, then identify one specific allocation by stack.
Difficulty. Senior.
Skills. runtime.MemProfileRate, runtime/pprof.Lookup, sampled-vs-exhaustive profiling.
Steps.
- At program start, set
runtime.MemProfileRate = 1(default is 512 KiB — sampled). - Allocate exactly one 1024-byte slice from a function named
mySpecificAllocSite. - Capture:
pprof.Lookup("heap").WriteTo(f, 0). - Run
go tool pprof prof.outandlist mySpecificAllocSite. - Verify the single allocation is visible.
Acceptance criteria.
- Profile contains the precise call site with non-zero
flatbytes. - Comment explains "MemProfileRate=1 = sample every allocation; default 524288 samples ~1-in-512KiB; setting to 1 has 10-30% overhead and is for debugging not prod".
Hint
`runtime.MemProfileRate` is the sampling cadence in bytes between recorded allocations. With the default 512 KiB, small allocations are unlikely to appear at all. Setting it to 1 records every alloc — useful when chasing "why was this object created" but unacceptable in production. The setting must be applied *before* any allocation you want recorded.Reference solution
package main
import (
"log"
"os"
"runtime"
"runtime/pprof"
)
func mySpecificAllocSite() []byte {
return make([]byte, 1024)
}
func main() {
runtime.MemProfileRate = 1 // record every allocation
defer func() { runtime.MemProfileRate = 524288 }()
x := mySpecificAllocSite()
_ = x
f, err := os.Create("heap.prof")
if err != nil {
log.Fatal(err)
}
defer f.Close()
runtime.GC() // ensure profile is up to date
if err := pprof.Lookup("heap").WriteTo(f, 0); err != nil {
log.Fatal(err)
}
}
Extension. Compare profile sizes: same program with MemProfileRate=1 vs default. The exhaustive profile can be 100x larger.
Task 18: Read runtime/mpagealloc.go — summarise the radix tree¶
Goal. Read runtime/mpagealloc.go and summarise the pageAlloc radix tree: structure, depth, why it exists.
Difficulty. Staff.
Skills. Reading dense runtime code, understanding hierarchical bitmap data structures.
Steps.
- Open
runtime/mpagealloc.go. Skim the top-of-file comment block. - Find
pageAllocstruct definition; identifysummary,chunks,searchAddr. - Understand the 5-level summary tree: each summary entry summarises 8 children below.
- Write a paragraph: what does the radix tree store? Why isn't a flat bitmap enough?
- Identify the function
(pageAlloc).find— how does it search?
Acceptance criteria.
- A 4-6 paragraph summary in a comment file (no code needed).
- Identifies: depth (5 levels), chunk size (4 MiB), per-summary "max contiguous free pages" hint.
- Explains why the tree exists: O(log N) search for a run of N free pages instead of O(heap-size) bitmap scan.
Hint
The radix tree is keyed by "address". Each leaf summarises a `pallocChunk` (4 MiB = 512 pages). Each higher level summarises 8 children below (so depth-1 = 32 MiB, depth-2 = 256 MiB, etc.). A summary entry records `(start, max, end)` — the run of free pages at the start, in the middle, and at the end of the region. Finding a run of N pages is then "descend the tree, pruning subtrees whose max < N".Reference solution
// notes_pagealloc.md (in code as a Go comment for the task)
//
// runtime/mpagealloc.go — pageAlloc radix tree
// ==============================================
//
// What it stores. Tracks which 8-KiB pages of the heap are free vs in
// use. Lives on top of mheap; small/large object allocations both end
// up asking pageAlloc for N contiguous pages.
//
// Why not a flat bitmap? A flat bitmap (one bit per page) requires
// scanning O(heap-size) bits to find a run of N free pages. For a 100
// GiB heap that's 1.6 GB of bitmap and a multi-millisecond scan per
// alloc. Unacceptable for the inner allocator loop. The radix tree
// short-circuits subtrees that cannot satisfy N.
//
// Structure. Up to 5 levels of summaries. Each leaf is a pallocChunk
// (4 MiB = 512 pages). Each summary entry above leaves contains:
// start: contiguous free pages at the START of the region
// max: longest contiguous free run ANYWHERE in the region
// end: contiguous free pages at the END of the region
// Combining children is associative: parent.max = max(child[0].end +
// child[1].start, max(child[i].max for i in children), ...).
//
// On 64-bit Linux, the maximum heap is 4 PiB. Five levels are enough
// to cover this: 4 MiB * 8^5 = 4 MiB * 32768 = 128 GiB per top-level
// entry; 8 top-level entries = 1 TiB; further extension via "heap
// arenas". See arenaL1Bits and arenaL2Bits in malloc.go.
//
// (pageAlloc).find(npages). Descends from the top. At each level,
// picks the leftmost child whose summary indicates a free run of
// >= npages. Recurses. At the leaf (chunk), does a bitmap scan over
// 512 bits. The descent prunes >99% of the heap on a typical lookup.
//
// scavenger interaction. The same radix tree carries "scavenged"
// state — pages that are free AND have been released to the OS via
// madvise. The scavenger uses the tree to find runs of in-heap-but-
// not-released pages, then madvise's them. The tree is the single
// source of truth for both allocation and scavenging.
//
// Senior takeaway. The page allocator went from a free-list of mspans
// (pre-1.14) to this radix tree because GC mark-termination is
// dominated by the time to scan free lists; the tree is essentially
// a B+ tree of free-page runs that supports both lookup and update
// in O(log N) where N is heap-page-count. Read commits 7fb5fe4
// (Go 1.14) for the rationale.
Extension. Run pprof against runtime.(*pageAlloc).find in a CPU profile of an allocation-heavy program. Notice the depth-first descent in the flame graph.
Task 19: Per-package allocation tracker¶
Goal. Build a custom allocation tracker that counts allocations per package by sampling runtime.MemProfile records and grouping by call-site package.
Difficulty. Staff.
Skills. runtime.MemProfile, runtime.FuncForPC, package-path parsing.
Steps.
- Call
runtime.MemProfile(records, true)to fetch the in-process memory profile. - For each record, walk
record.Stack()from the top frame. - Use
runtime.FuncForPC(pc).Name()to get qualified function names. - Extract the package path (everything before the last
.). - Aggregate
AllocBytes/AllocObjectsper package into amap[string]struct{Bytes, Objects int64}. - Print top 10 packages by
AllocBytes.
Acceptance criteria.
- Tracker prints a per-package breakdown.
- Self-test: allocate from three different packages and verify they appear distinctly.
- Comment explains "MemProfile is sampled; numbers are estimates not exact counts".
Hint
`runtime.MemProfile` writes into a slice of `runtime.MemProfileRecord`. Each record has `AllocBytes`, `AllocObjects`, `FreeBytes`, `FreeObjects`, and a `Stack []uintptr`. Live = Alloc - Free. The function name format is `pkg/path.FuncName` — strip from the last `.` to get the package path. For methods, the format is `pkg/path.(*Type).Method` — same strip-from-last-`.` trick works after balancing parentheses.Reference solution
package alloctracker
import (
"fmt"
"io"
"runtime"
"sort"
"strings"
)
type PackageStats struct {
AllocBytes int64
AllocObjects int64
InUseBytes int64
InUseObjects int64
}
// Snapshot walks the runtime's memory profile and groups by package
// of the top-of-stack frame.
func Snapshot() map[string]*PackageStats {
// Fetch records. Loop until the buffer is large enough.
var records []runtime.MemProfileRecord
for {
n, ok := runtime.MemProfile(records, true)
if ok {
records = records[:n]
break
}
records = make([]runtime.MemProfileRecord, n+50)
}
out := map[string]*PackageStats{}
for _, r := range records {
pkg := packageOfStack(r.Stack())
if pkg == "" {
pkg = "<unknown>"
}
s := out[pkg]
if s == nil {
s = &PackageStats{}
out[pkg] = s
}
s.AllocBytes += r.AllocBytes
s.AllocObjects += r.AllocObjects
s.InUseBytes += r.InUseBytes()
s.InUseObjects += r.InUseObjects()
}
return out
}
func packageOfStack(stack []uintptr) string {
for _, pc := range stack {
f := runtime.FuncForPC(pc)
if f == nil {
continue
}
name := f.Name()
// Skip runtime frames so we attribute to user code.
if strings.HasPrefix(name, "runtime.") {
continue
}
return packageOf(name)
}
return ""
}
func packageOf(funcName string) string {
// "pkg/path.FuncName" -> "pkg/path"
// "pkg/path.(*Type).Method" -> "pkg/path"
// Find the last '/'; from there, look for the first '.'.
slash := strings.LastIndex(funcName, "/")
dot := strings.Index(funcName[slash+1:], ".")
if dot < 0 {
return funcName
}
return funcName[:slash+1+dot]
}
func PrintTop(w io.Writer, n int) {
stats := Snapshot()
type row struct {
pkg string
s *PackageStats
}
rows := make([]row, 0, len(stats))
for k, v := range stats {
rows = append(rows, row{k, v})
}
sort.Slice(rows, func(i, j int) bool {
return rows[i].s.AllocBytes > rows[j].s.AllocBytes
})
if n > len(rows) {
n = len(rows)
}
fmt.Fprintf(w, "%-40s %15s %15s\n", "PACKAGE", "ALLOC_BYTES", "INUSE_BYTES")
for _, r := range rows[:n] {
fmt.Fprintf(w, "%-40s %15d %15d\n",
r.pkg, r.s.AllocBytes, r.s.InUseBytes)
}
// MemProfile is SAMPLED at runtime.MemProfileRate (default 512 KiB).
// Numbers are statistical estimates, not exact counts. Set
// runtime.MemProfileRate = 1 for exhaustive recording (high cost).
}
Extension. Group by the second-to-leaf frame (the caller's caller). For libraries like encoding/json this often reveals "which of my packages is feeding json.Marshal the most data".
Task 20: "RSS grows but Heap is flat" — diagnose with scavtrace¶
Goal. Reproduce the pathological case where process RSS grows over time but HeapAlloc stays flat. Diagnose using GODEBUG=scavtrace=1.
Difficulty. Staff.
Skills. Scavenger lifecycle, fragmentation, madvise(MADV_DONTNEED) vs returning to the runtime.
Steps.
- Write a program that, in a loop forever, does: (a) allocates 50 MB of small objects scattered across many goroutines, (b) drops all of them, (c) sleeps 100 ms.
- Run under
GODEBUG=scavtrace=1. Capture scavtrace lines. - Measure RSS every 5 seconds for 2 minutes.
- Observe:
HeapAllocreturns to baseline after each drop, but RSS climbs slowly. - Explain the cause — fragmentation. Pages with even one live object cannot be released; over time, "holes" accumulate.
Acceptance criteria.
- RSS climbs across measurements while HeapAlloc oscillates around a low baseline.
- Scavtrace output captured and a comment interprets each field.
- Diagnosis: each allocation cycle leaves some pages with at least one survivor (Go GC + scheduling timing), preventing those pages from being scavenged.
- Mitigation suggested:
debug.FreeOSMemory()periodic call, orGOMEMLIMITto cap total RSS, or restructuring to allocate from async.Poolof large buffers.
Hint
The scavenger can only release a *page* (8 KiB) once *all* objects on that page are free. If even one survivor pins a page, the whole page stays resident. With small objects spread across many goroutines, the random survivor pattern leaves many pages with one-or-two live objects each — those pages are wasted. This is the classic "heap fragmentation" pathology, common in services with high allocation churn and concurrent allocators.Reference solution
package main
import (
"bufio"
"fmt"
"os"
"runtime"
"strconv"
"strings"
"sync"
"time"
)
func rssKB() int {
f, err := os.Open("/proc/self/status")
if err != nil {
return -1
}
defer f.Close()
scanner := bufio.NewScanner(f)
for scanner.Scan() {
line := scanner.Text()
if strings.HasPrefix(line, "VmRSS:") {
fs := strings.Fields(line)
n, _ := strconv.Atoi(fs[1])
return n
}
}
return -1
}
type Small struct{ A, B int }
// keep is a global retention list that catches a small random fraction
// of allocations in each cycle. This emulates the realistic case where
// "some pointers survive" — caches, work queues, recently-completed
// records still being logged, etc.
var (
keepMu sync.Mutex
keep []*Small
)
func cycle() {
const N = 50_000
var wg sync.WaitGroup
for g := 0; g < 8; g++ {
wg.Add(1)
go func(g int) {
defer wg.Done()
local := make([]*Small, N)
for i := range local {
local[i] = &Small{A: g, B: i}
}
// Pin ~0.1% of objects randomly via the global keep slice.
// In production this models "incidental retention".
keepMu.Lock()
for i := 0; i < N; i += 1000 {
keep = append(keep, local[i])
}
if len(keep) > 5_000 {
keep = keep[2_500:] // trim, but not all the way
}
keepMu.Unlock()
runtime.KeepAlive(local)
}(g)
}
wg.Wait()
runtime.GC()
}
func main() {
fmt.Printf("pid=%d\n", os.Getpid())
go func() {
for {
cycle()
time.Sleep(100 * time.Millisecond)
}
}()
var m runtime.MemStats
for t := 0; t < 24; t++ {
runtime.ReadMemStats(&m)
fmt.Printf("t=%3ds HeapAlloc=%6d KB HeapInuse=%6d KB RSS=%6d KB NumGC=%d\n",
t*5, m.HeapAlloc/1024, m.HeapInuse/1024, rssKB(), m.NumGC)
time.Sleep(5 * time.Second)
}
}
pid=12345
t= 0s HeapAlloc= 1500 KB HeapInuse= 3000 KB RSS= 20000 KB NumGC=4
t= 5s HeapAlloc= 1800 KB HeapInuse= 9000 KB RSS= 45000 KB NumGC=22
t= 10s HeapAlloc= 1900 KB HeapInuse= 14000 KB RSS= 68000 KB NumGC=45
t= 30s HeapAlloc= 2100 KB HeapInuse= 28000 KB RSS= 140000 KB NumGC=120
t= 60s HeapAlloc= 2300 KB HeapInuse= 52000 KB RSS= 280000 KB NumGC=240
t=120s HeapAlloc= 2500 KB HeapInuse= 88000 KB RSS= 460000 KB NumGC=470
// HeapAlloc stays around 2 MiB — the program does NOT leak in the
// language sense. Each cycle drops 99.9% of its allocations and GC
// reclaims them. But because each cycle pins ~50 random objects in
// `keep`, those objects are scattered across hundreds of mspans. A
// page (8 KiB) holding even one live object cannot be returned to
// the OS. Over time, ever more pages become "lightly populated" and
// stick around.
//
// RSS climb = fragmentation, NOT leak.
//
// Mitigations (in order of impact):
// 1. Allocate the long-lived `keep` set from a separate arena
// (one big preallocated []Small) so retained objects are
// densely packed. Frees fragmentation immediately.
// 2. Set GOMEMLIMIT to cap RSS; the scavenger becomes aggressive.
// 3. Periodic debug.FreeOSMemory() during quiet windows.
// 4. Restart-on-bloat policy (last resort; trades availability).
//
// What is NOT a fix:
// - sync.Pool by itself — Pool entries also get scattered.
// - GOGC = lower — triggers more GCs but does not address packing.
// - Calling runtime.GC() in a tight loop — same as above.
Extension. Replace the random-retention pattern with a sync.Pool of preallocated large arenas, where short-lived objects are sliced off the arena and dropped together. Re-run and verify RSS no longer climbs.
Self-grading rubric¶
Score each task 0 (skipped), 1 (got it with hints), 2 (got it unaided), 3 (got it AND can predict the behaviour of a similar task before running it). Sum:
| Score | What it means |
|---|---|
| 0–20 | You've seen runtime.MemStats and -benchmem once. Re-read junior.md and middle.md; redo Tasks 1–8. The mental model of "size classes -> mcache -> mcentral -> mheap" must be reflex before deeper work. |
| 21–35 | You can run the allocator's diagnostics and predict the small-object path. Tasks 9–14 cement escape analysis, the scavenger, and pprof. The "RSS vs HeapAlloc" distinction in Task 13/14 is the single most common production confusion — confirm you've internalised it. |
| 36–48 | Senior level. You distinguish tiny vs small vs large allocations and can read a -alloc_space profile critically. Tasks 15–17 push you into the runtime source — start using it as documentation, not folklore. |
| 49–60 | Staff level. Tasks 18–20 require holding the radix tree and the fragmentation pathology in your head simultaneously. If Task 20's diagnosis didn't feel familiar, you haven't lived through it yet — ask any operator of a long-running Go service for war stories. |
The most important question is not did you finish — it is can you predict, for any new allocation pattern, which path through the allocator it takes and what its long-term RSS curve will look like? "This is a 24-byte struct with a pointer — size class 4, scan." "This is a 50 KiB buffer — large path, mheap lock, candidate for sync.Pool." "This is a hot []byte returning from a hot function — escape analysis will keep it on the stack unless the result interface-converts." If those come reflexively, you understand the allocator. If not, the rest is plumbing.
Concrete checks worth running before declaring done:
go test -bench=. -benchmemclean on every numerical claim — predict, then verify, then explain any 10%+ delta.- For escape analysis tasks (2, 3, 10): always rebuild with
-gcflags="-m=2"and read the justification, not just the verdict. Compiler decisions change between Go versions. - For pprof tasks (11, 12, 17): cross-check
-alloc_spaceagainst-alloc_objects. Top-N divergence between them is the diagnostic, not either one alone. - For scavenger tasks (13, 14, 20): always include RSS measurement, not just
MemStats. The whole point is that they disagree. - For Task 19 (per-package tracker): exercise from three packages and verify isolation; without that smoke test, name-parsing bugs will silently merge unrelated callers.
Stretch challenges¶
S1 — Cross-version allocator regression detector. Build a CI tool that runs a benchmark suite under Go 1.21, 1.22, 1.23 and reports allocator-related regressions per benchmark. Constraint: detect changes in B/op, allocs/op, AND new flat lines in pprof -alloc_objects between versions. The hard part is identifying which allocations are version-sensitive (escape analysis changes between minor versions affect specific patterns). Output: a Markdown table per affected package, showing baseline vs current numbers and the specific functions whose escape decisions changed. Bonus: integrate with go test -bench=. JSON output (go test -bench=. -count=10 -benchtime=2s -json | tee bench.json).
S2 — Live fragmentation observer. Build an HTTP endpoint that exposes the current heap fragmentation ratio: HeapInuse / HeapAlloc plus the per-size-class breakdown of "spans with one live object" (the fragmentation indicator). Pull data from runtime.ReadMemStats and runtime/metrics (the new structured metrics API). Constraint: < 1 ms p99 to respond, because the endpoint will be scraped by Prometheus every 15 seconds. The hard part is that runtime.MemStats is not cheap (it stops the world briefly) — you need to cache, choose the right sample window, and decide which metrics are actually fragmentation-sensitive (HeapInuse - HeapAlloc) versus which are noise.
S3 — Arena-backed object pool with safe lifetimes. Implement a typed arena allocator (similar to the experimental arena package but production-safe) that grants N instances of T from a preallocated chunk, tracks per-arena reference counts, and recycles the entire chunk when the count hits zero. Constraints: (a) the arena must work for types containing pointers — the chunk's pages must be scanned by the GC for the duration of any live reference, but freed atomically when no references remain. (b) Misuse (using an instance after the arena is freed) must be detectable in -race mode via explicit runtime.SetFinalizer-based instrumentation. (c) The benchmark BenchmarkArenaVsHeap must show 2-5x throughput improvement vs equivalent new(T) for hot workloads. Worth understanding before attempting: read the design doc for golang.org/x/exp/arena and the criticism of it (lifetime safety is the hard problem, not the speedup).