Memory Management in Depth — Optimize¶

1. Measure first, optimize second¶

The most common mistake in memory optimization is starting with a guess. Before you change a single line:

Capture a baseline benchmark with -benchmem.
Capture an allocation profile (-memprofile mem.out + pprof -alloc_objects).
Capture a gctrace for at least one minute of representative load.
Identify the one allocation site responsible for the majority of bytes or count.

If you can't point at a specific call site and say "this is 40% of allocations", you don't have a target — you have a hunch. Don't optimize on a hunch.

go test -bench=. -benchmem -memprofile=mem.out -count=10 ./...
go tool pprof -alloc_objects mem.out
(pprof) top
(pprof) list YourHotFunc

2. Struct field ordering¶

Go follows the same alignment rules as C. A struct is laid out in declaration order, with padding to satisfy each field's alignment, plus tail padding to satisfy the struct's own alignment.

type bad struct {
    a bool    // 1 + 7 padding
    b int64   // 8
    c bool    // 1 + 7 padding
}             // total: 24 bytes

type good struct {
    b int64   // 8
    a bool    // 1
    c bool    // 1 + 6 padding
}             // total: 16 bytes

For one struct nobody cares; for millions of records the savings are real.

Detection:

go install honnef.co/go/tools/cmd/structlayout@latest
go install honnef.co/go/tools/cmd/structlayout-pretty@latest
structlayout ./... MyType | structlayout-pretty

Or, for whole-package guidance:

go install honnef.co/go/tools/cmd/fieldalignment@latest
fieldalignment ./...

Order by alignment descending (int64, *T, int32, int16, bool) — that's a safe default that rarely beats hand layout but never loses.

3. Slices: capacity is half the API¶

// Grows 6 times reaching cap 8192
s := make([]int, 0)
for i := 0; i < 5000; i++ {
    s = append(s, i)
}

// One allocation, zero copies
s := make([]int, 0, 5000)
for i := 0; i < 5000; i++ {
    s = append(s, i)
}

The first form allocates ~14× the final size in total over its growth sequence (1, 2, 4, 8, …, 8192) and copies repeatedly. Wherever you can predict the final length, pre-size.

For maps:

m := make(map[string]int, 5000)  // hint, not a hard cap

The hint avoids rehashing as the map grows.

4. Avoid `interface{}` boxing in hot loops¶

// Allocates per call — values box into iface{}
func sum(xs []any) (s int64) {
    for _, x := range xs {
        s += x.(int64)
    }
    return
}

// Zero allocation
func sum(xs []int64) (s int64) {
    for _, x := range xs {
        s += x
    }
    return
}

The first version pays a heap allocation when each int64 is stored into the slice, regardless of how you read it back out. Generics (Go 1.18+) make typed code reusable without paying the boxing cost:

func sum[T constraints.Integer](xs []T) T {
    var s T
    for _, x := range xs {
        s += x
    }
    return s
}

5. `strings.Builder` over `+=`¶

// Quadratic allocations: each += allocates a new string
var s string
for _, p := range parts {
    s += p
}

// Linear: one backing array, grows geometrically
var b strings.Builder
b.Grow(estimatedSize)
for _, p := range parts {
    b.WriteString(p)
}
s := b.String()

Grow matters: without it, the builder still allocates several times. With it, often once.

Same pattern with bytes.Buffer for []byte.

6. `sync.Pool` — when, where, how¶

Good candidates:

Many short-lived allocations of similar size, in concurrent paths (HTTP handlers, codec scratch).
The pooled type has no exclusive ownership (you can hand the value back trivially).

Bad candidates:

Long-lived objects (just allocate once at startup).
Variable-size objects without a cap discipline (grows unboundedly).
Anything you couldn't safely call Reset() on.

Template:

var pool = sync.Pool{
    New: func() any { return &Decoder{buf: make([]byte, 0, 4096)} },
}

func decode(r io.Reader) (*Result, error) {
    d := pool.Get().(*Decoder)
    defer func() {
        if cap(d.buf) <= 64<<10 {
            d.Reset()
            pool.Put(d)
        }
    }()
    return d.Decode(r)
}

Bench before vs. after. If the win is < 10%, the complexity probably isn't worth it.

7. Escape analysis as an optimization tool¶

go build -gcflags="-m=2" ./... 2>&1 | grep "escapes to heap"

=2 adds detail: it explains why each variable escaped. The most actionable patterns:

Compiler message	Fix
`&x escapes to heap` (in `return &x`)	Return value instead of pointer
`make([]T, n)` escapes (`n` is non-constant)	Bound `n` if you can; or pass a pre-allocated slice in
`x escapes to heap: x converted to itype{}`	Avoid the interface conversion in the hot path
Closure captures `*x`	Pass by value into the closure or make it a method

You don't need to chase every escape — many are harmless. Focus on those in functions that appear in your allocation profile.

8. Avoid allocations from common idioms¶

Idiom	Allocates?	Fix
`fmt.Sprintf("%d", n)`	Yes	`strconv.Itoa(n)`
`[]byte(s)`	Yes (always copies)	`unsafe.StringData(s)` (Go 1.20+) for read-only access
`string(b)`	Yes (always copies)	Same, in reverse: `unsafe.String(&b[0], len(b))`
`time.Now().Sub(t)`	No, but `t.Sub(t2)` is a value-type subtract — prefer it	—
`errors.New("...")` in a hot path	Yes	Predefine the sentinel as a package var
`map[string]T` keyed by short stable strings	Hashes each call	Use a typed enum or `int` if domain allows

unsafe.StringData / unsafe.String are sharp tools — they let the same backing memory be viewed as string and []byte, which means you must guarantee no one mutates the bytes for the lifetime of any aliasing string.

9. `make` vs declared zero¶

var s []int            // nil slice, no allocation
s := make([]int, 0)    // non-nil, empty slice, one allocation
s := []int{}           // same as make([]int, 0)

For an empty result you'll often return, prefer the var form. Callers can append to a nil slice without trouble, and you save the allocation.

Same for maps:

var m map[K]V          // nil; reads return zero, writes panic
m := make(map[K]V)     // empty but writable; one allocation

Pick based on whether the function may write to it.

10. Concurrent allocation: per-P scratch¶

For a service that performs the same expensive allocation per request, the allocator scales well but locking inside your code might not. Instead of one global pool with a mutex, prefer:

sync.Pool (per-P internally).
A sharded slice keyed by runtime.GOMAXPROCS (rare, but useful for specialized cases).

type shards [256]struct {
    _    [56]byte   // pad to 64B cache line (assuming 8B mutex)
    mu   sync.Mutex
    data map[string]int
}

var sh shards

func get(k string) int {
    s := &sh[fnv32(k)%uint32(len(sh))]
    s.mu.Lock()
    defer s.mu.Unlock()
    return s.data[k]
}

Padding to 64 B prevents false sharing — when two adjacent shards live in one cache line and writes to one invalidate the other on every CPU. Measure with perf c2c (Linux) before reaching for this.

11. Profile-guided optimization (PGO)¶

Go 1.21 GA'd PGO. The flow:

# 1. Build a profilable binary
go build -o app ./cmd/app

# 2. Run under representative load, collect profile
curl http://localhost:6060/debug/pprof/profile?seconds=60 > default.pgo

# 3. Rebuild with the profile
mv default.pgo cmd/app/
go build -pgo=auto -o app ./cmd/app

Typical wins: 2–10% on CPU. For allocation patterns, PGO helps because the compiler inlines hotter paths more aggressively, which lets escape analysis see more, which can stack-allocate things that previously escaped.

See 11-pgo for the full treatment.

12. The optimization checklist¶

Before claiming a memory optimization is done:

You have a baseline benchmark.
You have a profile that pinpointed the hotspot.
You measured allocations/op, not just ns/op.
You ran benchstat over at least 10 iterations and the change is statistically meaningful.
You ran the full test suite, including the race detector.
You wrote (or have) a benchmark that locks in the regression you fixed.
You documented the change with the before/after numbers in the commit message.

Without these, you have a change. With them, you have an optimization.

13. Summary¶

The fastest way to optimize Go memory is to make the allocation never happen — pre-allocate, pool, avoid boxing, prefer values to pointers in small types. The second-fastest is to use the right primitive (strings.Builder, strconv.Itoa, generics over interface{}). Knobs (GOGC, GOMEMLIMIT) are blunter than they look; reach for them only when you've already done the algorithmic work. Measure, change one thing, measure again, commit with numbers.