GC Source — Optimization¶
1. How to use this file¶
Fifteen scenarios where code creates GC pressure on a hot path — extra allocations, large scan surface, or mark-assist bursts that show up as P99 jitter. Each entry has a Before (code + benchmark + GC trace) and a collapsible After (optimized code + numbers + why grounded in the runtime + trade-offs + when NOT).
Anchored at Go 1.23, amd64, GOGC=100. Numbers are reproducible-shape — run go test -bench=. -benchmem with GODEBUG=gctrace=1 on your hardware before quoting them. The mechanics referenced live in runtime/mgc.go (pacing), runtime/mgcmark.go (scan), runtime/mgcsweep.go (sweep), runtime/mbarrier.go (write barrier), and runtime/malloc.go (allocation fast path).
GC pressure has four shapes: per-call allocations that drive heap growth and trigger GCs sooner; large objects with many pointer fields that inflate scan time; sustained allocation rate forcing mark-assist onto mutator goroutines; short-lived garbage that escapes a cycle and inflates the next sweep. Most wins remove one of those four.
Reading order: Ex. 1, 2, 11, then any order. Ex. 4, 5, 12 are the ones senior reviews flag most.
How to read gctrace: format is gc N @Ts %CPU: A+B+C ms clock, ... ms cpu, I->J->K MB, L MB goal, M P — A is mutator assist time (the bill goroutines pay per runtime/mgcmark.go:gcAssistAlloc), B is background mark, I→J→K is heap at trigger → mid-cycle → live after sweep. Rising assist means the mutator is doing GC work; longer GC intervals mean less allocation pressure.
2. Exercise 1 — fmt.Sprintf in a hot loop¶
Difficulty: Junior+
A logger formats a cache key with fmt.Sprintf("%d-%d", id, n). The format machinery boxes both ints into any (one runtime.convT64 alloc each), pulls a *pp printer from the pool, walks the verb table, writes into an internal []byte, and string()-copies the result. At 200k QPS that's 600k allocs/sec just to make a map key.
BenchmarkSprintfKey-8 5000000 320 ns/op 48 B/op 3 allocs/op
gc 15 @4.34s 6%: 0.20+13+0.20 ms clock, 41->45->24 MB, 44 MB goal
GCs ~240 ms apart, mutator assist 1.5 ms — runtime/mgc.go:gcSetTriggerRatio lowered the trigger because alloc rate is high.
After
`strconv.AppendInt` writes digits directly into a caller-owned buffer. No iface boxing, no `fmt` state, one final string copy. ~8× faster, GC interval doubled, assist halved. **Why faster:** `fmt.Sprintf` allocates the result string plus one `any` per arg. `strconv.AppendInt` writes into the caller's buffer; the values never escape. Lower alloc rate delays the next trigger and shrinks assist (`gcAssistAlloc` runs on whichever goroutine crosses the trigger). **Trade-off:** The final `string(buf)` still copies. For map-probe-then-discard, `unsafe.String` skips it — but the map must not retain the slice. **When NOT:** Cold paths, error messages, log lines where the I/O dwarfs format cost.3. Exercise 2 — []byte(string) in a hot loop¶
Difficulty: Junior+
mac.Write([]byte(token)) triggers runtime.stringtoslicebyte — fresh []byte allocation plus copy on every call. At 1M ops/s the GC fires every 120 ms.
func sign(mac hash.Hash, token string) []byte {
mac.Reset()
mac.Write([]byte(token))
return mac.Sum(nil)
}
BenchmarkStringToBytes-8 10000000 140 ns/op 96 B/op 2 allocs/op
gc 29 @2.22s 8%: 0.21+18+0.20 ms clock, 52->58->30 MB, 56 MB goal
After
`io.WriteString` dispatches to `hash.Hash`'s `WriteString` method (Go 1.19+) — no conversion. For APIs that only take `[]byte`, reuse a per-goroutine buffer with `append(buf[:0], s...)`. ~7× faster, zero allocations, GC interval 5× longer, assist negligible. **Why faster:** `[]byte(s)` always copies in safe Go. `io.WriteString` lets the writer keep the string as-is. Fewer allocations → trigger ratio relaxes → assist drops. **Trade-off:** A reused buffer pins the largest payload ever seen; cap with `if cap(buf) > maxIdle { buf = nil }`. **When NOT:** When the byte slice escapes to another goroutine — reuse becomes a race.4. Exercise 3 — += for string concat in a loop¶
Difficulty: Junior+
s += t is s = runtime.concatstring2(s, t) — fresh allocation each call, O(N²) bytes copied across the loop. The intermediates each survive long enough to be scanned by the next mark phase.
func serialize(fields []string) string {
result := ""
for _, f := range fields { result += f + "," }
return strings.TrimSuffix(result, ",")
}
BenchmarkStringConcat-8 5000 280000 ns/op 480000 B/op 199 allocs/op
gc 6 @1.00s 12%: 0.40+24+0.22 ms clock, 60->68->32 MB, 64 MB goal
Mark-assist 3 ms.
After
`strings.Builder` writes into one growing `[]byte`, returns it via `unsafe.String` — no copy on `String()`. ~87× faster, 199× fewer allocations, assist drops to <1 ms. **Why faster:** Builder grows geometrically — `log2(200) ≈ 8` allocations vs 200. With `Grow`, just one. The dead intermediates in the `+=` version each lived through one mark phase (`runtime/mgcmark.go:scanobject`). **Trade-off:** Builder isn't safe to share across goroutines. After `String()`, further writes alias into the returned string. **When NOT:** Single concat. Constant-shape `fmt.Sprintf` for occasional logging.5. Exercise 4 — make([]byte, n) per request¶
Difficulty: Middle
A decoder allocates a fresh buffer per message. Average 4 KB, spikes to 64 KB; 50k QPS = ~200 MB/sec of garbage. The 64 KB buffers go through mheap (large object path in runtime/malloc.go:mallocgc), bypassing per-P caches.
func handleConn(c net.Conn) {
for {
var hdr [4]byte
if _, err := io.ReadFull(c, hdr[:]); err != nil { return }
n := binary.BigEndian.Uint32(hdr[:])
buf := make([]byte, n)
io.ReadFull(c, buf)
process(buf)
}
}
BenchmarkPerReqBuffer-8 20000 62000 ns/op 65536 B/op 1 allocs/op
gc 33 @1.30s 18%: 0.6+38+0.24 ms clock, 124->148->62 MB, 130 MB goal
GC every 100 ms, assist 5 ms — mutators pay 5 ms of GC work per cycle.
After
`sync.Pool` of byte buffers. Per-P pool slots are O(1) reads (`sync/pool.go`), and the pool is auto-drained by `runtime/mgc.go:clearpools` so it can't grow without bound.var bufPool = sync.Pool{ New: func() any { b := make([]byte, 0, 8192); return &b } }
func getBuf(n int) *[]byte {
bp := bufPool.Get().(*[]byte)
if cap(*bp) < n { b := make([]byte, n); return &b }
*bp = (*bp)[:n]
return bp
}
func putBuf(bp *[]byte) {
if cap(*bp) > 1<<20 { return }
*bp = (*bp)[:0]
bufPool.Put(bp)
}
6. Exercise 5 — Pointer-heavy struct inflating scan time¶
Difficulty: Middle
A cache entry has 8 *string fields (nullable text columns). 5M entries = 40M pointers for runtime/mgcmark.go:scanobject to walk each mark phase. The pointer-mask bitmap (runtime/type.go's gcdata) determines per-byte whether a slot is a pointer — pointers cost a dereference, scalars skip in O(1).
type Entry struct {
Title, Description, Author, Source *string
Tag1, Tag2, Tag3, Tag4 *string
CreatedAt, UpdatedAt int64
}
var cache = make(map[int64]*Entry, 5_000_000)
Mark phase 220 ms. CPU time 440 ms across 8 Ps — ~1.5% baseline lost to scanning.
After
Replace `*string` with `string` for usually-populated fields; for tags use IDs into an intern table (`[4]uint32` — no pointers, skipped in one bitmap step). Mark phase ~45 ms (~5× faster), heap 25% smaller. **Why faster:** `scanobject` skips runs of non-pointer fields via the pointer-mask bitmap. 4 pointers per entry × 5M = 20M dereferences eliminated. String headers still hold a data pointer — but field count dropped from 8 string-pointers + 2 strings (10 pointers) to 4 strings + 0 (4 pointers). **Trade-off:** Intern table adds lookup cost on read. Nullable semantics move to a `NullMask` — easy to forget on write. Empty string is now indistinguishable from "no value" without the mask. **When NOT:** Small heaps (<100 MB) where mark phase is already milliseconds. Schemas with reflect-driven code that special-cases `*string` for nullability.7. Exercise 6 — Map of pointers when values would do¶
Difficulty: Middle
map[string]*Metadata where Metadata is a 32-byte struct. Storing pointers costs: extra alloc per insert, every map bucket holds 1 pointer to the metadata + 2 string pointers inside it, and every read indirects.
type Metadata struct {
SizeBytes, ModTime int64
Owner, MIME string
}
var cache = make(map[string]*Metadata, 1_000_000)
BenchmarkMapOfPointers-8 20000000 62 ns/op 0 B/op 0 allocs/op
gc 12 @30s 0%: 0.4+120+0.4 ms clock, 200->200->198 MB, 220 MB goal
After
Store the value inline. Go map buckets (`runtime/map.go:bmap`) hold values directly when small enough. ~1.5× faster lookup, mark phase 25% faster. **Why faster:** Without the `*Metadata` slot, the bucket layout has 1M fewer pointers for `scanobject` to walk. Cache locality also improves: hit lands in the bucket cache line, no remote-heap fetch. **Trade-off:** Map values are copied on assignment and read. For values >128 bytes the copy cost flips the equation. Mutation of a returned value doesn't update the map (`cache[k].Owner = "x"` is a compile error). Concurrent reads+writes still need a lock or `sync.Map`. **When NOT:** Large values (>128 B). Values mutated in place by callers. Schemas where the pointer let callers share state.8. Exercise 7 — []interface{} boxing in a hot loop¶
Difficulty: Middle
append(out, intVal) where out is []interface{} calls runtime.convT64 — an 8-byte heap object + iface header per int. 1M pushes/sec = 1M tiny allocs/sec.
type Pipeline struct{ out []interface{} }
func (p *Pipeline) Push(v int64) { p.out = append(p.out, v) }
BenchmarkIfacePush-8 20000000 62 ns/op 16 B/op 1 allocs/op
gc 8 @1.50s 9%: 0.30+22+0.20 ms clock, 56->62->28 MB, 60 MB goal
After
Typed slice. If mixed types are required, generics monomorphize per type. ~15× faster, allocations gone except for slice growth, GC interval 5× longer. **Why faster:** `runtime/iface.go`'s `convT64` allocates a heap word per int. Generics monomorphize the slice into `[]int64`, eliminating the box. The compiler can also emit tight loop code on a primitive slice. **Trade-off:** Generics inflate binary size per instantiation. `any`-typed APIs read nicely but the cost is real until profiled. **When NOT:** Genuinely heterogeneous slices (YAML config values). Low call rates where boxing is invisible.9. Exercise 8 — time.After in a select loop¶
Difficulty: Middle
time.After allocates a fresh *Timer and channel each call, plus a runtime timer-heap entry (see runtime/time.go). In a select loop firing 10k/s this is 30k allocs/sec.
func worker(ch <-chan Task, timeout time.Duration) {
for {
select {
case t := <-ch: process(t)
case <-time.After(timeout): heartbeat()
}
}
}
BenchmarkTimeAfter-8 2000000 640 ns/op 192 B/op 3 allocs/op
gc 14 @5.00s 4%: 0.20+14+0.20 ms clock, 40->44->24 MB, 44 MB goal
After
One `*Timer` outside the loop, `Reset` per iteration. ~10× faster, zero allocs, GC interval 4× longer. **Why faster:** The runtime timer (`runtime/time.go`) keeps a per-P heap. New timers insert O(log N); the `*Timer` plus its channel are GC roots until fired or stopped. `Reset` reuses both with an in-place heap update. **Trade-off:** The drain-on-reset dance is one of Go's gnarliest API edges. Go 1.23 relaxed it (`Reset` is safer post-stop), but the explicit drain stays correct across versions. **When NOT:** One-shot timeouts. Tests where the timer fires before the goroutine exits.10. Exercise 9 — Reading whole file into memory¶
Difficulty: Junior+
os.ReadFile on a 200 MB log: one 200 MB byte allocation + 200 MB string + a []string per line. Mark phase has to scan all line headers; peak RSS ~850 MB.
func countErrors(path string) (int, error) {
data, _ := os.ReadFile(path)
lines := strings.Split(string(data), "\n")
n := 0
for _, line := range lines {
if strings.Contains(line, "ERROR") { n++ }
}
return n, nil
}
BenchmarkReadAll-8 50 420000000 ns/op 420000000 B/op 2000001 allocs/op
gc 3 @0.95s 22%: 0.6+200+0.30 ms clock, 420->440->420 MB, 440 MB goal
Mark-assist 5 ms; peak RSS ~850 MB.
After
`bufio.Scanner` streams. `sc.Bytes()` returns a view into the scanner's buffer — no per-line alloc. ~11× faster, ~5800× fewer allocations, peak RSS ~25 MB. **Why faster:** Streaming caps live bytes. `sc.Bytes()` returns a view, no allocation. The GC never sees a 200 MB live object whose scan dominates a mark phase. **Trade-off:** `sc.Bytes()` is reused across `Scan()` calls — retaining it across iterations is a bug. Lines beyond `Buffer` cap fail; tune for outliers. **When NOT:** Small files (<1 MB) where reading-all is simpler. Code needing random line access.11. Exercise 10 — json.Marshal building intermediate buffers¶
Difficulty: Middle
json.Marshal builds the full output in an internal bytes.Buffer, copies to a result []byte, returns. For a 500 KB payload, peak heap ~1 MB (tree + buffer + return).
func writeJSON(w http.ResponseWriter, payload any) {
data, _ := json.Marshal(payload)
w.Header().Set("Content-Type", "application/json")
w.Write(data)
}
BenchmarkMarshalCopy-8 1000 1200000 ns/op 980000 B/op 2400 allocs/op
gc 18 @2.00s 11%: 0.40+28+0.22 ms clock, 80->110->48 MB, 90 MB goal
After
`json.NewEncoder(w).Encode(payload)` writes directly to the writer. The encoder reuses an internal scratch buffer, bounded by the largest field — not the whole document. ~3× faster, ~3× fewer allocations, GC interval 2.5× longer. **Why faster:** No monolithic `[]byte` exists in memory simultaneously with the payload. Peak resident heap drops, and the trigger ratio sees lower allocation pressure per request. **Trade-off:** `Encode` appends a trailing newline by design. Mid-stream errors leave partial JSON on the writer; buffer if rollback matters. **When NOT:** Tiny payloads (<4 KB). Code retrying the same payload on transient errors — buffer first.12. Exercise 11 — Per-request closure capture¶
Difficulty: Middle
A closure passed to a callback captures the request-scoped logger. Because the closure escapes through the callback API, cmd/compile/internal/escape marks the captured variables heap-allocated.
func handle(req Req, logger *Logger) error {
return forEachItem(req.Items, func(item Item) error {
logger.Info("processing", "id", item.ID)
return process(item)
})
}
BenchmarkClosureCapture-8 5000000 240 ns/op 64 B/op 2 allocs/op
gc 22 @3.00s 7%: 0.30+18+0.22 ms clock, 60->68->32 MB, 64 MB goal
After
Pass dependencies through arguments. The function value becomes a static code pointer with no captured environment.func handle(req Req, logger *Logger) error {
return forEachItemCtx(logger, req.Items, processOne)
}
func processOne(logger *Logger, item Item) error {
logger.Info("processing", "id", item.ID)
return process(item)
}
func forEachItemCtx[C, T any](ctx C, items []T, fn func(C, T) error) error {
for _, it := range items { if err := fn(ctx, it); err != nil { return err } }
return nil
}
13. Exercise 12 — Large struct passed by value¶
Difficulty: Middle
process(cfg Config, item Item) with a 512-byte Config copies 512 bytes per call. In a hot loop over 100k items, runtime.memmove dominates the CPU profile.
type Config struct {
Endpoints [16]string
Timeouts [16]int64
Flags uint64
AuthHeader, UserAgent, Region, Tenant string
}
func process(cfg Config, item Item) error { /* read cfg */ }
CPU profile: runtime.memmove near the top. GC quiet — but each call frame is huge.
After
Pass by pointer. The struct stays in one place; the callee gets an 8-byte word. ~5× faster. **Why faster:** One MOV instead of 64 (a 512-byte memmove is dozens of cycles). Cache locality improves — `*Config` stays in L1 across the batch. **Trade-off:** Callee can mutate; document immutability. For concurrent hot-reload, `atomic.Pointer[Config]`. **When NOT:** Small structs (≤16 B) — copy beats indirection. Value semantics matter for the design (each call sees an immutable snapshot).14. Exercise 13 — append without capacity hint¶
Difficulty: Junior+
var ids []int64; for ... { ids = append(ids, x) } triggers runtime/slice.go:growslice 14 times for N=10k — geometric doubling under cap 1024 then ×1.25. Each growth allocates and copies; intermediates become garbage.
func collectIDs(items []Item) []int64 {
var ids []int64
for _, it := range items {
if it.Active { ids = append(ids, it.ID) }
}
return ids
}
BenchmarkAppendNoHint-8 10000 240000 ns/op 420000 B/op 14 allocs/op
gc 10 @2.50s 6%: 0.20+15+0.22 ms clock, 48->54->28 MB, 52 MB goal
After
Pre-size with `make([]T, 0, n)`. ~4× faster, 14× fewer allocations, GC interval 3× longer. **Why faster:** One alloc instead of 14. Total bytes copied during growth is ~2× the final size; pre-allocation skips it. Even with a loose upper bound (50% fill rate), one oversize allocation beats 14 growths. **Trade-off:** Overestimate wastes memory in the returned slice. `slices.Clip` trims excess at the cost of one copy. **When NOT:** N small (<32) — growth is once and cheap. N unboundable.15. Exercise 14 — defer in a hot loop¶
Difficulty: Middle
for _, it := range items { ... defer close(it) ... } accumulates defer records (runtime/runtime2.go:_defer) on the heap — one per iteration — and they all run at function return, exhausting file descriptors meanwhile. Pre-1.14 defer was open-coded only at function scope; inside a loop it's not.
func processAll(items []Item) error {
for _, it := range items {
r, err := open(it)
if err != nil { return err }
defer r.Close() // accumulates!
if err := process(r); err != nil { return err }
}
return nil
}
BenchmarkDeferInLoop-8 10000 320000 ns/op 480000 B/op 10000 allocs/op
gc 14 @2.00s 8%: 0.30+22+0.22 ms clock, 56->64->32 MB, 60 MB goal
After
Restructure: per-item defer in a helper function. The compiler open-codes defers in single-statement functions (≤8 defers, no recursion). ~100× faster, zero allocations, resources close promptly. **Why faster:** Open-coded defer (Go 1.14+) compiles into inline epilogue code at function exit — no heap record. Per-iteration close also bounds the live FD set. **Trade-off:** Restructuring requires a helper function. `Close` errors are normally swallowed; use a named return + deferred wrapper if you need them. **When NOT:** When the loop body genuinely needs all resources open across iterations (very rare — usually a design smell).16. Exercise 15 — errors.New per call¶
Difficulty: Junior+
errors.New("invalid") allocates an *errorString on the heap each call. At 100k QPS × 1% rejection = 1k error allocs/sec.
func validate(in Input) error {
if in.Name == "" { return errors.New("name required") }
if in.Age < 0 { return errors.New("age must be non-negative") }
return nil
}
BenchmarkErrorsNew-8 50000000 24 ns/op 16 B/op 1 allocs/op
gc 8 @5.00s 1%: 0.10+6+0.20 ms clock, 26->28->18 MB, 36 MB goal
After
Package-level sentinel — one allocation at init. ~13× faster, zero allocations, GC interval 6× longer. **Why faster:** Sentinel is a static `*errorString`. Returning it copies a two-word iface header. `errors.New` always allocates. Bonus: `errors.Is(err, ErrNameRequired)` becomes a pointer compare. **Trade-off:** Sentinel messages can't carry per-call context. Wrap with `fmt.Errorf("got %d: %w", val, ErrSentinel)` only when the value matters — that wrap still allocates, but only on the path that needs it. **When NOT:** Errors carrying genuine per-call structured data (validation field list). Cold paths where readability beats nanoseconds.17. When NOT to optimize¶
GC pressure dominates a profile only when (a) allocation rate × live-set triggers GC frequently and (b) mark-assist or STW contributes meaningfully to your latency budget. A CLI tool allocating 100 MB once at startup pays one GC; nothing here helps.
Profile first. GC overhead has four signatures in CPU and gctrace output:
runtime.mallocgcnear top of CPU profile → Ex. 1, 4, 5, 10, 13 (eliminate allocations).runtime.gcAssistAllocrising assist times ingctrace→ Ex. 2, 4, 13 (cut alloc rate).runtime.scanobjectdominating mark phase → Ex. 5, 6 (reduce pointer slots).- Short GC interval in
gctracedespite small live set → high allocation rate, see Ex. 1–4.
Common premature optimizations: sync.Pool (Ex. 4) for once-per-minute allocs; hand-unrolled keys (Ex. 1) on 1k-QPS endpoints; pre-sized slices (Ex. 13) for N=3; flattening *string (Ex. 5) on a struct with 10 instances total.
Correctness gaps disguised as optimizations: pool of buffers (Ex. 4) without Reset — next caller sees previous payload; reused timer (Ex. 8) without drain — phantom fires; map of values (Ex. 6) where callers mutated the returned struct expecting it to update the map; bufio.Scanner slice retained across iterations (Ex. 9) — buffer overwrites; defer moved into a helper (Ex. 14) silently swallowing Close errors; sentinel errors (Ex. 15) where two call sites needed distinguishable messages; pointer-field removal (Ex. 5) where nil semantically meant "absent" — null vs zero confusion.
GC tuning escape hatches. When code-level fixes are exhausted: GOMEMLIMIT (Go 1.19+) caps total runtime memory, triggering GCs sooner to avoid OOM; GOGC=200 doubles heap-growth target, trading memory for CPU; runtime/debug.SetGCPercent for hot-reconfigure. Read runtime/mgcpacer.go for the pacing math behind gctrace lines.
18. Summary¶
Always-ship wins (apply by default in any hot-path code):
strconv.AppendIntoverfmt.Sprintffor numeric keys (Ex. 1).io.WriteString/append(buf[:0], s...)over[]byte(s)in tight loops (Ex. 2).strings.Builderover+=for concat (Ex. 3).bufio.Scannerfor line-oriented file processing (Ex. 9).json.NewEncoder(w).Encodefor response writers (Ex. 10).- Pass dependencies through arguments, not closures, in hot callbacks (Ex. 11).
- Pre-size slices with
make([]T, 0, n)when N is bounded (Ex. 13). - Package-level sentinel errors for hot-path rejection (Ex. 15).
- Restructure to keep
deferout of hot loops (Ex. 14). - Reuse
time.TimerwithResetin worker loops (Ex. 8).
Wins behind a profile (when measurements justify them):
sync.Poolof byte buffers when alloc rate is documented (Ex. 4).- Map of values over map of pointers for read-mostly small structs (Ex. 7).
- Generics/typed slices over
[]interface{}boxing (Ex. 8). - Pointer pass for large structs (Ex. 12).
- Flatten pointer-heavy structs in large caches (Ex. 5/6).
Specialty (only when the design calls for it):
- Custom arena per request for batch parsers with millions of small objects.
- String interning to fold repeated text into IDs.
GOMEMLIMIT+GOGCtuning for steady-state services with predictable working set.runtime.SetFinalizer— avoid unless wrapping CGo handles; finalizers extend lifetimes across two GC cycles.unsafe.String/unsafe.SliceDatafor zero-copy at boundaries where lifetime is provable.
GC pressure on the hot path comes from four mechanics living in runtime/mgc.go and friends: every allocation moves the heap closer to the next trigger (gcSetTriggerRatio); every live pointer in a marked object costs scanobject time; sustained alloc rate forces mark-assist onto mutator goroutines (gcAssistAlloc); short-lived garbage that escapes a cycle inflates the next sweep. Strip allocations from the hot path, flatten pointer-heavy structs that live in big caches, pool the unavoidable churn, and the four mechanics quiet down together. Profile, then apply the lever the trace points to — the four signatures above tell you which one.