Goroutine Stack Growth — Optimization¶
Table of Contents¶
- Introduction
- Establish a Baseline
- Optimization 1 — Convert Recursion to Iteration
- Optimization 2 — Move Large Locals to the Heap
- Optimization 3 — Use sync.Pool for Per-Goroutine Buffers
- Optimization 4 — Worker Pool Instead of Per-Task Goroutines
- Optimization 5 — Pre-Grow Long-Lived Worker Stacks
- Optimization 6 — Cap MaxStack to Fail Fast
- Optimization 7 — Right-Size Channel Buffers
- Optimization 8 — Pprof-Guided Cuts
- Optimization 9 — Inlining Hot Helpers
- When NOT to Optimize
Introduction¶
Stack growth is amortised cheap, but on hot paths it shows up. Each section is a real optimization with concrete before/after, expected wins, and how to measure. The order is by impact: convert-recursion-to-iteration usually wins biggest; inlining hot helpers wins least but is sometimes worth it.
Profile first. Optimize only when pprof points at stack growth as a measurable cost.
Establish a Baseline¶
Before any optimization, measure:
package main
import (
"fmt"
"net/http"
_ "net/http/pprof"
"runtime"
)
func main() {
go http.ListenAndServe("localhost:6060", nil)
// ... your workload ...
var m runtime.MemStats
runtime.ReadMemStats(&m)
fmt.Printf("Goroutines: %d\n", runtime.NumGoroutine())
fmt.Printf("StackInuse: %d KB\n", m.StackInuse/1024)
fmt.Printf("StackSys: %d KB\n", m.StackSys/1024)
}
Capture a CPU profile:
Inside pprof:
If runtime.morestack_noctxt or runtime.newstack appears in the top 10, stack growth is costing measurable CPU. If not, skip stack-related optimizations and look at other hotspots.
Optimization 1 — Convert Recursion to Iteration¶
When: A recursive function is on a hot path, especially if depth can be large.
Expected gain: Often 20-50% for deeply recursive workloads. Eliminates morestack from pprof.
Before¶
func walk(n *Node, visit func(int)) {
if n == nil {
return
}
visit(n.Value)
walk(n.Left, visit)
walk(n.Right, visit)
}
After¶
func walk(root *Node, visit func(int)) {
if root == nil {
return
}
stack := make([]*Node, 0, 64)
stack = append(stack, root)
for len(stack) > 0 {
n := stack[len(stack)-1]
stack = stack[:len(stack)-1]
visit(n.Value)
if n.Right != nil {
stack = append(stack, n.Right)
}
if n.Left != nil {
stack = append(stack, n.Left)
}
}
}
Why it's faster¶
- No prologue check per call.
- Goroutine stack stays at minimum 2 KB.
- Slice grows by doubling (amortised cheap) on the heap; one heap allocation amortises across many
appends. - For long chains (depth > 50), no stack growth events.
How to measure¶
Benchmark both with go test -bench on a 10,000-node tree. Look at: - ns/op (lower is better). - allocs/op (iterative uses fewer; just the slice). - B/op.
Add a pprof CPU profile to confirm no morestack in the iterative version.
Optimization 2 — Move Large Locals to the Heap¶
When: A function with a >1 KB local array runs in many short-lived goroutines.
Expected gain: Each fresh goroutine avoids growth from 2 KB → 4 KB → 8 KB.
Before¶
func handle(conn net.Conn) {
var buf [8192]byte // 8 KB local — triggers growth
for {
n, err := conn.Read(buf[:])
// ... process buf[:n] ...
}
}
A fresh goroutine starts at 2 KB. The buf makes the frame ~8 KB. Stack must grow.
After¶
func handle(conn net.Conn) {
buf := make([]byte, 8192) // heap-allocated
for {
n, err := conn.Read(buf)
// ... process buf[:n] ...
}
}
buf is on the heap (one allocation per connection). The goroutine's frame stays small. No stack growth.
Trade-off¶
You added one heap allocation per connection, which GC must eventually reclaim. For short-lived goroutines this is a wash. For long-lived connections (read in a loop), it's a one-time cost worth paying.
When the trade-off favours stack¶
If the goroutine is long-lived and the buffer is used millions of times, the post-growth stack version is cheaper because there's no GC overhead. Use sync.Pool (next optimization) to keep both worlds.
Optimization 3 — Use sync.Pool for Per-Goroutine Buffers¶
When: A per-task scratch buffer is used by many goroutines, each briefly.
Expected gain: Eliminate both heap-alloc-per-task and stack-growth-per-task.
Before¶
func process(in []byte) []byte {
var scratch [16 * 1024]byte // stack local, 16 KB
// ... write to scratch ...
return append([]byte(nil), scratch[:len(in)]...)
}
Every call: stack growth (16 KB > 2 KB initial), plus a heap allocation for the return.
After¶
var scratchPool = sync.Pool{
New: func() any {
return make([]byte, 16*1024)
},
}
func process(in []byte) []byte {
scratch := scratchPool.Get().([]byte)
defer scratchPool.Put(scratch[:cap(scratch)])
// ... write to scratch ...
out := make([]byte, len(in))
copy(out, scratch[:len(in)])
return out
}
The pool amortises the 16 KB allocation across calls. Goroutine stack stays small. Only the returned out is allocated per call.
Gotcha: pool growth¶
sync.Pool per-P caches mean each P holds its own buffer. With high concurrency you may allocate hundreds of buffers. Tune the buffer size; if 4 KB is enough for 99% of calls, use 4 KB and grow only for outliers.
When NOT to use sync.Pool¶
- For tiny objects (< 1 KB). The pool overhead can exceed the alloc savings.
- For objects holding pointers — GC scans them anyway.
- For values you need to be zeroed — pools may return non-zero buffers; you must zero them.
Optimization 4 — Worker Pool Instead of Per-Task Goroutines¶
When: A high-rate stream of tasks, each modest in size.
Expected gain: Eliminates per-task goroutine creation and stack growth.
Before¶
Each task gets a fresh 2 KB goroutine. Process grows the stack. Total growths = number of tasks.
After¶
func handleStream(tasks chan Task) {
workers := runtime.GOMAXPROCS(0) * 2
var wg sync.WaitGroup
for i := 0; i < workers; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for task := range tasks {
process(task)
}
}()
}
wg.Wait()
}
A fixed pool of workers. Each worker's stack grows once (during the first heavy task) and stays grown. Subsequent tasks reuse the warmed-up stack.
Trade-off¶
- Loses some parallelism if tasks are wildly different sizes (head-of-line blocking in worker channels).
- Adds channel send/receive cost.
- More complex code.
When to choose¶
- CPU-bound tasks where you don't want more goroutines than cores.
- Tasks of similar size.
- Sustained throughput more important than burst tolerance.
When to keep per-task goroutines:
- I/O-bound tasks with random latency.
- Bursty workloads.
- Code simplicity is more important than peak throughput.
Optimization 5 — Pre-Grow Long-Lived Worker Stacks¶
When: Latency-sensitive long-lived workers, where the first few requests pay growth tax.
Expected gain: Eliminates tail-latency spikes from stack growth.
Code¶
func warmupStack() {
// Force the stack to grow to ~32 KB by allocating a large local.
var pad [30 * 1024]byte
for i := 0; i < len(pad); i += 1024 {
pad[i] = byte(i)
}
}
func worker(tasks <-chan Task) {
warmupStack()
for task := range tasks {
process(task)
}
}
Why it works¶
warmupStack forces an early stack growth before any tasks arrive. Subsequent calls run on the larger stack. Latency variance drops.
When this helps¶
- p99 / p99.9 latency budgets near the cost of one stack growth (5-20 μs).
- A few "first request" spikes you can't tolerate.
When this doesn't help¶
- Throughput-bound workloads — total work is the same.
- Per-request goroutine model — each is cold anyway.
Optimization 6 — Cap MaxStack to Fail Fast¶
When: Service handles untrusted input that might cause deep recursion.
Expected gain: Process dies in 5 ms instead of 5 seconds on attack. Less memory consumed before death.
Code¶
Cap stack at 64 MB instead of 1 GB.
Why it's an optimization¶
- Memory safety: an attacker triggering recursion can consume only 64 MB, not 1 GB.
- Fast failure: the process dies before slowing other operations.
- Easier debugging: at 64 MB you see "stack overflow" in seconds, not minutes.
How to choose the cap¶
- If your recursion is bounded to depth N with frame size F, cap should be > N × F with some headroom.
- For typical web services, 16-64 MB is plenty.
- For known-deep recursion (e.g., a compiler), set higher.
Production tip¶
Combine with a process supervisor that restarts on crash (systemd, k8s). A stack-overflow attack now degrades into a restart loop, which alerting systems flag.
Optimization 7 — Right-Size Channel Buffers¶
When: Goroutines park on channels, holding their stacks.
Expected gain: Reduce StackSys when many parked goroutines.
Before¶
results := make(chan Result) // unbuffered
for i := 0; i < N; i++ {
go func() {
r := work()
results <- r // blocks until consumed
}()
}
If consumption is slow, N goroutines pile up, all parked, all holding their (possibly grown) stacks.
After¶
results := make(chan Result, N) // fully buffered
for i := 0; i < N; i++ {
go func() {
r := work()
results <- r // returns immediately
}()
}
Each goroutine returns immediately after sending, so its stack is freed. Only the buffered channel holds the results in heap memory.
Trade-off¶
You pay for buffer memory upfront. For large N this can dominate. Pick a buffer size matched to consumption rate, not total task count.
Optimization 8 — Pprof-Guided Cuts¶
When: You've exhausted obvious optimizations and pprof still shows growth.
Expected gain: Variable. Depends on what pprof identifies.
Workflow¶
- Capture CPU profile during representative load.
go tool pprof -alloc_objects http://...for allocation profile too.- Look for
runtime.morestack_noctxtandruntime.newstackin the top. - Follow the callers chain. Pprof's
tracescommand shows callers.
This lists every code path that triggered growth, with cumulative time.
- For each path, look at the calling function. Is it:
- A large frame? → heap-allocate the locals.
- A recursion? → iterate.
- A per-task spawn? → use a pool.
Example¶
Profile output:
Trace:
runtime.morestack_noctxt
caller: encoding/json.(*decodeState).object
caller: encoding/json.(*decodeState).value
caller: encoding/json.(*decodeState).array
Action: switch from encoding/json to a parser with smaller frames (e.g., jsoniter or fastjson).
Optimization 9 — Inlining Hot Helpers¶
When: A tight loop calls a small helper that itself triggers stack checks.
Expected gain: Marginal (saves 2-3 cycles per call) — only worth it when the helper is invoked billions of times.
Code¶
//go:inline
func clamp(x, lo, hi int) int {
if x < lo { return lo }
if x > hi { return hi }
return x
}
The //go:inline directive (Go 1.20+) is a hint; the compiler decides. Use -gcflags="-m=2" to check what was inlined.
Trade-off¶
- An inlined function shares the caller's frame — adds to its size.
- A larger caller frame may trigger growth that the smaller version wouldn't.
When this helps¶
- Helpers called millions of times per second.
- Tight inner loops.
- Functions small enough that inlining is a net win.
Use go test -bench to verify. Without a measurable improvement, don't bother.
When NOT to Optimize¶
Resist stack-related optimization when:
morestackis not in the top of pprof. Stack growth is amortised cheap. If it's not measurable, no win is available.- The recursion is bounded and shallow. Walking a balanced binary tree of 1M nodes is depth 20. Recursion is fine.
- The code is already clear. A 5% improvement that doubles complexity is a net negative.
- You haven't measured. Always benchmark before and after. Speculative optimizations often regress.
- The bottleneck is elsewhere. Database, network I/O, GC. Stack growth is rarely the dominant cost.
A pragmatic checklist¶
Optimize stack growth if:
- Pprof shows
morestack/newstackin top 20. - You can measure a latency or throughput win.
- The optimization is local and reviewable.
- The before-state has a clear bug (unbounded recursion, large stack locals × many goroutines, leaks).
Skip stack-growth optimization if:
- The workload comfortably fits memory and CPU budgets.
- Code clarity matters more than micro-improvements.
- You haven't established a baseline.
Summary¶
Optimizations in order of impact:
- Convert recursion to iteration — biggest wins for deep recursion.
- Move large locals to heap — eliminates per-spawn growth.
- sync.Pool for scratch buffers — combines both above for hot paths.
- Worker pools — amortises growth across many tasks.
- Pre-grow long-lived workers — eliminates first-request latency spike.
- Cap MaxStack — defensive; fails fast on attack.
- Right-size channel buffers — frees parked goroutine stacks.
- Pprof-guided cuts — find specific paths to optimize.
- Inlining hot helpers — micro-level; last resort.
Always measure. The cost of stack growth is amortised cheap; the cost of optimizing prematurely is engineering time.