Goroutine Stack Growth — Senior Level¶
Table of Contents¶
- Introduction
- Architectural Implications of Growable Stacks
- Recursion vs Iteration — When Each Wins
- Stack-Heavy Workloads
- Designing Recursive-Descent Parsers Safely
- Stack Cost in High-Concurrency Services
- Stack Behaviour Across the Network Stack
- Stacks and Cgo
- Stacks and Signal Handlers
- Stack Growth and Latency Budgets
- Memory Budgeting at Scale
- Comparison with Other Runtimes
- Summary
Introduction¶
At senior level the goal is to make system-shaping decisions with stack growth in mind. You decide whether to use recursion or iteration in a hot parser, whether to spawn a goroutine per request or use a worker pool, whether to bound user input or rely on the runtime's 1 GB ceiling. You also know the asymmetries between Go stacks and the other stacks lurking in your process: the M's system stack, the signal stack, and cgo stacks. These asymmetries matter when you wire Go into C libraries, OS signals, or extreme-concurrency network code.
This file is less about how growth works (covered at middle level) and more about which architectural choices are good given that it does.
Architectural Implications of Growable Stacks¶
Per-request goroutine is the canonical Go idiom¶
Because each goroutine costs ~2 KB initially, the dominant design pattern in Go is one goroutine per logical unit of work:
- One goroutine per accepted TCP connection.
- One goroutine per HTTP request.
- One goroutine per RPC handler.
- One goroutine per scheduled job.
In a thread-based language this pattern would not work — 100,000 threads each with 1 MB stacks would need 100 GB of address space. In Go it works because stacks scale with the work each goroutine does.
This single design pattern is what defines "idiomatic Go" for server code. It is enabled by growable stacks.
Goroutine-per-task vs worker pool¶
The trade-off:
- Goroutine-per-task — simple, scales by spawning, each task gets its own stack. Cost: stack growth happens per task; aggregated growth shows up in pprof. Memory cost is
2 KB + peak_growthper task. - Worker pool — fixed number of goroutines that pull tasks from a channel. Each worker's stack settles to a peak after a few warmup tasks. Cost: complexity, channel contention. Memory cost is fixed.
When to use which:
- Per-task for I/O-bound work with hundreds to tens of thousands of concurrent operations. Latency benefits from no queueing.
- Worker pool for CPU-bound work, where you don't want to spawn more goroutines than CPUs. Also when each task's work is comparable in size and a long queue won't blow latency budgets.
The stack-growth angle: pools amortise growth across millions of tasks; per-task pays growth per task. For short tasks with low recursion, this rarely matters. For tasks that recurse heavily (e.g., parsing complex documents), it does.
Avoid spawning goroutines you don't need to spawn¶
A common mistake at senior level is reflexively spawning a goroutine for every operation:
If items has 10 million entries and each process is a millisecond of work, you have 10 million × 2 KB = 20 GB of stack memory transient. Better:
const workers = runtime.GOMAXPROCS(0)
ch := make(chan Item, workers*2)
var wg sync.WaitGroup
for i := 0; i < workers; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for it := range ch {
process(it)
}
}()
}
for _, it := range items {
ch <- it
}
close(ch)
wg.Wait()
This caps both concurrency and stack memory.
Recursion vs Iteration — When Each Wins¶
Growable stacks make recursion safe up to the 1 GB ceiling. But "safe" is not "fast."
Recursion wins when¶
- Depth is provably bounded. Walking a balanced binary tree: depth is ~log N. A 1 million-node tree is ~20 levels deep. Recursion is clear, idiomatic, and the stack cost is trivial.
- Code clarity matters more than performance. Compilers, AST visitors, type-checkers. The recursive structure mirrors the data structure.
- The recursion has visible base cases. A reader can confirm termination.
Iteration wins when¶
- Depth depends on user input. Anything that takes JSON, XML, regex, or user code as input is vulnerable.
- Depth depends on data size linearly. Walking a linked list of length N as recursion is N stack frames — wasteful.
- Hot path performance matters. Each recursive call pays the prologue check. A loop pays nothing comparable.
- Tail-call structure exists. Go does not optimise tail calls. Convert by hand.
The "trampoline" pattern¶
For recursion whose depth is data-dependent but whose structure is simple, a trampoline converts it to iteration:
type cont func() cont
func walk(n *node) cont {
if n == nil {
return nil
}
visit(n)
return func() cont {
if c := walk(n.left); c != nil {
return c
}
return walk(n.right)
}
}
func run(start cont) {
for c := start; c != nil; c = c() {
}
}
This pattern is less common in Go than in functional languages but appears in heavy-AST tooling. The cost is closure allocation per level, which goes to the heap — trading stack growth for GC pressure.
Explicit-stack iteration¶
Most often, the right answer is to maintain an explicit slice as the work stack:
func walk(root *node) {
stack := []*node{root}
for len(stack) > 0 {
n := stack[len(stack)-1]
stack = stack[:len(stack)-1]
if n == nil {
continue
}
visit(n)
stack = append(stack, n.right, n.left)
}
}
This puts the stack on the heap, which means:
- Growth is via slice doubling, not stack copying.
- No 1 GB ceiling — limited only by available heap memory.
- No goroutine-stack growth on the traversal.
- Heap allocation cost (mitigatable with
sync.Pool).
For tree depths over a few thousand, this is the standard Go idiom.
Stack-Heavy Workloads¶
Some workloads naturally stress goroutine stacks:
Recursive descent parsers¶
encoding/json, encoding/xml, YAML libraries, regex engines, SQL parsers. Each level of nesting in the input is one or more stack frames.
encoding/json caps depth at 10,000 levels by default. Documents nested deeper return an error rather than triggering stack overflow. You should adopt the same pattern in any parser you write.
Tree walkers¶
AST traversal (compilers, linters, code generators). Usually balanced enough that recursion is fine, but a degenerate input (a 10,000-deep if x then if y then if z ...) can blow stacks if unbounded.
Backtracking algorithms¶
DFS, constraint satisfaction, game-tree search. Pure recursion grows stack linearly with search depth. For deep games (chess to depth 30), use iterative deepening with explicit stacks.
Pattern matching¶
regexp is safe in Go because the standard library uses RE2 (no backtracking, no recursion proportional to input). But third-party regex libraries with PCRE-style backtracking can blow stacks on adversarial inputs.
Functional-style code¶
Heavy higher-order combinators (map, fold, filter, monadic chains) can produce deep call chains. Less common in Go than in Haskell/Scala but appears in code ported from those languages.
Designing Recursive-Descent Parsers Safely¶
Three patterns combine to make recursive parsers safe:
1. Bound depth explicitly¶
const maxDepth = 1000
type parser struct {
depth int
// ...
}
func (p *parser) parseValue() (Value, error) {
if p.depth > maxDepth {
return nil, fmt.Errorf("nesting too deep (max %d)", maxDepth)
}
p.depth++
defer func() { p.depth-- }()
// ... actual parsing ...
}
The user sees a clean error; your process survives.
2. Lower debug.SetMaxStack for defence in depth¶
If your depth limit fails (bug, missed code path), the process dies at 64 MB instead of 1 GB. Faster failure, less memory consumed, less impact on the rest of the host.
3. Iterative parsers where speed matters¶
A hand-written iterative JSON parser is 2–3× faster than a recursive one because it avoids the prologue check and growth on hot paths. encoding/json/v2 and high-performance parsers like jsoniter and fastjson use iterative state machines for this reason.
For a normal application, the standard library's recursive parser is fine. For a service handling millions of small JSONs per second, switch to an iterative parser.
Stack Cost in High-Concurrency Services¶
The "million-connection" architecture¶
A server handling 1M concurrent TCP connections, one goroutine per connection, with each goroutine averaging a 4 KB stack, uses:
- 4 KB × 1,000,000 = 4 GB of stack memory.
- Plus 2 KB scheduler overhead per goroutine = 2 GB.
- Total: ~6 GB just for goroutine state.
This is reachable on commodity 16-core / 64 GB servers. With smarter pooling (read/write goroutines sharing buffers, sleep idle connections), it can be reduced further.
The "buffer-per-goroutine" trap¶
A common pattern:
func handleConn(c net.Conn) {
buf := make([]byte, 64*1024) // 64 KB read buffer per connection
for {
n, err := c.Read(buf)
// ...
}
}
The make puts buf on the heap, but 64 KB × 1M connections = 64 GB. Plus the goroutine stack. Solutions:
- Smaller buffer. 4 KB or 8 KB is usually plenty.
- Shared buffer pool.
sync.Poolso idle connections release the buffer. bufio.Readerwith a moderate size. Lets you read efficiently without huge per-connection buffers.
Choosing the right starting concurrency¶
Rule of thumb for typical Go services:
- 1K–10K goroutines: no special design needed.
- 10K–100K goroutines: pay attention to per-goroutine memory; use bufio sparingly.
- 100K–1M goroutines: design for it — sleep idle connections, share buffers, watch StackSys.
- >1M: consider event-loop architecture (e.g.,
gnet,evio) that uses fewer goroutines + manual epoll.
The Go-native style works up to a few million on big hardware. Past that, the trade-off of one-goroutine-per-conn vs event-loop favours event loops.
Stack Behaviour Across the Network Stack¶
When a goroutine blocks on net.Conn.Read, the runtime:
- Calls the standard
read(2)syscall (eventually). - The fd was set non-blocking when registered;
readreturnsEAGAINimmediately if no data. - The runtime parks the goroutine on the netpoller (epoll-based on Linux).
- The M is freed.
While the goroutine is parked, its stack stays allocated. The runtime cannot shrink a parked stack until GC runs and the goroutine is still in the parked state. If parked goroutines accumulate to millions, their stacks add up.
This is why StackSys in a million-connection server tends to hover at a few GB even when most goroutines are idle.
Mitigation: connection pooling on the client side¶
A client that opens many connections and lets them idle is paying for idle goroutine stacks. Closing idle connections (e.g., http.Transport.IdleConnTimeout) returns those goroutines to the runtime, where they exit and free their stacks.
Mitigation: read deadlines¶
A connection waiting forever on Read pins a goroutine forever. With a read deadline (c.SetReadDeadline(...)) the goroutine exits if no data arrives, freeing the stack.
Stacks and Cgo¶
When a goroutine calls a C function via cgo:
- The runtime switches from the goroutine's growable stack to the M's
g0stack. - The C function runs on
g0— a fixed-size, OS-allocated thread stack (typically 8 KB or larger). - The goroutine cannot grow its stack while in C. A deep C recursion can blow
g0.
Implications:
- Don't recurse deeply in C from cgo. The fixed
g0size limits depth. - Don't pass Go pointers to long-running C code. The Go stack may move at any growth event. If C holds a pointer into a Go stack and the stack moves, the pointer dangles.
Go enforces this with cgo.Handle and pointer-check tooling (GODEBUG=cgocheck=2).
Why two stacks?¶
The M's g0 stack is used for runtime internals (scheduler, GC, signal handling). C code lives in the same "system" world. Switching to g0 for cgo means:
- C code runs on a stack the kernel knows about.
- The Go stack-growth mechanism stays out of the way of C.
- The runtime's invariants (movable stacks, stack maps) are not violated.
runtime.LockOSThread and stacks¶
A goroutine that calls LockOSThread still has its growable Go stack. The lock pins the goroutine to an M but doesn't change its stack.
Stacks and Signal Handlers¶
When a signal (SIGURG for async preemption, SIGPROF for the CPU profiler, etc.) is delivered, the kernel switches to a separate signal stack — the M's gsignal stack. This is allocated at M creation, fixed size (~32 KB), and not growable.
Inside a signal handler:
- You cannot call most Go code (no allocations, no growable-stack operations).
- The Go runtime's signal handlers are written carefully to fit in 32 KB.
- User-installed signal handlers via
os/signalpackage run in a normal goroutine — the runtime translates from the gsignal stack to a queued signal that a normal goroutine handles.
If you somehow ended up doing real work on gsignal (you would have to use cgo to install a signal handler that calls Go), you can blow it just like any fixed-size stack.
Stack Growth and Latency Budgets¶
A stack grow event involves:
- Allocate new stack (microseconds, hits the stack pool or the page allocator).
- Copy bytes — for a 16 KB stack, ~3 μs on modern hardware.
- Pointer fix-up — ~1 μs per frame, depending on number of pointer-containing slots.
Total: a single growth event is typically 5–20 μs for stack sizes up to 64 KB. Larger stacks take proportionally longer.
In a service with a 1 ms P99 latency budget, a stack grow event consumes 0.5–2% of the budget. Usually invisible. But if your request involves multiple growths (deep handler chain), it can add 50–100 μs to tail latency.
Mitigation: warmup¶
In long-lived workers (e.g., a fixed pool of workers serving requests), the first few requests grow the stack; later requests don't. Some services do explicit warmup:
This is a niche optimisation; usually the first request's slight tail-latency hit is acceptable.
Mitigation: per-request goroutine = always cold¶
If every request gets a fresh goroutine, every request pays the cold start. For latency-sensitive RPC servers handling tiny messages, switching to a worker pool can shave microseconds.
Memory Budgeting at Scale¶
Budgeting per-goroutine memory at scale:
| Component | Cost |
|---|---|
g struct | ~400 bytes |
| Initial stack | 2 KB |
| Scheduler bookkeeping | ~100 bytes |
| Per-goroutine baseline | ~2.5 KB |
| Plus typical handler local vars | 1–4 KB |
| Plus handler-allocated buffers | varies |
So for an HTTP server with 100K connections and one goroutine each:
- Baseline: 2.5 KB × 100K = 250 MB
- Stacks after growth to ~4 KB average: 4 KB × 100K = 400 MB
- Application buffers: usually the dominant term
Stacks rarely dominate total memory. A service with a 10 GB RSS rarely has more than a few GB in stacks. The dominant memory cost is usually application data (caches, buffers, request bodies).
Watching the budget¶
Track in production:
runtime.NumGoroutine()— total goroutine count.runtime.MemStats.StackInuse— stack memory in active use.runtime.MemStats.StackSys— stack memory committed.
Alert when StackInuse exceeds an expected ceiling. A 10× growth usually indicates a leak.
Comparison with Other Runtimes¶
POSIX threads¶
Fixed stack size at creation (pthread_attr_setstacksize). Typically 8 MB default on Linux, 512 KB on macOS. No growth. Stack overflow crashes the process or, on some systems, segfaults.
A million pthreads = 8 TB of address space. Infeasible.
Windows threads¶
Fixed at creation. Default 1 MB. Reserved virtual address space; committed lazily by the OS. Can be smaller, but no growth.
Java threads¶
Backed by OS threads since the move away from green threads. Default stack 512 KB on most JVMs (-Xss512k). No growth. Java 21+ adds virtual threads (Project Loom) which do have growable stacks similar to goroutines.
Erlang processes¶
Small stacks (typically a few hundred words), growable in segments. Erlang famously inspired the design pattern of "many lightweight processes." Garbage-collected per process, isolated address space per process.
Async/await runtimes (Rust Tokio, JS V8, Python asyncio)¶
Conceptually a different model — no per-task stack at all. Each async function compiles to a state machine. Suspended state is a heap-allocated struct. Total memory per task is whatever the state machine occupies — often less than 2 KB.
Trade-off: code must be async-coloured (functions marked async, awaits explicit). Goroutines hide all of that behind the go keyword.
Project Loom (Java)¶
Modern Java virtual threads use stack chunks similar to Go's segmented stacks of yore, with a twist: stacks are "thinned out" by unmounting (storing state in the heap when blocked, restoring on resume). The result is similar memory cost per virtual thread (~few KB) and similar performance to goroutines.
Comparison summary¶
| Runtime | Per-task memory | Growth? | Max stack |
|---|---|---|---|
| Goroutine (Go 1.4+) | 2 KB | Yes, copying | 1 GB (settable) |
| pthread (Linux) | 8 MB default | No | Fixed at creation |
| Java thread | 512 KB | No | Fixed |
| Java virtual thread | ~few KB | Yes, segmented + heap | Heap-bounded |
| Erlang process | ~few KB | Yes, segmented | Per-VM cap |
| Rust async task | ~few hundred bytes | N/A (state machine) | Heap-bounded |
| Python asyncio task | ~few hundred bytes | N/A | Heap-bounded |
Goroutines sit in a sweet spot: synchronous, blocking-style code (no async coloring) plus low per-task memory.
Summary¶
At senior level, stack growth becomes an architectural input:
- It enables the one-goroutine-per-task idiom that defines idiomatic Go.
- It does not free you from bounding recursion on untrusted input.
- It interacts with cgo (separate stack), signal handlers (separate stack), and the netpoller (parked stacks accumulate).
- It is amortised cheap but visible at the long tail of latency budgets.
- It is the reason Go is in the same conversation as Loom, Erlang, and async runtimes.
The professional level walks the runtime source — morestack, newstack, stackalloc, copystack — and details the pointer fix-up via stack maps.