Goroutine Lifecycle — Optimize¶
Optimization exercises focused on reducing the cost of goroutine lifecycles: birth churn, waiting overhead, stack bloat, and the indirect GC pressure that long-lived goroutines impose.
Table of Contents¶
- Optimization Mindset
- Opt 1: Avoid Spawn-Per-Job Churn
- Opt 2: Reuse Closures, Not Goroutines
- Opt 3: Shrink Stack Size for Long-Lived Workers
- Opt 4: Reduce
_GwaitingTime - Opt 5: Replace
time.Tickwith a Long-Lived Ticker - Opt 6: Batch Spawns
- Opt 7: Nil Out References Before Long Waits
- Opt 8: Bound Goroutine Count
- Opt 9: Pool Reusable Goroutines
- Opt 10: Move Cleanup Off the Critical Path
- Measuring Lifecycle Costs
- Summary
Optimization Mindset¶
Goroutines are cheap, but cheap times millions matter. A goroutine costs:
- 2 KB initial stack (more after growth).
- ~400 bytes of
gstruct overhead. - A run-queue slot.
- A closure on the heap (if
go func() { ... }()). - The duration of any GC scan over its stack.
- For long-waiting goroutines: pinned heap references.
If your service spawns 100k goroutines per second and they live 1 ms each, that is 100 birth+death operations per millisecond per second of wall clock — total g activity ~100 million/second. Each is fast, but the aggregate competes with real work for CPU and allocator bandwidth.
Optimizing lifecycle means:
- Spawn less. Reuse goroutines via pools.
- Wait less. Bound wait times, drain promptly.
- Hold less. Don't pin large closures during long waits.
- Die promptly. A goroutine ready to die should die — clean up, return.
Opt 1: Avoid Spawn-Per-Job Churn¶
Baseline¶
For 100k req/s, you spawn 100k goroutines/s. Each lives briefly. Birth+death overhead is non-trivial.
Optimized¶
type Worker struct {
jobs chan Request
}
func New(workers int) *Worker {
w := &Worker{jobs: make(chan Request, 1024)}
for i := 0; i < workers; i++ {
go w.run()
}
return w
}
func (w *Worker) run() {
for req := range w.jobs {
process(req)
}
}
func (w *Worker) Handle(req Request) {
w.jobs <- req
}
Measure¶
func BenchmarkSpawnPerJob(b *testing.B) {
var wg sync.WaitGroup
for i := 0; i < b.N; i++ {
wg.Add(1)
go func() {
defer wg.Done()
// tiny work
}()
}
wg.Wait()
}
func BenchmarkPool(b *testing.B) {
jobs := make(chan struct{}, 1024)
var wg sync.WaitGroup
for i := 0; i < 16; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for range jobs {
// tiny work
}
}()
}
for i := 0; i < b.N; i++ {
jobs <- struct{}{}
}
close(jobs)
wg.Wait()
}
For micro-work, the pool is typically 3-10x faster because spawn cost dominates.
When NOT to optimize¶
If each job does meaningful work (>10us), spawn overhead is amortized. Don't pool prematurely.
Opt 2: Reuse Closures, Not Goroutines¶
Each go func() { ... }() allocates the closure on the heap. For high spawn rates, allocation is a measurable cost.
Baseline¶
Each iteration allocates a fresh closure. Plus the captured loop variable bug (pre-1.22).
Optimized¶
One goroutine, one closure. Sequential iteration is often fine if items don't need true concurrency.
If you do need concurrency:
items := splitChunks(items, 8)
for _, chunk := range items {
chunk := chunk
go func() {
for _, item := range chunk {
process(item)
}
}()
}
One goroutine per chunk, not per item.
Opt 3: Shrink Stack Size for Long-Lived Workers¶
Long-lived worker goroutines may have grown stacks (e.g., to 4 KB or 8 KB) due to deep recursion or large local variables. Even after the call stack shrinks, the runtime keeps the larger stack until certain triggers (next stack growth, GC).
Baseline¶
A worker that occasionally calls a deep function:
func (w *Worker) run() {
for j := range w.jobs {
if j.IsRare {
deepRecursion(j) // grows the stack to 16 KB
} else {
quickProcess(j)
}
}
}
After one rare job, the stack is 16 KB for the lifetime of the worker.
Optimized¶
Run the rare path in a dedicated goroutine:
func (w *Worker) run() {
for j := range w.jobs {
if j.IsRare {
go deepRecursion(j) // its own short-lived goroutine
} else {
quickProcess(j)
}
}
}
Or accept the cost: each worker has a 16 KB stack, but you have 16 workers, not a million. Trade off based on absolute numbers.
runtime.Stack can show you the stack size in pprof's extended views.
Opt 4: Reduce _Gwaiting Time¶
A goroutine in _Gwaiting holds memory and pins references. Reduce wait time:
Baseline¶
func worker(ctx context.Context, ch <-chan Job) {
for {
select {
case <-ctx.Done():
return
case j := <-ch:
process(j)
}
}
}
If ch is rarely fed, the worker waits most of the time. Its stack and closure are pinned.
Optimized¶
If feed rate is low and bursty, scale the pool dynamically:
func dispatcher(ctx context.Context, jobs <-chan Job) {
sem := make(chan struct{}, 8) // max 8 concurrent workers
for {
select {
case <-ctx.Done():
return
case j := <-jobs:
sem <- struct{}{}
go func() {
defer func() { <-sem }()
process(j)
}()
}
}
}
Workers are spawned on demand. When idle, no goroutines wait. Semaphore caps concurrency.
This trades spawn cost (some per-job) for memory savings (none idle). Best for low-rate, bursty workloads.
Opt 5: Replace time.Tick with a Long-Lived Ticker¶
time.Tick is convenient but creates an unstoppable runtime timer. Repeated Tick calls add up.
Baseline¶
func heartbeat(ctx context.Context) {
for {
for t := range time.Tick(time.Second) {
sendHeartbeat(t)
if ctx.Err() != nil {
return
}
}
}
}
Every entry to the inner loop creates a new timer that never stops. Even though the outer loop tries to exit, the timer leaks.
Optimized¶
func heartbeat(ctx context.Context) {
ticker := time.NewTicker(time.Second)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case t := <-ticker.C:
sendHeartbeat(t)
}
}
}
One ticker for the whole lifecycle. Stop() releases the runtime timer goroutine.
Opt 6: Batch Spawns¶
If you spawn many short-lived goroutines, spawn rate matters. Each go involves runtime work — locking the per-P run queue, possibly waking an M.
Baseline¶
For 1M items, that's 1M go calls.
Optimized¶
Fan out by GOMAXPROCS, each goroutine processing many items:
n := runtime.GOMAXPROCS(0)
chunks := splitN(items, n)
var wg sync.WaitGroup
for _, chunk := range chunks {
chunk := chunk
wg.Add(1)
go func() {
defer wg.Done()
for _, item := range chunk {
process(item)
}
}()
}
wg.Wait()
Only n goroutines, each long enough to amortize startup. CPU utilization is the same; lifecycle overhead is O(GOMAXPROCS) instead of O(N).
Opt 7: Nil Out References Before Long Waits¶
A goroutine's stack and closure are GC roots. Anything reachable from them is pinned.
Baseline¶
func processAfterDelay(bigData []byte, delay time.Duration) {
time.Sleep(delay)
save(bigData)
}
go processAfterDelay(blob, time.Hour) // blob pinned for an hour
During the hour-long sleep, blob is alive.
Optimized¶
If you can do the work before the wait:
func processWithDelayedReport(blob []byte, delay time.Duration) {
intermediate := transform(blob)
blob = nil // GC can now reclaim
time.Sleep(delay)
save(intermediate)
}
Or restructure: persist blob to disk, sleep, reload smaller form.
When it matters¶
This is a real production pattern for jobs that "do work later." If blob is 10 MB and you have 1000 deferred jobs, that is 10 GB of pinned memory.
Opt 8: Bound Goroutine Count¶
An unbounded goroutine count is a DoS vector and a memory blow-up. Always cap.
Baseline¶
http.HandleFunc("/work", func(w http.ResponseWriter, r *http.Request) {
go doWork(r.Body) // unbounded
w.Write([]byte("queued"))
})
100k concurrent requests spawn 100k goroutines.
Optimized¶
var sem = make(chan struct{}, 256) // max 256 concurrent
http.HandleFunc("/work", func(w http.ResponseWriter, r *http.Request) {
select {
case sem <- struct{}{}:
go func() {
defer func() { <-sem }()
doWork(r.Body)
}()
w.Write([]byte("queued"))
default:
http.Error(w, "busy", http.StatusServiceUnavailable)
}
})
Or use a proper queue with a worker pool — the right tool for sustained load.
Opt 9: Pool Reusable Goroutines¶
The g free list reuses dead goroutines' g structs but not their stacks beyond a certain size or their closures. For maximum reuse, use worker pools.
type Pool struct {
jobs chan func()
wg sync.WaitGroup
}
func NewPool(n int) *Pool {
p := &Pool{jobs: make(chan func(), 1024)}
for i := 0; i < n; i++ {
p.wg.Add(1)
go func() {
defer p.wg.Done()
for fn := range p.jobs {
fn()
}
}()
}
return p
}
func (p *Pool) Submit(fn func()) {
p.jobs <- fn
}
func (p *Pool) Close() {
close(p.jobs)
p.wg.Wait()
}
Hot path: zero spawn. The workers live forever (until Close), so their g structs and stacks are reused infinitely.
Benchmark:
func BenchmarkPool_HotPath(b *testing.B) {
p := NewPool(runtime.GOMAXPROCS(0))
var done sync.WaitGroup
for i := 0; i < b.N; i++ {
done.Add(1)
p.Submit(func() {
defer done.Done()
// ... tiny work ...
})
}
done.Wait()
p.Close()
}
Typical: 2-5x faster than spawn-per-job for sub-microsecond work, equivalent for >10us work.
Opt 10: Move Cleanup Off the Critical Path¶
A goroutine that does expensive cleanup before death extends its lifecycle. If cleanup is non-essential, defer it.
Baseline¶
func worker(ctx context.Context, w *Worker) {
defer w.flushMetrics() // 50ms
defer w.closeConnections() // 100ms
defer w.persistState() // 200ms
for {
select {
case <-ctx.Done():
return
case j := <-w.jobs:
process(j)
}
}
}
When ctx is canceled, the goroutine spends ~350 ms in cleanup. Other goroutines waiting on this one (via wg.Wait) wait too.
Optimized¶
Decouple cleanup from worker lifecycle:
func worker(ctx context.Context, w *Worker) {
defer close(w.exited)
for {
select {
case <-ctx.Done():
return
case j := <-w.jobs:
process(j)
}
}
}
// Separate cleanup goroutine
go func() {
<-w.exited
w.flushMetrics()
w.closeConnections()
w.persistState()
close(w.cleanupDone)
}()
w.exited signals "the work is done"; the cleanup runs independently. The parent can wait on cleanupDone only if it actually needs the cleanup synchronously.
Measuring Lifecycle Costs¶
Counting spawns¶
import "runtime"
func main() {
var stats runtime.MemStats
runtime.ReadMemStats(&stats)
fmt.Println("alloc:", stats.Alloc)
// ... do work ...
runtime.ReadMemStats(&stats)
fmt.Println("alloc:", stats.Alloc)
}
Alloc reflects allocation pressure, which includes closures. If Alloc rises after a workload, your goroutines may be allocating closures unnecessarily.
Tracking g allocations¶
The runtime exposes (via GODEBUG=gctrace=1) information about GC and (via GODEBUG=schedtrace=1000) about scheduler activity. Use to spot anomalies:
sched: gomaxprocs=8 idleprocs=0 threads=12 spinningthreads=0 idlethreads=4 runqueue=0 [0 0 1 0 5 0 0 0]
Big runqueue numbers indicate spawn-bursts.
runtime/trace¶
Run a workload with trace.Start/Stop. Open go tool trace. The "Goroutines" view shows:
- Number of goroutines over time.
- Spawn rate.
- Wait reasons distribution.
The dominant wait reason tells you where lifecycle is dominated. If "chan receive" dominates, your goroutines wait too much; if "GoStart" dominates, your spawn rate is too high.
Benchmark template¶
func BenchmarkLifecycle(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
done := make(chan struct{})
go func() {
close(done)
}()
<-done
}
}
Typical output:
The 16 B / 1 alloc is the closure. Eliminating it (e.g., by passing a function value) can cut allocation pressure.
Summary¶
Goroutine lifecycle optimization is rarely the first bottleneck — but at scale (100k+ goroutines/s, millions of long-lived workers), it dominates. The patterns:
- Pool instead of spawn for hot paths.
- One goroutine per chunk, not per item.
- Tickers, not
time.Tickfor periodic work. - Bound concurrency with semaphores or pools.
- Nil out big references before long waits.
- Decouple cleanup from worker lifecycle when possible.
- Measure with
runtime/tracebefore optimizing — intuition deceives.
Verify each optimization preserves correctness with leak tests (goleak) and race detection (-race). Premature optimization of lifecycle is no different from any other premature optimization — measure, target the hot path, and beware of complexity creep.
See 02-detecting-leaks for the diagnostic toolbox and 03-preventing-leaks for patterns that combine correctness with efficiency.