Goroutine Lifecycle — Optimize¶

Optimization exercises focused on reducing the cost of goroutine lifecycles: birth churn, waiting overhead, stack bloat, and the indirect GC pressure that long-lived goroutines impose.

Table of Contents¶

Optimization Mindset
Opt 1: Avoid Spawn-Per-Job Churn
Opt 2: Reuse Closures, Not Goroutines
Opt 3: Shrink Stack Size for Long-Lived Workers
Opt 4: Reduce _Gwaiting Time
Opt 5: Replace time.Tick with a Long-Lived Ticker
Opt 6: Batch Spawns
Opt 7: Nil Out References Before Long Waits
Opt 8: Bound Goroutine Count
Opt 9: Pool Reusable Goroutines
Opt 10: Move Cleanup Off the Critical Path
Measuring Lifecycle Costs
Summary

Optimization Mindset¶

Goroutines are cheap, but cheap times millions matter. A goroutine costs:

2 KB initial stack (more after growth).
~400 bytes of g struct overhead.
A run-queue slot.
A closure on the heap (if go func() { ... }()).
The duration of any GC scan over its stack.
For long-waiting goroutines: pinned heap references.

If your service spawns 100k goroutines per second and they live 1 ms each, that is 100 birth+death operations per millisecond per second of wall clock — total g activity ~100 million/second. Each is fast, but the aggregate competes with real work for CPU and allocator bandwidth.

Optimizing lifecycle means:

Spawn less. Reuse goroutines via pools.
Wait less. Bound wait times, drain promptly.
Hold less. Don't pin large closures during long waits.
Die promptly. A goroutine ready to die should die — clean up, return.

Opt 1: Avoid Spawn-Per-Job Churn¶

Baseline¶

func handle(req Request) {
    go func() {
        process(req)
    }()
}

For 100k req/s, you spawn 100k goroutines/s. Each lives briefly. Birth+death overhead is non-trivial.

Optimized¶

type Worker struct {
    jobs chan Request
}

func New(workers int) *Worker {
    w := &Worker{jobs: make(chan Request, 1024)}
    for i := 0; i < workers; i++ {
        go w.run()
    }
    return w
}

func (w *Worker) run() {
    for req := range w.jobs {
        process(req)
    }
}

func (w *Worker) Handle(req Request) {
    w.jobs <- req
}

Measure¶

func BenchmarkSpawnPerJob(b *testing.B) {
    var wg sync.WaitGroup
    for i := 0; i < b.N; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            // tiny work
        }()
    }
    wg.Wait()
}

func BenchmarkPool(b *testing.B) {
    jobs := make(chan struct{}, 1024)
    var wg sync.WaitGroup
    for i := 0; i < 16; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for range jobs {
                // tiny work
            }
        }()
    }
    for i := 0; i < b.N; i++ {
        jobs <- struct{}{}
    }
    close(jobs)
    wg.Wait()
}

For micro-work, the pool is typically 3-10x faster because spawn cost dominates.

When NOT to optimize¶

If each job does meaningful work (>10us), spawn overhead is amortized. Don't pool prematurely.

Opt 2: Reuse Closures, Not Goroutines¶

Each go func() { ... }() allocates the closure on the heap. For high spawn rates, allocation is a measurable cost.

Baseline¶

for _, item := range items {
    go func() {
        process(item)
    }()
}

Each iteration allocates a fresh closure. Plus the captured loop variable bug (pre-1.22).

Optimized¶

go func(items []Item) {
    for _, item := range items {
        process(item)
    }
}(items)

One goroutine, one closure. Sequential iteration is often fine if items don't need true concurrency.

If you do need concurrency:

items := splitChunks(items, 8)
for _, chunk := range items {
    chunk := chunk
    go func() {
        for _, item := range chunk {
            process(item)
        }
    }()
}

One goroutine per chunk, not per item.

Opt 3: Shrink Stack Size for Long-Lived Workers¶

Long-lived worker goroutines may have grown stacks (e.g., to 4 KB or 8 KB) due to deep recursion or large local variables. Even after the call stack shrinks, the runtime keeps the larger stack until certain triggers (next stack growth, GC).

Baseline¶

A worker that occasionally calls a deep function:

func (w *Worker) run() {
    for j := range w.jobs {
        if j.IsRare {
            deepRecursion(j) // grows the stack to 16 KB
        } else {
            quickProcess(j)
        }
    }
}

After one rare job, the stack is 16 KB for the lifetime of the worker.

Optimized¶

Run the rare path in a dedicated goroutine:

func (w *Worker) run() {
    for j := range w.jobs {
        if j.IsRare {
            go deepRecursion(j) // its own short-lived goroutine
        } else {
            quickProcess(j)
        }
    }
}

Or accept the cost: each worker has a 16 KB stack, but you have 16 workers, not a million. Trade off based on absolute numbers.

runtime.Stack can show you the stack size in pprof's extended views.

Opt 4: Reduce `_Gwaiting` Time¶

A goroutine in _Gwaiting holds memory and pins references. Reduce wait time:

Baseline¶

func worker(ctx context.Context, ch <-chan Job) {
    for {
        select {
        case <-ctx.Done():
            return
        case j := <-ch:
            process(j)
        }
    }
}

If ch is rarely fed, the worker waits most of the time. Its stack and closure are pinned.

Optimized¶

If feed rate is low and bursty, scale the pool dynamically:

func dispatcher(ctx context.Context, jobs <-chan Job) {
    sem := make(chan struct{}, 8) // max 8 concurrent workers
    for {
        select {
        case <-ctx.Done():
            return
        case j := <-jobs:
            sem <- struct{}{}
            go func() {
                defer func() { <-sem }()
                process(j)
            }()
        }
    }
}

Workers are spawned on demand. When idle, no goroutines wait. Semaphore caps concurrency.

This trades spawn cost (some per-job) for memory savings (none idle). Best for low-rate, bursty workloads.

Opt 5: Replace `time.Tick` with a Long-Lived Ticker¶

time.Tick is convenient but creates an unstoppable runtime timer. Repeated Tick calls add up.

Baseline¶

func heartbeat(ctx context.Context) {
    for {
        for t := range time.Tick(time.Second) {
            sendHeartbeat(t)
            if ctx.Err() != nil {
                return
            }
        }
    }
}

Every entry to the inner loop creates a new timer that never stops. Even though the outer loop tries to exit, the timer leaks.

Optimized¶

func heartbeat(ctx context.Context) {
    ticker := time.NewTicker(time.Second)
    defer ticker.Stop()
    for {
        select {
        case <-ctx.Done():
            return
        case t := <-ticker.C:
            sendHeartbeat(t)
        }
    }
}

One ticker for the whole lifecycle. Stop() releases the runtime timer goroutine.

Opt 6: Batch Spawns¶

If you spawn many short-lived goroutines, spawn rate matters. Each go involves runtime work — locking the per-P run queue, possibly waking an M.

Baseline¶

for _, item := range items {
    go process(item)
}

For 1M items, that's 1M go calls.

Optimized¶

Fan out by GOMAXPROCS, each goroutine processing many items:

n := runtime.GOMAXPROCS(0)
chunks := splitN(items, n)
var wg sync.WaitGroup
for _, chunk := range chunks {
    chunk := chunk
    wg.Add(1)
    go func() {
        defer wg.Done()
        for _, item := range chunk {
            process(item)
        }
    }()
}
wg.Wait()

Only n goroutines, each long enough to amortize startup. CPU utilization is the same; lifecycle overhead is O(GOMAXPROCS) instead of O(N).

Opt 7: Nil Out References Before Long Waits¶

A goroutine's stack and closure are GC roots. Anything reachable from them is pinned.

Baseline¶

func processAfterDelay(bigData []byte, delay time.Duration) {
    time.Sleep(delay)
    save(bigData)
}

go processAfterDelay(blob, time.Hour) // blob pinned for an hour

During the hour-long sleep, blob is alive.

Optimized¶

If you can do the work before the wait:

func processWithDelayedReport(blob []byte, delay time.Duration) {
    intermediate := transform(blob)
    blob = nil // GC can now reclaim
    time.Sleep(delay)
    save(intermediate)
}

Or restructure: persist blob to disk, sleep, reload smaller form.

When it matters¶

This is a real production pattern for jobs that "do work later." If blob is 10 MB and you have 1000 deferred jobs, that is 10 GB of pinned memory.

Opt 8: Bound Goroutine Count¶

An unbounded goroutine count is a DoS vector and a memory blow-up. Always cap.

Baseline¶

http.HandleFunc("/work", func(w http.ResponseWriter, r *http.Request) {
    go doWork(r.Body) // unbounded
    w.Write([]byte("queued"))
})

100k concurrent requests spawn 100k goroutines.

Optimized¶

var sem = make(chan struct{}, 256) // max 256 concurrent

http.HandleFunc("/work", func(w http.ResponseWriter, r *http.Request) {
    select {
    case sem <- struct{}{}:
        go func() {
            defer func() { <-sem }()
            doWork(r.Body)
        }()
        w.Write([]byte("queued"))
    default:
        http.Error(w, "busy", http.StatusServiceUnavailable)
    }
})

Or use a proper queue with a worker pool — the right tool for sustained load.

Opt 9: Pool Reusable Goroutines¶

The g free list reuses dead goroutines' g structs but not their stacks beyond a certain size or their closures. For maximum reuse, use worker pools.

type Pool struct {
    jobs chan func()
    wg   sync.WaitGroup
}

func NewPool(n int) *Pool {
    p := &Pool{jobs: make(chan func(), 1024)}
    for i := 0; i < n; i++ {
        p.wg.Add(1)
        go func() {
            defer p.wg.Done()
            for fn := range p.jobs {
                fn()
            }
        }()
    }
    return p
}

func (p *Pool) Submit(fn func()) {
    p.jobs <- fn
}

func (p *Pool) Close() {
    close(p.jobs)
    p.wg.Wait()
}

Hot path: zero spawn. The workers live forever (until Close), so their g structs and stacks are reused infinitely.

Benchmark:

func BenchmarkPool_HotPath(b *testing.B) {
    p := NewPool(runtime.GOMAXPROCS(0))
    var done sync.WaitGroup
    for i := 0; i < b.N; i++ {
        done.Add(1)
        p.Submit(func() {
            defer done.Done()
            // ... tiny work ...
        })
    }
    done.Wait()
    p.Close()
}

Typical: 2-5x faster than spawn-per-job for sub-microsecond work, equivalent for >10us work.

Opt 10: Move Cleanup Off the Critical Path¶

A goroutine that does expensive cleanup before death extends its lifecycle. If cleanup is non-essential, defer it.

Baseline¶

func worker(ctx context.Context, w *Worker) {
    defer w.flushMetrics()      // 50ms
    defer w.closeConnections()  // 100ms
    defer w.persistState()      // 200ms
    for {
        select {
        case <-ctx.Done():
            return
        case j := <-w.jobs:
            process(j)
        }
    }
}

When ctx is canceled, the goroutine spends ~350 ms in cleanup. Other goroutines waiting on this one (via wg.Wait) wait too.

Optimized¶

Decouple cleanup from worker lifecycle:

func worker(ctx context.Context, w *Worker) {
    defer close(w.exited)
    for {
        select {
        case <-ctx.Done():
            return
        case j := <-w.jobs:
            process(j)
        }
    }
}

// Separate cleanup goroutine
go func() {
    <-w.exited
    w.flushMetrics()
    w.closeConnections()
    w.persistState()
    close(w.cleanupDone)
}()

w.exited signals "the work is done"; the cleanup runs independently. The parent can wait on cleanupDone only if it actually needs the cleanup synchronously.

Measuring Lifecycle Costs¶

Counting spawns¶

import "runtime"

func main() {
    var stats runtime.MemStats
    runtime.ReadMemStats(&stats)
    fmt.Println("alloc:", stats.Alloc)

    // ... do work ...

    runtime.ReadMemStats(&stats)
    fmt.Println("alloc:", stats.Alloc)
}

Alloc reflects allocation pressure, which includes closures. If Alloc rises after a workload, your goroutines may be allocating closures unnecessarily.

Tracking `g` allocations¶

The runtime exposes (via GODEBUG=gctrace=1) information about GC and (via GODEBUG=schedtrace=1000) about scheduler activity. Use to spot anomalies:

sched: gomaxprocs=8 idleprocs=0 threads=12 spinningthreads=0 idlethreads=4 runqueue=0 [0 0 1 0 5 0 0 0]

Big runqueue numbers indicate spawn-bursts.

`runtime/trace`¶

Run a workload with trace.Start/Stop. Open go tool trace. The "Goroutines" view shows:

Number of goroutines over time.
Spawn rate.
Wait reasons distribution.

The dominant wait reason tells you where lifecycle is dominated. If "chan receive" dominates, your goroutines wait too much; if "GoStart" dominates, your spawn rate is too high.

Benchmark template¶

func BenchmarkLifecycle(b *testing.B) {
    b.ReportAllocs()
    for i := 0; i < b.N; i++ {
        done := make(chan struct{})
        go func() {
            close(done)
        }()
        <-done
    }
}

Typical output:

BenchmarkLifecycle-8  3000000  450 ns/op  16 B/op  1 allocs/op

The 16 B / 1 alloc is the closure. Eliminating it (e.g., by passing a function value) can cut allocation pressure.

Summary¶

Goroutine lifecycle optimization is rarely the first bottleneck — but at scale (100k+ goroutines/s, millions of long-lived workers), it dominates. The patterns:

Pool instead of spawn for hot paths.
One goroutine per chunk, not per item.
Tickers, not time.Tick for periodic work.
Bound concurrency with semaphores or pools.
Nil out big references before long waits.
Decouple cleanup from worker lifecycle when possible.
Measure with runtime/trace before optimizing — intuition deceives.

Verify each optimization preserves correctness with leak tests (goleak) and race detection (-race). Premature optimization of lifecycle is no different from any other premature optimization — measure, target the hot path, and beware of complexity creep.

See 02-detecting-leaks for the diagnostic toolbox and 03-preventing-leaks for patterns that combine correctness with efficiency.

Goroutine Lifecycle — Optimize¶

Table of Contents¶

Optimization Mindset¶

Opt 1: Avoid Spawn-Per-Job Churn¶

Baseline¶

Optimized¶

Measure¶

When NOT to optimize¶

Opt 2: Reuse Closures, Not Goroutines¶

Baseline¶

Optimized¶

Opt 3: Shrink Stack Size for Long-Lived Workers¶

Baseline¶

Optimized¶

Opt 4: Reduce _Gwaiting Time¶

Baseline¶

Optimized¶

Opt 5: Replace time.Tick with a Long-Lived Ticker¶

Baseline¶

Optimized¶

Opt 6: Batch Spawns¶

Baseline¶

Optimized¶

Opt 7: Nil Out References Before Long Waits¶

Baseline¶

Optimized¶

When it matters¶

Opt 8: Bound Goroutine Count¶

Baseline¶

Optimized¶

Opt 9: Pool Reusable Goroutines¶

Opt 10: Move Cleanup Off the Critical Path¶

Baseline¶

Optimized¶

Measuring Lifecycle Costs¶

Counting spawns¶

Tracking g allocations¶

runtime/trace¶

Benchmark template¶

Summary¶

Opt 4: Reduce `_Gwaiting` Time¶

Opt 5: Replace `time.Tick` with a Long-Lived Ticker¶

Tracking `g` allocations¶

`runtime/trace`¶