Preventing Goroutine Leaks — Optimize¶

Table of Contents¶

Introduction
Optimising Shutdown Latency
Reducing Cancellation Overhead
Fast Leak Triage
Minimising the Cost of Prevention
Avoiding Spawn Churn
Hot-Path Cancellation Checks
Optimising the Owning Struct Pattern
When Not to Optimise
Summary

Introduction¶

Prevention is not free. Every <-ctx.Done() case, every defer cancel(), every sync.WaitGroup adds nanoseconds and a few bytes. For most code, the cost is invisible. For hot paths — request routers, message-loop dispatchers, tight scheduling code — it matters.

This file covers two kinds of optimisation:

Latency: make shutdown fast, make cancellation propagate quickly.
Throughput: reduce the per-goroutine overhead of prevention.

The optimisations here come after the patterns from earlier files are in place. Optimising a leak-prone codebase is a waste; optimising a clean one pays off.

Optimising Shutdown Latency¶

The shutdown wall¶

A typical shutdown sequence:

cancel()
  -> goroutines notice ctx.Done()
  -> goroutines wind down (in-flight work)
  -> goroutines return
  -> Wait returns
  -> Close returns
  -> next component closes
  -> ...

The wall-clock time of shutdown is dominated by the slowest "wind down" step. Typical contributors:

An HTTP server waiting for a long-poll request to finish.
A database query that doesn't respect context cancellation.
A retry loop with exponential backoff sleeping past the cancellation.
A worker pool with one slow job in flight.

Measure first¶

Add a shutdown profiler:

func (s *Service) Close() error {
    start := time.Now()
    defer func() {
        log.Printf("Close took %v", time.Since(start))
    }()
    // ...
}

In CI, fail if shutdown exceeds the budget. In production, alert if shutdown latency exceeds the 95th percentile of historic shutdowns.

Quick wins¶

Replace time.Sleep with select { case <-ctx.Done(): return; case <-time.After(d): }: the goroutine wakes immediately on cancellation instead of waiting out the sleep.
Bound retry sleeps: cap exponential backoff. A 30-minute retry sleep ignores 30 minutes of cancellation.
Make every blocking call context-aware: db.QueryContext, http.NewRequestWithContext, *net.Conn with SetDeadline.
Tighten loop iteration: a CPU-bound loop checking ctx.Err() every 10^9 iterations is slow to cancel. Drop to 10^6 or even 10^5 for sub-100ms latency.

Parallel shutdown¶

If you have 10 independent components, shut them down in parallel:

func (m *Manager) Close() error {
    g, _ := errgroup.WithContext(context.Background())
    for _, c := range m.components {
        c := c
        g.Go(c.Close)
    }
    return g.Wait()
}

Caveat: if components have dependencies (the API depends on the DB; close the API first), parallel close breaks the order. Use parallel only for truly independent components.

Two-phase shutdown is faster¶

The drain-then-cancel pattern:

Phase 1: stop accepting new work (instant)
Phase 2: wait for in-flight work (most of the time)
Phase 3: force-cancel residual (deadline-bound)

Phase 1 is instant: close(acceptCh) or listener.Close(). Phase 2 has a budget. Phase 3 is the safety net. The total is bounded by the budget.

func (s *Service) Shutdown(ctx context.Context) error {
    s.stopAccepting()
    done := make(chan struct{})
    go func() {
        s.wg.Wait()
        close(done)
    }()
    select {
    case <-done:
        return nil
    case <-ctx.Done():
        s.cancel() // force-cancel
        s.wg.Wait()
        return ctx.Err()
    }
}

Reducing Cancellation Overhead¶

The cost of `<-ctx.Done()` in a select¶

A single <-ctx.Done() case in a hot select adds:

A few nanoseconds for the select's case evaluation (one extra channel read attempt).
A few hundred bytes per goroutine (the context internal state).

For most workloads, the cost is invisible. For loops doing tens of millions of iterations per second per goroutine, it can show up in benchmarks.

Batch the check¶

Instead of checking on every iteration, check every N iterations:

for i := 0; i < len(items); i++ {
    if i&0xFFF == 0 {
        if ctx.Err() != nil {
            return ctx.Err()
        }
    }
    crunch(items[i])
}

Trade-off: cancellation latency increases by up to N iterations' worth of work. Pick N to match your latency budget.

Avoid context construction on the hot path¶

context.WithValue and context.WithCancel allocate. If a hot path constructs a derived context per call, you pay for those allocations:

// Slow if called millions of times per second
func handle(ctx context.Context, item Item) {
    sub, cancel := context.WithTimeout(ctx, 50*time.Millisecond)
    defer cancel()
    process(sub, item)
}

If the per-call timeout is critical, accept the cost. Otherwise, propagate the parent context.

Don't `select` on `ctx.Done()` when not needed¶

// Unnecessary
select {
case <-ctx.Done():
    return ctx.Err()
case result := <-doWork():
    return result
}

If doWork already respects ctx, you don't need both. The select is a layer of defence; if your internal call propagates context correctly, drop the select and rely on the call to return.

Fast Leak Triage¶

When a leak is detected (NumGoroutine spike or pprof showing N goroutines parked in the same place), triage:

Step 1 — Take a goroutine profile¶

go tool pprof http://service/debug/pprof/goroutine
(pprof) top 20
(pprof) list functionName

Or in production-safer form, the snapshot text:

curl -s http://service/debug/pprof/goroutine?debug=2 > stacks.txt

The text format is more readable than the pprof binary for one-off investigations.

Step 2 — Group by stack¶

grep -A 30 'goroutine ' stacks.txt | awk '/goroutine/{p=1; print; next} p{print; if(/^$/) p=0}' | sort | uniq -c | sort -rn | head

Most leaks have many goroutines parked on the same line. The top of the histogram is your culprit.

Step 3 — Match to a pattern¶

The stack will show a select, a channel receive, a syscall. Match against the five patterns:

Parked on runtime.gopark -> select or channel op.
Parked on chan receive or chan send -> channels not closed / sender not buffered.
Parked on time.Sleep -> sleep without cancel.
Parked on sync.runtime_SemacquireMutex -> mutex deadlock.
Parked in internal/poll.runtime_pollWait -> blocked syscall (use SetDeadline).

Step 4 — Find the spawn site¶

The leaked goroutine's stack shows where it was created (after created by in the dump). That line tells you which function to fix.

Step 5 — Add a goleak test¶

Once you've identified the leak, add a goleak test that reproduces it. The fix lands with the regression test.

Minimising the Cost of Prevention¶

The structural overhead¶

A typical owning-struct pattern adds:

One context.WithCancel (allocation: ~200 bytes).
One chan struct{} for done (allocation: ~96 bytes).
One sync.WaitGroup if multiple goroutines (no allocation; fits in struct).

For a service with 100 long-lived components, the structural overhead is ~30 KB. Negligible.

Per-call overhead¶

Per-call context propagation:

Passing ctx as an argument: nothing (pointer).
ctx.Done() channel receive in a select: ~10ns.
ctx.Err() non-receiver check: ~5ns.

For a request rate of 10K req/s with five context-checked layers, the cost is ~250 microseconds/s. Less than 0.03%.

When to skip the wrapper¶

The concurrency.Go wrapper from professional.md adds tracking. Per-spawn cost: maybe 1 microsecond. For 1000 spawns/s, that's 1 ms/s, negligible. For 1 million spawns/s, it's 1 second/s of overhead — and at that rate, you should not be spawning per call anyway.

Use the wrapper for long-lived goroutines, not per-message workers.

Avoiding Spawn Churn¶

Spawn per item is usually wrong¶

for _, item := range items {
    go process(item) // BAD if items is large
}

Spawning a goroutine costs ~2 KB stack and a few hundred bytes of bookkeeping. For a million items, that's 2 GB and significant scheduler pressure.

Use a fixed pool¶

g, ctx := errgroup.WithContext(ctx)
g.SetLimit(16)
for _, item := range items {
    item := item
    g.Go(func() error { return process(ctx, item) })
}
return g.Wait()

SetLimit(16) reuses goroutines: at any time, at most 16 are running. The goroutine churn is bounded by the limit. Memory is ~32 KB instead of ~2 GB.

Reuse goroutines across requests¶

A common request-per-goroutine pattern wastes goroutines:

for {
    req := <-incoming
    go handle(req) // a new goroutine for every request
}

Better:

for i := 0; i < workers; i++ {
    go func() {
        for {
            select {
            case <-ctx.Done():
                return
            case req := <-incoming:
                handle(req)
            }
        }
    }()
}

The workers are reused. No churn.

Goroutine pooling¶

For extreme cases (very short-lived goroutines, high spawn rate), use a goroutine pool library:

github.com/panjf2000/ants — popular pool library.
github.com/sourcegraph/conc — modern conc pool.

Caveat: these libraries reintroduce a leak risk if not closed. They are an optimisation, not a substitute for the patterns.

Hot-Path Cancellation Checks¶

`ctx.Err()` vs `<-ctx.Done()`¶

<-ctx.Done() in a select is the standard way. ctx.Err() is faster for a one-off check:

if ctx.Err() != nil {
    return ctx.Err()
}

Err() is a single atomic load. <-Done() involves a channel operation. In a tight loop without other communication, Err() is preferable.

Caching the Done channel¶

ctx.Done() returns the same channel each time, but the method call has overhead. For tight loops:

done := ctx.Done()
for i := 0; i < n; i++ {
    select {
    case <-done:
        return
    default:
    }
    work(i)
}

The cached done avoids repeated method calls. Marginal.

Branch prediction¶

A <-ctx.Done() case in a loop is rarely taken. The CPU's branch predictor learns this and the cost is essentially zero in the steady state. The actual cost is paid only on cancellation, when the prediction misses.

Optimising the Owning Struct Pattern¶

Skip `sync.Once` when single-call¶

type Lite struct {
    cancel context.CancelFunc
    done   chan struct{}
}

func (l *Lite) Close() {
    l.cancel()
    <-l.done
}

If Close is always called once, you don't need sync.Once. Cancelling twice is harmless; receiving on a closed channel is harmless. The pattern is naturally idempotent.

Use sync.Once only when concurrent Close calls are realistic (multi-goroutine cleanup paths).

Channel vs WaitGroup¶

For a single goroutine, chan struct{} closed in defer is slightly cheaper than sync.WaitGroup. For multiple goroutines, WaitGroup is cleaner. Don't agonise over this — both are fine.

Lazy spawn¶

Some types only need their goroutine if a method is called:

type Maybe struct {
    once sync.Once
    cancel context.CancelFunc
    done   chan struct{}
}

func (m *Maybe) Submit(j Job) {
    m.once.Do(func() {
        ctx, cancel := context.WithCancel(context.Background())
        m.cancel = cancel
        m.done = make(chan struct{})
        go m.run(ctx)
    })
    // ... actual submit ...
}

func (m *Maybe) Close() error {
    if m.cancel == nil {
        return nil
    }
    m.cancel()
    <-m.done
    return nil
}

Trade-off: Close must handle the "never started" case. Useful for components that are constructed but rarely used.

When Not to Optimise¶

Don't add complexity to save 100ns in a path that runs 100 times a day.
Don't merge cancellation checks if it makes the code harder to read; the readability cost outlasts the performance gain.
Don't pool goroutines if your spawn rate is under 10K/s; the runtime handles that fine.
Don't shorten cancellation-check intervals below your latency budget; it just adds noise.

The right order: correctness (no leaks), readability (the patterns are clear), then performance (only if measured to be a bottleneck).

Profile-driven optimisation¶

For every prevention-overhead optimisation, run before/after benchmarks:

go test -bench=. -benchmem -count=10 ./...

Compare with benchstat. If the difference is under 5%, leave the code as-is and prefer the readable version.

Summary¶

Optimisation of leak-prevention code falls into two camps:

Shutdown latency: replace sleeps with select+after; make every blocking call context-aware; bound retries; parallel close where independent; two-phase drain.
Per-call overhead: cache ctx.Done() for tight loops; check ctx.Err() every N iterations for CPU-bound work; avoid per-call context construction; use goroutine pools to avoid spawn churn.

For triage, a goroutine profile grouped by stack frame identifies the leak's parking point. Match to one of the five patterns; the fix is the canonical one from junior.md.

The discipline is: correctness first, optimisation second, and only after profiling. The prevention patterns are cheap in absolute terms — measure before assuming otherwise.

See also: 04-pprof-tools for in-depth pprof workflows; 02-detecting-leaks for routine leak detection without an active incident.