Goroutine Best Practices — Middle Level¶

Table of Contents¶

Introduction
From Rules to Patterns
errgroup in Anger
Worker Pools, Bounded Channels, Semaphores
Structured Concurrency in Go
Graceful Shutdown
Context Discipline
Recover Helpers and Logging Strategy
Concurrent Data Structures: Mutex vs sync.Map vs Channel-Owned State
Testing Concurrent Code
Leak Detection in CI
Self-Assessment
Summary

Introduction¶

The junior level introduced the twelve canonical rules. At middle level, you apply them in real services and run into the second-order questions: how big should the worker pool be? Where do recover helpers belong? When does errgroup not fit? What's the production-shaped error path? How do tests prove the rules are followed, not just that the code runs?

This file is the working-engineer's version. Each section assumes you already accept the rules; the question is how to wire them into a service that survives a production environment.

From Rules to Patterns¶

Rules are negative ("don't do X"); patterns are positive ("do Y"). A mature codebase doesn't read like a list of don'ts — it reads like a small set of repeating patterns. Three patterns cover roughly 90% of goroutine usage in a service:

Pattern	Use case
Fan-out / fan-in with `errgroup`	Parallel calls to N downstreams, collect results or first error.
Bounded worker pool	Long-lived consumers reading from a queue or channel.
Periodic loop with context	Heartbeats, metric flushes, garbage collection of caches.

The remaining 10% are special: pipelines, broadcast topologies, supervisors. Build them on top of the three.

The discipline: when you reach for go func(), ask "which of the three patterns is this?" If the answer is "none," double-check that you really want a new shape.

`errgroup` in Anger¶

golang.org/x/sync/errgroup is the workhorse. Here are the production-level details.

`WithContext`: not optional¶

g, ctx := errgroup.WithContext(parent)

Always use WithContext. The bare errgroup.Group{} is for trivial cases; in a service, you want first-error cancellation. Pass the returned ctx to children:

g.Go(func() error {
    return doWork(ctx, x)        // ctx, not parent
})

Otherwise the cancellation cascade doesn't reach the child.

`SetLimit(n)`: bound concurrency¶

Available in Go 1.20+. Caps how many g.Go callbacks run concurrently. Cleaner than a separate semaphore:

g, ctx := errgroup.WithContext(parent)
g.SetLimit(16)
for _, url := range urls {
    url := url
    g.Go(func() error { return fetch(ctx, url) })
}
return g.Wait()

Pick the limit by measuring, not by guessing. Typical starting points:

CPU-bound work: runtime.GOMAXPROCS(0) (one worker per logical CPU).
Network-bound work to one host: 8–64 depending on the host's tolerance.
Network-bound work to many hosts: 256–1024 if you have to keep them all in flight.

When `errgroup` is not enough¶

You need to collect all errors, not just the first. Use a mutex-protected slice or errors.Join.
You need partial results when some workers fail. Catch errors inside each Go and don't return them; aggregate manually.
Workers depend on each other (output of A feeds B). Use a pipeline, not a fan-out.
You need different timeouts per worker. Use context.WithTimeout per Go body.

Common `errgroup` bugs¶

// Bug A: forgot to shadow loop variable (pre-1.22)
for _, url := range urls {
    g.Go(func() error { return fetch(ctx, url) })   // wrong url
}

// Bug B: bare parent context
g.Go(func() error { return fetch(parent, url) })    // cancellation lost

// Bug C: returning nil "to signal completion"
g.Go(func() error {
    if done() { return nil }                        // doesn't cancel peers
    return work()
})

Worker Pools, Bounded Channels, Semaphores¶

Three implementations of the same idea: "at most N goroutines in flight." Pick by ergonomics, not by raw performance — they are close enough.

Implementation A: fixed pool over a channel¶

func pool(ctx context.Context, in <-chan Job, n int, handle func(context.Context, Job) error) error {
    g, ctx := errgroup.WithContext(ctx)
    for i := 0; i < n; i++ {
        g.Go(func() error {
            for {
                select {
                case <-ctx.Done():
                    return ctx.Err()
                case j, ok := <-in:
                    if !ok {
                        return nil
                    }
                    if err := handle(ctx, j); err != nil {
                        return err
                    }
                }
            }
        })
    }
    return g.Wait()
}

Use when:

Workers are long-lived (lifetime ≈ service lifetime).
Jobs arrive over time.
You care about throughput more than latency.

Implementation B: per-item goroutine with `errgroup.SetLimit`¶

g, ctx := errgroup.WithContext(ctx)
g.SetLimit(n)
for _, item := range items {
    item := item
    g.Go(func() error { return process(ctx, item) })
}
return g.Wait()

Use when:

Items are known up front.
You want a single join at the end.
Each item is non-trivial (otherwise the spawn cost dominates).

Implementation C: semaphore channel¶

sem := make(chan struct{}, n)
var wg sync.WaitGroup
for _, item := range items {
    item := item
    sem <- struct{}{}
    wg.Add(1)
    go func() {
        defer wg.Done()
        defer func() { <-sem }()
        process(item)
    }()
}
wg.Wait()

Use when:

You can't import errgroup for some reason.
You're mixing pool ownership across multiple call sites.

The first two patterns are preferred. The semaphore is fine but more bookkeeping.

Sizing the pool¶

Three considerations:

Resource caps. N file descriptors, N database connections, etc. Pool size ≤ resource cap.
Downstream tolerance. A third-party API that 429s above 50 requests/sec needs N small enough that you stay below the limit.
Memory. N goroutines × per-goroutine memory (stacks plus per-job allocations). Bound to fit your memory budget.

Start with a number based on the smallest of the three. Measure. Adjust.

Structured Concurrency in Go¶

Structured concurrency is the discipline that every goroutine's lifetime is bounded by a lexical scope — typically a function call. The function does not return until every goroutine it spawned has returned. There are no detached goroutines.

Go does not enforce structured concurrency at the language level (unlike Python's asyncio.TaskGroup or Trio's nursery). But you can adopt it as a convention:

func process(ctx context.Context, items []Item) error {
    g, ctx := errgroup.WithContext(ctx)
    for _, item := range items {
        item := item
        g.Go(func() error { return work(ctx, item) })
    }
    return g.Wait()
}

Every goroutine spawned by process returns before process does (by definition of g.Wait). Locally, structured concurrency is achieved.

Why it matters¶

No leaks by construction. If process returns, every spawned goroutine has returned.
Errors propagate naturally. g.Wait returns the first error.
Cancellation cascades. Cancelling the context cancels every descendant.
Stack traces make sense. Every goroutine's stack is reachable from a known function.

Where it breaks¶

Fire-and-forget metrics. A goroutine that flushes metrics in the background doesn't fit lexical scoping.
Long-lived workers. A worker pool that runs for the lifetime of the service is bounded by the service, not by a function.

For these, lift the scope to the service (Service.Run(ctx) runs until ctx cancels; every goroutine respects ctx). The structure is at the service level, not the function level.

Graceful Shutdown¶

A correctly-behaving Go service does the following on SIGTERM:

Stop accepting new work (close the listener, stop pulling from the queue).
Cancel the root context to signal in-flight work to wind down.
Wait (with a bounded deadline) for in-flight goroutines to return.
Exit with code 0.

If shutdown takes longer than the deadline, log which goroutines are still alive (using pprof goroutine) and exit anyway.

Skeleton¶

func main() {
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()

    // Translate SIGTERM/SIGINT into cancel.
    sigCh := make(chan os.Signal, 1)
    signal.Notify(sigCh, os.Interrupt, syscall.SIGTERM)
    go func() {
        <-sigCh
        log.Println("shutting down")
        cancel()
    }()

    if err := run(ctx); err != nil {
        log.Fatal(err)
    }
}

func run(ctx context.Context) error {
    srv := &http.Server{Addr: ":8080", Handler: newRouter()}
    g, ctx := errgroup.WithContext(ctx)

    g.Go(func() error {
        log.Println("listening on", srv.Addr)
        if err := srv.ListenAndServe(); err != http.ErrServerClosed {
            return err
        }
        return nil
    })

    g.Go(func() error {
        <-ctx.Done()
        shutCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
        defer cancel()
        return srv.Shutdown(shutCtx)
    })

    g.Go(func() error { return backgroundWorker(ctx) })

    return g.Wait()
}

This wires together every rule:

Single root context (Rule 4).
errgroup joins everything (Rule 6).
Each goroutine has a clear exit story (Rule 1).
A bounded shutdown deadline (no infinite hang).

The "drain" stage¶

If the service runs background jobs from an in-memory queue, shutdown needs a drain stage:

g.Go(func() error {
    <-ctx.Done()
    // Stop accepting new jobs.
    close(jobCh)
    // The worker pool drains jobCh; pool's own goroutines exit naturally.
    return nil
})

The order is: cancel root → workers stop pulling new external work → drain in-memory buffers → join everyone → exit.

Context Discipline¶

context.Context is the most-misused type in Go. The middle-level discipline:

Rules of context¶

Context is the first parameter, named ctx. Standard library convention.
Don't store contexts in structs. Pass them as call arguments.
context.Background() only at the top of main or in tests. Nowhere else.
context.TODO() is for when you don't know what context to use. It is a marker for "fix this later" — and the linter (e.g., contextcheck) can find them.
WithCancel returns a cancel; always defer cancel() (otherwise the timer leaks).
Don't use context.Value for required parameters. Use it for cross-cutting trace IDs, not for the user ID your function needs.

Wrapping context¶

When a function needs both a deadline and the parent context's values:

ctx, cancel := context.WithTimeout(parent, 5*time.Second)
defer cancel()

The result inherits values from parent and adds its own deadline. The cancel is for this timer; calling it doesn't cancel parent.

Detecting context misuse¶

golang.org/x/tools/go/analysis/passes/contextcheck flags functions that take a context but don't use it.
staticcheck's SA1029 flags context keys with built-in types.
revive's context-as-argument enforces the "first parameter" rule.

Recover Helpers and Logging Strategy¶

A safeGo helper centralises the recover-and-log policy. The naive version:

func safeGo(name string, fn func()) {
    go func() {
        defer func() {
            if r := recover(); r != nil {
                log.Printf("goroutine %q panic: %v\n%s", name, r, debug.Stack())
            }
        }()
        fn()
    }()
}

The production version adds:

Structured logging (zap, slog) with the panic value as a field.
A metric (panics_total{goroutine="..."}) so panics are visible on the dashboard.
Optional re-raise: if the panic represents an unrecoverable state (corrupted invariants), call os.Exit(2) rather than continue.

func SafeGo(ctx context.Context, name string, fn func(context.Context)) {
    go func() {
        defer func() {
            if r := recover(); r != nil {
                stack := debug.Stack()
                slog.ErrorContext(ctx, "goroutine panic",
                    "name", name,
                    "panic", fmt.Sprintf("%v", r),
                    "stack", string(stack))
                metrics.GoroutinePanics.WithLabelValues(name).Inc()
            }
        }()
        fn(ctx)
    }()
}

Treat panics in goroutines as bugs to fix, not as part of the application's error model. Every panic that hits the recover should result in either:

A fix in the code that caused the panic, or
A documented "this input is invalid; we now return error X instead of panicking."

Concurrent Data Structures: Mutex vs `sync.Map` vs Channel-Owned State¶

When multiple goroutines share state, three patterns are common. Each has a sweet spot.

`sync.Mutex` / `sync.RWMutex` wrapping a plain map¶

type Cache struct {
    mu sync.RWMutex
    m  map[string][]byte
}

func (c *Cache) Get(k string) ([]byte, bool) {
    c.mu.RLock()
    defer c.mu.RUnlock()
    v, ok := c.m[k]
    return v, ok
}

Pros: simple, fast under low contention, predictable. Cons: scales poorly under heavy write contention; one slow operation under the lock blocks everyone.

`sync.Map`¶

var cache sync.Map
cache.Store("k", v)
val, ok := cache.Load("k")

Pros: optimised for "many distinct keys, mostly reads" or "keys that are written once and read many times." Cons: slower than a mutex+map for moderate write rates, no value type safety (returns any), no iteration ordering guarantees.

Use sync.Map only when the documented sweet spot fits. Otherwise prefer mutex+map.

Channel-owned state¶

type Counter struct {
    incCh chan struct{}
    getCh chan chan int
}

func (c *Counter) run() {
    var n int
    for {
        select {
        case <-c.incCh:
            n++
        case ch := <-c.getCh:
            ch <- n
        }
    }
}

Pros: serialises access without explicit locking; composes well with select. Cons: more code; introduces an additional goroutine (= an exit story to manage); slower for simple counters.

Use channel-owned state when the operation is naturally a message ("apply this update to the state") rather than a mutation ("increment that field").

Decision matrix¶

Workload	Choice
Counter with high write rate	`atomic.Int64`
Cache with mixed reads/writes	mutex + map
Read-mostly map with distinct keys per write	`sync.Map`
State that responds to commands and queries	channel-owned actor
Set of values for membership checks	mutex + map[T]struct{}

Testing Concurrent Code¶

Testing concurrency is its own discipline. The middle-level toolkit:

Race detector¶

Enable in CI:

go test -race ./...

Run unit and integration tests under -race. Most flake reports trace back to a real race.

`goleak` per-package¶

package mypkg_test
import (
    "testing"
    "go.uber.org/goleak"
)
func TestMain(m *testing.M) {
    goleak.VerifyTestMain(m)
}

Test-level alternative:

func TestProcess(t *testing.T) {
    defer goleak.VerifyNone(t)
    // ...
}

Deterministic synchronisation in tests¶

Instead of time.Sleep, expose a synchronisation point:

// In code:
type Worker struct {
    onReady chan struct{}
}
func (w *Worker) Start() {
    go func() {
        w.setup()
        close(w.onReady)
        w.loop()
    }()
}

// In test:
w := newWorker()
w.Start()
select {
case <-w.onReady:
case <-time.After(time.Second):
    t.Fatal("timeout")
}

The test waits for the event, with a deadline that catches hangs.

`testing/synctest` (Go 1.24+)¶

Go 1.24 introduced testing/synctest, which fakes time and lets concurrent code be tested deterministically. Worth adopting where available:

synctest.Run(func() {
    // time.Sleep, time.After, time.Tick all use fake time here.
    // The package waits until all goroutines are blocked, then advances.
})

This eliminates a class of "wait long enough" flakiness.

Stress-testing¶

For code where ordering matters, run the test many times with -count:

go test -race -count=1000 ./... -run TestRaceyThing

A race that reproduces 1 in 100 runs becomes 10 reproductions in 1000 runs.

Leak Detection in CI¶

A leak detector in CI is what turns "we have a discipline" into "we enforce a discipline."

Strategy 1: `goleak` everywhere¶

Add goleak.VerifyTestMain(m) to every package's TestMain. New tests that leak fail. Old leaks that escaped notice now surface.

Cost: some legitimate background goroutines from imports (HTTP/2, DNS resolver) need to be allowed:

goleak.VerifyTestMain(m,
    goleak.IgnoreTopFunction("net/http.(*Transport).readBufferingLoop"),
    goleak.IgnoreCurrent(),
)

IgnoreCurrent ignores goroutines alive at the start (e.g., the test runner's own).

Strategy 2: production goroutine profiling¶

net/http/pprof exposes /debug/pprof/goroutine. In production, scrape it periodically:

go tool pprof http://host:6060/debug/pprof/goroutine
(pprof) top 10

A leak shows up as a stack-trace count that grows monotonically over hours.

Set up an alert: if go_goroutines{job="myservice"} (Prometheus) crosses a threshold for more than 10 minutes, page.

Strategy 3: CI smoke test¶

Run the service under integration test, do typical workflows, then check the goroutine count is back to baseline:

base := runtime.NumGoroutine()
runWorkload()
if runtime.NumGoroutine() > base+5 {
    t.Fatalf("leak: %d -> %d", base, runtime.NumGoroutine())
}

The "+5" tolerates noise from runtime-internal goroutines. Tune to your environment.

Self-Assessment¶

Summary¶

At middle level, the twelve rules become three patterns (fan-out with errgroup, bounded worker pool, periodic loop with context) and a small toolkit (safeGo, errgroup.SetLimit, goleak, race detector in CI). Graceful shutdown is the integration test that proves the patterns work together. Tests use event-based synchronisation, not time.Sleep. Concurrent state uses the right primitive for the workload, not "always channels" or "always mutex." The next level — senior — is about pushing these conventions across a team via review checklists and style guides.

Goroutine Best Practices — Middle Level¶

Table of Contents¶

Introduction¶

From Rules to Patterns¶

errgroup in Anger¶

WithContext: not optional¶

SetLimit(n): bound concurrency¶

When errgroup is not enough¶

Common errgroup bugs¶

Worker Pools, Bounded Channels, Semaphores¶

Implementation A: fixed pool over a channel¶

Implementation B: per-item goroutine with errgroup.SetLimit¶

Implementation C: semaphore channel¶

Sizing the pool¶

Structured Concurrency in Go¶

Why it matters¶

Where it breaks¶

Graceful Shutdown¶

Skeleton¶

The "drain" stage¶

Context Discipline¶

Rules of context¶

Wrapping context¶

Detecting context misuse¶

Recover Helpers and Logging Strategy¶

Concurrent Data Structures: Mutex vs sync.Map vs Channel-Owned State¶

sync.Mutex / sync.RWMutex wrapping a plain map¶

sync.Map¶

Channel-owned state¶

Decision matrix¶

Testing Concurrent Code¶

Race detector¶

goleak per-package¶

Deterministic synchronisation in tests¶

testing/synctest (Go 1.24+)¶

Stress-testing¶

Leak Detection in CI¶

Strategy 1: goleak everywhere¶

Strategy 2: production goroutine profiling¶

Strategy 3: CI smoke test¶

Self-Assessment¶

Summary¶

`errgroup` in Anger¶

`WithContext`: not optional¶

`SetLimit(n)`: bound concurrency¶

When `errgroup` is not enough¶

Common `errgroup` bugs¶

Implementation B: per-item goroutine with `errgroup.SetLimit`¶

Concurrent Data Structures: Mutex vs `sync.Map` vs Channel-Owned State¶

`sync.Mutex` / `sync.RWMutex` wrapping a plain map¶

`sync.Map`¶

`goleak` per-package¶

`testing/synctest` (Go 1.24+)¶

Strategy 1: `goleak` everywhere¶