Goroutine Best Practices — Senior Level¶
Table of Contents¶
- Introduction
- Code Review Checklist for Goroutines
- Style Guides Compared: Effective Go, Uber, Google
- Establishing Team Conventions
- Linters and Automated Enforcement
- Concurrency Architecture Reviews
- Postmortem Patterns That Drove These Rules
- Onboarding New Engineers
- Library Authoring vs Application Authoring
- Migration: Cleaning Up a Legacy Codebase
- When to Break the Rules
- Self-Assessment
- Summary
Introduction¶
At senior level, the question shifts from "which rule applies here" to "how do I scale this discipline across a team, a codebase, and a release pipeline?" The same twelve rules from junior level become a code review checklist, a linter ruleset, a postmortem template, an onboarding curriculum, and an architectural review framework. The depth comes from knowing why each rule exists historically — which production failure drove it — and from running the review machinery that catches the violation before it ships.
After this you will:
- Lead a code review where concurrency bugs are caught by checklist, not by luck.
- Map your codebase's conventions to (and from) the major published style guides.
- Choose linters that catch the rules you actually care about.
- Lead the cleanup of an existing codebase that grew without these disciplines.
Code Review Checklist for Goroutines¶
This is the checklist a reviewer runs every time they see go, errgroup.Group, sync.WaitGroup, chan, sync.Mutex, or context.Context in a diff. Each item is a yes/no question. If any answer is "no" or "unclear," the diff is not ready.
1. Exit story¶
- Can I describe in one sentence how this goroutine returns?
- Is that exit guaranteed by code, or does it depend on an assumption ("this channel is always closed")?
- Is there a path where the exit never happens (deadlock, infinite loop without
select { case <-ctx.Done(): })?
2. WaitGroup discipline¶
- Is
wg.Addcalled in the parent, beforego? - Is
wg.Done()deferred as the first line of the goroutine body? - Is
wg.Waitcalled on the samewgfrom a single point? - If
wg.Addis called in a loop, is there acontinuepath that skips it?
3. Loop variables¶
- Is every loop variable used inside a goroutine closure either shadowed (
x := x) or passed as a function parameter? - What Go version does this module target (check
go.mod)? If < 1.22, the rule is mandatory; if >= 1.22, it's still preferred for clarity.
4. Context¶
- Does every long-running goroutine accept
ctx context.Context? - Does it check
ctx.Done()in anyselectthat waits on a channel? - Does it check
ctx.Err()before expensive work? - Is
context.Background()used outsidemainor tests? (If yes, justify.) - Is
cancelfromWithCancel/WithTimeout/WithDeadlinealwaysdefer-ed?
5. Panic recovery¶
- Is there a
defer recoverat the boundary of every goroutine that processes external input? - Does the recovery log AND increment a metric? (Silent recovery hides bugs.)
- Does the recovery let the goroutine exit, not loop on the bad state?
6. Coordination primitive¶
- Is
errgroupused instead of hand-rolledWaitGroup+ error channel? - If
errgroupwas rejected, is there a comment explaining why? - Is
g.SetLimit(n)used wherever concurrency must be bounded?
7. Channels vs mutex¶
- Is shared state protected by either a mutex/atomic OR owned by a single goroutine (channel-based actor)?
- Is there a channel where a mutex would be simpler? (Reverse smell.)
- Is there a mutex held across a channel send/recv? (Lock-order risk.)
8. time.Sleep¶
- Is
time.Sleepused anywhere as "wait for goroutine to finish"? - If
time.Sleepappears, is it (a) a ticker, (b) a backoff, or (c) explicitly justified?
9. Race detector¶
- Does CI run
go test -race? - Has this code been run under
-race? - If new test code is concurrent, does it pass
-race -count=10locally?
10. Concurrency bound¶
- Is every "for input in stream, go handle(input)" pattern bounded?
- Is the bound a constant, a configuration value, or computed from
GOMAXPROCS/resource limits? - Is there a comment explaining why the chosen bound is right?
11. Documentation¶
- If a new exported type or function uses goroutines internally or holds shared state, is its concurrency policy in the doc comment?
- If a method has different concurrency rules from its type, are they called out?
12. Leak detection¶
- Does the test for this code use
goleak? - If not, is there another way the test would catch a leak?
Universal smells¶
- Is there a
go func() { ... }()literal longer than 20 lines? Extract a function. - Is there a goroutine that captures the receiver
cof a method? Document its lifetime relative toc. - Is there a "main" goroutine in a non-
mainpackage? Why?
Print the checklist. Hand it to junior reviewers. Use it on your own code.
Style Guides Compared: Effective Go, Uber, Google¶
The four publicly available style references for Go concurrency are:
- Effective Go (official, foundational).
- Uber Go style guide (detailed, opinionated, widely adopted).
- Google Go style guide (Google-internal, since published).
- Go Code Review Comments (the closest thing to an official review checklist).
They agree on most things and differ on a few. Senior engineers know where.
Universal agreements¶
| Topic | All four say |
|---|---|
context.Context first parameter | Yes; ctx context.Context. |
Don't store Context in structs | Yes. |
Don't use Context.Value for required parameters | Yes. |
| Run race detector in CI | Yes. |
| Document concurrency safety on exported types | Yes. |
go f() arguments evaluated in caller | Yes (this is spec, not style). |
Where they diverge¶
| Topic | Effective Go | Uber | |
|---|---|---|---|
| Prefer channels or mutex? | Channels for ownership transfer, mutex for state. | Same; explicitly says "synchronisation primitives are NOT inherently worse." | Same. |
| Channel direction in function signatures | Use directional types when relevant. | Same; says always make channel direction part of the type when possible. | Same. |
sync.WaitGroup patterns | Light coverage. | Detailed: wg.Add in parent, defer wg.Done, exact pattern enforced. | Same as Uber. |
| Panic vs error | Use errors; panic for "impossible" cases. | Strong: never panic for expected errors, recover at goroutine boundaries. | Same. |
errgroup | Not mentioned (predates errgroup). | Recommended for error-aggregating fan-out. | Recommended. |
| Naming concurrency primitives | Light. | Done channels are chan struct{}; result channels named for content. | Same. |
| Method receivers and goroutines | Light. | "Make the zero value useful" applies to mutex-containing types. | Same. |
sync.Once.Do for lazy init | Light. | Yes; prefer sync.Once over hand-rolled. | Yes. |
The Uber rules that are most worth adopting¶
From uber-go/guide — these are the ones most teams formalise:
- Don't copy a
sync.Mutexorsync.WaitGroup. Use pointer fields or pass by pointer. - Defer
Unlockimmediately afterLock. Reads cleaner than carefully managing exit paths. - Channels' size is one or none. Either an unbuffered channel for synchronisation, or a buffer of 1 for "send-and-go." Larger buffers should have a comment justifying the size.
- Goroutine lifetime must be obvious. If it's not obvious, it's a leak.
- Start goroutines with a clear plan to stop them. Echoing Cheney.
The Google guide adds:
context.Background()only atmain's top.context.TODO()is allowed temporarily but should be flagged.- Don't use a bare
chan struct{}as a "lock"; usesync.Mutex.
Picking what to adopt¶
A team doesn't need to follow one guide exclusively. A practical approach: take Effective Go as the floor, add the Uber rules around WaitGroup and goroutine lifetime, add Google's rules around context, and document the team-specific overlay.
Establishing Team Conventions¶
Style guides cover the language. Team conventions cover the gap between "what the language allows" and "what we expect in this codebase." Examples:
- Naming.
Goroutinehelpers are in packagesafego. Worker pools inpool. Cancellable workers exposeRun(ctx). - Error policy. Worker errors that don't represent failure (e.g., expected
context.Canceled) are filtered before propagation. - Logging. Recovered panics log at
ERRORwith the goroutine name and a stack. - Metrics. Every long-running goroutine increments
goroutine_alive{name="..."}on start and decrements on exit. The metric tells you on a dashboard exactly which goroutines are running. - Tests. Every package uses
goleak.VerifyTestMain. Tests that legitimately leave background goroutines (HTTP/2 transport) declare it. - Code review. The checklist above is in
CONTRIBUTING.md. New reviewers walk through it.
Document the conventions in one place. Reference them in PR templates. The discipline is enforceable only if it's written down.
Linters and Automated Enforcement¶
A linter that catches a rule violation before review is worth ten checklist items. Useful ones for goroutine discipline:
| Linter | What it catches |
|---|---|
staticcheck | SA1029 (context keys), SA1019 (deprecated Sleep patterns), SA4006 (unused values), and many concurrency-relevant patterns. |
errcheck | Unchecked errors — including those from goroutine bodies. |
contextcheck | Functions that accept Context but don't pass it to callees. |
containedctx | Context stored in struct fields. |
noctx | HTTP requests without context.Context. |
bodyclose | HTTP response bodies that aren't closed (often in goroutines). |
loopclosure (go vet) | Loop variable captured by goroutine (pre-1.22). |
revive | Style; can enforce "context first argument" and many other rules. |
gosec | Some race-related smells. |
wsl | Whitespace; helps make defer wg.Done() immediately follow go func() {. |
Build them into the CI pipeline as a separate stage. Failing the lint stage should fail the PR check.
For deeper analysis:
go vetwith stdlib analysers catches a baseline of obvious bugs (printf format, lock copy, etc).golangci-lintaggregates many of the above with one configuration.- Custom analysers can be written with
golang.org/x/tools/go/analysisfor team-specific rules (e.g., "everygostatement must be a call to one of our helper functions").
Concurrency Architecture Reviews¶
A senior engineer reviewing a new service's design asks the concurrency questions before code is written:
- What is the maximum number of goroutines this service will have under load? "Unbounded" is unacceptable.
- What is the lifetime of each kind of goroutine? Per-request? Service-lifetime? Per-job?
- What signals a goroutine to exit? Context cancel, closed channel, both?
- Where does shared state live? Per-request, per-worker, global?
- Where are the locks? What's the lock order?
- What happens on SIGTERM? Drain queue, cancel context, wait with deadline, force exit?
- What does the goroutine profile look like on a healthy node? A baseline you can compare against during incidents.
- What's the leak-detection plan?
goleakin tests,pprofscraping in production?
Document the answers in a design doc. Re-read them six months later when you're debugging.
Postmortem Patterns That Drove These Rules¶
Each rule has a history. Understanding the failure modes is what makes the rule stick.
"The 4 AM page" — unbounded goroutines¶
A team launched a Kafka consumer that spawned one goroutine per message. Under normal load, fine. A backlog (from a partition rebalance) delivered 200 000 messages in 30 seconds. The service spawned 200 000 goroutines, allocated ~400 MB of stacks, OOMed, restarted, and re-spawned 200 000 more. Cascading restart loop, hours to recover. Rule 10 (bound concurrency) was born.
"The 7-day leak"¶
A long-running gRPC client opened a stream goroutine per call but never closed it on error. Over a week, the goroutine count climbed from 1 000 to 50 000. Each goroutine held a small buffer; total memory crept from 200 MB to 4 GB. No alert until the node was at 95% memory. Rule 1 (clear exit story) and Rule 12 (leak detection) followed.
"The Friday deploy"¶
A handler used time.Sleep(100 * time.Millisecond) to "give the background goroutine time to finish writing." Worked on developer laptops. CI passed. On a busy production node, the goroutine took 500 ms on its first run. The handler returned a 200 to the user with the database write not yet committed. The user re-submitted; double-write. Rule 8 (no time.Sleep) and Rule 9 (race detector) followed.
"The panic that took down everything"¶
A worker pool processed JSON from a queue. One malformed message triggered a nil pointer dereference. The panic killed the process, not the worker. Every other worker died with it. The entire fleet (sharing the same poison message) crashed in lockstep. Rule 5 (recover at boundary) followed.
"The mystery race"¶
A session token was stored in a goroutine-shared map. Two requests with similar timing caused one user to briefly see another's session. Discovered by a user; not by tests. Rule 9 (race detector in CI) was now non-negotiable.
Tell these stories during onboarding. The rules are not arbitrary; they are scars.
Onboarding New Engineers¶
A new engineer joining a Go team needs:
- Read the team's concurrency conventions doc. Time-boxed: 30 minutes.
- Walk through the checklist on a real PR. With a senior engineer.
- Run the race detector and
goleaklocally. See them flag real bugs. - Pair-review a concurrent feature. Be the second pair of eyes.
- Find a leak in the codebase. Often this is a known one held back for training.
Roughly 1 week of dedicated time produces a engineer who can review goroutine code. Two weeks produces one who can write it confidently.
Library Authoring vs Application Authoring¶
The rules apply differently in libraries versus applications.
In application code¶
- You control
main. You spawn goroutines yourself. - You enforce conventions via your codebase's helpers.
- You decide when to
os.Exit.
In library code¶
- You do not control
main. You may or may not be allowed to spawn goroutines. - You must never spawn a goroutine the caller doesn't know about. Exception: documented background goroutines (e.g., a
Clientwith a connection pool). Document them. - You must never panic on bad inputs in a goroutine. Recover and return an error to the caller via a channel or callback.
- You must document every exported type's concurrency policy.
- You must not assume the caller passes
context.Background(). Always thread the context.
A library that spawns surprise goroutines is unusable in production. A library that returns the goroutines via channels and accepts Context is reusable.
Migration: Cleaning Up a Legacy Codebase¶
When you inherit a codebase that grew without these disciplines, the cleanup follows a sequence:
Phase 1: Visibility¶
- Add
goleak.VerifyTestMainto every test package. Most will fail. Don't fix; record. - Enable
-racein CI. Most tests will fail. Don't fix; record. - Add a goroutine count metric to production. Watch it.
Phase 2: Stop the bleeding¶
- Add a
safeGohelper. - Forbid bare
goin new code (via review + lint). - Forbid
time.Sleepin tests (via review + lint).
Phase 3: Fix systematically¶
- Pick the worst leak from
pprof. Fix it. Verify by re-running. - Pick the worst race from
-race. Fix it. Verify. - Repeat. Track progress on a dashboard:
go_goroutines_leakedandrace_failures_per_week.
Phase 4: Convention¶
- Write the team's conventions doc.
- Add the code review checklist to
CONTRIBUTING.md. - Onboard the team.
Phase 5: Bound¶
- For each unbounded
go f()pattern, replace with a worker pool. - Add
errgroup.SetLimitwherever fan-out exists.
A realistic cleanup takes months, not days. The order matters: visibility before fixes, fixes before conventions, conventions before scaling.
When to Break the Rules¶
Every rule has a "but" case. The senior judgement is knowing when.
| Rule | Acceptable exception |
|---|---|
| Clear exit story (1) | The goroutine genuinely runs for the lifetime of the process (e.g., metrics flusher). Document it. |
wg.Add in parent (2) | Inside a self-spawning closure that adds to itself recursively (rare; consider errgroup instead). |
| Pass loop variables (3) | Single-iteration cases where there's no ambiguity. Still safer to pass. |
| Thread context (4) | Trivial helpers that complete synchronously. The threshold is "could this ever block on I/O." |
| Recover at boundary (5) | Trusted code that handles all its own errors (e.g., pure computation on validated input). |
| Prefer errgroup (6) | When the parent wants partial results regardless of errors, hand-roll. |
| Channels for flow (7) | Performance: a contended sync.Mutex may outperform a channel and vice versa. Benchmark. |
No time.Sleep (8) | Tickers, backoffs. Never as "wait until done." |
| Race detector in CI (9) | Code that intentionally races and uses sync/atomic to define the semantics — but flag those packages explicitly. |
| Bound concurrency (10) | When input is known finite and small. Still safer to bound. |
| Document safety (11) | Internal/unexported types with one user. Still helpful to document. |
goleak/pprof (12) | A test that legitimately leaves background goroutines. Use goleak.Ignore*. |
Break a rule with a comment explaining why. The comment is the rule.
Self-Assessment¶
- I have run a code review using a goroutine checklist.
- I can recite the major differences between Effective Go, Uber, and Google style guides on concurrency.
- I have set up
golangci-lintwith concurrency-relevant linters. - I have written a
CONTRIBUTING.mdsection on goroutine conventions. - I have led the cleanup of at least one leaky service.
- I can tell at least one production-postmortem story for each major rule.
- I know which Uber rules my team adopted and which it didn't.
- I have onboarded an engineer onto goroutine conventions.
- I have written a library that respects "no surprise goroutines."
- I have made a defensible decision to break one of the rules in a real codebase.
Summary¶
At senior level the goroutine rules become a system: a code review checklist, a linter ruleset, a team convention document, a postmortem-driven curriculum. The depth comes from knowing why each rule exists, where the major published style guides agree and disagree, and how to migrate a legacy codebase toward discipline without breaking it. The rules are not arbitrary; each one is the scar of a real outage. The next step — professional — is about how the rules map to runtime invariants and what each violation costs in observable terms.