Preventing Goroutine Leaks — Senior Level¶
Table of Contents¶
- Introduction
- Library Design: Leaks Impossible by Construction
- Code Review Checklist
- Real-World Example: HTTP Server Graceful Shutdown
- Real-World Example: Kafka Consumer Group Lifecycle
- Anti-Patterns at Scale
- Audit Checklist for an Existing Codebase
- Lifecycle Across Service Boundaries
- When Leaks Are Acceptable (and How to Document Them)
- Patterns Beyond Start/Stop
- Self-Assessment
- Summary
Introduction¶
At junior and middle levels, the goal is to write a single goroutine that does not leak. At senior level, the goal is to design systems — libraries, services, platforms — where leaks are not possible for any caller, no matter how careless they are. The lever is the API. A well-designed API makes the leak-free path the easiest path; a poorly designed API requires every caller to remember the same five-line cleanup ritual, which they will eventually forget.
This file is about three shifts:
- From discipline to design. Push the leak prevention out of caller code and into the type's contract.
- From individual reviews to systemic audits. Bring an existing codebase under control with a checklist.
- From textbook examples to real systems. HTTP servers, message consumers, and pool managers have well-known shapes; understanding them is the difference between writing a service and copying one.
You should already be fluent with errgroup, the Start/Stop struct, context propagation, and goleak in CI.
Library Design: Leaks Impossible by Construction¶
The contract: every spawn comes with a stop¶
A library that starts goroutines must expose a way to stop them. If your type's documentation says "this starts a background goroutine," it must also document the method that stops it. There is no exception.
Bad API:
Caller has no handle. The goroutine is immortal until the process exits. If a test calls StartReporter, the goroutine survives the test and contaminates the next one.
Good API:
type Reporter struct {
cancel context.CancelFunc
done chan struct{}
}
func NewReporter(ctx context.Context, addr string) *Reporter {
ctx, cancel := context.WithCancel(ctx)
r := &Reporter{cancel: cancel, done: make(chan struct{})}
go func() {
defer close(r.done)
report(ctx, addr)
}()
return r
}
func (r *Reporter) Close() error {
r.cancel()
<-r.done
return nil
}
Close is documented, io.Closer-compatible, idempotent, and waits. The caller has no excuse for leaking the goroutine.
Take a context in the constructor, not the methods¶
A type that owns long-lived goroutines should take a context.Context once, at construction:
This ties the goroutine's lifetime to a context the caller controls. If the caller wants the goroutine to stop, they can cancel the context — and Close is still available as a synchronous wait.
Don't take ctx in every method of a long-lived type for the internal goroutines' lifetime; that confuses per-call cancellation with type lifetime. Per-call methods take ctx for that call's cancellation, which is a different concern.
io.Closer, not Stop¶
When the library exposes shutdown, prefer io.Closer over an ad hoc Stop method. Close() is the standard library's convention, and tools (linters, defer helpers, panicking-test wrappers) recognise it:
Reserve Stop for cases where Close is meaningfully different (e.g., a watcher that has both Stop for the watch loop and Close for the underlying file handle).
Don't expose a goroutine you don't own¶
If your library accepts a callback to run "in the background," you have implicitly handed the caller a goroutine they cannot stop. Either:
- Take a
context.Contextand use it as the stop signal. - Return an
io.Closerand document that callers must callCloseto stop it. - Refuse to expose the goroutine at all; require the caller to run the work themselves.
The pattern to avoid:
// Bad: caller has no handle to the goroutine
func OnEveryTick(interval time.Duration, fn func()) {
go func() {
for range time.Tick(interval) {
fn()
}
}()
}
Compare to:
// Good: caller owns lifecycle
type Ticker struct { /* ... */ }
func NewTicker(ctx context.Context, interval time.Duration, fn func(context.Context)) *Ticker
func (t *Ticker) Close() error
Pool pattern: own the lifecycle of your workers¶
Connection pools, worker pools, and similar types own a set of goroutines. The lifecycle pattern:
type Pool struct {
size int
cancel context.CancelFunc
wg sync.WaitGroup
jobs chan Job
}
func NewPool(ctx context.Context, size int) *Pool {
ctx, cancel := context.WithCancel(ctx)
p := &Pool{
size: size,
cancel: cancel,
jobs: make(chan Job),
}
p.wg.Add(size)
for i := 0; i < size; i++ {
go func() {
defer p.wg.Done()
for {
select {
case <-ctx.Done():
return
case j, ok := <-p.jobs:
if !ok {
return
}
j.Run(ctx)
}
}
}()
}
return p
}
func (p *Pool) Submit(j Job) {
p.jobs <- j
}
func (p *Pool) Close() {
p.cancel()
p.wg.Wait()
}
Submit doesn't accept a context, because the pool's goroutines have their own. If a submission needs a deadline, the caller wraps Submit in their own select.
Two-phase shutdown for services¶
A service that handles in-flight work needs two phases:
- Drain: stop accepting new work; wait for in-flight work to complete.
- Terminate: cancel everything that hasn't finished by a deadline.
type Service struct {
cancel context.CancelFunc
wg sync.WaitGroup
// ...
}
func (s *Service) Shutdown(ctx context.Context) error {
close(s.acceptCh) // phase 1: stop accepting
done := make(chan struct{})
go func() {
s.wg.Wait()
close(done)
}()
select {
case <-done:
return nil
case <-ctx.Done():
s.cancel() // phase 2: force-cancel
s.wg.Wait()
return ctx.Err()
}
}
The caller controls the drain window via the context's timeout. This is the same shape http.Server.Shutdown uses.
Code Review Checklist¶
Every PR that adds or modifies concurrent code goes through this list. Mechanical, not optional.
Spawn-level checks¶
- Every
go f()is associated with a struct field or local variable that holds its cancel/stop mechanism. - Every long-running goroutine accepts
ctx context.Contextas its first parameter. - Every long-running goroutine has a
selectcase for<-ctx.Done()and returns from it. - No
context.Background()appears outsidemain, top-level constructors, or in tests. - No
context.TODO()was committed (it's a placeholder, not a value). - No
context.Contextstored in a struct field as a call parameter (it may be stored only when it owns the struct's internal goroutines).
Channel checks¶
- Every channel that is closed has exactly one closer.
- The closer is the unique sender (or the unique coordinator of multiple senders).
- Receivers check the
okvalue when a close is possible. - Buffer sizes are documented (constant, capacity, or "unbounded queue with back-pressure").
- No
time.Afterin a long-running loop.
Cleanup checks¶
- Every
time.NewTickeris paired withdefer t.Stop(). - Every
time.NewTimeris paired withdefer t.Stop(). - Every
context.WithCancel/WithTimeout/WithDeadlinehas a correspondingcancel()(ordefer cancel()). - Every type that spawns a goroutine in its constructor exposes a
Close(or equivalent) method. - Every test that uses a struct with
Closecalls it indeferor test cleanup.
Test-level checks¶
- The package's
TestMaincallsgoleak.VerifyTestMain(m)(or each test defersgoleak.VerifyNone(t)). - Tests cover the cancellation path: pass an already-cancelled context, assert the function returns.
- Tests cover the timeout path: pass a context with a short deadline, assert the function honours it.
- If the code uses tickers, tests use
time.NewTickerwith a small interval or inject a clock.
Mutex checks¶
- No mutex is held across a channel send or receive.
- No mutex is held across an HTTP call, database query, file I/O, or any other blocking syscall.
-
defer mu.Unlock()immediately followsmu.Lock()unless there's a documented reason otherwise. -
RWMutexis used only when reads outnumber writes by 10x or more; otherwise plainMutex.
Documentation checks¶
- Types that own goroutines document their lifecycle in the package or type comment.
- The
Closemethod documents whether it is idempotent and whether it waits. - If
Closemay return an error, the comment explains under what conditions.
Real-World Example: HTTP Server Graceful Shutdown¶
The canonical leak-free service skeleton.
package main
import (
"context"
"errors"
"log"
"net/http"
"os"
"os/signal"
"syscall"
"time"
"golang.org/x/sync/errgroup"
)
func main() {
if err := run(); err != nil {
log.Fatal(err)
}
}
func run() error {
rootCtx, stop := signal.NotifyContext(context.Background(),
syscall.SIGINT, syscall.SIGTERM)
defer stop()
g, ctx := errgroup.WithContext(rootCtx)
srv := &http.Server{
Addr: ":8080",
Handler: buildHandler(),
ReadHeaderTimeout: 5 * time.Second,
ReadTimeout: 30 * time.Second,
WriteTimeout: 30 * time.Second,
IdleTimeout: 120 * time.Second,
}
// Goroutine 1: run the server.
g.Go(func() error {
log.Printf("listening on %s", srv.Addr)
if err := srv.ListenAndServe(); err != nil && !errors.Is(err, http.ErrServerClosed) {
return err
}
return nil
})
// Goroutine 2: watch for shutdown and trigger graceful stop.
g.Go(func() error {
<-ctx.Done()
shutdownCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
return srv.Shutdown(shutdownCtx)
})
return g.Wait()
}
func buildHandler() http.Handler {
mux := http.NewServeMux()
mux.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
// Honour the request context for backend calls.
select {
case <-time.After(time.Second):
w.Write([]byte("ok\n"))
case <-r.Context().Done():
return
}
})
return mux
}
What this code does right¶
- Signal handling:
signal.NotifyContextreturns a context cancelled by SIGINT or SIGTERM. The whole tree below it inherits cancellation for free. - Structured concurrency: both the server goroutine and the shutdown goroutine live inside
errgroup.WithContext. The function returns only after both finish. - Drain window:
srv.Shutdownaccepts a context with a 30-second timeout, the standard for graceful shutdown. Within that window, the server finishes in-flight requests and refuses new ones. - Independent shutdown context: the shutdown context is not derived from
rootCtx(which is already cancelled), but fromcontext.Background()with its own timeout. If we used the cancelledrootCtx,srv.Shutdownwould return immediately and kill in-flight connections. - Server-level timeouts:
ReadTimeout,WriteTimeout,IdleTimeoutbound how long any single connection can stall. Without them, a slow client can hold a goroutine forever. - Handler honours request context: the handler selects on
r.Context().Done(). When the client disconnects or shutdown begins, the handler returns promptly.
Common bugs in this skeleton¶
- Using
rootCtxforsrv.Shutdown: the shutdown returns immediately because the context is already cancelled. In-flight requests are dropped. - No
IdleTimeout: keep-alive connections held by idle clients never time out, leaking a goroutine per connection. - No
ReadHeaderTimeout: a Slowloris attacker holds connections open with partial headers; each one is a leaked goroutine. - Handler ignoring
r.Context(): long handler operations don't notice the client disconnect; goroutines pile up.
Testing the skeleton¶
func TestRunShutsDownCleanly(t *testing.T) {
defer goleak.VerifyNone(t)
ctx, cancel := context.WithCancel(context.Background())
errCh := make(chan error, 1)
go func() { errCh <- runWithCtx(ctx) }()
time.Sleep(50 * time.Millisecond) // let server start
cancel()
if err := <-errCh; err != nil {
t.Fatal(err)
}
}
goleak asserts no goroutines remain. The test will fail loudly if any handler or background goroutine outlives the shutdown.
Real-World Example: Kafka Consumer Group Lifecycle¶
A Kafka consumer group is harder than an HTTP server because:
- Each partition is its own concurrent stream.
- Rebalances can swap partitions in and out mid-process.
- Commits must happen after processing, not before.
- The whole group must drain on shutdown without losing messages.
A leak-resistant skeleton (using a generic consumer interface; the principles apply to confluent-kafka-go, segmentio/kafka-go, IBM/sarama):
type Consumer struct {
client KafkaClient
cancel context.CancelFunc
wg sync.WaitGroup
closed chan struct{}
}
func NewConsumer(ctx context.Context, cfg Config) (*Consumer, error) {
ctx, cancel := context.WithCancel(ctx)
c := &Consumer{
cancel: cancel,
closed: make(chan struct{}),
}
client, err := newClient(cfg)
if err != nil {
cancel()
return nil, err
}
c.client = client
c.wg.Add(1)
go func() {
defer c.wg.Done()
c.runGroup(ctx)
}()
return c, nil
}
func (c *Consumer) runGroup(ctx context.Context) {
for {
if err := c.consumeOnce(ctx); err != nil {
if errors.Is(err, context.Canceled) {
return
}
log.Printf("consume: %v; retrying in 1s", err)
select {
case <-ctx.Done():
return
case <-time.After(time.Second):
}
}
}
}
func (c *Consumer) consumeOnce(ctx context.Context) error {
session, err := c.client.JoinGroup(ctx)
if err != nil {
return err
}
defer session.Close()
g, gctx := errgroup.WithContext(ctx)
for _, partition := range session.Partitions() {
partition := partition
g.Go(func() error {
return c.consumePartition(gctx, session, partition)
})
}
return g.Wait()
}
func (c *Consumer) consumePartition(ctx context.Context, session Session, p Partition) error {
for {
msg, err := p.Next(ctx)
if err != nil {
return err
}
if err := c.handle(ctx, msg); err != nil {
return err
}
if err := session.Commit(ctx, msg); err != nil {
return err
}
}
}
func (c *Consumer) Close() error {
c.cancel()
c.wg.Wait()
return c.client.Close()
}
What this code does right¶
- One owning goroutine (
runGroup) at the top level, which owns all its children througherrgroup. - Per-partition goroutines are spawned inside
errgroup.WithContext, so a failure in one cancels the others — exactly what you want when the partition assignment is no longer valid. - Cancellation reaches every layer:
ctxis the first argument toJoinGroup,Next,handle, andCommit. None of them can block forever. - Rebalance handling: when partitions change,
Nextreturns an error (or the session ends); the loop returns, the parentconsumeOncereturns, andrunGroupcallsconsumeOnceagain to re-join with new partitions. - Backoff with cancellation: the retry sleep uses
selectso cancellation isn't delayed by the retry. - Idempotent close:
c.cancel()is safe to call multiple times;c.wg.Wait()blocks until all per-partition goroutines have stopped before closing the client.
Common bugs in Kafka consumers¶
- Forgetting to pass
ctxtoNext: the consumer blocks in the library, ignoring shutdown. - Committing in a separate goroutine: race conditions on commit order; the leak appears as "I keep getting the same message after restart."
- Spawning a goroutine per message: if
handleis slow, you accumulate goroutines per message. Use the partition goroutine to serialise; if you need parallelism, use a fixed pool per partition. - Catching errors and continuing:
consumeOnceshould return on any error so the rebalance loop can re-join. Eating errors inside the partition goroutine masks broker problems.
Drain semantics¶
When Close is called mid-message:
- The partition goroutine sees
ctx.Done()on its nextNextcall. - If
handle(ctx, msg)is in progress, it should also seectx.Done()and return promptly (cooperative cancellation). Commitshould not be called for half-processed messages.
If your handler does exactly-once work, the cancellation must happen before the side effect, not after. Design the handler accordingly.
Anti-Patterns at Scale¶
Fire-and-forget side effects¶
func ProcessRequest(req Request) error {
go func() { auditLog.Write(req) }() // fire-and-forget
go func() { metrics.Increment("rcv") }() // fire-and-forget
return doWork(req)
}
Symptoms:
- Audit log entries silently disappear on shutdown.
- Metrics inflation: spawning a goroutine per request can outpace processing during a spike.
- No way to apply back-pressure: if
auditLog.Writeslows down, you accumulate goroutines instead of failing requests.
Fix: a queue-based pattern. audit.Submit(req) puts a copy on a bounded channel; an owned background worker drains the channel. Submission either blocks (back-pressure) or fails fast (load shedding), with explicit behaviour. The worker has a clean shutdown that drains the queue.
Lazy goroutine for caching¶
var once sync.Once
func GetConfig() Config {
once.Do(func() {
go refreshLoop() // spawned on first call, no owner
})
return current
}
The goroutine is started by the first caller and outlives them all. No shutdown. Common in HTTP middleware initialisation.
Fix: explicit NewConfigService(ctx) that returns a struct with a Close. The HTTP server constructs it before ListenAndServe and closes it on shutdown.
"Just one more goroutine to handle the edge case"¶
Anywhere a goroutine is spawned conditionally, the owner is conditional. The conditional owner is a guarantee that someone, somewhere, will forget to close it.
Fix: spawn unconditionally inside a struct that always owns it, and have the goroutine itself check the condition.
Per-connection goroutine without bounds¶
Real services need rate limiting. A simple bound is a semaphore (or errgroup.SetLimit):
g, ctx := errgroup.WithContext(ctx)
g.SetLimit(1000)
for {
conn, err := listener.Accept()
if err != nil {
return err
}
g.Go(func() error {
defer conn.Close()
return handle(ctx, conn)
})
}
Now connection count is bounded. Excess connections wait in the kernel's accept queue, and the OS can apply its own back-pressure (SYN queue overflow, RST).
Audit Checklist for an Existing Codebase¶
You inherit a codebase. There are leaks. How do you find and fix them systematically?
Phase 1: instrumentation (1 day)¶
- Add
runtime.NumGoroutineto a debug endpoint or metric. Watch it over a few hours of normal traffic. - Add
pprofendpoints (net/http/pprof). - Take a goroutine profile (
go tool pprof http://service/debug/pprof/goroutine) under load and under idle. Diff them.
If NumGoroutine grows over time with steady traffic, you have leaks. The pprof profile shows where they are parked.
Phase 2: catalogue (1–2 days)¶
grep -rn 'go ' --include='*.go'to list every spawn site.- For each spawn site, identify:
- The function or struct that owns it.
- The stop signal (context, channel, none).
- The wait point (
WaitGroup,errgroup, none). - Tag each entry:
OWNED(clear lifecycle),IMPLICIT(owner clear but no wait),ORPHAN(no clear owner).
The output is a spreadsheet of every goroutine in the system. The orphans are your priority list.
Phase 3: add tests first (1 week)¶
For each package with concurrent code, add goleak.VerifyTestMain to a TestMain. Many tests will start failing. Each failure is a leak in the test (or in the code the test exercises).
Fix the leaks in the test setup first (forgotten Close calls), then fix the leaks in the code. After this phase, the test suite is clean.
Phase 4: refactor the orphans (1–4 weeks)¶
For each orphan goroutine:
- Wrap the spawn site in a struct (the Start/Stop pattern).
- Thread
ctxfrom the construction point. - Add the cancellation case to the goroutine's loop.
- Update callers to call
Close. - Add a test that exercises the cancellation path.
Do not try to fix all orphans in a single PR. One per PR, reviewed individually.
Phase 5: prevent regression (ongoing)¶
- CI:
go test ./...with goleak required. Block merges on leak. - Linter: a custom
staticcheckorgolangci-lintconfig that flagsgostatements outside known patterns. - Code review checklist (see above) made mandatory.
- Monthly review of
NumGoroutinebaseline in production. Spikes get investigated.
When the audit reveals legitimate background goroutines¶
Some are: a connection pool's keepalive, a metrics flusher, a leader-election heartbeat. Document them in a BACKGROUND_GOROUTINES.md in the repo. List each one with: owner, lifetime, what triggers stop, how shutdown is verified. The doc is your allowlist of expected long-lived goroutines.
Lifecycle Across Service Boundaries¶
When a goroutine's lifetime crosses a process boundary (gRPC stream, WebSocket, long-poll), the rules change slightly.
Server streaming RPCs¶
The server-side stream handler runs in a goroutine until the client disconnects or it returns. The handler must:
- Watch
stream.Context().Done()for client disconnect. - Return promptly on disconnect (no fire-and-forget background work).
- Bound any per-stream resources (counters, caches) and free them on return.
func (s *Server) StreamData(req *pb.Req, stream pb.Svc_StreamDataServer) error {
ctx := stream.Context()
for {
select {
case <-ctx.Done():
return ctx.Err()
case item := <-s.feed:
if err := stream.Send(item); err != nil {
return err
}
}
}
}
Behind the scenes, gRPC manages the goroutine for this handler. Your job is to make it return promptly when the stream ends.
WebSocket connections¶
Each connection is a goroutine (or two: one for reads, one for writes). The pattern:
type Conn struct {
cancel context.CancelFunc
wg sync.WaitGroup
}
func handleWebSocket(ws *websocket.Conn) {
ctx, cancel := context.WithCancel(context.Background())
c := &Conn{cancel: cancel}
c.wg.Add(2)
go func() { defer c.wg.Done(); c.readLoop(ctx, ws) }()
go func() { defer c.wg.Done(); c.writeLoop(ctx, ws) }()
c.wg.Wait()
}
When either loop returns (network error, ping timeout, client close), it calls cancel() to wake the other, and the function blocks until both have exited. No leaked goroutine per connection.
When Leaks Are Acceptable (and How to Document Them)¶
Some goroutines are intentionally immortal. The Go runtime's GC has them. Standard library packages (net/http keepalive pools, database/sql connection pool maintainers) have them. Your service probably has a few.
The rule: an immortal goroutine is explicitly documented, with these facts:
- It is started exactly once, at process start (not per-request, not per-connection).
- Its memory footprint is bounded and known.
- Its work is idempotent if the process is killed mid-iteration.
- It is on the allowlist for goleak and for the goroutine-count alert.
A documented immortal goroutine is fine. An undocumented one is a leak in disguise.
Patterns Beyond Start/Stop¶
Supervisor pattern¶
A supervisor owns a set of goroutines. If one dies, the supervisor restarts it. Common in long-lived workers that may panic:
type Supervisor struct {
cancel context.CancelFunc
wg sync.WaitGroup
}
func (s *Supervisor) Spawn(name string, fn func(ctx context.Context) error) {
s.wg.Add(1)
go func() {
defer s.wg.Done()
for {
err := s.runOnce(name, fn)
if errors.Is(err, context.Canceled) {
return
}
log.Printf("supervisor: %s exited: %v; restarting", name, err)
select {
case <-s.ctx.Done():
return
case <-time.After(time.Second):
}
}
}()
}
The supervisor itself is owned by its caller and shuts down cleanly. Children that panic don't bring down the process; they get restarted.
Lifecycle manager¶
For services with many components, a lifecycle manager coordinates start and stop:
type Lifecycle struct {
components []Component
}
func (l *Lifecycle) Add(c Component) { l.components = append(l.components, c) }
func (l *Lifecycle) Start(ctx context.Context) error {
for _, c := range l.components {
if err := c.Start(ctx); err != nil {
l.Stop(context.Background()) // unwind
return err
}
}
return nil
}
func (l *Lifecycle) Stop(ctx context.Context) error {
var errs []error
for i := len(l.components) - 1; i >= 0; i-- {
if err := l.components[i].Stop(ctx); err != nil {
errs = append(errs, err)
}
}
return errors.Join(errs...)
}
Components register at construction; the manager starts them in order, stops them in reverse, and surfaces all errors. This is the structural backbone of frameworks like fx, kratos, and similar.
Self-Assessment¶
- You can design a library type that exposes a long-lived goroutine and makes it impossible for the caller to leak it.
- You can perform a code review of a 500-line concurrent diff and identify every leak vector.
- You can write the HTTP server graceful shutdown skeleton from memory, including all the standard timeouts.
- You can explain the role of each context in a Kafka consumer group's lifecycle.
- You can audit a 50,000-line codebase for leaks with a documented plan.
- You know when to use a supervisor and when not to (hint: rarely; structured concurrency covers most cases).
- You can explain why
context.Background()insidesrv.Shutdownis correct even though the parent context is already cancelled.
Summary¶
Senior-level leak prevention is design, not vigilance. The lever is the type's API:
- Every type that owns goroutines exposes a
Close(orStop). - The
Closeis idempotent and waits. - The constructor takes a context that scopes the goroutine's lifetime.
- The contract is documented in the type comment.
Beyond individual types, the patterns scale:
- HTTP servers use
Shutdownwith an independent timeout context. - Kafka consumers use nested
errgroupfor partition lifecycle and rebalance. - Supervisors restart panicking children without leaking.
- Lifecycle managers coordinate start/stop across whole services.
For an existing codebase, the audit is a five-phase project: instrument, catalogue, test, refactor, prevent. The output is a system where leaks are bugs caught in CI, not weekly fires in production. Once the design has these properties, the prevention work moves from individual code to platform-level guarantees — covered in the professional level.