Preventing Goroutine Leaks — Tasks¶
Table of Contents¶
- How to Use This File
- Warm-Up: Fix the Five Patterns
- Build the Start/Stop Struct
- Wire Up
errgroup - Add goleak to a Project
- HTTP Server Graceful Shutdown
- Worker Pool with Back-Pressure
- Refactor a Leaky Codebase
- Cancellation Latency Budget
- Build a Supervisor
- Audit Exercise
- Self-Check Solutions
How to Use This File¶
Each task has a stated goal, a starting point (or a blank slate), and a checklist of what counts as "done." Most tasks take 15–60 minutes. Do them in a fresh module so you can verify with go test ./... cleanly.
For every task, the unstated requirement is: goleak passes. Add goleak.VerifyTestMain early; let it catch your bugs.
Warm-Up: Fix the Five Patterns¶
Task 1.1 — Pattern 1: Sender on Unbuffered Channel¶
Starting code:
package main
import "fmt"
func fetch(u string) string {
return "result of " + u
}
func first(urls []string) string {
ch := make(chan string)
for _, u := range urls {
go func(u string) { ch <- fetch(u) }(u)
}
return <-ch
}
func main() {
fmt.Println(first([]string{"a", "b", "c", "d", "e"}))
}
Goal: After first returns, no goroutines remain.
Checklist: - [ ] make(chan string, len(urls)) so all senders can deposit. - [ ] Test asserts runtime.NumGoroutine() returns to baseline 100ms after first returns. - [ ] goleak passes.
Task 1.2 — Pattern 2: Receiver on Channel Never Closed¶
Starting code:
Goal: The goroutine can be stopped by the caller.
Checklist: - [ ] Accept ctx context.Context. - [ ] select watches <-ctx.Done() and the channel. - [ ] Return a Stop mechanism (cancel func, or wrap in a struct). - [ ] goleak passes after cancel() + wait.
Task 1.3 — Pattern 3: Infinite Default Loop¶
Starting code:
Goal: No busy-wait; clean cancellation.
Checklist: - [ ] Remove default (or replace with a time.Ticker if a periodic poll is genuinely needed). - [ ] Add <-ctx.Done() case. - [ ] CPU usage is near 0 when idle (verify with top or a Go benchmark).
Task 1.4 — Pattern 4: Mutex Held Across Channel Op¶
Starting code:
var mu sync.Mutex
var pending []Job
func enqueue(j Job, dispatch chan<- Job) {
mu.Lock()
defer mu.Unlock()
pending = append(pending, j)
dispatch <- j // BUG: holds mu across send
}
Goal: The send does not happen while the mutex is held.
Checklist: - [ ] Mutex critical section ends before any channel operation. - [ ] State update and send are still consistent. - [ ] Test: spawn 100 enqueuers and one slow consumer; no deadlock.
Task 1.5 — Pattern 5: Ticker Not Stopped¶
Starting code:
func startHeartbeat() {
go func() {
t := time.NewTicker(time.Second)
for {
select {
case <-t.C:
ping()
}
}
}()
}
Goal: Owner, cancellation, ticker stop.
Checklist: - [ ] Wrap in a struct with Close. - [ ] defer t.Stop() immediately after NewTicker. - [ ] <-ctx.Done() case in the select. - [ ] Test starts, sleeps 50ms, closes, asserts no goroutines remain.
Build the Start/Stop Struct¶
Task 2 — Generic Owned Loop¶
Build a LoopRunner type that runs a user-provided function on a configurable interval until Close:
type LoopRunner struct { /* fields */ }
func NewLoopRunner(ctx context.Context, interval time.Duration, fn func(context.Context)) *LoopRunner
func (l *LoopRunner) Close() error
Checklist: - [ ] Constructor takes parent context. - [ ] Uses time.NewTicker with defer Stop(). - [ ] Close is idempotent (use sync.Once). - [ ] Close waits for the loop to exit. - [ ] fn is called with the goroutine's context so it can respect cancellation. - [ ] goleak in the test passes.
Task 2.1 — Multi-Goroutine Variant¶
Extend LoopRunner to accept N concurrent worker functions. The struct owns N goroutines; Close waits for all of them.
Checklist: - [ ] sync.WaitGroup replaces the single done channel. - [ ] All goroutines share the same context. - [ ] Close cancels once and waits for all. - [ ] Test with N=10 and intentional 50ms latency in fn.
Wire Up errgroup¶
Task 3.1 — Parallel Fetch¶
Implement fetchAll(ctx, urls) that fetches each URL concurrently, returns the slice of bodies, and fails fast on the first error. Use errgroup.WithContext.
Checklist: - [ ] No hand-rolled WaitGroup or error channel. - [ ] Loop variable captured correctly (url := url in the loop body if you are on Go < 1.22). - [ ] Goroutines pass the derived ctx to fetch. - [ ] First error cancels siblings. - [ ] Test: one URL returns an error, observe that the others are cancelled.
Task 3.2 — Bounded Parallelism¶
Modify Task 3.1 to limit concurrency to 8. Use g.SetLimit(8).
Checklist: - [ ] Observe that at most 8 fetches are in flight (instrument with a counter). - [ ] Total time on 100 URLs at 100ms each is ~1.25 seconds (100 ÷ 8 × 100ms). - [ ] No leaks if half the URLs fail.
Add goleak to a Project¶
Task 4 — Retrofit an Existing Project¶
Pick a small Go project you have (or use https://github.com/avelino/awesome-go to find one). Add goleak:
Checklist: - [ ] go get go.uber.org/goleak. - [ ] Add TestMain with goleak.VerifyTestMain(m) to one package. - [ ] Run go test ./.... Fix any leaks reported. - [ ] Decide whether any reported leaks are third-party. Add IgnoreTopFunction allowlist with justification in a comment. - [ ] Repeat for every package with goroutine usage.
Task 4.1 — Per-Test Variant¶
For a flakier package, use per-test verification instead of TestMain:
Checklist: - [ ] Identify a test that has setup goroutines that should not survive the test. - [ ] Add the defer. - [ ] Run go test -count=10 -race ./... and ensure stability.
HTTP Server Graceful Shutdown¶
Task 5 — Build the Skeleton¶
Implement run(ctx) that: - Starts an HTTP server on :8080. - Has a /hello endpoint that sleeps 2 seconds, then writes "hi". - On SIGINT or SIGTERM, drains in-flight requests within 30 seconds.
Checklist: - [ ] signal.NotifyContext for SIGINT/SIGTERM. - [ ] errgroup.WithContext to coordinate the server goroutine and the shutdown goroutine. - [ ] srv.Shutdown(shutdownCtx) where shutdownCtx is context.WithTimeout(context.Background(), 30*time.Second) — not derived from the cancelled root context. - [ ] ReadHeaderTimeout, ReadTimeout, WriteTimeout, IdleTimeout set. - [ ] Handler respects r.Context().Done() (interrupts the sleep on disconnect). - [ ] Integration test: send a request, send SIGTERM mid-flight, request completes, server exits cleanly.
Task 5.1 — Timeout Bound¶
Modify the shutdown to exit forcibly after the 30-second timeout, logging which connections remain.
Checklist: - [ ] If Shutdown returns context.DeadlineExceeded, log the number of active connections. - [ ] Use srv.RegisterOnShutdown if needed for per-connection cleanup. - [ ] Verify with a misbehaving handler that holds for 60 seconds.
Worker Pool with Back-Pressure¶
Task 6 — Build a Pool¶
Implement a Pool type:
type Pool struct { /* ... */ }
func NewPool(ctx context.Context, size int, handle func(context.Context, Job) error) *Pool
func (p *Pool) Submit(ctx context.Context, j Job) error
func (p *Pool) Close() error
Behaviour: - N goroutines pull jobs from an internal channel and call handle. - Submit blocks until a worker is free, or until its ctx is cancelled. - Close cancels the pool's context, waits for workers, returns the first error encountered.
Checklist: - [ ] Internal job channel is unbuffered (back-pressure) or small buffer (smoothing). - [ ] Submit uses select to honour both p.in <- j and <-ctx.Done(). - [ ] Close is idempotent. - [ ] Test: submit 1000 jobs, close mid-flight, verify all submitted-and-accepted jobs were handled.
Task 6.1 — Load Shedding¶
Modify Submit to return an ErrPoolFull error if no worker is free within 100 ms. Implement using select with a time.After and a context.Done() case.
Checklist: - [ ] Sustained submit rate above worker capacity yields ErrPoolFull quickly. - [ ] Brief bursts within timeout are absorbed. - [ ] Test: 1000 submits at 10ms intervals to a pool of 5 workers handling 50ms jobs; some fraction succeeds, the rest get ErrPoolFull.
Refactor a Leaky Codebase¶
Task 7 — The Mini-Service¶
Starting code (a small intentionally-leaky service):
package svc
import (
"fmt"
"time"
)
type Service struct {
in chan string
}
func NewService() *Service {
s := &Service{in: make(chan string)}
go s.consume()
go s.flush()
return s
}
func (s *Service) Submit(msg string) {
s.in <- msg
}
func (s *Service) consume() {
for msg := range s.in {
time.Sleep(10 * time.Millisecond)
fmt.Println(msg)
}
}
func (s *Service) flush() {
for {
time.Sleep(time.Second)
fmt.Println("flush")
}
}
Goal: Refactor to use the Start/Stop pattern. After refactoring, s.Close() stops both goroutines cleanly.
Checklist: - [ ] NewService takes a parent context. - [ ] Both consume and flush watch <-ctx.Done(). - [ ] flush uses time.Ticker with defer Stop(). - [ ] Submit uses select with <-ctx.Done() so it can return an error after Close. - [ ] sync.WaitGroup waits for both goroutines. - [ ] goleak.VerifyTestMain passes.
Task 7.1 — Add Tests¶
Write tests:
-
TestStartClose: construct, submit one message, close, assert no goroutines remain. -
TestSubmitAfterClose: construct, close, attempt to submit, expect error. -
TestConcurrentClose: callClosefrom 10 goroutines simultaneously; assert no panics, no leaks, exactly one shutdown.
Cancellation Latency Budget¶
Task 8 — Measure and Enforce¶
Build a test that asserts the cancellation latency of your Service (Task 7) is under 100 ms.
func TestCancellationLatency(t *testing.T) {
ctx, cancel := context.WithCancel(context.Background())
s := NewService(ctx)
time.Sleep(10 * time.Millisecond)
start := time.Now()
cancel()
// Wait for goroutines to exit; use s.Close() or s.Wait().
elapsed := time.Since(start)
if elapsed > 100*time.Millisecond {
t.Errorf("cancellation took %v, budget 100ms", elapsed)
}
}
Checklist: - [ ] Test passes with a clean implementation. - [ ] Intentionally add a 200ms sleep in the goroutine; test fails. - [ ] Replace time.Sleep with select { case <-ctx.Done(): return; case <-time.After(200*time.Millisecond): }; test passes again.
Task 8.1 — CPU-Bound Cancellation¶
Add a goroutine that does a CPU-bound loop (e.g., counting to 10^9). Ensure cancellation latency stays under 100 ms.
Checklist: - [ ] Loop checks ctx.Err() every N iterations. - [ ] Choose N so the check happens at least every 10 ms (typically N = 10^6 for tight loops). - [ ] Verify with the latency test.
Build a Supervisor¶
Task 9 — Restart-on-Panic Supervisor¶
Implement a Supervisor type:
type Supervisor struct { /* ... */ }
func NewSupervisor(parent context.Context) *Supervisor
func (s *Supervisor) Spawn(name string, fn func(context.Context) error)
func (s *Supervisor) Close()
Behaviour: - Each Spawn registers a function. The supervisor runs it in a goroutine. - If the function returns an error, the supervisor restarts it after a 1-second backoff. - If the function returns nil, the supervisor does not restart it. - Close cancels the supervisor's context and waits for all children. - A panicking child is recovered and restarted (with a log message).
Checklist: - [ ] recover() in each child goroutine. - [ ] Restart uses select with <-ctx.Done() so it doesn't wait through the backoff on shutdown. - [ ] Close cancels and waits. - [ ] Test: spawn a child that panics every iteration; verify it is restarted but eventually Close stops it.
Audit Exercise¶
Task 10 — Audit a Real Repository¶
Choose a real Go service (your own, or one of: https://github.com/grafana/loki, https://github.com/prometheus/prometheus, https://github.com/etcd-io/etcd). The smaller the better for a first pass.
Steps:
- Catalogue. Run
grep -rn 'go func\|go [a-zA-Z]' --include='*.go' .and copy the output into a spreadsheet. - Classify each spawn:
- Owner identified? (Y/N)
- Stop signal identified? (Y/N)
- Wait point identified? (Y/N)
- Score the repo: count of
(Y, Y, Y)divided by total spawns. A healthy repo is over 90%. - Pick the worst three. Read the code, write down what would leak under shutdown.
- Write a hypothetical fix (don't submit; this is for learning).
Checklist: - [ ] Spreadsheet has every spawn. - [ ] Score is calculated. - [ ] Three fixes are written up. - [ ] You can articulate why the project's score is what it is (active maintenance, project age, framework usage).
Task 10.1 — Audit Your Own Project¶
Repeat Task 10 for code you have written. Be honest. Be the bug.
Checklist: - [ ] Score is calculated. - [ ] Lowest-scoring file is identified. - [ ] Action plan: convert to Start/Stop pattern, add goleak, submit PR.
Self-Check Solutions¶
Solution sketches¶
Task 1.1: make(chan string, len(urls)).
Task 1.2:
func worker(ctx context.Context, in <-chan Job) {
go func() {
for {
select {
case <-ctx.Done():
return
case j, ok := <-in:
if !ok {
return
}
process(j)
}
}
}()
}
Task 1.3: Remove default. Add <-ctx.Done().
Task 1.4:
func enqueue(j Job, dispatch chan<- Job) {
mu.Lock()
pending = append(pending, j)
mu.Unlock()
dispatch <- j
}
Task 1.5:
type Heartbeat struct {
cancel context.CancelFunc
done chan struct{}
}
func StartHeartbeat(parent context.Context) *Heartbeat {
ctx, cancel := context.WithCancel(parent)
h := &Heartbeat{cancel: cancel, done: make(chan struct{})}
go func() {
defer close(h.done)
t := time.NewTicker(time.Second)
defer t.Stop()
for {
select {
case <-ctx.Done():
return
case <-t.C:
ping()
}
}
}()
return h
}
func (h *Heartbeat) Close() {
h.cancel()
<-h.done
}
Task 2 sketch:
type LoopRunner struct {
cancel context.CancelFunc
done chan struct{}
closeOnce sync.Once
}
func NewLoopRunner(parent context.Context, interval time.Duration, fn func(context.Context)) *LoopRunner {
ctx, cancel := context.WithCancel(parent)
l := &LoopRunner{cancel: cancel, done: make(chan struct{})}
go func() {
defer close(l.done)
t := time.NewTicker(interval)
defer t.Stop()
for {
select {
case <-ctx.Done():
return
case <-t.C:
fn(ctx)
}
}
}()
return l
}
func (l *LoopRunner) Close() error {
l.closeOnce.Do(func() {
l.cancel()
<-l.done
})
return nil
}
Task 6 pool sketch:
type Pool struct {
cancel context.CancelFunc
wg sync.WaitGroup
in chan Job
err error
errMu sync.Mutex
}
func NewPool(parent context.Context, size int, handle func(context.Context, Job) error) *Pool {
ctx, cancel := context.WithCancel(parent)
p := &Pool{cancel: cancel, in: make(chan Job)}
for i := 0; i < size; i++ {
p.wg.Add(1)
go func() {
defer p.wg.Done()
for {
select {
case <-ctx.Done():
return
case j, ok := <-p.in:
if !ok {
return
}
if err := handle(ctx, j); err != nil {
p.recordErr(err)
}
}
}
}()
}
return p
}
func (p *Pool) Submit(ctx context.Context, j Job) error {
select {
case <-ctx.Done():
return ctx.Err()
case p.in <- j:
return nil
}
}
func (p *Pool) Close() error {
p.cancel()
p.wg.Wait()
return p.err
}
func (p *Pool) recordErr(err error) {
p.errMu.Lock()
defer p.errMu.Unlock()
if p.err == nil {
p.err = err
}
}
The tasks build on each other: warm-up patterns → owned struct → real services. By the end you have implemented every pattern at junior, middle, and senior levels and run goleak across each. The next file, find-bug.md, tests recognition rather than construction.