Goroutine Lifecycle — Middle Level¶
Table of Contents¶
- Introduction
- Designing for Explicit Lifecycle
- Ownership Trees
- Lifecycle and
context.Context - Lifecycle and Panics
- Lifecycle and
runtime.Goexit - Lifecycle and Deferred Cleanup
- Joining Children to Parents
- Cancellation Patterns
- Graceful Shutdown
- Observability:
pprof goroutineandruntime/trace - Lifecycle Anti-Patterns
- Testing Lifecycle
- Summary
Introduction¶
Focus: "I know goroutines have a lifecycle. How do I design programs whose goroutine lifecycles are obvious and bounded?"
At this level we stop describing lifecycle and start controlling it. The bar is: in every package you write, the lifetime of every goroutine should be obvious from a single function and bounded by an observable event. No "it will eventually exit." No "well, when the program shuts down."
We will cover:
- The mental shift from implicit spawn to explicit ownership.
- The standard primitives Go gives you to control lifecycle: channels,
WaitGroup,context.Context,errgroup. - How panics,
Goexit, and deferred cleanup interact. - The "graceful shutdown" pattern, which is the production-grade lifecycle pattern for daemons.
- How to observe lifecycle in a running program.
This material assumes you understand the basics from junior.md. It is the bridge to the leak-prevention techniques in 03-preventing-leaks.
Designing for Explicit Lifecycle¶
Three rules:
Rule 1: Spawn and join in the same function¶
A function that uses go should also wait for those goroutines to finish before returning. This makes the lifecycle local and inspectable.
// GOOD: lifecycle is local to fetchAll.
func fetchAll(ctx context.Context, urls []string) ([]Result, error) {
var wg sync.WaitGroup
results := make([]Result, len(urls))
for i, u := range urls {
wg.Add(1)
go func(i int, u string) {
defer wg.Done()
results[i] = fetch(ctx, u)
}(i, u)
}
wg.Wait()
return results, nil
}
// BAD: lifecycle escapes the function.
func startBackground(urls []string) {
for _, u := range urls {
go fetch(u) // no one waits, no cancellation
}
}
The second form is sometimes acceptable for daemons, but the lifecycle must then be owned by the surrounding type, not lost.
Rule 2: If you must spawn beyond the function, hand the lifecycle to an owner¶
type Daemon struct {
cancel context.CancelFunc
done chan struct{}
}
func StartDaemon() *Daemon {
ctx, cancel := context.WithCancel(context.Background())
d := &Daemon{cancel: cancel, done: make(chan struct{})}
go func() {
defer close(d.done)
d.run(ctx)
}()
return d
}
func (d *Daemon) Stop() {
d.cancel()
<-d.done
}
The lifecycle is now an explicit object. The caller can stop it and observe it.
Rule 3: Document the exit condition¶
Every go deserves a one-line comment naming the exit condition. If you cannot write the comment, you cannot ship the code.
Ownership Trees¶
Think of every goroutine as having an owner — the goroutine, function, or struct responsible for ending it. Build trees:
main
├── http.Server.ListenAndServe (1 listener goroutine)
│ ├── per-connection goroutine
│ │ └── per-handler context (cancelled on response complete)
│ ├── per-connection goroutine
│ └── ...
├── MetricsServer.Run
│ └── ticker goroutine (exits on ctx cancel)
└── CacheRefresher.Run
└── refresh goroutine (exits on ctx cancel)
Cancellation flows from root to leaves: ctx, cancel := context.WithCancel(...); on shutdown, cancel() propagates ctx.Done() to every leaf.
If a leaf goroutine has no path back to a root in the tree, it is an orphan. Orphans are the root cause of leaks. Code review should flag any go ... that does not visibly attach to a parent.
Lifecycle and context.Context¶
context.Context is Go's lifecycle-coordination primitive. Three things to internalize:
ctx.Done() is the cancellation signal¶
A goroutine that does not check ctx.Done() cannot be canceled. There is no force-stop.
ctx.Err() tells you why it was canceled¶
context.Canceled— explicitcancel()call.context.DeadlineExceeded— timeout reached.
Returning ctx.Err() from a goroutine that exits because of cancellation is idiomatic.
Always call cancel(), ideally via defer¶
Even if the timeout will fire, calling cancel() early frees the timer goroutine that the context internally maintains.
Don't lose the context across the goroutine boundary¶
// BAD: child goroutine ignores ctx
go func() {
longRunning()
}()
// GOOD: child goroutine respects ctx
go func() {
longRunning(ctx)
}()
Lifecycle and Panics¶
Every goroutine has its own panic-handling chain. A panic anywhere in the goroutine unwinds the stack, runs all defer-ed functions, and — if nothing recovers — terminates the process.
The unrecovered panic path¶
panic("oops")
|
v
defer chain runs (most recent first)
|
v
no recover found
|
v
runtime.fatalpanic
|
v
process terminates with exit code 2
Recovering at the boundary¶
For any goroutine that runs untrusted or fallible code, wrap the entire body:
go func() {
defer func() {
if r := recover(); r != nil {
log.Printf("worker recovered: %v\n%s", r, debug.Stack())
metrics.IncPanicCount()
}
}()
work()
}()
After recovery, the goroutine reaches _Gdead cleanly. The rest of the program survives. This is the only way to scope a panic to one goroutine.
Where the panic boundary belongs¶
- At the top of every long-running worker.
- At the boundary of every "user code" callback (plugin systems, callbacks from event loops).
- In every
http.HandlerFunc— butnet/httpdoes this for you by default.
It does not belong inside small library functions: leave panics to propagate so callers can decide.
Lifecycle and runtime.Goexit¶
runtime.Goexit ends the current goroutine immediately, running every defer-ed function on the stack. It is different from return (which only exits the current frame) and from panic (which is an error path).
Use cases¶
- Test framework internals.
testing.(*T).FailNowcallsGoexitto terminate a failing test goroutine without affecting other tests. - Library functions that cannot signal failure any other way and want to force the current goroutine to stop.
Goexit from the main goroutine¶
func main() {
go func() {
for {
fmt.Println("worker")
time.Sleep(time.Second)
}
}()
runtime.Goexit()
}
The main goroutine ends. The worker continues. The program runs until the runtime detects there are no live goroutines (which never happens here, so the program runs forever).
Goexit and defer¶
Output: defer in main. The defer runs because Goexit honors the defer chain.
In production code, Goexit is rarely needed. Prefer return plus error values.
Lifecycle and Deferred Cleanup¶
Every goroutine has its own defer stack. Cleanup at the goroutine level is one of Go's idioms:
go func() {
defer wg.Done() // tell parent we're done
defer conn.Close() // free the connection
defer cancel() // free the timer in ctx
defer log.Println("worker exit") // diagnostics
work(ctx, conn)
}()
Order matters: deferred calls run in reverse order. The above runs log.Println first, then cancel(), then conn.Close(), then wg.Done(). Usually that ordering is correct — you want the wg.Done() to be last so the parent does not return before the goroutine's cleanup actually finished.
defer and recover and Goexit¶
All three play together:
deferruns on normal return.deferruns onruntime.Goexit.deferruns on panic (andrecoverworks inside a defer).
The only time defer does not run is os.Exit or syscall.Exit — those terminate the process without any unwinding.
Joining Children to Parents¶
sync.WaitGroup¶
The default tool for "wait for N goroutines to finish":
var wg sync.WaitGroup
for i := 0; i < n; i++ {
wg.Add(1)
go func() {
defer wg.Done()
work()
}()
}
wg.Wait()
Rules:
Add(1)beforego, never inside the goroutine.defer wg.Done()at the top of the goroutine.Wait()returns only when the counter reaches 0.
errgroup.Group¶
For fan-out with error propagation and shared cancellation:
import "golang.org/x/sync/errgroup"
g, ctx := errgroup.WithContext(ctx)
for _, u := range urls {
u := u
g.Go(func() error {
return fetch(ctx, u)
})
}
if err := g.Wait(); err != nil {
return err
}
errgroup:
- Spawns child goroutines via
g.Go. - Cancels the shared context on the first error.
- Waits for all children before returning.
- The lifecycle of every child is bounded by the
Waitcall.
Channels for join¶
For more bespoke patterns:
A closed channel is a join signal. Useful when a WaitGroup would be overkill (one goroutine) or when you want a select on the done signal.
Cancellation Patterns¶
Channel-based cancellation (legacy)¶
quit := make(chan struct{})
go func() {
for {
select {
case <-quit:
return
case j := <-jobs:
process(j)
}
}
}()
// ... later ...
close(quit)
Works. But it does not compose: passing the quit channel down many layers is tedious, and you cannot attach a deadline or value.
context.Context (modern)¶
go func() {
for {
select {
case <-ctx.Done():
return
case j := <-jobs:
process(ctx, j)
}
}
}()
// ... later ...
cancel()
This composes: child contexts inherit cancellation, deadlines propagate down, values can be attached. Use context.Context for any new code.
Combining cancellation and channels¶
The lifecycle is bounded by the first event to fire.
Graceful Shutdown¶
The canonical production lifecycle pattern:
func main() {
ctx, cancel := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer cancel()
srv := &http.Server{Addr: ":8080", Handler: routes()}
// start the server in a goroutine — lifecycle is the whole program.
serverErr := make(chan error, 1)
go func() {
serverErr <- srv.ListenAndServe()
}()
select {
case <-ctx.Done():
log.Println("shutdown signal received")
case err := <-serverErr:
log.Printf("server error: %v", err)
}
// graceful shutdown with a hard deadline.
shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 30*time.Second)
defer shutdownCancel()
if err := srv.Shutdown(shutdownCtx); err != nil {
log.Printf("forced shutdown: %v", err)
}
}
Properties:
- The signal handler (
signal.NotifyContext) is the lifecycle trigger. srv.Shutdownwaits for in-flight handler goroutines to finish, bounded byshutdownCtx.- The
serverErrchannel is buffered to avoid a leaking sender.
Every long-running server should follow this shape. The goroutines spawned by handlers are owned by http.Server; Shutdown joins them.
Observability: pprof goroutine and runtime/trace¶
pprof goroutine¶
Add this to every server:
import _ "net/http/pprof"
func main() {
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
// ...
}
Then:
Inside pprof:
top— see the busiest stacks.list FuncName— see the source.web— a graph.
A leak shows up as a stack with thousands of goroutines parked on the same line.
Equivalent text dump:
Each goroutine's stack with state in brackets — [chan receive, 12 minutes] is a smoking gun.
runtime/trace¶
Captures full lifecycle (every state transition with timestamps):
import "runtime/trace"
f, _ := os.Create("trace.out")
trace.Start(f)
defer trace.Stop()
// ... run workload ...
Then:
The browser UI shows:
- Each goroutine's lifeline (born, run intervals, wait intervals, dead).
- Why each wait happened (channel, syscall, sleep).
- The cause of each schedule.
Use it once per project. The intuition you gain is irreplaceable.
runtime.Stack¶
The simplest dump:
Use it in SIGUSR1 handlers, in test failures, and in panic dumps.
Lifecycle Anti-Patterns¶
The fire-and-forget log¶
If log is misconfigured (e.g., writing to a slow network), the goroutine waits. Many of these add up to a memory blob and a slow leak.
The unbounded retry¶
If op never succeeds, the goroutine never ends. Add a context, a max-retry count, or both.
The "trust me, it exits" goroutine¶
Fine — if someone closes ch. Bad if the original sender keeps a reference but never closes. Document the close-ownership.
The goroutine that recovers itself¶
This may keep the lifecycle "alive" but masks bugs. Better: log the panic, return, and let a supervisor restart.
The "spawn from spawn" cascade¶
Each process is now an orphan with no lifecycle parent. Use a worker pool instead.
Testing Lifecycle¶
Strategy 1: Baseline + leakcheck¶
func TestNoLeak(t *testing.T) {
before := runtime.NumGoroutine()
runMyCode(t)
// give the runtime a moment to update.
time.Sleep(50 * time.Millisecond)
after := runtime.NumGoroutine()
if after > before {
buf := make([]byte, 1<<20)
n := runtime.Stack(buf, true)
t.Fatalf("leak: before=%d after=%d\n%s", before, after, buf[:n])
}
}
Strategy 2: uber-go/goleak¶
goleak snapshots goroutines, runs the test, and asserts none leaked. Sample stack traces if there is a leak.
Strategy 3: Synthetic lifecycle test¶
func TestWorkerStopsOnContextCancel(t *testing.T) {
ctx, cancel := context.WithCancel(context.Background())
done := make(chan struct{})
go func() {
defer close(done)
worker(ctx)
}()
cancel()
select {
case <-done:
case <-time.After(time.Second):
t.Fatal("worker did not exit on cancel")
}
}
Always check: did the goroutine end because of cancel, within a reasonable budget? time.After gives you the budget; do not omit it.
Summary¶
Lifecycle design at the middle level is about making the answer to "when does this goroutine end?" both easy to give and easy to verify:
- Spawn and join in the same function when possible. When not possible, attach the lifecycle to an explicit owner with a
StoporClosemethod. - Pass
context.Contextto every goroutine; checkctx.Done()at every blocking point. - Wrap every long-running goroutine body with
defer recoverto scope panics. - Use
sync.WaitGrouporerrgroup.Groupfor join. - Add
pprof goroutineandruntime/traceto your toolbox. - Test lifecycle explicitly: assert that goroutine count returns to baseline, or use
goleak.
The senior level extends this to whole-system patterns: supervisor trees, hierarchical contexts, and the interaction between lifecycle and the garbage collector. The professional level dives into the runtime states (_Grunnable, _Gwaiting, _Gsyscall, _Gdead) and the g struct itself.
See also:
- 02-detecting-leaks — when this discipline breaks down, how to detect it.
- 03-preventing-leaks — patterns that make leaks structurally impossible.
- ../../10-scheduler-deep-dive — what happens between the lifecycle states.