Cancellation Propagation — Professional Level¶
Table of Contents¶
- Scope
- The
contextPackage Implementation - How
WithCancelBuilds the Tree - The Closed-Channel Trick
- Scheduler Interactions
- Cost of
selectwithctx.Done() AfterFuncInternalsWithCancelCauseandCauseWithoutCancel- Custom
ContextImplementations - Cancellation Latency in the Scheduler
- Memory Footprint of the Context Tree
- Garbage Collection of Contexts
- Performance Benchmarks
- Cancellation Under High Concurrency
- Cancellation Race Conditions
- Cross-process Implementations
- Incident Post-Mortems
- Patterns That Push the Limits
- Cheat Sheet for Professionals
- Summary
- Further Reading
Scope¶
This file dives into the internals: how context.Context is implemented, what cancellation costs in the Go runtime, the race conditions that production code must defend against, and the lessons from large-scale incidents. The audience is engineers who design infrastructure libraries, debug production cancellation bugs, or optimise services for low-latency shutdown.
The patterns described at junior, middle, and senior level remain correct; this file explains why they work and where they break under pressure.
A note on assertions and cancellation¶
Some code uses assertions to validate invariants. Cancellation may invalidate assertions silently.
func process(ctx context.Context, data Data) {
intermediate := step1(ctx, data)
if intermediate == nil {
panic("invariant: step1 always returns non-nil")
}
finalize(ctx, intermediate)
}
If step1 is cancellable and returns nil on cancellation, the panic fires. Either:
- The assertion is wrong (step1 can return nil), and the code is unprepared.
- The assertion is right (step1 should never return nil), and cancellation is the bug.
Better:
func process(ctx context.Context, data Data) error {
intermediate, err := step1(ctx, data)
if err != nil {
return err
}
return finalize(ctx, intermediate)
}
Use error returns; assertions are for true invariants that cannot be violated by external events (like cancellation).
Cancellation in distributed counting¶
A subtle case: counting events across many goroutines while supporting cancellation. The naive approach uses a shared counter under a mutex; this works but contends under high write rates.
var counter int64
for _, item := range items {
item := item
g.Go(func() error {
if err := process(ctx, item); err != nil {
return err
}
atomic.AddInt64(&counter, 1)
return nil
})
}
On cancellation, the counter reflects partial work — the items processed before cancel. This is usually fine.
For more sophisticated counting (per-tenant, per-status), use per-goroutine counters reduced at the end:
results := make([]int64, len(items))
for i, item := range items {
i, item := i, item
g.Go(func() error {
if err := process(ctx, item); err != nil {
return err
}
results[i] = 1
return nil
})
}
// after Wait
total := int64(0)
for _, r := range results {
total += r
}
Per-index slot avoids contention. The reduce is sequential but fast.
For very large fan-outs, use sharded counters or expvar.Int for atomic-ish counting with low contention.
Reading the runtime source for cancellation¶
The Go runtime source is fast to read once you know where to look. For cancellation:
src/context/context.go: the package itself.src/runtime/chan.go: channel operations, including close.src/runtime/proc.go: scheduler, including parking and waking.src/runtime/select.go: select implementation.src/runtime/sema.go: semaphores used by sync.Mutex.
Reading these gives a deeper understanding of "what happens when cancel fires." The structures are well-named; the comments are extensive.
Notable functions¶
closechan(chan.go): closes a channel and wakes waiters.gopark(proc.go): parks the calling goroutine.goready(proc.go): readies a goroutine to be scheduled.selectgo(select.go): implements select.
Tracing through cancel -> close -> closechan -> goready -> selectgo wakes up gives a complete picture.
Resources¶
The Go source has docs and design notes in src/runtime/HACKING.md and src/runtime/README.md. The scheduler design doc by Dmitry Vyukov is widely available.
Context package: design history¶
Worth understanding where context.Context came from before diving into internals.
context was originally an internal Google package, used heavily in the net/http and RPC libraries. It was promoted to the standard library in Go 1.7 (2016). The design was driven by:
- Need for cancellation in long-running RPC operations.
- Need to carry per-request values (trace IDs, auth) without polluting every function signature.
- Desire to make cancellation composable across library boundaries.
The choices that shaped it:
- Interface, not concrete type: lets users implement custom contexts (testing, wrapping).
- Immutable by
With*API: each derivation returns a new context; the parent is unchanged. Safe for concurrent use. - Cancellation via channel close: leverages the existing channel primitive rather than introducing a new sync mechanism.
- Optional value bag: useful but easily abused.
- Per-derivation cancel function: explicit ownership of when to release.
The criticisms over time:
- "Why pass context as first parameter instead of last?" Convention; could have gone either way. The first-parameter rule is enforced by linters now.
- "Why is value lookup not type-safe?" Pragmatic — interface{} (now
any) for the value, no generics at the time of design. - "Why no Set after creation?" Immutability is a feature; concurrent safety would require locks otherwise.
- "Why a tree, not a flat list?" Hierarchical cancellation is a feature; cancelling a parent cancels descendants.
Despite the criticisms, no replacement has emerged. The API is stable; the ecosystem is built around it.
The context Package Implementation¶
The context package is implemented in src/context/context.go (go/src/context/). It is roughly 800 lines of Go, surprisingly compact for the role it plays in every modern Go program.
The interface¶
type Context interface {
Deadline() (deadline time.Time, ok bool)
Done() <-chan struct{}
Err() error
Value(key any) any
}
Four methods. Each context type implements them. The interface is the entire surface; everything else is helper functions.
Concrete types¶
The implementation has a handful of concrete types:
emptyCtx— the value returned byBackground()andTODO(). Always-on, always-empty.cancelCtx— whatWithCancelreturns. Holds the cancel state.timerCtx— whatWithTimeoutandWithDeadlinereturn. EmbedscancelCtxplus a timer.valueCtx— whatWithValuereturns. Holds a key/value pair plus the parent.afterFuncCtx— the type registered byAfterFunc. Hidden from the public API.
The hierarchy is:
Context (interface)
├── emptyCtx (background/todo)
├── valueCtx (parent + key/value)
├── cancelCtx (parent + cancel state)
│ └── timerCtx (cancelCtx + timer)
└── withoutCancelCtx (parent values, no cancellation)
Each derived context embeds or composes the parent. Walking up the chain reaches Background.
emptyCtx¶
type emptyCtx struct{}
func (emptyCtx) Deadline() (time.Time, bool) { return time.Time{}, false }
func (emptyCtx) Done() <-chan struct{} { return nil }
func (emptyCtx) Err() error { return nil }
func (emptyCtx) Value(any) any { return nil }
The methods all return zero values. Done() returning nil is critical: a select case on a nil channel never fires, so case <-ctx.Done(): on Background is effectively a no-op case. The receive blocks forever (but in a select, the case is never chosen because nil-channel cases are never ready).
Background and TODO are pointers to package-level variables:
var (
background = new(emptyCtx)
todo = new(emptyCtx)
)
func Background() Context { return background }
func TODO() Context { return todo }
Two singleton instances. The difference is convention only: Background for "real root context," TODO for "I have not decided yet." Linters distinguish them.
valueCtx¶
type valueCtx struct {
Context
key, val any
}
func (c *valueCtx) Value(key any) any {
if c.key == key {
return c.val
}
return c.Context.Value(key)
}
Embeds the parent Context; overrides Value to check its own key first, then delegates to the parent. Looking up a value walks the chain until the key is found or the chain ends.
Performance note: Value is O(depth of chain). For deeply nested contexts this can be slow. In practice, depth is small (typically 1-5).
Looking at the runtime trace¶
Go's runtime trace (go tool trace) captures every goroutine event. For cancellation, you see:
- The cancel call.
- Each goroutine waking up from
<-ctx.Done(). - The order in which they were scheduled.
- The time spent waiting between wake and run.
A typical cancel trace shows:
cancel()at time t0.- 100 goroutines marked "runnable" by t0+5us.
- Goroutines start running on available P's from t0+5us onwards.
- Last goroutine finishes by t0+50us.
This level of detail is invaluable for tuning. If you see goroutines waiting milliseconds between "runnable" and "running," the scheduler is over-subscribed; consider increasing GOMAXPROCS or reducing per-goroutine work.
The trace output is heavy (megabytes per second of recording), but the insights are uniquely detailed.
In-depth: the cancelCtxKey magic¶
The context package uses a sentinel to identify cancel contexts within the value lookup machinery:
This unexported variable is used as a key in Value lookups:
func (c *cancelCtx) Value(key any) any {
if key == &cancelCtxKey {
return c
}
return value(c.Context, key)
}
When someone calls ctx.Value(&cancelCtxKey), the lookup walks the chain. When it hits a cancelCtx, the special case returns the context itself.
This allows parentCancelCtx to find an ancestor cancel context even if there are valueCtx wrappers in between:
The clever bit: the key is unexported, so external code cannot accidentally collide with it. The mechanism is internal to the package.
This is also why custom context implementations can't participate in the fast path — they would need to know about cancelCtxKey and respond to it, but it is unexported.
There is a similar trick for WithoutCancel:
func (withoutCancelCtx) Value(key any) any {
if key == &cancelCtxKey {
return nil // explicitly mask the parent's cancel context
}
return c.c.Value(key)
}
This is how WithoutCancel detaches: it intercepts the cancelCtxKey lookup and returns nil, preventing the cancel propagation logic from finding the ancestor.
These internal details are not API; they are implementation. But understanding them helps decode the source.
A more thorough look at done lazy initialisation¶
The cancelCtx does not allocate its done channel until first call to Done():
func (c *cancelCtx) Done() <-chan struct{} {
d := c.done.Load()
if d != nil {
return d.(chan struct{})
}
c.mu.Lock()
defer c.mu.Unlock()
d = c.done.Load()
if d == nil {
d = make(chan struct{})
c.done.Store(d)
}
return d.(chan struct{})
}
Double-checked locking. The first load is atomic-only (fast path). If nil, take the lock, load again under lock, and create if still nil.
Why lazy? Because most cancelable contexts are never observed via Done(). They are used for value passing, or for Err() checks, but the channel itself is never selected on. Lazy creation avoids allocating channels for unobserved contexts.
The cancel method handles the lazy case:
If Done() was never called, the channel is nil. Store the package-level pre-closed channel. Any future Done() call returns the closed channel; any future select on Done() fires immediately.
This is a small but cumulative optimisation. Across millions of contexts, the saved allocations add up.
Anatomy of cancelCtx¶
A more detailed walkthrough of cancelCtx and how cancellation cascades.
The canceler interface¶
The package defines an internal interface:
Both *cancelCtx and *timerCtx implement it. The cascade walks this interface; concrete types extend behaviour (e.g. timerCtx.cancel also stops the timer).
The propagateCancel reasoning¶
The function has two paths:
- Parent is a
*cancelCtx: register inchildrenmap. When parent cancels, it iterates and cancels each child synchronously. - Parent is not a
*cancelCtx: spawn a goroutine that waits on parent'sDoneand then cancels the child.
The optimisation for the fast path avoids creating a watcher goroutine when one is not needed. For deeply nested WithCancel chains (all standard cancel contexts), no extra goroutines are created.
The slow path matters when:
- You use a custom
Contextimplementation that is not a*cancelCtx. - You wrap a
*cancelCtxin avalueCtxand then callWithCancelagain — wait, this case is handled. Let me look atparentCancelCtx:
func parentCancelCtx(parent Context) (*cancelCtx, bool) {
done := parent.Done()
if done == closedchan || done == nil {
return nil, false
}
p, ok := parent.Value(&cancelCtxKey).(*cancelCtx)
if !ok {
return nil, false
}
pdone, _ := p.done.Load().(chan struct{})
if pdone != done {
return nil, false
}
return p, true
}
It uses Value with a magic key (cancelCtxKey) that cancelCtx.Value returns self for:
func (c *cancelCtx) Value(key any) any {
if key == &cancelCtxKey {
return c
}
return value(c.Context, key)
}
So even if you wrap a cancelCtx in a valueCtx, the Value(&cancelCtxKey) call finds the underlying cancelCtx and the fast path is taken. This is a clever piece of API design.
Custom Context implementations do not respond to cancelCtxKey, so they take the slow path. If you implement a custom Context, you can opt into the fast path by responding to cancelCtxKey — but the key is unexported, making this impossible from outside the standard library.
Walking the cancel cascade by hand¶
Consider:
root := context.Background()
ctx1, cancel1 := context.WithCancel(root)
ctx2, cancel2 := context.WithTimeout(ctx1, time.Second)
ctx3, cancel3 := context.WithCancel(ctx2)
ctx4, cancel4 := context.WithValue(ctx3, "k", "v") // not cancellable directly
Wait, WithValue does not return a cancel function. Let me fix:
root := context.Background()
ctx1, cancel1 := context.WithCancel(root)
ctx2, cancel2 := context.WithTimeout(ctx1, time.Second)
ctx3, cancel3 := context.WithCancel(ctx2)
The tree:
rootisemptyCtx(singleton,Done() == nil).ctx1is*cancelCtxwith parentroot. Sinceroot.Done() == nil,propagateCancelreturns early; no parent registration.ctx2is*timerCtxwith parentctx1.ctx1is a*cancelCtx, soctx2registers itself inctx1.children.ctx3is*cancelCtxwith parentctx2.ctx2is a*cancelCtx(via embedding), soctx3registers inctx2.children.
Calling cancel1:
ctx1.cancelruns. Lock. Setctx1.err = Canceled. Closectx1.done. Iterate children (ctx2).ctx2.cancelruns. Lock. Setctx2.err = Canceled(inheriting parent's err). Closectx2.done. Stopctx2.timer. Iterate children (ctx3).ctx3.cancelruns. Lock. Setctx3.err = Canceled. Closectx3.done. No children.
All three contexts are cancelled. Calling cancel2 or cancel3 after is a no-op (idempotent).
Calling cancel3 first (before cancel1):
ctx3.cancelruns. Sets err, closes done. Removes self fromctx2.children.ctx2.cancelandctx1.cancelare unaffected. The tree partially cancels.
Calling cancel2 later:
ctx2.cancelruns. Cancelsctx2. No children (ctx3already removed). Stops the timer.ctx1unaffected.
This is "structured concurrency" working: subtree cancellation does not affect siblings or ancestors.
How WithCancel Builds the Tree¶
func WithCancel(parent Context) (ctx Context, cancel CancelFunc) {
c := withCancel(parent)
return c, func() { c.cancel(true, Canceled, nil) }
}
func withCancel(parent Context) *cancelCtx {
if parent == nil {
panic("cannot create context from nil parent")
}
c := &cancelCtx{}
c.propagateCancel(parent, c)
return c
}
The cancelCtx struct:
type cancelCtx struct {
Context
mu sync.Mutex // protects following fields
done atomic.Value // of chan struct{}, created lazily, closed by first cancel call
children map[canceler]struct{} // set to nil by the first cancel call
err error // set to non-nil by the first cancel call
cause error // set to non-nil by the first cancel call
}
Key fields:
doneis anatomic.Valueholding achan struct{}. Lazy: created on first call toDone().childrenis a map of "things that should also be cancelled when this is." Filled bypropagateCancel.errandcauserecord the cancellation reason. Set once.muprotects the children map and the error field.
propagateCancel¶
func (c *cancelCtx) propagateCancel(parent Context, child canceler) {
c.Context = parent
done := parent.Done()
if done == nil {
return // parent is never cancelled
}
select {
case <-done:
// parent is already cancelled
child.cancel(false, parent.Err(), Cause(parent))
return
default:
}
if p, ok := parentCancelCtx(parent); ok {
// parent is a *cancelCtx, register as its child
p.mu.Lock()
if p.err != nil {
child.cancel(false, p.err, p.cause)
} else {
if p.children == nil {
p.children = make(map[canceler]struct{})
}
p.children[child] = struct{}{}
}
p.mu.Unlock()
return
}
// parent is not a *cancelCtx, watch it via a goroutine
goroutines.Add(1)
go func() {
select {
case <-parent.Done():
child.cancel(false, parent.Err(), Cause(parent))
case <-child.Done():
}
}()
}
Two paths:
- Parent is a
*cancelCtx: register the child in the parent'schildrenmap. When the parent cancels, it iterateschildrenand cancels each. - Parent is a custom Context (not a
*cancelCtx): spawn a goroutine that watches the parent'sDonechannel and cancels the child when it fires.
The optimisation: known cancel-context parents do not need a watcher goroutine; the cancellation cascades through the map. Unknown parents (custom Context implementations) need the watcher.
This is why mixing in custom contexts can cost more goroutines than you expect.
cancel method¶
func (c *cancelCtx) cancel(removeFromParent bool, err, cause error) {
if err == nil {
panic("context: internal error: missing cancel error")
}
if cause == nil {
cause = err
}
c.mu.Lock()
if c.err != nil {
c.mu.Unlock()
return // already cancelled
}
c.err = err
c.cause = cause
d, _ := c.done.Load().(chan struct{})
if d == nil {
c.done.Store(closedchan)
} else {
close(d)
}
for child := range c.children {
child.cancel(false, err, cause)
}
c.children = nil
c.mu.Unlock()
if removeFromParent {
removeChild(c.Context, c)
}
}
Walk through:
- Lock.
- If already cancelled, return (idempotent).
- Record err and cause.
- Close the done channel (or store the package-level pre-closed channel if
Done()was never called). - Cascade to children.
- Clear the children map.
- Unlock.
- Optionally remove self from parent's children map.
The cascade is synchronous: this cancel invokes each child's cancel, which invokes its grandchildren's, etc. The whole subtree is cancelled before this call returns.
For deep trees, the cascade is sequential and the call stack grows with depth. In practice, depth is small.
closedchan¶
A package-level pre-closed channel. When cancel fires before Done() was ever called, it stores closedchan instead of allocating a new channel and closing it. This avoids an allocation for the common case "cancelled before anyone observed."
The next call to Done() returns closedchan, which is already closed; any select on it fires immediately.
This is a small optimisation that matters when you create many short-lived cancellable contexts.
timerCtx: the deadline variant¶
WithDeadline (and WithTimeout which wraps it) returns a *timerCtx:
type timerCtx struct {
cancelCtx
timer *time.Timer
deadline time.Time
}
func WithDeadline(parent Context, d time.Time) (Context, CancelFunc) {
if parent == nil {
panic("cannot create context from nil parent")
}
if cur, ok := parent.Deadline(); ok && cur.Before(d) {
// The current deadline is already sooner than the new one.
return WithCancel(parent)
}
c := &timerCtx{
deadline: d,
}
c.cancelCtx.propagateCancel(parent, c)
dur := time.Until(d)
if dur <= 0 {
c.cancel(true, DeadlineExceeded, nil)
return c, func() { c.cancel(false, Canceled, nil) }
}
c.mu.Lock()
defer c.mu.Unlock()
if c.err == nil {
c.timer = time.AfterFunc(dur, func() {
c.cancel(true, DeadlineExceeded, nil)
})
}
return c, func() { c.cancel(true, Canceled, nil) }
}
Key points:
- If parent's deadline is already sooner, just use
WithCancel(parent)— the parent's deadline will fire first anyway. - If the deadline has already passed, cancel immediately.
- Otherwise, install a
time.AfterFuncthat cancels when the duration elapses.
The time.AfterFunc is scheduled in the runtime's timer wheel. When it fires, it runs the callback (cancel the context). The cancel cascades to children.
cancel for timerCtx:
func (c *timerCtx) cancel(removeFromParent bool, err, cause error) {
c.cancelCtx.cancel(false, err, cause)
if removeFromParent {
removeChild(c.cancelCtx.Context, c)
}
c.mu.Lock()
if c.timer != nil {
c.timer.Stop()
c.timer = nil
}
c.mu.Unlock()
}
Stops the timer to free its resources. Without this, the timer wheel would still hold a reference to the (now cancelled) context.
Timer resource costs¶
Each active timer in the wheel costs a small struct plus a slot in the data structure. The Go runtime can handle millions of timers efficiently, but there is a cost. defer cancel() is what releases the timer; without it, the timer survives until it fires.
This is the resource leak go vet flags as "lostcancel".
The Closed-Channel Trick¶
The fundamental primitive of cancellation propagation is "closing a channel broadcasts to every receiver." Let me dig into why this is so cheap.
What close does to a channel¶
A channel is a struct in the runtime (hchan in runtime/chan.go). It has:
- A buffer (for buffered channels).
- A send queue (goroutines blocked on send).
- A receive queue (goroutines blocked on receive).
- A closed flag.
- A lock.
close(ch) acquires the lock, sets the closed flag, drains the receive queue (each waiting goroutine is woken and resumes with the zero value plus ok == false), and unblocks any future receives. Future sends panic.
For a chan struct{} with no buffer and N goroutines blocked on receive, the close wakes all N. The wake is sequential: the goroutines are queued, and the close iterates the queue. Each wake is fast (an enqueue onto the scheduler's run queue), but N wakes still take O(N) time.
For N = 10, this is negligible. For N = 10 000, the close takes a few microseconds. For N = 1 000 000, it can take milliseconds. Most pipelines are far from that scale.
Why receivers see the close immediately¶
After close, any receive on the channel returns immediately. This is implemented by checking the closed flag inside the receive operation. A new receive after close does not block; it returns the zero value and false.
In select, the case <-done becomes "ready" the moment the close completes. The select picks it (or another ready case) on its next evaluation.
Why this matters for ctx.Done()¶
ctx.Done() returns a channel. cancel closes that channel. Every goroutine selecting on ctx.Done() is woken. The wake-up is the broadcast.
Compare to alternatives:
- Atomic boolean:
if cancelled { return }. Each goroutine must poll explicitly. No wake-up. - Condition variable:
cond.Broadcast()wakes waiters but requires every waiter to hold a mutex. Not composable with channel operations. - Per-goroutine notification: each goroutine has its own signal channel. The canceller iterates and signals each. O(N) work for the canceller, but no shared close.
The closed-channel approach is the sweet spot: O(1) canceller work, automatic compatibility with select, and the runtime handles wake-up efficiently.
Channel close: a deeper look at hchan¶
The chan struct{} used by Done() is a runtime.hchan. The relevant structure (simplified from runtime/chan.go):
type hchan struct {
qcount uint // total data in the queue
dataqsiz uint // size of the circular queue
buf unsafe.Pointer // points to an array of dataqsiz elements
elemsize uint16
closed uint32
elemtype *_type
sendx uint
recvx uint
recvq waitq // list of recv waiters
sendq waitq // list of send waiters
lock mutex
}
closed is a flag; recvq is the list of goroutines waiting to receive.
The closechan runtime function (called by close()):
func closechan(c *hchan) {
if c == nil {
panic(plainError("close of nil channel"))
}
lock(&c.lock)
if c.closed != 0 {
unlock(&c.lock)
panic(plainError("close of closed channel"))
}
c.closed = 1
// wake up all readers
var glist gList
for {
sg := c.recvq.dequeue()
if sg == nil {
break
}
if sg.elem != nil {
typedmemclr(c.elemtype, sg.elem)
sg.elem = nil
}
if sg.releasetime != 0 {
sg.releasetime = cputicks()
}
gp := sg.g
gp.param = unsafe.Pointer(sg)
sg.success = false
if raceenabled {
raceacquireg(gp, c.raceaddr())
}
glist.push(gp)
}
// wake up all writers (they will panic)
for {
sg := c.sendq.dequeue()
if sg == nil {
break
}
// ... similar handling ...
glist.push(gp)
}
unlock(&c.lock)
for !glist.empty() {
gp := glist.pop()
gp.schedlink = 0
goready(gp, 3)
}
}
What we see:
- Lock the channel.
- Set
closed = 1. - Dequeue every waiting receiver; for each, mark as not-successful and add to a wake list.
- Dequeue every waiting sender (these will panic upon waking — sending to a closed channel panics, but the convention "owner closes" prevents this).
- Unlock the channel.
- Wake every goroutine in the list via
goready.
The interesting bit: the wake-up is after the unlock. The lock is held only for the dequeue, not for the actual goready calls. This means the close is fast even with many receivers; the wake-up happens off the critical path.
goready adds the goroutine to a P's local run queue (or the global one). The scheduler picks it up on the next iteration.
Latency budget for close-and-wake¶
- Lock: tens of nanoseconds (uncontended) to microseconds (contended).
- Per-receiver dequeue: tens of nanoseconds.
- Wake (goready): tens of nanoseconds per goroutine.
- Receiver running its next instruction: depends on scheduler load.
For 1 receiver: 100 ns total. For 100 receivers: 10 microseconds. For 10 000 receivers: ~1 millisecond.
This is the actual latency budget for "cancellation broadcast." It explains why cancellation is "fast enough" for most pipelines but becomes measurable at very large scales.
Scheduler Interactions¶
When a goroutine is blocked on <-ctx.Done(), it is in the scheduler's "blocked" state. The scheduler removes it from the run queue and parks it on the channel's receive queue.
When close(done) runs, the scheduler walks the receive queue and re-queues each goroutine onto the run queue. They become runnable.
The runtime preempts (or schedules) them on available threads. The wake-up latency is:
- Time to walk the receive queue (microseconds for typical sizes).
- Time for the scheduler to pick the goroutine up on a thread (typically nanoseconds to single-digit microseconds).
The total: sub-millisecond for almost any scenario.
Cancellation during blocking I/O¶
If a goroutine is blocked in a system call (e.g. read on a socket), it is parked by the OS, not the Go scheduler. Closing ctx.Done() does not wake it. The goroutine must be woken some other way:
- The I/O completes naturally.
- The I/O fails (the file descriptor is closed elsewhere).
- A deadline on the I/O fires.
This is why cancellation must be plumbed into the I/O layer (via SetReadDeadline, db.QueryContext, etc.). The ctx.Done() close alone does not wake a goroutine in a syscall.
Cancellation under high goroutine count¶
The Go scheduler scales well. Closing a channel with 100 000 receivers takes a millisecond or so. The bottleneck is the OS scheduling of the runtime threads. On a busy host, the latency can climb.
For a service that needs to cancel 100 000 goroutines in under a second, this is fine. For 1 million goroutines in under a millisecond, you may want to engineer alternatives (per-shard cancellation, hierarchical fan-out).
Cost of select with ctx.Done()¶
A select with N cases has overhead proportional to N. Each iteration:
- Builds an array of cases.
- Checks each case for readiness.
- If any are ready, picks one at random (and executes it).
- If none are ready, parks the goroutine in each case's queue.
- On wake-up, removes from the other queues.
For 2-3 cases, the cost is tens of nanoseconds. Negligible for most code.
For tight loops where the select runs millions of times per second, the cost adds up. Optimisations:
- Cache
done := ctx.Done()outside the loop. - Avoid extra cases that are rarely ready.
- Use atomic flags for hot-path checks, with
selectfor the rare slow path.
Micro-benchmarking your specific hot path is the right way to decide.
Benchmark example¶
func BenchmarkSelectCtxDone(b *testing.B) {
ctx := context.Background()
in := make(chan int, 1)
in <- 1
for i := 0; i < b.N; i++ {
select {
case <-in:
case <-ctx.Done():
}
in <- 1
}
}
On a modern machine: roughly 20-50 ns per iteration. For most pipelines, this is well under the per-item work cost and is invisible.
When the cost matters¶
In CPU-bound inner loops processing millions of items per second per goroutine, even 20 ns adds up to 2% of CPU. If you measure and find select overhead is the bottleneck, options:
- Move the cancel check outside the hot loop. Check
ctx.Err()every 1000 iterations. - Use a buffered channel as the input; let the buffer absorb load and check cancellation between batches.
- Profile to confirm; sometimes the GC or per-item work is the actual culprit.
Go scheduler basics relevant to cancellation¶
The Go scheduler is GMP: G (goroutine), M (OS thread), P (processor / scheduling context). The relevant pieces:
- Each P has a local run queue of runnable goroutines.
- M's run goroutines from their associated P's queue.
- When a goroutine blocks, it is parked off the run queue.
- When unblocked, it is enqueued onto a P's run queue.
For cancellation:
- A goroutine waiting on
<-ctx.Done()is parked on the channel'srecvq. cancel()callsclose(done), which iteratesrecvqand re-queues each goroutine.- The scheduler picks them up on next iteration.
The wake-up delay is:
- For a goroutine that was on a P's run queue: nanoseconds.
- For a goroutine that needs to be migrated: microseconds.
- Under heavy load (all P's busy): milliseconds while waiting for a slot.
GOMAXPROCS determines the number of P's. Increasing it speeds up cancellation under load (more parallelism for wake-ups) but increases overall scheduler overhead.
Preemption and cancellation¶
Go's async preemption (since 1.14) allows the runtime to interrupt a running goroutine at safe points and reschedule. This means even a CPU-bound goroutine that does not select on ctx.Done() will eventually be preempted and the scheduler will check its status.
But preemption does not by itself cancel the goroutine. The goroutine must check ctx.Err() to know it should exit. Without the check, preemption merely moves the goroutine off-CPU briefly, then resumes it.
So: preemption gives the scheduler control; the goroutine must use that control to check cancellation.
Locking and cancellation¶
A goroutine holding a sync.Mutex cannot be cancelled until it releases the mutex. There is no Mutex.LockContext in the standard library; the operation is uninterruptible.
Workarounds:
- Use a channel-based lock that supports
select:
type CtxMutex struct {
ch chan struct{}
}
func New() *CtxMutex {
return &CtxMutex{ch: make(chan struct{}, 1)}
}
func (m *CtxMutex) Lock(ctx context.Context) error {
select {
case m.ch <- struct{}{}:
return nil
case <-ctx.Done():
return ctx.Err()
}
}
func (m *CtxMutex) Unlock() {
<-m.ch
}
A 1-element buffered channel as a mutex. Lock-by-send (blocking until the buffer has room), Unlock-by-receive. The Lock can wait on <-ctx.Done().
The trade-off: this lock is slower than sync.Mutex (no fast path, no spinning) but is cancellable. Use for locks held across long operations.
AfterFunc Internals¶
context.AfterFunc was added in Go 1.21. It registers a callback to run when a context is cancelled.
func AfterFunc(ctx Context, f func()) (stop func() bool) {
a := &afterFuncCtx{
f: f,
}
a.cancelCtx.propagateCancel(ctx, a)
return func() bool {
stopped := false
a.once.Do(func() {
stopped = true
})
if stopped {
a.cancel(true, removeFromParent, nil)
}
return stopped
}
}
Internally, AfterFunc creates an afterFuncCtx that subscribes to the parent's cancellation. When cancellation fires, the callback runs.
Why use AfterFunc instead of a watcher goroutine?
- No goroutine for the wait. The callback is invoked from the canceller's cascade, not a separate goroutine.
- Built-in unregistration. The
stopfunction detaches the callback. - Cleaner code. No explicit
<-ctx.Done()plus separatecancel()plumbing.
Cost: a small alloc per registration. Use when you have many short-lived "do this on cancel" hooks (e.g. per-request cleanup).
Caveat: the callback runs from the cancelling goroutine. It must be fast and non-blocking, or it will delay other cancellations in the cascade.
WithCancelCause and Cause¶
WithCancelCause (Go 1.20+) returns a context that can be cancelled with a custom error.
func WithCancelCause(parent Context) (ctx Context, cancel CancelCauseFunc) {
c := withCancel(parent)
return c, func(cause error) { c.cancel(true, Canceled, cause) }
}
func Cause(c Context) error {
if cc, ok := c.Value(&cancelCtxKey).(*cancelCtx); ok {
cc.mu.Lock()
defer cc.mu.Unlock()
return cc.cause
}
return c.Err()
}
The cause is stored in the cancelCtx.cause field. Cause(ctx) walks up the tree (via Value) to find the nearest cancelCtx and returns its cause.
The reason Cause uses Value traversal: it must work even if the context has been wrapped in WithValue since the cancel was set. The traversal finds the cancelling ancestor.
ctx.Err() still returns Canceled for compatibility. Cause(ctx) returns the rich reason. Both are part of the public API.
Use it for debuggability¶
Logging Cause(ctx) at the cancellation boundary gives "why was this cancelled?" with the actual reason. Logging ctx.Err() gives "Canceled" — useless for diagnosis.
The log captures the cause; the return preserves error compatibility.
How Cause traverses the tree¶
context.Cause(ctx) returns the cause associated with the cancellation. The implementation:
func Cause(c Context) error {
if cc, ok := c.Value(&cancelCtxKey).(*cancelCtx); ok {
cc.mu.Lock()
defer cc.mu.Unlock()
return cc.cause
}
return c.Err()
}
It calls c.Value(&cancelCtxKey), which the cancelCtx.Value method special-cases to return self. This walks up the chain through any valueCtxs until it finds a cancelCtx.
If found, it returns the cause field. If not (no cancel context anywhere in the chain), it falls back to c.Err().
Why this design? Because the cause may have been set on a parent context (e.g. via cancel(myErr) on the parent), and any descendant should be able to learn it.
Subtle: if both a parent and a child have causes, which one wins? The nearest cancelled cancelCtx in the chain. So if a child is cancelled with its own cause before the parent, the child's cause is returned. If the parent cancels first (cascading to the child), the parent's cause is propagated through cancel(removeFromParent, err, cause) and stored on each descendant.
The result: Cause(ctx) returns the most specific cause known to that context's subtree.
WithoutCancel¶
WithoutCancel (Go 1.21+) is a context that does not propagate cancellation from its parent.
type withoutCancelCtx struct {
c Context
}
func (withoutCancelCtx) Deadline() (time.Time, bool) { return time.Time{}, false }
func (withoutCancelCtx) Done() <-chan struct{} { return nil }
func (withoutCancelCtx) Err() error { return nil }
func (c withoutCancelCtx) Value(key any) any { return c.c.Value(key) }
The parent's values are preserved; cancellation is not. Useful for "this background task should outlive the request that triggered it":
func handler(ctx context.Context) {
err := work(ctx)
if err != nil {
// log the failure even if ctx cancelled
go logFailure(context.WithoutCancel(ctx), err)
}
}
logFailure keeps the request's trace ID and other values, but is not affected by the request's cancellation. It runs to completion.
Be careful with this — WithoutCancel is an escape hatch from structured concurrency. Use it only when you specifically need to detach from the parent's cancellation. If the detached work goes wrong, the structure cannot help you find it.
A deep example: implementing a context-aware worker pool from scratch¶
Putting it all together: a production-grade worker pool with all the right cancellation behaviour.
type Pool struct {
mu sync.Mutex
closed bool
workers []chan job
submit chan job
ctx context.Context
cancel context.CancelCauseFunc
wg sync.WaitGroup
}
type job struct {
ctx context.Context
run func(context.Context) error
res chan error
}
func New(parent context.Context, workers, buf int) *Pool {
ctx, cancel := context.WithCancelCause(parent)
p := &Pool{
submit: make(chan job, buf),
ctx: ctx,
cancel: cancel,
}
p.workers = make([]chan job, workers)
p.wg.Add(workers)
for i := 0; i < workers; i++ {
p.workers[i] = make(chan job)
go p.runWorker(p.workers[i])
}
p.wg.Add(1)
go p.dispatch()
return p
}
func (p *Pool) runWorker(ch chan job) {
defer p.wg.Done()
for {
select {
case <-p.ctx.Done():
return
case j, ok := <-ch:
if !ok {
return
}
// Combined context: cancel if either job or pool cancels
jobCtx, jobCancel := context.WithCancel(j.ctx)
stop := context.AfterFunc(p.ctx, jobCancel)
err := j.run(jobCtx)
stop()
jobCancel()
j.res <- err
}
}
}
func (p *Pool) dispatch() {
defer p.wg.Done()
nextWorker := 0
for {
select {
case <-p.ctx.Done():
// close all worker channels
for _, w := range p.workers {
close(w)
}
return
case j := <-p.submit:
// round-robin
target := p.workers[nextWorker]
nextWorker = (nextWorker + 1) % len(p.workers)
select {
case target <- j:
case <-p.ctx.Done():
j.res <- p.ctx.Err()
return
}
}
}
}
func (p *Pool) Submit(ctx context.Context, fn func(context.Context) error) error {
p.mu.Lock()
if p.closed {
p.mu.Unlock()
return errors.New("pool closed")
}
p.mu.Unlock()
res := make(chan error, 1)
j := job{ctx: ctx, run: fn, res: res}
select {
case p.submit <- j:
case <-ctx.Done():
return ctx.Err()
case <-p.ctx.Done():
return p.ctx.Err()
}
select {
case err := <-res:
return err
case <-ctx.Done():
return ctx.Err()
}
}
func (p *Pool) Close() {
p.mu.Lock()
if p.closed {
p.mu.Unlock()
return
}
p.closed = true
p.mu.Unlock()
p.cancel(errors.New("pool closed"))
p.wg.Wait()
}
Features:
- Bounded concurrency via per-worker channels.
- Per-job context combined with pool's context — either cancels the job.
context.AfterFuncwires the pool's cancellation into the job's context without an extra goroutine.- Cancellable Submit — respects both the caller's and pool's contexts.
- Cancellable Result — waiting for the result respects the caller's context.
- Graceful Close — cancels everything and joins.
The complexity reflects production realities: every place that can block must be cancellable from multiple directions.
Hidden subtleties of Value¶
context.WithValue is criticised for enabling implicit passing of parameters. But it has real subtleties worth knowing.
Value lookup walks the chain¶
func (c *valueCtx) Value(key any) any {
if c.key == key {
return c.val
}
return value(c.Context, key)
}
func value(c Context, key any) any {
for {
switch ctx := c.(type) {
case *valueCtx:
if key == ctx.key {
return ctx.val
}
c = ctx.Context
case *cancelCtx:
if key == &cancelCtxKey {
return ctx
}
c = ctx.Context
case withoutCancelCtx:
if key == &cancelCtxKey {
return nil
}
c = ctx.c
case *timerCtx:
if key == &cancelCtxKey {
return &ctx.cancelCtx
}
c = ctx.Context
case backgroundCtx, todoCtx:
return nil
default:
return c.Value(key)
}
}
}
A type switch over the context tree. Each layer either matches the key or delegates to the parent. Custom types fall through to c.Value(key) which is the interface method.
The cost is O(depth). For a tree of depth 5, the walk is 5 dispatches. For depth 50 (rare but possible), it is 50.
Comparable keys¶
The key in WithValue must be comparable (since the implementation does ==). String keys work but are dangerous because of collisions: two unrelated packages might both use "user_id" and clash.
The idiomatic key is a private unexported type:
type userIDKey struct{}
ctx = context.WithValue(ctx, userIDKey{}, uid)
// elsewhere:
uid, _ := ctx.Value(userIDKey{}).(string)
The struct type is unique to the package; no collision possible.
Performance of Value¶
In a hot path, Value is a measurable cost. The standard library uses it sparingly. If you find yourself doing ctx.Value in inner loops, cache the result outside the loop.
Custom Context Implementations¶
The Context interface is public; you can implement it. Common reasons:
- Test fixtures (
fakeContextwith controlledDonechannel). - Tracing wrappers that add span IDs.
- Aspect-oriented contexts that record values automatically.
A minimal custom implementation:
type myContext struct {
parent context.Context
done <-chan struct{}
err error
}
func (m *myContext) Deadline() (time.Time, bool) { return m.parent.Deadline() }
func (m *myContext) Done() <-chan struct{} { return m.done }
func (m *myContext) Err() error { return m.err }
func (m *myContext) Value(key any) any { return m.parent.Value(key) }
The catch: as noted in propagateCancel, if your custom type is not a *cancelCtx, the standard library will spawn a watcher goroutine for every WithCancel(myContext) call. This is correct but wasteful at scale.
For performance-critical code, embedding a cancelCtx (via a type alias trick) lets the standard library treat your type as a normal cancel context. The internal detail of "I am cancellable" is communicated via the Value method returning self for the magic cancelCtxKey. This is non-portable behaviour; prefer not to customise unless necessary.
Cancellation Latency in the Scheduler¶
A measurement: how long from cancel() to a goroutine's first instruction after <-ctx.Done()?
On a modern x86 server:
cancel()itself: a few hundred nanoseconds (lock + close + iterate children).- Goroutine wake-up: O(microseconds), depending on scheduler state.
- First instruction: another few hundred nanoseconds.
Total: 1-10 microseconds for a simple cancellation reaching a single waiting goroutine. For N waiters, the time is O(N) for the close itself but parallel for the post-wake-up execution.
Under heavy CPU contention (many goroutines runnable, few CPUs available), the latency can be 100 microseconds to milliseconds. The wake itself is fast; getting CPU time to run is the bottleneck.
Implications¶
For pipelines that cancel rarely (once per request), latency is negligible. For pipelines that re-derive contexts in tight loops (a WithTimeout per item, for example), the allocation and scheduler interaction become measurable.
The optimisation: hoist context creation out of inner loops when possible. Re-use a single context for many iterations; check ctx.Err() inside the loop.
Cancellation in slow-tail systems¶
Some workloads have predictable slow tails: 99% of requests complete in 10 ms, but 1% take 5 seconds. The 1% is the long tail; the 99% is the fast path.
Cancellation is the tool for managing the long tail: cancel slow requests so resources stay available for the fast path.
ctx, cancel := context.WithTimeout(ctx, time.Duration(maxLatencyMs)*time.Millisecond)
defer cancel()
Set maxLatencyMs near the p99 of the fast path. Anything slower is cancelled, freeing the goroutine and resources for the next request.
The trade-off: a small fraction of requests fail with timeout. They are the requests that would have eaten resources from the rest.
This pattern is sometimes called "deadline propagation" or "tail amputation." It is essential for high-throughput services where the long tail can starve everyone.
Cancellation in HTTP middleware chains¶
A typical HTTP server has a chain of middleware: logging, auth, tracing, etc. Each middleware can affect the context.
func handler(w http.ResponseWriter, r *http.Request) {
// ... handler uses r.Context() ...
}
mux := http.NewServeMux()
mux.HandleFunc("/", handler)
var srv http.Handler = mux
srv = withLogging(srv)
srv = withTracing(srv)
srv = withTimeout(srv, 30*time.Second)
srv = withAuth(srv)
Each middleware wraps the next. The request flows from outermost to innermost; the response flows back.
For cancellation:
withTimeoutderives a new context with a deadline and replacesr.Context().withTracingadds a span to the context.withAuthmay add user info to the context.
By the time the handler runs, r.Context() is a derived context with all the middleware's modifications. Cancellation cascades from the original request context to the deadline-narrowed handler context.
Implementation: withTimeout¶
func withTimeout(next http.Handler, d time.Duration) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), d)
defer cancel()
next.ServeHTTP(w, r.WithContext(ctx))
})
}
The new context replaces the request's; downstream sees the tighter deadline. The defer fires after the handler returns.
Implementation: withTracing¶
func withTracing(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
ctx, span := tracer.Start(r.Context(), r.URL.Path)
defer span.End()
next.ServeHTTP(w, r.WithContext(ctx))
})
}
The trace span is added to the context. The span's lifetime is bounded by the handler's execution (via defer).
Cancellation issues in middleware¶
A poorly written middleware can break cancellation:
// BAD: ignores cancellation
func badMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
time.Sleep(time.Second) // ignores ctx; blocks for full second
next.ServeHTTP(w, r)
})
}
Even if the client disconnects, the middleware sleeps for the full second. Latency budget burned.
// GOOD
func goodMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
select {
case <-time.After(time.Second):
case <-r.Context().Done():
return
}
next.ServeHTTP(w, r)
})
}
Cancellable sleep.
Lesson: every middleware should be cancellation-aware, even seemingly innocuous ones.
Cancellation under chaos (Jepsen-like testing)¶
For truly critical systems, treat cancellation paths as you would treat any other behaviour: test them under chaos. Jepsen-style testing introduces network partitions, process kills, and clock jumps to validate that the system handles them gracefully.
For Go pipelines, simpler chaos tests are valuable:
- Inject random delays at network calls.
- Inject random cancellations.
- Kill a random worker periodically.
- Simulate clock jumps (test with
time.Nowmocking).
func TestChaos(t *testing.T) {
for i := 0; i < 1000; i++ {
ctx, cancel := context.WithTimeout(context.Background(),
time.Duration(rand.Intn(100)+1)*time.Millisecond)
defer cancel()
if err := runPipeline(ctx); err != nil {
// log but continue
t.Logf("iteration %d: %v", i, err)
}
}
// After all iterations, verify no goroutine leaks
runtime.GC()
n := runtime.NumGoroutine()
if n > 5 {
t.Errorf("leaked goroutines: %d", n)
}
}
1000 iterations with random timeouts. Catches a wide variety of timing bugs.
For production systems, run chaos in staging: kill random pods, inject network delays, etc. The system should degrade gracefully and recover quickly. Cancellation paths are heavily exercised by chaos.
Cancellation in transactional pipelines¶
For pipelines that need exactly-once semantics (rare but valuable), cancellation interacts with the transaction protocol.
The pattern: batch a series of changes; commit only when the whole batch succeeds; rollback on any error or cancellation.
func transactionalPipeline(ctx context.Context, items []Item) error {
tx, err := db.BeginTx(ctx, nil)
if err != nil {
return err
}
defer tx.Rollback() // no-op if Commit succeeded
for _, item := range items {
if err := tx.ExecContext(ctx, "INSERT ...", item); err != nil {
return err
}
if ctx.Err() != nil {
return ctx.Err()
}
}
return tx.Commit()
}
Cancellation during the loop returns; defer tx.Rollback aborts. No partial state visible.
Cancellation during Commit is tricky. The commit may have been written to the database before the cancel signal arrived. The transaction may be committed even though the function returns context.Canceled.
This is fundamental: distributed commits are not atomic from the client's perspective. The transaction is or is not committed; the cancellation result is ambiguous from the client side.
Mitigation: query the database after a cancelled commit to determine actual state. Or use two-phase commit semantics if the database supports them.
For most pipelines, the simpler approach is good enough: rely on defer Rollback for the common case; accept the ambiguity for the rare race.
Cancellation in connection multiplexing¶
A common pattern: many logical streams over one physical connection. gRPC over HTTP/2 is the classic example. Each stream has its own context; cancellation is per-stream, not per-connection.
The implementation: the framework tracks streams by ID. Cancelling stream N sends RST_STREAM(N); other streams continue.
For Go applications using gRPC, this is automatic — each call has its own context, and cancellation doesn't affect other calls on the same connection.
For custom multiplexed protocols (e.g. a streaming WebSocket-of-streams), you need to implement per-stream context handling:
type Mux struct {
conn net.Conn
streams map[uint32]*Stream
mu sync.Mutex
}
type Stream struct {
id uint32
ctx context.Context
cancel context.CancelFunc
}
func (m *Mux) NewStream(parent context.Context) *Stream {
m.mu.Lock()
defer m.mu.Unlock()
id := nextID()
ctx, cancel := context.WithCancel(parent)
s := &Stream{id: id, ctx: ctx, cancel: cancel}
m.streams[id] = s
return s
}
func (m *Mux) CancelStream(id uint32) {
m.mu.Lock()
defer m.mu.Unlock()
if s, ok := m.streams[id]; ok {
s.cancel()
delete(m.streams, id)
}
}
Each stream's context is independent. Cancellation affects only the targeted stream.
Edge cases catalogue¶
A reference of edge cases professionals should know:
- Cancel on a nil context: panic. Always check or use Background.
- Cancel a context that was never used: no-op; just releases the timer (if any).
- Done() on Background: returns nil; select on nil never fires.
- Value lookup on cancelled context: works fine; values are not invalidated.
- Receive on closed Done after cancel: returns immediately with zero value.
- WithCancel on cancelled parent: child is born cancelled; Done is pre-closed.
- WithTimeout(parent, 0): cancels immediately with DeadlineExceeded.
- WithTimeout(parent, negativeDuration): cancels immediately.
- WithDeadline(parent, pastTime): cancels immediately.
- WithCancel(nilParent): panic.
- Multiple cancel calls: idempotent.
- AfterFunc on cancelled context: callback runs immediately.
- Cause(ctx) before any cancel: returns nil (or parent's cause if any).
- WithoutCancel(cancelledParent): detached but no values inherited.
- Concurrent cancel and read of Done: safe; reads see the close if after.
- Cancel during Done channel allocation race: handled by lazy init + closedchan.
Each edge case has a defined behaviour; the package source documents most of them. Professionals know the corners; production code respects them.
Cancellation through gRPC gateway and reverse proxies¶
A reverse proxy (Envoy, nginx) routing to a Go backend forwards cancellation by closing the upstream connection. The Go backend's r.Context() sees the cancel.
For gRPC gateway translating HTTP to gRPC: similar pattern. The gateway propagates the context cancellation to the gRPC call. The backend gRPC server sees its ctx cancel.
This works automatically if both ends use context-aware libraries. The chain:
client --(HTTP)--> gateway --(gRPC)--> backend
| | |
v v v
disconnects ctx cancels ctx cancels
(gateway code (backend code
sees this) sees this)
Each hop forwards the cancellation in its own way (TCP RST for HTTP, RST_STREAM for gRPC). The Go context tree on each hop sees the cancel.
Mid-chain cancellation hooks¶
A proxy or gateway may want to log cancellations as they pass through:
The hook fires when the proxy detects the upstream closed. Useful for tracking client patience.
Cancellation in async stacks (planning for generics-era patterns)¶
Go generics (1.18+) enable typed channels and typed pipelines. Cancellation works the same way; the type system just makes signatures more precise.
type Pipeline[In, Out any] struct {
in <-chan In
out chan Out
}
func (p *Pipeline[In, Out]) Run(ctx context.Context, fn func(In) Out) {
defer close(p.out)
for v := range p.in {
result := fn(v)
select {
case p.out <- result:
case <-ctx.Done():
return
}
}
}
Same cancellation semantics; type parameter does not affect runtime behaviour. The benefit is compile-time safety: you cannot accidentally connect a Pipeline[int, int] to a Pipeline[string, string].
Libraries like github.com/sourcegraph/conc use generics to provide typed wrappers around errgroup-style patterns. Cancellation is built in.
Memory Footprint of the Context Tree¶
Each context node is small:
emptyCtx: 0 bytes (singletons).valueCtx: ~48 bytes (header + key + value + parent pointer).cancelCtx: ~80 bytes (header + done atomic + map header + err + cause + parent).timerCtx: ~120 bytes (cancelCtx + timer struct).
A request that derives 10 contexts uses about 1 KB. For 10 000 concurrent requests, 10 MB. Not free but not large.
The children map allocations can add up: for a parent with 1000 children, the map header plus entries is ~40 KB. Frequently-changing parents (many WithCancel and immediate cancel) churn allocations.
When the footprint matters¶
Most services do not feel the context memory cost. Services with very high request rates (1M+ RPS) may. Profile with pprof -memprofile to see if context allocations are a top contributor.
A closing perspective: cancellation discipline as engineering culture¶
In organisations that take cancellation seriously, you see specific cultural patterns:
- Code reviews always check cancellation paths.
- Onboarding materials emphasise the "every goroutine has an exit" rule.
- Shared libraries expose only context-aware APIs.
- Postmortems systematically identify cancellation contributors.
- Tooling (linters, leak detectors) is set up by default in CI.
Without these, the same engineering team writes the same cancellation bugs repeatedly. The technical solutions (context, errgroup, etc.) are easy; the discipline is hard.
Investing in cancellation as a first-class concern pays dividends: faster shutdowns, fewer leaks, easier debugging, fewer incidents. The cost is the up-front time to design and the ongoing time to review. Compared to the cost of a single major leak incident, the investment is trivial.
A note on cancellation and observable side-effects¶
Cancellation interrupts work in progress. Side effects already done are not undone. This is a fundamental property; understanding it shapes design.
Examples:
- Database INSERT cancelled mid-write: the database driver may roll back, but if multiple INSERTs were committed before the cancel, those persist.
- HTTP request cancelled mid-write: the server may have partial bytes in its buffer; the wire may have partial data. The receiver sees an error.
- File write cancelled: bytes already written are on disk; the file may be partially written.
For safety:
- Use atomic operations where possible (rename for files, single INSERT for DB rows).
- Use transactions for multi-step database work.
- Use checksums or completion markers for files.
- Design recovery procedures for partial state.
The cancellation primitive does not save you from partial state. It interrupts when told; the cleanup is your responsibility.
Cancellation budget allocation algorithms¶
For systems with strict per-step latency requirements, deadline allocation among steps becomes algorithmic.
Algorithm 1: equal split¶
Each step gets remaining / N budget.
func allocate(remaining time.Duration, steps int) time.Duration {
return remaining / time.Duration(steps)
}
Simple. Steps that finish early do not benefit later ones.
Algorithm 2: priority-weighted¶
Steps have priorities; budget is allocated proportionally.
func allocate(remaining time.Duration, weights []int) []time.Duration {
total := 0
for _, w := range weights {
total += w
}
result := make([]time.Duration, len(weights))
for i, w := range weights {
result[i] = remaining * time.Duration(w) / time.Duration(total)
}
return result
}
A critical step gets a larger fraction.
Algorithm 3: adaptive¶
Steps' historical durations are tracked. The budget is allocated based on past observation.
type Budget struct {
history []time.Duration
avg time.Duration
}
func (b *Budget) Allocate(remaining time.Duration, steps int) []time.Duration {
// Use historical averages to allocate proportionally
// ...
}
More sophisticated. Useful when step durations vary significantly.
Algorithm 4: greedy with floor¶
Each step gets min(remaining, max_step_time). Bounds per-step time absolutely.
func allocate(remaining time.Duration, maxPerStep time.Duration) time.Duration {
if remaining < maxPerStep {
return remaining
}
return maxPerStep
}
Prevents any single step from monopolising the budget.
Most production systems use a combination: a maximum per-step (algorithm 4), with dynamic allocation among steps (algorithm 3 or 2).
The cancellation point: every step has a context with the allocated budget. The step's deadline fires; the work cancels; the result either succeeds or fails. The orchestrator moves to the next step with the remaining parent budget.
Garbage Collection of Contexts¶
Contexts are tracked by the GC like any other Go object. The cycle:
- A context is allocated by
WithCanceletc. - The returned context and cancel function are held by the caller.
- When the caller's frame returns (or
defer cancel()runs), the references are dropped. - The context is unreachable; GC collects it on the next cycle.
The defer cancel() is important not only for releasing the timer (in WithTimeout) but also for removing the context from its parent's children map. Without cancel, the parent retains a reference to the child indefinitely (until the parent itself is cancelled).
Long-lived parents (e.g. an HTTP server context that lives for the server's lifetime) can accumulate children if cancels are forgotten. This is the "lostcancel" leak.
go vet catches the easy cases:
./main.go:42: the cancel function returned by context.WithCancel should be called, not discarded, to avoid a context leak
But complex flows can hide the leak. Manual defer cancel() is the safety net.
Performance Benchmarks¶
Some measurements on a typical modern machine (Apple M1 / Go 1.22):
WithCancel cost¶
func BenchmarkWithCancel(b *testing.B) {
parent := context.Background()
for i := 0; i < b.N; i++ {
_, cancel := context.WithCancel(parent)
cancel()
}
}
About 200 ns per iteration. Includes allocation, mutex, and cancel.
select with ctx.Done()¶
func BenchmarkSelectCtxDone(b *testing.B) {
ctx := context.Background()
ch := make(chan int, 1)
ch <- 1
for i := 0; i < b.N; i++ {
select {
case <-ch:
case <-ctx.Done():
}
ch <- 1
}
}
Around 30 ns per iteration. The compiler optimises the always-nil Background.Done case.
Cancellation broadcast¶
func BenchmarkCancelBroadcast(b *testing.B) {
for i := 0; i < b.N; i++ {
ctx, cancel := context.WithCancel(context.Background())
var wg sync.WaitGroup
for j := 0; j < 100; j++ {
wg.Add(1)
go func() {
defer wg.Done()
<-ctx.Done()
}()
}
cancel()
wg.Wait()
}
}
About 50 microseconds per iteration. Most of it is the WaitGroup join; the actual broadcast is sub-microsecond for 100 receivers.
Take-aways¶
- Cancellation is fast — usually fast enough to ignore.
- Context allocation has a small cost; avoid in tight inner loops.
- Broadcast scales with receivers; very large fan-outs need care.
Cancellation in critical sections¶
A subtle interaction with sync.Mutex. A goroutine holding a mutex cannot be cancelled until it releases. If the work inside the critical section is long, cancellation latency suffers.
Cancellation arrives. heavyWork respects context and returns quickly. The defer fires; the mutex releases. Cancellation works through.
But if heavyWork does not respect context, the goroutine holds the mutex for its full duration. Other goroutines waiting on the mutex cannot proceed. They are not cancellable because they are in Lock().
Mitigation:
- Make
heavyWorkcancellable. - Use a context-aware lock as shown earlier.
- Reduce the scope of the lock (do the heavy work outside).
The right answer: minimise mutex hold time. The standard advice "do not hold a mutex during I/O or expensive work" is doubly true under cancellation.
sync.RWMutex¶
Same considerations. A read-holder blocks writers; a write-holder blocks everyone. Cancellation cannot interrupt a RLock or Lock call.
For cancellable RWMutex, build on channels:
type CtxRWMutex struct {
write chan struct{}
readSem chan struct{}
}
func New(maxReaders int) *CtxRWMutex {
return &CtxRWMutex{
write: make(chan struct{}, 1),
readSem: make(chan struct{}, maxReaders),
}
}
func (m *CtxRWMutex) RLock(ctx context.Context) error {
select {
case m.readSem <- struct{}{}:
return nil
case <-ctx.Done():
return ctx.Err()
}
}
func (m *CtxRWMutex) RUnlock() {
<-m.readSem
}
func (m *CtxRWMutex) Lock(ctx context.Context) error {
// ... acquire write lock, wait for readers ...
return nil
}
Custom locking is rarely worth it. Better: avoid long-held locks.
Cancellation when contexts are stored vs passed¶
context.Context is conventionally passed as the first parameter, not stored in struct fields. There is one common exception: long-lived components that need a stable context for their internal goroutines.
type Server struct {
ctx context.Context
cancel context.CancelFunc
// ...
}
func NewServer(parent context.Context) *Server {
ctx, cancel := context.WithCancel(parent)
return &Server{ctx: ctx, cancel: cancel}
}
func (s *Server) Start() {
go s.run()
}
func (s *Server) Stop() {
s.cancel()
}
The server stores ctx and cancel so its internal goroutines can use them. The public API is Start/Stop, not "pass me a context."
This pattern is sometimes criticised as anti-idiomatic. The defense: the context lifetime matches the server's lifetime; storing it makes that explicit.
But there is a trap: methods on Server should not also accept a context. If they do, which one wins? Convention:
- Methods that do per-call work accept
ctxand pass it down, possibly combining with the server's context if they need both. - Methods that affect the server itself (like
Stop) use the stored context.
A method that takes ctx and ignores it (using s.ctx instead) is a bug — the caller is asking for the method to be cancellable, and you are ignoring them.
Cancellation Under High Concurrency¶
In a service handling 10 000+ concurrent goroutines, cancellation has subtle behaviours.
Thundering herd¶
cancel() wakes all waiters at once. If they all then try to acquire a shared lock or do I/O, they contend. The wake-up is fast; the contention afterwards is what matters.
Mitigation: stagger the work after wake-up. For example, each goroutine sleeps for a random small interval before contending.
Scheduler thrash¶
10 000 wake-ups in one nanosecond mean the scheduler has 10 000 runnable goroutines and limited CPUs. The throughput is OK (the work gets done), but per-goroutine latency rises.
Mitigation: design for this — assume cancellation latency in a busy service is milliseconds, not microseconds. Plan SLAs accordingly.
Inner-loop contention¶
A pipeline that processes shared state (a counter, a map) under cancellation may see lock contention spike at shutdown. Every goroutine wakes and tries to flush its state.
Mitigation: per-goroutine state with batch reduction at exit. Each goroutine writes its local state to a per-goroutine slot; a single coordinator collects after all have exited.
High-volume cancellation: 100k goroutines on shutdown¶
A scenario: a server handling 100 000 concurrent connections, each with its own goroutine. On SIGTERM, every goroutine must exit.
The breakdown:
cancel()on the root context: locks, closes the done channel, walks children.- The 100 000 goroutines are not direct children (they are derived through per-request contexts), so the cascade propagates through the per-request tree.
- Each per-request context has its own goroutine (the request handler) plus possibly nested children.
The wake-up storm: 100 000 goroutines become runnable simultaneously. The Go scheduler distributes them across P's. With GOMAXPROCS=8, each P sees ~12 500 wake-ups.
Each goroutine, on waking, runs its defer chain (closes channels, releases resources, etc.). The total work is bounded by per-goroutine cleanup time.
Measured latency on a typical 8-core server: ~50 milliseconds from cancel() to last goroutine exited, assuming each goroutine has minimal per-cleanup work.
If per-cleanup work is heavier (e.g. flushing a buffer to disk), latency grows accordingly. For "100 000 goroutines, each flushing 10 ms of work to disk," shutdown takes seconds because disk I/O serialises.
Implication for design¶
For services with this scale of concurrency:
- Shutdown SLA should account for the storm.
- Cleanup per goroutine should be minimal.
- Bulk cleanup (batched I/O) is better than per-goroutine I/O.
- Test shutdown under production-like load, not just unit tests.
Cancellation Race Conditions¶
A few race conditions cancellation code must defend against.
Race: read after close¶
Closing a channel while a sender is pending is panic. The sender does not see the close; it sees a panic at the runtime layer.
In cancellation code, the producer owns the close; nobody else closes its output. This avoids the race.
Race: cancel during close¶
A producer is about to close its channel:
A consumer is reading the channel. Cancel fires. The producer's select picks the cancel case and returns; the defer close(out) runs. The consumer's range sees the close and exits.
No race here — the close is atomic, the read is atomic. The race that would exist is if two goroutines tried to close the same channel; the convention "owner closes" avoids that.
Race: cancel during external I/O¶
A goroutine is in db.Query(...). cancel() fires. The goroutine is blocked in the driver, which is in the OS. The cancel does not reach it directly.
The cure: use db.QueryContext(ctx, ...). The driver itself watches ctx.Done() and sends a cancel to the database. The blocked goroutine eventually returns with context.Canceled.
Without context-aware drivers, the cancellation is effectively delayed until the I/O completes naturally. The race is "did the I/O finish before the cancel deadline?"
Race: dual cancel sources¶
Two goroutines both call cancel() "simultaneously":
go func() {
if err := work(); err != nil {
cancel()
}
}()
go func() {
<-time.After(time.Second)
cancel()
}()
Idempotency saves us: the first cancel wins, the second is a no-op. No race.
But WithCancelCause records the cause from the first call. If you want a specific cause, ensure it is the first call. This is sometimes a race in practice.
Race: cancel before observe¶
ctx, cancel := context.WithCancel(parent)
cancel() // before anyone observes
done := ctx.Done() // returns closedchan (pre-closed)
<-done // immediately fires
No race. The closedchan optimisation handles "cancel before observe" cleanly.
Specific races involving select and cancellation¶
Race A: send picked first when cancel was already triggered¶
go func() {
cancel()
log.Println("cancel called")
}()
select {
case out <- v:
log.Println("sent")
case <-ctx.Done():
log.Println("cancelled")
}
If both cases are ready when select evaluates, it picks at random. Even though cancel() was called "before" the select (in source order), the goroutine ordering means select may pick "sent."
This is correct behaviour — see the Go spec on select. But it sometimes surprises engineers. The mental model: cancellation is eventually delivered, not instant.
Implication: do not write code that relies on "after cancel() no more values flow." There is always a small window of in-flight values.
Race B: close happens after read finished¶
ch := make(chan int, 1)
ch <- 1
go func() {
<-ch
cancel()
close(ch) // not really; ch is already empty
}()
select {
case v := <-ch:
use(v)
case <-ctx.Done():
}
This is poorly written but illustrative. The point: if cancel() happens after <-ch succeeded, the select on <-ch does not see the cancel — it already returned.
Race C: closing a channel while another goroutine selects¶
ch := make(chan int)
go func() {
close(ch)
}()
select {
case v, ok := <-ch:
if !ok { return }
use(v)
case <-ctx.Done():
return
}
If close(ch) and cancel(ctx) happen simultaneously, the select sees both cases ready. Random choice. If "channel closed" wins, ok == false. If "ctx done" wins, the cancel branch runs.
Both are valid outcomes. The code must handle either.
Race D: deadline fires during send¶
ctx is a deadline context. The deadline fires precisely as the send becomes possible. Select picks at random. Either outcome (sent or cancelled) is valid.
This is the same as race A; deadlines fire by calling cancel on the context, so the race is identical.
Defensive coding¶
For correctness, your code must work in all interleavings allowed by the runtime. Test patterns:
- Stress tests with many iterations (catches frequent races).
- Random delay injection (forces uncommon interleavings).
- Race detector (catches memory races between cancellation and data flow).
Cancellation under contention: a benchmark¶
What is the actual cost of cancellation under high contention? Let me design an experiment.
func BenchmarkCancelManyWaiters(b *testing.B) {
for _, n := range []int{1, 10, 100, 1000, 10000, 100000} {
b.Run(fmt.Sprintf("n=%d", n), func(b *testing.B) {
for i := 0; i < b.N; i++ {
ctx, cancel := context.WithCancel(context.Background())
var wg sync.WaitGroup
wg.Add(n)
for j := 0; j < n; j++ {
go func() {
defer wg.Done()
<-ctx.Done()
}()
}
cancel()
wg.Wait()
}
})
}
}
Indicative results (on an Apple M1 Pro, Go 1.22):
| n waiters | time per iteration |
|---|---|
| 1 | ~2 microseconds |
| 10 | ~10 microseconds |
| 100 | ~100 microseconds |
| 1000 | ~1 millisecond |
| 10000 | ~12 milliseconds |
| 100000 | ~150 milliseconds |
Approximately linear in n. The cost is the wake-up storm plus the WaitGroup join (which is also O(n)).
For most pipelines, n is small (10-100). Cancellation latency is sub-millisecond. For services with many concurrent waiters, the latency can exceed a millisecond. Plan SLAs accordingly.
Optimisation: hierarchical fan-out¶
For very large fan-outs (millions of goroutines), the linear cost of broadcast becomes a bottleneck. Hierarchical fan-out splits the waiters into groups, each watching a sub-context:
cancel() on root cancels g1, g2, g3 in sequence; each then cancels its workers in parallel. Total time: O(log n) instead of O(n) for the broadcast.
This is over-engineering for n < 1000. For n > 100 000, it can be necessary.
A useful technique: capability-based contexts¶
Carry capabilities (or "permissions") via context values, then cancel them by replacing the value. This is rare in Go but appears in some advanced libraries.
type capabilityKey struct{}
type Cap interface {
Allow(action string) bool
}
func WithCap(ctx context.Context, c Cap) context.Context {
return context.WithValue(ctx, capabilityKey{}, c)
}
func CapFrom(ctx context.Context) Cap {
c, _ := ctx.Value(capabilityKey{}).(Cap)
return c
}
func protected(ctx context.Context, action string) error {
c := CapFrom(ctx)
if c == nil || !c.Allow(action) {
return errors.New("forbidden")
}
return doIt(ctx, action)
}
Combined with cancellation: a separate path can "revoke" a capability by injecting a new one (a fresh context with revoked permissions) into downstream calls.
This is over-engineered for most apps but appears in security-sensitive systems.
Cancellation interaction with reflection and unsafe¶
Reflection (reflect) does not have a context-aware mode. Calls into reflection (e.g. reflect.Call) are not cancellable from the outside. If your code does heavy reflection inside a stage, that work is uninterruptible.
Mitigation: do reflection-driven work in small chunks, check ctx.Err() between chunks.
unsafe operations are even more opaque. Pointer arithmetic, conversion between types — none of these are context-aware. They are also fast, so cancellation latency is rarely an issue.
The pattern: structure your code so that reflection or unsafe work is bounded. Cancellation can wait until the small bounded chunk completes.
Cancellation and panic interaction¶
A panic propagates up through deferred calls. If a deferred call calls recover, the panic is caught. If not, the goroutine dies.
Cancellation interaction:
- A
defer cancel()runs during a panic. The cancellation fires. - If
recover()is in the deferred chain, the panic is caught; subsequent code runs (including the cancel). - The cancel cascades to other goroutines normally.
This means panics naturally trigger cancellation if you structure the deferred calls correctly:
go func() {
defer cancel() // cancels on any return, including panic
defer func() {
if r := recover(); r != nil {
log.Printf("panic: %v", r)
}
}()
work(ctx)
}()
A panic in work triggers the deferred recover (logs); the deferred cancel runs; other goroutines watching ctx exit.
Panic in another goroutine¶
A panic in goroutine A does not affect goroutine B unless A's cancel cascades to B. If A's panic happens before cancel() runs, the process dies (the panic propagates up A's stack and the runtime kills the process).
To survive panics in goroutines, each goroutine that runs potentially-panicking code must have its own recover:
go func() {
defer func() {
if r := recover(); r != nil {
cancel() // cancel siblings
}
}()
work(ctx)
}()
The pattern: recover plus cancel — convert a panic into a graceful cancellation of the group.
Cross-process Implementations¶
Cancellation across processes uses different mechanisms than in-process.
gRPC¶
gRPC uses HTTP/2 streams. Cancellation is signalled by sending RST_STREAM. The receiver's stream errors out; the framework's handler sees the context cancel.
Deadlines are encoded as the grpc-timeout metadata header. The receiver decodes and reconstructs a context with the same remaining deadline.
The implementation handles edge cases:
- Network delay: the client's deadline minus network RTT is the actual deadline visible to the server.
- Clock skew: nodes may disagree on absolute time, so deadlines are encoded as durations remaining, not timestamps.
- Cancellation reasons: gRPC has status codes (
Canceled,DeadlineExceeded) that map cleanly to Go context errors.
HTTP/2¶
HTTP/2 has explicit stream cancellation via RST_STREAM. The Go net/http server detects this and cancels r.Context(). The Go client similarly emits the frame when its context cancels.
For HTTP/1.1, the only mechanism is closing the TCP connection. This is heavier than RST_STREAM (kills any pipelined requests) and slower (TCP teardown is a multi-step protocol).
Message queues¶
Cancellation in a queue is asymmetric. Consumers can stop consuming, but published messages cannot be recalled. The producer can:
- Set a TTL on messages, so old ones expire.
- Use idempotency keys so duplicates are deduplicated.
- Use cooperative cancellation via a control channel.
The control-channel pattern: a separate topic broadcasts "cancel job X." Workers consume the control topic and stop processing X. This is heavyweight; only worth it for long-running jobs.
Deep dive: the select statement under the hood¶
A select with N cases compiles to a call to runtime.selectgo. The internals (simplified from runtime/select.go):
func selectgo(cas0 *scase, order0 *uint16, ncases int) (int, bool) {
// ... build pollOrder (random permutation of cases) and lockOrder (sorted) ...
// Pass 1: look for already-ready cases
for _, casi := range pollOrder {
switch cas := &cas0[casi]; cas.kind {
case caseRecv:
if cas.c.qcount > 0 || cas.c.closed != 0 {
return casi, true
}
case caseSend:
if cas.c.qcount < cas.c.dataqsiz {
return casi, true
}
}
}
// Pass 2: park goroutine in each case's queue
for _, casi := range lockOrder {
// enqueue current goroutine in the channel's wait queue
}
// Park; wait for wake
gopark(...)
// Pass 3: dequeue from other cases (the case that fired removed itself)
// ...
return firedCase, ok
}
Key points:
- The
pollOrderis random, ensuring fairness across cases. - The
lockOrderis deterministic (sorted by channel address), preventing deadlocks during enqueue. - The goroutine parks via
gopark; wakes viagoreadyfrom the case that fires.
For cancellation: when cancel closes ctx.Done(), every goroutine parked in a select where <-ctx.Done() is a case is woken via goready. They each resume selectgo at "pass 3," determine that the <-ctx.Done() case fired, and return.
The cost of selectgo:
- N comparisons to build the order.
- N enqueues.
- 1 park (if no case is ready).
- N dequeues on wake.
For N = 2 (typical: cancel + work), the cost is ~20-50 ns. For N = 10, ~100 ns. Linear in N but with small constants.
Cancellation in admin/control endpoints¶
A service typically exposes admin endpoints for runtime control:
/debug/cancelto cancel a specific job./debug/drainto start a graceful drain./debug/statusto report state.
The cancellation for these endpoints is interesting because the action is itself a cancellation.
func handleCancel(w http.ResponseWriter, r *http.Request) {
jobID := r.URL.Query().Get("id")
if cancel, ok := jobs.Get(jobID); ok {
cancel(errors.New("admin cancelled"))
w.WriteHeader(204)
return
}
w.WriteHeader(404)
}
The handler triggers cancellation of another goroutine's context. The handler itself respects its own request context for the response.
If the admin's client disconnects, the handler returns; the cancellation already fired. The cancelled job stops regardless of the admin connection.
This is asymmetric: the cancel trigger is one-way. Once fired, the admin's disconnection does not undo it.
For "drain" endpoints, the trigger may take longer:
func handleDrain(w http.ResponseWriter, r *http.Request) {
if err := s.Drain(r.Context()); err != nil {
http.Error(w, err.Error(), 500)
return
}
w.WriteHeader(204)
}
The drain is a blocking operation. It respects r.Context() — if the admin disconnects, the drain context cancels and Drain returns early. The actual drain may already be in progress; that progress is preserved.
Custom context implementation: when to and not to¶
The temptation: implement a custom context to add features (logging, metrics, custom semantics). Almost always: don't.
Why not?¶
- The standard library does not optimise for custom contexts.
propagateCancelfalls back to a watcher goroutine, which costs an extra goroutine per derived context. - Value lookup fast paths assume known types. Custom types take the slow path.
- API users will be confused. "Why does this function return my own ContextWithExtra type and not a stdlib context?"
- Future stdlib features may not work.
AfterFunc,Cause, etc. depend on internal magic keys that custom contexts don't expose.
What to do instead¶
For added behaviour, compose around Context, not implement it:
- Need logging? Wrap the call site with logging.
- Need metrics? Use
AfterFuncto record on cancel. - Need a custom cancel reason? Use
WithCancelCause. - Need a different deadline? Use
WithDeadline.
The only legitimate case for a custom context: a test fixture that simulates a real context's behaviour with precise timing control.
type fakeContext struct {
done chan struct{}
err error
}
func (f *fakeContext) Done() <-chan struct{} { return f.done }
func (f *fakeContext) Err() error { return f.err }
// ... etc
The fake exposes a Cancel() method that the test can call. Useful for deterministic cancellation tests.
Detailed cancellation lifecycle in database/sql¶
The database/sql package's cancellation behaviour is worth understanding in detail.
When you call db.QueryContext(ctx, query, args...):
- The package acquires a connection from the pool (cancellable).
- The driver prepares and executes the query, watching
ctx.Done(). - If
ctxcancels, the driver issues a backend cancel (depends on driver). For PostgreSQL viapgx, this is a separate connection issuingpg_cancel_backend(pid). - The query returns with
context.Canceled(or a driver-specific cancellation error). - The connection is returned to the pool.
Cancellation behaviour during specific operations:
Begin / BeginTx¶
BeginTx(ctx, opts) is cancellable. If ctx cancels before the transaction starts, it returns the error.
Commit / Rollback¶
Once a transaction is started, Commit and Rollback themselves are not cancellable in older drivers. The cancellation happens within the transaction's queries; commit blocks until the WAL flush.
Modern drivers (Go 1.18+) may support cancellation of commit. Test your specific driver.
Stmt.Exec / Stmt.Query¶
Prepared statement execution is cancellable via the context passed in. The driver tracks active executions per statement and cancels them on context cancel.
Conn¶
Conn is a single connection from the pool. Operations on a Conn are cancellable. Conn.Close returns the connection to the pool; it is not cancellable but is fast.
Pool exhaustion¶
If the pool is at max capacity and you request another connection, the request waits. The wait is cancellable via the context. If ctx cancels, the request fails with context.Canceled.
This is a critical resilience property: a service with a saturated DB pool returns errors fast rather than blocking indefinitely.
A long worked example: a search engine query pipeline¶
Putting many of the patterns together. A search engine receives a query, fans out to N shards, merges results, ranks, and responds. Cancellation must work at every stage.
type SearchService struct {
shards []ShardClient
ranker *Ranker
cache *Cache
logger *log.Logger
}
func (s *SearchService) Search(ctx context.Context, q Query) (*Results, error) {
ctx, cancel := context.WithCancelCause(ctx)
defer cancel(nil)
// Set per-search deadline if none
if _, ok := ctx.Deadline(); !ok {
var c context.CancelFunc
ctx, c = context.WithTimeout(ctx, 500*time.Millisecond)
defer c()
}
// Try cache
if r, ok := s.cache.Get(q); ok {
return r, nil
}
// Fan out to shards
g, gctx := errgroup.WithContext(ctx)
g.SetLimit(16)
shardResults := make([]ShardResult, len(s.shards))
for i, sh := range s.shards {
i, sh := i, sh
g.Go(func() error {
shardCtx, shardCancel := context.WithTimeout(gctx, 200*time.Millisecond)
defer shardCancel()
r, err := sh.Query(shardCtx, q)
if err != nil {
if errors.Is(err, context.DeadlineExceeded) {
s.logger.Printf("shard %d slow", i)
// partial result is OK
return nil
}
return err
}
shardResults[i] = r
return nil
})
}
if err := g.Wait(); err != nil {
return nil, fmt.Errorf("shard query: %w", err)
}
// Rank
rankCtx, rankCancel := context.WithTimeout(ctx, 100*time.Millisecond)
defer rankCancel()
ranked, err := s.ranker.Rank(rankCtx, shardResults)
if err != nil {
return nil, err
}
// Cache
s.cache.Set(q, ranked)
return ranked, nil
}
type Query struct{}
type Results struct{}
type ShardResult struct{}
type ShardClient interface {
Query(ctx context.Context, q Query) (ShardResult, error)
}
type Ranker struct{}
func (r *Ranker) Rank(ctx context.Context, in []ShardResult) (*Results, error) { return nil, nil }
type Cache struct{}
func (c *Cache) Get(q Query) (*Results, bool) { return nil, false }
func (c *Cache) Set(q Query, r *Results) {}
Cancellation analysis:
- Top level:
WithCancelCauseallows recording why the search ended. The total deadline (500 ms) caps the operation. - Cache lookup: cheap; no cancellation issues.
- Fan-out:
errgroupwith concurrency limit. Each shard has its own 200 ms timeout. - Per-shard timeout: shard timeouts are independent of one another. A slow shard does not block others.
- Graceful degradation: a shard that times out returns
nil(partial result) rather than failing the whole search. This trade-off lets the system return partial results when some shards are slow. - Rank stage: separate timeout (100 ms). Even if shard queries took 400 ms, the ranker has time.
- Cache write: synchronous; the write happens before the function returns. No cancellation issue.
The trickiest part: how to handle shard failures vs shard timeouts. The code distinguishes:
- Timeout: log, return nil (graceful).
- Error: return error (fail the search).
This is a policy decision; it could go either way. The cancellation primitives just enable expressing the policy.
Cancellation latency¶
Worst case:
- Total deadline: 500 ms.
- Shard fan-out: 200 ms.
- Rank: 100 ms.
If the client cancels at 250 ms (mid-shard fan-out), every shard's shardCtx cancels; their queries abort. g.Wait returns with gctx.Err(). The function returns with the cancellation reason.
Total cancellation latency: ~5-10 ms (the shard's response to context cancel, dominated by network if the shard is remote).
For a service handling many such searches per second, the cancellation behaviour ensures resources are released promptly when clients lose interest.
Multi-tenant cancellation isolation¶
A service shared across N tenants must isolate failures: one tenant's cancellation should not affect others.
Approach 1: per-tenant root context¶
Each tenant has its own root context derived from the application root:
type Tenants struct {
mu sync.Mutex
ctxs map[string]context.Context
cancels map[string]context.CancelFunc
}
func (t *Tenants) Get(tenant string, appCtx context.Context) (context.Context, context.CancelFunc) {
t.mu.Lock()
defer t.mu.Unlock()
if ctx, ok := t.ctxs[tenant]; ok {
return ctx, t.cancels[tenant]
}
ctx, cancel := context.WithCancel(appCtx)
t.ctxs[tenant] = ctx
t.cancels[tenant] = cancel
return ctx, cancel
}
func (t *Tenants) Cancel(tenant string) {
t.mu.Lock()
defer t.mu.Unlock()
if cancel, ok := t.cancels[tenant]; ok {
cancel()
delete(t.ctxs, tenant)
delete(t.cancels, tenant)
}
}
Tenants are isolated. Cancelling tenant A does not affect tenant B. Application shutdown cancels every tenant via the appCtx.
Approach 2: shared context with per-tenant filter¶
A single context per pipeline but filtering tasks by tenant:
type Task struct {
Tenant string
// ...
}
for {
select {
case <-ctx.Done():
return
case t := <-tasks:
if isCancelled(t.Tenant) {
// skip
continue
}
process(ctx, t)
}
}
Cancellation is checked via a separate "is this tenant cancelled" function. Simpler but less granular — does not give per-tenant ctx.Done() for downstream code.
The choice depends on the integration: if downstream code accepts a context, per-tenant root is cleaner. If downstream code does not, the filter approach works.
Cross-tenant resources¶
If tenants share resources (a DB pool), cancellation of one tenant should not exhaust resources for others. Use:
- Per-tenant connection limits.
- Per-tenant queues with shedding.
- Per-tenant quotas enforced before issuing the work.
The cancellation primitive is the same; the policy around it isolates tenants.
Incident Post-Mortems¶
Two paraphrased post-mortems from real incidents.
Incident: pod restarts on shutdown¶
A Kubernetes pod was being killed with SIGKILL because graceful shutdown did not complete within the configured terminationGracePeriodSeconds. The pod logs showed shutdown started but never finished.
Investigation: the application had a background goroutine doing periodic cleanup. The cleanup used time.Sleep(time.Minute) to wait between iterations. SIGTERM cancelled the context, but the goroutine was in time.Sleep and could not be woken.
Fix: replace time.Sleep with select { case <-ctx.Done(): return; case <-time.After(...): }. After the fix, shutdown completed in under a second.
Lesson: every blocking call must be cancellable, even mundane sleeps.
Incident: connection pool exhaustion under retries¶
A service backed by a database was failing under load. The DB pool exhausted; new requests waited forever.
Investigation: each request did a retry on transient errors with exponential backoff. The backoff used time.Sleep. During the sleep, the DB connection from the previous attempt was held. With 5 retries averaging 30 seconds total, the connection was tied up for 30 seconds per request.
Fix: release the DB connection before sleeping. Acquire fresh on retry. After the fix, connection turnover increased and the pool stayed healthy.
Lesson: hold resources for the minimum required time; release before any sleep or wait, even if cancellable.
Incident: cancellation cause not propagated¶
A service started failing with context canceled errors. The team could not determine the root cause from logs because all cancellations logged as "canceled" without further detail.
Investigation: the team had not used WithCancelCause. The actual reason (a downstream timeout) was never recorded.
Fix: introduce WithCancelCause at the top of each request handler. Cancellations now logged with cause. Root cause was a misconfigured backend timeout; obvious once it was visible in logs.
Lesson: invest in cancellation observability early. The cost is small; the debugging time saved is large.
Incident: cancel-leak via captured cancel function¶
A subtle real-world bug. A library exposed:
Looks fine. But the library also internally derived a child context:
func (w *Watcher) run(parent context.Context) {
ctx, cancel := context.WithCancel(parent)
w.cancel = cancel
// ... main loop ...
}
The user did:
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
w := NewWatcher(ctx)
// ... use w ...
When defer cancel() ran, the parent cancelled, which cancelled the watcher's child. The watcher exited. So far so good.
But the watcher also held an internal cancel that it never called. The internal cancel function was a closure over the child context's internal struct. The internal struct was retained by the closure, and the closure was retained by w.cancel, and w was retained by the caller until they were done with it.
The result: the cancelled context's internal struct was alive in memory long after cancellation. Multiplied by tens of thousands of watchers, this was a significant heap retention bug.
Fix: ensure the library's internal cancel is also called when the watcher exits.
func (w *Watcher) run(parent context.Context) {
ctx, cancel := context.WithCancel(parent)
defer cancel()
w.cancel = cancel
// ...
}
The defer cancel() releases the internal struct when the goroutine exits.
Lesson: every cancel function must be called. Not calling it leaks not just the timer (for WithTimeout) but also the parent-children-map entry.
Patterns That Push the Limits¶
Some advanced patterns that go beyond standard middle/senior material.
Pattern: hot-swappable workers¶
A pool that can replace individual workers without restarting:
type Pool struct {
mu sync.Mutex
workers map[int]*Worker
nextID int
}
func (p *Pool) Add(ctx context.Context, fn func(context.Context) error) int {
p.mu.Lock()
id := p.nextID
p.nextID++
workerCtx, cancel := context.WithCancel(ctx)
w := &Worker{ID: id, cancel: cancel, done: make(chan struct{})}
p.workers[id] = w
p.mu.Unlock()
go func() {
defer close(w.done)
fn(workerCtx)
}()
return id
}
func (p *Pool) Remove(id int) {
p.mu.Lock()
w, ok := p.workers[id]
delete(p.workers, id)
p.mu.Unlock()
if ok {
w.cancel()
<-w.done
}
}
Each worker has its own context derived from the pool's. Cancelling one does not affect others. The pool can add and remove workers at runtime.
Pattern: time-bounded cancel cascade¶
For shutdowns where you want a hard upper bound on how long the cascade takes:
func shutdown(deadline time.Duration) error {
done := make(chan struct{})
go func() {
// ... cancel everything, wait ...
close(done)
}()
select {
case <-done:
return nil
case <-time.After(deadline):
return errors.New("shutdown timed out")
}
}
If shutdown doesn't complete, abandon and report. The process can then exit anyway (the leaked goroutines are about to die with the process).
Pattern: per-message deadlines from header¶
Reading a deadline from a message metadata field:
func processMsg(parent context.Context, msg Message) error {
deadline := time.Unix(msg.DeadlineUnix, 0)
if time.Now().After(deadline) {
return errors.New("expired")
}
ctx, cancel := context.WithDeadline(parent, deadline)
defer cancel()
return process(ctx, msg)
}
Each message carries its own deadline; the processor enforces it. Old messages are dropped without processing.
Pattern: graceful upgrade¶
A process replaces itself without dropping requests:
func main() {
listener := net.Listen(...) // inherit fd from parent on upgrade
ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGUSR2)
defer cancel()
srv := &http.Server{Handler: ...}
go srv.Serve(listener)
<-ctx.Done() // SIGUSR2 received
// spawn new process with inherited fd
// wait for new process to be ready
srv.Shutdown(...)
}
The new process accepts new connections via the inherited fd; the old process drains existing connections. The cancellation triggers the swap.
This is the pattern in tableflip, endless, and similar libraries. Complex but valuable for zero-downtime deploys.
Cancellation in singleflight and how to extend it¶
golang.org/x/sync/singleflight deduplicates concurrent calls for the same key. As discussed, it has a known limitation: the first call's context dominates.
A common pattern: extend it to be per-caller-cancellable. The idea — track the number of waiters and cancel the underlying call only when all waiters cancel.
type CountingFlight struct {
mu sync.Mutex
flights map[string]*flight
}
type flight struct {
waiters int
cancel context.CancelFunc
res chan flightResult
}
type flightResult struct {
val any
err error
}
func (cf *CountingFlight) Do(ctx context.Context, key string, fn func(context.Context) (any, error)) (any, error) {
cf.mu.Lock()
f, ok := cf.flights[key]
if !ok {
flightCtx, cancel := context.WithCancel(context.Background())
f = &flight{
cancel: cancel,
res: make(chan flightResult, 1),
}
cf.flights[key] = f
go func() {
val, err := fn(flightCtx)
f.res <- flightResult{val, err}
cf.mu.Lock()
delete(cf.flights, key)
cf.mu.Unlock()
}()
}
f.waiters++
cf.mu.Unlock()
select {
case <-ctx.Done():
cf.mu.Lock()
f.waiters--
if f.waiters == 0 {
f.cancel()
}
cf.mu.Unlock()
return nil, ctx.Err()
case res := <-f.res:
// re-send for other waiters
select {
case f.res <- res:
default:
}
return res.val, res.err
}
}
Sketch only — production needs more careful error handling and result distribution. The point: cancellation across deduplicated calls requires tracking waiters and cancelling when the count reaches zero.
This pattern is a senior-level extension; production singleflight does not have it. The trade-offs (complexity, race conditions, distribution overhead) are real.
A reminder: idempotent shutdown is essential¶
Shutdown may be triggered multiple times in quick succession:
- SIGTERM arrives.
- A second SIGTERM arrives before the first finished.
- A user clicks "stop" twice in admin UI.
- An orchestrator timeout escalates from SIGTERM to SIGKILL.
Each trigger should be safely idempotent:
- Cancelling an already-cancelled context: no-op (built into the API).
- Closing an already-closed channel: panic. Avoid by using sync.Once or a flag.
- Calling shutdown() twice: should work — possibly waiting for the in-progress shutdown to finish.
type Server struct {
shutdownOnce sync.Once
cancel context.CancelFunc
done chan struct{}
}
func (s *Server) Shutdown(ctx context.Context) error {
s.shutdownOnce.Do(func() {
s.cancel()
close(s.done) // not exactly right; need to wait until actually done
})
select {
case <-s.done:
return nil
case <-ctx.Done():
return ctx.Err()
}
}
A second call returns once the first completes (or the second's context cancels). Multiple concurrent calls are safe.
Production-grade shutdown handlers always have this pattern. Without it, double-trigger leads to panics or undefined behaviour.
Cancellation versus deletion: distinct concepts¶
Sometimes confused: cancellation says "stop"; deletion says "remove the underlying entity." They are different.
- A cancelled HTTP request is one whose work was stopped. The request itself (the record of "this URL was hit") may still exist for logging.
- A deleted job is one whose record is removed from the system. The job may have been previously cancelled or completed.
For a job queue:
Cancel(jobID)stops the job from continuing; it stays in the system with a "cancelled" status.Delete(jobID)removes the job entirely; only valid for finished (cancelled, completed, or failed) jobs.
Designing this distinction up front avoids confusion. The cancellation primitive is the in-process signal; the deletion is the persistent state change.
Cancellation versus pausing¶
Another distinction: pausing means "stop temporarily; can resume." Cancellation means "stop permanently."
Go's context.Context does not have a "pause" primitive. To pause, you would need:
- A mechanism to block goroutines without exiting.
- A mechanism to resume them later.
This is rarely needed at the application level. Operating systems handle pause/resume of processes via SIGSTOP/SIGCONT. Application code typically just throttles (slow down) rather than pauses.
If you do need pause/resume, the pattern is:
type Pausable struct {
mu sync.Mutex
paused bool
cond *sync.Cond
}
func (p *Pausable) Pause() { p.mu.Lock(); p.paused = true; p.mu.Unlock() }
func (p *Pausable) Resume() { p.mu.Lock(); p.paused = false; p.cond.Broadcast(); p.mu.Unlock() }
func (p *Pausable) Wait(ctx context.Context) error {
p.mu.Lock()
for p.paused {
// wait on cond, but the wait is not context-aware
// ... this is a problem ...
}
p.mu.Unlock()
return nil
}
The challenge: sync.Cond.Wait is not context-aware. For cancellable pause, build on channels:
type Pausable struct {
mu sync.Mutex
pausedCh chan struct{}
}
func New() *Pausable {
p := &Pausable{pausedCh: make(chan struct{})}
close(p.pausedCh) // initially not paused
return p
}
func (p *Pausable) Pause() {
p.mu.Lock()
select {
case <-p.pausedCh:
p.pausedCh = make(chan struct{}) // close it to "pause"
default:
}
p.mu.Unlock()
}
func (p *Pausable) Resume() {
p.mu.Lock()
select {
case <-p.pausedCh:
// already not paused
default:
close(p.pausedCh)
}
p.mu.Unlock()
}
func (p *Pausable) Wait(ctx context.Context) error {
p.mu.Lock()
ch := p.pausedCh
p.mu.Unlock()
select {
case <-ch:
return nil
case <-ctx.Done():
return ctx.Err()
}
}
A channel that is "open" when paused, "closed" when running. Wait blocks until resume or cancel. Sketch only; production needs careful handling.
Cancellation in batch APIs¶
A batch API accepts N items at once and processes them. Two patterns:
Pattern A: all-or-nothing¶
func ProcessBatch(ctx context.Context, items []Item) error {
g, ctx := errgroup.WithContext(ctx)
for _, item := range items {
item := item
g.Go(func() error { return process(ctx, item) })
}
return g.Wait()
}
First error cancels siblings; the whole batch fails.
Pattern B: best-effort¶
func ProcessBatch(ctx context.Context, items []Item) []error {
errs := make([]error, len(items))
var wg sync.WaitGroup
for i, item := range items {
i, item := i, item
wg.Add(1)
go func() {
defer wg.Done()
errs[i] = process(ctx, item)
}()
}
wg.Wait()
return errs
}
Each item is independent. Some succeed, some fail. Cancellation aborts all.
The choice depends on the use case. Transactional batches use Pattern A; notification fan-outs use Pattern B.
Hybrid: stop after K failures¶
A middle ground: best-effort up to K failures, then cancel siblings.
func ProcessBatchK(ctx context.Context, items []Item, maxFailures int) []error {
ctx, cancel := context.WithCancel(ctx)
defer cancel()
errs := make([]error, len(items))
var (
wg sync.WaitGroup
failures int
failuresMu sync.Mutex
)
for i, item := range items {
i, item := i, item
wg.Add(1)
go func() {
defer wg.Done()
err := process(ctx, item)
errs[i] = err
if err != nil {
failuresMu.Lock()
failures++
if failures >= maxFailures {
cancel()
}
failuresMu.Unlock()
}
}()
}
wg.Wait()
return errs
}
After K failures, cancel triggers; remaining items see ctx.Done() and abort. Useful for "fail fast after enough damage."
Cancellation in plugins and dynamic code¶
Go plugins (plugin package) and dynamically loaded code may have their own goroutines. Cancellation needs to reach them.
The pattern: plugins should accept a Context in their entry points:
The host passes a context derived from its lifecycle. Cancellation cascades.
If a plugin spawns its own goroutines without honouring the passed context, those goroutines leak when the plugin "stops." This is a common bug in plugin systems.
The fix: design the plugin interface to make Context mandatory. Provide a default Stop method that cancels the plugin's internal context. Document.
A note on cancellation in stateless vs stateful workers¶
Stateless workers (no internal accumulated state) are trivially cancellable. Cancel, drain, exit; no state lost.
Stateful workers (accumulating aggregations, caches, buffers) lose state on cancel. Two strategies:
Strategy 1: periodic externalisation¶
func stateful(ctx context.Context) {
state := &State{}
ticker := time.NewTicker(time.Second)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
// best-effort save
save(context.Background(), state)
return
case <-ticker.C:
save(ctx, state)
case ev := <-events:
state.Apply(ev)
}
}
}
State is saved every second. Cancellation triggers a final save. The most recent second's work may be lost.
Strategy 2: write-ahead¶
Each event is written to a durable log before applied to in-memory state. Cancellation does not lose unprocessed events because they are still in the log; the recovery on restart replays from the last applied position.
for ev := range events {
if err := log.Append(ctx, ev); err != nil {
return err
}
state.Apply(ev)
state.SetCursor(ev.Offset)
}
The cursor is updated only after Apply. On cancellation, the log has events past the cursor; restart replays them.
Trade-off: write-ahead is more durable but slower. Periodic save is faster but loses recent work.
Most production systems use a hybrid: write-ahead for critical state, periodic save for derived state.
Cancellation patterns from related ecosystems¶
A short tour for perspective.
Kotlin coroutines¶
CoroutineScope has a parent-child relationship. Cancelling a scope cancels all coroutines started in it. Structured concurrency is enforced by the language.
The Go equivalent is errgroup. Kotlin's enforcement is stronger; you cannot accidentally leak a coroutine.
Rust async¶
tokio::spawn returns a JoinHandle. The handle has a cancel() method that aborts the future. Without cancellation, the future runs to completion.
Rust's pattern is similar to Go's: cancellation is opt-in (via tokio::select! or futures-cancel). The discipline is the same.
Java's CompletableFuture¶
Cancellation in CompletableFuture is complicated. cancel(mayInterruptIfRunning) only fully works for futures created in specific ways. The composition has many edge cases.
The lesson: Go's context.Context is simpler than Java's equivalent. Trust the simplicity.
Python's asyncio¶
asyncio.Task has a cancel() method that raises CancelledError in the coroutine. The coroutine must handle it (or let it propagate).
Conceptually similar to Go's context, but the cancellation arrives as an exception rather than a channel close. Different mental model.
Erlang processes¶
Erlang processes have OS-process-like lifecycle: each has a PID, can be linked to others, and supervisors can restart them. Cancellation is via exit signals.
The actor pattern in Go (covered earlier) approximates this. Native Go does not have linked processes; you build the supervision tree.
A study of an errgroup-style implementation¶
Implementing your own errgroup reveals the patterns:
type Group struct {
cancel context.CancelCauseFunc
wg sync.WaitGroup
errOnce sync.Once
err error
}
func WithContext(ctx context.Context) (*Group, context.Context) {
ctx, cancel := context.WithCancelCause(ctx)
return &Group{cancel: cancel}, ctx
}
func (g *Group) Go(fn func() error) {
g.wg.Add(1)
go func() {
defer g.wg.Done()
if err := fn(); err != nil {
g.errOnce.Do(func() {
g.err = err
if g.cancel != nil {
g.cancel(err)
}
})
}
}()
}
func (g *Group) Wait() error {
g.wg.Wait()
if g.cancel != nil {
g.cancel(g.err)
}
return g.err
}
Differences from the real errgroup:
- Uses
WithCancelCauseso the cancellation reason is the actual error. - Does not have
SetLimit; for that, add a semaphore channel.
The core mechanism is unchanged: wait group for join, sync.Once for first-error capture, cancel propagation via the context.
When you'd write your own¶
Sometimes you need behaviour errgroup does not provide:
- Aggregating all errors instead of just the first.
- Soft cancellation: cancel on error but let siblings finish their current work.
- Custom timing: launch with staggered delays.
- Custom retry: re-run failed tasks.
For these, write your own group. The patterns above generalise: a wait group, a sync.Once or similar for error capture, a context for cancellation.
Memory ordering of cancellation observables¶
The Go memory model specifies which writes are visible after which synchronisation events. For cancellation:
cancel()callsclose(done), which is a synchronisation event.- A receive on
doneafter the close happens-after the close. - Any writes ordered before the close are visible to the receiver after the receive.
This means:
var globalState string
go func() {
globalState = "shutting down" // (1)
cancel() // (2): close happens-before any receive
}()
<-ctx.Done() // (3): receive happens-after the close
fmt.Println(globalState) // (4): reads "shutting down"
The print sees "shutting down" — guaranteed by the memory model.
But:
var globalState string
go func() {
cancel() // (1)
globalState = "still going" // (2): unordered with the cancel observation
}()
<-ctx.Done() // (3)
fmt.Println(globalState) // (4): may print "" or "still going"
The print is racy. Writes done after cancel have no ordering guarantee with the receiver's read.
This is rarely relevant in well-designed code, but knowing it helps explain subtle bugs.
Real-world cancellation latency observations¶
Approximate latencies I have observed in production systems:
Microservice API endpoint¶
- Endpoint receives request, kicks off internal pipeline.
- Client disconnects mid-request.
r.Context()cancels.- Internal pipeline (5 stages) sees cancel; each stage exits in ~10 us.
- DB driver cancels in-flight query; query takes ~5 ms to actually cancel server-side.
- Handler returns; goroutine count drops by ~5.
Total: ~5-10 ms from client disconnect to all cleanup done. Dominated by DB cancellation roundtrip.
Stream processor¶
- Service processing Kafka events in parallel (16 workers).
- SIGTERM arrives.
rootCtxcancels.- Each worker finishes current event (~200 ms each, parallel).
- Workers exit; cleanup runs.
- Total shutdown: ~250 ms.
Long-poll server¶
- 10 000 concurrent long-poll connections.
- SIGTERM arrives.
- Each handler's
r.Context()cancels (viaBaseContext). - Each handler returns a 204; connections close.
- Total shutdown: ~50 ms (most goroutines wake and exit nearly simultaneously).
Heavy ETL job¶
- Single goroutine running a large ETL.
- Cancellation arrives.
- Goroutine is inside
db.QueryContextfor a slow query. - DB cancellation takes ~10 seconds to actually abort the query.
- Goroutine returns; job exits.
Total: ~10 seconds — bounded by the database's ability to abort.
These numbers vary widely by system and language. The pattern: well-instrumented services have predictable cancellation latency; poorly instrumented services have nasty surprises.
In-depth: cancellation in WebSocket connections¶
A WebSocket connection is long-lived bidirectional. Cancellation is critical because the connection can live for hours.
func wsHandler(w http.ResponseWriter, r *http.Request) {
conn, err := upgrader.Upgrade(w, r, nil)
if err != nil {
return
}
defer conn.Close()
ctx, cancel := context.WithCancel(r.Context())
defer cancel()
g, ctx := errgroup.WithContext(ctx)
g.Go(func() error { return readLoop(ctx, conn) })
g.Go(func() error { return writeLoop(ctx, conn) })
g.Go(func() error { return pingLoop(ctx, conn) })
_ = g.Wait()
}
func readLoop(ctx context.Context, conn *websocket.Conn) error {
for {
select {
case <-ctx.Done():
return ctx.Err()
default:
}
conn.SetReadDeadline(time.Now().Add(time.Minute))
_, msg, err := conn.ReadMessage()
if err != nil {
return err
}
handle(msg)
}
}
func writeLoop(ctx context.Context, conn *websocket.Conn) error {
for {
select {
case <-ctx.Done():
return ctx.Err()
case msg := <-outbox:
if err := conn.WriteMessage(websocket.TextMessage, msg); err != nil {
return err
}
}
}
}
func pingLoop(ctx context.Context, conn *websocket.Conn) error {
ticker := time.NewTicker(30 * time.Second)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return ctx.Err()
case <-ticker.C:
if err := conn.WriteControl(websocket.PingMessage, nil, time.Now().Add(time.Second)); err != nil {
return err
}
}
}
}
Three goroutines: read, write, ping. They share a context. errgroup joins them and propagates errors.
Cancellation paths:
- Client disconnects:
conn.ReadMessageerrors out (typicallywebsocket.ErrCloseSentor similar). Read loop returns; errgroup cancels; write and ping loops seectx.Done()and exit. - Server shutdown:
r.Context()cancels (viaBaseContext); errgroup'sctxcancels; all three loops exit. - Internal error: any loop returning an error triggers errgroup cancel.
Total cleanup: the deferred conn.Close runs, releasing the WebSocket. The errgroup.Wait returns.
This is the production pattern for WebSocket servers handling many concurrent connections. Each connection's three goroutines join cleanly on disconnect.
Cancellation across runtime versions¶
Cancellation behaviour has evolved across Go versions. Notable changes:
Go 1.7¶
Introduced the context package. Before this, every project had its own done-channel patterns.
Go 1.14¶
Asynchronous preemption. Goroutines can be preempted at almost any point, not just at function calls. Cancellation polling becomes effective even in tight loops without function calls.
Go 1.16¶
signal.NotifyContext added. Replaced manual signal-handler-to-context plumbing.
Go 1.20¶
context.WithCancelCause, context.Cause. Rich cancellation reasons.
Go 1.21¶
context.AfterFunc, context.WithoutCancel. Cleaner side-effect and detached patterns.
Go 1.22¶
for ... range loop variables are per-iteration. Eliminates a class of capture bugs that affected goroutines spawned in loops.
Future-watching¶
Generics + context: typed channels with consistent cancellation semantics.
Each release smooths a rough edge in cancellation. Codebases that started on early Go often have legacy patterns; migrating to current idioms reduces bug surface.
A final architectural pattern: cancellation budgets across teams¶
For organisations with many teams shipping services that interact, cancellation behaviour is a coordination problem.
- Team A's service has a 5-second deadline for its operations.
- Team B's service is called by A; it inherits the deadline.
- Team B's service calls Team C's, narrowing the deadline.
- Etc.
If any team uses a context-unaware library, the chain breaks. If any team adds a hidden timeout, the budget is silently truncated.
Mitigations:
- Cross-team conventions: every team uses context-aware libraries. Code review enforces.
- Observability: traces show the actual deadline propagated through each hop. Reviews of traces reveal hidden waste.
- Service ownership of deadline: each service has a "max budget" SLA and enforces it. Surpluses are visible.
- Postmortem reviews: incidents involving deadline mishaps are investigated and remediated.
This is not a technical problem with a code solution; it is an organisational discipline supported by good tooling.
Cancellation and the runtime's "sysmon"¶
The Go runtime has a background thread called sysmon ("system monitor"). It runs every ~20us and handles:
- Network poller integration.
- Stuck-goroutine detection (preemption).
- Forced GC.
- Scavenger.
sysmon is largely irrelevant to user-level cancellation, but it does keep the scheduler responsive. Without it, a CPU-bound goroutine could prevent the scheduler from running other goroutines, including ones that need to handle cancellation.
Implication: even on a single-CPU system with all goroutines busy, sysmon ensures the scheduler runs periodically. Cancellation does not get starved.
This is why async preemption (Go 1.14+) and sysmon together make Go a viable language for low-latency cancellation. The runtime keeps things moving.
Cancellation correctness proofs (informal)¶
For a critical pipeline, you can write informal proofs of cancellation correctness. The proof template:
Claim: For any execution where cancel() is called, all goroutines spawned by runPipeline(ctx) exit within K microseconds.
Proof sketch:
- Every goroutine in the pipeline is in one of these states:
- Running in a
selectwith<-ctx.Done()as one case. - Running a per-item work function
fn(ctx)that pollsctx.Err()every M operations. - When
cancel()fires,ctx.Done()closes. The scheduler wakes any goroutine blocked on<-ctx.Done()within X microseconds. - A goroutine in a
selectcase immediately picks<-ctx.Done()(or completes the other case and loops to a new select). Worst case: one full iteration of the loop, which is bounded by per-item work T. - A goroutine in
fn(ctx)pollsctx.Err()within M operations. If we measure that M operations take Y microseconds, then cancellation latency is at most Y.
Conclusion: K = max(X + T, Y) = max(scheduler wake + worst per-item work, polling interval cost).
For a typical pipeline with T = 1 ms and Y = 10 us, K ≈ 1 ms. Bounded and provable.
This level of rigour is rare in production code but valuable for critical systems. The proof can be expressed as test assertions:
func TestCancellationBound(t *testing.T) {
ctx, cancel := context.WithCancel(context.Background())
done := make(chan struct{})
go func() {
runPipeline(ctx)
close(done)
}()
time.Sleep(100 * time.Millisecond)
start := time.Now()
cancel()
<-done
elapsed := time.Since(start)
if elapsed > 2*time.Millisecond {
t.Errorf("cancellation took %v, expected <2ms", elapsed)
}
}
The test is a runtime proof. Run it repeatedly under load to catch outliers.
Cancellation in production tracing¶
For observability at scale, every cancellation event should produce a trace span. The pattern:
func runPipeline(ctx context.Context) error {
ctx, span := tracer.Start(ctx, "pipeline.run")
defer span.End()
err := work(ctx)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
}
if ctx.Err() != nil {
span.AddEvent("cancelled", trace.WithAttributes(
attribute.String("cause", causeString(ctx)),
))
}
return err
}
The span captures:
- Start and end time of the pipeline.
- The cancellation event with its cause.
- The error, if any.
In the trace UI, you see the cancellation as a discrete event with context. For distributed traces (multiple services), this works across hops.
Sampling cancellations¶
For high-volume services, sampling traces is necessary. Cancellations are interesting events worth always sampling:
The tracing system can be configured to sample 100% of cancellation events even if normal sampling is 1%. This gives high fidelity on the rare-but-important shutdowns.
Cancellation and the GC: rare interactions¶
The Go GC occasionally STW (stop-the-world) pauses, traditionally short. The GC does not block cancellation per se, but a goroutine in a STW pause cannot be woken until the pause ends.
For shutdown SLAs, this is usually irrelevant — GC pauses are sub-millisecond. But for very tight SLAs, you may notice cancellation latency variability that correlates with GC.
debug.SetGCPercent(-1) disables GC; useful for reproducing cancellation behaviour without GC interference.
Context retention by GC¶
A cancelled context is collectible if no one holds a reference. Common holders:
- The caller (until they let go).
- The parent's
childrenmap (untilremoveChildor parent cancel). - Goroutines that captured
ctxin closures.
The removeChild on cancel ensures parents do not retain cancelled children. The closures retain until the goroutines finish.
For long-running closures, the captured context lives as long as the closure. The cancellation reduces the context's "active work" but the context's memory footprint persists until the goroutine returns.
Cancellation in a "two-level" cache¶
A pattern where a fast in-memory cache fronts a slower remote cache. Reading is two-stage:
func (c *TwoLevel) Get(ctx context.Context, key string) (Value, error) {
if v, ok := c.local.Get(key); ok {
return v, nil
}
v, err := c.remote.Get(ctx, key)
if err != nil {
return Value{}, err
}
c.local.Set(key, v)
return v, nil
}
Cancellation interaction:
- Local lookup is fast; no cancellation issue.
- Remote lookup respects ctx.
- On cancellation, the remote call returns; local cache is not updated.
What about writes?
func (c *TwoLevel) Set(ctx context.Context, key string, v Value) error {
c.local.Set(key, v)
return c.remote.Set(ctx, key, v)
}
Local set is unconditional; remote set respects ctx. On cancellation, local has the value but remote may not. Future reads from another node will not see the value (because remote was not updated).
This is a consistency issue, not a cancellation correctness issue. The cancellation is honoured; the write semantics are different from what some callers might expect.
Solutions:
- Write to remote first, then local. Cancellation may leave remote-only state.
- Use a write-behind pattern: queue writes, retry on failure.
- Accept the trade-off; document it.
The lesson: cancellation respects the immediate caller but cannot reverse side effects already done. Design for this.
Cancellation profiling techniques¶
When debugging cancellation performance, several profiling tools help.
CPU profile¶
go tool pprof http://localhost/debug/pprof/profile?seconds=30 captures 30 seconds of CPU. Look for:
- High
context.(*cancelCtx).canceltime: many cancellations or a slow cascade. - High
runtime.selectgotime: many selects (often fine; may need to be moved out of hot loops). - High
runtime.gopark/runtime.goreadytime: many wake-ups (cancellation storms).
Goroutine profile¶
go tool pprof http://localhost/debug/pprof/goroutine shows the call stacks of all live goroutines. Look for:
- Many goroutines stuck in
chan receive: missed cancellation. - Many goroutines in
select: normal, but if numerous, may indicate a fan-out. - Many goroutines in unique stacks: each is a potential leak.
Block profile¶
SetBlockProfileRate(1) enables block profiling. Captures time goroutines spend blocked. Useful for spotting:
- Long waits on
<-ctx.Done(): usually fine, but if a goroutine is supposed to be active, this means it's stalled. - Long waits on channel sends: producer wedged on slow consumer.
Trace (execution trace)¶
go tool trace shows a timeline of goroutine activity. Useful for debugging:
- Cancellation cascades: see exactly when each goroutine wakes after
cancel(). - Scheduler delays: gap between "runnable" and "running."
The trace can be heavy (hundreds of megabytes for 10 seconds of activity), but invaluable for understanding micro-level cancellation behaviour.
Cancellation across multiple Go programs¶
A pipeline of independent Go programs (e.g. a CLI tool that pipes into another) does not share context.Context. Cancellation between them uses Unix mechanisms:
- Pipe closure: closing stdout makes the next program see EOF on stdin.
- Signals: SIGPIPE on writing to a closed pipe; SIGTERM/SIGINT for explicit cancel.
- Exit codes: 0 for success, non-zero for failure.
A multi-process pipeline:
If the consumer exits, its stdin closes. The filter writes to its stdout and gets SIGPIPE. The filter exits. The producer writes to its stdout and gets SIGPIPE. The producer exits. Clean shutdown by Unix signal cascade.
In Go programs, you handle SIGPIPE by writing carefully. The default behaviour for stdout/stderr is to terminate on SIGPIPE; for other writes, the system call returns EPIPE.
For programs that listen on networks or read from files, the equivalent is SetWriteDeadline or closing the file descriptor.
Cross-program cancellation across language boundaries¶
A Go program piping into a Python program: same Unix mechanisms. The Python program closes its stdin and exits; the Go program writes to a closed pipe and gets EPIPE. Language-agnostic.
The lesson: at the Unix level, cancellation is "close the file descriptor." Go's context.Context is a higher-level abstraction; at the OS boundary, the abstraction shifts.
Cancellation in a microservice control plane¶
Imagine a control plane that orchestrates many worker nodes. Each node runs its own pipeline; the control plane can request cancellation for specific jobs.
Architecture¶
- Control plane: a service that receives admin commands (cancel job X).
- Workers: services that run jobs. They register with the control plane.
- Communication: gRPC.
Cancel propagation¶
- Admin calls
controlPlane.CancelJob(jobID). - Control plane looks up which worker is running the job.
- Control plane calls
worker.Cancel(jobID)via gRPC. - Worker's handler triggers cancellation on the local job's context.
- Local pipeline cancels; the job exits cleanly.
- Worker reports back: "job X cancelled."
The cross-process protocol:
- gRPC for the cancel command (a simple unary call).
- Worker's local context cancel.
- Worker's response to control plane.
Latency: tens of milliseconds for the cross-network hop, sub-millisecond for the local cancel.
Subtle issue: idempotency¶
The admin may retry the cancel command if the response is lost. The worker should handle "cancel job X" idempotently — if the job is already cancelled or finished, return success without panicking.
func (w *Worker) Cancel(ctx context.Context, jobID string) error {
w.mu.Lock()
defer w.mu.Unlock()
j, ok := w.jobs[jobID]
if !ok {
return nil // already cleaned up
}
j.cancel()
delete(w.jobs, jobID)
return nil
}
Subtle issue: split brain¶
If the control plane loses connection to a worker, it does not know if the cancel reached. The worker may still be running the job. Mitigation: per-job deadlines that fire even without control plane intervention.
Even if the cancel command never arrives, the job times out at jobMaxDuration. The worker stays bounded.
Subtle issue: cancel cause across the wire¶
If the admin includes a reason for cancelling, the worker should record it. gRPC metadata can carry the reason:
md := metadata.Pairs("cancel-reason", "admin: too slow")
ctx = metadata.NewOutgoingContext(ctx, md)
client.Cancel(ctx, req)
// on the worker:
md, _ := metadata.FromIncomingContext(ctx)
reasons := md.Get("cancel-reason")
if len(reasons) > 0 {
log.Println("cancel reason:", reasons[0])
}
The reason flows from admin to control plane to worker, preserved across the network.
Cancellation in a queue-backed worker¶
Workers that consume from a queue (Kafka, SQS) have a particular cancellation shape:
- The consumer holds a connection to the broker.
- Each message is delivered with an "in-flight" lock; the consumer must ack to release.
- On cancellation, in-flight messages should be released (so they can be redelivered).
func runConsumer(ctx context.Context, broker Broker) error {
consumer := broker.NewConsumer(ctx)
defer consumer.Close() // releases all in-flight messages on broker side
for {
select {
case <-ctx.Done():
return ctx.Err()
default:
}
msg, err := consumer.Poll(ctx, 100*time.Millisecond)
if err != nil {
return err
}
if msg == nil {
continue
}
if err := process(ctx, msg); err != nil {
msg.Nack() // explicit release; will be redelivered
return err
}
msg.Ack() // acknowledged; will not be redelivered
}
}
Cancellation paths:
ctx.Done()is checked at the top of every iteration.consumer.Poll(ctx, ...)respects context (typically usesctx.Done()and the poll timeout, whichever fires first).process(ctx, msg)respects context.consumer.Close()is deferred; on any return, the consumer's broker connection is closed and in-flight messages are nacked.
The result: cancellation causes in-flight messages to be released. They will be redelivered to another consumer (or the same one on next start). At-least-once delivery is preserved.
Subtle: process during cancel¶
If process(ctx, msg) is mid-execution when cancel fires, it should return promptly. The caller then nacks the message (because the process did not succeed). The message is redelivered.
If process somehow completes successfully despite the cancel, the code should ack. The double-handling on redelivery is the consumer's responsibility to deduplicate.
This is the standard "at-least-once with idempotent consumers" architecture. Cancellation slots in cleanly.
Cancellation in event sourcing¶
Event sourcing: state is derived by replaying a sequence of events. Cancellation interacts with replay.
Replay with cancellation¶
func replay(ctx context.Context, log EventLog, applyFn func(Event) error) error {
cursor := ""
for {
select {
case <-ctx.Done():
return ctx.Err()
default:
}
events, next, err := log.Read(ctx, cursor, 100)
if err != nil {
return err
}
if len(events) == 0 {
return nil // replay complete
}
for _, ev := range events {
if err := applyFn(ev); err != nil {
return err
}
if ctx.Err() != nil {
return ctx.Err()
}
}
cursor = next
}
}
The replay can be cancelled at any point. State is partially derived; restart from cursor resumes.
For atomic-replay semantics (where partial replay is unacceptable), use snapshots:
// Take snapshot before replay
snapshot := state.Clone()
if err := replay(ctx, log, applyFn); err != nil {
// Restore on failure or cancellation
state.Restore(snapshot)
return err
}
Cancellation rolls back to the snapshot. At-most-once replay semantics.
Cheat Sheet for Professionals¶
// Inspect cause across the tree
cause := context.Cause(ctx)
// Detach but keep values
detached := context.WithoutCancel(ctx)
// Side-effect callback on cancel
stop := context.AfterFunc(ctx, func() { metrics.Inc() })
defer stop()
// Cancel with rich cause
ctx, cancel := context.WithCancelCause(parent)
defer cancel(nil)
cancel(fmt.Errorf("upstream: %w", err))
// Read the implementation tree
runtime.Stack(buf, true) // get all goroutine stacks
pprof.Lookup("goroutine").WriteTo(w, 2)
// Diagnose cancel hangs
debug.SetGCPercent(-1) // disable GC interference
// then send SIGQUIT to dump all goroutines
Detailed look at signal.NotifyContext implementation¶
func NotifyContext(parent context.Context, signals ...os.Signal) (ctx context.Context, stop context.CancelFunc) {
ctx, cancel := context.WithCancel(parent)
c := &signalCtx{
Context: ctx,
cancel: cancel,
signals: signals,
}
c.ch = make(chan os.Signal, 1)
Notify(c.ch, signals...)
if ctx.Err() == nil {
go func() {
select {
case <-c.ch:
c.cancel()
case <-c.Done():
}
}()
}
return c, c.stop
}
func (c *signalCtx) stop() {
c.cancel()
Stop(c.ch)
}
A clean composition:
- Wrap
parentinWithCancel. - Register a channel for signal delivery.
- Spawn a goroutine that cancels when either a signal arrives or the parent's context cancels (so the goroutine can exit cleanly).
- The
stopfunction cancels and unregisters the signal handler.
The watcher goroutine is the implementation of "signal triggers cancel." It exits when either the signal fires (and cancels) or the user explicitly cancels (and the parent's Done() fires).
For long-lived programs, the goroutine lives for the program's lifetime. Negligible cost.
Memory model and cancellation¶
The Go memory model gives certain happens-before guarantees around channels.
- A send on a channel happens before the corresponding receive completes.
- The closing of a channel happens before a receive that returns because the channel is closed.
For cancellation:
cancel()happens before<-ctx.Done()returns (closed). So any writes done beforecancel()are visible to the receiver after<-ctx.Done()returns.
This is what makes the pattern work:
go func() {
config.Set("key", "value") // before cancel
cancel()
}()
go func() {
<-ctx.Done()
val := config.Get("key") // sees "value", guaranteed
}()
The synchronisation provided by channel close also synchronises any preceding memory writes. This is part of why Go's "communicate by sharing" idiom works without explicit memory barriers.
Implication for design¶
You can use cancellation as a synchronisation point. State changes done before cancel() are visible to any goroutine that wakes via <-ctx.Done(). This is occasionally useful: a cancelled state has more than just "I am cancelled"; it can carry final messages or state.
But: rely on this only for state that the cancelling goroutine wrote before calling cancel. Concurrent writes from other goroutines have no guaranteed ordering relative to the cancel.
Pre-1.21 patterns that are now obsolete¶
If you maintain old code, you may encounter patterns that were necessary before Go 1.21 added WithCancelCause, WithoutCancel, and AfterFunc. The modern replacements:
Old: custom cancel with reason¶
type CancelReason struct {
Reason error
mu sync.Mutex
cancel context.CancelFunc
}
func (c *CancelReason) Cancel(reason error) {
c.mu.Lock()
defer c.mu.Unlock()
if c.Reason == nil {
c.Reason = reason
c.cancel()
}
}
A wrapper around context.WithCancel that also records a reason. Now obsolete:
ctx, cancel := context.WithCancelCause(parent)
defer cancel(nil)
cancel(myErr) // records the cause
context.Cause(ctx) // retrieves it
Old: detached context via custom type¶
type detached struct {
context.Context
}
func (detached) Done() <-chan struct{} { return nil }
func (detached) Err() error { return nil }
func (detached) Deadline() (time.Time, bool) { return time.Time{}, false }
func Detach(ctx context.Context) context.Context {
return detached{ctx}
}
A custom context that strips cancellation. Now obsolete:
The built-in form is more efficient and well-tested.
Old: watcher goroutine for on-cancel side effects¶
A goroutine that waits and then runs cleanup. Now obsolete for non-blocking cleanup:
No extra goroutine; the callback runs from the canceller's cascade. Use for fast, non-blocking cleanup. For long-running cleanup, the watcher goroutine is still appropriate.
Lesson¶
The standard library evolves. Patterns that were standard at one point become old. Periodically review your codebase for opportunities to replace homegrown solutions with stdlib equivalents.
Cancellation patterns in concurrent libraries¶
A short tour of how popular Go libraries handle cancellation.
golang.org/x/sync/singleflight¶
singleflight.Group.Do(key, fn) ensures only one execution of fn for a given key. If multiple goroutines call Do concurrently for the same key, only one runs fn; the others wait and share the result.
Cancellation:
- The first caller's context is the one used for
fn. Cancelling that context propagates tofn. - Other callers do not have their own context; they wait for the first.
- If the first is cancelled, all callers see the cancellation error.
The trade-off: deduplication wins; cancellation precision loses. The other callers cannot cancel "their" call because there is only one call.
There is DoChan for async use, and Forget to invalidate the in-progress call.
golang.org/x/sync/semaphore¶
semaphore.Weighted is a semaphore with Acquire(ctx, n) that respects context. If ctx cancels during acquisition, Acquire returns the error.
The implementation uses an internal queue and select with <-ctx.Done(). Standard pattern.
github.com/cenkalti/backoff/v4¶
Retry library with exponential backoff. Retry(operation, backoff) repeatedly calls operation until it succeeds or backoff.NextBackOff() returns Stop.
Cancellation: pass a BackOffContext wrapping a context. Cancellation aborts the retry loop and the operation.
github.com/sourcegraph/conc¶
Modern concurrency utilities. The pool.New().WithContext(ctx) returns a pool whose tasks share a context. Cancellation cascades.
The library provides higher-level patterns (MapSlice, Stream, etc.) on top of errgroup-style primitives. Worth a look for clean cancellation semantics.
Lessons across libraries¶
- Every library that does anything blocking takes a context.
- Cancellation is the universal "abort" signal.
- Different patterns trade between cancellation precision and other properties (deduplication, batching, fairness).
In-depth: cancellation in errgroup-like custom code¶
A custom group with features beyond errgroup. Let me show "collect all errors":
type AllErrors struct {
cancel context.CancelFunc
wg sync.WaitGroup
mu sync.Mutex
errs []error
}
func WithContext(ctx context.Context) (*AllErrors, context.Context) {
ctx, cancel := context.WithCancel(ctx)
return &AllErrors{cancel: cancel}, ctx
}
func (g *AllErrors) Go(fn func() error) {
g.wg.Add(1)
go func() {
defer g.wg.Done()
if err := fn(); err != nil {
g.mu.Lock()
g.errs = append(g.errs, err)
g.mu.Unlock()
}
}()
}
func (g *AllErrors) Wait() []error {
g.wg.Wait()
g.cancel()
return g.errs
}
Unlike errgroup, this never cancels on error — it lets every goroutine finish and collects all errors. Useful for "show me everything that went wrong."
The cancellation is only triggered explicitly via cancel or when the parent context cancels. Goroutines that respect context exit when the parent cancels; others run to completion.
This pattern is the "best-effort with full error reporting" combined.
For situations where you want to cancel after K errors:
type ThresholdGroup struct {
threshold int
counter atomic.Int32
cancel context.CancelFunc
wg sync.WaitGroup
mu sync.Mutex
errs []error
}
func (g *ThresholdGroup) Go(fn func() error) {
g.wg.Add(1)
go func() {
defer g.wg.Done()
if err := fn(); err != nil {
g.mu.Lock()
g.errs = append(g.errs, err)
g.mu.Unlock()
if g.counter.Add(1) >= int32(g.threshold) {
g.cancel()
}
}
}()
}
After K errors, cancel siblings. Useful for "stop processing once we have enough errors to draw a conclusion."
Cancellation in shutdown hooks and finalizers¶
runtime.SetFinalizer runs a function when an object is garbage-collected. Finalizers and cancellation are an awkward combination:
- Finalizers run in their own goroutines.
- Finalizers may run during shutdown.
- Finalizers cannot reliably participate in coordinated shutdown.
Best practice: avoid finalizers for anything that needs ordered shutdown. Use explicit Close methods called by orderly shutdown logic.
For OS resources (file handles, sockets), defer Close() is the right pattern. For application state, explicit Shutdown is right.
runtime.AddCleanup (Go 1.24+) is a finalizer alternative with better semantics, but still does not coordinate with application-level cancellation.
Heisenbugs in cancellation¶
Some cancellation bugs only manifest under specific timing. They are heisenbugs — adding logging makes them disappear because the logging changes the timing.
Typical heisenbug:
// A
go func() {
if condition {
cancel()
}
}()
// B
ctx, cancel := context.WithCancel(parent)
select {
case <-ctx.Done():
return
case v := <-source:
process(v)
}
In rare interleavings, B may observe ctx.Done() and exit before A actually calls cancel. Adding a log line in A delays it enough that B never sees the issue.
The fix: identify the actual race and add proper synchronisation. Removing the symptom (the log) does not fix the bug.
Tools that help:
- Race detector (catches memory races, which often coincide with timing bugs).
-count=Nrepeated tests (increases chance of hitting rare interleavings).- Manual delay injection in tests (
time.Sleep(time.Duration(rand.Intn(100)) * time.Microsecond)). - Chaos tests in CI.
The principle: timing bugs are real bugs. Treat them as first-class issues, not "flaky tests."
A pattern for cancellation with progress reporting¶
For long-running operations, the caller may want periodic progress updates. Cancellation must respect both the work and the reporting.
type ProgressEvent struct {
Done int
Total int
Err error
}
func runWithProgress(ctx context.Context, fn func(context.Context, func(int)) error, total int) <-chan ProgressEvent {
out := make(chan ProgressEvent, 16)
go func() {
defer close(out)
report := func(done int) {
select {
case out <- ProgressEvent{Done: done, Total: total}:
case <-ctx.Done():
}
}
err := fn(ctx, report)
if err != nil {
select {
case out <- ProgressEvent{Err: err}:
case <-ctx.Done():
}
}
}()
return out
}
The function reports progress through a callback that sends events. The caller receives events; cancellation flows through ctx.
Caller side:
progress := runWithProgress(ctx, doSomething, 1000)
for ev := range progress {
if ev.Err != nil {
return ev.Err
}
log.Printf("progress: %d/%d", ev.Done, ev.Total)
}
return nil
The range exits when the channel closes. On cancellation, the function returns, the channel closes, and the range exits.
The buffered channel (size 16) prevents the reporter from blocking on a slow consumer. If the consumer is even slower, events are silently dropped (the select with default).
Variant: drop only old events, keep recent ones:
report := func(done int) {
select {
case out <- ProgressEvent{Done: done, Total: total}:
case <-ctx.Done():
default:
// buffer full; replace oldest with newest
select {
case <-out:
default:
}
select {
case out <- ProgressEvent{Done: done, Total: total}:
case <-ctx.Done():
default:
}
}
}
A "keep recent" pattern. The consumer always sees the latest progress; intermediate states may be dropped under load.
Cancellation in caching layers¶
A cache that fetches values from a backend on miss may have an in-flight call:
type Cache struct {
sf singleflight.Group
store map[string]Value
mu sync.Mutex
}
func (c *Cache) Get(ctx context.Context, key string) (Value, error) {
c.mu.Lock()
if v, ok := c.store[key]; ok {
c.mu.Unlock()
return v, nil
}
c.mu.Unlock()
v, err, _ := c.sf.Do(key, func() (any, error) {
return c.fetch(ctx, key)
})
if err != nil {
return Value{}, err
}
return v.(Value), nil
}
singleflight deduplicates concurrent fetches for the same key. Cancellation interaction:
- The caller's context can cancel during the fetch.
singleflight.Doreturns when the fetch completes (or errors), but it doesn't respect the caller's context — the single in-flight fetch holds the underlying call.- If the caller cancels, they get
ctx.Err()back. The fetch continues in the background until it completes.
To avoid this, use singleflight.DoChan:
ch := c.sf.DoChan(key, func() (any, error) { ... })
select {
case res := <-ch:
return res.Val.(Value), res.Err
case <-ctx.Done():
return Value{}, ctx.Err()
}
The caller respects context; the underlying fetch continues. If another caller also wants the same key, they receive the same result when the fetch finishes.
For "forget the in-progress fetch":
Lets the next caller start a fresh fetch. Useful for retry-on-cancel patterns.
Cancellation in IO multiplexers (epoll/kqueue)¶
Go's network poller is built on epoll (Linux), kqueue (BSD/macOS), or IOCP (Windows). When a goroutine does a network read, it is parked off the OS thread; the poller wakes it when data arrives.
Cancellation interaction:
conn.SetReadDeadline(time.Now())tells the poller to wake the goroutine immediately with a timeout error.- The Go runtime detects the deadline, removes the goroutine from the poll wait list, and re-queues it with an error.
This is how context.WithTimeout interacts with TCP reads (via the net package's deadline handling). The context's timeout becomes a SetReadDeadline call.
For db.QueryContext, the driver internally uses similar mechanics: it watches ctx.Done() and on cancellation closes the underlying connection or sets a deadline.
Implication¶
Network I/O is well-integrated with Go's cancellation. As long as you use the context-aware methods, network calls cancel cleanly.
The weak link is custom protocols implemented on top of net.Conn without using context. Always layer a deadline-based cancel mechanism.
File I/O cancellation: the difficult case¶
os.File.Read does not accept a context. The read blocks until data arrives (for pipes, sockets disguised as files) or returns immediately (for regular files).
For regular files, the read is typically fast (the OS may have the data in page cache). Cancellation interrupts the wait by closing the file — but this leaves no useful state.
For pipes and FIFOs, cancellation needs:
The close from another goroutine makes the in-progress read return an error. Not elegant but functional.
A cleaner approach for some cases: use bufio.Scanner or io.Reader wrappers that periodically check context. But these only work if your reading logic can check between reads, not during.
The cgo reading variant¶
If you call into C code that does I/O, that I/O is uninterruptible by Go. The C call holds an OS thread. Mitigation: bound the number of concurrent C calls (so they cannot all block); use timeouts at the C-call level.
Cancellation in distributed actor systems¶
An actor system (Akka, Erlang-style) has many small "actors" each running in their own goroutine. Each actor has a mailbox. Messages are sent to actors; actors process them sequentially.
Cancellation of an actor:
type Actor struct {
mailbox chan Message
ctx context.Context
cancel context.CancelFunc
}
func New(parent context.Context, behaviour func(context.Context, Message)) *Actor {
ctx, cancel := context.WithCancel(parent)
a := &Actor{
mailbox: make(chan Message, 16),
ctx: ctx,
cancel: cancel,
}
go a.run(behaviour)
return a
}
func (a *Actor) run(behaviour func(context.Context, Message)) {
for {
select {
case <-a.ctx.Done():
return
case msg := <-a.mailbox:
behaviour(a.ctx, msg)
}
}
}
func (a *Actor) Send(msg Message) error {
select {
case a.mailbox <- msg:
return nil
case <-a.ctx.Done():
return a.ctx.Err()
}
}
func (a *Actor) Stop() {
a.cancel()
}
The actor processes messages one at a time. Cancellation interrupts the wait between messages (or interrupts the behaviour function if it respects context).
Supervisor for actors¶
A supervisor actor watches child actors and restarts them on failure:
type Supervisor struct {
children map[string]*Actor
}
func (s *Supervisor) Spawn(name string, behaviour func(context.Context, Message)) {
a := New(s.ctx, func(ctx context.Context, msg Message) {
defer func() {
if r := recover(); r != nil {
s.notifyFailure(name, r)
}
}()
behaviour(ctx, msg)
})
s.children[name] = a
}
func (s *Supervisor) notifyFailure(name string, reason any) {
log.Printf("actor %s failed: %v", name, reason)
// restart logic ...
}
The supervisor itself is an actor; its cancellation cascades to children. Restart logic creates new actors with fresh contexts derived from the supervisor's.
This is structured concurrency at scale. Go's primitives (context, channels) compose into actor systems without special language support.
In-depth: cancellation race with channel send/receive¶
A nuanced race that can cause subtle bugs.
func producer(ctx context.Context, out chan int) {
for i := 0; ; i++ {
select {
case out <- i:
case <-ctx.Done():
close(out) // BUG: out may already have receivers waiting
return
}
}
}
The bug: the producer closes out after observing ctx.Done(). But what if a receiver is mid-receive?
Actually, closing a channel while receivers wait is fine — they wake up with the closed-channel signal. The bug is more subtle:
What if another goroutine also sends to out? Closing while someone is sending is a panic.
In well-designed code, the convention "the producing goroutine closes" is the only writer, so no one else sends. The close is safe.
But the convention assumes:
- The producer is the only writer.
- The close happens after all sends.
If the code has multiple producers without coordination, the close is unsafe. Multiple-producer designs must use a separate closer that joins all producers first.
Race: cancel + close¶
producer goroutine:
select {
case out <- i:
// ...
case <-ctx.Done():
close(out)
return
}
consumer goroutine:
for v := range out {
process(v)
}
If ctx.Done() and the consumer's range receive both fire on the same value i, what happens?
Actually, this is fine. The select non-determinism means either:
- The send wins:
vflows to the consumer; the producer loops; next iteration seesctx.Done(); closes; consumer's range ends. - The cancel wins: producer closes; consumer's range receives any buffered values (none in unbuffered case) and ends.
Either way, the protocol terminates cleanly.
Race: send to nil channel¶
var ch chan int // nil
select {
case ch <- v: // case is never ready (nil channel)
case <-ctx.Done():
}
A nil channel's send/receive case is never ready. The select waits only on ctx.Done(). Safe.
This is sometimes used to disable a case dynamically:
If shouldSend is false, out is nil; the send case is disabled. Cancellation still works.
A practical view: implementing graceful HTTP shutdown¶
Let me walk through http.Server.Shutdown as a case study. From the standard library:
func (srv *Server) Shutdown(ctx context.Context) error {
srv.inShutdown.Store(true)
srv.mu.Lock()
lnerr := srv.closeListenersLocked()
for _, f := range srv.onShutdown {
go f()
}
srv.mu.Unlock()
srv.closeIdleConns()
pollIntervalBase := time.Millisecond
nextPollInterval := func() time.Duration {
// ... exponential backoff capped at 500ms ...
}
timer := time.NewTimer(nextPollInterval())
defer timer.Stop()
for {
if srv.closeIdleConns() {
return lnerr
}
select {
case <-ctx.Done():
return ctx.Err()
case <-timer.C:
timer.Reset(nextPollInterval())
}
}
}
What's happening:
- Mark the server as in-shutdown (no new connections accepted).
- Close all listeners (stop the accept loop).
- Run any registered
onShutdownhooks in goroutines. - Close idle connections.
- Loop: poll until all connections are closed OR the context cancels.
The polling: every 1 ms initially (rapidly), backing off to 500 ms (less aggressive). The polling checks if all connections are closed; if so, return.
If the context cancels before all connections close (typically because the SLA timer fired), return the context error. The remaining connections are left for forceful close (via srv.Close() or process exit).
The lesson: Shutdown is best-effort. It tries to be graceful with a deadline. After the deadline, the caller must decide what to do (force-close, leave to process exit, etc.).
Cancellation in CSP-style pipelines¶
CSP (Communicating Sequential Processes) is the model that inspired Go's channels. CSP pipelines have specific patterns:
Pipelined filters¶
func odds(ctx context.Context, in <-chan int) <-chan int {
out := make(chan int)
go func() {
defer close(out)
for v := range in {
if v%2 != 0 {
select {
case out <- v:
case <-ctx.Done():
return
}
}
}
}()
return out
}
func square(ctx context.Context, in <-chan int) <-chan int {
out := make(chan int)
go func() {
defer close(out)
for v := range in {
select {
case out <- v * v:
case <-ctx.Done():
return
}
}
}()
return out
}
func main() {
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
nums := generate(ctx)
result := square(ctx, odds(ctx, nums))
for v := range result {
fmt.Println(v)
}
}
The composition is functional; each stage is a goroutine. Cancellation cascades naturally.
CSP-style pipelines are elegant but rare in production Go. Most production code uses errgroup to wrap the same shapes with error handling.
CSP and structured concurrency¶
CSP processes have implicit lifetimes (they end when their channels close). Combined with structured concurrency (every goroutine has a join), CSP becomes a powerful programming model.
The implementation: every stage takes ctx and respects it. The orchestrator joins via errgroup or WaitGroup. The output channel closes naturally when the producer exits.
Summary¶
At professional level, cancellation propagation is not just a pattern but an understanding of how the runtime implements the primitive. The closed-channel broadcast, the lock-protected children map, the goroutine-watcher for foreign contexts, the timer integration — all are simple structures that combine to give Go its cancellation semantics.
Knowing the internals helps you:
- Predict latency under load.
- Design for high-concurrency services without surprises.
- Debug rare race conditions.
- Choose between alternatives (
AfterFuncvs watcher goroutine, custom context vs derived context).
The patterns at this level — hot-swap pools, graceful upgrade, time-bounded shutdowns, per-message deadlines — push the toolkit to its limits. They build on the same primitive but require careful sequencing and rigorous testing.
The final practical advice for cancellation propagation: read the context package source until it is familiar. Measure your service's cancellation behaviour under load. Set SLAs based on measurements, not guesses. Iterate.
Further Reading¶
src/context/context.goin the Go source tree.- "Go 1.21 release notes" — covers
AfterFunc,WithCancelCause,WithoutCancel. - "Go scheduler internals" — Go runtime source,
runtime/proc.go. - "Channels in Go" — Go runtime source,
runtime/chan.go. - "Context Considered Harmful?" — discussions of
WithValuemisuse: https://faiface.github.io/post/context-should-go-away-go2/ - "Go runtime: futexes, monitors, channels" — internal scheduler talk by Russ Cox.