Context — Optimization¶
1. How to use this file¶
Fourteen scenarios where context.Context usage allocates more, dispatches slower, or scales worse than it should. Each entry has a Before (code + benchmark) and a collapsible After (optimized code + benchmark + why + trade-offs + when NOT).
Anchored at Go 1.23, amd64. Numbers are reproducible-shape — run go test -bench=. -benchmem on your hardware before quoting them. Context cost is dominated by five things: per-derivation heap allocation, Value walking a linked chain of parents, per-call WithTimeout spawning a timer goroutine, repeated ctx.Done() channel construction inside short-lived parents, and the indirection through the Context interface itab. Most wins remove one of those five from the hot path. Reading order: Ex. 1, 2, 7, then any order. Ex. 4, 9, 10 are the ones most senior reviews flag.
2. Exercise 1 — Deep WithValue chain¶
A request handler attaches user, tenant, request ID, locale, and trace-span as five separate context.WithValue calls. Every ctx.Value(userKey) then walks five *valueCtx parents to find it. A middleware that reads four of the five per request multiplies the walk.
func attach(ctx context.Context, r *Request) context.Context {
ctx = context.WithValue(ctx, userKey, r.User)
ctx = context.WithValue(ctx, tenantKey, r.Tenant)
ctx = context.WithValue(ctx, ridKey, r.RequestID)
ctx = context.WithValue(ctx, localeKey, r.Locale)
ctx = context.WithValue(ctx, spanKey, r.Span)
return ctx
}
func handler(ctx context.Context) {
u := ctx.Value(userKey).(*User)
t := ctx.Value(tenantKey).(*Tenant)
rid := ctx.Value(ridKey).(string)
loc := ctx.Value(localeKey).(string)
_ = u; _ = t; _ = rid; _ = loc
}
After
Pack the five fields into one struct, attach with one `WithValue`. Lookup is one walk + one field load.type ReqInfo struct {
User *User
Tenant *Tenant
RequestID string
Locale string
Span *Span
}
type reqKey struct{}
func attach(ctx context.Context, r *Request) context.Context {
return context.WithValue(ctx, reqKey{}, &ReqInfo{
User: r.User, Tenant: r.Tenant, RequestID: r.RequestID,
Locale: r.Locale, Span: r.Span,
})
}
func reqOf(ctx context.Context) *ReqInfo { return ctx.Value(reqKey{}).(*ReqInfo) }
func handler(ctx context.Context) {
r := reqOf(ctx)
u, t, rid, loc := r.User, r.Tenant, r.RequestID, r.Locale
_ = u; _ = t; _ = rid; _ = loc
}
3. Exercise 2 — ctx.Value in a hot loop¶
A row scanner reads ctx.Value(tenantKey) on every row to filter. For a 1M-row scan, that's 1M iface dispatches plus 1M parent-chain walks (even one hop costs).
func scanRows(ctx context.Context, rows []Row) int {
n := 0
for _, r := range rows {
t := ctx.Value(tenantKey).(*Tenant)
if r.TenantID == t.ID { n++ }
}
return n
}
After
Hoist the lookup once before the loop. The value is request-scoped; it does not change mid-scan. ~7.7× faster. **Why faster:** `ctx.Value` is a virtual call (iface dispatch) followed by a parent walk that is opaque to the compiler — the loop body can't be inlined or vectorized while it lives there. Hoisting reduces the loop body to a register-local int compare. **Trade-off:** None for read-only values. If the tenant could change mid-iteration (it shouldn't), hoisting hides the change. Document the invariant if there's any doubt. **When NOT:** Loops where the value semantically changes per item (extremely rare for context — that's not what context is for). Loops too short for the per-call overhead to matter (< 100 iters).4. Exercise 3 — WithTimeout per call instead of WithDeadline shared¶
A function batches 1000 RPCs against the same upstream with a 5-second budget. Code calls context.WithTimeout(parent, 5*time.Second) inside the per-RPC helper, allocating a fresh *cancelCtx + timer for each.
func fanOut(parent context.Context, targets []string) error {
for _, t := range targets {
ctx, cancel := context.WithTimeout(parent, 5*time.Second)
callRPC(ctx, t)
cancel()
}
return nil
}
After
The whole batch shares one deadline. Derive once with `WithDeadline` (or `WithTimeout`) and pass the child to every RPC. ~21× faster, ~1000× fewer allocations. **Why faster:** `WithTimeout` allocates a `*timerCtx`, schedules a `time.AfterFunc`, and lazily allocates a `chan struct{}` on first `Done()`. Doing this 1000 times spawns 1000 timer goroutines (well, 1000 scheduled timer heap entries) and 1000 cancel funcs. Sharing the deadline keeps one timer for the whole batch. **Trade-off:** A single slow RPC eats budget from the rest — they all fail at once. If you need per-RPC budgets, you genuinely need per-call timeouts; consider a smaller shared budget plus retries. **When NOT:** Per-RPC budgets that differ. RPCs that must independently retry on timeout without affecting siblings. Long-running streaming RPCs where one deadline shouldn't terminate the others.5. Exercise 4 — Repeated context.Background() calls¶
Hot-path code that doesn't have a context (background workers, init paths, tests) calls context.Background() per invocation. The runtime returns a sentinel singleton, but the call site still costs an iface allocation if it's stored in a local interface variable.
After
Package-level var. The standard library's `context.Background()` already returns a singleton, but caching the iface header in a typed package var avoids re-wrapping at each call site and lets the compiler hoist the value into a register. ~5.6× faster. **Why faster:** `context.Background()` is a function call that returns `emptyCtx{}` via an iface — the compiler can sometimes inline it, often not. A package var is a direct iface load from `.bss`. In a 10M/min loop the saved 5 ns per call is ~50 ms/min of CPU. **Trade-off:** Trivially small win in absolute terms. Style guides sometimes prefer the explicit `context.Background()` call site for grep-ability. Many linters flag package-level state — adopt sparingly. **When NOT:** Code called less than ~1000 times per second — the saving is invisible. Tests where each case wants a fresh context for isolation (it isn't — Background is global anyway, but `TODO()` helps readers). Style-strict codebases.6. Exercise 5 — Excessive WithCancel for short-lived child goroutines¶
A request fans out to 8 worker goroutines, each given its own context.WithCancel(parent) so failures cancel siblings. The orchestrator allocates 8 *cancelCtx, 8 cancel funcs, and rewires 8 propagateCancel links.
func handle(ctx context.Context) error {
var wg sync.WaitGroup
errs := make([]error, 8)
for i := 0; i < 8; i++ {
cctx, cancel := context.WithCancel(ctx)
wg.Add(1)
go func(i int) {
defer wg.Done()
defer cancel()
errs[i] = work(cctx, i)
}(i)
}
wg.Wait()
for _, e := range errs { if e != nil { return e } }
return nil
}
After
`errgroup.WithContext` derives one cancelable child and cancels it on the first error. Workers share it. ~4.7× faster, ~10× fewer allocations. **Why faster:** One `*cancelCtx`, one cancel func, one `propagateCancel` chain — not eight. `errgroup` cancels once on the first non-nil error; the original code had every worker call its own cancel even on success, racing them all through the propagation lock. **Trade-off:** All workers cancel together; you cannot isolate one worker's failure. If `work` produces partial results worth keeping when one sibling fails, use independent contexts and aggregate errors manually. **When NOT:** Workers that must run to completion regardless of sibling failures (collecting metrics from all shards). Heterogeneous fan-out where one worker's cancellation policy differs.7. Exercise 6 — Spawning a goroutine in propagateCancel (non-stdlib Context)¶
A homegrown Context implementation supports cancellation by spawning a goroutine per derivation that watches the parent's Done() channel and forwards cancellation. For 10k derivations per second, that's 10k goroutines in flight, each holding ~2 KB of stack.
type myCtx struct {
parent context.Context
done chan struct{}
err error
}
func WithCancel(parent context.Context) (*myCtx, func()) {
c := &myCtx{parent: parent, done: make(chan struct{})}
go func() { // one goroutine per derivation
select {
case <-parent.Done():
c.err = parent.Err()
close(c.done)
case <-c.done:
}
}()
return c, func() { close(c.done) }
}
After
Use the stdlib's `context.WithCancel`. It registers the child on the parent's child set under a mutex — no goroutine per derivation. The parent's cancel walks the child set and closes them all synchronously. ~23× faster, no goroutine churn. **Why faster:** Stdlib uses a `children map[canceler]struct{}` on the parent under a `sync.Mutex` — propagation is a synchronous walk, not a select on N goroutines. Goroutine creation is ~1-2 µs and 2 KB of stack each; 10k of them is 20 MB of stack. **Trade-off:** None — the stdlib version is strictly better. The only reason custom `Context` implementations exist is to integrate with non-Go concurrency systems (very rare) or to attach extra metadata (use `WithValue` instead). **When NOT:** You are writing a context-compatible cancellation primitive for a runtime that doesn't share stdlib's child-set model (gVisor, TinyGo subset). Even then, study `context.WithCancel`'s implementation first.8. Exercise 7 — context.WithValue for function arguments¶
A function chain passes a *Logger through context because "everyone has a context already". Every callee does ctx.Value(loggerKey).(*Logger). Five layers deep, that's five lookups for one argument.
type loggerKey struct{}
func handler(ctx context.Context) {
ctx = context.WithValue(ctx, loggerKey{}, slog.Default())
step1(ctx)
}
func step1(ctx context.Context) { ctx.Value(loggerKey{}).(*slog.Logger).Info("a"); step2(ctx) }
func step2(ctx context.Context) { ctx.Value(loggerKey{}).(*slog.Logger).Info("b"); step3(ctx) }
func step3(ctx context.Context) { ctx.Value(loggerKey{}).(*slog.Logger).Info("c") }
After
Pass the logger as a parameter. `context.Value` is for request-scoped data that crosses package boundaries opaquely (auth, trace IDs); explicit dependencies belong in the function signature.func handler(ctx context.Context, log *slog.Logger) { step1(ctx, log) }
func step1(ctx context.Context, log *slog.Logger) { log.Info("a"); step2(ctx, log) }
func step2(ctx context.Context, log *slog.Logger) { log.Info("b"); step3(ctx, log) }
func step3(ctx context.Context, log *slog.Logger) { log.Info("c") }
9. Exercise 8 — select { case <-ctx.Done(): ; default: } in a tight loop¶
A streaming loop polls ctx.Done() non-blockingly every iteration to bail early. Each poll is a select with two cases; the runtime's selectgo is fast but not free, and in a 10M-iter loop it dominates.
func stream(ctx context.Context, ch <-chan int) {
for v := range ch {
select {
case <-ctx.Done():
return
default:
}
process(v)
}
}
After
Blocking select on both channels. The receive from `ch` already blocks; combining it with `ctx.Done()` in one select removes the explicit poll. ~1.8× faster. **Why faster:** `selectgo` with a `default` case takes a fast path that still polls all other cases. Removing `default` lets the runtime park the goroutine instead of spinning, freeing the OS thread for other goroutines. The blocking variant also semantically waits for `ch`, so the loop body doesn't run a busy-poll between channel sends. **Trade-off:** Cannot interleave non-channel work between iterations the way a default+work pattern can. If `process(v)` must run even when `ch` is empty (e.g., flush a buffer), the blocking variant won't. **When NOT:** Loops that do CPU work between channel ops and want to check cancellation periodically without blocking. Loops driven by something other than a channel (file reader, generator).10. Exercise 9 — Excessive context derivation in middleware chain¶
A web server has 12 middleware layers, each calling context.WithValue to attach its own scoped data: tracing, logging, auth, locale, feature flags, request ID, deadline, etc. Per request that's 12 *valueCtx allocations.
func chain(handlers []Middleware) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
for _, m := range handlers {
ctx = m.Apply(ctx, r) // each calls context.WithValue
}
handler(ctx, w, r)
}
}
After
Combine related values into one struct attached via a single `WithValue`. Middleware that doesn't add cancellation/deadline writes fields into the shared struct instead of deriving.type ReqCtx struct {
TraceID string
Logger *slog.Logger
User *User
Locale string
Flags FeatureFlags
RequestID string
}
type reqKey struct{}
func WithReq(ctx context.Context, r *ReqCtx) context.Context {
return context.WithValue(ctx, reqKey{}, r)
}
func ReqOf(ctx context.Context) *ReqCtx { return ctx.Value(reqKey{}).(*ReqCtx) }
func chain(handlers []Middleware) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
rc := &ReqCtx{}
ctx := WithReq(r.Context(), rc)
for _, m := range handlers {
m.Apply(ctx, rc, r) // mutates rc instead of deriving
}
handler(ctx, w, r)
}
}
11. Exercise 10 — WithTimeout + defer cancel() overhead¶
A per-request handler creates a 100 ms timeout. WithTimeout schedules a time.Timer; defer cancel() deschedules it. At 50k req/s that's 100k timer ops per second.
func handle(ctx context.Context) error {
ctx, cancel := context.WithTimeout(ctx, 100*time.Millisecond)
defer cancel()
return downstream(ctx)
}
After
Pool the `*timerCtx` and reuse it with `Reset`. This is advanced — Go 1.23 makes raw timer pooling correct, but `context`'s internal types are not exported. The realistic version uses `context.AfterFunc` (added in Go 1.21) for fire-once cleanup without allocating a child context if the deadline is already set upstream.// If the parent already has a deadline within budget, don't derive again.
func handle(ctx context.Context) error {
if d, ok := ctx.Deadline(); ok && time.Until(d) <= 100*time.Millisecond {
return downstream(ctx)
}
ctx, cancel := context.WithTimeout(ctx, 100*time.Millisecond)
defer cancel()
return downstream(ctx)
}
12. Exercise 11 — errgroup.WithContext over-derivation¶
A pipeline has three stages, each starting its own errgroup.WithContext(parent). Three *cancelCtx allocations, three child registrations on the parent.
func pipeline(ctx context.Context) error {
g1, gctx1 := errgroup.WithContext(ctx)
for _, x := range stage1Items { g1.Go(func() error { return doStage1(gctx1, x) }) }
if err := g1.Wait(); err != nil { return err }
g2, gctx2 := errgroup.WithContext(ctx)
for _, x := range stage2Items { g2.Go(func() error { return doStage2(gctx2, x) }) }
if err := g2.Wait(); err != nil { return err }
g3, gctx3 := errgroup.WithContext(ctx)
for _, x := range stage3Items { g3.Go(func() error { return doStage3(gctx3, x) }) }
return g3.Wait()
}
After
Derive cancellation once, then run three errgroups bound to the shared child context. The errgroups still cancel each other on stage failure (via the shared parent), but you pay one derivation instead of three.func pipeline(ctx context.Context) error {
ctx, cancel := context.WithCancel(ctx)
defer cancel()
run := func(items []Item, fn func(context.Context, Item) error) error {
g, _ := errgroup.WithContext(ctx)
for _, x := range items { g.Go(func() error { return fn(ctx, x) }) }
return g.Wait()
}
if err := run(stage1Items, doStage1); err != nil { return err }
if err := run(stage2Items, doStage2); err != nil { return err }
return run(stage3Items, doStage3)
}
13. Exercise 12 — Many goroutines each calling ctx.Done()¶
A scatter-gather across 200 goroutines has each call ctx.Done() once to set up its select. Each call locks the context briefly to lazily allocate the chan struct{} if not yet created — only the first call allocates, but every call goes through the iface dispatch and parent walk to find the cancelable ancestor.
func scatter(ctx context.Context, work []Task) {
var wg sync.WaitGroup
for _, t := range work {
wg.Add(1)
go func(t Task) {
defer wg.Done()
select {
case <-ctx.Done():
return
case res := <-runAsync(t):
handle(res)
}
}(t)
}
wg.Wait()
}
After
Resolve `ctx.Done()` once before fan-out; pass the channel directly. ~3× faster. **Why faster:** `ctx.Done()` per goroutine is an iface call (`Context.Done`) that, for a derived context, walks parents until it finds the cancelable ancestor. Resolving it once turns 200 iface calls into 200 channel receives on a captured local — the runtime can register the receive on the channel's wait queue directly. **Trade-off:** If `ctx` is replaced (rare — context references don't change), the captured `done` still points to the old channel. Almost never an issue, but `ctx.Done()` re-resolved per call would catch it. Don't capture across goroutine spawn boundaries that outlive the parent function. **When NOT:** A single goroutine calling `ctx.Done()` once — no savings. Goroutines that legitimately need to re-resolve `Done()` because they receive a different context per loop iteration. Code where readers expect `select { case <-ctx.Done() }` and may be confused by the captured variable.14. Exercise 13 — Closure-captured ctx vs pass-as-parameter¶
A worker function calls a helper that needs the context. Two styles: pass ctx as parameter, or capture it in a closure that the helper invokes. Capturing escapes ctx to the heap if the closure escapes; passing it threads through registers.
func work(ctx context.Context, items []Item) {
helper := func(i Item) {
if ctx.Err() != nil { return }
process(ctx, i)
}
for _, it := range items { helper(it) }
}
After
Pass `ctx` explicitly. The closure no longer escapes, and the compiler can inline `helper`. ~1.3× faster, allocation eliminated. **Why faster:** Closure capturing an iface (`ctx`) forces the iface header onto the heap because the closure's address may outlive the stack frame. Passing the iface as a parameter keeps it in argument registers (Go 1.17+ register ABI handles iface in two regs). Compiler inlining sees the parameter form clearly. **Trade-off:** Tiny absolute win — 160 ns and one allocation per 1000 calls. Readability is the real driver here: explicit ctx-as-parameter is the idiomatic Go style and grep-friendly. Closure form reads more naturally for one-shot inline helpers. **When NOT:** Almost any case — the perf delta is negligible. Optimize for the readability your team prefers. If the closure form reads better, ship it. The exception: hot paths where pprof shows the closure allocation.15. Exercise 14 — Allocating cancel functions per request¶
context.WithCancel returns a CancelFunc — a closure over the internal cancel logic. For 100k req/s, each allocating a fresh cancel func, that's 100k closure heap objects per second.
func handle(ctx context.Context) error {
ctx, cancel := context.WithCancel(ctx)
defer cancel()
return downstream(ctx)
}
After
Pool the cancelable context. Reset via a custom type that wraps `*cancelCtx`. Because stdlib internals aren't exported, use `context.AfterFunc` (Go 1.21+) for the common case of "run cleanup when parent cancels" — no cancel func allocated at all. ~5× faster, ~3× fewer allocations. **Why faster:** `WithCancel` allocates a `*cancelCtx`, a cancel closure, and registers the child on the parent's child set. `AfterFunc` allocates a single `afterFuncCtx` and registers it the same way — no new context, no closure for the user. The returned `stop` function deregisters; it's lightweight. **Trade-off:** `AfterFunc` doesn't give you a new context to pass downstream — it just runs `cleanup` when the parent cancels. Use it when downstream code doesn't need cancellation propagation, only cleanup hooks. For real propagation, `WithCancel` is still required. **When NOT:** Code that genuinely needs a new cancelable context for downstream (forwarding to RPC stubs that select on `ctx.Done`). Pre-1.21 Go versions — `AfterFunc` doesn't exist. Code where the cleanup must run on success path too (use `defer`, not `AfterFunc`).16. When NOT to optimize¶
Context cost dominates only when derivation, lookup, or cancellation propagation lives on a hot path measured in microseconds. If your handler does 100 ms of database work, a 5 µs WithValue overhead is invisible: per-request middleware bundles, CLI commands building a one-shot context, background jobs that derive once per minute. Profile first — the right answer is often "leave it alone".
Profile signatures. Context overhead has five typical fingerprints in a CPU profile: runtime.mallocgc from context.WithValue → Ex. 1 or 10; (*valueCtx).Value walking a chain → Ex. 1 or 2; (*timerCtx).cancel and timer goroutine traffic → Ex. 3 or 11; runtime.selectgo hot in a loop → Ex. 8; (*cancelCtx).Done or (*cancelCtx).cancel showing high call count → Ex. 5, 7, 12, or 14.
Common premature optimizations: packing values into a struct (Ex. 1) when only one key is read; hoisting ctx.Value (Ex. 2) outside loops with < 100 iters; sharing WithTimeout (Ex. 3) across RPCs with independent SLAs; package-level Background() (Ex. 4) in code called once per request; errgroup (Ex. 5) when you need independent worker policies; eliminating context-passed loggers (Ex. 7) when the chain is two deep; AfterFunc (Ex. 14) when downstream actually needs a cancelable context.
Correctness gaps disguised as optimizations: struct-packed values (Ex. 1) where one field needs independent rebinding — all readers see the new struct or none; hoisted ctx.Value (Ex. 2) where the value could change mid-loop; shared timeout (Ex. 3) where one slow call should not poison the whole batch; cached bgCtx (Ex. 4) that test fixtures forget to reset; errgroup (Ex. 5) where worker failures must not cancel siblings; bypassing stdlib WithCancel (Ex. 7) and forgetting the synchronous child-set propagation; logger-as-parameter (Ex. 7) breaking plugin contracts that expect context-only signatures; blocking select (Ex. 8) where a default branch was actually doing real work between channel ops; combined middleware struct (Ex. 10) where concurrent middleware races write the same field; conditional timeout (Ex. 11) silently dropping a per-handler enforcement; shared done channel (Ex. 12) captured before a context replacement; closure vs param (Ex. 13) where the closure form was hiding a ctx aliasing bug; AfterFunc (Ex. 14) used where actual cancellation propagation was needed.
17. Summary¶
Always-ship wins (default in any new context-using code): hoist ctx.Value calls out of hot loops (Ex. 2); share WithTimeout/WithDeadline across batches that share an SLA (Ex. 3); use errgroup.WithContext instead of hand-rolled fan-out cancellation (Ex. 5); never write a custom Context — use the stdlib types (Ex. 6); pass explicit dependencies (loggers, DB handles) as parameters, not via ctx.Value (Ex. 7); blocking select on ctx.Done() instead of poll + default (Ex. 8); resolve ctx.Done() once before fan-out (Ex. 12).
Wins behind a profile (when measurements justify them): packing multiple values into one struct + one WithValue (Ex. 1, when valueCtx walks show); package-level Background() cache (Ex. 4, when called millions of times); combining middleware values (Ex. 10, when chain depth > 8); conditional timeout to skip redundant derivation (Ex. 11, when parent deadline usually wins); shared parent cancel across pipeline stages (Ex. 11, when WithCancel shows hot); AfterFunc instead of WithCancel for cleanup-only paths (Ex. 14, when cancel-func allocation shows in pprof).
Specialty (only when the design calls for it): timer pooling for very-high-QPS handlers where *timerCtx allocation is the bottleneck (Ex. 10); custom Context implementations only when integrating with a non-Go runtime; struct-packed ReqCtx shared across closed middleware ecosystems where allocation budgets are tight.
Context cost is derivation, lookup, propagation, dispatch, and channel construction. Strip those five from the read path by choosing the right primitive: WithValue for opaque request-scoped data; WithCancel/WithDeadline/WithTimeout for cancellation and budgets; errgroup for cancel-on-first-error fan-out; AfterFunc for cleanup hooks; plain function parameters for explicit dependencies. The primitives are cheap — the wins come from matching the primitive to the shape of the work. Profile, then pick the lever; the five signatures above tell you which one.