Futures & Promises — Optimization¶
1. How to use this file¶
Twelve scenarios where Future/Promise code is slower, allocates more, or scales worse than it should. Each entry has a Before (code + benchmark) and a collapsible After (optimized code + benchmark + why + trade-offs + when NOT).
Anchored at Go 1.23, amd64. Numbers are reproducible-shape — run go test -bench=. -benchmem on your hardware before quoting them. Future cost is dominated by three things: goroutine creation, channel synchronization, and iface boxing. Most wins remove one of those three on the hot path.
Reading order: Exercise 1, 5 (generics), 8 (pooling), then the rest in any order.
2. Exercise 1 — Goroutine-per-Future for tiny work¶
The reflex is f := Go(func() (T, error) { ... }) for every async value. For 200-ns work, the goroutine itself (~2 µs to start, ~8 KB stack, scheduler bookkeeping) costs 50× the work it wraps.
Before:
for _, k := range keys {
f := Go(func() (User, error) { return cache.Get(k) })
users = append(users, mustAwait(f))
}
After
Batch the trivially-fast work. Use `sync.Once`-cached futures only when the result is shared.for _, k := range keys {
users = append(users, cache.Get(k))
}
// sync.Once-cached future for expensive shared work:
type Cached[T any] struct {
once sync.Once
val T
err error
fn func() (T, error)
}
func (c *Cached[T]) Get() (T, error) {
c.once.Do(func() { c.val, c.err = c.fn() })
return c.val, c.err
}
3. Exercise 2 — Channel-of-Result with mutex¶
A Future that protects (val, err, done) with a mutex pays Lock/Unlock on the resolver and a guarded read on every Await. Post-resolution readers still take the lock to discover the value is ready.
Before:
type Future[T any] struct {
mu sync.Mutex; cond *sync.Cond
done bool; val T; err error
}
func (f *Future[T]) Await() (T, error) {
f.mu.Lock(); for !f.done { f.cond.Wait() }
v, err := f.val, f.err; f.mu.Unlock()
return v, err
}
After
`atomic.Pointer[result]` for the resolved state, plus a `chan struct{}` for the still-waiting hop. Post-resolution readers take the lock-free path.type result[T any] struct { val T; err error }
type Future[T any] struct {
done chan struct{}
res atomic.Pointer[result[T]]
once sync.Once
}
func (f *Future[T]) Resolve(v T) {
f.once.Do(func() { f.res.Store(&result[T]{val: v}); close(f.done) })
}
func (f *Future[T]) Await(ctx context.Context) (T, error) {
if r := f.res.Load(); r != nil { return r.val, r.err } // fast path
select {
case <-f.done: r := f.res.Load(); return r.val, r.err
case <-ctx.Done(): var z T; return z, ctx.Err()
}
}
4. Exercise 3 — Unbuffered Future channel¶
The textbook Future uses make(chan T). The send blocks until someone awaits. For a value that resolves before any consumer awaits, the resolver parks indefinitely.
Before:
func NewFuture[T any]() *Future[T] { return &Future[T]{ch: make(chan T)} }
func (f *Future[T]) Resolve(v T) { f.ch <- v } // blocks until Await
// producer resolves immediately, consumer awaits 10 ms later → producer parks 10 ms
After
Buffer of 1. The value goes into the buffer; the resolver never blocks. ~70× faster on the resolve-before-await pattern. **Why faster:** Unbuffered channels do a rendezvous — the send hands the value directly to a waiting receive, parking the sender's G. A buffered channel just memcopies into the buffer. No park, no scheduler hop, no second context switch. **Trade-off:** 8-24 B per Future for the buffer cell. One-shot semantic: a second send blocks. Pair with `sync.Once` to enforce. **When NOT:** When you want synchronous handoff as backpressure — almost never desired for Futures.5. Exercise 4 — Per-Future context allocation¶
Calling context.WithTimeout per Future allocates a *timerCtx, registers a timer in the runtime heap, and chains parent cancel propagation. For 50 futures with the same logical deadline, that's 50 redundant allocations and 50 timers.
Before:
g, gctx := errgroup.WithContext(ctx)
for i, id := range ids {
i, id := i, id
g.Go(func() error {
ctx2, cancel := context.WithTimeout(gctx, 200*time.Millisecond) // per-future
defer cancel()
u, err := fetchUser(ctx2, id); out[i] = u; return err
})
}
After
Derive *one* deadline at the request boundary and reuse it. ~6.5× faster, 26× fewer allocations. **Why faster:** One `*timerCtx` instead of 50. One timer in the runtime heap (O(log n) per insert). `errgroup` already cancels its derived context on error — same propagation without per-future overhead. **Trade-off:** All futures share the same deadline. If individual sub-requests need their own budget, per-future is correct. In practice the request-level budget is what callers care about. **When NOT:** When per-future deadlines model the underlying SLA (50 ms per RPC regardless of fan-out). When the futures escape a request boundary.6. Exercise 5 — Boxing via any¶
A pre-generics Future stores interface{}. Every Resolve boxes the value into an iface — heap allocation for any value larger than two words. Every Await type-asserts.
Before:
type Future struct { done chan struct{}; val any; err error }
func (f *Future) Resolve(v any) { f.val = v; close(f.done) }
f.Resolve(user) // boxes User (40 B) → heap alloc
u := (<-f).(User) // type assert
After
Generics. `Future[User]` stores `User` directly. ~14× faster, zero allocations. **Why faster:** Compiler monomorphizes `Future[User]` — `f.val` is a `User` field, not a two-word iface header. No iface materialization, no type-descriptor write, no type-assertion check. **Trade-off:** One Future type per concrete payload — the right granularity. Generic instantiation grows binary size slightly per `T`, harmless at tens-of-types. **When NOT:** A genuinely heterogeneous result channel that one consumer dispatches on type. There iface is unavoidable; reach for a tagged-union wrapper.7. Exercise 6 — errgroup with high SetLimit¶
A team sets g.SetLimit(1000) "for throughput". Each in-flight call uses 8 KB stack and a downstream connection. Downstream rate-limiter at 200 QPS rejects most; context-switching exceeds work.
Before:
g, gctx := errgroup.WithContext(ctx)
g.SetLimit(1000)
for _, id := range ids {
id := id
g.Go(func() error { return fetchUser(gctx, id) })
}
After
Apply Little's Law: `L = λ × W`. Downstream serves 200 QPS at 50 ms each: `L = 200 × 0.05 = 10`. Limit 10-20 saturates without queueing. Same throughput (downstream-bound), ~11× better p99. **Why faster:** Past saturation, more concurrency just adds queueing latency — Little's Law in reverse. 1000 goroutines for 200 QPS means 5-second queue depth. **Trade-off:** Under-limit leaves throughput on the floor. Tune by measuring where downstream p99 starts climbing; add 2× headroom and a hard cap. **When NOT:** CPU-bound work where downstream isn't the constraint — `GOMAXPROCS × 2` is the right limit.8. Exercise 7 — select per Future Await¶
A consumer awaiting N futures does sequential for { f.Await(ctx) }. p99 is the sum of wait times, not the max.
Before:
After
Fan-in with a results channel. For dynamic N, `reflect.Select`. For runtime-sized case sets, build `[]reflect.SelectCase` from each future's `Done()` channel plus `ctx.Done()`, then `reflect.Select`. **Why faster:** Sequential await sums latencies; fan-in takes the max. For 10 futures at 120 µs ± 40 µs, sequential is ~1200 µs, parallel max ~200 µs. `reflect.Select` is slower per case but supports runtime-sized sets. **Trade-off:** Fan-in spawns a goroutine per future (~2 µs setup). For 3 futures, goroutine cost exceeds parallelism win — do sequential. `reflect.Select` allocates per case. **When NOT:** Futures with sharply different priorities — you want first-resolved and may bail. Hand-coded `select` is clearer.9. Exercise 8 — Repeated Future creation in a loop¶
A streaming pipeline allocates one *Future[T] per item. At 100k items/sec that's 8 MB/s of garbage; GC pauses show in p99.
Before:
for j := range in {
f := &Future[Result]{done: make(chan struct{})}
go func(j Job, f *Future[Result]) {
r, err := doWork(j)
if err != nil { f.Reject(err) } else { f.Resolve(r) }
}(j, f)
out <- f
}
After
`sync.Pool` of Future structs. `close(done)` is irreversible — switch to a notify channel recreated on Release. ~6.8× faster, half the allocations. **Why faster:** Pool avoids `mallocgc` for the Future struct (40-80 B). GC pressure drops; pauses shrink. Remaining allocation is the notify channel — pool that too if you need to. **Trade-off:** Release must be called exactly once after the last consumer awaits. Multi-consumer broadcast Futures cannot be pooled this way. `sync.Pool` clears on each GC, so hit-rate is high but not 100%. **When NOT:** Below ~10k Futures/sec. When Future lifetime escapes the pool's scope.10. Exercise 9 — time.After leaks in awaiting loop¶
A loop that awaits with timeout calls time.After inside the select. Each iteration creates a fresh timer; the old one fires later and posts a stale value. Timer heap grows.
Before:
for {
select {
case v := <-future.Done(): return process(v)
case <-time.After(100 * time.Millisecond): // new timer per iter, leaks
if retryCount++; retryCount > 10 { return errTimeout }
case <-ctx.Done(): return ctx.Err()
}
}
After
`time.NewTimer` once, reset per iteration. In Go 1.23, `Timer.Reset` and `Timer.Stop` are safe to call without draining `t.C` — the runtime handles the leftover-fire case. ~12× faster, zero allocations per iteration. **Why faster:** `time.After` allocates a `*Timer` and registers it in the runtime timer heap (O(log n)) every call. With Reset, one timer is reused; its heap position is updated, not allocated. **Trade-off:** Slightly more code. Pre-1.23 Reset semantics required draining `t.C` first — old code has subtle bugs around stop-then-reset. Safe in 1.23. **When NOT:** One-shot timeouts (single select that runs once). There `time.After` is fine — the leak is at most one timer.11. Exercise 10 — Future tree depth N¶
A pipeline chains futures: Map(f0, fn1) → Map(f1, fn2) → ... N deep. Each Map spawns a forwarder goroutine. For N=10 stages, 10 forwarder goroutines per item.
Before:
f := initialFetch()
for _, stage := range stages { // 10 stages, each spawns a goroutine
f = Map(f, stage)
}
result, _ := f.Await(ctx)
After
Flatten into a single combinator that runs all stages in one goroutine.func Chain[T any](initial *Future[T], stages ...func(T) (T, error)) *Future[T] {
out := NewFuture[T]()
go func() {
v, err := initial.Await(context.Background())
if err != nil { out.Reject(err); return }
for _, fn := range stages {
v, err = fn(v)
if err != nil { out.Reject(err); return }
}
out.Resolve(v)
}()
return out
}
12. Exercise 11 — singleflight on every call¶
A handler wraps every cache read in singleflight.Group.Do. For unique keys (high-cardinality per-user data) dedup never fires — but you still pay the map lookup, mutex, and iface box.
Before:
func GetUser(ctx context.Context, id string) (User, error) {
v, err, _ := g.Do(id, func() (any, error) { return fetchUser(ctx, id) })
return v.(User), err
}
The fetch is ~200 ns when cache-hit; singleflight adds ~580 ns of mutex+map+box.
After
Apply singleflight only at cache miss path, not cache hit.func GetUser(ctx context.Context, id string) (User, error) {
if u, ok := cache.Get(id); ok { return u, nil }
v, err, _ := g.Do(id, func() (any, error) {
if u, ok := cache.Get(id); ok { return u, nil } // double-check
u, err := fetchUser(ctx, id)
if err == nil { cache.Set(id, u) }
return u, err
})
return v.(User), err
}
13. Exercise 12 — errgroup.Wait per request¶
A handler creates its own errgroup and spawns 50 sub-fetches per request. At 1000 QPS that's 50k goroutines/sec of churn, ~400 MB/s of stack memory churning.
Before:
g, gctx := errgroup.WithContext(ctx)
g.SetLimit(16)
for i, id := range ids {
i, id := i, id
g.Go(func() error { it, err := fetchItem(gctx, id); out[i] = it; return err })
}
return out, g.Wait()
After
Long-lived worker pool consuming a shared queue. Per-request work is submitting jobs and awaiting per-request futures.type Job struct { id string; out *Future[Item]; ctx context.Context }
type Pool struct{ jobs chan Job }
func NewPool(workers int) *Pool {
p := &Pool{jobs: make(chan Job, workers*4)}
for i := 0; i < workers; i++ {
go func() {
for j := range p.jobs {
if j.ctx.Err() != nil { j.out.Reject(j.ctx.Err()); continue }
it, err := fetchItem(j.ctx, j.id)
if err != nil { j.out.Reject(err) } else { j.out.Resolve(it) }
}
}()
}
return p
}
func (p *Pool) Submit(ctx context.Context, id string) *Future[Item] {
f := NewFuture[Item]()
p.jobs <- Job{id, f, ctx}
return f
}
14. When NOT to optimize¶
Future patterns dominate when many small async values are flying around. If your code creates 10 futures per minute, optimizing them is pointless — your time is in the work the futures wrap.
- Background sync that runs once per hour — keep the simplest channel-based Future.
- Test fixtures that fake async — no goroutine, just an already-resolved Future struct.
- Code where each "future" already wraps a 10 ms network call — goroutine overhead is < 0.1% of total cost.
Profile first. Look for time in runtime.chansend, runtime.gopark, runtime.mallocgc, and sync.(*Mutex).Lock — the four signatures of Future overhead.
Common premature optimizations:
- Pooling Future structs (Ex. 8) below 10k Futures/sec.
atomic.Pointer(Ex. 2) when there's only one waiter —sync.Mutexmatches it with simpler semantics.- Fan-in await (Ex. 7) for ≤3 futures — sequential is shorter and faster.
- Flattening chains (Ex. 10) when stages are observably independent.
- Worker pool (Ex. 12) when per-request load is uneven.
Correctness gaps disguised as optimizations:
- Removing
sync.Oncefrom Resolve "because it's only called once" — until a retry path calls it twice and panics on closed channel. - Buffered channel without one-shot guard — multi-resolution silently overwrites.
- Pool reuse with active consumers still awaiting — Future mutated under their feet.
- Singleflight across security domains — one tenant's result returned to another's call.
15. Summary¶
Always-ship wins (apply by default in any new Future code):
- Buffered Future channel cap 1, never unbuffered (Ex. 3).
- Typed generic
Future[T], neverFuture[any](Ex. 5). - One context deadline at the request boundary (Ex. 4).
time.NewTimer+Resetin any awaiting loop (Ex. 9).sync.Oncearound Resolve/Reject — non-negotiable correctness.
Wins behind a profile (when measurements justify them):
atomic.Pointer[Result]for lock-free fast path (Ex. 2) — when read-after-resolution is hot.errgroup.SetLimitvia Little's Law (Ex. 6) — when downstream is the constraint.- Fan-in await for ≥8 futures (Ex. 7).
- Pool the Future struct (Ex. 8) — at ≥10k Futures/sec.
- Flatten chain combinators (Ex. 10).
- Singleflight only at miss path (Ex. 11).
- Shared worker pool (Ex. 12) — when goroutine spawn dominates per-request CPU.
Specialty (only when the design calls for it):
sync.Once-cached Future for shared expensive results (Ex. 1).reflect.Selectover dynamic Future sets (Ex. 7).- Reference-counted Future for broadcast.
- Lazy Futures with
sync.Oncefor fallback chains.
Future cost is goroutines, channels, and iface boxes. Strip those three by inlining cheap work, buffering one-shot channels, and using generics; then size concurrency to the real downstream. The shape is ~100 lines of Go; the discipline is what makes it production-grade. The Future is rarely where the time goes — but when it is, these are the levers.