Select Statement — Optimization¶
Honest framing first: in most Go services,
selectis not the bottleneck. The runtime'sselectgois well tuned, the typical for-select loop wakes a few hundred or a few thousand times per second, and the cost of one selection is dominated by what happens inside the chosen case, not by the selection machinery itself. If your profile does not showruntime.selectgonear the top, almost everything below is premature.What is worth optimizing is the patterns around
select: the timer that leaks because you wrotetime.Afterinside a hot loop, the goroutine that busy-spins ondefault + Sleep, the case list that grows long enough that runtime fairness picks a slow path, thereflect.Selectyou reached for when a switch onlen(cases)would have done. These are the changes that move real numbers — allocations per second, P99 latency, goroutine count, GC pause — and each of them has been a real production fix in real production code.Each entry below states the problem, shows a "before" version, an "after" version, the realistic gain, and the caveat that prevents you from over-applying it.
Optimization 1 — Replace busy-wait with select+timer¶
Problem: Code that waits for an event by polling in a tight loop with time.Sleep burns CPU, sleeps too long when the event arrives mid-sleep, and adds latency proportional to the sleep granularity. A select on a channel and a timer does the same job with zero polling overhead.
Before:
func waitForReady(ready *atomic.Bool, deadline time.Time) bool {
for time.Now().Before(deadline) {
if ready.Load() {
return true
}
time.Sleep(5 * time.Millisecond)
}
return false
}
After:
func waitForReady(ctx context.Context, signal <-chan struct{}) bool {
select {
case <-signal:
return true
case <-ctx.Done():
return false
}
}
signal (or sends once) when the resource is ready. The waiter parks on the channel — zero CPU until the wake-up — and resumes within microseconds. Gain: From ~2.5 ms median latency at non-zero CPU to ~5 µs at zero CPU. Across 10k waiters, GOMAXPROCS load drops by 10–30% in services that previously spent measurable time in the polling loop.
Caveat: You need a real channel from the producer. If the only signal you have is a flag in shared memory you cannot wake from, change the producer first; do not work around it with a polling waiter.
Optimization 2 — Batch operations to reduce select churn¶
Problem: A for-select that pays the full select-and-dispatch cost for each tiny message wastes most of its time in scheduler bookkeeping when the queue is hot. If consumers can process N items at once, you can amortise the wake-up cost.
Before:
With 200k events/sec, that is 200k selectgo calls and 200k disk writes per second.After:
const batchMax = 64
buf := make([]Event, 0, batchMax)
for {
select {
case ev := <-events:
buf = append(buf, ev)
// drain whatever is already queued, up to batchMax
drain:
for len(buf) < batchMax {
select {
case ev := <-events:
buf = append(buf, ev)
default:
break drain
}
}
writeBatch(buf)
buf = buf[:0]
case <-ctx.Done():
return
}
}
Gain: With bursty traffic and a bulk-capable downstream (disk, syscall, network buffer), throughput typically rises 5–20× and selectgo CPU drops by an order of magnitude.
Caveat: Latency for the first item in a quiet period is unchanged; only the steady-state amortises. If you must guarantee per-event latency, cap the batch by both size and a tight max-wait timer.
Optimization 3 — Use one time.NewTimer with Reset¶
Problem: time.After(d) inside a hot loop allocates a fresh *time.Timer per iteration. That timer cannot be garbage-collected until it fires. In a loop that completes in microseconds, you accumulate a Timer per iteration, each living for d seconds. Heap usage grows; GC pressure rises; eventually you OOM.
Before:
for {
select {
case msg := <-input:
handle(msg)
case <-time.After(time.Second): // new *Timer every iteration
log.Println("idle")
return
}
}
time.NewTimer and the timer goroutine dominating allocation. After:
idle := time.NewTimer(time.Second)
defer idle.Stop()
for {
select {
case msg := <-input:
handle(msg)
if !idle.Stop() {
<-idle.C // drain if it had already fired
}
idle.Reset(time.Second)
case <-idle.C:
log.Println("idle")
return
}
}
Gain: Allocations per second drop to zero for the timer path. In a measured proxy this changed Heap-In-Use from "growing 50 MB/min" to flat. P99 latency dropped because GC ran less often.
Caveat: The Stop/Drain/Reset dance is easy to get subtly wrong. On Go 1.23+ the timer semantics changed (Stop and Reset no longer require manual draining in many cases) — read the release notes for your Go version before copy-pasting old patterns.
Optimization 4 — Reduce the case count¶
Problem: Runtime selectgo cost grows roughly linearly with the number of cases (it sorts and locks each one). A select with 12 cases is measurably slower per-iteration than one with 3. Worse, large case lists are usually a code-smell: they often mean one goroutine is doing several jobs that should each have their own loop.
Before:
for {
select {
case j := <-jobsHigh:
run(j)
case j := <-jobsMid:
run(j)
case j := <-jobsLow:
run(j)
case <-ticker1.C:
flushA()
case <-ticker2.C:
flushB()
case <-ticker3.C:
flushC()
case e := <-eventsA:
recordA(e)
case e := <-eventsB:
recordB(e)
case <-ctx.Done():
return
}
}
After: Split by responsibility. One goroutine handles jobs (3 cases), one handles ticks (1 multi-channel case via a fan-in), one handles events. Each loop has 3–4 cases and a clear name in the stack trace.
go runJobs(ctx, jobsHigh, jobsMid, jobsLow)
go runFlushers(ctx, ticker1.C, ticker2.C, ticker3.C)
go runEvents(ctx, eventsA, eventsB)
Or merge homogeneous channels with a fan-in goroutine so the consumer only sees one channel.
Gain: Per-iteration selectgo cost drops by 2–3× when the case count goes from 9 down to 3. More importantly, each loop becomes independently testable and stoppable.
Caveat: Splitting introduces extra goroutines and channels. If the cases are genuinely interdependent — e.g. a state machine where any event can change which channels matter — keep them in one select and accept the per-iteration cost.
Optimization 5 — Use the nil-channel trick to disable cases dynamically¶
Problem: Some cases should only fire under certain conditions ("send only when buffer non-empty," "accept input only when not paused"). Building this with extra if guards or with multiple distinct selects forces you to allocate or branch on every iteration.
Before:
for {
if paused {
select {
case <-resume:
paused = false
case <-ctx.Done():
return
}
} else {
select {
case in := <-input:
buf = append(buf, in)
case <-pause:
paused = true
case <-ctx.Done():
return
}
}
}
After:
for {
var inCh <-chan Item = input
if paused {
inCh = nil // disable input case
}
select {
case in := <-inCh: // never fires when nil
buf = append(buf, in)
case <-pause:
paused = true
case <-resume:
paused = false
case <-ctx.Done():
return
}
}
Gain: Dead-code elimination at the language level. Fewer branches, a single select layout, easier to instrument.
Caveat: Confusing for readers who do not know the nil-channel rule. Add a one-line comment the first time it appears in a file.
Optimization 6 — Avoid select for trivial single-case usage¶
Problem: A select with exactly one channel case (and no default) is identical in behaviour to a plain receive — but pays the selectgo overhead. People reach for select reflexively when they only have one thing to wait on, especially after refactoring.
Before:
Same asfor msg := range input { process(msg) } but slower and harder to read. After:
For a single channel with cancellation, do not write a one-case select either:
Gain: A few nanoseconds per iteration and meaningful readability. In tight loops processing millions of items, the cumulative saving is measurable.
Caveat: As soon as you need a second case (a timeout, a cancel, a heartbeat) you need select again. Don't trade away cancellation just to remove the select.
Optimization 7 — Replace select-with-default polling on ctx.Done()¶
Problem: Some code peeks at cancellation by polling: select { case <-ctx.Done(): return; default: }. This is correct but wasteful — the goroutine returns to a CPU-bound loop and only checks cancellation between iterations. If a unit of work is long, cancellation is delayed; if it is short, every iteration pays the polling cost.
Before:
for _, item := range hugeSlice {
select {
case <-ctx.Done():
return ctx.Err()
default:
}
expensive(item)
}
expensive(item) call. Per-iteration cost is the selectgo overhead even though no cancellation is happening 99.9% of the time. After: If the work itself is channel-driven, do the cancellation in the same select that pulls work:
for {
select {
case item, ok := <-items:
if !ok {
return nil
}
expensive(item)
case <-ctx.Done():
return ctx.Err()
}
}
If the loop is over a slice, hand the cancellation check to a worker pool entry-point and only check it once per batch:
for batchStart := 0; batchStart < len(hugeSlice); batchStart += batchSize {
select {
case <-ctx.Done():
return ctx.Err()
default:
}
for _, item := range hugeSlice[batchStart : batchStart+batchSize] {
expensive(item)
}
}
Gain: Per-iteration overhead is amortised across batchSize items. Cancellation latency stays bounded by one batch.
Caveat: Polling with default is fine in code that is not on the hot path. Don't refactor a one-shot startup routine to save 10 ns of cancel-check overhead.
Optimization 8 — Coalesce timeouts: share one timer across multiple selects¶
Problem: A function that wraps several blocking operations in a row, each with its own select { ... case <-time.After(d): }, allocates a fresh timer at each step and pays for the bookkeeping each time, even though the wall-clock budget is shared.
Before:
func three(ctx context.Context, a, b, c <-chan int) (int, int, int, error) {
var x, y, z int
select {
case x = <-a:
case <-time.After(timeout):
return 0, 0, 0, errTimeout
}
select {
case y = <-b:
case <-time.After(timeout):
return 0, 0, 0, errTimeout
}
select {
case z = <-c:
case <-time.After(timeout):
return 0, 0, 0, errTimeout
}
return x, y, z, nil
}
After: Derive one deadline-bound context and let every step observe it:
func three(ctx context.Context, a, b, c <-chan int) (int, int, int, error) {
ctx, cancel := context.WithTimeout(ctx, timeout)
defer cancel()
var x, y, z int
select {
case x = <-a:
case <-ctx.Done():
return 0, 0, 0, ctx.Err()
}
select {
case y = <-b:
case <-ctx.Done():
return 0, 0, 0, ctx.Err()
}
select {
case z = <-c:
case <-ctx.Done():
return 0, 0, 0, ctx.Err()
}
return x, y, z, nil
}
WithTimeout) covers all three selects, and the whole call has one shared budget. Gain: One *time.Timer instead of three; total wall-clock cap correct ("the whole call should finish in timeout," not "each step gets its own timeout"). For a deeply-pipelined function this saves real allocations and matches the user's actual expectation.
Caveat: Sometimes the per-step timeout is what you want — e.g. a retry loop where each attempt has its own clock. Pick the model that matches the contract you intend to expose.
Optimization 9 — Avoid select with default in a hot loop¶
Problem: A select { case ...: ; default: } in a tight loop is a busy-spin in disguise. The goroutine never blocks; it polls, finds nothing, runs default, polls again. CPU goes to 100% on one core even when no work is happening.
Before:
Spin-locked at full CPU. Adds noise to GC pause measurements, distorts scheduler decisions, and is invisible in metrics that average over windows larger than a millisecond.After: Drop the default; let the goroutine block.
If you really need to do something while nothing is on jobs, drive that something from a timer or a different channel — not from default.
Gain: From 100% CPU to ~0% CPU when idle. Across a fleet, this is the difference between "we run at 60% utilisation" and "we run at 25% utilisation" — a real cost-of-goods change.
Caveat: A default case is correct in three places: non-blocking enqueue (drop on full), bounded drain after a blocking primary case (Optimization 2), and one-shot peeks. Outside those, treat it as a smell.
Optimization 10 — Pre-resolve channels outside the hot loop¶
Problem: Re-evaluating a channel expression on every iteration — by calling a method, dereferencing a pointer, or walking a struct — pays the cost on every iteration even though the answer never changes. The compiler cannot always hoist this, especially across interface boundaries.
Before:
type Bus struct{ subs []chan Event }
func (b *Bus) Run(ctx context.Context) {
for {
select {
case e := <-b.input.Events(): // method call, possibly through an interface
for i := range b.subs {
b.subs[i] <- e
}
case <-ctx.Done():
return
}
}
}
b.input.Events() afresh, possibly through a vtable. After:
func (b *Bus) Run(ctx context.Context) {
in := b.input.Events() // one method call, hoisted
subs := b.subs // local slice header
done := ctx.Done() // one method call, hoisted
for {
select {
case e := <-in:
for i := range subs {
subs[i] <- e
}
case <-done:
return
}
}
}
Gain: A few ns per iteration become free; in micro-benchmarks of busy loops this is sometimes 5–10% throughput. The bigger win is cleaner asm output and fewer surprises when the compiler decides not to inline.
Caveat: Only hoist if the channel really is invariant. If the source can change mid-loop (e.g. you swap the upstream when the connection reconnects), keep the method call inside.
Optimization 11 — runtime.LockOSThread + select for latency-critical paths (advanced)¶
Problem: The Go scheduler can move a goroutine between OS threads between iterations of a select loop. For most code this is invisible, but for sub-microsecond latency paths (a market-data tap, a kernel-bypass network interface, a real-time audio pipeline) the cost of being rescheduled onto a cold core (cache miss, branch-predictor reset, NUMA hop) shows up as occasional latency spikes in P99.999.
Before:
Median latency is fine; tail latency shows tens-of-µs spikes whenever the goroutine migrates.After:
go func() {
runtime.LockOSThread()
defer runtime.UnlockOSThread()
for {
select {
case pkt := <-pkts:
handle(pkt)
case <-stop:
return
}
}
}()
The goroutine is pinned to one OS thread. If the host has CPU isolation enabled (isolcpus=, cset shield, taskset), pin that thread to an isolated core via unix.SchedSetaffinity after LockOSThread.
Gain: Typical P99.999 latency improvements of 2–10× in low-latency networking loops. Cache-resident state stays warm; no scheduler-induced jitter.
Caveat: This is a heavy hammer. It removes a goroutine from the common scheduling pool, can starve sibling goroutines if they relied on that thread, and only helps if you can also stop the OS from scheduling other work onto the same core. Do not reach for it without a profile that shows scheduler-induced tail-latency.
Optimization 12 — reflect.Select is slow; avoid it for fixed-N cases¶
Problem: reflect.Select exists for selects whose case set is known only at runtime (e.g. a router that selects across an arbitrary slice of channels). It is dramatically slower than a literal select — typical numbers are 10–100× — because each call constructs reflect.Values, copies the case array, and dispatches through reflection.
Before:
func first(chs []<-chan int) (int, int) {
cases := make([]reflect.SelectCase, len(chs))
for i, c := range chs {
cases[i] = reflect.SelectCase{
Dir: reflect.SelectRecv,
Chan: reflect.ValueOf(c),
}
}
chosen, recv, _ := reflect.Select(cases)
return chosen, int(recv.Int())
}
After (fixed N): write the literal select:
func first2(a, b <-chan int) (int, int) {
select {
case v := <-a:
return 0, v
case v := <-b:
return 1, v
}
}
After (variable N): if N is bounded but small (say, at most 8), use a switch ladder over len(chs):
switch len(chs) {
case 1:
return 0, <-chs[0]
case 2:
select {
case v := <-chs[0]:
return 0, v
case v := <-chs[1]:
return 1, v
}
// ... up to the bound
}
After (truly variable N): fan-in goroutines write into a single merged channel, then the consumer does a 1-channel receive:
func merge(chs []<-chan int) <-chan int {
out := make(chan int, 64)
var wg sync.WaitGroup
wg.Add(len(chs))
for _, c := range chs {
go func(c <-chan int) {
defer wg.Done()
for v := range c {
out <- v
}
}(c)
}
go func() { wg.Wait(); close(out) }()
return out
}
Gain: Replacing reflect.Select with a fan-in pattern is typically 10–100× throughput improvement and removes per-iteration allocation entirely.
Caveat: If you genuinely have an unbounded set of sources that come and go at runtime (e.g. a chat-room server with N sessions), a fan-in goroutine per source plus a merged channel is the right architecture; do not try to keep reflect.Select "for flexibility."
Benchmarking and Measurement¶
Optimisation without measurement is folklore. For select-heavy code the most useful signals:
// Microbenchmarks per pattern
func BenchmarkSelectTwo(b *testing.B) {
a := make(chan int, 1)
c := make(chan int, 1)
a <- 1
for i := 0; i < b.N; i++ {
select {
case v := <-a:
a <- v
case v := <-c:
c <- v
}
}
}
func BenchmarkReflectSelectTwo(b *testing.B) {
a := make(chan int, 1)
c := make(chan int, 1)
a <- 1
cases := []reflect.SelectCase{
{Dir: reflect.SelectRecv, Chan: reflect.ValueOf(a)},
{Dir: reflect.SelectRecv, Chan: reflect.ValueOf(c)},
}
for i := 0; i < b.N; i++ {
chosen, v, _ := reflect.Select(cases)
if chosen == 0 {
a <- int(v.Int())
} else {
c <- int(v.Int())
}
}
}
Run with:
go test -bench=. -benchmem -benchtime=2s ./...
go test -bench=. -cpuprofile=cpu.out -memprofile=mem.out
go tool pprof -http :8080 cpu.out
Useful observations: - runtime.selectgo near the top of a CPU profile means you have too many cases or too many iterations per second; consider Optimizations 4, 6, or 9. - A growing time.NewTimer allocation column means Optimization 3 applies. - Goroutines stuck on <-time.After(...) in a goroutine dump means a leak from a tight time.After loop. - Per-iteration allocations > 0 in benchmem for a select-only benchmark almost always traces back to a time.After or a reflect.Select.
For end-to-end signals, watch P99 latency, allocation rate (runtime/metrics), and goroutine count over time. If a "fix" does not move them measurably, it was not a fix.
When NOT to Optimize¶
- The select runs once per request. A web handler that does one select per HTTP request is not the bottleneck — the database call inside the case is. Profile the case body, not the select.
- The case body dwarfs the select. If the chosen case takes a millisecond, the 100-nanosecond selectgo cost is invisible. Optimise the body first.
- You are pre-1.0 and the API is changing weekly. Stabilise the shape before micro-tuning the implementation; you will rewrite the hot loop anyway.
- The "optimised" version is unreadable. A nil-channel toggle, a Stop/Drain/Reset dance, and an OS-thread pin in one function is a maintenance liability. Spend cleverness only where the profile demands it.
- Cancellation latency is the metric, not throughput. Then add the cancel case, do not remove it. Optimization 7 is about polling; do not interpret it as "remove ctx.Done() from every loop."
- You have not run a profile. "It feels slow" is not a profile.
pprof,go test -bench,runtime/metrics, ortracefirst. Then optimise.
Summary¶
select is one of the cheapest concurrency primitives Go provides — the runtime has had two decades to tune it — and most of the time the right answer is "leave it alone." When the profile says otherwise, the wins come not from rewriting selectgo but from the patterns around it: kill busy-waits with a real channel signal; batch when downstream allows; reuse one timer instead of leaking one per iteration; keep case counts small; turn cases off with nil channels instead of restructuring; do not wrap a single channel in a select; share one timer across a multi-step call; never spin with default in a hot loop; hoist invariant channels out of the loop; reach for LockOSThread only when scheduler jitter is provable; and avoid reflect.Select for any fixed-N case set.
Each entry above is small in code but large in cumulative effect. Apply them where measurement asks for them, leave the rest of your selects alone, and you will spend your optimisation budget where it actually moves numbers — on the work inside the case body, not on the select that picked it.