Premature Concurrency Optimization — Find the Bug¶
Twelve code snippets where concurrency is the bug — either making the code slower than the sequential version, increasing latency, or hiding correctness issues. For each, identify the bug, explain why it's a bug, and propose a fix.
Each snippet is presented first, followed by the analysis. Try to spot the bug before reading the analysis.
Snippet 1: parallel sum, sequential effect¶
func parallelSum(xs []int) int {
var mu sync.Mutex
var sum int
var wg sync.WaitGroup
for _, x := range xs {
x := x
wg.Add(1)
go func() {
defer wg.Done()
mu.Lock()
sum += x
mu.Unlock()
}()
}
wg.Wait()
return sum
}
Bug¶
The mutex serialises all the goroutines. They're all waiting for the same lock to update one shared counter. The "parallel" version runs effectively sequentially, plus the overhead of goroutine spawn, mutex contention, and synchronisation.
Why it's a bug¶
- Spawn overhead per item: ~1 µs.
- Mutex contention: each item parks/wakes.
- For typical inputs, this is 100-1000× slower than the simple
for _, x := range xs { sum += x }.
Fix¶
Just use a for loop. Or, if xs is huge, partition: each goroutine sums its chunk into a local variable, and a final merge adds them. No mutex needed in the hot path.
Snippet 2: channel coordination on hot path¶
func processStream(input []Item) {
work := make(chan Item)
done := make(chan struct{})
go func() {
for item := range work {
doTiny(item) // ~50 ns of work
}
close(done)
}()
for _, item := range input {
work <- item
}
close(work)
<-done
}
Bug¶
A channel send/recv costs ~50-250 ns. The work itself is ~50 ns. The channel coordination is 2-5× the cost of the actual work. The "concurrent" version is slower than just for _, item := range input { doTiny(item) }.
Fix¶
Don't use a channel for tiny work. Just call directly. If you must distribute work, batch it — send slices of 100 items, not single items.
Snippet 3: per-item goroutine for fast work¶
func fastWork(items []int) []int {
out := make([]int, len(items))
var wg sync.WaitGroup
for i, x := range items {
i, x := i, x
wg.Add(1)
go func() {
defer wg.Done()
out[i] = x * 2 // trivial
}()
}
wg.Wait()
return out
}
Bug¶
Spawning a goroutine costs ~1 µs. The work is one multiplication (~1 ns). For 10,000 items, you spawn 10,000 goroutines costing ~10 ms — for work that should take ~10 µs.
Fix¶
Either use a simple for loop (almost certainly faster), or use a bounded worker pool with chunked work.
Snippet 4: false sharing in a counter pool¶
type Counters struct {
Requests, Errors, BytesIn, BytesOut int64
}
func (c *Counters) AddRequest() {
atomic.AddInt64(&c.Requests, 1)
}
func (c *Counters) AddError() {
atomic.AddInt64(&c.Errors, 1)
}
// ... etc
Bug¶
All four counters fit in one 64-byte cache line. Multiple goroutines incrementing different counters will bounce the cache line between cores. Each atomic.AddInt64 becomes 10-30× slower than its uncontended cost.
Fix¶
Pad each counter to its own cache line:
type Counters struct {
Requests int64
_ [56]byte
Errors int64
_ [56]byte
BytesIn int64
_ [56]byte
BytesOut int64
_ [56]byte
}
Snippet 5: huge buffered channel hides overload¶
Bug¶
The buffer is so large that the consumer can fall arbitrarily behind without backpressure. Latency for items in the queue can be seconds. Memory grows. The system appears healthy until OOM.
Fix¶
Use a small buffer (16, 64, 256). When the consumer slows, the producer blocks, which provides backpressure. If you want to drop instead of block, use a non-blocking send with a fallback.
Snippet 6: sync.RWMutex hurting reader path¶
type Cache struct {
mu sync.RWMutex
m map[string]string
}
func (c *Cache) Get(k string) string {
c.mu.RLock()
defer c.mu.RUnlock()
return c.m[k] // 20 ns of work
}
Bug¶
sync.RWMutex.RLock is ~50 ns; the critical section is ~20 ns. The reader spends more time on the lock than on the work. sync.RWMutex only pays off if critical sections are long enough to benefit from reader parallelism.
Fix¶
Use sync.Mutex (uncontended Lock is ~10 ns). Or, if writes are very rare, use copy-on-write with atomic.Pointer[map[string]string].
Snippet 7: sync.Pool for small objects¶
var pool = sync.Pool{
New: func() interface{} { return make([]byte, 16) },
}
func process(x int) []byte {
b := pool.Get().([]byte)
defer pool.Put(b)
binary.LittleEndian.PutUint64(b[:8], uint64(x))
return b
}
Bug¶
There are two issues: (1) the buffer is 16 bytes — allocating it is ~30 ns, pool overhead is ~30 ns, no win. (2) Put returns the buffer while the caller still holds it; the next Get returns the same buffer, overwriting the caller's data. This is a correctness bug masquerading as an optimization.
Fix¶
For small buffers, just allocate. If you must reuse, ensure the buffer is no longer referenced before Put.
Snippet 8: goroutine leak from forgot-close¶
func handler(w http.ResponseWriter, r *http.Request) {
work := make(chan int)
go func() {
for x := range work {
process(x)
}
}()
// ... fill work
// ... return without close(work)
}
Bug¶
The goroutine never exits because the channel is never closed. Every request leaks one goroutine. Eventually, OOM.
Fix¶
defer close(work) in the handler, or ensure all return paths close the channel.
Snippet 9: context not propagated, work continues after return¶
func handler(w http.ResponseWriter, r *http.Request) {
var data []int
go func() {
data = expensiveFetch(context.Background()) // wrong!
}()
// ... handler returns
}
Bug¶
Two bugs: (1) data is a data race (handler reads or modifies without synchronisation). (2) The goroutine uses Background instead of r.Context(), so if the client disconnects, the goroutine still runs to completion — wasting work and potentially expensive resources.
Fix¶
Use r.Context() for cancellation. Use errgroup.WithContext or wait properly for the goroutine before returning.
Snippet 10: parallel processing of dependent items¶
type Account struct{ Balance int }
func transfer(accounts []*Account, transfers []Transfer) {
var wg sync.WaitGroup
for _, t := range transfers {
t := t
wg.Add(1)
go func() {
defer wg.Done()
accounts[t.From].Balance -= t.Amount
accounts[t.To].Balance += t.Amount
}()
}
wg.Wait()
}
Bug¶
The transfers are dependent (they update shared state). The "parallel" version has races — two goroutines updating the same account simultaneously give incorrect totals. This is a correctness bug, but it's also a performance issue because adding a mutex to fix it serialises everything, making parallel no faster than sequential.
Fix¶
Process transfers sequentially. If parallelism is essential, partition by account (each shard sequential within itself).
Snippet 11: time.After in tight loop¶
func waitForChan(ch chan int) {
for {
select {
case <-ch:
// process
case <-time.After(time.Second):
log.Println("waited a second")
}
}
}
Bug¶
Each iteration creates a new time.After timer. If ch fires frequently, you accumulate timers — each held until it fires 1 second later. Memory grows.
Fix¶
Use time.NewTimer and Reset:
timer := time.NewTimer(time.Second)
defer timer.Stop()
for {
timer.Reset(time.Second)
select {
case <-ch:
if !timer.Stop() {
<-timer.C
}
case <-timer.C:
log.Println("waited a second")
}
}
Snippet 12: unbounded goroutine spawn per request¶
func handler(w http.ResponseWriter, r *http.Request) {
var ids []int
json.NewDecoder(r.Body).Decode(&ids)
var wg sync.WaitGroup
for _, id := range ids {
id := id
wg.Add(1)
go func() {
defer wg.Done()
backendCall(id)
}()
}
wg.Wait()
}
Bug¶
The number of goroutines is len(ids) — unbounded by the caller. A malicious or buggy client can send a giant array, spawning millions of goroutines. Two consequences: (1) memory explosion, (2) the backend gets a thundering herd that may DDoS it.
Fix¶
Bound the fan-out:
g, ctx := errgroup.WithContext(ctx)
g.SetLimit(8)
for _, id := range ids {
id := id
g.Go(func() error { return backendCall(ctx, id) })
}
return g.Wait()
Also validate len(ids) against a sane maximum at the API boundary.
Summary¶
The 12 bugs covered:
- Parallel sum with mutex (sequential effect).
- Channel for tiny work (overhead dominates).
- Goroutine per fast item (spawn cost).
- False sharing (cache-line bouncing).
- Huge buffered channel (no backpressure).
- RWMutex for short reads (reader overhead).
- sync.Pool for tiny objects (overhead + race).
- Goroutine leak (channel never closed).
- Background context (no cancellation).
- Parallel dependent items (race).
- time.After in loop (timer leak).
- Unbounded fan-out (memory + DDoS).
All twelve are concurrent code that the author thought would be faster, more flexible, or safer — but each is in fact slower, leakier, or buggier than a simpler approach.
The lesson: every go, every chan, every mu is a commitment. Make sure it pays off.
End of find-bug.