Premature Concurrency Optimization — Optimize¶
Ten scenarios where the optimization is to remove concurrency, simplify to sequential, or batch instead of parallelise. For each: the original (over-concurrent) code, the diagnosis, the simplified version, and the expected outcome.
Scenario 1: parallel sum, simplified to sequential¶
Original¶
func sumParallel(xs []int) int {
workers := 8
chunkSize := (len(xs) + workers - 1) / workers
partial := make([]int, workers)
var wg sync.WaitGroup
for w := 0; w < workers; w++ {
w := w
start := w * chunkSize
end := start + chunkSize
if end > len(xs) {
end = len(xs)
}
wg.Add(1)
go func() {
defer wg.Done()
for _, x := range xs[start:end] {
partial[w] += x
}
}()
}
wg.Wait()
sum := 0
for _, p := range partial {
sum += p
}
return sum
}
Diagnosis¶
For len(xs) < 1_000_000, the parallel version is slower than for _, x := range xs { sum += x } because the goroutine setup and final merge cost more than the savings.
Simplified¶
Outcome¶
Benchmark on 8 cores, len(xs) = 10_000: - Parallel: 12 µs. - Sequential: 2.5 µs.
5× faster. Less code. No bugs to write.
Scenario 2: channel-based aggregator, simplified to indexed slice¶
Original¶
func processItems(items []Item) []Result {
out := make(chan struct {
i int
r Result
}, len(items))
for i, item := range items {
i, item := i, item
go func() {
out <- struct {
i int
r Result
}{i, process(item)}
}()
}
results := make([]Result, len(items))
for range items {
x := <-out
results[x.i] = x.r
}
return results
}
Diagnosis¶
The channel-and-aggregator pattern uses a goroutine per item and 2 channel ops per item. For small per-item work, this is expensive overhead.
Simplified¶
func processItems(items []Item) []Result {
results := make([]Result, len(items))
workers := runtime.GOMAXPROCS(0)
var idx atomic.Int64
var wg sync.WaitGroup
for w := 0; w < workers; w++ {
wg.Add(1)
go func() {
defer wg.Done()
for {
i := idx.Add(1) - 1
if int(i) >= len(items) {
return
}
results[i] = process(items[i])
}
}()
}
wg.Wait()
return results
}
Outcome¶
Workers pull from an atomic index, write directly to results[i]. No channels. 2-5× faster for typical workloads, simpler code.
Scenario 3: per-item goroutine, simplified to batched¶
Original¶
func handleEvents(events []Event) {
var wg sync.WaitGroup
for _, ev := range events {
ev := ev
wg.Add(1)
go func() {
defer wg.Done()
doWork(ev) // ~500 ns
}()
}
wg.Wait()
}
Diagnosis¶
Spawn cost (~1 µs) dominates the work (~500 ns). 100,000 events = 100,000 goroutines = 100 ms of spawn cost.
Simplified¶
func handleEvents(events []Event) {
workers := runtime.GOMAXPROCS(0)
chunkSize := (len(events) + workers - 1) / workers
var wg sync.WaitGroup
for w := 0; w < workers; w++ {
start := w * chunkSize
end := start + chunkSize
if end > len(events) {
end = len(events)
}
if start >= end {
continue
}
wg.Add(1)
go func(chunk []Event) {
defer wg.Done()
for _, ev := range chunk {
doWork(ev)
}
}(events[start:end])
}
wg.Wait()
}
Outcome¶
Goroutine count drops from len(events) to workers. Spawn cost amortised. 10-100× faster for short per-item work.
Scenario 4: lock-free queue, simplified to mutex-protected slice¶
Original¶
// A custom lock-free queue with atomic head/tail pointers,
// hazard pointers, epoch-based reclamation, ~300 lines of code.
type LockFreeQueue struct { /* ... */ }
Diagnosis¶
For most workloads (< 10M ops/sec), a sync.Mutex + slice is faster, simpler, and correct. The lock-free version is harder to verify, harder to maintain, and rarely outperforms.
Simplified¶
type Queue struct {
mu sync.Mutex
items []Item
}
func (q *Queue) Push(item Item) {
q.mu.Lock()
q.items = append(q.items, item)
q.mu.Unlock()
}
func (q *Queue) Pop() (Item, bool) {
q.mu.Lock()
defer q.mu.Unlock()
if len(q.items) == 0 {
return Item{}, false
}
item := q.items[0]
q.items = q.items[1:]
return item, true
}
Outcome¶
300 lines down to 20. Benchmarks show this is faster than the lock-free version under typical contention, and correct.
Scenario 5: sharded map, simplified to single map¶
Original¶
type ShardedMap[K comparable, V any] struct {
shards [32]struct {
mu sync.Mutex
m map[K]V
}
}
func (s *ShardedMap[K, V]) hash(k K) int { /* ... */ }
func (s *ShardedMap[K, V]) Get(k K) (V, bool) { /* ... */ }
func (s *ShardedMap[K, V]) Set(k K, v V) { /* ... */ }
Diagnosis¶
For maps with low contention (most maps), the hash overhead per access exceeds the savings from reduced mutex contention. Single map with sync.Mutex is usually faster.
Simplified¶
type Map[K comparable, V any] struct {
mu sync.Mutex
m map[K]V
}
func (m *Map[K, V]) Get(k K) (V, bool) {
m.mu.Lock()
defer m.mu.Unlock()
v, ok := m.m[k]
return v, ok
}
func (m *Map[K, V]) Set(k K, v V) {
m.mu.Lock()
defer m.mu.Unlock()
m.m[k] = v
}
Outcome¶
Single map: ~30 ns/op. Sharded map: ~50 ns/op (hash + mutex).
The sharded version was 60% slower, despite intent. Remove the sharding.
Scenario 6: sync.Pool for tiny objects, simplified to allocation¶
Original¶
var pool = sync.Pool{
New: func() interface{} { return &SmallObj{} },
}
func process(x int) *SmallObj {
o := pool.Get().(*SmallObj)
o.value = x
return o
}
func release(o *SmallObj) {
*o = SmallObj{} // reset
pool.Put(o)
}
Diagnosis¶
SmallObj is 16 bytes. Allocation is ~30 ns; pool overhead is ~30 ns. No win. And the Put/Get protocol requires the caller to remember to release — easy to forget.
Simplified¶
Outcome¶
Same performance. Half the code. No risk of forgetting to release.
Scenario 7: actor pattern for simple state, simplified to mutex¶
Original¶
type Counter struct {
incCh chan struct{}
getCh chan chan int
}
func NewCounter() *Counter {
c := &Counter{
incCh: make(chan struct{}),
getCh: make(chan chan int),
}
go c.run()
return c
}
func (c *Counter) run() {
n := 0
for {
select {
case <-c.incCh:
n++
case out := <-c.getCh:
out <- n
}
}
}
func (c *Counter) Inc() { c.incCh <- struct{}{} }
func (c *Counter) Get() int {
out := make(chan int)
c.getCh <- out
return <-out
}
Diagnosis¶
Each operation is 2 channel ops = ~500 ns. A mutex would be ~10 ns. The actor pattern is 50× slower for trivial state.
Simplified¶
type Counter struct {
mu sync.Mutex
n int
}
func (c *Counter) Inc() {
c.mu.Lock()
c.n++
c.mu.Unlock()
}
func (c *Counter) Get() int {
c.mu.Lock()
defer c.mu.Unlock()
return c.n
}
Outcome¶
50× faster. Less code. No background goroutine to manage. (For a counter, even better: use atomic.Int64 — 1 ns per op.)
Scenario 8: batched events sent one at a time¶
Original¶
func send(events <-chan Event, ch chan<- Batch) {
for ev := range events {
ch <- Batch{Events: []Event{ev}} // batch of one!
}
}
Diagnosis¶
A "batch" of one defeats batching. Each event goes through a channel with full overhead.
Simplified¶
func send(events <-chan Event, ch chan<- Batch) {
batch := Batch{Events: make([]Event, 0, 100)}
flush := func() {
if len(batch.Events) > 0 {
ch <- batch
batch = Batch{Events: make([]Event, 0, 100)}
}
}
timer := time.NewTimer(50 * time.Millisecond)
defer timer.Stop()
for {
select {
case ev, ok := <-events:
if !ok {
flush()
return
}
batch.Events = append(batch.Events, ev)
if len(batch.Events) >= 100 {
flush()
timer.Reset(50 * time.Millisecond)
}
case <-timer.C:
flush()
timer.Reset(50 * time.Millisecond)
}
}
}
Outcome¶
100× fewer channel sends. Bounded latency (≤50 ms wait for partial batches).
Scenario 9: hedged on a contended backend, simplified to no hedge¶
Original¶
Diagnosis¶
Hedging adds load (~5-10% extra requests) to the backend. If the backend is already at capacity, hedging makes things worse — increasing load increases tail latency, which triggers more hedging.
Simplified¶
Remove the hedger. Address the actual problem: backend capacity. Add instances or fix the slow endpoint.
Outcome¶
Backend load drops. Tail latency improves. The hedger was a workaround for the real problem.
Scenario 10: 4-level cache hierarchy, simplified to one cache¶
Original¶
type Cache struct {
L1 *sync.Map // ultra-fast hot path
L2 *bigcache.BigCache // larger, slower
L3 *redis.Client // remote, slowest
}
func (c *Cache) Get(key string) Value {
if v, ok := c.L1.Load(key); ok {
return v.(Value)
}
if v, err := c.L2.Get(key); err == nil {
// promote to L1
c.L1.Store(key, parseValue(v))
return parseValue(v)
}
v, _ := c.L3.Get(key)
parsed := parseValue(v)
c.L2.Set(key, encodeValue(parsed))
c.L1.Store(key, parsed)
return parsed
}
Diagnosis¶
Three levels of cache add complexity, allocation, and possible inconsistencies. If L1's hit rate is high and the dataset fits in L1, you don't need L2. If L2's hit rate is high, you don't need L3.
Measure: hit rates are L1: 5%, L2: 3%, L3: 1%. Total cache effectiveness: 9%. Without caches, the underlying source is fast enough.
Simplified¶
(Optionally: a single LRU cache if the underlying source is expensive enough to warrant it.)
Outcome¶
Vastly simpler code. Latency unchanged. Memory savings. Operational simplicity.
Summary¶
Ten patterns of "remove the concurrency":
- Parallel sum → sequential.
- Channel aggregator → indexed slice.
- Per-item spawn → batched/chunked.
- Lock-free → mutex.
- Sharded map → single map.
- sync.Pool tiny → just allocate.
- Actor for state → mutex.
- Single-event batches → real batches.
- Hedger on saturated → remove.
- Multi-level cache → single or none.
In each, the simpler code is faster, more reliable, easier to reason about. The "optimization" was a tax with no return.
The senior engineer's gift to the team is recognising these patterns and shipping the simplification. Every removed line of premature concurrency is a future bug avoided.
End of optimize.