Premature Concurrency Optimization — Tasks¶
A series of 18 hands-on tasks to build measurement-driven concurrency intuition. Each task has a setup, instructions, expected outcomes, and reflection questions.
Run all tasks with:
Task 1: Sequential vs concurrent sum¶
Setup: Write three implementations of summing a []float64.
// sumSeq: simple for-range loop
// sumParChunks: split into chunks, sum each in a goroutine, sum partial sums
// sumParAtomic: one goroutine per N items, atomic.AddInt64 to a global
Instructions: 1. Write benchmarks for N = 100, 1000, 10000, 100000, 1000000, 10000000. 2. Run go test -bench=. -count=10 -cpu=1,2,4,8. 3. Compute the crossover point: at what N does parallel start to win? 4. Compute the crossover point for atomic vs chunks.
Expected outcomes: - sumSeq wins for small N (< ~100000). - sumParChunks wins for large N. - sumParAtomic loses to chunks because of cache contention.
Reflection: - Why does sumParAtomic lose? - What's the crossover N on your hardware? - How does the crossover shift with -cpu=2 vs -cpu=8?
Task 2: Goroutine spawn overhead¶
Setup: Measure the cost of spawning a goroutine.
func BenchmarkSpawn(b *testing.B) {
var wg sync.WaitGroup
for i := 0; i < b.N; i++ {
wg.Add(1)
go func() { wg.Done() }()
}
wg.Wait()
}
func BenchmarkSpawnParallel(b *testing.B) {
b.RunParallel(func(pb *testing.PB) {
var wg sync.WaitGroup
for pb.Next() {
wg.Add(1)
go func() { wg.Done() }()
wg.Wait()
}
})
}
Instructions: 1. Run both benchmarks. 2. Note the ns/op. 3. Compare to a similar benchmark using a worker pool with channel send/recv.
Expected outcomes: - ~1 µs per spawn. - Pool with channel: ~100-200 ns per dispatched task.
Reflection: - When does the spawn cost matter? - For 10 µs of work, what's the relative overhead? - For 100 ns of work?
Task 3: Channel send/recv cost¶
Setup: Measure channel cost in various scenarios.
// BenchmarkChanBuffered: buffered channel, single producer, single consumer.
// BenchmarkChanUnbuffered: unbuffered channel, producer/consumer must sync.
// BenchmarkChanContended: N producers, 1 consumer.
// BenchmarkChanSelect: select with 2 cases.
Instructions: 1. Write all four benchmarks. 2. Measure per-op cost. 3. Compare to a mutex-protected slice.
Expected outcomes: - Buffered, hot: ~50 ns/op. - Unbuffered: ~250 ns/op. - Contended: depends on N; higher per-op due to lock contention. - Select: ~150 ns/op for 2 cases. - Mutex-protected slice: ~30 ns/op.
Reflection: - When is mutex-protected slice clearly better? - When is buffered channel sufficient? - When does unbuffered's sync semantics justify the cost?
Task 4: False sharing demonstration¶
Setup: Two structs with adjacent vs padded fields.
Instructions: 1. Write a benchmark with two goroutines, one incrementing A and one incrementing B. 2. Run with Adjacent and Padded. 3. Compare ns/op.
Expected outcomes: - Adjacent: dramatically slower (4-8×) due to cache-line bouncing. - Padded: near-baseline speed.
Reflection: - What's the size of your CPU's cache line? - Where else does false sharing hide in real code? - How do you detect false sharing without already suspecting it?
Task 5: Mutex vs RWMutex for short reads¶
Setup: A map with concurrent reads and rare writes.
type CacheMutex struct {
mu sync.Mutex
m map[string]string
}
type CacheRW struct {
mu sync.RWMutex
m map[string]string
}
Instructions: 1. Implement Get/Set on both. 2. Benchmark RunParallel reads with rare writes. 3. Compare.
Expected outcomes: - Mutex wins for short critical sections. - RWMutex only wins if critical section is long (e.g. > 1 µs).
Reflection: - For your map, how long is the critical section? - Where would RWMutex actually help? - Could you eliminate the lock entirely (copy-on-write)?
Task 6: Sharded map breakeven¶
Setup: A map with concurrent access; compare single-mutex to sharded.
type SingleMap struct {
mu sync.Mutex
m map[string]int
}
type ShardedMap struct {
shards []struct {
mu sync.Mutex
m map[string]int
}
}
Instructions: 1. Implement both. 2. Benchmark with varying contention (number of goroutines). 3. Find the breakeven: at what contention level does sharding win?
Expected outcomes: - At low contention (1-4 goroutines), single map wins (hash overhead). - At high contention (16+ goroutines), sharding wins.
Reflection: - What's your production contention level? - Does sharding pay off given that level? - What's the right shard count for your workload?
Task 7: sync.Pool benefit measurement¶
Setup: Allocate vs pool small buffers.
// BenchmarkAlloc: make([]byte, size) per call.
// BenchmarkPool: get from sync.Pool, use, put back.
Instructions: 1. For sizes 64 B, 256 B, 1 KB, 4 KB, 16 KB. 2. Benchmark each. 3. Note bytes/op and allocs/op.
Expected outcomes: - Pool overhead is comparable to allocation for small sizes. - Pool clearly wins for sizes > 1 KB.
Reflection: - At what size does pooling start to pay? - What's the memory cost of the pool when full? - When is sync.Pool wrong (e.g. mutable state)?
Task 8: Worker pool sizing¶
Setup: A CPU-bound task with N workers.
func cpuTask(x int) int {
sum := 0
for i := 0; i < 1000; i++ {
sum += i * x
}
return sum
}
// Run with pools of size 1, 2, 4, 8, 16, 32.
Instructions: 1. Benchmark throughput at each size. 2. Plot throughput vs pool size.
Expected outcomes: - Throughput rises with pool size up to GOMAXPROCS. - Past that, diminishing returns. - Beyond ~2× GOMAXPROCS, throughput may decline.
Reflection: - What's the optimal size for your machine? - Does the optimum shift with task cost? - What if the task were I/O-bound?
Task 9: I/O fan-out¶
Setup: Simulate 10 concurrent HTTP calls each taking 50 ms.
func slowCall(ctx context.Context) error {
time.Sleep(50 * time.Millisecond)
return nil
}
// Sequential: 10 calls in a row = 500 ms.
// Parallel: 10 calls concurrent = ~50 ms.
Instructions: 1. Implement both. 2. Benchmark wall time.
Expected outcomes: - Sequential: ~500 ms. - Parallel: ~50 ms (limited by single call's latency).
Reflection: - This is the textbook win for concurrency. Why? - What happens if calls share a connection pool of size 5? - How would you handle errors (first error vs all errors)?
Task 10: Hedged execution¶
Setup: A backend with variable latency (mostly fast, occasionally slow).
func variableCall() error {
if rand.Float64() < 0.1 {
time.Sleep(200 * time.Millisecond) // slow tail
} else {
time.Sleep(20 * time.Millisecond)
}
return nil
}
Instructions: 1. Implement a hedger: after 50 ms, fire a backup. 2. Benchmark p99 latency with and without hedging.
Expected outcomes: - Without hedging: p99 ~200 ms. - With hedging: p99 ~50-70 ms. - Backend load: ~1.05× (5% of requests trigger hedge).
Reflection: - When is hedging worth it? - What if the backend is already at capacity? - How would you measure the actual hedge rate?
Task 11: Profile a slow function¶
Setup: A deliberately slow function.
func slowFunc() {
// Allocates a lot of garbage
var s []byte
for i := 0; i < 1000; i++ {
s = append(s, byte(i%256))
}
// ... more wasteful work
}
Instructions: 1. Run benchmark with -cpuprofile=cpu.prof -memprofile=mem.prof. 2. Open both in go tool pprof. 3. Identify top consumers.
Expected outcomes: - CPU profile shows where time goes. - Memory profile shows allocation sites.
Reflection: - What's the easiest optimization? - Is concurrency the answer? (Hint: probably not for this kind of function.)
Task 12: Trace a concurrent program¶
Setup: A small program with workers and a producer.
func main() {
items := make(chan int, 100)
var wg sync.WaitGroup
for w := 0; w < 4; w++ {
wg.Add(1)
go func() {
defer wg.Done()
for x := range items {
_ = x * x
}
}()
}
for i := 0; i < 1000; i++ {
items <- i
}
close(items)
wg.Wait()
}
Instructions: 1. Wrap with trace.Start/Stop. 2. Run; open go tool trace. 3. View the timeline.
Expected outcomes: - Workers visible on multiple Ps. - Channel send/recv events. - Brief idle gaps.
Reflection: - Are workers well-utilized? - Where do they idle? - How could you improve parallelism?
Task 13: Compare two implementations with benchstat¶
Setup: Two implementations of the same function.
Instructions: 1. Implement both. 2. Run benchmarks with -count=20 for each. 3. Compare with benchstat.
Expected outcomes: - Clear winner with statistical significance, or - No significant difference (p > 0.05).
Reflection: - Was the winner what you expected? - If not, why? - Was the difference practically meaningful?
Task 14: GOMAXPROCS sensitivity¶
Setup: A benchmark that's affected by GOMAXPROCS.
Instructions: 1. Run with -cpu=1,2,4,8. 2. Plot ns/op vs cpu.
Expected outcomes: - Parallel benchmarks improve with more CPUs (up to a point). - Beyond a point, returns diminish or reverse.
Reflection: - What's the scaling curve? - Is there an optimal -cpu for your machine? - What if your container has a 2-CPU limit?
Task 15: Identify a leak¶
Setup: A function that leaks goroutines.
Instructions: 1. Call it 100 times. 2. Check runtime.NumGoroutine(). 3. Grab a goroutine dump.
Expected outcomes: - Goroutine count = 100+. - Dump shows 100 goroutines in time.Sleep.
Reflection: - How would you fix? - How would CI catch this (hint: goleak)?
Task 16: Remove unnecessary concurrency¶
Setup: A function with goroutines that shouldn't be there.
func slowSum(xs []int) int {
var wg sync.WaitGroup
var mu sync.Mutex
var sum int
for _, x := range xs {
x := x
wg.Add(1)
go func() {
defer wg.Done()
mu.Lock()
sum += x
mu.Unlock()
}()
}
wg.Wait()
return sum
}
Instructions: 1. Benchmark this. 2. Replace with a simple loop. 3. Compare.
Expected outcomes: - The "concurrent" version is 100-1000× slower than the loop. - The simple loop is the obvious choice.
Reflection: - Why is this version slow? - What was the (mistaken) intent? - How would you spot this in code review?
Task 17: Batch instead of fan out per item¶
Setup: Process a stream of small items.
// BadDesign: spawn goroutine per item.
// GoodDesign: send batches of 100 items to a pool of workers.
Instructions: 1. Implement both. 2. Benchmark throughput. 3. Measure memory usage during benchmark.
Expected outcomes: - BadDesign: high goroutine count, high GC pressure. - GoodDesign: bounded goroutines, much higher throughput.
Reflection: - What's the batch size sweet spot? - How does it depend on item cost?
Task 18: Optimize a real (open source) example¶
Setup: Pick a Go open source library you use.
Instructions: 1. Run a representative workload through it. 2. Profile. 3. Identify any premature optimizations (sharded maps, sync.Pool, etc.). 4. Try replacing with simpler alternatives. 5. Benchmark.
Expected outcomes: - Sometimes the "optimization" pays off. - Often a simpler version is competitive.
Reflection: - Did the original optimization have a measurable justification? - Could you contribute a simplification PR? - What did you learn about library performance?
Wrap-up¶
After completing these tasks, you should have:
- Concrete intuition for when concurrency wins and loses.
- Familiarity with profiling tools.
- Experience with
benchstat. - Skepticism toward unmeasured optimizations.
The discipline is the durable outcome. Apply it in your own code.
End of tasks.