Premature Concurrency Optimization — Tasks¶

A series of 18 hands-on tasks to build measurement-driven concurrency intuition. Each task has a setup, instructions, expected outcomes, and reflection questions.

Run all tasks with:

go test -bench=. -count=10 -cpu=1,2,4,8 -benchmem
benchstat <baseline.txt> <candidate.txt>

Task 1: Sequential vs concurrent sum¶

Setup: Write three implementations of summing a []float64.

// sumSeq: simple for-range loop
// sumParChunks: split into chunks, sum each in a goroutine, sum partial sums
// sumParAtomic: one goroutine per N items, atomic.AddInt64 to a global

Instructions: 1. Write benchmarks for N = 100, 1000, 10000, 100000, 1000000, 10000000. 2. Run go test -bench=. -count=10 -cpu=1,2,4,8. 3. Compute the crossover point: at what N does parallel start to win? 4. Compute the crossover point for atomic vs chunks.

Expected outcomes: - sumSeq wins for small N (< ~100000). - sumParChunks wins for large N. - sumParAtomic loses to chunks because of cache contention.

Reflection: - Why does sumParAtomic lose? - What's the crossover N on your hardware? - How does the crossover shift with -cpu=2 vs -cpu=8?

Task 2: Goroutine spawn overhead¶

Setup: Measure the cost of spawning a goroutine.

func BenchmarkSpawn(b *testing.B) {
    var wg sync.WaitGroup
    for i := 0; i < b.N; i++ {
        wg.Add(1)
        go func() { wg.Done() }()
    }
    wg.Wait()
}

func BenchmarkSpawnParallel(b *testing.B) {
    b.RunParallel(func(pb *testing.PB) {
        var wg sync.WaitGroup
        for pb.Next() {
            wg.Add(1)
            go func() { wg.Done() }()
            wg.Wait()
        }
    })
}

Instructions: 1. Run both benchmarks. 2. Note the ns/op. 3. Compare to a similar benchmark using a worker pool with channel send/recv.

Expected outcomes: - ~1 µs per spawn. - Pool with channel: ~100-200 ns per dispatched task.

Reflection: - When does the spawn cost matter? - For 10 µs of work, what's the relative overhead? - For 100 ns of work?

Task 3: Channel send/recv cost¶

Setup: Measure channel cost in various scenarios.

// BenchmarkChanBuffered: buffered channel, single producer, single consumer.
// BenchmarkChanUnbuffered: unbuffered channel, producer/consumer must sync.
// BenchmarkChanContended: N producers, 1 consumer.
// BenchmarkChanSelect: select with 2 cases.

Instructions: 1. Write all four benchmarks. 2. Measure per-op cost. 3. Compare to a mutex-protected slice.

Expected outcomes: - Buffered, hot: ~50 ns/op. - Unbuffered: ~250 ns/op. - Contended: depends on N; higher per-op due to lock contention. - Select: ~150 ns/op for 2 cases. - Mutex-protected slice: ~30 ns/op.

Reflection: - When is mutex-protected slice clearly better? - When is buffered channel sufficient? - When does unbuffered's sync semantics justify the cost?

Setup: Two structs with adjacent vs padded fields.

type Adjacent struct {
    A, B int64
}

type Padded struct {
    A int64
    _ [56]byte
    B int64
}

Instructions: 1. Write a benchmark with two goroutines, one incrementing A and one incrementing B. 2. Run with Adjacent and Padded. 3. Compare ns/op.

Expected outcomes: - Adjacent: dramatically slower (4-8×) due to cache-line bouncing. - Padded: near-baseline speed.

Reflection: - What's the size of your CPU's cache line? - Where else does false sharing hide in real code? - How do you detect false sharing without already suspecting it?

Task 5: Mutex vs RWMutex for short reads¶

Setup: A map with concurrent reads and rare writes.

type CacheMutex struct {
    mu sync.Mutex
    m  map[string]string
}

type CacheRW struct {
    mu sync.RWMutex
    m  map[string]string
}

Instructions: 1. Implement Get/Set on both. 2. Benchmark RunParallel reads with rare writes. 3. Compare.

Expected outcomes: - Mutex wins for short critical sections. - RWMutex only wins if critical section is long (e.g. > 1 µs).

Reflection: - For your map, how long is the critical section? - Where would RWMutex actually help? - Could you eliminate the lock entirely (copy-on-write)?

Task 6: Sharded map breakeven¶

Setup: A map with concurrent access; compare single-mutex to sharded.

type SingleMap struct {
    mu sync.Mutex
    m  map[string]int
}

type ShardedMap struct {
    shards []struct {
        mu sync.Mutex
        m  map[string]int
    }
}

Instructions: 1. Implement both. 2. Benchmark with varying contention (number of goroutines). 3. Find the breakeven: at what contention level does sharding win?

Expected outcomes: - At low contention (1-4 goroutines), single map wins (hash overhead). - At high contention (16+ goroutines), sharding wins.

Reflection: - What's your production contention level? - Does sharding pay off given that level? - What's the right shard count for your workload?

Task 7: sync.Pool benefit measurement¶

Setup: Allocate vs pool small buffers.

// BenchmarkAlloc: make([]byte, size) per call.
// BenchmarkPool: get from sync.Pool, use, put back.

Instructions: 1. For sizes 64 B, 256 B, 1 KB, 4 KB, 16 KB. 2. Benchmark each. 3. Note bytes/op and allocs/op.

Expected outcomes: - Pool overhead is comparable to allocation for small sizes. - Pool clearly wins for sizes > 1 KB.

Reflection: - At what size does pooling start to pay? - What's the memory cost of the pool when full? - When is sync.Pool wrong (e.g. mutable state)?

Task 8: Worker pool sizing¶

Setup: A CPU-bound task with N workers.

func cpuTask(x int) int {
    sum := 0
    for i := 0; i < 1000; i++ {
        sum += i * x
    }
    return sum
}

// Run with pools of size 1, 2, 4, 8, 16, 32.

Instructions: 1. Benchmark throughput at each size. 2. Plot throughput vs pool size.

Expected outcomes: - Throughput rises with pool size up to GOMAXPROCS. - Past that, diminishing returns. - Beyond ~2× GOMAXPROCS, throughput may decline.

Reflection: - What's the optimal size for your machine? - Does the optimum shift with task cost? - What if the task were I/O-bound?

Task 9: I/O fan-out¶

Setup: Simulate 10 concurrent HTTP calls each taking 50 ms.

func slowCall(ctx context.Context) error {
    time.Sleep(50 * time.Millisecond)
    return nil
}

// Sequential: 10 calls in a row = 500 ms.
// Parallel: 10 calls concurrent = ~50 ms.

Instructions: 1. Implement both. 2. Benchmark wall time.

Expected outcomes: - Sequential: ~500 ms. - Parallel: ~50 ms (limited by single call's latency).

Reflection: - This is the textbook win for concurrency. Why? - What happens if calls share a connection pool of size 5? - How would you handle errors (first error vs all errors)?

Task 10: Hedged execution¶

Setup: A backend with variable latency (mostly fast, occasionally slow).

func variableCall() error {
    if rand.Float64() < 0.1 {
        time.Sleep(200 * time.Millisecond) // slow tail
    } else {
        time.Sleep(20 * time.Millisecond)
    }
    return nil
}

Instructions: 1. Implement a hedger: after 50 ms, fire a backup. 2. Benchmark p99 latency with and without hedging.

Expected outcomes: - Without hedging: p99 ~200 ms. - With hedging: p99 ~50-70 ms. - Backend load: ~1.05× (5% of requests trigger hedge).

Reflection: - When is hedging worth it? - What if the backend is already at capacity? - How would you measure the actual hedge rate?

Task 11: Profile a slow function¶

Setup: A deliberately slow function.

func slowFunc() {
    // Allocates a lot of garbage
    var s []byte
    for i := 0; i < 1000; i++ {
        s = append(s, byte(i%256))
    }
    // ... more wasteful work
}

Instructions: 1. Run benchmark with -cpuprofile=cpu.prof -memprofile=mem.prof. 2. Open both in go tool pprof. 3. Identify top consumers.

Expected outcomes: - CPU profile shows where time goes. - Memory profile shows allocation sites.

Reflection: - What's the easiest optimization? - Is concurrency the answer? (Hint: probably not for this kind of function.)

Task 12: Trace a concurrent program¶

Setup: A small program with workers and a producer.

func main() {
    items := make(chan int, 100)
    var wg sync.WaitGroup
    for w := 0; w < 4; w++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for x := range items {
                _ = x * x
            }
        }()
    }
    for i := 0; i < 1000; i++ {
        items <- i
    }
    close(items)
    wg.Wait()
}

Instructions: 1. Wrap with trace.Start/Stop. 2. Run; open go tool trace. 3. View the timeline.

Expected outcomes: - Workers visible on multiple Ps. - Channel send/recv events. - Brief idle gaps.

Reflection: - Are workers well-utilized? - Where do they idle? - How could you improve parallelism?

Task 13: Compare two implementations with benchstat¶

Setup: Two implementations of the same function.

Instructions: 1. Implement both. 2. Run benchmarks with -count=20 for each. 3. Compare with benchstat.

Expected outcomes: - Clear winner with statistical significance, or - No significant difference (p > 0.05).

Reflection: - Was the winner what you expected? - If not, why? - Was the difference practically meaningful?

Task 14: GOMAXPROCS sensitivity¶

Setup: A benchmark that's affected by GOMAXPROCS.

Instructions: 1. Run with -cpu=1,2,4,8. 2. Plot ns/op vs cpu.

Expected outcomes: - Parallel benchmarks improve with more CPUs (up to a point). - Beyond a point, returns diminish or reverse.

Reflection: - What's the scaling curve? - Is there an optimal -cpu for your machine? - What if your container has a 2-CPU limit?

Task 15: Identify a leak¶

Setup: A function that leaks goroutines.

func leakyHandler() {
    go func() {
        time.Sleep(time.Hour)
    }()
}

Instructions: 1. Call it 100 times. 2. Check runtime.NumGoroutine(). 3. Grab a goroutine dump.

Expected outcomes: - Goroutine count = 100+. - Dump shows 100 goroutines in time.Sleep.

Reflection: - How would you fix? - How would CI catch this (hint: goleak)?

Task 16: Remove unnecessary concurrency¶

Setup: A function with goroutines that shouldn't be there.

func slowSum(xs []int) int {
    var wg sync.WaitGroup
    var mu sync.Mutex
    var sum int
    for _, x := range xs {
        x := x
        wg.Add(1)
        go func() {
            defer wg.Done()
            mu.Lock()
            sum += x
            mu.Unlock()
        }()
    }
    wg.Wait()
    return sum
}

Instructions: 1. Benchmark this. 2. Replace with a simple loop. 3. Compare.

Expected outcomes: - The "concurrent" version is 100-1000× slower than the loop. - The simple loop is the obvious choice.

Reflection: - Why is this version slow? - What was the (mistaken) intent? - How would you spot this in code review?

Task 17: Batch instead of fan out per item¶

Setup: Process a stream of small items.

// BadDesign: spawn goroutine per item.
// GoodDesign: send batches of 100 items to a pool of workers.

Instructions: 1. Implement both. 2. Benchmark throughput. 3. Measure memory usage during benchmark.

Expected outcomes: - BadDesign: high goroutine count, high GC pressure. - GoodDesign: bounded goroutines, much higher throughput.

Reflection: - What's the batch size sweet spot? - How does it depend on item cost?

Task 18: Optimize a real (open source) example¶

Setup: Pick a Go open source library you use.

Instructions: 1. Run a representative workload through it. 2. Profile. 3. Identify any premature optimizations (sharded maps, sync.Pool, etc.). 4. Try replacing with simpler alternatives. 5. Benchmark.

Expected outcomes: - Sometimes the "optimization" pays off. - Often a simpler version is competitive.

Reflection: - Did the original optimization have a measurable justification? - Could you contribute a simplification PR? - What did you learn about library performance?

Wrap-up¶

After completing these tasks, you should have:

Concrete intuition for when concurrency wins and loses.
Familiarity with profiling tools.
Experience with benchstat.
Skepticism toward unmeasured optimizations.

The discipline is the durable outcome. Apply it in your own code.

End of tasks.