When to Use Concurrency — Optimization Exercises¶
Each exercise has concurrent code that hurts or does not help. Remove or restructure the concurrency. Measure before and after.
Exercise 1 — Per-element goroutine for trivial work¶
Baseline.
func sum(nums []int) int {
var s atomic.Int64
var wg sync.WaitGroup
for _, n := range nums {
wg.Add(1)
go func(n int) {
defer wg.Done()
s.Add(int64(n))
}(n)
}
wg.Wait()
return int(s.Load())
}
Goal. Remove concurrency.
Solution.
100x faster. Concurrency was decoration.
Exercise 2 — Fan-out where caching wins¶
Baseline.
func enrich(items []Item) []Enriched {
out := make([]Enriched, len(items))
g, _ := errgroup.WithContext(ctx)
for i, item := range items {
i, item := i, item
g.Go(func() error {
user := fetchUser(item.UserID)
out[i] = enrich1(item, user)
return nil
})
}
g.Wait()
return out
}
If many items share a UserID, you fetch the same user repeatedly.
Goal. Dedupe.
Solution.
import "golang.org/x/sync/singleflight"
func enrich(items []Item) []Enriched {
out := make([]Enriched, len(items))
var g singleflight.Group
var wg sync.WaitGroup
for i, item := range items {
i, item := i, item
wg.Add(1)
go func() {
defer wg.Done()
user, _, _ := g.Do(item.UserID, func() (interface{}, error) {
return fetchUser(item.UserID), nil
})
out[i] = enrich1(item, user.(User))
}()
}
wg.Wait()
return out
}
If 100 items share 10 unique users: from 100 fetches to 10.
Exercise 3 — Sub-spawning in HTTP handler¶
Baseline.
http.HandleFunc("/api", func(w http.ResponseWriter, r *http.Request) {
go heavyTask(r) // fire-and-forget
w.WriteHeader(202)
})
The goroutine may leak or fail silently.
Goal. Make the async work managed.
Solution.
var jobs = make(chan *http.Request, 1024)
func init() {
for i := 0; i < 16; i++ {
go func() {
for r := range jobs {
heavyTask(r)
}
}()
}
}
http.HandleFunc("/api", func(w http.ResponseWriter, r *http.Request) {
select {
case jobs <- r:
w.WriteHeader(202)
default:
http.Error(w, "overloaded", 503)
}
})
Bounded workers, bounded queue, load shedding on overload.
Exercise 4 — Concurrent code serialised by lock¶
Baseline.
type Counter struct {
mu sync.Mutex
n int
}
func (c *Counter) Inc() {
c.mu.Lock()
c.n++
c.mu.Unlock()
}
Under 32-core contention, every increment is a cache-line bounce.
Goal. Reduce contention.
Solution. Sharded counter.
type ShardedCounter struct {
shards [32]struct {
n atomic.Int64
_ [56]byte // padding
}
}
func (c *ShardedCounter) Inc() {
idx := someGoroutineHash() % 32
c.shards[idx].n.Add(1)
}
func (c *ShardedCounter) Load() int64 {
var total int64
for i := range c.shards {
total += c.shards[i].n.Load()
}
return total
}
Cache-line-isolated shards. Reads are slower (sum all shards) but increments scale linearly.
Exercise 5 — Concurrent search where tiered is better¶
Baseline.
func find(query string) ([]Result, error) {
g, _ := errgroup.WithContext(ctx)
var (
cache []Result
db []Result
api []Result
)
g.Go(func() error { cache = searchCache(query); return nil })
g.Go(func() error { db = searchDB(query); return nil })
g.Go(func() error { api = searchAPI(query); return nil })
g.Wait()
return merge(cache, db, api), nil
}
Always pays cost of all three lookups.
Goal. Cheap-first.
Solution.
func find(query string) ([]Result, error) {
if r := searchCache(query); len(r) > 0 {
return r, nil
}
if r := searchDB(query); len(r) > 0 {
return r, nil
}
return searchAPI(query), nil
}
Most queries hit cache; saves database and API load.
Exercise 6 — Excessive worker pool¶
Baseline.
const workers = 256
queue := make(chan Job, 1024)
for i := 0; i < workers; i++ {
go func() {
for j := range queue {
process(j) // takes 10 ms CPU
}
}()
}
CPU-bound work; 256 workers on (say) 8 cores. Workers compete for CPU. Throughput plateaus around 8 simultaneous.
Goal. Right-size the pool.
Solution.
8 workers process 800 jobs/sec at full CPU utilisation. 256 workers process the same 800 jobs/sec with scheduling overhead.
Exercise 7 — Unbounded goroutine spawn¶
Baseline.
If events has 1M elements, 1M goroutines.
Goal. Bound concurrency.
Solution.
import "golang.org/x/sync/errgroup"
func handle(events []Event) error {
g, _ := errgroup.WithContext(ctx)
g.SetLimit(64)
for _, e := range events {
e := e
g.Go(func() error {
process(e)
return nil
})
}
return g.Wait()
}
64 in flight; rest queue.
Exercise 8 — Channel for one-shot wait¶
Baseline.
Just to wait for the goroutine. Channel + goroutine for nothing.
Goal. Remove unnecessary indirection.
Solution.
Or, if you genuinely need it in a goroutine (e.g., for timeout enforcement):
Exercise 9 — Concurrent code without measurement¶
Baseline.
A team adds concurrency to a function "to improve scalability." Benchmark:
Concurrent is 3x slower.
Goal. Revert.
Solution. Profile. Confirm the concurrent version has more overhead than gain. Replace with sequential.
This requires writing the benchmark first. The discipline: every concurrent design comes with a benchmark.
Exercise 10 — Reading multiple files¶
Baseline.
If paths is 100 files of 1 MB each, sequential: ~10 s on a slow disk.
Goal. Concurrent read.
Solution.
sem := make(chan struct{}, 16) // bound for disk
var wg sync.WaitGroup
for _, path := range paths {
sem <- struct{}{}
wg.Add(1)
go func(path string) {
defer wg.Done()
defer func() { <-sem }()
data, _ := os.ReadFile(path)
process(data)
}(path)
}
wg.Wait()
16 concurrent reads (good for SSD; for HDD, lower).
But: process may be the bottleneck. Profile.
Exercise 11 — Hedging when downstream is at capacity¶
Baseline. Service hedges every request to a busy downstream:
Downstream sees ~1.8x traffic. It is already at 90% capacity. Hedging pushes it to 100%+ → queues build up → latency rises everywhere.
Goal. Conditional hedging.
Solution. Only hedge when:
- The first request has not responded in some time.
- Downstream is not already overloaded.
- The request is read-only.
Or use a feedback-loop circuit breaker that disables hedging under load.
Exercise 12 — Premature async¶
Baseline.
type Result struct {
out chan int
}
func compute(req Request) *Result {
r := &Result{out: make(chan int, 1)}
go func() {
r.out <- doWork(req)
}()
return r
}
// Caller:
v := <-result.out
The caller waits anyway. Channel + goroutine is pure overhead.
Goal. Remove async.
Solution.
If you ever need to add async for a real reason (parallel sub-operations, timeout enforcement), add it then.
Exercise 13 — Worker pool that hides errors¶
Baseline.
for i := 0; i < workers; i++ {
go func() {
for j := range jobs {
err := process(j)
if err != nil {
log.Println(err) // logged, then forgotten
}
}
}()
}
Goal. Propagate errors.
Solution. Use errgroup:
g, ctx := errgroup.WithContext(ctx)
g.SetLimit(workers)
for j := range jobs {
j := j
g.Go(func() error {
return process(ctx, j)
})
}
if err := g.Wait(); err != nil {
return err
}
First error cancels the rest.
Exercise 14 — Locking too broadly¶
Baseline.
type Cache struct {
mu sync.Mutex
data map[string][]Item
}
func (c *Cache) Search(query string) []Item {
c.mu.Lock()
defer c.mu.Unlock()
// ... search ...
return matching
}
A long search holds the lock; other readers wait.
Goal. Reduce critical section.
Solution.
func (c *Cache) Search(query string) []Item {
c.mu.RLock()
snapshot := c.data // snapshot the map reference
c.mu.RUnlock()
// search on snapshot — no lock needed (assuming no mutation)
return search(snapshot, query)
}
Lock only long enough to capture a snapshot. Or use atomic.Pointer[map].
Exercise 15 — Sequential where parallel saves seconds¶
Baseline.
func loadProfile(userID string) Profile {
user := loadUser(userID) // 50 ms
prefs := loadPrefs(userID) // 50 ms
history := loadHistory(userID) // 50 ms
return assemble(user, prefs, history)
}
Total: 150 ms.
Goal. Parallelise.
Solution.
func loadProfile(ctx context.Context, userID string) (Profile, error) {
var (
user User
prefs Prefs
history History
)
g, ctx := errgroup.WithContext(ctx)
g.Go(func() error {
var err error
user, err = loadUser(ctx, userID)
return err
})
g.Go(func() error {
var err error
prefs, err = loadPrefs(ctx, userID)
return err
})
g.Go(func() error {
var err error
history, err = loadHistory(ctx, userID)
return err
})
if err := g.Wait(); err != nil {
return Profile{}, err
}
return assemble(user, prefs, history), nil
}
Total: ~50 ms (max, not sum). 3x speedup.
This is concurrency done right — measurable benefit, structured, with proper error propagation.
Closing¶
Optimisation in this space is often about removing unnecessary concurrency. Patterns to recognise:
- Trivial work in goroutines. Just iterate.
- Concurrency serialised by a single lock. Restructure or remove.
- Async APIs callers wait for synchronously. Just call.
- Sub-spawning inside framework goroutines. Use a managed worker pool.
- Unbounded goroutine creation. Always bound.
- CPU pool sized for I/O work. Match to actual workload.
- Hedging on overloaded downstream. Conditional.
- Concurrent search where tiered is better. Try cheap first.
Concurrency-done-right patterns:
- Parallel I/O via
errgroup. Genuine speedup. - CPU split into NumCPU goroutines. Parallelism on multi-core.
- Worker pool for unbounded input bounded by downstream.
singleflightfor deduping concurrent requests.- Atomics for hot counters; sharding for very hot counters.
The discipline: every go statement justifies itself. Measure before and after. Default to sequential. Remove concurrency that does not pay.