Mutex Copying — Optimization Scenarios¶

9 scenarios where code needs to be optimised to reduce mutex contention without changing semantics. Each scenario presents a starting point, an analysis of the bottleneck, and an optimised solution.

The focus here is on avoiding contention; the copy-prevention rules from earlier files apply to every refactored version (and many of these refactors incidentally remove copy hazards by switching to pointer types and lock-free primitives).

Scenario 1: Counter under heavy contention¶

Starting point¶

type Counter struct {
    mu sync.Mutex
    n  int64
}

func (c *Counter) Inc() {
    c.mu.Lock()
    c.n++
    c.mu.Unlock()
}

func (c *Counter) Load() int64 {
    c.mu.Lock()
    defer c.mu.Unlock()
    return c.n
}

Bottleneck¶

Every Inc and Load acquires the mutex. At high throughput (millions of ops/sec), the mutex's fast path saturates a single CPU. CPU profile shows >50% time in Lock/Unlock instructions.

Optimisation 1: Use atomic.Int64¶

type Counter struct {
    n atomic.Int64
}

func (c *Counter) Inc()       { c.n.Add(1) }
func (c *Counter) Load() int64 { return c.n.Load() }

Atomic operations cost ~5-10 ns; mutex fast path costs ~20 ns under contention. The atomic is also branchless. Throughput typically 2-3x in micro-benchmarks.

Optimisation 2: Sharded counter for extreme throughput¶

For >10M Inc/sec across many cores:

type Counter struct {
    shards [numShards]paddedAtomic
}

type paddedAtomic struct {
    n atomic.Int64
    _ [56]byte // pad to 64-byte cache line
}

const numShards = 64

func (c *Counter) Inc() {
    idx := fastrand() % numShards
    c.shards[idx].n.Add(1)
}

func (c *Counter) Load() int64 {
    var sum int64
    for i := range c.shards {
        sum += c.shards[i].n.Load()
    }
    return sum
}

Cache-line padding eliminates false sharing between shards. Inc distributes across shards; Load sums them.

Trade-off: Load is O(N) where N is shard count. For low-frequency reads, fine.

Measurement¶

Implementation	1 goroutine	16 goroutines	256 goroutines
Mutex	60M ops/s	12M ops/s	8M ops/s
Atomic	200M ops/s	100M ops/s	70M ops/s
Sharded atomic	200M ops/s	1.6B ops/s	2.5B ops/s

(Approximate; depends on hardware. Run your own benchmarks.)

Scenario 2: Read-mostly map¶

Starting point¶

type Lookup struct {
    mu   sync.Mutex
    data map[string]int
}

func (l *Lookup) Get(k string) (int, bool) {
    l.mu.Lock()
    defer l.mu.Unlock()
    v, ok := l.data[k]
    return v, ok
}

func (l *Lookup) Set(k string, v int) {
    l.mu.Lock()
    defer l.mu.Unlock()
    l.data[k] = v
}

Bottleneck¶

99% of operations are Get. The mutex serialises reads, throttling read throughput.

Optimisation: switch to RWMutex¶

type Lookup struct {
    mu   sync.RWMutex
    data map[string]int
}

func (l *Lookup) Get(k string) (int, bool) {
    l.mu.RLock()
    defer l.mu.RUnlock()
    v, ok := l.data[k]
    return v, ok
}

func (l *Lookup) Set(k string, v int) {
    l.mu.Lock()
    defer l.mu.Unlock()
    l.data[k] = v
}

Reads can proceed concurrently. Throughput improves significantly when readers were the bottleneck.

Optimisation 2: COW with atomic.Pointer¶

For workloads with very rare writes (e.g., minute-scale config updates):

type Lookup struct {
    data atomic.Pointer[map[string]int]
    mu   sync.Mutex // protects writers from racing each other
}

func (l *Lookup) Get(k string) (int, bool) {
    m := *l.data.Load()
    v, ok := m[k]
    return v, ok
}

func (l *Lookup) Set(k string, v int) {
    l.mu.Lock()
    defer l.mu.Unlock()
    old := *l.data.Load()
    n := make(map[string]int, len(old)+1)
    for kk, vv := range old {
        n[kk] = vv
    }
    n[k] = v
    l.data.Store(&n)
}

Reads are entirely lock-free (one atomic load). Writes are O(n) but rare. Excellent for cache/config use cases.

Trade-off table¶

Pattern	Read cost	Write cost	Memory	When to use
Mutex	Lock	Lock	small	Balanced
RWMutex	RLock	Lock	small	Read-dominant
atomic.Pointer COW	1 atomic	O(n) copy	2x during write	Read-very-dominant, infrequent writes

Scenario 3: Reducing critical section size¶

Starting point¶

type RequestLog struct {
    mu      sync.Mutex
    entries []Entry
}

func (r *RequestLog) Record(req Request) {
    r.mu.Lock()
    defer r.mu.Unlock()
    entry := Entry{
        Timestamp: time.Now(),
        ID:        req.ID,
        Body:      processBody(req.Body), // slow
    }
    r.entries = append(r.entries, entry)
    persistToDisk(entry) // VERY slow
}

Bottleneck¶

Record holds the mutex through both processBody (which is CPU-heavy) and persistToDisk (I/O bound, milliseconds). Contention pile-up under any concurrent load.

Optimisation: move work outside the critical section¶

func (r *RequestLog) Record(req Request) {
    entry := Entry{
        Timestamp: time.Now(),
        ID:        req.ID,
        Body:      processBody(req.Body), // OUTSIDE lock
    }
    r.mu.Lock()
    r.entries = append(r.entries, entry)
    r.mu.Unlock()
    persistToDisk(entry) // OUTSIDE lock
}

Lock-held time drops from milliseconds to nanoseconds. Throughput rises proportionally.

Caveats¶

The order of writes to r.entries and persistToDisk differs between concurrent callers. If you need consistency (entries persisted in the same order they appear in r.entries), additional design is needed (a serialised writer goroutine reading from a channel, for example).

For most "log records to disk" cases, order is not critical and the optimised version is correct.

Scenario 4: Batched updates¶

Starting point¶

type MetricCollector struct {
    mu     sync.Mutex
    counts map[string]int64
}

func (m *MetricCollector) Inc(name string) {
    m.mu.Lock()
    defer m.mu.Unlock()
    m.counts[name]++
}

In a hot path that calls Inc thousands of times per request, this is a contention disaster.

Optimisation: batch increments per goroutine¶

type MetricCollector struct {
    mu     sync.Mutex
    counts map[string]int64
    pool   sync.Pool
}

type batch map[string]int64

func (m *MetricCollector) NewBatch() batch {
    if b := m.pool.Get(); b != nil {
        return b.(batch)
    }
    return make(batch)
}

func (b batch) Inc(name string) {
    b[name]++
}

func (m *MetricCollector) Flush(b batch) {
    m.mu.Lock()
    for k, v := range b {
        m.counts[k] += v
        delete(b, k)
    }
    m.mu.Unlock()
    m.pool.Put(b)
}

Usage:

b := collector.NewBatch()
for _, item := range items {
    b.Inc(item.Category)
}
collector.Flush(b)

The hot path inside the loop is map-modification only (no locking). Flush acquires the lock once for the entire batch.

Trade-off: counts in m.counts are eventually consistent (only visible after Flush). Most metrics workflows tolerate this.

When batching helps¶

Per-goroutine work generates many small updates.
Reads of the global state are infrequent or accept eventual consistency.
Goroutines have a natural "batch boundary" (end of request, end of file, etc.).

Scenario 5: Lock-free shutdown signal¶

Starting point¶

type Service struct {
    mu     sync.Mutex
    closed bool
}

func (s *Service) IsClosed() bool {
    s.mu.Lock()
    defer s.mu.Unlock()
    return s.closed
}

func (s *Service) Close() {
    s.mu.Lock()
    defer s.mu.Unlock()
    s.closed = true
}

Every "is the service still running?" check acquires the mutex. In a tight loop, this is wasted.

Optimisation: atomic.Bool¶

type Service struct {
    closed atomic.Bool
}

func (s *Service) IsClosed() bool { return s.closed.Load() }
func (s *Service) Close()         { s.closed.Store(true) }

Load is ~1 ns; mutex was ~20 ns. Both correct.

Optimisation 2: context.Context¶

For cancellation-style signals, a context.Context is even better — it composes with channel-select patterns and is the idiomatic Go approach.

type Service struct {
    ctx    context.Context
    cancel context.CancelFunc
}

func NewService() *Service {
    ctx, cancel := context.WithCancel(context.Background())
    return &Service{ctx: ctx, cancel: cancel}
}

func (s *Service) Close() { s.cancel() }
func (s *Service) Done() <-chan struct{} { return s.ctx.Done() }

Goroutines select on s.Done() and handle cancellation natively.

Scenario 6: Sharded session store¶

Starting point¶

type Sessions struct {
    mu       sync.RWMutex
    sessions map[string]*Session
}

At 100k sessions and high concurrency, even RWMutex contention becomes significant.

Optimisation: shard by session ID hash¶

type Sessions struct {
    shards [256]*shard
}

type shard struct {
    mu       sync.RWMutex
    sessions map[string]*Session
}

func NewSessions() *Sessions {
    s := &Sessions{}
    for i := range s.shards {
        s.shards[i] = &shard{sessions: make(map[string]*Session)}
    }
    return s
}

func (s *Sessions) shardFor(id string) *shard {
    h := fnv.New32a()
    h.Write([]byte(id))
    return s.shards[h.Sum32()%256]
}

func (s *Sessions) Get(id string) (*Session, bool) {
    sh := s.shardFor(id)
    sh.mu.RLock()
    defer sh.mu.RUnlock()
    sess, ok := sh.sessions[id]
    return sess, ok
}

func (s *Sessions) Set(id string, sess *Session) {
    sh := s.shardFor(id)
    sh.mu.Lock()
    defer sh.mu.Unlock()
    sh.sessions[id] = sess
}

Contention is divided across 256 mutexes. Throughput scales nearly linearly with shard count up to the goroutine count.

Choosing shard count¶

Small (<10) shards: easy reasoning, minimal memory overhead, modest concurrency support.
Medium (32-128) shards: good for most services.
Large (256-1024) shards: heavy concurrency, multi-core machines.

Power-of-2 shard counts allow & masking instead of % (faster).

Scenario 7: Object pool to avoid allocation under lock¶

Starting point¶

type Worker struct {
    mu     sync.Mutex
    buffer []byte
}

func (w *Worker) Process(data []byte) []byte {
    w.mu.Lock()
    defer w.mu.Unlock()
    w.buffer = w.buffer[:0]
    w.buffer = append(w.buffer, data...)
    return process(w.buffer)
}

The single buffer is shared. All Process calls serialise.

Optimisation: use sync.Pool¶

var bufferPool = sync.Pool{
    New: func() any {
        b := make([]byte, 0, 4096)
        return &b
    },
}

func Process(data []byte) []byte {
    b := bufferPool.Get().(*[]byte)
    defer bufferPool.Put(b)
    *b = (*b)[:0]
    *b = append(*b, data...)
    return process(*b)
}

No mutex needed. Each goroutine pulls a buffer from the pool; the pool internally uses per-P buffers and atomic operations for hot paths.

Trade-off: result needs to be copied (or used before the buffer returns to the pool); the pool's buffers are not stable references.

Scenario 8: Avoiding RWMutex inversion¶

Starting point¶

type Cache struct {
    mu   sync.RWMutex
    data map[string]Result
}

func (c *Cache) GetOrCompute(k string, compute func() Result) Result {
    c.mu.RLock()
    if v, ok := c.data[k]; ok {
        c.mu.RUnlock()
        return v
    }
    c.mu.RUnlock()
    c.mu.Lock()
    defer c.mu.Unlock()
    // re-check under write lock
    if v, ok := c.data[k]; ok {
        return v
    }
    v := compute() // SLOW — held under write lock
    c.data[k] = v
    return v
}

Compute holds the write lock; readers wait. If compute is slow, readers stall.

Optimisation: singleflight pattern¶

import "golang.org/x/sync/singleflight"

type Cache struct {
    mu   sync.RWMutex
    data map[string]Result
    sf   singleflight.Group
}

func (c *Cache) GetOrCompute(k string, compute func() Result) Result {
    c.mu.RLock()
    if v, ok := c.data[k]; ok {
        c.mu.RUnlock()
        return v
    }
    c.mu.RUnlock()

    v, _, _ := c.sf.Do(k, func() (any, error) {
        result := compute()
        c.mu.Lock()
        c.data[k] = result
        c.mu.Unlock()
        return result, nil
    })
    return v.(Result)
}

singleflight.Group ensures only one goroutine computes per key; others wait for the result. Crucially, the write lock is held only briefly to insert the result, not during compute.

Readers proceed concurrently while compute is in flight (they get RLock).

Scenario 9: Avoiding lock-on-read for immutable data¶

Starting point¶

type Config struct {
    mu sync.RWMutex
    // ... many fields ...
}

func (c *Config) GetTimeout() time.Duration {
    c.mu.RLock()
    defer c.mu.RUnlock()
    return c.timeout
}

If c.timeout is set once at startup and never changes, the RLock is wasted.

Optimisation: separate immutable from mutable¶

type Config struct {
    // Set once during startup. No lock required.
    timeout time.Duration
    addr    string

    // Mutable; protected by mu.
    mu      sync.RWMutex
    refresh time.Time
    stats   Stats
}

func (c *Config) GetTimeout() time.Duration { return c.timeout } // no lock

Document the invariant: fields above the mu line are immutable after construction; fields below require the lock. Build the contract in the type definition.

Strict alternative: split the type¶

type ImmutableConfig struct {
    Timeout time.Duration
    Addr    string
}

type MutableState struct {
    mu      sync.RWMutex
    refresh time.Time
    stats   Stats
}

type Service struct {
    Config *ImmutableConfig
    State  *MutableState
}

The split makes the immutability mechanical: Config has no mutex; you cannot accidentally read it under a lock that doesn't exist.

General optimisation principles¶

Measure first. Profile mutex contention before optimising. Don't optimise locks that don't appear in the profile.
Reduce hold time. Move work outside the critical section. This is almost always the highest-impact change.
Reduce lock frequency. Batch operations. Use thread-local accumulators flushed periodically.
Increase concurrency. Shard the data. Use RWMutex if reads dominate.
Eliminate the mutex. Use atomics, COW, or actor model.
Avoid premature pessimisation. Don't shard a lock that isn't contended. Don't switch to atomic.Pointer for a balanced workload. Measure.
Preserve correctness. Every optimisation must preserve the program's correctness contract. Many copy-related bugs in the wild were introduced during attempts to optimise.
Document the change. Future maintainers should know why the code uses sharding/COW/atomics. Without the context, they may "simplify" back to the slow version.
Watch out for copy bugs. Many of these optimisations involve switching to pointer types or lock-free structures. The same vet-driven discipline applies — noCopy markers, pointer receivers, vet in CI.
Avoid distributed locks. Use in-process locking when in-process suffices. Distributed locking introduces a separate class of failure modes.

Summary¶

Optimisation table:

Original	Optimised	Trigger
Mutex-protected counter	atomic.Int64	Profile shows mutex hot
Mutex map	RWMutex map	Read-heavy workload
RWMutex map	atomic.Pointer COW	Very read-heavy, rare writes
Single mutex	Sharded mutexes	Contention at high concurrency
Wide critical section	Narrow critical section + outside-lock work	Slow operations under lock
Per-operation lock	Per-batch lock	High-frequency updates
Lock-protected flag	atomic.Bool or context.Context	Lock-on-read in hot path
Lock-protected pool	sync.Pool	Object allocation under lock
Cache-miss-recomputes-under-lock	singleflight	Slow compute under lock
Lock for immutable fields	No lock (split types or document)	Fields never change

Apply these in order of profile-driven evidence. Always benchmark before and after. Always verify with vet and -race.