Mutex Copying — Optimization Scenarios¶
9 scenarios where code needs to be optimised to reduce mutex contention without changing semantics. Each scenario presents a starting point, an analysis of the bottleneck, and an optimised solution.
The focus here is on avoiding contention; the copy-prevention rules from earlier files apply to every refactored version (and many of these refactors incidentally remove copy hazards by switching to pointer types and lock-free primitives).
Scenario 1: Counter under heavy contention¶
Starting point¶
type Counter struct {
mu sync.Mutex
n int64
}
func (c *Counter) Inc() {
c.mu.Lock()
c.n++
c.mu.Unlock()
}
func (c *Counter) Load() int64 {
c.mu.Lock()
defer c.mu.Unlock()
return c.n
}
Bottleneck¶
Every Inc and Load acquires the mutex. At high throughput (millions of ops/sec), the mutex's fast path saturates a single CPU. CPU profile shows >50% time in Lock/Unlock instructions.
Optimisation 1: Use atomic.Int64¶
type Counter struct {
n atomic.Int64
}
func (c *Counter) Inc() { c.n.Add(1) }
func (c *Counter) Load() int64 { return c.n.Load() }
Atomic operations cost ~5-10 ns; mutex fast path costs ~20 ns under contention. The atomic is also branchless. Throughput typically 2-3x in micro-benchmarks.
Optimisation 2: Sharded counter for extreme throughput¶
For >10M Inc/sec across many cores:
type Counter struct {
shards [numShards]paddedAtomic
}
type paddedAtomic struct {
n atomic.Int64
_ [56]byte // pad to 64-byte cache line
}
const numShards = 64
func (c *Counter) Inc() {
idx := fastrand() % numShards
c.shards[idx].n.Add(1)
}
func (c *Counter) Load() int64 {
var sum int64
for i := range c.shards {
sum += c.shards[i].n.Load()
}
return sum
}
Cache-line padding eliminates false sharing between shards. Inc distributes across shards; Load sums them.
Trade-off: Load is O(N) where N is shard count. For low-frequency reads, fine.
Measurement¶
| Implementation | 1 goroutine | 16 goroutines | 256 goroutines |
|---|---|---|---|
| Mutex | 60M ops/s | 12M ops/s | 8M ops/s |
| Atomic | 200M ops/s | 100M ops/s | 70M ops/s |
| Sharded atomic | 200M ops/s | 1.6B ops/s | 2.5B ops/s |
(Approximate; depends on hardware. Run your own benchmarks.)
Scenario 2: Read-mostly map¶
Starting point¶
type Lookup struct {
mu sync.Mutex
data map[string]int
}
func (l *Lookup) Get(k string) (int, bool) {
l.mu.Lock()
defer l.mu.Unlock()
v, ok := l.data[k]
return v, ok
}
func (l *Lookup) Set(k string, v int) {
l.mu.Lock()
defer l.mu.Unlock()
l.data[k] = v
}
Bottleneck¶
99% of operations are Get. The mutex serialises reads, throttling read throughput.
Optimisation: switch to RWMutex¶
type Lookup struct {
mu sync.RWMutex
data map[string]int
}
func (l *Lookup) Get(k string) (int, bool) {
l.mu.RLock()
defer l.mu.RUnlock()
v, ok := l.data[k]
return v, ok
}
func (l *Lookup) Set(k string, v int) {
l.mu.Lock()
defer l.mu.Unlock()
l.data[k] = v
}
Reads can proceed concurrently. Throughput improves significantly when readers were the bottleneck.
Optimisation 2: COW with atomic.Pointer¶
For workloads with very rare writes (e.g., minute-scale config updates):
type Lookup struct {
data atomic.Pointer[map[string]int]
mu sync.Mutex // protects writers from racing each other
}
func (l *Lookup) Get(k string) (int, bool) {
m := *l.data.Load()
v, ok := m[k]
return v, ok
}
func (l *Lookup) Set(k string, v int) {
l.mu.Lock()
defer l.mu.Unlock()
old := *l.data.Load()
n := make(map[string]int, len(old)+1)
for kk, vv := range old {
n[kk] = vv
}
n[k] = v
l.data.Store(&n)
}
Reads are entirely lock-free (one atomic load). Writes are O(n) but rare. Excellent for cache/config use cases.
Trade-off table¶
| Pattern | Read cost | Write cost | Memory | When to use |
|---|---|---|---|---|
| Mutex | Lock | Lock | small | Balanced |
| RWMutex | RLock | Lock | small | Read-dominant |
| atomic.Pointer COW | 1 atomic | O(n) copy | 2x during write | Read-very-dominant, infrequent writes |
Scenario 3: Reducing critical section size¶
Starting point¶
type RequestLog struct {
mu sync.Mutex
entries []Entry
}
func (r *RequestLog) Record(req Request) {
r.mu.Lock()
defer r.mu.Unlock()
entry := Entry{
Timestamp: time.Now(),
ID: req.ID,
Body: processBody(req.Body), // slow
}
r.entries = append(r.entries, entry)
persistToDisk(entry) // VERY slow
}
Bottleneck¶
Record holds the mutex through both processBody (which is CPU-heavy) and persistToDisk (I/O bound, milliseconds). Contention pile-up under any concurrent load.
Optimisation: move work outside the critical section¶
func (r *RequestLog) Record(req Request) {
entry := Entry{
Timestamp: time.Now(),
ID: req.ID,
Body: processBody(req.Body), // OUTSIDE lock
}
r.mu.Lock()
r.entries = append(r.entries, entry)
r.mu.Unlock()
persistToDisk(entry) // OUTSIDE lock
}
Lock-held time drops from milliseconds to nanoseconds. Throughput rises proportionally.
Caveats¶
The order of writes to r.entries and persistToDisk differs between concurrent callers. If you need consistency (entries persisted in the same order they appear in r.entries), additional design is needed (a serialised writer goroutine reading from a channel, for example).
For most "log records to disk" cases, order is not critical and the optimised version is correct.
Scenario 4: Batched updates¶
Starting point¶
type MetricCollector struct {
mu sync.Mutex
counts map[string]int64
}
func (m *MetricCollector) Inc(name string) {
m.mu.Lock()
defer m.mu.Unlock()
m.counts[name]++
}
In a hot path that calls Inc thousands of times per request, this is a contention disaster.
Optimisation: batch increments per goroutine¶
type MetricCollector struct {
mu sync.Mutex
counts map[string]int64
pool sync.Pool
}
type batch map[string]int64
func (m *MetricCollector) NewBatch() batch {
if b := m.pool.Get(); b != nil {
return b.(batch)
}
return make(batch)
}
func (b batch) Inc(name string) {
b[name]++
}
func (m *MetricCollector) Flush(b batch) {
m.mu.Lock()
for k, v := range b {
m.counts[k] += v
delete(b, k)
}
m.mu.Unlock()
m.pool.Put(b)
}
Usage:
The hot path inside the loop is map-modification only (no locking). Flush acquires the lock once for the entire batch.
Trade-off: counts in m.counts are eventually consistent (only visible after Flush). Most metrics workflows tolerate this.
When batching helps¶
- Per-goroutine work generates many small updates.
- Reads of the global state are infrequent or accept eventual consistency.
- Goroutines have a natural "batch boundary" (end of request, end of file, etc.).
Scenario 5: Lock-free shutdown signal¶
Starting point¶
type Service struct {
mu sync.Mutex
closed bool
}
func (s *Service) IsClosed() bool {
s.mu.Lock()
defer s.mu.Unlock()
return s.closed
}
func (s *Service) Close() {
s.mu.Lock()
defer s.mu.Unlock()
s.closed = true
}
Every "is the service still running?" check acquires the mutex. In a tight loop, this is wasted.
Optimisation: atomic.Bool¶
type Service struct {
closed atomic.Bool
}
func (s *Service) IsClosed() bool { return s.closed.Load() }
func (s *Service) Close() { s.closed.Store(true) }
Load is ~1 ns; mutex was ~20 ns. Both correct.
Optimisation 2: context.Context¶
For cancellation-style signals, a context.Context is even better — it composes with channel-select patterns and is the idiomatic Go approach.
type Service struct {
ctx context.Context
cancel context.CancelFunc
}
func NewService() *Service {
ctx, cancel := context.WithCancel(context.Background())
return &Service{ctx: ctx, cancel: cancel}
}
func (s *Service) Close() { s.cancel() }
func (s *Service) Done() <-chan struct{} { return s.ctx.Done() }
Goroutines select on s.Done() and handle cancellation natively.
Scenario 6: Sharded session store¶
Starting point¶
At 100k sessions and high concurrency, even RWMutex contention becomes significant.
Optimisation: shard by session ID hash¶
type Sessions struct {
shards [256]*shard
}
type shard struct {
mu sync.RWMutex
sessions map[string]*Session
}
func NewSessions() *Sessions {
s := &Sessions{}
for i := range s.shards {
s.shards[i] = &shard{sessions: make(map[string]*Session)}
}
return s
}
func (s *Sessions) shardFor(id string) *shard {
h := fnv.New32a()
h.Write([]byte(id))
return s.shards[h.Sum32()%256]
}
func (s *Sessions) Get(id string) (*Session, bool) {
sh := s.shardFor(id)
sh.mu.RLock()
defer sh.mu.RUnlock()
sess, ok := sh.sessions[id]
return sess, ok
}
func (s *Sessions) Set(id string, sess *Session) {
sh := s.shardFor(id)
sh.mu.Lock()
defer sh.mu.Unlock()
sh.sessions[id] = sess
}
Contention is divided across 256 mutexes. Throughput scales nearly linearly with shard count up to the goroutine count.
Choosing shard count¶
- Small (<10) shards: easy reasoning, minimal memory overhead, modest concurrency support.
- Medium (32-128) shards: good for most services.
- Large (256-1024) shards: heavy concurrency, multi-core machines.
Power-of-2 shard counts allow & masking instead of % (faster).
Scenario 7: Object pool to avoid allocation under lock¶
Starting point¶
type Worker struct {
mu sync.Mutex
buffer []byte
}
func (w *Worker) Process(data []byte) []byte {
w.mu.Lock()
defer w.mu.Unlock()
w.buffer = w.buffer[:0]
w.buffer = append(w.buffer, data...)
return process(w.buffer)
}
The single buffer is shared. All Process calls serialise.
Optimisation: use sync.Pool¶
var bufferPool = sync.Pool{
New: func() any {
b := make([]byte, 0, 4096)
return &b
},
}
func Process(data []byte) []byte {
b := bufferPool.Get().(*[]byte)
defer bufferPool.Put(b)
*b = (*b)[:0]
*b = append(*b, data...)
return process(*b)
}
No mutex needed. Each goroutine pulls a buffer from the pool; the pool internally uses per-P buffers and atomic operations for hot paths.
Trade-off: result needs to be copied (or used before the buffer returns to the pool); the pool's buffers are not stable references.
Scenario 8: Avoiding RWMutex inversion¶
Starting point¶
type Cache struct {
mu sync.RWMutex
data map[string]Result
}
func (c *Cache) GetOrCompute(k string, compute func() Result) Result {
c.mu.RLock()
if v, ok := c.data[k]; ok {
c.mu.RUnlock()
return v
}
c.mu.RUnlock()
c.mu.Lock()
defer c.mu.Unlock()
// re-check under write lock
if v, ok := c.data[k]; ok {
return v
}
v := compute() // SLOW — held under write lock
c.data[k] = v
return v
}
Compute holds the write lock; readers wait. If compute is slow, readers stall.
Optimisation: singleflight pattern¶
import "golang.org/x/sync/singleflight"
type Cache struct {
mu sync.RWMutex
data map[string]Result
sf singleflight.Group
}
func (c *Cache) GetOrCompute(k string, compute func() Result) Result {
c.mu.RLock()
if v, ok := c.data[k]; ok {
c.mu.RUnlock()
return v
}
c.mu.RUnlock()
v, _, _ := c.sf.Do(k, func() (any, error) {
result := compute()
c.mu.Lock()
c.data[k] = result
c.mu.Unlock()
return result, nil
})
return v.(Result)
}
singleflight.Group ensures only one goroutine computes per key; others wait for the result. Crucially, the write lock is held only briefly to insert the result, not during compute.
Readers proceed concurrently while compute is in flight (they get RLock).
Scenario 9: Avoiding lock-on-read for immutable data¶
Starting point¶
type Config struct {
mu sync.RWMutex
// ... many fields ...
}
func (c *Config) GetTimeout() time.Duration {
c.mu.RLock()
defer c.mu.RUnlock()
return c.timeout
}
If c.timeout is set once at startup and never changes, the RLock is wasted.
Optimisation: separate immutable from mutable¶
type Config struct {
// Set once during startup. No lock required.
timeout time.Duration
addr string
// Mutable; protected by mu.
mu sync.RWMutex
refresh time.Time
stats Stats
}
func (c *Config) GetTimeout() time.Duration { return c.timeout } // no lock
Document the invariant: fields above the mu line are immutable after construction; fields below require the lock. Build the contract in the type definition.
Strict alternative: split the type¶
type ImmutableConfig struct {
Timeout time.Duration
Addr string
}
type MutableState struct {
mu sync.RWMutex
refresh time.Time
stats Stats
}
type Service struct {
Config *ImmutableConfig
State *MutableState
}
The split makes the immutability mechanical: Config has no mutex; you cannot accidentally read it under a lock that doesn't exist.
General optimisation principles¶
-
Measure first. Profile mutex contention before optimising. Don't optimise locks that don't appear in the profile.
-
Reduce hold time. Move work outside the critical section. This is almost always the highest-impact change.
-
Reduce lock frequency. Batch operations. Use thread-local accumulators flushed periodically.
-
Increase concurrency. Shard the data. Use RWMutex if reads dominate.
-
Eliminate the mutex. Use atomics, COW, or actor model.
-
Avoid premature pessimisation. Don't shard a lock that isn't contended. Don't switch to atomic.Pointer for a balanced workload. Measure.
-
Preserve correctness. Every optimisation must preserve the program's correctness contract. Many copy-related bugs in the wild were introduced during attempts to optimise.
-
Document the change. Future maintainers should know why the code uses sharding/COW/atomics. Without the context, they may "simplify" back to the slow version.
-
Watch out for copy bugs. Many of these optimisations involve switching to pointer types or lock-free structures. The same vet-driven discipline applies —
noCopymarkers, pointer receivers, vet in CI. -
Avoid distributed locks. Use in-process locking when in-process suffices. Distributed locking introduces a separate class of failure modes.
Summary¶
Optimisation table:
| Original | Optimised | Trigger |
|---|---|---|
| Mutex-protected counter | atomic.Int64 | Profile shows mutex hot |
| Mutex map | RWMutex map | Read-heavy workload |
| RWMutex map | atomic.Pointer COW | Very read-heavy, rare writes |
| Single mutex | Sharded mutexes | Contention at high concurrency |
| Wide critical section | Narrow critical section + outside-lock work | Slow operations under lock |
| Per-operation lock | Per-batch lock | High-frequency updates |
| Lock-protected flag | atomic.Bool or context.Context | Lock-on-read in hot path |
| Lock-protected pool | sync.Pool | Object allocation under lock |
| Cache-miss-recomputes-under-lock | singleflight | Slow compute under lock |
| Lock for immutable fields | No lock (split types or document) | Fields never change |
Apply these in order of profile-driven evidence. Always benchmark before and after. Always verify with vet and -race.