sync.Map — Optimize¶

A field guide to choosing between concurrent-map alternatives and tuning each one. Use this when you have a sync.Map in production and want to know whether it earns its keep, or when you are choosing a data structure for a new use case.

Table of Contents¶

The Decision Tree
Step 1 — Identify the Workload
Step 2 — Benchmark Honestly
Step 3 — Choose the Right Primitive
Tuning sync.Map Itself
Tuning RWMutex + map
Tuning a Sharded Map
When atomic.Pointer[map] Beats Both
When to Add singleflight
Memory and GC Considerations
Replacing a Concurrent Map Entirely
Checklist Before Shipping

The Decision Tree¶

flowchart TD A[Need concurrent access to keyed data] --> B{Single value, no key?} B -- yes --> C[atomic.Pointer] B -- no --> D{Read/Write ratio > 100:1?} D -- yes --> E{Key set stable?} E -- yes --> F[sync.Map or atomic.Pointer of map] E -- no --> G{Need Len/snapshot/order?} G -- yes --> H[RWMutex + map] G -- no --> F D -- no --> I{Workload uniformly distributed?} I -- yes --> J[sharded RWMutex + map] I -- no --> K{Hot keys?} K -- yes --> L[redesign: per-shard atomics or buffered channel] K -- no --> H

This is the answer 80% of the time. The remaining 20% is:

You need singleflight semantics (one compute per key).
You need TTL / LRU (use a library — ristretto, golang-lru).
You need persistence (use Redis / BoltDB / Pebble).

Step 1 — Identify the Workload¶

Before optimising, characterise:

Question	Why it matters
What is the read/write ratio?	Read-heavy favours `sync.Map` and `atomic.Pointer[map]`; write-heavy favours sharded `RWMutex+map`.
How many entries at steady state?	Affects amplification cost, `Range` cost, snapshot cost.
Is the key set growing, stable, or shrinking?	Growing keys hurt `sync.Map` (more rebuilds). Shrinking keys hurt all immutable structures.
Are operations uniformly distributed across keys?	If hot keys dominate, no map structure helps; you need a different data flow.
Do you need `Len`?	`sync.Map` does not have it.
Do you need ordered iteration?	None of these provide it; layer a sorted slice on top.
Do you need an atomic snapshot?	Only `RWMutex+map` and `atomic.Pointer[map]` support it.
Are values comparable?	`CompareAndSwap` requires it.
Are values mutable structs or immutable?	Mutable values invite races inside the value; consider immutable + CAS.
What is the lifetime of the data structure?	Long-lived structures justify more setup; short-lived favour simplicity.

Write the answers down. Then proceed.

Step 2 — Benchmark Honestly¶

The biggest mistake is choosing by reputation. Run your real workload through each candidate.

Skeleton¶

package optimize

import (
    "sync"
    "testing"
)

type API interface {
    Get(k int) (int, bool)
    Set(k int, v int)
}

func benchWorkload(b *testing.B, m API, readRatio int) {
    b.ResetTimer()
    b.RunParallel(func(pb *testing.PB) {
        i := 0
        for pb.Next() {
            if i%100 < readRatio {
                m.Get(i % 10000)
            } else {
                m.Set(i%10000, i)
            }
            i++
        }
    })
}

func BenchmarkSyncMap_Read95(b *testing.B) {
    var m wrappedSyncMap
    benchWorkload(b, &m, 95)
}
func BenchmarkLocked_Read95(b *testing.B) {
    m := newLockedMap()
    benchWorkload(b, m, 95)
}
func BenchmarkSharded_Read95(b *testing.B) {
    m := newShardedMap(64)
    benchWorkload(b, m, 95)
}

// repeat for Read50, Read5...

Run:

go test -bench=. -benchmem -cpu=1,2,4,8 ./optimize

Compare ns/op, B/op, allocs/op. Decide.

What to look for¶

Linear scaling with cores. If a primitive does not scale, contention is the issue. Look at sync.Mutex profile (go test -mutexprofile=mutex.out).
Allocations per operation. Boxing of value types is the silent killer for sync.Map. B/op > 0 means you are allocating per op.
Tail latency. ns/op is the average. For real systems, measure p99 with a histogram. The amplification cost of sync.Map rebuilds shows up at p99.

Step 3 — Choose the Right Primitive¶

A flowchart:

                    +---------------------------+
                    | Read-mostly, stable keys? |
                    +---------+-----------------+
                              | yes
                              v
                  +---------------------------+
                  | Writes rare (config-like)?|
                  +---+-----------+-----------+
                      | yes       | no
                      v           v
              atomic.Pointer    sync.Map
              [map[K]V]
                              |
                              v
                       +-------------------------+
                       | Need Len/snapshot/order?|
                       +---+---------+-----------+
                           | yes     | no
                           v         v
                      RWMutex+map  done

For balanced or write-heavy:

                    +-----------------------+
                    | Hot keys concentrate? |
                    +---+-------+-----------+
                        | yes   | no
                        v       v
                    redesign  sharded
                              RWMutex+map
                              with maphash

Sample sharded counts:

Concurrent goroutines	Shard count
4–8	16
8–32	64
32+	128–256

More shards reduce contention but add Len cost (walking all shards) and per-shard memory.

Tuning `sync.Map` Itself¶

If sync.Map is the right fit, you can still tune around it:

1. Store pointers, not values¶

// Slow: boxing on every Store
m.Store("k", 42)

// Faster for value types: one-time alloc
v := 42
m.Store("k", &v)
// or use an atomic.Int64 pointer for mutation

Pointers avoid per-write heap allocation. Especially valuable for primitive value types.

2. Pre-warm with `Store` at startup¶

If you know the key set at startup, store them all eagerly. The first miss-then-promote happens during init, not under request load.

3. Avoid `Range` on the hot path¶

Range is O(n) and visits every entry, including tombstoned ones. Schedule it off the request path (background metrics scrape every N seconds).

4. Avoid mixed read/write patterns¶

If a single goroutine alternates Load and Store on the same key, you defeat the fast path. Consider whether the design is wrong: could you batch the updates?

5. Watch for amplification under churn¶

If runtime.ReadMemStats shows growing HeapInuse while live entries stay constant, you have a churn problem. Either:

Periodically replace the entire sync.Map (atomic pointer swap).
Switch to RWMutex + map, which handles delete cleanly.
Use ristretto / golang-lru for proper LRU eviction.

Tuning `RWMutex + map`¶

The plain mutex-guarded map is often the right answer. Tune it:

1. Use `RLock` for reads, `Lock` for writes¶

mu.RLock()
v, ok := m[k]
mu.RUnlock()

Trivially obvious; surprisingly often forgotten in legacy code. RLock is wait-free for readers when no writer is waiting.

2. Keep the critical section tiny¶

// Slow: holds lock while doing expensive work
mu.Lock()
m[k] = expensive() // BAD
mu.Unlock()

// Fast: compute, then lock briefly
v := expensive()
mu.Lock()
m[k] = v
mu.Unlock()

The lock is contention. Time inside it is bottleneck.

3. Pre-allocate the map¶

m := make(map[int]int, expectedSize)

Avoids re-hashing as the map grows.

4. Avoid pointer values where possible¶

A map[int]int is cache-friendlier than map[int]*int. Indirection costs cache misses.

5. Don't `defer` for very short critical sections¶

// Defer adds 10-20 ns
mu.Lock()
defer mu.Unlock()
v := m[k]

// Manual is faster for nano-critical reads
mu.RLock()
v := m[k]
mu.RUnlock()

For tiny critical sections, the defer overhead can be 20% of the lock cost. Use defer for safety in functions that may panic; skip it in hot-path reads.

Tuning a Sharded Map¶

1. Shard count is a power of 2¶

Lets you replace h % N with h & (N-1). Slightly faster, and the modular bias is gone.

const shardMask = 63 // N = 64
sh := shards[h & shardMask]

2. Use `maphash` for the hash¶

maphash.Hash is randomised per process and provides good distribution.

seed := maphash.MakeSeed()
var h maphash.Hash
h.SetSeed(seed)
h.WriteString(key)
sum := h.Sum64()

Do not use fnv or crc32 if keys are attacker-controlled — they are predictable and let attackers engineer hot-shard scenarios.

3. Cache shard pointer if accessing many times in a row¶

sh := s.shardFor(key)
sh.Lock()
// multiple ops on sh
sh.Unlock()

The shard lookup is a hash; avoid hashing twice.

4. `Len` is O(shardCount)¶

Cache the total in atomic.Int64 if you call it often.

5. Don't shard if you have fewer goroutines than shards¶

64 shards with 4 goroutines wastes memory and adds latency. Match shard count to expected parallelism.

When `atomic.Pointer[map]` Beats Both¶

The copy-on-write pattern has the fastest reads:

type CowMap[K comparable, V any] struct {
    p atomic.Pointer[map[K]V]
}

func NewCowMap[K comparable, V any]() *CowMap[K, V] {
    c := &CowMap[K, V]{}
    empty := make(map[K]V)
    c.p.Store(&empty)
    return c
}

func (c *CowMap[K, V]) Get(k K) (V, bool) {
    v, ok := (*c.p.Load())[k]
    return v, ok
}

func (c *CowMap[K, V]) Set(k K, v V) {
    for {
        old := c.p.Load()
        next := make(map[K]V, len(*old)+1)
        for k2, v2 := range *old {
            next[k2] = v2
        }
        next[k] = v
        if c.p.CompareAndSwap(old, &next) {
            return
        }
    }
}

Use when:

Reads dominate by 1000:1 or more.
Writes are batchable (or rare).
Map size is small enough that rebuilds are cheap.

Avoid when:

Map size is large (1M entries means rebuilding 1M entries on every write).
Writes are frequent (CAS retries multiply the cost).

Bonus: reads can be inlined by the compiler and have zero allocations. The single atomic load is the fastest possible "concurrent read."

When to Add `singleflight`¶

If your cache uses expensive compute (DB query, RPC, parse), and concurrent misses are possible, singleflight saves you compute time:

Compute cost	Concurrent miss probability	Use singleflight?
< 1 µs	Any	No — overhead exceeds savings
1 µs – 100 µs	Low	Probably no
1 µs – 100 µs	High	Yes
> 100 µs	Any	Yes

singleflight.Group.Do adds ~200 ns of overhead per call (map lookup + mutex). For sub-microsecond computes, this is more than the duplicate work. For database queries or external API calls, it is negligible.

Combine with sync.Map for caching (senior level shows the pattern). For a complete cache, use groupcache or ristretto which combine both internally.

Memory and GC Considerations¶

Boxing in `sync.Map`¶

Storing int in sync.Map:

m.Store("k", 42) // boxes 42 as interface{}

Each Store allocates a small object (about 16 bytes on 64-bit). For high-throughput writes, this drives GC pressure. Mitigations:

Store *int (one-time alloc per key, mutated atomically).
Use atomic.Int64 outside the map.
Use a typed []int indexed by hash.

Tombstones¶

sync.Map retains deleted entries in read.m as nil or expunged. High-churn workloads accumulate them until the next dirty rebuild. Memory grows beyond live-entry count. The professional level explains the mechanics.

Map grow-shrink asymmetry¶

The Go runtime's built-in map grows on insert but does not shrink on delete. A map that once held 1M entries and now holds 100 retains the bucket allocation for the 1M. To reclaim, recreate:

newMap := make(map[K]V, len(oldMap))
for k, v := range oldMap {
    newMap[k] = v
}
oldMap = newMap

This applies to RWMutex+map and the inner map of sharded structures.

Pointer values cost cache misses¶

A map[int]*Entry requires dereferencing a pointer to read the entry. Each dereference is a potential cache miss. A map[int]Entry (value type) inlines the entry but copies on every read. Trade-off:

Small entries (<16 bytes): value type, no pointer.
Large entries: pointer (saves copies).
Mutable entries: pointer (so all readers see the same instance).

Replacing a Concurrent Map Entirely¶

Sometimes the optimisation is to stop using a concurrent map. Patterns:

1. Use an array indexed by ID¶

If keys are dense integer IDs, an []atomic.Int64 indexed by ID is faster than any map:

counters := make([]atomic.Int64, maxID)
counters[id].Add(1)

No hashing, no lock, no GC pressure.

2. Use a channel for write-then-aggregate¶

If many goroutines update shared state but only one reads:

updates := make(chan Update, 1024)
go func() {
    state := make(map[K]V)
    for u := range updates {
        state[u.K] = u.V
    }
}()

Single-owner state; no concurrent access at all. Throughput bounded by channel ops (~50 ns each) and the aggregator's processing rate.

3. Use a per-goroutine map, merge later¶

Each goroutine maintains its own map[K]V. A separate phase merges them. Useful for embarassingly-parallel workloads where the merge is rare.

4. Use a struct with named fields¶

If "keys" are a fixed enum, you do not need a map at all:

type Stats struct {
    Hits, Misses, Errors atomic.Int64
}

Clearer, type-safer, faster.

5. Use a database¶

For state larger than memory or requiring durability, a small embedded KV store (Pebble, BoltDB) gives you concurrent access with a real query API. Latency is microseconds to milliseconds, not nanoseconds, but you get persistence and queries for free.

Checklist Before Shipping¶

Before merging code that uses sync.Map (or any concurrent map):

If you can tick every box, you have made a defensible choice. If you cannot, find the gap and address it before shipping.

Closing thought¶

sync.Map is the right answer for two specific workloads and the wrong answer for everything else. The optimization mindset is not "how do I make my sync.Map faster" — it is "is sync.Map the right primitive here, and if not, what is?"

Measure, choose, document. The next engineer (often you, six months later) will thank you.