Concurrent LRU — Optimization¶
A concurrent LRU has many failure modes that masquerade as performance problems. Most "optimizations" first require correct measurement of what is actually slow. Each entry below states the problem, shows a "before" snippet, an "after" snippet, and the realistic gain. Numbers are illustrative — measure in your own code.
Optimization 1 — Sharding to break mutex contention¶
Problem. Single-mutex LRU serializes all operations through one lock. At 16+ goroutines on a multi-core machine, the lock becomes the bottleneck.
Before:
cache, _ := lru.New[string, *User](16384)
// 16 goroutines hammer the cache; only one is inside at a time.
After:
Throughput: ~50M ops/sec. Mutex wait drops to <10%.Gain. 10x throughput. The cost is 16 mutex structures (~400 bytes) and one hash per operation (~20 ns).
Optimization 2 — Power-of-two shard count and bitmask¶
Problem. Modulo for shard selection is slow (division on most CPUs).
Before:
After:
Gain. 5-10 ns per operation. Modest but free.
Optimization 3 — Padding shard structs against false sharing¶
Problem. Two shard structs on the same 64-byte cache line cause coherency traffic between CPUs.
Before:
type shard struct {
inner *lru.Cache[string, int] // 8 bytes
// total 8 bytes; multiple shards fit on one cache line
}
After:
type shard struct {
inner *lru.Cache[string, int]
_pad [56]byte
// total 64 bytes; each shard on its own line
}
Gain. 2-5x throughput improvement under heavy contention. Verify with unsafe.Sizeof.
Optimization 4 — Switching from golang-lru to ristretto¶
Problem. Even sharded golang-lru has a mutex-per-shard ceiling. At very high throughput (>50M ops/sec), ristretto's lock-free Get is faster.
Before:
Throughput: ~70M ops/sec.After:
cache, _ := ristretto.NewCache(&ristretto.Config{
NumCounters: 100_000_000,
MaxCost: 16384,
BufferItems: 64,
})
Gain. 2x throughput, plus better hit rate on skewed workloads. Cost: more complex API; eventually consistent.
Optimization 5 — sync.Pool for entry allocations¶
Problem. Every Add allocates a new *entry struct. At high throughput, this stresses the allocator.
Before:
After:
var entryPool = sync.Pool{
New: func() interface{} { return new(entry[K, V]) },
}
ent := entryPool.Get().(*entry[K, V])
ent.key = k
ent.val = v
// ... use ...
entryPool.Put(ent) // on eviction
Gain. Allocation rate drops by ~50%. GC overhead drops proportionally. Latency p99 improves by 5-15%.
Optimization 6 — Cache-line aligned atomic counters¶
Problem. Hit and miss counters on the same cache line cause coherency traffic when both are incremented from different goroutines.
Before:
After:
Gain. Counter increment latency drops 5-10x under contention.
Optimization 7 — Lazy aggregation of per-shard metrics¶
Problem. Each operation increments a global counter (with cache-line bouncing).
Before:
func (c *Cache) Get(k string) (*V, bool) {
v, ok := c.pick(k).Get(k)
if ok {
c.globalHits.Add(1) // contended
}
return v, ok
}
After:
type shard struct {
inner *lru.Cache
hits atomic.Uint64 // per-shard
_pad [56]byte
}
func (c *Cache) Stats() Stats {
var total uint64
for _, s := range c.shards {
total += s.hits.Load()
}
return Stats{Hits: total}
}
Gain. Each shard's counter is uncontended. Stats() is O(N) but called rarely (metrics scrape).
Optimization 8 — Pre-allocate map capacity¶
Problem. Default-sized map grows incrementally, causing rehashes.
Before:
After:
Gain. Avoids 3-5 rehashes for a 100K cache. Saves ~50 ms of init time and improves steady-state by avoiding allocator pressure.
Optimization 9 — Hash function selection¶
Problem. hash/maphash is excellent for correctness but ~20 ns per call. For very high throughput, a faster hash matters.
Before:
After (xxhash):
Gain. 15 ns per operation. At 100M ops/sec, 1.5 seconds of CPU per second saved across cores.
Trade-off. xxhash is not seeded by default. For non-adversarial inputs, fine. For untrusted keys, use a seeded variant.
Optimization 10 — Reduce critical section size¶
Problem. Doing expensive work inside the lock blocks other operations.
Before:
func (c *Cache) Set(k string, raw []byte) {
c.mu.Lock()
defer c.mu.Unlock()
var v Value
json.Unmarshal(raw, &v) // expensive! holds the lock
c.inner.Add(k, &v)
}
After:
func (c *Cache) Set(k string, raw []byte) {
var v Value
json.Unmarshal(raw, &v) // outside the lock
c.mu.Lock()
c.inner.Add(k, &v)
c.mu.Unlock()
}
Gain. Reduced lock hold time → less contention → higher throughput.
Optimization 11 — Avoid defer in hot paths¶
Problem. defer has a small but non-zero overhead (~5 ns in Go 1.14+, less in newer versions).
Before:
After (Go 1.14+ has zero-cost defer for simple cases; only do this if profile shows it):
func (c *Cache) Get(k string) (V, bool) {
c.mu.Lock()
v, ok := c.inner[k]
c.mu.Unlock()
if !ok {
var zero V
return zero, false
}
return v, true
}
Gain. Marginal; only matters if defer shows up in profiles.
Optimization 12 — Inline the hash function¶
Problem. A function call has overhead (~2-5 ns).
Before:
After (inline the pick logic):
Gain. 2-5 ns per operation. Only matters at extreme throughput. Modern Go compilers often inline automatically.
Optimization 13 — Off-heap storage for large caches¶
Problem. On-heap LRU with millions of entries stresses GC. Pause times grow with heap size.
Before:
cache := NewShardedLRU[string, *Value](1_000_000, 64)
// values are *Value pointers; GC scans them all
After:
import "github.com/allegro/bigcache/v3"
config := bigcache.DefaultConfig(10 * time.Minute)
config.HardMaxCacheSize = 4 // MB per shard × 1024 shards = ~4GB
cache, _ := bigcache.New(context.Background(), config)
// values are serialised []byte; GC sees one large byte arena
Gain. 10-20x lower GC pause. Cost: serialisation overhead on every operation; values are now bytes.
Optimization 14 — Combine related lookups into one batch¶
Problem. Looking up three values takes three Get calls, three lock acquisitions.
Before:
After (single cache for related data):
type Bundle struct {
User *User
Session *Session
Profile *Profile
}
bundle, _ := bundleCache.Get(id) // one lock acquisition
Gain. Reduces lock acquisitions by 3x. Trade-off: more eviction granularity (whole bundle goes when any value changes).
Optimization 15 — Lock-free check before locked update¶
Problem. Every Set takes the lock even if the entry exists with the same value.
Before:
After:
func (c *Cache) Set(k string, v *Value) {
if cur, ok := c.inner.Peek(k); ok && cur == v {
return // no change needed
}
c.mu.Lock()
defer c.mu.Unlock()
c.inner.Add(k, v)
}
Gain. Avoids redundant Sets. Useful when the same value is repeatedly Set (idempotent loaders, hot-path caches).
Optimization 16 — Bypass the cache for huge values¶
Problem. Caching a 10 MB value inserts and may evict 100 smaller values. The eviction cost overshadows the cache benefit.
Before:
After:
if estimateSize(hugeValue) > 1024*1024 { // 1 MB threshold
return hugeValue, nil // don't cache
}
cache.Add(id, hugeValue)
Gain. Prevents one bad request from destroying the cache. Trade: that specific request always misses.
Optimization 17 — Move logging out of the cache hot path¶
Problem. Every miss logs; under high miss rate, the log itself becomes the bottleneck.
Before:
After:
if _, ok := cache.Get(k); !ok {
if missCount.Add(1)%10000 == 0 {
log.Printf("cache miss rate sample: %s", k)
}
// ... load and store ...
}
Gain. Eliminates log contention as a bottleneck. Sampling preserves the diagnostic value.
Optimization 18 — Pre-warm the cache at startup¶
Problem. Cold cache → slow first requests → tail-latency spike during deploys.
Before:
After:
func (s *Service) Start(ctx context.Context) error {
hot, _ := s.analytics.TopKeys(ctx, 1000)
var wg sync.WaitGroup
sem := make(chan struct{}, 8)
for _, k := range hot {
wg.Add(1)
sem <- struct{}{}
go func(k string) {
defer wg.Done()
defer func() { <-sem }()
v, _ := s.loader(ctx, k)
s.cache.Add(k, v)
}(k)
}
wg.Wait()
return nil
}
Gain. Eliminates the cold-cache latency spike. Cost: a few seconds of startup time.
Optimization 19 — Singleflight for stampede prevention¶
Problem. When a hot key misses, N concurrent goroutines all call the loader.
Before:
if v, ok := cache.Get(k); ok { return v }
v, _ := load(k) // N callers race here
cache.Add(k, v)
return v
After:
if v, ok := cache.Get(k); ok { return v }
result, _, _ := sf.Do(k, func() (interface{}, error) {
v, err := load(k)
if err != nil {
return nil, err
}
cache.Add(k, v)
return v, nil
})
return result.(*Value)
Gain. Reduces load calls by up to N (one per concurrent miss). Major win at high concurrency.
Optimization 20 — Cost-based capacity for variable-size values¶
Problem. Mixed value sizes lead to OOM under count-based capacity.
Before:
Memory could be anywhere from 1 MB to 10 GB.After:
cache, _ := ristretto.NewCache(&ristretto.Config{
NumCounters: 100_000,
MaxCost: 1 * 1024 * 1024 * 1024, // 1 GB
BufferItems: 64,
})
cache.Set(k, v, int64(len(v))) // cost = byte size
Gain. Predictable memory usage. Eviction prefers large entries when over budget.
Optimization 21 — Replace interface{} with typed cache¶
Problem. Pre-generics caches return interface{}; every Get incurs type assertion and possibly an allocation.
Before:
After:
Gain. Saves ~5 ns per Get; cleaner code; type safety at compile time.
Optimization 22 — Per-goroutine L1 cache for hot keys¶
Problem. A few keys absorb most traffic; sharding does not help for those.
Before:
After:
type Worker struct {
localCache map[string]*Value // per-goroutine
shared *Cache
}
func (w *Worker) Get(k string) *Value {
if v, ok := w.localCache[k]; ok {
return v
}
v, _ := w.shared.Get(k)
w.localCache[k] = v
return v
}
Gain. For very hot keys, eliminates cache lookups entirely. Trade: per-goroutine memory + stale data.
Optimization 23 — Disable GC for short benchmarks¶
Problem. GC mid-benchmark perturbs results.
Before:
func BenchmarkCache(b *testing.B) {
cache, _ := lru.New[string, int](10000)
for i := 0; i < b.N; i++ {
cache.Get("key")
}
}
After:
func BenchmarkCache(b *testing.B) {
cache, _ := lru.New[string, int](10000)
runtime.GC() // clear any pending GC
debug.SetGCPercent(-1) // disable GC for the duration
defer debug.SetGCPercent(100)
b.ResetTimer()
for i := 0; i < b.N; i++ {
cache.Get("key")
}
}
Gain. Stable benchmark results. Caveat: only for short benchmarks (otherwise memory grows unbounded).
Optimization 24 — Profile-guided sizing¶
Problem. Picking cache size by intuition leads to over- or under-provisioning.
Before:
After:
// Measure working set first.
trace := captureProductionTrace(1 * time.Hour)
mrc := mattsonAlgorithm(trace)
// mrc[c] is hit rate at capacity c
// Pick c where mrc[c] reaches 90% (or your target)
cache, _ := lru.New[string, *User](optimalSize)
Gain. Right-sized cache. Memory savings (often 2-3x), better hit rate.
Optimization 25 — Stale-while-revalidate¶
Problem. TTL expiry causes simultaneous cache misses and a latency spike.
Before:
e, ok := cache.Get(k)
if !ok || time.Now().After(e.expireAt) {
return slowLoad(k) // every TTL boundary = slow path
}
return e.val
After:
e, ok := cache.Get(k)
if !ok {
return slowLoad(k)
}
age := time.Since(e.fetchedAt)
if age < softTTL {
return e.val
}
if age < hardTTL {
if e.refresh.CompareAndSwap(false, true) {
go refreshInBackground(k)
}
return e.val // stale, but immediate
}
return slowLoad(k)
Gain. Eliminates latency spike at TTL expiry. Users see slightly stale data but no slow responses.
Conclusion¶
Every optimization above should be motivated by a measurement. The order in which to apply them:
- Profile with
pprof(CPU, mutex, alloc). - Pick the biggest opportunity from the profile.
- Apply ONE optimization.
- Re-benchmark.
- Verify the improvement.
- Commit.
- Repeat.
Random optimization without measurement adds complexity without benefit. Disciplined optimization with measurement yields compounding improvements.
The typical sequence for a Go service:
- No optimization (single-mutex LRU). Works up to ~10K req/s.
- Sharding. Up to ~1M req/s.
- Singleflight + TTL jitter. Solves stampede problems.
- ristretto migration. Up to ~10M req/s per pod.
- Off-heap (bigcache). For >1 GB caches.
- NUMA tuning, hugepages. Last resort.
Each step earns 2-10x. The cumulative improvement, when justified by measurement, can be 100-1000x over a naive starting point.
Profile. Optimize. Measure. Iterate. That is the loop.