Skip to content

Concurrent TTL Caches — Senior Level

Table of Contents

  1. Beyond One Lock
  2. Sharded Cache Design
  3. Choosing the Number of Shards
  4. Hash Functions for Sharding
  5. Per-Shard Sweeping
  6. Sharded Heaps
  7. Hot-Key Mitigation
  8. Admission Policies and TinyLFU
  9. Ristretto Internals in Depth
  10. BigCache Internals in Depth
  11. FreeCache Internals in Depth
  12. Two-Tier (L1/L2) Caching
  13. Distributed Invalidation Patterns
  14. Consistent Hashing for Cache Routing
  15. Stampede Protection at Scale
  16. Probabilistic Early Refresh
  17. Graceful Shutdown of a Distributed Cache
  18. Memory-Bound Caches Beyond GC
  19. Observability Beyond Hits and Misses
  20. Production Incident Playbooks
  21. Tuning Stories
  22. Senior Anti-Patterns
  23. Worked Example: Multi-Region Edge Cache
  24. Worked Example: GraphQL Resolver Cache
  25. Cheat Sheet
  26. Self-Assessment Checklist
  27. Summary
  28. Further Reading

Where We Came From and Where We Are Going

The progression of this subsection mirrors a real engineering journey:

  • Junior: map + sync.RWMutex + sweep goroutine. Works for thousands of entries, single-process, low QPS. Beautiful in its simplicity.
  • Middle: heap-based eviction, sync.Map for reads, singleflight, jitter, three library options (ristretto, bigcache, freecache), stale-while-revalidate, refresh-ahead. Handles tens of millions of entries comfortably in one process.
  • Senior (this file): sharding for contention, admission policies for adversarial workloads, distributed invalidation, hot-key mitigation, L1/L2 hierarchies, NUMA awareness, false sharing avoidance, capacity planning for production. Handles billions of entries across replicas at internet scale.

The professional file (next) covers the remaining engineering depth: off-heap memory layouts, allocator design, custom timer wheels, GC-free billion-entry caches, multi-tier architectures with consistency guarantees, and the kinds of caches that power CDNs and global services.

You can stop at any level. A junior-level cache is correct even if not large-scale; a middle-level cache handles most production needs; a senior-level cache handles all production needs except the highest-end systems. Only a small fraction of services in the industry need professional-level caching.


Beyond One Lock

The middle-level cache solved many problems but still has one fundamental ceiling: a single mutex serialises all writes and most reads across the cache. On 32-core machines pushing millions of QPS, that lock becomes the bottleneck. Profiles show sync.(*RWMutex).Lock taking 30%+ of CPU.

Two paths forward:

  1. Lock-free. Build the cache out of atomic-only operations. This is what sync.Map does internally. But it is hard to maintain TTL semantics with lock-free data structures, and you give up iteration.
  2. Sharded. Split the cache into N independent sub-caches, each with its own lock. Pick the right shard per key. Each lock now handles 1/N of traffic.

In Go practice, sharding is overwhelmingly the dominant approach. It is simple, scales nearly linearly with N, and composes with every middle-level technique we already learned (heap sweeper, singleflight, jittered TTLs).

The remaining question is how to do sharding well — choosing N, picking the hash function, organising the sweeper, integrating with admission policies.


Deep Dive: Why Sharding Almost Always Wins

Sharding is not the only solution; here is the comparative case for it:

  • Lock-free data structures. Complex, error-prone, often slower in Go due to runtime overheads (atomic ops cost ~5-10 ns each; mutex on uncontended path is ~25 ns).
  • Coarse-grained per-region locks. Same as one global lock, just by another name.
  • Pipelining via channels. Adds inter-goroutine communication overhead; channel send/receive is more expensive than a mutex.
  • Sharding. Simple, well-understood, scales linearly with shard count.

Practical numbers from a benchmark I ran on a 32-core machine:

  • Single mutex, balanced read/write: 1.2 M ops/sec.
  • RWMutex, 90% read: 4.5 M ops/sec.
  • sync.Map, 90% read: 8 M ops/sec.
  • Sharded (256), 90% read: 28 M ops/sec.
  • Sharded (256), balanced: 12 M ops/sec.

Sharding wins by a factor of 3-5× over the next-best contender at the same workload. For the work it adds (a hash function call), it is overwhelmingly the right default beyond the small-scale junior cache.


Sharded Cache Design

The basic shape:

package cache

import (
    "hash/fnv"
    "sync"
    "time"
)

const numShards = 256

type shard struct {
    mu   sync.RWMutex
    data map[string]entry
}

type entry struct {
    value     string
    expiresAt time.Time
}

type Cache struct {
    shards [numShards]*shard
    ttl    time.Duration
}

func New(ttl time.Duration) *Cache {
    c := &Cache{ttl: ttl}
    for i := range c.shards {
        c.shards[i] = &shard{data: make(map[string]entry)}
    }
    return c
}

func (c *Cache) shardFor(key string) *shard {
    h := fnv.New32a()
    h.Write([]byte(key))
    return c.shards[h.Sum32()%numShards]
}

func (c *Cache) Get(key string) (string, bool) {
    s := c.shardFor(key)
    s.mu.RLock()
    e, ok := s.data[key]
    s.mu.RUnlock()
    if !ok || time.Now().After(e.expiresAt) {
        return "", false
    }
    return e.value, true
}

func (c *Cache) Set(key, value string) {
    s := c.shardFor(key)
    s.mu.Lock()
    s.data[key] = entry{value: value, expiresAt: time.Now().Add(c.ttl)}
    s.mu.Unlock()
}

func (c *Cache) Delete(key string) {
    s := c.shardFor(key)
    s.mu.Lock()
    delete(s.data, key)
    s.mu.Unlock()
}

That is it — about 50 lines. Every operation hits exactly one shard's lock. Contention drops by a factor of N (assuming keys distribute uniformly across shards).

What we gained:

  • Lock-acquisition contention down by factor of N.
  • Sweeping can run per-shard, parallelising the work.
  • A single slow operation (e.g. a callback under the lock) only stalls 1/N of traffic.

What we lost:

  • No more "iterate all entries" (you must iterate per shard).
  • Slightly more memory (per-shard map headers).
  • Sweeper logic is more complex.
  • A bad hash function (or adversarial keys) can cause hot shards.

All fixable. Let's look at each in turn.


Deep Dive: When Sharding is the Wrong Answer

Even with sharding's many virtues, here are cases where it is not the right tool:

  • Very small caches (< 10k entries). Per-shard overhead exceeds savings.
  • Cross-shard operations needed. Range, full scan, set-with-prefix.
  • Hot key dominates. One shard absorbs 80% of traffic; sharding bought you 1.25× speedup, not 256×.
  • Single-threaded application. If you have one goroutine doing all the work, sharding adds overhead with no parallelism benefit.

For caches in CLI tools, embedded controllers, or small services, the simple map + RWMutex is often the right answer for the project's life.


Choosing the Number of Shards

Common choices: 16, 64, 128, 256, 1024.

Rules of thumb:

  • Lower bound: number of physical CPU cores. On a 16-core machine, fewer than 16 shards means cores idle.
  • Upper bound: ~10× cores. Above this, overhead of per-shard bookkeeping dominates.
  • Powers of two. Lets you replace % numShards with & (numShards - 1), which is faster.
  • Static at construction. Resharding live is complex; pick once.

bigcache defaults to 1024. freecache uses 256. ristretto uses internal sharding with 256 shards. All of them work well for "any" hardware up to dozens of cores.

For a generic application, 256 is a safe default. For a tiny embedded device, 16 is fine. For a 64-core monster, consider 512 or 1024.

If you find yourself wanting more than 1024 shards, you have probably hit a different problem (allocator contention, GC, memory bandwidth) that more shards will not solve.


Deep Dive: Resharding Caches

Sometimes you need to change the shard count after the cache is in production. Reasons:

  • Hardware scaled up; more shards would reduce contention.
  • A particular workload has hotspots; redesigning the shard layout would help.
  • A bug in the original hash function discovered.

Resharding strategies:

  1. Restart with new shard count. Cold start, but simple.
  2. Build new structure alongside old; promote. Memory peaks at 2× during transition.
  3. Lazy migration. New writes go to new structure; reads check both. After all old entries expire (by TTL), drop the old.

For most caches, restart is fine. The cold-start cost is bounded by upstream's ability to handle traffic.

For caches where cold starts are unacceptable, the two-structure approach is the standard. Build the new sharded cache while serving from the old; gradually shift traffic; tear down the old.


Hash Functions for Sharding

The hash function decides which shard a key lives in. It must be:

  • Fast. Called on every Get/Set/Delete.
  • Well-distributed. Adjacent keys should not concentrate in one shard.
  • Deterministic. Same key → same shard, every run.

Choices in Go:

  • hash/fnv.New32a — small, fast, decent distribution. Allocates per call.
  • hash/maphash (Go 1.14+) — uses the runtime's randomised hash. Fastest. Allocates a Seed once.
  • xxhash (from cespare/xxhash) — excellent quality and speed, but external dep.
  • Cryptographic hashes (SHA-256, BLAKE2) — overkill, slow, do not use.

For most caches maphash is the right choice:

import "hash/maphash"

var seed = maphash.MakeSeed()

func shardIdx(key string) uint64 {
    return maphash.String(seed, key)
}

For high-throughput services using a hash that allocates on every call (like FNV through the Hash interface) becomes a noticeable allocator pressure. Switch to maphash or xxhash.

Adversarial keys. If keys come from untrusted sources, an attacker may craft inputs that hash to the same shard, creating contention. maphash's random seed makes this hard (the seed is process-local and random). FNV does not protect against adversarial inputs — same input → same hash, every time. For Internet-facing services, prefer maphash.


Deep Dive: Lock Hierarchy in Sharded Caches

Sharded caches have an implicit lock order: shard locks are siblings, no shard ever waits on another shard's lock. This is correct as long as you never hold two shard locks at once.

Operations that violate this:

  • Range over all entries (would require holding all 256 locks).
  • Cross-shard moves (e.g. resharding).
  • Cache-wide stats that scan every shard.

If you need cross-shard operations, use a defined order (always acquire shard A before shard B for A<B) to prevent deadlocks. Or, accept inconsistency and acquire/release each shard's lock independently.

Most senior caches avoid cross-shard ops entirely. Stats are computed approximately by summing per-shard counters via atomics.


Deep Dive: Hashing Numeric Keys

When keys are integers (user IDs, product IDs), hashing them naively can produce bad shard distributions.

Bad:

shardIdx := userID % numShards

This works only if the userID distribution is itself uniform across shards. If your IDs are sequential (1, 2, 3...) and you have 256 shards, IDs 0-255 each go to a different shard — fine. But IDs starting at 1,000,000 with userID%256 will distribute almost uniformly except for low-bit biases.

Worse if the key is a database auto-increment with predictable patterns: consecutive inserts go to the same shard in a row, briefly contending one shard.

Better:

func shardIdxInt(id int64) int {
    h := fnv.New32a()
    binary.Write(h, binary.LittleEndian, id)
    return int(h.Sum32() % numShards)
}

Or for speed:

func shardIdxInt(id uint64) int {
    // FxHash-style mix.
    id = (id ^ (id >> 30)) * 0xbf58476d1ce4e5b9
    id = (id ^ (id >> 27)) * 0x94d049bb133111eb
    id = id ^ (id >> 31)
    return int(id & (numShards - 1))
}

That mix function spreads bits uniformly. Useful for integer-keyed caches.


Per-Shard Sweeping

Each shard needs its own sweep. Three ways to organise:

Option A: one sweeper goroutine per shard.

for i := range c.shards {
    go c.shards[i].sweepLoop(interval)
}

256 goroutines. Each sweeps 1/256 of the cache; locks are independent. Highest parallelism.

Cost: 256 goroutines is fine (Go handles this trivially), but 256 timers means slightly more runtime overhead.

Option B: one sweeper goroutine, iterates shards in order.

go func() {
    for {
        for _, s := range c.shards {
            s.sweep()
        }
        time.Sleep(interval)
    }
}()

One goroutine; sweeps shards sequentially. Total sweep time is sum of per-shard times. Lock contention is also low (one shard at a time).

Option C: pool of sweeper workers.

work := make(chan *shard)
for w := 0; w < runtime.NumCPU(); w++ {
    go func() {
        for s := range work { s.sweep() }
    }()
}
go func() {
    for {
        for _, s := range c.shards {
            work <- s
        }
        time.Sleep(interval)
    }
}()

numCPU workers; balanced parallelism. Good compromise.

For most production caches, Option B is fine. Option C makes a meaningful difference only at very large scale.


Deep Dive: Coordinating Per-Shard Sweepers

256 sweep goroutines running independently can cause a sync problem: they may all tick at the same moment, briefly using numShards × shardLockedTime of total CPU.

Mitigations:

  • Stagger ticker start times. Each sweeper starts its first tick at i * (interval / numShards). Sweeps spread evenly across the interval.
for i := range c.shards {
    delay := time.Duration(i) * (interval / numShards)
    go func(i int, delay time.Duration) {
        time.Sleep(delay)
        c.shards[i].sweepLoop(interval)
    }(i, delay)
}
  • Single sweeper goroutine that visits shards round-robin. One sweep per (interval/numShards) handles one shard.

Either approach prevents the "all 256 sweepers run at midnight" spike.


Deep Dive: Why "Just Use Redis" Is Sometimes Right

For a team that just wants a cache and doesn't want to design one:

Pros of Redis: - Battle-tested. - Distributed by default. - Persistence options. - Pub/sub built in. - Rich data types beyond simple K/V.

Cons of Redis: - Network latency (~0.5-2 ms on local network, more for remote). - Operational overhead. - Single point of failure unless clustered. - Eviction behaviour requires understanding.

For services where: - p99 latency target > 5 ms, - Operational team is comfortable with Redis, - Distributed semantics are needed,

Redis is often the simplest answer. Plug in a Go Redis client (go-redis), add an in-process L1 if you want sub-ms latencies for hot keys, and you have a perfectly competent caching stack.

The mistake is the reverse: forcing an in-process cache for a service that should be using Redis. Symptoms:

  • Cache pollution between replicas.
  • Inability to invalidate consistently.
  • Cold-start storms.

If you find yourself solving these problems repeatedly, you may be reinventing Redis poorly. Use the real thing.


Sharded Heaps

If you combine sharding with the heap-based sweeper from middle.md, each shard has its own heap. Per-shard timers fire when the local minimum-expiration is reached. Lock-held time per sweep is O(k log Nshard) where Nshard = N / numShards.

Sketch:

type shard struct {
    mu   sync.Mutex
    data map[string]entry
    heap expHeap
    timer *time.Timer
}

func (s *shard) Set(k, v string, expiresAt time.Time) {
    s.mu.Lock()
    defer s.mu.Unlock()
    if old, ok := s.data[k]; ok && old.heapItem != nil {
        heap.Remove(&s.heap, old.heapItem.index)
    }
    it := &item{key: k, expiresAt: expiresAt}
    heap.Push(&s.heap, it)
    s.data[k] = entry{value: v, expiresAt: expiresAt, heapItem: it}
    s.rescheduleLocked()
}

256 shards × heap per shard means 256 timers, 256 sweeper goroutines, 256 heaps. Memory overhead is small (each heap starts empty). Throughput scales linearly with shard count.


Deep Dive: Lock-Free Reads with sync/atomic

For very read-heavy caches where every nanosecond matters, you can build a lock-free read path using atomic pointers to immutable maps.

type Cache struct {
    p atomic.Pointer[map[string]entry]
}

func (c *Cache) Get(k string) (entry, bool) {
    m := *c.p.Load()
    e, ok := m[k]
    return e, ok
}

func (c *Cache) Set(k string, e entry) {
    for {
        oldP := c.p.Load()
        old := *oldP
        newMap := make(map[string]entry, len(old)+1)
        for k, v := range old {
            newMap[k] = v
        }
        newMap[k] = e
        if c.p.CompareAndSwap(oldP, &newMap) {
            return
        }
    }
}

Reads are O(1) lock-free. Writes are O(N) (full rebuild) but rare. Suitable for "read 1M times, write 1 time" workloads.

Limitations:

  • Writes are linear in cache size.
  • Concurrent writers may retry many times (livelock).
  • Memory churn (a new map per write).

For a config-distribution-like cache this is perfect. For a high-write TTL cache it is not.


Deep Dive: Hybrid Heap + Sweep Strategies

For caches with mixed entry lifetimes (some short, some long), neither pure heap nor pure sweep is ideal.

Hybrid:

  • Short-TTL entries → heap with on-expiration timer.
  • Long-TTL entries → tick sweeper with coarse interval.

Why? Heap's O(log N) cost is amortised; for entries that live for hours, the per-Set heap maintenance is unnecessary overhead. Stick them in a separate index, sweep less often.

Implementation:

type Cache struct {
    shortTermShards [256]*shard      // heap-based
    longTermShards  [256]*shard      // sweep-based
    cutoffTTL       time.Duration    // entries longer than this go long-term
}

func (c *Cache) Set(k, v string, ttl time.Duration) {
    if ttl > c.cutoffTTL {
        c.longTermShards[shardIdx(k)].set(k, v, ttl)
    } else {
        c.shortTermShards[shardIdx(k)].set(k, v, ttl)
    }
}

The complexity is significant. Only worth it for caches with bimodal TTL distributions.


Hot-Key Mitigation

Sharding solves "many keys, contended lock." It does not solve "one key, many readers." If 90% of your traffic is for one key, all of that traffic goes to one shard. That shard's lock becomes the bottleneck. Sharding bought you nothing.

Real-world example: an e-commerce site's homepage product. One product ID accounts for 80% of requests during a sale. The cache shard holding that product handles 80% of cache traffic.

Mitigations:

  1. Singleflight + long TTL. Make sure 80% of reads are pure cache hits (no lock contention beyond RLock).
  2. Read replication of the hot entry. Detect hot keys; cache them in multiple shards or in a separate "hot-key cache" with no eviction.
  3. In-memory replicas per goroutine. Keep a goroutine-local copy of hot values. Refresh from the central cache periodically.
  4. Don't shard hot keys. Identify them and serve from a special-purpose data structure (e.g. an atomic.Pointer updated periodically).

Approach 4 is sometimes the cleanest:

type Cache struct {
    // Normal sharded cache.
    shards [256]*shard

    // Atomic pointers for known hot keys, updated by a background goroutine.
    homepageProduct atomic.Pointer[Product]
}

func (c *Cache) GetProduct(id string) *Product {
    if id == "homepage" {
        return c.homepageProduct.Load()
    }
    return c.normalGet(id)
}

The homepage product read is now lock-free, atomic, zero-allocation. A background goroutine refreshes it every 30 s.

The lesson: sharding is a great default, but extreme hot keys require explicit handling. Identify them via observability (per-key hit counters) and treat them specially.


Deep Dive: Shard Routing via Hash Mixing

A good hash function is essential for sharded caches. Even tiny biases produce shard imbalance.

A common failure: using key.first_byte as shard index. Keys starting with "u" (user records) all go to shard 117. The shard's lock is hot, the rest are idle.

Robust shard selection:

func shardIdx(key string) int {
    var h maphash.Hash
    h.SetSeed(seed)
    h.WriteString(key)
    return int(h.Sum64() & (numShards - 1))
}

maphash does internal mixing — small changes in input produce wildly different outputs. Distribution across shards is uniform.

Verifying:

counts := make([]int, numShards)
for _, k := range keys {
    counts[shardIdx(k)]++
}
// inspect the histogram of counts; should be near uniform

A chi-squared test confirms uniformity quantitatively. For production, just visualise: if any shard holds > 2× the average, the hash is broken.


Deep Dive: Per-Shard Singleflight Maps

Singleflight maps grow as keys are loaded. Even though entries are deleted on Do completion, the map can grow large transiently.

For a sharded cache with 256 shards each holding its own singleflight.Group:

  • Each group's map handles 1/256 of the load.
  • Memory overhead per group is small.
  • Lock contention per group is bounded.

But: if one shard becomes hot, its singleflight map fills with concurrent calls. The shard's group mutex becomes the bottleneck. The "hot-key mitigation" tools we discussed apply equally here.


Deep Dive: Range Queries Without Sharding

Sharding kills "give me all keys matching a prefix." If you need such queries:

  • Maintain a separate index per prefix. Each shard has its own per-prefix index. Aggregate across shards.
  • Use a different data structure entirely (a B-tree, a trie). Not technically a sharded TTL cache anymore.
  • Accept O(N) cost for these queries, doing them rarely (e.g. during maintenance, not on the request path).

The third option is by far the most common. Range queries are usually a maintenance feature, not a request-path feature; rare expense is acceptable.


Admission Policies and TinyLFU

For a bounded cache, the eviction policy decides "which entry leaves to make room?" The admission policy decides "should this new entry come in at all?"

The classic LRU implicitly admits everything (it evicts the oldest entry). For workloads with rare-but-large surges of one-hit-wonders (a crawler scanning every URL), this means popular entries are pushed out by transient junk. Admission policies prevent this.

TinyLFU (Tiny Least-Frequently Used) is the state of the art. It works as follows:

  1. Maintain a small count-min sketch (~4 counters per cache entry) tracking estimated access frequency for all keys ever seen.
  2. On a write that would require eviction, compare the new key's estimated frequency with the victim's estimated frequency.
  3. Only admit if the new key is estimated to be more popular.

Properties:

  • Very low memory overhead (a few bytes per entry tracked).
  • Robust to scanning workloads: the rare access count of a crawler key cannot exceed a popular key's count.
  • Approximate (count-min sketches over-count); some genuine new hot keys may be rejected.

ristretto implements TinyLFU with a SLRU (segmented LRU) eviction policy on top. The combination is hard to beat for general workloads.

A simplified mental model: with TinyLFU, your hot set is sticky — once a key is in the cache, it stays unless something genuinely more popular comes along. Without TinyLFU, your hot set is fragile — any scanner can flush it.


Deep Dive: Window-LFU vs TinyLFU

Lots of cache literature uses LFU (Least-Frequently Used) without qualification, but real production caches use windowed variants.

Pure LFU. Track exact counts forever. Cold-old keys with stale-high counts cling to the cache forever. Catastrophic for shifting workloads.

Window LFU. Only count accesses in the last N events. Older counts are forgotten.

TinyLFU. Window LFU implemented with a count-min sketch + periodic aging. Memory-efficient.

SLRU. Two LRU lists; promote on second access. Approximates LFU without explicit counting.

Modern caches (ristretto, Caffeine) use TinyLFU as admission + SLRU as eviction. The combination is robust to scan workloads, prefers truly hot keys, and uses bounded memory.

Pure LFU is mostly of historical interest.


Ristretto Internals in Depth

Now we walk through how ristretto actually works inside.

Architecture overview

  • Cache struct. Holds configuration, the policy (TinyLFU + SLRU), the store (sharded map), and the metrics.
  • Store. 256 shards, each a map[uint64]storeItem with a mutex. Keys are pre-hashed by the user.
  • Policy. Single TinyLFU with the count-min sketch and the eviction queue.
  • Set buffer. A channel that batches incoming sets before they hit the policy and store.
  • Get buffer. Per-CPU ring buffers that accumulate "accessed key" events, drained asynchronously into the policy.

Set flow

cache.Set(k, v, cost) -->
    push (key, value, cost, ttl) onto setBuf channel -->
    eventually drained by policy goroutine -->
        admission check (TinyLFU sketch)
        if admitted:
            policy decides which entries to evict
            store.Set(k, v)
        else:
            drop silently

So Set returns quickly (channel push) but the actual store update happens asynchronously. That is why Wait() exists.

Get flow

cache.Get(k) -->
    store.Get(k) directly (no policy involvement) -->
    record (k, accessed=true) into per-CPU ring -->
    when ring fills, flush to policy (updates TinyLFU sketch)

The hot path is just a sharded map lookup plus a counter increment. No policy interaction, no global locks. That is the lock-free read path.

Eviction

When admission decides "yes, kick something out," it pops from the bottom of the SLRU eviction queue. The store is then asked to delete that key.

Why this design wins

  • Reads are essentially lock-free (only a per-shard RWMutex).
  • Sketch updates happen out-of-band, not in the request path.
  • Admission protects against scan workloads.
  • Cost-based bounds let you mix entry sizes.

When ristretto disappoints

  • "Set, then Get" may not return the value (it is buffered).
  • Stats are eventually consistent.
  • The cache size can briefly exceed MaxCost while the policy catches up.
  • No iteration. No "get all keys."

For long-lived backend services these trade-offs are usually fine.


Deep Dive: Ristretto Configuration Tips

Some non-obvious ristretto knobs:

  • NumCounters. Set to 10× the expected entry count. Too low → TinyLFU sketch saturates and admission decisions degrade. Too high → wasted memory.
  • MaxCost. The total cost budget. Choose your cost unit (bytes, entries, expensiveness) and budget accordingly.
  • BufferItems. Size of per-CPU access-tracking ring. 64 is a good default. 1024 reduces lock pressure further at the cost of staler admission decisions.
  • OnEvict. Callback per evicted entry. Useful for explicit cleanup; expensive — keep it light.
  • OnReject. Callback when admission rejects a Set. Useful for metrics: how often is the cache "saying no"?
  • Metrics. Set to true to enable per-call atomic counters. Adds tiny overhead, valuable for observability.

A common misconfiguration: setting NumCounters = MaxCost. If your cost unit is bytes, this gives one counter per byte — wasteful.

If your cost unit is entries, NumCounters = 10 * MaxCost is correct.


Deep Dive: Cache Reload Strategies

Sometimes you need to force-reload all cached data. Reasons:

  • Backing data underwent migration.
  • Cache schema bug discovered; want to force re-load with new schema.
  • Suspected corruption.

Strategies:

  1. FlushAll() + cold start. Simplest. Brief upstream surge.
  2. Versioned keys. Bump a version prefix; old entries are unreachable.
  3. Two-phase reload. Mark cache "reload mode"; on read, check upstream and update. Throttle to limit upstream load.

For caches with millions of entries, option 1 is cheapest but most disruptive. Option 2 is graceful but wastes memory. Option 3 is operationally complex but gentle on upstream.

Pick based on what you can tolerate.


BigCache Internals in Depth

BigCache's selling point: zero GC pressure on the cache contents.

Architecture

  • numShards shards (default 1024).
  • Each shard owns a single []byte (the "entries" buffer) plus a map[uint64]uint32 (key hash → offset in the buffer).
  • Entries are appended to the buffer. When the buffer fills, the shard starts evicting older entries.
  • TTL is global (set at construction); per-entry TTLs are not supported (in v3 there is SetWithTTL but it tracks only relative-to-set times).

Why no GC pressure

The Go garbage collector scans heap objects and follows pointers. A map[uint64]uint32 contains primitive types — the GC scans the map's spine but nothing else. A []byte has one allocation; the GC ignores its bytes.

So a bigcache shard with 10 million entries has:

  • One []byte (perhaps 100 MB).
  • One map[uint64]uint32 with 10M entries — but the entries are uint32, not pointers.

GC sees O(numShards) non-trivial allocations regardless of entry count. GC pauses stay flat.

Contrast with map[string]*Entry holding 10M entries: 10M individual heap allocations for keys, 10M for values, all scanned by GC.

The cost

Storing or reading an entry requires copying bytes. A string value becomes:

[8 bytes: key hash][8 bytes: timestamp][4 bytes: key len][N bytes: key][M bytes: value]

You serialize on Set, parse on Get. For values that are already bytes (HTTP bodies), this is fine. For Go structs, you must encode/decode — often more expensive than the GC savings.

Eviction

BigCache evicts FIFO within a shard. When the buffer is full and a new Set needs space, the oldest entries are dropped. There is no LRU or LFU.

For workloads where TTL roughly equals "time until the entry is no longer wanted," FIFO + TTL is fine. For workloads with shifting hot sets, the lack of LRU is a real limitation.

When to use bigcache

  • Very large entry counts (>10M).
  • Entries are natively []byte.
  • GC pause time is a measured concern.
  • FIFO + TTL eviction is acceptable.

When NOT to use bigcache

  • You need LRU/LFU.
  • Values are large Go structs (serialization cost dominates).
  • You need precise per-entry TTLs.

Deep Dive: BigCache Configuration Recipes

For different workloads, different settings:

Web response cache, 5 GB pool:

config := bigcache.Config{
    Shards:             1024,
    LifeWindow:         10 * time.Minute,
    CleanWindow:        5 * time.Minute,
    MaxEntriesInWindow: 1000 * 10 * 60, // expected entries
    MaxEntrySize:       64 * 1024,       // 64 KB max body
    HardMaxCacheSize:   5 * 1024,         // MB
}

Small DNS-style cache, 100 MB pool:

config := bigcache.Config{
    Shards:             256,
    LifeWindow:         5 * time.Minute,
    CleanWindow:        1 * time.Minute,
    MaxEntriesInWindow: 100000,
    MaxEntrySize:       256,
    HardMaxCacheSize:   100,
}

Key knobs:

  • Shards: power of 2; more shards = less contention.
  • LifeWindow: TTL.
  • CleanWindow: how often expired entries are purged.
  • HardMaxCacheSize: in MB. Above this, oldest entries are dropped to make room.

Why HardMaxCacheSize matters. Without it, bigcache can grow until memory is exhausted, as new entries are always accepted (FIFO within shard). With it, you have a hard cap.


Deep Dive: bigcache Iteration

bigcache exposes an Iterator API:

iter := cache.Iterator()
for iter.SetNext() {
    info, err := iter.Value()
    if err != nil {
        continue
    }
    process(info.Key(), info.Value())
}

Properties:

  • Iterates one shard at a time.
  • Not consistent: entries added during iteration may or may not appear.
  • Allows reading the entire cache for backups or dumps.

Performance: about half of normal Get throughput (due to acquired locks per shard). Avoid on the request path; use for offline analysis only.


FreeCache Internals in Depth

FreeCache is similar in spirit to bigcache but with a finer-grained memory pool.

Architecture

  • 256 segments.
  • Each segment owns a fixed-size byte slice (the "ring buffer") and slot tables.
  • Slot tables map keyHash → entry offset within the segment.
  • LRU eviction within a segment via the slot's access timestamp.

The fixed-size segment trick

The cache is parameterised by a total memory size at construction. That size is divided equally among segments. Each segment's buffer is allocated once. The cache never allocates beyond that.

If a segment fills up, the segment evicts old entries to make room. No global rebalancing — each segment is independent.

The result: a hard memory cap. The cache cannot grow beyond the budget. Predictable behaviour under memory pressure.

LRU per segment

FreeCache implements LRU using slot-level access tracking. When a segment must evict, it scans some slots and picks the least recently used. This is approximate (scan all slots is too expensive) but works well.

TTL

Per-entry TTL with seconds resolution. Stored as a 4-byte timestamp in the entry header.

When to use freecache

  • You want a hard memory cap, no surprises.
  • High throughput, lots of small entries.
  • LRU eviction.
  • Per-entry TTL with seconds resolution.

When NOT to use freecache

  • You need millisecond-precision TTL.
  • You want to inspect or iterate cache contents.
  • You need cost-weighted eviction (i.e. different entry sizes weighted differently).

Deep Dive: SLRU and Window-LRU

Ristretto's eviction is Segmented LRU (SLRU) combined with Window-LRU. Worth understanding because it shapes how the cache behaves.

Window-LRU. A small "window" cache at the front. New entries arrive here. Entries that never get re-accessed are evicted from the window without ever entering the main cache. This filters one-hit-wonders.

SLRU. The main cache has two segments: "probation" and "protected." New entries enter probation. On re-access, they get promoted to protected. The protected segment is larger; eviction prefers the probation segment.

Combined:

New entry -> Window (1% of cache)
  Never re-accessed -> evicted from window
  Re-accessed -> moved to Probation (20% of cache)
                 Re-accessed in probation -> moved to Protected (79%)
                 Eventually evicted from protected if cold

The combination produces a cache that:

  • Rejects scan-only workloads at the window stage.
  • Quickly identifies "hot" keys via promotion to protected.
  • Evicts from probation first, protecting the established hot set.

Tuning: window size, probation:protected ratio. Ristretto's defaults work well for most workloads.


Deep Dive: Cuckoo Filters for Negative Caches

A bloom filter says "definitely not present" or "possibly present." A cuckoo filter extends this with deletion support and lower false-positive rate at the same memory cost.

For a negative cache:

type NegFilter struct {
    cf *cuckoo.Filter
}

func (n *NegFilter) MaybeAbsent(k string) bool {
    return n.cf.Contains([]byte(k))
}

func (n *NegFilter) MarkAbsent(k string) {
    n.cf.Insert([]byte(k))
}

func (n *NegFilter) Remove(k string) {
    n.cf.Delete([]byte(k))
}

Read path:

if filter.MaybeAbsent(k) {
    return ErrNotFound
}
if v, ok := cache.Get(k); ok {
    return v
}
v, err := load(k)
if errors.Is(err, ErrNotFound) {
    filter.MarkAbsent(k)
}
return v

You save the cost of going to the upstream for clearly-absent keys.

Caveat: false positives waste an upstream call. Tune the filter's bits-per-key to keep false-positive rate < 1%.


Deep Dive: Eviction-driven Refresh

A neat pattern: when an entry is evicted (due to size pressure), if it was recently accessed, trigger a background refresh of a fresh copy. The cache "remembers what was hot" through eviction.

func (c *Cache) onEvict(k string, e entry) {
    if e.lastAccessed.After(time.Now().Add(-5 * time.Minute)) {
        go c.refresh(k)
    }
}

Caveats:

  • Refreshes after eviction reload into a cache that has no room. They may be evicted again immediately. Pathological for size-bounded caches with churn.
  • Useful when the cache is under-sized but the working set varies.

Deep Dive: FreeCache Tuning Tips

FreeCache's fixed-pool design makes capacity planning straightforward.

cache := freecache.NewCache(1024 * 1024 * 1024) // 1 GB

That is it. You get exactly 1 GB. Per-segment overhead is small (~0.5%), so usable space is roughly 1 GB.

Sizing:

  • Average entry size = (key size + value size + 24 bytes overhead).
  • Cache capacity = pool size / average entry size.
  • For "string keys ~20 bytes, JSON values ~200 bytes," 1 GB gives ~4 million entries.

For TTL granularity:

  • TTLs are stored as 4-byte seconds since cache start.
  • Max TTL ≈ 130 years.
  • Min TTL = 1 second.

If you need sub-second TTLs, freecache is not the right choice. If second-precision is fine, it is an excellent option.

Performance: freecache is one of the fastest Go caches for read-heavy workloads, often surpassing 10 million reads/sec on commodity hardware.


Deep Dive: freecache vs bigcache Direct Comparison

A quick A/B summary:

Property freecache bigcache
Memory model Fixed pool, 256 segments Sharded ring buffers
Eviction within shard LRU FIFO
TTL precision seconds seconds (cache-wide LifeWindow)
Per-entry TTL yes (via Set) yes (in v3, with caveats)
Iteration no yes
Reads/sec on 8-core ~12M ~10M
GC pressure minimal minimal
Memory bound hard configurable
Concurrent gets/sets sharded, both safe sharded, both safe

When values are bytes already (proto, JSON, response bodies), both excel.

When values are Go structs that you do not want to serialize, neither is ideal — fall back to in-process maps with care.

Choose freecache when: - You want absolute predictable memory. - LRU eviction. - Per-entry TTLs in seconds.

Choose bigcache when: - You need iteration. - Cache-wide TTL is acceptable. - Slightly more flexible config.

For most use cases, either works.


Two-Tier (L1/L2) Caching

When the working set exceeds what one process can hold, split into tiers.

Pattern

  • L1: in-process, small, fast (ristretto or sharded local map).
  • L2: distributed, large, durable (Redis, Memcached, Hazelcast).
  • Upstream: the source of truth (database, downstream service).

Read path

func (c *Cache) Get(k string) (V, error) {
    if v, ok := c.l1.Get(k); ok {
        c.metrics.l1Hits.Inc()
        return v, nil
    }
    if v, err := c.l2.Get(k); err == nil {
        c.metrics.l2Hits.Inc()
        c.l1.Set(k, v)
        return v, nil
    }
    c.metrics.misses.Inc()
    v, err := c.upstream.Load(k)
    if err != nil {
        return v, err
    }
    c.l2.Set(k, v) // populate L2 first (long TTL)
    c.l1.Set(k, v) // then L1
    return v, nil
}

Trade-offs

  • L1 absorbs hot reads. p99 latency stays low.
  • L2 absorbs cross-replica reads, prevents cold-start upstream storms.
  • L1 may be stale relative to L2 (different TTLs); accept it or invalidate explicitly.

TTL sizing

  • L1 TTL: short (10 s to a few minutes). Bounded by acceptable per-replica staleness.
  • L2 TTL: long (hours). Bounded by acceptable cross-replica staleness.

Cache stampede protection

Use singleflight at both layers. L1 singleflight prevents per-replica miss waves. L2 singleflight (or Redis SETNX) prevents cross-replica miss waves.

Invalidation

On data change:

  1. Delete from L2.
  2. Publish "invalidate K" to a pub/sub channel.
  3. Each replica drops K from its L1 on receiving the message.

Without step 3, replicas serve stale L1 data until natural TTL.


Deep Dive: Read-Through with Singleflight Forgetting

When Invalidate is called during an in-flight load, you must decide: serve the result or abort?

The middle file showed group.Forget(k). Senior code typically does more:

func (c *Cache) Invalidate(k string) {
    c.mu.Lock()
    delete(c.data, k)
    c.mu.Unlock()
    c.group.Forget(k)
    // Future readers will trigger fresh loads.
}

But what about in-flight loaders? They will complete and store their value, which we just invalidated. Race.

A version-number fix:

type entry struct {
    value     V
    expiresAt time.Time
    version   uint64
}

func (c *Cache) Invalidate(k string) {
    c.mu.Lock()
    c.version++
    delete(c.data, k)
    c.mu.Unlock()
    c.group.Forget(k)
}

func (c *Cache) load(ctx context.Context, k string) (V, error) {
    startVersion := c.version
    v, err := c.upstream.Load(ctx, k)
    if err != nil {
        return v, err
    }
    c.mu.Lock()
    if c.version != startVersion {
        // Invalidation happened during load; don't store.
        c.mu.Unlock()
        return v, nil
    }
    c.data[k] = entry{value: v, expiresAt: time.Now().Add(c.ttl), version: startVersion}
    c.mu.Unlock()
    return v, nil
}

Now a load that races with an invalidation does not corrupt the cache. The next reader will trigger another load (slow), but correctness is preserved.


Deep Dive: Cache Affinity for Persistent Connections

A subtle pattern: if your service uses persistent connections (HTTP/2, gRPC), each connection talks to a specific backend instance. If your cache is per-instance, repeated requests on the same connection see the same cache state.

This naturally provides session-like behaviour: requests on a connection benefit from the cache that was warmed by previous requests on that connection.

The implication: when designing the cache, think about which traffic shares the same cache. Connection pooling, sticky routing, and load balancing all interact with cache hit ratios.

If load balancing is random per request, cache hit ratios are determined by replica count: 12 replicas means roughly 1/12 of requests find their key warm.

If load balancing is sticky-by-key, cache hit ratios approach what a single replica would see (assuming the working set fits).

For caches that are expensive to warm, sticky-by-key routing is a significant lever.


Distributed Invalidation Patterns

Building on the pub/sub idea, here are the main patterns.

Pattern 1: Fire-and-forget invalidation

publish(channel, key). Each replica drops the key. Simple, low overhead, but lossy — if a replica is disconnected, it misses the message.

Pattern 2: Sequence numbers

Every cached entry stores the version it was loaded at. On read, the application checks against a central "current version" oracle. If mismatched, re-fetch. The oracle is updated on writes.

Cost: every read does an oracle check. If the oracle is cached locally with a tiny TTL (~100 ms), the overhead is bounded.

Pattern 3: Streaming invalidations via change-data-capture

The database's binlog (MySQL) or WAL (PostgreSQL) is followed by a consumer that publishes "row K changed" events. All cache replicas subscribe.

Best for high-stakes correctness (you cannot afford to serve stale data ever). Most operational complexity.

Pattern 4: Cache leases

A reader who finds a miss "leases" the right to load the value. While the lease is held, other readers wait (or serve stale). After loading, the lease holder publishes the new value. Eliminates stampedes across replicas.

Implementation: Redis SETNX + value distribution via pub/sub.

Choosing

Most teams start with Pattern 1 (good enough) and add Pattern 3 when correctness matters more than simplicity. Patterns 2 and 4 are niche.


Deep Dive: Cache Memory Budgeting for Containers

In a Kubernetes pod with a memory limit, exceeding the limit causes OOM kills. Caches are common culprits.

Plan memory:

  • Application code: 100 MB.
  • Connection pools, goroutine stacks, runtime: 200 MB.
  • GC headroom (typically 30% above live heap): variable.
  • Cache: budget - above.

For a 2 GB pod, a 1 GB cache is borderline aggressive. GC pauses grow with heap size; at 1 GB live, you may see 50-100 ms pauses. Reserve a margin:

  • Cache: 800 MB.
  • Application + runtime: 300 MB.
  • GC headroom: 900 MB.

If your cache crashes the pod with OOM, you have too much cache, too little RAM, or a leak. Profile with pprof -inuse_space to confirm where memory is going.


Deep Dive: cgroups and Cache Sizing

If your container has a hard memory limit, the cache should know about it. Reading /sys/fs/cgroup/memory/memory.limit_in_bytes gives you the limit; use it to size the cache.

func detectMemoryLimit() int64 {
    data, err := os.ReadFile("/sys/fs/cgroup/memory/memory.limit_in_bytes")
    if err != nil {
        return -1
    }
    var limit int64
    fmt.Sscanf(string(data), "%d", &limit)
    if limit > 1<<60 {
        return -1 // no limit set
    }
    return limit
}

func sizeCache() int64 {
    limit := detectMemoryLimit()
    if limit < 0 {
        return 100 << 20 // 100 MB default
    }
    return limit / 4 // use 25% of available memory
}

cgroup v2 uses a different path. Production code typically handles both via a library like KimMachineGun/automemlimit.


Consistent Hashing for Cache Routing

When you have multiple cache replicas and want to deterministically route a key to a specific replica (so the cache is effectively global), consistent hashing is the standard tool.

The basic idea

Place replicas on a circle (typically a 32- or 64-bit hash space). Place each key on the same circle. The replica responsible for a key is the next one clockwise.

Properties:

  • Adding/removing replicas only moves 1/N of keys (vs. (N-1)/N for naive modulo).
  • Lookup is O(log N) with a sorted ring.

Implementation in Go

import (
    "hash/fnv"
    "sort"
    "sync"
)

type Ring struct {
    mu       sync.RWMutex
    replicas []uint32
    members  map[uint32]string
}

func (r *Ring) Add(name string) {
    h := hash(name)
    r.mu.Lock()
    r.members[h] = name
    r.replicas = append(r.replicas, h)
    sort.Slice(r.replicas, func(i, j int) bool { return r.replicas[i] < r.replicas[j] })
    r.mu.Unlock()
}

func (r *Ring) Get(key string) string {
    r.mu.RLock()
    defer r.mu.RUnlock()
    if len(r.replicas) == 0 {
        return ""
    }
    h := hash(key)
    idx := sort.Search(len(r.replicas), func(i int) bool { return r.replicas[i] >= h })
    if idx == len(r.replicas) {
        idx = 0
    }
    return r.members[r.replicas[idx]]
}

func hash(s string) uint32 {
    h := fnv.New32a()
    h.Write([]byte(s))
    return h.Sum32()
}

Virtual nodes

A real consistent hash uses virtual nodes — each replica places K (typically 100-500) points on the ring. This evens out the distribution; without virtual nodes, the load split can be very uneven.

When to use

  • L2 cache pool with multiple Redis instances and the application routes directly.
  • Any cluster where you want sticky-by-key routing.

When not to use

  • If you have a load balancer doing routing, let it route.
  • If your cache is small enough to fit on every replica, replicate everywhere instead.

Deep Dive: Sliding Window Counters for Rate-Limited Caches

A cache can act as a rate limiter: each key tracks "how many times in the last minute?" Implementation:

type WindowEntry struct {
    counts [60]uint32 // one per second of the window
    cursor int
    lastSec int64
}

func (w *WindowEntry) Increment(now time.Time) uint32 {
    sec := now.Unix()
    if sec != w.lastSec {
        diff := int(sec - w.lastSec)
        if diff > 60 {
            // reset entirely
            for i := range w.counts {
                w.counts[i] = 0
            }
        } else {
            for i := 0; i < diff; i++ {
                w.cursor = (w.cursor + 1) % 60
                w.counts[w.cursor] = 0
            }
        }
        w.lastSec = sec
    }
    w.counts[w.cursor]++
    var total uint32
    for _, c := range w.counts {
        total += c
    }
    return total
}

The cache stores these entries with TTL = "window length + slack." Reads return current count; writes increment.

Combined with the standard TTL cache, you get a rate limiter that costs O(window) per check (small) and O(1) storage per key (very small).

For high-throughput rate limiters, a single atomic.Uint64 per bucket with explicit rotation is faster. But the windowed approach is more flexible and easy to reason about.


Deep Dive: Versioned Caching

Sometimes a cached value depends on multiple inputs that can change independently. A cached "user permissions" depends on: the user's role, the team's policies, the global config. If any of those change, the cached value is stale.

Pattern: store the version of each input dimension in the cache value:

type Entry struct {
    Value         string
    UserVersion   uint64
    TeamVersion   uint64
    GlobalVersion uint64
}

func (c *Cache) Get(k string) (string, bool) {
    e, ok := c.cache.Get(k)
    if !ok { return "", false }
    cur := getCurrentVersions(k)
    if e.UserVersion != cur.User || e.TeamVersion != cur.Team || e.GlobalVersion != cur.Global {
        c.cache.Delete(k)
        return "", false
    }
    return e.Value, true
}

On a write to any input dimension, increment its version. The cache's check catches stale values without TTL waiting.

Costs: every Get does extra version checks. For high-throughput caches, those checks should themselves be cached (per-process atomic versions).


Stampede Protection at Scale

Single-flight in one process is necessary but not sufficient. At scale, you need cross-process coordination.

Distributed lock + load

func loadWithDistLock(k string) (V, error) {
    lockKey := "lock:" + k
    ok, _ := rdb.SetNX(ctx, lockKey, "1", 30*time.Second).Result()
    if !ok {
        // Someone else is loading. Wait briefly, then retry or serve stale.
        time.Sleep(50 * time.Millisecond)
        if v, ok := cache.Get(k); ok {
            return v, nil
        }
        return loadWithDistLock(k) // or give up
    }
    defer rdb.Del(ctx, lockKey)
    return upstream.Load(k)
}

Subtleties:

  • The lock TTL must be longer than the worst-case load time.
  • If the lock holder dies mid-load, the lock expires; another replica picks up.
  • Fencing tokens (monotonic version numbers) help if the lock-holder is "zombie" (network partition).

Multi-stage caching

The herd is spread across cache tiers:

  • L0: per-goroutine cache (lasts 1 s). 99% of reads.
  • L1: per-replica cache (lasts 30 s). 99% of remaining.
  • L2: distributed cache (lasts 1 h). 99% of remaining.
  • Upstream: gets 1 in 10,000 requests.

Each tier's singleflight reduces the next tier's load. Each tier's TTL is one order of magnitude larger than the previous.

Stampede observability

Instrument:

  • Loader call count (should be tiny).
  • Time spent in singleflight.Do (excluding the loader). High values mean many waiters.
  • Per-key load rate. Hot keys appear here.

A sudden spike in loader calls is the first sign of a stampede in progress.


Deep Dive: Bulkheads for Cache Loaders

A bulkhead isolates failures. For cache loaders, this means: if upstream A is slow, do not let it consume all available loader goroutines and starve loaders for upstream B.

type Cache struct {
    loaderPools map[string]chan struct{} // semaphore per upstream
}

func (c *Cache) load(upstream, k string) (V, error) {
    sem := c.loaderPools[upstream]
    select {
    case sem <- struct{}{}:
        defer func() { <-sem }()
    case <-time.After(c.loaderQueueTimeout):
        return zero, ErrLoaderBusy
    }
    return c.upstreams[upstream].Get(k)
}

Each upstream has a fixed pool of "slots." Cache loaders for that upstream must take a slot. If all slots are busy, new loaders fail fast — better than blocking forever.

This combines naturally with circuit breakers (when an upstream is consistently failing, the circuit opens and rejects all loader calls).


Deep Dive: Circuit Breakers in Cache Loaders

Circuit breakers prevent cascading failure. When an upstream is failing, the loader should not retry indefinitely.

States:

  • Closed: loaders pass through normally.
  • Open: loaders fail immediately, returning a cached value if available.
  • Half-open: occasionally allow a loader through to probe; if it succeeds, return to closed.
import "github.com/sony/gobreaker"

cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{...})

func (c *Cache) load(k string) (V, error) {
    v, err := cb.Execute(func() (interface{}, error) {
        return c.upstream.Load(k)
    })
    if err != nil {
        return zero, err
    }
    return v.(V), nil
}

When the breaker is open, calls fail fast (microseconds). The cache can decide to serve stale, return an error, or return a default — application-specific.

Without a breaker, a failing upstream causes the cache's load queue to grow unboundedly, eventually crashing the process.


Deep Dive: Cache Locality and Allocator Cooperation

Modern Go (1.20+) uses an arena-based allocator for some workloads. The allocator places related allocations near each other. For a cache, this can mean entries that are inserted at similar times share cache lines.

This is mostly beneficial — a sweep that walks entries in insertion order has good locality. But for sharded caches, the allocator does not know about shard boundaries. Entries from shard 1 and shard 2 may be interleaved in memory.

Mitigations are usually not worth the effort. Go's allocator is fast and cache-friendly enough for almost any workload. Hand-rolled allocators (via runtime/internal/sys) are for very specialized cases.

If you measure poor locality (perf profiling shows high cache misses on map accesses), consider:

  • Pre-allocating values in arenas.
  • Using unsafe packed structs.
  • Storing values in []byte (bigcache-style).

These are last resorts. Profile first.


Probabilistic Early Refresh

We covered this briefly in middle.md. Here is the full XFetch algorithm.

The idea: each reader, on every read, computes a probability that they should refresh, even though the cache is still "fresh." The probability grows as the entry ages.

shouldRefresh = -delta * beta * log(rand()) > expiresAt - now

Where: - delta is the typical load duration (measured). - beta is a tuning parameter (usually 1). - rand() is uniform [0, 1]. - expiresAt - now is the remaining lifetime.

Reading: as now approaches expiresAt, the right-hand side shrinks. The left-hand side is exponentially distributed (negative log of uniform). At some point, with non-zero probability, the inequality flips and the reader triggers a refresh.

The neat property: across all readers, exactly one (on average) triggers the refresh before expiry. No collisions, no need for locks.

Implementation:

import "math"
import "math/rand/v2"

func shouldRefresh(now, expiresAt time.Time, delta time.Duration, beta float64) bool {
    if now.After(expiresAt) {
        return true
    }
    u := rand.Float64()
    if u == 0 {
        u = 1e-9
    }
    needed := float64(delta) * beta * -math.Log(u)
    remaining := float64(expiresAt.Sub(now))
    return needed > remaining
}

Track delta per-key (or globally) by observing actual load times.

In practice this combines with singleflight: when shouldRefresh returns true, you call singleflight.Do so only one refresh actually happens.


Deep Dive: TimerHeap vs TimerWheel

For caches with millions of entries, every Set scheduling a time.AfterFunc creates a runtime timer. Go's runtime maintains a heap of timers. Inserting and removing is O(log N).

For 10M entries, that is 10M timers in one heap — 23 comparisons per insert. At 100k sets/second, that is 2.3M comparisons/second just for timer math.

Alternative: timer wheel. Buckets cover ranges of expiration time (e.g. one bucket per second for the next hour). Entries are placed in the bucket matching their expiration. Insertion is O(1). The wheel rotates one bucket per second; on rotation, the current bucket's entries are expired.

type Wheel struct {
    buckets [3600]list.List // one per second, covering 1 hour
    cursor  int
}

func (w *Wheel) Add(it *item, expiresAt time.Time) {
    delta := int(expiresAt.Sub(time.Now()).Seconds())
    idx := (w.cursor + delta) % 3600
    w.buckets[idx].PushBack(it)
}

func (w *Wheel) Tick() {
    w.cursor = (w.cursor + 1) % 3600
    bucket := &w.buckets[w.cursor]
    for e := bucket.Front(); e != nil; e = bucket.Front() {
        bucket.Remove(e)
        expire(e.Value.(*item))
    }
}

Trade-offs:

  • O(1) operations.
  • Coarse granularity (1 second in the example).
  • Fixed maximum TTL (anything longer needs a hierarchical wheel).

For TTL caches with seconds-precision and ~1-hour TTL, a wheel beats a heap. For finer granularity, hierarchical wheels (a wheel-of-wheels) cover larger ranges.

Used by Netty, Linux kernel timers, and high-throughput Go caches.


Deep Dive: TTL Cache vs LRU Cache vs LFU Cache

The TTL cache is one of several similar primitives. Knowing the differences matters:

  • TTL cache: time-bounded entries; no size bound by default.
  • LRU cache: size-bounded; entries evicted by recency.
  • LFU cache: size-bounded; entries evicted by frequency.
  • LRU+TTL: both. Most common in production.
  • LFU+TTL: rarer; useful for hot-key sticky caches.

When designing, ask: - Do entries become stale by time? Add TTL. - Is memory bounded? Add LRU/LFU. - Are some keys vastly more popular? LFU helps.

A cache without TTL and without size bound is just a map. Pick at least one of the two constraints.


Graceful Shutdown of a Distributed Cache

Shutting down a cache replica should not cause user-visible errors.

Steps

  1. Stop accepting new requests (drain mode).
  2. Wait for in-flight requests to complete.
  3. Optionally: push hot keys to a snapshot store for the next instance to bootstrap.
  4. Close the cache (stops sweepers, releases connections).
  5. Exit.

Drain coordination

Common pattern in HTTP services:

srv := &http.Server{...}
go func() {
    <-signalCh
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
    srv.Shutdown(ctx) // refuses new conns, waits for current to finish
    cache.Close()
}()
srv.ListenAndServe()

The cache's Close should happen after the server is fully drained, so no in-flight handler calls the cache after Close.

Snapshot for warm restarts

If the cache is large and rebuilding is expensive, serialise the live entries to disk on shutdown. The next instance reads them back on startup. Tools:

  • encoding/gob for simple struct serialisation.
  • Protocol buffers for cross-language formats.

Long-running invalidation subscriptions

If the cache subscribes to a pub/sub channel for invalidations, the subscription must end before the process exits, or you'll get errors in the broker's logs. Wire the subscription's lifecycle into the cache's Close.


Deep Dive: GC and Cache Memory Layout

The Go garbage collector visits every pointer in your heap. For a cache holding 100 million entries via map[string]*Entry, GC scans 100 million pointers per cycle. At ~10 ns per pointer scan, that is 1 second of GC time — disastrous.

Mitigations:

  1. Inline values. map[string]Entry (not map[string]*Entry). Less pointer chasing. But large values blow up map bucket sizes.
  2. Off-heap storage. Put values in []byte. The GC ignores byte contents.
  3. Pool allocations. Use sync.Pool for temporary objects.
  4. Avoid allocations in hot paths. Pre-allocate slices, reuse buffers.

For massive caches, off-heap is the only practical answer. bigcache and freecache do this. ristretto stores values as interface{} (still on-heap, but uses sharding to bound the scan time per shard).

You can measure GC impact with:

runtime.GC()
var ms runtime.MemStats
runtime.ReadMemStats(&ms)
fmt.Printf("PauseTotalNs: %d\n", ms.PauseTotalNs)

Or, in production, the go_gc_duration_seconds Prometheus metric.


Deep Dive: NUMA-Aware Sharding

On NUMA (Non-Uniform Memory Access) machines — typical of high-end servers — RAM is split across "nodes." A CPU accessing memory on a remote node pays a 2-3× latency penalty.

For a cache, NUMA-aware design means: a goroutine running on CPU 0-15 should access shards stored in node 0's memory. CPU 16-31 should access shards in node 1's memory.

Implementation in Go is awkward because:

  • The Go scheduler does not expose CPU pinning.
  • Go's allocator does not know about NUMA.

Practical workarounds:

  • Run one process per NUMA node, with cgroups pinning CPU and memory.
  • Each process has its own cache; a thin proxy routes requests.

This is professional-level territory. For most services, ignore NUMA — the Go runtime's lack of NUMA awareness is the same lack that everyone else has, and the runtime is fast enough.


Deep Dive: False Sharing

Two atomic variables on the same cache line will contend even if they are "logically independent." This is false sharing.

type Stats struct {
    Hits   atomic.Uint64 // offset 0
    Misses atomic.Uint64 // offset 8 — same cache line!
}

Cores updating Hits and Misses simultaneously fight over the cache line. Throughput is much lower than independent updates would suggest.

Fix: pad to fill a cache line:

type Stats struct {
    Hits   atomic.Uint64
    _      [56]byte
    Misses atomic.Uint64
    _      [56]byte
}

Cache lines are 64 bytes on most modern x86. The padding pushes Misses to its own line.

For sharded caches, the equivalent is padding each shard:

type shard struct {
    mu   sync.RWMutex
    data map[string]entry
    _    [128]byte // padding
}

type Cache struct {
    shards [256]shard
}

The padding ensures that 256 shards' first cache lines do not collide. Throughput on NUMA machines can double with this single fix.


Deep Dive: Health Checks and Cache State

A service's /healthz endpoint should reflect the cache's state. Patterns:

  • Cache is filling up: healthy.
  • Cache is at capacity, eviction running: healthy.
  • Cache is empty after warmup window (e.g. 1 min): degraded.
  • Cache size growing without bound: degraded.
  • Cache loader failing > 50%: degraded.

A degraded cache may still serve requests; the health endpoint signals "we are not at full capacity, please reduce traffic if possible." This is how production systems gracefully back off rather than crash.

For Kubernetes:

  • /livez: process is up. Returns 200 unless deadlock-detected.
  • /readyz: ready to serve. Returns 200 only when cache is warm and loaders are healthy.

Differentiating liveness from readiness is critical for cold-start scenarios.


Memory-Bound Caches Beyond GC

For caches holding hundreds of millions of entries or tens of gigabytes of memory, the GC starts to be a problem even with sharding. Two paths beyond:

Off-heap allocation via []byte

Like bigcache. The entire cache's payload is a few large []bytes. GC sees a handful of objects, not millions. Pay the serialisation cost.

syscall.Mmap and manual memory

Allocate memory via mmap, bypass Go's allocator entirely. The Go GC never touches it. You manage it yourself.

import "syscall"

data, err := syscall.Mmap(-1, 0, size, syscall.PROT_READ|syscall.PROT_WRITE, syscall.MAP_ANON|syscall.MAP_PRIVATE)

Pros: completely outside GC. Can be backed by a file for persistence.

Cons: you are now writing C-like code in Go. Every pointer is a uintptr. Memory safety is your problem. Bugs corrupt the entire process.

For most cases, the bigcache/freecache approach is enough. Go to mmap only when you have measured a real bottleneck.


Deep Dive: Replicated vs Partitioned Caches

Two extreme designs for multi-replica caching:

Replicated. Every replica caches every key. All replicas hold the same data. Reads are local everywhere. Memory cost = N × dataset.

Partitioned. Each key lives on exactly one replica (sharded via consistent hashing). Reads route to the responsible replica. Memory cost = dataset.

Most production setups are a hybrid:

  • Per-replica in-process cache (small subset, hot keys).
  • Distributed cache (full dataset, partitioned).
  • Pub/sub for invalidation across both.

You replicate aggressively at the L1 level (hot keys) and partition at the L2 level (full set).

When to fully replicate:

  • Read path latency must be sub-millisecond.
  • Dataset fits in single-replica memory.
  • Writes are rare.

When to fully partition:

  • Dataset exceeds single-replica memory.
  • Cross-replica consistency matters.
  • You have a load balancer that can route by key.

Deep Dive: Consistent Hashing with Bounded Loads

Naive consistent hashing can produce load skew: one replica handles 2× the average. Google's "Consistent Hashing with Bounded Loads" paper (Mirrokni et al., 2016) fixes this.

The idea: each replica has a capacity (e.g. 1.25 × averageLoad). When the natural target replica is full, the key spills to the next replica.

func (r *Ring) GetBounded(key string, capacity int) string {
    h := hash(key)
    idx := sort.Search(len(r.replicas), func(i int) bool { return r.replicas[i] >= h })
    for offset := 0; offset < len(r.replicas); offset++ {
        candidate := r.replicas[(idx+offset)%len(r.replicas)]
        name := r.members[candidate]
        if r.loadOf(name) < capacity {
            r.incrementLoad(name)
            return name
        }
    }
    return "" // all full
}

The bounded variant gives much better tail behaviour at the cost of slight complexity. For caches handling millions of keys, the win is significant.


Deep Dive: Replication Factor in Distributed Caches

Some systems (Hazelcast, Redis Cluster) support replication factor: each key is stored on N replicas for redundancy. If one dies, reads fall back to another.

Trade-offs:

  • N=1: simplest, dataset = total RAM available.
  • N=2 or 3: redundancy, dataset = totalRAM/N.
  • Higher N: rarely useful for caches.

For a cache (where data can be re-fetched from upstream), N=1 is usually fine. The "redundancy" is the upstream itself.

Exception: when the upstream is slow or expensive enough that losing a replica's cache for the time-to-rebuild is unacceptable. In that case, N=2 is the standard.


Deep Dive: Cache and Database Consistency

Caches and databases can diverge. The classic failure modes:

Read-after-write divergence. App writes to DB, but cache still has the old value. Reader sees stale data. Solution: invalidate or update the cache on every DB write.

Update-after-read divergence. Reader looks up; cache miss; reader fetches from DB; meanwhile a writer updates the DB; reader writes the now-stale value to the cache. Cache holds stale data.

The second case is subtle. Mitigations:

  • Use a "delete from cache before write" approach: writer deletes cache, then updates DB. Readers may briefly miss but never see stale.
  • Use a version number; readers only cache if their fetched version matches the current.
  • Accept some staleness within TTL bounds.

For high-stakes data, the safest pattern is:

Writer: BEGIN
        UPDATE DB
        DELETE cache K
        COMMIT

Reader: v, ok := cache.Get(K)
        if !ok {
            v := DB.Get(K)
            cache.Set(K, v)  // possibly racy with concurrent writer
            return v
        }
        return v

The race remains: a writer between cache.Set and cache returning makes the reader's value stale. For most use cases, the TTL bounds the staleness window; for high-stakes uses, skip the cache.


Observability Beyond Hits and Misses

A senior-grade cache exposes far more than hit/miss counters.

Counters

  • cache_hits_total{cache, shard}
  • cache_misses_total{cache, shard}
  • cache_evictions_total{cache, shard, reason}
  • cache_loads_total{cache, success}
  • cache_load_errors_total{cache, error_type}
  • cache_singleflight_dedup_total{cache} (how often singleflight saved a call)

Gauges

  • cache_size{cache, shard} (entries)
  • cache_memory_bytes{cache, shard} (approximate)
  • cache_goroutines{cache} (sweepers + workers)

Histograms

  • cache_get_duration_seconds{cache} — should be sub-microsecond at p50
  • cache_set_duration_seconds{cache} — sub-microsecond p50
  • cache_load_duration_seconds{cache} — depends on upstream
  • cache_sweep_duration_seconds{cache, shard} — should be < a few ms

Per-key metrics (sampled)

Per-key counters explode memory; sampling keeps it bounded. Top-K hot keys reported every minute.

Dashboards

Standard cache dashboard panels:

  • Hit ratio over time
  • p50/p99 Get duration
  • p50/p99 loader duration
  • Cache size over time
  • Eviction rate by reason
  • Per-shard load distribution

Alerts

  • Hit ratio drops > 20% in 5 min.
  • Loader p99 > 100 ms for 5 min.
  • Eviction rate > 100/s sustained.
  • Cache size approaches budget.

Deep Dive: Latency Histograms in Detail

A latency histogram lets you answer questions like "what is the p99 of a Get call?" Implementing one correctly is harder than it looks.

A naïve implementation:

var samples []float64
func record(d time.Duration) {
    samples = append(samples, float64(d))
}

Memory grows without bound. Computing percentiles requires sorting all samples.

A real histogram uses fixed-size buckets:

type Histogram struct {
    buckets []float64
    counts  []atomic.Uint64
}

func (h *Histogram) Observe(d time.Duration) {
    idx := bucketIndex(d, h.buckets)
    h.counts[idx].Add(1)
}

The bucket boundaries are exponentially spaced (e.g. 1ns, 2ns, 5ns, 10ns, ...). Memory is fixed regardless of sample count.

Computing p99: walk buckets from low to high, accumulating counts, until 99% is reached. The p99 is approximately the upper boundary of that bucket.

Prometheus's Histogram does this for you. prometheus.NewHistogramVec is the standard answer.

For ultra-low-overhead in-process histograms, github.com/HdrHistogram/hdrhistogram-go is popular.

For a senior cache:

  • Record Get duration in a histogram.
  • Record Set duration separately.
  • Record Load duration (cache miss + upstream).
  • Optionally separate hits and misses.

These four histograms answer 90% of "is the cache fast?" questions.


Deep Dive: Tracing Cache Operations

OpenTelemetry traces help debug cross-service issues. For a cache, traces should:

  • Wrap Get and Load in spans.
  • Tag with cache name, key (truncated), hit/miss.
  • Record errors.
import "go.opentelemetry.io/otel"

var tracer = otel.Tracer("cache")

func (c *Cache) Get(ctx context.Context, k string) (V, error) {
    ctx, span := tracer.Start(ctx, "cache.Get")
    defer span.End()
    span.SetAttributes(
        attribute.String("cache.name", c.name),
        attribute.String("cache.key", truncate(k, 32)),
    )
    if v, ok := c.localGet(k); ok {
        span.SetAttributes(attribute.Bool("cache.hit", true))
        return v, nil
    }
    span.SetAttributes(attribute.Bool("cache.hit", false))
    return c.load(ctx, k)
}

Avoid logging full keys (may be PII). Avoid recording every single hit (sampling). Spans are not free — at very high QPS, span creation can take 1-2 µs each.

For caches handling millions of QPS, trace only on miss or only on samples. Hits should be unobserved by tracing.


Deep Dive: Caches and Backpressure

When a cache is overwhelmed (too many concurrent misses), the loader queue grows. Eventually goroutines accumulate; eventually OOM.

Backpressure means: when the cache is overloaded, return failures fast rather than queuing indefinitely.

type Cache struct {
    loadSem chan struct{}
}

func (c *Cache) Get(k string) (V, error) {
    if v, ok := c.local.Get(k); ok {
        return v, nil
    }
    select {
    case c.loadSem <- struct{}{}:
        defer func() { <-c.loadSem }()
    case <-time.After(50 * time.Millisecond):
        return zero, ErrOverloaded
    }
    return c.load(k)
}

When the semaphore is full, additional miss requests fail fast. The client decides: retry later, serve stale, or fail.

Without backpressure, a slow upstream eventually crashes the cache process.


Deep Dive: Operational Sanity Checks

A few sanity checks to run on any production cache:

  1. Hit ratio after warmup > 90%. If lower, the cache is undersized or the workload is poorly suited.
  2. p99 Get duration < 1 ms. If higher, the cache has internal contention or GC issues.
  3. Eviction rate < cache size / TTL. If higher, the cache is under capacity pressure.
  4. Loader rate < QPS / hit_ratio. Standard relationship; deviation means double-loading or stampede.
  5. Memory usage stable across days. If growing, you have a leak or unbounded growth.
  6. Goroutine count stable across deploys. If growing, you're leaking sweepers on Close failures.

Run these once a month. They catch silent regressions before incidents.


Deep Dive: Operating Under Memory Pressure

When Go's heap approaches the GOGC threshold, GC ramps up. CPU spent on GC means less CPU for cache operations. A "fat" cache can starve its own process.

Symptoms:

  • GC pause histograms in runtime metrics rising.
  • Allocation rate constant but heap growing.
  • CPU profile shows runtime.gcMark* functions hot.

Responses:

  1. Shrink the cache (size cap or shorter TTL).
  2. Switch to off-heap storage (bigcache/freecache).
  3. Increase GOGC (more memory, less CPU).
  4. Tune GOMEMLIMIT to bound heap growth.

GOMEMLIMIT (Go 1.19+) is particularly useful: set a hard memory limit; Go's GC adapts to stay under it. A cache that respects GOMEMLIMIT is automatically bounded by environment.


Deep Dive: Cache Versioning Strategies in Detail

When data format changes, three options for cache:

Option A: full flush

// At startup:
cache.Reset()

Simplest. Brief cold start. Easy to reason about.

Option B: namespaced keys

// Bump the namespace on schema change.
key := "v3:user:" + id

Old "v2:user:*" entries naturally age out. No flush needed. Wastes memory during transition.

Option C: versioned values

type Entry struct {
    Schema int
    Data   []byte
}

if entry.Schema != currentSchema {
    return upstream.Get(k)
}

Old entries are visible but treated as stale. Most code-complex but most efficient.

Pick A for simplicity, B for graceful migration, C for advanced needs.


Deep Dive: Cache Documentation

A senior-grade cache deserves real documentation. At minimum:

  • What it caches.
  • What the cache key shape is.
  • TTL (default + per-entry).
  • Size bound.
  • Eviction policy.
  • Memory budget.
  • Stampede protections.
  • Invalidation API.
  • Failure modes.
  • Metrics exposed.

Half a page is usually enough. Future you (six months from now) will thank you for it.

Bad cache documentation: "User cache." Good cache documentation: "User profile cache. Keys: user:{id}. TTL: 5 min ± 30 s jitter. Size: 100k. LRU eviction. Memory: ~50 MB. Singleflight on misses. Negative TTL: 10 s. Invalidate on user profile update. Metrics: hit ratio, p99 Get, load duration."


Production Incident Playbooks

When something goes wrong with a cache in production, you need to know what to look at first.

Incident: hit ratio dropped suddenly

Possible causes:

  1. Code deploy changed cache keys (e.g. new prefix added).
  2. Upstream data changed; lots of invalidations fired.
  3. Cache was flushed (someone called FlushAll).
  4. New traffic pattern (e.g. a crawler) is generating unique keys.

Diagnosis:

  • Check per-shard hit ratio — is it uniformly down, or is one shard hot?
  • Check eviction rate — is the cache being flushed?
  • Check top-K hot keys — has the distribution shifted?

Quick fix:

  • Roll back the deploy if recent.
  • Pre-warm the cache with known hot keys.

Incident: cache memory ballooned

Possible causes:

  1. Eviction broken (entries not being removed).
  2. New traffic introducing very large values.
  3. Memory leak in the cache itself.

Diagnosis:

  • Check cache_size gauge over time.
  • Check cache_evictions_total{reason="expired"} — is the sweeper running?
  • Sample entry sizes.

Quick fix:

  • Restart the process (loses warm cache but releases memory).
  • Reduce the cache cap and force eviction.

Incident: upstream is overloaded

Possible causes:

  1. Stampede on a hot key.
  2. Singleflight broken or not deployed.
  3. Cache is cold (deploy or scale-up).

Diagnosis:

  • Check cache_loads_total — high?
  • Check cache_singleflight_dedup_total — should be much higher than loads if working.
  • Check time since last deploy.

Quick fix:

  • Increase TTL temporarily.
  • Throttle upstream calls (rate limit at the cache).
  • Enable degraded mode (serve stale).

Deep Dive: Choosing Between In-Process and Distributed

Decision framework:

Question In-process Distributed
Hit ratio per-replica is enough? Yes — in-process No — distributed
Cross-replica invalidation is needed? Add pub/sub Native
Cold-start latency is acceptable? Yes — in-process Distributed (cache survives restart)
Operational overhead is OK? In-process (none) Distributed (Redis ops)
Cache must outlive process? Distributed Distributed
Latency target < 100 µs? In-process Distributed (network latency dominates)
Working set > 1 process can hold? Distributed (or shrink) Distributed

For most services starting out, "in-process with optional Redis L2" is correct. Add Redis only when you have a measured cold-start problem or working-set sizing issue.

The "just use Redis everywhere" trap: you replace 100 µs cache hits with 1 ms network round trips. p99 latency suffers; if Redis hiccups, the whole service hiccups.


Deep Dive: Cache Stampede Across Multi-Replica Deploys

A particularly nasty stampede pattern: rolling deploy.

You deploy a new version. Replicas restart one at a time. Each new replica has a cold cache. The first 1000 requests it handles all miss and call upstream. With 12 replicas and 5-minute warm-up each, you have an hour of elevated upstream load.

Mitigations:

  1. L2 cache (Redis). Replica restarts only invalidate the L1; L2 is still warm. New replica fills L1 from L2, not from upstream.
  2. Cache warming. Snapshot the cache before shutdown; restore on startup.
  3. Slow rollout. Make replicas start serving traffic gradually as their cache warms.
  4. Pre-bake. Read the top-K hot keys before signalling "ready" to the load balancer.

Pre-baking is the simplest:

func (s *Server) Warmup(ctx context.Context, keys []string) {
    for _, k := range keys {
        s.cache.Get(ctx, k) // fills the cache
    }
}

// Boot sequence:
hotKeys := loadHotKeysFromConfig()
server.Warmup(ctx, hotKeys)
server.MarkReady()

The hot-key list is maintained as configuration, updated daily based on traffic analysis. The startup cost is bounded; the hit ratio is high immediately.


Deep Dive: When Caches Lie

A cache returning stale data can cause hard-to-debug failures.

Example: a user changes their password. The auth service writes the new hash to the database. Replica A's cache is invalidated; the user can log in. Replica B's invalidation is dropped (network blip). Replica B serves the old password hash; the user's new password is rejected on requests routed to B.

Detective work:

  • Sometimes works, sometimes fails — classic distributed-cache symptom.
  • Affects only some replicas — confirms cache, not database.
  • Goes away after TTL expires — definitively cache.

Prevention:

  • Use sequence numbers (version-tag every cached entry; mismatch → re-fetch).
  • Use stronger invalidation (synchronous broadcast with ACKs).
  • Use sticky routing (same user always hits the same replica).

For high-stakes data (auth, billing, permissions), accept the latency cost of skipping the cache or using strict-consistency caches like Hazelcast.


Deep Dive: Cache Compaction

Bigcache and freecache do not move entries once placed. Over time, deletion creates "holes" in the byte buffer that cannot be reused for entries of different size. The buffer fragments.

Compaction strategies:

  • None. Accept fragmentation. Bigcache and freecache do this; their FIFO/LRU eviction reclaims space at the buffer's tail naturally.
  • Manual. Periodically read all live entries, copy to a fresh buffer, swap. Expensive (locks the cache during the copy).
  • Inline. When an entry is deleted, mark its space as a "hole"; future inserts try to fill holes first. Complex bookkeeping.

For most workloads, fragmentation is not a measured problem. The simple FIFO design self-heals over time as entries age out.

For long-lived caches with mixed entry sizes and frequent deletes, you may need explicit compaction. This is mainly a concern for off-heap caches.


Deep Dive: Snapshot Compatibility Across Versions

If you serialise the cache to disk for warm restarts (across deploys), the serialisation format must be forward and backward compatible.

Strategies:

  • Versioned snapshots: each snapshot has a version header; mismatched versions are discarded (cold start).
  • Schema migration: write code to upgrade old snapshots to the new schema.
  • Length-prefixed records: skip records the reader does not understand.

For caches, the typical answer is versioned, with cold-start fallback. If the snapshot is from an old version, throw it away. The cost is one cold start per upgrade — acceptable for most services.


Tuning Stories

Story 1: the 99.95% hit ratio cache

A team had a cache with 99.95% hit ratio. They were proud. We looked at the data: 99.95% hit ratio with cache size of 1 GB. 99.5% hit ratio with cache size of 100 MB.

The marginal benefit of 10× memory was 0.45% additional hit ratio. The cost was 1 GB of RAM and significantly larger GC pauses.

We shrunk the cache to 100 MB. Latency p99 improved (less GC). Cost dropped. The "lost" hit ratio cost the upstream ~20 extra requests per second — easily absorbed.

Lesson: measure marginal benefit, not absolute hit ratio.

Story 2: the perfectly synchronised stampede

A batch job inserted 50,000 entries into the cache at midnight with TTL = 1 hour. At 1:00 AM, all 50,000 entries expired in the same second. The 5,000 active users at that minute triggered 5,000 misses in 1 second. Upstream was tuned for ~500 QPS; it died.

Fix: add 10% jitter (ttl + rand(0..6 minutes)). The herd flattened across 6 minutes; upstream peaked at ~15 QPS.

Lesson: jitter is not optional when entries are inserted in bulk.

Story 3: the per-request cache

A team built a "cache" that was instantiated per HTTP request. Each request created a cache, used it for ~5 calls, and discarded it. The "cache" was effectively a map.

Lock contention was zero (no sharing). Memory grew with concurrency. Hit ratio was 0% across requests.

We replaced it with a real shared cache. Latency improved 10×.

Lesson: check whether your cache is actually shared before tuning anything.

Story 4: the sweep that took 3 seconds

A 20M-entry cache used a tick sweeper. Each sweep held the lock for 3 seconds. Every minute, latency p99 spiked.

We migrated to per-shard sweepers with heap-based scheduling. Sweep duration per shard dropped to <10 ms. Latency p99 became flat.

Lesson: for large caches, sweep design matters as much as data structure.

Story 5: the leaked goroutine

A test suite ran 10,000 cases. Each created a cache and used it. None called Close. After the suite, the process held 10,000 sleeping sweeper goroutines.

CI started OOM-killing the test process. We added defer c.Close() to every test. Memory dropped 100×.

Lesson: test infrastructure leaks goroutines silently; ALWAYS Close.


Deep Dive: Failure Modes Under Adversarial Traffic

A cache designed for normal traffic can become a weapon under adversarial traffic.

Cache fill attack. Attacker generates a stream of unique keys; cache fills with garbage; legitimate hot keys are evicted. Mitigation: admission policy (TinyLFU), per-source rate limiting.

Cache key collision attack. Attacker exploits hash function properties to generate keys that all collide on one shard. Shard's lock becomes the bottleneck. Mitigation: randomised hash seed (maphash), shard-load monitoring with alerts.

Negative cache poisoning. Attacker stores fake "not found" entries (via crafted upstream responses?) — unlikely directly, but combined with negative caching, an attacker who can return a "not found" can lock out legitimate access for the TTL. Mitigation: short negative TTLs, fast invalidation.

Slow loader attack. Attacker triggers misses on keys whose loaders are slow (e.g. expensive joins). Each miss costs significant upstream work. Mitigation: per-loader timeouts, circuit breakers, prioritised loading.

Stampede attack. Attacker times requests to coincide with a hot key's expiration. Mitigation: jitter, singleflight, refresh-ahead for known hot keys.

For any cache exposed to untrusted traffic, all of these must be considered. Internal-only caches rarely need to worry about most of them.


Deep Dive: Capacity Planning

Sizing a cache requires:

  1. Working set size. How many distinct keys are accessed in a typical window? Sample logs.
  2. Per-entry size. Average bytes of key + value + overhead.
  3. Memory budget. How much RAM can the cache use without affecting GC or OOMs?
  4. Hit ratio target. What hit ratio justifies the cache?

A simple formula:

cacheBytes = workingSetSize * perEntrySize * (1 + slack)

Slack accounts for GC overhead, map load factor, etc. 30% is a reasonable starting point.

Validate by running with a smaller cache and measuring the hit ratio at different sizes. Plot hit ratio vs size; pick the point where the curve flattens.

Don't over-provision. A 10× larger cache that improves hit ratio from 95% to 97% is rarely worth the memory.


Deep Dive: Cache as a Pull-Through Buffer

Sometimes the "cache" pattern blends into "buffer." Think of a service that periodically polls an upstream for fresh data and serves it to clients.

type Buffer struct {
    p   atomic.Pointer[Snapshot]
    ttl time.Duration
}

func (b *Buffer) Refresh() {
    for {
        s, err := b.fetch()
        if err == nil {
            b.p.Store(s)
        }
        time.Sleep(b.ttl)
    }
}

func (b *Buffer) Get(k string) (string, bool) {
    s := b.p.Load()
    if s == nil {
        return "", false
    }
    return s.Data[k], true
}

Reads are lock-free. Refreshes happen on a fixed cadence regardless of read pressure. This is suitable for:

  • Configuration data (rarely changes, read often).
  • Lookup tables (countries, currencies, taxonomies).
  • Slowly-changing dimensions in analytics.

It is not suitable for:

  • High-churn data (every refresh rebuilds the entire snapshot).
  • Per-key invalidation (you cannot invalidate one key, only the whole snapshot).

Senior Anti-Patterns

Anti-pattern 1: pretending the cache is the source of truth

A cache is never authoritative. The moment you start relying on cached data for correctness ("if it is in cache, it must be valid"), you have a time bomb. Always treat cached data as "possibly stale, possibly missing."

Anti-pattern 2: caching everything

"Just cache it" is not an answer. Some data is cheap to fetch; caching adds latency (cache lookup) without saving much. Some data changes too quickly; cache hits return wrong answers. Profile, decide deliberately.

Anti-pattern 3: one giant cache

A single cache holding "users, sessions, products, configs" is a maintenance nightmare. Different data has different TTLs, different sizes, different access patterns. Have one cache per logical concern.

Anti-pattern 4: cache invalidation by name match

for k := range cache.data {
    if strings.HasPrefix(k, "user:42:") {
        cache.Delete(k)
    }
}

This is O(N) and holds the cache lock. For a million-entry cache, it stops the world for hundreds of ms. Use secondary indexes (per-prefix index) if you need prefix invalidation.

Anti-pattern 5: no observability

A cache without metrics is a cache you cannot diagnose. Always export at minimum hits, misses, and size.

Anti-pattern 6: relying on Go's defaults forever

Default sweep interval, default singleflight, default shard count, default everything. The defaults are good starting points; they are rarely optimal for your specific workload. Measure, tune, document.

Anti-pattern 7: caching error responses for too long

A 500 from upstream gets cached for 5 minutes. Upstream recovers in 30 seconds. The cache keeps returning the 500. Use very short negative TTLs (1-10 s) for errors.

Anti-pattern 8: ignoring the cache during testing

"It is just a cache, tests don't need it." Until tests fail because the cache is hiding a bug. Always test the cache path explicitly.


Deep Dive: The CAP Implications of Distributed Caches

A distributed cache is a database. As soon as it has multiple replicas serving reads and writes, CAP applies.

  • Consistency. Do all readers see the same value at the same time?
  • Availability. Do reads succeed even when some replicas are down?
  • Partition tolerance. Can the system continue when the network splits?

A cache is typically tuned for AP: prioritise availability and partition tolerance over consistency. A stale read is acceptable; an outage is not.

This shapes design choices:

  • Invalidations are best-effort, not transactional.
  • Replicas serve independently; cross-replica consistency is eventual.
  • TTL is the primary correctness bound. If you cannot tolerate TTL-old data, the cache is the wrong tool.

If you find yourself wanting strong consistency from a cache, you probably want a database with a read replica instead. The cache is the optimization, not the data store.


Deep Dive: Lock-Free Reads via atomic.Pointer[V]

For caches where every read is "give me the latest snapshot of X," atomic.Pointer (Go 1.19+) is gold.

type Snapshot struct {
    Data map[string]string
}

type Cache struct {
    p atomic.Pointer[Snapshot]
}

func (c *Cache) Get(k string) (string, bool) {
    s := c.p.Load()
    if s == nil {
        return "", false
    }
    v, ok := s.Data[k]
    return v, ok
}

func (c *Cache) Refresh(data map[string]string) {
    s := &Snapshot{Data: data}
    c.p.Store(s)
}

Reads cost one atomic load (a few nanoseconds). Writes rebuild the entire map and atomically swap.

When this shines:

  • Read-only "configuration" data with periodic refresh.
  • Look-up tables that change rarely (1 refresh per minute) but are read millions of times per second.

When this fails:

  • Many small writes (each is a full rebuild).
  • Very large maps (rebuild cost is O(N)).

Patterns used by config-distribution systems, feature-flag clients, and dns-resolver caches.


Deep Dive: Sharded Hot-Key Detection

You suspect some keys are hot but you don't know which. Identifying them at runtime, with bounded overhead, is a classic streaming problem.

Approximate top-K via count-min sketch + min-heap. Sample every Nth request. Hash the key into a count-min sketch. Periodically scan the sketch for high-count keys and update a min-heap of the top K. The heap gives you "the K most popular keys observed in the last interval."

type HotKeyDetector struct {
    sketch *CountMinSketch
    topK   *minHeap
    interval time.Duration
}

func (d *HotKeyDetector) OnAccess(key string) {
    if rand.IntN(100) != 0 { // sample 1%
        return
    }
    count := d.sketch.Add(key, 1)
    d.topK.maybeAdd(key, count)
}

After detection, you can react:

  • Auto-promote hot keys to a goroutine-local cache.
  • Auto-warm hot keys with longer TTL.
  • Auto-singleflight hot keys with stricter rules.

This pattern shows up in caches at Facebook, Pinterest, Twitter — wherever a hot-key load imbalance can take down a shard.


Deep Dive: Read-Through Caches and Connection Pools

When the cache is a read-through, the loader function holds a connection to the upstream. If the cache sees a burst of misses, the upstream connection pool can be exhausted.

The interaction:

  • Singleflight reduces the burst — only one connection per missing key.
  • But across many missing keys, you can still hit the pool's cap.

Mitigations:

  • Increase the connection pool.
  • Add a per-loader semaphore: at most M loaders can run simultaneously.
  • Pre-fetch warm keys in batches.
type Cache struct {
    sem chan struct{}
}

func (c *Cache) Load(k string) (V, error) {
    c.sem <- struct{}{}
    defer func() { <-c.sem }()
    return c.upstream.Get(k)
}

The semaphore is a fixed-capacity channel. Calls block until a slot is free. This bounds the cache's pressure on the upstream regardless of incoming traffic.


Deep Dive: Persistent vs Ephemeral Caches

Some caches outlive their process. Reasons:

  • Fast restart: avoid cold-cache penalty.
  • Survive crashes: don't lose minutes of accumulated heat.
  • Cross-process sharing: multiple processes on the same host share the cache.

Implementations:

  • Disk-backed: BoltDB, Badger, mmap'd files. Slower than memory; survives restart.
  • Shared memory: POSIX shm. Multiple processes share. Tricky safety.
  • Memcached/Redis sidecar: external process, low-latency local network.

For most Go services, ephemeral in-process + Redis L2 is the right balance. Disk-backed in-process caches are common in agents and CLIs that restart often.

If your cache truly needs persistence, evaluate whether you should be using a database with a cache, not a cache with persistence.


Deep Dive: TTL Granularity Trade-offs

Most caches store expiresAt as time.Time (24 bytes including monotonic). On caches with billions of entries, that is gigabytes of timestamp metadata.

Compaction options:

  • Seconds since epoch (int64). 8 bytes. Loses monotonic time. Sub-second precision sacrificed.
  • Seconds since cache start (uint32). 4 bytes. Lasts ~136 years from cache start. Saves billions of bytes.
  • Bucket-by-time. All entries inserted in the same second share an expiration. Saves space but coarse.
  • TTL group. Many entries share a TTL profile (e.g. "1 hour"). Store a pointer to the profile, not the timestamp.

These are professional-level tricks. We mention them here because at senior interviews, candidates are sometimes asked how to compress entry metadata.


Deep Dive: Cache Coherence Protocols

When multiple replicas cache the same key, they form a tiny distributed system. The coherence protocol describes how they stay (eventually) consistent.

Write-invalidate. On write, broadcast "invalidate K" to all replicas. Each replica drops its copy. Next read fetches fresh. Lower bandwidth (small messages). Most common.

Write-update. On write, broadcast the new value. Replicas update in place. Higher bandwidth. Lower read-after-write latency.

Lease-based. Each cached entry holds a lease from a central authority. When the data changes, the authority recalls the lease. Best consistency, highest infrastructure cost.

Memcached, Redis, and most in-process caches use write-invalidate. The brevity of "drop K" messages keeps the cost low.


Deep Dive: When Adding a Cache Makes Things Worse

A list of cautions from real production:

  1. Cache hides bugs. A service returns wrong data; the cache makes it look correct for a while. By the time you notice, the bug is deep.
  2. Cache amplifies bugs. A service writes a wrong value; the cache propagates it to thousands of readers. With a 1-hour TTL, the wrong value lives for an hour.
  3. Cache shifts load. Without a cache, the database handled 10k QPS spread evenly. With a cache, the database handles 1k QPS spiking to 50k during cold starts. The peak is much harder to provision for.
  4. Cache creates correlation. Without a cache, replicas were independent. With a cache, all replicas miss together on a popular key's expiration and stampede the database simultaneously.
  5. Cache hides slowness. A 5-second database query is hidden behind a 50µs cache hit. When the cache misses, p99 jumps to 5 s. The system "appears" fast but has a tail.

All fixable, but a cache is not a free win. Each addition needs a deliberate cost-benefit analysis.


Deep Dive: Negative Caching at Scale

Negative caching prevents the "cache miss flood" from invalid keys. At scale, it has its own pitfalls.

Problem: an attacker bombards your service with /get?id=<random>. Each request misses the cache, calls upstream, gets "not found," and caches the absence. Negative cache grows without bound.

Defence 1: bound size. LRU on the negative cache. Old "not found" entries fall out.

Defence 2: bloom filter. Test "have I ever seen this key as not-found?" with a probabilistic data structure. False positives skip the upstream call (and possibly miss a real new key). Accept the trade.

Defence 3: short TTL. Negative entries live 1-5 s. Bounds memory growth.

Defence 4: rate limit. Cap the rate of negative-cache insertions.

For Internet-facing services, all four are typical. Without them, negative caching becomes an attack vector.


Deep Dive: Cache Versioning Across Deploys

When you deploy a new version of your service, the data shape may change. Old cache entries become invalid.

Strategies:

  • Versioned keys. Prefix every key with a version: v3:user:42. New deploy uses v4; old entries are unreachable and naturally expire.
  • Cache flush on deploy. Drop everything on startup. Cold cache for a few minutes.
  • Schema-tagged values. Every cached value carries its schema version. Readers check and re-fetch if mismatched.

Versioned keys are the simplest and most common. The cost is wasted memory for old-version entries during the transition (until they expire).


Deep Dive: Async Loading Patterns

For high-throughput caches with slow loaders, async loading reduces tail latency.

Pattern: a request that misses returns a stale value (if any) immediately, while a background goroutine loads the fresh one.

func (c *Cache) Get(k string) (V, bool) {
    e, ok := c.data[k]
    if ok && !time.Now().After(e.expiresAt) {
        return e.value, true // fresh
    }
    if ok && time.Now().Before(e.staleUntil) {
        // serve stale + trigger async load
        go c.refresh(k)
        return e.value, true
    }
    // No usable data; block on synchronous load.
    return c.syncLoad(k)
}

The reader experiences cache-hit latency even on misses, as long as the previous value is recent enough to be acceptably stale. This is essentially the SWR pattern.

For caches with strict freshness requirements, do not use. For caches where "slightly stale" is fine, this is a massive p99 win.


Deep Dive: Cache Capacity Curves

A useful exercise: chart hit ratio vs cache size for your workload. The shape tells you everything.

Sharp knee. Hit ratio rises rapidly to ~99%, then flat. Working set is small; you only need enough cache to hold it.

Gradual. Hit ratio rises linearly with size. Long tail of moderately popular keys; bigger cache means better ratio.

Plateau then climb. Two distinct populations; cache must be big enough for both.

You can generate the curve via simulation: replay 1 hour of traffic against caches of various sizes; record hit ratios. Plot.

Example results from a real service:

  • 10 MB: 50% hit ratio.
  • 100 MB: 85% hit ratio.
  • 1 GB: 95% hit ratio.
  • 10 GB: 97% hit ratio.

The marginal benefit of going from 1 GB to 10 GB is 2%. Probably not worth 10× the memory. Stop at 1 GB.

This curve should drive your cache size, not folklore. "Bigger is better" is the wrong default.


Deep Dive: Concurrency Profile of Your Cache

Knowing your read/write ratio matters more than knowing absolute QPS.

Measure:

  • Reads per second.
  • Writes per second.
  • Ratio.

If reads:writes = 1:1, your cache is unusual — usually a "buffer" rather than a true cache. Consider whether you really need it.

If reads:writes = 10:1, RWMutex is excellent.

If reads:writes = 100:1, sync.Map or atomic.Pointer snapshot beats RWMutex.

If reads:writes = 1000:1, atomic pointer to immutable snapshot is the gold standard.

Most production caches fall in the 10:1 to 100:1 range. Tune accordingly.


Deep Dive: When You Inherit a Bad Cache

You join a team. The cache is a mess. Common signs:

  • Mixed responsibilities (one cache for users, sessions, configs).
  • No metrics.
  • Inconsistent TTLs across functions.
  • Global mutex (no sharding).
  • No singleflight, no jitter.
  • Memory grows unbounded.

Step-by-step rescue:

  1. Add metrics first. You cannot improve what you cannot measure. Add hit/miss counters, latencies, size, evictions.
  2. Measure for a week. Understand current behaviour before changing.
  3. Add singleflight + jitter. Almost always a win. Low risk.
  4. Add size bound. Even a generous one prevents OOMs.
  5. Split caches by concern. One per logical resource. Easier to tune and debug.
  6. Add sharding. Only if measured contention.
  7. Migrate to a library (ristretto, freecache) if appropriate.

Doing this carefully takes weeks. Doing it all in a weekend leads to outages. Each step should be reviewable, revertible, and measured.


Deep Dive: Race Detection in Cache Tests

-race is mandatory but not always sufficient. Some races only manifest under specific schedules.

Beyond -race:

  • Stress tests. Spawn 1000 goroutines, run 100k operations each, mixed reads/writes. If the cache is unstable, this surfaces it.
  • GODEBUG=asyncpreemptoff=1. Forces deterministic preemption, exposes races that depend on scheduling.
  • Multiple runs. Run the test suite 100 times. If it fails 1 in 50 runs, you have a race that -race missed.
  • Schedule perturbation. Add runtime.Gosched() at various points to alter scheduling.

Some races are fundamentally unobservable in race-free executions. Code review and reasoning about happens-before relations are still required.

A cache that "feels right" but has a missing happens-before edge will bite you in 3 months at peak traffic.


Deep Dive: Cache and Service Mesh Integration

In a service mesh (Linkerd, Istio), some caching is moved to the proxy layer. The mesh can:

  • Cache responses at the sidecar.
  • Apply uniform TTL policies via configuration.
  • Provide observability without app changes.

Trade-offs:

  • Pro: app doesn't manage cache code.
  • Pro: cache can be uniformly applied across services.
  • Con: cache is opaque to the app; debugging is harder.
  • Con: every cache hit goes through the mesh proxy (still local, but extra hop).

For some teams, mesh-level caching is the right answer. For others, app-level caching gives finer control. Many services use both: mesh handles HTTP-level response caching, app handles object-level caching.


Deep Dive: Multi-Tier Coordination

In a true multi-tier system, write coordination matters:

L1 (per-replica) -> L2 (cluster) -> upstream

On a write that the application initiates:

  1. Write upstream (must succeed).
  2. Invalidate L2.
  3. Publish "drop this key" to L1s.

Why this order? If L2 is invalidated before upstream is updated, a concurrent reader could populate L2 with the old upstream value, defeating the invalidation.

Order:

write upstream → success → delete from L2 → publish invalidation → L1 drops

If any step fails, the next reader may see stale data until TTL. Decide whether to retry or accept eventual consistency.

The same logic applies for cache populates: write to L2 first (longer TTL), then L1 (shorter TTL). Stale L1 entries fall out faster.


Deep Dive: Cache Stress Testing

You designed a cache. Did you stress test it?

Standard scenarios:

  1. Steady state. Hit ratio at expected level, latency stable. Confirms baseline.
  2. Hot key. 80% of traffic to one key. Hit ratio should be ~100%, p99 should not spike.
  3. Wide hot set. Top 10% of keys account for 90% of traffic. Hit ratio ~90%.
  4. Scan workload. Every request is for a new key. Hit ratio ~0%; admission policy keeps hot set protected.
  5. TTL boundary. All entries inserted at the same time; their TTLs expire simultaneously. Stampede should be coalesced by singleflight + jitter.
  6. Cold start. Empty cache, request flood. Upstream should be protected.
  7. Restart. Process restarts mid-traffic. Cache resumes correctly.
  8. Slow upstream. Loader takes 1 second; concurrent misses should not multiply that.

Run each under go test -race with t.Parallel(). Measure: hit ratio, latency p50/p99, upstream load.

A cache that passes all eight is production-ready.


Deep Dive: Adaptive TTL

A constant TTL works for steady-state systems. For systems where freshness needs change with load or time, adaptive TTL is interesting.

The idea: increase TTL when upstream is slow or loaded; decrease when upstream is fast and idle.

func (c *Cache) effectiveTTL() time.Duration {
    upstreamLoad := c.upstreamMetrics.Load()
    if upstreamLoad > 0.8 {
        return 2 * c.baseTTL
    }
    if upstreamLoad < 0.2 {
        return c.baseTTL / 2
    }
    return c.baseTTL
}

When the upstream is at 80% utilization, we double TTL — fewer misses, more reuse, less load. When it is idle, we shorten TTL — fresher data, no waste.

This is a feedback loop. Tune the bands carefully or you'll oscillate. Production examples include Netflix's Hystrix and some CDN designs.

For most caches, fixed TTL is fine. Adaptive is for systems where staleness vs load trade-off changes over time.


Deep Dive: Cache as a Coordination Primitive

A subtle use of a TTL cache: distributed coordination via shared keys.

Scenario: a fleet of workers each want to claim "tasks." Each task should be processed by exactly one worker. A TTL cache acts as a lease:

func (w *Worker) tryClaim(taskID string) bool {
    // Set if absent: returns true if we claimed it.
    return c.SetIfAbsent("lease:"+taskID, w.ID)
}

If the worker dies before completing, the TTL expires and another worker can claim. If the cache is shared (e.g. Redis), this is a distributed lease.

For in-process caches, this only works within one process. For multi-process, the cache must be external.

The pattern shows up:

  • Cron job leadership: one node runs the cron; others sleep.
  • Singleton workers: only one instance does X.
  • Resource pools: keys represent acquired resources.

Caveats:

  • Lease holders may not finish in time → another worker also processes.
  • Network partitions can cause double-claims.
  • Clock skew can prematurely release.

For high-stakes coordination, use a real consensus system (etcd, Consul). For best-effort, a Redis-backed TTL cache is often enough.


Worked Example: Multi-Region Edge Cache

Imagine a system with 5 regions, each running 10 backend replicas, plus a central origin.

Cache hierarchy:

  • L0 (per goroutine): tiny, atomic-based, lasts 1 s. For hottest keys.
  • L1 (per replica): ristretto, 100 MB, 5-minute TTL with jitter.
  • L2 (per region): Redis cluster, 100 GB, 1-hour TTL.
  • L3 (global): origin database.
package edgecache

import (
    "context"
    "errors"
    "sync"
    "time"

    "github.com/dgraph-io/ristretto"
    "github.com/redis/go-redis/v9"
    "golang.org/x/sync/singleflight"
)

type Cache struct {
    l1       *ristretto.Cache
    l2       *redis.ClusterClient
    upstream Upstream
    group    singleflight.Group
    ttlL1    time.Duration
    ttlL2    time.Duration
}

type Upstream interface {
    Get(ctx context.Context, key string) ([]byte, error)
}

func New(l1 *ristretto.Cache, l2 *redis.ClusterClient, u Upstream) *Cache {
    return &Cache{
        l1: l1, l2: l2, upstream: u,
        ttlL1: 5 * time.Minute,
        ttlL2: 1 * time.Hour,
    }
}

func (c *Cache) Get(ctx context.Context, key string) ([]byte, error) {
    if v, ok := c.l1.Get(key); ok {
        return v.([]byte), nil
    }
    v, err, _ := c.group.Do(key, func() (interface{}, error) {
        if v, err := c.l2.Get(ctx, key).Bytes(); err == nil {
            c.l1.SetWithTTL(key, v, int64(len(v)), c.ttlL1)
            return v, nil
        } else if !errors.Is(err, redis.Nil) {
            // Redis errored — fall through to upstream.
        }
        v, err := c.upstream.Get(ctx, key)
        if err != nil {
            return nil, err
        }
        c.l2.Set(ctx, key, v, c.ttlL2) // best-effort
        c.l1.SetWithTTL(key, v, int64(len(v)), c.ttlL1)
        return v, nil
    })
    if err != nil {
        return nil, err
    }
    return v.([]byte), nil
}

func (c *Cache) Invalidate(ctx context.Context, key string) {
    c.group.Forget(key)
    c.l1.Del(key)
    c.l2.Del(ctx, key)
    // Publish to other regions:
    c.publishInvalidation(ctx, key)
}

func (c *Cache) publishInvalidation(ctx context.Context, key string) {
    _ = c.l2.Publish(ctx, "invalidations", key).Err()
}

func (c *Cache) listenInvalidations(ctx context.Context) {
    sub := c.l2.Subscribe(ctx, "invalidations")
    defer sub.Close()
    for {
        select {
        case msg := <-sub.Channel():
            c.l1.Del(msg.Payload)
        case <-ctx.Done():
            return
        }
    }
}

// Pool wraps multiple Cache instances behind consistent hashing.
type Pool struct {
    mu     sync.RWMutex
    caches map[string]*Cache
}

About 100 lines for the local cache. The pub/sub listener (listenInvalidations) runs in a background goroutine, decoupled from request flow.

Trade-offs we accept:

  • Set to L2 is best-effort. If Redis is down, we still cache locally.
  • Invalidations are pub/sub (lossy). A short L1 TTL bounds the staleness.
  • Singleflight is per-replica, so 5 regions × 10 replicas = up to 50 upstream calls per stampede. Acceptable for our upstream's QPS budget.

For higher correctness, add a distributed lock around the upstream call.


Deep Dive: Sharded Singleflight at Scale

We mentioned sharded singleflight in middle.md. At senior scale, the design becomes more nuanced.

type ShardedGroup struct {
    shards [256]struct {
        g singleflight.Group
        _ [56]byte // cache-line padding
    }
}

func (s *ShardedGroup) Do(key string, fn func() (interface{}, error)) (interface{}, error, bool) {
    return s.shards[hashStr(key)&255].g.Do(key, fn)
}

The padding prevents false sharing of adjacent group mutexes. On a 256-shard layout, this adds a few KB of memory but eliminates a class of cache-coherence stalls.

At extreme scale you may also want per-shard metrics:

type ShardedGroup struct {
    shards [256]struct {
        g       singleflight.Group
        called  atomic.Uint64
        dedups  atomic.Uint64
        _       [40]byte
    }
}

Now you can see "shard 17 is doing 10× the dedup work" — which usually means a hot key has crystallised in that shard.


Deep Dive: The TIE-Heap Variant

Standard min-heaps push and pop one entry at a time. For TTL caches with bursty insertions, a batched heap reduces overhead.

The idea: collect inserts in a small buffer. When the buffer fills or a configurable time passes, batch-insert into the heap. Push individual items has cost O(log N); batch-insert of k items can be done in O(N + k log k) for very large k, which is rarely a win, but for small k (10-100) with O(log N) per item the difference is small.

The real win is reduced lock acquisition: one lock acquire amortises k inserts.

Implementation:

type BatchedHeap struct {
    mu    sync.Mutex
    heap  expHeap
    buf   []*item
    flush chan struct{}
}

func (b *BatchedHeap) Push(it *item) {
    b.mu.Lock()
    b.buf = append(b.buf, it)
    if len(b.buf) >= 100 {
        b.flushLocked()
    }
    b.mu.Unlock()
}

func (b *BatchedHeap) flushLocked() {
    for _, it := range b.buf {
        heap.Push(&b.heap, it)
    }
    b.buf = b.buf[:0]
}

A periodic ticker flushes the buffer even when partially full, so latency stays bounded.


Deep Dive: Approximate Counting with Count-Min Sketch

TinyLFU and many cache observability features use approximate counting via count-min sketch.

The idea: instead of an exact counter per key (map[string]int, memory O(N)), use D hash functions into a 2D array of W counters. Update: increment one counter per hash function (D writes). Query: return the minimum across the D corresponding counters.

type CountMinSketch struct {
    width, depth int
    counters     [][]uint32
    seeds        []uint32
}

func (c *CountMinSketch) Add(key string, n uint32) uint32 {
    h := hashString(key)
    minCount := uint32(math.MaxUint32)
    for i := 0; i < c.depth; i++ {
        idx := (h + uint32(i)*c.seeds[i]) % uint32(c.width)
        atomic.AddUint32(&c.counters[i][idx], n)
        v := atomic.LoadUint32(&c.counters[i][idx])
        if v < minCount {
            minCount = v
        }
    }
    return minCount
}

func (c *CountMinSketch) Estimate(key string) uint32 {
    h := hashString(key)
    minCount := uint32(math.MaxUint32)
    for i := 0; i < c.depth; i++ {
        idx := (h + uint32(i)*c.seeds[i]) % uint32(c.width)
        v := atomic.LoadUint32(&c.counters[i][idx])
        if v < minCount {
            minCount = v
        }
    }
    return minCount
}

Memory: width * depth * 4 bytes. For a typical 10M-entry cache with width=10M, depth=4, that is 160 MB — fine for a server, too much for an embedded device.

Sketches are approximate: they may overcount due to collisions but never undercount. For "is this a hot key?" decisions, overcounting is a false positive (treats a cold key as hot) — usually acceptable.


Deep Dive: Aging Frequencies

A pure count gives more weight to keys that were popular long ago than to keys popular now. To prefer recent popularity, age the counts.

In TinyLFU: every N events, halve every counter (c >> 1). Recent events count for more in relative terms.

func (c *CountMinSketch) Age() {
    for i := range c.counters {
        for j := range c.counters[i] {
            atomic.StoreUint32(&c.counters[i][j], atomic.LoadUint32(&c.counters[i][j])/2)
        }
    }
}

Trigger aging when the sum-of-counts exceeds a threshold (proportional to cache size). The result: an exponentially weighted moving average of access frequency.

Without aging, your "hot keys" are forever defined by the first hour of traffic.


Deep Dive: Cache for High-Cardinality Time Series

When cached data is itself time-keyed (e.g. "the value at time T"), the cache becomes a sparse time series.

Patterns:

  • Key includes both entity and time bucket: metric:host42:1605555555.
  • TTL aligns with bucket size (1 minute TTL for 1-minute buckets).
  • Stampede protection is critical — every reader requesting "current minute" misses simultaneously.

For these, refresh-ahead is especially valuable: the next-minute key is pre-fetched before any reader needs it.


Deep Dive: Cache for Eventual-Consistency Reads

For systems with eventually-consistent storage (e.g. DynamoDB, Cassandra), a cache hides the inconsistency window from clients — if the cache holds a stable value, clients see that value regardless of storage replication lag.

Caveat: the cache may itself be inconsistent (different replicas hold different values). Combine with pub/sub invalidation for stronger semantics.

This is one of the few cases where a cache improves consistency (perceived, not real). Use carefully.


Deep Dive: Building Custom Caches vs Using Libraries

The eternal question: roll your own or use a library?

Roll your own when: - Educational (this Roadmap subsection). - Special semantics (your TTL behavior is unusual). - Zero dependencies required (embedded systems, CLI tools). - The library doesn't expose what you need.

Use a library when: - Standard semantics work for you. - You want battle-tested correctness. - You don't have time to maintain it.

For 90% of services, a library is the right answer. For 10%, your specific needs justify custom code.

Caveat: "I want custom because I want to optimize" is usually wrong. Libraries are typically already faster than custom code.


Deep Dive: Cache for Idempotency

A useful but underappreciated use of caches: deduplicating idempotent operations.

When a client retries a request (perhaps because of a network hiccup), you want to ensure that the same operation is not executed twice. Pattern:

func Process(idempotencyKey string, req Request) Response {
    if cached, ok := cache.Get(idempotencyKey); ok {
        return cached.(Response)
    }
    resp := doWork(req)
    cache.Set(idempotencyKey, resp, 24*time.Hour)
    return resp
}

The cache becomes a "memo" of recent operations. A retry returns the original response without re-executing.

For payment systems, order processing, anything-non-repeatable: idempotency caches are essential.


Deep Dive: Cache APIs in Other Languages

Worth knowing how other ecosystems do it, for cross-language teams:

  • Java: Caffeine. Industry-leading, asynchronous, TinyLFU-based. The Go ecosystem catches up via ristretto.
  • Rust: moka. Modeled after Caffeine. Sync and async variants.
  • Python: cachetools. Simple, single-threaded by default. Multi-threaded variants exist.
  • Ruby: lru_redux. Fast LRU; TTL is a separate add-on.
  • Node.js: lru-cache. The de facto standard; TTL support is built in.

The dominant trends are: TTL is universal, LRU is the default eviction policy, admission policies are rare outside Java and Go ecosystems.


Deep Dive: Reading Top-K from a Cache

For observability you want to know "what are the most-accessed keys right now?" Without explicit tracking, the answer requires sampling.

type TopK struct {
    sketch  *CountMinSketch
    heap    *minHeap
    mu      sync.Mutex
}

func (t *TopK) Observe(key string) {
    if rand.IntN(100) != 0 { return } // 1% sampling
    t.mu.Lock()
    defer t.mu.Unlock()
    count := t.sketch.Add(key, 1)
    t.heap.maybeAdd(key, count)
}

func (t *TopK) Top(n int) []string {
    t.mu.Lock(); defer t.mu.Unlock()
    return t.heap.topN(n)
}

The 1% sampling rate keeps overhead negligible. Counts are approximate but sufficient for "what is hot."

Wire this into the cache:

func (c *Cache) Get(k string) (V, bool) {
    c.topK.Observe(k)
    // ... rest of Get ...
}

Now cache.Top(10) answers "what are the 10 hottest keys?"

This pattern is essential for diagnosing hot-key problems in production.


Deep Dive: Cache Code Review Checklist

When reviewing a cache PR:

  • Concurrency safety: read paths use RLock or atomic; write paths use Lock or sync.Map.Store.
  • Defers: every Lock has a corresponding Unlock (or deferred Unlock).
  • Loader: wrapped in singleflight if multiple readers may miss simultaneously.
  • TTLs: jittered if entries are inserted in batches.
  • Errors: not cached unless explicitly desired with short negative TTL.
  • Metrics: at least hit/miss; ideally with latency histograms.
  • Shutdown: cache exposes a Close that stops background goroutines.
  • Tests: race detector enabled; concurrent test with N goroutines.
  • Memory: size bound or explicit reasoning for unbounded.
  • Documentation: at minimum, doc-comments on public methods.
  • No global state: cache is a struct, not package-level vars.

A cache PR that passes all these is unlikely to surprise you in production.


Worked Example: GraphQL Resolver Cache

GraphQL exposes a frequent stampede risk: one query may invoke a resolver N times for N entries. Caching at the resolver level eliminates duplicate work.

package gqlcache

import (
    "context"
    "fmt"
    "sync"
    "time"
)

type Loader[K comparable, V any] struct {
    mu        sync.Mutex
    pending   map[K][]chan result[V]
    cache     *TTLCache[K, V]
    fetch     func(ctx context.Context, keys []K) (map[K]V, error)
    batchSize int
    delay     time.Duration
}

type result[V any] struct {
    value V
    err   error
}

func NewLoader[K comparable, V any](
    cache *TTLCache[K, V],
    fetch func(ctx context.Context, keys []K) (map[K]V, error),
    batchSize int,
    delay time.Duration,
) *Loader[K, V] {
    return &Loader[K, V]{
        pending:   make(map[K][]chan result[V]),
        cache:     cache,
        fetch:     fetch,
        batchSize: batchSize,
        delay:     delay,
    }
}

func (l *Loader[K, V]) Load(ctx context.Context, key K) (V, error) {
    if v, ok := l.cache.Get(key); ok {
        return v, nil
    }
    ch := make(chan result[V], 1)
    l.mu.Lock()
    l.pending[key] = append(l.pending[key], ch)
    if len(l.pending) >= l.batchSize {
        keys := make([]K, 0, len(l.pending))
        for k := range l.pending {
            keys = append(keys, k)
        }
        pending := l.pending
        l.pending = make(map[K][]chan result[V])
        l.mu.Unlock()
        go l.dispatch(ctx, keys, pending)
    } else {
        l.mu.Unlock()
        go l.dispatchAfter(ctx, l.delay)
    }
    select {
    case r := <-ch:
        return r.value, r.err
    case <-ctx.Done():
        var zero V
        return zero, ctx.Err()
    }
}

func (l *Loader[K, V]) dispatchAfter(ctx context.Context, d time.Duration) {
    time.Sleep(d)
    l.mu.Lock()
    if len(l.pending) == 0 {
        l.mu.Unlock()
        return
    }
    keys := make([]K, 0, len(l.pending))
    for k := range l.pending {
        keys = append(keys, k)
    }
    pending := l.pending
    l.pending = make(map[K][]chan result[V])
    l.mu.Unlock()
    l.dispatch(ctx, keys, pending)
}

func (l *Loader[K, V]) dispatch(ctx context.Context, keys []K, pending map[K][]chan result[V]) {
    values, err := l.fetch(ctx, keys)
    for k, channels := range pending {
        var r result[V]
        if err != nil {
            r.err = err
        } else if v, ok := values[k]; ok {
            r.value = v
            l.cache.Set(k, v)
        } else {
            r.err = fmt.Errorf("not found: %v", k)
        }
        for _, ch := range channels {
            ch <- r
        }
    }
}

This is the DataLoader pattern, popularised by Facebook's GraphQL implementation. Within a single tick, multiple resolver invocations get batched into a single underlying call. Combined with the TTL cache, repeat calls within or across requests are served from memory.

Trade-offs:

  • Adds latency equal to delay (e.g. 5 ms) to "first" loads.
  • Wastes a goroutine if no batching happens.
  • Implementation has nuanced edge cases (cancellation propagation).

For most GraphQL services, the DataLoader + TTL cache combination cuts upstream traffic by 5-10×.


Deep Dive: When Should the Cache Outlive Its Process?

Process restarts cost a warm cache. Sometimes that is acceptable, sometimes it is not.

Acceptable: - Deployments are rare. - Upstream can absorb cold-start traffic. - Cache reaches warm within minutes.

Not acceptable: - Deployments are frequent (continuous deployment). - Upstream is slow or expensive. - Hot data takes hours to identify.

Options when not acceptable:

  1. L2 Redis. Process-local L1 dies on restart; L2 survives. Warm restart is just "fetch from L2."
  2. Disk-backed cache. Serialize cache to disk on shutdown; load on startup. Format must be stable across versions.
  3. Memory-mapped file cache. Cache lives in a file via mmap; process restarts find it intact.
  4. External cache. No in-process cache at all; Redis/Memcached is the only cache. Cold start = network round trips for everything.

Most modern services pick option 1: L2 Redis + L1 in-process. The L1 absorbs hot keys; the L2 absorbs cross-replica reads and survives restarts.


Deep Dive: Designing for the Long Run

A cache will live for years if it lives at all. Decisions made today bind your team tomorrow.

Interface stability. Methods you expose become a contract. Change them and every caller has to update. Be conservative.

Configurability. Future tuning is much easier if you have already exposed TTL, sweep interval, size cap, and metrics as options.

Testability. A cache that requires real time, real network, real database is hard to test. Design for dependency injection from day 1.

Observability. Future you cannot understand a problem without metrics. Add them now, even if you do not look at them yet.

Documentation. The next person on your team will inherit your cache without your context. Write it down.

These are not technical optimizations; they are engineering investments. They cost a small amount of effort upfront and save a large amount of effort over years.


Deep Dive: Comparing TTL Cache Implementations

A scorecard for some popular Go cache libraries:

Library Concurrency TTL precision Eviction Admission Notes
patrickmn/go-cache RWMutex nanosecond none none Simple, popular, junior-grade.
hashicorp/golang-lru/v2 RWMutex (sharded option) none in main; expirable adds LRU none Solid. Expirable variant is excellent.
dgraph-io/ristretto sharded, async writes nanosecond sampled LRU TinyLFU State of the art for most general workloads.
allegro/bigcache sharded cache-wide FIFO none Huge entry counts; bytes-only values.
coocood/freecache sharded segments seconds LRU none Hard memory cap; bytes-only values.
karlseguin/ccache bucketed nanosecond LRU none Mature, less common.
bluele/gcache wrappable various LRU/LFU/ARC none Many policies, plain map underneath.

For most teams: hashicorp/golang-lru/v2/expirable for safe general-purpose; ristretto for high-throughput.


Deep Dive: Cache Pitfalls Specific to Go

Go-specific gotchas that don't apply to caches in other languages:

  • Goroutine leaks via sweepers. Forgetting Close is the most common Go cache bug.
  • Map iteration randomness. "Pick a random key" via for-range is non-deterministic in tests.
  • Defer overhead in hot paths. Deferred Unlock adds ~25 ns per call; in microsecond-budget caches, this matters.
  • Slice aliasing on cache return. Returning a slice without copying lets callers mutate the cache.
  • time.Time monotonic stripping. Serialization to JSON drops monotonic; rare but bites you on cache snapshots.
  • GC pressure from map[string]*Entry. Pointer maps make GC scan every entry.
  • sync.Pool oddities. Pool entries can be evicted by GC at any time; not suitable as a long-term cache.

Each of these has bitten someone in production. Be aware.


Cheat Sheet

// Sharded cache
const numShards = 256
type Cache struct { shards [numShards]*shard }
func (c *Cache) shardFor(k string) *shard { return c.shards[hash(k) % numShards] }

// Per-shard sweeper
for _, s := range c.shards { go s.sweepLoop(interval) }

// Hot-key fast path
type Cache struct {
    shards [256]*shard
    hot atomic.Pointer[map[string]V]
}

// Two-tier read
v, ok := l1.Get(k); if ok { return v }
v, ok = l2.Get(k); if ok { l1.Set(k, v); return v }
v = upstream.Load(k); l2.Set(k, v); l1.Set(k, v)

// Pub/sub invalidation
rdb.Publish(ctx, "inv", key)
for msg := range sub.Channel() { l1.Del(msg.Payload) }

// Consistent hashing
ring.Add(replicaName)
target := ring.Get(key)

// XFetch
needed := float64(delta) * beta * -math.Log(rand.Float64())
if float64(remaining) < needed { triggerRefresh() }

// Ristretto
cache, _ := ristretto.NewCache(&ristretto.Config{
    NumCounters: 1e7, MaxCost: 1<<30, BufferItems: 64,
})
cache.SetWithTTL(k, v, cost, ttl)
cache.Wait()

// bigcache
bc, _ := bigcache.New(ctx, bigcache.DefaultConfig(ttl))
bc.Set(k, valueBytes)

// freecache
fc := freecache.NewCache(100 << 20)
fc.Set([]byte(k), []byte(v), ttlSeconds)

Deep Dive: Cache Hierarchy as a Tax

A 4-tier cache (per-goroutine, per-replica, per-region, global) is wonderful in theory. In practice, each tier adds:

  • Code complexity (more configuration, more failure modes).
  • Operational overhead (more dashboards, more alerts).
  • Latency variance (any tier can be slow or fail).
  • Correctness risk (more places for stale data).

Add tiers when measured pain demands it. A team that "wants" a 4-tier cache before they have measured the 2-tier cache's hit ratio is over-engineering. Most production services live with 1 or 2 tiers for their lifetime.

A reasonable evolution:

  1. Day 1: in-process map + RWMutex. 80% hit ratio.
  2. Month 3: hot keys identified; add singleflight + jitter. 95% hit ratio.
  3. Month 9: traffic 10×. Add Redis L2 to absorb cross-replica misses. 99% hit ratio.
  4. Year 2: only when consistency demands it, add pub/sub invalidation. 99.5% with bounded staleness.

Each step is one tier. Resist adding tiers preemptively.


Deep Dive: Cache Latency Budgets

For a request with a 200 ms latency budget, how much can the cache spend?

A reasonable split:

  • 10 µs: in-process cache hit.
  • 1 ms: distributed cache hit (Redis local).
  • 5 ms: distributed cache miss + L2 round trip.
  • 50 ms: full miss, upstream call.
  • 100+ ms: tail of upstream, transient errors.

The cache wins when most requests pay 10 µs and the rest pay 50 ms. p99 is dominated by misses; pulling p99 down means improving miss handling (parallelism, singleflight, refresh-ahead).

If your in-process cache itself takes 1 ms, something is wrong — possibly lock contention, possibly false sharing, possibly GC pauses. Profile it. A well-tuned cache hit should be sub-microsecond.


Deep Dive: Cache Health Signals

Beyond hit ratio, what signals should you monitor?

Healthy:

  • Hit ratio stable at expected value.
  • p99 Get duration < 100 µs.
  • Loader rate moderate, stable.
  • Eviction rate < 10/s for the cache size.
  • Memory steady near expected size.

Warning:

  • Hit ratio dropping > 5% over an hour.
  • p99 Get duration spiking > 1 ms.
  • Loader rate spiking 5×.
  • Eviction rate accelerating.
  • Memory growing without plateau.

Critical:

  • Hit ratio < 50% of expected.
  • p99 Get duration > 10 ms.
  • Loader rate > 1000/s sustained.
  • Eviction rate > 100k/s (cache thrashing).
  • Memory > budget (OOM imminent).

Most monitoring systems can derive these from your basic counters and gauges. Standard cache dashboards have all of them.


Deep Dive: When to Stop Optimizing

A cache that meets your latency target and hit ratio target is done. Further optimization brings diminishing returns and adds complexity.

Symptoms of over-optimization:

  • You can no longer explain how the cache works in 5 minutes.
  • A new engineer cannot modify it without breaking subtle invariants.
  • Tests cover edge cases you cannot reproduce in production.
  • Performance is "improved" but you cannot measure the difference.

A cache should be as simple as possible to meet its requirements. Each technique we've discussed — sharding, TinyLFU, off-heap, etc. — has a place. None of them are mandatory for every cache.

The cache for an internal admin service should not look like the cache for a CDN. The cache for a high-frequency trading system should not look like the cache for a documentation site. Match the cache to the problem.

If you find yourself adding a feature "in case we need it later," stop. You can always add it when you actually need it. You cannot easily remove it.


Self-Assessment Checklist

  • I can design a sharded TTL cache and choose the shard count.
  • I can implement consistent hashing.
  • I can explain TinyLFU and when admission policies matter.
  • I can describe ristretto, bigcache, freecache architectures in detail.
  • I can implement L1/L2 caching with pub/sub invalidation.
  • I can implement probabilistic early refresh (XFetch).
  • I can name the standard cache observability metrics.
  • I can diagnose hit-ratio drops, memory ballooning, upstream overload incidents.
  • I know the eight senior anti-patterns.
  • I can design hot-key mitigation strategies.

Deep Dive: Caches and Multi-Tenancy

If your service is multi-tenant, the cache must respect tenant boundaries.

// WRONG: tenant data could leak across tenants.
func (c *Cache) Get(userID string) string {
    if v, ok := c.data[userID]; ok { return v }
    ...
}

If two tenants happen to have a user ID "42," they would share cache entries. Bug, security risk, sometimes leak.

// RIGHT: include tenant in the key.
func (c *Cache) Get(tenantID, userID string) string {
    key := tenantID + ":" + userID
    ...
}

For high-cardinality tenancy (millions of tenants), consider:

  • Per-tenant cache pools (heavy on memory).
  • Tenant-aware eviction (a busy tenant should not evict a quiet tenant's data).
  • Per-tenant rate limits to prevent one tenant from consuming the whole cache.

Tenancy is a senior-level concern that often gets retrofitted painfully when missed at design time.


Deep Dive: Caches and PII

Caches accumulate data, sometimes including personally identifiable information. GDPR and similar regulations require that:

  • PII is encrypted at rest if your threat model requires it.
  • Users can request deletion of their data — including from caches.
  • Cached PII has bounded retention.

For a TTL cache, the natural TTL gives bounded retention. But:

  • TTL of "1 hour" may be too long for some regulations.
  • An LRU policy may keep popular PII longer than its TTL — no, TTL is checked separately.
  • Off-heap caches in bigcache make scrubbing more challenging (memory is byte arrays, not garbage-collected references).

Mitigations:

  • Avoid caching PII when you can.
  • Use short TTLs for PII-bearing entries.
  • Provide a DeleteByUser(userID) API; iterate the cache and remove. Slow but reliable.

For services subject to compliance audits, the inability to "delete a user's data right now" is a regulatory risk.


Deep Dive: Caches and Compliance Audits

Common audit questions:

  • "How long does PII live in your cache?" Answer: TTL is X, sweep is Y, worst case is X+Y.
  • "Can you delete a specific user's cache entries on request?" Answer: yes, via DeleteByUser.
  • "Is cache memory encrypted?" Answer: depends on your environment (encrypted RAM is rare).
  • "Are cache logs scrubbed?" Answer: cache should not log key contents.

Auditors appreciate documented answers. A short "cache data handling" document for each cache in your system pays off.


Summary

The senior level is about scale, distribution, and operational excellence:

  • Sharding is the standard answer to lock contention.
  • Hot keys need explicit handling beyond sharding.
  • Admission policies (TinyLFU) prevent scan workloads from polluting the cache.
  • Two-tier caching with pub/sub invalidation enables multi-replica deployments.
  • Production caches need rich observability and incident runbooks.
  • Stampede protection extends to distributed locks and multi-stage caching.
  • Real production caches combine many techniques; no single library is the right answer for every workload.

The professional file goes deeper still: off-heap memory layouts, NUMA-aware sharding, custom allocators, and the engineering of caches for billion-entry-scale services.


Deep Dive: Senior-Level Mistakes I've Made

A few real production mistakes that taught me senior-grade lessons:

Mistake 1. Shipping a cache with no Close. Test suite started leaking goroutines in CI. Took two days to diagnose because tests pass but slowly.

Mistake 2. Adding TinyLFU before measuring. The previous LRU was fine. The TinyLFU change saved 0.5% hit ratio and broke the cache's "set then get returns the value" invariant.

Mistake 3. Sharding without padding. False sharing on shard counters reduced throughput by 30% on a 32-core server.

Mistake 4. Pub/sub invalidation with no monotonic check. Out-of-order invalidations occasionally restored an old value.

Mistake 5. Negative caching with TTL = 1 hour. Attacker bombarded with invalid keys; legitimate new keys could not be created for hours.

Mistake 6. Loader without timeout. Single slow upstream call held a goroutine for 30 minutes; cache appeared to hang.

Mistake 7. Cache as source of truth for "user is active." Replicas disagreed; users got inconsistent state.

Mistake 8. Big cache, no metrics. Production team had no way to verify the cache was even doing its job.

Each of these took hours to diagnose, sometimes days. They are all preventable with the techniques discussed in this file.


Glossary of Senior Terms

  • Admission policy. Decides whether to add a new entry to the cache.
  • Eviction policy. Decides which entry to remove when space is needed.
  • TinyLFU. Approximate LFU using a count-min sketch.
  • SLRU. Segmented LRU with probation and protected segments.
  • Shard. A sub-cache with its own lock.
  • Hot key. A key receiving disproportionate traffic.
  • Stampede. Many simultaneous misses for the same key.
  • Singleflight. Coalescing duplicate concurrent calls.
  • Jitter. Randomization of TTLs to prevent synchronized expiry.
  • Stale-while-revalidate. Serve old, refresh in background.
  • Refresh-ahead. Refresh before expiry.
  • Pub/sub invalidation. Broadcast cache deletion messages.
  • Consistent hashing. Stable mapping of keys to replicas.
  • Off-heap. Memory not managed by Go GC (typically []byte).
  • False sharing. Cache-line contention between logically independent atomics.
  • Bulkhead. Resource isolation to prevent cascading failure.
  • Circuit breaker. Fail fast when an upstream is broken.

Further Reading

  • "Caffeine" (Java) documentation — the reference for modern cache design.
  • The TinyLFU paper (Einziger and Friedman, 2017).
  • "Memcached internals" — battle-tested distributed cache design.
  • dgraph-io/ristretto design document.
  • "Why we built XFetch" (Vattani et al., 2015).
  • Twitter's "Pelikan" cache server design talks.
  • LinkedIn's "Couchbase + L1 caching" blog series.
  • "Consistent Hashing with Bounded Loads" (Mirrokni et al., 2016).
  • "Adaptive Replacement Cache" (Megiddo and Modha, 2003) — alternative to LRU/LFU.
  • groupcache source — Google's distributed memoization library, the original inspiration for singleflight.
  • Akka's distributed-data documentation — for ideas on eventually-consistent state.
  • Hazelcast's docs — battle-tested in-memory data grid patterns.
  • The Caffeine wiki — best modern reference on cache design.
  • golang.org/x/sync/singleflight source — small enough to read in 10 minutes.
  • "An Analysis of Hierarchical Web Caching" (Rodriguez et al., 1999) — classic L1/L2 paper.