The Memory Hierarchy — Middle Level¶
Topic: The Memory Hierarchy Focus: The mechanisms behind the levels — cache lines, sets and associativity, miss types, prefetching, the TLB — and a working cost model you can reason with.
Table of Contents¶
- Introduction
- Prerequisites
- Glossary
- Core Concepts
- Anatomy of a cache
- Associativity, sets, and eviction
- The three Cs of cache misses
- Spatial and temporal locality
- Prefetching
- The TLB and address translation
- Bandwidth vs latency
- A Cost Model You Can Use
- Mental Models
- Code Examples
- Coding Patterns
- Pros & Cons
- Best Practices
- Edge Cases & Pitfalls
- Summary
Introduction¶
At the junior level the hierarchy is a pyramid of latencies and one rule: be sequential. That rule works because of machinery — set-associative caches, hardware prefetchers, and a translation cache called the TLB — that you can now learn to reason about quantitatively. The goal of this tier is to replace "scattered access is slow" with a model that predicts how slow and why, so you can estimate the cost of a loop before you run it.
Prerequisites¶
- The junior picture: levels, approximate latencies, the 64-byte cache line, locality.
- Comfort with hexadecimal and bitwise thinking (address bits split into fields).
- Basic idea of virtual memory: programs see virtual addresses; hardware maps them to physical RAM.
Glossary¶
- Cache line / block — the unit of transfer and storage, typically 64 bytes.
- Set — a bucket of cache lines an address is allowed to live in.
- Way / associativity — how many lines fit in one set; an N-way cache has N candidate slots per set.
- Tag — the high address bits stored alongside a cached line to verify identity.
- Eviction — kicking out a line to make room; chosen by a replacement policy (pseudo-LRU).
- Compulsory / capacity / conflict miss — the three reasons a needed line isn't cached.
- Prefetcher — hardware (or compiler-inserted) logic that loads lines before they're demanded.
- TLB (Translation Lookaside Buffer) — a cache of virtual→physical page translations.
- Page — the unit of virtual memory mapping, usually 4 KB (or 2 MB/1 GB "huge pages").
- Working set — the distinct memory a program touches over a window of time.
Core Concepts¶
Anatomy of a cache¶
A cache splits every memory address into three fields:
| ----- tag ----- | -- set index -- | -- block offset -- |
high bits middle bits low 6 bits (for 64B lines)
- Block offset (low 6 bits for a 64-byte line) picks the byte within the line.
- Set index picks which set the line maps to.
- Tag is stored with the cached line; on a lookup the hardware compares the address's tag against the tags in that set.
So a cache is essentially a hardware hash table: the set index is the hash bucket, and the tag is the key check. A hit is a tag match in the right set; a miss triggers a fetch from the next level.
Associativity, sets, and eviction¶
- A direct-mapped cache has one line per set: each address has exactly one home. Simple and fast, but two hot addresses that map to the same set evict each other endlessly.
- A fully associative cache lets any line go anywhere: no conflicts, but checking every slot is expensive — used only for tiny caches like the TLB.
- Real L1/L2/L3 are N-way set-associative (commonly 8-way for L1/L2, 12–20-way for L3): a sweet spot. An address maps to one set; within that set, any of N ways may hold it.
When a set is full and a new line arrives, a replacement policy evicts one. Hardware can't afford true LRU across many ways, so it uses pseudo-LRU (tree-based bits) or similar approximations. The practical takeaway: recently used lines tend to survive, but it's not a guarantee.
The three Cs of cache misses¶
Every miss is one of three kinds — a vocabulary that tells you which fix applies:
- Compulsory (cold) miss — first-ever touch of a line. Unavoidable except by prefetching. Fix: prefetch, or touch less unique data.
- Capacity miss — the working set is bigger than the cache, so lines you'll need again get evicted before reuse. Fix: shrink the working set (blocking/tiling, smaller types).
- Conflict miss — the working set would fit, but too many hot lines map to the same set and evict each other. Fix: change layout/stride so addresses spread across sets (avoid power-of-two strides that alias).
A loop that strides by exactly a power of two (e.g. exactly 4096 bytes) is a classic conflict-miss generator: every access lands in the same handful of sets.
Spatial and temporal locality¶
- Temporal locality — if you touched address A, you'll likely touch A again soon. Caches exploit this by keeping lines after first use.
- Spatial locality — if you touched A, you'll likely touch A+1, A+2… soon. Caches exploit this by fetching a whole line and by prefetching the next line.
Real programs have both, which is why caches work at all. Code that destroys locality — pointer chasing, random hashing into a huge table, column-major scans — gets little benefit from the megabytes of cache sitting idle.
Prefetching¶
The CPU tries to hide latency by loading data before you ask:
- Hardware prefetchers detect simple patterns: sequential streams and constant strides. A forward
forloop over an array is the easiest pattern to detect — the prefetcher races ahead, so demand loads frequently hit. This is why sequential access is fast beyond just line reuse. - Software prefetch — explicit instructions (
__builtin_prefetchin C/GCC,_mm_prefetch) tell the CPU "I'll need this address soon." Useful for irregular-but-predictable patterns (e.g. you can compute the next node's address before processing the current one). Easy to overuse: a wrong prefetch wastes bandwidth and pollutes cache.
Prefetchers struggle with: pointer chasing (the next address is unknown until the current line arrives), random access, and very short loops.
The TLB and address translation¶
Programs use virtual addresses; the MMU translates them to physical RAM addresses using page tables. Walking the multi-level page table is itself several memory accesses — far too slow to do per load. So the CPU caches recent translations in the TLB.
- A TLB hit makes translation effectively free (overlapped with cache access).
- A TLB miss triggers a page walk (often hardware-driven), costing tens to hundreds of cycles — sometimes more than the data load it precedes.
- TLBs are small: on the order of 64–1500 entries. With 4 KB pages, a few thousand entries cover only a few megabytes. A loop scanning gigabytes with poor page locality can be TLB-bound even when the data caches are fine.
- Huge pages (2 MB) let one TLB entry cover 512× more memory, dramatically cutting TLB misses for large working sets — a common production tuning knob for databases and JVMs.
The lesson: there are two caches you can miss — the data cache and the TLB — and they fail for different reasons.
Bandwidth vs latency¶
These are independent axes and confusing them causes bad decisions:
- Latency = time for one access to come back (~100 ns to DRAM). It matters for dependent accesses, where you can't issue the next until the current returns (pointer chasing).
- Bandwidth = bytes per second the memory system can sustain (tens of GB/s per socket). It matters for streaming, where many independent accesses are in flight at once.
Modern CPUs hide latency by keeping many outstanding misses in flight (memory-level parallelism). A sequential scan saturates bandwidth and hides latency. A pointer chase exposes full latency on every step because each load depends on the previous — you get neither parallelism nor bandwidth.
A Cost Model You Can Use¶
Estimate a loop's memory cost in three steps:
- Count distinct cache lines touched, not bytes or operations.
nfloats scanned sequentially touchn/16lines (16 floats per 64-byte line). - Classify the reuse. Does the working set fit in L1 (~32 KB), L2 (~512 KB), or L3 (~tens of MB)? Pick the latency of the smallest level that holds it.
- Account for the access pattern. Sequential → prefetched, near-best case. Strided/random → add full miss latency per line, and check whether you're also blowing the TLB.
Worked example: summing 64 million int32 (256 MB). - Lines touched ≈ 256 MB / 64 B = 4 million. - 256 MB ≫ L3, so each line is a DRAM fetch the first time: ~capacity/compulsory miss territory. - Sequential, so the prefetcher hides most latency and you run at bandwidth limit (~tens of GB/s) → a few milliseconds. - Random access to the same data: each line exposes ~100 ns latency, ~4M × 100 ns ≈ 0.4 s — about 100× slower. Same data, same total, different pattern.
Mental Models¶
- A cache is a hardware hash table keyed by set index, with N slots per bucket and a tag check. Conflict misses are bucket collisions.
- Two caches, two failure modes. Data cache misses come from touching too much or too scattered data; TLB misses come from spanning too many pages. Huge pages fix the latter, not the former.
- Latency is hidden by parallelism; pointer chasing kills parallelism. Independent accesses overlap; dependent ones serialize.
Code Examples¶
Strided access reveals the cache line¶
// Touch one int per stride. Larger stride => fewer useful bytes per line
// => more lines per element => slower, until the line is one-int-per-line.
for (size_t i = 0; i < N; i += stride)
sum += a[i];
Plotting time vs stride, time rises until stride reaches 16 ints (one full 64-byte line per access) and then roughly plateaus — a direct measurement of the 64-byte line size from software.
Software prefetch for a predictable irregular walk¶
for (size_t i = 0; i < n; i++) {
// Hint the line we'll need a few iterations ahead.
__builtin_prefetch(&data[index[i + 8]], 0, 1);
process(data[index[i]]);
}
This helps only when the future address (index[i+8]) is known well before it's needed — i.e. the index array itself is streamed predictably.
Demonstrating the TLB in Go¶
// Walking with a page-sized stride touches a new page every step:
// great data-cache behavior is irrelevant once you exhaust the TLB.
const pageStride = 4096 / 8 // 8-byte elements
for i := 0; i < len(buf); i += pageStride {
sink += buf[i]
}
For a multi-GB buf, this is dominated by TLB misses, not data-cache misses — switching the allocation to huge pages can cut its time substantially.
Coding Patterns¶
- Loop interchange — reorder nested loops so the innermost index walks contiguous memory (row-major inner loop over columns).
- Loop blocking / tiling — process the data in cache-sized tiles so a tile is fully reused before eviction; turns capacity misses into hits (classic in matrix multiply).
- Avoid power-of-two strides and array dimensions — pad arrays (e.g.
[N][N+1]) to break the aliasing that causes conflict misses. - Batch independent work — issue several independent loads so the CPU overlaps their latency instead of serializing.
Pros & Cons¶
| Mechanism | Helps when | Hurts when |
|---|---|---|
| Set associativity | Reduces conflict misses vs direct-mapped | Costs power/area; not infinite |
| Hardware prefetch | Sequential/strided streams | Pointer chasing; pollutes cache on mispredict |
| Software prefetch | Predictable irregular access | Wrong distance wastes bandwidth |
| Huge pages | Large working sets (TLB pressure) | Fragmentation; wasted RAM if sparse |
Best Practices¶
- Make the inner loop walk contiguous memory. This single change wins more often than any micro-optimization.
- Size hot loops to a cache level deliberately — tile big computations so each tile fits L1/L2.
- Watch for power-of-two aliasing in array dimensions and strides; pad to dodge conflict misses.
- Treat the TLB as a separate budget. For huge sequential or sparse-across-pages workloads, consider huge pages.
- Prefer streaming over chasing. If you can express work as a scan of arrays rather than a walk of pointers, do it.
Edge Cases & Pitfalls¶
- The "fits in cache" cliff. Performance is roughly flat while the working set fits a level, then drops sharply when it spills to the next. Microbenchmarks that stay under the cliff lie about production behavior.
- Conflict misses with "small" data. A 256 KB array that should fit L2 can thrash if your stride aliases sets. Capacity isn't the only constraint.
- Prefetcher fooled by boundaries. Hardware prefetchers usually don't cross page boundaries, so very page-fragmented access loses prefetch benefit even if sequential within pages.
- Software prefetch as a crutch. It rarely fixes a fundamentally pointer-chasing structure; redesigning the layout usually beats sprinkling prefetch hints.
- Counting bytes, forgetting pages. Two layouts with identical byte counts can differ wildly if one spans many more pages (TLB) or sets (conflicts).
Summary¶
- A cache is a set-associative hardware hash table of 64-byte lines; addresses split into tag / set / offset.
- Misses come in three flavors — compulsory, capacity, conflict — and each points to a different fix.
- Prefetching hides latency for predictable patterns; pointer chasing defeats it.
- The TLB is a second cache you can miss; large or page-sparse workloads can be translation-bound, and huge pages help.
- Latency vs bandwidth are distinct: dependent accesses pay latency, independent streams ride bandwidth.
- Reason about loops by counting distinct lines, identifying the holding cache level, and classifying the access pattern — that model predicts the 10–100× differences you'll see in practice.
In this topic
- junior
- middle
- senior
- professional