The Memory Hierarchy — Middle Level¶

Topic: The Memory Hierarchy Focus: The mechanisms behind the levels — cache lines, sets and associativity, miss types, prefetching, the TLB — and a working cost model you can reason with.

Table of Contents¶

Introduction
Prerequisites
Glossary
Core Concepts
Anatomy of a cache
Associativity, sets, and eviction
The three Cs of cache misses
Spatial and temporal locality
Prefetching
The TLB and address translation
Bandwidth vs latency
A Cost Model You Can Use
Mental Models
Code Examples
Coding Patterns
Pros & Cons
Best Practices
Edge Cases & Pitfalls
Summary

Introduction¶

At the junior level the hierarchy is a pyramid of latencies and one rule: be sequential. That rule works because of machinery — set-associative caches, hardware prefetchers, and a translation cache called the TLB — that you can now learn to reason about quantitatively. The goal of this tier is to replace "scattered access is slow" with a model that predicts how slow and why, so you can estimate the cost of a loop before you run it.

Prerequisites¶

The junior picture: levels, approximate latencies, the 64-byte cache line, locality.
Comfort with hexadecimal and bitwise thinking (address bits split into fields).
Basic idea of virtual memory: programs see virtual addresses; hardware maps them to physical RAM.

Glossary¶

Cache line / block — the unit of transfer and storage, typically 64 bytes.
Set — a bucket of cache lines an address is allowed to live in.
Way / associativity — how many lines fit in one set; an N-way cache has N candidate slots per set.
Tag — the high address bits stored alongside a cached line to verify identity.
Eviction — kicking out a line to make room; chosen by a replacement policy (pseudo-LRU).
Compulsory / capacity / conflict miss — the three reasons a needed line isn't cached.
Prefetcher — hardware (or compiler-inserted) logic that loads lines before they're demanded.
TLB (Translation Lookaside Buffer) — a cache of virtual→physical page translations.
Page — the unit of virtual memory mapping, usually 4 KB (or 2 MB/1 GB "huge pages").
Working set — the distinct memory a program touches over a window of time.

Core Concepts¶

Anatomy of a cache¶

A cache splits every memory address into three fields:

| ----- tag ----- | -- set index -- | -- block offset -- |
        high bits      middle bits          low 6 bits (for 64B lines)

Block offset (low 6 bits for a 64-byte line) picks the byte within the line.
Set index picks which set the line maps to.
Tag is stored with the cached line; on a lookup the hardware compares the address's tag against the tags in that set.

So a cache is essentially a hardware hash table: the set index is the hash bucket, and the tag is the key check. A hit is a tag match in the right set; a miss triggers a fetch from the next level.

Associativity, sets, and eviction¶

A direct-mapped cache has one line per set: each address has exactly one home. Simple and fast, but two hot addresses that map to the same set evict each other endlessly.
A fully associative cache lets any line go anywhere: no conflicts, but checking every slot is expensive — used only for tiny caches like the TLB.
Real L1/L2/L3 are N-way set-associative (commonly 8-way for L1/L2, 12–20-way for L3): a sweet spot. An address maps to one set; within that set, any of N ways may hold it.

When a set is full and a new line arrives, a replacement policy evicts one. Hardware can't afford true LRU across many ways, so it uses pseudo-LRU (tree-based bits) or similar approximations. The practical takeaway: recently used lines tend to survive, but it's not a guarantee.

The three Cs of cache misses¶

Every miss is one of three kinds — a vocabulary that tells you which fix applies:

Compulsory (cold) miss — first-ever touch of a line. Unavoidable except by prefetching. Fix: prefetch, or touch less unique data.
Capacity miss — the working set is bigger than the cache, so lines you'll need again get evicted before reuse. Fix: shrink the working set (blocking/tiling, smaller types).
Conflict miss — the working set would fit, but too many hot lines map to the same set and evict each other. Fix: change layout/stride so addresses spread across sets (avoid power-of-two strides that alias).

A loop that strides by exactly a power of two (e.g. exactly 4096 bytes) is a classic conflict-miss generator: every access lands in the same handful of sets.

Spatial and temporal locality¶

Temporal locality — if you touched address A, you'll likely touch A again soon. Caches exploit this by keeping lines after first use.
Spatial locality — if you touched A, you'll likely touch A+1, A+2… soon. Caches exploit this by fetching a whole line and by prefetching the next line.

Real programs have both, which is why caches work at all. Code that destroys locality — pointer chasing, random hashing into a huge table, column-major scans — gets little benefit from the megabytes of cache sitting idle.

Prefetching¶

The CPU tries to hide latency by loading data before you ask:

Hardware prefetchers detect simple patterns: sequential streams and constant strides. A forward for loop over an array is the easiest pattern to detect — the prefetcher races ahead, so demand loads frequently hit. This is why sequential access is fast beyond just line reuse.
Software prefetch — explicit instructions (__builtin_prefetch in C/GCC, _mm_prefetch) tell the CPU "I'll need this address soon." Useful for irregular-but-predictable patterns (e.g. you can compute the next node's address before processing the current one). Easy to overuse: a wrong prefetch wastes bandwidth and pollutes cache.

Prefetchers struggle with: pointer chasing (the next address is unknown until the current line arrives), random access, and very short loops.

The TLB and address translation¶

Programs use virtual addresses; the MMU translates them to physical RAM addresses using page tables. Walking the multi-level page table is itself several memory accesses — far too slow to do per load. So the CPU caches recent translations in the TLB.

A TLB hit makes translation effectively free (overlapped with cache access).
A TLB miss triggers a page walk (often hardware-driven), costing tens to hundreds of cycles — sometimes more than the data load it precedes.
TLBs are small: on the order of 64–1500 entries. With 4 KB pages, a few thousand entries cover only a few megabytes. A loop scanning gigabytes with poor page locality can be TLB-bound even when the data caches are fine.
Huge pages (2 MB) let one TLB entry cover 512× more memory, dramatically cutting TLB misses for large working sets — a common production tuning knob for databases and JVMs.

The lesson: there are two caches you can miss — the data cache and the TLB — and they fail for different reasons.

Bandwidth vs latency¶

These are independent axes and confusing them causes bad decisions:

Latency = time for one access to come back (~100 ns to DRAM). It matters for dependent accesses, where you can't issue the next until the current returns (pointer chasing).
Bandwidth = bytes per second the memory system can sustain (tens of GB/s per socket). It matters for streaming, where many independent accesses are in flight at once.

Modern CPUs hide latency by keeping many outstanding misses in flight (memory-level parallelism). A sequential scan saturates bandwidth and hides latency. A pointer chase exposes full latency on every step because each load depends on the previous — you get neither parallelism nor bandwidth.

A Cost Model You Can Use¶

Estimate a loop's memory cost in three steps:

Count distinct cache lines touched, not bytes or operations. n floats scanned sequentially touch n/16 lines (16 floats per 64-byte line).
Classify the reuse. Does the working set fit in L1 (~32 KB), L2 (~512 KB), or L3 (~tens of MB)? Pick the latency of the smallest level that holds it.
Account for the access pattern. Sequential → prefetched, near-best case. Strided/random → add full miss latency per line, and check whether you're also blowing the TLB.

Worked example: summing 64 million int32 (256 MB). - Lines touched ≈ 256 MB / 64 B = 4 million. - 256 MB ≫ L3, so each line is a DRAM fetch the first time: ~capacity/compulsory miss territory. - Sequential, so the prefetcher hides most latency and you run at bandwidth limit (~tens of GB/s) → a few milliseconds. - Random access to the same data: each line exposes ~100 ns latency, ~4M × 100 ns ≈ 0.4 s — about 100× slower. Same data, same total, different pattern.

Mental Models¶

A cache is a hardware hash table keyed by set index, with N slots per bucket and a tag check. Conflict misses are bucket collisions.
Two caches, two failure modes. Data cache misses come from touching too much or too scattered data; TLB misses come from spanning too many pages. Huge pages fix the latter, not the former.
Latency is hidden by parallelism; pointer chasing kills parallelism. Independent accesses overlap; dependent ones serialize.

Code Examples¶

Strided access reveals the cache line¶

// Touch one int per stride. Larger stride => fewer useful bytes per line
// => more lines per element => slower, until the line is one-int-per-line.
for (size_t i = 0; i < N; i += stride)
    sum += a[i];

Plotting time vs stride, time rises until stride reaches 16 ints (one full 64-byte line per access) and then roughly plateaus — a direct measurement of the 64-byte line size from software.

Software prefetch for a predictable irregular walk¶

for (size_t i = 0; i < n; i++) {
    // Hint the line we'll need a few iterations ahead.
    __builtin_prefetch(&data[index[i + 8]], 0, 1);
    process(data[index[i]]);
}

This helps only when the future address (index[i+8]) is known well before it's needed — i.e. the index array itself is streamed predictably.

Demonstrating the TLB in Go¶

// Walking with a page-sized stride touches a new page every step:
// great data-cache behavior is irrelevant once you exhaust the TLB.
const pageStride = 4096 / 8 // 8-byte elements
for i := 0; i < len(buf); i += pageStride {
    sink += buf[i]
}

For a multi-GB buf, this is dominated by TLB misses, not data-cache misses — switching the allocation to huge pages can cut its time substantially.

Coding Patterns¶

Loop interchange — reorder nested loops so the innermost index walks contiguous memory (row-major inner loop over columns).
Loop blocking / tiling — process the data in cache-sized tiles so a tile is fully reused before eviction; turns capacity misses into hits (classic in matrix multiply).
Avoid power-of-two strides and array dimensions — pad arrays (e.g. [N][N+1]) to break the aliasing that causes conflict misses.
Batch independent work — issue several independent loads so the CPU overlaps their latency instead of serializing.

Pros & Cons¶

Mechanism	Helps when	Hurts when
Set associativity	Reduces conflict misses vs direct-mapped	Costs power/area; not infinite
Hardware prefetch	Sequential/strided streams	Pointer chasing; pollutes cache on mispredict
Software prefetch	Predictable irregular access	Wrong distance wastes bandwidth
Huge pages	Large working sets (TLB pressure)	Fragmentation; wasted RAM if sparse

Best Practices¶

Make the inner loop walk contiguous memory. This single change wins more often than any micro-optimization.
Size hot loops to a cache level deliberately — tile big computations so each tile fits L1/L2.
Watch for power-of-two aliasing in array dimensions and strides; pad to dodge conflict misses.
Treat the TLB as a separate budget. For huge sequential or sparse-across-pages workloads, consider huge pages.
Prefer streaming over chasing. If you can express work as a scan of arrays rather than a walk of pointers, do it.

Edge Cases & Pitfalls¶

The "fits in cache" cliff. Performance is roughly flat while the working set fits a level, then drops sharply when it spills to the next. Microbenchmarks that stay under the cliff lie about production behavior.
Conflict misses with "small" data. A 256 KB array that should fit L2 can thrash if your stride aliases sets. Capacity isn't the only constraint.
Prefetcher fooled by boundaries. Hardware prefetchers usually don't cross page boundaries, so very page-fragmented access loses prefetch benefit even if sequential within pages.
Software prefetch as a crutch. It rarely fixes a fundamentally pointer-chasing structure; redesigning the layout usually beats sprinkling prefetch hints.
Counting bytes, forgetting pages. Two layouts with identical byte counts can differ wildly if one spans many more pages (TLB) or sets (conflicts).

Summary¶

A cache is a set-associative hardware hash table of 64-byte lines; addresses split into tag / set / offset.
Misses come in three flavors — compulsory, capacity, conflict — and each points to a different fix.
Prefetching hides latency for predictable patterns; pointer chasing defeats it.
The TLB is a second cache you can miss; large or page-sparse workloads can be translation-bound, and huge pages help.
Latency vs bandwidth are distinct: dependent accesses pay latency, independent streams ride bandwidth.
Reason about loops by counting distinct lines, identifying the holding cache level, and classifying the access pattern — that model predicts the 10–100× differences you'll see in practice.