The Memory Hierarchy — Professional Level¶
Topic: The Memory Hierarchy Focus: Profiling the hierarchy in production — perf counters, roofline, top-down analysis — plus hardware-level detail (MSHRs, DRAM banks, prefetch tuning) and war stories where the hierarchy decided the outcome.
Table of Contents¶
- Introduction
- Measuring the Hierarchy: the Counters That Matter
- Top-Down Microarchitecture Analysis
- The Roofline Model
- Hardware Detail You Eventually Need
- MSHRs and memory-level parallelism
- DRAM internals: banks, rows, and the page-hit gamble
- Prefetcher tuning and its limits
- Non-temporal stores and write bandwidth
- War Stories
- A Production Profiling Playbook
- Best Practices
- Edge Cases & Pitfalls
- Summary
Introduction¶
At this level you stop reasoning about the hierarchy abstractly and start measuring it on real hardware under real load, then attributing wall-clock time to specific levels with hardware performance counters. The questions become precise: Is this loop bound by L3 latency or DRAM bandwidth? Are we losing 20% to TLB walks? Is a single false-sharing line costing us a core? The tools are perf, toplev, likwid, Intel VTune, AMD uProf, and the roofline model — and the fixes are informed by how the silicon actually behaves.
Measuring the Hierarchy: the Counters That Matter¶
Every modern CPU exposes hardware performance monitoring counters (PMCs). The Linux entry point is perf.
# Cache miss overview for a workload
perf stat -e cycles,instructions,\
cache-references,cache-misses,\
L1-dcache-loads,L1-dcache-load-misses,\
LLC-loads,LLC-load-misses,\
dTLB-loads,dTLB-load-misses ./app
# Where in the code the misses happen (sampling on LLC misses)
perf record -e LLC-load-misses:pp -c 1000 ./app
perf report
How to read the results:
- IPC (instructions / cycles). Below ~1.0 on a 4-wide+ core usually means stalls — frequently memory. Near 0.2–0.5 screams "memory-bound."
- LLC-load-misses are the expensive ones: each is ~a DRAM round trip (~100 ns). Multiply count × ~100 ns to bound the DRAM-stall time. If that approaches your runtime, you're DRAM-latency-bound.
- dTLB-load-misses → page walks. A high ratio (and high
dtlb_load_misses.walk_activecycles) means you're translation-bound — reach for huge pages. - Miss ratios mislead. A 2% miss ratio sounds great but if it's 2% of 100 billion loads at 100 ns each, it's the whole runtime. Always convert to time, not ratio.
For NUMA, use numastat -p <pid> and the uncore counters (perf stat -e uncore_imc/.../ or PCM) to see per-socket memory traffic and remote-access fractions. perf c2c (cache-to-cache) is the dedicated tool for finding false sharing — it pinpoints the exact line and the two functions fighting over it.
Top-Down Microarchitecture Analysis¶
Counting misses tells you what is happening; Top-Down Microarchitecture Analysis (TMA) tells you whether it matters. TMA buckets every pipeline slot into four categories:
┌─ Frontend Bound (instruction supply: I-cache, iTLB, branch mispredict)
Pipeline slot ───┼─ Bad Speculation (wasted work)
├─ Retiring (useful work — what you want)
└─ Backend Bound ───┬─ Core Bound (execution ports/divider)
└─ Memory Bound ┬─ L1/L2/L3 Bound
├─ DRAM Bound (latency vs bandwidth)
└─ Store Bound
Run it with Andi Kleen's toplev (pmu-tools) or VTune's "Microarchitecture Exploration":
The payoff: TMA tells you the fraction of pipeline slots lost to "DRAM Bound" or "L3 Bound." If a loop is 60% DRAM-Bound, no amount of clever arithmetic helps — you must cut bytes moved or improve locality. If it's "Core Bound," memory tuning is wasted effort. This single discipline stops the most common senior mistake: optimizing the wrong layer.
The Roofline Model¶
The roofline plots attainable FLOP/s (or ops/s) against arithmetic intensity = useful ops per byte moved from DRAM.
Performance (ops/s)
^ ____________________ <- compute roof (peak FLOP/s)
| /
| / <- this slope = peak memory bandwidth (GB/s)
| /
| /
+-------+----------------------> arithmetic intensity (ops/byte)
ridge point
- A kernel left of the ridge point is memory-bound: its ceiling is
bandwidth × intensity. The only way up is raising intensity (reuse data more — blocking/tiling) or moving fewer bytes (SoA, compression, smaller types). - A kernel right of the ridge is compute-bound: vectorize, use more cores, better algorithms.
Most data-processing and "AI inference on CPU" kernels live to the left — they're memory-bound, and the hierarchy is the binding constraint. Measure your kernel's intensity (FLOPs ÷ bytes from perf's memory traffic counters) and plot where it sits before optimizing. likwid-perfctr -g MEM_DP or Intel Advisor's roofline automate this.
Hardware Detail You Eventually Need¶
MSHRs and memory-level parallelism¶
A core hides DRAM latency by keeping multiple misses in flight simultaneously, tracked by MSHRs (Miss Status Handling Registers) — typically ~10–20 per core. This is the hardware reason sequential scans hit bandwidth while pointer chasing hits latency: a scan generates many independent misses that fill all MSHRs and overlap; a pointer chase has exactly one outstanding miss at a time (the next address is unknown), so the core sees full latency on each, serialized.
Practical implication: to saturate DRAM bandwidth from one core you need enough independent in-flight accesses to cover the latency-bandwidth product (Little's Law: concurrency = latency × bandwidth). One stream often can't; multiple streams or multiple cores are needed to reach peak bandwidth. This is why a single-threaded memcpy rarely hits the chip's rated bandwidth and why you sometimes need several cores just to fill the memory pipe.
DRAM internals: banks, rows, and the page-hit gamble¶
DRAM isn't flat. It's organized into channels → ranks → banks, each bank a 2D array with a row buffer. Accessing a column in the currently open row (a row hit) is fast; accessing a different row (row miss / conflict) forces a precharge + activate — much slower. Sequential access tends to stay within open rows (row hits); random access thrashes rows (constant precharge/activate). This is a second, finer-grained reason random DRAM access is slower than the headline latency suggests, and why interleaving across channels/banks matters for bandwidth. You rarely program to this directly, but it explains benchmark anomalies and is why "DRAM latency" is a distribution, not a constant.
Prefetcher tuning and its limits¶
Server CPUs expose multiple prefetchers (L2 streamer, adjacent-line, DCU/IP-stride) togglable via MSRs (on Intel, MSR 0x1A4). For most code, leave them on. But: - For streaming-once workloads (read huge data once, no reuse), aggressive prefetch can pollute cache and waste bandwidth; some HPC/DB shops disable specific prefetchers. - Prefetchers don't cross page boundaries, so 4 KB-fragmented access loses prefetch even when sequential within a page — another argument for huge pages beyond TLB savings. - Software prefetch (prefetcht0/1/2/nta) helps only at the right distance (enough lookahead to cover latency, not so far it evicts useful lines). Tuning it is empirical; measure, don't assume.
Non-temporal stores and write bandwidth¶
When you write data you will not read back soon (e.g. producing a large output buffer), normal stores waste bandwidth on RFO (Read-For-Ownership) — the CPU reads the line just to overwrite it — and pollute cache. Non-temporal / streaming stores (movnti, _mm_stream_ps, memset variants) bypass the cache and write straight to memory, roughly halving write traffic and sparing the cache. Used in optimized memcpy/memset, video pipelines, and large array initialization. The catch: they need fencing (sfence) and hurt if you do re-read the data soon.
War Stories¶
1. The 40% NUMA tax nobody saw. A Go service on a 2-socket box degraded under load. CPU wasn't saturated; perf stat showed low IPC and numastat showed ~45% remote memory accesses. The cause: a startup routine allocated and zeroed all caches/buffers on one goroutine (one socket), then the runtime scheduled workers across both sockets. Fix: NUMA-aware sharding of the buffers with per-shard worker affinity (and pinning with GOMAXPROCS + numactl --cpunodebind). Throughput rose ~35%.
2. A single line that cost a core. A lock-free counter array int64 hits[NumCPU] showed near-zero scaling past 4 threads. perf c2c flagged one 64-byte line carrying eight adjacent counters — textbook false sharing, the line bouncing between cores thousands of times per millisecond. Padding each counter to its own line restored linear scaling. The diff was four lines; the speedup was 3×.
3. The hash map that was secretly disk-bound. An "in-memory" cache spilled past RAM into swap on a memory-pressured node. Latency p99 jumped from microseconds to tens of milliseconds. vmstat showed si/so (swap-in/out) activity; the working set had quietly exceeded RAM and the kernel was paging to SSD. The hierarchy's bottom level had silently joined the hot path. Fix: cap the cache size below available RAM and add admission control — never let the working set cross the RAM→swap cliff.
4. Tiling a matrix kernel off the roofline. A naive matrix multiply ran at a fraction of peak. Roofline analysis put it deep in the memory-bound region (intensity ~ O(1) because each element was re-fetched from DRAM). Blocking into L1/L2-sized tiles raised arithmetic intensity by reusing each loaded tile O(tile) times, moving the kernel toward the compute roof — a multi-× speedup with identical FLOP count.
A Production Profiling Playbook¶
- Establish boundness with TMA first.
toplev -l1. If not Backend/Memory-Bound, stop tuning memory. - Localize with sampling.
perf record -e LLC-load-misses:pp→perf reportto find the offending lines/functions. - Convert misses to time, not ratios:
LLC-load-misses × ~100 nsvs runtime. - Check the TLB (
dtlb_load_misses.walk_activecycles). High → huge pages. - Check NUMA (
numastat -p, uncore IMC counters per socket). High remote % → fix placement. - Hunt false sharing with
perf c2cif a parallel section scales poorly. - Place the kernel on a roofline to decide whether to cut bytes or add compute.
- Verify with the same counters after the fix — confirm the bucket you targeted actually shrank.
Best Practices¶
- Profile before, attribute precisely, verify after. TMA + perf counters, every time. Never optimize memory by guesswork.
- Engineer for the RAM→swap cliff. Size working sets safely under RAM; monitor swap activity in prod. The bottom of the hierarchy is a latency cliff, not a slope.
- Use huge pages for large/sparse working sets — wins on both TLB and prefetch.
- Reach for non-temporal stores when producing large write-only buffers; disable specific prefetchers only for measured streaming-once workloads.
- Pin memory and threads on NUMA hardware; co-locate data with the cores that use it via parallel first-touch and affinity.
- Treat single-thread bandwidth as limited (MSHR/Little's Law); parallelize streaming to fill the memory pipe.
Edge Cases & Pitfalls¶
- Counter skid and attribution error. Non-precise events attribute misses to the wrong instruction; use
:pp(PEBS/IBS) precise sampling for memory events. - Microbenchmarks that fit in cache. They report fantasy numbers; size test data above the relevant cache level and include realistic TLB/NUMA conditions.
- Frequency scaling and uncore. Turbo, C-states, and uncore frequency shift latency/bandwidth between runs; pin frequency for repeatable measurement.
- Huge pages backfiring. Transparent Huge Pages can cause latency spikes from
khugepagedcompaction; for latency-critical services many shops disable THP and use explicit hugepages instead. - Non-temporal stores hurting reuse. If the "write-only" buffer is read back soon, NT stores force a re-fetch from DRAM — a net loss. Validate the access pattern.
- Optimizing a non-bottleneck. A loop that's 5% of runtime, perfectly tuned, saves ~5%. TMA + profiling exist precisely to keep you on the dominant cost.
Summary¶
- Use perf counters + Top-Down (TMA) to attribute wall-clock time to specific hierarchy levels, and convert misses to time, not ratios.
- The roofline model tells you whether a kernel is memory- or compute-bound and therefore whether locality or compute tuning pays.
- Hardware reality shapes the rules: MSHRs cap in-flight misses (latency vs bandwidth), DRAM banks/rows make random access doubly slow, prefetchers don't cross pages, and non-temporal stores save write bandwidth.
- The most expensive production surprises come from the bottom of the hierarchy (swap) and from coherence/NUMA (false sharing, remote memory) — all measurable with
perf c2c,numastat, andvmstat. - The professional discipline: measure → attribute → fix the dominant level → verify with the same counter.
In this topic