The Memory Hierarchy — Interview Questions¶
Topic: The Memory Hierarchy
Introduction¶
These questions probe whether you can reason quantitatively about where data lives and what that costs — from approximate latencies through cache lines, associativity, the TLB, NUMA, and false sharing, up to profiling and data-layout design. Strong answers cite concrete numbers, name the mechanism, and connect it to a real fix.
Table of Contents¶
Conceptual¶
Question 1¶
Sketch the memory hierarchy with approximate latencies, and explain why it exists at all.
Top to bottom: registers (<1 ns), L1 (~1 ns / ~4 cycles, ~32 KB), L2 (~4 ns, ~256 KB–1 MB), L3 (~12–40 ns, tens of MB, shared), DRAM (~60–100 ns, GBs), NVMe SSD (~10–100 µs), network/disk (~ms). Each level is bigger, cheaper per byte, and slower. It exists because no single memory technology is simultaneously fast, large, and cheap: SRAM is fast but tiny and expensive; DRAM is dense but ~100 ns; flash is cheap and huge but microseconds. Stacking them and keeping the active working set high in the pyramid yields near-top-level speed at near-bottom-level cost — provided access patterns have locality.
Question 2¶
What is a cache line and why does its size dominate so many performance results?
A cache line is the fixed-size unit (typically 64 bytes) that moves between memory levels — you never transfer a single byte. Cost is therefore paid per distinct line touched, not per byte: reading 64 packed bytes costs about the same as reading one. This makes spatial locality decisive — sequential access uses every byte of each fetched line, while scattered access wastes most of each line, multiplying the number of lines (and misses) for the same logical data.
Question 3¶
Define spatial and temporal locality and explain how the cache exploits each.
Temporal locality: a recently accessed address is likely to be accessed again soon — caches exploit it by retaining lines after use. Spatial locality: addresses near a recent access are likely next — caches exploit it by fetching a whole 64-byte line and by prefetching the following line(s). Real programs exhibit both, which is the entire reason caches work; code that destroys locality (pointer chasing, random hashing) leaves megabytes of cache idle.
Question 4¶
Explain the three types of cache misses and the fix for each.
Compulsory (cold): first touch of a line — unavoidable except via prefetching or touching less unique data. Capacity: working set exceeds cache size, so reused lines are evicted before reuse — fix by shrinking the working set (blocking/tiling, smaller types). Conflict: the set would fit but too many hot lines map to the same set and evict each other — fix by changing stride/layout (avoid power-of-two strides, pad arrays). Naming the type tells you which lever to pull.
Question 5¶
What is the memory wall and what does it imply for software design?
The memory wall is the long-running divergence between CPU speed (which grew fast) and DRAM latency (which barely improved — a few percent per year). A core can now execute hundreds of instructions in one ~100 ns DRAM access. Implications: caches/prefetchers are the only thing keeping CPUs busy; adding cores doesn't help memory-bound code (they all wait on DRAM); compute is effectively free relative to a miss, so recomputing or compressing to move fewer bytes often wins. Locality becomes the master performance variable.
Question 6¶
What is the TLB and how can it bottleneck a program whose data caches are fine?
The TLB caches virtual→physical page translations so the CPU avoids a multi-level page-table walk on every access. It's small (tens to ~1500 entries); with 4 KB pages a few thousand entries cover only a few MB. A workload that spans many pages — large or page-sparse access — suffers TLB misses (page walks costing tens to hundreds of cycles) even with perfect data-cache behavior. Huge pages (2 MB) let one entry cover 512× more memory, often the fix.
Question 7¶
Distinguish latency from bandwidth and when each binds.
Latency is the time for one access to return (~100 ns to DRAM); bandwidth is sustained bytes/second (tens of GB/s). Latency binds dependent accesses where you can't issue the next until the current returns (pointer chasing) — you pay full latency serialized. Bandwidth binds independent streaming where many accesses overlap. CPUs hide latency via memory-level parallelism (multiple in-flight misses), so sequential scans ride bandwidth while pointer chases expose latency.
Tool-Specific¶
Question 8¶
Which perf counters do you check to diagnose a memory-bound loop, and how do you interpret them?
cache-misses, LLC-load-misses (each ≈ a DRAM round trip), L1-dcache-load-misses, dTLB-load-misses, plus cycles/instructions for IPC. Low IPC (≪1) on a wide core signals stalls. Critically, convert misses to time: LLC-load-misses × ~100 ns and compare to runtime — a small miss ratio can still be the whole runtime. High dtlb...walk_active cycles → TLB-bound (huge pages). Sample with perf record -e LLC-load-misses:pp to locate the offending code.
Question 9¶
What does Top-Down Microarchitecture Analysis tell you that raw miss counts don't?
TMA classifies every pipeline slot into Frontend-Bound, Bad-Speculation, Retiring, and Backend-Bound (which splits into Core-Bound and Memory-Bound, further into L1/L2/L3/DRAM-Bound). It tells you the fraction of stalls actually caused by memory — whether memory tuning will help at all. A loop that's 60% DRAM-Bound needs fewer bytes/better locality; one that's Core-Bound won't benefit from any cache work. It prevents optimizing the wrong layer. Run via toplev (pmu-tools) or VTune.
Question 10¶
How would you find false sharing in a poorly scaling parallel program?
Use perf c2c (cache-to-cache), which records HITM events and reports the exact cache line being contended and the functions/PCs writing it. Symptom: a parallel section scales poorly or even slows past a few threads despite no logical data sharing. The report shows a single 64-byte line ping-ponging between cores. Fix: pad the independently-written data so each lands on its own line (alignas(64), padding bytes, Java @Contended, Rust CachePadded).
Tricky / Trap¶
Question 11¶
Two loops do the same additions over the same 2D array; one is 8× slower. Why?
Loop order versus memory layout. A 2D array is stored row-major; the fast loop's inner index walks contiguous memory (every cache line fully used, hardware prefetch engaged), while the slow loop strides by a full row each step, touching one element per line and defeating prefetch. Same operations, same result — the difference is entirely cache-line utilization and prefetcher behavior, invisible in the source.
Question 12¶
A "2% cache miss rate" loop is the bottleneck. How is that possible?
Because miss ratio is the wrong metric — convert to time. 2% of, say, 100 billion loads is 2 billion misses; at ~100 ns each that's ~200 seconds of pure DRAM stall. A low ratio over a huge load count can be the entire runtime. Always multiply miss count by miss latency and compare to wall-clock, never reason from the percentage alone.
Question 13¶
A linked list and an array hold the same 10M integers. Why is iterating the list ~100× slower?
Each node = node->next is a dependency: the next address is unknown until the current line returns, so misses serialize and you pay full DRAM latency per node (~10M × ~100 ns ≈ 1 s) with no prefetch and no memory-level parallelism. The array scan generates independent, predictable accesses — prefetched, riding bandwidth, a few ms. Identical big-O, but the array's locality and access independence win by orders of magnitude.
Question 14¶
Your in-memory cache suddenly shows tens-of-ms p99 latency. The code didn't change. What happened?
The working set likely outgrew RAM and the kernel started paging to swap (SSD) — the bottom of the hierarchy silently entered the hot path. A microsecond cache became a millisecond one. Confirm with vmstat swap-in/out (si/so) and memory pressure metrics. Fix: cap the cache below available RAM with admission/eviction; never allow the working set to cross the RAM→swap cliff. It's a hierarchy cliff, not a gradual slope.
Design¶
Question 15¶
You're storing millions of particles processed by a SIMD physics loop. AoS or SoA, and why?
SoA (separate contiguous arrays per field) for the field-at-a-time SIMD kernel: updating positions touches packed x[] and vx[] lines with every byte used and lets SIMD load 8–16 lanes per instruction — commonly 2–8× faster than AoS, where each line drags whole records (including unused fields) and strides defeat vectorization. If a different hot loop needs whole records, AoS or a hybrid AoSoA (tiles of N records) balances SIMD width against record locality. Match layout to the dominant access pattern.
Question 16¶
Design the memory-placement strategy for a high-throughput service on a 2-socket NUMA server.
Make allocation NUMA-aware: rely on or control first-touch so pages land on the node of the thread that will use them — avoid the trap of one thread allocating+zeroing everything (all pages on one node) before a cross-socket pool processes it. Use parallel first-touch (each worker initializes the slice it later processes), pin threads and memory with numactl/affinity, and shard data per node. Verify with numastat (remote-access %) and uncore IMC counters. Ignoring this commonly costs 30–50% throughput; remote access runs ~1.5–2× the latency at reduced bandwidth.
In this topic
- interview
- tasks