Memory Layout — Senior Level¶

Topic: Memory Layout Focus: Design-level layout decisions — AoS vs. SoA, pointer chasing vs. flat arrays, object-header overhead across managed runtimes, and data-oriented design as an architecture.

Table of Contents¶

Introduction
Prerequisites
Glossary
Core Concepts
Real-World Analogies
Mental Models
Code Examples
Pros & Cons
Use Cases
Coding Patterns
Best Practices
Edge Cases & Pitfalls
Summary

Introduction¶

The lower tiers treated one struct at a time: how to order its fields, when to pack, how to avoid false sharing. The senior tier changes the unit of analysis from the struct to the data structure and the access pattern. Once you have millions of records and a loop that touches them, the dominant question is no longer "how is this struct laid out" but "how is the collection laid out relative to how the code traverses it."

Three decisions dominate senior-level layout design:

AoS vs. SoA — array-of-structs versus struct-of-arrays. The same fields, two physical arrangements, with order-of-magnitude performance differences depending on access pattern.
Pointer-based vs. flat — linked structures (lists, trees, graphs of heap pointers) versus contiguous arrays. This is the difference between cache-friendly and cache-hostile.
What the runtime adds — managed languages bolt object headers, GC metadata, and indirection onto your data, changing the layout you thought you designed.

These are not micro-optimizations. They are architectural choices that shape entire subsystems: ECS game engines, columnar databases, vectorized query engines, and high-throughput services are all, at heart, decisions about memory layout.

Prerequisites¶

You are fluent in alignment, padding, cache lines, false sharing, and hot/cold splitting (middle tier).
You understand the cache hierarchy and rough latencies (L1 ~4 cycles, L3 ~40, DRAM ~200+), and what a TLB miss and a hardware prefetcher are.
You have profiled real code and read cache-miss counters at least once.
You can reason about SIMD at a high level (one instruction operating on a vector of 4/8/16 lanes).

Glossary¶

AoS (Array of Structs) — struct P { x, y, z; }; P points[N]; — each record's fields are contiguous; records are laid end-to-end.
SoA (Struct of Arrays) — struct { float x[N], y[N], z[N]; } — each field gets its own contiguous array; one record's fields are scattered across arrays.
AoSoA / tiled — a hybrid: arrays of small SoA blocks (e.g. 8 records' worth of each field), balancing SIMD width and record locality.
Pointer chasing — traversing a structure by dereferencing a pointer to find the next node, where each node may live anywhere in the heap.
Data-oriented design (DOD) — an architecture philosophy: design around how data is transformed in bulk, not around conceptual objects. Layout follows access pattern.
Object header — per-object metadata a managed runtime prepends: identity/lock bits, GC marks, a class/type pointer.
Compressed oops — JVM optimization storing object references as 32-bit offsets (scaled) instead of 64-bit pointers, shrinking headers and references on heaps ≤ 32 GB.
Klass pointer — the JVM's pointer from an object to its class metadata.
ECS (Entity-Component-System) — a game/sim architecture storing components in SoA arrays keyed by entity, enabling cache-efficient bulk system passes.

Core Concepts¶

1. AoS vs. SoA: the same data, two physical worlds¶

Consider a particle system with position (x, y, z), velocity, and a color, processed by a physics step that only reads/writes positions and velocities.

AoS:

struct Particle { float x, y, z, vx, vy, vz; uint32_t color; };
Particle particles[1_000_000];

for (auto &p : particles)        // update positions
    p.x += p.vx, p.y += p.vy, p.z += p.vz;

Each Particle is 28 bytes (→ 32 with padding). The physics loop touches 6 of 7 fields, so AoS wastes little here — and AoS is excellent when you operate on whole records.

SoA:

struct Particles {
    float x[N], y[N], z[N];
    float vx[N], vy[N], vz[N];
    uint32_t color[N];
};
for (size_t i = 0; i < N; i++)
    P.x[i] += P.vx[i];   // ... etc

Now x[] is a dense contiguous array of floats. Three things happen:

SIMD becomes trivial. x[i] += vx[i] over a contiguous float array vectorizes to one AVX instruction per 8 elements. The compiler (and you) can load 8 lanes at once. In AoS, x values are 32 bytes apart — a gather, which SIMD hates.
Cache lines carry only useful data. A 64-byte line of x[] holds 16 positions, all of which you use. In AoS, a line holds 2 particles' worth of all fields, including color you didn't need.
Cold fields don't pollute hot scans. If the physics pass never touches color, SoA never loads it.

The flip side: if you need one whole particle (e.g., to render a single picked entity), SoA forces 7 separate array accesses across 7 distant regions — 7 cache lines for one record, where AoS needed one.

The rule: choose layout by access pattern. Operate on one field across many records → SoA. Operate on many fields of one record → AoS. Columnar scans, vectorized math, and bulk filters love SoA; transactional record-at-a-time access loves AoS.

This is exactly why columnar databases (Parquet, ClickHouse, DuckDB, Arrow) store data column-by-column: analytics queries scan one column across billions of rows, and SoA makes that bandwidth- and SIMD-optimal. Row stores (Postgres heap, MySQL InnoDB) are AoS, optimal for fetching whole rows in OLTP.

2. Pointer chasing vs. flat arrays¶

A linked list is the canonical cache-hostile structure:

struct Node { int value; struct Node *next; };

Each Node is heap-allocated, potentially far from its neighbors. Traversal is a chain of dependent loads: you cannot prefetch next->next until next resolves. Every hop is a likely cache miss (~200 cycles), and the CPU's out-of-order machinery stalls because the address of the next load depends on the result of the current one. This is pointer chasing, and it is why a linked list can be 10–50× slower to traverse than a flat array holding the same values — despite identical Big-O.

A vector/slice/array stores elements contiguously. Traversal is a linear sweep the hardware prefetcher predicts perfectly; misses are amortized across a whole cache line of elements. The same logic indicts:

Node-based trees (red-black trees, generic BSTs) vs. flat/array-backed structures (B-trees with large nodes, sorted arrays, implicit heaps).
Hash maps with separate-chaining vs. open-addressing (Swiss tables / hashbrown, Go's bucketized map) which keep entries in flat arrays.
Graphs of heap-pointer nodes vs. CSR (compressed sparse row) adjacency arrays.

The senior instinct: prefer flat, index-based structures over pointer-based ones for hot traversals. Replace *next pointers with array indices; replace heap-scattered nodes with a backing arena. You keep the algorithm and lose the cache misses.

3. Object-header overhead in managed runtimes¶

In C/Rust/Go (mostly), a struct is its fields. In managed runtimes, every heap object carries hidden metadata that dominates the layout of small objects.

JVM (HotSpot): each object has a header: - Mark word (8 bytes): identity hash, GC age/mark bits, lock state (biased/lightweight/heavyweight). - Klass pointer (4 bytes with compressed class pointers, else 8): points to the class metadata. - Plus alignment to 8 bytes.

So an Integer wrapping a 4-byte int costs 16 bytes (12-byte header + 4-byte value, padded to 16). A Long[] of boxed Longs is catastrophic: each element is a pointer (4–8 bytes) to a 16-byte heap object, scattered — pointer chasing plus header bloat. This is the entire reason for primitive arrays, IntStream, value types (Project Valhalla), and off-heap columnar storage in JVM data systems.

Compressed oops shrink this: on heaps ≤ 32 GB, the JVM stores references as 32-bit values scaled by the 8-byte alignment, so a reference is 4 bytes instead of 8 and the klass pointer is 4 bytes. Crossing the 32 GB heap threshold disables compressed oops and can make a larger heap hold less live data — a classic capacity-planning trap.

Go: objects have no per-object header. A struct on the heap is just its fields (the GC tracks metadata in separate span/bitmap structures, not inline). This is why Go structs are compact and why []Particle is genuinely contiguous. The cost is paid elsewhere (the GC's side tables), but your layout is what you wrote.

Other runtimes: CPython objects are enormous (refcount + type pointer + per-type fields; a bare int object is ~28 bytes), which is why NumPy exists — to store numbers in flat C arrays outside the object model. .NET has an 8-byte object header (sync block + method table pointer) plus 8-byte alignment.

The senior takeaway: in managed runtimes, the runtime's per-object overhead, not your field layout, often dominates — and the cure is fewer, bigger objects (struct-of-arrays, primitive arrays, value types, off-heap buffers) rather than millions of tiny ones.

4. Data-oriented design as an architecture¶

DOD inverts OOP's starting question. OOP asks "what is this thing?" and bundles its data and behavior into an object. DOD asks "what transformation runs over this data, and how is it laid out for that transformation to be fast?" Layout follows access pattern, not conceptual taxonomy.

The flagship is the ECS in game engines (Unity DOTS, Bevy, EnTT, Overwatch's engine). Instead of class Enemy { Transform t; Health h; AI ai; ... } (AoS objects, virtual dispatch, pointer chasing), you store each component in its own SoA array. A system (e.g., movement) iterates the Position and Velocity arrays in lockstep — pure linear, SIMD-friendly, branch-predictable passes over hot data, with cold components never loaded. The same idea powers vectorized query engines and high-frequency trading hot paths.

DOD is not anti-abstraction; it is anti-pessimistic-default-layout. It says: for the small set of code that runs millions of times, design the memory layout deliberately and let the access pattern dictate the structure.

Real-World Analogies¶

Spreadsheet stored by row vs. by column. AoS is a spreadsheet stored row-by-row: great for reading one person's whole record, slow for "average the salary column" (you skip across every row). SoA stores column-by-column: "sum the salary column" is one contiguous sweep, but reading one person means hopping across every column file. Choose by the query you run most.

A library's books vs. a card catalog. A flat array is books on one long shelf in order — you walk the shelf, grabbing each in turn. A linked list is a treasure hunt: each book contains a slip telling you which random shelf the next book is on. Same books, but the treasure hunt makes you sprint across the building between every read.

Shipping one product vs. a warehouse audit. AoS is a fulfillment center optimized to pick one complete order. SoA is optimized to audit "how many of SKU #5 across all orders." Neither layout is wrong; they serve opposite workloads.

Mental Models¶

"Layout is a function of the loop." There is no globally optimal layout — only a layout optimal for a given traversal. Identify the hottest loop, then lay data out for it; accept that other access patterns get slower.
"Every pointer hop is a potential 200-cycle stall." Treat heap pointers in hot paths as expensive. Flatten to indices/arenas where the traversal is hot.
"In managed runtimes, count the objects, not the fields." A million tiny objects = a million headers + a million scattered allocations. One big array = one allocation, zero headers. Prefer the latter.
"Bandwidth, not size." On a bulk scan, performance is bytes-moved-through-cache. SoA wins not because it's smaller but because every byte moved is a byte used.

Code Examples¶

Rust — AoS vs. SoA, and SIMD-friendliness¶

// AoS: one whole record is contiguous; bulk single-field ops gather.
#[derive(Clone, Copy)]
struct ParticleAoS { x: f32, y: f32, z: f32, vx: f32, vy: f32, vz: f32 }

struct ParticlesSoA {
    x: Vec<f32>, y: Vec<f32>, z: Vec<f32>,
    vx: Vec<f32>, vy: Vec<f32>, vz: Vec<f32>,
}

impl ParticlesSoA {
    // This loop is auto-vectorizable: x and vx are dense f32 arrays.
    fn step(&mut self) {
        for i in 0..self.x.len() {
            self.x[i] += self.vx[i];
            self.y[i] += self.vy[i];
            self.z[i] += self.vz[i];
        }
    }
}
// The AoS equivalent strides by sizeof(ParticleAoS) — the loads are 24 bytes
// apart, defeating contiguous SIMD loads and wasting cache-line bandwidth
// on vx/vy/vz the position-only pass might not need.

Go — flat indices instead of pointer chasing¶

// Pointer-based tree: each node is a separate, scattered allocation.
type NodePtr struct {
    Value       int
    Left, Right *NodePtr // chasing these is a cache miss per hop
}

// Arena/flat tree: nodes live in one contiguous slice; "pointers" are indices.
type Arena struct {
    Value       []int
    Left, Right []int32 // -1 == nil; indices into the same arrays
}

func (a *Arena) sum(i int32) int {
    if i < 0 {
        return 0
    }
    // Children are nearby in the backing arrays -> prefetcher-friendly,
    // and traversal order can be made contiguous by build order.
    return a.Value[i] + a.sum(a.Left[i]) + a.sum(a.Right[i])
}

Java — header overhead and the boxing trap¶

// Each boxed Long: 12-byte header + 8-byte value + padding = 24 bytes,
// PLUS an 8-byte reference in the array, scattered across the heap.
Long[] boxed = new Long[1_000_000];   // ~32 MB + pointer chasing on iterate

// Primitive array: pure contiguous 8-byte longs, one allocation, no headers.
long[] flat = new long[1_000_000];    // exactly 8 MB, dense, SIMD-able

// On heaps <= 32 GB, compressed oops make references 4 bytes; cross 32 GB
// and references jump to 8 bytes — a bigger heap can hold less live data.

Pros & Cons¶

Choice	Favors	Pays
AoS	Whole-record access, OLTP, locality of one entity, simpler code	Wastes line bandwidth on bulk single-field scans; poor SIMD
SoA	Bulk single-field scans, SIMD/vectorization, columnar analytics, cold-field exclusion	Whole-record access is N scattered reads; more bookkeeping; harder to mutate record count
AoSoA	Balance of SIMD width and record locality	Complex indexing; tuning the tile size
Flat/arena	Cache-friendly traversal, no per-node alloc, no chasing	Manual lifetime mgmt; resizing/relocation; index invalidation
Pointer-based	Easy insert/delete, stable addresses, natural recursion	Cache misses per hop; allocation overhead; no prefetch

Use Cases¶

SoA / columnar: analytics databases (ClickHouse, DuckDB, Arrow, Parquet), vectorized query engines, ML feature stores, particle/physics systems, signal processing.
AoS / row store: OLTP databases, request handlers fetching whole entities, anything record-at-a-time.
Flat/arena over pointers: parsers (AST in an arena/index vector), game scene graphs, ECS, compilers (region allocation), high-throughput servers.
Managed-runtime mitigations: primitive arrays over boxed collections, value types/records, off-heap buffers (Netty, Arrow, mapped files), @JvmInline/struct value types.

Coding Patterns¶

Structure-of-arrays transform: take a hot []Struct, split each field into a parallel array, and rewrite hot loops to iterate one field at a time. Keep an AoS "view" for rare whole-record access.
Index-as-pointer (arena): replace *Node with int32 indices into shared backing arrays; -1/sentinel for null. Enables compact serialization and cache locality.
Component arrays (ECS): store each component type in its own dense array; iterate the intersection of component sets in systems.
Off-heap / value-type escape hatch: in managed runtimes, move bulk numeric data out of the object model entirely (primitive arrays, direct ByteBuffer, Arrow vectors) to dodge headers and chasing.
AoSoA tiling: group records into tiles of K (e.g. SIMD width), SoA within a tile, AoS across tiles — when you need both SIMD and decent single-record locality.

Best Practices¶

Pick layout from the hottest access pattern, and write it down. Document "this is SoA because the analytics scan dominates"; the next engineer will otherwise "fix" it into AoS.
Default to flat, index-based structures in hot paths. Reach for pointer-based structures only when insert/delete churn or stable addressing genuinely demands it.
In managed runtimes, minimize object count for bulk data. Prefer primitive arrays, value types, and off-heap buffers over millions of small objects.
Keep AoS↔SoA reversible. Encapsulate the layout behind an interface so you can switch when profiling demands, without rewriting callers.
Watch the 32 GB JVM cliff. Size heaps to stay under compressed-oops threshold, or accept the reference-size jump consciously.
Validate with profiling, not theory. Confirm cache-miss reduction and vectorization actually happened (covered in the professional tier).

Edge Cases & Pitfalls¶

SoA hurts when you actually need whole records. A mixed workload (bulk scans and record fetches) may need both representations or an AoSoA compromise — don't dogmatically pick one.
SoA mutation of record count is painful. Inserting/removing a record means editing every parallel array consistently; a bug desyncs them silently.
"Linked lists are always bad" is too strong. For small N, intrusive lists with good allocation locality, or workloads dominated by O(1) splice/move, lists can win. The indictment is about cache-hostile traversal at scale.
Auto-vectorization is fragile. SoA enables SIMD but doesn't guarantee it; aliasing, non-contiguous access, or branches inside the loop can block the compiler. Check the generated assembly.
Compressed oops surprise across the heap cliff. Increasing -Xmx past ~32 GB can reduce effective capacity and slow everything down.
Arena indices are not lifetime-safe. A stale int32 index outliving its slot (after compaction/reuse) is a use-after-free analog with no type-system protection.
False sharing returns in SoA. Parallel writers each owning a slice of the same field array can false-share at slice boundaries; align partitions to cache lines.
Padding interacts with AoS strides. A poorly ordered AoS struct inflates the stride, multiplying wasted bandwidth across the whole array — junior-tier ordering still matters here.

Summary¶

At senior scale, the unit of analysis is the collection and its access pattern, not the lone struct.
AoS vs. SoA is chosen by the loop: SoA for bulk single-field scans, SIMD, and columnar analytics; AoS for whole-record access and OLTP. The same fields, two physical worlds, with order-of-magnitude consequences.
Pointer chasing (linked lists, node trees, chained maps, heap-pointer graphs) is cache-hostile: each hop is a dependent, unprefetchable, likely-miss load. Prefer flat, index/arena-based structures in hot traversals.
Managed runtimes add per-object headers (JVM mark word + klass pointer; .NET sync block + method table; CPython refcount + type) that dominate small-object layout; Go has no per-object header. The cure is fewer, bigger objects — primitive arrays, value types, off-heap buffers — and awareness of compressed-oops and the 32 GB JVM cliff.
Data-oriented design makes layout an architecture: design memory around the transformation, as in ECS engines and columnar/vectorized databases.
There is no universally best layout — only one optimal for a given access pattern. Choose deliberately, document the choice, and verify with profiling.

Next, the professional tier turns this into measured performance engineering: perf c2c, pahole, cache-miss counters, and quantifying every claim above.