Object Memory Layout — Optimize¶
Layout optimizations land in three buckets: shrink (reduce per-object footprint), reorder (avoid padding holes), and arrange (use cache lines well). Each section names a concrete move, the JOL output that confirms it landed, and a sketch of when not to apply it. Numbers are illustrative — always confirm with JMH and JOL in your environment.
1. Field reordering is mostly automatic — but check¶
On modern HotSpot with default -XX:FieldsAllocationStyle=1, fields are reordered largest-first inside the class. You almost never gain bytes by hand-ordering. The exception is inheritance holes, where the parent's mid-word end forces the child's long-bin to wait for the next 8-byte boundary.
public class Base {
boolean flag; // ends at offset 13
}
public class Derived extends Base {
long timestamp; // forced to offset 16; offset 13..15 is hole
}
Derived is 24 bytes; merging into one class can sometimes cut the hole (and sometimes — as Bug 4 in find-bug shows — make it worse). The optimization rule:
- Default to one class for value-carrier shapes; flatten inheritance unless you need polymorphism.
- When inheritance is required, group small fields together in the parent so the parent's tail aligns at 16, eliminating the child-side hole.
- Verify with JOL before claiming the saving. The compiler is wiser than your spreadsheet.
2. Primitive arrays beat reference arrays for hot loops¶
For 1 million int values, the difference is roughly 200x — not in footprint (which is "only" 4–5x), but in traversal time:
| Storage | Footprint | Sequential sum time (1M elements, JMH) |
|---|---|---|
int[] | ~4 MB | ~0.6 ms |
Integer[] | ~20 MB (4M refs + 16M of boxes) | ~6 ms |
ArrayList<Integer> | ~20 MB (same boxes) | ~7 ms |
The reasons:
int[]is contiguous memory; the CPU prefetcher streams the next cache line before you ask.Integer[]is references; each load chases a pointer to wherever the box lives.- Boxing also kills auto-vectorization — the JIT cannot apply SIMD to a chain of dependent loads.
Rule: for primitive numeric data on hot paths, never go through boxed collections. Use int[], long[], double[], or IntStream/LongStream. Boxing belongs in slow paths only.
3. Records cooperate with escape analysis¶
A record is final, has only final fields, and exposes no mutating method. These are the exact preconditions HotSpot needs to scalar-replace the record inside a method:
public record Vec3(double x, double y, double z) {
public double dot(Vec3 o) { return x*o.x + y*o.y + z*o.z; }
}
double sumOfDots(double[] xs, double[] ys, double[] zs, Vec3 v) {
double sum = 0;
for (int i = 0; i < xs.length; i++) {
Vec3 p = new Vec3(xs[i], ys[i], zs[i]); // allocation in source
sum += p.dot(v); // hot path
}
return sum;
}
In the loop, the JIT proves p never escapes (dot reads only its arguments; p is never stored anywhere). The allocation is eliminated: the three doubles of p live in CPU registers, and the heap allocation drops out entirely.
Confirm with:
You should see a line naming the allocation site as eliminated. JMH allocation profilers (-prof gc) show 0 bytes per op on the loop. See ../05-escape-analysis-and-scalar-replacement/ for the deep treatment.
The take-away: use records for value carriers that flow through tight loops. They are an EA-friendly shape, not just a syntax convenience.
4. @Contended for genuinely contended fields¶
For two fields written by different threads in a tight loop, putting them on separate cache lines is the most dramatic optimization in this file. A JMH micro-benchmark on a Pair { long a; long b; }:
| Setup | Throughput (ops/s/thread) |
|---|---|
Plain volatile long a, b | ~1 M (false sharing) |
@Contended long a, @Contended long b (padding kicks in) | ~9–12 M |
The cost is bytes: each @Contended field consumes ~128 bytes of padding around it (default -XX:ContendedPaddingWidth=128). Apply only to:
- High-throughput counters (
LongAdder'sCell[]). - Per-thread state held in a shared object (
ThreadLocalRandom's seed). - Hot fields in lock-free data structures.
Do not apply to ordinary fields. A 256-byte object that is never written by two threads is a pure cost.
To use @Contended from application code on HotSpot:
Without both flags, the annotation is silently ignored (Bug 3 in find-bug).
5. Compressed oops sizing decisions¶
The compressed-oops cliff sits around 32 GB heap (4 GB × 8-byte alignment). Below it, every reference is 4 bytes; at or above, every reference becomes 8 bytes. The implications for sizing a service:
- 31 GB heap is the safe maximum for compressed oops.
- 32 GB heap is the boundary. Some JVM versions auto-disable compressed oops, others widen the encoding shift. Test before committing.
- 48 GB heap disables compressed oops. Every object grows; you often have less live capacity than 32 GB.
Mitigation: -XX:ObjectAlignmentInBytes=16 extends the addressable range to ~64 GB at the cost of more per-object padding. Sometimes a smaller heap (24 GB) outperforms a larger one (48 GB) because the smaller one keeps compressed oops on.
The decision tree:
- Run with
-Xmx32gand observe footprint withjcmd GC.heap_info. - If you need more, try
-Xmx30g(definitely compressed) before scaling up. - If pressure remains, scale out (more pods) rather than up (one larger pod losing compressed oops).
- If you must go past 32 GB, accept the inflation and budget for it.
6. Valhalla flat layouts — future-proofing today¶
Project Valhalla introduces value class (JEP 401). When delivered, value instances will:
- Have no header (no mark word, no klass pointer).
- Be eliminated in arrays (
Point[]is flat: 8 bytes per element for twoints). - Be eliminated as fields (a
Vec3field inside aParticlestores the 24 bytes inline, no reference).
Today, the closest approximation is records + escape analysis (which already eliminates many allocations) and primitive int[] for hot collections. The structures you build today that will benefit most from Valhalla:
- Records used as hot-loop value carriers (already EA-friendly).
record[]arrays of small value carriers (today they pay reference + header per element; under Valhalla they will be flat).- Methods returning small records (the JIT often inlines today; Valhalla will guarantee it).
Avoid building structures that cannot migrate to value classes:
- Classes using identity equality on small carriers (
==onMoneyinstances). - Synchronization on small carriers.
- Storage of identity hash codes computed from object identity.
These prevent Valhalla migration. Migrate them to .equals() and Objects.hash(fields) now.
7. Struct-of-Arrays (SoA) for the dominant hot path¶
When one field of a small object dominates traversal cost, split it out:
// AoS — array of references to Particle objects
public class Particle { double x, y, z, vx, vy, vz; }
Particle[] particles = ...;
void integrate(double dt) {
for (Particle p : particles) p.x += p.vx * dt;
}
This loop touches only x and vx. Every loaded Particle brings four wasted doubles (y, z, vy, vz) into cache. SoA version:
public class ParticleSystem {
double[] x, y, z, vx, vy, vz;
public void integrate(double dt) {
for (int i = 0; i < x.length; i++) x[i] += vx[i] * dt;
}
}
The inner loop now touches two contiguous arrays. Throughput on 1M particles improves 3–5x in typical benchmarks. The JIT also auto-vectorizes the double[] math when alignment cooperates.
Trade-off: SoA is hostile to OO modeling. Apply only on the one hot loop that matters; keep AoS Particle for general operations. The hybrid pattern: ParticleSystem owns SoA arrays for the integrator; Particle instances are a view generated on demand for non-hot paths.
8. Memory-aligned access and Unsafe.putLong¶
For Unsafe-level optimization in low-level libraries (network protocols, custom allocators), aligned access is faster than unaligned:
- A
longread at an offset that is a multiple of 8 is one CPU instruction. - A
longread at an unaligned offset is sometimes two reads + a merge on older CPUs, or a single slower instruction on newer ones.
JOL output guarantees field alignment: HotSpot places long fields at 8-byte-aligned offsets. If you build a custom binary protocol with Unsafe.putLong, ensure your offsets are aligned the same way.
unsafe.putLong(buffer, ARRAY_BYTE_BASE_OFFSET + 0, word0); // aligned
unsafe.putLong(buffer, ARRAY_BYTE_BASE_OFFSET + 1, word1); // UNALIGNED — penalty
unsafe.putLong(buffer, ARRAY_BYTE_BASE_OFFSET + 8, word1); // aligned
Modern Intel/AMD x86 tolerate unaligned access with small penalty; ARM is more sensitive. If you build for ARM (Apple Silicon, AWS Graviton), align everything.
ByteBuffer.allocateDirect(...) gives 8-byte-aligned native memory; Unsafe.allocateMemory(...) does not — you must align yourself.
9. Reducing object count, not just size¶
The biggest layout win is usually not allocating the object at all. Strategies:
- Object pooling for high-allocation-rate types (rare in modern JVMs; mostly the JIT's job through EA). Apply only if you have evidence from
jcmd JFR.dumpthat allocation is the dominant cost. - Caching small immutable values.
Integer.valueOf(127)returns a cached box;new Integer(127)always allocates. The same trick works in user code: cache the 16 commonly seen values of your enum or record. - Replace many small objects with one composite. Three
Optional<X>fields in a class are three potential allocations and three references; oneEnumSet<Flags>is a singlelong. - Lazy initialization for rarely-used data. A
Map<X, RareData> sideTablepaid only by the fewXinstances that need rare data; the main class stays lean.
For count reduction, the rule of thumb: aim for the GC.class_histogram top entry to drop by half. Reorganize until it does.
10. When not to optimize layout¶
Layout optimization is wasted unless one of these holds:
- The class appears in the
jcmd GC.class_histogramtop 20. - The class is on a hot allocation site in async-profiler
-e alloc. - The class is hit by multiple threads writing under contention.
- The class participates in a hot SoA-vs-AoS loop.
Outside these, code clarity wins. A 32-byte record that should be 24 saves nothing if the JVM allocates 100 of them per second — the GC absorbs the difference. Optimize where the count is huge or the rate is high; everywhere else, write the obvious code.
11. Quick rules¶
- Default to JOL output; do not estimate footprint by adding fields.
- Prefer primitive arrays (
int[],long[],double[]) over boxed collections on hot paths. - Use records for value carriers; they cooperate with escape analysis.
-
@Contendedonly for fields written under cross-thread contention — confirm with JOL that padding actually landed. - Stay under 32 GB heap to keep compressed oops; sometimes a smaller heap is faster.
- SoA for the dominant hot loop, AoS for everything else.
- Reduce object count before optimizing object size — the bigger lever.
- Optimize only the GC.class_histogram top 20 and the contended counters; everywhere else, prefer clarity.
12. What's next¶
| Topic | File |
|---|---|
| Hands-on JOL exercises | tasks.md |
| Interview Q&A | interview.md |
Related sibling chapters: ../05-escape-analysis-and-scalar-replacement/ for the allocation-elimination layer, ../02-vtable-and-itable/ for the dispatch-layer cost that interacts with layout, ../../03-design-principles/ for the design choices that drive the object count.
Memorize this: the levers are shrink, reorder, arrange. Shrink by pruning fields and using primitives over boxes. Reorder by trusting HotSpot — and watching out for inheritance holes. Arrange by understanding cache lines and @Contended. Optimize only the heavy hitters (GC.class_histogram top 20, contended counters). Everywhere else, code clarity wins — JOL is a tool, not a habit.