Stack vs Heap — Middle Level¶
Topic: Stack vs Heap Focus: The machine-level mechanics of frames and allocation, and a quantitative cost model for choosing between them.
Table of Contents¶
- Introduction
- Prerequisites
- Glossary
- Core Concepts
- Anatomy of a stack frame
- Calling conventions
- Why stack allocation is one instruction
- Why heap allocation costs more
- Escape analysis
- Mental Models
- Code Examples
- Cost Model
- Pros & Cons
- Use Cases
- Coding Patterns
- Best Practices
- Edge Cases & Pitfalls
- Summary
Introduction¶
At the junior level "stack vs heap" is a story about lifetime. At the middle level it becomes a story about machine instructions and cost. You now need to know what a frame actually contains byte-for-byte, why bumping the stack pointer costs roughly one cycle while a heap allocation can cost dozens to thousands, and how a modern compiler can quietly turn a heap allocation into a stack allocation behind your back.
This is the level where you start reading allocation profiles and asking "why is this on the heap?" — and getting answers from tools instead of guessing.
Prerequisites¶
- Junior-level grasp: stack = call frames, heap = arbitrary-lifetime data, lifetime is the deciding factor.
- Comfort reading a little pseudo-assembly (
push,pop, register names). - You know what a CPU register and a cache line are, at least roughly.
- You can run a compiler with flags (e.g.,
go build -gcflags=-m,gcc -O2 -S).
Glossary¶
- Frame pointer (FP / RBP): A register pointing at a fixed spot in the current frame, used as a stable base to address locals and parameters.
- Stack pointer (SP / RSP): Points at the current top of the stack.
- Return address: The instruction address to resume at after the callee returns. Pushed by the
callinstruction. - Calling convention (ABI): The agreed rules for how arguments are passed, who saves which registers, and how results are returned.
- Prologue / epilogue: The instructions at a function's start/end that set up and tear down its frame.
- Bump allocation: Allocating by simply advancing a pointer past the requested size.
- Free list: A data structure tracking reusable freed heap blocks, grouped by size.
- Escape analysis: A compiler pass that decides whether a value can stay on the stack or must move ("escape") to the heap.
- Size class: A bucket of allocation sizes (e.g., 16, 32, 48 bytes) used by modern allocators to avoid per-request bookkeeping.
Core Concepts¶
Anatomy of a stack frame¶
When a function runs, its frame on x86-64 typically looks like this (addresses increasing upward, stack growing downward):
higher addresses
┌─────────────────────────────┐
│ caller's frame … │
├─────────────────────────────┤
│ argument 8, 9, … (if any) │ ← extra args spilled to stack
│ return address │ ← pushed by `call`
│ saved RBP (caller's FP) │ ← pushed by prologue
├─────────────────────────────┤ ← RBP points here
│ local variable a │
│ local variable b │
│ saved callee-saved regs │
│ spilled temporaries │
├─────────────────────────────┤ ← RSP points here (top)
│ (red zone, 128 bytes) │ ← scratch below RSP on System V
└─────────────────────────────┘
lower addresses
The first several arguments are passed in registers (on System V x86-64: RDI, RSI, RDX, RCX, R8, R9 for integers/pointers; XMM0–XMM7 for floats). Only when you run out of registers do arguments spill onto the stack. So a typical small call touches the stack mostly for the return address and a few locals.
A function's prologue and epilogue manage the frame:
push rbp ; save caller's frame pointer
mov rbp, rsp ; establish our frame pointer
sub rsp, 32 ; reserve 32 bytes for locals ← THE allocation
... ; function body
leave ; mov rsp, rbp ; pop rbp (tear down)
ret ; pop return address, jump to it
That single sub rsp, 32 is the allocation of every local in the function at once. With frame-pointer omission (-fomit-frame-pointer, on by default at -O2), even the push rbp/mov rbp,rsp pair disappears and locals are addressed off RSP directly.
Calling conventions¶
A calling convention (part of the platform ABI) pins down the contract between caller and callee so code compiled separately can interoperate:
- Argument passing: which registers, in what order, and how the rest spill to the stack.
- Return values: where results go (RAX/RDX for integers, XMM0 for floats; large structs returned via a hidden pointer).
- Caller-saved vs callee-saved registers: the callee may freely clobber caller-saved registers (RAX, RCX, RDX, RSI, RDI, R8–R11); it must preserve callee-saved ones (RBX, RBP, R12–R15) by saving and restoring them in its frame.
- Stack alignment: System V requires RSP to be 16-byte aligned at each
call. Violating it crashes SSE code.
Go does not use the C ABI; it has its own register-based convention (since Go 1.17) and its goroutine stacks differ fundamentally, which matters when you call into C via cgo and pay a transition cost.
Why stack allocation is one instruction¶
Reserving stack space is sub rsp, N. Releasing it is leave (or add rsp, N). There is:
- No search for a suitable block.
- No metadata written per object.
- No free list to update.
- No synchronization — the stack is thread-private.
And it is cache-hot: the top of the stack was touched on the previous call and is almost certainly already in L1. Bumping the pointer and writing a local is among the cheapest things a CPU does — call it ~1 cycle of overhead.
Why heap allocation costs more¶
A heap allocator (malloc, Go's mallocgc, the JVM's TLAB path) must do real work:
- Round the request up to a size class (e.g., a 20-byte request becomes a 32-byte slot).
- Find a free slot of that class — pop from a per-thread/per-CPU free list in the fast path.
- Update bookkeeping so the slot is marked used.
- On the slow path, when the local cache is empty, grab a fresh run of memory from a central pool, possibly under a lock, possibly via an
mmap/sbrksyscall to the OS. - For GC'd heaps, record the object for the collector (write barriers, span metadata), and later the GC must scan and reclaim it — a cost paid at collection time, not allocation time.
The fast path of a good allocator is tens of instructions — still ~10–100× a stack bump. The slow path (syscall, lock contention, GC trigger) can be thousands of cycles. And freeing is its own cost: returning a block to a free list, or, for GC, the mark/sweep work that scales with live-object count.
There is also a second-order cost: heap objects are scattered across address space, so chasing pointers between them causes cache misses (a miss to DRAM is ~100+ ns, ~hundreds of cycles). Stack data is dense and hot; heap data is sparse and cold.
Escape analysis¶
Managed and systems compilers run escape analysis: a static pass that asks "could a reference to this value outlive the current function?" If the answer is provably no, the value can be stack-allocated even if you wrote it like a heap object.
A value escapes when:
- Its address is returned to the caller.
- It is stored into something that outlives the function (a global, a heap object, a channel).
- It is captured by a closure that escapes.
- It is passed to a function the compiler cannot see through (an interface method, an indirect call), so the compiler must assume the worst.
In Go, go build -gcflags=-m prints these decisions. In the JVM, escape analysis enables scalar replacement (exploding an object into its fields, held in registers — no allocation at all) and lock elision. Escape analysis is conservative: if it cannot prove safety, it heap-allocates. So the way you write code (returning pointers, using interfaces) directly changes where data lands.
Mental Models¶
- The stack pointer is a high-water mark. Allocation lowers it; deallocation raises it. The "freed" memory is not zeroed — it is simply abandoned and overwritten by the next call. This is why uninitialized locals contain garbage.
- Heap allocation is a transaction; stack allocation is a register write. One involves a bookkeeper (the allocator); the other does not.
- Escape analysis is the compiler doing your lifetime reasoning for you. When you "shouldn't" allocate but can't prove it, the compiler can — or can't, and then you pay.
Code Examples¶
Go — seeing escape analysis decisions¶
package main
type Point struct{ X, Y int }
//go:noinline
func stackPoint() int {
p := Point{1, 2} // does NOT escape — stays on the stack
return p.X + p.Y
}
//go:noinline
func heapPoint() *Point {
p := Point{3, 4} // escapes — address returned
return &p
}
func main() {
_ = stackPoint()
_ = heapPoint()
}
The compiler tells you in plain words: heapPoint's p is moved to heap; stackPoint's is not.
C — explicit stack vs heap¶
#include <stdlib.h>
#include <string.h>
// stack: lifetime ends at return; cheap; cannot be returned
void process_local(void) {
char buf[256]; // sub rsp, 256 — basically free
memset(buf, 0, sizeof buf);
}
// heap: lifetime controlled by caller; costs an allocator round-trip
char *make_buffer(size_t n) {
char *buf = malloc(n); // allocator finds a slot, bookkeeps
if (buf) memset(buf, 0, n);
return buf; // safe: heap data outlives the function
// caller MUST free(buf)
}
Java — escape analysis and scalar replacement¶
// With -XX:+DoEscapeAnalysis (default in HotSpot), this allocation
// is often eliminated entirely: the Point's fields become locals/registers.
static long sumDistances(int n) {
long total = 0;
for (int i = 0; i < n; i++) {
Point p = new Point(i, i); // "new" — but may never hit the heap
total += p.x + p.y; // p doesn't escape the loop body
}
return total;
}
Here new Point(...) reads as a heap allocation, but because p never escapes, HotSpot's JIT can apply scalar replacement and allocate nothing — a striking example of "the source says heap, the machine says stack (or registers)."
Cost Model¶
Rough orders of magnitude on a modern x86-64 core (your mileage varies):
| Operation | Approx. cost | Why |
|---|---|---|
| Stack allocation (reserve N bytes) | ~1 cycle | sub rsp, N; cache-hot |
| Heap alloc, fast path (TLAB / per-P cache) | ~20–100 cycles | size-class lookup, free-list pop |
| Heap alloc, slow path (refill / syscall) | ~1,000–10,000+ cycles | central pool, lock, mmap/sbrk |
| Access to cache-hot stack data | ~1–4 cycles (L1) | dense, recently touched |
| Access to cache-cold heap data | ~100–300 cycles | likely DRAM miss |
| GC scan per live object | small but cumulative | paid at collection, scales with live set |
The practical takeaway: a tight loop that heap-allocates per iteration can be 10–100× slower than one that doesn't — not because the allocation is huge, but because of allocation plus cache misses plus the GC pressure it creates downstream.
Pros & Cons¶
Stack
- Pros: ~1-cycle alloc/free; automatic, leak-proof cleanup; cache-hot; no fragmentation; lock-free (thread-private).
- Cons: bounded size; scope-bound lifetime; cannot return references to locals (in unmanaged languages); large locals risk overflow.
Heap
- Pros: arbitrary lifetime and sharing; large objects; dynamically sized structures.
- Cons: alloc/free cost; allocator metadata and contention; fragmentation; GC pressure or manual-free bugs; cache-cold access.
Use Cases¶
- Stack: hot-loop temporaries, small fixed structs, function parameters, recursion state of bounded depth.
- Heap: returned objects, runtime-sized buffers and collections, shared state across goroutines/threads, anything large enough to threaten the stack limit.
Coding Patterns¶
- Pre-allocate and reuse heap buffers outside hot loops instead of allocating per iteration (e.g., a reused
[]byteor async.Poolin Go). - Pass small structs by value to keep them on the stack; pass large ones by pointer — but know that taking the address may cause an escape.
- Avoid interface boxing in hot paths. Storing a concrete value in an interface (Go) or autoboxing a primitive (Java) forces a heap allocation.
- Read the escape report before micro-optimizing. Let the tool, not intuition, tell you what escapes.
Best Practices¶
- Measure allocations, don't guess:
go test -bench -benchmem,pprofalloc profiles, JFR/async-profiler for the JVM. - Keep frames modest; move large buffers to the heap so you don't overflow the stack.
- In C, document ownership: every function that returns heap memory should state who frees it.
- Treat per-iteration heap allocation in a hot loop as a smell worth profiling.
Edge Cases & Pitfalls¶
- Surprise escapes: taking
&local, capturing a variable in an escaping closure, or putting a value in aninterface{}can silently move it to the heap. The escape report is the ground truth. - Stack overflow from large locals:
int big[1<<20]on the stack (4 MB) blows past typical limits instantly. - Frame-pointer omission breaks naive stack walkers: profilers may need
-fno-omit-frame-pointeror DWARF unwinding to produce correct stacks. - Misreading "new" as always-heap: with escape analysis,
new/makemay allocate nothing. The source is not the final word; the compiler is.
Summary¶
- A stack frame holds a call's return address, saved frame pointer, saved registers, and locals; the prologue's
sub rsp, Nallocates them all in one instruction. - Calling conventions dictate register usage, caller/callee-saved registers, and stack alignment — the contract that lets separately compiled code interoperate.
- Stack allocation is ~1 cycle and cache-hot; heap allocation costs tens of cycles on the fast path, thousands on the slow path, plus cache-miss and GC costs downstream.
- Escape analysis lets compilers stack-allocate values you wrote as heap objects (and JVM scalar replacement can eliminate the allocation entirely). The way you write code — returning pointers, using interfaces, capturing in closures — determines what escapes.
- Profile allocations; never optimize stack-vs-heap from intuition alone.
In this topic
- junior
- middle
- senior
- professional