Interpretation, Compilation, JIT, AOT — Hands-On Tasks¶
Introduction¶
You cannot feel the difference between a tree-walker and a bytecode VM, between a cold JIT and a warm one, or between AOT startup and JIT peak, by reading about them. You have to build the small versions and measure. These tasks take you from a 50-line interpreter to a profile-guided "compile-on-hot" loop, and every one ends with a number you can defend.
How to use this file: implement first, measure second, read the hints only after you have a result that surprises you. Use a real timer (perf_counter, System.nanoTime, performance.now) and always warm up before you measure a JIT. Tick the self-check boxes when you can explain the number to someone else, not when the program merely runs.
Table of Contents¶
Warm-Up¶
Task 1 — Disassemble three runtimes¶
Take the one-liner def f(n): return n*n + 1.
- In Python, run
import dis; dis.dis(f)and read the bytecode. - In Java, compile
int f(int n){ return n*n + 1; }and runjavap -con the class. - In a browser, paste the JS equivalent into the console with
%DisassembleFunction(Node--allow-natives-syntax) — or just observe that you can't see native code easily, and ask why.
Self-check: - [ ] I can point to the multiply, the add, and the return in each listing. - [ ] I can explain why the Python listing has explicit LOAD_FAST/BINARY_* opcodes but the Java JIT'd version eventually has none of them.
Hint
Python stays as bytecode and is interpreted; `javap` shows JVM *bytecode*, which is itself later JIT-compiled by HotSpot into native code you don't see in `javap`. The "I can't see it" for JS is the point: the native code is generated at runtime and thrown away on deopt.Task 2 — Tree-walker for arithmetic¶
Write a tree-walking interpreter for + - * / and integer literals over a parsed AST (you may hand-build the AST; parsing is not the point). Evaluate (2 + 3) * (4 + 5).
Self-check: - [ ] My eval(node) recurses on children and dispatches on node type. - [ ] I can state the per-node overhead (a virtual call / type switch per AST node, every time).
Core¶
Task 3 — Same language, bytecode VM¶
Compile the same AST from Task 2 to a tiny stack bytecode (PUSH, ADD, SUB, MUL, DIV) and write a fetch-decode-execute loop over a list/array of instructions. Run the same expression a million times in a loop with both the tree-walker and the bytecode VM. Record the ratio.
Self-check: - [ ] The bytecode VM is measurably faster. - [ ] I can explain why (no pointer-chasing over AST nodes; linear instruction stream; better cache behaviour; cheaper dispatch).
Hint
Expect somewhere between 2× and 10× depending on language. The win is mostly "flatten the tree into a linear stream and stop re-walking it." If your ratio is ~1×, you probably allocate inside the hot loop — hoist it out.Task 4 — Watch a JIT warm up¶
In Java (or Node), put a compute-heavy function in a loop and time each iteration block of, say, 100k calls. Print the per-block time for the first 30 blocks.
Self-check: - [ ] The first blocks are slow, then there's a step-change as the method gets compiled. - [ ] I can identify roughly which block the C2/TurboFan compile kicked in. - [ ] Adding -XX:+PrintCompilation (Java) shows the method being compiled at that moment.
Hint
HotSpot's default compile threshold is in the thousands of invocations; you'll see the interpreter → C1 → C2 staircase. On Node, `--trace-opt --trace-deopt` prints the same story for TurboFan.Task 5 — Force a deoptimization¶
Write a JS function that does integer math, call it 100k times with integers (so it gets optimized assuming integers), then call it once with a string or a float that breaks the speculation. Run with --trace-deopt.
Self-check: - [ ] I see a deoptimizing log line at the moment the type assumption breaks. - [ ] I can explain what "on-stack replacement" and "bailout to the interpreter" mean here.
Advanced¶
Task 6 — Dispatch overhead: switch vs computed-goto¶
Implement the bytecode loop from Task 3 twice in C: once with a switch on the opcode, once with a computed-goto / direct-threaded dispatch table (&&label GCC extension). Benchmark a long instruction stream.
Self-check: - [ ] I measured a difference (often 10–30%) and can explain it via branch prediction (one shared indirect branch in switch vs one per opcode in threading). - [ ] I understand why interpreter authors care about this single line of code.
Task 7 — AOT vs JIT: startup and peak¶
Take one CPU-bound program. Run it (a) on the JVM normally, and (b) compiled with GraalVM native-image (or a .NET app with NativeAOT). Measure two things: time to first output (startup) and steady-state throughput after warmup.
Self-check: - [ ] AOT starts much faster and uses less memory. - [ ] JIT reaches equal-or-higher peak throughput after warmup. - [ ] I can state which one I'd pick for (1) a CLI tool, (2) a long-running server, (3) a serverless function — and why.
Hint
This is the whole tradeoff in two numbers: AOT trades away runtime adaptive specialization (and breaks closed-world assumptions like reflection) for instant startup and low memory. Serverless cold-starts are exactly why AOT for managed languages came back.Task 8 — Profile-guided constant¶
Write a function with a branch whose outcome is 99% one-way. Compile it with PGO (-fprofile-generate then -fprofile-use, or pgo in Go/.NET) and inspect whether the hot path got laid out first / the cold path got outlined.
Self-check: - [ ] I can see (in the asm or in the timing) that the compiler used the profile. - [ ] I understand that a JIT gets this profile for free at runtime, while AOT needs a training run.
Capstone¶
Task 9 — A "compile-on-hot" mini-runtime¶
Build a toy runtime for the arithmetic language that:
- Starts every function as a tree-walker.
- Counts invocations per function.
- After a function crosses a threshold
N, compiles it to the bytecode form (Task 3) and switches future calls to the faster path. - Logs each "tier-up" event:
function <id> promoted after <N> calls.
Then add one speculation: assume all operands are small integers, compile a specialized path that skips type checks, and deoptimize (fall back to the tree-walker) the first time a non-integer appears — logging the bailout.
Self-check: - [ ] My runtime demonstrates the interpreter → compiled staircase on a real workload. - [ ] My deopt path is correct: the program produces the right answer even when speculation fails, just slower. - [ ] I can map every piece of my toy to its real-world counterpart (HotSpot tiers, OSR, deopt, profile counters).
Hint
Keep the "compiler" trivial — flattening the AST to bytecode *is* a compiler for this purpose. The lesson is the control structure (count → tier-up → specialize → guard → deopt), which is exactly what HotSpot, V8, and SpiderMonkey do at vastly larger scale.Self-Assessment¶
You have internalised this topic when you can:
- Place any runtime on the interpret→bytecode→baseline-JIT→optimizing-JIT→AOT spectrum and justify it.
- Explain warmup, OSR, and deoptimization from having watched them, not read them.
- Predict, for a new workload, whether AOT or JIT wins — and on which metric (startup, memory, peak).
- Explain why "the interpreter is slow" is mostly about dispatch and pointer-chasing, and what a bytecode VM does about it.
In this topic
- interview
- tasks