Bytecode & Virtual Machines — Senior Level¶

Topic: Bytecode & Virtual Machines Focus: Interpreter dispatch techniques, superinstructions and stack caching, bytecode verification, lazy linking, and the handoff to the JIT.

Table of Contents¶

Introduction
Prerequisites
Glossary
Core Concepts
Real-World Analogies
Mental Models
Code Examples
Pros & Cons
Use Cases
Coding Patterns
Best Practices
Edge Cases & Pitfalls
Common Mistakes
Test Yourself
Cheat Sheet
Summary
Further Reading

Introduction¶

Focus: Making the interpreter fast (dispatch, superinstructions, stack caching), making it safe (verification), making it link (lazy resolution), and handing hot code to the JIT.

At the middle level, the interpreter loop was a switch and bytecode was a format you could parse. A senior engineer needs to know the parts that production VMs obsess over:

Dispatch. The switch-based loop is the slow way. Real interpreters use direct threading, computed goto, or tail-call threading to cut the per-instruction overhead — chiefly to dodge the CPU branch-predictor penalty that a single shared switch causes. On top of that: superinstructions (fuse common opcode pairs) and stack caching (keep the top of the operand stack in CPU registers).
Verification. Before the JVM runs bytecode it doesn't trust, a verifier proves the bytecode is type-safe and stack-balanced — statically, before a single instruction executes. This is what makes it safe to run untrusted bytecode (applets, plugins, sandboxes). Wasm's design takes this further: validation is linear-time and total.
Linking and resolution. A .class references other classes and methods by name (symbolic references). The VM resolves those to concrete addresses lazily, the first time each is actually used.
The JIT handoff. Bytecode is the input to a just-in-time compiler. We won't build a JIT here, but you must understand why bytecode is the natural handoff point and what properties make it JIT-friendly.

In one sentence: this page is about the engineering that turns "a switch over opcodes" into a fast, safe, linkable runtime.

🎓 Why this matters at this level: These topics separate "I wrote a toy VM" from "I understand why HotSpot, V8, CPython, and Wasm engines are built the way they are." Dispatch technique alone can be a 2–3× interpreter speedup; verification is the entire security story for running untrusted code; and the JIT handoff is the reason bytecode formats look the way they do.

Prerequisites¶

Required: junior.md and middle.md — stack vs register VMs, instruction anatomy, jumps/backpatching, the constant pool, class-file structure.
Required: Comfort with C-level thinking: function pointers, goto, the cost of a branch, and roughly what a CPU pipeline and branch predictor are.
Required: Knowing what a JIT compiler is at a conceptual level (translates bytecode → machine code at runtime).
Helpful: Familiarity with type systems (the verifier reasons about types) and with the idea of a dataflow / fixpoint analysis.

You do not need to build a JIT or know SSA/register allocation — those are downstream topics. We treat the JIT as the consumer of bytecode.

Glossary¶

Term	Definition
Dispatch	The mechanism the interpreter uses to jump from one opcode's handler to the next.
Switch dispatch	A `while(true){ switch(op){...} }` loop. Portable, but one indirect branch the CPU mispredicts.
Direct threading	Each bytecode is (or points to) the address of its handler; handlers jump directly to the next. Needs computed goto / labels-as-values.
Computed goto	A GCC/Clang extension (`&&label`, `goto *ptr`) used to implement direct/token threading in C.
Token threading	Bytecode stays as small tokens; a jump table maps token → handler address. The common "direct threading" in interpreters.
Tail-call threading	Each handler is a function ending in a tail call to the next handler; the compiler turns tail calls into jumps (`musttail`).
Branch predictor	CPU hardware guessing branch targets. A single shared dispatch branch is hard to predict; per-opcode branches predict better.
Superinstruction	A single opcode that fuses a frequent sequence (e.g. `LOAD+LOAD+ADD`) to cut dispatches.
Stack caching	Keeping the top N operand-stack entries in CPU registers instead of memory, reducing loads/stores.
Verification	A static check (before execution) that bytecode is type-safe, stack-balanced, and well-formed.
Stack-map frame	Per-jump-target type snapshot the JVM verifier uses to check merges in (near) linear time (Java 6+).
Symbolic reference	A class/method/field referred to by name in the constant pool, not yet resolved to an address.
Resolution / linking	Turning a symbolic reference into a concrete pointer/offset — done lazily on first use.
Hot path / hot method	Code executed often enough that the VM decides to JIT-compile it.
Tiered compilation	Running interpreted first, then JIT-compiling progressively hotter code at higher optimization.
OSR (on-stack replacement)	Swapping a running interpreted frame for a compiled one mid-execution (e.g. a hot loop).

Core Concepts¶

1. Dispatch: the inner loop's biggest lever¶

The naive interpreter:

while (1) {
    switch (*pc++) {
        case OP_ADD: /* ... */ break;
        case OP_LOAD: /* ... */ break;
        /* ... */
    }
}

Every iteration ends by branching back to the top of the loop and through one switch — which the compiler implements as a single indirect branch (a jump table lookup). The problem is the branch predictor: that one shared indirect branch jumps to a different handler almost every iteration, so the CPU mispredicts constantly, stalling the pipeline. The switch is correct and portable, but it leaves performance on the table.

Direct / token threading. Instead of returning to a central switch, each handler ends by jumping straight to the next handler. In GCC/Clang you implement this with computed goto (labels-as-values):

static void *table[] = { &&op_add, &&op_load, &&op_store, /* ... */ };
#define DISPATCH() goto *table[*pc++]

DISPATCH();
op_add:  { /* pop b,a; push a+b */ DISPATCH(); }
op_load: { /* push locals[*pc++] */ DISPATCH(); }

Now there's a separate indirect branch at the end of each handler. The branch predictor can learn per-handler patterns (e.g. LOAD is very often followed by another LOAD or ADD), so prediction improves dramatically. Token threading typically buys 1.5–2.5× over switch dispatch on a real interpreter. CPython adopted computed-goto dispatch (when the compiler supports it) for exactly this reason.

Tail-call threading. A newer, portable approach: make each handler a function whose last action is a tail call to the next handler. With a compiler that guarantees tail calls ([[clang::musttail]], or proper tail calls in some languages), these become jumps with no stack growth. The benefit: the optimizer treats each handler as its own function with its own register allocation, often beating computed goto, and it's portable to compilers without label-as-values. (CPython 3.14's experimental "tail-calling interpreter" uses this.)

Direct threading (true). Pre-translate the bytecode so each instruction is a handler address (not a token indexed through a table). Saves the table lookup but bloats the code stream to pointer-width and complicates the format. Token threading is usually the sweet spot.

2. Superinstructions and stack caching¶

Two further interpreter optimizations, both attacking the number of dispatches and the cost per instruction:

Superinstructions. Profile real bytecode and you find a few opcode pairs/triples dominate: LOAD_FAST; LOAD_FAST, LOAD_CONST; BINARY_OP. Fuse each hot sequence into a single opcode that does both — one fetch, one dispatch, one handler. Fewer dispatches, less PC bookkeeping. Some VMs generate superinstructions statically; others build them per-program. CPython's "specializing adaptive interpreter" (PEP 659, the LOAD_FAST_LOAD_FAST family and _ADAPTIVE/_QUICKEN opcodes) is a modern, runtime version: it rewrites generic opcodes into specialized ones based on observed types.

Stack caching. The operand stack normally lives in memory; every push/pop is a load/store. Stack caching keeps the top 1–2 entries in CPU registers, so chains like LOAD; LOAD; ADD avoid round-tripping to memory. It requires tracking, at each program point, how many of the top entries are currently in registers (the "cache state") and generating handler variants per state — more complex, but a real win for arithmetic-heavy code.

3. Bytecode verification: safety before execution¶

The JVM is designed to run bytecode it did not produce and does not trust — downloaded applets, plugins, generated code. If a malicious .class could (say) pop from an empty stack, treat an int as a pointer, or jump into the middle of an instruction, it would break memory safety. The verifier prevents this statically, before the method runs, by proving:

Type safety. Every instruction gets operands of the type it expects (iadd sees two ints; you can't astore (store reference) something the verifier knows is an int). The verifier tracks an abstract type for every stack slot and local at every program point.
Stack balance / no under/overflow. The operand stack never underflows, never exceeds max_stack, and has a consistent depth and type signature at every point where control flow merges (e.g. the top of a loop, the join after an if).
Well-formed control flow. Every jump targets the start of a valid instruction within the method; no jumping into the middle of an instruction or out of the method.
Initialization. Objects are initialized (the constructor chain ran) before use; locals are written before read.

Historically the verifier ran an iterative dataflow fixpoint (merge abstract states at join points until stable) — correct but slow. Since Java 6, the compiler emits stack-map frames: a precomputed type snapshot at each branch target. The verifier then checks each instruction once against these snapshots — essentially a single linear pass, much faster. (This is why class files carry StackMapTable attributes.)

This is the whole reason it's safe to run untrusted JVM bytecode. Native machine code has no such gate — once it runs, it can do anything. Bytecode + verification is a checkable contract.

WebAssembly pushes this further by design. Wasm validation is linear-time, total, and single-pass by construction: structured control flow (no arbitrary gotos), explicit function/type signatures, and a type system simple enough that a streaming validator can accept-or-reject a module in one pass while it downloads. That's a deliberate design goal — Wasm is meant to be validated and compiled fast enough to start before the download finishes. It's the lesson of the JVM verifier, redesigned with validation-cost as a first-class constraint.

4. Linking: symbolic references and lazy resolution¶

A method like obj.foo() compiles to (roughly) invokevirtual #14, where pool entry #14 is a symbolic reference: the strings "class Bar", "method foo", "descriptor ()V". At compile time the JVM has no idea where Bar.foo will live in memory — Bar might not even be loaded yet.

Resolution turns that symbolic reference into a concrete thing (a method-table index, a vtable slot, a direct pointer) and happens lazily: the first time an invokevirtual #14 executes, the VM resolves #14 (loading Bar if needed, checking access, finding the method), then typically caches the result so subsequent executions are fast (inline caches, rewriting the constant-pool entry to a "resolved" form). This lazy linking is why the JVM can start without eagerly loading the transitive closure of every referenced class, and why a NoSuchMethodError can surface at first call rather than at load. CPython does an analogous thing with LOAD_GLOBAL/LOAD_ATTR caching in its adaptive interpreter.

5. Bytecode is the handoff point to the JIT¶

A modern high-performance VM does not just interpret. It runs a method interpreted at first (cheap startup, gather profile data: which branches are taken, which types actually occur, which call targets are monomorphic), and once the method or loop is hot, it JIT-compiles the bytecode to native machine code, optimized using the gathered profile (speculative inlining, type specialization, branch reordering). This is tiered compilation; swapping a running interpreted loop for compiled code mid-flight is on-stack replacement (OSR).

Why is bytecode (not source, not a syntax tree) the right input to the JIT?

It's already parsed, name-resolved, and desugared — the JIT starts from simple, regular instructions, not text.
It's compact and uniform — fast to read and rewrite into the JIT's internal IR.
It's verifiable — the JIT can trust structural invariants (stack balance, type consistency) the verifier already established, instead of re-checking them.
It carries profile attachment points — each bytecode offset is a natural place to hang counters and type feedback.
It's language-agnostic — Kotlin, Scala, Clojure all reach the same JIT through the same bytecode, so the optimizer is written once.

We don't build the JIT here (that's the next topic). The key senior takeaway: bytecode's design — small, regular, verified, language-neutral — exists in large part to be an excellent JIT input, not just an interpreter input.

Real-World Analogies¶

1. Dispatch = a switchboard vs. direct extensions. Switch dispatch is an old phone exchange where every call goes back through one operator who then connects you — a bottleneck the operator can't anticipate. Threaded dispatch is direct extensions: each handler already knows where calls usually go next, so it forwards them itself. The branch predictor "learns the office's calling patterns."

2. Verification = a contract review before signing. Before you run untrusted code, the verifier reads the whole contract and proves no clause lets it reach into memory it shouldn't, overflow the stack, or jump somewhere illegal. Native code is a contract with no review — you find out it was malicious by being robbed.

3. Lazy linking = forwarding mail. A .class ships with names ("send to Bar.foo"), not street addresses. The first time you actually mail something there, the post office looks up the address, delivers it, and writes the address on a sticky note so next time is instant.

4. Superinstructions = abbreviations for common phrases. Instead of spelling out "load, load, add" every time, you coin one symbol meaning all three — fewer glances at the page (dispatches) for the same meaning.

5. Stack caching = keeping your current tools on the bench, not in the drawer. The top items you're actively using stay in registers (on the bench); only when you need older items do you go back to memory (the drawer).

Mental Models¶

Model 1: The interpreter's enemy is the branch predictor, not the ALU. Arithmetic in a handler is nearly free; the cost is the mispredicted indirect branch at dispatch. Every dispatch technique is really a branch-prediction optimization. Internalize that and the whole "switch vs threading vs tail-call" landscape makes sense.

Model 2: Verification is type-checking the bytecode as if it were a program in a tiny typed language. The verifier runs an abstract interpretation: it propagates types through the instructions and checks consistency at merges. Stack-map frames are just annotations that let it check one pass instead of iterating to a fixpoint.

Model 3: "Symbolic now, concrete later" is the JVM's linking philosophy. Compile-time keeps everything as names; runtime resolves on first touch and caches. This is what lets independently-compiled classes link without a global link step.

Model 4: Bytecode is a staging area between two compilers. The front-end compiler lowers source → bytecode; the JIT raises bytecode → native. Bytecode is engineered to be a good middle for both — small enough to ship and verify, regular enough to optimize.

Model 5: Interpret first, compile what's hot. Don't JIT everything (compilation isn't free and most code runs rarely). Profile cheaply via the interpreter; spend optimization budget only on hot code. This is the economic logic of every tiered VM.

Code Examples¶

Example 1: Switch dispatch vs computed-goto threading (C)¶

/* --- Switch dispatch: one shared indirect branch, mispredicts a lot --- */
for (;;) {
    switch (*pc++) {
        case OP_PUSH:  *sp++ = *pc++;                 break;
        case OP_ADD:   sp[-2] += sp[-1]; sp--;        break;
        case OP_PRINT: printf("%ld\n", sp[-1]);       break;
        case OP_HALT:  return;
    }
}

/* --- Computed-goto (token threading): per-handler branch, predicts better --- */
static void *tbl[] = { &&L_PUSH, &&L_ADD, &&L_PRINT, &&L_HALT };
#define NEXT() goto *tbl[*pc++]

NEXT();
L_PUSH:  *sp++ = *pc++;            NEXT();
L_ADD:   sp[-2] += sp[-1]; sp--;   NEXT();
L_PRINT: printf("%ld\n", sp[-1]);  NEXT();
L_HALT:  return;

Same semantics; the second form removes the central loop branch and lets each opcode's NEXT() be predicted on its own. On a real interpreter this is commonly a ~1.5–2× speedup. (Requires GCC/Clang labels-as-values; MSVC needs a different approach.)

Example 2: A superinstruction¶

/* Profiling shows OP_LOAD immediately followed by OP_LOAD then OP_ADD is hot.
   Fuse them: one fetch, one dispatch, no intermediate stack juggling. */
L_LOAD_LOAD_ADD: {
    long a = locals[pc[0]];
    long b = locals[pc[1]];
    pc += 2;
    *sp++ = a + b;
    NEXT();
}

The compiler/quickener rewrites the LOAD i; LOAD j; ADD triple into this single opcode when it sees the pattern.

Example 3: Tail-call threading (modern, portable-ish)¶

typedef void (*Handler)(VM *vm);
static Handler handlers[256];

#define MUSTTAIL __attribute__((musttail))
static inline void dispatch(VM *vm) {
    uint8_t op = *vm->pc++;
    MUSTTAIL return handlers[op](vm);   /* becomes a jump, no stack growth */
}

static void op_add(VM *vm) {
    vm->sp[-2] += vm->sp[-1]; vm->sp--;
    MUSTTAIL return dispatch(vm);
}

Each handler is its own function (its own register allocation), and musttail guarantees the dispatch becomes a jump. This is the shape CPython 3.14's experimental interpreter uses.

Example 4: Watch the verifier reject bad bytecode (conceptual)¶

; Hand-written invalid JVM bytecode:
   iconst_1        ; push int 1        stack: [int]
   areturn         ; return a REFERENCE
; Verifier: areturn expects a reference on the stack, found int.
;           → VerifyError, method never runs.

; Stack-imbalanced bytecode:
   iconst_1        ; stack: [int]
   ifeq L          ; pops the int → both paths must merge with SAME stack shape
   iconst_2        ; fall-through: stack: [int]
L: ireturn         ; merge point: one path arrives empty, one with [int]
; Verifier: inconsistent stack height/type at L → VerifyError.

The verifier catches both before execution. A toy VM without verification would simply read garbage or crash at runtime.

Example 5: See CPython specialize (adaptive interpreter)¶

import dis

def hot(a, b):
    return a + b

# Run it many times so the adaptive interpreter specializes:
for _ in range(1000):
    hot(1, 2)

dis.dis(hot, adaptive=True)   # 3.11+: may show BINARY_OP specialized to BINARY_OP_ADD_INT

On a sufficiently warmed function, dis(..., adaptive=True) can reveal specialized opcodes (e.g. an int-specialized add) — CPython's runtime superinstruction/quickening at work.

Pros & Cons¶

Threaded dispatch (vs switch)

Pros	Cons
1.5–2.5× interpreter speedup	Computed goto is GCC/Clang-only (not standard C, not MSVC)
Better branch prediction	Harder to read/debug than a switch
Foundation for superinstructions	Tail-call form depends on compiler tail-call guarantees

Verification

Pros	Cons
Enables safely running untrusted bytecode	Adds load-time cost (mitigated by stack-map frames)
Lets the JIT trust structural invariants	The verifier itself is complex, security-critical code
Catches malformed/malicious classes early	Restricts what valid bytecode can express

JIT (bytecode as input)

Pros	Cons
Near-native speed for hot code	Compilation latency / memory; warmup cost
Profile-guided (better than static AOT in some cases)	Non-deterministic performance; harder to reason about
One JIT serves many languages via shared bytecode	Hard to build; large attack surface

Use Cases¶

Computed-goto / threaded dispatch: CPython, Ruby YARV, Lua, virtually every serious bytecode interpreter where the JIT is absent or hasn't kicked in.
Tail-call threading: CPython 3.14 experimental interpreter; emerging as a portable alternative to computed goto.
Verification: JVM (untrusted applets/plugins historically; today still validates every loaded class), WebAssembly (every module validated before instantiation), .NET (CIL verification, though often skipped for full-trust).
Lazy linking: the JVM's entire class-loading/linking model; lets large apps start without loading everything.
Tiered JIT + OSR: HotSpot (C1/C2), V8 (Ignition interpreter → Sparkplug/Maglev/TurboFan), the modern Wasm engines (baseline + optimizing tiers), CPython's emerging JIT.

Coding Patterns¶

Pattern 1: Choose dispatch by portability requirement¶

Need standard-C portability / MSVC? Use switch (and accept the cost) or tail-call threading with a compiler that supports musttail.
GCC/Clang only? Computed goto is the proven default.
Want maximum speed and have a modern toolchain? Tail-call threading is increasingly the winner.

Pattern 2: Quicken (rewrite) opcodes in place¶

Start with generic opcodes; after observing behavior, rewrite the bytecode in memory to specialized/superinstruction forms (with a guard + deopt path back to the generic opcode if assumptions break). This is the adaptive-interpreter pattern and the bridge toward a JIT.

Pattern 3: Cache resolved symbolic references¶

On first execution of a call/load-attr site, resolve the symbolic reference and rewrite the site (inline cache / resolved pool entry) so subsequent hits skip resolution. Always keep a slow-path fallback for cache misses.

Pattern 4: Emit stack maps if you verify¶

If your format is verified, precompute a type/depth snapshot at each jump target and ship it (like StackMapTable). It turns verification from an iterative fixpoint into one linear pass.

Best Practices¶

Profile before optimizing dispatch. Confirm the interpreter loop is actually your bottleneck (it usually is for interpret-heavy workloads) before rewriting it into computed goto and losing readability.
Keep a slow, simple reference interpreter. Maintain a switch-based version alongside the threaded one as an oracle for differential testing. Threaded interpreters are easy to get subtly wrong.
Make specialization reversible. Any superinstruction/quickened opcode must have a deopt path: if a guard fails (an int-specialized add sees a float), fall back to the generic opcode. Specialization without deopt is a correctness bug.
Treat the verifier as security-critical. If you accept untrusted bytecode, the verifier is your security boundary. Fuzz it. A verifier bug is a sandbox escape.
Design the format to be JIT-friendly from day one. Regular encoding, explicit types or type feedback hooks, stable bytecode offsets for profile attachment. Retrofitting JIT-friendliness onto an ad-hoc format is painful.
Resolve lazily, fail clearly. Lazy linking is great for startup, but surface resolution errors with the call site and what was missing, not an opaque crash.

Edge Cases & Pitfalls¶

Computed goto and the pc++ ordering. Off-by-one in goto *tbl[*pc++] (incrementing before vs after the table read) silently corrupts dispatch. Get the fetch/advance order exactly right and test against the switch oracle.
Threaded interpreters defeat some debuggers/profilers. A giant function with computed gotos can confuse stack unwinders and sampling profilers; tail-call threading (separate functions) is friendlier here.
Stack-map frames must be exactly right. A wrong stack-map can make the verifier accept invalid bytecode (security hole) or reject valid bytecode. They're generated by the compiler; bugs there are nasty.
Specialization without proper guards = miscompilation. If BINARY_OP_ADD_INT runs on non-ints because the guard is missing or wrong, you get silent wrong results. The guard is the correctness.
Lazy resolution changes when errors appear. A NoSuchMethodError shows up at first call, possibly deep in a rarely-taken branch in production, not at load time. Different failure timing than AOT-linked languages.
Verification can be (partially) skipped. The JVM verifies less for classes loaded by the bootstrap loader; .NET runs "unverifiable" code in full trust. Skipping verification trades safety for speed/flexibility — know when it's happening.
OSR transition state. Replacing an interpreted frame with a compiled one mid-loop requires reconstructing the compiled frame's exact state from the interpreter's. Bugs here corrupt live variables. (Relevant when you understand JIT, but it's the bytecode/interpreter boundary that makes it tricky.)

Common Mistakes¶

Believing switch dispatch is "fast enough" without measuring. For interpret-bound code, threading is often a large, free win. Conversely, rewriting dispatch when the interpreter isn't the bottleneck is wasted effort — measure.
Thinking verification is optional polish. For untrusted bytecode it's the entire security model. Removing it doesn't just lose a check; it removes the sandbox.
Confusing "the verifier type-checks" with "the language is type-safe." The verifier checks bytecode invariants. A typed source language helps, but the verifier defends the VM regardless of source language.
Assuming the JIT replaces the interpreter. They coexist: interpret cold code (and to gather profiles), JIT hot code, deopt back to the interpreter when speculation fails. The interpreter never goes away.
Adding superinstructions/specialization without a deopt path. Guaranteed eventual miscompilation.
Designing an ad-hoc bytecode then bolting on a JIT later. Irregular encodings and missing type feedback make JITing far harder. Plan the format for it.

Test Yourself¶

Why is switch dispatch slow, and what specifically does computed-goto threading improve? Name the hardware mechanism.
Contrast token/direct threading with tail-call threading. What does each need from the compiler?
What is a superinstruction, and what is stack caching? What cost does each attack?
List four things the JVM verifier proves. Why is it the prerequisite for running untrusted bytecode?
What are stack-map frames, and how do they change the verifier's algorithmic cost?
Explain lazy linking: what is a symbolic reference, when is it resolved, and why cache the result?
Give four reasons bytecode (not source, not an AST) is the natural input to a JIT.
What is OSR, and why is the interpreter↔JIT boundary the hard part?
Why must any opcode specialization have a deopt path?

Cheat Sheet¶

DISPATCH (fastest → most portable)
  tail-call threading (musttail; per-handler regalloc)   ~ fastest, portable-ish
  computed goto / token threading (&&label, goto *tbl)   GCC/Clang, 1.5–2.5× over switch
  switch dispatch                                         portable, 1 mispredicted branch
  → the enemy is the BRANCH PREDICTOR, not arithmetic

SUPERINSTRUCTIONS  fuse hot opcode sequences → fewer dispatches
STACK CACHING      keep top-of-stack in registers → fewer loads/stores
ADAPTIVE/QUICKEN   rewrite generic opcode → specialized (with GUARD + DEOPT)

VERIFICATION (JVM, before execution) proves:
  type safety | stack balance & consistent merges | legal jump targets | init-before-use
  STACK-MAP FRAMES → linear single pass (was iterative fixpoint pre-Java6)
  = the security boundary for UNTRUSTED bytecode
  Wasm: validation is linear, total, single-pass BY DESIGN (structured CF, explicit types)

LINKING  symbolic ref (name in pool) --first use--> RESOLVE --> cache (inline cache)
  lazy → fast startup; errors surface at first call, not load

JIT HANDOFF  interpret (profile) → hot → JIT to native (tiered) → OSR mid-loop
  bytecode is ideal JIT input: parsed, compact, verified, profile-hookable, lang-neutral

Summary¶

Dispatch is the interpreter's biggest lever. Switch dispatch suffers from a single mispredicted indirect branch; computed-goto/token threading and tail-call threading give each opcode its own well-predicted branch, commonly 1.5–2.5× faster. Superinstructions cut the number of dispatches; stack caching cuts memory traffic per instruction; adaptive specialization rewrites generic opcodes into fast ones (always with a deopt guard).
Verification statically proves bytecode is type-safe, stack-balanced, and well-formed before it runs — the foundation for executing untrusted bytecode. Stack-map frames make it a linear single pass. WebAssembly is designed so validation is linear, total, and single-pass by construction.
Linking keeps references symbolic (names in the constant pool) and resolves them lazily on first use, caching the result — enabling fast startup and independent compilation.
Bytecode is the handoff point to the JIT: a parsed, compact, verified, profile-hookable, language-neutral format that a tiered compiler turns into native code for hot paths (with OSR and deopt back to the interpreter).

professional.md zooms out to system-level concerns: WebAssembly as a deliberately-designed modern bytecode (linear memory, capability-based safety, validation/JIT economics), the BEAM and CPython internals, designing your own production bytecode (opcode budget, evolution, deopt), and the security of running untrusted bytecode at scale.