Method Dispatch — Middle Level¶

Table of Contents¶

Introduction
The Three Call Forms in Assembly
Locating itab.fun[] in Practice
Per-Call Cost Anatomy
Branch Predictor and Polymorphic Sites
Inline Budget and the 80-Node Threshold
Local Devirtualization
Benchmark Methodology
benchstat and Statistical Significance
Common Pitfalls in Dispatch Benchmarks
Type Assertions in Hot Loops
Method Values vs Method Expressions Cost
Reflection-Based Dispatch
Tricky Questions
Cheat Sheet
Summary

Introduction¶

At the junior level you saw that dynamic dispatch exists and that it is slower. Now we look at how much slower, why, and how to measure it correctly. This file is hands-on: you will read assembly, peek at itab.fun[] from a debugger, and write benchmarks that actually distinguish dispatch cost from loop overhead.

The Three Call Forms in Assembly¶

Compile the following with go build -gcflags='-S' main.go 2>asm.txt:

type T struct{ n int }
func (t T) Inc() int { return t.n + 1 }

type I interface{ Inc() int }

func staticCall(t T) int    { return t.Inc() }
func dynamicCall(i I) int   { return i.Inc() }
func methodValueCall(t T) int {
    f := t.Inc
    return f()
}

Trimmed assembly (amd64, Go 1.22):

"".staticCall STEXT
    CALL    "".T.Inc(SB)        ; direct, fixed-address CALL

"".dynamicCall STEXT
    MOVQ    24(AX), DX          ; AX = itab; load fun[0]
    CALL    DX                  ; indirect CALL

"".methodValueCall STEXT
    LEAQ    "".T.Inc-fm(SB), AX
    ; ... allocate closure on heap or stack ...
    MOVQ    0(AX), DX           ; load fn pointer from closure
    CALL    DX                  ; indirect CALL

Three observations: 1. The static form is a single CALL to a known label. 2. The dynamic form has two memory loads first (itab, then fun[i]). 3. The method-value form often also allocates a closure when t.Inc escapes.

Locating itab.fun[] in Practice¶

The runtime stores itabs in a hash table keyed by (interface type, concrete type). The relevant declarations are in runtime/iface.go and runtime/runtime2.go:

// from runtime2.go
type itab struct {
    inter *interfacetype
    _type *_type
    hash  uint32
    _     [4]byte
    fun   [1]uintptr // variable-sized; one slot per interface method
}

You can confirm this yourself with go tool objdump:

go tool objdump -s "main.dynamicCall" myprog | head -20

Look for MOVQ N(AX), DX where N is 24 on 64-bit (offset of fun in itab). That N corresponds to fun[0]; fun[1] would be at 32, etc.

Inspecting under Delve¶

(dlv) b main.dynamicCall
(dlv) c
(dlv) regs
(dlv) print *(*[3]uint64)(unsafe.Pointer(uintptr(rax)))

The third word is fun[0] — the function pointer the runtime will jump to.

Per-Call Cost Anatomy¶

A single dynamic call decomposes into:

Step	Approximate cost
Load `itab` from receiver word	<1 ns (often L1)
Load `fun[i]` from itab	<1 ns (often L1)
Indirect CALL instruction	0.3-0.5 ns plus predictor
Predictor mispredict (if target varies)	10-20 ns penalty
icache miss when target body is "cold"	5-50 ns

Steady-state, monomorphic call sites hit the predictor's BTB and look like a constant ~1.5 ns. Polymorphic call sites can degrade to 5-10 ns.

Branch Predictor and Polymorphic Sites¶

Modern CPUs have a Branch Target Buffer (BTB) that learns the targets of indirect calls. If a single call site always invokes the same concrete method (monomorphic), the BTB caches it and the branch is "free."

// MONOMORPHIC — predictor wins
var w io.Writer = os.Stdout
for i := 0; i < N; i++ {
    w.Write(buf)            // always (*os.File).Write
}

// POLYMORPHIC — predictor misses
ws := []io.Writer{os.Stdout, os.Stderr, &bytes.Buffer{}}
for _, w := range ws {
    w.Write(buf)            // alternates
}

A simple benchmark:

func BenchmarkMonomorphic(b *testing.B) {
    var w io.Writer = io.Discard
    for i := 0; i < b.N; i++ {
        w.Write(nil)
    }
}

func BenchmarkPolymorphic(b *testing.B) {
    ws := []io.Writer{io.Discard, &bytes.Buffer{}}
    for i := 0; i < b.N; i++ {
        ws[i&1].Write(nil)
    }
}

Typical results show the polymorphic variant ~2-3x slower, dominated by mispredicts.

Inline Budget and the 80-Node Threshold¶

The inliner uses a "node budget" measured in AST nodes. Each statement, expression, or call counts. Historically the budget was 80; Go 1.22 kept ~80 with some refinements (see cmd/compile/internal/inline/inl.go).

A method whose body exceeds the budget is never inlined, even when statically dispatched.

// Likely inlined — small body
func (p Point) X() int { return p.x }

// Likely NOT inlined — too many nodes
func (p Point) Distance(q Point) float64 {
    dx := p.x - q.x
    dy := p.y - q.y
    return math.Sqrt(float64(dx*dx + dy*dy))
}

Run go build -gcflags='-m=2' and look for messages like:

./point.go:9:6: cannot inline (Point).Distance: function too complex: cost 124 exceeds budget 80

The cost listed is the inliner's estimate; tune your code if you want the inline.

Forcing or blocking inlining¶

You cannot force inlining. You can block it with //go:noinline:

//go:noinline
func (p Point) X() int { return p.x }

Useful in benchmarks to make sure both static and dynamic variants pay a real call cost.

Local Devirtualization¶

Even without PGO, the compiler can devirtualize when the concrete type flows from a local variable:

func run() int {
    var w io.Writer
    w = bytes.NewBuffer(nil)  // exact concrete type known here
    return w.(*bytes.Buffer).Len() // forced concrete via assertion
}

A more interesting case is when an interface assignment dominates the call:

func process() {
    var s Speaker = Greeter{}
    s.Hello() // compiler may statically rewrite this to Greeter.Hello
}

This pattern is observable through -gcflags='-m=2' as devirtualizing call to Speaker.Hello. Without PGO, devirtualization is conservative — it gives up across function boundaries.

Benchmark Methodology¶

Dispatch benchmarks are tricky. The compiler is aggressive at deleting "useless" code, so a naive loop measures nothing:

// BAD: result discarded; compiler may delete the call
func BenchmarkBad(b *testing.B) {
    var s Speaker = Greeter{}
    for i := 0; i < b.N; i++ {
        s.Hello()
    }
}

Always consume the result and disable inlining of the boundary if needed:

var sink string

//go:noinline
func consume(s string) { sink = s }

func BenchmarkGood(b *testing.B) {
    var s Speaker = Greeter{}
    for i := 0; i < b.N; i++ {
        consume(s.Hello())
    }
}

Another guard is runtime.KeepAlive for non-string values, or assigning into a package-level var sink T.

Reset timer for setup-heavy benchmarks¶

func BenchmarkX(b *testing.B) {
    data := buildLargeFixture()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = process(data)
    }
}

`benchstat` and Statistical Significance¶

Run the benchmark several times and compare with benchstat:

go test -bench=. -count=10 -benchmem | tee old.txt
# (make change)
go test -bench=. -count=10 -benchmem | tee new.txt

benchstat old.txt new.txt
# name             old time/op    new time/op    delta
# Static-8         0.30 ns ± 2%   0.30 ns ± 1%   ~     (p=0.871 n=10+10)
# Dynamic-8        2.05 ns ± 3%   1.42 ns ± 2%   -30.7% (p=0.000 n=10+10)

p is the p-value; below 0.05 means the change is statistically significant.

Common Pitfalls in Dispatch Benchmarks¶

Letting the compiler eliminate the call. Always consume the result.
Confusing inlining with dispatch. A static call that gets inlined looks free; mark it //go:noinline if you want to compare call costs specifically.
Not warming up. First iterations pay icache and BTB cost. b.N runs many times, but if the benchmark function itself is called only once, predictor warm-up is included.
Mixing receivers and pointers in the same benchmark. Calc{} and &Calc{} produce different itabs and may confuse the compiler.
Comparing across machines. Cache sizes, predictor depth, and clock speed differ. Always benchmark on the deployment-target hardware.

Type Assertions in Hot Loops¶

// SLOW — assertion every iteration
for _, x := range items {
    if g, ok := x.(Greeter); ok {
        g.Hello()
    }
}

// FAST — assert once, dispatch many times
gs := make([]Greeter, 0, len(items))
for _, x := range items {
    if g, ok := x.(Greeter); ok {
        gs = append(gs, g)
    }
}
for _, g := range gs {
    g.Hello() // STATIC
}

The assertion itself is not free — it is roughly the cost of one itab comparison plus a branch. Hoisting it converts dynamic dispatch into static dispatch in the hot loop.

Method Values vs Method Expressions Cost¶

A method value captures the receiver in a closure. If the closure escapes, that's a heap allocation:

fn := obj.Handle      // closure: (obj, &Handle)
go fn()               // escapes; heap alloc

A method expression does not capture; the receiver is passed as the first argument:

fn := (*Service).Handle
fn(obj, req)          // no closure; no escape

Benchmark:

type S struct{ n int }
func (s *S) Tick() { s.n++ }

func BenchmarkValue(b *testing.B) {
    s := &S{}
    for i := 0; i < b.N; i++ {
        f := s.Tick     // closure each iteration → heap alloc
        f()
    }
}

func BenchmarkExpr(b *testing.B) {
    s := &S{}
    f := (*S).Tick      // hoisted out of loop
    for i := 0; i < b.N; i++ {
        f(s)
    }
}

-benchmem will reveal the allocation rate of the value form. The expression form is allocation-free.

Reflection-Based Dispatch¶

v := reflect.ValueOf(obj)
m := v.MethodByName("Handle")
m.Call([]reflect.Value{reflect.ValueOf(req)})

Cost: ~100-500 ns per call, plus allocations. Reflection-based dispatch is two orders of magnitude slower than interface dispatch and three orders slower than static dispatch. Use it only outside hot paths (config, RPC plumbing, codecs that build a static plan once).

Tricky Questions¶

Q1: Why is an interface call slower even when the target stays the same? Because the compiler cannot prove that statically. Without proof, it must emit the indirect-call sequence. The CPU's BTB makes the actual run-time cost low for monomorphic sites, but the compiler still cannot inline through the call.

Q2: Does pointer vs value receiver change dispatch cost? Slightly. Pointer-receiver methods take one fewer copy; value-receiver methods on small types may be faster because the receiver lives in a register. The dispatch mechanism is the same.

Q3: Does Go ever cache itabs across runs? No. Itabs are built lazily at runtime and live for the duration of the process. They are not serialized to disk.

Q4: Why does my benchmark show 0.30 ns/op for a "method call"? The compiler inlined it. Add //go:noinline to measure raw call cost, or accept that inlining is real and benchmark the realistic call site instead.

Q5: Can go test -bench measure CPU branch mispredicts? Not directly. Use perf stat -e branch-misses ./bench.test -test.bench=... on Linux to get real hardware counters.

Cheat Sheet¶

ASSEMBLY MARKERS
─────────────────────────────────
CALL X(SB)            static call (label = X)
MOVQ N(AX), DX        load itab.fun[N/8] (e.g. 24 = fun[0])
CALL DX               indirect call
LEAQ "".X-fm(SB), AX  method value formation

INLINER COST ESTIMATE
─────────────────────────────────
budget ~80 nodes (Go 1.22)
each statement ~1 node
each call ~57 nodes (huge)
runtime calls usually disqualify

DISPATCH COSTS (steady state)
─────────────────────────────────
inlined static    ~0 ns
non-inlined static ~0.5-1 ns
monomorphic dynamic ~1.5-2 ns
polymorphic dynamic ~3-8 ns
reflect.Value.Call ~100-500 ns

BENCHMARK CHECKLIST
─────────────────────────────────
[ ] consume the result (sink, KeepAlive)
[ ] use -count=10 + benchstat
[ ] disable inlining if measuring call itself
[ ] reset timer after setup

Summary¶

Middle-level method-dispatch knowledge is largely about seeing what the compiler did and writing benchmarks that don't lie. You should:

Know that an interface call decomposes into two memory loads plus an indirect CALL.
Be able to point at itab.fun[i] in assembly or in a debugger.
Understand the BTB's role in monomorphic vs polymorphic call-site behavior.
Use //go:noinline and a sink variable to write honest benchmarks.
Hoist type assertions out of hot loops and prefer method expressions to method values when allocations matter.

The senior file dives deeper: how the compiler decides to devirtualize, what PGO actually changes in the binary, and how generics dispatch interacts with the inliner.