Functional Options — Under the Hood¶

1. The runtime framing¶

Junior and middle taught what the pattern is. This file is about what the compiler and runtime actually do when you write it. Every WithX(arg) call is a function value being constructed on the fly — a closure that captures arg. Every for _, opt := range opts { opt(s) } is an indirect call through that function value. Every []Option is a slice of these function values. None of that is free; none of it is unreasonably expensive either. The point of this file is to be precise.

We work in Go 1.22 / amd64 unless otherwise noted. References to the standard library are against go1.22.x source, paths like src/runtime/runtime2.go and src/cmd/compile/internal/ssagen/ssa.go.

The questions we answer:

What is the in-memory layout of an Option value?
When does WithX(arg) heap-allocate the closure and when does it stack-allocate?
Why doesn't the loop body opt(s) inline?
What does the assembly for the apply-loop look like?
Does NewServer(":8080", WithLogger(l)).Start() allocate the Server on the heap or the stack?
What's the actual cost of the function vs interface variant, instruction by instruction?

2. Table of Contents¶

The runtime framing
Table of Contents
How a function value is represented
The funcval struct and closure layout
WithX call site — escape analysis walkthrough
The apply loop in assembly
Why options don't inline
The slice of options in memory
Interface variant under the hood
Escape analysis of NewServer chains
GOSSAFUNC walkthrough
Allocation count, byte-by-byte
Cross-language comparison at the machine level
Edge cases at the lowest level
Test
Tricky questions
Summary
Further reading

3. How a function value is represented¶

In Go, a value of type func(...) is a single word — a pointer. Specifically, it points to a runtime.funcval struct. The struct's first field is the entry-point PC; any words following it are the captured variables for that closure. This is defined in src/runtime/runtime2.go:

// runtime2.go (paraphrased; the comment is from the Go source)
type funcval struct {
    fn uintptr
    // variable-sized, fn-specific data here
}

So when you write:

var opt Option = WithLogger(myLogger)

opt is one machine word. That word is the address of a funcval. The funcval's first 8 bytes (on amd64) are the entry PC of the anonymous function func(s *Server) { s.logger = myLogger }. The bytes after that are the captured environment — in this case, the captured *log.Logger value (one pointer).

opt (one word, 8 bytes on amd64)
 │
 └──> funcval @ heap-allocated address
      ┌──────────────────────────────────┐
      │ fn (8 bytes): entry PC of closure│
      ├──────────────────────────────────┤
      │ captured: *log.Logger (8 bytes)  │
      └──────────────────────────────────┘

Calling opt(s) is not a simple direct jump. The runtime convention on amd64 puts the funcval pointer into the DX register (the "closure register"), then jumps to [DX] — i.e., dereferences the first word of the funcval to get the entry PC. The closure prologue then reads its captures off [DX+8], [DX+16], etc.

This is why the apply-loop produces CALL instructions through a register, not direct calls. We'll see it in §6.

Method values are the same shape. s.Start (where Start has receiver *Server) is a funcval whose capture word is the receiver pointer. This is why method values cost an allocation when they escape — they're closures.

flowchart LR A["opt: Option (1 word)"] --> B["funcval"] B --> C["fn: entry PC"] B --> D["capture[0]: *log.Logger"] B --> E["capture[1]: ... (if more)"]

4. The funcval struct and closure layout¶

The compiler's representation lives in src/cmd/compile/internal/ir/func.go and the closure conversion pass is in src/cmd/compile/internal/walk/closure.go. The thing to internalise is that a closure literal:

func(s *Server) { s.readTimeout = d }

…with d captured by value compiles to a hidden type:

// Compiler-synthesized closure environment for WithReadTimeout's lambda
type closureWithReadTimeout struct {
    fn uintptr           // entry PC of the lambda
    d  time.Duration     // captured value
}

When WithReadTimeout(5*time.Second) runs:

The compiler emits a call to runtime.newobject (when the closure escapes) — or reserves stack space (when it doesn't).
It writes entryPC(closure) into [obj+0].
It writes d into [obj+8].
It returns the pointer-to-obj as the Option value.

You can see this materialise in -gcflags="-m":

// withreadtimeout.go
package srv

import "time"

type Server struct{ readTimeout time.Duration }
type Option func(*Server)

func WithReadTimeout(d time.Duration) Option {
    return func(s *Server) { s.readTimeout = d }
}

$ go build -gcflags="-m" withreadtimeout.go
./withreadtimeout.go:8:6: can inline WithReadTimeout
./withreadtimeout.go:9:9: can inline WithReadTimeout.func1
./withreadtimeout.go:9:9: func literal escapes to heap

That last line — func literal escapes to heap — is the closure allocation. The closure environment cannot live on the stack of WithReadTimeout because it survives the return: the caller holds the Option and applies it later.

If you somehow inline the whole construction so the closure never outlives its enclosing frame, it stack-allocates. We come back to this in §10 with NewServer(...).Start().

4.1 Captures by value vs by reference¶

Go closures capture variables, not values, but the variables themselves can be moved to the heap (heap-promotion) so the closure can keep a stable address. If the captured variable doesn't need to be mutable across the closure and the outer scope, the compiler treats the capture as by-value.

func WithLogger(l *log.Logger) Option {
    return func(s *Server) { s.logger = l }
}

Here l is local to WithLogger, never reassigned, never address-taken outside the closure. The compiler captures it by value — one pointer-sized word inside the closure environment. No second allocation for a heap-promoted l.

Contrast:

func WithCounter() Option {
    n := 0
    return func(s *Server) { n++; s.seq = n }
}

Now n is mutable, shared between the closure and… well, only the closure here, but if you returned both n and the closure they would share it. The compiler heap-promotes n and the closure captures &n:

funcval
 ├── fn: entry PC
 └── *int (pointer to heap-promoted n)

For options, you almost never capture mutable state. Captures are arguments that the caller already constructed.

5. WithX call site — escape analysis walkthrough¶

Take a small program and run -gcflags="-m -m" on it:

// example.go
package main

import "time"

type Server struct {
    addr        string
    readTimeout time.Duration
}

type Option func(*Server)

func WithReadTimeout(d time.Duration) Option {
    return func(s *Server) { s.readTimeout = d }
}

func NewServer(addr string, opts ...Option) *Server {
    s := &Server{addr: addr, readTimeout: 30 * time.Second}
    for _, opt := range opts {
        opt(s)
    }
    return s
}

func main() {
    s := NewServer(":8080", WithReadTimeout(5*time.Second))
    _ = s
}

Compile with full escape annotations:

$ go build -gcflags="-m -m" example.go 2>&1 | grep -E "escape|inline|allocate"
./example.go:12:6: can inline WithReadTimeout
./example.go:13:9: can inline WithReadTimeout.func1
./example.go:16:6: cannot inline NewServer: function too complex: cost 99 exceeds budget 80
./example.go:24:6: can inline main
./example.go:13:9: func literal escapes to heap
./example.go:16:35: opts does not escape
./example.go:17:7: &Server{...} escapes to heap
./example.go:25:25: ... argument does not escape
./example.go:25:42: time.Duration(5e9) does not escape

Reading the key lines:

func literal escapes to heap — the closure returned by WithReadTimeout heap-allocates. Mandatory: a return value can't live on the callee's stack.
opts does not escape — the variadic opts ...Option is a slice that is consumed inside NewServer and never stored. Its backing array can live on main's stack.
&Server{...} escapes to heap — even though s is constructed locally and only the pointer is returned, the address escapes (it's returned to main), so the Server lives on the heap.
cannot inline NewServer: function too complex — the loop pushes NewServer over the inlining budget. This is the source of the next two limitations: the Server allocation can't be folded into main's frame, and the loop body's indirect call can't be devirtualised.

The annotated counts you get with -gcflags="-m -m" are the compiler's opinions. They reflect the heuristics in src/cmd/compile/internal/inline/inl.go and src/cmd/compile/internal/escape/escape.go. The numbers shift between Go versions.

5.1 What forces the closure to escape¶

A closure escapes to the heap when:

It is returned from its enclosing function, OR
It is stored in a heap-resident location (a struct field, a global, an interface), OR
It is passed to a function that the escape analyser cannot prove keeps it bounded.

For WithReadTimeout, condition 1 is decisive. There is no way to avoid this allocation in the function variant of the pattern. The closure must outlive WithReadTimeout's frame because the caller holds it.

5.2 What stops the variadic slice from escaping¶

The opts ...Option parameter is shorthand for opts []Option. When the caller writes:

NewServer(":8080", WithReadTimeout(5*time.Second))

…the compiler synthesises a small slice literal at the call site. It looks roughly like:

__tmp := [1]Option{WithReadTimeout(5*time.Second)}
NewServer(":8080", __tmp[:])

The backing array __tmp is a local in main's frame. If NewServer doesn't keep a reference to opts past the call (it doesn't — the loop reads it and discards it), the array stays on main's stack. The compiler proves this by tracing the uses of opts inside NewServer:

for _, opt := range opts { opt(s) } — read-only iteration. The element values (function pointers) are copied to local opt. No backing-array reference leaks.
No s.options = opts, no append(s.options, opts...), etc.

If you wrote s.options = opts inside NewServer, the variadic slice would escape and main would allocate the array on the heap.

6. The apply loop in assembly¶

Compile the same example.go and disassemble NewServer:

$ go build -gcflags="-S" -o /dev/null example.go 2>&1 | sed -n '/"".NewServer/,/^$/p'

The interesting region (cleaned up; specific PCs omitted; comments added):

"".NewServer STEXT size=176 args=0x30 locals=0x28
    SUBQ    $40, SP
    MOVQ    BP, 32(SP)
    LEAQ    32(SP), BP

    ; --- allocate *Server ---
    LEAQ    type:srv.Server(SB), AX        ; type descriptor in AX
    CALL    runtime.newobject(SB)          ; AX <- *Server
    MOVQ    AX, "".s+24(SP)                ; save *Server in local

    ; --- write addr field (string is {data,len}) ---
    MOVQ    "".addr+48(SP), CX             ; addr.data
    MOVQ    "".addr+56(SP), DX             ; addr.len
    MOVQ    CX, (AX)
    MOVQ    DX, 8(AX)

    ; --- write readTimeout = 30s (constant 30e9) ---
    MOVQ    $30000000000, 16(AX)

    ; --- range over opts ---
    MOVQ    "".opts+64(SP), BX             ; opts.data (pointer to []Option)
    MOVQ    "".opts+72(SP), CX             ; opts.len
    XORL    SI, SI                         ; i = 0
loop:
    CMPQ    SI, CX
    JGE     done

    MOVQ    (BX)(SI*8), DX                 ; DX = opts[i] (funcval pointer)
                                            ; DX is the closure register on amd64
    MOVQ    "".s+24(SP), AX                ; AX = &Server (argument)
    MOVQ    (DX), R12                      ; R12 = funcval.fn (entry PC)
    CALL    R12                            ; indirect call
                                            ; callee reads captures from [DX+8...]

    INCQ    SI
    JMP     loop
done:
    MOVQ    "".s+24(SP), AX
    MOVQ    AX, "".~r0+80(SP)              ; return *Server
    MOVQ    32(SP), BP
    ADDQ    $40, SP
    RET

The four lines that are the pattern:

MOVQ    (BX)(SI*8), DX     ; load funcval pointer from slice
MOVQ    "".s+24(SP), AX    ; load *Server argument
MOVQ    (DX), R12          ; dereference funcval to get entry PC
CALL    R12                ; indirect call

A few things to notice:

DX is the closure register. The Go calling convention reserves DX to point at the funcval for closure calls. The callee's prologue reads captures using [DX+8], [DX+16], etc. — without DX, the callee has no way to find its captures.
The CALL is indirect. CALL R12 is fundamentally different from a direct CALL "".someFunc(SB). The branch predictor needs an Indirect Branch Target Buffer (IBTB) entry per call site to predict the target. For a single hot call site that always calls the same closure (e.g., a constant Combine option in a tight loop), prediction is perfect after the first miss. For an apply-loop that runs once per constructor call, there's a small cold cost.
No spills inside the loop. The compiler keeps s (in AX) and the index/length pair in CX/SI registers. The whole loop body is six instructions plus the call. The runtime cost of the loop framework itself is irrelevant; the cost is the work inside opt(s).
No bounds check on opts[i]. Go's bounds-check-elimination pass proves i < len(opts) from the loop guard CMPQ SI, CX / JGE done, so no runtime.panicIndex thunk appears.

The whole apply-loop, per iteration: ~6 instructions + the body of the closure + return. On a modern x86, that's roughly 3-5 ns per option not counting the closure body itself.

7. Why options don't inline¶

The Go inliner is intra-procedural and conservative. It will inline a function call when:

The callee's body cost (a heuristic in src/cmd/compile/internal/inline/inl.go) is below the budget (~80 nodes).
The call is direct — i.e., the callee is statically known at compile time.

For the apply-loop, neither condition is fully met:

for _, opt := range opts {
    opt(s)            // INDIRECT call through funcval
}

opt is a function value loaded from a slice element. The compiler doesn't know at this site whether opt was constructed from WithLogger or WithTimeout or something else entirely. It cannot inline the body of opt because it doesn't know which body to inline. So the call survives as an indirect CALL R12.

Even with profile-guided optimization (PGO, Go 1.21+), the most the compiler can do is devirtualise in some cases — recognise that the indirect call site is almost always one specific target and replace it with if opt == knownTarget { directCall() } else { indirectCall() }. PGO devirtualisation is implemented for interface calls; for closure-typed call sites it is more limited because the closure environment is unique per allocation.

The body of WithReadTimeout's lambda is two SSA ops (MOVQ d, (s.readTimeout)). It would be a textbook inlining candidate if the call were direct. Because of the slice, it isn't.

7.1 What you can inline¶

The WithX constructors themselves can inline, because they are direct calls:

./example.go:12:6: can inline WithReadTimeout
./example.go:13:9: can inline WithReadTimeout.func1

When main calls WithReadTimeout(5*time.Second), the body of WithReadTimeout (return a closure capturing d) gets inlined into main. That doesn't eliminate the closure allocation — the closure itself must still escape — but it removes the WithReadTimeout stack frame.

The lambda WithReadTimeout.func1 is also "can inline", which sounds promising. The annotation means if a direct call to func1 ever appears, it can be inlined. The reality at the call site (opt(s) through a slice) is indirect, so the inline never fires.

7.2 The lifted cost¶

The net of "no inlining" for the apply-loop body is:

One indirect call per option (~3-5 ns).
One closure-environment load per option (the captured variable, e.g., 8 bytes of *log.Logger).
One field store per option (in the body of the lambda).

For 5 options this is ~25 ns, plus the closure allocations done at the call site (~16-32 bytes each, on the order of ~10 ns each in runtime.newobject). In a constructor called once per Server, totally invisible. In a per-request constructor, you start paying attention.

8. The slice of options in memory¶

opts ...Option is a slice. A Go slice header is three words (data, len, cap) — 24 bytes on amd64:

opts (24 bytes on the stack)
┌────────────┬────────────┬────────────┐
│ data ptr   │ len (int)  │ cap (int)  │  ← slice header
└────────────┴────────────┴────────────┘
      │
      ▼
backing array [N]Option (each element 8 bytes — a funcval pointer)
┌──────────┬──────────┬──────────┬──────────┐
│ funcval* │ funcval* │ funcval* │ funcval* │
└──────────┴──────────┴──────────┴──────────┘
     │         │         │         │
     ▼         ▼         ▼         ▼
   funcval   funcval   funcval   funcval
   ┌────┐   ┌────┐    ┌────┐    ┌────┐
   │ fn │   │ fn │    │ fn │    │ fn │
   │cap0│   │cap0│    │cap0│    │cap0│
   │... │   │... │    │... │    │... │
   └────┘   └────┘    └────┘    └────┘

Each Option slot in the array is 8 bytes (one funcval pointer). The funcvals themselves are usually heap-allocated and live elsewhere; the slice points at them indirectly.

For NewServer(":8080", WithA(), WithB(), WithC()):

24 bytes of slice header (on the caller's stack — doesn't escape).
24 bytes of backing array (3 × 8-byte funcval pointers).
3 × sizeof(funcval) + sizeof(captures) bytes of closure environments on the heap. For an option capturing a single time.Duration, that's sizeof(funcval) + 8 = 16 bytes per closure, rounded up to the GC size class — typically 16 bytes.

Total heap allocation for three single-capture options: 3 × 16 = 48 bytes plus three GC-tracked objects. The slice itself is stack-resident.

If you build the slice imperatively at the caller:

opts := make([]Option, 0, 3)
opts = append(opts, WithA())
opts = append(opts, WithB())
opts = append(opts, WithC())
NewServer(":8080", opts...)

…then make([]Option, 0, 3) will probably stack-allocate the backing array of 3 × 8 = 24 bytes if the escape analyser can prove opts doesn't escape NewServer. It usually can: the opts... spread is passed directly, NewServer doesn't retain a reference. Confirm with -gcflags="-m".

If you store the slice for reuse:

var prodOpts = []Option{WithA(), WithB(), WithC()}
NewServer(":8080", prodOpts...)

…then prodOpts's backing array lives in the BSS or in a heap-allocated initialised data block. The three funcvals are heap-resident and live for the program lifetime. No per-call allocation for the options — just the call itself.

9. Interface variant under the hood¶

The interface variant in middle §3.2:

type Option interface{ apply(*Server) }
type loggerOption struct{ l *log.Logger }
func (o loggerOption) apply(s *Server) { s.logger = o.l }
func WithLogger(l *log.Logger) Option { return loggerOption{l: l} }

An interface value in Go is two words (16 bytes on amd64). The first word is a pointer to the itab (interface table) for the (concrete type, interface) pair; the second word is the data pointer.

Option (interface, 16 bytes)
┌────────────────┬────────────────┐
│ itab ptr       │ data ptr       │
└────────────────┴────────────────┘
       │                 │
       ▼                 ▼
    itab for          loggerOption{l: ...}
   (loggerOption,        ┌────────┐
    Option)              │ l      │ ← *log.Logger
                         └────────┘

The itab itself is described in src/runtime/runtime2.go:

// runtime2.go
type itab struct {
    inter *interfacetype
    _type *_type
    hash  uint32
    _     [4]byte
    fun   [1]uintptr // variable sized; method PCs for the interface methods
}

itab.fun[0] is the entry PC for apply (and so on for any further interface methods). The runtime constructs itabs lazily and caches them in a hash table (runtime.itabTable in iface.go); the second call to loggerOption{}.apply(s) reuses the same itab the first call built.

flowchart LR A["opt: Option (16 bytes)"] --> B["itab ptr"] A --> C["data ptr"] B --> D["itab"] D --> E["inter: *interfacetype"] D --> F["_type: *_type (loggerOption)"] D --> G["fun[0]: PC of (loggerOption).apply"] C --> H["loggerOption{l: *log.Logger}"]

The apply loop becomes:

; for _, opt := range opts { opt.apply(s) }
loop:
    CMPQ    SI, CX
    JGE     done
    MOVQ    (BX)(SI*16), DX        ; DX = opts[i].itab  (16-byte stride!)
    MOVQ    8(BX)(SI*16), AX       ; AX = opts[i].data
    MOVQ    "".s+24(SP), CX        ; (move s into the right arg register)
    MOVQ    24(DX), R12            ; R12 = itab.fun[0] (offset depends on layout)
    CALL    R12
    INCQ    SI
    JMP     loop

Two differences from the function variant:

16-byte stride through the slice. Each element is two words, not one.
Two loads to set up the call — the itab pointer and the data pointer. The function variant does one load (the funcval pointer) plus one dereference (to get the entry PC). Net effect: about one extra load per iteration.

Both variants do an indirect CALL and both pay the same prediction cost on the branch predictor. The interface variant pays ~30% more per option in practice (matches the middle §12 benchmark numbers), but the absolute difference is ~2 ns per option on amd64. Not a reason to choose between variants.

9.1 itab caching¶

The first time the runtime encounters loggerOption being assigned to Option, it walks the method set of loggerOption, finds apply, builds an itab, and stores it in itabTable. That's a one-time cost during program startup (or first cold path); subsequent assignments reuse the cached itab.

You can see the assignment instruction the compiler emits — runtime.convT for value types being boxed into an interface (which is exactly what WithLogger(l) does — boxes loggerOption{l: l} into Option). For pointer-receiver method sets, the box is just the pointer and the runtime call is unnecessary.

If you change func (o loggerOption) apply(...) to func (o *loggerOption) apply(...) and return &loggerOption{...}, the boxing becomes just "stick the pointer in the data word"; no runtime.convT call, no per-construct allocation of the boxed value separate from the underlying object. In the value-receiver form above, runtime.convT allocates a heap copy of loggerOption{l: l} and stores a pointer to it in the interface's data word. Two heap allocations per option (the closure itself plus the boxed copy) vs one for the function variant.

That's why some libraries' interface-variant options use pointer receivers — to avoid the box-copy.

10. Escape analysis of NewServer chains¶

A pattern people ask about:

s := NewServer(":8080", WithLogger(l)).Start()

Does the Server allocate on the heap or the stack?

Take a stripped-down version:

// main.go
package main

import "log"

type Server struct{ logger *log.Logger; running bool }
type Option func(*Server)

func WithLogger(l *log.Logger) Option { return func(s *Server) { s.logger = l } }

func NewServer(opts ...Option) *Server {
    s := &Server{}
    for _, o := range opts { o(s) }
    return s
}

func (s *Server) Start() *Server { s.running = true; return s }

func main() {
    s := NewServer(WithLogger(log.Default())).Start()
    _ = s
}

$ go build -gcflags="-m -m" main.go 2>&1 | grep -E "escape|inline"
./main.go:9:6: can inline WithLogger
./main.go:9:35: can inline WithLogger.func1
./main.go:11:6: cannot inline NewServer: function too complex
./main.go:17:6: can inline (*Server).Start
./main.go:19:6: can inline main
./main.go:9:35: func literal escapes to heap
./main.go:11:21: opts does not escape
./main.go:12:7: &Server{} escapes to heap
./main.go:20:51: log.Default() does not escape
./main.go:20:38: ... argument does not escape

&Server{} escapes to heap even though the only use after Start() is _ = s (immediately discarded). The reason is again NewServer's inlining failure: because NewServer doesn't inline into main, the &Server{} allocation site is inside a non-inlined function that returns the pointer. Returning a pointer is enough to force escape.

If you manually inline the body of NewServer into main (or hand-write the construction), the escape analyser sees the full lifetime and can stack-allocate s. But the inliner won't do it for you because of the for-loop cost.

10.1 What if there are zero options?¶

func NewServer() *Server {
    return &Server{}
}

func main() {
    s := NewServer().Start()
    _ = s
}

./main.go:5:6: can inline NewServer
./main.go:9:18: inlining call to NewServer
./main.go:5:9: &Server{} escapes to heap

NewServer inlines, but the &Server{} still escapes. The reason isn't the variadic anymore — it's that Start() is a method call on the returned pointer, and Start's receiver parameter is treated conservatively unless the inliner can also fold Start. Here it does:

./main.go:9:23: inlining call to (*Server).Start

Both inlined, yet escape is still reported. The Go escape analyser is bounded — it doesn't always recognise that a fully-inlined chain produces a value that lives only in the current frame. This is a known limitation; some incremental progress has been made over Go 1.20/1.21/1.22.

Empirically, with Go 1.22, the chain in main still produces a heap allocation for the Server. With explicit hoisting:

s := &Server{}
s.logger = log.Default()
s.running = true
_ = s

…the allocation can become stack-resident. So the pattern doesn't optimise as aggressively as the equivalent C++ code where return-value optimization elides everything; you live with one allocation per server, period.

11. GOSSAFUNC walkthrough¶

For a complete view of how the compiler transforms the apply-loop, dump SSA:

GOSSAFUNC=NewServer go build example.go
# opens ssa.html in browser

The HTML contains every SSA pass, from the AST translation to final lowering. The passes most relevant to options:

Pass	What it does	What you see
`start`	AST → SSA	The naive form: explicit `range`, explicit slice indexing
`escape analysis`	Decide stack vs heap for each allocation	`Server` and closure marked "to heap"
`inline calls`	Inline small callees	`WithLogger.func1` not inlined (indirect target)
`decompose user`	Split slice/string/interface values into their words	`opts` becomes `(opts.ptr, opts.len, opts.cap)` triples
`prove`	Bounds-check elimination	The `i < len(opts)` proof discharges the check on `opts[i]`
`lower`	SSA → architecture-specific ops	The `CALL` becomes a `CALLclosure` op
`regalloc`	Assign physical registers	DX is forced as the closure register

The most illuminating pass is decompose user. Before it, the loop reads opts[i] as one opaque "Option" value. After it, the load is explicit: the slice header has been broken into three separate values, the indexed read is loadOption(opts.ptr, i), and the closure call's environment register (DX) is explicit in the SSA.

You can also trace by passing -gcflags="-S -d=ssa/lower/dump" for a text dump of the lowered SSA. For the apply-loop, the lowered SSA reads almost identical to the asm in §6.

12. Allocation count, byte-by-byte¶

Bench harness:

// bench_options_test.go
package srv

import (
    "log"
    "testing"
    "time"
)

var sink *Server

func BenchmarkNoOpts(b *testing.B) {
    b.ReportAllocs()
    for i := 0; i < b.N; i++ {
        sink = NewServer(":8080")
    }
}

func BenchmarkOneOpt(b *testing.B) {
    l := log.Default()
    b.ReportAllocs()
    for i := 0; i < b.N; i++ {
        sink = NewServer(":8080", WithLogger(l))
    }
}

func BenchmarkFiveOpts(b *testing.B) {
    l := log.Default()
    b.ReportAllocs()
    for i := 0; i < b.N; i++ {
        sink = NewServer(":8080",
            WithReadTimeout(5*time.Second),
            WithWriteTimeout(5*time.Second),
            WithLogger(l),
            WithMaxConns(1000),
            WithDebug(),
        )
    }
}

func BenchmarkFiveOptsReused(b *testing.B) {
    l := log.Default()
    opts := []Option{
        WithReadTimeout(5*time.Second),
        WithWriteTimeout(5*time.Second),
        WithLogger(l),
        WithMaxConns(1000),
        WithDebug(),
    }
    b.ReportAllocs()
    for i := 0; i < b.N; i++ {
        sink = NewServer(":8080", opts...)
    }
}

Sample results (Go 1.22, amd64, M2 Mac, GOMAXPROCS=1):

BenchmarkNoOpts-1            120000000     10.3 ns/op    48 B/op    1 allocs/op
BenchmarkOneOpt-1             40000000     32.1 ns/op    72 B/op    2 allocs/op
BenchmarkFiveOpts-1           10000000    125.0 ns/op   192 B/op    6 allocs/op
BenchmarkFiveOptsReused-1     50000000     28.4 ns/op    48 B/op    1 allocs/op

Reading the numbers:

NoOpts: 1 alloc, 48 B. That's the Server itself on the heap. Nothing else allocates.
OneOpt: 2 allocs, 72 B = 48 (Server) + 24 (closure environment, rounded to the 24-byte size class).
FiveOpts: 6 allocs, 192 B = 48 (Server) + 5 × ~24-32 (closures). The exact size depends on each closure's capture set (WithLogger captures 8 bytes, WithDebug() captures 0 bytes but still allocates a closure if it isn't inlinable).
FiveOptsReused: 1 alloc — the Server. The closures live in the opts slice that was built once before the benchmark loop. Inside the loop, only the Server allocates.

This is the big lever for performance-sensitive options usage: build the options slice once, reuse it. The constructor itself is unchanged. Same call site (NewServer(addr, opts...)), 5× fewer allocations.

12.1 What an empty-capture closure costs¶

func WithDebug() Option {
    return func(s *Server) { s.debug = true }
}

The closure captures nothing. Does it still allocate?

Yes — by default. The closure is still a funcval and still has to live somewhere; the compiler emits runtime.newobject(funcval) to produce it. The size is sizeof(funcval) = 8 bytes (just the fn pointer), rounded up to the 16-byte size class.

There is an optimisation for static zero-capture closures: if a closure literal has no captures and is at package scope, the compiler can emit a single static funcval and reuse it forever. The standard library uses this for things like time.Local initialisation. For a WithDebug defined as above, however, each call returns a fresh function value, and the optimisation doesn't fire because the closure is constructed inside WithDebug, not at package scope.

If WithDebug were performance-critical, you could write:

var debugOpt Option = func(s *Server) { s.debug = true }

func WithDebug() Option { return debugOpt }

…and now WithDebug returns the same precomputed function value every time. No allocation. This is rarely worth the readability hit, but the technique exists.

13. Cross-language comparison at the machine level¶

How does Go's functional-options pattern compare to neighbours' equivalent constructs at the codegen level?

13.1 C++ default arguments¶

Server make_server(std::string addr,
                   std::chrono::milliseconds read_timeout = 30s,
                   Logger* logger = nullptr);

The compiler stores default values in the caller's code, not the callee's. Each call site is compiled with the defaults inlined. Effectively: zero runtime cost for "unused" defaults; one breaking-change vector (changing a default requires recompiling every caller; changing the parameter list breaks the ABI).

Codegen at the call site looks like a normal direct call with the defaults synthesised inline. No closures, no indirect calls.

13.2 Java builders¶

Server s = Server.builder()
    .addr(":8080")
    .readTimeout(Duration.ofSeconds(5))
    .build();

Each .addr(...) returns the builder (this), so the chain is a sequence of direct virtual method calls. In hot code, JIT-inlines them all, eliding the intermediate this returns. The final .build() constructs the object.

Cost: each builder method is a virtual call (resolved through the vtable), but the JIT specialises on the receiver type after warmup and inlines. After warmup, costs comparable to Go's function variant — possibly faster because the JIT can fully fold the chain into a single allocation.

Cold start: noticeably slower because the JIT hasn't warmed up.

13.3 Rust struct update syntax¶

let s = Server { addr: ":8080".into(), read_timeout: Duration::from_secs(5), ..Server::default() };

Constructed in one expression. The compiler emits the struct on the caller's stack (or wherever the binding lives), populates the fields directly. Zero closures, zero allocations, zero function calls.

The cost is that the struct's fields are exposed in the type (and need to be pub), so you have the same "API surface = struct shape" problem as a Go config struct. Rust mitigates with non-exhaustive struct attributes:

#[non_exhaustive]
pub struct Server { /* ... */ }

…which prevents external crates from constructing the struct positionally; they must go through a constructor. So Rust's approximate equivalent to "functional options" is "builder pattern + #[non_exhaustive]", with slightly cleaner ergonomics than Java.

13.4 Summary¶

Language	Cost per "option"	Allocations per option	When the cost is paid
Go (function variant)	~5 ns + 1 alloc	1 (closure)	At option construction
Go (interface variant)	~7 ns + 1-2 allocs	1-2 (closure + box)	At option construction
C++ default args	0 (literal substitution)	0	Compile time
Java builder, post-JIT	~1 ns	0 (JIT-folded)	After warmup
Java builder, cold	~10 ns	1 (builder object)	Always
Rust struct update	0	0	Compile time

Go pays the most per option, but its options are first-class values: passable, storable, conditional, composable. C++ default args and Rust struct update syntax cannot be passed around. Java builders can but allocate the builder. The tradeoff is paid in the right currency.

14. Edge cases at the lowest level¶

14.1 Variadic with a single option vs spread¶

NewServer(":8080", WithLogger(l))
// vs
NewServer(":8080", []Option{WithLogger(l)}...)

Both produce a []Option of length 1. The first form is preferred because the compiler synthesises the slice with the smallest possible cost. The second form is explicit: the caller allocates the slice. If the caller's slice escapes (e.g., it's literally a slice literal), it stack-allocates. If you write opts := []Option{...} and then pass opts..., the slice header is on the caller's stack regardless.

The interesting case: 0 options.

NewServer(":8080")

There is no slice. The variadic parameter is set to nil. Inside NewServer, len(opts) == 0, range opts iterates zero times, no allocations for the variadic.

NewServer(":8080", nil) // <-- not zero options! One option that happens to be nil.

This is one option of value nil. The apply loop's opt(s) will panic. Be careful with conditionals:

var opt Option // nil
NewServer(":8080", opt) // boom on first iteration

14.2 `runtime.newobject` vs stack allocation for closures¶

The closure escapes when it outlives the constructor. The compiler emits one of:

// On heap (the usual case for options)
&funcval{fn: pc, capture0: d}   // really: runtime.newobject + initialise

// On stack (rare for options)
funcval{fn: pc, capture0: d}    // initialised in the caller's frame, no runtime call

For functional options, the heap version is the only one you ever see — because the option's purpose is to outlive the WithX call.

The exception is when the option is consumed in the same function and the escape analyser can prove it doesn't outlive:

func ConfigureLocally() {
    o := WithLogger(myLogger) // closure could in principle stack-allocate
    o(&localServer)           // ... if Go could see o is only used here
}

In practice, even this stack-allocates inconsistently. The compiler is conservative; closures usually go to the heap. Not a hot path for the pattern.

14.3 Stack-grow during option application¶

The apply loop calls into option closures. Each call may grow the goroutine's stack if the closure's body deeply nests. The runtime's stack-grow path is runtime.morestack in src/runtime/asm_amd64.s. Before each closure call, the function prologue checks g.stackguard0 and, if exceeded, calls runtime.morestack_noctxt, which expands the stack by copying the existing frames to a larger allocation.

For functional options, this is a non-issue: the closure bodies are short. But if you have an option that does heavy work, that work runs on the constructor's goroutine and respects the goroutine's stack.

A subtle implication: an option closure that captures a runtime.Stack-sensitive pointer doesn't need special handling. The runtime tracks all live pointers across stack-grow.

14.4 The "method value" version of options¶

You can use a method value as an option:

type Logger struct{ /* ... */ }
func (l *Logger) Attach(s *Server) { s.logger = l }

l := &Logger{}
NewServer(":8080", l.Attach) // l.Attach is an Option

l.Attach is a method value — a funcval capturing the receiver l. It costs one allocation (the method value's funcval, 16 bytes including the captured receiver pointer). This is shorter than writing WithLogger(l) and has identical cost to WithLogger(l) plus the inlined body. Whether to expose method-value options is a style call: it skips the WithX ceremony at the cost of leaking the method-set shape into the API.

14.5 GC barriers when closures hold pointers¶

The closure environment for WithLogger(l) holds a *log.Logger. The Go GC scans these along with any other heap-resident pointers — there's no special exception for closure captures. From the GC's perspective, a funcval is just a heap object with a *runtime._type describing its layout (so the GC knows which words are pointers).

When the runtime constructs the closure (runtime.newobject path), it allocates the object with the type descriptor produced by the compiler for the closure. The descriptor's pointer-bitmap is set so the GC knows position 1 (after the fn word) is a *log.Logger and needs to be scanned. Position 0 (fn) is also a pointer (to executable code) but is excluded from GC scanning by the type descriptor's special bit (the kindNoPointers-style logic in src/runtime/type.go).

So options participate normally in the write barrier and GC mark phase. No surprises.

15. Test¶

Internal knowledge questions¶

1. What is the size in bytes of a single Option value on amd64?

Answer

8 bytes — a `func(...)` value is a single pointer to a `funcval`. The function variant's `Option` is one word. The interface variant's `Option` is two words (16 bytes) — itab pointer + data pointer.

2. Why does &Server{} inside NewServer allocate on the heap even when the result is immediately discarded?

Answer

The escape analyser considers each function independently. `NewServer` returns a `*Server`, so the address escapes the function's frame. Since `NewServer` doesn't inline (too complex due to the loop), the caller can't see that the value is unused, and the conservative result is heap allocation. Manually inlining the body or removing the for-loop is the only way to suppress this.

3. The apply loop produces CALL R12. Why R12 specifically?

Answer

The compiler's amd64 calling convention reserves DX as the closure register — the pointer to the funcval is passed in DX so the callee can read its captures via `[DX+8]`, `[DX+16]`. The entry PC is loaded from `(DX)` into a scratch register (commonly R12 on Go 1.17+ register-based calling convention). The choice of R12 is from the regalloc pass; what matters is that DX holds the funcval pointer at the moment of CALL.

4. What's the assembly difference between calling a normal func() and a closure?

Answer

A normal function call is `CALL "".funcName(SB)` — a direct call with a static target. A closure call requires (a) loading the funcval pointer into DX, (b) loading the entry PC from `(DX)`, (c) calling the entry PC indirectly. The callee's prologue then reads captures off DX-relative offsets. The cost difference is one extra load and one indirect-call penalty in the branch predictor.

5. Why does WithDebug() (no arguments, no captures) still heap-allocate?

Answer

Even with no captures, the closure is constructed inside the `WithDebug` function and returned. Returning the closure forces it to escape `WithDebug`'s frame, which means heap allocation. The fix is to store the (capture-less) closure in a package-level variable and return it directly: `var debugOpt Option = func(s *Server) {...}; func WithDebug() Option { return debugOpt }`. Now no allocation per call.

6. Reading the assembly, how can you tell if the slice of options is heap-allocated?

Answer

Look at the caller's prologue. If you see `CALL runtime.newobject` (or `runtime.makeslice`) before the `NewServer` call with a `[N]Option` type descriptor in AX, the backing array is on the heap. If you see `LEAQ "".__tmpN(SP), ` instead, the backing array lives on the stack. `-gcflags="-m"` reports the same with `... argument does not escape` or `... argument escapes`.

Test code: count allocations directly¶

func TestNoExtraAllocs(t *testing.T) {
    l := log.Default()
    opts := []Option{WithLogger(l), WithReadTimeout(5*time.Second)}

    allocs := testing.AllocsPerRun(1000, func() {
        _ = NewServer(":8080", opts...)
    })

    if allocs != 1 {
        t.Fatalf("expected 1 alloc (Server only), got %v", allocs)
    }
}

If this test fails, the slice escaped or one of the options reallocates. Bisect by running with fewer options.

16. Tricky questions¶

1. Why does NewServer(":8080", opt1, opt2) allocate three things but NewServer(":8080", opts...) (where opts is a long-lived slice) allocate only one?

Answer

In the first form, each `optN` is constructed at the call site (`WithX(...)`) which produces a fresh closure on the heap, plus the variadic slice's backing array. Three allocations. In the spread form, `opts` was built once, the closures already exist, the slice already exists. The spread passes the existing slice header through `NewServer`'s variadic parameter without copying. Only the `Server` itself is freshly allocated.

2. Why does the interface variant of Option (with value receivers) often produce two allocations per option, when the function variant produces only one?

Answer

With value receivers: `WithLogger(l) returns Option(loggerOption{l: l})` boxes the struct into an interface. The runtime call `runtime.convT` allocates a heap copy of the struct and stores its pointer in the interface's data word. Allocation 1 is `loggerOption{}` boxed. Allocation 2 is the slice's backing array. With pointer receivers (`*loggerOption`), boxing is free (the existing pointer is the data word), and you have only one allocation per option — the `&loggerOption{}`. So the interface variant should use pointer receivers when allocation count matters.

3. Is the apply loop's indirect call slower because of Spectre mitigations?

Answer

Yes, marginally. On x86, indirect branches are subject to BTB poisoning, and the kernel may have enabled IBRS/IBPB depending on `mitigations=` boot params. Each indirect call may pay a few cycles of additional latency vs an unmitigated direct call. On modern CPUs (Ice Lake and later) with eIBRS, the cost is amortized and barely measurable. On older CPUs with retpoline mitigation, indirect calls can be 10× slower. The Go toolchain doesn't insert per-call mitigations; this is purely hardware/microcode/kernel-level.

4. Why is for _, opt := range opts { opt(s) } not converted into a runtime.duffcopy-like vectored sequence by the compiler?

Answer

Because the loop body involves an indirect call with side effects on `s`, and the compiler cannot prove the calls are independent (they might write to overlapping fields of `s` in order-dependent ways). The compiler is required to preserve call order. There's no SIMD analogue for "apply N different functions to the same target" — each is a distinct call with its own prologue and epilogue.

5. Can profile-guided optimisation (PGO) inline option bodies?

Answer

Partially. Go 1.21 added PGO devirtualisation for interface calls — if profiling shows that one call site is almost always dispatching to one specific type, the compiler can emit a check-and-direct-call. For closure-typed call sites (the function variant), PGO devirtualisation is limited because the closure environment varies per call. As of Go 1.22, the most you can hope for is that "hot" closures' bodies get a marginal layout benefit. Don't expect PGO to eliminate the indirect call cost of functional options.

6. If I capture a *time.Time, will the closure escape extend the lifetime of the time.Time?

Answer

The closure holds a pointer to the `time.Time`. As long as the closure (the option) is reachable from a GC root (e.g., stored in a long-lived `[]Option`), the GC will keep the `time.Time` alive. This is normal pointer-reachability. The "trap" is when callers construct an option from a stack-local that they expect to die quickly; the option captures the pointer and the variable gets heap-promoted by the escape analyser. You may see `time.Time escapes to heap` annotations as a result. Usually fine; just be aware that closure captures can promote locals to the heap.

17. Summary¶

An Option in the function variant is a single pointer (8 bytes on amd64), pointing to a runtime.funcval. The funcval contains the entry PC and any captured values inline.
WithX(arg) always heap-allocates the funcval, because the returned closure must outlive the WithX call. There is no way to eliminate this in the function variant.
The apply-loop for _, opt := range opts { opt(s) } compiles to a 6-instruction loop with an indirect CALL through the closure register (DX on amd64). Each iteration is ~3-5 ns plus the closure body.
Options don't inline because the call is indirect — the compiler doesn't know which closure body is at each slice slot. Profile-guided optimisation can't fully recover this for closure values.
The Server allocated inside NewServer escapes to the heap because NewServer doesn't inline (the loop pushes it over the budget). Even fully-inlined chains like NewServer(...).Start() typically still produce one heap allocation for the Server in Go 1.22.
Variadic slices (opts ...Option) often stack-allocate when NewServer doesn't retain a reference. Reusing a pre-built []Option reduces allocations from N+1 to 1 (just the Server).
The interface variant doubles the size of each Option slot (itab + data, 16 bytes) and may add a second allocation per option for value-receiver method sets. Function variant is ~30% faster per option in practice, but the absolute difference is ~2 ns.
Cross-language comparison: Go pays the most per option at runtime, but its options are first-class values. C++ default args and Rust struct update syntax pay zero at runtime but cannot be composed or passed around. Java builders match Go's flexibility but rely on JIT to recover the cost.
The cardinal rule for performance-sensitive options: build the option slice once, reuse it. The closure allocations happen at option-construction time; reuse the option, reuse the closures.

18. Further reading¶

Go runtime source — runtime.funcval, itab structures: src/runtime/runtime2.go
Closure conversion in the compiler: src/cmd/compile/internal/walk/closure.go
Escape analysis: src/cmd/compile/internal/escape/escape.go
Inliner heuristics: src/cmd/compile/internal/inline/inl.go
Calling convention (register-based, Go 1.17+): src/cmd/compile/abi-internal.md in the Go source tree
itab caching: src/runtime/iface.go — getitab, itabTable
Profile-guided optimisation: https://go.dev/doc/pgo
Dave Cheney, "Functional options for friendly APIs" (2014) — the canonical justification; doesn't cover internals but is the historical reference
Related: 02-language-basics/07-pointers/05-unsafe-pointer/professional.md for unsafe.Pointer internals
Related: 02-language-basics/02-functions/04-closure-internals/professional.md for the closure-conversion deep dive that this file builds on
Related: middle.md §12 for the higher-level benchmark numbers; this file explains why those numbers look the way they do

Functional Options — Under the Hood¶

1. The runtime framing¶

2. Table of Contents¶

3. How a function value is represented¶

4. The funcval struct and closure layout¶

4.1 Captures by value vs by reference¶

5. WithX call site — escape analysis walkthrough¶

5.1 What forces the closure to escape¶

5.2 What stops the variadic slice from escaping¶

6. The apply loop in assembly¶

7. Why options don't inline¶

7.1 What you can inline¶

7.2 The lifted cost¶

8. The slice of options in memory¶

9. Interface variant under the hood¶

9.1 itab caching¶

10. Escape analysis of NewServer chains¶

10.1 What if there are zero options?¶

11. GOSSAFUNC walkthrough¶

12. Allocation count, byte-by-byte¶

12.1 What an empty-capture closure costs¶

13. Cross-language comparison at the machine level¶

13.1 C++ default arguments¶

13.2 Java builders¶

13.3 Rust struct update syntax¶

13.4 Summary¶

14. Edge cases at the lowest level¶

14.1 Variadic with a single option vs spread¶

14.2 runtime.newobject vs stack allocation for closures¶

14.3 Stack-grow during option application¶

14.4 The "method value" version of options¶

14.5 GC barriers when closures hold pointers¶

15. Test¶

Internal knowledge questions¶

Test code: count allocations directly¶

16. Tricky questions¶

17. Summary¶

18. Further reading¶

14.2 `runtime.newobject` vs stack allocation for closures¶