Functional Options — Under the Hood¶
1. The runtime framing¶
Junior and middle taught what the pattern is. This file is about what the compiler and runtime actually do when you write it. Every WithX(arg) call is a function value being constructed on the fly — a closure that captures arg. Every for _, opt := range opts { opt(s) } is an indirect call through that function value. Every []Option is a slice of these function values. None of that is free; none of it is unreasonably expensive either. The point of this file is to be precise.
We work in Go 1.22 / amd64 unless otherwise noted. References to the standard library are against go1.22.x source, paths like src/runtime/runtime2.go and src/cmd/compile/internal/ssagen/ssa.go.
The questions we answer:
- What is the in-memory layout of an
Optionvalue? - When does
WithX(arg)heap-allocate the closure and when does it stack-allocate? - Why doesn't the loop body
opt(s)inline? - What does the assembly for the apply-loop look like?
- Does
NewServer(":8080", WithLogger(l)).Start()allocate theServeron the heap or the stack? - What's the actual cost of the function vs interface variant, instruction by instruction?
2. Table of Contents¶
- The runtime framing
- Table of Contents
- How a function value is represented
- The funcval struct and closure layout
- WithX call site — escape analysis walkthrough
- The apply loop in assembly
- Why options don't inline
- The slice of options in memory
- Interface variant under the hood
- Escape analysis of NewServer chains
- GOSSAFUNC walkthrough
- Allocation count, byte-by-byte
- Cross-language comparison at the machine level
- Edge cases at the lowest level
- Test
- Tricky questions
- Summary
- Further reading
3. How a function value is represented¶
In Go, a value of type func(...) is a single word — a pointer. Specifically, it points to a runtime.funcval struct. The struct's first field is the entry-point PC; any words following it are the captured variables for that closure. This is defined in src/runtime/runtime2.go:
// runtime2.go (paraphrased; the comment is from the Go source)
type funcval struct {
fn uintptr
// variable-sized, fn-specific data here
}
So when you write:
opt is one machine word. That word is the address of a funcval. The funcval's first 8 bytes (on amd64) are the entry PC of the anonymous function func(s *Server) { s.logger = myLogger }. The bytes after that are the captured environment — in this case, the captured *log.Logger value (one pointer).
opt (one word, 8 bytes on amd64)
│
└──> funcval @ heap-allocated address
┌──────────────────────────────────┐
│ fn (8 bytes): entry PC of closure│
├──────────────────────────────────┤
│ captured: *log.Logger (8 bytes) │
└──────────────────────────────────┘
Calling opt(s) is not a simple direct jump. The runtime convention on amd64 puts the funcval pointer into the DX register (the "closure register"), then jumps to [DX] — i.e., dereferences the first word of the funcval to get the entry PC. The closure prologue then reads its captures off [DX+8], [DX+16], etc.
This is why the apply-loop produces CALL instructions through a register, not direct calls. We'll see it in §6.
Method values are the same shape. s.Start (where Start has receiver *Server) is a funcval whose capture word is the receiver pointer. This is why method values cost an allocation when they escape — they're closures.
4. The funcval struct and closure layout¶
The compiler's representation lives in src/cmd/compile/internal/ir/func.go and the closure conversion pass is in src/cmd/compile/internal/walk/closure.go. The thing to internalise is that a closure literal:
…with d captured by value compiles to a hidden type:
// Compiler-synthesized closure environment for WithReadTimeout's lambda
type closureWithReadTimeout struct {
fn uintptr // entry PC of the lambda
d time.Duration // captured value
}
When WithReadTimeout(5*time.Second) runs:
- The compiler emits a call to
runtime.newobject(when the closure escapes) — or reserves stack space (when it doesn't). - It writes
entryPC(closure)into[obj+0]. - It writes
dinto[obj+8]. - It returns the pointer-to-
objas theOptionvalue.
You can see this materialise in -gcflags="-m":
// withreadtimeout.go
package srv
import "time"
type Server struct{ readTimeout time.Duration }
type Option func(*Server)
func WithReadTimeout(d time.Duration) Option {
return func(s *Server) { s.readTimeout = d }
}
$ go build -gcflags="-m" withreadtimeout.go
./withreadtimeout.go:8:6: can inline WithReadTimeout
./withreadtimeout.go:9:9: can inline WithReadTimeout.func1
./withreadtimeout.go:9:9: func literal escapes to heap
That last line — func literal escapes to heap — is the closure allocation. The closure environment cannot live on the stack of WithReadTimeout because it survives the return: the caller holds the Option and applies it later.
If you somehow inline the whole construction so the closure never outlives its enclosing frame, it stack-allocates. We come back to this in §10 with NewServer(...).Start().
4.1 Captures by value vs by reference¶
Go closures capture variables, not values, but the variables themselves can be moved to the heap (heap-promotion) so the closure can keep a stable address. If the captured variable doesn't need to be mutable across the closure and the outer scope, the compiler treats the capture as by-value.
Here l is local to WithLogger, never reassigned, never address-taken outside the closure. The compiler captures it by value — one pointer-sized word inside the closure environment. No second allocation for a heap-promoted l.
Contrast:
Now n is mutable, shared between the closure and… well, only the closure here, but if you returned both n and the closure they would share it. The compiler heap-promotes n and the closure captures &n:
For options, you almost never capture mutable state. Captures are arguments that the caller already constructed.
5. WithX call site — escape analysis walkthrough¶
Take a small program and run -gcflags="-m -m" on it:
// example.go
package main
import "time"
type Server struct {
addr string
readTimeout time.Duration
}
type Option func(*Server)
func WithReadTimeout(d time.Duration) Option {
return func(s *Server) { s.readTimeout = d }
}
func NewServer(addr string, opts ...Option) *Server {
s := &Server{addr: addr, readTimeout: 30 * time.Second}
for _, opt := range opts {
opt(s)
}
return s
}
func main() {
s := NewServer(":8080", WithReadTimeout(5*time.Second))
_ = s
}
Compile with full escape annotations:
$ go build -gcflags="-m -m" example.go 2>&1 | grep -E "escape|inline|allocate"
./example.go:12:6: can inline WithReadTimeout
./example.go:13:9: can inline WithReadTimeout.func1
./example.go:16:6: cannot inline NewServer: function too complex: cost 99 exceeds budget 80
./example.go:24:6: can inline main
./example.go:13:9: func literal escapes to heap
./example.go:16:35: opts does not escape
./example.go:17:7: &Server{...} escapes to heap
./example.go:25:25: ... argument does not escape
./example.go:25:42: time.Duration(5e9) does not escape
Reading the key lines:
func literal escapes to heap— the closure returned byWithReadTimeoutheap-allocates. Mandatory: a return value can't live on the callee's stack.opts does not escape— the variadicopts ...Optionis a slice that is consumed insideNewServerand never stored. Its backing array can live onmain's stack.&Server{...} escapes to heap— even thoughsis constructed locally and only the pointer is returned, the address escapes (it's returned tomain), so theServerlives on the heap.cannot inline NewServer: function too complex— the loop pushesNewServerover the inlining budget. This is the source of the next two limitations: theServerallocation can't be folded intomain's frame, and the loop body's indirect call can't be devirtualised.
The annotated counts you get with -gcflags="-m -m" are the compiler's opinions. They reflect the heuristics in src/cmd/compile/internal/inline/inl.go and src/cmd/compile/internal/escape/escape.go. The numbers shift between Go versions.
5.1 What forces the closure to escape¶
A closure escapes to the heap when:
- It is returned from its enclosing function, OR
- It is stored in a heap-resident location (a struct field, a global, an interface), OR
- It is passed to a function that the escape analyser cannot prove keeps it bounded.
For WithReadTimeout, condition 1 is decisive. There is no way to avoid this allocation in the function variant of the pattern. The closure must outlive WithReadTimeout's frame because the caller holds it.
5.2 What stops the variadic slice from escaping¶
The opts ...Option parameter is shorthand for opts []Option. When the caller writes:
…the compiler synthesises a small slice literal at the call site. It looks roughly like:
The backing array __tmp is a local in main's frame. If NewServer doesn't keep a reference to opts past the call (it doesn't — the loop reads it and discards it), the array stays on main's stack. The compiler proves this by tracing the uses of opts inside NewServer:
for _, opt := range opts { opt(s) }— read-only iteration. The element values (function pointers) are copied to localopt. No backing-array reference leaks.- No
s.options = opts, noappend(s.options, opts...), etc.
If you wrote s.options = opts inside NewServer, the variadic slice would escape and main would allocate the array on the heap.
6. The apply loop in assembly¶
Compile the same example.go and disassemble NewServer:
The interesting region (cleaned up; specific PCs omitted; comments added):
"".NewServer STEXT size=176 args=0x30 locals=0x28
SUBQ $40, SP
MOVQ BP, 32(SP)
LEAQ 32(SP), BP
; --- allocate *Server ---
LEAQ type:srv.Server(SB), AX ; type descriptor in AX
CALL runtime.newobject(SB) ; AX <- *Server
MOVQ AX, "".s+24(SP) ; save *Server in local
; --- write addr field (string is {data,len}) ---
MOVQ "".addr+48(SP), CX ; addr.data
MOVQ "".addr+56(SP), DX ; addr.len
MOVQ CX, (AX)
MOVQ DX, 8(AX)
; --- write readTimeout = 30s (constant 30e9) ---
MOVQ $30000000000, 16(AX)
; --- range over opts ---
MOVQ "".opts+64(SP), BX ; opts.data (pointer to []Option)
MOVQ "".opts+72(SP), CX ; opts.len
XORL SI, SI ; i = 0
loop:
CMPQ SI, CX
JGE done
MOVQ (BX)(SI*8), DX ; DX = opts[i] (funcval pointer)
; DX is the closure register on amd64
MOVQ "".s+24(SP), AX ; AX = &Server (argument)
MOVQ (DX), R12 ; R12 = funcval.fn (entry PC)
CALL R12 ; indirect call
; callee reads captures from [DX+8...]
INCQ SI
JMP loop
done:
MOVQ "".s+24(SP), AX
MOVQ AX, "".~r0+80(SP) ; return *Server
MOVQ 32(SP), BP
ADDQ $40, SP
RET
The four lines that are the pattern:
MOVQ (BX)(SI*8), DX ; load funcval pointer from slice
MOVQ "".s+24(SP), AX ; load *Server argument
MOVQ (DX), R12 ; dereference funcval to get entry PC
CALL R12 ; indirect call
A few things to notice:
DXis the closure register. The Go calling convention reserves DX to point at the funcval for closure calls. The callee's prologue reads captures using[DX+8],[DX+16], etc. — without DX, the callee has no way to find its captures.- The CALL is indirect.
CALL R12is fundamentally different from a directCALL "".someFunc(SB). The branch predictor needs an Indirect Branch Target Buffer (IBTB) entry per call site to predict the target. For a single hot call site that always calls the same closure (e.g., a constantCombineoption in a tight loop), prediction is perfect after the first miss. For an apply-loop that runs once per constructor call, there's a small cold cost. - No spills inside the loop. The compiler keeps
s(in AX) and the index/length pair in CX/SI registers. The whole loop body is six instructions plus the call. The runtime cost of the loop framework itself is irrelevant; the cost is the work insideopt(s). - No bounds check on
opts[i]. Go's bounds-check-elimination pass provesi < len(opts)from the loop guardCMPQ SI, CX / JGE done, so noruntime.panicIndexthunk appears.
The whole apply-loop, per iteration: ~6 instructions + the body of the closure + return. On a modern x86, that's roughly 3-5 ns per option not counting the closure body itself.
7. Why options don't inline¶
The Go inliner is intra-procedural and conservative. It will inline a function call when:
- The callee's body cost (a heuristic in
src/cmd/compile/internal/inline/inl.go) is below the budget (~80 nodes). - The call is direct — i.e., the callee is statically known at compile time.
For the apply-loop, neither condition is fully met:
opt is a function value loaded from a slice element. The compiler doesn't know at this site whether opt was constructed from WithLogger or WithTimeout or something else entirely. It cannot inline the body of opt because it doesn't know which body to inline. So the call survives as an indirect CALL R12.
Even with profile-guided optimization (PGO, Go 1.21+), the most the compiler can do is devirtualise in some cases — recognise that the indirect call site is almost always one specific target and replace it with if opt == knownTarget { directCall() } else { indirectCall() }. PGO devirtualisation is implemented for interface calls; for closure-typed call sites it is more limited because the closure environment is unique per allocation.
The body of WithReadTimeout's lambda is two SSA ops (MOVQ d, (s.readTimeout)). It would be a textbook inlining candidate if the call were direct. Because of the slice, it isn't.
7.1 What you can inline¶
The WithX constructors themselves can inline, because they are direct calls:
When main calls WithReadTimeout(5*time.Second), the body of WithReadTimeout (return a closure capturing d) gets inlined into main. That doesn't eliminate the closure allocation — the closure itself must still escape — but it removes the WithReadTimeout stack frame.
The lambda WithReadTimeout.func1 is also "can inline", which sounds promising. The annotation means if a direct call to func1 ever appears, it can be inlined. The reality at the call site (opt(s) through a slice) is indirect, so the inline never fires.
7.2 The lifted cost¶
The net of "no inlining" for the apply-loop body is:
- One indirect call per option (~3-5 ns).
- One closure-environment load per option (the captured variable, e.g., 8 bytes of
*log.Logger). - One field store per option (in the body of the lambda).
For 5 options this is ~25 ns, plus the closure allocations done at the call site (~16-32 bytes each, on the order of ~10 ns each in runtime.newobject). In a constructor called once per Server, totally invisible. In a per-request constructor, you start paying attention.
8. The slice of options in memory¶
opts ...Option is a slice. A Go slice header is three words (data, len, cap) — 24 bytes on amd64:
opts (24 bytes on the stack)
┌────────────┬────────────┬────────────┐
│ data ptr │ len (int) │ cap (int) │ ← slice header
└────────────┴────────────┴────────────┘
│
▼
backing array [N]Option (each element 8 bytes — a funcval pointer)
┌──────────┬──────────┬──────────┬──────────┐
│ funcval* │ funcval* │ funcval* │ funcval* │
└──────────┴──────────┴──────────┴──────────┘
│ │ │ │
▼ ▼ ▼ ▼
funcval funcval funcval funcval
┌────┐ ┌────┐ ┌────┐ ┌────┐
│ fn │ │ fn │ │ fn │ │ fn │
│cap0│ │cap0│ │cap0│ │cap0│
│... │ │... │ │... │ │... │
└────┘ └────┘ └────┘ └────┘
Each Option slot in the array is 8 bytes (one funcval pointer). The funcvals themselves are usually heap-allocated and live elsewhere; the slice points at them indirectly.
For NewServer(":8080", WithA(), WithB(), WithC()):
- 24 bytes of slice header (on the caller's stack — doesn't escape).
- 24 bytes of backing array (3 × 8-byte funcval pointers).
- 3 ×
sizeof(funcval) + sizeof(captures)bytes of closure environments on the heap. For an option capturing a singletime.Duration, that'ssizeof(funcval) + 8 = 16 bytesper closure, rounded up to the GC size class — typically 16 bytes.
Total heap allocation for three single-capture options: 3 × 16 = 48 bytes plus three GC-tracked objects. The slice itself is stack-resident.
If you build the slice imperatively at the caller:
opts := make([]Option, 0, 3)
opts = append(opts, WithA())
opts = append(opts, WithB())
opts = append(opts, WithC())
NewServer(":8080", opts...)
…then make([]Option, 0, 3) will probably stack-allocate the backing array of 3 × 8 = 24 bytes if the escape analyser can prove opts doesn't escape NewServer. It usually can: the opts... spread is passed directly, NewServer doesn't retain a reference. Confirm with -gcflags="-m".
If you store the slice for reuse:
…then prodOpts's backing array lives in the BSS or in a heap-allocated initialised data block. The three funcvals are heap-resident and live for the program lifetime. No per-call allocation for the options — just the call itself.
9. Interface variant under the hood¶
The interface variant in middle §3.2:
type Option interface{ apply(*Server) }
type loggerOption struct{ l *log.Logger }
func (o loggerOption) apply(s *Server) { s.logger = o.l }
func WithLogger(l *log.Logger) Option { return loggerOption{l: l} }
An interface value in Go is two words (16 bytes on amd64). The first word is a pointer to the itab (interface table) for the (concrete type, interface) pair; the second word is the data pointer.
Option (interface, 16 bytes)
┌────────────────┬────────────────┐
│ itab ptr │ data ptr │
└────────────────┴────────────────┘
│ │
▼ ▼
itab for loggerOption{l: ...}
(loggerOption, ┌────────┐
Option) │ l │ ← *log.Logger
└────────┘
The itab itself is described in src/runtime/runtime2.go:
// runtime2.go
type itab struct {
inter *interfacetype
_type *_type
hash uint32
_ [4]byte
fun [1]uintptr // variable sized; method PCs for the interface methods
}
itab.fun[0] is the entry PC for apply (and so on for any further interface methods). The runtime constructs itabs lazily and caches them in a hash table (runtime.itabTable in iface.go); the second call to loggerOption{}.apply(s) reuses the same itab the first call built.
The apply loop becomes:
; for _, opt := range opts { opt.apply(s) }
loop:
CMPQ SI, CX
JGE done
MOVQ (BX)(SI*16), DX ; DX = opts[i].itab (16-byte stride!)
MOVQ 8(BX)(SI*16), AX ; AX = opts[i].data
MOVQ "".s+24(SP), CX ; (move s into the right arg register)
MOVQ 24(DX), R12 ; R12 = itab.fun[0] (offset depends on layout)
CALL R12
INCQ SI
JMP loop
Two differences from the function variant:
- 16-byte stride through the slice. Each element is two words, not one.
- Two loads to set up the call — the itab pointer and the data pointer. The function variant does one load (the funcval pointer) plus one dereference (to get the entry PC). Net effect: about one extra load per iteration.
Both variants do an indirect CALL and both pay the same prediction cost on the branch predictor. The interface variant pays ~30% more per option in practice (matches the middle §12 benchmark numbers), but the absolute difference is ~2 ns per option on amd64. Not a reason to choose between variants.
9.1 itab caching¶
The first time the runtime encounters loggerOption being assigned to Option, it walks the method set of loggerOption, finds apply, builds an itab, and stores it in itabTable. That's a one-time cost during program startup (or first cold path); subsequent assignments reuse the cached itab.
You can see the assignment instruction the compiler emits — runtime.convT for value types being boxed into an interface (which is exactly what WithLogger(l) does — boxes loggerOption{l: l} into Option). For pointer-receiver method sets, the box is just the pointer and the runtime call is unnecessary.
If you change func (o loggerOption) apply(...) to func (o *loggerOption) apply(...) and return &loggerOption{...}, the boxing becomes just "stick the pointer in the data word"; no runtime.convT call, no per-construct allocation of the boxed value separate from the underlying object. In the value-receiver form above, runtime.convT allocates a heap copy of loggerOption{l: l} and stores a pointer to it in the interface's data word. Two heap allocations per option (the closure itself plus the boxed copy) vs one for the function variant.
That's why some libraries' interface-variant options use pointer receivers — to avoid the box-copy.
10. Escape analysis of NewServer chains¶
A pattern people ask about:
Does the Server allocate on the heap or the stack?
Take a stripped-down version:
// main.go
package main
import "log"
type Server struct{ logger *log.Logger; running bool }
type Option func(*Server)
func WithLogger(l *log.Logger) Option { return func(s *Server) { s.logger = l } }
func NewServer(opts ...Option) *Server {
s := &Server{}
for _, o := range opts { o(s) }
return s
}
func (s *Server) Start() *Server { s.running = true; return s }
func main() {
s := NewServer(WithLogger(log.Default())).Start()
_ = s
}
$ go build -gcflags="-m -m" main.go 2>&1 | grep -E "escape|inline"
./main.go:9:6: can inline WithLogger
./main.go:9:35: can inline WithLogger.func1
./main.go:11:6: cannot inline NewServer: function too complex
./main.go:17:6: can inline (*Server).Start
./main.go:19:6: can inline main
./main.go:9:35: func literal escapes to heap
./main.go:11:21: opts does not escape
./main.go:12:7: &Server{} escapes to heap
./main.go:20:51: log.Default() does not escape
./main.go:20:38: ... argument does not escape
&Server{} escapes to heap even though the only use after Start() is _ = s (immediately discarded). The reason is again NewServer's inlining failure: because NewServer doesn't inline into main, the &Server{} allocation site is inside a non-inlined function that returns the pointer. Returning a pointer is enough to force escape.
If you manually inline the body of NewServer into main (or hand-write the construction), the escape analyser sees the full lifetime and can stack-allocate s. But the inliner won't do it for you because of the for-loop cost.
10.1 What if there are zero options?¶
./main.go:5:6: can inline NewServer
./main.go:9:18: inlining call to NewServer
./main.go:5:9: &Server{} escapes to heap
NewServer inlines, but the &Server{} still escapes. The reason isn't the variadic anymore — it's that Start() is a method call on the returned pointer, and Start's receiver parameter is treated conservatively unless the inliner can also fold Start. Here it does:
Both inlined, yet escape is still reported. The Go escape analyser is bounded — it doesn't always recognise that a fully-inlined chain produces a value that lives only in the current frame. This is a known limitation; some incremental progress has been made over Go 1.20/1.21/1.22.
Empirically, with Go 1.22, the chain in main still produces a heap allocation for the Server. With explicit hoisting:
…the allocation can become stack-resident. So the pattern doesn't optimise as aggressively as the equivalent C++ code where return-value optimization elides everything; you live with one allocation per server, period.
11. GOSSAFUNC walkthrough¶
For a complete view of how the compiler transforms the apply-loop, dump SSA:
The HTML contains every SSA pass, from the AST translation to final lowering. The passes most relevant to options:
| Pass | What it does | What you see |
|---|---|---|
start | AST → SSA | The naive form: explicit range, explicit slice indexing |
escape analysis | Decide stack vs heap for each allocation | Server and closure marked "to heap" |
inline calls | Inline small callees | WithLogger.func1 not inlined (indirect target) |
decompose user | Split slice/string/interface values into their words | opts becomes (opts.ptr, opts.len, opts.cap) triples |
prove | Bounds-check elimination | The i < len(opts) proof discharges the check on opts[i] |
lower | SSA → architecture-specific ops | The CALL becomes a CALLclosure op |
regalloc | Assign physical registers | DX is forced as the closure register |
The most illuminating pass is decompose user. Before it, the loop reads opts[i] as one opaque "Option" value. After it, the load is explicit: the slice header has been broken into three separate values, the indexed read is loadOption(opts.ptr, i), and the closure call's environment register (DX) is explicit in the SSA.
You can also trace by passing -gcflags="-S -d=ssa/lower/dump" for a text dump of the lowered SSA. For the apply-loop, the lowered SSA reads almost identical to the asm in §6.
12. Allocation count, byte-by-byte¶
Bench harness:
// bench_options_test.go
package srv
import (
"log"
"testing"
"time"
)
var sink *Server
func BenchmarkNoOpts(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
sink = NewServer(":8080")
}
}
func BenchmarkOneOpt(b *testing.B) {
l := log.Default()
b.ReportAllocs()
for i := 0; i < b.N; i++ {
sink = NewServer(":8080", WithLogger(l))
}
}
func BenchmarkFiveOpts(b *testing.B) {
l := log.Default()
b.ReportAllocs()
for i := 0; i < b.N; i++ {
sink = NewServer(":8080",
WithReadTimeout(5*time.Second),
WithWriteTimeout(5*time.Second),
WithLogger(l),
WithMaxConns(1000),
WithDebug(),
)
}
}
func BenchmarkFiveOptsReused(b *testing.B) {
l := log.Default()
opts := []Option{
WithReadTimeout(5*time.Second),
WithWriteTimeout(5*time.Second),
WithLogger(l),
WithMaxConns(1000),
WithDebug(),
}
b.ReportAllocs()
for i := 0; i < b.N; i++ {
sink = NewServer(":8080", opts...)
}
}
Sample results (Go 1.22, amd64, M2 Mac, GOMAXPROCS=1):
BenchmarkNoOpts-1 120000000 10.3 ns/op 48 B/op 1 allocs/op
BenchmarkOneOpt-1 40000000 32.1 ns/op 72 B/op 2 allocs/op
BenchmarkFiveOpts-1 10000000 125.0 ns/op 192 B/op 6 allocs/op
BenchmarkFiveOptsReused-1 50000000 28.4 ns/op 48 B/op 1 allocs/op
Reading the numbers:
NoOpts: 1 alloc, 48 B. That's theServeritself on the heap. Nothing else allocates.OneOpt: 2 allocs, 72 B = 48 (Server) + 24 (closure environment, rounded to the 24-byte size class).FiveOpts: 6 allocs, 192 B = 48 (Server) + 5 × ~24-32 (closures). The exact size depends on each closure's capture set (WithLoggercaptures 8 bytes,WithDebug()captures 0 bytes but still allocates a closure if it isn't inlinable).FiveOptsReused: 1 alloc — theServer. The closures live in theoptsslice that was built once before the benchmark loop. Inside the loop, only the Server allocates.
This is the big lever for performance-sensitive options usage: build the options slice once, reuse it. The constructor itself is unchanged. Same call site (NewServer(addr, opts...)), 5× fewer allocations.
12.1 What an empty-capture closure costs¶
The closure captures nothing. Does it still allocate?
Yes — by default. The closure is still a funcval and still has to live somewhere; the compiler emits runtime.newobject(funcval) to produce it. The size is sizeof(funcval) = 8 bytes (just the fn pointer), rounded up to the 16-byte size class.
There is an optimisation for static zero-capture closures: if a closure literal has no captures and is at package scope, the compiler can emit a single static funcval and reuse it forever. The standard library uses this for things like time.Local initialisation. For a WithDebug defined as above, however, each call returns a fresh function value, and the optimisation doesn't fire because the closure is constructed inside WithDebug, not at package scope.
If WithDebug were performance-critical, you could write:
var debugOpt Option = func(s *Server) { s.debug = true }
func WithDebug() Option { return debugOpt }
…and now WithDebug returns the same precomputed function value every time. No allocation. This is rarely worth the readability hit, but the technique exists.
13. Cross-language comparison at the machine level¶
How does Go's functional-options pattern compare to neighbours' equivalent constructs at the codegen level?
13.1 C++ default arguments¶
Server make_server(std::string addr,
std::chrono::milliseconds read_timeout = 30s,
Logger* logger = nullptr);
The compiler stores default values in the caller's code, not the callee's. Each call site is compiled with the defaults inlined. Effectively: zero runtime cost for "unused" defaults; one breaking-change vector (changing a default requires recompiling every caller; changing the parameter list breaks the ABI).
Codegen at the call site looks like a normal direct call with the defaults synthesised inline. No closures, no indirect calls.
13.2 Java builders¶
Each .addr(...) returns the builder (this), so the chain is a sequence of direct virtual method calls. In hot code, JIT-inlines them all, eliding the intermediate this returns. The final .build() constructs the object.
Cost: each builder method is a virtual call (resolved through the vtable), but the JIT specialises on the receiver type after warmup and inlines. After warmup, costs comparable to Go's function variant — possibly faster because the JIT can fully fold the chain into a single allocation.
Cold start: noticeably slower because the JIT hasn't warmed up.
13.3 Rust struct update syntax¶
Constructed in one expression. The compiler emits the struct on the caller's stack (or wherever the binding lives), populates the fields directly. Zero closures, zero allocations, zero function calls.
The cost is that the struct's fields are exposed in the type (and need to be pub), so you have the same "API surface = struct shape" problem as a Go config struct. Rust mitigates with non-exhaustive struct attributes:
…which prevents external crates from constructing the struct positionally; they must go through a constructor. So Rust's approximate equivalent to "functional options" is "builder pattern + #[non_exhaustive]", with slightly cleaner ergonomics than Java.
13.4 Summary¶
| Language | Cost per "option" | Allocations per option | When the cost is paid |
|---|---|---|---|
| Go (function variant) | ~5 ns + 1 alloc | 1 (closure) | At option construction |
| Go (interface variant) | ~7 ns + 1-2 allocs | 1-2 (closure + box) | At option construction |
| C++ default args | 0 (literal substitution) | 0 | Compile time |
| Java builder, post-JIT | ~1 ns | 0 (JIT-folded) | After warmup |
| Java builder, cold | ~10 ns | 1 (builder object) | Always |
| Rust struct update | 0 | 0 | Compile time |
Go pays the most per option, but its options are first-class values: passable, storable, conditional, composable. C++ default args and Rust struct update syntax cannot be passed around. Java builders can but allocate the builder. The tradeoff is paid in the right currency.
14. Edge cases at the lowest level¶
14.1 Variadic with a single option vs spread¶
Both produce a []Option of length 1. The first form is preferred because the compiler synthesises the slice with the smallest possible cost. The second form is explicit: the caller allocates the slice. If the caller's slice escapes (e.g., it's literally a slice literal), it stack-allocates. If you write opts := []Option{...} and then pass opts..., the slice header is on the caller's stack regardless.
The interesting case: 0 options.
There is no slice. The variadic parameter is set to nil. Inside NewServer, len(opts) == 0, range opts iterates zero times, no allocations for the variadic.
This is one option of value nil. The apply loop's opt(s) will panic. Be careful with conditionals:
14.2 runtime.newobject vs stack allocation for closures¶
The closure escapes when it outlives the constructor. The compiler emits one of:
// On heap (the usual case for options)
&funcval{fn: pc, capture0: d} // really: runtime.newobject + initialise
// On stack (rare for options)
funcval{fn: pc, capture0: d} // initialised in the caller's frame, no runtime call
For functional options, the heap version is the only one you ever see — because the option's purpose is to outlive the WithX call.
The exception is when the option is consumed in the same function and the escape analyser can prove it doesn't outlive:
func ConfigureLocally() {
o := WithLogger(myLogger) // closure could in principle stack-allocate
o(&localServer) // ... if Go could see o is only used here
}
In practice, even this stack-allocates inconsistently. The compiler is conservative; closures usually go to the heap. Not a hot path for the pattern.
14.3 Stack-grow during option application¶
The apply loop calls into option closures. Each call may grow the goroutine's stack if the closure's body deeply nests. The runtime's stack-grow path is runtime.morestack in src/runtime/asm_amd64.s. Before each closure call, the function prologue checks g.stackguard0 and, if exceeded, calls runtime.morestack_noctxt, which expands the stack by copying the existing frames to a larger allocation.
For functional options, this is a non-issue: the closure bodies are short. But if you have an option that does heavy work, that work runs on the constructor's goroutine and respects the goroutine's stack.
A subtle implication: an option closure that captures a runtime.Stack-sensitive pointer doesn't need special handling. The runtime tracks all live pointers across stack-grow.
14.4 The "method value" version of options¶
You can use a method value as an option:
type Logger struct{ /* ... */ }
func (l *Logger) Attach(s *Server) { s.logger = l }
l := &Logger{}
NewServer(":8080", l.Attach) // l.Attach is an Option
l.Attach is a method value — a funcval capturing the receiver l. It costs one allocation (the method value's funcval, 16 bytes including the captured receiver pointer). This is shorter than writing WithLogger(l) and has identical cost to WithLogger(l) plus the inlined body. Whether to expose method-value options is a style call: it skips the WithX ceremony at the cost of leaking the method-set shape into the API.
14.5 GC barriers when closures hold pointers¶
The closure environment for WithLogger(l) holds a *log.Logger. The Go GC scans these along with any other heap-resident pointers — there's no special exception for closure captures. From the GC's perspective, a funcval is just a heap object with a *runtime._type describing its layout (so the GC knows which words are pointers).
When the runtime constructs the closure (runtime.newobject path), it allocates the object with the type descriptor produced by the compiler for the closure. The descriptor's pointer-bitmap is set so the GC knows position 1 (after the fn word) is a *log.Logger and needs to be scanned. Position 0 (fn) is also a pointer (to executable code) but is excluded from GC scanning by the type descriptor's special bit (the kindNoPointers-style logic in src/runtime/type.go).
So options participate normally in the write barrier and GC mark phase. No surprises.
15. Test¶
Internal knowledge questions¶
1. What is the size in bytes of a single Option value on amd64?
Answer
8 bytes — a `func(...)` value is a single pointer to a `funcval`. The function variant's `Option` is one word. The interface variant's `Option` is two words (16 bytes) — itab pointer + data pointer.2. Why does &Server{} inside NewServer allocate on the heap even when the result is immediately discarded?
Answer
The escape analyser considers each function independently. `NewServer` returns a `*Server`, so the address escapes the function's frame. Since `NewServer` doesn't inline (too complex due to the loop), the caller can't see that the value is unused, and the conservative result is heap allocation. Manually inlining the body or removing the for-loop is the only way to suppress this.3. The apply loop produces CALL R12. Why R12 specifically?
Answer
The compiler's amd64 calling convention reserves DX as the closure register — the pointer to the funcval is passed in DX so the callee can read its captures via `[DX+8]`, `[DX+16]`. The entry PC is loaded from `(DX)` into a scratch register (commonly R12 on Go 1.17+ register-based calling convention). The choice of R12 is from the regalloc pass; what matters is that DX holds the funcval pointer at the moment of CALL.4. What's the assembly difference between calling a normal func() and a closure?
Answer
A normal function call is `CALL "".funcName(SB)` — a direct call with a static target. A closure call requires (a) loading the funcval pointer into DX, (b) loading the entry PC from `(DX)`, (c) calling the entry PC indirectly. The callee's prologue then reads captures off DX-relative offsets. The cost difference is one extra load and one indirect-call penalty in the branch predictor.5. Why does WithDebug() (no arguments, no captures) still heap-allocate?
Answer
Even with no captures, the closure is constructed inside the `WithDebug` function and returned. Returning the closure forces it to escape `WithDebug`'s frame, which means heap allocation. The fix is to store the (capture-less) closure in a package-level variable and return it directly: `var debugOpt Option = func(s *Server) {...}; func WithDebug() Option { return debugOpt }`. Now no allocation per call.6. Reading the assembly, how can you tell if the slice of options is heap-allocated?
Answer
Look at the caller's prologue. If you see `CALL runtime.newobject` (or `runtime.makeslice`) before the `NewServer` call with a `[N]Option` type descriptor in AX, the backing array is on the heap. If you see `LEAQ "".__tmpN(SP),Test code: count allocations directly¶
func TestNoExtraAllocs(t *testing.T) {
l := log.Default()
opts := []Option{WithLogger(l), WithReadTimeout(5*time.Second)}
allocs := testing.AllocsPerRun(1000, func() {
_ = NewServer(":8080", opts...)
})
if allocs != 1 {
t.Fatalf("expected 1 alloc (Server only), got %v", allocs)
}
}
If this test fails, the slice escaped or one of the options reallocates. Bisect by running with fewer options.
16. Tricky questions¶
1. Why does NewServer(":8080", opt1, opt2) allocate three things but NewServer(":8080", opts...) (where opts is a long-lived slice) allocate only one?
Answer
In the first form, each `optN` is constructed at the call site (`WithX(...)`) which produces a fresh closure on the heap, plus the variadic slice's backing array. Three allocations. In the spread form, `opts` was built once, the closures already exist, the slice already exists. The spread passes the existing slice header through `NewServer`'s variadic parameter without copying. Only the `Server` itself is freshly allocated.2. Why does the interface variant of Option (with value receivers) often produce two allocations per option, when the function variant produces only one?
Answer
With value receivers: `WithLogger(l) returns Option(loggerOption{l: l})` boxes the struct into an interface. The runtime call `runtime.convT` allocates a heap copy of the struct and stores its pointer in the interface's data word. Allocation 1 is `loggerOption{}` boxed. Allocation 2 is the slice's backing array. With pointer receivers (`*loggerOption`), boxing is free (the existing pointer is the data word), and you have only one allocation per option — the `&loggerOption{}`. So the interface variant should use pointer receivers when allocation count matters.3. Is the apply loop's indirect call slower because of Spectre mitigations?
Answer
Yes, marginally. On x86, indirect branches are subject to BTB poisoning, and the kernel may have enabled IBRS/IBPB depending on `mitigations=` boot params. Each indirect call may pay a few cycles of additional latency vs an unmitigated direct call. On modern CPUs (Ice Lake and later) with eIBRS, the cost is amortized and barely measurable. On older CPUs with retpoline mitigation, indirect calls can be 10× slower. The Go toolchain doesn't insert per-call mitigations; this is purely hardware/microcode/kernel-level.4. Why is for _, opt := range opts { opt(s) } not converted into a runtime.duffcopy-like vectored sequence by the compiler?
Answer
Because the loop body involves an indirect call with side effects on `s`, and the compiler cannot prove the calls are independent (they might write to overlapping fields of `s` in order-dependent ways). The compiler is required to preserve call order. There's no SIMD analogue for "apply N different functions to the same target" — each is a distinct call with its own prologue and epilogue.5. Can profile-guided optimisation (PGO) inline option bodies?
Answer
Partially. Go 1.21 added PGO devirtualisation for interface calls — if profiling shows that one call site is almost always dispatching to one specific type, the compiler can emit a check-and-direct-call. For closure-typed call sites (the function variant), PGO devirtualisation is limited because the closure environment varies per call. As of Go 1.22, the most you can hope for is that "hot" closures' bodies get a marginal layout benefit. Don't expect PGO to eliminate the indirect call cost of functional options.6. If I capture a *time.Time, will the closure escape extend the lifetime of the time.Time?
Answer
The closure holds a pointer to the `time.Time`. As long as the closure (the option) is reachable from a GC root (e.g., stored in a long-lived `[]Option`), the GC will keep the `time.Time` alive. This is normal pointer-reachability. The "trap" is when callers construct an option from a stack-local that they expect to die quickly; the option captures the pointer and the variable gets heap-promoted by the escape analyser. You may see `time.Time escapes to heap` annotations as a result. Usually fine; just be aware that closure captures can promote locals to the heap.17. Summary¶
- An
Optionin the function variant is a single pointer (8 bytes on amd64), pointing to aruntime.funcval. The funcval contains the entry PC and any captured values inline. WithX(arg)always heap-allocates the funcval, because the returned closure must outlive theWithXcall. There is no way to eliminate this in the function variant.- The apply-loop
for _, opt := range opts { opt(s) }compiles to a 6-instruction loop with an indirectCALLthrough the closure register (DX on amd64). Each iteration is ~3-5 ns plus the closure body. - Options don't inline because the call is indirect — the compiler doesn't know which closure body is at each slice slot. Profile-guided optimisation can't fully recover this for closure values.
- The
Serverallocated insideNewServerescapes to the heap becauseNewServerdoesn't inline (the loop pushes it over the budget). Even fully-inlined chains likeNewServer(...).Start()typically still produce one heap allocation for the Server in Go 1.22. - Variadic slices (
opts ...Option) often stack-allocate whenNewServerdoesn't retain a reference. Reusing a pre-built[]Optionreduces allocations from N+1 to 1 (just the Server). - The interface variant doubles the size of each
Optionslot (itab + data, 16 bytes) and may add a second allocation per option for value-receiver method sets. Function variant is ~30% faster per option in practice, but the absolute difference is ~2 ns. - Cross-language comparison: Go pays the most per option at runtime, but its options are first-class values. C++ default args and Rust struct update syntax pay zero at runtime but cannot be composed or passed around. Java builders match Go's flexibility but rely on JIT to recover the cost.
- The cardinal rule for performance-sensitive options: build the option slice once, reuse it. The closure allocations happen at option-construction time; reuse the option, reuse the closures.
18. Further reading¶
- Go runtime source —
runtime.funcval, itab structures:src/runtime/runtime2.go - Closure conversion in the compiler:
src/cmd/compile/internal/walk/closure.go - Escape analysis:
src/cmd/compile/internal/escape/escape.go - Inliner heuristics:
src/cmd/compile/internal/inline/inl.go - Calling convention (register-based, Go 1.17+):
src/cmd/compile/abi-internal.mdin the Go source tree - itab caching:
src/runtime/iface.go—getitab,itabTable - Profile-guided optimisation: https://go.dev/doc/pgo
- Dave Cheney, "Functional options for friendly APIs" (2014) — the canonical justification; doesn't cover internals but is the historical reference
- Related:
02-language-basics/07-pointers/05-unsafe-pointer/professional.mdforunsafe.Pointerinternals - Related:
02-language-basics/02-functions/04-closure-internals/professional.mdfor the closure-conversion deep dive that this file builds on - Related: middle.md §12 for the higher-level benchmark numbers; this file explains why those numbers look the way they do