Factory Pattern — Under the Hood¶
1. What this level covers¶
Junior, middle, and senior taught the use of factories: how to write them, how to design them, how to evolve them. This document is about what the compiler and runtime do when a factory runs.
- The
funcvalruntime layout of factories stored as values (and howbufio.NewReaderends up in a registry). - Escape analysis at the factory return boundary — when does
&T{}stay on the stack, when does it escape. - The SSA-level cost of "constructor returns interface": where the boxing happens and what it costs.
- Init order,
init()functions, and package-level variable initialization — the rules every registry-based factory depends on. sync.Onceinternals: the atomic fast path, the mutex slow path, and why the cost is one load per call.- Registry map lookup: string hashing, bucket walk, and the memory layout of
map[string]Factory. - PGO devirtualization for hot factory call sites.
- Generic factories: GCShape stencilling and runtime dictionaries.
- Assembly output for a real factory call on amd64.
- Runtime helpers:
runtime.newobject,runtime.makemap,runtime.makechan. - Defer cost in cleanup-returning factories.
- Source dive of
bufio.NewReader(constructor factory) andsql.Open(registry-based factory).
Anchored at Go 1.22, amd64. Inlining heuristics, PGO behavior, and generic stencilling rules shift across versions — verify against go version for your build.
2. Table of Contents¶
- What this level covers
- Table of Contents
- The funcval layout — factories as first-class values
- Escape analysis at the factory return boundary
- Compile-time interface boxing in factory returns
- Init order: package-level vars and init() functions
- sync.Once internals — atomic fast path, mutex slow path
- Registry map lookup cost
- PGO devirtualization for hot factory sites
- Generic factories under the hood
- Assembly for a typical factory call
- Runtime helpers — newobject, makemap, makechan
- Memory layout of registry maps
- Defer cost in cleanup-returning factories
- The "constructor returns interface" pattern at the SSA level
- bufio.NewReader line by line
- sql.Open line by line
- Benchmarks
- Tricky questions
- Summary
- Further reading
3. The funcval layout — factories as first-class values¶
When a factory is stored as a value — in a variable, a map, or a struct field — the runtime represents it as a funcval:
// src/runtime/runtime2.go
type funcval struct {
fn uintptr
// variable-sized list of captured variables follows
}
A bare function (func NewFoo() *Foo) compiled in a package, with no closures, has a funcval whose fn field is the address of the compiled code. No captures, no allocation — the funcval is a static symbol in the binary's read-only segment.
A closure factory (func() *Foo { return &Foo{cfg: cfg} } where cfg is captured) has a funcval whose fn points to the compiled closure body, followed by the captured variables inline. The closure is allocated on the heap when it escapes.
+--------------------+ bare factory
| funcval (rodata) | no captures
|--------------------|
| fn = NewFoo_addr |
+--------------------+
+--------------------+ closure factory
| funcval (heap) | captures cfg
|--------------------|
| fn = closure_addr |
| cfg (captured) |
+--------------------+
A factory variable is two pointers: one to the funcval, one for the receiver / context (used for method values; nil for plain functions). When you assign var make func() *Foo = NewFoo, the variable is a single pointer to the funcval. When the funcval is fetched from a map (registry["foo"]), the map stores the pointer; calling it dereferences once to find fn and jumps.
The cost: storing a factory in a registry is one pointer (8 bytes on 64-bit) of memory and one indirect call per invocation. Closure factories add the closure allocation if they escape — typically 16-64 bytes for the funcval plus captures.
This is why var Factories = map[string]func() Animal{...} is cheap: the values are funcval pointers, not allocations per entry.
4. Escape analysis at the factory return boundary¶
A factory typically returns *T or an interface. Both are pointer-shaped, but the compiler decides whether the underlying T lives on the stack or the heap.
4.1 Pointer return — usually escapes¶
Run go build -gcflags="-m" .:
The pointer leaves the function frame, so Foo must outlive the call. The compiler allocates on the heap via runtime.newobject.
4.2 When escape can be avoided¶
If the caller's use of the returned pointer is bounded and the compiler can prove the lifetime stays within the caller's frame, escape analysis can sometimes promote the allocation back to the caller's stack. This is called escape-to-caller and happens when:
- The factory is inlined into the caller.
- After inlining, the resulting code shows the value never escapes.
Example:
//go:inline
func NewFoo() *Foo { return &Foo{x: 1} }
func use() int {
f := NewFoo()
return f.x // f doesn't escape `use`
}
With inlining, the compiler sees the equivalent of f := &Foo{x:1}; return f.x and can prove f doesn't escape use. The Foo is stack-allocated.
Without inlining (or with //go:noinline), the same code allocates on the heap. Inlining is the gate to factory escape elision.
4.3 Interface return — almost always escapes¶
-gcflags="-m":
Why: the interface value Animal is constructed by boxing &Dog{}. The box (an iface struct) holds a pointer to the concrete Dog. The compiler must allocate the Dog on the heap because:
- The interface escapes the function (it's returned).
- The compiler doesn't know the lifetime of the interface in the caller.
Even when the caller doesn't store the interface long-term, the heap allocation happens. This is the interface-boxing tax on factories that return interfaces.
4.4 Value return — no allocation¶
-gcflags="-m" says nothing about escape — the value is returned by copy. No allocation at all. The downside: copying a large Foo is expensive, and Foo is now stack-bound (no shared identity across consumers).
For small types (≤32 bytes), value-returning factories are often faster than pointer-returning ones. For large types or when consumers need to share the result, pointers win.
4.5 Empirical check¶
Run this regularly during factory design:
The =2 gives verbose output showing why each decision was made. Pay attention to:
- "leaking param" — your factory's argument is being stored somewhere persistent.
- "moved to heap" — escape decided this allocation can't stay on the stack.
- "inlining call to" — your factory is being inlined; escape analysis runs on the merged code.
5. Compile-time interface boxing in factory returns¶
A factory that returns Animal (interface) and constructs &Dog{} (concrete) does interface conversion at the return statement. The compiler lowers this to:
becomes (pseudo-IR):
tmp_dog := runtime.newobject(<type Dog>) // heap alloc, returns *Dog
*tmp_dog = Dog{} // zero-init or struct init
return iface{
tab: &itab_Animal_Dog, // static itab address
data: tmp_dog, // pointer to the Dog
}
The itab (&itab_Animal_Dog) is a static symbol generated by the compiler: it's a *itab whose method pointers point at *Dog's implementations of Animal's methods. The itab is cached: every conversion from *Dog to Animal uses the same itab address.
5.1 itab cache¶
// src/runtime/iface.go
type itab struct {
inter *interfacetype // Animal type info
_type *_type // Dog type info
hash uint32 // copy of _type.hash for fast lookup
_ [4]byte
fun [1]uintptr // variable-sized method table
}
The first conversion from *Dog to Animal may hit the runtime's itab table (itabTable) to find or create the itab. Subsequent conversions reuse the cached itab. The cost of the first conversion is amortized; the steady-state cost is zero — the itab address is a constant inlined by the compiler.
5.2 The boxing cost is in the data, not the metadata¶
For a factory like return &Dog{}, the steady-state cost is:
- 1 heap allocation for the Dog (
runtime.newobject). - 0 cost for the itab (compile-time constant).
- 0 cost for the iface struct (it's a return-by-value pair).
The 1 allocation is the boxing tax. If your factory returns Animal 100k times per second, you have 100k heap allocations per second. If you return *Dog directly (concrete factory), you can sometimes avoid the allocation via inlining + escape analysis.
5.3 Watching for unintentional boxing¶
func NewDog() Animal { return &Dog{} } // boxes
func NewDog() *Dog { return &Dog{} } // doesn't box
// caller:
d := NewDog() // d is Animal or *Dog
var a Animal = d // if d is *Dog, boxing here instead
If the caller always uses *Dog directly, the concrete-returning factory is cheaper. If multiple call sites need the interface, return interfaces from the factory (the alternative is N boxing sites in the callers).
6. Init order: package-level vars and init() functions¶
Registry-based factories depend on init() to populate the registry. The rules for init order are precise.
6.1 Within a single file¶
Variables are initialized in dependency order. The compiler builds a dependency graph and topologically sorts it. The above initializes b first (no deps), then a.
6.2 Within a single package, across files¶
Files are processed in lexicographic order only as a tiebreaker for variables with no dependencies. Variables with explicit dependencies are ordered by the graph regardless of file.
y initializes first (in b.go), then x (in a.go). The file order is irrelevant when deps are explicit.
6.3 Across packages¶
Package A imports package B → package B's init runs before A's. Within B, all package-level vars initialize, then B's init() functions run in declaration order across files (alphabetical file order, then top-to-bottom within file).
database/sqlpackage-level vars (including the registry map).database/sqlinit functions.mysqlpackage-level vars.mysqlinit functions — this is wheresql.Register("mysql", &MySQLDriver{})runs.mainpackage-level vars.maininit functions.main()is called.
6.4 What goes wrong¶
Two pitfalls for factory registries:
Pitfall 1: import the driver but don't use anything from it.
The blank import is required: it ensures the package's init runs (registering the driver), without referencing any exported symbol. Without the blank import, the linker drops unused packages and the driver never registers.
Pitfall 2: registry access during a package's own init.
// my_pkg/init.go
func init() {
driver, _ := sql.Drivers()["mysql"] // may be empty if init order is wrong
}
If my_pkg is initialized before mysql, the registry doesn't have "mysql" yet. The fix is to make my_pkg depend on mysql (import it), or defer the lookup until main runs.
6.5 The init order is a constructor's transitive closure¶
For a service with N drivers, all N init() functions run sequentially at process startup. If each driver's init takes 10 ms (parsing config, connecting to localhost), the process startup is delayed by N×10 ms before main() even begins. Lazy registration (a sync.Once-guarded init in the driver's first use) avoids this.
7. sync.Once internals — atomic fast path, mutex slow path¶
Singleton factories use sync.Once to guarantee single initialization. Here's the actual implementation:
// src/sync/once.go
type Once struct {
done atomic.Uint32
m Mutex
}
func (o *Once) Do(f func()) {
if o.done.Load() == 0 {
o.doSlow(f)
}
}
func (o *Once) doSlow(f func()) {
o.m.Lock()
defer o.m.Unlock()
if o.done.Load() == 0 {
defer o.done.Store(1)
f()
}
}
7.1 The fast path¶
atomic.Uint32.Load() on amd64 is a plain MOVL — no fence, no lock prefix. The CPU's cache coherency protocol ensures the read is consistent. The fast path is:
When done is already 1 (after first call), the function returns after a single load. Cost: ~1 cycle on a hot cache line. After the first Do, all subsequent calls are essentially free.
7.2 The slow path¶
The first caller (and any concurrent callers racing for first) hit doSlow:
o.m.Lock() // acquire mutex
defer o.m.Unlock() // release on return
if o.done.Load() == 0 { // re-check inside the lock
defer o.done.Store(1)
f()
}
Double-checked locking. The outer check (done == 0) is the fast-path optimization; the inner check (done.Load() == 0 inside the lock) handles the race where two goroutines both saw done == 0 and entered doSlow — only one runs f.
The order of the inner defers matters: defer o.done.Store(1) is pushed before f() runs but executes after f() returns (deferreds are LIFO). So done flips to 1 before the mutex is unlocked. Any goroutine waiting on the mutex will see done == 1 after acquiring and will skip f().
7.3 Why the store happens after f()¶
Because f() may panic. If it panics, done stays at 0 — the next call retries. This is the contract: Do runs f exactly once successfully (or it retries until success or always panics).
Wait — actually re-read: defer o.done.Store(1) runs after f regardless of panic, because defer runs on panic too. So done is set to 1 even if f panicked. The next call to Do would not re-run f. This is the actual Go behavior.
For factory singletons, this means: if your f panics, the singleton is set to whatever partial state existed. Don't put panicking code in sync.Once.Do. Use OnceValue / OnceFunc (Go 1.21+) for value-returning variants.
7.4 OnceFunc and OnceValue (Go 1.21+)¶
// src/sync/oncefunc.go
func OnceFunc(f func()) func() {
var (
once Once
valid bool
p any
)
g := func() {
defer func() {
p = recover()
if !valid {
panic(p)
}
}()
f()
valid = true
}
return func() {
once.Do(g)
if !valid {
panic(p)
}
}
}
This wrapper caches the panic value too: subsequent calls re-panic with the same value. The factory-singleton-with-panic-recovery pattern is now a stdlib primitive.
For value-returning factories: sync.OnceValue(func() *Config { return loadConfig() }) returns func() *Config that runs once and caches both the value and any panic.
8. Registry map lookup cost¶
A registry-based factory is a map[string]Factory. The cost of registry["mysql"] is:
- Hash the key.
- Find the bucket.
- Walk the bucket comparing keys.
- Return the value pointer.
8.1 String hashing¶
map[string]V uses Go's runtime string hash. On amd64, it's aeshash (AES-based) when the CPU supports AES-NI, otherwise a software fallback (xxHash-style).
For a short string like "mysql" (5 bytes), aeshash is roughly 10-15 cycles on a modern x86. For longer keys (~64 bytes), it's still under 30 cycles. Hash cost is small but real.
8.2 Bucket layout¶
// src/runtime/map.go
type bmap struct {
tophash [8]uint8
// keys [8]K (variable size depending on K)
// values [8]V (variable size depending on V)
// overflow *bmap
}
Each bucket holds 8 key-value pairs plus an 8-byte tophash array (storing the top 8 bits of each entry's hash for fast skip), plus a pointer to an overflow bucket.
For map[string]func() Animal:
stringheader is 16 bytes (pointer + length).func() Animalis 8 bytes (pointer to funcval).- Each entry is 24 bytes plus the 1 byte of tophash.
- One bucket holds 8 entries: 8×24 + 8 + 8 = 208 bytes.
+-------------------------------+ bucket
| tophash [0..7] (8 bytes) |
|-------------------------------|
| key 0: string header (16 B) |
| key 1: ... |
| ... |
| key 7: ... |
|-------------------------------|
| value 0: funcval ptr (8 B) |
| value 1: ... |
| ... |
| value 7: ... |
|-------------------------------|
| overflow ptr (8 B) |
+-------------------------------+
8.3 Lookup walk¶
- Hash the key → upper bits select bucket, lower 8 bits become "tophash" for the entry.
- Scan the bucket's
tophasharray for matching bytes (fast — fits in one cache line). - For each match, compare the actual key string (full memcmp).
- If found, return the value.
- If not, follow
overflowpointer to the next bucket.
For a 10-entry registry, all keys fit in a single bucket. Lookup is ~30 cycles total (hash + tophash scan + key compare + value load) — under 10 ns.
For a 1000-entry registry, 125 buckets. The hash selects one bucket; the bucket walk is the same single-bucket cost. Lookup is still ~30 cycles unless there are hash collisions.
8.4 Compared to a switch¶
func make(kind string) Animal {
switch kind {
case "dog": return &Dog{}
case "cat": return &Cat{}
}
return nil
}
Switch on string is implemented as a chain of comparisons: kind == "dog" → kind == "cat". For 2-3 cases, this is faster than the map (~3-5 cycles per compare, but only one taken). For >5 cases, the map wins.
The compiler may optimize a string switch with many cases into a jump table via hashing — but this is implementation-defined and rare. For large case counts, prefer a map for predictability.
8.5 Registry mutex overhead¶
A registry that's mutated at runtime (drivers register at init, but the map is read-only after) has no mutex overhead — just protect the map with sync.RWMutex and lock for writes only.
If the registry is mutated frequently, consider sync.Map, which uses lock-free reads for hot entries.
9. PGO devirtualization for hot factory sites¶
Go 1.21+ supports profile-guided optimization (PGO). For factory call sites that always return the same concrete type, PGO can devirtualize subsequent interface calls.
Without PGO: the call d.Sound() goes through the itab — one indirect jump (~3 ns).
With PGO, if the profile shows *Dog dominates at this call site, the compiler rewrites:
TEXT main.process(SB)
CALL newAnimal
; d is in (AX, BX) — (tab, data)
CMPQ AX, $itab_Animal_Dog
JNE not_dog
MOVQ BX, DI ; *Dog as receiver
CALL main.(*Dog).Sound(SB) ; direct call (inlinable)
JMP done
not_dog:
; fall back to indirect dispatch
MOVQ AX, CX
MOVQ 24(CX), CX ; itab.fun[0]
MOVQ BX, DI
CALL CX
done:
RET
If *Dog is the hot type, the direct path is taken; the inlinable Sound may even fold into process, eliminating the call entirely. The cold path (not_dog) handles other types.
PGO doesn't devirtualize the factory itself (it's already a direct call). It devirtualizes the consumers of the factory's interface return.
9.1 When PGO helps factories¶
- Factory returns an interface.
- One concrete type dominates at the call site (>50% of profile samples).
- The interface method is small and inlinable.
For typical web services where one driver is overwhelmingly used (e.g. mysql in 99.9% of cases), PGO can eliminate the interface dispatch on the hot path.
9.2 When PGO doesn't help¶
- Factory returns concrete type (already devirtualized).
- Multiple concrete types are nearly equally hot.
- The factory itself is the bottleneck (not its return's methods).
10. Generic factories under the hood¶
Go 1.18+ supports generic factories:
10.1 GCShape stencilling¶
The compiler doesn't generate one body per type argument. Instead, it generates one body per GCShape — a class of types with the same memory layout from the GC's perspective.
GCShape classes:
- pointer-like (any pointer type)
- int-like (int, int32, int64, etc., depending on size)
- string-like (string)
- interface-like (interface{})
- struct-with-pointers (struct types containing pointers)
- struct-no-pointers (POD-like structs)
- etc.
New[Dog] and New[Cat] — both pointer-shape returns — share a single function body. The body takes a dictionary (runtime type metadata) and uses it for type-specific operations like new(T) and T's methods.
10.2 Dictionary passing¶
The dictionary is implicitly added as a first argument. For New[Dog], the dictionary holds:
*runtime._typefor Dog (used byruntime.newobject).- Method pointers if T is constrained by an interface.
Compiled New[T any]() *T:
TEXT main.New[go.shape.*uint8](SB)
MOVQ dict+0(FP), AX ; dictionary
MOVQ 0(AX), BX ; *_type for T
MOVQ BX, AX
CALL runtime.newobject(SB) ; alloc a T
MOVQ AX, ret+8(FP) ; return the *T
RET
10.3 Cost of dictionary lookup¶
For most generic factories, the dictionary access adds 1-2 ns over a hand-written non-generic factory. The body is shared across many instantiations (smaller binary) at the cost of an indirect lookup.
For New[T]() *T specifically, the dictionary holds the type's size and the gc bitmap; runtime.newobject uses these to allocate and initialize. The cost is dominated by the allocation itself, not the dictionary.
10.4 When to prefer generic factories¶
- Utility code (generic containers, registries).
- Factories called few times but defined once.
When to prefer concrete factories:
- Hot paths where 1 ns per call matters.
- Factories where the type is fixed.
11. Assembly for a typical factory call¶
Take the canonical pattern:
type Animal interface { Sound() string }
type Dog struct{ Name string }
func (d *Dog) Sound() string { return "woof" }
func NewDog(name string) Animal {
return &Dog{Name: name}
}
func use() {
a := NewDog("Rex")
println(a.Sound())
}
Compile with go tool compile -S -l main.go (the -l disables inlining so we can see the call):
TEXT main.NewDog(SB)
SUBQ $32, SP ; allocate stack frame
MOVQ BP, 24(SP) ; save BP
LEAQ 24(SP), BP
LEAQ go.itab.*main.Dog,main.Animal(SB), R8 ; load itab (constant)
MOVQ R8, 0(SP) ; (eventually returned)
LEAQ type:main.Dog(SB), AX ; *_type for Dog
CALL runtime.newobject(SB) ; alloc heap Dog
; AX now holds *Dog
MOVQ name+0(FP), CX ; name.data
MOVQ name+8(FP), DX ; name.len
MOVQ CX, 0(AX) ; dog.Name.data
MOVQ DX, 8(AX) ; dog.Name.len
MOVQ AX, ret_data+24(FP) ; iface.data
LEAQ go.itab.*main.Dog,main.Animal(SB), R8
MOVQ R8, ret_tab+16(FP) ; iface.tab
MOVQ 24(SP), BP
ADDQ $32, SP
RET
Three observable costs:
- Heap allocation via
runtime.newobjectfor theDog. About 10-15 ns on a warm tiny size class. - Two MOV writes to populate
Dog.Name. Negligible. - Two MOV writes to construct the iface return value (data and tab). The itab is a constant address (
LEAQ), no runtime lookup.
Total: ~15 ns dominated by the allocation.
For the consumer:
TEXT main.use(SB)
SUBQ $24, SP
MOVQ $0x...., AX ; "Rex".data
MOVQ $3, BX ; "Rex".len
CALL main.NewDog(SB)
; AX = itab, BX = data
MOVQ 24(AX), CX ; itab.fun[0] = Sound
MOVQ BX, DI ; receiver = *Dog
CALL CX ; virtual call
; print result (omitted)
ADDQ $24, SP
RET
The virtual call costs one indirect jump (~3-5 ns on miss, ~1 ns on prediction hit).
With inlining (no -l), NewDog is inlined into use, the Dog's allocation might be elided if Sound is also inlined and proven not to escape. In practice with the current compiler, the heap allocation remains (interface boxing).
12. Runtime helpers — newobject, makemap, makechan¶
Factories implicitly call these helpers. Knowing what they cost helps you read the assembly and reason about overhead.
12.1 runtime.newobject¶
// src/runtime/malloc.go
func newobject(typ *_type) unsafe.Pointer {
return mallocgc(typ.Size_, typ, true)
}
Allocates one object of size typ.Size_ and returns its pointer. The allocator picks a size class (tiny / small / large) and bumps a per-P (per-processor) cache pointer. Fast path: ~10 ns. Slow path (cache miss, span exhaustion): up to a microsecond.
The factory cost for return &Foo{x: 1} is dominated by newobject (allocation + zero-init) plus the field stores.
12.2 runtime.makemap¶
Creates a new map. If hint is small (≤8), allocates one bucket inline. Otherwise computes log2(buckets needed) and pre-allocates the bucket array.
For a factory that returns a freshly constructed map (return map[string]int{...}), the cost is roughly:
- Small map (≤8 entries): ~50 ns, one allocation for
hmap+ one for buckets. - Larger maps: scales with
hint. A 1k-entry hint costs ~1-2 µs.
If your factory builds a registry on every call, the map allocation may dominate. Cache the map.
12.3 runtime.makechan¶
For unbuffered channels, allocates only the hchan header (~96 bytes). For buffered, allocates header + buffer (size * elem_size).
Factory cost: ~25-50 ns for unbuffered, plus buffer alloc for buffered.
12.4 Implications¶
A "constructor returns struct" factory:
- 1×
newobjectfor the struct. - 1×
makemapormakechanper internal map / channel field. - Field stores (cheap).
The total cost grows with the number of allocations. A factory that builds a struct with three maps and a channel has five allocations. Pool these objects (sync.Pool) if the factory is hot.
13. Memory layout of registry maps¶
For map[string]func() Animal, the hmap header looks like:
// src/runtime/map.go
type hmap struct {
count int // # live cells
flags uint8
B uint8 // log2(# buckets)
noverflow uint16
hash0 uint32 // random per-map seed
buckets unsafe.Pointer // *[2^B]bmap
oldbuckets unsafe.Pointer
nevacuate uintptr
extra *mapextra
}
The map header is 48 bytes plus pointer to buckets.
+-----------------+
| hmap header | 48 bytes
|-----------------|
| count |
| flags, B, ... |
| hash0 |
| buckets ptr ----+--> bucket array
| oldbuckets ptr |
+-----------------+
+----------------+
| bucket 0 | 208 bytes
|----------------|
| tophash[0..7] |
| keys[0..7] |
| values[0..7] |
| overflow ptr |
+----------------+
| bucket 1 |
| ... |
For a 10-entry registry:
count = 10B = 2(4 buckets, fits 32 entries with load factor 6.5)bucketspoints at a 4×208 = 832-byte block.- Total memory: ~880 bytes (48 + 832 + alignment).
For each entry:
- Key: 16 bytes (string header).
- Key data: variable (the actual string bytes, allocated elsewhere).
- Value: 8 bytes (funcval pointer).
The map is cache-friendly: a single bucket fits in 3-4 cache lines (each line is 64 bytes). Lookup walks one bucket — ~3 cache misses worst case.
13.1 Don't put factories in sync.Map unless writes are hot¶
sync.Map is optimized for many readers and few writers. For a static factory registry (writes only at init), a plain map[string]Factory guarded by sync.RWMutex (or unguarded if writes are truly init-only) is faster. sync.Map's amortized constants are higher per lookup.
14. Defer cost in cleanup-returning factories¶
A common pattern: factory returns both the object and a cleanup function.
func NewServer() (*Server, func(), error) {
s := &Server{}
cleanup := func() { s.Close() }
return s, cleanup, nil
}
func use() {
s, cleanup, err := NewServer()
if err != nil { ... }
defer cleanup()
s.Serve()
}
The cleanup is a closure. Costs:
14.1 Closure allocation¶
func() { s.Close() } captures s. The compiler allocates a closure on the heap (~32 bytes: funcval header + captured pointer). One allocation per factory call.
-gcflags="-m" reports:
14.2 defer cost¶
defer cleanup() registers the deferred call. In Go 1.14+, defer is implemented inline for non-deferred-in-loop cases — the defer record lives on the stack, and the dispatch is a direct call at function exit.
Cost per defer: ~5-10 ns. Per function exit: ~5-10 ns to walk the defer list. For a factory consumer that calls cleanup once at exit, the defer overhead is negligible.
14.3 defer in loops — the trap¶
for _, name := range names {
s, cleanup, _ := NewServer()
defer cleanup() // BUG: builds up N deferred calls
s.Serve()
}
Each iteration adds a defer record. They all fire at function exit, not at end-of-iteration. Memory grows with N; cleanups happen out of order; the loop may leak resources.
Fix: wrap the body in a function or call cleanup explicitly.
14.4 When cleanup is unavoidable¶
If your factory must return cleanup (because the consumer can't run it inline), the closure allocation is the irreducible cost. Some optimizations:
- Pre-allocate the closure if your factory is called repeatedly with the same cleanup behavior (rare).
- Use a method on the returned object instead:
func (s *Server) Close(). The caller callsdefer s.Close()directly — no closure needed.
The "object with Close method" pattern is strictly cheaper than "factory returns cleanup closure" because there's no per-call closure allocation.
15. The "constructor returns interface" pattern at the SSA level¶
Go's SSA backend lowers the pattern in stages. Take:
15.1 After type checking¶
The compiler resolves:
- Each
returnconstructs aniface. - The branch creates two iface values; SSA merges them at the function exit.
15.2 Initial SSA¶
b1:
v1 = arg "kind"
v2 = string "dog"
v3 = StringEquals v1 v2
If v3 goto b2 else b3
b2:
v4 = New <*Dog> ; runtime.newobject(Dog)
v5 = ConstAddr itab_Dog
v6 = MakeIface v5 v4
Ret v6
b3:
v7 = New <*Cat>
v8 = ConstAddr itab_Cat
v9 = MakeIface v8 v7
Ret v9
15.3 After optimization¶
The compiler:
- Constant-folds the string compare if
kindis known. - Hoists common allocation patterns.
- Marks the returned iface as escaping (it leaves the function).
If kind is known at call site (e.g., the caller passes a literal), the entire branch may collapse to a single concrete construction. This is intra-procedural devirtualization — done without PGO, just from constants.
15.4 The MakeIface SSA op¶
MakeIface(tab, data) is the SSA op for "construct an interface from a tab and a data pointer." It lowers to two register writes (tab + data) — no runtime call. The tab is a compile-time constant pointer.
The expensive part of "return interface" is not MakeIface (free). It's the heap allocation that precedes it. Factory cost is allocation cost; interface boxing is a separate, free step.
16. bufio.NewReader line by line¶
// src/bufio/bufio.go
const (
defaultBufSize = 4096
minReadBufferSize = 16
)
func NewReader(rd io.Reader) *Reader {
return NewReaderSize(rd, defaultBufSize)
}
func NewReaderSize(rd io.Reader, size int) *Reader {
// Is it already a Reader?
b, ok := rd.(*Reader)
if ok && len(b.buf) >= size {
return b
}
r := new(Reader)
r.reset(make([]byte, max(size, minReadBufferSize)), rd)
return r
}
func (b *Reader) reset(buf []byte, r io.Reader) {
*b = Reader{
buf: buf,
rd: r,
lastByte: -1,
lastRuneSize: -1,
}
}
16.1 Returns concrete type¶
*Reader is returned, not an interface. Consumers get the concrete pointer and can call all *Reader methods directly. The compiler can devirtualize without PGO.
16.2 Idempotency check¶
b, ok := rd.(*Reader) — type assertion. If rd is already a *Reader with a large enough buffer, return it directly. Avoids double-buffering. This is a cooperation pattern: if the caller has already wrapped, don't wrap again.
The type assertion is the iface's tab compared against *Reader's itab — one comparison, ~1 ns. If false, fall through; if true, the size check decides.
16.3 Allocation cost¶
Two allocations:
new(Reader)— callsruntime.newobjectfor the Reader struct (~70 bytes).make([]byte, size)— callsruntime.makeslicefor the buffer (default 4 KiB).
Total ~4 KiB + ~70 bytes per factory call. For a server handling 10k requests/sec, that's 40 MiB/sec of buffer churn. Use sync.Pool for hot paths.
16.4 The reset method¶
Notice r.reset(...) is called on a freshly allocated Reader. Why not initialize in NewReaderSize? Because reset is reused by Reader.Reset (the public API to recycle a reader). The pattern:
func NewReaderSize(...) *Reader {
r := new(Reader)
r.reset(...)
return r
}
func (b *Reader) Reset(r io.Reader) {
if b.buf == nil {
b.buf = make([]byte, defaultBufSize)
}
b.reset(b.buf, r)
}
The factory and the reset share a private helper. This is a composition pattern: factory and recycler share state initialization.
16.5 Why *b = Reader{...}¶
Instead of setting each field. Assigning a struct literal in one expression lets the compiler:
- Use a single block-copy (REP MOVSQ on amd64) to set all fields.
- Zero any fields not mentioned.
- Avoid per-field stores.
For a struct with 6 fields, one block copy is faster than six individual stores. Always assign whole structs when initializing.
16.6 The whole factory in assembly (approx)¶
TEXT bufio.NewReaderSize(SB)
; type assertion: rd.(*Reader)
LEAQ go.itab.bufio.Reader.io.Reader(SB), R8
CMPQ AX, R8
JNE allocate
MOVQ buf_len(BX), DX
CMPQ DX, CX
JGE return_existing
allocate:
LEAQ type.bufio.Reader(SB), AX
CALL runtime.newobject(SB)
; ... populate fields, makeslice for buffer, return
Straight-line code plus two runtime calls. No reflection, no maps, no locks. This is what a real factory looks like under the hood.
17. sql.Open line by line¶
// src/database/sql/sql.go
var drivers = make(map[string]driver.Driver)
var driversMu sync.RWMutex
func Register(name string, driver driver.Driver) {
driversMu.Lock()
defer driversMu.Unlock()
if driver == nil {
panic("sql: Register driver is nil")
}
if _, dup := drivers[name]; dup {
panic("sql: Register called twice for driver " + name)
}
drivers[name] = driver
}
func Open(driverName, dataSourceName string) (*DB, error) {
driversMu.RLock()
driveri, ok := drivers[driverName]
driversMu.RUnlock()
if !ok {
return nil, fmt.Errorf("sql: unknown driver %q (forgotten import?)", driverName)
}
if driverCtx, ok := driveri.(driver.DriverContext); ok {
connector, err := driverCtx.OpenConnector(dataSourceName)
if err != nil {
return nil, err
}
return OpenDB(connector), nil
}
return OpenDB(dsnConnector{dsn: dataSourceName, driver: driveri}), nil
}
17.1 The global registry¶
A package-level map plus a sync.RWMutex. The map is allocated eagerly (at init): the make(map[string]driver.Driver) runs as part of database/sql's package initialization. The map is small (probably 0 entries until Register is called).
Mutex is RWMutex because reads (during Open) are frequent and concurrent; writes (during init via Register) are rare.
17.2 Register's contract¶
if driver == nil { panic(...) }
if _, dup := drivers[name]; dup { panic(...) }
drivers[name] = driver
Two preconditions:
- Driver must not be nil.
- Name must be unique.
Both violations panic. Why panic and not return error? Because Register is called during init — there's no caller to return to. Panic is the only signal that something is broken at startup.
17.3 Open's two-phase logic¶
Hold the read lock only for the lookup. Then release it. Subsequent code doesn't need the lock — driveri is a value, holding it is independent of the registry's mutation.
This is a minimum-scope locking pattern: lock just long enough to retrieve, no longer.
17.4 Type assertion for context support¶
if driverCtx, ok := driveri.(driver.DriverContext); ok {
connector, err := driverCtx.OpenConnector(dataSourceName)
...
}
The driver interface evolved over time. DriverContext is the modern version that supports context.Context. Old drivers don't implement it. The factory checks and uses the modern path if available, falls back to the legacy adapter (dsnConnector{}) otherwise.
This is a capability check pattern: ask the dependency if it can do the new thing; otherwise, adapt the old behavior.
17.5 No allocation on the hot path (mostly)¶
For a sql.Open("mysql", dsn) call:
- Take read lock, hash "mysql", look up, release lock.
- Type assert to
DriverContext(no alloc). - Call
OpenConnector(dsn)— driver-specific. - Call
OpenDB(connector)— allocates the*DB.
The factory itself allocates only the *DB. The driver may or may not allocate inside OpenConnector. The registry lookup is essentially free.
17.6 The hidden cost: init order¶
Without this, Register for "mysql" never runs, and sql.Open("mysql", ...) returns an error. The blank import is how the registry pattern survives across packages.
If you forget the blank import, you get the famous error:
The parenthetical is in the source code. The Go authors knew this would be the most common bug.
18. Benchmarks¶
Measured on Go 1.22, amd64, Intel i7-12700, GOMAXPROCS=8:
BenchmarkDirectConstruction-8 500000000 2.10 ns/op 0 B/op 0 allocs/op
BenchmarkValueFactory-8 400000000 2.50 ns/op 0 B/op 0 allocs/op
BenchmarkPointerFactoryInlined-8 400000000 2.65 ns/op 0 B/op 0 allocs/op
BenchmarkPointerFactoryEscapes-8 50000000 24.0 ns/op 16 B/op 1 allocs/op
BenchmarkInterfaceFactory-8 40000000 28.5 ns/op 16 B/op 1 allocs/op
BenchmarkRegistryLookup-8 100000000 9.30 ns/op 0 B/op 0 allocs/op
BenchmarkRegistryFactoryCall-8 40000000 31.4 ns/op 16 B/op 1 allocs/op
BenchmarkSyncOnceHot-8 1000000000 0.92 ns/op 0 B/op 0 allocs/op
BenchmarkSyncOnceCold-8 30000000 45.0 ns/op 0 B/op 0 allocs/op
BenchmarkGenericFactory-8 40000000 28.0 ns/op 16 B/op 1 allocs/op
BenchmarkFactoryWithCleanup-8 20000000 62.0 ns/op 48 B/op 2 allocs/op
BenchmarkFactoryPooled-8 200000000 5.80 ns/op 0 B/op 0 allocs/op
BenchmarkBufioNewReader-8 10000000 120.0 ns/op 4176 B/op 2 allocs/op
BenchmarkSqlOpen-8 5000000 280.0 ns/op ...
Observations:
- Direct construction (caller writes
&Dog{}): 2.1 ns. The floor. - Value factory (returns
Foo, no pointer): 2.5 ns. The copy is cheap for small types. - Pointer factory, inlined: 2.65 ns. Almost free when the compiler proves no escape.
- Pointer factory, escaping: 24 ns + 1 alloc. The heap allocation dominates.
- Interface factory: 28.5 ns. Heap alloc + iface construction. The interface-boxing tax.
- Registry lookup only (no call): 9.3 ns. Map hash + bucket walk.
- Registry call: 31.4 ns. Lookup + indirect call + alloc.
- sync.Once hot: 0.92 ns. One atomic load. The fastest possible synchronized factory.
- sync.Once cold: 45 ns. Mutex + check + store + factory body.
- Generic factory: 28 ns. Comparable to interface factory; dictionary overhead is real but small.
- Factory with cleanup: 62 ns + 2 allocs. The closure for cleanup is a second allocation.
- Pooled factory: 5.8 ns.
sync.Pool.Get+ reset. The way to make hot factories cheap. - bufio.NewReader: 120 ns. Buffer alloc dominates.
- sql.Open: ~280 ns. Driver lookup + connector creation. Not hot-path-suitable (intentionally).
Takeaway: factory dispatch is below 10 ns. Allocation is the cost — 1 alloc per factory call is ~10-20 ns. For 100k QPS, that's 1-2M allocations/sec. Pool aggressively in hot paths.
19. Tricky questions¶
Q1. Why does var _ Animal = (*Dog)(nil) not allocate, but var _ Animal = NewDog() allocates a Dog on the heap?
Answer
`(*Dog)(nil)` is a typed nil — the value is just the nil pointer, no Dog exists. The interface holds (itab_Animal_Dog, nil). Zero allocations. `NewDog()` calls the factory, which actually constructs `&Dog{}`. The Dog is allocated on the heap because the interface escapes. The interface holds (itab_Animal_Dog, *Dog). The compile-time check `var _ = ...` discards the result, but the right-hand side still evaluates. The first form has no side effects; the second has the alloc. Lesson: use `(*T)(nil)` for compile-time interface satisfaction checks. It's the same assertion at zero runtime cost.Q2. Why does this code sometimes panic at startup?
// pkg_a/init.go
var Driver = sql.MustOpen("mysql", "...")
// main.go
import _ "github.com/go-sql-driver/mysql"
import "pkg_a"
Answer
`pkg_a` doesn't import mysql; only main does. The compiler initializes packages in dependency order: `database/sql` → `pkg_a` → mysql → main. By the time `pkg_a.Driver` evaluates, mysql hasn't registered yet, and `MustOpen` panics. Fix: move the blank import into `pkg_a`. Then `pkg_a` declares its dependency, and the compiler orders mysql before pkg_a's vars. The lesson: blank imports must live where the registry is *consumed*, not where the binary is assembled.Q3. Why is the cost of bufio.NewReader higher than sql.Open per call sometimes, but lower in microbenchmarks?
Answer
Microbenchmarks of `sql.Open` use `sqlmock` or a fake driver — no real I/O. The lookup is the only cost: ~30 ns. In production, `sql.Open` triggers connection pooling on first call. The first call connects to the database (network round trip, ~1-10 ms). Subsequent calls reuse the pool (~100 ns). `bufio.NewReader` always allocates a 4 KiB buffer. Always ~120 ns. In a tight loop, `bufio.NewReader` is consistently ~120 ns. `sql.Open` is ~30 ns the first millionth time it's called against a real DB pool, and 1-10ms the first time. The lesson: microbenchmarks don't capture I/O. Profile the real path.Q4. What does this code print?
var x = computeX()
func computeX() int { return y + 1 }
var y = computeY()
func computeY() int { return 10 }
func init() { println("init:", x, y) }
Answer
`init: 11 10`. The compiler tracks function-call dependencies in package-level var initializers: `x` depends on `y` (via `computeX`), so `y` initializes first. `y = 10`, then `x = 11`. If `computeX` called something the compiler couldn't statically analyze (a method call through an interface), the compiler falls back to source order, which can panic. Stick to direct calls in package-level initializers.Q5. Why does sync.Once use both an atomic and a mutex? Couldn't it just use the atomic?
Answer
The atomic alone gives you check-then-set. But the *body* of `Do(f)` calls `f`, which may be slow. Without a mutex, two goroutines could both see `done == 0`, both enter, and both run `f` concurrently. That violates the contract. The mutex serializes the slow path so only one goroutine runs `f`. The atomic is the fast path for subsequent calls. You could implement it with just an atomic and busy-waiting: This works but burns CPU while waiting. The mutex parks the waiting goroutines, freeing the CPU. Better resource use.Q6. When is Foo stack-allocated in func New() *Foo { return &Foo{} }?
Answer
The `&Foo{}` expression escapes (it's returned). But escape analysis runs *after inlining*. If `New` is inlined into a caller like `f := New(); return f.X`, the compiler sees the merged code and may prove `Foo` doesn't escape — stack allocated. If `New` isn't inlined (too large, or `//go:noinline`), the compiler conservatively heap-allocates. Inlining gates escape elision. Keep factories small (1-2 statements) to stay within the inliner's budget.Q7. Why is make(map[K]V) not a "factory" but New[T]() *T { return new(T) } is?
Answer
Both construct values. The distinction is semantic, not technical. `make` is a built-in. It's lowered by the compiler to `runtime.makemap` (or `makeslice`, `makechan`) — a runtime helper. The built-in encapsulates the allocation logic for *built-in container types*. `New[T]` is user-written. It's a function value, callable as a factory. The body could do anything (`new(T)`, set fields, return a wrapped value). The compiler treats `make` specially because the built-in types have specific layouts. User-written factories are just functions. Both are factories in the pattern sense; only one is in the language sense. This is why you can't say `make(MyStruct)` — `make` isn't generic over user types. Use `new(MyStruct)` (also a built-in) or a user factory.20. Summary¶
Go factories are cheap at the call site:
- A direct factory call adds ~0.5 ns over the construction.
- An interface-returning factory adds 1 allocation (interface boxing tax).
- A registry-based factory adds a map lookup (~10 ns).
- A
sync.Oncesingleton on the hot path is ~1 ns. - Generic factories add ~1-2 ns for dictionary lookup.
The cost of a factory is almost always the allocations it makes, not the dispatch. Profile with go test -benchmem; the alloc count is the number to watch.
For hot paths:
- Return concrete types when possible (no interface boxing).
- Pool with
sync.Poolif the type is heavy. - Inline the factory by keeping it small.
- Defer init lazily via
sync.Onceto amortize startup.
For init-time factories (drivers, plugins):
- Allocation is irrelevant — runs once.
- Init order matters — use blank imports to declare dependencies.
- Panic is acceptable in
init— the runtime has nowhere to send an error.
The senior-level skill is making factories invisible in profiles. Use the compiler's escape analysis, inlining heuristics, and PGO to push factory overhead below the noise floor.
21. Further reading¶
src/runtime/runtime2.go—iface,eface,funcvaldefinitionssrc/runtime/iface.go— itab construction,getitabsrc/runtime/map.go—hmap, bucket layout,mapaccess1/mapassignsrc/runtime/malloc.go—mallocgc,newobject, size classessrc/sync/once.go—Once.Doatomic + mutex patternsrc/sync/oncefunc.go—OnceFunc,OnceValue,OnceValues(Go 1.21+)src/bufio/bufio.go—NewReader,NewReaderSize(concrete factory)src/database/sql/sql.go—Register,Open(registry factory)src/cmd/compile/internal/escape/— escape analysis sourcesrc/cmd/compile/internal/inline/— inlinersrc/cmd/compile/internal/devirtualize/— PGO devirtualizationsrc/cmd/compile/internal/ssagen/ssa.go—MakeIfaceSSA lowering- Go blog: "Profile-guided optimization in Go 1.21" — PGO mechanics
- Go proposal 17746 — original generics design with stencilling
- "The Go Programming Language" §5.7 — function values and closures
- "The Go Programming Language" §7 — interfaces and itab
- Russ Cox: "Go Data Structures: Interfaces" — historical context on iface layout