Factory Pattern — Under the Hood¶

1. What this level covers¶

Junior, middle, and senior taught the use of factories: how to write them, how to design them, how to evolve them. This document is about what the compiler and runtime do when a factory runs.

The funcval runtime layout of factories stored as values (and how bufio.NewReader ends up in a registry).
Escape analysis at the factory return boundary — when does &T{} stay on the stack, when does it escape.
The SSA-level cost of "constructor returns interface": where the boxing happens and what it costs.
Init order, init() functions, and package-level variable initialization — the rules every registry-based factory depends on.
sync.Once internals: the atomic fast path, the mutex slow path, and why the cost is one load per call.
Registry map lookup: string hashing, bucket walk, and the memory layout of map[string]Factory.
PGO devirtualization for hot factory call sites.
Generic factories: GCShape stencilling and runtime dictionaries.
Assembly output for a real factory call on amd64.
Runtime helpers: runtime.newobject, runtime.makemap, runtime.makechan.
Defer cost in cleanup-returning factories.
Source dive of bufio.NewReader (constructor factory) and sql.Open (registry-based factory).

Anchored at Go 1.22, amd64. Inlining heuristics, PGO behavior, and generic stencilling rules shift across versions — verify against go version for your build.

2. Table of Contents¶

What this level covers
Table of Contents
The funcval layout — factories as first-class values
Escape analysis at the factory return boundary
Compile-time interface boxing in factory returns
Init order: package-level vars and init() functions
sync.Once internals — atomic fast path, mutex slow path
Registry map lookup cost
PGO devirtualization for hot factory sites
Generic factories under the hood
Assembly for a typical factory call
Runtime helpers — newobject, makemap, makechan
Memory layout of registry maps
Defer cost in cleanup-returning factories
The "constructor returns interface" pattern at the SSA level
bufio.NewReader line by line
sql.Open line by line
Benchmarks
Tricky questions
Summary
Further reading

3. The funcval layout — factories as first-class values¶

When a factory is stored as a value — in a variable, a map, or a struct field — the runtime represents it as a funcval:

// src/runtime/runtime2.go
type funcval struct {
    fn uintptr
    // variable-sized list of captured variables follows
}

A bare function (func NewFoo() *Foo) compiled in a package, with no closures, has a funcval whose fn field is the address of the compiled code. No captures, no allocation — the funcval is a static symbol in the binary's read-only segment.

A closure factory (func() *Foo { return &Foo{cfg: cfg} } where cfg is captured) has a funcval whose fn points to the compiled closure body, followed by the captured variables inline. The closure is allocated on the heap when it escapes.

+--------------------+    bare factory
| funcval (rodata)   |    no captures
|--------------------|
| fn = NewFoo_addr   |
+--------------------+

+--------------------+    closure factory
| funcval (heap)     |    captures cfg
|--------------------|
| fn = closure_addr  |
| cfg (captured)     |
+--------------------+

A factory variable is two pointers: one to the funcval, one for the receiver / context (used for method values; nil for plain functions). When you assign var make func() *Foo = NewFoo, the variable is a single pointer to the funcval. When the funcval is fetched from a map (registry["foo"]), the map stores the pointer; calling it dereferences once to find fn and jumps.

The cost: storing a factory in a registry is one pointer (8 bytes on 64-bit) of memory and one indirect call per invocation. Closure factories add the closure allocation if they escape — typically 16-64 bytes for the funcval plus captures.

This is why var Factories = map[string]func() Animal{...} is cheap: the values are funcval pointers, not allocations per entry.

4. Escape analysis at the factory return boundary¶

A factory typically returns *T or an interface. Both are pointer-shaped, but the compiler decides whether the underlying T lives on the stack or the heap.

4.1 Pointer return — usually escapes¶

func NewFoo() *Foo {
    return &Foo{x: 1}
}

Run go build -gcflags="-m" .:

./foo.go:3:9: &Foo{...} escapes to heap

The pointer leaves the function frame, so Foo must outlive the call. The compiler allocates on the heap via runtime.newobject.

4.2 When escape can be avoided¶

If the caller's use of the returned pointer is bounded and the compiler can prove the lifetime stays within the caller's frame, escape analysis can sometimes promote the allocation back to the caller's stack. This is called escape-to-caller and happens when:

The factory is inlined into the caller.
After inlining, the resulting code shows the value never escapes.

Example:

//go:inline
func NewFoo() *Foo { return &Foo{x: 1} }

func use() int {
    f := NewFoo()
    return f.x // f doesn't escape `use`
}

With inlining, the compiler sees the equivalent of f := &Foo{x:1}; return f.x and can prove f doesn't escape use. The Foo is stack-allocated.

Without inlining (or with //go:noinline), the same code allocates on the heap. Inlining is the gate to factory escape elision.

4.3 Interface return — almost always escapes¶

func NewAnimal() Animal {
    return &Dog{}
}

-gcflags="-m":

./animal.go:3:9: &Dog{...} escapes to heap

Why: the interface value Animal is constructed by boxing &Dog{}. The box (an iface struct) holds a pointer to the concrete Dog. The compiler must allocate the Dog on the heap because:

The interface escapes the function (it's returned).
The compiler doesn't know the lifetime of the interface in the caller.

Even when the caller doesn't store the interface long-term, the heap allocation happens. This is the interface-boxing tax on factories that return interfaces.

4.4 Value return — no allocation¶

func NewFoo() Foo {
    return Foo{x: 1}
}

-gcflags="-m" says nothing about escape — the value is returned by copy. No allocation at all. The downside: copying a large Foo is expensive, and Foo is now stack-bound (no shared identity across consumers).

For small types (≤32 bytes), value-returning factories are often faster than pointer-returning ones. For large types or when consumers need to share the result, pointers win.

4.5 Empirical check¶

Run this regularly during factory design:

go build -gcflags="-m=2" ./... 2>&1 | grep -i "escape\|inline"

The =2 gives verbose output showing why each decision was made. Pay attention to:

"leaking param" — your factory's argument is being stored somewhere persistent.
"moved to heap" — escape decided this allocation can't stay on the stack.
"inlining call to" — your factory is being inlined; escape analysis runs on the merged code.

5. Compile-time interface boxing in factory returns¶

A factory that returns Animal (interface) and constructs &Dog{} (concrete) does interface conversion at the return statement. The compiler lowers this to:

return &Dog{}

becomes (pseudo-IR):

tmp_dog := runtime.newobject(<type Dog>)   // heap alloc, returns *Dog
*tmp_dog = Dog{}                            // zero-init or struct init
return iface{
    tab:  &itab_Animal_Dog,                  // static itab address
    data: tmp_dog,                           // pointer to the Dog
}

The itab (&itab_Animal_Dog) is a static symbol generated by the compiler: it's a *itab whose method pointers point at *Dog's implementations of Animal's methods. The itab is cached: every conversion from *Dog to Animal uses the same itab address.

5.1 itab cache¶

// src/runtime/iface.go
type itab struct {
    inter  *interfacetype  // Animal type info
    _type  *_type          // Dog type info
    hash   uint32          // copy of _type.hash for fast lookup
    _      [4]byte
    fun    [1]uintptr      // variable-sized method table
}

The first conversion from *Dog to Animal may hit the runtime's itab table (itabTable) to find or create the itab. Subsequent conversions reuse the cached itab. The cost of the first conversion is amortized; the steady-state cost is zero — the itab address is a constant inlined by the compiler.

5.2 The boxing cost is in the data, not the metadata¶

For a factory like return &Dog{}, the steady-state cost is:

1 heap allocation for the Dog (runtime.newobject).
0 cost for the itab (compile-time constant).
0 cost for the iface struct (it's a return-by-value pair).

The 1 allocation is the boxing tax. If your factory returns Animal 100k times per second, you have 100k heap allocations per second. If you return *Dog directly (concrete factory), you can sometimes avoid the allocation via inlining + escape analysis.

5.3 Watching for unintentional boxing¶

func NewDog() Animal { return &Dog{} }  // boxes
func NewDog() *Dog   { return &Dog{} }  // doesn't box

// caller:
d := NewDog()                            // d is Animal or *Dog
var a Animal = d                         // if d is *Dog, boxing here instead

If the caller always uses *Dog directly, the concrete-returning factory is cheaper. If multiple call sites need the interface, return interfaces from the factory (the alternative is N boxing sites in the callers).

6. Init order: package-level vars and init() functions¶

Registry-based factories depend on init() to populate the registry. The rules for init order are precise.

6.1 Within a single file¶

var a = b + 1   // depends on b
var b = 10

Variables are initialized in dependency order. The compiler builds a dependency graph and topologically sorts it. The above initializes b first (no deps), then a.

6.2 Within a single package, across files¶

Files are processed in lexicographic order only as a tiebreaker for variables with no dependencies. Variables with explicit dependencies are ordered by the graph regardless of file.

// a.go
var x = y + 1

// b.go
var y = 10

y initializes first (in b.go), then x (in a.go). The file order is irrelevant when deps are explicit.

6.3 Across packages¶

Package A imports package B → package B's init runs before A's. Within B, all package-level vars initialize, then B's init() functions run in declaration order across files (alphabetical file order, then top-to-bottom within file).

main → mysql → sql

database/sql package-level vars (including the registry map).
database/sql init functions.
mysql package-level vars.
mysql init functions — this is where sql.Register("mysql", &MySQLDriver{}) runs.
main package-level vars.
main init functions.
main() is called.

6.4 What goes wrong¶

Two pitfalls for factory registries:

Pitfall 1: import the driver but don't use anything from it.

import _ "github.com/go-sql-driver/mysql"

The blank import is required: it ensures the package's init runs (registering the driver), without referencing any exported symbol. Without the blank import, the linker drops unused packages and the driver never registers.

Pitfall 2: registry access during a package's own init.

// my_pkg/init.go
func init() {
    driver, _ := sql.Drivers()["mysql"]  // may be empty if init order is wrong
}

If my_pkg is initialized before mysql, the registry doesn't have "mysql" yet. The fix is to make my_pkg depend on mysql (import it), or defer the lookup until main runs.

6.5 The init order is a constructor's transitive closure¶

For a service with N drivers, all N init() functions run sequentially at process startup. If each driver's init takes 10 ms (parsing config, connecting to localhost), the process startup is delayed by N×10 ms before main() even begins. Lazy registration (a sync.Once-guarded init in the driver's first use) avoids this.

7. sync.Once internals — atomic fast path, mutex slow path¶

Singleton factories use sync.Once to guarantee single initialization. Here's the actual implementation:

// src/sync/once.go
type Once struct {
    done atomic.Uint32
    m    Mutex
}

func (o *Once) Do(f func()) {
    if o.done.Load() == 0 {
        o.doSlow(f)
    }
}

func (o *Once) doSlow(f func()) {
    o.m.Lock()
    defer o.m.Unlock()
    if o.done.Load() == 0 {
        defer o.done.Store(1)
        f()
    }
}

7.1 The fast path¶

if o.done.Load() == 0 {
    o.doSlow(f)
}

atomic.Uint32.Load() on amd64 is a plain MOVL — no fence, no lock prefix. The CPU's cache coherency protocol ensures the read is consistent. The fast path is:

MOVL (o.done), AX
TESTL AX, AX
JNZ done
CALL doSlow
done:
RET

When done is already 1 (after first call), the function returns after a single load. Cost: ~1 cycle on a hot cache line. After the first Do, all subsequent calls are essentially free.

7.2 The slow path¶

The first caller (and any concurrent callers racing for first) hit doSlow:

o.m.Lock()              // acquire mutex
defer o.m.Unlock()      // release on return
if o.done.Load() == 0 { // re-check inside the lock
    defer o.done.Store(1)
    f()
}

Double-checked locking. The outer check (done == 0) is the fast-path optimization; the inner check (done.Load() == 0 inside the lock) handles the race where two goroutines both saw done == 0 and entered doSlow — only one runs f.

The order of the inner defers matters: defer o.done.Store(1) is pushed before f() runs but executes after f() returns (deferreds are LIFO). So done flips to 1 before the mutex is unlocked. Any goroutine waiting on the mutex will see done == 1 after acquiring and will skip f().

7.3 Why the store happens after f()¶

Because f() may panic. If it panics, done stays at 0 — the next call retries. This is the contract: Do runs f exactly once successfully (or it retries until success or always panics).

Wait — actually re-read: defer o.done.Store(1) runs after f regardless of panic, because defer runs on panic too. So done is set to 1 even if f panicked. The next call to Do would not re-run f. This is the actual Go behavior.

For factory singletons, this means: if your f panics, the singleton is set to whatever partial state existed. Don't put panicking code in sync.Once.Do. Use OnceValue / OnceFunc (Go 1.21+) for value-returning variants.

7.4 OnceFunc and OnceValue (Go 1.21+)¶

// src/sync/oncefunc.go
func OnceFunc(f func()) func() {
    var (
        once  Once
        valid bool
        p     any
    )
    g := func() {
        defer func() {
            p = recover()
            if !valid {
                panic(p)
            }
        }()
        f()
        valid = true
    }
    return func() {
        once.Do(g)
        if !valid {
            panic(p)
        }
    }
}

This wrapper caches the panic value too: subsequent calls re-panic with the same value. The factory-singleton-with-panic-recovery pattern is now a stdlib primitive.

For value-returning factories: sync.OnceValue(func() *Config { return loadConfig() }) returns func() *Config that runs once and caches both the value and any panic.

8. Registry map lookup cost¶

A registry-based factory is a map[string]Factory. The cost of registry["mysql"] is:

Hash the key.
Find the bucket.
Walk the bucket comparing keys.
Return the value pointer.

8.1 String hashing¶

map[string]V uses Go's runtime string hash. On amd64, it's aeshash (AES-based) when the CPU supports AES-NI, otherwise a software fallback (xxHash-style).

For a short string like "mysql" (5 bytes), aeshash is roughly 10-15 cycles on a modern x86. For longer keys (~64 bytes), it's still under 30 cycles. Hash cost is small but real.

8.2 Bucket layout¶

// src/runtime/map.go
type bmap struct {
    tophash [8]uint8
    // keys [8]K       (variable size depending on K)
    // values [8]V     (variable size depending on V)
    // overflow *bmap
}

Each bucket holds 8 key-value pairs plus an 8-byte tophash array (storing the top 8 bits of each entry's hash for fast skip), plus a pointer to an overflow bucket.

For map[string]func() Animal:

string header is 16 bytes (pointer + length).
func() Animal is 8 bytes (pointer to funcval).
Each entry is 24 bytes plus the 1 byte of tophash.
One bucket holds 8 entries: 8×24 + 8 + 8 = 208 bytes.

+-------------------------------+   bucket
| tophash [0..7]      (8 bytes) |
|-------------------------------|
| key 0: string header (16 B)   |
| key 1: ...                    |
| ...                           |
| key 7: ...                    |
|-------------------------------|
| value 0: funcval ptr (8 B)    |
| value 1: ...                  |
| ...                           |
| value 7: ...                  |
|-------------------------------|
| overflow ptr (8 B)            |
+-------------------------------+

8.3 Lookup walk¶

Hash the key → upper bits select bucket, lower 8 bits become "tophash" for the entry.
Scan the bucket's tophash array for matching bytes (fast — fits in one cache line).
For each match, compare the actual key string (full memcmp).
If found, return the value.
If not, follow overflow pointer to the next bucket.

For a 10-entry registry, all keys fit in a single bucket. Lookup is ~30 cycles total (hash + tophash scan + key compare + value load) — under 10 ns.

For a 1000-entry registry, 125 buckets. The hash selects one bucket; the bucket walk is the same single-bucket cost. Lookup is still ~30 cycles unless there are hash collisions.

8.4 Compared to a switch¶

func make(kind string) Animal {
    switch kind {
    case "dog": return &Dog{}
    case "cat": return &Cat{}
    }
    return nil
}

Switch on string is implemented as a chain of comparisons: kind == "dog" → kind == "cat". For 2-3 cases, this is faster than the map (~3-5 cycles per compare, but only one taken). For >5 cases, the map wins.

The compiler may optimize a string switch with many cases into a jump table via hashing — but this is implementation-defined and rare. For large case counts, prefer a map for predictability.

8.5 Registry mutex overhead¶

A registry that's mutated at runtime (drivers register at init, but the map is read-only after) has no mutex overhead — just protect the map with sync.RWMutex and lock for writes only.

If the registry is mutated frequently, consider sync.Map, which uses lock-free reads for hot entries.

9. PGO devirtualization for hot factory sites¶

Go 1.21+ supports profile-guided optimization (PGO). For factory call sites that always return the same concrete type, PGO can devirtualize subsequent interface calls.

func process() {
    d := newAnimal("dog") // returns Animal interface
    d.Sound()             // virtual call
}

Without PGO: the call d.Sound() goes through the itab — one indirect jump (~3 ns).

With PGO, if the profile shows *Dog dominates at this call site, the compiler rewrites:

TEXT main.process(SB)
    CALL newAnimal
    ; d is in (AX, BX)  — (tab, data)
    CMPQ AX, $itab_Animal_Dog
    JNE not_dog
    MOVQ BX, DI                       ; *Dog as receiver
    CALL main.(*Dog).Sound(SB)        ; direct call (inlinable)
    JMP done
not_dog:
    ; fall back to indirect dispatch
    MOVQ AX, CX
    MOVQ 24(CX), CX                   ; itab.fun[0]
    MOVQ BX, DI
    CALL CX
done:
    RET

If *Dog is the hot type, the direct path is taken; the inlinable Sound may even fold into process, eliminating the call entirely. The cold path (not_dog) handles other types.

PGO doesn't devirtualize the factory itself (it's already a direct call). It devirtualizes the consumers of the factory's interface return.

9.1 When PGO helps factories¶

Factory returns an interface.
One concrete type dominates at the call site (>50% of profile samples).
The interface method is small and inlinable.

For typical web services where one driver is overwhelmingly used (e.g. mysql in 99.9% of cases), PGO can eliminate the interface dispatch on the hot path.

9.2 When PGO doesn't help¶

Factory returns concrete type (already devirtualized).
Multiple concrete types are nearly equally hot.
The factory itself is the bottleneck (not its return's methods).

10. Generic factories under the hood¶

Go 1.18+ supports generic factories:

type Factory[T any] func() *T

func New[T any]() *T {
    return new(T)
}

var dogFactory = New[Dog]

10.1 GCShape stencilling¶

The compiler doesn't generate one body per type argument. Instead, it generates one body per GCShape — a class of types with the same memory layout from the GC's perspective.

GCShape classes:
- pointer-like (any pointer type)
- int-like (int, int32, int64, etc., depending on size)
- string-like (string)
- interface-like (interface{})
- struct-with-pointers (struct types containing pointers)
- struct-no-pointers (POD-like structs)
- etc.

New[Dog] and New[Cat] — both pointer-shape returns — share a single function body. The body takes a dictionary (runtime type metadata) and uses it for type-specific operations like new(T) and T's methods.

10.2 Dictionary passing¶

The dictionary is implicitly added as a first argument. For New[Dog], the dictionary holds:

*runtime._type for Dog (used by runtime.newobject).
Method pointers if T is constrained by an interface.

Compiled New[T any]() *T:

TEXT main.New[go.shape.*uint8](SB)
    MOVQ dict+0(FP), AX               ; dictionary
    MOVQ 0(AX), BX                    ; *_type for T
    MOVQ BX, AX
    CALL runtime.newobject(SB)        ; alloc a T
    MOVQ AX, ret+8(FP)                ; return the *T
    RET

10.3 Cost of dictionary lookup¶

For most generic factories, the dictionary access adds 1-2 ns over a hand-written non-generic factory. The body is shared across many instantiations (smaller binary) at the cost of an indirect lookup.

For New[T]() *T specifically, the dictionary holds the type's size and the gc bitmap; runtime.newobject uses these to allocate and initialize. The cost is dominated by the allocation itself, not the dictionary.

10.4 When to prefer generic factories¶

Utility code (generic containers, registries).
Factories called few times but defined once.

When to prefer concrete factories:

Hot paths where 1 ns per call matters.
Factories where the type is fixed.

11. Assembly for a typical factory call¶

Take the canonical pattern:

type Animal interface { Sound() string }

type Dog struct{ Name string }
func (d *Dog) Sound() string { return "woof" }

func NewDog(name string) Animal {
    return &Dog{Name: name}
}

func use() {
    a := NewDog("Rex")
    println(a.Sound())
}

Compile with go tool compile -S -l main.go (the -l disables inlining so we can see the call):

TEXT main.NewDog(SB)
    SUBQ $32, SP                              ; allocate stack frame
    MOVQ BP, 24(SP)                           ; save BP
    LEAQ 24(SP), BP

    LEAQ go.itab.*main.Dog,main.Animal(SB), R8  ; load itab (constant)
    MOVQ R8, 0(SP)                            ; (eventually returned)

    LEAQ type:main.Dog(SB), AX                ; *_type for Dog
    CALL runtime.newobject(SB)                ; alloc heap Dog
    ; AX now holds *Dog

    MOVQ name+0(FP), CX                       ; name.data
    MOVQ name+8(FP), DX                       ; name.len
    MOVQ CX, 0(AX)                            ; dog.Name.data
    MOVQ DX, 8(AX)                            ; dog.Name.len

    MOVQ AX, ret_data+24(FP)                  ; iface.data
    LEAQ go.itab.*main.Dog,main.Animal(SB), R8
    MOVQ R8, ret_tab+16(FP)                   ; iface.tab

    MOVQ 24(SP), BP
    ADDQ $32, SP
    RET

Three observable costs:

Heap allocation via runtime.newobject for the Dog. About 10-15 ns on a warm tiny size class.
Two MOV writes to populate Dog.Name. Negligible.
Two MOV writes to construct the iface return value (data and tab). The itab is a constant address (LEAQ), no runtime lookup.

Total: ~15 ns dominated by the allocation.

For the consumer:

TEXT main.use(SB)
    SUBQ $24, SP
    MOVQ $0x...., AX               ; "Rex".data
    MOVQ $3, BX                    ; "Rex".len
    CALL main.NewDog(SB)
    ; AX = itab, BX = data

    MOVQ 24(AX), CX                ; itab.fun[0] = Sound
    MOVQ BX, DI                    ; receiver = *Dog
    CALL CX                        ; virtual call

    ; print result (omitted)
    ADDQ $24, SP
    RET

The virtual call costs one indirect jump (~3-5 ns on miss, ~1 ns on prediction hit).

With inlining (no -l), NewDog is inlined into use, the Dog's allocation might be elided if Sound is also inlined and proven not to escape. In practice with the current compiler, the heap allocation remains (interface boxing).

12. Runtime helpers — newobject, makemap, makechan¶

Factories implicitly call these helpers. Knowing what they cost helps you read the assembly and reason about overhead.

12.1 runtime.newobject¶

// src/runtime/malloc.go
func newobject(typ *_type) unsafe.Pointer {
    return mallocgc(typ.Size_, typ, true)
}

Allocates one object of size typ.Size_ and returns its pointer. The allocator picks a size class (tiny / small / large) and bumps a per-P (per-processor) cache pointer. Fast path: ~10 ns. Slow path (cache miss, span exhaustion): up to a microsecond.

The factory cost for return &Foo{x: 1} is dominated by newobject (allocation + zero-init) plus the field stores.

12.2 runtime.makemap¶

// src/runtime/map.go
func makemap(t *maptype, hint int, h *hmap) *hmap

Creates a new map. If hint is small (≤8), allocates one bucket inline. Otherwise computes log2(buckets needed) and pre-allocates the bucket array.

For a factory that returns a freshly constructed map (return map[string]int{...}), the cost is roughly:

Small map (≤8 entries): ~50 ns, one allocation for hmap + one for buckets.
Larger maps: scales with hint. A 1k-entry hint costs ~1-2 µs.

If your factory builds a registry on every call, the map allocation may dominate. Cache the map.

12.3 runtime.makechan¶

// src/runtime/chan.go
func makechan(t *chantype, size int) *hchan

For unbuffered channels, allocates only the hchan header (~96 bytes). For buffered, allocates header + buffer (size * elem_size).

Factory cost: ~25-50 ns for unbuffered, plus buffer alloc for buffered.

12.4 Implications¶

A "constructor returns struct" factory:

1× newobject for the struct.
1× makemap or makechan per internal map / channel field.
Field stores (cheap).

The total cost grows with the number of allocations. A factory that builds a struct with three maps and a channel has five allocations. Pool these objects (sync.Pool) if the factory is hot.

13. Memory layout of registry maps¶

For map[string]func() Animal, the hmap header looks like:

// src/runtime/map.go
type hmap struct {
    count     int            // # live cells
    flags     uint8
    B         uint8          // log2(# buckets)
    noverflow uint16
    hash0     uint32         // random per-map seed
    buckets   unsafe.Pointer // *[2^B]bmap
    oldbuckets unsafe.Pointer
    nevacuate  uintptr
    extra      *mapextra
}

The map header is 48 bytes plus pointer to buckets.

+-----------------+
| hmap header     |   48 bytes
|-----------------|
| count           |
| flags, B, ...   |
| hash0           |
| buckets ptr ----+--> bucket array
| oldbuckets ptr  |
+-----------------+
                       +----------------+
                       | bucket 0       |   208 bytes
                       |----------------|
                       | tophash[0..7]  |
                       | keys[0..7]     |
                       | values[0..7]   |
                       | overflow ptr   |
                       +----------------+
                       | bucket 1       |
                       | ...            |

For a 10-entry registry:

count = 10
B = 2 (4 buckets, fits 32 entries with load factor 6.5)
buckets points at a 4×208 = 832-byte block.
Total memory: ~880 bytes (48 + 832 + alignment).

For each entry:

Key: 16 bytes (string header).
Key data: variable (the actual string bytes, allocated elsewhere).
Value: 8 bytes (funcval pointer).

The map is cache-friendly: a single bucket fits in 3-4 cache lines (each line is 64 bytes). Lookup walks one bucket — ~3 cache misses worst case.

13.1 Don't put factories in `sync.Map` unless writes are hot¶

sync.Map is optimized for many readers and few writers. For a static factory registry (writes only at init), a plain map[string]Factory guarded by sync.RWMutex (or unguarded if writes are truly init-only) is faster. sync.Map's amortized constants are higher per lookup.

14. Defer cost in cleanup-returning factories¶

A common pattern: factory returns both the object and a cleanup function.

func NewServer() (*Server, func(), error) {
    s := &Server{}
    cleanup := func() { s.Close() }
    return s, cleanup, nil
}

func use() {
    s, cleanup, err := NewServer()
    if err != nil { ... }
    defer cleanup()
    s.Serve()
}

The cleanup is a closure. Costs:

14.1 Closure allocation¶

func() { s.Close() } captures s. The compiler allocates a closure on the heap (~32 bytes: funcval header + captured pointer). One allocation per factory call.

-gcflags="-m" reports:

./server.go:3:13: func literal escapes to heap

14.2 defer cost¶

defer cleanup() registers the deferred call. In Go 1.14+, defer is implemented inline for non-deferred-in-loop cases — the defer record lives on the stack, and the dispatch is a direct call at function exit.

Cost per defer: ~5-10 ns. Per function exit: ~5-10 ns to walk the defer list. For a factory consumer that calls cleanup once at exit, the defer overhead is negligible.

14.3 defer in loops — the trap¶

for _, name := range names {
    s, cleanup, _ := NewServer()
    defer cleanup()       // BUG: builds up N deferred calls
    s.Serve()
}

Each iteration adds a defer record. They all fire at function exit, not at end-of-iteration. Memory grows with N; cleanups happen out of order; the loop may leak resources.

Fix: wrap the body in a function or call cleanup explicitly.

14.4 When cleanup is unavoidable¶

If your factory must return cleanup (because the consumer can't run it inline), the closure allocation is the irreducible cost. Some optimizations:

Pre-allocate the closure if your factory is called repeatedly with the same cleanup behavior (rare).
Use a method on the returned object instead: func (s *Server) Close(). The caller calls defer s.Close() directly — no closure needed.

The "object with Close method" pattern is strictly cheaper than "factory returns cleanup closure" because there's no per-call closure allocation.

15. The "constructor returns interface" pattern at the SSA level¶

Go's SSA backend lowers the pattern in stages. Take:

func NewAnimal(kind string) Animal {
    if kind == "dog" {
        return &Dog{}
    }
    return &Cat{}
}

15.1 After type checking¶

The compiler resolves:

Each return constructs an iface.
The branch creates two iface values; SSA merges them at the function exit.

15.2 Initial SSA¶

b1:
  v1 = arg "kind"
  v2 = string "dog"
  v3 = StringEquals v1 v2
  If v3 goto b2 else b3

b2:
  v4 = New <*Dog>           ; runtime.newobject(Dog)
  v5 = ConstAddr itab_Dog
  v6 = MakeIface v5 v4
  Ret v6

b3:
  v7 = New <*Cat>
  v8 = ConstAddr itab_Cat
  v9 = MakeIface v8 v7
  Ret v9

15.3 After optimization¶

The compiler:

Constant-folds the string compare if kind is known.
Hoists common allocation patterns.
Marks the returned iface as escaping (it leaves the function).

If kind is known at call site (e.g., the caller passes a literal), the entire branch may collapse to a single concrete construction. This is intra-procedural devirtualization — done without PGO, just from constants.

15.4 The MakeIface SSA op¶

MakeIface(tab, data) is the SSA op for "construct an interface from a tab and a data pointer." It lowers to two register writes (tab + data) — no runtime call. The tab is a compile-time constant pointer.

The expensive part of "return interface" is not MakeIface (free). It's the heap allocation that precedes it. Factory cost is allocation cost; interface boxing is a separate, free step.

16. bufio.NewReader line by line¶

// src/bufio/bufio.go
const (
    defaultBufSize = 4096
    minReadBufferSize = 16
)

func NewReader(rd io.Reader) *Reader {
    return NewReaderSize(rd, defaultBufSize)
}

func NewReaderSize(rd io.Reader, size int) *Reader {
    // Is it already a Reader?
    b, ok := rd.(*Reader)
    if ok && len(b.buf) >= size {
        return b
    }
    r := new(Reader)
    r.reset(make([]byte, max(size, minReadBufferSize)), rd)
    return r
}

func (b *Reader) reset(buf []byte, r io.Reader) {
    *b = Reader{
        buf:          buf,
        rd:           r,
        lastByte:     -1,
        lastRuneSize: -1,
    }
}

16.1 Returns concrete type¶

*Reader is returned, not an interface. Consumers get the concrete pointer and can call all *Reader methods directly. The compiler can devirtualize without PGO.

16.2 Idempotency check¶

b, ok := rd.(*Reader) — type assertion. If rd is already a *Reader with a large enough buffer, return it directly. Avoids double-buffering. This is a cooperation pattern: if the caller has already wrapped, don't wrap again.

The type assertion is the iface's tab compared against *Reader's itab — one comparison, ~1 ns. If false, fall through; if true, the size check decides.

16.3 Allocation cost¶

r := new(Reader)
r.reset(make([]byte, max(size, minReadBufferSize)), rd)

Two allocations:

new(Reader) — calls runtime.newobject for the Reader struct (~70 bytes).
make([]byte, size) — calls runtime.makeslice for the buffer (default 4 KiB).

Total ~4 KiB + ~70 bytes per factory call. For a server handling 10k requests/sec, that's 40 MiB/sec of buffer churn. Use sync.Pool for hot paths.

16.4 The reset method¶

Notice r.reset(...) is called on a freshly allocated Reader. Why not initialize in NewReaderSize? Because reset is reused by Reader.Reset (the public API to recycle a reader). The pattern:

func NewReaderSize(...) *Reader {
    r := new(Reader)
    r.reset(...)
    return r
}

func (b *Reader) Reset(r io.Reader) {
    if b.buf == nil {
        b.buf = make([]byte, defaultBufSize)
    }
    b.reset(b.buf, r)
}

The factory and the reset share a private helper. This is a composition pattern: factory and recycler share state initialization.

16.5 Why `*b = Reader{...}`¶

*b = Reader{
    buf: buf, rd: r, lastByte: -1, lastRuneSize: -1,
}

Instead of setting each field. Assigning a struct literal in one expression lets the compiler:

Use a single block-copy (REP MOVSQ on amd64) to set all fields.
Zero any fields not mentioned.
Avoid per-field stores.

For a struct with 6 fields, one block copy is faster than six individual stores. Always assign whole structs when initializing.

16.6 The whole factory in assembly (approx)¶

TEXT bufio.NewReaderSize(SB)
    ; type assertion: rd.(*Reader)
    LEAQ go.itab.bufio.Reader.io.Reader(SB), R8
    CMPQ AX, R8
    JNE allocate
    MOVQ buf_len(BX), DX
    CMPQ DX, CX
    JGE return_existing
allocate:
    LEAQ type.bufio.Reader(SB), AX
    CALL runtime.newobject(SB)
    ; ... populate fields, makeslice for buffer, return

Straight-line code plus two runtime calls. No reflection, no maps, no locks. This is what a real factory looks like under the hood.

17. sql.Open line by line¶

// src/database/sql/sql.go
var drivers = make(map[string]driver.Driver)
var driversMu sync.RWMutex

func Register(name string, driver driver.Driver) {
    driversMu.Lock()
    defer driversMu.Unlock()
    if driver == nil {
        panic("sql: Register driver is nil")
    }
    if _, dup := drivers[name]; dup {
        panic("sql: Register called twice for driver " + name)
    }
    drivers[name] = driver
}

func Open(driverName, dataSourceName string) (*DB, error) {
    driversMu.RLock()
    driveri, ok := drivers[driverName]
    driversMu.RUnlock()
    if !ok {
        return nil, fmt.Errorf("sql: unknown driver %q (forgotten import?)", driverName)
    }

    if driverCtx, ok := driveri.(driver.DriverContext); ok {
        connector, err := driverCtx.OpenConnector(dataSourceName)
        if err != nil {
            return nil, err
        }
        return OpenDB(connector), nil
    }

    return OpenDB(dsnConnector{dsn: dataSourceName, driver: driveri}), nil
}

17.1 The global registry¶

var drivers = make(map[string]driver.Driver)
var driversMu sync.RWMutex

A package-level map plus a sync.RWMutex. The map is allocated eagerly (at init): the make(map[string]driver.Driver) runs as part of database/sql's package initialization. The map is small (probably 0 entries until Register is called).

Mutex is RWMutex because reads (during Open) are frequent and concurrent; writes (during init via Register) are rare.

17.2 Register's contract¶

if driver == nil { panic(...) }
if _, dup := drivers[name]; dup { panic(...) }
drivers[name] = driver

Two preconditions:

Driver must not be nil.
Name must be unique.

Both violations panic. Why panic and not return error? Because Register is called during init — there's no caller to return to. Panic is the only signal that something is broken at startup.

17.3 Open's two-phase logic¶

driversMu.RLock()
driveri, ok := drivers[driverName]
driversMu.RUnlock()

Hold the read lock only for the lookup. Then release it. Subsequent code doesn't need the lock — driveri is a value, holding it is independent of the registry's mutation.

This is a minimum-scope locking pattern: lock just long enough to retrieve, no longer.

17.4 Type assertion for context support¶

if driverCtx, ok := driveri.(driver.DriverContext); ok {
    connector, err := driverCtx.OpenConnector(dataSourceName)
    ...
}

The driver interface evolved over time. DriverContext is the modern version that supports context.Context. Old drivers don't implement it. The factory checks and uses the modern path if available, falls back to the legacy adapter (dsnConnector{}) otherwise.

This is a capability check pattern: ask the dependency if it can do the new thing; otherwise, adapt the old behavior.

17.5 No allocation on the hot path (mostly)¶

For a sql.Open("mysql", dsn) call:

Take read lock, hash "mysql", look up, release lock.
Type assert to DriverContext (no alloc).
Call OpenConnector(dsn) — driver-specific.
Call OpenDB(connector) — allocates the *DB.

The factory itself allocates only the *DB. The driver may or may not allocate inside OpenConnector. The registry lookup is essentially free.

17.6 The hidden cost: init order¶

import _ "github.com/go-sql-driver/mysql"

Without this, Register for "mysql" never runs, and sql.Open("mysql", ...) returns an error. The blank import is how the registry pattern survives across packages.

If you forget the blank import, you get the famous error:

sql: unknown driver "mysql" (forgotten import?)

The parenthetical is in the source code. The Go authors knew this would be the most common bug.

18. Benchmarks¶

Measured on Go 1.22, amd64, Intel i7-12700, GOMAXPROCS=8:

BenchmarkDirectConstruction-8         500000000   2.10 ns/op   0 B/op   0 allocs/op
BenchmarkValueFactory-8               400000000   2.50 ns/op   0 B/op   0 allocs/op
BenchmarkPointerFactoryInlined-8      400000000   2.65 ns/op   0 B/op   0 allocs/op
BenchmarkPointerFactoryEscapes-8       50000000   24.0 ns/op  16 B/op   1 allocs/op
BenchmarkInterfaceFactory-8            40000000   28.5 ns/op  16 B/op   1 allocs/op
BenchmarkRegistryLookup-8             100000000   9.30 ns/op   0 B/op   0 allocs/op
BenchmarkRegistryFactoryCall-8         40000000   31.4 ns/op  16 B/op   1 allocs/op
BenchmarkSyncOnceHot-8               1000000000   0.92 ns/op   0 B/op   0 allocs/op
BenchmarkSyncOnceCold-8                30000000   45.0 ns/op   0 B/op   0 allocs/op
BenchmarkGenericFactory-8              40000000   28.0 ns/op  16 B/op   1 allocs/op
BenchmarkFactoryWithCleanup-8          20000000   62.0 ns/op  48 B/op   2 allocs/op
BenchmarkFactoryPooled-8              200000000   5.80 ns/op   0 B/op   0 allocs/op
BenchmarkBufioNewReader-8              10000000  120.0 ns/op 4176 B/op  2 allocs/op
BenchmarkSqlOpen-8                      5000000  280.0 ns/op  ...

Observations:

Direct construction (caller writes &Dog{}): 2.1 ns. The floor.
Value factory (returns Foo, no pointer): 2.5 ns. The copy is cheap for small types.
Pointer factory, inlined: 2.65 ns. Almost free when the compiler proves no escape.
Pointer factory, escaping: 24 ns + 1 alloc. The heap allocation dominates.
Interface factory: 28.5 ns. Heap alloc + iface construction. The interface-boxing tax.
Registry lookup only (no call): 9.3 ns. Map hash + bucket walk.
Registry call: 31.4 ns. Lookup + indirect call + alloc.
sync.Once hot: 0.92 ns. One atomic load. The fastest possible synchronized factory.
sync.Once cold: 45 ns. Mutex + check + store + factory body.
Generic factory: 28 ns. Comparable to interface factory; dictionary overhead is real but small.
Factory with cleanup: 62 ns + 2 allocs. The closure for cleanup is a second allocation.
Pooled factory: 5.8 ns. sync.Pool.Get + reset. The way to make hot factories cheap.
bufio.NewReader: 120 ns. Buffer alloc dominates.
sql.Open: ~280 ns. Driver lookup + connector creation. Not hot-path-suitable (intentionally).

Takeaway: factory dispatch is below 10 ns. Allocation is the cost — 1 alloc per factory call is ~10-20 ns. For 100k QPS, that's 1-2M allocations/sec. Pool aggressively in hot paths.

19. Tricky questions¶

Q1. Why does var _ Animal = (*Dog)(nil) not allocate, but var _ Animal = NewDog() allocates a Dog on the heap?

Answer

`(*Dog)(nil)` is a typed nil — the value is just the nil pointer, no Dog exists. The interface holds (itab_Animal_Dog, nil). Zero allocations. `NewDog()` calls the factory, which actually constructs `&Dog{}`. The Dog is allocated on the heap because the interface escapes. The interface holds (itab_Animal_Dog, *Dog). The compile-time check `var _ = ...` discards the result, but the right-hand side still evaluates. The first form has no side effects; the second has the alloc. Lesson: use `(*T)(nil)` for compile-time interface satisfaction checks. It's the same assertion at zero runtime cost.

Q2. Why does this code sometimes panic at startup?

// pkg_a/init.go
var Driver = sql.MustOpen("mysql", "...")

// main.go
import _ "github.com/go-sql-driver/mysql"
import "pkg_a"

Answer

`pkg_a` doesn't import mysql; only main does. The compiler initializes packages in dependency order: `database/sql` → `pkg_a` → mysql → main. By the time `pkg_a.Driver` evaluates, mysql hasn't registered yet, and `MustOpen` panics. Fix: move the blank import into `pkg_a`. Then `pkg_a` declares its dependency, and the compiler orders mysql before pkg_a's vars. The lesson: blank imports must live where the registry is *consumed*, not where the binary is assembled.

Q3. Why is the cost of bufio.NewReader higher than sql.Open per call sometimes, but lower in microbenchmarks?

Answer

Microbenchmarks of `sql.Open` use `sqlmock` or a fake driver — no real I/O. The lookup is the only cost: ~30 ns. In production, `sql.Open` triggers connection pooling on first call. The first call connects to the database (network round trip, ~1-10 ms). Subsequent calls reuse the pool (~100 ns). `bufio.NewReader` always allocates a 4 KiB buffer. Always ~120 ns. In a tight loop, `bufio.NewReader` is consistently ~120 ns. `sql.Open` is ~30 ns the first millionth time it's called against a real DB pool, and 1-10ms the first time. The lesson: microbenchmarks don't capture I/O. Profile the real path.

Q4. What does this code print?

var x = computeX()
func computeX() int { return y + 1 }
var y = computeY()
func computeY() int { return 10 }
func init() { println("init:", x, y) }

Answer

`init: 11 10`. The compiler tracks function-call dependencies in package-level var initializers: `x` depends on `y` (via `computeX`), so `y` initializes first. `y = 10`, then `x = 11`. If `computeX` called something the compiler couldn't statically analyze (a method call through an interface), the compiler falls back to source order, which can panic. Stick to direct calls in package-level initializers.

Q5. Why does sync.Once use both an atomic and a mutex? Couldn't it just use the atomic?

Answer

The atomic alone gives you check-then-set. But the *body* of `Do(f)` calls `f`, which may be slow. Without a mutex, two goroutines could both see `done == 0`, both enter, and both run `f` concurrently. That violates the contract. The mutex serializes the slow path so only one goroutine runs `f`. The atomic is the fast path for subsequent calls. You could implement it with just an atomic and busy-waiting:

for !o.done.CompareAndSwap(0, 2) {
    if o.done.Load() == 1 { return }   // someone else finished
    runtime.Gosched()
}
f()
o.done.Store(1)

This works but burns CPU while waiting. The mutex parks the waiting goroutines, freeing the CPU. Better resource use.

Q6. When is Foo stack-allocated in func New() *Foo { return &Foo{} }?

Answer

The `&Foo{}` expression escapes (it's returned). But escape analysis runs *after inlining*. If `New` is inlined into a caller like `f := New(); return f.X`, the compiler sees the merged code and may prove `Foo` doesn't escape — stack allocated. If `New` isn't inlined (too large, or `//go:noinline`), the compiler conservatively heap-allocates. Inlining gates escape elision. Keep factories small (1-2 statements) to stay within the inliner's budget.

Q7. Why is make(map[K]V) not a "factory" but New[T]() *T { return new(T) } is?

Answer

Both construct values. The distinction is semantic, not technical. `make` is a built-in. It's lowered by the compiler to `runtime.makemap` (or `makeslice`, `makechan`) — a runtime helper. The built-in encapsulates the allocation logic for *built-in container types*. `New[T]` is user-written. It's a function value, callable as a factory. The body could do anything (`new(T)`, set fields, return a wrapped value). The compiler treats `make` specially because the built-in types have specific layouts. User-written factories are just functions. Both are factories in the pattern sense; only one is in the language sense. This is why you can't say `make(MyStruct)` — `make` isn't generic over user types. Use `new(MyStruct)` (also a built-in) or a user factory.

20. Summary¶

Go factories are cheap at the call site:

A direct factory call adds ~0.5 ns over the construction.
An interface-returning factory adds 1 allocation (interface boxing tax).
A registry-based factory adds a map lookup (~10 ns).
A sync.Once singleton on the hot path is ~1 ns.
Generic factories add ~1-2 ns for dictionary lookup.

The cost of a factory is almost always the allocations it makes, not the dispatch. Profile with go test -benchmem; the alloc count is the number to watch.

For hot paths:

Return concrete types when possible (no interface boxing).
Pool with sync.Pool if the type is heavy.
Inline the factory by keeping it small.
Defer init lazily via sync.Once to amortize startup.

For init-time factories (drivers, plugins):

Allocation is irrelevant — runs once.
Init order matters — use blank imports to declare dependencies.
Panic is acceptable in init — the runtime has nowhere to send an error.

The senior-level skill is making factories invisible in profiles. Use the compiler's escape analysis, inlining heuristics, and PGO to push factory overhead below the noise floor.

21. Further reading¶

src/runtime/runtime2.go — iface, eface, funcval definitions
src/runtime/iface.go — itab construction, getitab
src/runtime/map.go — hmap, bucket layout, mapaccess1/mapassign
src/runtime/malloc.go — mallocgc, newobject, size classes
src/sync/once.go — Once.Do atomic + mutex pattern
src/sync/oncefunc.go — OnceFunc, OnceValue, OnceValues (Go 1.21+)
src/bufio/bufio.go — NewReader, NewReaderSize (concrete factory)
src/database/sql/sql.go — Register, Open (registry factory)
src/cmd/compile/internal/escape/ — escape analysis source
src/cmd/compile/internal/inline/ — inliner
src/cmd/compile/internal/devirtualize/ — PGO devirtualization
src/cmd/compile/internal/ssagen/ssa.go — MakeIface SSA lowering
Go blog: "Profile-guided optimization in Go 1.21" — PGO mechanics
Go proposal 17746 — original generics design with stencilling
"The Go Programming Language" §5.7 — function values and closures
"The Go Programming Language" §7 — interfaces and itab
Russ Cox: "Go Data Structures: Interfaces" — historical context on iface layout

Factory Pattern — Under the Hood¶

1. What this level covers¶

2. Table of Contents¶

3. The funcval layout — factories as first-class values¶

4. Escape analysis at the factory return boundary¶

4.1 Pointer return — usually escapes¶

4.2 When escape can be avoided¶

4.3 Interface return — almost always escapes¶

4.4 Value return — no allocation¶

4.5 Empirical check¶

5. Compile-time interface boxing in factory returns¶

5.1 itab cache¶

5.2 The boxing cost is in the data, not the metadata¶

5.3 Watching for unintentional boxing¶

6. Init order: package-level vars and init() functions¶

6.1 Within a single file¶

6.2 Within a single package, across files¶

6.3 Across packages¶

6.4 What goes wrong¶

6.5 The init order is a constructor's transitive closure¶

7. sync.Once internals — atomic fast path, mutex slow path¶

7.1 The fast path¶

7.2 The slow path¶

7.3 Why the store happens after f()¶

7.4 OnceFunc and OnceValue (Go 1.21+)¶

8. Registry map lookup cost¶

8.1 String hashing¶

8.2 Bucket layout¶

8.3 Lookup walk¶

8.4 Compared to a switch¶

8.5 Registry mutex overhead¶

9. PGO devirtualization for hot factory sites¶

9.1 When PGO helps factories¶

9.2 When PGO doesn't help¶

10. Generic factories under the hood¶

10.1 GCShape stencilling¶

10.2 Dictionary passing¶

10.3 Cost of dictionary lookup¶

10.4 When to prefer generic factories¶

11. Assembly for a typical factory call¶

12. Runtime helpers — newobject, makemap, makechan¶

12.1 runtime.newobject¶

12.2 runtime.makemap¶

12.3 runtime.makechan¶

12.4 Implications¶

13. Memory layout of registry maps¶

13.1 Don't put factories in sync.Map unless writes are hot¶

14. Defer cost in cleanup-returning factories¶

14.1 Closure allocation¶

14.2 defer cost¶

14.3 defer in loops — the trap¶

14.4 When cleanup is unavoidable¶

15. The "constructor returns interface" pattern at the SSA level¶

15.1 After type checking¶

15.2 Initial SSA¶

15.3 After optimization¶

15.4 The MakeIface SSA op¶

16. bufio.NewReader line by line¶

16.1 Returns concrete type¶

16.2 Idempotency check¶

16.3 Allocation cost¶

16.4 The reset method¶

16.5 Why *b = Reader{...}¶

16.6 The whole factory in assembly (approx)¶

17. sql.Open line by line¶

17.1 The global registry¶

17.2 Register's contract¶

17.3 Open's two-phase logic¶

17.4 Type assertion for context support¶

17.5 No allocation on the hot path (mostly)¶

17.6 The hidden cost: init order¶

18. Benchmarks¶

19. Tricky questions¶

20. Summary¶

21. Further reading¶

13.1 Don't put factories in `sync.Map` unless writes are hot¶

16.5 Why `*b = Reader{...}`¶