Registry — Optimization¶
1. How to use this file¶
Twelve scenarios where Registry code is slower, allocates more, or scales worse than it should. Each entry has a Before (code + benchmark) and a collapsible After (optimized code + benchmark + why + trade-offs + when NOT).
Anchored at Go 1.23, amd64. Numbers are reproducible-shape — run go test -bench=. -benchmem on your hardware before quoting them. Registry cost is dominated by four things: lock contention on the read path, map key hashing, interface boxing, and reflect-based key construction. Most wins remove one of those four from the hot path.
Reading order: Exercise 1 (RWMutex → sync.Map), 3 (atomic.Pointer), 5 (typed generic), then the rest in any order. Exercises 7, 8, 10 are the ones most senior code reviews flag.
2. Exercise 1 — RWMutex contention on a hot read path¶
A codec registry guards map[string]Codec with sync.RWMutex. At startup it's fine. Under sustained read load on a hot path (handler resolves a codec per request), RLock/RUnlock pays two atomic RMW ops per call. Above ~1M reads/sec across cores, the RWMutex's internal readerCount becomes a cache-line ping.
Before:
type Registry struct {
mu sync.RWMutex
m map[string]Codec
}
func (r *Registry) Get(name string) (Codec, bool) {
r.mu.RLock(); c, ok := r.m[name]; r.mu.RUnlock()
return c, ok
}
BenchmarkRWMutexGet-8 20000000 78 ns/op 0 B/op 0 allocs/op
BenchmarkRWMutexGet_Parallel-8 10000000 180 ns/op 0 B/op 0 allocs/op // 8 cores
After
`sync.Map` for read-mostly, scattered-write workloads. Reads on a stable key skip the mutex entirely after the first store. ~2.8× faster single-thread, ~5× contended. **Why faster:** `sync.Map.Load` on a key in its `read` half is a single atomic pointer load of the read-only snapshot, then a plain map lookup — no per-call atomic write. RWMutex always pays `readerCount` add/sub even when uncontended. **Trade-off:** ~80 B per entry (vs ~16 B in `map`); awkward `Range`-only iteration; type assertion per `Load`. **When NOT:** Write-heavy registries (hot-swap per request). Ordered iteration needed. Tiny entry counts (≤16) where `RWMutex+map` is smaller in cache footprint.3. Exercise 2 — String key hash on the hot path¶
Every Get("postgres") hashes the string. Go's map uses AES-NI string hash where available — fast, but O(len(key)). For 24-char keys called millions of times per second, hash plus equality adds up.
Before:
var codecs = map[string]Codec{}
func Get(name string) (Codec, bool) {
c, ok := codecs[name] // hash + memcmp
return c, ok
}
// Hot path: Get("application/vnd.company.v3+json") per request
After
For closed, compile-time-known sets, assign each value a small integer ID and key by ID. Resolve name→ID once at the parse boundary.type CodecID uint16
const (
CodecJSON CodecID = iota
CodecMsgpack
CodecProtobuf
CodecYAML
)
var codecsByID [256]Codec // dense array
func Register(id CodecID, c Codec) { codecsByID[id] = c }
func Get(id CodecID) (Codec, bool) {
if int(id) >= len(codecsByID) { return nil, false }
c := codecsByID[id]
return c, c != nil
}
func ResolveID(name string) (CodecID, bool) { id, ok := nameToID[name]; return id, ok }
4. Exercise 3 — Mutex per Get for a read-mostly registry¶
A registry mutated at startup and read forever after still pays a mutex on every read "just in case". For startup-only registries, the read should be lock-free.
Before:
type Registry struct {
mu sync.Mutex
m map[string]Handler
}
func (r *Registry) Get(name string) (Handler, bool) {
r.mu.Lock(); defer r.mu.Unlock()
h, ok := r.m[name]
return h, ok
}
BenchmarkMutexGet-8 30000000 45 ns/op 0 B/op 0 allocs/op
BenchmarkMutexGet_Parallel-8 5000000 280 ns/op 0 B/op 0 allocs/op // serialized
After
Copy-on-write via `atomic.Pointer[map]`. Writes build a new map and CAS-publish. Reads do a single atomic pointer load — no lock, no contention scaling problem.type Registry struct {
m atomic.Pointer[map[string]Handler]
writeMu sync.Mutex // serialize writers only
}
func NewRegistry() *Registry {
r := &Registry{}
empty := map[string]Handler{}
r.m.Store(&empty)
return r
}
func (r *Registry) Register(name string, h Handler) {
r.writeMu.Lock(); defer r.writeMu.Unlock()
old := r.m.Load()
next := make(map[string]Handler, len(*old)+1)
for k, v := range *old { next[k] = v }
next[name] = h
r.m.Store(&next)
}
func (r *Registry) Get(name string) (Handler, bool) {
h, ok := (*r.m.Load())[name]
return h, ok
}
5. Exercise 4 — Linear scan to match a key¶
A pattern-matching router scans a slice of (prefix, handler) to find the longest prefix match. O(N) per request — 50 routes × 10k QPS is 500k comparisons/sec.
Before:
type Route struct{ Prefix string; H Handler }
var routes []Route
func Lookup(path string) (Handler, bool) {
var best Route; var bestLen int
for _, r := range routes {
if strings.HasPrefix(path, r.Prefix) && len(r.Prefix) > bestLen {
best, bestLen = r, len(r.Prefix)
}
}
if bestLen == 0 { return nil, false }
return best.H, true
}
After
For *exact* match, use a plain `map[string]Handler`. For *prefix* match, a trie or Go 1.22+ `ServeMux` (radix tree). ~22× faster (exact) or ~6× (trie). **Why faster:** Map lookup is O(1) amortized. Trie is O(len(path)). Linear scan is O(N×L). **Trade-off:** Exact map requires caller-side path normalization (trailing slashes, case). Trie has higher construction cost — paid at startup. **When NOT:** Sub-10 route count where linear scan fits a cache line and beats the map hash. Wildcard/regex routes need a compiled matcher, not a registry.6. Exercise 5 — Interface boxing on every Get¶
A generic registry stores values as any. Every consumer asserts to a single concrete Codec, but every Get returns an iface header (16 B) and the consumer pays a type-assertion check.
Before:
type Registry struct{ m sync.Map } // map[string]any
func (r *Registry) Get(name string) (Codec, bool) {
v, ok := r.m.Load(name)
if !ok { return nil, false }
return v.(Codec), true
}
After
Generic `Registry[T]`. Compiler monomorphizes — `Get` returns concrete `T`. ~25% faster, and the assertion happens once per call site instead of every call. **Why faster:** Direct return of `T` skips iface header materialization on the value side. Internal assertion is branch-predictor-friendly: one type per registry instance. **Trade-off:** One `Registry[T]` instantiation per stored type. Five registries → ~10-30 KB binary growth. Negligible. **When NOT:** Registries that genuinely store heterogeneous types (event dispatcher with one handler signature per event). There you need `any` plus a dispatch table.7. Exercise 6 — reflect.Type key allocation per call¶
A handler registry keyed by reflect.TypeOf(payload). The reflect.Type itself is interned, but the iface boxing of the payload (to call TypeOf) can escape it to heap. The hash of a reflect.Type is fast (pointer compare), but the indirection through any is the real cost.
Before:
var handlers = map[reflect.Type]Handler{}
func Register(payload any, h Handler) { handlers[reflect.TypeOf(payload)] = h }
func Dispatch(payload any) error {
h, ok := handlers[reflect.TypeOf(payload)]
if !ok { return errNoHandler }
return h(payload)
}
Dispatch(UserCreated{ID: 7}) // UserCreated escapes via 'any'
After
Key by a stable string name carried on the event itself.type Event interface{ EventName() string }
var handlers = map[string]Handler{}
func Register(name string, h Handler) { handlers[name] = h }
func Dispatch(e Event) error {
h, ok := handlers[e.EventName()]
if !ok { return errNoHandler }
return h(e)
}
func (UserCreated) EventName() string { return "user.created" }
8. Exercise 7 — Per-call Names() rebuild¶
A debug endpoint hits Registry.Names() per request. Each call locks, iterates, allocates, sorts. 100 QPS on a 500-entry registry is 50k entries/sec of churn for a static answer.
Before:
func (r *Registry) Names() []string {
r.mu.RLock(); defer r.mu.RUnlock()
out := make([]string, 0, len(r.m))
for k := range r.m { out = append(out, k) }
sort.Strings(out)
return out
}
After
Cache the sorted slice with a version counter. Invalidate on write, rebuild lazily on first read.type Registry struct {
mu sync.RWMutex
m map[string]Codec
version atomic.Uint64
cached atomic.Pointer[cachedNames]
}
type cachedNames struct{ version uint64; names []string }
func (r *Registry) Register(name string, c Codec) {
r.mu.Lock(); r.m[name] = c; r.version.Add(1); r.mu.Unlock()
}
func (r *Registry) Names() []string {
v := r.version.Load()
if c := r.cached.Load(); c != nil && c.version == v { return c.names }
r.mu.RLock()
out := make([]string, 0, len(r.m))
for k := range r.m { out = append(out, k) }
r.mu.RUnlock()
sort.Strings(out)
r.cached.Store(&cachedNames{version: v, names: out})
return out
}
9. Exercise 8 — Lookup repeated in a tight loop¶
A hot encoding loop calls codecs.Get("json") per item. Even at 30 ns/op, 1M items/sec spends 30 ms/sec in the registry. The codec doesn't change — the loop re-looks-up every iteration.
Before:
for _, item := range items {
c, ok := codecs.Get("json")
if !ok { return errNoCodec }
b, err := c.Encode(item)
if err != nil { return err }
out = append(out, b)
}
After
Hoist the lookup. In an HTTP server, hoist further: resolve at request parse time, stash on a parsed-request struct; handler never touches the registry. **Why faster:** Trivial — one lookup instead of N. The point isn't the win but that registries are *easy to overuse* because they look free. Mid-level reviews catch loop-internal lookups. **Trade-off:** If the key varies per iteration, you can't hoist. For loops where keys cluster (event types, batched routes), cache the prior result (`lastKey`, `lastCodec`) — 90%+ hit rate on clustered traffic. **When NOT:** Per-iteration variable key with no clustering. Outside hot paths where one lookup amortizes over plenty of work.10. Exercise 9 — Hot-reload copies O(N) per Register¶
A plugin manager re-registers all plugins every 30s. Each Register triggers a full COW (Ex. 3). 200 plugins × O(N) per call = O(N²) reload.
Before:
func (r *Registry) Register(name string, p Plugin) {
r.writeMu.Lock(); defer r.writeMu.Unlock()
old := r.m.Load()
next := make(map[string]Plugin, len(*old)+1)
for k, v := range *old { next[k] = v }
next[name] = p
r.m.Store(&next)
}
for name, p := range newPlugins { registry.Register(name, p) } // reload
After
Diff-based merge. Build the next map once from `old + delta`, store once.func (r *Registry) RegisterAll(updates map[string]Plugin, removed []string) {
r.writeMu.Lock(); defer r.writeMu.Unlock()
old := r.m.Load()
next := make(map[string]Plugin, len(*old)+len(updates))
for k, v := range *old { next[k] = v }
for _, k := range removed { delete(next, k) }
for k, v := range updates { next[k] = v }
r.m.Store(&next)
}
11. Exercise 10 — Mutex-guarded read in handler hot path¶
A request handler does registry.Get(name) under a mutex per request. p99 rises with concurrent requests because mutex acquisition serializes.
Before:
func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
name := r.Header.Get("X-Codec")
h.registry.mu.RLock(); c, ok := h.registry.m[name]; h.registry.mu.RUnlock()
if !ok { http.Error(w, "no codec", 400); return }
// ... encode response with c
}
After
Lock-free check via `atomic.Pointer[map]` (Ex. 3). One atomic load, one map lookup, no lock. ~18× better p99. **Why faster:** Removes the serialization point. With 4000 concurrent goroutines, mutex acquisition is the bottleneck even when the critical section is < 100 ns — Amdahl's Law. **Trade-off:** As in Ex. 3, writes copy the map. For startup-only registries, free. For mutating ones, reload window has higher write cost — acceptable since reloads are rare. **When NOT:** Single-threaded handlers (rare). Genuinely write-heavy registries where reads are cold.12. Exercise 11 — Init-time registration with N items¶
A vendored library calls Register(name, impl) once per plugin in init(). With 300 plugins, init runs 300 individual registrations — each takes the lock, each inserts into the map (with rehashes on growth). Startup is 30 ms when it should be 0.5 ms.
Before:
func init() {
plugin.Register("compress.gzip", &gzipPlugin{})
plugin.Register("compress.snappy", &snappyPlugin{})
plugin.Register("compress.zstd", &zstdPlugin{})
// ... 297 more
}
After
Bulk register. Define entries as a slice/map literal, register once. ~75× faster init. **Why faster:** One lock, one map allocation sized for final count, one CAS-publish. Per-call Register is N locks, N rehashes on growth, N publishes. **Trade-off:** Map literal must be statically constructible — no per-entry init-time validation (do it inside `RegisterAll`). Tools that scan `Register` calls lose grep-ability — name the bulk map well. **When NOT:** Conditional registration (env vars, build tags). Per-entry `Register` is correct; cost paid only when conditions are met.13. Exercise 12 — plugin.Lookup per call¶
Go's plugin package (Linux/macOS) loads .so files; each Lookup walks the plugin's symbol table — O(symbols) worst-case. A poorly-shaped wrapper calls Lookup("DoWork") on every invocation.
Before:
func CallPlugin(p *plugin.Plugin, arg string) (string, error) {
sym, err := p.Lookup("DoWork") // ~3 µs per call
if err != nil { return "", err }
return sym.(func(string) (string, error))(arg)
}
After
Lookup once at load, cache the typed function pointer.type LoadedPlugin struct {
p *plugin.Plugin
DoWork func(string) (string, error)
}
func Load(path string) (*LoadedPlugin, error) {
p, err := plugin.Open(path)
if err != nil { return nil, err }
sym, err := p.Lookup("DoWork")
if err != nil { return nil, err }
fn, ok := sym.(func(string) (string, error))
if !ok { return nil, fmt.Errorf("plugin %s: DoWork wrong signature", path) }
return &LoadedPlugin{p: p, DoWork: fn}, nil
}
func (lp *LoadedPlugin) Call(arg string) (string, error) { return lp.DoWork(arg) }
14. When NOT to optimize¶
Registry overhead dominates only when lookups land on the hot path of a high-QPS service. If your registry is read 100 times per minute, every optimization here is irrelevant — your time is in the work the registry routes to.
- Driver registry consulted once at process start — keep
RWMutex+map. - Image-format registry used once per uploaded file (kilobytes of work follow).
- Test-only registry for mocks — no production load, no scaling problem.
Profile first. Registry overhead has four signatures in a CPU profile:
sync.(*RWMutex).RLock/RUnlockon a hot stack → Ex. 1 or 3.runtime.mapaccess2_faststrdominating → Ex. 2 (typed key) if keys are bounded.runtime.convT*on everyGet→ Ex. 5 (generics).reflect.TypeOfin a dispatch loop → Ex. 6 (string keys).
Common premature optimizations:
sync.Map(Ex. 1) on a registry with < 100 reads/sec —RWMutex+mapis fine.atomic.Pointer[map](Ex. 3) with no measurable contention — write path gets worse for no win.- Integer key IDs (Ex. 2) when string lookups don't show in profiles.
- Cached
Names()(Ex. 7) when called once per minute. - Plugin function caching (Ex. 12) when invocation latency is itself ms-scale.
Correctness gaps disguised as optimizations:
- Removing the nil-check from
Register"to save 1 ns" — until a nil panics on first use, far from the registration site. atomic.Pointerswap without a write mutex — two concurrent registrations race; one loses silently.- Caching
Names()and letting callers mutate the slice — neighbors see corruption. - Bulk
RegisterAlloverwriting without duplicate-check — hot-reload silently shadows old plugins. - Hoisting a lookup out of a loop when the key varies — wrong result, slower to debug than the slow original.
- Replacing
reflect.Typekeys with string names without stable names — collisions across packages sharing a short name. - Lock-free publish without a happens-before guarantee on the stored value's internal state (a
Codecwhose init is still in progress when stored).
15. Summary¶
Always-ship wins (apply by default in any new Registry code):
- Generic
Registry[T]overmap[string]any(Ex. 5). - Bulk
RegisterAllfor known-static sets (Ex. 11). - Hoist
Getout of hot loops (Ex. 8). - Cache
plugin.Lookupresults in typed function fields (Ex. 12). - Stable string keys, not
reflect.Type(Ex. 6). - Nil-check on
Registerand duplicate-key panic — correctness, not performance.
Wins behind a profile (when measurements justify them):
sync.Mapfor read-mostly contended registries (Ex. 1) — whenRWMutex.RLockshows in profile.atomic.Pointer[map]COW (Ex. 3, 10) — read contention dominates, writes rare.- Cached
Names()with version counter (Ex. 7) — introspection on a hot path. - Diff-based bulk merge for hot-reload (Ex. 9) — scheduled reloads.
- Exact-match map over linear scan (Ex. 4) — route count > 10.
Specialty (only when the design calls for it):
- Integer ID key with dense array backing (Ex. 2) — closed sets, sub-10-ns lookups.
- Radix tree / trie for prefix routing — wildcard semantics, large route sets.
- Reference-counted plugin handles for safe hot-swap — swap can race with an in-flight call.
- Per-tenant scoped registries via context — isolation matters more than convenience.
Registry cost is locks, hashes, ifaces, and reflect-type extraction. Strip those four from the read path by choosing the right primitive: RWMutex+map for startup-only and 99% of code; atomic.Pointer[map] when reads scale across cores; sync.Map for read-mostly with occasional writes; integer keys when the alphabet is closed. The Registry is rarely where time goes — but when it is, these are the levers.