Singleton Pattern — Optimization¶
1. How to use this file¶
Twelve scenarios where singleton code costs more than it should. Each:
- Scenario — the inefficiency.
- Before — measured-slow code with realistic benchmark numbers.
- After (collapsible) — optimised version with benchmark comparison.
- Why faster — what changed at the runtime level.
- Trade-offs — what you lose by optimising.
- When NOT to do this — the cases where the optimisation isn't worth it.
Singletons sit on the hot path of nearly every Go service: a logger, a config, a DB pool, a metrics registry. The access function gets called millions of times per second. Even a 5 ns saving on Get() multiplied by 1e7 QPS is 50 ms of CPU per second — half a core. So unlike adapters, singleton micro-optimisations frequently do pay back.
But the inverse is also true: a misplaced atomic or a sync.Map where a sync.RWMutex would do can turn Get() into the bottleneck of a whole service. The right answer is almost always "the cheapest primitive that still gives the guarantees you need".
Benchmarks are illustrative — your numbers will differ. The qualitative direction (faster vs slower, allocs vs no allocs) is more important than the absolute ns/op.
Go 1.22, amd64, GOMAXPROCS=8.
2. Table of Contents¶
- How to use this file
- Table of Contents
- Exercise 1 — Mutex on every access
- Exercise 2 — Eager init in init() blocks startup
- Exercise 3 — sync.Once when the value is always needed
- Exercise 4 — sync.Map for shared singleton state
- Exercise 5 — Hot-reload with full lock
- Exercise 6 — Singleton called from a hot loop
- Exercise 7 — Global mutex for a read-heavy field
- Exercise 8 — PGO devirtualization of singleton-returning factory
- Exercise 9 — Singleton config parsed on every read
- Exercise 10 — JSON-loaded singleton re-read per test
- Exercise 11 — log.Default mutex on every log line
- Exercise 12 — Three singletons collapsed into one access
- When NOT to optimize
- Summary
Exercise 1 — Mutex on every access¶
Scenario: Classic textbook singleton using a sync.Mutex on every Get().
Before:
type Config struct {
mu sync.Mutex
instance *Config
Region string
}
var cfg Config
func Get() *Config {
cfg.mu.Lock()
defer cfg.mu.Unlock()
if cfg.instance == nil {
cfg.instance = &Config{Region: "us-east-1"}
}
return cfg.instance
}
Benchmark:
BenchmarkGet_Mutex-8 30000000 42 ns/op 0 B/op 0 allocs/op
BenchmarkGet_Mutex-8-par 10000000 145 ns/op 0 B/op 0 allocs/op // contended
The Lock/Unlock pair is unavoidable on every call, and under contention the cost balloons because goroutines serialise.
After
Use `sync.Once` for the init guard, then store the value in an `atomic.Pointer` for read-mostly access: ~18× faster single-threaded, ~240× under contention. **Why faster:** `atomic.Pointer.Load` is a single `MOVQ` on amd64 with a load-acquire fence — no lock, no defer, no contention. After the first `Get()`, the fast path never touches `sync.Once`. Compare that to mutex Lock/Unlock, which is two atomic CAS operations *plus* potential park/unpark on contention. **Trade-offs:** Two atomic loads on the slow init path (cheap, but two instead of one). The pattern is harder to read than `sync.Once.Do(...)` alone — junior reviewers may not see why both are needed. **When NOT to do this:** When the singleton is mutated after init. `atomic.Pointer` works for read-mostly; for read-write, see Exercise 7 (RWMutex) or Exercise 5 (atomic swap on whole pointer).Exercise 2 — Eager init in init() blocks startup¶
Scenario: A heavyweight singleton built in init(), slowing every binary start.
Before:
var db *sql.DB
func init() {
var err error
db, err = sql.Open("postgres", os.Getenv("DSN"))
if err != nil {
log.Fatal(err)
}
if err := db.Ping(); err != nil { // 50-300 ms network RTT
log.Fatal(err)
}
}
Measured at process startup:
287 ms before the binary can even print its version — because init() opened a database connection.
After
Defer the work behind `sync.Once`: 24× faster startup. The DB connects only when `DB()` is first called. **Why faster:** `init()` runs unconditionally when the package is imported, even for commands that don't use the DB (`--version`, `--help`, CLI tools that share the package). Lazy init defers the cost to the first real use. **Trade-offs:** First request pays the connect latency. If your service is request-driven and you want predictable first-request latency, *prime* the singleton at the right moment (after readiness probe, before HTTP server starts) instead of leaving it cold: **When NOT to do this:** When the singleton *must* exist before any user code runs (e.g., a global tracer that other packages' `init()` register into). In that case, eager init is required by ordering, not chosen for performance.Exercise 3 — sync.Once when the value is always needed¶
Scenario: A singleton guarded by sync.Once that is always accessed by every code path. Lazy init costs one atomic load per access for no reason.
Before:
var (
once sync.Once
metric *prometheus.Counter
)
func Metric() *prometheus.Counter {
once.Do(func() {
metric = prometheus.NewCounter(prometheus.CounterOpts{Name: "reqs"})
})
return metric
}
Every call performs one atomic load on once.done. For a counter incremented millions of times per second, that's wasted work.
After
If the value is needed unconditionally and the cost of init is small, use plain `init()`: 8× faster — and `Metric()` is now eligible for inlining. **Why faster:** No atomic load. The variable is plain memory; reads are free. The compiler can also inline `Metric()` because it has no branching. **Trade-offs:** Loses laziness. Init runs even if the singleton is never used. Counter creation is microseconds, so for Prometheus counters this is fine; for a DB pool, this is Exercise 2's anti-pattern. **When NOT to do this:** When init is expensive (network, file I/O) or has ordering hazards (depends on other packages' init that may not have run). Keep `sync.Once` for those cases.Exercise 4 — sync.Map for shared singleton state¶
Scenario: A singleton service registry uses sync.Map because someone read that it's "concurrent". The workload is high write contention from many goroutines.
Before:
type Registry struct {
services sync.Map // map[string]*Service
}
var reg = &Registry{}
func (r *Registry) Get(name string) *Service {
v, _ := r.services.Load(name)
if v == nil { return nil }
return v.(*Service)
}
func (r *Registry) Register(name string, s *Service) {
r.services.Store(name, s)
}
Concurrent Register benchmark (8 goroutines, mixed read/write):
sync.Map was designed for the "write-once, read-many" case (e.g., type-to-method caches). Mixed write workloads make it serialise on the dirty mutex, and the type-assertion on Load adds an iface unbox per call.
After
Shard the map. Use N small maps, each guarded by its own `sync.RWMutex`:const shardCount = 32
type shard struct {
mu sync.RWMutex
m map[string]*Service
}
type Registry struct {
shards [shardCount]shard
}
func (r *Registry) shardFor(key string) *shard {
return &r.shards[fnv32(key)%shardCount]
}
func (r *Registry) Get(name string) *Service {
s := r.shardFor(name)
s.mu.RLock()
v := s.m[name]
s.mu.RUnlock()
return v
}
func (r *Registry) Register(name string, svc *Service) {
s := r.shardFor(name)
s.mu.Lock()
if s.m == nil { s.m = make(map[string]*Service) }
s.m[name] = svc
s.mu.Unlock()
}
Exercise 5 — Hot-reload with full lock¶
Scenario: A config singleton supports hot reload from disk. Every Get() takes a read lock, every reload takes a write lock.
Before:
type Config struct {
Limits map[string]int
Hosts []string
}
var (
cfgMu sync.RWMutex
cfg *Config
)
func Get() *Config {
cfgMu.RLock()
defer cfgMu.RUnlock()
return cfg
}
func Reload(c *Config) {
cfgMu.Lock()
cfg = c
cfgMu.Unlock()
}
BenchmarkRLock_Get-8 200000000 7.2 ns/op // single-threaded
BenchmarkRLock_Get-8-par 50000000 32 ns/op // 8 readers, contended on the RWMutex internal state
Even RLock has cost — it bumps an atomic reader counter. Under high concurrency that atomic becomes a contention point of its own.
After
Store the whole config pointer in `atomic.Pointer[Config]` and swap on reload: 20–100× faster, perfect read scalability. **Why faster:** `atomic.Pointer.Load` is a single load-acquire. Multiple readers don't interact — no shared atomic counter to contend on. The whole config is copy-on-write: `Reload` builds a new `*Config` and atomically swaps it in. Readers either see the old pointer or the new one; never a torn read. **Trade-offs:** 1. **Reload allocates** — you can't mutate the config in place; you must build a new `*Config`. For configs that change rarely this is fine; for high-write workloads, see Exercise 4. 2. **Readers may see stale data briefly** — between `Reload` and the next `Load`. Acceptable for most config use; not acceptable for strict-consistency state. 3. **The `Config` value should be immutable** — if any field of the loaded `*Config` is itself a mutable map/slice, readers will race on that. Make all fields immutable, or deep-copy on reload. **When NOT to do this:** When readers and writers need to coordinate (e.g., a reader needs to see "the version that was current when I started"). RWMutex with explicit versioning is clearer there. Also when the singleton state contains many fields where only one changes — copying the whole struct on every reload becomes wasteful.Exercise 6 — Singleton called from a hot loop¶
Scenario: A hot inner loop calls Get() per iteration. Even a 2 ns Get() adds up over 1e9 iterations.
Before:
func process(items []Item) {
for _, it := range items {
cfg := config.Get()
if it.Size > cfg.MaxSize {
return
}
write(it)
}
}
Even with atomic.Pointer, the loop performs an atomic load every iteration plus a field read.
After
Hoist the singleton fetch out of the loop: ~4× faster on this loop. **Why faster:** The compiler *cannot* hoist `config.Get()` itself out of the loop automatically — it doesn't know whether the function has side effects or whether the pointer might change. Once you hoist it manually, the inner loop becomes pure arithmetic + branch + memory write. No atomic loads, no function calls. **Trade-offs:** If the singleton is hot-reloadable (Exercise 5) and the loop is long-running, you'll miss a reload that happens mid-loop. For a 1ms loop that's fine; for a 10-minute batch, it's a bug. Two safe patterns: 1. Re-fetch on outer chunk boundaries (every N items). 2. Pass `cfg` explicitly as a function argument so the call site decides when to refresh. **When NOT to do this:** When the loop body is already slow (>1 µs/iteration). The `Get()` call disappears into the noise. Don't optimise loops where the singleton fetch isn't measurably hot.Exercise 7 — Global mutex for a read-heavy field¶
Scenario: Singleton wraps a counter behind a sync.Mutex. Reads dominate writes 1000:1.
Before:
type Stats struct {
mu sync.Mutex
count int64
label string
}
var stats Stats
func GetCount() int64 {
stats.mu.Lock()
defer stats.mu.Unlock()
return stats.count
}
func Inc() {
stats.mu.Lock()
stats.count++
stats.mu.Unlock()
}
BenchmarkMutex_Get-8 100000000 18 ns/op
BenchmarkMutex_Get-8-par 20000000 180 ns/op // 8 readers contend
Lock/Unlock serialises reads even when no writer is present.
After (atomic for counter; RWMutex if structure needed)
For a single counter, `atomic.Int64` is the right primitive:type Stats struct {
count atomic.Int64
}
var stats Stats
func GetCount() int64 { return stats.count.Load() }
func Inc() { stats.count.Add(1) }
Exercise 8 — PGO devirtualization of singleton-returning factory¶
Scenario: A singleton is exposed via an interface for testability. Calls to its methods go through interface dispatch.
Before:
type Storage interface {
Save(key string, val []byte) error
Load(key string) ([]byte, error)
}
var storage Storage = newRealStorage() // initialised at startup
func Get() Storage { return storage }
// In a hot handler:
func handle(req Request) error {
return Get().Save(req.Key, req.Val)
}
Each Save call goes through an itab lookup.
After (with PGO)
Collect a profile in production load, then build with PGO: Or in production: ~2.5× faster on the hot call. **Why faster:** PGO sees that `storage` is always `*realStorage` in the profile. It rewrites the call site to a direct call to `(*realStorage).Save`, with a fallback indirect call if the type ever differs. Direct calls inline; indirect ones don't. **Trade-offs:** - Build pipeline must produce and ship a profile. Stale profiles target outdated types. - ~5–10% larger binaries (PGO inlines more aggressively). - Only helps when one concrete type dominates the interface. If you genuinely swap implementations, PGO has nothing to specialise on. **When NOT to do this:** Small services, batch jobs, anything where you'd be tuning build complexity for invisible wins. For sub-1k QPS services the engineering cost beats the runtime savings. **Alternative without PGO:** if the singleton is *truly* immutable across the binary's life, drop the interface entirely: Direct call, no interface. Trade: tests can no longer substitute the singleton. (You can recover testability with a per-test override variable; see Exercise 10.)Exercise 9 — Singleton config parsed on every read¶
Scenario: "Lazy" config that parses environment variables every time Get() is called, defeating the point of a singleton.
Before:
type Config struct{ Region string; Limit int }
func Get() *Config {
return &Config{
Region: os.Getenv("AWS_REGION"),
Limit: parseIntOr(os.Getenv("LIMIT"), 100),
}
}
func parseIntOr(s string, def int) int {
if s == "" { return def }
n, err := strconv.Atoi(s)
if err != nil { return def }
return n
}
Two os.Getenv syscalls, one strconv.Atoi, one heap allocation — every call.
After
Parse once, cache in `atomic.Pointer`: ~280× faster after first call. Zero allocations. **Why faster:** Parsing is moved from per-call to per-process. After the first call, `Get()` is a single atomic load. The parse-and-allocate work happens exactly once. **Trade-offs:** The config can't pick up environment changes after process start. If you need that, reload explicitly (and atomically swap, per Exercise 5): Don't parse-on-read just to allow env-var changes — call `Reload()` from a SIGHUP handler instead. **When NOT to do this:** When the config is intentionally per-call dynamic (rare; usually a sign of a different bug — the singleton has become a misnomer for "look up environment").Exercise 10 — JSON-loaded singleton re-read per test¶
Scenario: A singleton loads its data from a JSON file at startup. Tests want isolation, so each test calls a "reset" that re-parses the file.
Before:
type Catalog struct{ Items map[string]Item }
var (
once sync.Once
catalog *Catalog
)
func Get() *Catalog {
once.Do(func() {
data, _ := os.ReadFile("catalog.json")
catalog = &Catalog{}
json.Unmarshal(data, catalog)
})
return catalog
}
func ResetForTest() {
once = sync.Once{}
catalog = nil
_ = Get() // re-reads the file
}
In a test suite with 500 tests, each calling ResetForTest:
Per test: 1× file read (~50 µs) + 1× JSON unmarshal (~800 µs) = ~850 µs × 500 tests = 425 ms just on catalog parsing.
After
Parse the JSON once *per process* (not per test). Store the parsed result in a package-level var; reset by swapping a fresh deep copy:var (
parseOnce sync.Once
catalogProto *Catalog // the immutable parsed prototype
)
func loadProto() *Catalog {
parseOnce.Do(func() {
data, _ := os.ReadFile("catalog.json")
catalogProto = &Catalog{}
json.Unmarshal(data, catalogProto)
})
return catalogProto
}
var current atomic.Pointer[Catalog]
func Get() *Catalog {
if p := current.Load(); p != nil { return p }
current.Store(loadProto())
return current.Load()
}
func ResetForTest() {
// Deep-clone the proto so tests can mutate freely.
current.Store(deepClone(loadProto()))
}
Exercise 11 — log.Default mutex on every log line¶
Scenario: Code uses log.Printf (or a homegrown logger) on the hot path. The standard library's log.Logger takes a mutex on every line, even when the level would suppress the message.
Before:
var lvl atomic.Int32 // 0=debug, 1=info, 2=warn, 3=error
func DebugF(format string, args ...any) {
if lvl.Load() <= 0 {
log.Printf("DEBUG "+format, args...) // log.Default mutex inside
}
}
Even when debug is suppressed, the arguments are evaluated and (if any are interfaces) boxed. The fast path takes ~180 ns and 2 allocations per ignored debug line.
After
Two layers of optimisation: 1. Check the level *before* arg evaluation (kill the boxing). 2. When emitting, use a logger that writes to a buffered, lock-free per-goroutine sink — or use `slog` with a discard handler at low levels.type Logger struct {
level atomic.Int32
out io.Writer // typically os.Stderr, or a buffered sink
mu sync.Mutex // only used when actually emitting
}
func (l *Logger) DebugEnabled() bool { return l.level.Load() <= 0 }
func (l *Logger) Debug(msg string, args ...any) {
if !l.DebugEnabled() { return }
l.emit("DEBUG", msg, args)
}
// Caller pattern:
if log.DebugEnabled() {
log.Debug("user %d state=%s", userID, state)
}
Exercise 12 — Three singletons collapsed into one access¶
Scenario: A hot request handler fetches three independent singletons (config, metrics, logger). Each fetch is cheap, but three atomic loads back-to-back add up.
Before:
func handle(req Request) error {
cfg := config.Get()
met := metrics.Get()
lg := logger.Get()
if req.Size > cfg.MaxSize {
met.IncReject()
lg.Warn("too big")
return ErrTooBig
}
...
}
Three atomic loads, three function calls.
After
Bundle related singletons into a single context struct exposed via one `Get()`:type Services struct {
Cfg *Config
Metrics *Metrics
Logger *Logger
}
var svc atomic.Pointer[Services]
func init() {
svc.Store(&Services{
Cfg: loadConfig(),
Metrics: newMetrics(),
Logger: newLogger(),
})
}
func Get() *Services { return svc.Load() }
// Caller:
func handle(req Request) error {
s := Get()
if req.Size > s.Cfg.MaxSize {
s.Metrics.IncReject()
s.Logger.Warn("too big")
return ErrTooBig
}
...
}
When NOT to optimize¶
Singleton access isn't free, but it isn't the bottleneck in most services either. The right order of operations:
- Profile.
go test -bench=. -cpuprofile=cpu.outorcurl /debug/pprof/profile. Thengo tool pprof -top cpu.out. - Identify. Is
config.Get(or your equivalent) in the top 10 of CPU? Top 20? If not, leave it alone. - Mutex profile too.
go test -mutexprofile=mu.out. Singleton contention shows up here, not in CPU. A 1 ms median lock-wait will not appear in CPU profiles but will tank tail latency. - Apply selectively. Use the cheapest primitive that still gives the guarantees:
- Read-mostly immutable →
atomic.Pointerswap (Exercises 1, 5). - Read-mostly with field reads →
RWMutex(Exercise 7). - Counter only →
atomic.Int64(Exercise 7). - Truly contended map → shard (Exercise 4).
- Truly cold init that may never run →
sync.Once(Exercises 1, 2). - Always-needed init → plain
init()+ package var (Exercise 3).
- Read-mostly immutable →
- Measure again. Confirm the optimisation removed the bottleneck. If not, revert (simplicity wins).
What's almost always worth it (no profile needed):
- Replacing
Mutex.Lock()/Unlock()around a single read of an immutable pointer withatomic.Pointer.Load. The mutex was over-engineering from the start. - Pre-compiling, pre-parsing, or pre-allocating anything that the singleton hands out (Exercise 9).
- Moving expensive
init()tosync.Once-guarded lazy init (Exercise 2) — this is faster and improves startup.
What's rarely worth it:
- Sharding a small map. Profile first; 32 shards of a 50-entry map is just waste.
- PGO for a singleton-returning factory in a low-QPS service.
- Manually unrolling a
Get()call out of a loop that isn't hot.
What can backfire:
- Using
sync.Mapbecause "it's lock-free". It isn't; it's "lock-free for the read-mostly path". For mixed workloads it's slower than shardedmap+RWMutex. - Replacing
MutexwithRWMutexwhen reads are not actually dominant. The reader-counter overhead can make things worse. atomic.Pointerswaps when the underlying struct contains mutable maps/slices — readers still race on those.
Summary¶
Wins that always ship:
sync.Once+atomic.Pointerfor read-mostly singletons (Exercise 1).- Lazy init via
sync.Onceinstead ofinit()for expensive resources (Exercise 2). - Plain
init()for always-needed, cheap-to-build singletons (Exercise 3) — notsync.Once. - Parse env/config once, cache (Exercise 9).
- Atomic level check before formatting log args (Exercise 11).
Wins behind a profile:
- Shard
sync.Mapor contended map (Exercise 4). - Atomic-pointer swap for hot reload (Exercise 5).
- Hoist
Get()out of hot loops (Exercise 6). atomic.Int64orRWMutexfor read-heavy fields (Exercise 7).- Bundle related singletons into one access (Exercise 12).
Wins that trade off flexibility:
- PGO devirtualization of interface-returning factories (Exercise 8).
- Sharing a parsed prototype across tests instead of re-parsing (Exercise 10).
- Bundling singletons into a
Servicesstruct (Exercise 12).
Rarely worth it:
- Sharding small maps.
- PGO for low-QPS services.
- Manual hoist of cold-path
Get()calls.
Singletons are special among patterns: they sit on the hot path, so even small per-call savings compound. But the inverse is also true — a misplaced atomic or shard scheme will cost more than the textbook implementation it replaced. Profile, identify, then apply the cheapest primitive that still gives the guarantees you need.
In this topic