Singleton — Optimize¶

Source: refactoring.guru/design-patterns/singleton Format: Slow / inefficient implementations + benchmarks + optimized version + tradeoffs.

Each exercise: take a working but slow Singleton, profile or reason about why it's slow, optimize, and measure the win.

Table of Contents¶

Optimization 1: Replace synchronized lazy with lazy holder
Optimization 2: Replace mutex with sync.Once
Optimization 3: Recreating value inside Once
Optimization 4: Bound an unbounded cache
Optimization 5: Sharded singleton for hot reads
Optimization 6: Eager init slowing startup
Optimization 7: Async logger to remove serialization
Optimization 8: Trim the singleton's deep object graph
Optimization 9: Defer expensive Python module init
Optimization 10: Per-context state instead of mutated singleton

All benchmarks below come from a 2024 Apple M2 Pro, single-threaded unless noted.

Optimization 1: Replace synchronized lazy with lazy holder¶

Slow code (Java)¶

public final class Logger {
    private static Logger INSTANCE;
    private Logger() {}
    public static synchronized Logger getInstance() {  // BOTTLENECK
        if (INSTANCE == null) INSTANCE = new Logger();
        return INSTANCE;
    }
    public void log(String msg) { /* ... */ }
}

Benchmark (JMH, 8 threads)¶

Benchmark                     Mode  Cnt   Score   Error  Units
LoggerSync.getInstance       thrpt   10  29.8M  ±  0.4M  ops/s

Throughput plateaus at single-thread speed because every read takes the lock.

Optimized — Lazy Holder¶

public final class Logger {
    private Logger() {}
    private static class Holder { static final Logger INSTANCE = new Logger(); }
    public static Logger getInstance() { return Holder.INSTANCE; }
    public void log(String msg) { /* ... */ }
}

Benchmark after¶

LoggerHolder.getInstance     thrpt   10  4500M  ±  20M  ops/s

150× speedup. The hot path is now a single getstatic instruction; after JIT, it's effectively constant-folded.

Tradeoff¶

Lazy Holder is correct, simple, fast.
The class can't easily be subclassed for testing — use interface + DI if mocking is needed.
For a replaceable singleton (test scenarios), expose a __reset() for tests or use enum + factory.

Optimization 2: Replace mutex with sync.Once¶

Slow code (Go)¶

var (
    instance *Service
    mu       sync.Mutex
)

func Get() *Service {
    mu.Lock(); defer mu.Unlock()
    if instance == nil {
        instance = &Service{}
    }
    return instance
}

Benchmark¶

func BenchmarkMutexLazy(b *testing.B) {
    for i := 0; i < b.N; i++ { _ = Get() }
}

BenchmarkMutexLazy-8     100000000     11.0 ns/op

Optimized — `sync.Once`¶

var (
    instance *Service
    once     sync.Once
)

func Get() *Service {
    once.Do(func() { instance = &Service{} })
    return instance
}

Benchmark after¶

BenchmarkOnce-8     500000000      2.3 ns/op

~5× speedup. Hot path is a single atomic load — no mutex contention.

Tradeoff¶

sync.Once is the idiomatic Go pattern. No real tradeoff vs the mutex approach for this use case — strictly better.

Optimization 3: Recreating value inside Once¶

Slow / buggy code (Go)¶

var (
    instance *Service
    once     sync.Once
)

func InitWithConfig(cfg Config) *Service {
    once.Do(func() {
        instance = buildService(cfg)
    })
    return buildService(cfg)   // BUG/SLOW: build every call
}

Symptom¶

Every call to InitWithConfig allocates a new Service and discards it (returning the freshly built one instead of the cached instance). 100× slower than expected.

Benchmark¶

BenchmarkBadInit-8     1000000     1500 ns/op    (allocates per call)

Optimized¶

func Get(cfg Config) *Service {
    once.Do(func() { instance = buildService(cfg) })
    return instance
}

Benchmark after¶

BenchmarkGet-8     500000000     2.3 ns/op    0 allocs/op

~650× speedup. And no allocations.

Lesson¶

Subtle bug: returning buildService(cfg) instead of instance. Code review catches this; benchmarks make it screaming-obvious.

Optimization 4: Bound an unbounded cache¶

Slow code (Python)¶

class Cache:
    _instance = None
    @classmethod
    def get(cls):
        if cls._instance is None: cls._instance = cls()
        return cls._instance
    def __init__(self): self.data = {}
    def set(self, key, value): self.data[key] = value
    def lookup(self, key): return self.data.get(key)

Symptoms¶

After running the application for hours, RSS grows from 200 MB to 8 GB. tracemalloc shows millions of entries in Cache.data.

# tracemalloc snapshot
[<frame at cache.py:8 (set)>]   12.5 GiB

Optimized — Bounded LRU + TTL¶

from cachetools import TTLCache

cache = TTLCache(maxsize=10_000, ttl=300)

Memory after¶

RSS stable at ~250 MB regardless of runtime.

Benchmark (set/lookup)¶

	Naive `dict`	`TTLCache`
`set`	~150 ns	~700 ns
`lookup` (hit)	~80 ns	~300 ns

TTLCache is slower per op but bounded — the real tradeoff is between op cost and memory growth. For long-running services, bounded almost always wins.

Tradeoff¶

4× slower per operation.
30× less memory in steady state.
Eviction may evict hot entries — tune maxsize based on working set.

Optimization 5: Sharded singleton for hot reads¶

Slow code (Go)¶

var (
    instance *Cache
    once     sync.Once
)

type Cache struct {
    mu sync.RWMutex
    m  map[string]string
}

func Get() *Cache {
    once.Do(func() { instance = &Cache{m: map[string]string{}} })
    return instance
}

func (c *Cache) Lookup(k string) string {
    c.mu.RLock(); defer c.mu.RUnlock()
    return c.m[k]
}

func (c *Cache) Set(k, v string) {
    c.mu.Lock(); defer c.mu.Unlock()
    c.m[k] = v
}

Symptoms¶

Under 64 concurrent goroutines doing reads, throughput plateaus around 30 M ops/s. pprof shows 60% time in RWMutex.RLock.

Optimized — Sharded¶

const shardCount = 32

type Cache struct {
    shards [shardCount]struct {
        mu sync.RWMutex
        m  map[string]string
    }
}

func newCache() *Cache {
    c := &Cache{}
    for i := range c.shards {
        c.shards[i].m = map[string]string{}
    }
    return c
}

var (
    instance *Cache
    once     sync.Once
)

func Get() *Cache { once.Do(func() { instance = newCache() }); return instance }

func (c *Cache) Lookup(k string) string {
    s := &c.shards[fnv32(k)%shardCount]
    s.mu.RLock(); defer s.mu.RUnlock()
    return s.m[k]
}

func (c *Cache) Set(k, v string) {
    s := &c.shards[fnv32(k)%shardCount]
    s.mu.Lock(); defer s.mu.Unlock()
    s.m[k] = v
}

func fnv32(s string) uint32 {
    h := uint32(2166136261)
    for i := 0; i < len(s); i++ { h ^= uint32(s[i]); h *= 16777619 }
    return h
}

Benchmark (64 goroutines)¶

Implementation	Throughput	Lock time
Single RWMutex	30 M ops/s	60%
32 shards	850 M ops/s	<5%

~28× speedup. Most goroutines now contend on different mutexes.

Tradeoff¶

Memory: N hash maps instead of 1 — slight overhead.
Iteration order: harder to enumerate all keys consistently.
Hash function: must be fast and well-distributed.

Optimization 6: Eager init slowing startup¶

Slow code (Java)¶

public final class Reports {
    private static final Reports INSTANCE = new Reports();
    private final ExpensiveCache cache;
    private Reports() { this.cache = ExpensiveCache.preload(); }   // 800 ms
    public static Reports getInstance() { return INSTANCE; }
}

Symptom¶

App startup takes 1.2 s, mostly in Reports's static init. The Reports feature is used by only 5% of users.

Optimized — Lazy holder¶

public final class Reports {
    private final ExpensiveCache cache;
    private Reports() { this.cache = ExpensiveCache.preload(); }
    private static class Holder { static final Reports INSTANCE = new Reports(); }
    public static Reports getInstance() { return Holder.INSTANCE; }
}

Result¶

Startup: 400 ms (3× faster).
First call to Reports.getInstance() for users who use it: +800 ms.
95% of users who never call it: pay nothing.

Tradeoff¶

Faster startup vs. higher first-call latency for the feature. Worth it when: - Most users don't use the feature. - Or, the cold path is acceptable.

If even the first call must be fast, run Reports.getInstance() in a background thread shortly after startup (warmup).

Optimization 7: Async logger to remove serialization¶

Slow code (Go)¶

type Logger struct{ mu sync.Mutex; out io.Writer }

func (l *Logger) Log(msg string) {
    l.mu.Lock(); defer l.mu.Unlock()
    fmt.Fprintln(l.out, msg)   // syscall under the lock
}

Symptom¶

100 concurrent goroutines logging 1 KB each → ~5 µs per log on average. Lock-and-syscall serializes all calls.

Optimized — Async logger¶

type Logger struct {
    ch chan string
}

func newLogger(out io.Writer) *Logger {
    l := &Logger{ch: make(chan string, 1024)}
    go func() {
        bw := bufio.NewWriter(out)
        ticker := time.NewTicker(50 * time.Millisecond)
        defer bw.Flush()
        for {
            select {
            case msg := <-l.ch:
                bw.WriteString(msg); bw.WriteByte('\n')
            case <-ticker.C:
                bw.Flush()
            }
        }
    }()
    return l
}

func (l *Logger) Log(msg string) { l.ch <- msg }

Benchmark¶

	Sync	Async
100 goroutines × 1k logs	500 ms	30 ms
Per-log latency at producer	5 µs	50 ns

~17× speedup at the call site. Producers don't wait for I/O.

Tradeoff¶

Logs may be lost on SIGKILL — they're in the buffer.
Need explicit shutdown flush.
Buffered I/O delays log visibility by ~50 ms.

For most production logging, this is acceptable. For audit logs that must persist, keep them sync (or use durable log shipping).

Optimization 8: Trim the singleton's deep object graph¶

Slow code¶

public final class Settings {
    private static final Settings INSTANCE = new Settings();
    private final List<UserSession> activeSessions = new ArrayList<>();   // BUG: grows forever
    private final Map<String, byte[]> attachmentCache = new HashMap<>();  // BUG: grows forever

    public static Settings getInstance() { return INSTANCE; }
    // ...
}

Symptom¶

Heap dumps show the singleton retaining 500 MB after 1 day of operation. GC pauses grow from 50 ms to 500 ms because the old generation is full of singleton-rooted objects.

Optimized¶

public final class Settings {
    private static final Settings INSTANCE = new Settings();

    // Bounded: only the 100 most recent sessions
    private final EvictingQueue<UserSession> activeSessions = EvictingQueue.create(100);

    // Bounded with LRU eviction + TTL
    private final Cache<String, byte[]> attachmentCache = Caffeine.newBuilder()
        .maximumSize(50)
        .expireAfterAccess(Duration.ofMinutes(10))
        .build();
}

Result¶

Old gen size: stable at 50 MB.
GC pauses: 50 ms.

Lesson¶

Singletons are GC roots. Anything they reference, anything, becomes effectively immortal. If a collection grows unbounded inside a singleton, it's a memory leak with extra steps. Audit collections inside singletons for: add/put without remove/evict.

Optimization 9: Defer expensive Python module init¶

Slow code¶

# heavy.py
import time
time.sleep(0.5)   # simulate expensive setup
data = {i: f"value-{i}" for i in range(1_000_000)}

Symptom¶

from heavy import data adds 500 ms + 60 MB to every process that imports it, even if data is never used.

Optimized — Lazy property¶

# heavy.py
class _Lazy:
    _data = None
    @property
    def data(self):
        if self._data is None:
            time.sleep(0.5)
            self._data = {i: f"value-{i}" for i in range(1_000_000)}
        return self._data

heavy = _Lazy()

Callers use heavy.data. The expensive initialization runs only on first access.

Result¶

Import time: 50 ms (10× faster).
Memory before first access: 0 MB.
First access after import: 500 ms (same total cost, just deferred).

Tradeoff¶

First access becomes slower (cold start).
heavy.data becomes a property call instead of a plain attribute (negligible).

Useful when: many modules import heavy, few of them use data.

Optimization 10: Per-context state instead of mutated singleton¶

Slow / wrong code (Python)¶

class RequestContext:
    _instance = None
    @classmethod
    def get(cls):
        if cls._instance is None: cls._instance = cls()
        return cls._instance
    def __init__(self): self.user = None; self.tenant = None

# in middleware
RequestContext.get().user = parse_user(req)
RequestContext.get().tenant = parse_tenant(req)

# in handler
user = RequestContext.get().user

Symptom¶

Under concurrent requests, user A's context is visible to user B. Subtle bugs: requests see wrong user identity. Security leak.

The "Singleton" was hijacked to hold per-request state — but it's process-wide.

Optimized — Context-local¶

Python (using contextvars):

from contextvars import ContextVar

_user: ContextVar = ContextVar("user", default=None)
_tenant: ContextVar = ContextVar("tenant", default=None)

class RequestContext:
    @staticmethod
    def set_user(u): _user.set(u)
    @staticmethod
    def user(): return _user.get()
    @staticmethod
    def set_tenant(t): _tenant.set(t)
    @staticmethod
    def tenant(): return _tenant.get()

contextvars are scoped per asyncio task / per-thread (in WSGI). Each request gets its own state.

Go (using context.Context):

type ctxKey int
const (
    userKey ctxKey = iota
    tenantKey
)

func WithUser(ctx context.Context, u User) context.Context {
    return context.WithValue(ctx, userKey, u)
}

func User(ctx context.Context) User {
    return ctx.Value(userKey).(User)
}

Pass ctx through every function call. The Go convention is "context as first parameter."

Java (using ThreadLocal):

public final class RequestContext {
    private static final ThreadLocal<User> USER = new ThreadLocal<>();
    public static void setUser(User u) { USER.set(u); }
    public static User user() { return USER.get(); }
}

In a servlet filter:

RequestContext.setUser(parseUser(req));
try { chain.doFilter(...); }
finally { USER.remove(); }   // critical — clear on response

Result¶

No cross-request contamination.
Each request has its own state, isolated by language-level mechanisms.
Slightly more boilerplate (context passing) — worth it for correctness.

Lesson¶

The Singleton is a fixed-cardinality construct: exactly one. The moment you find yourself wanting "one per request" or "one per session," you have outgrown Singleton. Use:

contextvars (Python)
context.Context (Go)
ThreadLocal (Java) — careful with thread pools, always remove()
DI container with request scope (Spring @RequestScope)

This is the most expensive Singleton anti-pattern in real production systems — a security bug waiting to happen.

Optimization Tips¶

How to find singleton bottlenecks¶

Profile. pprof (Go), async-profiler (Java), py-spy (Python).
Look for time in lock methods. RWMutex.RLock, synchronized, Lock.acquire.
Look for time in getInstance(). Even when correct, it shouldn't be a bottleneck.
Heap dumps. Singletons retaining lots of memory? Audit their fields.
GC logs. Long pauses → check old-gen contents → check singleton-rooted objects.

Optimization checklist¶

Lock-free hot path (atomic load on the singleton reference)
Bounded collections (LRU, TTL, size cap)
Lazy init for expensive setup
Sharded state for hot mutable singletons
Async I/O instead of synchronous, lock-held writes
Per-context state separated from process-global state
Explicit shutdown / flush

Anti-optimizations to avoid¶

❌ Premature DCL — use lazy holder or enum.
❌ Volatile fields without understanding JMM.
❌ Sharding when there's no contention measured.
❌ Async logging when sync is fast enough — adds operational complexity.
❌ Reactor patterns inside a singleton — singleton's cardinality says "1," reactors say "scale out."

← Back to Singleton folder · ↑ Creational Patterns · ↑↑ Roadmap Home

Previous: Singleton — Find-Bug

Singleton roadmap complete. All 8 files: junior · middle · senior · professional · interview · tasks · find-bug · optimize.

Singleton — Optimize¶

Table of Contents¶

Optimization 1: Replace synchronized lazy with lazy holder¶

Slow code (Java)¶

Benchmark (JMH, 8 threads)¶

Optimized — Lazy Holder¶

Benchmark after¶

Tradeoff¶

Optimization 2: Replace mutex with sync.Once¶

Slow code (Go)¶

Benchmark¶

Optimized — sync.Once¶

Benchmark after¶

Tradeoff¶

Optimization 3: Recreating value inside Once¶

Slow / buggy code (Go)¶

Symptom¶

Benchmark¶

Optimized¶

Benchmark after¶

Lesson¶

Optimization 4: Bound an unbounded cache¶

Slow code (Python)¶

Symptoms¶

Optimized — Bounded LRU + TTL¶

Memory after¶

Benchmark (set/lookup)¶

Tradeoff¶

Optimization 5: Sharded singleton for hot reads¶

Slow code (Go)¶

Symptoms¶

Optimized — Sharded¶

Benchmark (64 goroutines)¶

Tradeoff¶

Optimization 6: Eager init slowing startup¶

Slow code (Java)¶

Symptom¶

Optimized — Lazy holder¶

Result¶

Tradeoff¶

Optimization 7: Async logger to remove serialization¶

Slow code (Go)¶

Symptom¶

Optimized — Async logger¶

Benchmark¶

Tradeoff¶

Optimization 8: Trim the singleton's deep object graph¶

Slow code¶

Symptom¶

Optimized¶

Result¶

Lesson¶

Optimization 9: Defer expensive Python module init¶

Slow code¶

Symptom¶

Optimized — Lazy property¶

Result¶

Tradeoff¶

Optimization 10: Per-context state instead of mutated singleton¶

Slow / wrong code (Python)¶

Symptom¶

Optimized — Context-local¶

Result¶

Lesson¶

Optimization Tips¶

How to find singleton bottlenecks¶

Optimization checklist¶

Anti-optimizations to avoid¶

Optimized — `sync.Once`¶