Singleton — Optimize¶
Source: refactoring.guru/design-patterns/singleton Format: Slow / inefficient implementations + benchmarks + optimized version + tradeoffs.
Each exercise: take a working but slow Singleton, profile or reason about why it's slow, optimize, and measure the win.
Table of Contents¶
- Optimization 1: Replace synchronized lazy with lazy holder
- Optimization 2: Replace mutex with sync.Once
- Optimization 3: Recreating value inside Once
- Optimization 4: Bound an unbounded cache
- Optimization 5: Sharded singleton for hot reads
- Optimization 6: Eager init slowing startup
- Optimization 7: Async logger to remove serialization
- Optimization 8: Trim the singleton's deep object graph
- Optimization 9: Defer expensive Python module init
- Optimization 10: Per-context state instead of mutated singleton
All benchmarks below come from a 2024 Apple M2 Pro, single-threaded unless noted.
Optimization 1: Replace synchronized lazy with lazy holder¶
Slow code (Java)¶
public final class Logger {
private static Logger INSTANCE;
private Logger() {}
public static synchronized Logger getInstance() { // BOTTLENECK
if (INSTANCE == null) INSTANCE = new Logger();
return INSTANCE;
}
public void log(String msg) { /* ... */ }
}
Benchmark (JMH, 8 threads)¶
Throughput plateaus at single-thread speed because every read takes the lock.
Optimized — Lazy Holder¶
public final class Logger {
private Logger() {}
private static class Holder { static final Logger INSTANCE = new Logger(); }
public static Logger getInstance() { return Holder.INSTANCE; }
public void log(String msg) { /* ... */ }
}
Benchmark after¶
150× speedup. The hot path is now a single getstatic instruction; after JIT, it's effectively constant-folded.
Tradeoff¶
- Lazy Holder is correct, simple, fast.
- The class can't easily be subclassed for testing — use interface + DI if mocking is needed.
- For a replaceable singleton (test scenarios), expose a
__reset()for tests or use enum + factory.
Optimization 2: Replace mutex with sync.Once¶
Slow code (Go)¶
var (
instance *Service
mu sync.Mutex
)
func Get() *Service {
mu.Lock(); defer mu.Unlock()
if instance == nil {
instance = &Service{}
}
return instance
}
Benchmark¶
Optimized — sync.Once¶
var (
instance *Service
once sync.Once
)
func Get() *Service {
once.Do(func() { instance = &Service{} })
return instance
}
Benchmark after¶
~5× speedup. Hot path is a single atomic load — no mutex contention.
Tradeoff¶
sync.Once is the idiomatic Go pattern. No real tradeoff vs the mutex approach for this use case — strictly better.
Optimization 3: Recreating value inside Once¶
Slow / buggy code (Go)¶
var (
instance *Service
once sync.Once
)
func InitWithConfig(cfg Config) *Service {
once.Do(func() {
instance = buildService(cfg)
})
return buildService(cfg) // BUG/SLOW: build every call
}
Symptom¶
Every call to InitWithConfig allocates a new Service and discards it (returning the freshly built one instead of the cached instance). 100× slower than expected.
Benchmark¶
Optimized¶
Benchmark after¶
~650× speedup. And no allocations.
Lesson¶
Subtle bug: returning buildService(cfg) instead of instance. Code review catches this; benchmarks make it screaming-obvious.
Optimization 4: Bound an unbounded cache¶
Slow code (Python)¶
class Cache:
_instance = None
@classmethod
def get(cls):
if cls._instance is None: cls._instance = cls()
return cls._instance
def __init__(self): self.data = {}
def set(self, key, value): self.data[key] = value
def lookup(self, key): return self.data.get(key)
Symptoms¶
After running the application for hours, RSS grows from 200 MB to 8 GB. tracemalloc shows millions of entries in Cache.data.
Optimized — Bounded LRU + TTL¶
Memory after¶
RSS stable at ~250 MB regardless of runtime.
Benchmark (set/lookup)¶
Naive dict | TTLCache | |
|---|---|---|
set | ~150 ns | ~700 ns |
lookup (hit) | ~80 ns | ~300 ns |
TTLCache is slower per op but bounded — the real tradeoff is between op cost and memory growth. For long-running services, bounded almost always wins.
Tradeoff¶
- 4× slower per operation.
- 30× less memory in steady state.
- Eviction may evict hot entries — tune
maxsizebased on working set.
Optimization 5: Sharded singleton for hot reads¶
Slow code (Go)¶
var (
instance *Cache
once sync.Once
)
type Cache struct {
mu sync.RWMutex
m map[string]string
}
func Get() *Cache {
once.Do(func() { instance = &Cache{m: map[string]string{}} })
return instance
}
func (c *Cache) Lookup(k string) string {
c.mu.RLock(); defer c.mu.RUnlock()
return c.m[k]
}
func (c *Cache) Set(k, v string) {
c.mu.Lock(); defer c.mu.Unlock()
c.m[k] = v
}
Symptoms¶
Under 64 concurrent goroutines doing reads, throughput plateaus around 30 M ops/s. pprof shows 60% time in RWMutex.RLock.
Optimized — Sharded¶
const shardCount = 32
type Cache struct {
shards [shardCount]struct {
mu sync.RWMutex
m map[string]string
}
}
func newCache() *Cache {
c := &Cache{}
for i := range c.shards {
c.shards[i].m = map[string]string{}
}
return c
}
var (
instance *Cache
once sync.Once
)
func Get() *Cache { once.Do(func() { instance = newCache() }); return instance }
func (c *Cache) Lookup(k string) string {
s := &c.shards[fnv32(k)%shardCount]
s.mu.RLock(); defer s.mu.RUnlock()
return s.m[k]
}
func (c *Cache) Set(k, v string) {
s := &c.shards[fnv32(k)%shardCount]
s.mu.Lock(); defer s.mu.Unlock()
s.m[k] = v
}
func fnv32(s string) uint32 {
h := uint32(2166136261)
for i := 0; i < len(s); i++ { h ^= uint32(s[i]); h *= 16777619 }
return h
}
Benchmark (64 goroutines)¶
| Implementation | Throughput | Lock time |
|---|---|---|
| Single RWMutex | 30 M ops/s | 60% |
| 32 shards | 850 M ops/s | <5% |
~28× speedup. Most goroutines now contend on different mutexes.
Tradeoff¶
- Memory: N hash maps instead of 1 — slight overhead.
- Iteration order: harder to enumerate all keys consistently.
- Hash function: must be fast and well-distributed.
Optimization 6: Eager init slowing startup¶
Slow code (Java)¶
public final class Reports {
private static final Reports INSTANCE = new Reports();
private final ExpensiveCache cache;
private Reports() { this.cache = ExpensiveCache.preload(); } // 800 ms
public static Reports getInstance() { return INSTANCE; }
}
Symptom¶
App startup takes 1.2 s, mostly in Reports's static init. The Reports feature is used by only 5% of users.
Optimized — Lazy holder¶
public final class Reports {
private final ExpensiveCache cache;
private Reports() { this.cache = ExpensiveCache.preload(); }
private static class Holder { static final Reports INSTANCE = new Reports(); }
public static Reports getInstance() { return Holder.INSTANCE; }
}
Result¶
- Startup: 400 ms (3× faster).
- First call to
Reports.getInstance()for users who use it: +800 ms. - 95% of users who never call it: pay nothing.
Tradeoff¶
Faster startup vs. higher first-call latency for the feature. Worth it when: - Most users don't use the feature. - Or, the cold path is acceptable.
If even the first call must be fast, run Reports.getInstance() in a background thread shortly after startup (warmup).
Optimization 7: Async logger to remove serialization¶
Slow code (Go)¶
type Logger struct{ mu sync.Mutex; out io.Writer }
func (l *Logger) Log(msg string) {
l.mu.Lock(); defer l.mu.Unlock()
fmt.Fprintln(l.out, msg) // syscall under the lock
}
Symptom¶
100 concurrent goroutines logging 1 KB each → ~5 µs per log on average. Lock-and-syscall serializes all calls.
Optimized — Async logger¶
type Logger struct {
ch chan string
}
func newLogger(out io.Writer) *Logger {
l := &Logger{ch: make(chan string, 1024)}
go func() {
bw := bufio.NewWriter(out)
ticker := time.NewTicker(50 * time.Millisecond)
defer bw.Flush()
for {
select {
case msg := <-l.ch:
bw.WriteString(msg); bw.WriteByte('\n')
case <-ticker.C:
bw.Flush()
}
}
}()
return l
}
func (l *Logger) Log(msg string) { l.ch <- msg }
Benchmark¶
| Sync | Async | |
|---|---|---|
| 100 goroutines × 1k logs | 500 ms | 30 ms |
| Per-log latency at producer | 5 µs | 50 ns |
~17× speedup at the call site. Producers don't wait for I/O.
Tradeoff¶
- Logs may be lost on
SIGKILL— they're in the buffer. - Need explicit shutdown flush.
- Buffered I/O delays log visibility by ~50 ms.
For most production logging, this is acceptable. For audit logs that must persist, keep them sync (or use durable log shipping).
Optimization 8: Trim the singleton's deep object graph¶
Slow code¶
public final class Settings {
private static final Settings INSTANCE = new Settings();
private final List<UserSession> activeSessions = new ArrayList<>(); // BUG: grows forever
private final Map<String, byte[]> attachmentCache = new HashMap<>(); // BUG: grows forever
public static Settings getInstance() { return INSTANCE; }
// ...
}
Symptom¶
Heap dumps show the singleton retaining 500 MB after 1 day of operation. GC pauses grow from 50 ms to 500 ms because the old generation is full of singleton-rooted objects.
Optimized¶
public final class Settings {
private static final Settings INSTANCE = new Settings();
// Bounded: only the 100 most recent sessions
private final EvictingQueue<UserSession> activeSessions = EvictingQueue.create(100);
// Bounded with LRU eviction + TTL
private final Cache<String, byte[]> attachmentCache = Caffeine.newBuilder()
.maximumSize(50)
.expireAfterAccess(Duration.ofMinutes(10))
.build();
}
Result¶
- Old gen size: stable at 50 MB.
- GC pauses: 50 ms.
Lesson¶
Singletons are GC roots. Anything they reference, anything, becomes effectively immortal. If a collection grows unbounded inside a singleton, it's a memory leak with extra steps. Audit collections inside singletons for: add/put without remove/evict.
Optimization 9: Defer expensive Python module init¶
Slow code¶
# heavy.py
import time
time.sleep(0.5) # simulate expensive setup
data = {i: f"value-{i}" for i in range(1_000_000)}
Symptom¶
from heavy import data adds 500 ms + 60 MB to every process that imports it, even if data is never used.
Optimized — Lazy property¶
# heavy.py
class _Lazy:
_data = None
@property
def data(self):
if self._data is None:
time.sleep(0.5)
self._data = {i: f"value-{i}" for i in range(1_000_000)}
return self._data
heavy = _Lazy()
Callers use heavy.data. The expensive initialization runs only on first access.
Result¶
- Import time: 50 ms (10× faster).
- Memory before first access: 0 MB.
- First access after import: 500 ms (same total cost, just deferred).
Tradeoff¶
- First access becomes slower (cold start).
heavy.databecomes a property call instead of a plain attribute (negligible).
Useful when: many modules import heavy, few of them use data.
Optimization 10: Per-context state instead of mutated singleton¶
Slow / wrong code (Python)¶
class RequestContext:
_instance = None
@classmethod
def get(cls):
if cls._instance is None: cls._instance = cls()
return cls._instance
def __init__(self): self.user = None; self.tenant = None
# in middleware
RequestContext.get().user = parse_user(req)
RequestContext.get().tenant = parse_tenant(req)
# in handler
user = RequestContext.get().user
Symptom¶
Under concurrent requests, user A's context is visible to user B. Subtle bugs: requests see wrong user identity. Security leak.
The "Singleton" was hijacked to hold per-request state — but it's process-wide.
Optimized — Context-local¶
Python (using contextvars):
from contextvars import ContextVar
_user: ContextVar = ContextVar("user", default=None)
_tenant: ContextVar = ContextVar("tenant", default=None)
class RequestContext:
@staticmethod
def set_user(u): _user.set(u)
@staticmethod
def user(): return _user.get()
@staticmethod
def set_tenant(t): _tenant.set(t)
@staticmethod
def tenant(): return _tenant.get()
contextvars are scoped per asyncio task / per-thread (in WSGI). Each request gets its own state.
Go (using context.Context):
type ctxKey int
const (
userKey ctxKey = iota
tenantKey
)
func WithUser(ctx context.Context, u User) context.Context {
return context.WithValue(ctx, userKey, u)
}
func User(ctx context.Context) User {
return ctx.Value(userKey).(User)
}
Pass ctx through every function call. The Go convention is "context as first parameter."
Java (using ThreadLocal):
public final class RequestContext {
private static final ThreadLocal<User> USER = new ThreadLocal<>();
public static void setUser(User u) { USER.set(u); }
public static User user() { return USER.get(); }
}
In a servlet filter:
RequestContext.setUser(parseUser(req));
try { chain.doFilter(...); }
finally { USER.remove(); } // critical — clear on response
Result¶
- No cross-request contamination.
- Each request has its own state, isolated by language-level mechanisms.
- Slightly more boilerplate (context passing) — worth it for correctness.
Lesson¶
The Singleton is a fixed-cardinality construct: exactly one. The moment you find yourself wanting "one per request" or "one per session," you have outgrown Singleton. Use:
contextvars(Python)context.Context(Go)ThreadLocal(Java) — careful with thread pools, alwaysremove()- DI container with request scope (Spring
@RequestScope)
This is the most expensive Singleton anti-pattern in real production systems — a security bug waiting to happen.
Optimization Tips¶
How to find singleton bottlenecks¶
- Profile.
pprof(Go),async-profiler(Java),py-spy(Python). - Look for time in lock methods.
RWMutex.RLock,synchronized,Lock.acquire. - Look for time in
getInstance(). Even when correct, it shouldn't be a bottleneck. - Heap dumps. Singletons retaining lots of memory? Audit their fields.
- GC logs. Long pauses → check old-gen contents → check singleton-rooted objects.
Optimization checklist¶
- Lock-free hot path (atomic load on the singleton reference)
- Bounded collections (LRU, TTL, size cap)
- Lazy init for expensive setup
- Sharded state for hot mutable singletons
- Async I/O instead of synchronous, lock-held writes
- Per-context state separated from process-global state
- Explicit shutdown / flush
Anti-optimizations to avoid¶
- ❌ Premature DCL — use lazy holder or enum.
- ❌ Volatile fields without understanding JMM.
- ❌ Sharding when there's no contention measured.
- ❌ Async logging when sync is fast enough — adds operational complexity.
- ❌ Reactor patterns inside a singleton — singleton's cardinality says "1," reactors say "scale out."
← Back to Singleton folder · ↑ Creational Patterns · ↑↑ Roadmap Home
Previous: Singleton — Find-Bug
Singleton roadmap complete. All 8 files: junior · middle · senior · professional · interview · tasks · find-bug · optimize.