sync.Pool Internals — Professional Level¶

Table of Contents¶

Introduction
Profiling Pool Churn
Identifying Anti-Pattern Usage in Production
Quantifying Pool Impact
Custom Pool Variants
Comparing to fastcache and bytebufferpool
NUMA and Per-Socket Considerations
Interaction with the Runtime Scheduler
Pool Sizing at Scale
Observability: Metrics You Can Export
Capacity Caps and the Long-Tail Problem
Operational Playbook
Cheat Sheet
Further Reading

Introduction¶

This file is for the engineer responsible for keeping a Go service alive in production. By now you understand what sync.Pool does and how it works internally. The question here is operational: how do you spot a pool that is misbehaving, how do you measure the impact of changing one, and what alternatives exist when sync.Pool is not enough?

We focus on three skills:

Reading pprof output and runtime/metrics data to diagnose pool behavior.
Making informed tradeoffs between sync.Pool, third-party pools (bytebufferpool, fastcache), and hand-rolled solutions.
Designing systems where pools are first-class observability subjects, not invisible internals.

Profiling Pool Churn¶

Heap profile signature¶

The fingerprint of a misbehaving pool in go tool pprof -alloc_space is a single allocation site near sync.(*Pool).New taking a substantial share. Example:

Showing top 10 nodes out of 53
      flat  flat%   sum%        cum   cum%
   2.8GB    35%    35%       2.8GB    35%  main.newBuffer
   1.4GB    18%    53%       1.4GB    18%  bytes.makeSlice
   ...

main.newBuffer is the New function. If it accounts for > 5% of allocation pressure, the pool is missing badly. Either the workload is Get-heavy without matching Put, or the GC is draining the pool faster than it fills.

CPU profile signature¶

In go tool pprof http://localhost:6060/debug/pprof/profile or a recorded pprof.CPUProfile, look for:

Function	What it tells you
`sync.(*Pool).Get` (high self)	Frequent calls; might be fine if hit rate is high.
`sync.(*Pool).getSlow` (high self)	Fast path missing; pool is empty most of the time.
`sync.(*Pool).pinSlow` (high self)	Pool repeatedly hitting cold path — usually means `GOMAXPROCS` changes (unusual) or first-call latency on many pools.
`runtime.gcStart` linked to your `New`	GC is draining the pool; consider higher GOGC.
`runtime.(*mheap).alloc` calls dominating	The allocator is overloaded; pool may help if objects are big enough.

A healthy pool has Get self-time but no getSlow time. The opposite is the warning sign.

Trace view¶

go tool trace shows GC events on the timeline. Overlay them on your latency histogram: if every GC line corresponds to a latency spike, the pool drain (or just the GC pause) is contributing. To distinguish:

With GOGC=200, the GC frequency halves. If the latency spikes halve in frequency but not size, the cause is the GC pause itself.
If the latency spikes disappear, the cause is pool drain.

This separation guides the fix: GC pause spikes call for tuning GC (and the pool is fine); drain spikes call for either keeping the pool warmer or moving off sync.Pool entirely.

Identifying Anti-Pattern Usage in Production¶

The "always misses" pool¶

var pool = sync.Pool{New: func() any { return new(big) }}

func handler(w http.ResponseWriter, r *http.Request) {
    b := pool.Get().(*big)
    // ... use b in async goroutine, never Put ...
    go process(b)
}

Signature. New count grows linearly with request count; Put is never called. The "pool" is just a constructor with extra steps.

Fix. Either ensure Put runs in process (with the lifetime tradeoff that implies), or remove the pool.

The "tiny object" pool¶

type Pair struct{ K, V int }
var pool = sync.Pool{New: func() any { return new(Pair) }}

Signature. Benchmarks with and without the pool show no difference, or the pool is slower.

Fix. Remove. Pass by value.

The "wrong granularity" pool¶

var bufferPool sync.Pool

func processRow(row []byte) {
    b := bufferPool.Get().(*bytes.Buffer)
    defer bufferPool.Put(b)
    // ... 5 µs of work ...
}

Signature. Hot path; each call is short; the pool overhead dominates.

Fix. Acquire the buffer once at the outer batch level, reuse across rows:

func processBatch(rows [][]byte) {
    b := bufferPool.Get().(*bytes.Buffer)
    defer bufferPool.Put(b)
    for _, row := range rows {
        b.Reset()
        // ... process row using b ...
    }
}

The "wrong size class" pool¶

var pool = sync.Pool{New: func() any { return make([]byte, 0, 1024) }}

But the real workload needs 64 KB buffers. Every Get returns a 1 KB buffer that gets append-grown to 64 KB and then Put back. Now the pool is full of mixed-size buffers.

Signature. Memory grows quickly even with the pool; bytes.makeSlice shows up large in heap profile.

Fix. Match the New capacity to typical usage. Or cap the retained size in Put (see Scenario 3 in optimize.md).

Quantifying Pool Impact¶

To measure the impact of removing or modifying a pool:

Baseline: record runtime/metrics:
/gc/heap/allocs:bytes (cumulative)
/gc/heap/objects:objects (current)
/gc/cycles/total:gc-cycles
/sched/pauses/total/gc:seconds over a fixed window (60-300 s of load).
Treatment: apply the change (add pool, remove pool, change cap, change GOGC).
Compare: the deltas of allocs:bytes and cycles:gc-cycles over the same window.

A pool removal that increases allocs:bytes by 30% but reduces request p99 latency by 5% is a win (the pool was costing more in overhead than it saved). The opposite is a loss.

Synthetic benchmark scaffolding¶

func BenchmarkPooled(b *testing.B) {
    b.ReportAllocs()
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            buf := pool.Get().(*bytes.Buffer)
            buf.Reset()
            buf.WriteString("payload")
            pool.Put(buf)
        }
    })
}

func BenchmarkUnpooled(b *testing.B) {
    b.ReportAllocs()
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            buf := &bytes.Buffer{}
            buf.WriteString("payload")
            _ = buf
        }
    })
}

Run with go test -bench=. -benchmem -cpu=1,4,8,16. Vary -cpu because pool behavior changes with GOMAXPROCS — at GOMAXPROCS=1 there is no stealing.

Custom Pool Variants¶

Bounded LIFO pool¶

When you need a hard upper bound on retained memory and sync.Pool's "may be removed at any time" semantics are unacceptable:

type Bounded struct {
    mu   sync.Mutex
    pool []*bytes.Buffer
    cap  int
}

func (p *Bounded) Get() *bytes.Buffer {
    p.mu.Lock()
    if n := len(p.pool); n > 0 {
        b := p.pool[n-1]
        p.pool = p.pool[:n-1]
        p.mu.Unlock()
        return b
    }
    p.mu.Unlock()
    return &bytes.Buffer{}
}

func (p *Bounded) Put(b *bytes.Buffer) {
    b.Reset()
    p.mu.Lock()
    if len(p.pool) < p.cap {
        p.pool = append(p.pool, b)
    }
    p.mu.Unlock()
}

Tradeoffs vs sync.Pool:

Aspect	Bounded	`sync.Pool`
Throughput (no contention)	~150 ns/op	~7 ns/op
Throughput (16-way contention)	~1.2 µs/op	~25 ns/op
Max retained	`cap` * sizeof(*Buffer)	unbounded
GC interaction	None	Drains on every GC
Pre-warming behavior	Trivial	Hidden behind per-P state

Use the bounded variant when predictability is more important than throughput — e.g., memory-constrained embedded targets, or systems with strict SLOs on RSS.

Sharded mutex pool¶

Closer to sync.Pool's throughput but still GC-immune:

type Sharded struct {
    shards [256]struct {
        mu   sync.Mutex
        list []*bytes.Buffer
        _    [56]byte // pad against false sharing
    }
}

func (p *Sharded) Get() *bytes.Buffer {
    i := fastrand() & 0xff
    s := &p.shards[i]
    s.mu.Lock()
    n := len(s.list)
    if n > 0 {
        b := s.list[n-1]
        s.list = s.list[:n-1]
        s.mu.Unlock()
        return b
    }
    s.mu.Unlock()
    return &bytes.Buffer{}
}

This is the basic shape of bytebufferpool (without size-class tracking).

Per-P pool with explicit cleanup¶

If you want sync.Pool's per-P speed but not its GC drain, you can roll your own with runtime_procPin exported via //go:linkname. This is what some HFT libraries do. It is not portable across Go versions because the runtime hook is unstable; expect breakage on every release.

import _ "unsafe" // for go:linkname

//go:linkname runtime_procPin sync.runtime_procPin
func runtime_procPin() int

//go:linkname runtime_procUnpin sync.runtime_procUnpin
func runtime_procUnpin()

Use with extreme caution. The maintenance burden is high.

Comparing to fastcache and bytebufferpool¶

`valyala/bytebufferpool`¶

A drop-in sync.Pool for *bytes.Buffer, with size-class tracking. The library remembers, per-pool, the typical size of buffers Put back, and uses that to inform when to Put (large buffers are dropped instead of pooled, preventing the long-tail problem from optimize.md scenario 3).

Code shape:

import "github.com/valyala/bytebufferpool"

func handler() {
    b := bytebufferpool.Get()
    defer bytebufferpool.Put(b)
    // ...
}

Strengths: - Same fast-path performance as sync.Pool (it is a sync.Pool underneath). - Built-in size-class adaptation. - Drop-in.

Weakness: Specific to *ByteBuffer (their own type, byte-slice-backed). Not generic.

`VictoriaMetrics/fastcache`¶

A fixed-size sharded LRU cache, not a pool. It is the right tool when you have key-based lookup; the wrong tool when you have anonymous scratch objects. Often used alongside sync.Pool: fastcache for cache hits, sync.Pool for scratch buffers used in cache misses.

Hand-rolled freelist (linked list)¶

type freelist struct {
    head atomic.Pointer[node]
}

type node struct {
    next *node
    buf  *bytes.Buffer
}

CAS-based push and pop. Pre-Go-1.13, this was sometimes competitive with sync.Pool. Today it is uniformly slower because of the per-CAS contention vs sync.Pool's per-P locality. Useful only as an exercise.

Comparison table¶

Aspect	`sync.Pool`	`bytebufferpool`	`fastcache`	Hand-rolled CAS list
Per-P fast path	yes	yes (same impl)	no	no
GC drain	yes	yes	no	no
Key-based lookup	no	no	yes	no
Size adaptation	no	yes	no	no
Bounded	no	semi-bounded	yes	no
Throughput rank	1	1	4	3

NUMA and Per-Socket Considerations¶

sync.Pool is per-P, not per-socket. On a 2-socket server with 32 cores per socket (64 Ps total), there is no awareness of which Ps share an LLC. Stealing across sockets is therefore possible and expensive: cross-socket cache line traffic is 100+ ns vs ~20 ns intra-socket.

In practice this is rarely a problem because:

The Go scheduler tries to keep goroutines on a stable P.
Stealing happens only when the local P's pool is empty.
The cost is amortized — one cross-socket steal supplies an object that may be reused many times locally.

If you can prove cross-socket stealing is hurting you (e.g., perf stat -e cache-misses shows huge LLC miss rate concentrated in pool code), you have two options:

Pin worker goroutines to a single socket with runtime.LockOSThread plus taskset/cpuset cgroup constraints.
Use one pool per socket, with workers selecting based on runtime/internal/cpu (or via linkname-imported runtime.getg()).

Option 1 is portable; option 2 requires runtime hackery and is fragile.

Interaction with the Runtime Scheduler¶

Preemption during `pin`¶

runtime_procPin increments m.locks. While m.locks > 0, the scheduler will not preempt this goroutine and will not run scavenging on this M. This means:

pin cannot be called with the world stopped.
pin cannot recurse safely if the inner code can yield.
The pool's fast path is bounded constant time — no scheduler delays.

If a goroutine is pinned for too long, it can starve other goroutines on the same M. sync.Pool pins only for the duration of Get/Put, which is nanoseconds — well below any reasonable starvation threshold. User code that pins should follow the same rule.

`pinSlow` race with GC¶

pinSlow (the cold path of pin) calls poolRaceAddr and runtime_LoadAcquintptr to atomically resize the local array. It races with poolCleanup, but poolCleanup runs with the world stopped, so the race is resolved by mutual exclusion at the runtime level. pinSlow always observes a fully-formed local/victim pair.

GOMAXPROCS changes¶

runtime.GOMAXPROCS(n) triggers re-creation of the runtime's P list. The next pool pin after the change calls pinSlow and re-allocates a local array of size n. Any objects that were in the old local[i] for i >= n are lost — they go to GC on the next cycle.

Practical implication: do not call runtime.GOMAXPROCS after startup if you care about pool warmth.

Pool Sizing at Scale¶

How many objects does a pool steady-state hold?

Empirically, after warm-up, a sync.Pool holds approximately:

N_objects ≈ (peak_in_flight_objects / GOMAXPROCS) * GOMAXPROCS
        =  peak_in_flight_objects

In other words: across all Ps, the pool's total inventory is roughly the high-water mark of concurrent uses since the last two GCs.

If your service processes 1000 concurrent requests, each holding one pooled buffer, the pool holds ~1000 buffers. If each buffer is 64 KB, that is 64 MB of pool memory — substantial.

To estimate before deploying:

pool_memory ≈ peak_concurrency * sizeof(pooled_object) * 2 (for victim cache)

The 2× is conservative: at any moment, the victim cache may hold the previous generation while the local cache holds the current generation.

Observability: Metrics You Can Export¶

sync.Pool exposes no metrics. To get visibility, wrap it:

type InstrumentedPool struct {
    inner    sync.Pool
    gets     atomic.Uint64
    puts     atomic.Uint64
    news     atomic.Uint64
}

func (p *InstrumentedPool) New() any {
    p.news.Add(1)
    return /* construct */
}

func (p *InstrumentedPool) Get() any {
    p.gets.Add(1)
    return p.inner.Get()
}

func (p *InstrumentedPool) Put(x any) {
    p.puts.Add(1)
    p.inner.Put(x)
}

Export via Prometheus:

my_pool_gets_total
my_pool_puts_total
my_pool_news_total

Useful derived metrics:

Miss rate: news / gets. Should be < 5% in steady state.
Imbalance: (gets - puts) / gets. Should be near zero; a positive value indicates leaked references.
Churn rate per GC: delta(news) / delta(gc_cycles). High value means GC is draining the pool faster than it fills.

These three numbers alone are enough to diagnose 90% of pool issues in production.

Capacity Caps and the Long-Tail Problem¶

Without a cap, a single oversized request can wedge a giant buffer into the pool. The buffer survives 1-2 GC cycles before being released; meanwhile, every subsequent Get returns the giant buffer, even for tiny payloads.

The cap pattern from optimize.md:

const maxRetain = 64 * 1024

func putBuf(b *bytes.Buffer) {
    if b.Cap() > maxRetain {
        return // let GC reclaim
    }
    b.Reset()
    pool.Put(b)
}

For finer control, multiple pools of different size classes:

var (
    poolSmall  = sync.Pool{New: func() any { return bytes.NewBuffer(make([]byte, 0, 1024)) }}
    poolMedium = sync.Pool{New: func() any { return bytes.NewBuffer(make([]byte, 0, 16*1024)) }}
    poolLarge  = sync.Pool{New: func() any { return bytes.NewBuffer(make([]byte, 0, 256*1024)) }}
)

func getBuf(estimatedSize int) *bytes.Buffer {
    switch {
    case estimatedSize < 1024:
        return poolSmall.Get().(*bytes.Buffer)
    case estimatedSize < 16*1024:
        return poolMedium.Get().(*bytes.Buffer)
    default:
        return poolLarge.Get().(*bytes.Buffer)
    }
}

Add a corresponding putBuf that consults b.Cap() to choose which pool to return to. This is what HTTP/2 framers and many high-end servers do.

Operational Playbook¶

Symptom: latency p99 spikes correlated with GC¶

Compare GC frequency before and after raising GOGC (200 or 300).
If spikes shrink but persist: the GC pause itself is the cause. Tune heap target or use runtime/debug.SetMemoryLimit.
If spikes go away: pool drain was the cause. Either keep raising GOGC or pre-warm the pool harder.

Symptom: rising memory under steady-state load¶

Check pool miss rate via your wrapper metrics.
If miss rate is healthy but memory still rises: a pool object is holding onto unbounded state (e.g., a buffer that grew but never shrinks). Add a Cap() check in Put.
If miss rate is high and memory rises: somewhere code is Geting without Puting — leaked references. Audit defer statements.

Symptom: CPU dominated by `getSlow`¶

Verify GOMAXPROCS matches the actual core count.
Check the Put/Get ratio. If Put < Get, the workload is consuming faster than producing — pool cannot help.
Consider whether pooling is actually the right answer; for very short-lived objects, removing the pool may be faster.

Symptom: pool works in dev but not in prod¶

Compare GOMAXPROCS between environments. A 4-core dev box behaves very differently from a 64-core prod box for steal-heavy workloads.
Check GC frequency — prod may GC much more often due to higher allocation pressure elsewhere.
Check whether prod has GOGC overridden in deployment config.

Cheat Sheet¶

Situation	Action
Adding a pool	First benchmark without it; only add if benchmarks improve >10%.
Pool overhead too high	Object is too small; remove pool.
Pool misses too high	Either pre-warm or check for missing `Put`.
Pool memory unbounded	Add `Cap()` check or size-class shards.
Pool drains too often	Raise `GOGC` or use a non-GC alternative.
Cross-socket stealing	Pin workers; or one pool per socket.
Need bounded retention	Use `Bounded` mutex pool or `bytebufferpool`.
Need key lookup	Wrong tool; use `fastcache` or a real cache.