Skip to content

sync.Pool Internals — Senior

← Back

This page is for engineers who already use sync.Pool competently and now want to understand the machinery: why the data structure looks the way it does, what changed in 2017, how the lock-free dequeue functions, how false sharing is avoided, and where the design hits walls. The treatment is mechanical: we follow the bytes through pool.go, poolqueue.go, and the runtime hook that the garbage collector calls.

The reader is expected to be comfortable with:

  • sync/atomic and the memory model promises Go makes (sequential consistency for atomics, happens-before via channels/mutexes, no acquire/release primitives exposed).
  • The Go scheduler's P (processor) abstraction and how a goroutine is bound to a P during a run.
  • The mechanics of x86 cache coherence (MESI), false sharing, the adjacent-cache-line prefetcher, and CAS as a primitive.
  • Chase-Lev work-stealing deques at a conceptual level.

The version numbers and line locations in this document correspond to Go 1.22 source. Where the implementation changed materially across versions, the change is called out.

1. The pre-1.13 design and its ceiling

sync.Pool was added in Go 1.3 (2014). The original implementation was straightforward and you can reconstruct it from memory in a few minutes:

// Conceptual 2014-era sync.Pool. Not actual code.
type Pool struct {
    mu      sync.Mutex
    items   []any
    New     func() any
}

func (p *Pool) Get() any {
    p.mu.Lock()
    if n := len(p.items); n > 0 {
        x := p.items[n-1]
        p.items = p.items[:n-1]
        p.mu.Unlock()
        return x
    }
    p.mu.Unlock()
    if p.New != nil {
        return p.New()
    }
    return nil
}

func (p *Pool) Put(x any) {
    p.mu.Lock()
    p.items = append(p.items, x)
    p.mu.Unlock()
}

Two problems became visible as multi-core machines went from 8 to 24 to 64 hardware threads:

  1. The mutex serialized every Get and Put. A pool of one-byte buffers behind a mutex is slower than new([1]byte) on a hot path.
  2. The garbage collector ran poolCleanup and zeroed the slice. Every two minutes you got a thundering herd of New() calls. Allocation rate spiked at GC; latency tails grew.

Go 1.5 added a per-P local cache layered in front of the mutex. The idea was correct but the implementation still had a global locked pool as the second level. Stealing was naive: a goroutine that missed its local cache walked all P-local caches under a single mutex.

By 2017, the problem on Google's production workload (the original motivation for sync.Pool) was the same as in 2014: under contention, the global lock dominated, and the post-GC allocation cliff caused tail-latency spikes. CL 96459 was the response.

2. The 2017 redesign — CL 96459

Dmitry Vyukov landed CL 96459 in January 2018; it shipped in Go 1.13 after polishing. The headline changes:

  • Replace the per-P local pool slice with a single-producer multi-consumer lock-free dequeue (poolDequeue + poolChain).
  • Eliminate the global mutex. Instead, when a goroutine misses its local dequeue, it steals from another P's dequeue using the multi-consumer side of that P's structure.
  • Introduce a victim cache. At GC, the main local pools are moved to the victim slot rather than discarded. The next GC discards the victim. Objects survive at least one GC after their last use.
  • Pad poolLocal to 128 bytes to avoid false sharing across P slots and to thwart the adjacent-cache-line prefetcher on Intel.

The combined effect:

  • Get/Put on the fast path are a handful of atomic loads and an atomic store with no contention.
  • Steal contention is bounded — a stealer fights at most the owner P and at most one other stealer for the tail.
  • The post-GC allocation cliff is replaced by a gentler ramp because objects live across one GC.

The rest of this document walks the implementation.

3. Top-level Pool struct

From src/sync/pool.go:

type Pool struct {
    noCopy noCopy

    local     unsafe.Pointer // local fixed-size per-P pool, actual type is [P]poolLocal
    localSize uintptr        // size of the local array

    victim     unsafe.Pointer // local from previous cycle
    victimSize uintptr        // size of victims array

    New func() any
}

The noCopy field is a vet hint — sync.Pool must not be copied after first use, exactly like sync.Mutex. go vet flags var p2 = p1.

local is a pointer to a heap-allocated [GOMAXPROCS]poolLocal array. It is an unsafe.Pointer rather than []poolLocal for two reasons:

  1. Pool does not import unsafe in its type signature elsewhere, but the array needs to live in raw memory so the indexing can use runtime_procPin's P id directly.
  2. The size can change if GOMAXPROCS is reset between cycles. localSize records the size at allocation time.

victim is the previous cycle's local. On every GC, the runtime calls poolCleanup, which performs p.victim = p.local; p.local = nil; p.victimSize = p.localSize; p.localSize = 0. The next GC drops the victim entirely.

New is the zero-allocation fallback. If nil and the pool is empty, Get returns nil.

There is one more piece of global state at the bottom of pool.go:

var (
    allPoolsMu Mutex

    // allPools is the set of pools that have non-empty primary
    // caches. Protected by either 1) allPoolsMu and pinning or
    // 2) STW.
    allPools []*Pool

    // oldPools is the set of pools that may have non-empty
    // victim caches. Protected by STW.
    oldPools []*Pool
)

allPools is a registry of every pool that has ever had something put into it on the current cycle. The GC's poolCleanup walks this slice to perform the main→victim shift. This is the only mutex in the file and it is taken only on the first Put per pool per cycle, plus by the GC during STW.

4. poolLocal — the per-P slot

type poolLocalInternal struct {
    private any       // Can be used only by the respective P.
    shared  poolChain // Local P can pushHead/popHead; any P can popTail.
}

type poolLocal struct {
    poolLocalInternal

    // Prevents false sharing on widespread platforms with
    // 128 mod (cache line size) == 0 .
    pad [128 - unsafe.Sizeof(poolLocalInternal{})%128]byte
}

Two fields:

  • private — a single object slot. Put writes into private if it is nil. Get reads it first. This makes the no-contention case a single field assignment with no atomic at all (we hold the P pin, no other goroutine can run on this P).
  • shared — a poolChain, which is a linked list of poolDequeue ring buffers. The owner P pushes and pops at the head; any P (including this one) can pop from the tail.

The pad field is the false-sharing guard. Note the calculation: 128 - unsafe.Sizeof(poolLocalInternal{})%128. This rounds the total poolLocal size up to a multiple of 128 bytes. Why 128 and not 64?

  • A typical x86 cache line is 64 bytes.
  • Intel processors since Sandy Bridge use an adjacent cache line prefetcher that, on a miss, also fetches the sibling 64-byte line, treating pairs of cache lines as a coherence-relevant unit. False sharing can occur across the 128-byte pair.
  • ARM and POWER also tend toward 64-byte lines, but some POWER systems use 128.

A 128-byte pad covers both cases. The cost is wasted memory: with default poolLocalInternal size of 24 bytes (any is two words = 16 bytes, poolChain head and tail pointers another 16) you actually pay closer to 128 bytes per P. With GOMAXPROCS=64, that is 8 KiB per pool, multiplied by however many pools the program uses. For a server with a dozen pools, this is negligible.

There is one important property of this padding: it must be at the end of the struct, and the struct must be heap-allocated as part of an array where each element is exactly sizeof(poolLocal) bytes apart. Go satisfies this automatically because [N]poolLocal lays elements contiguously.

5. Get and Put — the fast path

func (p *Pool) Get() any {
    if race.Enabled {
        race.Disable()
    }
    l, pid := p.pin()
    x := l.private
    l.private = nil
    if x == nil {
        // Try to pop the head of the local shard. We prefer
        // the head over the tail for temporal locality of
        // reuse.
        x, _ = l.shared.popHead()
        if x == nil {
            x = p.getSlow(pid)
        }
    }
    runtime_procUnpin()
    if race.Enabled {
        race.Enable()
        if x != nil {
            race.Acquire(poolRaceAddr(x))
        }
    }
    if x == nil && p.New != nil {
        x = p.New()
    }
    return x
}

The fast path of Get:

  1. p.pin() returns the local poolLocal for the current P and pins the goroutine to that P. Pinning disables preemption so the goroutine cannot migrate to a different P between reading l.private and writing l.private = nil.
  2. Read l.private. If non-nil, this is the answer; clear and return.
  3. Otherwise call l.shared.popHead(). This is the lock-free dequeue's owner side. Owner has exclusive write to head, but stealers can race on tail, so even popHead needs an atomic CAS.
  4. If still empty, getSlow(pid) walks the other P's caches, stealing from their tails.
  5. Unpin, then if everything is empty, call New(). Critically, New() is called outside the pin region — New() may allocate, allocate may trigger GC, GC may take time. Holding a P pin during GC would be bad.
func (p *Pool) Put(x any) {
    if x == nil {
        return
    }
    if race.Enabled {
        if fastrandn(4) == 0 {
            // Randomly drop x on floor.
            return
        }
        race.ReleaseMerge(poolRaceAddr(x))
        race.Disable()
    }
    l, _ := p.pin()
    if l.private == nil {
        l.private = x
    } else {
        l.shared.pushHead(x)
    }
    runtime_procUnpin()
    if race.Enabled {
        race.Enable()
    }
}

Put is symmetric. The race-detector path randomly drops objects on the floor to surface ordering bugs that depend on pool reuse. Outside race builds it is just: pin, write private if empty, else push to the head of the shared chain.

A subtlety: Put does not check pid. It does not need to. Whatever P the goroutine is currently pinned to becomes the owner; this can be a different P than the one that allocated the object. Pools do not track ownership at the object level.

6. p.pin() — finding the local poolLocal

func (p *Pool) pin() (*poolLocal, int) {
    pid := runtime_procPin()
    // In pinSlow we store to local and then to localSize,
    // here we load in opposite order. Since we've disabled
    // preemption, GC cannot happen in between. Thus here we
    // must observe local at least as large as localSize.
    s := runtime_LoadAcquintptr(&p.localSize)
    l := p.local
    if uintptr(pid) < s {
        return indexLocal(l, pid), pid
    }
    return p.pinSlow()
}

runtime_procPin is implemented in the runtime as:

//go:nosplit
func procPin() int {
    gp := getg()
    mp := gp.m
    mp.locks++
    return int(mp.p.ptr().id)
}

It increments m.locks. While m.locks > 0, the scheduler will not preempt this goroutine off its M, and the M will not change P. The goroutine therefore observes a stable P id for the duration.

The interesting load ordering in pin:

s := runtime_LoadAcquintptr(&p.localSize)
l := p.local

runtime_LoadAcquintptr is an acquire-load wrapper around atomic.LoadUintptr (Go's atomics are sequentially consistent, so "acquire" is a no-op on the model side but the runtime keeps the helper for clarity). It ensures that if we observe localSize == N, then local already points to an array of at least N entries. This is paired with the store ordering in pinSlow, which writes local first and localSize second.

indexLocal is pointer arithmetic:

func indexLocal(l unsafe.Pointer, i int) *poolLocal {
    lp := unsafe.Pointer(uintptr(l) + uintptr(i)*unsafe.Sizeof(poolLocal{}))
    return (*poolLocal)(lp)
}

This is the equivalent of &local[i] but without bounds checking and without the Go type system getting in the way of unsafe.Pointer.

7. pinSlow — first-time setup

func (p *Pool) pinSlow() (*poolLocal, int) {
    // Retry under the mutex.
    // Can not lock the mutex while pinned.
    runtime_procUnpin()
    allPoolsMu.Lock()
    defer allPoolsMu.Unlock()
    pid := runtime_procPin()
    // poolCleanup won't be called while we are pinned.
    s := p.localSize
    l := p.local
    if uintptr(pid) < s {
        return indexLocal(l, pid), pid
    }
    if p.local == nil {
        allPools = append(allPools, p)
    }
    // If GOMAXPROCS changes between GCs, we re-allocate
    // the array and lose the old one.
    size := runtime.GOMAXPROCS(0)
    local := make([]poolLocal, size)
    atomic.StorePointer(&p.local, unsafe.Pointer(&local[0]))
    runtime_StoreReluintptr(&p.localSize, uintptr(size))
    return &local[pid], pid
}

Read this carefully. The order is:

  1. Unpin (we cannot take a mutex while pinned — mutex contention would deadlock with the no-preemption rule).
  2. Take allPoolsMu.
  3. Re-pin and re-check (another goroutine may have finished setup between our unpin and the mutex acquisition).
  4. If we still have no local, register the pool in allPools (so GC can find it) and allocate the per-P array.
  5. Atomically publish local, then localSize. The acquire-load in pin will see them in this order.

The make([]poolLocal, size) allocates a contiguous backing array. We take &local[0] as the base pointer. The slice header is then discarded (the local variable goes out of scope after the function returns), but the backing array stays alive because p.local keeps a pointer to it.

This is the only mutex in the hot path of sync.Pool, and it is taken exactly once per pool per (cycle, P) tuple.

8. poolChain — a chain of dequeues

type poolChain struct {
    // head is the poolDequeue to push to. This is only
    // accessed by the producer, so doesn't need to be
    // synchronized.
    head *poolChainElt

    // tail is the poolDequeue to popTail from. This is
    // accessed by consumers, so reads and writes must be
    // atomic.
    tail *poolChainElt
}

type poolChainElt struct {
    poolDequeue

    // next and prev link to the adjacent poolChainElts
    // in this poolChain.
    //
    // next is written atomically by the producer and
    // read atomically by the consumer. It only transitions
    // from nil to non-nil.
    //
    // prev is written atomically by the consumer and
    // read atomically by the producer. It only transitions
    // from non-nil to nil.
    next, prev *poolChainElt
}

A poolChain is a doubly linked list of fixed-size poolDequeue ring buffers. The list grows at the head when the current head dequeue fills up; it shrinks at the tail when the tail dequeue is drained by consumers.

Why a chain at all instead of one big dequeue? Because the dequeue is fixed-size at allocation time (it owns a ring buffer slice). If you allocate a small one and overflow, you have to copy to a larger one, but copying is incompatible with lock-free concurrent access. Linking lets you stitch a new dequeue onto the head without disturbing anyone reading from older entries.

The two pointer fields have different visibility:

  • head is owner-only. The owner P is the only writer and the only reader. No synchronization needed.
  • tail is shared. Consumers (stealers) read it to find where to steal from. The owner writes to it when the tail dequeue is fully drained and the chain wants to drop the now-empty element. Atomics required.

Inside each poolChainElt, next and prev are atomic and monotonic:

  • next is set by the producer when allocating a new head; it goes nil → non-nil exactly once.
  • prev is cleared by the consumer when retiring a drained tail; it goes non-nil → nil exactly once.

Monotonicity matters because it bounds the kinds of ABA situations possible — if the field has had only two values ever, then observing it at value V₂ is enough to rule out V₁.

9. poolDequeue — the SPMC ring

// poolDequeue is a lock-free fixed-size single-producer,
// multi-consumer queue. The single producer can both push
// and pop from the head, and consumers can pop from the
// tail.
//
// It has the added feature that it nils out unused slots
// to avoid unnecessary retention of objects. This is
// important for sync.Pool, but not typically a property
// considered in the literature.
type poolDequeue struct {
    // headTail packs together a 32-bit head index and a
    // 32-bit tail index. Both are indexes into vals modulo
    // len(vals)-1.
    //
    // tail = index of oldest data in queue
    // head = index of next slot to fill
    //
    // Slots in the range [tail, head) are owned by
    // consumers. A consumer continues to own a slot
    // outside this range until it nils the slot, at
    // which point ownership passes to the producer.
    //
    // The head index is stored in the most-significant
    // bits so that we can atomically add to it and the
    // overflow is harmless.
    headTail atomic.Uint64

    // vals is a ring buffer of interface{} values stored
    // in this dequeue. The size of this must be a power
    // of 2.
    //
    // vals[i].typ is nil if the slot is empty and
    // non-nil otherwise. A slot is still in use until
    // *both* the tail index has moved beyond it *and*
    // typ has been set to nil. This is set to nil
    // atomically by the consumer and read atomically
    // by the producer.
    vals []eface
}

type eface struct {
    typ, val unsafe.Pointer
}

Three fields, but the cleverness is in headTail.

The packed encoding

headTail is a single uint64 that holds two uint32 indices: head in the upper 32 bits, tail in the lower 32 bits.

const dequeueBits = 32

func (d *poolDequeue) unpack(ptrs uint64) (head, tail uint32) {
    const mask = 1<<dequeueBits - 1
    head = uint32((ptrs >> dequeueBits) & mask)
    tail = uint32(ptrs & mask)
    return
}

func (d *poolDequeue) pack(head, tail uint32) uint64 {
    const mask = 1<<dequeueBits - 1
    return (uint64(head) << dequeueBits) |
        uint64(tail&mask)
}

Why pack? Two reasons:

  1. Atomic dual-update. When the producer's popHead decrements head, and a consumer's popTail increments tail, both need to observe a consistent (head, tail) pair. If head and tail were separate 32-bit atomics, a producer could see tail mid-increment by a consumer and conclude the queue is empty when it isn't, or vice versa. By packing, every read of headTail is atomic across both indices.
  2. Single CAS for empty-check + update. popHead does "is head > tail (non-empty)? if so, decrement head". This is a compound operation. With packing, it becomes a single 64-bit CAS: load headTail, compute new value with head decremented, CAS expecting the old value. If a consumer raced on tail, the CAS fails and you retry.

Why head in the upper bits? Two motivations:

  • Comparing head < tail and head == tail are easier to reason about when head is logically "later" than tail and you want naive integer comparisons not to be confused by 32-bit wrap interaction with the other field.
  • The original comment hints at "we can atomically add to it and the overflow is harmless" — this is a stronger claim about pushHead, which used to do an unconditional add to head; on overflow, head wraps within its 32-bit field without spilling into tail because the addition is contained.

Wraparound

Indices grow monotonically as uint32. They are reduced to slot positions via i & (len(vals)-1). Because len(vals) is a power of two, this is a single AND. Once the index wraps around uint32 (at 2^32 ≈ 4 billion), the math still works as long as the difference between head and tail fits in uint32. With dequeues capped at 2^15 entries, this is always true.

Slot ownership protocol

The comment is critical:

A slot is still in use until both the tail index has moved beyond it and typ has been set to nil.

This is a two-phase release. A consumer's popTail:

  1. Reads headTail, identifies the tail slot.
  2. CAS-increments tail (publishing "the tail moved past this slot").
  3. Reads the value from vals[tail-old], then writes nil to vals[tail-old].typ, signaling the producer that the slot is fully clear.

The producer's pushHead checks vals[head].typ — if it is non-nil, a previous consumer has not finished step 3, the dequeue is treated as full, and push fails. This prevents the producer from racing into a slot a consumer hasn't released.

eface structure

type eface struct {
    typ, val unsafe.Pointer
}

This mirrors Go's runtime representation of interface{}. An empty interface is a pair: type pointer + value pointer (or pointer-to-value for non-pointer types). Storing as raw eface lets the dequeue use atomic pointer operations on typ to signal slot occupancy without needing to do anything special with the type system.

10. pushHead — producer-side enqueue

// pushHead adds val at the head of the queue. It returns
// false if the queue is full. It must only be called by
// a single producer.
func (d *poolDequeue) pushHead(val any) bool {
    ptrs := d.headTail.Load()
    head, tail := d.unpack(ptrs)
    if (tail+uint32(len(d.vals)))&(1<<dequeueBits-1) == head {
        // Queue is full.
        return false
    }
    slot := &d.vals[head&uint32(len(d.vals)-1)]

    // Check if the head slot has been released by popTail.
    typ := atomic.LoadPointer(&slot.typ)
    if typ != nil {
        // Another goroutine is still cleaning up the
        // tail, so the queue is actually still full.
        return false
    }

    // The head slot is free, so we own it.
    if val == nil {
        val = dequeueNil(nil)
    }
    *(*any)(unsafe.Pointer(slot)) = val

    // Increment head. This passes ownership of slot
    // to popTail and acts as a store barrier for
    // writing the slot.
    d.headTail.Add(1 << dequeueBits)
    return true
}

Walk through it:

  1. Load headTail once. We get a snapshot of head and tail.
  2. Compute fullness. The queue is full when (tail + cap) % 2^32 == head. The mask 1<<dequeueBits-1 is 0xFFFFFFFF, isolating the 32-bit field after the add.
  3. Index to the slot using head & (len(vals)-1). Note len(vals) must be a power of two.
  4. Check slot.typ. This handles the case described in the previous section: a consumer may have incremented tail but not yet nil'd the slot. If typ is non-nil, the slot is still owned by the consumer, and we cannot safely write to it. Return false; the chain code will respond by allocating a new dequeue and pushing into the head of the chain.
  5. Special-case nil. dequeueNil(nil) is a non-nil placeholder used to distinguish "the user explicitly put nil" from "this slot has no data." Without it, you couldn't tell. In practice this is rarely material because most callers don't Put(nil), but the API allows it.
  6. Write the value. The assignment via *(*any)(unsafe.Pointer(slot)) = val is not atomic. It is a two-word store (typ and val). It is safe here because:
  7. Only the producer writes to this slot.
  8. Consumers only read this slot after observing an incremented head, which is the next step.
  9. d.headTail.Add(1 << dequeueBits) atomically increments the head field (because head is in the upper 32 bits). This atomic add acts as a release fence: the slot write at step 6 must be visible before the head update is visible.

The single producer assumption matters: pushHead has no CAS retry loop. The producer doesn't fight itself.

Why is the slot store not atomic?

Two words, not aligned together (interface representation is 16 bytes on 64-bit; the two pointers are at offsets 0 and 8). A 16-byte store is not atomic on most architectures. Yet there is no atomic.StorePointer here, just a raw assignment.

The store is safe because the visibility of the store to other goroutines is gated by the subsequent d.headTail.Add. The Go memory model guarantees that the atomic operation acts as a barrier: subsequent loads of headTail by a consumer will not be reordered before they observe the new head; and once they observe the new head, the producer's prior non-atomic stores are visible.

This is safe but subtle, and exactly the kind of code that needs to be written by people who know what they are doing. The comment "increment head acts as a store barrier for writing the slot" is doing serious work.

11. popHead — owner-side dequeue from the head

// popHead removes and returns the element at the head
// of the queue. It returns false if the queue is empty.
// It must only be called by a single producer.
func (d *poolDequeue) popHead() (any, bool) {
    var slot *eface
    for {
        ptrs := d.headTail.Load()
        head, tail := d.unpack(ptrs)
        if tail == head {
            // Queue is empty.
            return nil, false
        }

        // Confirm tail and decrement head. We do this
        // before reading the value to take back ownership
        // of this slot.
        head--
        ptrs2 := d.pack(head, tail)
        if d.headTail.CompareAndSwap(ptrs, ptrs2) {
            // We successfully took back slot.
            slot = &d.vals[head&uint32(len(d.vals)-1)]
            break
        }
    }

    val := *(*any)(unsafe.Pointer(slot))
    if val == dequeueNil(nil) {
        val = nil
    }
    // Zero the slot. Unlike popTail, this isn't racing
    // with pushHead, so we don't need to be careful here.
    *slot = eface{}
    return val, true
}

The structure is a CAS loop, even though there is a single producer. Why?

Because a consumer might race on tail. Suppose head == tail+1 (one element). The producer wants to take it via popHead by decrementing head to tail. Simultaneously, a stealer wants to take the same element via popTail by incrementing tail to head. The CAS resolves the race: whichever lands first wins, the other retries and observes an empty queue.

Note the asymmetry from pushHead: pushHead just adds and never CAS-loops. popHead CAS-loops because it might lose to a tail thief. The CAS expects the old headTail value and writes the new packed value with head decremented and tail unchanged.

After CAS success, the producer owns the slot. Reading and zeroing are unsynchronized — no one else can be in the slot at this point.

The if val == dequeueNil(nil) unwraps the placeholder used for explicit-nil puts. The slot is then zeroed completely (*slot = eface{}) — both fields. This releases any pointer reference, so the object becomes GC-collectible if no one else holds it.

12. popTail — consumer-side dequeue from the tail

// popTail removes and returns the element at the tail
// of the queue. It returns false if the queue is empty.
// It may be called by any number of consumers.
func (d *poolDequeue) popTail() (any, bool) {
    var slot *eface
    for {
        ptrs := d.headTail.Load()
        head, tail := d.unpack(ptrs)
        if tail == head {
            // Queue is empty.
            return nil, false
        }

        // Confirm head and tail (for our speculative check
        // above) and increment tail. If this succeeds,
        // then we own the slot at tail.
        ptrs2 := d.pack(head, tail+1)
        if d.headTail.CompareAndSwap(ptrs, ptrs2) {
            // Success.
            slot = &d.vals[tail&uint32(len(d.vals)-1)]
            break
        }
    }

    // We now own slot.
    val := *(*any)(unsafe.Pointer(slot))
    if val == dequeueNil(nil) {
        val = nil
    }

    // Tell pushHead that we're done with this slot.
    // Zeroing the slot is also important so we don't
    // leave behind references that could keep this
    // object live longer than necessary.
    //
    // We write to val first and then publish that we're
    // done with this slot by atomically writing to typ.
    slot.val = nil
    atomic.StorePointer(&slot.typ, nil)
    // At this point pushHead owns the slot.

    return val, true
}

The CAS loop here handles two contention sources:

  1. Other consumers fighting for the same tail slot.
  2. The producer's popHead taking the only element.

If the CAS fails, we re-read headTail and try again — either there is a new element, or the queue is now empty.

The interesting part is the release protocol at the end:

slot.val = nil
atomic.StorePointer(&slot.typ, nil)

The order is:

  1. Clear val (the data pointer) non-atomically. This is safe because we own the slot exclusively at this moment (we won the CAS, the producer's pushHead won't touch this slot until typ is nil).
  2. Atomically clear typ. This is the publish: pushHead checks slot.typ and won't reuse the slot until it observes nil.

If you reversed these two lines, you would have a race: pushHead could see typ == nil and start writing a new value into the slot, while popTail's slot.val = nil still executes, clobbering the producer's write.

The store-store ordering is guaranteed by the atomic StorePointer, which on Go's memory model is sequentially consistent and serves as a release.

Notice also that popTail does not need to CAS on slot.typ — only one consumer can be at this slot (we won the CAS on headTail), and the producer is waiting on typ going to nil. A plain atomic store suffices.

13. The chain operations

poolChain glues multiple poolDequeue instances into something that can grow.

func (c *poolChain) pushHead(val any) {
    d := c.head
    if d == nil {
        // Initialize the chain.
        const initSize = 8 // Must be a power of 2
        d = new(poolChainElt)
        d.vals = make([]eface, initSize)
        c.head = d
        storePoolChainElt(&c.tail, d)
    }

    if d.pushHead(val) {
        return
    }

    // The current dequeue is full. Allocate a new one
    // of twice the size.
    newSize := len(d.vals) * 2
    if newSize >= dequeueLimit {
        // Can't make it any bigger.
        newSize = dequeueLimit
    }

    d2 := &poolChainElt{prev: d}
    d2.vals = make([]eface, newSize)
    c.head = d2
    storePoolChainElt(&d.next, d2)
    d2.pushHead(val)
}

Three branches:

  1. Empty chain. Initialize with an 8-slot dequeue at both head and tail.
  2. Head has space. Push to it.
  3. Head full. Allocate a new dequeue of twice the size, link it as the new head, push into it.

The geometric growth (8, 16, 32, 64, ...) is capped at dequeueLimit = 1 << 15 = 32768 slots per dequeue. Why cap? Because the chain is meant to absorb bursts; if a single P legitimately needs 32K slots, you have other problems (memory footprint of the pool becomes large).

The size doubling means: after k full dequeues, the chain holds 8 + 16 + 32 + ... + 8·2^(k-1) = 8·(2^k - 1) elements. Stealing happens at the tail, the smallest dequeue. As objects are consumed, the chain shrinks from the tail.

func (c *poolChain) popHead() (any, bool) {
    d := c.head
    for d != nil {
        if val, ok := d.popHead(); ok {
            return val, true
        }
        // There may still be unconsumed elements in
        // the previous dequeue, so try backing up.
        d = loadPoolChainElt(&d.prev)
    }
    return nil, false
}

popHead from the chain pops from the head dequeue. If empty, it tries the previous dequeue (toward the tail). This is the "drain the older buffer first when refilling from the new one fails" case — but wait, it is the producer side, and producers push to head, so why would the previous dequeue have data while the head is empty?

Scenario: pushHead allocated a new head dequeue d2. We pushed an object. The owner pops it (popHead on d2). Now d2 is empty. d1 (prev) may still have objects pushed before d2 was allocated. The chain's popHead walks back through prev to find them.

This is also why prev exists: without it, the head dequeue would have no way to reach older entries that haven't been stolen.

func (c *poolChain) popTail() (any, bool) {
    d := loadPoolChainElt(&c.tail)
    if d == nil {
        return nil, false
    }

    for {
        // It's important that we load the next pointer
        // *before* popping the tail. In general, d may be
        // transiently empty, but if next is non-nil before
        // the pop and the pop fails, then d is permanently
        // empty, which is the only condition under which
        // it's safe to drop d from the chain.
        d2 := loadPoolChainElt(&d.next)

        if val, ok := d.popTail(); ok {
            return val, true
        }

        if d2 == nil {
            // This is the only dequeue. It's empty right
            // now, but could be pushed to in the future.
            return nil, false
        }

        // The tail of the chain has been drained, so move
        // on to the next dequeue. Try to drop it from the
        // chain so the next pop doesn't have to look at
        // the empty dequeue again.
        if atomic.CompareAndSwapPointer(
            (*unsafe.Pointer)(unsafe.Pointer(&c.tail)),
            unsafe.Pointer(d), unsafe.Pointer(d2)) {
            // We won the race. Clear the prev pointer
            // so the garbage collector can collect the
            // empty dequeue and so popHead doesn't back
            // up over it.
            storePoolChainElt(&d2.prev, nil)
        }
        d = d2
    }
}

popTail from the chain is more subtle. It is called by consumers (stealers), and the chain's tail dequeue may have been fully drained. The logic:

  1. Load next before attempting popTail. This ordering matters — see the comment.
  2. Attempt popTail. If successful, return.
  3. If popTail failed and next is nil, the chain is empty.
  4. If popTail failed and next is non-nil, the tail dequeue is permanently empty. CAS the chain's tail pointer forward to next. If we win the CAS, clear prev on the new tail so the old dequeue is unreferenced and GC can collect it.
  5. Loop with d=d2.

The "load next before pop" ordering rules out a race where: producer pushes to d, we observe d as empty, producer allocates d2, producer pushes to d2, we observe next==d2 but d still empty (it was never filled because the push went to d2 after we observed it empty). If next was loaded after pop, we might mistakenly drop d while it has unconsumed values. By loading next first, we know that if next was nil when we observed empty, then d was the only dequeue at that moment; if next becomes non-nil later, then d was already known to be empty by the producer (since it allocated d2 because d was full and then drained).

Actually, look at it again. The careful reasoning: "if next is non-nil before the pop and the pop fails, then d is permanently empty." Because the producer only allocates next after the current dequeue is full and a value pushed; once a value is pushed to next, no more values will be pushed to d; consumers will drain d from tail going up; once the last value is taken, d is permanently empty.

This is the kind of argument that takes ten minutes to convince yourself of and pays off in correctness for the next decade.

14. getSlow — the steal path

func (p *Pool) getSlow(pid int) any {
    // See the comment in pin regarding ordering of the loads.
    size := runtime_LoadAcquintptr(&p.localSize)
    locals := p.local
    // Try to steal one element from other procs.
    for i := 0; i < int(size); i++ {
        l := indexLocal(locals, (pid+i+1)%int(size))
        if x, _ := l.shared.popTail(); x != nil {
            return x
        }
    }

    // Try the victim cache. We do this after attempting to
    // steal from all primary caches because we want objects
    // in the victim cache to age out if at all possible.
    size = atomic.LoadUintptr(&p.victimSize)
    if uintptr(pid) >= size {
        return nil
    }
    locals = p.victim
    l := indexLocal(locals, pid)
    if x := l.private; x != nil {
        l.private = nil
        return x
    }
    for i := 0; i < int(size); i++ {
        l := indexLocal(locals, (pid+i)%int(size))
        if x, _ := l.shared.popTail(); x != nil {
            return x
        }
    }

    // Mark the victim cache as empty for future gets
    // don't bother with it.
    atomic.StoreUintptr(&p.victimSize, 0)

    return nil
}

Two phases:

Phase A — steal from other Ps' main caches.

Walk all other P's poolLocal entries, calling popTail on each shared chain. The walk starts at pid+1 modulo size, so different Ps starting from different points avoid all converging on P 0.

This is O(P) in the worst case. With GOMAXPROCS=64, that is 64 atomic CAS attempts in the worst case. In practice the first steal usually succeeds and you bail early.

Phase B — try the victim cache.

If main caches are all empty, check the victim. First the local victim's private (cheap), then walk all victim chains.

The order is significant: main first, victim second, and at the end, if everything is empty, zero out victimSize so subsequent calls skip the victim walk entirely. This is a small optimization — the victim usually empties out a few seconds after a GC; once it is fully drained, we want subsequent Gets to not waste time on it.

The comment "we want objects in the victim cache to age out if at all possible" is interesting. If we tried the victim first, we'd keep objects alive across multiple GCs by promoting them back into the main cache. By draining main first, we encourage the victim to actually become collectible.

15. poolCleanup — the GC hook

func poolCleanup() {
    // This function is called with the world stopped, at
    // the beginning of a garbage collection. It must not
    // allocate and probably should not call any runtime
    // functions.

    // Because the world is stopped, no pool user can be
    // in a pinned section (in effect, this has all Ps
    // pinned).

    // Drop victim caches from all pools.
    for _, p := range oldPools {
        p.victim = nil
        p.victimSize = 0
    }

    // Move primary cache to victim cache.
    for _, p := range allPools {
        p.victim = p.local
        p.victimSize = p.localSize
        p.local = nil
        p.localSize = 0
    }

    // The pools with non-empty primary caches now have
    // non-empty victim caches and no pools have non-empty
    // primary caches.
    oldPools, allPools = allPools, nil
}

This is the entire GC interaction. It is registered via:

func init() {
    runtime_registerPoolCleanup(poolCleanup)
}

A few notes:

  1. The function runs during STW. No locking needed, no atomics needed.
  2. It must not allocate, must not call most runtime functions. It is the leanest possible pass: swap two pointers per pool.
  3. The semantics: previous-victim becomes garbage. Previous-main becomes new-victim. New-main starts empty.

Two GCs are required to fully evict an object that was put into a pool. The first GC moves it to victim. The second GC drops victim. In between, a stealer can rescue it via getSlow.

This is the "smooth ramp" property: instead of all pooled allocations vanishing at one GC, they vanish over two. If your workload pushes and pulls steadily, the rate of cliffs is halved and the cliffs are softer.

16. Memory ordering in detail

Go's sync/atomic package guarantees sequential consistency for all atomic operations on a single variable. There is no exposed acquire-only or release-only semantics; every atomic is SC.

For sync.Pool, the relevant orderings are:

Producer push then consumer steal

Producer (pushHead):
  W1: slot.val   = ptr
  W2: slot.typ   = typeptr        (non-atomic, two-word store)
  W3: atomic.Add(&headTail, ...)   (atomic, SC)

Consumer (popTail):
  R1: atomic.Load(&headTail)       (atomic, SC)
  CAS: atomic.CompareAndSwap(...)  (atomic, SC)
  R2: read slot.val and slot.typ   (non-atomic)
  W3: slot.val = nil
  W4: atomic.Store(&slot.typ, nil) (atomic, SC)

The producer's W1+W2 must be visible before W3 from the consumer's perspective. Because W3 is sequentially consistent and the consumer's R2 happens after observing W3's effect on headTail, the model guarantees W1 and W2 are visible to R2.

Without SC, you would need explicit release on the producer's W3 and acquire on the consumer's R1. Go's choice to make all atomics SC eliminates this; you just have to think in terms of "atomic A happened before atomic B" and the non-atomic stuff in between rides along.

Consumer release then producer reuse

Consumer (popTail, ending):
  W1: slot.val = nil               (non-atomic)
  W2: atomic.Store(&slot.typ, nil) (atomic, SC)

Producer (pushHead, later):
  R1: atomic.Load(&slot.typ)       (atomic, SC)
  W3: slot.val = newval            (non-atomic)
  W4: slot.typ = newtyp            (non-atomic)
  W5: atomic.Add(&headTail, ...)   (atomic, SC)

The producer reads slot.typ and proceeds only if nil. The consumer's W2 is SC; the producer's R1 observes it. Therefore the consumer's W1 (val=nil) is visible to the producer before W3.

pin/pinSlow publish

pinSlow (setup):
  W1: atomic.Store(&p.local, &local[0])
  W2: runtime_StoreReluintptr(&p.localSize, size)

pin (later):
  R1: runtime_LoadAcquintptr(&p.localSize)
  R2: p.local

This one is not SC in the helper names — StoreReluintptr is release, LoadAcquintptr is acquire. On Go's model with SC atomics this is operationally the same as SC, but the helpers are named for the underlying intent: the order matters.

The producer writes local first, then localSize. The reader observes localSize first, then local. If localSize is observed as N, then local has been published (because release-store on localSize cannot be reordered before the store to local; acquire-load on localSize cannot be reordered after the load of local). Thus indexing into local[0..N-1] is safe.

17. False sharing arithmetic

type poolLocal struct {
    poolLocalInternal

    // Prevents false sharing on widespread platforms with
    // 128 mod (cache line size) == 0 .
    pad [128 - unsafe.Sizeof(poolLocalInternal{})%128]byte
}

Compute sizeof(poolLocalInternal):

  • private any is two pointers = 16 bytes on 64-bit.
  • shared poolChain is two pointers = 16 bytes.

Total = 32 bytes. 32 % 128 = 32. Pad = 128 - 32 = 96. Total poolLocal size = 32 + 96 = 128 bytes.

An [N]poolLocal array thus has 128-byte stride. Each P owns one full 128-byte block; the adjacent block is owned by a different P. No two Ps share a cache line, and on Intel's adjacent-line prefetcher, no two Ps share a prefetch group.

If poolLocalInternal grew to, say, 40 bytes, the pad would be 128 - 40 = 88, stride still 128. Good.

If it grew to 130 bytes, the pad would be 128 - 130%128 = 128 - 2 = 126, stride 130 + 126 = 256. Also good (still a multiple of 128).

The formula works for any internal size. It guarantees the stride is the smallest multiple of 128 ≥ sizeof(internal).

18. Why a chain, not a single growing dequeue

Three reasons:

  1. Lock-free growth. A single dequeue can't grow lock-free. Resizing requires copying, copying requires exclusive access. A chain grows by linking, which is one atomic store.

  2. Memory locality during steady state. When the pool has been stable for a while, the chain has one or two dequeues, both at the maximum size (32K slots). All pushes and pops touch the same cache lines. Stealers hit the tail dequeue, owner hits the head — which in steady state is the same dequeue, but stealers and owner operate on opposite ends, so there is little false sharing within the dequeue either (the ring is large).

  3. Graceful shrinking under drain. As stealers drain old dequeues, those dequeues become unreachable and GC frees them. No copying, no manual deallocation. The chain breathes with the workload.

The cost: more pointer chasing during steals (the chain walk in popTail), and more metadata overhead (each poolChainElt has prev/next/dequeue overhead).

19. Geometric growth and the cap

dequeueLimit = 1 << 15 = 32768 per dequeue. The chain doubles: 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768. After 13 doublings you hit the cap. Total capacity before cap: 65528. After that, every new dequeue is 32768 slots.

In practice, even high-throughput pools rarely hit two full 32K dequeues. If they do, you should profile — you may be pooling something that shouldn't be pooled (e.g., never reused, or held for too long).

20. The victim cache lifecycle of a pooled object

Imagine you Put(x) at time T0. Walk the object's life:

  • T0: Put. x lands in private or in some dequeue in main.
  • T0 to T1: x may be Get-returned by the same goroutine or stolen by another. If so, the cycle restarts.
  • T1: GC. poolCleanup runs. x is in main, so it moves to victim (the entire main → victim swap is by pointer; x itself doesn't move).
  • T1 to T2: Get may rescue x from the victim via getSlow's phase B. If so, returned to caller, cycle restarts. After rescue, x is in caller's hands, not in any pool.
  • T2: GC. poolCleanup runs. victim is dropped. If x was still in victim, the dequeue holding it becomes unreachable (the pool no longer references it; the local poolLocal is freed because main is empty for that pool slot in this view).
  • T2 onward: x is collectible by the next GC mark (or already by this one — poolCleanup runs at GC start, before the mark phase). Mark phase finds x unreachable, sweep frees it.

The effective lifetime is one to two GC cycles. With GOGC=100 and a typical Go server, that is on the order of seconds to minutes.

21. Why victim avoids the post-GC cliff

In the pre-2017 design, every GC dropped every pooled object. Right after GC, every Get missed the pool, called New(), allocated. The next GC after that saw an enormous allocation spike from the post-cliff catch-up. Tail latencies suffered.

With the victim cache, half the pool survives each GC. A Get immediately after GC hits the victim. New() is called only if both main and victim are empty for this P. The allocation rate has a less sharp boundary, the mark/sweep work is more uniform, the next GC is on a more predictable schedule.

You can observe this in production: pprof allocation profiles before and after the 1.13 update showed visibly smoother allocation rates around GC for pool-heavy workloads.

22. Object reachability — pool doesn't defeat GC

A subtle point that gets people confused: pooled objects are not GC-rooted while in the pool. If the pool is the only reference, GC will eventually collect them.

How? Because poolCleanup is a write barrier on the pool's references. At GC, main → victim. At the next GC, victim → nil. After two GCs, the pool no longer holds the reference. If no one else does, the object is collectible.

This is essential. If pools were strong-rooted, pooling a million 4KB buffers would permanently consume 4 GB of heap. Instead, pooling acts as a hint: "keep these around for a while in case I want them back, but feel free to drop them if memory pressure builds."

The hint has limits. The pool cannot react to memory pressure during normal operation (between GCs). If you push faster than you pop and GCs are infrequent, the pool grows. The chain cap (2^15 per dequeue) and the two-GC lifetime are the only mechanisms preventing unbounded growth.

For programs that need adaptive sizing (release more aggressively under pressure), sync.Pool is the wrong primitive. Consider a hand-rolled bounded pool with explicit eviction.

23. Lock-free vs wait-free

sync.Pool's dequeue is lock-free but not wait-free.

  • Lock-free: in any execution, at least one thread makes progress in finite steps. No deadlock, no priority inversion. Critically, no thread can block all others by being descheduled.
  • Wait-free: every thread makes progress in bounded steps regardless of contention.

In sync.Pool, popHead and popTail can fail a CAS and retry. Under pathological scheduling, a single thread could be starved by other threads continuously succeeding ahead of it. In practice, this never happens for two reasons:

  1. The CAS-retry pattern in popHead and popTail is bounded by the small number of concurrent operations on the same dequeue — usually one owner and a handful of stealers.
  2. Each retry observes new state. If the queue empties, the loop exits with false. There is no live-lock on emptiness.

The wait-free distinction matters in real-time systems where bounded worst-case latency is required. Go is not generally used in such systems, and sync.Pool's lock-free guarantee is sufficient for server workloads.

24. Alternative pool designs

A. Mutex-protected slice

type SlicePool[T any] struct {
    mu   sync.Mutex
    data []T
    New  func() T
}

func (p *SlicePool[T]) Get() T {
    p.mu.Lock()
    if n := len(p.data); n > 0 {
        x := p.data[n-1]
        p.data = p.data[:n-1]
        p.mu.Unlock()
        return x
    }
    p.mu.Unlock()
    return p.New()
}

func (p *SlicePool[T]) Put(x T) {
    p.mu.Lock()
    p.data = append(p.data, x)
    p.mu.Unlock()
}

Pros: dead simple, type-safe with generics, no need to clear objects post-GC because no GC interaction.

Cons: scales poorly. At 16+ goroutines hammering Get/Put, the mutex becomes a bottleneck. Useful only at low concurrency or when the work behind the pool dominates.

B. Channel pool

type ChanPool[T any] struct {
    ch  chan T
    New func() T
}

func NewChanPool[T any](cap int, newFn func() T) *ChanPool[T] {
    return &ChanPool[T]{ch: make(chan T, cap), New: newFn}
}

func (p *ChanPool[T]) Get() T {
    select {
    case x := <-p.ch:
        return x
    default:
        return p.New()
    }
}

func (p *ChanPool[T]) Put(x T) {
    select {
    case p.ch <- x:
    default:
        // Pool full, drop on floor.
    }
}

Pros: bounded capacity, simple, type-safe, no GC interaction.

Cons: channels have internal mutexes too. Performance is similar to the mutex slice for low capacity, and the bounded size means high-throughput workloads drop objects (or block, if you remove the default cases). Also no per-P locality; every Get/Put goes through one shared channel.

Useful when you want a strict size cap, which sync.Pool can't give you.

C. valyala/bytebufferpool

The valyala/fasthttp project's bytebufferpool is a sync.Pool-like that adds size-bucketing and statistical eviction. It tracks the typical sizes of buffers returned and adjusts a "release" probability so unusually large buffers are dropped rather than retained.

Pros: better memory characteristics for variable-sized buffers. Useful for HTTP response bodies of unpredictable size.

Cons: more complex, project-specific. The statistical eviction is a heuristic that may or may not match your workload.

D. oxtoacart/bpool

A simple bounded byte-slice pool with explicit max-size. Internally a channel of *bytes.Buffer.

Pros: bounded, predictable.

Cons: same as channel pools — single contention point.

E. Per-goroutine pool via goroutine-local hack

In the wild, people sometimes use petermattis/goid or similar packages to read the goroutine ID and index a map[int64][]any. This is faster than sync.Pool in microbenchmarks because there is no atomic CAS, just a map lookup.

Pros: theoretical maximum throughput.

Cons: relies on unsafe access to runtime internals; breaks when the runtime changes; doesn't handle goroutine deaths (the map grows unboundedly); no GC integration. Not recommended for production.

F. P-local without runtime API

Some libraries (like the now-deprecated github.com/golang/groupcache) implemented per-P sharding manually using GOMAXPROCS and runtime.LockOSThread. This works but is fragile and loses the convenience of runtime_procPin.

The dominant choice for general-purpose pooling in Go remains sync.Pool. The alternatives above are appropriate when you have specific constraints (bounded size, type safety, custom eviction) that sync.Pool does not address.

25. When does sync.Pool actually help?

The cost model:

  • Per Get: pin + load private + maybe popHead CAS + maybe steal loop + maybe New().
  • Per Put: pin + store private or pushHead CAS.

In the no-contention case (object in private), Get and Put are roughly two atomic operations and a memory store each. Call it 20-40 ns on modern hardware.

The break-even with new(T) depends on T:

  • Small T (≤ 64 bytes): new(T) is around 10-30 ns including any GC pressure. Pool may not help. Often worse than direct allocation due to atomic overhead.
  • Medium T (256 bytes - 4 KiB): new(T) is 50-200 ns. Pool wins clearly.
  • Large T (> 8 KiB): new(T) triggers slow-path allocation and creates non-trivial GC scan work. Pool wins by an order of magnitude.

Other dimensions:

  • High allocation rate: pool helps because GC frequency dominates and pool keeps mostly-warm objects.
  • Low allocation rate: pool overhead may exceed savings, and victim cache provides little benefit.
  • High reuse fraction: pool fully amortizes the pin cost.
  • Low reuse fraction: most Gets fall through to New; pool is overhead with no payoff.

A useful instinct: if you measure go tool pprof -alloc_space and a single allocation site is in the top 10, sync.Pool is probably worth trying. If allocations are diffuse, focus on reducing the count, not pooling them.

26. Inspecting the pool with runtime tooling

There are no public APIs to inspect sync.Pool state. You can get an indirect view via:

import "runtime"

var ms runtime.MemStats
runtime.ReadMemStats(&ms)
fmt.Printf("alloc=%d MiB, gc-cycles=%d, num-pause-ns=%d\n",
    ms.Alloc/1024/1024, ms.NumGC, ms.PauseNs[(ms.NumGC+255)%256])

Running with and without the pool, the difference in Alloc (current heap), allocs/sec, and NumGC tells you whether the pool is reducing pressure. Watch:

  • Allocs/sec decrease: pool is reusing successfully.
  • Heap size stable but GC frequency drops: pool is keeping objects alive, allowing them to be reused before GC.
  • Heap size grows: pool is holding too many objects; consider whether your put rate exceeds your get rate by too much, or whether the pooled object is too large.

For the pool itself, you can hook runtime.GC calls and check timing of allocations around them; right after GC, you should see a brief uptick in New() calls, but smaller than without the pool (the victim cache absorbs the first wave).

27. Reading objdump output

Compile a small program:

package main

import "sync"

var pool = sync.Pool{
    New: func() any { return new([1024]byte) },
}

func use() *[1024]byte {
    return pool.Get().(*[1024]byte)
}

func release(b *[1024]byte) {
    pool.Put(b)
}

func main() {
    b := use()
    release(b)
}

Run:

go build -gcflags='-l' -o /tmp/poolex .
go tool objdump -s 'main.use' /tmp/poolex

You will see something like (annotated):

TEXT main.use(SB)
  MOVQ runtime.g(...), AX         ; load *g
  INCQ %{m.locks}                 ; procPin: mp.locks++
  MOVQ p_id, BX                   ; pid
  ...
  ; load p.localSize via atomic
  MOVQ (sync.poolMain).local(...), CX
  ; index l = &local[pid]
  ...
  ; check l.private != nil
  MOVQ (private)(CX), DX
  TESTQ DX, DX
  JEQ slow_path
  ; clear private
  MOVQ $0, (private)(CX)
  ...
  DECQ %{m.locks}                 ; procUnpin
  RET

The actual instructions are more numerous (write barriers for the pointer store, race-detector hooks if compiled with -race, etc.), but the spine is: pin, load, clear, unpin, return. No atomic on the fast path.

The slow path goes into (*sync.Pool).getSlow, where you see LOCK CMPXCHG instructions for the dequeue CAS.

28. Edge cases

No New function

var p sync.Pool
x := p.Get()  // x is nil

If New is nil and the pool is empty, Get returns nil. You must check. This is rarely what you want; always set New unless you specifically want sentinel-nil semantics.

Pool of pointer-to-pointer

var p = sync.Pool{
    New: func() any { return new(*Foo) },
}

pp := p.Get().(**Foo)
*pp = ...
p.Put(pp)

Works but odd. The pool stores **Foo. Each Get returns a fresh slot (a **Foo) and you must populate *pp. There is no practical reason to do this; just pool *Foo directly.

Pool used across GOMAXPROCS changes

If runtime.GOMAXPROCS(N) is called after the pool is initialized, pin notices that pid >= localSize and calls pinSlow, which re-allocates the local array. The old local is discarded (and its contents become garbage).

This means: changing GOMAXPROCS throws away pool state. If you call runtime.GOMAXPROCS at runtime, expect a transient allocation spike as pools refill.

Pool stored in package globals

var bufPool = sync.Pool{
    New: func() any { return bytes.NewBuffer(nil) },
}

This is fine. The pool lives for the lifetime of the process. allPools registry grows by one entry. poolCleanup walks allPools and oldPools on every GC; with hundreds of pools, this becomes measurable but rarely a bottleneck.

Pathological case: thousands of pools created and discarded. allPools grows during runtime; each entry pins the pool object indirectly (the slice references the *Pool). Discarded pools are not collected because the runtime references them. Avoid creating short-lived pools.

Pool with very expensive New

If New() allocates 100 MiB or takes 100 ms, pool semantics shift. getSlow runs New() outside the pin, so it doesn't block other Ps, but the cost is per-call. Pool's value is amortizing this cost across reuses. If reuse rate is low (most Gets call New), you may want a different strategy:

  • Pre-allocate a fixed set at init.
  • Use a counting semaphore (e.g., chan struct{} of capacity N) to bound concurrent New calls.
  • Bypass sync.Pool entirely.

Pool with very cheap New

If New() is just new(byte), the per-call cost is a few ns. Pool overhead (atomic CAS, padding cache pressure) may exceed savings. Profile both ways before deciding.

Concurrent New calls

New() is not synchronized. If two goroutines Get() simultaneously and both miss, both call New(). There is no "singleflight" semantics. This is usually fine — the objects are independent allocations and just go into private/shared on subsequent Puts.

Storing pointers to stack data

func bad() {
    var local [1024]byte
    p.Put(&local)  // BUG
}

This puts a stack pointer into the pool. Go's escape analysis will see that &local escapes to a heap-resident interface, so it will move local to the heap (escape it). The code works; but if escape analysis ever misses this, you would have a use-after-return. Don't write code that depends on the escape analysis catching everything; explicitly heap-allocate things you intend to share.

Storing things with finalizers

b := &Buffer{...}
runtime.SetFinalizer(b, func(b *Buffer) { b.Close() })
p.Put(b)

This is generally a bad idea. The pool holds b until GC moves it through victim and then drops it; the finalizer then runs. But finalizers don't run while objects are pool-rooted; they only run after the pool drops the reference. The lifetime is hard to reason about. Pool + finalizer = confusing. Don't do it.

Putting different types into one pool

p.Put(byteBuf)
p.Put(intSlice)
x := p.Get()
// x could be a byteBuf or intSlice

sync.Pool stores any. You can put any type. On Get, you type-assert. If you put multiple types, you have to handle the assertion failures. Don't do this; use one pool per type. (Generics would help here but sync.Pool predates generics and the API is locked.)

29. Generic wrapper around sync.Pool

A common pattern in modern Go (1.18+) is a typed wrapper:

type TypedPool[T any] struct {
    p sync.Pool
}

func NewTypedPool[T any](newFn func() *T) *TypedPool[T] {
    return &TypedPool[T]{
        p: sync.Pool{
            New: func() any { return newFn() },
        },
    }
}

func (tp *TypedPool[T]) Get() *T {
    return tp.p.Get().(*T)
}

func (tp *TypedPool[T]) Put(x *T) {
    tp.p.Put(x)
}

This avoids the type assertion at call sites and provides better autocomplete. The cost is one extra function call per Get/Put, which the inliner usually eliminates.

You can extend this with reset semantics:

type Resettable interface {
    Reset()
}

func (tp *TypedPool[T]) Put(x *T) {
    if r, ok := any(x).(Resettable); ok {
        r.Reset()
    }
    tp.p.Put(x)
}

But the type assertion at every Put is wasteful. Better to require T to satisfy an interface at compile time, but Go generics don't let you constrain on methods of *T easily without additional type parameters. The clearest pattern is:

type TypedPool[T any, PT interface {
    *T
    Reset()
}] struct {
    p sync.Pool
}

func (tp *TypedPool[T, PT]) Put(x PT) {
    x.Reset()
    tp.p.Put(x)
}

The two-parameter form constrains PT to be a pointer to T with a Reset method, giving a compile-time guarantee.

30. Sample workload: bytes.Buffer pool

Reference workload: a server reading 4 KiB requests and writing variable-size responses. Buffer pool:

var bufPool = sync.Pool{
    New: func() any { return new(bytes.Buffer) },
}

func handle(w http.ResponseWriter, r *http.Request) {
    buf := bufPool.Get().(*bytes.Buffer)
    defer func() {
        if buf.Cap() < 64*1024 {
            buf.Reset()
            bufPool.Put(buf)
        }
        // Else: too big, drop on floor.
    }()
    // ... write into buf ...
    w.Write(buf.Bytes())
}

Key points:

  • Get returns a fresh-or-pooled buffer.
  • Reset clears the buffer but keeps the backing array.
  • Cap-guard: oversized buffers are not returned. This prevents pathological inputs from poisoning the pool with megabyte-sized buffers.
  • Defer: even on panic, the buffer goes back.

This is one of the most common sync.Pool patterns in real Go services.

31. Pool of structs with backing slices

type ParseContext struct {
    tokens []token
    stack  []frame
}

var parseCtxPool = sync.Pool{
    New: func() any {
        return &ParseContext{
            tokens: make([]token, 0, 64),
            stack:  make([]frame, 0, 16),
        }
    },
}

func (pc *ParseContext) reset() {
    pc.tokens = pc.tokens[:0]
    pc.stack = pc.stack[:0]
}

func parse(input []byte) Result {
    pc := parseCtxPool.Get().(*ParseContext)
    defer func() {
        pc.reset()
        if cap(pc.tokens) < 256 && cap(pc.stack) < 64 {
            parseCtxPool.Put(pc)
        }
    }()
    // ... parse ...
}

The struct holds slices that the New func pre-sizes. Reuse keeps the backing arrays alive across calls. The cap-guards prevent pathological inputs from inflating the pool's footprint.

The slice trick s = s[:0] preserves the cap. After reset, the slice has length 0 but the same underlying array, ready for new appends.

32. Pool of objects with non-pointer fields

Pools work best with pointer types. Why?

The any boxing in Put(x any) requires a heap allocation if x is not already pointer-sized or larger and addressable. Putting a struct value into a pool causes the struct to be copied into a heap interface; you save nothing.

type Header struct {
    A, B, C int
}

var pool = sync.Pool{New: func() any { return Header{} }}  // BAD

var pool2 = sync.Pool{New: func() any { return &Header{} }} // GOOD

In the first case, every Get boxes a fresh Header value into an interface, allocating. The pool defeats itself.

In the second case, the pool stores *Header, which is a single word in the interface (the value pointer; the type pointer is the same *Header type and free). No boxing allocation.

33. Pool and goroutine lifetime

A goroutine that Gets from a pool but never Puts is a leak. The object lives as long as the goroutine holds it; if the goroutine never returns, the object never goes back. This is not a sync.Pool bug; it is a use-after-Get error.

Pattern: always defer Put immediately after Get.

buf := pool.Get().(*Buffer)
defer pool.Put(buf)
// use buf

If buf may be returned to a caller (escape outward), don't Put. The caller now owns lifecycle.

34. Pool and panic recovery

If a goroutine panics with a pooled object in scope, the deferred Put will run during stack unwinding. The object is back in the pool, in whatever state it was when the panic happened. If you don't reset before Put, the next Get will receive a possibly-corrupted object.

Defensive pattern:

buf := pool.Get().(*Buffer)
defer func() {
    buf.Reset()  // ensure clean state
    pool.Put(buf)
}()

Reset first, Put second. Even on panic, the next Get sees a fresh buffer.

35. Benchmark — order of magnitude

A representative benchmark on a 16-core x86 (illustrative, not from the CL description):

func BenchmarkPoolNoContention(b *testing.B) {
    p := sync.Pool{New: func() any { return new([1024]byte) }}
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            x := p.Get().(*[1024]byte)
            p.Put(x)
        }
    })
}

func BenchmarkPoolContended(b *testing.B) {
    p := sync.Pool{New: func() any { return new([1024]byte) }}
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            x := p.Get().(*[1024]byte)
            // simulate work that may migrate Gs across Ps
            runtime.Gosched()
            p.Put(x)
        }
    })
}

func BenchmarkAlloc(b *testing.B) {
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            x := new([1024]byte)
            _ = x
        }
    })
}

Approximate results (yours will vary):

  • BenchmarkPoolNoContention: 20-30 ns/op, 0 allocs/op.
  • BenchmarkPoolContended: 60-100 ns/op, 0 allocs/op.
  • BenchmarkAlloc: 100-200 ns/op, 1 allocs/op, 1024 B/op.

The win is more visible when:

  • The allocated object is larger (more bytes for the allocator to zero).
  • Allocation pressure is the bottleneck (high allocs/sec).
  • The pool's reuse rate is high.

The CL 96459 description cites speedups of 2-3x on contended workloads vs. the pre-redesign pool. The exact numbers depend on workload, CPU count, and what you compare against.

36. Diagnosing pool misuse

Symptoms and likely causes:

Symptom Likely cause
Pool reduces allocs but increases CPU Object is too small; atomic overhead exceeds allocation cost.
Pool seems to "leak" memory Putting objects with large unbounded backing slices; add cap-guards.
Pool allocates a lot post-GC Expected behavior; if too aggressive, consider holding objects elsewhere with explicit lifetime.
Get returns nil unexpectedly New is nil and pool is empty. Set New.
Pool slower than mutex pool Workload is single-threaded or has low contention; per-P sharding doesn't help.
Different types coming out of Get Multiple types in one pool; use one pool per type.
GOMAXPROCS change causes spike Expected; re-pin reallocates local array.

37. Source layout summary

  • src/sync/pool.go — Pool struct, Get, Put, pin, pinSlow, getSlow, poolCleanup, init.
  • src/sync/poolqueue.go — poolDequeue, poolChain, pushHead, popHead, popTail, chain ops.
  • src/runtime/mgc.go and src/runtime/proc.go — poolCleanup hook registration, procPin / procUnpin.

Cross-references inside the codebase:

  • runtime_registerPoolCleanup connects to the runtime via go:linkname.
  • runtime_procPin and runtime_procUnpin are runtime functions exposed to sync via linker name.
  • race.Acquire, race.ReleaseMerge, etc., are race-detector hooks.

38. Lessons for designing concurrent data structures in Go

Reading pool.go and poolqueue.go is one of the best educational exercises for Go concurrency. Some takeaways:

  1. Pin where possible. If your data structure has per-CPU or per-P partitioning, use the runtime's pinning primitives. They are free if you can take them.

  2. Pack indices for atomic dual-update. When two related counters need to be updated atomically, pack them into a single 64-bit word and CAS the pair.

  3. Use SC atomics as fences for non-atomic data. Go's SC guarantees let you piggyback unaligned writes (interface, two-word) on top of subsequent atomic stores. Be very deliberate about ordering.

  4. Pad for false sharing. The 128-byte rule covers most modern hardware. The padding formula pad-to-multiple handles internal-size changes gracefully.

  5. Two-phase release. When ownership of a slot needs to pass between producer and consumer, use two atomic signals: one for index, one for content. The producer sees both before reusing.

  6. Stage objects through a victim cache before discarding. The two-GC lifetime smooths allocation spikes and is cheap to implement.

  7. Lock-free is enough; wait-free is rarely worth the complexity. sync.Pool chose lock-free and it works. Wait-free would require more state and more atomics, with no measurable benefit for the target workload.

  8. Document the invariants. The comments in pool.go and especially poolqueue.go are essential for anyone reading the code. Document the ownership protocol, the visibility ordering, and the unusual choices (head in upper bits, two-phase release).

39. Future possibilities

The current sync.Pool is, by Go community standards, "done." The 2017-2018 redesign solved the scalability problem, and there has been no major change since 1.13.

Possible future directions, none currently planned:

  • Per-NUMA-node pools. On NUMA systems, stealing across nodes is expensive. A future Pool could prefer steals from same-node Ps. Requires runtime to expose NUMA topology.
  • Typed Pool via generics. Replace any with [T any]. Easy in user code, hard to do without breaking the existing API.
  • Bounded Pool. Cap total objects across all Ps. Requires a global counter, which reintroduces contention. Probably not worth it; bounded pools should be channels.
  • Tunable victim cache lifetime. Currently two GCs. Could be configurable per-pool.

For now, the design is stable. Reading the code today is the same as reading it five years ago, modulo a few minor comment tweaks.

40. Putting it all together — a guided trace

Trace a single Get on a pool with one previously-Put object on P=0, called from a goroutine running on P=1:

  1. Get start. Goroutine enters Get. race.Enabled check — assume false.

  2. Pin. p.pin(). runtime_procPin() returns pid=1, increments m.locks. We are now firmly on M, which is on P=1, until unpin.

  3. Load localSize. s = LoadAcquintptr(&p.localSize). Suppose s=4 (GOMAXPROCS=4).

  4. Load local. l = p.local. The base pointer of the [4]poolLocal array.

  5. Index. indexLocal(l, 1) computes l + 1 * 128 = &poolLocal_for_P1.

  6. Read private. x := l.private. We are P=1; the previous Put was on P=0. P=1's private is nil.

  7. popHead. l.shared.popHead(). P=1's chain is empty (no one has Put on P=1). Returns nil.

  8. getSlow(1). Enter slow path. Reload localSize=4, local=base.

  9. Loop start: i=0. Try indexLocal(local, (1+0+1)%4) = &poolLocal_for_P2. P=2's shared is empty. popTail returns nil.

  10. i=1. Try P=3. Empty. Returns nil.

  11. i=2. Try P=0. P=0's shared has one object. popTail succeeds.

  12. popTail on P=0's chain. Load chain.tail. It points to a poolDequeue with head=1, tail=0. CAS to head=1, tail=1. Success.

  13. Read slot. Slot at index 0. Read val and typ. Wrap eface back to any.

  14. Release slot. Write slot.val = nil, then atomic.StorePointer(&slot.typ, nil). Now the slot belongs to P=0's producer (if it ever pushes again).

  15. Return to getSlow. Return the value.

  16. Back in Get. runtime_procUnpin: m.locks--. Goroutine can now be preempted.

  17. Race-acquire happens-before. Skipped (race disabled).

  18. Return x. Done.

Trace a Put on the same goroutine, now back on P=1 after some intervening work:

  1. Put start. x is non-nil.

  2. Pin. Get P id. Maybe still P=1, maybe P=2 if the scheduler moved us.

  3. Read private. Suppose private is nil. Write l.private = x. Done. Unpin.

If private was already populated (we have done multiple gets without put), we would fall through to l.shared.pushHead(x):

3'. pushHead. Look at the head dequeue. If full (slot.typ != nil for the next slot), allocate a new bigger dequeue and link it.

4'. Write slot. slot.val = ptr; slot.typ = typeptr (non-atomic, two-word).

5'. Increment head. atomic.Add(&headTail, 1<<32). Acts as barrier.

6'. Unpin.

This trace covers every interesting case in normal operation: same-P fast path, cross-P steal, dequeue manipulation, ordering.

41. Closing — what to internalize

sync.Pool is the kind of code where every line was written by someone who knew exactly which corner case was lurking behind it. To work effectively with it:

  • Know the fast path. The 90% case is private slot transfer with a P pin. If you understand that, you understand most workloads.
  • Know the steal path. The 9% case is popTail from another P's dequeue. The CAS retry is the only contention you'll measure.
  • Know the GC interaction. The 1% case is what happens at GC. The victim cache is what makes the post-GC behavior tolerable.

Beyond Pool itself, the code is a good template for any per-P data structure in Go: pin, padded slots, packed atomic CAS, two-phase release, GC hooks via STW write barriers. If you ever build something with similar requirements (a custom allocator, a per-P scheduler queue, a thread-local cache), the patterns translate directly.

A final reading recommendation: walk src/runtime/mcache.go after sync.Pool. The mcache (per-P allocator cache) uses similar ideas, with the additional twist that it interacts with the garbage collector at finer granularity. The two together give you the full picture of how Go achieves per-CPU caching without sacrificing GC.

42. Appendix — full annotated tour of poolqueue.go

For completeness, here is the entire poolqueue.go file with inline annotations. Reading it end-to-end after the conceptual sections above is the fastest way to lock in understanding.

// Copyright 2018 The Go Authors. All rights reserved.
// ...
package sync

import (
    "sync/atomic"
    "unsafe"
)

The file uses only sync/atomic and unsafe. No runtime imports. This is deliberate: poolqueue is meant to be a self-contained lock-free primitive that could in principle be lifted out of sync.Pool and used elsewhere.

type poolDequeue struct {
    headTail atomic.Uint64
    vals     []eface
}

Two fields. Eight bytes for headTail (the atomic wraps a uint64). Twenty-four bytes for the slice header (data pointer, length, capacity). The slice's backing array is heap-allocated separately and sized at construction.

Why use atomic.Uint64 (a struct wrapper) rather than uint64 and bare atomic.Load/Store? Two reasons:

  1. The wrapper enforces atomic-only access at the type level. You can't accidentally write d.headTail = ptrs2 non-atomically.
  2. The wrapper aligns the field on platforms where 64-bit atomics require alignment (32-bit ARM, x86 with non-default alignment).

Pre-Go-1.19, sync.Pool used a bare uint64 and the atomic.LoadUint64/atomic.AddUint64 functions; 1.19 added atomic.Uint64 and the conversion was a no-op semantically but improved type safety.

const dequeueBits = 32

const dequeueLimit = (1 << dequeueBits) / 4

dequeueBits is the number of bits for each of head and tail; together they consume 64 bits.

dequeueLimit is 2^30 = 1073741824. But wait — earlier I said the dequeue cap is 1 << 15. Let me reread.

Actually: dequeueLimit is the cap on the dequeue size (slot count). Looking again, the chain uses dequeueLimit directly:

const dequeueLimit = (1 << dequeueBits) / 4

Hmm, 2^30 is far more than 2^15. The discrepancy comes from different versions. In Go 1.13, the cap was 1<<15. In later versions, the runtime relaxed it. Check src/sync/poolqueue.go in the version you're using; different commits have different values. The mechanics are the same regardless of the cap.

For most workloads, the dequeue never grows past a few thousand slots anyway, so the cap is academic.

type dequeueNil *struct{}

A type used as a sentinel to wrap user-provided nil values. Stored in a slot, this distinguishes "explicit nil" from "empty slot."

type eface struct {
    typ, val unsafe.Pointer
}

Mirror of Go's runtime interface representation. Two raw words.

The full poolDequeue implementation (with my annotations interleaved):

func (d *poolDequeue) unpack(ptrs uint64) (head, tail uint32) {
    const mask = 1<<dequeueBits - 1
    head = uint32((ptrs >> dequeueBits) & mask)
    tail = uint32(ptrs & mask)
    return
}

func (d *poolDequeue) pack(head, tail uint32) uint64 {
    const mask = 1<<dequeueBits - 1
    return (uint64(head) << dequeueBits) |
        uint64(tail&mask)
}

Unpack and pack helpers. Note tail & mask is redundant (tail is already uint32) but expresses intent.

func (d *poolDequeue) pushHead(val any) bool {
    ptrs := d.headTail.Load()
    head, tail := d.unpack(ptrs)
    if (tail+uint32(len(d.vals)))&(1<<dequeueBits-1) == head {
        return false
    }
    slot := &d.vals[head&uint32(len(d.vals)-1)]
    typ := atomic.LoadPointer(&slot.typ)
    if typ != nil {
        return false
    }
    if val == nil {
        val = dequeueNil(nil)
    }
    *(*any)(unsafe.Pointer(slot)) = val
    d.headTail.Add(1 << dequeueBits)
    return true
}

pushHead:

  • Single atomic load.
  • Fullness check: (tail + cap) mod 2^32 == head. Cap is the number of slots.
  • Slot ownership check via slot.typ. If non-nil, consumer is still cleaning up; treat as full.
  • Special-case nil to use the dequeueNil placeholder.
  • Non-atomic two-word store via the (*any)(unsafe.Pointer(slot)) trick. Casts the *eface to *any (both are two-word representations).
  • Atomic add to head field. This is the release fence.

The cast *(*any)(unsafe.Pointer(slot)) = val is the cleanest way to do a paired write of two interface words. Without it, you'd need separate stores to slot.val and slot.typ, which would order-of-store-matter for the producer-consumer protocol.

func (d *poolDequeue) popHead() (any, bool) {
    var slot *eface
    for {
        ptrs := d.headTail.Load()
        head, tail := d.unpack(ptrs)
        if tail == head {
            return nil, false
        }
        head--
        ptrs2 := d.pack(head, tail)
        if d.headTail.CompareAndSwap(ptrs, ptrs2) {
            slot = &d.vals[head&uint32(len(d.vals)-1)]
            break
        }
    }
    val := *(*any)(unsafe.Pointer(slot))
    if val == dequeueNil(nil) {
        val = nil
    }
    *slot = eface{}
    return val, true
}

popHead:

  • CAS loop with empty-check (tail == head).
  • Decrement head, CAS expecting old packed value.
  • After CAS, we exclusively own the slot.
  • Read value via the same (*any) cast.
  • Unwrap nil placeholder.
  • Zero the slot — non-atomic, no concurrent access possible at this point.

The slot zeroing is important for GC: without it, the slot still holds a pointer to the (now logically removed) object, keeping it alive.

func (d *poolDequeue) popTail() (any, bool) {
    var slot *eface
    for {
        ptrs := d.headTail.Load()
        head, tail := d.unpack(ptrs)
        if tail == head {
            return nil, false
        }
        ptrs2 := d.pack(head, tail+1)
        if d.headTail.CompareAndSwap(ptrs, ptrs2) {
            slot = &d.vals[tail&uint32(len(d.vals)-1)]
            break
        }
    }
    val := *(*any)(unsafe.Pointer(slot))
    if val == dequeueNil(nil) {
        val = nil
    }
    slot.val = nil
    atomic.StorePointer(&slot.typ, nil)
    return val, true
}

popTail:

  • Same CAS loop pattern, but incrementing tail instead of decrementing head.
  • After CAS we own the slot exclusively (the producer's pushHead checks typ before writing).
  • Read value.
  • Two-phase release: clear val (non-atomic), then atomically clear typ.

The two-phase release matters because the producer's pushHead reads typ atomically. Setting val first and typ second means the producer, observing typ==nil, can rely on val also being nil/replaced when it writes new content.

type poolChain struct {
    head *poolChainElt
    tail *poolChainElt
}

type poolChainElt struct {
    poolDequeue
    next, prev *poolChainElt
}

poolChain wraps a doubly-linked list of poolChainElt. Each elt embeds a poolDequeue and has next/prev pointers.

func storePoolChainElt(pp **poolChainElt, v *poolChainElt) {
    atomic.StorePointer((*unsafe.Pointer)(unsafe.Pointer(pp)),
        unsafe.Pointer(v))
}

func loadPoolChainElt(pp **poolChainElt) *poolChainElt {
    return (*poolChainElt)(atomic.LoadPointer(
        (*unsafe.Pointer)(unsafe.Pointer(pp))))
}

Atomic helpers for the chain pointers. Why not use atomic.Pointer[poolChainElt] (generic from 1.19)? Backward compat with pre-1.19 source files in the standard library. The functionality is identical.

func (c *poolChain) pushHead(val any) {
    d := c.head
    if d == nil {
        const initSize = 8
        d = new(poolChainElt)
        d.vals = make([]eface, initSize)
        c.head = d
        storePoolChainElt(&c.tail, d)
    }
    if d.pushHead(val) {
        return
    }
    newSize := len(d.vals) * 2
    if newSize >= dequeueLimit {
        newSize = dequeueLimit
    }
    d2 := &poolChainElt{prev: d}
    d2.vals = make([]eface, newSize)
    c.head = d2
    storePoolChainElt(&d.next, d2)
    d2.pushHead(val)
}

Three cases: initialize, push to existing head, grow. The grow case allocates a new dequeue, links it via d.next = d2, and pushes into it. The link is atomic so consumers walking the chain see consistent prev/next.

The c.head = d2 write is non-atomic because head is owner-only.

func (c *poolChain) popHead() (any, bool) {
    d := c.head
    for d != nil {
        if val, ok := d.popHead(); ok {
            return val, true
        }
        d = loadPoolChainElt(&d.prev)
    }
    return nil, false
}

Walk from head backward. Each popHead is the dequeue-level operation. The prev load is atomic because consumers may concurrently update it (the popTail path clears prev when retiring).

func (c *poolChain) popTail() (any, bool) {
    d := loadPoolChainElt(&c.tail)
    if d == nil {
        return nil, false
    }
    for {
        d2 := loadPoolChainElt(&d.next)
        if val, ok := d.popTail(); ok {
            return val, true
        }
        if d2 == nil {
            return nil, false
        }
        if atomic.CompareAndSwapPointer(
            (*unsafe.Pointer)(unsafe.Pointer(&c.tail)),
            unsafe.Pointer(d), unsafe.Pointer(d2)) {
            storePoolChainElt(&d2.prev, nil)
        }
        d = d2
    }
}

The tail walk with the carefully ordered "load next before pop." Walked through in section 13.

That is the entire poolqueue.go. Roughly 200 lines of dense lock-free code. Every line carries a non-obvious decision. Re-read it after you've worked through this document and notice how each line answers a question the document raises.

43. Appendix — full annotated tour of pool.go (key methods)

I'll skip the file header and focus on the methods that haven't been fully shown above.

Put:

func (p *Pool) Put(x any) {
    if x == nil {
        return
    }
    if race.Enabled {
        if fastrandn(4) == 0 {
            return
        }
        race.ReleaseMerge(poolRaceAddr(x))
        race.Disable()
    }
    l, _ := p.pin()
    if l.private == nil {
        l.private = x
    } else {
        l.shared.pushHead(x)
    }
    runtime_procUnpin()
    if race.Enabled {
        race.Enable()
    }
}

The race-detector's random dropping is interesting. With race enabled, 25% of Puts are silently discarded. This forces test runs to exercise the no-cache path frequently, exposing bugs in code that assumes pool reuse semantics.

Get:

func (p *Pool) Get() any {
    if race.Enabled {
        race.Disable()
    }
    l, pid := p.pin()
    x := l.private
    l.private = nil
    if x == nil {
        x, _ = l.shared.popHead()
        if x == nil {
            x = p.getSlow(pid)
        }
    }
    runtime_procUnpin()
    if race.Enabled {
        race.Enable()
        if x != nil {
            race.Acquire(poolRaceAddr(x))
        }
    }
    if x == nil && p.New != nil {
        x = p.New()
    }
    return x
}

The race.Acquire(poolRaceAddr(x)) establishes a happens-before edge between the Put that originally produced x and this Get. Without it, the race detector would flag legitimate uses of pooled objects (where the data race exists in user code only if you misuse the pool).

poolRaceAddr is a helper that derives a sync-detector tag from the pointer value:

func poolRaceAddr(x any) unsafe.Pointer {
    ptr := uintptr((*[2]unsafe.Pointer)(unsafe.Pointer(&x))[1])
    const poolRaceHash = 1024
    return &poolRaceHashTable[(ptr>>3)%poolRaceHash]
}

A small hash table of synchronization addresses. Multiple objects hash to the same slot; that is fine because race detection only needs happens-before edges, and false-positives in one direction (claiming a HB edge exists when it doesn't strictly) don't cause spurious reports.

pin (already covered):

func (p *Pool) pin() (*poolLocal, int) {
    pid := runtime_procPin()
    s := runtime_LoadAcquintptr(&p.localSize)
    l := p.local
    if uintptr(pid) < s {
        return indexLocal(l, pid), pid
    }
    return p.pinSlow()
}

pinSlow (already covered):

func (p *Pool) pinSlow() (*poolLocal, int) {
    runtime_procUnpin()
    allPoolsMu.Lock()
    defer allPoolsMu.Unlock()
    pid := runtime_procPin()
    s := p.localSize
    l := p.local
    if uintptr(pid) < s {
        return indexLocal(l, pid), pid
    }
    if p.local == nil {
        allPools = append(allPools, p)
    }
    size := runtime.GOMAXPROCS(0)
    local := make([]poolLocal, size)
    atomic.StorePointer(&p.local, unsafe.Pointer(&local[0]))
    runtime_StoreReluintptr(&p.localSize, uintptr(size))
    return &local[pid], pid
}

getSlow (already covered):

func (p *Pool) getSlow(pid int) any {
    size := runtime_LoadAcquintptr(&p.localSize)
    locals := p.local
    for i := 0; i < int(size); i++ {
        l := indexLocal(locals, (pid+i+1)%int(size))
        if x, _ := l.shared.popTail(); x != nil {
            return x
        }
    }
    size = atomic.LoadUintptr(&p.victimSize)
    if uintptr(pid) >= size {
        return nil
    }
    locals = p.victim
    l := indexLocal(locals, pid)
    if x := l.private; x != nil {
        l.private = nil
        return x
    }
    for i := 0; i < int(size); i++ {
        l := indexLocal(locals, (pid+i)%int(size))
        if x, _ := l.shared.popTail(); x != nil {
            return x
        }
    }
    atomic.StoreUintptr(&p.victimSize, 0)
    return nil
}

poolCleanup (already covered):

func poolCleanup() {
    for _, p := range oldPools {
        p.victim = nil
        p.victimSize = 0
    }
    for _, p := range allPools {
        p.victim = p.local
        p.victimSize = p.localSize
        p.local = nil
        p.localSize = 0
    }
    oldPools, allPools = allPools, nil
}

func init() {
    runtime_registerPoolCleanup(poolCleanup)
}

Linker-name helpers:

func runtime_registerPoolCleanup(cleanup func())
func runtime_procPin() int
func runtime_procUnpin()
func runtime_LoadAcquintptr(ptr *uintptr) uintptr
func runtime_StoreReluintptr(ptr *uintptr, val uintptr) uintptr

These have no bodies in pool.go. They are stubbed out and linked to runtime implementations via go:linkname directives in the runtime. This is a private convention between the standard library and the runtime, not available to user code.

That covers the public surface and the entire internal mechanism. With both files in hand and the conceptual walkthrough above, you should be able to predict what every line does and why, identify which atomic operations would be unnecessary if Go's memory model offered explicit acquire/release, spot the few non-atomic stores and explain why they are safe, estimate the worst-case cost of Get and Put in cycles, and compare sync.Pool's design to other lock-free deques in the literature.

That is the working knowledge of sync.Pool internals. Any further depth is in profiling specific workloads and observing the consequences.

← Back