Starvation — Optimisation¶
Table of Contents¶
- Goal of This File
- Optimisation 1: Shorten the RWMutex Critical Section
- Optimisation 2: Replace RWMutex with atomic.Pointer
- Optimisation 3: Shard the Lock
- Optimisation 4: Bounded Queues with Adaptive Capacity
- Optimisation 5: Aging Priority Queue with Heap
- Optimisation 6: Per-CPU Local Counters
- Optimisation 7: Hot Path / Cold Path Split
- Measurement and Validation
- Summary
Goal of This File¶
You have identified starvation in a Go service. You have shortened the critical sections, added back-pressure, and the system is stable. Now you want to make it faster while preserving fairness. This file is about the second-order improvements: how to reduce p99 latency further once the obvious bugs are gone.
Each section presents a before/after with measurement guidance. Every optimisation has a trade-off; we name it explicitly.
Optimisation 1: Shorten the RWMutex Critical Section¶
Before¶
type Cache struct {
mu sync.RWMutex
data map[string]*Entry
}
func (c *Cache) Lookup(k string) *Result {
c.mu.RLock()
defer c.mu.RUnlock()
entry, ok := c.data[k]
if !ok {
return nil
}
return entry.Compute() // expensive
}
Problem¶
Compute() is slow (e.g., 1 ms). It runs under the read lock. Writers wait for every active Compute() to finish.
After¶
func (c *Cache) Lookup(k string) *Result {
c.mu.RLock()
entry, ok := c.data[k]
c.mu.RUnlock()
if !ok {
return nil
}
return entry.Compute()
}
Trade-off¶
The entry pointer could be invalidated between unlock and use. Two options:
- If
Entry.Compute()is safe on a snapshot (entry is immutable after publication), this is fine. Set the field via atomic pointer swap on update. - If not, hold a per-entry lock or refcount inside
Entry.
Measurement¶
- Before: writer p99 = read critical section duration × (number of readers / cores). With 100 readers and 1 ms reads, ~100 ms p99.
- After: writer p99 = map lookup time ≈ microseconds.
Optimisation 2: Replace RWMutex with atomic.Pointer¶
Before¶
type Config struct {
mu sync.RWMutex
v *ConfigData
}
func (c *Config) Get() *ConfigData {
c.mu.RLock()
defer c.mu.RUnlock()
return c.v
}
func (c *Config) Set(v *ConfigData) {
c.mu.Lock()
defer c.mu.Unlock()
c.v = v
}
Problem¶
Get is called millions of times per second; it pays for two atomic operations on every call (RLock/RUnlock increments and decrements of readerCount). For a pointer load, this is overkill.
After¶
import "sync/atomic"
type Config struct {
v atomic.Pointer[ConfigData]
}
func (c *Config) Get() *ConfigData {
return c.v.Load()
}
func (c *Config) Set(v *ConfigData) {
c.v.Store(v)
}
Trade-off¶
- Requires
ConfigDatato be effectively immutable after publication. Mutating the struct afterStoreis a data race. - No way to atomically update multiple fields. Bundle related state in a struct and store a new pointer.
Measurement¶
Get goes from ~20 ns to ~2 ns. At millions of QPS this is a measurable CPU saving and removes the source of RWMutex contention entirely.
Optimisation 3: Shard the Lock¶
Before¶
type Counter struct {
mu sync.Mutex
count int64
}
func (c *Counter) Inc() {
c.mu.Lock()
c.count++
c.mu.Unlock()
}
func (c *Counter) Get() int64 {
c.mu.Lock()
defer c.mu.Unlock()
return c.count
}
Problem¶
Under high concurrency, the single lock is contended. The lock itself becomes a starvation source for waiters.
After¶
const shards = 64
type Counter struct {
counts [shards]struct {
mu sync.Mutex
value int64
_ [40]byte // pad to cache line
}
}
func (c *Counter) Inc() {
s := &c.counts[shardIndex()]
s.mu.Lock()
s.value++
s.mu.Unlock()
}
func (c *Counter) Get() int64 {
var sum int64
for i := range c.counts {
c.counts[i].mu.Lock()
sum += c.counts[i].value
c.counts[i].mu.Unlock()
}
return sum
}
func shardIndex() int {
// some fast per-goroutine or per-CPU dispersal
return int(uintptr(unsafe.Pointer(new(int)))) % shards
}
Trade-off¶
Getis now O(shards). Still fast, but not O(1).- Reading is no longer point-in-time consistent across shards (different shards observed at different moments).
- More memory: each shard has its own cache line.
For pure counters, prefer atomic.Int64 or atomic.AddInt64. Sharding is more interesting for complex state where atomics do not apply.
Measurement¶
For high-contention counters, sharded versions achieve nearly linear scaling with cores. The unsharded version saturates at 30-50 M ops/sec depending on hardware; the sharded version scales to hundreds of millions.
Optimisation 4: Bounded Queues with Adaptive Capacity¶
Before¶
Problem¶
Fixed capacity. If the workload is bursty, you either:
- Set capacity high enough for the burst (wasting memory and latency in normal load).
- Set it low (rejecting work during bursts).
After¶
Implement a queue with adaptive capacity. Start at low capacity; grow when fullness exceeds 80% sustained; shrink when below 20% sustained.
type AdaptiveQueue struct {
mu sync.Mutex
items []Job
cap int
minCap int
maxCap int
fullness atomic.Int64 // EWMA in percent
}
// Resize logic runs periodically in a background goroutine.
Full implementation is involved; pattern is:
- Track utilisation as an exponentially weighted moving average.
- Resize at most once per N seconds.
- Use
sync.Condto signalPopafter resize.
Trade-off¶
- Complexity. Adaptive systems are harder to reason about.
- Hysteresis tuning. Grow/shrink thresholds need testing.
- Per-resize allocation. Each grow copies items to a new slice.
For most services, fixed capacity tuned to peak observed load + headroom is simpler and good enough.
Measurement¶
Compare p99 latency under bursty load (10x base rate for 1 second every 10 seconds). Adaptive should keep latency stable; fixed-small drops work; fixed-large keeps stale items longer.
Optimisation 5: Aging Priority Queue with Heap¶
Before¶
Problem¶
For large queues (>1000 items), the O(N) scan per pop dominates.
After¶
Use a min-heap keyed by effective(it). Periodically rebuild the heap because aging shifts the keys.
import "container/heap"
type AgingQueue struct {
mu sync.Mutex
h heap
nonEmpty *sync.Cond
lastBuilt time.Time
}
// h is a heap.Interface implementation with effective(it) as key.
func (q *AgingQueue) Push(it *Item) {
q.mu.Lock()
heap.Push(&q.h, it)
q.mu.Unlock()
q.nonEmpty.Signal()
}
func (q *AgingQueue) Pop() *Item {
q.mu.Lock()
defer q.mu.Unlock()
for q.h.Len() == 0 {
q.nonEmpty.Wait()
}
if time.Since(q.lastBuilt) > 100*time.Millisecond {
heap.Init(&q.h)
q.lastBuilt = time.Now()
}
return heap.Pop(&q.h).(*Item)
}
Trade-off¶
- Heap operations are O(log N) but heap keys change over time as items age. Periodic rebuilds keep order approximately correct.
- The rebuild is O(N). Frequency tuned to balance correctness vs. cost.
- Strict aging guarantees are weaker; items can be pop'd slightly out of "true" effective-priority order between rebuilds.
For most workloads, 100 ms rebuild interval is invisible.
Optimisation 6: Per-CPU Local Counters¶
Before¶
Problem¶
Atomic operations on a single cache line. Under high contention from many cores, the cache line bounces between cores. Each Add takes 50-200 ns instead of the 2-5 ns it would take if uncontended.
After¶
Per-P counters with periodic aggregation. Go does not expose a stable "current P", but runtime.NumCPU() plus an XOR-shifted index is a reasonable approximation:
import "runtime"
var counters []atomic.Int64
func init() {
counters = make([]atomic.Int64, runtime.NumCPU())
}
func inc() {
idx := dispersedIdx() // any cheap dispersal
counters[idx&(len(counters)-1)].Add(1)
}
func total() int64 {
var sum int64
for i := range counters {
sum += counters[i].Load()
}
return sum
}
Pad each atomic.Int64 to a cache line to avoid false sharing:
Trade-off¶
totalis O(N).- Sums are eventually-consistent across counters.
- Extra memory: one cache line per CPU.
For pure increment counters this scales to billions of operations per second on commodity hardware.
Optimisation 7: Hot Path / Cold Path Split¶
Before¶
func (s *Service) Handle(r Request) Response {
s.mu.Lock()
defer s.mu.Unlock()
if r.Type == Common {
return s.handleCommon(r)
}
return s.handleRare(r) // slow
}
Problem¶
The rare path is slow but the lock is shared with the common path. Common requests wait behind rare ones.
After¶
Split the lock; let common requests use a fast lock and rare requests use a different lock.
type Service struct {
commonMu sync.Mutex
common CommonState
rareMu sync.Mutex
rare RareState
}
func (s *Service) Handle(r Request) Response {
if r.Type == Common {
s.commonMu.Lock()
defer s.commonMu.Unlock()
return s.handleCommon(r)
}
s.rareMu.Lock()
defer s.rareMu.Unlock()
return s.handleRare(r)
}
If the rare path needs to read common state, snapshot it under the common lock first:
func (s *Service) handleRare(r Request) Response {
s.commonMu.Lock()
snapshot := s.common.Snapshot()
s.commonMu.Unlock()
// Now do slow work without holding commonMu.
...
}
Trade-off¶
- More locks = more bookkeeping.
- Cross-lock invariants become harder. If
commonandraremust be updated atomically together, you need a coordination lock anyway.
The pattern works best when common and rare are independent enough that the only interaction is reads.
Measurement¶
Common-path p99 should drop dramatically. Rare-path p99 stays the same.
Measurement and Validation¶
Every optimisation needs three numbers:
- p50 / p99 / p99.9 latency of the affected operation, before and after.
- Throughput (ops/sec), before and after.
- CPU and memory profile, before and after, to confirm the optimisation paid off in the budget you expected.
Use go test -bench with -benchtime=10s to get stable measurements. For percentiles, use the harness from tasks.md Task 10.
Run with GOMAXPROCS=runtime.NumCPU() and a realistic concurrency level. Single-threaded benchmarks rarely show contention effects.
Validate fairness explicitly: under high load, no single goroutine should have starved. Plot per-goroutine completion counts; verify they are within a factor of 2-3.
Summary¶
Once obvious starvation bugs are fixed (long critical sections, biased selects, unbounded queues, single-tenant queues), the remaining optimisations are:
- Shrink critical sections by snapshotting.
- Replace
RWMutexwithatomic.Pointerwhere data is immutable. - Shard locks to reduce contention.
- Use adaptive queue sizing for bursty workloads.
- Upgrade aging queues to heaps for large queue sizes.
- Use per-P counters to eliminate atomic contention.
- Split hot-path and cold-path locks.
Each is a trade-off. Measure before and after. Keep the fairness guarantee — a faster system that starves some workload is a regression, not an improvement.
Continue to tasks.md for hands-on exercises or find-bug.md for diagnosis practice.