Livelock — Middle Level¶

Table of Contents¶

Introduction
Anatomy of a Livelock
CAS-Loop Livelock in Detail
Back-Off Strategies
The Standard Cure: Jitter
The cenkalti/backoff Library
Diagnosing Livelock in a Running Service
Livelock vs Contention
Sharding to Reduce Contention
Singleflight and Coalescing
Patterns That Reliably Avoid Livelock
Summary

Introduction¶

At the junior level you learned the shape and the basic cure. At the middle level you learn to:

See livelock as a contention pattern, not a single bug.
Reason about CAS-loop performance under varying numbers of goroutines.
Use exponential back-off with jitter correctly, including the difference between full jitter, equal jitter, and decorrelated jitter.
Use github.com/cenkalti/backoff end-to-end with context cancellation.
Diagnose livelock from pprof CPU profiles and runtime metrics.
Build sharded data structures that resist contention by design.
Combine singleflight and back-off in a single cache-fill pipeline.

The middle-level skill is reaching for the right cure on the first try, not flailing through symptoms.

Anatomy of a Livelock¶

Every livelock has four ingredients:

Two or more actors. Goroutines, processes, services.
A shared resource. A mutex, an atomic, a row, a leadership lease.
A reaction loop. Each actor reacts to the others' state and retries.
Symmetry. The actors' reactions are similar enough to keep colliding.

Remove any one of these and the livelock disappears:

Reduce to one actor — no livelock, but no concurrency either.
Eliminate the shared resource (sharding) — different actors hit different resources.
Remove the loop (give up after N attempts) — bounded failure, not livelock.
Break the symmetry (jitter, priority) — collisions become rare.

The third and fourth are the common production cures. Sharding helps when contention is over data; jitter helps when contention is over coordination.

// All four ingredients are visible here:
// 1. multiple goroutines
// 2. one shared atomic
// 3. an unbounded retry loop
// 4. all goroutines use the same retry rhythm
for {
    old := counter.Load()                    // shared resource
    if counter.CompareAndSwap(old, old+1) {  // reaction
        return
    }                                        // symmetric: every goroutine identical
}

CAS-Loop Livelock in Detail¶

The CAS loop is Go's most common livelock-prone shape. The semantics are simple: read, compute, swap, retry on failure.

for {
    old := state.Load()
    new := compute(old)
    if state.CompareAndSwap(old, new) {
        return
    }
}

Under low contention this is fast — the loop almost always exits on the first try. Under high contention the loop's success probability collapses.

The contention model¶

With N goroutines all CAS-ing on the same location, the probability that any one given attempt succeeds during a given round is approximately 1/N. Over a round of N attempts, exactly one succeeds and N-1 fail. The wasted work is proportional to N-1.

Throughput per goroutine, roughly:

useful_ops_per_second_per_goroutine = (1 / N) * (1 / time_per_CAS)

Doubling N halves the per-goroutine throughput. Aggregate throughput is roughly constant: the same number of successful CAS operations per second regardless of N, since they serialise through the cache line.

This is not always livelock in the strict "no progress" sense — some CAS succeeds per round. But it is the spirit of livelock: most work is wasted, and adding more goroutines makes the problem worse.

When CAS-loop livelock crosses into true zero-progress¶

If compute(old) is expensive or has its own side effects that interfere with other goroutines (for example, allocates from a contended pool), the round time grows. If the round time grows faster than the throughput, success per second falls toward zero. Now you have observed livelock.

Demonstration¶

package main

import (
    "fmt"
    "sync"
    "sync/atomic"
    "time"
)

func main() {
    var counter atomic.Int64
    var success atomic.Int64

    const goroutines = 1000
    var wg sync.WaitGroup
    stop := make(chan struct{})

    for i := 0; i < goroutines; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for {
                select {
                case <-stop:
                    return
                default:
                }
                old := counter.Load()
                if counter.CompareAndSwap(old, old+1) {
                    success.Add(1)
                }
            }
        }()
    }

    time.Sleep(time.Second)
    s := success.Load()
    close(stop)
    wg.Wait()
    fmt.Printf("successes in 1s with %d goroutines: %d\n", goroutines, s)
}

Run with goroutines = 1, then 10, 100, 1000, 10000. Plot the result. The line is roughly flat — adding goroutines does not increase throughput, only CPU usage.

Cure 1: Use the dedicated atomic operation¶

counter.Add(1) // one atomic instruction on most hardware

atomic.Int64.Add is implemented as a single LOCK XADD on x86 — one round trip to the cache line, no CAS retries. It is the cure for the specific case where the new value is a function only of the old value.

Cure 2: Sharding the counter¶

If you have many goroutines incrementing one counter, give each shard its own counter and sum them on read.

type ShardedCounter struct {
    shards [64]struct {
        v atomic.Int64
        _ [56]byte // pad to a cache line (64 bytes) to avoid false sharing
    }
}

func (c *ShardedCounter) Inc() {
    shard := goroutineHash() % 64
    c.shards[shard].v.Add(1)
}

func (c *ShardedCounter) Sum() int64 {
    var s int64
    for i := range c.shards {
        s += c.shards[i].v.Load()
    }
    return s
}

Inc becomes O(1) per goroutine with no contention; Sum becomes O(shards) but is cheap relative to the work being counted. This pattern lives in expvar.Int, runtime.MemStats, and many production counters.

Cure 3: Mutex with parking¶

Counter-intuitively, switching from a CAS loop to a sync.Mutex can increase throughput under heavy contention. The mutex parks the loser; the loser's CPU is freed. Only the winner runs. With 1000 goroutines and sync.Mutex, you may see throughput double or triple compared to a CAS loop.

Back-Off Strategies¶

When a retry will not succeed immediately, you wait before retrying. How you wait determines whether you escape livelock or amplify it.

Constant back-off¶

time.Sleep(10 * time.Millisecond)

Pros: simple. Cons: if many goroutines wait the same time, they collide on the same future tick.

Linear back-off¶

time.Sleep(time.Duration(attempt) * 10 * time.Millisecond)

Pros: spreads attempts over time. Cons: still symmetric — N goroutines on attempt 2 all wait 20 ms and collide together.

Exponential back-off¶

time.Sleep(time.Duration(1<<attempt) * time.Millisecond)

Pros: spreads attempts further over time. Cons: still symmetric. The classic mistake is "I have exponential back-off, I am safe from livelock." You are not. Without jitter, N goroutines still re-collide.

Exponential with full jitter¶

base := time.Duration(1<<attempt) * time.Millisecond
sleep := time.Duration(rand.Int63n(int64(base)))
time.Sleep(sleep)

The sleep is uniformly random in [0, base). The expected sleep is base/2, but the actual sleep for each goroutine is independent. The probability of two goroutines waiting the same time is zero (with sub-nanosecond resolution).

Exponential with equal jitter¶

base := time.Duration(1<<attempt) * time.Millisecond
sleep := base/2 + time.Duration(rand.Int63n(int64(base/2)))
time.Sleep(sleep)

Half the back-off is deterministic (the minimum wait), half is random. Used when you want a minimum wait but still randomise above it.

Decorrelated jitter¶

The AWS-recommended pattern:

// Decorrelated jitter — each new delay depends on the previous one.
var prev time.Duration = baseDelay
for attempt := 0; attempt < maxAttempts; attempt++ {
    if try() { return }
    // sleep uniformly in [base, prev * 3]
    upper := prev * 3
    if upper > maxDelay {
        upper = maxDelay
    }
    next := baseDelay + time.Duration(rand.Int63n(int64(upper - baseDelay)))
    time.Sleep(next)
    prev = next
}

Decorrelated jitter has the smoothest convergence behaviour under retry-storm workloads. It is the gold standard for distributed retry.

Which to choose¶

Strategy	Where to use
Constant	Almost never. Demo code only.
Linear	Light contention, predictable workload.
Exponential, no jitter	Almost never.
Exponential + full jitter	Standard retry. The 80% answer.
Exponential + equal jitter	When you want a minimum wait.
Decorrelated jitter	Distributed systems with retry storms.

The Standard Cure: Jitter¶

Why does jitter cure livelock? Because livelock is synchronisation on a wait. Two goroutines that fail at time t and wait identical times retry at the same time t + d. Adding random noise to d desynchronises them. After a few rounds of jittered retries, the probability of collision drops geometrically.

The math: if two goroutines pick uniformly random sleeps in [0, base), the probability they retry within ε of each other is approximately 2ε / base. For base = 100 ms and ε = 100 µs, the probability is 0.2%. After two rounds of independent jitter, the probability is 0.0004%.

How much jitter¶

A rule of thumb: jitter should be at least 50% of the base back-off to be effective. Smaller jitter slows the desynchronisation; larger jitter wastes time.

The most defensible default: full jitter — sleep uniformly random in [0, base). The expected sleep is base/2, but the worst case is base. The average rate of retry is comparable to plain exponential at half the base, and the desynchronisation is excellent.

Seeding `math/rand`¶

In Go 1.20+, the global math/rand is seeded at startup automatically. In Go 1.19 and earlier, you must seed it yourself:

import "math/rand"

func init() {
    rand.Seed(time.Now().UnixNano())
}

Even better: use math/rand/v2 (Go 1.22+), which is goroutine-safe by default and properly seeded.

Per-goroutine `rand.Source`¶

The global math/rand is goroutine-safe but uses a mutex internally. Under extreme contention, that mutex itself can become a hot spot. The cure: each goroutine uses its own rand.Source:

src := rand.New(rand.NewSource(time.Now().UnixNano()))
// then:
sleep := time.Duration(src.Int63n(int64(base)))

This is overkill for normal retry loops but matters when you are retrying millions of times per second.

The cenkalti/backoff Library¶

github.com/cenkalti/backoff/v4 is the de-facto standard back-off library in Go. It provides exponential back-off with jitter, context-aware cancellation, and a clean interface.

Basic use¶

import "github.com/cenkalti/backoff/v4"

operation := func() error {
    return doRiskyThing()
}

b := backoff.NewExponentialBackOff()
b.InitialInterval = 100 * time.Millisecond
b.MaxInterval = 5 * time.Second
b.MaxElapsedTime = 30 * time.Second
b.RandomizationFactor = 0.5 // 50% jitter

err := backoff.Retry(operation, b)

backoff.Retry calls operation until it returns nil or until MaxElapsedTime is exceeded.

With context cancellation¶

ctx, cancel := context.WithTimeout(parent, 10*time.Second)
defer cancel()

err := backoff.Retry(operation, backoff.WithContext(b, ctx))

When ctx is cancelled, backoff.Retry stops retrying immediately and returns ctx.Err().

Permanent vs transient errors¶

Not every error should be retried. A 401 Unauthorized will never succeed by retrying. The library lets you mark some errors as permanent:

operation := func() error {
    err := doRiskyThing()
    if isPermanent(err) {
        return backoff.Permanent(err)
    }
    return err
}

backoff.Permanent short-circuits the retry — the library stops immediately and returns the wrapped error.

Custom back-off strategies¶

The library has a BackOff interface:

type BackOff interface {
    NextBackOff() time.Duration
    Reset()
}

You can implement decorrelated jitter or any custom strategy.

Pitfalls of the library¶

MaxElapsedTime defaults to 15 minutes. In a server context that is forever. Always set it explicitly.
RandomizationFactor defaults to 0.5 — full jitter is achieved by setting it to 1.0 and manually adjusting. Read the source to confirm semantics.
The library is not lock-free; each Retry call has a small fixed overhead. For ultra-high-frequency retries (microsecond scale) you may want a hand-rolled loop.

Diagnosing Livelock in a Running Service¶

You suspect livelock. Your CPU is hot. Throughput is bad. What do you do?

Step 1: `runtime.NumGoroutine` over time¶

If the count is stable (no rising leak) but CPU is hot, livelock is on the table. A rising count points to a goroutine leak; a falling-to-zero count points to deadlock.

Step 2: `pprof` CPU profile¶

import _ "net/http/pprof"

go func() {
    log.Println(http.ListenAndServe("localhost:6060", nil))
}()

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
(pprof) top
(pprof) web

Look for: - A function that dominates the CPU time. - A call site that looks like a tight loop (one short function called millions of times). - Calls into sync/atomic.CompareAndSwap* or sync.(*Mutex).TryLock.

The flame graph will show a wide bar at the bottom for the busy loop and short bars above for the body. Classic livelock signature.

Step 3: goroutine dump¶

go tool pprof http://localhost:6060/debug/pprof/goroutine
(pprof) traces

Or simpler:

curl 'http://localhost:6060/debug/pprof/goroutine?debug=2'

A livelocked goroutine will show in a runnable or running state with a stack pointing at the retry loop. Many goroutines with similar stacks pointing at the same retry path is highly suggestive.

Step 4: Trace¶

curl 'http://localhost:6060/debug/pprof/trace?seconds=5' > trace.out
go tool trace trace.out

The trace shows scheduler activity. In a livelock, you will see goroutines repeatedly running short bursts and immediately yielding or rescheduling. Lots of tiny stripes is the visual cue.

Step 5: Application-level metric¶

The single most useful tool: a successes_per_second gauge. If it is near zero while CPU is hot, you are livelocked.

Step 6: Reproduce in a benchmark¶

Once you suspect a pattern, reproduce it in a Benchmark test that scales b.N or runs at fixed concurrency. A 20-line benchmark can confirm or refute the hypothesis in seconds.

Livelock vs Contention¶

These two are easily confused but distinct.

Contention is the cost of multiple goroutines competing for a resource. A heavily contended mutex slows everyone down, but throughput still scales with the rate at which the critical section can complete. Contention is a performance issue.

Livelock is when contention becomes pathological — the throughput is bounded below what the critical section's intrinsic speed would allow, because most of the cost is in retry/wait overhead.

Rule of thumb: - If you halve the goroutines and throughput doubles, you have contention. - If you halve the goroutines and throughput stays the same or goes up by more than 2x, you have livelock.

In practice the line blurs. Many systems sit on the contention/livelock boundary.

Sharding to Reduce Contention¶

The most effective way to eliminate livelock from data-contention scenarios is to not have contention. Sharding splits the resource so most accesses go to different shards.

Sharded counter¶

Shown above. Each shard has its own atomic; goroutines distribute by hash.

Sharded map¶

sync.Map is not sharded internally — it has a different optimisation. For genuine sharding, use a custom structure:

type ShardedMap struct {
    shards [64]struct {
        mu sync.Mutex
        m  map[string]any
    }
}

func (s *ShardedMap) shardFor(key string) *struct {
    mu sync.Mutex
    m  map[string]any
} {
    h := fnv.New32a()
    h.Write([]byte(key))
    return &s.shards[h.Sum32()%64]
}

64 shards means contention per shard is 1/64 of the unsharded case. The number of shards is usually chosen as a power of 2 ≥ GOMAXPROCS * 4.

When sharding does not help¶

Cross-shard transactions. If you need atomicity across two keys, sharding makes it harder.
Skewed access. If 90% of accesses hit the same key, 64 shards do not help — one shard still gets all the contention.
Read-heavy workloads. Reads do not contend much; sharding writes is the point.

Cache-line considerations¶

Each shard's atomic counter should be on its own cache line (64 bytes on x86, 128 on ARM). Otherwise false sharing — two shards on the same cache line — causes cache-line ping-pong and re-creates the contention you tried to eliminate.

type paddedCounter struct {
    v atomic.Int64
    _ [56]byte // pad to 64
}

Singleflight and Coalescing¶

golang.org/x/sync/singleflight solves the specific livelock case where many goroutines all try to do the same expensive thing.

import "golang.org/x/sync/singleflight"

var group singleflight.Group

func GetUser(id string) (*User, error) {
    v, err, _ := group.Do(id, func() (any, error) {
        return fetchUser(id) // only one goroutine runs this at a time per id
    })
    return v.(*User), err
}

If 100 goroutines all call GetUser("alice") at the same time, only one runs fetchUser. The other 99 wait for the result.

This is not just an optimisation — it is a livelock cure for the thundering-herd pattern where many goroutines, each retrying on cache miss, all hit the database simultaneously.

When singleflight helps¶

Cache fill on miss.
Database connection establishment.
Expensive computation keyed by input.
DNS lookup.

When it does not¶

Per-request unique work.
Operations that should not be coalesced (e.g., increment a counter — coalescing means losing increments).
Operations with side effects that must run exactly N times.

The `Forget` method¶

Singleflight caches the in-flight result. If the underlying value can change, call group.Forget(id) after the work so future calls do not get the stale result.

Patterns That Reliably Avoid Livelock¶

Pattern 1: Use the right primitive¶

Need	Primitive	Why it avoids livelock
Counter increment	`atomic.AddInt64`	Single instruction, no retry.
Set if zero	`atomic.CompareAndSwapInt64(&v, 0, x)`	Bounded retry by problem definition.
Read-then-write	`sync.Mutex`	Parks losers, no spin.
Cache fill	`singleflight`	Coalesces.
Bulk increments	Sharded counter	No contention.

Pattern 2: Acquire locks in a global order¶

If you must hold multiple locks, define a global order (by address, by ID) and acquire them in that order. Both deadlock and back-off-and-retry livelock disappear.

Pattern 3: Bound retries¶

Never write for { try() }. Always:

for attempt := 0; attempt < maxAttempts; attempt++ {
    if try() { return nil }
    backoff(attempt)
}
return errBudgetExhausted

A livelock with a retry budget is at worst a slow failure, not a permanent stall.

Pattern 4: Use context cancellation¶

Pair every retry loop with a context:

for {
    if try() { return nil }
    select {
    case <-time.After(backoff()):
    case <-ctx.Done():
        return ctx.Err()
    }
}

A caller can cancel the retry from outside.

Pattern 5: Server-side back-pressure¶

Servers should reject load they cannot serve. A 429 Too Many Requests is friendlier than letting clients retry forever. Pair with Retry-After so clients know how long to wait.

Pattern 6: Use sync.Map or sharded structures for hot reads¶

sync.Map performs well when keys are mostly read after initial write. For write-heavy workloads, a sharded map[string]any behind sharded mutexes is the standard.

Summary¶

Livelock at the middle level is a recognisable pattern with a small set of standard cures. The most common Go shape is the CAS loop under heavy contention; the standard cure is to either eliminate the contention (sharding, dedicated atomic ops) or to break the symmetry (jitter, priority).

github.com/cenkalti/backoff is the standard library for retry; learn its options and pitfalls. singleflight is the standard library for coalescing duplicate work. Sharding is the standard tool for eliminating contention by design.

Diagnose livelock with pprof CPU profiles, a goroutine dump, and most importantly an application-level success-rate counter. Fix it by understanding which of the four ingredients — actors, resource, loop, symmetry — you can remove or modify.

Move on to senior.md for advanced livelock analysis at the algorithmic and distributed-systems scale.