N-Barrier — Find the Bug¶
Each section presents a broken barrier, asks you to find the bug before reading the analysis, then explains the failure mode and the fix. Run the broken versions with go run -race and GOMAXPROCS=8 to reproduce.
Table of Contents¶
- Bug 1: WaitGroup Used as a Barrier
- Bug 2: Bare
ifAroundcond.Wait() - Bug 3:
SignalInstead ofBroadcast - Bug 4: Reset Count Without a Generation
- Bug 5: Early Return Skips the Barrier
- Bug 6: N Mismatch
- Bug 7: Channel Barrier Captures the Wrong Gate
- Bug 8: Single Barrier on a Buffer Swap
- Bug 9:
cond.Wait()Without Holding the Lock - Bug 10: No Cancellation — Deadlock on Error
- Bug 11: Phase Action Re-Enters the Barrier
- Bug 12: Reading Shared State Without the Barrier's Edge
Bug 1: WaitGroup Used as a Barrier¶
Broken code¶
func runPhases(n int) {
for phase := 0; phase < 3; phase++ {
var wg sync.WaitGroup
for i := 0; i < n; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
compute(id, phase) // phase 1 work
refine(id, phase) // BUG: assumes all compute() finished
}(i)
}
wg.Wait()
}
}
Failure mode¶
refine for worker 0 may run while worker 2 is still inside compute. The WaitGroup only tells the loop when all workers finish both steps; it does not synchronise the workers between compute and refine. If refine reads data that other workers' compute writes, it reads a half-written phase.
Fix¶
Put a real barrier between the two steps:
b := NewCyclic(n)
go func(id int) {
defer wg.Done()
compute(id, phase)
b.Wait() // all compute() done before any refine()
refine(id, phase)
}(i)
Bug 2: Bare if Around cond.Wait()¶
Broken code¶
func (b *Barrier) Wait() {
b.mu.Lock()
defer b.mu.Unlock()
b.count++
if b.count == b.n {
b.cond.Broadcast()
return
}
if b.count < b.n { // BUG: if, not for
b.cond.Wait()
}
}
Failure mode¶
cond.Wait() can return spuriously, or be woken by an unrelated broadcast. With a bare if, a waiter that wakes early does not re-check count < n — it falls through and "proceeds" while parties are still missing. The barrier releases too early, non-deterministically.
Fix¶
Always re-check the predicate in a loop. This is the single most repeatedsync.Cond rule. Bug 3: Signal Instead of Broadcast¶
Broken code¶
func (b *Barrier) Wait() {
b.mu.Lock()
defer b.mu.Unlock()
b.count++
if b.count == b.n {
b.cond.Signal() // BUG
return
}
for b.count < b.n {
b.cond.Wait()
}
}
Failure mode¶
Signal() wakes exactly one waiter. The other N-2 sleeping parties stay parked forever. With N=4, the Nth party returns, one waiter wakes, and two are stuck — a deadlock that looks like "most workers finished."
Fix¶
Use Broadcast() — a barrier must wake all waiters:
Signal is for "wake one worker to handle one item"; barriers are never that. Bug 4: Reset Count Without a Generation¶
Broken code¶
func (b *Barrier) Wait() {
b.mu.Lock()
defer b.mu.Unlock()
b.count++
if b.count == b.n {
b.cond.Broadcast()
b.count = 0 // BUG: reset with no generation guard
return
}
for b.count < b.n {
b.cond.Wait()
}
}
Failure mode¶
The "fast goroutine races into the next phase" bug. A just-released fast party loops back into Wait(), does count++ (now 1) before a slow party still inside cond.Wait() wakes and re-checks count < n. The slow party now sees count == 1 < n, keeps sleeping, and the new phase's count is corrupted. Across many phases this deadlocks or releases the wrong set of parties. Hard to reproduce, worse under high GOMAXPROCS.
Fix¶
Use a generation counter; wait on the generation changing, not the count:
gen := b.generation
b.count++
if b.count == b.n {
b.generation++
b.count = 0
b.cond.Broadcast()
return
}
for gen == b.generation {
b.cond.Wait()
}
Bug 5: Early Return Skips the Barrier¶
Broken code¶
func worker(id int, b *Barrier, data []Item) error {
for _, it := range data {
if !valid(it) {
return errInvalid // BUG: returns before b.Wait()
}
process(it)
}
b.Wait()
finalize(id)
return nil
}
Failure mode¶
If worker 3 hits an invalid item and returns, it never calls b.Wait(). The other N-1 workers reach the barrier and block forever waiting for the party that left. The whole cohort deadlocks. In tests with always-valid data it passes; the first bad record in production freezes the pipeline.
Fix¶
Either ensure every path reaches the barrier, or use an abortable barrier:
func worker(ctx context.Context, id int, b *Safe, data []Item) error {
for _, it := range data {
if !valid(it) {
b.Abort() // release peers instead of stranding them
return errInvalid
}
process(it)
}
if err := b.Wait(ctx); err != nil {
return err
}
finalize(id)
return nil
}
Bug 6: N Mismatch¶
Broken code¶
func main() {
const workers = 4
b := NewBarrier(workers + 1) // BUG: barrier expects 5, only 4 arrive
var wg sync.WaitGroup
for i := 0; i < workers; i++ {
wg.Add(1)
go func(id int) { defer wg.Done(); b.Wait() }(i)
}
wg.Wait()
}
Failure mode¶
The barrier waits for 5 arrivals; only 4 goroutines exist. The 4th arrival sees count == 4 < 5 and sleeps. No 5th party ever comes. wg.Wait() never returns — deadlock. (The runtime may report "all goroutines are asleep - deadlock!")
Fix¶
Derive N from the actual party count, in one place:
Better: passworkers to both the loop and the barrier from a single variable so they cannot drift. Bug 7: Channel Barrier Captures the Wrong Gate¶
Broken code¶
func (b *ChanBarrier) Wait() {
b.mu.Lock()
b.count++
if b.count == b.n {
close(b.gate)
b.gate = make(chan struct{})
b.count = 0
b.mu.Unlock()
return
}
b.mu.Unlock()
<-b.gate // BUG: reads b.gate AFTER unlocking, may be the NEW gate
}
Failure mode¶
After unlocking, a waiter reads b.gate fresh. If the Nth party has already closed the old gate and installed a new one in the meantime, this waiter blocks on the new (still-open) gate and misses its release — it waits a full extra generation, or forever if the cohort shrinks. A torn read between "which generation am I in" and "which gate do I wait on."
Fix¶
Capture the gate while holding the lock, then block on the captured value:
Bug 8: Single Barrier on a Buffer Swap¶
Broken code¶
for t := 0; t < ticks; t++ {
computeBand(id, cur, next) // reads cur, writes next
b.Wait()
if id == 0 {
cur, next = next, cur // swap
}
// BUG: no second barrier; next tick's computeBand starts immediately
}
Failure mode¶
After the single barrier, worker 1 may start the next tick's computeBand (reading cur) before worker 0 has finished the swap. Workers read inconsistent buffers — a data race and wrong simulation output. go test -race flags concurrent read/write of the shared buffers.
Fix¶
A second barrier after the swap:
Bug 9: cond.Wait() Without Holding the Lock¶
Broken code¶
func (b *Barrier) Wait() {
b.mu.Lock()
b.count++
last := b.count == b.n
b.mu.Unlock() // BUG: unlock before Wait
if last {
b.cond.Broadcast()
return
}
b.cond.Wait() // panic: must hold the lock
}
Failure mode¶
sync.Cond.Wait() requires the caller to hold the cond's lock; it panics with "sync: unlock of unlocked mutex" / unspecified behaviour otherwise. Even if it didn't panic, there is a lost-wakeup race: between Unlock() and Wait(), the Nth party could broadcast, and this waiter would then sleep forever having missed it.
Fix¶
Hold the lock across the whole critical section, including Wait():
b.mu.Lock()
defer b.mu.Unlock()
b.count++
if b.count == b.n {
b.cond.Broadcast()
return
}
for b.count < b.n {
b.cond.Wait() // releases & re-acquires the lock atomically
}
Bug 10: No Cancellation — Deadlock on Error¶
Broken code¶
func phase(ctx context.Context, b *Barrier, fn func() error) error {
if err := fn(); err != nil {
return err // BUG: peers still waiting at b.Wait()
}
b.Wait()
return nil
}
Failure mode¶
A party whose fn() errors returns without arriving. Peers block at b.Wait() forever. There is also no way for an external ctx cancellation to release parked parties — a cancelled job leaves goroutines (and whatever they hold) stranded.
Fix¶
Use an abortable, context-aware barrier:
func phase(ctx context.Context, b *Safe, fn func() error) error {
if err := fn(); err != nil {
b.Abort()
return err
}
return b.Wait(ctx) // returns ErrBroken on abort or ctx cancel
}
Bug 11: Phase Action Re-Enters the Barrier¶
Broken code¶
b := NewActionBarrier(n, func() {
log.Println("phase complete")
b.Wait() // BUG: re-enters Wait() while the last party holds the lock
})
Failure mode¶
The phase action runs inside the last party's Wait() while holding the mutex. Calling b.Wait() again tries to re-lock the same (non-reentrant) sync.Mutex — instant self-deadlock. Any blocking call in the action (a channel send, a network call, a lock) similarly stalls every party, since none can be released until the action returns.
Fix¶
Keep the phase action short, non-blocking, and never re-entrant:
Do work that must block outside the barrier (afterWait() returns), gated to one party. Bug 12: Reading Shared State Without the Barrier's Edge¶
Broken code¶
var ready bool // plain bool, no synchronisation
// last party
func setup(b *Barrier) {
ready = true // BUG: written outside the lock that establishes the edge
b.Wait()
}
// other parties
func use(b *Barrier) {
b.Wait()
if ready { /* ... */ } // may not observe the write on ARM
}
Failure mode¶
The write to ready happens before b.Wait(), but ready is a plain variable touched outside any shared lock. While the barrier provides a happens-before edge for memory ordered through its own mutex, a naive reading might assume all prior writes are visible. They are — the barrier's unlock/lock chain does order the write before the post-barrier read — but only if the write truly happens-before the arriving party's Wait() call and the data is not concurrently mutated. The real bug here is ready being written by one party while another party might still be mutating related state, with no agreement on who owns it. On weak memory models this surfaces as a torn/stale read; under -race it is reported as a data race if any concurrent write exists.
Fix¶
Make ownership explicit: write the shared state under the same lock the barrier uses (e.g., in a phase action) or before the writer's Wait() with no concurrent writers, and read it only after Wait() returns: