Condition Variables — Senior Level¶
Topic: Condition Variables Focus: futex internals, performance, alternatives, hand-coded waitsets
Table of Contents¶
- Introduction
- Prerequisites
- Glossary
- Core Concepts
- Real-World Analogies
- Mental Models
- Code Examples
- Pros & Cons
- Use Cases
- Coding Patterns
- Clean Code
- Best Practices
- Edge Cases & Pitfalls
- Common Mistakes
- Tricky Points
- Test Yourself
- Tricky Questions
- Cheat Sheet
- Summary
- What You Can Build
- Further Reading
- Related Topics
- Diagrams & Visual Aids
Introduction¶
At the junior level, a condition variable is a black box: you call wait, the thread sleeps; you call signal, a thread wakes. At the middle level, you understand spurious wakeups, predicates, and the canonical loop. At the senior level, you stop seeing the condition variable as a primitive at all — you see a thin wrapper over a kernel object (the futex on Linux, the wait address on Windows, the dispatch queue on Darwin) and you start asking different questions: How many cache lines does a broadcast touch? Does my workload actually benefit from waiting, or am I paying for a syscall I could amortize? Would a bounded queue serve me better? Could I use LockSupport.park and a custom waitset?
This document is the engineer's tour. We open up glibc's pthread_cond_t, follow the bytes through futex(2), walk through the requeue trick that makes broadcast not catastrophic, profile a contended broadcast storm, compare a hand-rolled producer/consumer against LinkedBlockingQueue, and explain why idiomatic Go code almost never uses sync.Cond. By the end you will know when to reach for a condvar, when to reach for a channel, and when to reach for nothing at all.
The senior posture is not "more condition variables". It is fewer, better-placed condition variables, and a clear-eyed picture of what they cost.
Prerequisites¶
Before reading this senior treatment, you should be solid on:
- Junior: What a condvar does and the
while (!predicate) wait()loop. - Middle: Spurious wakeups, signal vs broadcast, monitor coupling with a mutex, the lost-wakeup problem.
- Memory model: Acquire/release semantics, happens-before, why the mutex around a condvar provides the publication barrier.
- Linux primitives:
futex(2)syscall, wait/wake/requeue operations, the idea that the kernel uses the address of anintas a key. - Profiling:
perf,strace, contention counters, off-CPU profiling with eBPF orperf sched. - Posix and Java APIs:
pthread_cond_t,java.util.concurrent.locks .Condition,LockSupport.park/unpark.
If any of those feels shaky, return to the middle-level page and the mutex internals page first. The material here will reference them without re-deriving.
Glossary¶
- Futex: Fast Userspace muTEX. A Linux syscall that lets userspace manage a wait queue keyed by a memory address; the kernel only gets involved on contention.
- FUTEX_WAIT: Atomically check
*addr == expectedand sleep if true. - FUTEX_WAKE: Wake up to N waiters parked on
addr. - FUTEX_CMP_REQUEUE: Wake one waiter on
addr1and move the rest fromaddr1's queue toaddr2's queue without waking them. - Wait morphing / requeue trick: Use
FUTEX_CMP_REQUEUEso a broadcast moves waiters from the condvar's queue to the mutex's queue, preventing the thundering herd on the mutex. - Thundering herd / mutex stampede: Many threads wake simultaneously, all try to acquire the same mutex, and most go back to sleep — wasting context switches.
- Stampede counter:
__g1_orig_sizeand friends — glibc's internal generation counters that order signals against waits. - Generation counter: A monotonically increasing integer used to distinguish "the signal that woke me" from "a stale signal from before I started waiting".
- Cache line ping-pong: Multiple CPUs writing to the same cache line, causing constant cache coherency traffic.
- Park / unpark: Java's low-level "block this thread until told otherwise" primitive, exposed via
java.util.concurrent.locks.LockSupport. - Waitset: The data structure (often a linked list) that tracks blocked threads — what a condvar wraps.
- LBQ:
java.util.concurrent.LinkedBlockingQueue, a battle-tested bounded/unbounded BlockingQueue.
Core Concepts¶
1. Futex-backed condition variable implementation (glibc, musl)¶
A modern pthread_cond_t on Linux is not a kernel object. It is a small struct of integers in user memory plus a few futex syscalls. The kernel knows nothing about "condition variables" — only about wait queues keyed by an address.
The simplified layout (real glibc has more fields for fairness and shared mappings):
struct __pthread_cond_s {
uint64_t __wseq; // wait sequence — incremented on each wait
uint64_t __g1_start; // start of group 1 (the one being signaled)
unsigned int __g_refs[2]; // futex addresses, one per group
unsigned int __g_size[2]; // remaining waiters in each group
unsigned int __g1_orig_size;
unsigned int __wrefs; // total waiters + refcount bits
unsigned int __g_signals[2];
};
The trick that makes glibc's condvar correct under broadcast and signal is dual groups. Waiters arriving "now" join G2. When a signal or broadcast happens, the implementation closes G1 (drains it) and rotates G2 into G1. This means a signal cannot accidentally wake a thread that called wait after the signal returned — the generation counter keeps them in G2.
Why this matters: Earlier glibc versions had subtle ordering bugs under broadcast that were not fixed until 2016 (bug 13165). The "new condvar" landed in glibc 2.25. If you read pre-2.25 documentation you will see a different algorithm. The takeaway: condvars are hard even for the people who write libc.
musl takes a simpler path with a single counter and accepts slightly more wakeups in exchange for simpler code. Both are correct; the trade-off is throughput under heavy broadcast.
2. The futex_requeue trick used by pthread_cond_broadcast¶
A naive broadcast would wake every waiter with FUTEX_WAKE. All of them would race for the mutex. N-1 would lose, go back to sleep, and the kernel would have done N context switches to make 1 thread useful. This is the mutex stampede.
The fix: FUTEX_CMP_REQUEUE. The kernel wakes one waiter from the condvar's queue and moves the rest to the mutex's wait queue without waking them. As the first thread releases the mutex, the next one (already queued on the mutex) is woken normally — one wake per useful unit of work.
broadcast()
|
v
FUTEX_CMP_REQUEUE(cond_addr -> mutex_addr, wake=1, requeue=INT_MAX)
|
+-- wake 1 thread on cond_addr (it will try to lock mutex)
+-- move the rest from cond_addr's queue to mutex_addr's queue
(they stay parked, just on a different key)
This is one of the most beautiful pieces of systems engineering in modern kernels. It turns an O(N) wake into an O(1) wake plus an O(N) metadata move — and the metadata move never enters userspace.
3. Why broadcast on a single condvar can stampede the mutex¶
Even with requeue, broadcast is not free:
- Every requeued thread will, eventually, fight for the mutex.
- If the predicate is only satisfiable for one of them, N-1 will acquire, check the predicate, fail, and wait again.
- That means N lock/unlock cycles and N condvar reinserts.
Rule: Use broadcast only when the predicate change can satisfy multiple waiters. For "one slot freed in a bounded queue", use signal. For "the producer is done and everyone should observe EOF", use broadcast.
A surprisingly common bug: a thread pool with one shared condvar, broadcast on every task submission. With 64 worker threads, every submit causes 64 wakeups, 64 lock acquisitions, 63 failed predicate checks, 63 re-waits. Throughput collapses past 16 cores.
4. When to use channels/queues instead of condvars (Go, Rust)¶
In Go, the idiomatic concurrency primitive is the channel. A buffered channel with <- is operationally equivalent to a bounded queue with internal condition variables, but with a syntax that makes lost-wakeup bugs nearly impossible to write.
// Idiomatic Go: no condvar in sight.
ch := make(chan Job, 64)
// producer
ch <- job
// consumer
for job := range ch {
handle(job)
}
Go does ship sync.Cond, but the standard library team has publicly said it is rarely the right tool, and the Go runtime team has at times considered deprecating it. The reasons:
- It is easy to misuse (lost-wakeup, missing the
forloop). - Goroutines are cheap, so "block one goroutine per condition" via a channel is fine.
- The runtime's scheduler is aware of channels and can do tricks (direct hand-off) that
sync.Condcannot.
In Rust, std::sync::mpsc and crossbeam-channel cover most cases. std::sync::Condvar exists but the type system makes you carry the predicate through Mutex<T>, which already nudges you toward queues.
The senior heuristic: if the predicate is "there is an item in a queue", use a queue. Reach for a condvar only when the predicate is genuinely arbitrary ("the global config version has advanced", "the checkpoint coordinator has reached state X").
5. The cost: cache line ping-pong of the condvar's internal counter¶
Every signal, broadcast, and wait writes to the condvar's sequence counters. These counters live in one or two cache lines. With many CPUs operating on the same condvar, those lines bounce between caches constantly.
Profile with perf c2c (cache-to-cache):
$ perf c2c record ./my_program
$ perf c2c report
Shared Cacheline Distribution Pareto Table
=============================================
HITM Total Source Line
87% 42k pthread_cond_signal nptl/pthread_cond_signal.c:74
87% of cache-line hits in HITM state (modified-elsewhere) on the condvar's sequence field is a smoking gun. Solutions:
- Shard the condvar: one condvar per shard of work, reducing contention.
- Use a queue: the queue's slot pointers naturally distribute writes.
- Switch to
LockSupport.park/unpark(Java) and a per-thread parking flag, so each waiter touches a different cache line.
6. Java's LockSupport.park/unpark as a lower-level primitive¶
java.util.concurrent.locks.LockSupport is the closest thing in the JVM to a raw futex. It exposes:
LockSupport.park(); // block this thread until unparked
LockSupport.parkNanos(nanos); // block with timeout
LockSupport.unpark(thread); // wake a specific thread
Critical properties:
unparkis sticky: ifunpark(t)is called beforetcallspark, the nextparkreturns immediately. This eliminates the lost-wakeup window that plagues raw condvars.- No mutex required:
park/unparkoperates on the thread itself, not a shared address. You build your own waitset. - Per-thread:
unparktargets a specific thread, not "anyone waiting on this object". That means O(1) targeted wakeups.
Every queue in java.util.concurrent — LinkedBlockingQueue, SynchronousQueue, ArrayBlockingQueue, LinkedTransferQueue — is built on park/unpark, not on Object.wait() or Condition.await().
7. Hand-coded waitsets in lock-free libraries¶
When you write a high-performance queue (e.g. LinkedTransferQueue, Disruptor, JCTools queues), you do not use a condvar. You build a waitset explicitly:
1. Producer publishes an item (lock-free CAS or write).
2. If the item required a slot and a consumer was registered, call
LockSupport.unpark(consumer.thread).
3. Consumer enqueues itself on a waitset node, publishes "I am parked",
re-checks the predicate (memory-fence dance), then LockSupport.park().
The advantages over a condvar:
- No mutex held during the wait — producers can publish concurrently.
- O(1) targeted wakeup — no "wake all and let one win".
- Cache-friendly — each waiter touches its own node, not a shared counter.
The disadvantages:
- You must hand-roll the publication memory fence and the re-check.
- Subtle ABA bugs around node reuse.
- Hard to review; only reach for this in proven hot paths.
8. Building a producer/consumer with LinkedBlockingQueue (Java) — when stdlib beats hand-rolled condvar¶
The naive Java implementation of a bounded queue uses a ReentrantLock and two Conditions — notFull and notEmpty. It works. It is correct. It is also slower than LinkedBlockingQueue because:
- LBQ uses two separate locks — one at the head, one at the tail. Producers and consumers do not contend.
- LBQ uses atomic size counters, so size checks can skip the lock in the fast path.
- LBQ is battle-tested under every JVM bug, every OS, every generation of hardware. Your hand-rolled version is not.
The senior decision is rarely "should I write a condvar pattern". It is "should I take the LBQ in my standard library, or do I have a measured reason to write something else". The measured reason had better be benchmarks, not vibes.
9. Profiling condvar contention¶
Tools that reveal condvar pain:
perf lock(Linux): shows lock and condvar contention statistics.perf sched+perf timechart: visualizes off-CPU time per thread; long off-CPU regions on a condvar are visible immediately.bpftrace/bcc: tracefutexsyscalls with stack traces.- JVM:
async-profilerin wall-clock mode shows time spent inObject.wait/Condition.awaitper call site. pidstat -d -w: voluntary context switches per thread; a thread with 100k+ voluntary switches per second is almost certainly bouncing on a condvar.
Look for the signatures:
| Signature | Likely cause |
|---|---|
| High futex syscalls, low CPU | Broadcast storm |
| One waiter, low futex syscalls | Healthy or no work |
HITM on cond struct cacheline | Cache ping-pong, shard or queue |
Long off-CPU on pthread_cond_* | Predicate transitions are rare |
| Many threads woken, one works | Use signal instead of broadcast |
10. Real-world bugs: lost wakeup, missed predicate transition¶
Two recurring production incidents:
Lost wakeup (already covered at middle level, but seen here in shipped code):
// BUG: signal happens between predicate check and wait.
if (queue.empty()) {
pthread_cond_wait(&cv, &mu);
}
If the producer signals between empty() and wait(), the consumer sleeps forever. Fix: hold the mutex the entire time, use while instead of if.
Missed predicate transition: the predicate flips from false to true and back to false between signal and wakeup.
// Producer
pthread_mutex_lock(&mu);
queue.push(item);
pthread_cond_signal(&cv);
pthread_mutex_unlock(&mu);
// Meanwhile a second consumer drains the queue.
// Original consumer wakes, sees empty queue, goes back to wait.
// No bug — but a profiler will show high "useless wakeup" rate.
This is not incorrect, but at scale it is a waste. A bounded queue with a single consumer would avoid it.
11. Why Go does not bless sync.Cond in idiomatic code¶
The Go documentation for sync.Cond literally says (paraphrasing): "In most cases, a channel is preferable." Rob Pike's mantra "Do not communicate by sharing memory; share memory by communicating" is the philosophical version. The mechanical reasons:
sync.Conddoes not integrate with the Go scheduler's hand-off optimizations.- The
for predicate { c.Wait() }pattern is easy to get wrong; a channel<-does the loop for you. - Cancellation via
context.Contextis awkward — you need an extra goroutine to callc.Broadcast()on cancel.
When Go code does need cond-like behaviour, the idiom is a channel of struct{}, broadcast by closing it:
This pattern handles single-shot broadcast cleanly. For repeated broadcasts, generate a new channel each time and publish it via a sync/atomic.Value.
Real-World Analogies¶
- Air traffic control vs. radio broadcast. A condvar broadcast is the tower yelling "everyone check the runway". Every pilot wakes up, most check, see they are not cleared to land, and go back to a holding pattern. The futex requeue trick is the tower instead saying "you, plane 7, land now; everyone else, queue at the next ATC channel" — one wakeup, the rest stay parked but on the right queue.
- Email vs. office PA. A targeted
LockSupport.unpark(thread)is email. Abroadcastis the office PA system. Use email by default. - Restaurant order ticker vs. shouting "order up". A blocking queue is the order ticker — clean, ordered, one consumer pulls one ticket. Condvars with broadcast are the line cook shouting "order up" and every server racing to claim it.
- Generation counters as version stamps. The
__wseqfield is a document version number. A waiter records "I started waiting at version 7", a signaler bumps to version 8. The version comparison makes "did the event I am waiting for already happen" decidable.
Mental Models¶
Model 1: Condvar = wait address + generation counter. Everything else (broadcast, signal, requeue) is bookkeeping to manage one address and one counter correctly.
Model 2: Mutex stampede is the default; requeue is the fix. Always ask "how many threads wake when I signal, and how many can actually do work?" If those numbers do not match, you have inefficiency.
Model 3: Wait-set granularity is the design knob. One condvar per million tasks is bad. One condvar per worker is fine. The unit of parking should be the unit of "this exact thread has work".
Model 4: Queues and channels are condvars with the predicate baked in. "Is the queue non-empty" is such a common predicate that languages ship it as a primitive. Use the primitive.
Model 5: Park/unpark is the assembly language; condvar is the high- level API. Park is more flexible but easier to corrupt. Build the high-level API only when the standard one does not exist or has measured overhead.
Code Examples¶
Example 1: glibc cond var implementation walkthrough (annotated)¶
This is a simplified version of pthread_cond_wait from glibc 2.34, focusing on the structure. It is not meant to compile — read it as a specification.
int __pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex) {
// 1. Increment wait sequence and join group G2.
uint64_t wseq = atomic_fetch_add(&cond->__wseq, 2);
unsigned int g = wseq & 1; // group index (0 or 1)
uint64_t seq = wseq >> 1;
atomic_fetch_add(&cond->__g_size[g], 1);
// 2. Bump waiter refcount so destroyers wait for us.
atomic_fetch_add(&cond->__wrefs, 8); // +1 in upper bits
// 3. Release the mutex (publish that we are about to wait).
pthread_mutex_unlock(mutex);
// 4. Wait on our group's futex address.
while (true) {
unsigned int signals = atomic_load(&cond->__g_signals[g]);
if (signals & 1) break; // group closed -> wakeup
if (signals >= 2) { // signal available
if (atomic_compare_exchange(&cond->__g_signals[g],
signals, signals - 2)) {
break; // consumed a signal
}
continue; // retry
}
// No signal yet. Sleep on the futex.
futex_wait(&cond->__g_signals[g], signals);
}
// 5. Decrement group size; if last, allow group rotation.
if (atomic_fetch_sub(&cond->__g_size[g], 1) == 1) {
// last in this group; signaler may now rotate G2 -> G1
futex_wake(&cond->__g_refs[g], INT_MAX);
}
// 6. Drop waiter refcount.
atomic_fetch_sub(&cond->__wrefs, 8);
// 7. Reacquire the mutex.
pthread_mutex_lock(mutex);
return 0;
}
Key takeaways:
- The condvar maintains two groups for ordering correctness.
__g_signals[g]is the futex address. Its lowest bit means "group closed"; higher bits count available signals.- The waiter does not own the mutex during
futex_wait— that is the publication window the signaler needs. - Generation correctness comes from
wseq. A late signal cannot wake a future waiter because the future waiter is in a different group.
Example 2: Profiling a broadcast storm¶
A worker pool with a single shared cond var. We will profile it and fix it.
// broadcast_storm.c — compile: gcc -O2 -pthread broadcast_storm.c -o storm
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdatomic.h>
#define WORKERS 64
#define TASKS 200000
typedef struct {
int *buf;
int head, tail, count, cap;
pthread_mutex_t mu;
pthread_cond_t cv; // ONE cond var for all workers — bad
int shutdown;
} pool_t;
static atomic_long g_done;
void pool_init(pool_t *p, int cap) {
p->buf = calloc(cap, sizeof(int));
p->head = p->tail = p->count = 0;
p->cap = cap;
pthread_mutex_init(&p->mu, NULL);
pthread_cond_init(&p->cv, NULL);
p->shutdown = 0;
}
void submit(pool_t *p, int task) {
pthread_mutex_lock(&p->mu);
while (p->count == p->cap) pthread_cond_wait(&p->cv, &p->mu);
p->buf[p->tail] = task;
p->tail = (p->tail + 1) % p->cap;
p->count++;
pthread_cond_broadcast(&p->cv); // BUG: should be signal
pthread_mutex_unlock(&p->mu);
}
int take(pool_t *p) {
pthread_mutex_lock(&p->mu);
while (p->count == 0 && !p->shutdown)
pthread_cond_wait(&p->cv, &p->mu);
if (p->count == 0) { pthread_mutex_unlock(&p->mu); return -1; }
int v = p->buf[p->head];
p->head = (p->head + 1) % p->cap;
p->count--;
pthread_cond_broadcast(&p->cv); // BUG: should be signal
pthread_mutex_unlock(&p->mu);
return v;
}
void *worker(void *arg) {
pool_t *p = arg;
int t;
while ((t = take(p)) >= 0) {
// simulate work
atomic_fetch_add(&g_done, 1);
}
return NULL;
}
int main(void) {
pool_t p; pool_init(&p, 256);
pthread_t threads[WORKERS];
for (int i = 0; i < WORKERS; i++)
pthread_create(&threads[i], NULL, worker, &p);
for (int i = 0; i < TASKS; i++) submit(&p, i);
pthread_mutex_lock(&p.mu);
p.shutdown = 1;
pthread_cond_broadcast(&p.cv);
pthread_mutex_unlock(&p.mu);
for (int i = 0; i < WORKERS; i++) pthread_join(threads[i], NULL);
printf("done=%ld\n", atomic_load(&g_done));
return 0;
}
Profile:
$ perf stat -e context-switches,cs,migrations ./storm
[...]
1,234,567,890 context-switches
6,543,210 migrations
# Way more context switches than tasks. Broadcast storm confirmed.
$ perf record -g ./storm && perf report
# Top function: __pthread_cond_broadcast (45%), futex_wait_setup (22%).
Fix: change both broadcast calls to signal. Re-measure:
A 1000x reduction in context switches by changing one word.
Example 3: LinkedBlockingQueue vs hand-rolled condvar benchmark (Java, JMH)¶
// build with JMH: org.openjdk.jmh
import java.util.concurrent.*;
import java.util.concurrent.locks.*;
import org.openjdk.jmh.annotations.*;
@State(Scope.Benchmark)
@BenchmarkMode(Mode.Throughput)
public class QueueBench {
static class HandRolled<T> {
private final Object[] buf;
private int head, tail, count;
private final ReentrantLock lock = new ReentrantLock();
private final Condition notFull = lock.newCondition();
private final Condition notEmpty = lock.newCondition();
HandRolled(int cap) { buf = new Object[cap]; }
void put(T x) throws InterruptedException {
lock.lock();
try {
while (count == buf.length) notFull.await();
buf[tail] = x; tail = (tail + 1) % buf.length; count++;
notEmpty.signal();
} finally { lock.unlock(); }
}
@SuppressWarnings("unchecked")
T take() throws InterruptedException {
lock.lock();
try {
while (count == 0) notEmpty.await();
Object x = buf[head];
head = (head + 1) % buf.length; count--;
notFull.signal();
return (T) x;
} finally { lock.unlock(); }
}
}
HandRolled<Integer> hand;
LinkedBlockingQueue<Integer> lbq;
@Setup
public void setup() {
hand = new HandRolled<>(1024);
lbq = new LinkedBlockingQueue<>(1024);
}
@Benchmark
@Threads(8)
@Group("hand")
public void handPut() throws Exception { hand.put(1); }
@Benchmark
@Threads(8)
@Group("hand")
public Integer handTake() throws Exception { return hand.take(); }
@Benchmark
@Threads(8)
@Group("lbq")
public void lbqPut() throws Exception { lbq.put(1); }
@Benchmark
@Threads(8)
@Group("lbq")
public Integer lbqTake() throws Exception { return lbq.take(); }
}
Typical results on a 16-core machine (your numbers will vary):
Benchmark Mode Cnt Score Error Units
QueueBench.hand:put thrpt 5 3.2M ops/s
QueueBench.hand:take thrpt 5 3.2M ops/s
QueueBench.lbq:put thrpt 5 5.8M ops/s
QueueBench.lbq:take thrpt 5 5.8M ops/s
LBQ wins by roughly 80% because it splits the head/tail locks. The hand-rolled version contends on a single lock; producers and consumers serialize. Even with a "correct" condvar implementation, the architectural choice (single lock vs two) dominates.
Example 4: Java LockSupport.park/unpark¶
A minimal one-shot event that does not use Object.wait or Condition:
import java.util.concurrent.atomic.AtomicReference;
import java.util.concurrent.locks.LockSupport;
public final class OneShot {
private static final Object DONE = new Object();
private final AtomicReference<Object> state = new AtomicReference<>();
/** Block until {@link #fire} is called. Safe to call before fire. */
public void await() {
Thread me = Thread.currentThread();
// Try to register self.
while (true) {
Object s = state.get();
if (s == DONE) return; // already fired
if (state.compareAndSet(null, me)) break;
// Another thread is parked already? OneShot is single-waiter
// by design; throw if misused.
if (s instanceof Thread)
throw new IllegalStateException("multi-waiter");
}
// Park loop. park() may return spuriously; re-check state.
while (state.get() != DONE) {
LockSupport.park(this);
if (Thread.interrupted())
throw new RuntimeException("interrupted");
}
}
public void fire() {
Object prev = state.getAndSet(DONE);
if (prev instanceof Thread t) LockSupport.unpark(t);
// unpark is sticky: even if t calls park later, it returns
// immediately. No mutex, no condvar.
}
}
Try:
OneShot s = new OneShot();
Thread t = new Thread(() -> { s.await(); System.out.println("fired!"); });
t.start();
Thread.sleep(100);
s.fire();
t.join();
This is roughly 5-10x faster than the equivalent Object.wait / Object.notify pattern because there is no monitor lock acquisition. It is also harder to reason about — note the spurious-park check, the interruption handling, the misuse guard. The senior lesson: drop down to park/unpark only when you have measured a real bottleneck.
Pros & Cons¶
Pros (when used correctly):
- Universal: every POSIX, every JVM, every modern language.
- Cheap on the uncontended fast path (futex stays in userspace).
- Composes naturally with mutexes.
- Generation counters give strong ordering guarantees.
Cons (when used carelessly):
- Broadcast can stampede the mutex without
FUTEX_CMP_REQUEUE. - Cache line ping-pong on the internal counter under heavy contention.
- Lost-wakeup bugs and missed-transition wakeups are easy to write.
- The
wait/signalAPI does not compose with cancellation cleanly. signaldoes not say which waiter wakes — you cannot do targeted wakeups without dropping topark/unpark.
Use Cases¶
Use a condition variable when:
- Arbitrary predicate — "config version > X", "checkpoint phase reached", "all subordinate workers done". A queue does not model this directly.
- Coarse-grained coordination — a few threads, infrequent signals, correctness over throughput.
- You are implementing a higher-level primitive — a fork/join pool, a custom BlockingQueue, a graph executor.
- Predicate is multi-dimensional — "queue is non-empty AND we are not paused AND budget remains".
Prefer a queue or channel when:
- The predicate is "an item is available."
- You are in Go. Use channels unless you have a measured reason not to.
- You need cancellation via
context.ContextorCancellationToken. - You can tolerate the queue's memory overhead.
Prefer park/unpark when:
- You are building a lock-free data structure with rare blocking.
- You need targeted wakeups (specific thread, not "any waiter").
- You have profiled the condvar's cache pressure and it is a top hit.
Coding Patterns¶
Pattern: signal vs broadcast decision table.
| Predicate change | Use |
|---|---|
| One unit of work added | signal |
| All work done; shutdown announced | broadcast |
| State transition multiple kinds | broadcast |
| Queue slot freed | signal |
| Configuration version bumped | broadcast |
Pattern: sharded condvars.
Instead of one condvar for N workers, give each worker its own condition / waitset. Producer chooses which to wake (round-robin or work-stealing). This eliminates broadcast storms structurally.
Pattern: closed-channel broadcast (Go).
type Event struct{ ch chan struct{} }
func New() *Event { return &Event{ch: make(chan struct{})} }
func (e *Event) Wait() { <-e.ch }
func (e *Event) Trigger() { close(e.ch) }
Pattern: park-based waitset.
Each blocked thread enqueues a node containing its Thread reference. Producer pops a node, calls LockSupport.unpark(node.thread). No mutex, O(1) wakeup. Used by LinkedTransferQueue.
Pattern: deadline + monotonic clock.
Always use the OS monotonic clock for cond var timeouts. Wall-clock time can jump backward (NTP) and cause infinite waits.
Clean Code¶
- Name your condvars after the predicate. Not
cv1, butnot_empty,slot_available,config_updated. - Comment the predicate inline. The
whileloop should reference the predicate by name. - Encapsulate. Never expose a raw
pthread_cond_torConditionto callers; wrap it in aBlockingQueueorEventthat enforces the protocol. - Pair every condvar with exactly one mutex. Document this in the type that owns them.
- Document the signal/broadcast choice. A code comment "signal because only one waiter can proceed" prevents the next maintainer from "fixing" it to broadcast.
Best Practices¶
- Default to a queue. Reach for a condvar only when the predicate does not map cleanly to "items available".
- Use
signalby default; broadcast only when justified. - Use the standard library's queue. Hand-rolled is almost never faster after benchmarking honestly.
- Profile under load before declaring victory. The cost of a wrong choice scales with thread count.
- Beware static condvars. A
pthread_cond_tat file scope shared across modules is an anti-pattern; ownership is unclear. - Monotonic clock for timeouts. Always.
- Test the lost-wakeup race with stress tests, not unit tests. Run for hours, not seconds.
- Document the invariants the condvar protects, not just its syntax.
Edge Cases & Pitfalls¶
- Destroying a condvar with waiters. UB on POSIX; glibc protects this via
__wrefsbut you should not rely on it. - Signaling without holding the mutex. Allowed by POSIX but causes missed wakeups in practice and prevents the requeue optimization.
pthread_cond_timedwaitwithCLOCK_REALTIME. Default on older glibc; explicitly usepthread_condattr_setclock(CLOCK_MONOTONIC).- Mutex held by a different thread when signaling. Allowed, but defeats the requeue optimization — the kernel cannot move waiters to a mutex queue if the lock is not held.
- Spurious wakeups in CI. Real on all real systems. Run your tests on real hardware, not just in a single-threaded simulator.
- Java
Object.notifyvsnotifyAll. Same trade-off assignal/broadcastbut without requeue — Java pre-6 had a thundering herd on everynotifyAll. - Re-entry through condvar wait. If your wait predicate calls into user-supplied callbacks, those callbacks might re-enter your monitor and deadlock.
Common Mistakes¶
- Using
ifinstead ofwhilearound the wait. Even after a decade of warnings, this still appears in PRs. - Broadcast for "wake one" because "it's safer". It is correct but 1000x slower.
- Multiple condvars on one mutex without a coordination contract. Allowed but easy to mis-signal.
- Calling
signaloutside the mutex "for speed". Defeats the requeue optimization and risks lost wakeups. - Treating condvar as a counter.
signaldoes not stack; calling it 5 times with no waiters does not wake the next 5 waiters. - Forgetting to also signal on cancellation/shutdown. Threads waiting forever.
- Re-using a condvar after destroying it. Some implementations allow this; none promise it.
- Sharing a condvar across processes without
pthread_condattr_setpsharedand a shared memory mapping.
Tricky Points¶
- Requeue requires the mutex address to be a futex word. On platforms where the mutex is more elaborate (e.g. PI mutexes for realtime), requeue may be disabled. Profile accordingly.
- Generation counter rollover.
__wseqis 64-bit and will not wrap in any practical lifetime, but on 32-bit platforms with a 32-bit wseq, you can theoretically wrap. Glibc handles this; your hand-rolled imitation may not. futex(2)vsfutex_waitv(2). The newerfutex_waitvlets you wait on multiple addresses; condvars do not use it yet, but some custom waitset libraries do.pthread_cond_signalmay wake more than one. Permitted by POSIX ("at least one"). Most implementations wake exactly one, but your code must not rely on it.- Memory model. The mutex acquire on the waker's side and release on the waiter's side provides the happens-before. Signaling without the mutex breaks this on some weakly ordered architectures (ARMv8 before LSE, Power).
Test Yourself¶
- Why does
pthread_cond_broadcastuseFUTEX_CMP_REQUEUEinstead ofFUTEX_WAKEwith a large count? - What is the dual-group design in glibc's condvar and what problem does it solve?
- Why is
LockSupport.unparksticky, and how does this help avoid lost wakeups? - Under what conditions does
pthread_cond_signalwake more than one thread? - Why does Go's standard library de-emphasize
sync.Cond? - How would you measure that your condvar is causing cache-line ping-pong?
- When is broadcast cheaper than N signals?
- Why do
LinkedBlockingQueueandReentrantLock+Conditionbenchmark differently for producer/consumer workloads? - What is the difference between
FUTEX_WAKEandFUTEX_WAKE_OP? - What does it mean to "build a waitset by hand", and when is it worth it?
Tricky Questions¶
-
A broadcast wakes 64 threads, all check the predicate, 63 fail and re-wait. Throughput is poor. Your "fix" is to switch to signal. Why might that be wrong? (Answer: because the predicate change satisfies multiple waiters — e.g. a chunk of data arrived, not just one item. The correct fix is to issue N signals instead of one broadcast, or to redesign the queue so each waiter waits on its own slot.)
-
You replace a condvar with a channel. CPU goes down but tail latency p99 goes up. Why? (Channels in Go schedule via the runtime's goroutine queues, which can introduce extra context switches for low-frequency wakeups. The condvar bypasses this. For p99 you might need a buffered channel, GOMAXPROCS tuning, or
runtime.LockOSThread.) -
You design a system with one condvar per shard. A producer sometimes does not know which shard the work belongs to and so broadcasts to all. Now you have the stampede again. How do you structure this? (Use a routing layer that classifies the work first, then enqueues to a single shard. Or use a hierarchical condvar tree: a top-level "something happened" signal that wakes one router, which then targets the right shard.)
-
A teammate proposes signaling outside the mutex "to reduce contention". Should you accept it? (Reject unless they can show measurements. It is allowed by POSIX but defeats the requeue optimization, can cause missed wakeups under shutdown races, and weakens the happens-before relationship on some architectures.)
-
You read a 2012 blog post that says
pthread_cond_broadcastis broken under signal/wait races. Is it still relevant? (Possibly — that was glibc bug 13165, fixed in glibc 2.25 in 2017. If your target platform has older glibc, yes. On modern systems, no.) -
Why does
Object.notifyin Java not have a requeue optimization? (Java's monitor design predates futex tricks; the JVM maintains its own intrinsic locks. Some JVMs do internal biased-locking and lock coarsening, but there is no externally-visible requeue. This is part of whyjava.util.concurrentwas added: better primitives.) -
You see a benchmark where
BlockingQueue.putis slower than a raw mutex + condvar. Possible? (Yes, in single-threaded microbenchmarks where the queue's extra atomic counters cost more than they save. Real concurrent loads tell a different story.)
Cheat Sheet¶
PRIMITIVE WHEN TO USE COST
----------------------------------------------------------------
Condvar Arbitrary predicate, few waiters Medium
Channel/Queue "Item is available" predicate Low (built-in)
park/unpark Targeted wakeup, lock-free DS Lowest, hardest
Closed channel One-shot broadcast (Go) Lowest
CountDownLatch "All N done" predicate Low
SIGNAL vs BROADCAST
signal : one waiter can proceed prefer
broadcast : multiple waiters can proceed only when justified
FUTEX TRICKS
FUTEX_WAIT : park if *addr == expected
FUTEX_WAKE n : wake up to n waiters
FUTEX_CMP_REQUEUE : wake 1, move rest to another queue (broadcast magic)
GLIBC INTERNALS
__wseq : 64-bit wait sequence, group bit = LSB
__g_signals[2] : per-group futex address
__g_size[2] : waiters remaining in each group
__wrefs : refcount for safe destroy
PROFILING
perf lock : lock + cond contention stats
perf c2c : cache line ping-pong
bpftrace futex : per-call-site futex stats
async-profiler : JVM wall-clock view of await()
pidstat -w : voluntary context switches per thread
DEFAULTS
monotonic clock for timedwait
while loop for predicate
signal, not broadcast
hold the mutex when signaling
Summary¶
At the senior level, the condition variable is not an answer; it is a question. Every time you reach for one, ask: what is the predicate? How many waiters can proceed when it changes? Does my standard library already model this as a queue or channel? Can I shard? Can I target a specific thread with park/unpark?
The mechanics — futex requeue, dual groups, generation counters, cache line ping-pong — are not trivia. They explain why broadcast can stampede, why signal-without-mutex breaks the requeue optimization, why a sharded design outperforms a single condvar by orders of magnitude. Understanding them lets you read a profiler output and diagnose contention without trial and error.
The cultural shift, especially from a C/Java background to Go or Rust, is that idiomatic modern concurrency often has no condvar at all. Channels and queues bake the predicate in and make the correctness story trivial. When you do need a condvar, you reach for it with care, document the protocol, and choose signal over broadcast until measurements justify otherwise.
The simplest summary: understand the futex, prefer the queue, and never trust the first profile.
What You Can Build¶
- A custom thread pool with per-worker
LockSupport.park/unparkand work-stealing — no shared condvar. - A bounded blocking queue with split head/tail locks, beating
ArrayBlockingQueueon your specific workload after benchmarking. - A versioned config broadcaster using a closed-channel-per-version pattern in Go.
- A coordination barrier (phaser-style) using park/unpark.
- A "configurable broadcast strategy" library that picks signal vs broadcast at runtime based on predicate metadata.
- A debug shim that wraps
pthread_cond_*and logs every signal/wait with call-site and contention.
Further Reading¶
- Drepper, Ulrich. "Futexes Are Tricky" — the canonical paper.
- Linux kernel docs:
Documentation/locking/futex-requeue-pi.rst. - glibc source:
nptl/pthread_cond_wait.c,nptl/pthread_cond_signal.c(after 2.25). - Doug Lea, "The Java Memory Model" and the
java.util.concurrentJSR-166 papers. - Brendan Gregg, Systems Performance — chapters on off-CPU profiling.
- Aleksey Shipilev's blog posts on JMH and on JVM locking.
- Go issue 21165 — discussion on
sync.Cond. - Linux
man 2 futexandman 7 pthreads.
Related Topics¶
- Mutexes — the primitive condvars build on.
- Futexes — the kernel mechanism beneath.
- Channels — the higher-level alternative.
- Lock-Free Data Structures — where hand-coded waitsets live.
- Java
java.util.concurrent— the practical home of park/unpark. - Go
syncpackage — whysync.Condis rarely idiomatic.
Diagrams & Visual Aids¶
Broadcast without requeue (the stampede):
T0 T1 T2 T3 T4 T5 T6 T7
| | | | | | | |
broadcast()
v v v v v v v v
wake wake wake wake wake wake wake wake
| | | | | | | |
+---+---+---+---+---+---+---+
|
v
fight for mutex
|
v
T3 wins, others wait
-> 7 wasted wakeups
Broadcast with FUTEX_CMP_REQUEUE:
T0 T1 T2 T3 T4 T5 T6 T7 (parked on cond)
| | | | | | | |
broadcast()
v
wake T0 only
requeue T1..T7 onto mutex queue
| | | | | | | |
T0 T1..T7 still parked, but on mutex
|
v
T0 acquires mutex, does work, unlocks
|
v
mutex unlock wakes T1 (only)
T1 does work, unlocks -> wakes T2 ...
-> 1 wake per useful unit of work
glibc dual-group rotation:
wseq=0 wseq=1 wseq=2 wseq=3
G1 active G1 active G1 closed G1 closed
G2 -> G1 G2 -> G1
new G2 new G2
waiter at wseq=1 -> joins G1 (current)
signal arrives -> closes G1, rotates G2 into G1
waiter at wseq=4 -> joins fresh G2 (cannot be woken
by a signal that targeted old G1)
Cache line ping-pong on shared condvar:
CPU0 CPU1 CPU2 CPU3
^ ^ ^ ^
| | | |
modify modify modify modify
\____ /\___ /\___ /
cache line of __g_signals[0]
bounces M -> M -> M -> M
~100ns per transition
Sharded condvar (the fix):
Shard 0 Shard 1 Shard 2 Shard 3
[queue,mu,cv][queue,mu,cv][queue,mu,cv][queue,mu,cv]
^ ^ ^ ^
| | | |
W0,W1 W2,W3 W4,W5 W6,W7
Producer routes work to one shard.
Signal touches only that shard's cache line.
No cross-CPU bounce.
Park/unpark targeted wakeup:
ProducerThread ConsumerThread
| |
| publish item | (parked)
| atomic.compareAndSet |
| LockSupport.unpark(consumer) ---------> wake
| | re-check predicate
v v
continue do work
End of senior-level treatment.