Concurrency — Find the Bug¶
Concurrency is the richest bug source in software: the code reads correctly, the tests pass, the demo works — and then it fails once in ten thousand runs on a customer's machine. The bugs below are real, subtle, and intermittent. Each compiles and runs. Find the race, the visibility hole, or the deadlock before you open the answer.
Table of Contents¶
- Snippet 1 — Check-then-act on a map (Go)
- Snippet 2 — Lost update on a counter (Java)
- Snippet 3 — Visibility bug on a stop flag (Java)
- Snippet 4 — Loop variable captured by goroutines (Go)
- Snippet 5 — Deadlock from inconsistent lock ordering (Go)
- Snippet 6 —
synchronizedon the wrong lock object (Java) - Snippet 7 — Double-checked locking without
volatile(Java) - Snippet 8 — Goroutine leak on a blocked channel (Go)
- Snippet 9 —
WaitGroup.Addinside the goroutine (Go) - Snippet 10 — Shared
SimpleDateFormatacross threads (Java) - Snippet 11 — TOCTOU on a lock file (Python)
- Snippet 12 — Unbuffered channel deadlock on the main goroutine (Go)
- Snippet 13 — Closing a channel from the receiver (Go)
- Snippet 14 —
defer mu.Unlock()after an earlyreturnpath (Go) - Scorecard
- Related Topics
How to Use¶
Read each snippet and decide what's wrong before expanding the answer. For a concurrency bug, a complete diagnosis answers four questions:
- What is the data race or coordination flaw? Name the two operations that interleave badly, or the channel/lock that never gets signaled.
- Why is it intermittent? Almost every concurrency bug passes the first thousand runs. Explain the window that has to be hit.
- What's the fix? Usually a lock, an atomic, a
volatile/memory barrier, acontext, or restructuring ownership so the data is never shared. - Would a race detector catch it? Go's
-raceflag and Java's tooling (e.g. the FindBugs/SpotBugsMT_CORRECTNESSchecks, ThreadSanitizer for native code) catch data races — unsynchronized access to shared memory. They do not catch deadlocks, channel leaks, or logical races where every access happens to go through a lock but the sequence is still wrong. Knowing which category a bug falls into tells you which tool to reach for.
Difficulty is marked Easy / Medium / Hard. The Hard ones survive code review at most companies.
Snippet 1¶
Difficulty: Easy — Check-then-act on a map (Go)
package cache
import "sync"
type Counter struct {
mu sync.Mutex
counts map[string]int
}
func NewCounter() *Counter {
return &Counter{counts: make(map[string]int)}
}
// Increment bumps the count for key by one.
func (c *Counter) Increment(key string) {
if _, ok := c.counts[key]; !ok {
c.mu.Lock()
c.counts[key] = 0
c.mu.Unlock()
}
c.mu.Lock()
c.counts[key]++
c.mu.Unlock()
}
What's wrong?
Answer
**The race:** the `if _, ok := c.counts[key]; !ok` check reads the map **outside** the lock. Two goroutines can both observe `!ok` for a new key, both take the lock (serially), and both write `c.counts[key] = 0`. That part is merely redundant. The real damage is that the *read* of `c.counts` in the `if` is a concurrent map read happening while another goroutine holds the lock and is doing `c.counts[key]++` — a write. **Concurrent read and write of a Go map is a hard runtime error**, not just a logic bug: the runtime detects it and panics with `fatal error: concurrent map read and map write`, which cannot be recovered. **Why it's intermittent:** the unsynchronized read only collides with a write when the two goroutines hit the exact same map during a rehash or bucket write. Under light load you may never see it; under production load it crashes the whole process. **The fix:** put the entire check-then-act under one lock so the map is only ever touched while holding `mu`. Go's zero-value semantics make the whole `if !ok` dance unnecessary — `c.counts[key]++` on a missing key starts from 0. **Race detector:** `go test -race` catches this immediately. The detector reports the unsynchronized read in `Increment` against the locked write, with both stack traces. This is the canonical case `-race` was built for.Snippet 2¶
Difficulty: Easy — Lost update on a counter (Java)
public class RequestMeter {
private long total = 0;
public void record() {
total++;
}
public long total() {
return total;
}
}
// Used from a thread-per-request server:
// meter.record(); // called by every request thread
What's wrong?
Answer
**The race:** `total++` is not atomic. It compiles to three steps: read `total`, add 1, write `total` back. Two threads can both read the same value (say 41), both compute 42, both write 42. One increment is lost. With N threads hammering `record()`, the final count is reliably *less* than the number of calls. There is a second, sneakier problem: even reads via `total()` have no memory barrier, so a reader thread may see a stale value indefinitely (visibility — see Snippet 3). **Why it's intermittent:** the lost-update window is the few nanoseconds between read and write-back. Under low concurrency the threads rarely overlap there, so the count looks correct. Under load, thousands of updates vanish — and the loss is non-deterministic, so the bug "comes and goes." **The fix:** use an atomic, which performs read-modify-write as one indivisible operation and also provides the memory barrier for visibility. For high-contention counters where you only read occasionally, prefer `LongAdder` — it shards the count across cells to avoid CAS contention. **Race detector:** SpotBugs flags this as `VO_VOLATILE_INCREMENT` / unsynchronized increment. At runtime, Java's ThreadSanitizer-style tools (or the `java.util.concurrent` stress harness `jcstress`) reproduce the lost update deterministically. Plain unit tests almost never do.Snippet 3¶
Difficulty: Medium — Visibility bug on a stop flag (Java)
public class Worker implements Runnable {
private boolean running = true;
public void stop() {
running = false;
}
@Override
public void run() {
while (running) {
doWork();
}
}
private void doWork() { /* CPU-bound, no I/O, no locks */ }
}
// Main thread:
// Worker w = new Worker();
// new Thread(w).start();
// Thread.sleep(1000);
// w.stop(); // worker should exit... does it?
What's wrong?
Why it's intermittent: This one is the classic "works on my laptop, hangs in production."
Answer
**The bug (visibility, not a data race in the classic sense):** `running` is a plain field. The worker thread reads it in a tight loop with no synchronization. The Java Memory Model gives **no guarantee** that the write `running = false` on the main thread ever becomes visible to the worker thread. The JIT is permitted to hoist `running` into a register, turning the loop into `while (true)`. The worker spins forever even after `stop()` returns. **Why it's intermittent — and why it's worse than intermittent:** on **x86**, hardware has a strong memory model and stores become globally visible quickly, so the loop usually *does* terminate — you'll never see the bug on your Intel/AMD dev machine. On **ARM** (weaker memory model — Apple Silicon, AWS Graviton, mobile), the store can sit in a store buffer indefinitely and the loop genuinely never exits. The bug is invisible until you deploy to ARM servers. Even on x86, a more aggressive JIT optimization level can hoist the read and hang. **The fix:** declare the flag `volatile`. `volatile` guarantees that writes are visible to all threads and prevents the JIT from caching the read in a register. For a flag that's also read-modified, use `AtomicBoolean`. For interruptible blocking work, prefer `Thread.interrupt()` + checking `Thread.currentThread().isInterrupted()`. **Race detector:** this is a *visibility* bug, which most data-race detectors **do** classify as a race (unsynchronized read + write to `running`). `jcstress` reproduces it reliably. But a plain functional test on x86 will pass forever, which is exactly why visibility bugs are so dangerous — the platform hides them.Snippet 4¶
Difficulty: Medium — Loop variable captured by goroutines (Go)
package main
import (
"fmt"
"sync"
)
func main() {
urls := []string{"a.com", "b.com", "c.com"}
var wg sync.WaitGroup
for _, url := range urls {
wg.Add(1)
go func() {
defer wg.Done()
fmt.Println("fetching", url)
}()
}
wg.Wait()
}
What's wrong?
Answer
**The bug:** the closure captures the loop variable `url` **by reference**, not by value. In Go versions before 1.22, there is a single `url` variable reused across iterations. By the time the goroutines actually run (the scheduler may not start them until after the loop finishes), `url` holds the last value. The output is often `fetching c.com` printed three times — and the first two URLs are never fetched. This is the single most common Go concurrency bug, and the equivalent exists in JavaScript with `var` in a `for` loop closing over `setTimeout` callbacks. **Why it's intermittent:** if the scheduler happens to run each goroutine before the next iteration overwrites `url`, you might see the correct output — so it can pass on a lightly loaded machine and fail under load or with `GOMAXPROCS>1`. The result is data-dependent and timing-dependent. **The fix (any one of these):** Bind the value per iteration by shadowing: Or pass it as an argument (clearest — the value is copied at call time): **Go 1.22+:** the language changed so each iteration gets a fresh `url`, fixing this for `for range` loops. But you must know your Go version and `go.mod` language level; older modules still have the old semantics. Pass-by-argument works on every version. **Race detector:** `go vet` has a `loopclosure` analyzer that flags this statically. `-race` catches it at runtime if the captured variable is *written* concurrently, but the pure read-of-last-value case is best caught by `go vet`.Snippet 5¶
Difficulty: Hard — Deadlock from inconsistent lock ordering (Go)
package bank
import "sync"
type Account struct {
mu sync.Mutex
balance int
}
func Transfer(from, to *Account, amount int) {
from.mu.Lock()
defer from.mu.Unlock()
to.mu.Lock()
defer to.mu.Unlock()
from.balance -= amount
to.balance += amount
}
What's wrong?
Answer
**The deadlock:** `Transfer` always locks `from` then `to`. But the *order of the accounts* depends on the caller's arguments. Consider two concurrent transfers: - Goroutine 1: `Transfer(accountA, accountB, 100)` — locks A, then wants B. - Goroutine 2: `Transfer(accountB, accountA, 50)` — locks B, then wants A. If both grab their first lock before either grabs the second, G1 holds A and waits for B, while G2 holds B and waits for A. Neither releases. **Classic deadlock from inconsistent lock ordering** — the locks are acquired in a different global order depending on call arguments. **Why it's intermittent:** the deadlock only triggers when two transfers between the *same pair* of accounts run in *opposite directions* and both acquire their first lock in the tiny window before the second. Most of the time one finishes first. In a test with two accounts and a tight loop, it can hang within seconds; in production it might surface once a week as a stuck request and a leaked connection. **The fix:** impose a **consistent global lock order**. Lock the accounts in a deterministic order (e.g. by a stable ID) regardless of transfer direction.type Account struct {
id uint64 // unique, assigned at creation
mu sync.Mutex
balance int
}
func Transfer(from, to *Account, amount int) {
first, second := from, to
if first.id > second.id {
first, second = second, first
}
first.mu.Lock()
defer first.mu.Unlock()
second.mu.Lock()
defer second.mu.Unlock()
from.balance -= amount
to.balance += amount
}
Snippet 6¶
Difficulty: Hard — synchronized on the wrong lock object (Java)
public class BoundedBuffer {
private final Object writeLock = new Object();
private final java.util.LinkedList<Integer> items = new java.util.LinkedList<>();
public void put(int value) {
synchronized (writeLock) {
items.addLast(value);
}
}
public Integer poll() {
synchronized (items) { // <-- different lock!
return items.isEmpty() ? null : items.removeFirst();
}
}
}
What's wrong?
Answer
**The bug:** `put` synchronizes on `writeLock`; `poll` synchronizes on `items`. They are **different monitors**. Mutual exclusion only holds between threads that lock the *same* object. A `put` and a `poll` can run truly concurrently, both mutating the same `LinkedList`. `LinkedList` is not thread-safe; concurrent `addLast` and `removeFirst` corrupt its internal node pointers — you get lost elements, `NullPointerException` deep inside `LinkedList`, or a size counter that no longer matches the actual node chain. **Why it's intermittent:** two writers are correctly serialized (both use `writeLock`), and two pollers are serialized (both use `items`), so single-direction stress tests pass. The corruption needs a `put` and a `poll` to overlap on the exact list operation. Low traffic almost never overlaps; production load hits it and you get a mysterious `NPE` with a stack trace inside the JDK that points nowhere in your code. **The fix:** every method that touches `items` must hold the **same** lock.public class BoundedBuffer {
private final Object lock = new Object();
private final java.util.LinkedList<Integer> items = new java.util.LinkedList<>();
public void put(int value) {
synchronized (lock) { items.addLast(value); }
}
public Integer poll() {
synchronized (lock) {
return items.isEmpty() ? null : items.removeFirst();
}
}
}
Snippet 7¶
Difficulty: Hard — Double-checked locking without volatile (Java)
public class ConfigRegistry {
private static ConfigRegistry instance; // not volatile
private final Map<String, String> settings;
private ConfigRegistry() {
this.settings = loadFromDisk(); // expensive
}
public static ConfigRegistry getInstance() {
if (instance == null) { // first check (no lock)
synchronized (ConfigRegistry.class) {
if (instance == null) { // second check (locked)
instance = new ConfigRegistry();
}
}
}
return instance;
}
private static Map<String, String> loadFromDisk() { /* ... */ return null; }
}
What's wrong?
Answer
**The bug:** classic broken double-checked locking. The field `instance` is **not `volatile`**. The statement `instance = new ConfigRegistry()` is not atomic — it is roughly: (1) allocate memory, (2) run the constructor, (3) publish the reference into `instance`. The JMM permits the compiler/CPU to **reorder** steps 2 and 3, so `instance` can become non-null *before* the constructor finishes initializing `settings`. Now a second thread enters `getInstance`, sees `instance != null` on the first (unlocked) check, skips the lock, and returns the object — while `settings` is still `null` (or partially constructed). The caller dereferences `settings` and gets an `NullPointerException` or reads a half-built map. The object escaped before it was fully constructed. **Why it's intermittent:** the reordering only happens under specific JIT/hardware conditions, and only matters if a second thread observes the partially-published reference in the narrow window before construction completes. On x86 you may never see it; on a weakly-ordered ARM CPU it's far more likely. This bug shipped in countless codebases precisely because it almost always works. **The fix:** mark the field `volatile`. Since Java 5's revised memory model, `volatile` establishes a happens-before edge that forbids the harmful reordering — the write to `instance` cannot be reordered before the constructor's writes. Better: use the **initialization-on-demand holder idiom**, which is lazy, thread-safe, and lock-free, relying on the JVM's class-initialization guarantees: The JVM guarantees `Holder` is initialized exactly once, on first access, with full memory safety — no `volatile`, no synchronization in your code. **Race detector:** the missing-`volatile` reordering is genuinely hard to catch dynamically because it depends on speculative reordering; `jcstress` is the right tool and has canonical test cases for exactly this idiom. SpotBugs flags `DC_DOUBLECHECK` (double-checked locking) statically — treat that warning as a real bug, never suppress it.Snippet 8¶
Difficulty: Medium — Goroutine leak on a blocked channel (Go)
package search
import "context"
// First returns the first result from any backend; the rest are abandoned.
func First(query string, backends []func(string) string) string {
results := make(chan string) // unbuffered
for _, backend := range backends {
b := backend
go func() {
results <- b(query) // send result
}()
}
return <-results // take the first, ignore the rest
}
What's wrong?
Answer
**The leak:** the channel `results` is **unbuffered**. We launch one goroutine per backend, but `First` only ever receives **once** (`<-results`). The first goroutine to finish sends successfully and `First` returns. Every *other* goroutine blocks forever on `results <- b(query)` because there is no receiver left. Those goroutines never exit. They hold their stacks, their captured variables, and any resources `b` opened (HTTP connections, file handles) for the lifetime of the process. **Why it's intermittent — actually it's a steady leak:** the function returns the correct answer every time, so functional tests pass and the bug is invisible to correctness checks. The damage is cumulative: each call to `First` with N backends leaks N-1 goroutines. Memory and goroutine count climb slowly until the process OOMs or the scheduler grinds — often hours or days after deploy, far from the cause. **The fix:** give the channel enough buffer that every sender can complete and exit even if no one receives their value, and use a `context` so abandoned work can be cancelled.func First(ctx context.Context, query string, backends []func(context.Context, string) string) string {
results := make(chan string, len(backends)) // buffered: every sender can finish
ctx, cancel := context.WithCancel(ctx)
defer cancel() // signal the losers to stop
for _, backend := range backends {
b := backend
go func() {
results <- b(ctx, query)
}()
}
return <-results
}
Snippet 9¶
Difficulty: Medium — WaitGroup.Add inside the goroutine (Go)
package main
import (
"fmt"
"sync"
)
func processAll(jobs []int) {
var wg sync.WaitGroup
for _, job := range jobs {
go func(j int) {
wg.Add(1) // <-- Add inside the goroutine
defer wg.Done()
fmt.Println("processing", j)
}(job)
}
wg.Wait() // wait for all jobs
fmt.Println("all done")
}
What's wrong?
Answer
**The race:** `wg.Add(1)` is called *inside* each goroutine, which races with `wg.Wait()` on the main goroutine. The spec is explicit: calls to `Add` with a positive delta must **happen before** `Wait`. Here the loop can finish launching goroutines and reach `wg.Wait()` *before* any goroutine has run its `wg.Add(1)`. At that moment the counter is still 0, so `Wait` returns immediately — `processAll` prints "all done" and returns while jobs are still running or haven't started. Worse, a `Done` can then drive the counter negative, which panics (`sync: negative WaitGroup counter`). **Why it's intermittent:** if the scheduler happens to run every goroutine's `Add` before the main goroutine reaches `Wait`, it works. With few jobs on a single core it often appears correct; with `GOMAXPROCS>1` or many jobs, `Wait` frequently wins the race and returns early — or you get the negative-counter panic. The output is non-deterministic. **The fix:** call `Add` on the launching goroutine, *before* the `go` statement, so it is guaranteed to happen-before `Wait`. You may also `wg.Add(len(jobs))` once before the loop. **Race detector:** `-race` detects the data race between the concurrent `Add` and `Wait` on the `WaitGroup`'s internal state, reporting it with both stacks. The negative-counter case panics outright. This is a frequent enough mistake that it's worth memorizing the rule: *`Add` before `go`, always.*Snippet 10¶
Difficulty: Hard — Shared SimpleDateFormat across threads (Java)
public final class Timestamps {
// shared static formatter to avoid allocating one per call
private static final SimpleDateFormat FORMAT =
new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
public static String format(Date when) {
return FORMAT.format(when);
}
}
// Called concurrently from many request threads:
// String ts = Timestamps.format(new Date());
What's wrong?
Answer
**The bug:** `SimpleDateFormat` is **not thread-safe** — a famous JDK footgun. Internally it uses a shared mutable `Calendar` field as scratch space during `format()`. When two threads call `FORMAT.format(when)` simultaneously, they stomp on that shared `Calendar`. Symptoms range from subtle (one thread's date bleeds into another's output — you get a timestamp with the wrong year or month) to violent (`NumberFormatException`, `ArrayIndexOutOfBoundsException`, or `Empty digit list` thrown from deep inside the formatter on malformed intermediate state). **Why it's intermittent:** with a single thread, or under low concurrency, the formatter completes before another call begins and everything is correct. The corruption needs two `format` calls to overlap on the shared `Calendar`. It can run flawlessly in dev and unit tests (single-threaded), then produce garbage timestamps or random exceptions only under production concurrency — and the bad output is data-dependent, so it looks like random data corruption rather than a threading issue. **The fix:** use the **immutable, thread-safe** `java.time.format.DateTimeFormatter` (Java 8+). A single shared instance is safe to use from any number of threads.import java.time.Instant;
import java.time.ZoneOffset;
import java.time.format.DateTimeFormatter;
public final class Timestamps {
private static final DateTimeFormatter FORMAT =
DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss").withZone(ZoneOffset.UTC);
public static String format(Instant when) {
return FORMAT.format(when);
}
}
Snippet 11¶
Difficulty: Medium — TOCTOU on a lock file (Python)
import os
LOCK = "/tmp/job.lock"
def run_exclusive(task):
# Acquire a crude file lock so only one process runs the task.
if not os.path.exists(LOCK): # time-of-check
open(LOCK, "w").close() # time-of-use
try:
task()
finally:
os.remove(LOCK)
else:
print("another instance is running; skipping")
What's wrong?
Answer
**The race (TOCTOU — time-of-check to time-of-use):** the gap between `os.path.exists(LOCK)` and `open(LOCK, "w").close()` is unguarded. Two processes (or threads) can both call `os.path.exists`, both see `False`, and both proceed to create the file and run `task()` concurrently — defeating the entire purpose of the lock. The check and the acquire are two separate operations with a window between them. **Why it's intermittent:** the window is only the few microseconds between the `exists` check and the `open`. Two processes have to start within that window to collide. Cron jobs that fire at the same second, or a thundering herd on startup, hit it; manual testing where you launch processes by hand never does. So the "lock" appears to work for months, then two jobs run together exactly when it matters (e.g. a billing run executes twice). **The fix:** make check-and-acquire a **single atomic operation**. `O_CREAT | O_EXCL` asks the OS to create the file *and fail if it already exists*, atomically, in the kernel. For robust locking that survives crashes, prefer `fcntl.flock` (advisory locks the OS releases automatically when the process dies) or a dedicated library like `filelock`. The hand-rolled lock file above also leaks the lock forever if the process is `kill -9`'d between create and remove. **Race detector:** thread/data-race detectors don't apply here — this is a race across **processes** mediated by the filesystem, not shared memory. You reason about it by spotting the non-atomic check-then-act pattern, the same shape as Snippet 1. The fix is always to fold check and act into one atomic syscall or one locked critical section.Snippet 12¶
Difficulty: Easy — Unbuffered channel deadlock on the main goroutine (Go)
package main
import "fmt"
func main() {
ch := make(chan int) // unbuffered
ch <- 42 // send
value := <-ch // receive
fmt.Println(value)
}
What's wrong?
Answer
**The deadlock:** an unbuffered channel requires a sender and a receiver to **rendezvous** — the send `ch <- 42` blocks until some *other* goroutine is ready to receive. Here everything runs on the single `main` goroutine. `ch <- 42` blocks waiting for a receiver, but the receiver (`<-ch`) is the very next line on the *same* goroutine, which can never run because we're blocked on the send. The Go runtime detects that all goroutines are stuck and aborts: `fatal error: all goroutines are asleep - deadlock!`. **Why it's *not* intermittent (the instructive part):** unlike most bugs here, this one fails **deterministically and immediately** every single time, because the Go runtime has a built-in global deadlock detector for the all-goroutines-blocked case. That makes it a useful contrast: the dangerous deadlocks (Snippet 5) are *partial* — some goroutines stay alive — so the runtime stays quiet and you get a silent hang instead of a crash. **The fix:** the send and receive must be on different goroutines so they can rendezvous, or the channel needs a buffer. Different goroutines: Or a buffer of at least 1, so the send completes without a waiting receiver: **Race detector:** not a data race; `-race` is irrelevant. The runtime's deadlock detector catches the total-deadlock case for free. Recognize the pattern: **a blocking channel operation with no concurrent counterpart on another goroutine.**Snippet 13¶
Difficulty: Hard — Closing a channel from the receiver (Go)
package pipeline
import "sync"
// fanIn merges values from multiple producers into one channel.
func fanIn(producers []<-chan int) <-chan int {
out := make(chan int)
var wg sync.WaitGroup
for _, p := range producers {
wg.Add(1)
go func(in <-chan int) {
defer wg.Done()
for v := range in {
out <- v
}
close(out) // close when this producer is drained
}(p)
}
go func() {
wg.Wait()
}()
return out
}
What's wrong?
Answer
**The bug:** each fan-in goroutine calls `close(out)` when *its* input is drained. But `out` is shared by all of them. The first goroutine to finish closes `out`; when a second goroutine then tries `out <- v` or `close(out)` again, it triggers a runtime panic: **`send on closed channel`** or **`close of closed channel`**. The cardinal rules are violated: a channel must be closed **exactly once**, and only by the **sender side that owns it** — never by multiple producers racing to close the same channel. **Why it's intermittent:** if one producer finishes and closes `out` while the others happen to have already drained and exited, you might get away with it. The panic needs a second producer to still be sending (or closing) after the first close. With one slow producer and one fast one, or under load with many producers, the panic fires. It can pass small tests and crash in production where producers have uneven timing. **The fix:** close the output channel **once**, after **all** producers are done, from the single goroutine that owns `out`. The `wg.Wait()` goroutine is the natural place — but it must do the closing.func fanIn(producers []<-chan int) <-chan int {
out := make(chan int)
var wg sync.WaitGroup
for _, p := range producers {
wg.Add(1)
go func(in <-chan int) {
defer wg.Done()
for v := range in {
out <- v // never close here
}
}(p)
}
go func() {
wg.Wait() // all producers drained
close(out) // exactly one close, by the owner
}()
return out
}
Snippet 14¶
Difficulty: Medium — defer mu.Unlock() placed after an early return path (Go)
package store
import (
"errors"
"sync"
)
type Store struct {
mu sync.Mutex
data map[string]int
}
func (s *Store) Bump(key string, delta int) (int, error) {
if delta == 0 {
return 0, errors.New("delta must be non-zero")
}
s.mu.Lock()
v, ok := s.data[key]
if !ok {
return 0, errors.New("key not found") // <-- returns while holding the lock
}
defer s.mu.Unlock()
v += delta
s.data[key] = v
return v, nil
}
What's wrong?
Answer
**The bug:** `s.mu.Lock()` is acquired, but the `defer s.mu.Unlock()` is placed *after* the `if !ok` early return. When the key is missing, the function returns **while still holding the mutex** — the `defer` was never reached, so it never registered. The lock is now held forever. The next caller to `Bump` (or any method that locks `s.mu`) blocks indefinitely. One missing-key lookup permanently wedges the whole `Store`. **Why it's intermittent:** the code works perfectly for every existing key — which is the common path that tests and demos exercise. The lock leak only triggers on the *error* path (`!ok`), and only *future* callers suffer, not the one that took the bad path. So a single missing-key request in production silently poisons the store, and every subsequent request hangs — appearing as a sudden, total stall with no obvious trigger in the logs. **The fix:** register `defer s.mu.Unlock()` **immediately** after `Lock()`, before any branch that can return. `defer` runs on every return path, including the early one.func (s *Store) Bump(key string, delta int) (int, error) {
if delta == 0 {
return 0, errors.New("delta must be non-zero")
}
s.mu.Lock()
defer s.mu.Unlock() // registered before any early return
v, ok := s.data[key]
if !ok {
return 0, errors.New("key not found") // unlock still runs
}
v += delta
s.data[key] = v
return v, nil
}
Scorecard¶
Track how many you diagnosed correctly before opening the answer. A correct diagnosis names the racing operations (or the missed signal), explains the intermittency window, and gives the right fix category.
| # | Snippet | Bug class | Difficulty | Got it? |
|---|---|---|---|---|
| 1 | Check-then-act on a map | Data race / TOCTOU | Easy | ☐ |
| 2 | Lost update on a counter | Lost update (non-atomic RMW) | Easy | ☐ |
| 3 | Visibility bug on a stop flag | Visibility (missing volatile) | Medium | ☐ |
| 4 | Loop variable captured | Closure capture | Medium | ☐ |
| 5 | Inconsistent lock ordering | Deadlock | Hard | ☐ |
| 6 | synchronized on wrong object | Split lock / unsafe collection | Hard | ☐ |
| 7 | Double-checked locking | Unsafe publication (reordering) | Hard | ☐ |
| 8 | Blocked-channel goroutine leak | Resource leak | Medium | ☐ |
| 9 | WaitGroup.Add in goroutine | Add/Wait race | Medium | ☐ |
| 10 | Shared SimpleDateFormat | Unsafe shared mutable state | Hard | ☐ |
| 11 | TOCTOU on a lock file | Cross-process race | Medium | ☐ |
| 12 | Unbuffered channel deadlock | Deadlock (no counterpart) | Easy | ☐ |
| 13 | Closing channel from receivers | Double close / send on closed | Hard | ☐ |
| 14 | defer Unlock after early return | Lock leak | Medium | ☐ |
Scoring:
- 12–14: You think in interleavings. You'd catch these in review before they ship.
- 8–11: Solid. Re-study the Hard ones — unsafe publication (7) and split locks (6) are the ones that survive most code reviews.
- 4–7: You spot the obvious races but miss the visibility and ordering subtleties. Internalize the rule that on x86 most visibility bugs hide — test on ARM, run
jcstress/-race. - 0–3: Re-read
junior.mdfor the foundations, then come back. Every bug here maps to one rule there.
The meta-lesson: notice how often the bug is invisible to ordinary tests and only a category-specific tool catches it — -race/TSan for data races, jcstress for memory-model reordering, goleak/pprof for leaks, goroutine dumps for partial deadlocks, and plain reasoning about ordering for logical races that no tool flags. Concurrency correctness is designed in (clear ownership, immutability, one lock per invariant), not tested in.
Related Topics¶
junior.md— the foundational rules each of these bugs violates.tasks.md— hands-on exercises: write the failing test, then the fix.- Chapter README — the positive concurrency rules and anti-patterns to avoid.
- Anti-Patterns — broader catalog of recurring design mistakes, including concurrency anti-patterns.
- Refactoring — restructuring code (e.g. extracting a single locked critical section, introducing immutability) to make these bugs structurally impossible.
In this topic