Flaky Tests — Find the Bug¶
Category: Testing Anti-Patterns → Flaky Tests — the same test, on the same code, passes sometimes and fails sometimes. Also known as: non-deterministic tests · intermittent tests · heisentests
This file is diagnostic practice. Each snippet below is a plausible test (or a test-plus-production pair) that is flaky — it passes sometimes and fails sometimes on unchanged code. Your job is to read it like a reviewer hunting non-determinism and answer three questions:
What is the source of non-determinism? What concrete failure does it produce? How do you make it deterministic?
The source is always one (or more) of the seven from middle.md: timing, async race, shared state, randomness, real clock, external dependency, ordering. The skill is naming which one on sight — because the cure follows directly from the cause.
How to use this file: read each snippet, write your own answer before expanding the collapsible. Some snippets look perfectly innocent and would pass a casual review; the flake lives in an uncontrolled input, not a syntax error.
Table of Contents¶
- The async wait that guesses
- The token that expires too soon
- The two tests that hate each other
- The order that came back wrong
- The price from the internet
- The counter that lost count
- The random sample that sometimes breaks
- The "today" report
Snippet 1 — The async wait that guesses¶
// Go — a test for an event bus that delivers on a background goroutine.
func TestPublish_Delivers(t *testing.T) {
bus := NewBus()
var received string
bus.Subscribe("topic", func(msg string) { received = msg })
bus.Publish("topic", "hello") // delivered asynchronously
time.Sleep(10 * time.Millisecond) // give it "time to arrive"
if received != "hello" {
t.Fatalf("got %q", received)
}
}
What is the source of non-determinism? What concrete failure does it produce? How do you make it deterministic?
Answer
**Source: timing** (a `sleep`-based wait racing async delivery) — *plus* a latent **data race**, since `received` is written on the bus goroutine and read on the test goroutine with no synchronization. **Failure:** `10ms` is a guess that delivery completes within 10ms. On a loaded CI box the goroutine isn't scheduled in time → `received` is still `""` → fail. Even when it "passes," the unsynchronized read/write is undefined behavior (`go test -race` flags it). **Fix — synchronize on the actual delivery, don't sleep.** Use a channel the handler sends on:func TestPublish_Delivers(t *testing.T) {
bus := NewBus()
got := make(chan string, 1)
bus.Subscribe("topic", func(msg string) { got <- msg }) // signal, not shared var
bus.Publish("topic", "hello")
select {
case msg := <-got:
require.Equal(t, "hello", msg)
case <-time.After(2 * time.Second):
t.Fatal("message never delivered")
}
}
Snippet 2 — The token that expires too soon¶
# Python — production reads the wall clock; the test waits real time.
class AuthToken:
def __init__(self, ttl_seconds):
self.expires_at = time.time() + ttl_seconds
def valid(self):
return time.time() < self.expires_at
def test_token_validity():
token = AuthToken(ttl_seconds=2)
assert token.valid()
time.sleep(2)
assert not token.valid()
What is the source of non-determinism? What concrete failure does it produce? How do you make it deterministic?
Answer
**Source: the real clock** (production calls `time.time()`; the test races a real `sleep` against a real TTL). **Failure:** the `sleep(2)` and the 2-second TTL race each other at the boundary. If the OS sleeps a hair under 2s, or scheduling delays the `valid()` call, the assertion lands on the wrong side of the boundary → intermittent fail. It's also *slow* (a real 2-second wait per run). **Fix — inject the clock; control time in the test.**class AuthToken:
def __init__(self, ttl_seconds, clock):
self.clock = clock
self.expires_at = clock.now() + ttl_seconds
def valid(self):
return self.clock.now() < self.expires_at
class FakeClock:
def __init__(self, t=0.0): self.t = t
def now(self): return self.t
def advance(self, dt): self.t += dt
def test_token_validity():
clock = FakeClock()
token = AuthToken(ttl_seconds=2, clock=clock)
assert token.valid()
clock.advance(2) # exactly at boundary; "<" → invalid
assert not token.valid() # deterministic, instant, exact
Snippet 3 — The two tests that hate each other¶
# Python — a module-level cache, plus two tests that share it.
_user_cache = {}
def get_user(uid):
if uid not in _user_cache:
_user_cache[uid] = db.load(uid)
return _user_cache[uid]
def test_loads_from_db(mocker):
spy = mocker.patch("mod.db.load", return_value=User(1, "Ada"))
assert get_user(1).name == "Ada"
spy.assert_called_once() # expects exactly one DB load
def test_returns_cached(mocker):
spy = mocker.patch("mod.db.load", return_value=User(1, "Ada"))
get_user(1)
get_user(1)
assert spy.call_count == 1 # expects the cache to serve the 2nd call
What is the source of non-determinism? What concrete failure does it produce? How do you make it deterministic?
Answer
**Source: shared mutable state / test-order dependence** — the module-level `_user_cache` persists across tests. **Failure:** if `test_returns_cached` runs *first*, it populates `_user_cache[1]`. Then `test_loads_from_db` runs, `get_user(1)` hits the cache, `db.load` is **never called**, and `spy.assert_called_once()` fails (0 calls). Reverse the order and both pass. Under randomized order (pytest-randomly), it's flaky. Each test "looks fine" alone — the bug is the leaked global. **Fix — reset the shared cache between tests** (and ideally stop using a module global): Better still, make the cache an injected object constructed per test, so there's no global to leak. Verify with `pytest -p randomly` — green in every order.Snippet 4 — The order that came back wrong¶
// Java (JUnit 5) — verifying which permissions a role grants.
@Test
void editorPermissions() {
Set<String> perms = roleService.permissionsFor("editor"); // backed by a HashSet
assertEquals(List.of("read", "write", "comment"),
new ArrayList<>(perms)); // order from the set
}
What is the source of non-determinism? What concrete failure does it produce? How do you make it deterministic?
Answer
**Source: ordering assumption** — `perms` is a `HashSet`, whose iteration order is unspecified, but the test asserts a specific *ordered* `List`. **Failure:** `new ArrayList<>(perms)` yields the elements in `HashSet` iteration order, which depends on hash codes and capacity and is *not* `["read","write","comment"]` in general. The test passes only when the set happens to iterate in that order — flaky across JVM versions, element sets, and `-XX` flags. (It can also flip if a string is added/removed elsewhere, changing the set's internal layout.) **Fix — assert membership, not order.** Compare as a set (the real intent — *these three permissions*), or sort both sides before comparing. Don't assert an order the producer never promised.Snippet 5 — The price from the internet¶
// Go — a test for the pricing client.
func TestGetPrice(t *testing.T) {
client := NewPriceClient("https://api.exchange.example/v1") // real endpoint
price, err := client.GetPrice("BTC")
require.NoError(t, err)
require.Greater(t, price, 0.0)
}
What is the source of non-determinism? What concrete failure does it produce? How do you make it deterministic?
Answer
**Source: external dependency** — the test hits a real, remote HTTP API over the network. **Failure:** every non-deterministic property of the outside world becomes a flake: a DNS hiccup, a rate-limit (429), a 200ms latency spike past the client timeout, the API being briefly down during *their* deploy, or simply running CI offline. None of these are bugs in *your* code, yet they turn the build red intermittently. It's also slow. **Fix — fake the boundary with an in-process server returning a fixed response.**func TestGetPrice(t *testing.T) {
srv := httptest.NewServer(http.HandlerFunc(
func(w http.ResponseWriter, r *http.Request) {
require.Equal(t, "/v1/price/BTC", r.URL.Path) // assert the request too
fmt.Fprint(w, `{"price": 42000.0}`) // fixed, instant
}))
defer srv.Close()
client := NewPriceClient(srv.URL + "/v1")
price, err := client.GetPrice("BTC")
require.NoError(t, err)
require.Equal(t, 42000.0, price) // exact and repeatable
}
Snippet 6 — The counter that lost count¶
// Go — test for a metrics counter incremented from many goroutines.
func TestCounter_Concurrent(t *testing.T) {
c := &Counter{}
var wg sync.WaitGroup
for i := 0; i < 1000; i++ {
wg.Add(1)
go func() {
defer wg.Done()
c.Inc() // c.value++ (plain int, no synchronization)
}()
}
wg.Wait()
require.Equal(t, 1000, c.value) // sometimes 994, 997, 1000...
}
What is the source of non-determinism? What concrete failure does it produce? How do you make it deterministic — and where is the real bug?
Answer
**Source: an async race — but the race is in the *production* code, not the test.** This is the important case from `professional.md`: a flaky test that is a **correct report of a real bug**. **Failure:** `c.Inc()` does `c.value++`, a non-atomic read-modify-write. 1000 goroutines racing it lose updates (two read 500, both write 501), so `c.value` ends up *below* 1000 intermittently. The test correctly synchronizes (`wg.Wait()` guarantees all goroutines finished) yet still flakes — because the bug is the data race in `Inc`. **Do NOT "fix" the test** by sleeping, retrying, or asserting `>= 990`. That hides a real data-loss bug (undefined behavior that worsens under different CPUs/load). **Make the race deterministic and fix production:** Now the test passes deterministically *because the code is correct*. The flaky test did its job: it found a concurrency bug.Snippet 7 — The random sample that sometimes breaks¶
# Python — testing a percentile function on random data.
def test_p95_is_within_range():
data = [random.random() for _ in range(1000)] # fresh random data each run
p95 = percentile(data, 95)
assert 0.90 <= p95 <= 1.0 # "should usually be high"
What is the source of non-determinism? What concrete failure does it produce? How do you make it deterministic?
Answer
**Source: unseeded randomness** — the input is fresh random data every run, *and* the assertion encodes a probabilistic assumption about it. **Failure:** the p95 of 1000 uniform(0,1) samples is *usually* near 0.95, but it varies run to run; occasionally it dips below 0.90 (or the assumption is just statistically wrong for some draws) → intermittent fail you can't reproduce, because next run draws different data. **Fix — pick the right cure for the intent:** - If you're testing the **percentile algorithm**, use *fixed, known* input and assert the exact answer — no randomness at all: - If you genuinely want randomized inputs (property-style), **seed a local RNG** so failures reproduce, and assert a *true invariant* (e.g. `min(data) <= p95 <= max(data)`), not a probabilistic guess: The flake came from *unseeded* randomness combined with a *probabilistic* assertion. Seed it, and assert what's actually invariant.Snippet 8 — The "today" report¶
// Java — a report that filters records "from this month".
List<Record> thisMonth(List<Record> all) {
YearMonth now = YearMonth.now(); // reads the real clock
return all.stream()
.filter(r -> YearMonth.from(r.date()).equals(now))
.collect(toList());
}
@Test
void filtersCurrentMonth() {
var records = List.of(
new Record(LocalDate.now()), // "this month" — usually
new Record(LocalDate.now().minusMonths(2)));
assertEquals(1, thisMonth(records).size());
}
What is the source of non-determinism? What concrete failure does it produce? How do you make it deterministic?
Answer
**Source: the real clock / date boundary.** Both production (`YearMonth.now()`) and the test (`LocalDate.now()`) read the real calendar. **Failure:** the test silently assumes the two `LocalDate.now()` calls land in the *same* month as `YearMonth.now()`. Run the test at **23:59:59 on the last day of the month** and the clock can roll over between the calls, putting the "current" record in a different month → the filter returns 0 → fail. It's a once-a-month, time-of-day flake nobody can reproduce on demand. **Fix — inject a `Clock`; build test data relative to a *fixed* date.**List<Record> thisMonth(List<Record> all, Clock clock) {
YearMonth now = YearMonth.now(clock); // controllable
return all.stream()
.filter(r -> YearMonth.from(r.date()).equals(now))
.collect(toList());
}
@Test
void filtersCurrentMonth() {
Clock clock = Clock.fixed(Instant.parse("2026-06-15T12:00:00Z"), ZoneOffset.UTC);
LocalDate today = LocalDate.now(clock); // fixed: 2026-06-15
var records = List.of(
new Record(today), // June 2026
new Record(today.minusMonths(2))); // April 2026
assertEquals(1, thisMonth(records, clock).size()); // deterministic, any time of day
}
Summary — patterns of spotting¶
You diagnose a flake by asking "what uncontrolled input does this test's result secretly depend on?" The repeatable tells from these eight snippets:
- A
sleepbefore an assertion is almost always a timing flake racing async work (Snippet 1). Replace it with a condition wait or a synchronization signal — and check for an accompanying data race on the shared variable. - A real
sleep/TTL pair, ornow()/today/YearMonth.now()is a clock flake — fails at boundaries (midnight, month-end, DST, CI's timezone) (Snippets 2, 8). Inject a clock. - Two tests that pass in one order and fail in the other is shared mutable state — a module global, static field, or unreset cache (Snippet 3). Reset in teardown, or stop sharing.
- Asserting an ordered result from a set/map/dict is an ordering flake (Snippet 4). Compare as a set, or sort first — the producer never promised order.
- A real network/DB/filesystem call in a unit test inherits the outside world's non-determinism (Snippet 5). Fake the boundary.
- A flaky concurrency test may be reporting a real production race (Snippet 6) — don't stabilize the test; run under
-raceand fix the code. The flake is a true bug report. - Unseeded randomness plus a probabilistic assertion is irreproducible by construction (Snippet 7). Seed a local RNG and assert a real invariant — or use fixed known input.
The meta-lesson: a flaky test is never "just flaky." It depends on an input it should control — and sometimes (Snippet 6) it's correctly reporting that the system is the non-deterministic one. Name the input, then either control it (fix the test) or fix the code it's exposing.
Related Topics¶
junior.md— the seven sources of non-determinism, defined.middle.md— the worked cure for each source.tasks.md— fix these same flakes hands-on, end to end.professional.md— Snippet 6's deeper lesson: when the flaky test means the system is flaky.- Concurrency Anti-Patterns → Shared State — the production races behind Snippets 1 and 6.
- Testing Anti-Patterns → Mystery Guest — hidden shared fixtures, a common cause of Snippet 3's order-dependence.
In this topic