Flaky Tests & Reliability — Middle Level¶

Roadmap: Testing → Flaky Tests & Reliability Master the root-cause taxonomy of flakiness — async, ordering, isolation, concurrency, non-determinism, external deps, leaks, environment — and learn to detect and measure it instead of guessing.

Table of Contents¶

Introduction
Prerequisites
Glossary
Core Concept 1 -- The Root-Cause Taxonomy
Core Concept 2 -- Async & Timing
Core Concept 3 -- Test Ordering & Shared State
Core Concept 4 -- Isolation Failures: Leaked Singletons, Static & DB State
Core Concept 5 -- Non-Determinism: Random, Time, Iteration Order, Locale
Core Concept 6 -- External Dependencies, Concurrency & Resource Leaks
Core Concept 7 -- Detection & Measurement
Real-World Examples
Mental Models
Common Mistakes
Test Yourself
Cheat Sheet
Summary
Further Reading
Related Topics

Introduction¶

Focus: The complete root-cause taxonomy of flaky tests — recognizing each category and fixing it — plus how to detect and measure flakiness rather than rely on gut feel.

At the junior tier you learned that a flaky test is a broken test and that the trust it erodes is the real damage. You fixed your first sleep→poll case. Now you need the full diagnostic map: when a test is flaky, what kind of flaky is it, and what's the canonical fix?

This page is built around the root-cause taxonomy. Almost every flaky test in existence falls into one of these buckets. Learn to name the bucket on sight, and fixing becomes mechanical instead of mysterious.

Prerequisites¶

The junior page: flaky = broken, trust is the asset, sleep→poll.
Comfortable writing tests with setup/teardown and fixtures.
Basic understanding of threads/async, databases in tests, and CI.
Familiarity with Test Doubles and Test Data Management.

Glossary¶

Term	Meaning
Root cause	The underlying reason a test is non-deterministic, not the surface symptom.
Hermetic test	A test that depends on nothing outside its own controlled inputs — no real network, clock, or shared state.
Test isolation	The property that each test starts from a clean, independent state and leaves no residue.
Order dependence	A test that only passes (or fails) depending on which tests ran before it.
Flakiness rate	The fraction of runs in which a given test fails despite no code change.
Polling / explicit wait	Repeatedly checking a condition with a timeout, instead of sleeping a fixed time.
Seed	A fixed starting value that makes a random generator reproducible.
Test pollution	When one test mutates shared state (DB rows, statics, globals) that another test reads.

Core Concept 1 -- The Root-Cause Taxonomy¶

Every flaky test traces to one of these eight families. Internalize this list — it's the spine of the whole discipline.

#	Root cause	Smell	One-line fix
1	Async / timing	`sleep(n)`; "works locally, fails in CI"	Poll for a condition with a timeout
2	Order & shared state	Passes in suite, fails alone (or vice versa)	Make tests independent; reset state
3	Isolation failure	Leaked singletons, static fields, stray DB rows	Per-test isolation/transactions; reset statics
4	Concurrency / races	Fails ~1/N runs, no pattern	Fix the real race (in test or product)
5	Non-determinism	Random/time/map-order/locale in assertions	Seed it; inject a clock; sort; pin locale
6	External dependencies	Fails on network/3rd-party hiccup	Stub the dependency; make it hermetic
7	Resource leaks/exhaustion	Fails late in the run, ports/FDs/memory	Close resources; bound pools; clean up
8	Environment differences	CI vs. local, OS, timezone, container	Pin/normalize the environment

Mental shortcut: when a test flakes, walk the table top to bottom and ask "could it be this?" The first match is usually right.

Core Concept 2 -- Async & Timing¶

The most common family. The work happens asynchronously and the test guesses how long to wait.

// ❌ FLAKY: fixed sleep races the goroutine
func TestCacheWarmup(t *testing.T) {
    cache.WarmAsync()
    time.Sleep(100 * time.Millisecond) // guess
    if !cache.IsWarm() {
        t.Fatal("cache not warm")
    }
}

// ✅ STABLE: poll until ready, bounded by a timeout
func TestCacheWarmup(t *testing.T) {
    cache.WarmAsync()
    require.Eventually(t, cache.IsWarm, 5*time.Second, 10*time.Millisecond)
}

require.Eventually (testify) polls every 10 ms up to 5 s. Recognize it by sleep calls and the "works on my machine" signature. Fix it by waiting on the actual post-condition — a status, a row, an element. Even better, expose a deterministic hook (a callback, a channel, a done signal) so the test waits on a real event rather than polling.

Core Concept 3 -- Test Ordering & Shared State¶

A test passes only because of what ran before it, or fails because a sibling polluted shared state.

# ❌ FLAKY (order-dependent): relies on a module-level list
USERS = []

def test_add_user():
    USERS.append("alice")
    assert len(USERS) == 1          # passes only if run first

def test_count_users():
    assert len(USERS) == 1          # passes only if test_add_user ran first

Run these in a different order (or in parallel) and they fail. Recognize it with the rule: a test that passes in the suite but fails when run alone — or fails when the suite is shuffled — is order-dependent. Modern runners can shuffle order to expose this (pytest -p randomly, go test -shuffle=on).

# ✅ STABLE: each test owns its state
def test_add_user():
    users = []
    add_user(users, "alice")
    assert len(users) == 1

def test_count_users():
    users = ["alice"]
    assert count(users) == 1

The principle: every test must set up its own world and tear it down. Never let global lists, caches, or env vars carry data between tests.

Core Concept 4 -- Isolation Failures: Leaked Singletons, Static & DB State¶

A subtler cousin of ordering: state leaks through singletons, static fields, or a shared database.

# ❌ FLAKY: a leaked singleton remembers the previous test
class FeatureFlags:        # process-wide singleton
    _instance = None
    enabled = set()

def test_premium_on():
    FeatureFlags.instance().enabled.add("premium")
    assert is_premium_visible()      # leaves "premium" enabled!

def test_premium_off_by_default():
    assert not is_premium_visible()  # fails if test_premium_on ran first

The singleton carries premium into the next test. Fix: reset shared state in teardown, or — better — don't use a process-wide singleton in tested code; inject the dependency so each test gets a fresh instance (see Test Doubles).

For databases, the canonical fix is transaction rollback or truncate-per-test:

# ✅ STABLE: each test runs in a transaction that is rolled back
@pytest.fixture
def db():
    conn = engine.connect()
    txn = conn.begin()
    yield conn
    txn.rollback()        # nothing leaks to the next test
    conn.close()

Stray rows from one test changing another's query results is one of the most common integration-test flakes. See Integration Testing and Test Data Management for the full isolation toolkit.

Core Concept 5 -- Non-Determinism: Random, Time, Iteration Order, Locale¶

The test reads a source of variability the team forgot to control.

Unseeded randomness — seed it:

# ❌ random.random() differs every run
# ✅
rng = random.Random(42)
data = [rng.random() for _ in range(100)]

Real time / dates — inject a clock instead of calling now():

// ❌ FLAKY: depends on wall-clock; breaks at boundaries / timezones
func IsExpired(t Token) bool { return time.Now().After(t.Expiry) }

// ✅ STABLE: inject the clock so tests control "now"
type Clock interface{ Now() time.Time }
func IsExpired(c Clock, t Token) bool { return c.Now().After(t.Expiry) }
// test uses a fixed clock: fixedClock{at: parse("2026-01-01T00:00:00Z")}

Map / set iteration order — don't assert on it:

// ❌ FLAKY: Go randomizes map iteration order
for k := range m { result = append(result, k) }
assert.Equal(t, []string{"a","b","c"}, result)

// ✅ STABLE: sort before asserting, or compare as a set
sort.Strings(result)
assert.Equal(t, []string{"a","b","c"}, result)

Locale / timezone — pin them: set TZ=UTC and a fixed locale in the test environment so number/date formatting is stable across machines.

The unifying fix: push every non-deterministic input behind an injectable seam — clock, RNG, locale — so the test controls it. This is prevention-by-design; you'll formalize it at the senior tier.

Core Concept 6 -- External Dependencies, Concurrency & Resource Leaks¶

External dependencies. A unit test that calls a real API, DNS, or third-party service will flake whenever the network does — and that failure is not your bug. Make such tests hermetic by stubbing the dependency (see Test Doubles). Reserve real-network calls for a small, clearly-labeled E2E layer (see End-to-End Testing).

Concurrency / real races. A test that fails ~1-in-N runs with no obvious pattern often points at a genuine data race — sometimes in the test, sometimes in the product code. Run with a race detector (go test -race, ThreadSanitizer, Java's -Djava.util.concurrent stress tools). The concurrency-patterns skill covers synchronizing shared state correctly. Crucially: don't assume it's the test's fault — investigate whether the product is actually racy.

Resource leaks / exhaustion. Tests that pass early in the run and fail late — "address already in use," "too many open files," OOM — are leaking ports, file descriptors, connections, or memory. Fix: close everything in teardown (defer/try-with-resources/fixtures), bind to ephemeral ports (:0), and bound connection pools.

Core Concept 7 -- Detection & Measurement¶

You cannot manage what you don't measure. Stop guessing whether a test is flaky — prove it.

Re-run to detect (not to hide). Run the suite or a suspect test many times:

go test -run TestOrder -count=100        # Go: run 100 times
pytest tests/test_orders.py --count=100  # pytest-repeat

If it fails some of the time, it's flaky. This is using reruns diagnostically — the opposite of using them to mask failures (see senior tier on retries).

Flakiness rate as a metric. Track, per test, failures / total runs over a window. A test failing 2% of runs has a 2% flakiness rate. This number, trended over time, tells you whether reliability is improving.

Flaky-test detection in CI. Mature setups detect flakiness by re-running failed tests once and flagging any test that flips fail→pass as flaky — recording it, not hiding it. Tooling that does this includes:

pytest-rerunfailures (Python), Maven Surefire / Gradle retry plugins (JVM) — re-run on failure and surface that a retry happened.
go test -count=N and CI scripts for loop-detection.
Platform tools: Datadog Test Optimization, BuildPulse, Gradle Enterprise/Develocity flaky detection, and Google's internal flaky-test infrastructure — they aggregate fail-then-pass events into dashboards and per-test flakiness scores.

The non-negotiable rule: reruns are for identifying flakiness, never for silently making CI green. If a test needed a rerun to pass, that fact must be recorded and visible.

Real-World Examples¶

Shuffle exposes the rot. A team turns on go test -shuffle=on and 14 previously-green tests start failing — all were order-dependent, propped up by alphabetical run order. The suite was lying for months.
The race detector earns its keep. A test flaked 1-in-200. -race revealed an unsynchronized map write in production code. The flaky test had been catching a real concurrency bug nobody believed.
TZ=UTC saves the build. A date-formatting test passed in CI (UTC) and failed for an engineer in UTC+9. Pinning TZ=UTC everywhere — local and CI — ended a recurring "works on my machine" argument.

Mental Models¶

Walk the taxonomy. Eight families. Name the bucket, apply the canonical fix.
Push variability behind a seam. Clock, RNG, network, locale — inject them so tests control them.
Hermetic or honest. A test should depend only on its own inputs; if it must touch the real world, label it and quarantine its flakiness risk.
Reruns diagnose, never disguise. Re-running to find flakiness is science; re-running to hide it is fraud.

Common Mistakes¶

Bumping the sleep instead of polling. A slower flaky test is still flaky.
Cleaning up in the test body, not teardown. If the test fails mid-way, cleanup is skipped and the next test inherits the mess. Use fixtures/defer/finally.
Asserting on unordered collections (maps, sets, parallel results) as if they were ordered.
Blaming the test for a real race. Run -race before deciding the product is fine.
Mocking the clock in one place, reading now() in another. Inject it everywhere or the flake survives.
Calling reruns a "fix." They're detection, not a cure.

Test Yourself¶

Name all eight root-cause families and give a one-line fix for each.
A test passes in the full suite but fails when run alone. Which family, and how do you fix it?
Show how to make a date-dependent function testable without sleep or wall-clock reads.
What does go test -shuffle=on (or pytest -p randomly) help you find?
Distinguish using reruns to detect flakiness from using them to hide it. Why does the distinction matter?

Cheat Sheet¶

ROOT-CAUSE TAXONOMY (name the bucket, apply the fix)
  1 Async/timing   sleep → poll (Eventually / wait_until)
  2 Order/state    pass-in-suite-fail-alone → isolate state; shuffle to detect
  3 Isolation      leaked singleton/static/DB → reset in teardown; txn rollback
  4 Concurrency    1/N failures → run -race; fix the REAL race (test or product)
  5 Non-determinism random→seed; time→inject clock; map→sort; locale→pin TZ
  6 External deps   network/3rd-party → stub; make hermetic
  7 Resource leaks  fails late, ports/FDs/mem → close in teardown; ephemeral ports
  8 Environment     CI≠local → pin TZ/locale/OS/container

DETECTION
  Reproduce: go test -count=100 | pytest --count=100
  Shuffle:   go test -shuffle=on | pytest -p randomly
  Races:     go test -race | ThreadSanitizer
  Metric:    flakiness rate = failures / total runs (trend it)
  RULE: reruns DETECT flakiness, never HIDE it.

Summary¶

Nearly every flaky test belongs to one of eight families: async/timing, order & shared state, isolation failures, concurrency/races, non-determinism, external dependencies, resource leaks, and environment differences. For each, learn the smell and the canonical fix — and notice the unifying theme: push every uncontrolled input behind an injectable seam and isolate every test's state. Then measure: reproduce with high-count reruns, shuffle to expose order dependence, run race detectors, and track a per-test flakiness rate. Use reruns to detect flakiness, never to hide it — because a flaky test is a broken test, and the trust it erodes is the only asset your suite has.