Property-Based Testing — Senior Level¶
Roadmap: Testing → Property-Based Testing
Stateful/model-based PBT finds the ordering bugs no single-call property can. Add determinism, seed discipline, and the judgment of where PBT actually earns its keep.
Table of Contents¶
- Introduction
- Prerequisites
- Glossary
- Core Concept 1 — Stateful / model-based testing
- Core Concept 2 — A worked stateful model in Hypothesis
- Core Concept 3 — Determinism, seeds, and reproducibility
- Core Concept 4 — PBT without flakiness in CI
- Core Concept 5 — Regression: pinning a counterexample
- Core Concept 6 — The tool landscape by ecosystem
- Core Concept 7 — Where PBT pays off, and where it doesn't
- Real-World Examples
- Mental Models
- Common Mistakes
- Test Yourself
- Cheat Sheet
- Summary
- Further Reading
- Related Topics
Introduction¶
Focus: stateful/model-based PBT, the determinism and seed discipline that keeps randomized tests trustworthy, and the engineering judgment of where to spend PBT effort.
Single-call properties (round-trip, invariant) catch a lot, but the gnarliest bugs are sequence bugs: this operation, then that one, then a third, leaves the system in an impossible state. Stateful/model-based PBT generates programs — random sequences of operations — and checks the system against a model after each step. This is where PBT finds the concurrency, ordering, and lifecycle bugs that destroy production systems and that no human writes a test for. This level also tackles the elephant in the room for senior engineers: randomized tests and CI flakiness, and how to get the bug-finding power of PBT without a red build that nobody trusts.
Prerequisites¶
- Middle level: property patterns, generators, integrated shrinking.
- Experience with stateful systems (caches, queues, connection pools, parsers with internal state).
- Familiarity with your CI's flakiness controls (see
../12-flaky-tests-and-reliability/).
Glossary¶
| Term | Meaning |
|---|---|
| Stateful / model-based testing | Generate random sequences of operations; check the system against a simplified model after each. |
| Command / rule | One generated operation (e.g., push(3), pop()). |
| Model | A simple, obviously-correct mirror of the system's expected state. |
| Precondition | A guard saying when a command is valid in the current state. |
| Postcondition | The check run after a command, comparing system to model. |
| Seed | The value that determines which inputs are generated; replay it to reproduce. |
| Database (Hypothesis) | The local store of previously-failing examples, replayed first each run. |
| Coverage-guided fuzzing | Mutation + coverage feedback to explore inputs; lives in dynamic-analysis (not PBT). |
Core Concept 1 — Stateful / model-based testing¶
The idea: instead of testing one call, you let the framework generate a sequence of operations against a real system and against a simple model of that system, then assert they agree at every step.
- Model — a trivially correct mirror. For a stack, the model is a Python list. For a key-value store, a dict. The model is allowed to be slow or naive; it just has to be obviously right.
- Commands — the operations the framework can choose from (
put,get,delete,flush). - Preconditions — when a command is legal (can't
popan empty stack). - Postconditions — after running a command on the real system, assert the result matches what the model predicts.
The framework generates a random program — say put(a,1); put(b,2); delete(a); get(a); flush() — runs it against both, and reports any divergence. Then it shrinks the program: it deletes operations and simplifies arguments until it has the shortest sequence that still triggers the divergence. A 40-operation failure shrinks to "put(x); delete(x); get(x) returns the old value" — a precise reproduction of a lifecycle bug.
This technique has found real bugs in databases, distributed systems, and file systems where single-call testing found nothing. It is the highest-leverage form of PBT for stateful code.
Core Concept 2 — A worked stateful model in Hypothesis¶
Hypothesis's RuleBasedStateMachine generates and shrinks operation sequences for you:
from hypothesis.stateful import RuleBasedStateMachine, rule, precondition, invariant
from hypothesis import strategies as st
class BoundedQueueModel(RuleBasedStateMachine):
"""Test a real BoundedQueue (capacity 3) against a plain list model."""
def __init__(self):
super().__init__()
self.system = BoundedQueue(capacity=3) # system under test
self.model = [] # the obvious-correct model
@rule(value=st.integers())
@precondition(lambda self: len(self.model) < 3)
def enqueue(self, value):
self.system.enqueue(value)
self.model.append(value)
@rule()
@precondition(lambda self: len(self.model) > 0)
def dequeue(self):
assert self.system.dequeue() == self.model.pop(0) # FIFO must agree
@invariant()
def size_matches(self):
assert self.system.size() == len(self.model)
@invariant()
def never_exceeds_capacity(self):
assert self.system.size() <= 3
TestBoundedQueue = BoundedQueueModel.TestCase
Hypothesis will weave together enqueues and dequeues in every order, check the FIFO contract and the invariants after each, and if it finds a divergence, hand you the minimal failing sequence. jqwik offers the same via @StateMachine/Action classes; fast-check via fc.commands([...]) with fc.modelRun. The pattern is identical across ecosystems: system + model + commands + invariants.
Core Concept 3 — Determinism, seeds, and reproducibility¶
PBT is randomized, but it must be reproducible. Every framework derives all its inputs from a seed — one number. Same seed ⇒ same inputs ⇒ same pass/fail. Reproducibility hinges on three rules:
- The framework reports the seed on failure. Hypothesis prints a
@reproduce_failuredecorator or the failing example itself; fast-check printsseedandpath; jqwik prints the random seed. Capture it from the CI log. - Replay the exact failure locally. fast-check:
fc.assert(prop, { seed: 1234, path: "5:2" }). Hypothesis: paste the printed@reproduce_failure(...)decorator above the test, or rely on its example database (.hypothesis/), which automatically replays the last failure first. - The property itself must be deterministic. If your property reads the wall clock, a global RNG, network, or filesystem ordering, the same input can pass once and fail next time — and now the seed doesn't reproduce anything. Inject the clock, seed any internal randomness, and avoid order-dependent reads. A non-reproducible PBT failure is almost always a non-deterministic property, not a framework quirk.
# Make the property deterministic: pass time in, don't read it inside.
@given(st.datetimes(), st.integers(min_value=0, max_value=10_000))
def test_token_expiry(issued_at, ttl_seconds):
token = make_token(issued_at, ttl_seconds)
assert not token.is_expired(at=issued_at) # fresh
assert token.is_expired(at=issued_at + timedelta(seconds=ttl_seconds + 1))
Core Concept 4 — PBT without flakiness in CI¶
The legitimate fear: "random tests = random red builds." Manage it deliberately.
Two valid strategies, pick one per suite:
- Fixed seed in CI (deterministic mode). Pin the seed so every CI run executes the same inputs. Builds are reproducible; a failure is a real failure, never noise. Cost: you stop discovering new inputs in CI — the property only ever sees one fixed sample set. Good for the merge-gating suite where determinism matters most.
- Random seed + report it (exploratory mode). Let CI roam new inputs every run for maximum bug-finding, but ensure the framework prints the seed on failure so any red build is reproducible. Cost: a green PR can go red later when CI rolls a new seed onto a latent bug — which is correct (the bug was always there) but surprises teams. Best run on a nightly/scheduled job, not as a blocking PR gate.
A common mature setup: deterministic, bounded PBT on the PR gate (fixed seed, modest example count) + exploratory PBT nightly (random seed, high example count, longer deadlines). New bugs surface in the nightly with a reproducible seed; the PR gate stays stable.
Other controls: cap per-example time (deadline in Hypothesis) so a slow generated input doesn't cause timeout flakiness; set a sensible max_examples (more in nightly, fewer on the gate); keep the .hypothesis/ example database out of CI caches if you want CI runs independent, or in a persistent cache if you want CI to remember and re-check past failures. The key principle: a PBT failure must always be reproducible from information in the build log. If it isn't, fix the property's determinism before touching seeds. See ../12-flaky-tests-and-reliability/.
Core Concept 5 — Regression: pinning a counterexample¶
When PBT finds a bug, the discovered counterexample is gold — but a future random run might not hit it again. Promote it to a permanent example test so it's checked on every run, deterministically, forever:
from hypothesis import given, example, strategies as st
@given(st.lists(st.integers()))
@example([0, 0]) # the shrunk counterexample PBT found
@example([-1, 2_147_483_648]) # an overflow case it found later
def test_sort_is_ordered(xs):
result = my_sort(xs)
assert all(result[i] <= result[i+1] for i in range(len(result)-1))
@example(...) forces that exact input to run in addition to generated ones (jqwik: @Example / a seeded data point; fast-check: add an explicit unit test). Now the regression is locked in regardless of seed, and the property keeps hunting for new bugs. This is the standard discipline: find with PBT, pin with @example. Hypothesis's example database does this automatically for the last failure, but @example makes it explicit, reviewable, and durable in source control.
Core Concept 6 — The tool landscape by ecosystem¶
| Ecosystem | Tool | Entry points | Notes |
|---|---|---|---|
| Python | Hypothesis | @given, strategies as st, @example, RuleBasedStateMachine | The reference implementation; best-in-class shrinking, example DB, stateful. |
| Java/JVM | jqwik | @Property, @ForAll, Arbitraries, @Provide, state machines | JUnit 5 platform; integrated shrinking. |
| JS/TS | fast-check | fc.property, fc.assert, arbitraries, fc.commands | Mature shrinking; seed/path replay; model-based via commands. |
| Rust | proptest, quickcheck | proptest! macro / Arbitrary trait | proptest has integrated shrinking + a persistence file; quickcheck is the classic port. |
| Go | gopter; testing.F (native fuzzing) | gopter generators/commands; f.Fuzz | Go's native go test -fuzz is coverage-guided fuzzing, adjacent but distinct (below). |
Where coverage-guided fuzzing fits. Fuzzing (libFuzzer, AFL, Go's native fuzzer, Atheris) also generates many inputs, but its engine is mutation + coverage feedback aimed at crashes/hangs/sanitizer trips, usually on byte/string inputs. PBT is property-driven, checks rich logical invariants, and emphasizes shrinking to a clean counterexample. They overlap (Hypothesis can run in a fuzzing mode; Go blurs the line) but serve different goals. Fuzzing is covered in ../../dynamic-analysis-and-sanitizers/ — don't duplicate it here. The senior heuristic: structured logical invariants → PBT; "does any byte sequence crash the parser?" → fuzzing; and a parser deserves both.
Core Concept 7 — Where PBT pays off, and where it doesn't¶
High ROI (write PBT first):
- Parsers, serializers, codecs — round-trip is free money; the input space is exactly where humans miss cases.
- Data structures — heaps, trees, ring buffers, LRU caches; oracle and stateful testing shine.
- Encoding / financial / date-time logic — rounding, overflow, time zones, money; clear invariants, catastrophic bugs.
- Anything with an algebra — sets, intervals, CRDTs; commutativity/associativity/idempotence are testable laws.
- State machines / protocols — stateful PBT finds ordering and lifecycle bugs nothing else does.
Poor fit (don't force it):
- Thin glue / CRUD with no logic — the only property is "it calls the dependency," which a mock-based unit test states more clearly.
- Code with no statable invariant — if you can't articulate a rule, PBT degenerates into a slow, vague example test.
- Heavily side-effecting code without a clean model — building the model costs more than it returns; reach for integration tests.
- UI rendering / look-and-feel — snapshot/approval testing (
../08-snapshot-and-approval-testing/) fits better.
The judgment: PBT's cost is the skill of finding properties and slightly slower tests; its payoff is finding the bug class you'd never enumerate. Spend it where invariants are clear and bugs are expensive.
Real-World Examples¶
- A connection pool. Stateful PBT generates acquire/release/timeout sequences; invariant: leased + idle == total, never over capacity, no double-lease. Finds the "release twice" and "leak on exception" bugs.
- A JSON/Protobuf codec. Round-trip across a recursive generator finds the unicode, deep-nesting, and integer-boundary losses; the failure pins as an
@example. - An interval/calendar library. Metamorphic + invariants on overlap/merge; stateful add/remove sequences catch the "merge leaves a gap" bug.
- A CRDT. Commutativity and convergence are the correctness laws; PBT is essentially the spec.
Mental Models¶
- Stateful PBT generates programs, not values. The counterexample is a minimal script that breaks your system.
- Determinism first, seeds second. If a failure won't reproduce, the property is non-deterministic — fix that before blaming randomness.
- Find with PBT, pin with
@example. Every discovered bug becomes a permanent, seed-independent regression. - PBT and fuzzing are cousins, not twins. Logical invariants vs. crash-hunting; a parser wants both.
Common Mistakes¶
- Non-deterministic properties. Reading the clock/RNG/network inside the property makes failures irreproducible and seeds useless.
- Random seed on the blocking PR gate with no plan for the surprise red build. Use fixed seed on the gate, random seed nightly.
- Skipping the model in stateful tests. Asserting ad-hoc facts after each command misses divergences; a real model catches them.
- Letting a discovered counterexample evaporate. If you don't pin it with
@example, the next seed may never find it again. - Forcing PBT onto glue code. No invariant ⇒ no value; use the right test type.
- Confusing PBT with fuzzing and writing both in the wrong section — keep coverage-guided fuzzing in dynamic-analysis.
Test Yourself¶
- Describe the four pieces of a stateful/model-based test and what each contributes.
- A PBT failure won't reproduce from the reported seed. What's the most likely cause and fix?
- Contrast fixed-seed and random-seed CI strategies; when do you use each?
- How do you turn a discovered counterexample into a durable regression?
- Give one system where PBT is high ROI and one where it's a poor fit, with reasons.
- How does coverage-guided fuzzing differ from PBT, and where is it documented?
Cheat Sheet¶
# Stateful (Hypothesis)
class M(RuleBasedStateMachine):
@rule(x=st.integers()) ... # generated commands
@precondition(lambda self: ...) # legality guard
@invariant() ... # checked after every step
# Determinism + reproducibility
# - inject clock/RNG; never read them inside the property
# - on failure: capture the seed from the log
# - fast-check: fc.assert(p, {seed, path}) ; Hypothesis: @reproduce_failure / .hypothesis DB
# CI strategy
# PR gate -> fixed seed, modest max_examples (stable, reproducible)
# nightly -> random seed, high max_examples (exploratory, reproducible via reported seed)
# Regression: find with PBT, pin with @example(<shrunk counterexample>)
Summary¶
Stateful/model-based PBT generates random sequences of operations and checks the system against a simple model, finding ordering and lifecycle bugs that single-call properties miss — and it shrinks the failing program to a minimal script. Reproducibility rests on seeds and on deterministic properties; if a failure won't replay, fix the property's determinism first. Run deterministic (fixed-seed) PBT on the merge gate and exploratory (random-seed, reported) PBT nightly to get bug-finding power without flaky gates. Always pin discovered counterexamples with @example. PBT pays off most on parsers, data structures, codecs, and protocols with clear invariants; it's a poor fit for invariant-free glue. The professional level covers PBT as specification, org-wide adoption, and combining it with mutation testing.
Further Reading¶
- John Hughes, "Testing the Hard Stuff and Staying Sane" (QuviQ / Erlang QuickCheck) — the stateful-PBT classic.
- Hypothesis docs — "Stateful testing" and "Reproducing failures."
- fast-check docs — "Model based testing" and "Replay a failure."
- The
property-based-testingskill (stateful patterns and seed discipline).
Related Topics¶
- Property-Based Testing — Middle — property patterns, generators, shrinking.
- Property-Based Testing — Professional — PBT as spec, adoption, ROI.
- Flaky Tests and Reliability — keeping randomized tests deterministic.
- Dynamic Analysis and Sanitizers — coverage-guided fuzzing.
- Mutation Testing — validating that your properties actually catch bugs.
In this topic
- junior
- middle
- senior
- professional