Unit Tests — Professional Level¶

Focus: the deep end and the debates. What a unit actually is, the schools that disagree about it, property-based and mutation testing as instruments that measure what coverage cannot, the genuine costs of mocking, and the famous critiques that every senior engineer should be able to argue both sides of.

Table of Contents¶

What is a "unit", really?
Solitary vs. sociable: the classicist/mockist split
The cost of over-mocking
Test-induced design damage and "TDD is dead"
Property-based testing: invariants over examples
Mutation testing: who tests the tests?
The theoretical limits of coverage
Testing time, concurrency, and nondeterminism
Snapshot testing and its pitfalls
Testing vs. formal methods
Common Mistakes
Test Yourself
Cheat Sheet
Summary
Further Reading
Related Topics

What is a "unit", really?¶

The word unit in "unit test" is the most contested term in the discipline, and most arguments about test quality are actually arguments about this definition in disguise.

There is no canonical answer. The original meaning, from Kent Beck's SUnit and the early XP literature, ties a unit to a test, not a class: the unit of isolation is the test itself — each test runs without depending on the state any other test left behind (Beck, Test-Driven Development: By Example, 2002). Under this reading, "unit" describes the isolation of tests from each other, not the size of the code under test.

A later and now-dominant folk definition equates "unit" with "one class" or "one method." This is the reading that produces a mock for every collaborator. It is not what Beck meant, and treating it as a law produces the brittle suites discussed below.

Martin Fowler's synthesis is the practical one: a unit test is small, fast, and isolated from things that are slow or hard to control — the file system, the network, the clock, other processes — but the size of the unit is a judgement call (Fowler, UnitTest, 2014). The unit might be a single function, or it might be a cluster of cooperating objects that form one cohesive behaviour.

flowchart TD Q{What does "unit" mean to you?} Q -->|the test is isolated from other tests| Beck["Beck / xUnit (test isolation)"] Q -->|one class per test, mock everything else| Folk["Folk definition (solitary / mockist)"] Q -->|a cohesive cluster of collaborating objects| Fowler["Fowler / Detroit (sociable / classicist)"] Folk --> Risk1["Risk: tests coupled to implementation"] Fowler --> Risk2["Risk: harder to localize a failure"] Beck --> Pragmatic["Pragmatic baseline every school accepts"]

The reason the definition matters: it determines what you mock, and what you mock determines whether your tests enable change or obstruct it.

Solitary vs. sociable: the classicist/mockist split¶

Fowler names two camps (Fowler, Mocks Aren't Stubs, 2007):

Solitary tests isolate the unit from all its collaborators, replacing each with a test double. The unit under test is genuinely alone.
Sociable tests let the unit use its real collaborators (the cheap, fast, deterministic ones) and only double out what is slow or non-deterministic.

These map onto two schools of TDD:

	Classicist / Detroit school	Mockist / London school
Origin	Kent Beck, the original C3/Chrysler XP team	Steve Freeman & Nat Pryce, GOOS (2009)
Default test style	Sociable	Solitary
What gets mocked	Only awkward collaborators (I/O, clock, network)	All collaborators outside the unit
Verification style	State — assert on the result/output	Behaviour — assert on interactions (calls made)
Design driver	Tests verify behaviour after the fact	Tests drive the discovery of interfaces ("need-driven development")
Risk	Failures harder to localize; a bug ripples across many tests	Tests couple to how the code works, not what it does

The London school is articulated most fully in Freeman & Pryce's Growing Object-Oriented Software, Guided by Tests (GOOS). Their insight is genuinely valuable: writing the test first, with mocks for not-yet-existing collaborators, discovers the roles and interfaces the object needs — "listen to the tests" is their refrain. When mocking is painful, that pain is design feedback that the object has the wrong responsibilities.

The classicist counter, argued by Beck and by Vladimir Khorikov in Unit Testing Principles, Practices, and Patterns (2020), is that mocking everything couples the test to the implementation's collaboration structure, which is exactly the thing refactoring changes. A sociable test that asserts on output survives any internal restructuring that preserves behaviour; a mockist test that asserts "the repository's save was called once with these arguments" breaks the moment you change how the work is delegated, even when the observable result is identical.

Khorikov's reconciliation, widely adopted now: mock only at the boundaries of the system that you don't control and that have observable, out-of-process effects — the message bus, the third-party API, the SMTP server. Do not mock in-process collaborators whose only effect is internal. He calls the over-mocked, implementation-coupled kind "tests that verify communications with unmanaged dependencies done right, but everything else done wrong" and treats interaction-heavy in-process mocking as a smell.

// Classicist / sociable: assert on the OUTCOME. The real Cart, real
// pricing rules, real discount engine all run. Only the clock is faked.
func TestCheckout_AppliesWeekendDiscount(t *testing.T) {
    clock := fixedClock(saturday)               // double the awkward thing only
    cart := NewCart(realPricing, realDiscounts) // real collaborators
    cart.Add(item("widget", 1000), 2)

    total := cart.Total(clock)

    if total != 1700 { // 2000 - 15% weekend discount
        t.Fatalf("got %d, want 1700", total)
    }
}

// Mockist / solitary: assert on the INTERACTION. Brittle — it knows
// that Total delegates to discounts.Apply, which refactoring may change.
func TestCheckout_DelegatesToDiscountEngine(t *testing.T) {
    discounts := &mockDiscounts{}
    cart := NewCart(realPricing, discounts)
    cart.Add(item("widget", 1000), 2)

    cart.Total(fixedClock(saturday))

    discounts.AssertCalledOnceWith(t, 2000, saturday) // couples to "how"
}

The first test passes as long as the answer is right. The second passes only as long as the mechanism is unchanged — and fails on a behaviour-preserving refactor, which inverts the whole point of having tests.

The cost of over-mocking¶

"Tests enable change" is the promise. Over-mocking inverts it. Three concrete failure modes:

1. Tests that pass while production breaks. A mock encodes your belief about how a collaborator behaves. If that belief is wrong — the real HTTP client throws on a 204, the real DB returns rows in a different order, the real serializer drops nulls — the mock cannot know. Your test is green; production is red. This is the central argument for integration and contract tests as a complement: the mock asserts your assumption, the contract test verifies it. J.B. Rainsberger's "Integrated Tests Are a Scam" (2009) frames the trade as combinatorial: with N collaborators you cannot integration-test every path, so you push detail into focused tests and pin the assumptions with contract tests at each seam.

2. Tests coupled to implementation that block refactoring. Every verify(mock).method(args) is an assertion about internal structure. Refactoring is, by definition, behaviour-preserving structural change. A suite saturated with interaction assertions makes every refactor a test-rewrite, so people stop refactoring. The suite that was supposed to make change safe now taxes it.

3. The mock drifts from reality silently. Hand-written stubs and over-eager Mockito.when(...) chains rot. The real method gains a parameter or changes its contract; the mock still answers the old shape because it is typed loosely or because the mocking framework matches leniently.

Mitigations, in order of preference:

Prefer fakes (a real, in-memory implementation of the interface) over mocks for stateful collaborators — an in-memory repository that actually stores and queries. A fake is exercised through the same interface as production, so it cannot answer impossibly. Fowler's taxonomy (dummy / stub / spy / mock / fake) is the shared vocabulary here.
Mock only out-of-process, you-don't-own-it boundaries (Khorikov).
Where you must double an owned boundary, back it with a consumer-driven contract (Pact) or a verified test double so the double's behaviour is checked against the real thing.

# Fake over mock: an in-memory repo exercised through the real interface.
# Cannot lie about behaviour the way a per-call mock can.
class InMemoryOrderRepo:
    def __init__(self): self._db = {}
    def save(self, order): self._db[order.id] = order
    def by_id(self, oid): return self._db.get(oid)

def test_reorder_uses_previous_items():
    repo = InMemoryOrderRepo()              # fake, not mock
    repo.save(Order(id=1, items=["a", "b"]))
    svc = OrderService(repo)

    new_id = svc.reorder(1)                  # assert OUTCOME...

    assert repo.by_id(new_id).items == ["a", "b"]   # ...not interactions

Test-induced design damage and "TDD is dead"¶

In 2014 David Heinemeier Hansson (DHH) published "TDD is dead. Long live testing." and the follow-up "Test-induced design damage." His claim, sharpened in the subsequent Is TDD Dead? video conversations with Kent Beck and Martin Fowler, was that the dogmatic pursuit of fast, isolated, unit-testable code damages the design: it pushes teams to extract layers of indirection — service objects, repositories, hexagonal ports — whose only reason to exist is to make a mock injectable. You pay in architecture for a testing convenience.

This is a real phenomenon, not a strawman. Signs of test-induced design damage:

Interfaces with exactly one production implementation, created solely so a mock can implement them.
Constructors that take a dozen collaborators because every dependency was inverted "for testing."
Logic split across thin classes that have to be re-assembled in your head to understand a feature, because each was carved out to be "unit testable."

The Is TDD Dead? dialogue reached a nuanced settlement rather than a winner. Beck conceded that the mockist style can drive over-decoupling; Fowler reaffirmed sociable testing and self-testing code as the durable core; DHH accepted that having a fast test suite is valuable even if not test-first. The synthesis most teams now hold: TDD as a discipline is a tool, not a religion; test the behaviour through the widest interface that stays fast; do not contort the design to satisfy a mocking framework.

The deeper theoretical critique came from Jim Coplien, whose "Why Most Unit Testing Is Waste" (2014) and his recorded debate with Robert C. Martin argue that (a) most unit tests assert tautologies the developer already believed, (b) test code is a liability that must itself be maintained and can outweigh its value, and (c) system-level and architectural correctness — which he frames through Design by Contract and proper architecture — catches the bugs that matter more reliably than a sea of method-level tests. Martin's rebuttal defends TDD's role in enabling fearless change and in producing decoupled designs. You do not have to pick a side to extract the operational lesson: a test must justify its maintenance cost by either catching a regression a human would otherwise ship, or documenting behaviour that is otherwise unclear. A test that does neither is waste, exactly as Coplien says.

Property-based testing: invariants over examples¶

Example-based tests check that specific inputs produce specific outputs. They test the points you thought of. Bugs live in the points you didn't.

Property-based testing (PBT), invented by John Hughes and Koen Claessen with QuickCheck for Haskell (2000), inverts this: you state an invariant that must hold for all inputs in a domain, and the framework generates hundreds of randomized inputs trying to break it. When it finds a counterexample, it shrinks it to the minimal failing case.

The implementations: QuickCheck (Haskell), Hypothesis (Python, by David MacIver), gopter and Go 1.18+ native testing.F fuzzing (Go), jqwik and junit-quickcheck (Java), fast-check (JS/TS), ScalaCheck (Scala), PropEr (Erlang).

The canonical property families:

Round-trip / inverse: decode(encode(x)) == x. Encoders, parsers, serializers.
Invariant preservation: sorting preserves length and multiset of elements; a balanced-tree insert preserves the balance invariant.
Idempotence: f(f(x)) == f(x) (normalization, dedup).
Commutativity / associativity of merges, monoidal combine.
Oracle / model-based: the implementation agrees with a slow-but-obviously-correct reference (e.g., your fast cache agrees with a plain dict).
Metamorphic: when you can't state the absolute answer, you can state how the output should change when the input changes (e.g., adding an item never decreases a cart total).

# Hypothesis: a round-trip invariant catches encoder bugs that
# hand-picked examples miss — e.g., empty strings, unicode, embedded delimiters.
from hypothesis import given, strategies as st

@given(st.dictionaries(st.text(), st.text()))
def test_querystring_roundtrip(params):
    assert parse_qs(encode_qs(params)) == params
# Hypothesis will report a *shrunk* counterexample like {'': ''} or {'a': '&'}

// Go 1.18+ native fuzzing is property-based testing with coverage-guided
// input generation. The corpus + invariant find the inputs you didn't.
func FuzzRoundTrip(f *testing.F) {
    f.Add("hello")
    f.Fuzz(func(t *testing.T, s string) {
        if got := Decode(Encode(s)); got != s {
            t.Fatalf("round-trip failed: %q -> %q", s, got)
        }
    })
}

// jqwik: an invariant (sortedness + permutation), not an example.
@Property
void sortingProducesOrderedPermutation(@ForAll List<Integer> xs) {
    List<Integer> sorted = MySort.sort(xs);
    assertThat(sorted).isSorted();
    assertThat(sorted).containsExactlyInAnyOrderElementsOf(xs);
}

The payoff is not just bug-finding; it is that stating the property forces you to articulate what the code is actually supposed to do — often the hardest and most valuable part. Hughes' "QuickCheck Testing for Fun and Profit" and his "How to Specify It!" (2019) are the foundational reads. The cost: PBT needs deterministic, side-effect-bounded code to be tractable, and flaky generators can produce hard-to-reproduce failures (mitigated by recording the seed).

Mutation testing: who tests the tests?¶

Coverage tells you which lines ran. It says nothing about whether your assertions would notice if those lines were wrong. A test that calls a function and asserts nothing yields 100% coverage and catches nothing.

Mutation testing measures the thing coverage cannot. The tool introduces small faults — mutants — into the production code (flip < to <=, + to -, && to ||, delete a statement, replace a return with a constant) and reruns the suite against each mutant. If a test fails, the mutant is killed — good, your suite detects that fault. If all tests still pass, the mutant survived — your suite is blind to that fault. The mutation score (killed / total non-equivalent mutants) is a far stronger quality signal than line coverage.

Tools: PITest (PIT) for Java/JVM — the mature standard; mutmut and cosmic-ray for Python; go-mutesting and gremlins for Go; Stryker for JS/TS/C#.

The theory is old: mutation testing was proposed by DeMillo, Lipton & Sayward (1978) and rests on two hypotheses — the competent programmer hypothesis (real bugs are small deviations from correct code) and the coupling effect (tests that catch simple faults also catch complex ones). Empirical work (Just et al., FSE 2014, "Are Mutants a Valid Substitute for Real Faults in Software Testing?") found mutation detection correlates with real-fault detection significantly better than coverage does.

The two hard problems:

Equivalent mutants: a mutant that changes the code but not its behaviour (e.g., i <= n vs i < n+1). It can never be killed; it deflates your score and must be excluded by hand. Detecting equivalence is undecidable in general — this is the practical tax on mutation testing.
Cost: running the whole suite once per mutant is expensive. Mitigations: mutate only changed files in CI (incremental analysis, supported by PIT and Stryker), test selection (run only tests covering the mutated line), and parallelism.

# PITest output excerpt — the survived mutants are your real to-do list:
> Generated 142 mutations, Killed 128 (90%)
> SURVIVED  changed conditional boundary  Pricing.java:54  (< -> <=)
> SURVIVED  removed call to log.audit(..)  Order.java:88

A 90% mutation score with a surviving boundary mutant on a pricing rule is a direct, actionable signal: you have no test that distinguishes "1000 or more" from "more than 1000." No coverage report would ever tell you that.

The theoretical limits of coverage¶

Coverage metrics form a lattice of increasing strength, and senior engineers should know what each does and does not guarantee.

Statement coverage: every line executed. Weakest. Satisfied by code that runs but is never checked.
Branch coverage: every edge of every decision taken (both the true and false of each if).
Condition coverage: every boolean sub-expression takes both values.
MC/DC (Modified Condition/Decision Coverage): every condition independently affects the decision's outcome. Mandated by DO-178C Level A for avionics software — the regulatory recognition that branch coverage is insufficient for compound conditions.
Path coverage: every path through the control-flow graph. Infeasible in general — a loop introduces an unbounded number of paths; the count is exponential in the number of branches.

Three theoretical facts worth internalizing:

100% coverage is not correctness. Coverage measures execution, not verification. It cannot detect a missing branch — code you forgot to write has no line to be covered. It cannot detect a wrong-but-covered assertion. Dijkstra's dictum applies in full: "Testing shows the presence, not the absence of bugs" (Notes on Structured Programming, 1970).
Coverage as a target corrupts it (Goodhart's Law). Mandating 100% coverage produces assertion-free tests and trivial getter tests written to hit lines, not to verify behaviour. The metric goes up; quality does not.
The right use of coverage is finding holes, not proving completeness. A coverage report's value is the uncovered lines — they tell you what no test touches. The covered lines tell you nothing about quality; that is mutation testing's job.

Testing time, concurrency, and nondeterminism¶

Determinism is the precondition for a useful unit test. The four classic sources of nondeterminism, and how to neutralize each:

Time and clocks. Never call System.currentTimeMillis(), time.Now(), or datetime.now() directly in code under test. Inject a clock abstraction. Java has java.time.Clock (Clock.fixed(...) in tests). Go: pass a func() time.Time or a clock interface. Python: inject a callable, or use freezegun/time-machine. This makes "expires after 30 days," "applies weekend discount," and "retries with backoff" deterministic.

// Inject java.time.Clock; tests pump a fixed or steppable instant.
class TokenService {
    private final Clock clock;
    TokenService(Clock clock) { this.clock = clock; }
    boolean expired(Token t) { return t.issuedAt().plus(TTL).isBefore(clock.instant()); }
}
// test: new TokenService(Clock.fixed(t0.plus(TTL).plusSeconds(1), UTC))

Concurrency. A test that spawns goroutines/threads and asserts on shared state is racy by construction. Strategies: (1) prefer testing the pure, concurrency-free core and the synchronization separately; (2) run the suite under a race detector — Go's -race, Java's ThreadSanitizer or jcstress for JMM-level claims; (3) make the schedule deterministic — pass an executor you control so you decide ordering. jcstress (the Java Concurrency Stress harness) is the serious tool for asserting memory-model properties; it runs billions of interleavings and classifies the observed results.

// Always run concurrency tests under the race detector in CI:
//   go test -race ./...
func TestCounter_Concurrent(t *testing.T) {
    c := NewCounter()
    var wg sync.WaitGroup
    for i := 0; i < 1000; i++ {
        wg.Add(1)
        go func() { defer wg.Done(); c.Inc() }()
    }
    wg.Wait()
    if c.Value() != 1000 { t.Fatalf("race lost updates: %d", c.Value()) }
}

Randomness. Inject the RNG or its seed. A test that depends on rand without a fixed seed is flaky; one with a fixed seed is reproducible. (Property-based tools do this for you and print the seed on failure.)

Ordering and external state. Map/dict iteration order (Go randomizes it deliberately), filesystem listing order, DB result order without ORDER BY — all nondeterministic. Assert on sets, or sort before comparing.

Flaky tests are a CI emergency, not a nuisance. Google's research (Micco; Memon et al., "Taming Google-Scale Continuous Testing," ICSE-SEIP 2017) found ~16% of their tests exhibited some flakiness and that flaky failures train engineers to ignore red builds — destroying the signal value of the entire suite. The discipline: quarantine a flaky test immediately (so it stops poisoning the build), file a ticket, and fix or delete it on a deadline. A retry-until-green wrapper is a last resort that hides the root cause and must be paired with tracking.

Snapshot testing and its pitfalls¶

Snapshot (a.k.a. golden-master / approval) testing records the output of a function on the first run and asserts equality against that recording thereafter. Popularized by Jest for UI component trees; generalized by ApprovalTests (Llewellyn Falco) and Michael Feathers' "characterization tests" for legacy code.

It is genuinely valuable in two situations: pinning the behaviour of legacy code you don't yet understand before refactoring it (Feathers, Working Effectively with Legacy Code), and capturing large structured outputs (rendered HTML, serialized ASTs) where hand-writing assertions is impractical.

The pitfalls are severe and specific:

Rubber-stamping. When a snapshot diff appears, the path of least resistance is --updateSnapshot. Done reflexively, it accepts bugs as the new truth. A snapshot test only works if a human actually reads the diff — and most don't, especially under deadline.
Snapshots assert everything, including what you don't care about. A timestamp, a random ID, or a reordered map in the output makes the snapshot fail for reasons unrelated to the behaviour under test — generating noise that trains the rubber-stamp reflex. Mitigation: normalize/redact volatile fields before snapshotting.
No expressed intent. A snapshot says "the output is this" but never "because of this rule." Reading the test tells you nothing about why the output should be what it is — the opposite of a test as executable specification.
Giant snapshots couple to layout. A 500-line component snapshot breaks on any markup tweak, even a purely cosmetic one, producing the same refactor-tax as over-mocking.

The rule: snapshot tests are a characterization and legacy-pinning tool, not a substitute for behavioural assertions. Keep them small, redact nondeterminism, and treat an unread snapshot update in code review as you would treat a disabled assertion.

Testing vs. formal methods¶

Tests sample the input space; they cannot, even in principle, prove the absence of bugs across an infinite domain. Hillel Wayne is the clearest contemporary voice on what lies beyond testing and where it is worth going (Practical TLA+, 2018; "Formal methods, the QuickCheck way," and his essays).

The spectrum, from cheapest/weakest to most expensive/strongest:

Example-based tests — sample specific points.
Property-based tests — sample randomized points against an invariant (orders of magnitude more points, still sampling).
Type systems — prove a class of properties for all inputs at compile time (a NonEmptyList can't be empty; an exhaustive match handles every variant). This is verification, not testing.
Formal specification & model checking — TLA+, Alloy, P. You write a specification and a model checker explores the entire (bounded) state space, finding interleavings no test would reach. This is where AWS found real bugs in S3, DynamoDB, and EBS that testing had not (Newcombe et al., "How Amazon Web Services Uses Formal Methods," CACM 2015).
Mechanized proof — Coq, Isabelle, Lean, the F* of Project Everest. Proves correctness for all inputs, unbounded. Cost is enormous; reserved for protocols, crypto, compilers (CompCert), and kernels (seL4).

Wayne's pragmatic framing: property-based testing is the on-ramp from examples toward formal thinking, because both force you to specify properties rather than enumerate cases. The senior judgement is matching the rigor to the cost of being wrong: example tests for a CRUD endpoint; property tests for a serializer or a money library; TLA+ for a distributed-consensus or concurrency protocol where an undiscovered interleaving costs data loss. You do not formally verify a button handler, and you do not unit-test your way to confidence in a Paxos variant.

Common Mistakes¶

Equating "unit" with "class" and mocking every collaborator. Produces brittle, implementation-coupled tests that break on behaviour-preserving refactors — the opposite of "tests enable change." Mock only out-of-process, unowned boundaries.
Asserting on interactions when you could assert on output. verify(mock).save(...) couples to mechanism; assertEquals(expected, result) couples to behaviour. Prefer state verification.
Treating 100% coverage as proof of quality. Coverage measures execution, not verification. Use it to find untested code; use mutation testing to judge test strength.
Chasing a coverage target. Goodhart's Law: mandating 100% breeds assertion-free getter tests. The number rises while quality falls.
Leaving flaky tests in CI. One flaky test trains the whole team to ignore red. Quarantine on sight; fix or delete on a deadline.
Hidden time/RNG/concurrency dependencies in the unit. time.Now() or unseeded rand in code under test makes it untestable and flaky. Inject the clock and the seed.
Rubber-stamping snapshot updates. --updateSnapshot without reading the diff accepts bugs as the new baseline. Snapshots are a legacy-pinning tool, not a behavioural assertion.
Inverting every dependency "for testability." Test-induced design damage: single-implementation interfaces and twelve-collaborator constructors that exist only to inject a mock. Decouple for design reasons, not framework convenience.
Writing tautological tests. A test that re-implements the production logic to "check" it, or asserts something the type system already guarantees, is the waste Coplien warns about. Each test must catch a regression or document non-obvious behaviour.

Test Yourself¶

A teammate insists every test must mock all collaborators "because that's what a unit test is." How do you respond, citing the literature?

Answer

The premise — "unit" means "one class, mock the rest" — is the *folk* definition, not the original one. Beck's xUnit meaning is **test isolation from other tests**; Fowler defines a unit test as small/fast/isolated-from-slow-things, with the size of the unit a judgement call. The all-mocks style is the **London/mockist** school (Freeman & Pryce, GOOS); the **Detroit/classicist** school (Beck, Khorikov) uses sociable tests and reserves mocks for out-of-process boundaries. The concrete cost of all-mocking: tests couple to *how* the code collaborates, so behaviour-preserving refactors break them — inverting "tests enable change." Recommend mocking only unowned, out-of-process dependencies and asserting on output, not interactions.

Your suite has 95% line coverage. Is it a good suite? What would you measure instead?

Answer

Unknown from coverage alone. Coverage measures *execution*, not *verification* — assertion-free tests yield high coverage and catch nothing. Run **mutation testing** (PIT/mutmut/Stryker): the mutation score tells you whether your assertions would actually *notice* a fault. Surviving mutants are an actionable list of blind spots — e.g., a surviving boundary mutant (`<`→`<=`) means no test distinguishes the boundary value. Also use coverage's *uncovered* lines to find untested code; ignore the covered-percentage as a quality claim (Goodhart).

What is an equivalent mutant, and why does it matter for mutation testing?

Answer

A mutant whose code change does not change observable behaviour (e.g., `i <= n` vs `i < n+1`, or a mutation in dead code). It can never be killed by any test, so it deflates the mutation score and must be excluded manually. Detecting equivalence is **undecidable** in general — this is the main practical tax on mutation testing, alongside runtime cost.

Write a property (not an example) that would test a merge(a, b) of two sorted lists. Why is it stronger than examples?

Answer

Properties: (1) output is sorted; (2) output is a permutation of `a ++ b` (same multiset); (3) `len(merge) == len(a) + len(b)`; (4) `merge(a, [])` equals `a`. A framework (Hypothesis/jqwik/gopter) generates hundreds of randomized `a`, `b` and shrinks any counterexample to the minimal failing case. Stronger than examples because it covers inputs you never thought of — empty lists, duplicates, single elements, huge lists — and it forces you to articulate the actual specification of `merge`.

DHH says TDD causes "test-induced design damage." Give a concrete example and the counter-argument.

Answer

Example: extracting a `Repository` interface with one production implementation, plus a service layer, purely so a mock can be injected — adding indirection that obscures the feature for no design benefit. Counter (from the *Is TDD Dead?* dialogue): the damage comes from *dogmatic mockist* TDD, not from testing itself; a sociable, state-asserting style needs far less decoupling, and self-testing code still pays for itself. The synthesis: decouple for genuine design reasons, test behaviour through the widest fast interface, and don't contort architecture to satisfy a mocking framework.

Why might a test with verify(emailSender).send(...) pass while email is broken in production?

Answer

The mock encodes your *assumption* about `emailSender`, not its real behaviour. If the real sender throws on a malformed address, rejects an oversized body, or the SMTP credentials are wrong, the mock — which simply records the call — knows none of it. The interaction assertion is satisfied; production fails. The fix is a layered strategy: keep the focused test, but verify the assumption with a **contract test** (Pact) or an **integration test** against a real or containerized SMTP server.

When is snapshot testing the right tool, and what guardrails make it safe?

Answer

Right when **pinning legacy behaviour you don't yet understand** before refactoring (Feathers' characterization tests) or capturing large structured output where hand-asserting is impractical. Guardrails: keep snapshots small and focused; **redact/normalize** volatile fields (timestamps, IDs, unordered maps) so failures mean real behaviour change; require the diff to be *read* in review; never `--updateSnapshot` reflexively. Treat snapshots as characterization, not as a replacement for intent-expressing behavioural assertions.

You need confidence in a distributed leader-election protocol. Are unit tests enough? What else?

Answer

No. Unit and even integration tests *sample* schedules and cannot reach the rare interleavings where consensus protocols fail; that is precisely the bug class that costs data loss. Escalate up the rigor spectrum: **property-based tests** for the local invariants, and a **formal specification + model checker** (TLA+ / P) that exhaustively explores the bounded state space — the approach AWS used to find real bugs in S3/DynamoDB that testing missed (Newcombe et al., CACM 2015). Match rigor to the cost of being wrong.

Cheat Sheet¶

Concept	One-line takeaway
"Unit"	Not "one class" — it's test isolation; unit size is a judgement call (Fowler)
Classicist / Detroit	Sociable tests, real collaborators, assert on output
Mockist / London	Solitary tests, mock collaborators, assert on interactions; drives interface discovery (GOOS)
Mock vs. fake	Prefer a fake (in-memory real impl) for stateful collaborators; it can't lie
When to mock	Only unowned, out-of-process boundaries (Khorikov)
Over-mocking cost	Tests pass while prod breaks; tests block refactoring
Test-induced design damage	Don't invert dependencies just to inject a mock (DHH)
Property-based testing	State an invariant over all inputs; framework generates + shrinks (QuickCheck/Hypothesis/gopter/jqwik)
Mutation testing	Inject faults; killed = good, survived = blind spot. Real measure of suite strength (PIT/mutmut/Stryker)
Coverage	Measures execution, not verification. Use it to find holes, never as a target (Goodhart)
MC/DC	Strongest practical coverage; mandated by DO-178C Level A
Determinism	Inject the clock, the seed, the executor. Run concurrency under `-race`/`jcstress`
Flaky test	CI emergency — quarantine, then fix or delete on a deadline
Snapshot test	Legacy-pinning/characterization tool; redact nondeterminism; read the diff
Beyond testing	Types → property tests → model checking (TLA+) → mechanized proof; match rigor to cost-of-wrong

Summary¶

The professional view of unit testing is mostly the ability to hold the tensions. "Unit" has no fixed meaning; the classicist and mockist schools disagree productively, and the right default — sociable tests that assert on output and mock only unowned out-of-process boundaries — falls out of one observation: tests are supposed to enable change, and interaction-coupled tests do the opposite. Over-mocking buys isolation at the price of suites that pass while production breaks and that calcify the design; test-induced design damage is the architectural version of the same mistake.

Coverage is a hole-finder, not a quality metric — it measures execution, not verification, and collapses the moment it becomes a target. The instruments that actually measure suite quality are mutation testing (does an assertion notice when the code is wrong?) and property-based testing (does the code hold its invariants across inputs you never imagined?). Both force you to state a specification rather than enumerate examples, which is why property testing is the on-ramp to formal methods. And for the bug classes that sampling cannot reach — concurrency, distributed protocols, time — escalate deliberately: inject the clock and the seed, run under a race detector, and reach for a model checker when the cost of being wrong justifies it.

Unit Tests — Professional Level¶

Table of Contents¶

What is a "unit", really?¶

Solitary vs. sociable: the classicist/mockist split¶

The cost of over-mocking¶

Test-induced design damage and "TDD is dead"¶

Property-based testing: invariants over examples¶

Mutation testing: who tests the tests?¶

The theoretical limits of coverage¶

Testing time, concurrency, and nondeterminism¶

Snapshot testing and its pitfalls¶

Testing vs. formal methods¶

Common Mistakes¶

Test Yourself¶

Cheat Sheet¶

Summary¶

Further Reading¶

Related Topics¶