Test Strategy & the Pyramid — Middle Level¶
Roadmap: Testing → Test Strategy & the Pyramid The pyramid is one shape among several — learn the trophy, the honeycomb, and the rule for picking the right one.
Table of Contents¶
- Introduction
- Prerequisites
- Glossary
- Core Concept 1 -- Shapes are answers to "where do bugs live?"
- Core Concept 2 -- The Testing Pyramid (Cohn)
- Core Concept 3 -- The Testing Trophy (Dodds)
- Core Concept 4 -- The Honeycomb (Spotify)
- Core Concept 5 -- The ice-cream cone anti-pattern
- Core Concept 6 -- Sizes vs types: Google's small/medium/large
- Core Concept 7 -- The allocation heuristic
- Real-World Examples
- Mental Models
- Common Mistakes
- Test Yourself
- Cheat Sheet
- Summary
- Further Reading
- Related Topics
Introduction¶
Focus: the competing test-suite shapes, why "size" is a sharper axis than "type", and how to allocate tests so each one earns its keep.
The pyramid is the default, not the law. For a React front-end, a microservice, or a thin CRUD API, a different shape catches more bugs per minute of CI time. This page covers the four shapes you will meet (pyramid, trophy, honeycomb, ice-cream cone), a more useful way to classify tests (Google's sizes), and a concrete heuristic for deciding where each test belongs.
Prerequisites¶
Required
- Junior level of this topic: the three levels, the pyramid, the feedback-loop argument.
- You write unit and integration tests routinely in some language/framework.
Helpful
- Exposure to a mocking library and to a test that spins up a real database or HTTP server.
- A rough sense of your own project's CI duration.
Glossary¶
| Term | Plain-English meaning |
|---|---|
| Testing Trophy | Shape with a fat integration middle; popularised by Kent C. Dodds for JS/front-end. |
| Honeycomb | Spotify's model for microservices: thin unit + integration core, few "implementation detail" tests. |
| Ice-cream cone | Anti-pattern: many slow E2E/manual tests on top, few unit tests below. |
| Test size (Google) | Classification by resources a test may touch (memory, network, disk), not by scope. |
| Small / Medium / Large | Google's sizes: small = single process no I/O; medium = single machine, localhost only; large = multi-machine/network. |
| Contract test | Verifies two services agree on a message format, without running both end-to-end. |
| Allocation | Deciding how many tests of each kind, and which behaviour goes at which level. |
Core Concept 1 -- Shapes are answers to "where do bugs live?"¶
Every suite shape is implicitly answering: where, in this kind of system, do defects actually occur — and where are they cheapest to catch? The right shape follows from three properties of your system:
- Architecture. A pure-logic library hides its bugs in algorithms (catch with unit tests). A microservice hides them at boundaries — serialization, HTTP, DB mapping (catch with integration/contract tests). A UI hides them in wiring components together.
- Change rate. Code that changes often needs fast feedback, so push tests down. Stable boundaries can afford a few slower tests.
- Cost of a missed bug. A wrong number in a payment service is catastrophic; a misaligned tooltip is not. Spend confidence where the blast radius is large.
There is no universally correct shape — only a correct shape for a given system at a given time.
Core Concept 2 -- The Testing Pyramid (Cohn)¶
/\
/E2\ few, slow, full system
/----\
/ Integ\ moderate
/--------\
/ Unit \ many, fast, isolated
/____________\
Cohn's pyramid is the right default when most of your complexity is in code logic that can be exercised without I/O: business rules, calculations, algorithms, domain models. Classic back-end services with rich domains fit it well. The base is wide because logic has the most cases and they are cheapest to cover in isolation.
Core Concept 3 -- The Testing Trophy (Dodds)¶
Kent C. Dodds argued that for front-end / JavaScript apps the pyramid mis-allocates: unit tests that mock everything around a component prove little, because most front-end bugs are in how pieces wire together and how a component behaves when rendered with its real children.
___
/E2E\ a few critical flows
/-----\
/ \
| INTEGR. | <- the fat middle: render real component trees
\ /
\-----/
\Unit/ pure functions, small
\_/
----static---- types + lint (the base)
The trophy has a fat integration layer and adds a static base (TypeScript types, ESLint) that catches a class of bugs before any test runs. Dodds's slogan: "Write tests. Not too many. Mostly integration." Use it when your bugs cluster in component composition and rendering rather than in deep logic.
// Trophy-style "integration" test: render the real component tree, no deep mocks
test('submitting the form shows a success message', async () => {
render(<CheckoutForm onSubmit={fakeApi} />);
await userEvent.type(screen.getByLabelText('Card'), '4242424242424242');
await userEvent.click(screen.getByRole('button', { name: 'Pay' }));
expect(await screen.findByText('Payment received')).toBeVisible();
});
Core Concept 4 -- The Honeycomb (Spotify)¶
For microservices, Spotify (André Schaffer) proposed the honeycomb: lots of integration tests, a thin layer of unit tests, and very few "integrated" (cross-service / E2E) tests.
integrated tests (thin top -- few, expensive, flaky)
/ \
| INTEGRATION TESTS | (fat middle -- the bulk)
\ /
implementation (thin bottom -- isolated unit, only where logic is real)
The reasoning: a single microservice is usually thin — it receives a request, talks to a DB or another service, transforms data, responds. Its bugs live at the seams (HTTP, JSON, SQL), not in deep algorithms. So integration tests that exercise the service with real (or in-memory) infrastructure give the most confidence per test. Integrated tests across many services are kept few because they are slow and flaky — their job is largely replaced by contract tests (see Concept 7 and Contract Testing).
Core Concept 5 -- The ice-cream cone anti-pattern¶
\ /
\ M A N U A L T E S T S / <- biggest: humans clicking
\--------------------------/
\ E2E / UI tests / <- many, slow, flaky
\----------------------/
\ integration /
\------------------/
\ unit / <- thinnest: almost none
\--------------/
The pyramid turned upside down. It happens by default, not by decision: teams test through the UI because "that's how users use it," skip unit tests as "too granular," and lean on manual QA. The result is a suite that is slow (tens of minutes), flaky (UI timing), and uninformative (failures say "checkout broke" with no line number). If your CI takes 30+ minutes and people re-run failed jobs hoping they pass, you are probably standing on a cone. The fix is to push coverage down — re-prove logic in fast unit tests and delete redundant E2E cases.
Core Concept 6 -- Sizes vs types: Google's small/medium/large¶
"Unit / integration / E2E" classifies by scope (how much code is exercised). Google's testing culture classifies by size — what resources the test is allowed to touch — and finds it more useful operationally.
| Size | May use | May NOT use | Speed target |
|---|---|---|---|
| Small | single process, single thread, in-memory | network, disk, real DB, sleep, system clock, multiple threads | < 100 ms |
| Medium | single machine, localhost (DB, server on loopback), multiple threads | network to other machines | < 1 s (cap ~minutes) |
| Large | multiple machines, real network, full environment | — (anything goes) | seconds to minutes |
Why size beats scope: the thing that actually makes a test slow and flaky is touching shared, non-deterministic resources — the network, the disk, the wall clock, real time. A "unit test" that calls time.sleep() or hits localhost:5432 has the cost profile of an integration test no matter what you call it. Asking "can this test touch the network/disk/clock?" predicts speed and flakiness better than asking "is it a unit or integration test?"
# Looks like a "unit" test by scope, but it's MEDIUM by size: it sleeps + uses the clock.
def test_token_expires():
t = Token(ttl=1)
time.sleep(1.1) # <-- real clock, non-deterministic, slow
assert t.expired()
# Same behaviour as a SMALL test: inject the clock.
def test_token_expires_small():
clock = FakeClock(now=0)
t = Token(ttl=1, clock=clock)
clock.advance(2)
assert t.expired() # deterministic, microseconds
Core Concept 7 -- The allocation heuristic¶
The decision rule for "where does this test go," combining everything above:
- Find the lowest level that can prove the behaviour. Pure logic? Unit. Needs a real DB/HTTP round-trip? Integration. Needs the user-visible flow across the whole stack? E2E.
- Keep size small where possible. Inject the clock, fake the network, use in-memory adapters — turn would-be medium/large tests into small ones.
- Cover each behaviour once. If unit tests fully cover the discount matrix, integration/E2E should not re-test discount cases — they test wiring, not logic.
- Replace cross-service E2E with contract tests at boundaries. Instead of booting six services to check service A talks to service B, write a contract: A's expectations of B, verified against B in isolation. (See Contract Testing.)
- Reserve E2E for critical journeys only. Sign-up, checkout, the one flow that loses money if it breaks — and just the spine of each.
Result: most behaviour is proven by fast small tests; a moderate set of integration tests proves the wiring and I/O; contract tests guard the seams between services; and a tiny, hand-picked set of E2E tests proves the whole machine assembles.
Real-World Examples¶
A React e-commerce front-end → trophy. Static (TS + ESLint) catches typos and prop mismatches. A fat layer of React Testing Library tests renders real component trees ("clicking Add shows the cart badge incrementing"). Pure helpers (currency formatting) get small unit tests. Two or three Cypress E2E tests cover sign-up and checkout.
A payments microservice → honeycomb/pyramid blend. Deep money math is unit-tested exhaustively (pyramid instinct, because the logic is real and dangerous). The HTTP+DB seams get many integration tests (honeycomb instinct). A contract with the orders service replaces a flaky cross-service E2E.
A 35-minute Selenium suite → an ice-cream cone being fixed. The team finds that 80% of E2E cases re-test form validation already covered nowhere else. They move validation to unit tests, keep 6 journey E2E tests, and CI drops to 9 minutes.
Mental Models¶
- The shape is a consequence, not a choice. Pick it from where your bugs live (architecture × change rate × cost of a miss), not from a blog post.
- Size, not just scope. "Can it touch the network/disk/clock?" predicts pain better than the unit/integration label.
- Confidence is bought with money. Every level up buys realism and pays in speed, determinism, and debuggability. Buy only where the realism is worth the bill.
- Contract tests are E2E's cheaper substitute at seams. Same goal (the services agree), a fraction of the cost.
Common Mistakes¶
- Cargo-culting the pyramid onto a front-end. Mocking everything around a component yields green tests that prove nothing; the trophy exists for this reason.
- Calling slow tests "unit tests." Naming doesn't change cost; a sleeping/networking test is medium-sized whatever the folder it lives in.
- Booting the whole world to test one seam. Use contract tests instead of N-service E2E.
- Redundant coverage across levels. The same rule tested at unit, integration, and E2E — three places to maintain, triple the runtime, no extra confidence.
- Letting the cone grow silently. Nobody decides "let's invert the pyramid"; it happens when E2E is the path of least resistance. Watch CI time.
Test Yourself¶
- Your service is a thin HTTP→DB transformer with little logic. Pyramid, trophy, or honeycomb? Why?
- A "unit test" calls
requests.get("http://localhost:8080"). What size is it really, and what does that imply? - Give one behaviour that belongs in a unit test and one that belongs in a contract test, for the same microservice.
- Why does the ice-cream cone usually appear by accident rather than on purpose?
- The discount matrix is fully unit-tested. Where should it not be re-tested, and why?
Cheat Sheet¶
SHAPES (pick by where YOUR bugs live)
Pyramid rich domain logic -> wide unit base
Trophy front-end / JS -> fat integration + static base
Honeycomb microservices -> fat integration, contracts at seams
Cone <ANTI-PATTERN> -> too many E2E/manual; push coverage DOWN
SIZE (Google) -- the operational axis
Small no net/disk/clock/threads < 100 ms (prefer this)
Medium localhost only < ~1 s
Large multi-machine / real net seconds+
ALLOCATION
lowest level that proves it | keep size small | cover once
seams -> contract tests | E2E -> critical journeys only
Summary¶
- The pyramid is the default, not the only shape. The trophy (fat integration + static) fits front-ends; the honeycomb (fat integration + contracts) fits microservices; the ice-cream cone is the anti-pattern you slide into by accident.
- Choose the shape from where your bugs live: architecture × change rate × cost of a missed defect.
- Classify by size (network/disk/clock?) not just scope — size predicts speed and flakiness.
- Allocate by the heuristic: lowest level that proves it, keep it small, cover once, contracts at seams, E2E for critical journeys only.
Further Reading¶
- Kent C. Dodds, "The Testing Trophy and Testing Classifications."
- André Schaffer (Spotify), "Testing of Microservices" (the honeycomb).
- Mike Wacker (Google Testing Blog), "Just Say No to More End-to-End Tests."
- Google, Software Engineering at Google, ch. on test sizes.
- The
unit-testing-patternsandmocking-strategiesskills.
Related Topics¶
- Unit Testing · Integration Testing · End-to-End Testing
- Contract Testing — the cheaper substitute for cross-service E2E.
- Test Doubles, Mocks & Fakes — how to shrink test size.
- Flaky Tests & Reliability — the tax on upper layers.
- Senior level — designing a risk-based strategy under real constraints.
In this topic
- junior
- middle
- senior
- professional