Review Anti-patterns — Professional Level¶
Roadmap: Code Review → Review Anti-patterns The senior page taught you to name the anti-patterns and dodge them in your own reviews. This page is about diagnosing them across a whole org — reading review health from telemetry the way an SRE reads a dashboard, then running a remediation program in the one order that actually works, because the fixes have dependencies. It is also the section capstone: the throughline that ties the previous seven topics into a single claim about what makes review excellent.
Table of Contents¶
- Introduction
- Prerequisites
- Glossary
- Core Concept 1 — The Diagnostic: Reading Review Health From Signals
- Core Concept 2 — Root Causes, Not Symptoms
- Core Concept 3 — The Remediation Program, In Dependency Order
- Core Concept 4 — Review Theatre: The Org-Level Danger
- Core Concept 5 — The Culture Layer: Making Good Review Valued
- Core Concept 6 — Durable Artifacts: The Charter and Calibration
- War Stories
- Decision Frameworks
- Mental Models
- Common Mistakes
- Test Yourself
- Cheat Sheet
- Summary
- Further Reading
- Related Topics
Introduction¶
Focus: Diagnosing and remediating review dysfunction across an org, and synthesizing the whole section into one claim about what makes review excellent.
The senior page treated anti-patterns as things an individual reviewer recognizes and avoids: don't bikeshed, don't rubber-stamp, don't gatekeep. That's necessary and not sufficient. At staff and principal level the job changes shape entirely: nobody is asking you to write better comments on a specific PR. They're asking why the org's review is slow, or shallow, or driving people out — and to fix it across hundreds of engineers who will not read your manifesto.
That is a different skill. Individual anti-patterns are a judgement problem; org-wide review dysfunction is a systems problem. The dysfunction you see is almost never a character flaw in your reviewers — it's the predictable output of the incentives, tooling gaps, load, and norms you've put them under. A bikeshedding culture is not a culture of petty people; it's a culture with no auto-formatter and no style guide, so the cheapest comment to write is a style comment. A rubber-stamping culture is not a culture of lazy people; it's usually a culture where reviewers are drowning, review isn't valued at promotion time, or the org has quietly signalled that velocity beats correctness. Diagnose the system, not the people.
This page gives you three things: a diagnostic (how to read which anti-patterns dominate, from signals you can actually measure), a remediation program (the specific order to fix them, because the fixes depend on each other), and the culture work that the cheap fixes can't touch. Then, as the capstone, it ties the section together — because every one of topics 01 through 07 is, in the end, a defense against one of these anti-patterns.
Prerequisites¶
- Required: senior.md — the catalogue of anti-patterns (bikeshedding, rubber-stamping, gatekeeping, nitpicking, late design feedback, scope creep) and the individual-reviewer fixes.
- Required: 07 — Review Metrics & Tempo — TTFR, approval rate, reviewer load, and Goodhart's law on review metrics. The diagnostic in this page reads those metrics.
- Required: You've operated a team or org and owned an outcome (delivery, quality, retention) that review either helped or hurt.
- Helpful: You've run or read a developer-experience survey and seen review show up as a top pain point.
- Helpful: You've watched a well-intentioned process change make things worse, and learned why.
Glossary¶
- TTFR (time to first review): wall-clock from "PR ready" to the first substantive reviewer action. The single most load-bearing tempo signal.
- Approval rate / approval latency: the fraction of PRs approved and how fast. High-and-fast together is a theatre smell; low-and-slow is a bottleneck smell.
- Nit-to-design ratio: the proportion of review comments that are cosmetic (
nit:, style, naming preference) versus substantive (correctness, design, security). A culture's review depth in one number. - Reviewer concentration: how unevenly review load is distributed — e.g., the Gini coefficient of reviews-per-reviewer, or "what % of all reviews do the top 3 people do?"
- Escaped-defect class: the category of bug that review failed to catch and production found (concurrency, authz, N+1, null deref). Tells you which review lens is missing, not just that review is weak.
- Review theatre: mandatory-but-meaningless review that produces an approval without producing scrutiny. The dangerous failure mode, because it manufactures false assurance.
- Psychological safety: the shared belief that you can submit imperfect code, ask a "dumb" question, or push back on a reviewer without being punished. The substrate every other fix sits on.
- Review charter: a short, durable, written agreement on what review is for, what blocks a merge, SLAs, and severity vocabulary. The artifact that outlives any individual.
- Calibration: a recurring session where reviewers review the same PR and reconcile differences, converging the org's standards.
Core Concept 1 — The Diagnostic: Reading Review Health From Signals¶
You cannot remediate what you haven't diagnosed, and "our reviews feel bad" is not a diagnosis. The staff move is to read review health from a small set of signals the way you'd read a system from its golden signals — and crucially, to read them in combination, because no single number is interpretable alone.
The four signal families:
- Tempo — TTFR, review latency, PR cycle time (07).
- Depth — nit-to-design comment ratio, comments-per-100-LOC, change-request rate.
- Distribution — reviewer concentration (who does the reviews), author concentration (whose code gets reviewed by whom).
- Outcome — escaped-defect classes, revert/hotfix rate on reviewed code, and the DevEx-survey free-text on review.
The diagnostic power is in the combinations, not the individual gauges. The two-by-two of TTFR against approval rate is the fastest read you have:
APPROVAL RATE / SPEED
fast + ~100% slow / high change-request
┌────────────────────────┬────────────────────────────┐
low TTFR │ REVIEW THEATRE │ HEALTHY-ENGAGED (or churny) │
(fast first │ rubber-stamping; │ fast pickup, real scrutiny │
pickup) │ false assurance │ — verify with depth signal │
├────────────────────────┼────────────────────────────┤
high TTFR │ ABSENTEE / OVERLOAD │ BOTTLENECK + FRICTION │
(slow first │ nobody picks it up, │ slow AND adversarial; │
pickup) │ then waves it through │ worst quadrant for morale │
└────────────────────────┴────────────────────────────┘
- Fast TTFR + ~100% approval + near-zero change requests is the classic review theatre signature. Approvals are being produced without scrutiny. Confirm with the depth signal: if nit-to-design is also sky-high (or there are barely any comments at all), you've found rubber-stamping.
- Slow TTFR + high change-request rate + adversarial DevEx free-text points at gatekeeping or overload — reviews are both late and punishing.
- Fast TTFR + healthy mix of substantive comments + moderate change requests is what good looks like. Don't "fix" it.
Then layer the other signals:
- Reviewer concentration that's extreme (the top 2 people do 70% of reviews) predicts a bottleneck and an impending burnout-driven quality collapse — and a bus-factor risk in the review knowledge itself.
- Nit-to-design ratio dominated by nits, with style comments everywhere, is the bikeshedding fingerprint — and it almost always means style isn't automated.
- Escaped-defect classes are the most under-used signal. If production keeps surfacing the same class of bug (say, authz checks or N+1 queries) that review theoretically should catch, you don't have a generic "review is weak" problem — you have a missing lens problem, which points straight at the 04 — Security & Performance checklist or the 03 — Correctness & Design bug-hunt not being applied.
The professional discipline: treat these as diagnostic signals, not targets. The instant any of them becomes a number people are measured on, Goodhart's law (07) deforms it — and you'll have manufactured an anti-pattern while trying to detect one. (See the gamed "comments per review" war story below; that's exactly this failure.) Read them; never quota them.
Core Concept 2 — Root Causes, Not Symptoms¶
The same visible symptom can have several different root causes, and the right fix depends entirely on which one you have. Treating the symptom is how you get a remediation program that fails. Map symptom → candidate root causes before you touch anything:
| Visible symptom | Likely dominant anti-pattern | Plausible root causes (must distinguish) |
|---|---|---|
| Comment threads full of style/naming debate | Bikeshedding | Style not automated; no agreed style guide; review is the only place style gets enforced |
| Fast approvals, bugs leaking, no comments | Rubber-stamping / theatre | Reviewer overload; review not valued at promo; "ship fast" pressure; mandatory-but-pointless mandate |
| One person blocks everything on preference | Gatekeeping | No norms on what blocks; status/seniority unchecked; reviewer's preferences treated as rules |
| Reviews are days late | Bottleneck | Load concentration; no SLA/rotation; review is invisible labor nobody prioritizes |
| Design rewritten in review after a week of work | Late design feedback | Big PRs (no early signal); no design-review step before code; review starts too late |
| "Could you also fix…" creeping every PR | Scope creep via review | No scope norms; reviewers conflate "while you're here" with the change at hand |
| Nitpicking spikes after a metric launch | Gamed metric | A volume metric (comments/review) became a target; Goodhart |
The principal-level insight is that most of these root causes are systemic and cheap-to-medium to fix — except the last layer. Style automation, PR-size limits, SLAs, rotation, and norms are all mechanical interventions you can ship in weeks. The genuinely hard root cause — the one that needs leadership and patience and can't be automated — is culture and psychological safety: the gatekeeping that's really a status problem, the rubber-stamping that's really a "review isn't valued here" problem. That asymmetry is exactly why the remediation order in the next section is what it is: do the cheap, dependency-free, high-leverage fixes first, then spend your scarce cultural capital on the layer only leadership can move.
Core Concept 3 — The Remediation Program, In Dependency Order¶
The most common way a review-improvement effort fails is doing the fixes in the wrong order — typically opening with the culture conversation, which is the hardest, slowest, and most easily undermined by the mechanical problems you haven't fixed yet. You can't ask people to "be less nitpicky" while bikeshedding is still the cheapest comment to write. The fixes have a dependency graph; respect it.
The order, and why each step is positioned where it is:
1. Automate style first. This is the fast win and it has zero dependencies. An auto-formatter (gofmt/Prettier/Black/rustfmt) plus a linter (06 — Tooling & Automation) makes style comments impossible — there is nothing to debate when the formatter already decided. This single move kills the largest, most morale-corrosive anti-pattern (bikeshedding) cheaply, and it earns you credibility for the harder work. Do it first precisely because it's easy and visible.
2. Shrink PRs. Once style noise is gone, the next leverage point is size. Small PRs review faster, get deeper scrutiny, and — critically — surface design problems early instead of after a week of sunk work (02 — PR Scope & Size). Many downstream anti-patterns (late design feedback, rubber-stamping of huge diffs nobody can actually read, slow TTFR) are partly caused by oversized PRs. Fix size before you fix tempo, because size is a tempo input.
3. Fix load, SLA, and rotation. Now address the bottleneck and overload that drive both slowness and rubber-stamping (07). Set a TTFR SLA, cap reviewer WIP, and rotate review duty so it's not three heroes. This has to come after size, because shrinking PRs reduces per-review load and makes the SLA achievable — try to set the SLA first and you've set people up to fail.
4. Set norms, a style guide, severity labels, and calibration. With the mechanical noise gone and load survivable, now establish what review is for and how to do it: a written norm on what blocks a merge, a severity vocabulary (nit: / suggestion: / blocking:) so authors can triage feedback, the team's residual style decisions the formatter can't make, and calibration sessions to converge standards (05 — Feedback). Norms land after automation because a norm that says "don't argue about style" is hollow if style isn't already automated.
5. Fix or remove bad metrics. If any review metric has become a target (comments per review, approval counts), de-target it before the culture work — because a gamed metric will actively sabotage the cultural change you're about to attempt (07). You cannot ask for thoughtful, sparing feedback while rewarding comment volume.
6. Then, the culture and psychological-safety layer. Only now — with style automated, PRs small, load survivable, norms written, and metrics de-weaponized — do you spend leadership capital on the hardest layer: making review collaborative not adversarial, de-statusing it, and making good reviewing valued. This is last not because it's least important — it's the most important — but because it's the most expensive and the most likely to be quietly undone by the mechanical problems you've now cleared out of its way.
The principal heuristic: sequence by dependency and cost, not by importance. The cheap, dependency-free, high-leverage fixes go first; they build trust and remove the obstacles that would sabotage the cultural work. Opening with "we need to talk about our review culture" before automating the formatter is the single most common way these programs die.
Core Concept 4 — Review Theatre: The Org-Level Danger¶
Of all the anti-patterns, review theatre is the one that should scare a principal engineer the most, because it's the only one that manufactures false assurance. A team with no review at all knows it has no safety net. A team with review theatre — mandatory approval that produces a green checkmark without producing scrutiny — believes it is protected and is not. That belief is what makes it dangerous: it's load-bearing in incident postmortems, compliance attestations, and the org's risk model, and it's hollow.
The connection to Quality Gates is direct and important. A required-review gate is only as good as the review behind it. When the org makes review mandatory but does nothing to make it meaningful — no norms, no load relief, no culture of actually reading — you get a gate that's reliably green and reliably empty. The gate's existence becomes the org's evidence that "we review everything," which is precisely backwards: the mandate without the substance is worse than honesty, because it launders an absence of scrutiny into a record of approvals.
Detecting theatre (the signal combination from Core Concept 1, sharpened):
- Approval latency far shorter than is plausible for the diff size — 600-line PRs approved in 90 seconds, repeatedly.
- Approval rate at or near 100% with a vanishing change-request rate.
- Comment volume near zero, or all comments cosmetic (high nit-to-design).
- Escaped defects in classes review should catch, on PRs that were "reviewed and approved."
- DevEx free-text like "review is just a box to tick here."
Killing theatre without killing real review is the delicate part — and the trap is to respond by making the mandate heavier (more required approvers, more mandatory checklists), which only adds friction to the empty ritual. The fix is to attack the cause of the emptiness, which is always one of the systemic root causes: reviewers are overloaded (fix load), review isn't valued (fix incentives — Core Concept 5), PRs are too big to actually read (fix size), or there are no norms about what scrutiny means (fix norms). You make review meaningful, and only then is the mandate worth keeping. In some cases the right move is to reduce the mandate — drop required review on low-risk paths via policy-as-code in the quality gate — so the human review you do require is real, on the changes that warrant it. Less theatre, more scrutiny where it counts.
The hard rule: never let a mandatory-review policy substitute for evidence that review is actually happening. A green gate is an input to a question ("is this scrutinized?"), not an answer. If you can't tell theatre from real review from your signals, your gate is decorative — and decorative gates fail in the postmortem.
Core Concept 5 — The Culture Layer: Making Good Review Valued¶
This is the layer the automation can't reach, and the reason most review cultures quietly degrade. The core economic fact: good reviewing is invisible labor. A reviewer who catches a subtle race condition, asks the one question that reveals a wrong design, and mentors a junior through a cleaner approach has produced enormous value — and left almost no visible artifact. The author's name is on the merged code. The reviewer's contribution evaporates.
The consequence is brutal and predictable: if your performance and promotion processes ignore review, review degrades to rubber-stamping — rationally. Engineers respond to what's rewarded. If the org rewards shipping your own code and is silent on the time you spent making other people's code better, then careful review is a personal sacrifice with no return, and people stop making it. Rubber-stamping isn't a moral failure under those incentives; it's the equilibrium. You cannot fix this with a memo. You fix it by changing what's valued, which means leadership, not a process tweak.
What the culture work actually consists of:
- Make good reviewing count at promo time. Explicitly recognize high-quality review in performance reviews and promotion packets — "raised the quality bar through review," with examples. The moment review is career-relevant, careful review becomes rational. This is the highest-leverage cultural lever, and it requires the people who run calibration and promotion to buy in.
- Model it at the top. Staff and principal engineers must visibly do excellent, kind, substantive review — and visibly receive it well, including on their own code. Psychological safety is set by example: if the most senior person rewrites juniors in adversarial comments, that is the org's review culture, regardless of any written norm.
- De-status review; it's collaboration, not gatekeeping. Reframe review from "the senior person judges the junior's code" to "two engineers making the change better." That reframe attacks gatekeeping at its root (it's a status behavior) and lowers the stakes of receiving feedback. Authors should expect to push back on reviewers as a normal, safe act.
- Protect psychological safety deliberately. People must be able to submit imperfect work, ask naive questions, and disagree with a reviewer without fear. Where that's absent, every other fix underperforms: people game the metrics, avoid the harsh reviewer, and stop surfacing the risky changes that most need review.
The leadership reality: the cheap mechanical fixes (style automation, size limits, SLAs) get you most of the visible improvement and you can ship them yourself. But the ceiling on review quality is set by the culture layer, and that ceiling only moves when leadership makes good review valued and safe. This is the work that doesn't fit in a sprint and doesn't show on a dashboard — and it's the difference between an org that reviews well for a quarter and one that reviews well for years.
Core Concept 6 — Durable Artifacts: The Charter and Calibration¶
Culture set by personality is fragile — it leaves when the person leaves. The way you make a healthy review culture durable is to encode it in two artifacts that outlive any individual: a review charter and a calibration practice.
The review charter is a short, written, team-owned document (one page, not a policy binder) that answers the questions every review argument is secretly about:
- What is review for here? (defects + design + knowledge-sharing — not style policing, which the formatter owns.)
- What blocks a merge versus what's advisory? (Correctness, security, broken tests, wrong design block. Preferences don't.)
- What's the severity vocabulary? (
nit:/suggestion:/blocking:— so authors can triage.) - What's the TTFR SLA, and how is review load shared? (The rotation and the response-time expectation.)
- What does the tooling own? (Style, formatting, lint — so humans don't.)
The charter's value is that it makes the implicit explicit and shared. Most review conflict is two people operating from different unstated models of what review is; the charter gives them one model to point at. It also de-statuses disagreement: "the charter says preferences don't block" is a neutral arbiter that doesn't require the junior to out-argue the senior.
Calibration is the practice that keeps the charter real over time. Periodically, several reviewers review the same PR independently, then compare: what did each flag, what did each let go, what did they disagree on, and how do they reconcile it against the charter? This converges the org's standards (so the review you get doesn't depend on which reviewer you drew), surfaces drift, and is itself a venue for modeling good review. It's the same instinct as interviewer calibration — without it, "the bar" is whatever each reviewer privately believes.
The durability test: if your best reviewer left tomorrow, would review quality survive? If the answer is "no, they held the standard in their head," you have a personality-dependent culture, not a durable one. The charter and calibration are how you move the standard from one person's head into the team's shared, written practice — the difference between a good quarter and a good decade.
War Stories¶
The org of pure theatre. A ~200-engineer company had "100% review coverage" as a point of pride — every PR required an approval, and the dashboard showed it. It also had a steady stream of production incidents in reviewed code: null derefs, an authz check that was never added, a query that took down a database. Pulling the numbers told the story instantly: median approval latency was under two minutes regardless of PR size, approval rate was 99.6%, and the comments-per-PR was 0.3. It was a green gate over an empty ritual. The turnaround was deliberately ordered: first an auto-formatter and linter went in (removing the only comments anyone was writing); then a 400-LOC soft cap on PRs (so reviewers could actually read them); then a TTFR SLA with review rotation so the same five people weren't drowning. Only after those did leadership make the cultural move — recognizing good review in promo packets and dropping the mandatory review requirement on low-risk paths so human review concentrated where it mattered. Within two quarters comments-per-PR rose past 3, change-request rate climbed off the floor, and the recurring incident classes dried up. The mandate hadn't been protecting them; the substance they added did.
The gatekeeper who drove attrition. One senior engineer was the de-facto reviewer for a core service and treated every review as a gate they personally guarded — rewriting authors' approaches in the comments, blocking on personal style preferences, and relitigating settled decisions. TTFR for that service was fine; the cost showed up in the exit interviews: three mid-level engineers cited "review feels like an interrogation" on the way out. The diagnostic signal was reviewer concentration (this person did ~65% of the service's reviews) crossed with adversarial DevEx free-text scoped to that team. The mechanical fixes (rotation to spread reviews, a charter clarifying that preferences don't block) helped at the margin, but the real fix required leadership: the engineer's manager made "review collaboratively; preferences are suggestions, not blocks" an explicit expectation tied to their next-level criteria, and a principal engineer modeled the alternative by reviewing alongside them. The lesson: gatekeeping rooted in status is a culture problem wearing a process costume — rotation alone won't fix it; leadership has to move the incentive.
The bikeshedding culture, ended in an afternoon. A team's PRs routinely accumulated a dozen comments about brace placement, import ordering, and var versus let. Authors dreaded review; reviewers felt productive. The nit-to-design ratio was lopsided — cosmetic comments outnumbered substantive ones roughly four to one. The fix was almost insultingly cheap: adopt the language's standard formatter, wire it into a pre-commit hook and CI, and reformat the whole repo in one commit. Overnight, style comments became impossible — the formatter had already decided — and reviewers, with nothing cosmetic left to say, started commenting on logic and design. This is the canonical "automate style first" win: the cheapest intervention in the whole program killed the most morale-corrosive anti-pattern in an afternoon.
The bottleneck org fixed by rotation and load limits. A platform team had two engineers everyone routed reviews to. TTFR crept to two-plus days, the two heroes were burning out, and — the quiet danger — they'd started rubber-stamping just to clear the queue, so quality dropped as latency rose. Reviewer concentration was the smoking gun (the top two did over 70% of reviews). The fix was structural, not exhortative: a review rotation that spread the duty across the team, a WIP cap so no one held more than a few open reviews, and CODEOWNERS tuned so reviews routed to a group, not two names. TTFR fell back under a day, the heroes stopped drowning, and — counterintuitively to the org — quality went up once reviewers weren't clearing a queue under duress.
The gamed "comments per review" metric. A well-meaning eng-leadership team, wanting to "encourage thorough review," put average comments per review on a dashboard and praised the high scorers. Within a month the metric was up and review was worse: reviewers had learned to pad reviews with trivial nitpicks to hit the number. Substantive feedback didn't increase; cosmetic noise did, and authors got more demoralized, not less. This is Goodhart's law on review, exactly: the instant a depth proxy became a target, people optimized the proxy and not the thing. The fix was to kill the metric entirely, state plainly that comment volume was never the goal, and replace the dashboard with calibration sessions where reviewers compared substance on shared PRs. The lesson burned in: review metrics are for diagnosis, never for targets — measuring them is fine, rewarding them deforms them.
Decision Frameworks¶
Symptom → dominant anti-pattern → root cause → fix:
| Symptom (from signals) | Dominant anti-pattern | Root cause | First fix |
|---|---|---|---|
| Style/naming debates everywhere; nit-heavy | Bikeshedding | Style not automated | Auto-formatter + linter (06) |
| Fast approvals, no comments, leaking bugs | Rubber-stamping / theatre | Overload, or review not valued | Fix load + SLA, then incentives |
| 600-LOC PRs approved in 90s | Review theatre | Mandate without substance + big PRs | Shrink PRs (02); make review real before keeping mandate |
| One reviewer blocks on preference | Gatekeeping | Status unchecked; no norms | Charter ("preferences don't block") + leadership |
| Reviews days late; heroes burning out | Bottleneck | Load concentration | Rotation + WIP cap (07) |
| Design rewritten after a week | Late design feedback | PRs too big; no early signal | Smaller PRs + design review before code |
| Nitpicking spikes after a dashboard | Gamed metric | Volume metric became a target | Kill the metric (07) |
Review-culture remediation order (sequence by dependency + cost):
| Order | Intervention | Why here | Link |
|---|---|---|---|
| 1 | Automate style | Fast win, zero deps, kills bikeshedding cheaply | 06 |
| 2 | Shrink PRs | Size is a tempo and depth input | 02 |
| 3 | Fix load / SLA / rotation | Achievable only after PRs shrink | 07 |
| 4 | Norms + style guide + severity labels + calibration | Norms land only after noise + load are handled | 05 |
| 5 | Fix / remove bad metrics | De-weaponize before culture work | 07 |
| 6 | Culture + psychological safety | Hardest, needs leadership; do last so nothing sabotages it | this page |
Real review vs review theatre — detection:
| Signal | Real review | Review theatre |
|---|---|---|
| Approval latency vs PR size | Scales with size | Flat and implausibly fast |
| Approval rate | < 100%; some changes requested | ~100%, near-zero change requests |
| Comment depth | Substantive ≫ cosmetic | Near-zero, or all cosmetic |
| Escaped defects | Few in review-catchable classes | Recurring in classes review should catch |
| DevEx free-text | "review makes my code better" | "review is a box to tick" |
Making good reviewing valued:
| Lever | Mechanism | Owner |
|---|---|---|
| Promo recognition | Cite high-quality review in perf/promo packets | Leadership + calibration committee |
| Modeling at the top | Staff/principal visibly review well and receive it well | Senior ICs |
| De-statusing | Reframe review as collaboration, not gatekeeping | Managers + norms |
| Psychological safety | Make imperfect PRs and push-back safe | Everyone, set by leaders |
| Charter + calibration | Encode the standard so it's not personality-bound | Team |
Review-charter checklist (one page, team-owned):
| Question the charter must answer | Default position |
|---|---|
| What is review for? | Defects + design + knowledge — not style |
| What blocks a merge? | Correctness, security, broken tests, wrong design |
| What's advisory? | Preferences, optional suggestions, future cleanups |
| Severity vocabulary? | nit: / suggestion: / blocking: |
| TTFR SLA + load sharing? | A stated target + a rotation |
| What does tooling own? | Style, formatting, lint (so humans don't) |
Mental Models¶
-
Diagnose the system, not the people. Review dysfunction is the predictable output of incentives, tooling gaps, load, and norms — not a character flaw in your reviewers. Bikeshedding means style isn't automated; rubber-stamping means reviewers are drowning or review isn't valued. Fix the system and the behavior follows.
-
Fixes have a dependency graph; sequence by dependency and cost. Automate style → shrink PRs → fix load/SLA → set norms → fix metrics → then culture. The cheap, dependency-free wins go first; they build trust and clear the obstacles that would otherwise sabotage the hard cultural work.
-
Review theatre is more dangerous than no review. No review knows it has no net; theatre believes it's protected and isn't. A green mandatory-review gate is an input to "is this scrutinized?", never the answer.
-
Good reviewing is invisible labor — so it degrades to rubber-stamping unless it's valued. If promo and perf ignore review, careful review is an unrewarded sacrifice and people rationally stop. You change this by changing what's valued, which is leadership work, not a memo.
-
Culture by personality is fragile; encode it. A charter and calibration move the standard from one person's head into the team's shared, written practice — the difference between a good quarter and a good decade.
-
Review metrics diagnose; they never target. The moment a depth or tempo signal becomes a number people are rewarded for, Goodhart deforms it and you manufacture the anti-pattern you were hunting.
Common Mistakes¶
-
Opening with the culture conversation. "We need to talk about our review culture" before the formatter is automated and PRs are small is the single most common way these programs die — you're asking people to be less nitpicky while style debate is still the cheapest comment to write. Do the cheap mechanical fixes first.
-
Treating the symptom instead of the root cause. The same symptom (slow reviews) has different causes (load vs no SLA vs invisible labor) needing different fixes. Map symptom → root cause before intervening, or you'll ship a fix that doesn't address the actual mechanism.
-
Responding to review theatre by making the mandate heavier. More required approvers and mandatory checklists add friction to an empty ritual. Attack the cause of the emptiness (load, value, PR size, norms); sometimes the right move is to reduce the mandate so the review you keep is real.
-
Putting a review metric on a dashboard people are judged by. Comments-per-review, approval counts — any of them — gets gamed the instant it's a target, and you manufacture nitpicking or padding. Measure for diagnosis; never reward the proxy.
-
Fixing the bottleneck with exhortation instead of structure. "Please review faster" doesn't move TTFR when two people do 70% of reviews. Rotation, WIP caps, and group CODEOWNERS do. And note that an overloaded reviewer rubber-stamps to clear the queue — so the bottleneck is also a quality problem.
-
Mistaking gatekeeping for a process problem. Rotation spreads the load but doesn't fix a reviewer whose adversarial reviews are a status behavior. That needs leadership to move the incentive and a senior to model the alternative — process tweaks alone won't do it.
-
Letting the standard live in one person's head. If your best reviewer holds "the bar" privately and leaves, review quality leaves with them. Encode it in a charter and keep it real with calibration.
Test Yourself¶
- Your org's dashboard shows median approval latency under two minutes, ~100% approval rate, and 0.3 comments per PR — alongside recurring production incidents in reviewed code. Name the anti-pattern, explain why it's more dangerous than having no review, and give the first three remediation steps in order.
- Two engineers do 70% of a team's reviews, TTFR is two days, and quality is also dropping. Explain the causal link between the concentration and the quality drop, and give the structural fix.
- Why is "automate style" the first step in the remediation program rather than, say, "set review norms"? What dependency makes the order matter?
- A leadership team puts "average comments per review" on a dashboard to encourage thoroughness, and review gets worse. Explain the mechanism by name, and state the rule about review metrics it violates.
- Good reviewing is described as "invisible labor." Explain why that property predicts a drift toward rubber-stamping, and name the single highest-leverage lever to counteract it.
- A senior engineer's adversarial reviews are driving attrition, but their TTFR is fine. Which signals reveal the problem, and why won't rotation alone fix it?
- What two durable artifacts make a healthy review culture survive the departure of your best reviewer, and what does each one do?
Answers
1. **Review theatre / rubber-stamping** — a mandatory-review gate producing approvals without scrutiny. It's more dangerous than no review because it manufactures *false assurance*: the org believes it's protected (and treats the green gate as evidence in postmortems and compliance) while no actual scrutiny is happening. First three steps, in order: **(1) automate style** (formatter + linter — the fast, dependency-free win), **(2) shrink PRs** (so reviewers can actually read them), **(3) fix load + SLA + rotation** (so reviewing is survivable). Only after those do you touch incentives/culture. 2. Concentration causes the quality drop because the two overloaded reviewers start **rubber-stamping to clear the queue** — quality falls *as* latency rises. It's not just a speed problem; an overloaded reviewer skims. Structural fix: **review rotation** to spread the duty, a **WIP cap** so no one holds too many open reviews, and **group CODEOWNERS** so reviews route to a team, not two names. Exhortation ("review faster") won't move it. 3. Automating style has **zero dependencies** and is the cheapest, most visible win — it makes bikeshedding *impossible* (the formatter already decided) and earns credibility for the harder work. Norms must come *after*, because a norm that says "don't argue about style" is hollow if style isn't already automated — you'd be asking people to stop doing the cheapest thing available to them. The dependency is: norms about style are only enforceable once style is automated. 4. **Goodhart's law** — "when a measure becomes a target, it ceases to be a good measure." Reviewers optimized the proxy (comment volume) by padding with nitpicks rather than improving the thing (substantive scrutiny). The rule it violates: **review metrics are for diagnosis, never for targets** — measure them, never reward them. 5. A careful reviewer produces enormous value (catching a race, redirecting a wrong design, mentoring) but leaves almost no visible artifact — the author's name is on the code. So if perf/promo **ignore review**, careful review is an unrewarded personal sacrifice, and engineers rationally stop making it — the equilibrium is rubber-stamping. Highest-leverage counter-lever: **make good reviewing count at promotion/performance time** (explicitly recognized, with examples). That makes careful review career-rational. 6. The revealing signals are **reviewer concentration** (this person does most of the team's reviews) crossed with **adversarial DevEx free-text** scoped to that team, and the cost surfaces in **exit interviews**, not TTFR. Rotation alone won't fix it because the behavior is a **status/culture problem wearing a process costume** — spreading reviews reduces exposure but doesn't change the reviewer's adversarial stance. That needs **leadership** to make collaborative review an explicit expectation tied to the next-level criteria, plus a senior IC modeling the alternative. 7. A **review charter** (a one-page, team-owned doc stating what review is for, what blocks a merge vs what's advisory, the severity vocabulary, the SLA/rotation, and what tooling owns) — it moves the implicit standard into a shared, neutral artifact. And **calibration** (reviewers periodically reviewing the same PR and reconciling against the charter) — it keeps the standard converged and real over time, instead of living in one person's head.Cheat Sheet¶
THE DIAGNOSTIC (read combinations, never single gauges)
TTFR x approval rate:
fast + ~100% + no comments → REVIEW THEATRE / rubber-stamping
slow + high change-requests → GATEKEEPING or OVERLOAD
fast + substantive comments → HEALTHY (do not "fix")
reviewer concentration high → BOTTLENECK (+ impending quality collapse)
nit-to-design lopsided to nits → BIKESHEDDING (style not automated)
same escaped-defect class → MISSING LENS (03 correctness / 04 sec+perf)
REMEDIATION ORDER (sequence by dependency + cost)
1. automate style fast win, 0 deps kills bikeshedding (06)
2. shrink PRs size = tempo + depth surfaces design early (02)
3. load / SLA / rotation achievable after #2 fixes bottleneck (07)
4. norms + style guide + severity labels + calibration (05)
5. fix / remove bad metrics de-weaponize before culture (07)
6. culture + psych safety hardest; leadership; LAST (this page)
REVIEW THEATRE
danger = FALSE ASSURANCE (worse than no review)
detect: latency flat vs PR size; ~100% approval; ~0 comments; recurring bugs
kill it: fix the CAUSE of emptiness (load/value/size/norms),
sometimes REDUCE the mandate — never make the empty ritual heavier
a green required-review gate is an INPUT to "is this scrutinized?", not the answer
CULTURE LAYER (the ceiling on review quality)
good review = INVISIBLE LABOR → degrades to rubber-stamping unless VALUED
highest lever: make good review count at PROMO/PERF time
model it at the top (review well AND receive it well)
de-status: collaboration, not gatekeeping
protect psychological safety
DURABLE ARTIFACTS
review charter: what review is for / what blocks / severity vocab / SLA / tooling owns
calibration: same PR, several reviewers, reconcile → converge the standard
test: if your best reviewer left, does the standard survive?
GOLDEN RULE
metrics DIAGNOSE, never TARGET (Goodhart deforms any rewarded proxy)
Summary¶
- Diagnose before you remediate. Read review health from combinations of signals — TTFR × approval rate, nit-to-design ratio, reviewer concentration, escaped-defect classes, DevEx free-text — to identify which anti-pattern dominates and why. No single gauge is interpretable alone; the two-by-two of tempo against approval rate is your fastest read.
- Map symptom → root cause → fix. The same symptom (slow, shallow, punishing review) has different systemic root causes — style-not-automated, PRs-too-big, overload, no-norms, bad-metrics, culture — each needing a different intervention. Treating the symptom ships a fix that misses the mechanism.
- Run the remediation program in dependency order: automate style (06) → shrink PRs (02) → fix load/SLA/rotation (07) → set norms + severity labels + calibration (05) → fix/remove bad metrics (07) → then the culture layer. Cheap, dependency-free wins first; they clear the obstacles that would sabotage the hard work.
- Review theatre is the org-level danger — mandatory-but-meaningless review that manufactures false assurance over an empty quality gate. Detect it from implausibly-fast approvals and absent comments; kill it by making review meaningful (and sometimes reducing the mandate), never by making the empty ritual heavier.
- The culture work is the ceiling and the hardest layer. Good reviewing is invisible labor, so it degrades to rubber-stamping unless leadership makes it valued (promo recognition), modeled (at the top), de-statused (collaboration not gatekeeping), and safe (psychological safety). Encode it durably in a charter and keep it real with calibration so it survives the departure of your best reviewer.
The section synthesis. Excellent review is not a vibe or a personality — it's the right things, done in the right order, defended against the predictable failure modes. Look at the right things first (01); keep changes small so review is deep and design feedback is early (02); hunt for the correctness and design problems only a human catches (03); bring the security and performance lenses (04); give feedback that's specific, severity-labelled, and useful (05); let automation own everything mechanical so humans review what only humans can (06); keep a healthy tempo and metrics that diagnose rather than distort (07); and recognize, diagnose, and systematically remediate the anti-patterns that wreck all of the above (08). The throughline of the entire section: review is a high-leverage collaboration to ship good code — and almost all of its dysfunction is systemic, diagnosable, and fixable. The last tier — interview.md — distills the whole topic into the questions that probe whether someone can actually run this diagnosis and remediation at scale.
Further Reading¶
- Winters, Manshreck & Wright — Software Engineering at Google, ch. 9 ("Code Review"). The single best published treatment: why review exists, the human factors, and the data on what works at scale. The grounding for the whole section.
- Google — "Code Review Developer Guide". Public and definitive: the standard of review, speed of reviews, and how to write comments — the practices these anti-patterns are deviations from.
- DORA & SPACE on review + DevEx. The findings that lightweight, in-team review beats heavyweight external approval, and that review tempo and experience are real productivity and well-being signals — the evidence base for the tempo and culture arguments.
- Michael Lynch — "How to Do Code Reviews Like a Human". The practical companion on tone, ego, and feedback — the human craft that the culture layer of this page depends on.
- Pointer: interview.md — the questions that test whether someone can read review health and run the remediation, not just recite the anti-patterns.
Related Topics¶
- 01 — What to Look For & In What Order — the order of operations these anti-patterns subvert; "missing lens" escaped-defect classes point back here.
- 05 — Giving & Receiving Feedback — severity labels, calibration, and the feedback craft that the norms step installs.
- 07 — Review Metrics & Tempo — the signals the diagnostic reads, and Goodhart's law on why those signals must never become targets.
- Quality Gates — the mandatory-review policy that review theatre hollows out; where to put (and remove) the required-review gate.
- Soft-Skills → Code Review — the human half: psychological safety, tone, and navigating the disagreements the culture layer is about.
In this topic