Premature Optimization Traps — Optimize It (Inverted)¶
Category: Performance Anti-Patterns → Premature Optimization Traps — code twisted for speed that was never measured and rarely matters.
This optimize.md is inverted. For the other anti-patterns in this chapter you take slow code and make it fast. Here you do the opposite, because the anti-pattern is misplaced speed: you take code that was prematurely "optimized" into an unreadable mess and simplify it back to clarity — then prove, with a benchmark, that the simplification cost nothing (and sometimes gained). "Optimizing" a premature optimization means removing it.
The file ends with the necessary counterweight: one real hotspot, correctly optimized with measurement — so you see that the discipline isn't "never optimize," it's "optimize the proven 3%, and prove it." Every case shows before/after with numbers.
How to use this file: for each case, predict the benchmark delta before reading it. If you expect the "optimized" version to be meaningfully faster and it isn't, that prediction error is the lesson. The numbers below are representative (Apple M-series / OpenJDK 21 / CPython 3.12) — reproduce them on your machine; the shape of the result is what generalizes.
Table of Contents¶
| # | Case | Direction | Lang |
|---|---|---|---|
| 1 | Un-unroll the loop | simplify back | Go |
| 2 | Delete the StringBuilder ceremony | simplify back | Java |
| 3 | Rip out the cache that costs more than it saves | simplify back | Python |
| 4 | Un-inline the hand-flattened helper | simplify back | Go |
| 5 | The real hotspot, correctly optimized | optimize for real | Go |
Case 1 — Un-unroll the loop¶
Before — manually 4-way unrolled with a bit trick. Someone "optimized" a dot product.
func Dot(a, b []float64) float64 {
var s0, s1, s2, s3 float64
n, i := len(a), 0
for ; i+4 <= n; i += 4 { // manual 4-way unroll into 4 accumulators
s0 += a[i] * b[i]
s1 += a[i+1] * b[i+1]
s2 += a[i+2] * b[i+2]
s3 += a[i+3] * b[i+3]
}
sum := s0 + s1 + s2 + s3
for ; i < n; i++ { // tail
sum += a[i] * b[i]
}
return sum
}
After — the obvious loop.
func Dot(a, b []float64) float64 {
var sum float64
for i := range a {
sum += a[i] * b[i]
}
return sum
}
Prove it (go test -bench=Dot -count=10 | benchstat -, n=4096):
Result: identical (~, p=0.55). The compiler schedules the simple loop just as well — and on architectures where it auto-vectorizes, the clean version can win. The unroll bought nothing and cost the tail-loop bookkeeping (an i+4 <= n off-by-one waiting to happen) plus four accumulators a reader must mentally re-combine. Clarity was free.
Note the subtle behavioral trap the unroll introduced: four separate accumulators sum floats in a different order than one accumulator, so the two versions can disagree in the last ULP. The "optimization" silently changed the result. Reverting also restored the obvious summation order.
Case 2 — Delete the StringBuilder ceremony¶
Before — StringBuilder for a fixed three-part string.
String url(String host, int port, String path) {
return new StringBuilder()
.append("https://").append(host)
.append(':').append(port)
.append('/').append(path)
.toString();
}
After — plain concatenation.
String url(String host, int port, String path) {
return "https://" + host + ":" + port + "/" + path;
}
Prove it (JMH, AverageTime, 5 warm-up + 10 measured iterations, 3 forks):
Benchmark Mode Cnt Score Error Units
url_stringBuilder avgt 30 38.1 ± 1.2 ns/op
url_concat avgt 30 31.7 ± 0.9 ns/op <-- clearer AND ~17% faster
Result: the clear version is faster. On modern JDKs, javac compiles + concatenation through StringConcatFactory (an invokedynamic bootstrap) that often beats a hand-written StringBuilder chain — and it's one readable line. The "optimization" was both slower and uglier. Reverting it improved both axes.
The
StringBuilderis the right call inside a loop with unknown iteration count — that's the measurable case. For a fixed, small number of operands, the compiler wins. Know which case you're in; don't apply the reflex blindly.
Case 3 — Rip out the cache that costs more than it saves¶
Before — a memoization dict around a cheap computation, in a request hot path.
_fee_cache = {}
def fee(amount: float) -> float:
key = round(amount, 2)
if key in _fee_cache:
return _fee_cache[key]
result = amount * 0.029 + 0.30 # the entire "expensive" computation
_fee_cache[key] = result
return result
After — just compute it.
Prove it (pyperf timeit, distinct amounts so the cache actually misses like production):
cached: 210 ns ± 7 ns per call (dict lookup + insert + the multiply)
direct: 48 ns ± 2 ns per call (just the multiply) <-- 4.4x FASTER
Result: the cache made it 4× slower — the dict lookup and insert cost far more than the multiply-add they "save," and real traffic has distinct amounts so the hit rate is low anyway. Worse, the cache is an unbounded memory leak (every distinct amount retained forever) and shared mutable state (not thread-safe). The "optimization" added a leak, a race, and a slowdown.
Removing it restored a pure, fast, thread-safe one-liner. Caching is justified only when a profile shows the computation is expensive and repeated with the same inputs — neither held here.
Case 4 — Un-inline the hand-flattened helper¶
Before — a validation helper hand-inlined "to avoid the call".
func process(rs []*Record) {
for _, r := range rs {
// validity check inlined by hand at 3 call sites in this file:
if r != nil && r.ID > 0 && r.Name != "" &&
r.TS > 0 && r.TS < time.Now().UnixNano() {
save(r)
}
}
}
After — extract the named helper back out.
func valid(r *Record) bool {
return r != nil && r.ID > 0 && r.Name != "" &&
r.TS > 0 && r.TS < time.Now().UnixNano()
}
func process(rs []*Record) {
for _, r := range rs {
if valid(r) {
save(r)
}
}
}
Prove it (go test -bench=Process -count=10; confirm inlining with -gcflags=-m):
$ go build -gcflags='-m' ./... 2>&1 | grep 'inline.*valid'
./record.go:1:6: can inline valid
./record.go:14:11: inlining call to valid <-- compiler inlines it for us
$ benchstat hand-inlined.txt extracted.txt
│ hand-inlined │ extracted │
│ sec/op │ sec/op vs base │
Process-10 9.81µs ± 1% 9.79µs ± 2% ~ (p=0.73 n=10)
Result: identical (~), because the compiler inlines valid anyway — the -gcflags=-m output proves it. The hand-inlining gained nothing the compiler wasn't already doing, while costing a named concept and duplicating the rule across three call sites (so a future change to "valid" must be made in three places — a real bug magnet). Extracting it back was free at runtime and a large readability win.
Case 5 — The real hotspot, correctly optimized¶
The counterweight. The discipline is not "never optimize" — it's "optimize the proven hotspot, and prove the win." Here's that done right, so the contrast with Cases 1–4 is sharp.
The situation. A log-ingestion service misses its throughput target. We profile before touching anything:
go test -bench=Ingest -cpuprofile cpu.out
go tool pprof -top cpu.out
# flat flat% function
# 4.20s 63.1% parseTimestamp <-- the hotspot; called per log line
# 0.91s 13.7% splitFields
# ...
parseTimestamp is 63% of CPU, measured, on the hottest path (per line). This is Knuth's critical 3% — the place optimization is supposed to go. The before code re-parses the layout string on every call:
// Before — time.Parse re-interprets the layout for every single line.
func parseTimestamp(s string) (int64, error) {
t, err := time.Parse("2006-01-02T15:04:05Z07:00", s) // hot: millions of calls
if err != nil {
return 0, err
}
return t.UnixNano(), nil
}
The fix — a specialized parser for the known, fixed format, avoiding the general layout machinery. It's uglier, so it ships boxed: a clean signature, a fuzz oracle proving it matches time.Parse, and a benchmark guard.
// After — hand parse the fixed RFC3339-UTC shape. Justified: 63% of CPU
// (see cpu.out / PR), 3.1x faster (BenchmarkParseTimestamp), verified equal
// to time.Parse by FuzzParseTimestamp. Don't simplify without re-benchmarking.
func parseTimestamp(s string) (int64, error) {
if len(s) != 20 || s[4] != '-' || s[10] != 'T' || s[19] != 'Z' {
return slowParse(s) // fall back to time.Parse for anything unexpected
}
y := atoi4(s[0:4]); mo := atoi2(s[5:7]); d := atoi2(s[8:10])
h := atoi2(s[11:13]); mi := atoi2(s[14:16]); sec := atoi2(s[17:19])
return time.Date(y, time.Month(mo), d, h, mi, sec, 0, time.UTC).UnixNano(), nil
}
Prove it (benchstat, -count=10):
│ before │ after │
│ sec/op │ sec/op vs base │
ParseTimestamp-10 198.0n ± 1% 63.4n ± 1% -68.0% (p=0.000)
And confirm the whole-program win by re-profiling:
Why this is the right kind of optimization — and the inverse of Cases 1–4:
| Cases 1–4 (premature) | Case 5 (justified) | |
|---|---|---|
| Profile says it's hot? | No | Yes — 63% of CPU |
| Benchmark proves the win? | No (it was noise/negative) | Yes — 3.1×, p=0.000 |
| Win matters? | No (cold/cheap) | Yes — 2.4× throughput |
| Guarded + oracle + comment? | No | Yes — fuzz oracle + benchmark + why |
| Verdict | remove it | keep it, boxed |
The cleverness in Case 5 is no greater than in Case 1 — what makes it right is the evidence and the box around it. And note the re-profile: after the fix, parseTimestamp dropped to 8%, so it's no longer the hotspot. The next optimization target (if any) is now splitFields — and you'd repeat the loop, or stop if you've hit the throughput target.
The discipline — recap¶
Inverting the usual exercise drove home the anti-pattern's core truth:
- Most "optimizations" benchmark as noise or worse (Cases 1, 4:
~; Cases 2, 3: the clear version wins). Simplifying them back is free or a gain — and removes the bugs they smuggled in (Case 1's reordered float sum, Case 3's leak + race, Case 4's duplication). - The clear version is the optimizer's friend. Compilers inline (Case 4), vectorize (Case 1), and pick the best concat strategy (Case 2) for you — and do it better on code they recognize. Hand-doing their job usually ties their hands.
- "Optimizing" a premature optimization means deleting it and proving with
benchstat/JMH/pyperfthat clarity cost nothing. - Real optimization is the same discipline pointed at the proven 3% (Case 5): profile → confirm it's hot → benchmark the fix → box it with an oracle and a guard → re-profile. The cleverness is identical to the premature cases; the evidence is what makes it right.
premature: clever + no profile + no benchmark → REMOVE (clarity is free)
justified: clever + profiled hot + benchmarked → KEEP, boxed (the critical 3%)
Related Topics¶
junior.md·middle.md·senior.md·professional.md— recognize → measure → judge → the hard line.find-bug.md— the spotting counterpart, including the justified opt you must not delete.tasks.md— guided practice: revert an unhelpful micro-opt and guard a justified one.- N+1 in Code → optimize.md · Unnecessary Allocation → optimize.md · Wrong Data Structure → optimize.md — the real hotspots, where
optimize.mdruns in its normal direction. - Refactoring → Refactoring Techniques — reverting an over-clever optimization is a refactoring with a benchmark attached.
- The
profiling-techniquesandbig-o-analysisskills — the measurement that separates Case 5 from Cases 1–4.
In this topic