Automated Large-Scale Refactoring — Optimize This¶
Category: Anti-Patterns at Scale → Automated Large-Scale Refactoring Covers (collectively): Codemods & AST transforms · Type-aware rewrites · Pattern tools (Comby, Semgrep, gofmt -r) · Idempotency & verification · Landing huge mechanical diffs
These are make-the-rollout-good exercises. Each gives you a transform that runs but is slow, unsafe, or both at scale — a bash loop that boots a fresh JVM per file across 10,000 files, a regex masquerading as a codemod, an unverified mass diff. Your job is to fix it on two axes at once: correctness and speed. Faster but still corrupting strings is not a win; safe but taking six hours per run means nobody will re-run it to verify.
The mindset: at scale, startup cost dominates and every edge case is present in some file. A per-file process spawn that's invisible at n=10 is a 30-minute tax at n=10,000; a regex that's "probably fine" will hit the one file where it isn't. Optimize the rollout, not just the transform.
How to use this file: read "Before," predict the bottleneck and the unsafe case yourself, then expand "After." The note on why it's faster AND safer is the point.
Table of Contents¶
| # | Exercise | Problem | Fix |
|---|---|---|---|
| 1 | The per-file Node spawn | Startup × 10k files | One process, batch the files |
| 2 | The per-file JVM (OpenRewrite the wrong way) | JVM boot × every file | One run over the whole module |
| 3 | The unsafe regex rollout | sed corrupts code | Structural/AST tool |
| 4 | The unverified mass diff | No safety net at scale | Add a machine-checkable verification pass |
| 5 | The serial codemod that ignores cores | One core, hours of wall time | Parallel shards + deterministic merge |
Exercise 1 — The per-file Node spawn¶
Problem: This script applies a jscodeshift transform to a large repo by looping in bash and invoking the runner once per file. On 10,000 files it takes ~25 minutes, almost all of it Node/V8 startup.
# Before — a fresh `node` (and full jscodeshift load) per file.
find src -name '*.ts' | while read -r f; do
npx jscodeshift -t transform.js "$f" # spawns node + parses transform EACH time
done
Why is this slow, and how do you fix it without changing what the transform does?
After
**Diagnosis.** The cost isn't the transform — it's **process startup paid 10,000 times.** Each `npx jscodeshift ...` boots Node, initializes V8, resolves and loads `jscodeshift` + its parser + `transform.js`, then transforms *one* file and exits. That fixed startup (~100–200 ms of Node boot + module load) dominates; the actual AST work on a single file is milliseconds. Plus `npx` itself does a resolution step every invocation.Before: 10,000 × (node boot + load jscodeshift + load transform + 1 file) ≈ 25 min
└──────────────── paid 10,000 times ─────────────────┘
# After — ONE node process; jscodeshift walks the tree and uses a worker pool.
npx jscodeshift -t transform.js --extensions=ts --parser=ts src
Exercise 2 — The per-file JVM (OpenRewrite the wrong way)¶
Problem: A team wants an OpenRewrite recipe applied to a Java service. Someone wrote a loop that invokes the build per file. JVM startup plus Maven's own startup, paid thousands of times, turns a 2-minute job into an hour-plus — and it's also wrong.
# Before — boots Maven + JVM + re-parses the project for EVERY file.
find . -name '*.java' | while read -r f; do
mvn -q org.openrewrite.maven:rewrite-maven-plugin:run \
-Drewrite.activeRecipes=com.example.RenameApi \
-Drewrite.includes="$f"
done
Two things are wrong here. What are they, and what's the correct invocation?
After
**Diagnosis — two faults:** 1. **JVM + Maven startup × every file.** Each `mvn` invocation boots the JVM, starts Maven, and re-resolves the project. That's seconds of fixed cost per file before any rewriting happens. At thousands of files it's the entire runtime. 2. **It defeats the *reason* you chose OpenRewrite.** OpenRewrite is type-aware: it builds a **Lossless Semantic Tree for the whole module** so it can resolve types across files. Running it one file at a time means each run parses that file *without the rest of the project's type context* — so the type attribution that makes the recipe safe is degraded or absent. You've thrown away the exact feature you paid for. **Fix — run the recipe ONCE over the whole module. OpenRewrite parses the full project, builds the typed LST, and applies the recipe across all files in a single JVM:** (or, in a multi-module monorepo, run it at the reactor root so every module is parsed with shared type context.) **Correctness improves and speed improves together.** Speed: the JVM and project resolution are paid once, not per file (hours → minutes). Correctness: with the whole-project LST, the recipe resolves types across files, so a `ChangeMethodName` recipe targets exactly the intended type's method everywhere and leaves same-named methods on other types alone — which the per-file run could not guarantee. This is the recurring at-scale lesson: **the type-aware tool must see the whole compilation unit; feeding it one file at a time is both slower and less safe.**Exercise 3 — The unsafe regex rollout¶
Problem: A migration replaces the deprecated constructor call NewClient(url) with NewClientWithOptions(url, DefaultOpts) across a Go monorepo. Someone shipped it as a sed loop "because it was fast to write." It is fast — and it corrupts the codebase.
# Before — fast to write, unsafe at scale.
grep -rl 'NewClient(' --include='*.go' . | while read -r f; do
sed -i 's/NewClient(\(.*\))/NewClientWithOptions(\1, DefaultOpts)/g' "$f"
done
// What it hits in the wild:
c := NewClient(baseURL) // intended → NewClientWithOptions(baseURL, DefaultOpts)
d := NewClient(join(host, port)) // arg has nested parens + comma → BREAKS
// see NewClient(url) for the old form // a comment → corrupted
msg := "call NewClient(url) directly" // a string → corrupted
e := NewClientV2(baseURL) // 'NewClient(' is a SUBSTRING of NewClientV2(? no — but watch prefixes
What does this corrupt, and what's the safe rollout that's also fast?
After
**Diagnosis — `sed`'s `.*` is greedy and text-only:** - `NewClient(join(host, port))`: `\(.*\)` greedily grabs `join(host, port)` *and* eats the closing paren, then the substitution mangles the balance → `NewClientWithOptions(join(host, port), DefaultOpts)` is what you *wanted*, but with multiple calls on a line, greedy `.*` spans across them and corrupts everything between the first `NewClient(` and the *last* `)` on the line. - The **comment** and **string** mentioning `NewClient(url)` are rewritten — meaning and user-facing text changed. - `NewClientV2(` doesn't match `NewClient(` (the `(` differs), but `sed` gives you no guarantee about such prefix hazards in general — you're one careless pattern away from matching `NewClientV2`. It's "fast to write" but produces a diff you cannot trust, on code that may not even compile. **Fix — use a structural tool. Comby gives balanced-span safety with the same one-line authoring cost, and runs over the whole tree in one go:**# After — structural, balanced-aware, ignores strings/comments, one invocation.
comby 'NewClient(:[args])' 'NewClientWithOptions(:[args], DefaultOpts)' .go -in-place
Exercise 4 — The unverified mass diff¶
Problem: A codemod produced a 30,000-line diff across 900 files. The rollout script applies it and opens a PR. There is no verification — no compile check, no idempotency check, no test run. The author plans to "review the diff." Reviewing 30,000 lines by eye is not verification; it's theater. Add a real safety net.
# Before — apply and pray.
npx jscodeshift -t migrate.js --extensions=ts src
git add -A && git commit -m "mechanical migration"
gh pr create --title "Migrate API" --body "Big codemod, please review the diff."
What verification passes turn this from "apply and pray" into a trustworthy rollout — and why are they cheaper than reading the diff?
After
**Diagnosis.** At 30,000 lines, human review can't *verify* correctness — it can only sample. The trustworthiness must come from **machine-checkable invariants**, run automatically, because they scale to any diff size and don't fatigue. You cannot read 900 files; you *can* assert four properties about them. **Fix — a verification pipeline, cheapest signal first:**#!/usr/bin/env bash
set -euo pipefail
# 1. Apply the codemod (one process — see Exercise 1).
npx jscodeshift -t migrate.js --extensions=ts --parser=ts src
# 2. STILL COMPILES — the strongest cheap signal. Unparseable output fails here.
npx tsc --noEmit
# 3. IDEMPOTENT — re-apply; a clean second run proves the transform reaches a fixed point.
git add -A
npx jscodeshift -t migrate.js --extensions=ts --parser=ts src
if ! git diff --quiet; then
echo "FAIL: codemod is not idempotent — second run changed files" >&2
git diff --stat >&2
exit 1
fi
# 4. BEHAVIOR PRESERVED — a pure refactor must not change test outcomes.
npm test
# 5. NO RESIDUALS — the old pattern is fully gone (catches variants the codemod missed).
if grep -rn --include='*.ts' 'oldApiCall(' src; then
echo "FAIL: legacy pattern still present — partial migration" >&2
exit 1
fi
# 6. Formatting as a separate, pinned, final pass (keeps the logical diff clean).
npx prettier --write src
echo "All invariants hold — safe to open PR."
Exercise 5 — The serial codemod that ignores cores¶
Problem: A custom Go codemod (using go/ast) processes files one at a time on a single goroutine. On a 16-core machine and 12,000 files it runs for ~18 minutes using one core. Parallelize it — without breaking determinism.
// Before — serial; one core; deterministic but slow.
func main() {
files := listGoFiles("./...") // returns a sorted []string
for _, path := range files {
src, _ := os.ReadFile(path)
out := transform(src) // pure: bytes in, bytes out, no shared state
os.WriteFile(path, out, 0o644)
}
}
The transform is pure per file. Parallelize across cores while keeping the output deterministic and the run idempotent.
After
**Diagnosis.** `transform` is **pure and independent per file** — perfect for parallelism — but the loop uses one core, so 15 of 16 cores sit idle. The risk in parallelizing is **introducing nondeterminism**: if workers write shared state or you collect results in completion order, the output (or any report) becomes order-dependent and the "re-apply → empty diff" check goes flaky. **Fix — a bounded worker pool over the *already-sorted* file list. Each worker handles whole files independently; ordering is preserved because files don't interact, and any aggregated output is keyed back to the sorted index.**// After — N workers, one per core; deterministic because work is per-file and independent.
func main() {
files := listGoFiles("./...") // STILL sorted → deterministic file set
workers := runtime.NumCPU()
jobs := make(chan string)
var wg sync.WaitGroup
for i := 0; i < workers; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for path := range jobs {
src, err := os.ReadFile(path)
if err != nil {
log.Fatalf("read %s: %v", path, err)
}
out := transform(src) // pure: no shared state to race on
if err := os.WriteFile(path, out, 0o644); err != nil {
log.Fatalf("write %s: %v", path, err)
}
}
}()
}
for _, p := range files { // feed in sorted order
jobs <- p
}
close(jobs)
wg.Wait()
}
Summary¶
- Startup cost dominates at scale. The #1 rollout bug is one process per file — spawning Node/JVM/Maven 10,000 times. Fix: one process over many files (jscodeshift takes a directory; OpenRewrite parses the whole module once). 10–50× wins with zero change to the transform (Exercises 1–2).
- Feeding a type-aware tool one file at a time is slower and less safe — it loses the cross-file type context that justified choosing it. Run it over the whole compilation unit (Exercise 2).
- "Fast to write" regex trades a tiny authoring saving for unbounded corruption — nested calls, strings, comments. A structural tool (Comby) is equally fast to write and actually safe (Exercise 3).
- Reading a 30,000-line diff is not verification. Trust comes from machine-checkable invariants: compiles, idempotent (re-apply → empty diff), tests green, no residual pattern, formatter as a final pinned pass. Cheaper than eyeballing and scales to any diff (Exercise 4).
- Parallelize the embarrassingly-parallel part — independent per-file transforms across a worker pool — but keep ordering explicit and never let completion order leak into output, or you trade speed for nondeterminism (Exercise 5).
- The through-line: optimize the rollout, not just the transform. Speed comes from amortizing startup and using cores; safety comes from structural/type-aware tooling plus an automated verification pass. You need both, because at scale every edge case is present and every fixed cost is multiplied.
Related Topics¶
- Anti-Patterns at Scale — overview — where rollout mechanics fit among at-scale techniques.
- Architecture → Anti-Patterns — the architecture-level view of large mechanical change.
- Hotspot Analysis — scope the rollout to the code that matters.
- Architecture Fitness Functions — the CI invariants that make Exercise 4's verification permanent.
- Strangler Fig & Seams — the staged alternative when a mass rollout is too risky.
- Level files:
senior.md— the principles behind these optimizations, at depth.
In this topic