Automated Large-Scale Refactoring — Find the Bug¶
Category: Anti-Patterns at Scale → Automated Large-Scale Refactoring Covers (collectively): Codemods & AST transforms · Type-aware rewrites · Pattern tools (Comby, Semgrep, gofmt -r) · Idempotency & verification · Landing huge mechanical diffs
This file is critical-reading practice for codemods. Each snippet below is a transform — a sed script, a Semgrep rule, a jscodeshift program, a ts-morph migration — that looks like it does the job and would pass a casual demo on one clean file. Your task is to find the flaw that makes it unsafe at scale: the case where it corrupts a string, double-applies, breaks on nested syntax, ignores a shadow, or rewrites the wrong thing because it isn't type-aware.
What you're training: the reviewer's instinct that a mechanical transform run on 10,000 files will hit every edge case that exists in those files, not just the happy one. A codemod that's right 99% of the time is wrong on 100 files. Read the transform, then read it again imagining the nastiest input in the repo.
How to use this file: read each snippet, write down what input breaks it and how, then expand the answer.
Table of Contents¶
- The
sedrename that ate a comment - The wrapper that wraps itself
- The regex that can't count parentheses
- The rename that ignored the shadow
- The
.mapthat wasn't an array - The import the codemod added twice
- The nondeterministic diff
Snippet 1 — The sed rename that ate a comment¶
# Migrating the method getUser() -> fetchUser() across a JS service.
sed -i 's/getUser(/fetchUser(/g' src/**/*.js
// src/api.js — some of what this hits
const u = db.getUser(id); // intended target → db.fetchUser(id)
// getUser() is deprecated; prefer the cache // a doc comment
log("getUser() returned null for " + id); // a string literal
session.getUser(); // a DIFFERENT object's getUser()
What input breaks this transform, and how?
Answer
**`sed` matches text, not code — it corrupts the comment, the string literal, and an unrelated receiver's method.** After the run: - `// getUser() is deprecated` → `// fetchUser() is deprecated` — the comment now documents a method that didn't exist when it was written; its meaning is rewritten. - `log("getUser() returned null...")` → `log("fetchUser() returned null...")` — a **user-facing log string** silently changed. If anything greps logs or matches on that text, it breaks. - `session.getUser()` → `session.fetchUser()` — `session` is a *different type* that happens to also have `getUser()`. You did not mean to rename it, and now it calls a method that may not exist → runtime error. **Why it's the canonical regex-on-code failure:** `getUser(` is just six characters to `sed`. It has no idea which occurrences are calls on the receiver you care about, which are inside strings/comments, or which belong to another type. It also *misses* a multi-line call (`db\n .getUser(id)`), leaving a straggler. **Safe fix — a type-aware rename (ts-morph), or at minimum a structural tool:**// rename only db.getUser via the type checker
import { Project } from "ts-morph";
const project = new Project({ tsConfigFilePath: "tsconfig.json" });
// find the getUser method on the DB type and rename it; references follow the binding
project.getSourceFileOrThrow("src/db.ts")
.getClassOrThrow("Database")
.getInstanceMethodOrThrow("getUser")
.rename("fetchUser");
project.saveSync();
Snippet 2 — The wrapper that wraps itself¶
// jscodeshift: wrap every readFile(...) call in instrument(...) for tracing.
module.exports = function (fileInfo, api) {
const j = api.jscodeshift;
return j(fileInfo.source)
.find(j.CallExpression, { callee: { name: "readFile" } })
.replaceWith((path) =>
j.callExpression(j.identifier("instrument"), [path.node])
)
.toSource();
};
// Run once: readFile("/x") -> instrument(readFile("/x"))
// Run twice: instrument(readFile("/x")) -> instrument(instrument(readFile("/x"))) ??
What input breaks this transform, and how?
Answer
**It is not idempotent — a second run double-wraps every call.** The transform matches `readFile(...)` and wraps it. After the first run the call is `instrument(readFile("/x"))` — the inner `readFile(...)` *still matches the finder*, so a second run produces `instrument(instrument(readFile("/x")))`. And it *will* run twice: on a rebased branch, a re-merged revert, a CI "re-apply to confirm clean" check, or someone re-running the command. The result is silent corruption of files that were already correct — the most dangerous kind, because the diff looks plausible. **Safe fix — guard on the output shape: skip a `readFile` call whose parent is already `instrument(...)`:**module.exports = function (fileInfo, api) {
const j = api.jscodeshift;
return j(fileInfo.source)
.find(j.CallExpression, { callee: { name: "readFile" } })
.filter((path) => {
const parent = path.parent.node; // the enclosing node
return !(
parent.type === "CallExpression" &&
parent.callee.name === "instrument"
); // already wrapped → skip
})
.replaceWith((path) =>
j.callExpression(j.identifier("instrument"), [path.node])
)
.toSource();
};
Snippet 3 — The regex that can't count parentheses¶
# Convert assert(cond, msg) -> assert(cond, msg, {fatal: true}) in a Go test helper.
# (Author chose sed because "the AST tool was overkill for one extra arg.")
sed -i -E 's/assert\(([^,]+), ([^)]+)\)/assert(\1, \2, FatalOpt)/g' *_test.go
// some matches in the wild
assert(ok, "must be true") // simple → fine
assert(equalish(a, b), "mismatch") // arg has its own parens + comma
assert(want, fmt.Sprintf("got %d (n=%d)", x, n)) // nested call, commas, parens
What input breaks this transform, and how?
Answer
**The regex has no idea where the call's arguments end — `[^,]+` and `[^)]+` break the moment an argument contains a comma or a parenthesis.** - `assert(equalish(a, b), "mismatch")`: the first capture `([^,]+)` stops at the comma inside `equalish(a, b)`, grabbing only `equalish(a`. The result is mangled: `assert(equalish(a, b, "mismatch"), FatalOpt)` or similar nonsense — the parentheses no longer balance and it won't compile (best case) or compiles into wrong code (worse case). - `assert(want, fmt.Sprintf("got %d (n=%d)", x, n))`: the `[^)]+` for the second arg stops at the **first** `)` — the one inside the format string's `(n=%d)` — truncating the argument mid-expression. A regex cannot match **balanced delimiters**; that's a fundamental limitation (it's not a regular language). Real code has nested calls, multi-arg expressions, parens in strings — so a `(...)` -matching regex is wrong on any non-trivial argument. **Safe fix — Comby, whose holes bind to balanced spans:** `:[cond]` correctly captures `equalish(a, b)` as one balanced span (Comby knows the inner `)` closes `equalish(`), and `:[msg]` captures the whole `fmt.Sprintf(...)` including its parens and commas. Comby also won't fire inside strings/comments. For Go specifically you could also use a `go/ast` program, but Comby is the one-line, balanced-aware tool that does exactly what the broken `sed` *meant* to do.Snippet 4 — The rename that ignored the shadow¶
// jscodeshift: the global helper format() was renamed to formatValue().
// Rename all calls to format().
module.exports = function (fileInfo, api) {
const j = api.jscodeshift;
return j(fileInfo.source)
.find(j.CallExpression, { callee: { name: "format" } })
.forEach((path) => { path.node.callee.name = "formatValue"; })
.toSource();
};
import { format } from "./helpers";
export function report(rows) {
return rows.map((r) => format(r)); // the global helper → formatValue(r)
}
export function custom(rows, format) { // PARAMETER named format — shadows the import
return rows.map((r) => format(r)); // calls the LOCAL param, NOT the helper!
}
What input breaks this transform, and how?
Answer
**It renames by *name*, ignoring scope — so it wrongly rewrites a call to a local variable that *shadows* the helper.** In `custom`, the parameter is also named `format`. Inside that function, `format(r)` calls the **parameter**, not the imported helper. The transform sees the identifier `format` and blindly renames it to `formatValue`, producing `formatValue(r)` — which now refers to a name that doesn't exist in that scope (or, worse, to a *different* `formatValue` if one exists). The call's meaning changed completely. The syntactic finder `{ callee: { name: "format" } }` matches the *spelling*. It does not resolve which **binding** that identifier refers to — the imported helper or the local parameter. Shadowing is exactly where spelling and binding diverge. **Safe fix — resolve the symbol (type-/scope-aware rename):**// ts-morph: rename the imported binding; the language service respects scope.
import { Project } from "ts-morph";
const project = new Project({ tsConfigFilePath: "tsconfig.json" });
const helpers = project.getSourceFileOrThrow("src/helpers.ts");
helpers.getFunctionOrThrow("format").rename("formatValue");
project.saveSync();
Snippet 5 — The .map that wasn't an array¶
// jscodeshift-style: a perf migration replaces array .map(fn) with mapFast(arr, fn).
// "All our hot loops use .map; let's swap them."
module.exports = function (fileInfo, api) {
const j = api.jscodeshift;
return j(fileInfo.source)
.find(j.CallExpression, {
callee: { type: "MemberExpression", property: { name: "map" } },
})
.replaceWith((path) => {
const recv = path.node.callee.object; // the thing before .map
const fn = path.node.arguments[0];
return j.callExpression(j.identifier("mapFast"), [recv, fn]);
})
.toSource();
};
const a = items.map((x) => x * 2); // Array.map → intended target
const m = new Map<string, number>();
const v = m.map; // (hypothetical) — different type entirely
userScores.map((s) => s.id); // is userScores an Array? a custom collection?
new Set([1, 2]).values(); // unrelated, but illustrates: .map exists on many types
What input breaks this transform, and how?
Answer
**It's not type-aware — it rewrites every `.map(...)` regardless of the receiver's type, but `map` exists on many types (`Array`, RxJS `Observable`, Immutable.js collections, custom classes, a `Map`-like wrapper), and `mapFast` is only correct for arrays.** `{ property: { name: "map" } }` matches the *member name* `map` on *any* object. So: - `observable.map((x) => ...)` (RxJS) gets rewritten to `mapFast(observable, fn)` — a **completely different operation**; `mapFast` expects an array and will throw or silently misbehave on an Observable. - A custom collection's `.map` with different semantics (lazy, returning a view) is replaced by the eager array version — a behavior change that compiles fine and surfaces as a subtle bug. - A property access spelled `.map` that isn't even a call may be mis-handled. This is the textbook **"not type-aware → rewrites the wrong `map`"** failure. The receiver's *type* determines whether `.map` is `Array.prototype.map`; a syntactic finder can't ask that question. **Safe fix — gate on the resolved receiver type with ts-morph:**import { Project, SyntaxKind, Node } from "ts-morph";
const project = new Project({ tsConfigFilePath: "tsconfig.json" });
for (const sf of project.getSourceFiles()) {
for (const call of sf.getDescendantsOfKind(SyntaxKind.CallExpression)) {
const callee = call.getExpression();
if (!Node.isPropertyAccessExpression(callee)) continue;
if (callee.getName() !== "map") continue;
const recvType = callee.getExpression().getType();
if (!recvType.isArray()) continue; // <-- the type gate
const recv = callee.getExpression().getText();
const fn = call.getArguments()[0].getText();
call.replaceWithText(`mapFast(${recv}, ${fn})`);
}
}
project.saveSync();
Snippet 6 — The import the codemod added twice¶
// jscodeshift: ensure every file that uses logger imports it.
// Add `import { logger } from "./log";` to the top of each matching file.
module.exports = function (fileInfo, api) {
const j = api.jscodeshift;
const root = j(fileInfo.source);
const usesLogger = root.find(j.Identifier, { name: "logger" }).size() > 0;
if (usesLogger) {
const imp = j.importDeclaration(
[j.importSpecifier(j.identifier("logger"))],
j.literal("./log")
);
root.get().node.program.body.unshift(imp); // prepend the import
}
return root.toSource();
};
// A file that ALREADY imports logger:
import { logger } from "./log";
logger.info("hi");
// After running the codemod once:
import { logger } from "./log";
import { logger } from "./log"; // ?? duplicate
logger.info("hi");
What input breaks this transform, and how?
Answer
**It's not idempotent and doesn't check for an existing import — it adds a duplicate `import { logger }` to files that already have one.** The condition is just "does the identifier `logger` appear?" — which is true for files that *already* import it (the import itself contains the identifier, and so does every use). So the codemod prepends a second, identical import. Run it again and you get three. The duplicate import is a syntax error in strict ESM or a lint failure at best; at worst a bundler dedups it and hides the bug until it doesn't. This is the **double-apply** failure again, in import-management clothing: the transform matches a state (`uses logger`) that is *also true after* it has run, so it never reaches a fixed point. **Safe fix — check for an existing import first; match the "missing" state, not the "uses" state:**module.exports = function (fileInfo, api) {
const j = api.jscodeshift;
const root = j(fileInfo.source);
const usesLogger = root.find(j.Identifier, { name: "logger" }).size() > 0;
const alreadyImported =
root
.find(j.ImportDeclaration, { source: { value: "./log" } })
.find(j.ImportSpecifier, { imported: { name: "logger" } })
.size() > 0;
if (usesLogger && !alreadyImported) { // <-- the guard
const imp = j.importDeclaration(
[j.importSpecifier(j.identifier("logger"))],
j.literal("./log")
);
root.get().node.program.body.unshift(imp);
}
return root.toSource();
};
Snippet 7 — The nondeterministic diff¶
// A Go codemod that adds a build tag to every file in a set of packages.
// It collects target files in a map, then rewrites each.
func run(targets map[string]*ast.File) {
for path, file := range targets { // <-- iterating a map
addBuildTag(file)
writeFile(path, file)
}
// Imports across files are deduped into a shared set, emitted at the end:
emitImports(sharedImportSet) // sharedImportSet is a map[string]bool
}
$ go run codemod.go && git diff > run1.diff
$ git checkout . && go run codemod.go && git diff > run2.diff
$ diff run1.diff run2.diff
# ... the diffs differ between runs, even though the input is identical ...
What input breaks this transform, and how?
Answer
**It's nondeterministic — it iterates Go maps, whose iteration order is randomized by design, so the output (file processing order and emitted import order) changes from run to run.** Two sources here: 1. `for path, file := range targets` — Go deliberately randomizes `map` iteration order. If processing order affects output (e.g. shared state, or just the order edits are applied to overlapping regions), the result varies. Even when it doesn't change file *contents*, it can change the order of operations in logs/reports. 2. `sharedImportSet` is a `map[string]bool`; `emitImports` ranges over it, so the emitted import block is in a **random order** each run — `run1` has `import ("a"; "b")`, `run2` has `import ("b"; "a")`. Nondeterminism is poison at scale: the "re-apply and expect an empty diff" CI check becomes flaky, reviewers can't trust that what they reviewed is what lands, and build caches break. It also makes the *idempotency* test unreliable — a second run produces a *different reordering*, which looks like the codemod "did something" when it merely shuffled. **Safe fix — impose a deterministic order: sort everything before iterating.**func run(targets map[string]*ast.File) {
paths := make([]string, 0, len(targets))
for p := range targets {
paths = append(paths, p)
}
sort.Strings(paths) // deterministic file order
for _, path := range paths {
addBuildTag(targets[path])
writeFile(path, targets[path])
}
imports := make([]string, 0, len(sharedImportSet))
for imp := range sharedImportSet {
imports = append(imports, imp)
}
sort.Strings(imports) // deterministic import order
emitImports(imports)
}
Summary — the recurring failure modes¶
| # | Snippet | Root failure | The class of bug |
|---|---|---|---|
| 1 | sed rename | Text match hits strings/comments/wrong receiver | Not structural — regex on code |
| 2 | self-wrapping wrapper | Matches its own output | Not idempotent — double-applies |
| 3 | paren-counting regex | Can't match balanced delimiters | Not structural — breaks on nesting |
| 4 | shadowed rename | Renames by spelling, not binding | Not scope/type-aware — ignores shadows/overloads |
| 5 | wrong .map | Rewrites map on any type | Not type-aware — wrong receiver type |
| 6 | duplicate import | "Add X" without "unless X exists" | Not idempotent — double-applies |
| 7 | shuffled output | Iterates an unordered map | Not deterministic — order varies |
Three rules cover all seven:
- Edit structure, not text — use an AST/CST/type-aware tool (or at least Comby) so you never touch strings, comments, or the wrong scope/type.
- Be idempotent — match the "before" state only; never the "after." Prove it with a double-run-empty-diff test.
- Be deterministic — sort every unordered collection; pull out timestamps/randomness; run the formatter as a separate pinned pass.
A codemod that violates any one of these will corrupt files at scale, because at scale every edge case is present in some file.
Related Topics¶
- Anti-Patterns at Scale — overview — the at-scale context for these failures.
- Architecture → Anti-Patterns — the architecture-level view of mass change.
- Hotspot Analysis — targeting codemods where they pay off.
- Architecture Fitness Functions — CI checks that catch the residuals these buggy codemods leave behind.
- Strangler Fig & Seams — when a change is too semantic to mechanize safely.
- Level files:
senior.md— the principles these bugs violate, explained in depth.
In this topic