Large-Scale Automated Migrations — Middle¶
Source: Google, "Software Engineering at Google" (Large-Scale Changes ch.); OpenRewrite docs
The junior file gave you the playbook. This file is the operator's manual for the four parts you'll actually build and tune: sharding strategy, automated PR generation and review bots, completion tracking, and idempotency + deprecation enforcement.
1. Sharding strategies in depth¶
Sharding is choosing the cut lines that turn one change into many independent ones. The cut you pick determines reviewability, CI cost, and how partial failure is contained.
1.1 Shard by ownership (default for monorepos)¶
Route each shard to exactly one review team. The boundary is the OWNERS (or CODEOWNERS) file. This is the default because it makes review someone's job with no coordination.
# group changed files by their nearest OWNERS-defined area
# (sketch: walk up from each file to the nearest OWNERS file)
nearest_owner() {
dir=$(dirname "$1")
while [ "$dir" != "." ]; do
[ -f "$dir/OWNERS" ] && { echo "$dir"; return; }
dir=$(dirname "$dir")
done
echo "ROOT"
}
while read -r f; do echo "$(nearest_owner "$f")|$f"; done < changed-files.txt \
| sort | awk -F'|' '{print $2 >> "shards/"gensub(/\//,"_","g",$1)".files"}'
1.2 Shard by build target (best for CI signal)¶
Make each shard one compilation/test unit (Bazel target, Go package, Maven module). The payoff: each PR's CI is a clean, self-contained build+test of exactly the affected unit, so a green check actually means something. Pair this with a test-impact system (Google's TAP) that runs only the tests reachable from the changed target.
1.3 Shard by file count (fallback)¶
When there's no clean ownership or build boundary, cap shards at N files (e.g. 50). Crude but keeps PRs reviewable. Combine with ownership where possible — never split a single owner's files across shards if you can avoid it, since that fragments their review.
1.4 Sizing¶
| Too small (1 file/PR) | Too large (5,000 files/PR) |
|---|---|
| Thousands of PRs, review fatigue, bot/CI overhead dominates | Unreviewable, one failure blocks everything, conflict-prone |
Aim for the largest shard a single owning team can still meaningfully review — usually tens of files. The bot makes many shards cheap, so err toward smaller.
When NOT to shard: an atomic change with no compatibility shim (a cross-module rename that won't compile partially migrated) can't be sharded into independent PRs. Add a shim first to make shards independent, or do one atomic change. See
senior.md.
2. Automated PR generation and review bots¶
You will not hand-create hundreds of PRs. A bot does, and it owns the rollout lifecycle.
2.1 What the bot does per shard¶
for each shard:
1. create a branch
2. run the codemod scoped to the shard's files
3. commit with a standard, traceable message (links the tracking issue)
4. open a PR; request review from the shard's OWNERS
5. trigger CI (ideally only the affected tests)
6. on green + owner-approval -> merge (or auto-merge label)
7. on red -> quarantine the shard, log why, move on (don't block other shards)
8. report status to the completion dashboard
A thin sketch over the GitHub CLI:
#!/usr/bin/env bash
# rollout.sh — one shard, one PR. Real systems add retries, rate limits, backoff.
set -euo pipefail
TRACKING_ISSUE="$1"; shard="$2" # e.g. payments
git switch -c "lsc/getuser-rename/${shard}" origin/main
jscodeshift -t rename-getuser.ts $(cat "shards/${shard}.files")
git commit -am "[LSC] rename getUserById -> users.findById (${shard})
Automated migration. Tracking: #${TRACKING_ISSUE}
Generated by codemod rename-getuser.ts. Reply with STOP on the issue to opt this area out."
git push -u origin HEAD
gh pr create --fill --label automerge,lsc \
--reviewer "@org/${shard}-owners" \
--body "Mechanical change, see #${TRACKING_ISSUE}. CI must be green before merge."
2.2 Why review still happens (and why it's fast)¶
Even mechanical changes get owner review, because owners know context the codemod doesn't (a file that's about to be deleted, a generated file, a special case). But review is fast: the PR is small, the pattern is repetitive, and the owner is verifying "yes, apply this here," not re-deriving correctness. Google's LSC process formalizes this — a global approver vouches for the mechanics of the change, and local OWNERS approve applying it to their code.
2.3 Opt-out and pacing¶
- Opt-out: owners can reply STOP to keep the bot off their area (e.g. a directory mid-rewrite). The bot records the opt-out so it shows up in the long-tail report.
- Pacing / rate limiting: don't open 600 PRs in one minute — you'll DoS CI and reviewers. Roll out in waves (e.g. 30 PRs/hour), watch the failure rate, and back off if it spikes.
When NOT to automate the bot: for a few dozen shards, opening PRs by hand with a script is fine. The full bot (auto-merge, dashboards, opt-out) pays off in the hundreds-of-shards range.
3. Completion tracking¶
A migration without a live progress number stalls and rots. Make progress a metric.
3.1 The remaining-sites counter¶
The simplest, most honest tracker: count un-migrated call sites and watch it go to zero.
# migration-progress.sh — run in CI, emit to your metrics system
total_old=$(grep -rcE 'getUserById\(' src/ | awk -F: '{s+=$2} END{print s+0}')
echo "remaining_old_api_sites $total_old" # 0 == migration complete
Plot it over time. A healthy migration trends monotonically to zero; a flat line at 12% means you've hit the long tail and need human intervention (senior file).
3.2 A status table the bot maintains¶
| Shard | State | PR | CI | Notes |
|---|---|---|---|---|
| payments | merged | #1204 | green | |
| search | in-review | #1205 | green | awaiting owner |
| ads | quarantined | #1206 | red | codemod hit a macro; needs manual |
| legacy-billing | opted-out | — | — | owner: STOP, dir being deleted |
The bot updates this on every state change. Three things must always be visible: % complete, which shards are red and why, and which sites are opted-out (the planned long tail).
When NOT to over-instrument: a one-team, 30-file change doesn't need a dashboard — a
grep | wc -lin the PR description is enough. Reserve the tracking infra for migrations whose long tail you can't eyeball.
4. Idempotency and re-runs¶
(Junior section 6 covered the definition; here's how it shapes the operation.)
Because shards fail, get reverted, and conflict, the bot will re-run the codemod on already-migrated files. So the codemod must be idempotent and the rollout must be re-runnable end to end: re-running rollout.sh for a shard that's already merged should produce an empty diff and open no PR.
# re-runnable guard: if the codemod produces no change, skip the PR entirely
jscodeshift -t rename-getuser.ts $(cat "shards/${shard}.files")
if git diff --quiet; then
echo "shard ${shard} already migrated, nothing to do"; exit 0
fi
This guard is what makes the whole rollout safe to restart after a crash, a bad night, or a mid-flight codemod bug fix: re-run everything, and only the still-unmigrated shards produce PRs.
Quarantining files that won't transform¶
Some files will refuse to transform cleanly — parse errors, generated code, macros, exotic syntax. Don't let one bad file fail a shard. Skip it, record it, keep going:
for f in $(cat "shards/${shard}.files"); do
if ! jscodeshift -t rename-getuser.ts "$f" 2>>"quarantine/${shard}.log"; then
echo "$f" >> "quarantine/${shard}.skipped" # handle by hand later
fi
done
The quarantine list is part of your long tail (senior file). It's better to migrate 98% automatically and hand-do the 2% than to let the 2% block everything.
5. Deprecation enforcement¶
(Junior section 8 introduced this; here's the operational sequence.)
Enforcement is what makes a migration stay done. Roll it out in three stages so you never block existing code before it's migrated:
- Warn — add a lint rule at warning level while the rollout runs. New code gets nudged; existing code isn't blocked.
- Error — once remaining-sites hits zero, flip the rule to error. Now CI rejects any reintroduction.
- Delete — remove the old API. The compiler now enforces it for free and the lint rule becomes redundant.
OpenRewrite (Java) bakes this into the recipe model — a recipe is both the migration and a re-runnable check, so you can run the same recipe in CI to catch regressions:
# rewrite.yml — a recipe that both migrates and (re-run in CI) enforces
type: specs.openrewrite.org/v1beta/recipe
name: com.example.RenameGetUserById
displayName: Migrate getUserById to users.findById
recipeList:
- org.openrewrite.java.ChangeMethodName:
methodPattern: 'com.example.UserRepo getUserById(..)'
newMethodName: findById
Running this recipe across the repo migrates; running it again in CI and failing if it would make changes enforces (any reintroduced getUserById makes the recipe non-empty → CI fails).
When NOT to enforce early: flipping the lint rule to error before the rollout finishes will block teams' unrelated PRs that still contain the old API. Always warn during rollout, error only after remaining-sites is zero.
Next¶
senior.md— cross-team coordination, OWNERS at scale, sequencing dependent migrations, atomic vs incremental, and the long tail.- The automated safety nets sibling section (03) — the test-impact and CI signal that gates every shard.
../../04-large-scale-refactoring/01-branch-by-abstraction/junior.md— the abstraction layer that lets old and new APIs coexist so shards become independent.
In this topic
- junior
- middle
- senior
- professional