Maintainability Index — Senior Level¶

Roadmap: Code Quality Metrics → Maintainability Index The middle page taught you to compute the MI, read its 0–100 scale, and wire it into a gate. This page is the cross-examination: where the formula came from, why almost none of it survives contact with a modern polyglot codebase, and what a senior reaches for instead of a single seductive scalar.

Table of Contents¶

Introduction
Prerequisites
The Provenance Problem — Where 171 and the Coefficients Actually Came From
The Composite-Metric Problem — A Scalar You Have to Decompose Anyway
The Halstead Foundation Is Shaky — Volume Isn't Comparable Across Tools
Why Three Tools Give Three Numbers — The Variant Zoo
LOC Domination — The ln(LOC) Term Eats the Formula
Gaming and Insensitivity — The Comment Term and the File Split
The Honest Senior Position — What MI Is Allowed to Be
Better Alternatives — Components, Hotspots, Coupling, History
Mental Models
Common Mistakes
Test Yourself
Cheat Sheet
Summary
Further Reading
Related Topics

Introduction¶

Focus: A rigorous critique of the Maintainability Index — why it is seductive and largely unsound — and the metrics a senior uses in its place.

By the middle level you can compute the MI by hand: take Halstead Volume V, McCabe cyclomatic complexity G, and lines of code LOC, feed them through

MI = 171 − 5.2·ln(V) − 0.23·G − 16.2·ln(LOC)

clamp Visual Studio's rescaled MI = max(0, MI · 100 / 171) to 0–100, color it green/yellow/red at 20/10, and call a file "maintainable." It feels like measurement. A single number, a familiar scale, a threshold — it has the shape of a thermometer.

The senior job is to notice that the shape is borrowed and the substance is thin. Every term in that formula is contestable, the act of collapsing three signals into one destroys the information you need to act, the Halstead input isn't even comparable between two tools on the same file, and the dominant term is just file length wearing a lab coat. None of this makes MI worthless — a trend of MI on one codebase under one pinned tool can be a mild smell detector. But it does make MI indefensible as an absolute, as a cross-project comparison, or as a target. This page walks each failure precisely, with the formula and the evidence, and then points you at the metrics that actually carry signal: the components directly, churn×complexity hotspots, change coupling, and defect history.

Prerequisites¶

Required: You've internalized middle.md — you can compute Halstead V, McCabe G, the original 171 formula, and the VS 0–100 rescale, and you know the standard 10/20 thresholds.
Required: A working grasp of cyclomatic & cognitive complexity — what V(G) counts and where it already misleads, because MI inherits every one of those weaknesses.
Helpful: You've maintained a legacy codebase long enough to know which files were actually painful — and noticed the pain didn't track any single static number.
Helpful: Comfort reading a regression equation: what a fitted coefficient is, what R² claims, and why "fit on a sample" is not "law of nature."

The Provenance Problem — Where 171 and the Coefficients Actually Came From¶

Start with the number people quote most and understand least: 171. It is not a constant of software, like π is a constant of geometry. It is a regression intercept — a leftover from curve-fitting that happened to land near 171 for one dataset in one study.

The Maintainability Index was published by Paul Oman and Jack Hagemeister at the University of Idaho around 1991–1994. Their method was straightforward empirical statistics: take a corpus of programs, have engineers rate or otherwise establish their maintainability, then run a multivariate regression to find the linear combination of cheap-to-compute metrics (Halstead Volume, cyclomatic complexity, LOC, comment ratio) that best predicted the maintainability signal. The coefficients −5.2, −0.23, −16.2, and the intercept 171 are simply the fitted parameters that minimized error on that corpus. They are the answer to "what weights make this line fit these points," nothing more.

And the corpus is the crux. The programs were a small set of systems written in C and Pascal, in the idioms of the late 1980s and early 1990s, several of them HP (Hewlett-Packard) industrial software. Procedural code. No closures, no generics, no async/await, no lambdas, no decorators, no list comprehensions, no operator overloading, no metaprogramming, no garbage-collected reference semantics, no package managers pulling in 800 transitive dependencies. The "maintainability" being predicted was human effort to maintain Pascal and C in that era, as judged in that era.

Key insight: A regression model is a description of the data it was trained on. Its coefficients carry no authority outside that distribution. Applying a model fitted on 1990s Pascal/C to 2020s TypeScript, Rust, or Kotlin is using a curve fit on one population to score a completely different one — and expecting the intercept 171 to still mean something is statistically unjustifiable.

Three consequences follow, and they are damning:

No external validity for modern languages. Nothing in the original study establishes that the same coefficients predict maintenance effort in languages with constructs that didn't exist in the training set. A list comprehension and the explicit loop it replaces have wildly different Halstead and LOC profiles; the model was never shown either, so its scoring of them is extrapolation, not measurement.
The threshold bands are even softer than the formula. The familiar "< 10 red, 10–20 yellow, > 20 green" guidance comes from the same small-corpus fitting exercise (Coleman, Oman, et al. proposed bands like 65/85 on the unscaled formula; tools later remapped them). The cutoffs are conventions layered on conventions, not validated decision boundaries for your language and domain.
Re-derivation almost never happens. The honest move — if you insist on a composite — would be to re-fit the regression on your codebase against your maintenance signal (e.g., time-to-fix, defect counts). Essentially no team does this. They ship the 1990s coefficients verbatim, which means they are scoring modern code with a model nobody re-validated.

The provenance problem alone is enough to disqualify MI as an absolute or cross-project measure. The intercept 171 deserves precisely the skepticism you'd give any magic number with no derivation in the codebase — except here the derivation exists, in a thirty-year-old paper, on data that looks nothing like what you ship.

The Composite-Metric Problem — A Scalar You Have to Decompose Anyway¶

Suppose, generously, the coefficients were perfect. The MI would still be the wrong shape of metric, because it is a composite: it crushes three independent signals — Halstead Volume, cyclomatic complexity, and LOC — into one scalar. Compression to a scalar is lossy by construction, and the information it throws away is exactly the information you need to act.

Consider two files that both score MI = 42:

	File A	File B
Halstead Volume `V`	very high (dense expressions, many operators)	low
Cyclomatic `G`	low (almost no branching)	very high (deeply nested decisions)
LOC	small	large

These two files demand opposite remediations. File A is a wall of dense, operator-heavy expressions you should decompose into named steps; File B is a long, branch-tangled procedure you should flatten and split by responsibility. The MI says "42, 42" — identical — and so tells you nothing about which knob to turn. The metric that is supposed to guide maintenance is silent on the only question maintenance asks: what, specifically, is wrong here?

Key insight: Any actionable response to a low MI requires decomposing it back into its components — you must go look at V, G, and LOC separately to know what to do. But if you have to read the components to act, the composite added nothing except a lossy intermediate that can mask a problem (a great score on two terms hiding a terrible third). The aggregation step is pure loss with negative value.

This is the general pathology of single-number quality scores, and it has a name in measurement theory: you cannot meaningfully add or average across incommensurable dimensions. Volume (a count of token information), complexity (a count of independent paths), and length (a count of lines) measure different things in different units; a weighted sum of them is a number whose units are "nothing," and whose movements are uninterpretable. When MI drops three points, you cannot say why without re-deriving the parts — so the scalar is, at best, a flashing light that says "look closer," at which point you discard it and read the components. A flashing light that you must immediately ignore to do your job is a weak instrument.

There's a sharper way to see the loss. The composite is non-invertible: from MI = 42 you cannot recover (V, G, LOC). Many different triples map to the same MI. So MI is a hash of the thing you care about, with collisions — and you would never make a decision on a hash when the preimage is sitting right there in the same tool.

The Halstead Foundation Is Shaky — Volume Isn't Comparable Across Tools¶

Even granting that a composite is acceptable, one of its three inputs is built on sand. Halstead Volume is V = N · log₂(η), where N = N₁ + N₂ is total operator + operand occurrences and η = η₁ + η₂ is the count of distinct operators + operands. Computing V requires answering a deceptively hard question: what counts as an operator, and what counts as an operand?

Halstead never gave a language-agnostic, mechanically unambiguous definition — and there isn't one. The classification is tokenizer- and language-dependent, and every tool draws the lines differently:

Is ++ one operator or + applied twice? Is += one operator or two?
Is a function call f(x, y) — does the call () count as an operator? Do the commas? Is f an operand or an operator?
Are [], ., ->, ::, ? : operators? Most tools say yes; they disagree on how many.
Are keywords like if, return, else operators? (The common convention says yes — but it's a convention.)
Do type annotations, generics (List<Map<String, Int>>), decorators, f-string interpolations, or destructuring patterns contribute operators and operands? Each language adds constructs the original C/Pascal scheme never contemplated, and each tool improvises.
How are literals counted — is every distinct integer literal a distinct operand, or are all integers one "operand kind"?

Because the classification rules differ, η₁, η₂, N₁, N₂ differ, so V differs, so MI differs — for the exact same source file, depending only on which tool tokenized it. Halstead's metrics were criticized for precisely this from early on (Halstead's "software science" has been one of the most disputed corners of software measurement for decades), and the dispute is not academic: it means Volume is not a stable, comparable quantity across languages or even across tools for one language.

Key insight: A measurement you cannot reproduce across instruments is not a measurement; it's a tool artifact. Two MI tools disagreeing on the same file aren't both "approximately right" — they are computing different functions because they tokenize differently. This is the root reason Visual Studio, radon, and SonarQube report different MI numbers for identical code.

This propagates straight into the composite. The −5.2·ln(V) term is one of two logarithmic terms in the formula, so MI is structurally sensitive to V — and V is the least reproducible of the three inputs. You have built a "maintainability" score on top of a quantity whose value is an opinion held by a tokenizer. (The other two inputs are firmer: LOC is merely definitional across tools — physical vs logical, blanks, comments — and cyclomatic G varies modestly by how each tool counts boolean operators and case labels. Halstead is the wobbliest leg of the tripod by a wide margin.)

Why Three Tools Give Three Numbers — The Variant Zoo¶

The tokenizer divergence above is compounded by a second, independent source of disagreement: there is no single MI formula. There is a family of variants, and major tools each picked a different member of the family. So when Visual Studio, radon, and SonarQube hand you three different "maintainability" numbers, two distinct effects are stacked.

The variants in the wild:

1. The original (Oman/Coleman) — no comments term:

MI = 171 − 5.2·ln(V) − 0.23·G − 16.2·ln(LOC)

2. The comment-adjusted variant — adds a term rewarding the fraction of comment lines CM (perCM = comment_lines / LOC):

MI = 171 − 5.2·ln(V) − 0.23·G − 16.2·ln(LOC) + 50·sin(√(2.4·perCM))

That sin(√(...)) term is itself a fitted curiosity, capped so that comments can lift MI by at most ~50 points. Its mere existence is a red flag (more on the gaming it enables below).

3. The Microsoft / Visual Studio rescale — takes the no-comments formula, clamps at zero, and rescales to 0–100 for a friendlier dashboard:

MI_VS = max(0, (171 − 5.2·ln(V) − 0.23·G − 16.2·ln(LOC)) · 100 / 171)

with the green/yellow/red bands at 20/10 on the 0–100 scale.

4. Tool-specific reinterpretations — radon (Python) implements the comment-adjusted formula and also rescales to 0–100 with its own A/B/C letter bands; SonarQube historically computed a maintainability/MI-style number and then moved its product toward the SQALE model (a remediation-effort ratio, surfaced as A–E ratings) instead of the classic MI — so "SonarQube's maintainability" often isn't the Oman formula at all.

Stack the two effects and the divergence is fully explained:

Key insight: Two MI tools can disagree on the same file because (a) they tokenize Halstead differently, and (b) they implement different members of the MI formula family (with/without the comment term, with/without the 0–100 rescale, with different bands — or a different model entirely). The number is not a property of your code; it is a property of your code × the tool's tokenizer × the tool's chosen formula.

The practical corollary is strict and non-negotiable: an MI value is meaningless without naming the tool and version that produced it. "Our MI is 68" is an incomplete statement, like a temperature with no unit. And it makes any cross-project comparison that uses different tools — or even the same tool across a version bump that changed tokenization — invalid on its face.

LOC Domination — The `ln(LOC)` Term Eats the Formula¶

Look at the magnitudes of the coefficients on the two logarithmic terms:

−5.2·ln(V)        ← Halstead Volume
−16.2·ln(LOC)     ← lines of code

The LOC term carries roughly three times the weight of the Volume term, and the cyclomatic term (−0.23·G, linear, not logarithmic) is small until G gets large. In practice, across a real codebase, ln(V) and ln(LOC) are strongly correlated — bigger files have more tokens — so the formula's variance is dominated by file size. MI is, to a first approximation, a decreasing function of how long the file is.

A back-of-envelope check makes this vivid. Hold complexity and Volume's per-line density roughly constant and just grow the file. Each doubling of LOC subtracts 16.2·ln(2) ≈ 11.2 points (plus whatever the correlated ln(V) adds, another few). Going from a 50-line file to an 800-line file (four doublings) costs on the order of 45+ points from the LOC term alone — enough to walk a file from "green" to "red" purely on length, with no change in branching or expression density at all.

Key insight: Because of the heavily-weighted ln(LOC) term, MI largely tracks file size — the very same critique leveled at cyclomatic complexity, which also correlates strongly with LOC. You have taken a complexity metric that already mostly measures length, mixed in a Volume metric that also grows with length, and weighted length the heaviest of all. MI is, to a disappointing degree, an elaborate proxy for "this file is big."

This collapses much of MI's claimed value. "Big files are riskier" is true but trivial — you can get it from wc -l, instantly, with a stable definition, in every language, with zero tokenizer ambiguity and zero magic constants. If a Halstead-plus-McCabe-plus-regression apparatus mostly reproduces what line-counting already tells you, the apparatus is not earning its complexity. And it's worse than line counting in one way: MI's size signal is entangled with the unstable Volume term, so it's a noisier proxy for size than the line count it's largely echoing.

It also means the easiest way to "improve" MI is to split files — which sometimes genuinely helps and just as often produces a fragmented codebase where related logic is scattered across many small files to game the per-file score, harming real maintainability while the dashboard turns green. The metric rewards a structural change that may have nothing to do with whether the code is easier to maintain.

Gaming and Insensitivity — The Comment Term and the File Split¶

A metric used as a target gets optimized — and per Goodhart's Law, "when a measure becomes a target, it ceases to be a good measure." MI is unusually easy to game, in two directions, and the comment-adjusted variant is the most embarrassing example in all of code metrics.

Padding comments to raise MI. Recall the comment term: + 50·sin(√(2.4·perCM)), where perCM is the comment-line fraction. Adding comment lines mechanically increases MI, up to ~50 points, regardless of whether the comments say anything true or useful. The literal exploit:

# This function does the thing.
# It does the thing well.
# The thing is done here.
# See above for what the thing is.
# TODO: nothing, the thing is fine.
def do_the_thing(x):
    return x  # return the thing

You can take a genuinely awful function and lift its MI into "maintainable" territory by burying it in vacuous or even misleading comments — and misleading comments make maintenance harder, the exact opposite of what the score now claims. A metric that rewards comment quantity while being blind to comment quality and accuracy is not measuring maintainability; it's measuring typing. (This is precisely why Visual Studio dropped the comment term from its formula — but radon and others kept it.)

Splitting to dodge the size penalty. Because the ln(LOC) term punishes length per-file, you can raise every file's MI by chopping one cohesive 600-line module into six 100-line files. Sometimes that's a real improvement. Often it isn't — you get six files that must be read together to understand anything, with the cohesion that justified keeping them together now invisible to a per-file metric. The dashboard improves; the codebase gets harder to navigate. The metric is measuring file boundaries, not maintainability.

The mirror image of gamability is insensitivity — changes that hurt maintainability a lot while barely moving MI:

Bad names. Renaming everything to a, b, tmp, data2 devastates readability — the single biggest real driver of maintenance cost — and MI doesn't move at all. Halstead counts occurrences and distinct tokens; it is utterly blind to whether a name communicates intent.
Deep coupling. A function with simple internals but tendrils into fifteen other modules (high efferent coupling, hidden temporal dependencies) is a maintenance nightmare, and its MI can be excellent — MI sees only within-file metrics and is blind to the dependency graph entirely (that's coupling & cohesion's domain).
Spooky action at a distance. Global mutable state, order-dependent initialization, implicit side effects — none register in V, G, or LOC.
Concurrency hazards. A data race or a subtle lock-ordering bug is invisible to every term of the formula.

Key insight: MI is simultaneously too easy to game (comment padding, file splitting) and too insensitive to what actually drives maintenance cost (naming, coupling, hidden state, concurrency). The things you can do to move the number are mostly not the things that make code maintainable, and the things that make code unmaintainable mostly don't move the number. That double failure is fatal for any use of MI as a target.

The Honest Senior Position — What MI Is Allowed to Be¶

Tear all of that down and a small, defensible residue remains. The senior position on MI is neither "it's astrology" nor "it's a quality score." It is precise about the one shape of use that survives the critique:

MI may be used as a coarse trend on a single codebase over time, under one pinned tool and version — and never as anything else. Specifically:

✅ Trend, same codebase, pinned tool. If radon's MI on your service has slid from 70 to 55 over six months with the tokenizer and formula held constant, that's a mild signal worth a look — the direction and relative movement can hint at accreting cruft. Even here, treat it as a smoke alarm, not a verdict, and always confirm by reading the components and the code.
❌ Never an absolute. "This file scores 48" means nothing on its own. There is no validated mapping from an MI value to a real maintainability outcome for your language and domain; the bands are 1990s conventions.
❌ Never cross-project. Different codebases, languages, or tools tokenize and formulate differently. Comparing your team's 62 to another team's 71 is comparing two different functions evaluated on two different populations. Meaningless.
❌ Never a target or a gate that fails a build. The moment MI becomes a target, it gets gamed (comment padding, file splitting) and stops measuring anything. A hard MI gate trains your team to produce green dashboards, not maintainable code.
❌ Never a substitute for reading the code or the components. If MI flags a file, the only legitimate next step is to look at V, G, LOC separately and then open the file. At that point MI has done its entire job — pointing — and you proceed on the underlying signals.

Key insight: The strongest honest claim for MI is "a noisy, tool-specific trend on one codebase that occasionally points at a file worth reading." Everything beyond that — absolute scores, cross-team leaderboards, build gates, grades — is misuse that the metric's own provenance, composite structure, and tokenizer-dependence cannot support.

There is a deeper meta-lesson here, and it's the real senior takeaway: resist the demand for a single quality number. Management, dashboards, and our own desire for tidiness all pull toward "give me one score." MI is what you get when you give in: a borrowed thermometer shape wrapped around incommensurable, unstable, gameable inputs. The mature answer to "what's our code quality score?" is "quality isn't a scalar — here are the specific risks, located in specific files, with specific remediations," which is exactly what the alternatives below provide.

Better Alternatives — Components, Hotspots, Coupling, History¶

If MI is at best a weak pointer, what does a senior actually instrument? Four families, in roughly increasing order of signal-to-noise. The throughline: keep the dimensions separate, and prefer history over a static snapshot.

1. Look at the components directly — never the composite. The honest version of "MI" is just its inputs, unaggregated: cyclomatic/cognitive complexity, function/file length, and (if you must) Halstead Volume, each reported on its own axis with its own threshold. A file that's long and branchy and dense is genuinely suspicious; a file that's long but flat is just long. Keeping the axes separate preserves the actionability the composite destroys — you can see which dimension is bad and therefore what to do. Prefer cognitive complexity over cyclomatic where available: it penalizes the nesting and control-flow tangling humans actually struggle with, rather than just counting paths, so it correlates better with real comprehension difficulty.

2. Churn × complexity — the hotspot. This is the highest-leverage single idea in the whole roadmap, and it comes from Adam Tornhill (Your Code as a Crime Scene, Software Design X-Rays). Complexity alone is inert: a gnarly file nobody ever touches costs you nothing. The risk is complexity you keep changing. Multiply a file's complexity (or just its size) by its churn — how often it changes in version control — and the product surfaces the files that are both hard to understand and under constant modification. Those are where bugs cluster and where refactoring pays off. Crucially, this is behavioral: it uses your real Git history, so it's grounded in what your team actually does, not in a 1990s regression.

# crude hotspot: change-frequency × current size, from Git
git log --format=format: --name-only --since='12 months ago' \
  | grep -E '\.(go|ts|py|java)$' | sort | uniq -c | sort -rn | head -20
# → files with the most commits; cross-reference the high-churn names
#   against the largest/most-complex files. The intersection is your hotspot set.

(Tools like CodeScene productize this with proper complexity trends and visualizations; the one-liner is enough to find your worst offenders today.)

3. Change coupling (temporal coupling). Files that change together across commits are coupled in a way no static metric can see — they share a hidden dependency through behavior, not through imports. If pricing.ts and invoice_pdf.ts are edited in the same commit 80% of the time, they're effectively one module with a fragile, undocumented seam, even if neither imports the other. Change coupling mined from history exposes exactly the spooky action at a distance that MI is blind to, and it predicts where a change in one place will silently break another.

4. Defect history. The most direct maintainability signal is where the bugs actually are. Join your bug tracker / incident records to the files touched by their fixes, and you get an empirical map of fragility — no proxy, no regression, no tokenizer. This connects to the strongest result in the literature: Nagappan & Ball (Microsoft Research) showed that relative code churn predicts defect density far better than static snapshots of size or complexity. What you change and what has broken before beat what the code looks like right now. That single finding is the empirical case for preferring all of #2–#4 over MI.

Key insight: Every good alternative shares two properties MI lacks. First, it keeps dimensions separate, so the number is actionable — it tells you what to do, not just that something is off. Second, it uses history (churn, change coupling, defect data), which the literature shows predicts maintenance pain far better than any static-snapshot composite. MI is the opposite on both counts: aggregated and static. That's why a senior reaches past it.

A complete picture, then, layers these: keep the components on separate axes for a quick static read, rank files by churn×complexity to find where to look, use change coupling to find hidden seams, and ground the whole thing in defect history to confirm what's actually fragile. None of these collapses to a single seductive scalar — and that refusal is the point. (For what to do once you've located the risk, that's Technical Debt Management and Refactoring; this roadmap only locates it.)

Mental Models¶

171 is a regression intercept, not a constant of nature. It's the leftover from curve-fitting on a small 1990s C/Pascal corpus. Treat every coefficient in the MI formula the way you'd treat any magic number with no derivation in your codebase — except this one's derivation is on data that looks nothing like what you ship.
A composite is a lossy hash of the thing you care about. From MI = 42 you cannot recover (V, G, LOC) — many triples collapse to it. Since you must read the components to act anyway, the aggregation adds nothing but the chance to mask one bad dimension behind two good ones.
Halstead Volume is an opinion held by a tokenizer. "What's an operator?" has no language-agnostic answer, so V — and therefore MI — changes with the tool, not just the code. A number you can't reproduce across instruments isn't a measurement.
MI is mostly wc -l in a lab coat. The −16.2·ln(LOC) term outweighs the others and correlates with the Volume term; MI's variance is dominated by file size. If a Halstead+McCabe+regression machine mostly reproduces line-counting, it isn't earning its complexity.
What you can move isn't what matters; what matters doesn't move it. You can game MI with comment padding and file splits (which often don't help); you can wreck maintainability with bad names, coupling, and hidden state (which don't register). That double failure is why MI can't be a target.
Quality is not a scalar. The honest answer to "what's our quality score?" is a located, dimensioned one: specific risks, in specific files, with specific fixes — found via components, hotspots, change coupling, and defect history, not collapsed into one borrowed number.

Common Mistakes¶

Quoting an MI value without naming the tool and version. "Our MI is 68" is like a temperature with no unit. Different tools tokenize Halstead differently and implement different formula variants (with/without the comment term, with/without the 0–100 rescale, or a different model entirely). The number is code × tokenizer × formula, not a property of the code.
Comparing MI across projects, languages, or teams. Invalid on its face — you're comparing different functions evaluated on different populations, scored by a model fit on neither. MI cross-project leaderboards measure tool and language differences, not relative quality.
Making MI a build gate. The moment it's a target it gets gamed (Goodhart). Teams pad comments and split files to turn the dashboard green while the code gets worse. A failing MI gate trains green dashboards, not maintainability.
Treating a low MI as a diagnosis. It isn't — it's at most a pointer. A low score doesn't tell you why; you have to decompose back into V, G, LOC and read the file. If you're going to read the components anyway, you didn't need the composite.
Trusting the comment-adjusted variant. The +50·sin(√(2.4·perCM)) term rewards comment quantity, blind to quality and truth. You can lift an awful function into "maintainable" with vacuous or misleading comments — making maintenance harder while the score claims the opposite. (Visual Studio dropped this term for exactly this reason.)
Believing MI when names are bad or coupling is high. MI is blind to identifier quality (the biggest real driver of comprehension cost) and to the entire dependency graph. A file with terrible names and fifteen hidden dependencies can score beautifully.
Refitting nothing, then claiming rigor. If you insist on a composite, the only defensible move is to re-fit the regression on your code against your maintenance signal (time-to-fix, defects). Shipping the 1990s coefficients verbatim and calling it a measurement is borrowing someone else's model for a population it never saw.
Reaching for MI when a hotspot would do. The question "where should I spend refactoring effort?" is answered far better by churn×complexity and defect history than by any static composite — and those use your history, which the literature shows predicts maintenance pain better than any snapshot.

Test Yourself¶

Where do the numbers 171, −5.2, −0.23, and −16.2 in the MI formula come from, and why does that origin undermine using MI on a modern TypeScript codebase?
Two files both score MI = 42. Explain why that tells you almost nothing actionable, and what you must do next to actually fix either one.
Visual Studio, radon, and SonarQube report three different MI numbers for the same file. Give the two independent reasons this happens.
Why is it fair to say MI "mostly tracks file size"? Reference the specific terms and their coefficients.
Describe two concrete ways to game MI upward without improving — or while actively harming — maintainability.
Name two changes that badly hurt maintainability but barely move MI, and explain why the formula is blind to them.
State the one defensible use of MI, with all its qualifications.
A teammate wants to add an MI gate to CI. Propose what to instrument instead, and justify it from the empirical literature.

Answers

1. They are the **fitted parameters of a multivariate regression** (Oman & Hagemeister, ~1991–94) that best predicted maintainability on a **small corpus of 1990s C and Pascal programs** (several HP systems). `171` is the intercept. A regression describes only its training distribution; those coefficients carry no authority over languages with constructs (closures, generics, async, decorators, comprehensions) absent from the corpus. Scoring modern TypeScript with them is extrapolation, and `171` retains no meaning outside the original data. 2. MI is a **composite** that collapses `V`, `G`, `LOC` into one scalar; it's non-invertible, so `42` could be high-Volume/low-complexity (a dense expression wall — decompose into named steps) or low-Volume/high-complexity (a branch-tangled procedure — flatten and split). The score is identical while the fixes are opposite. To act you must **decompose back into `V`, `G`, `LOC`** and read the file — at which point the composite added nothing. 3. (a) **Different tokenizers** — "what counts as an operator/operand" has no language-agnostic definition, so each tool computes a different Halstead `V`, hence different MI, for identical code. (b) **Different formula variants** — they implement different members of the MI family (with/without the `+50·sin(√…)` comment term, with/without the 0–100 rescale and bands, or — like SonarQube's SQALE-based ratings — a different model altogether). 4. The two logarithmic terms are `−5.2·ln(V)` and `−16.2·ln(LOC)`; the LOC coefficient is ~**3×** the Volume one, and across real code `ln(V)` and `ln(LOC)` are strongly correlated (bigger files = more tokens). So MI's variance is **dominated by file length** — each doubling of LOC subtracts ~11 points from the LOC term alone. MI is largely an elaborate, noisier proxy for `wc -l`. 5. (a) **Pad comments** — in the comment-adjusted variant, `+50·sin(√(2.4·perCM))` raises MI purely by adding comment lines, even vacuous or misleading ones (which make maintenance *harder*). (b) **Split files** — the `ln(LOC)` penalty is per-file, so chopping one cohesive module into many small files raises each file's MI while scattering related logic and hiding cohesion. 6. Examples: **bad names** (rename everything to `a`, `b`, `tmp`) — devastates readability but Halstead counts only token occurrences/distinct counts, blind to whether a name communicates intent; **high coupling / hidden state** (tendrils into many modules, global mutable state, order-dependent init) — MI uses only *within-file* metrics and never sees the dependency graph or side effects. Also valid: concurrency hazards (races, lock ordering) register in no term. 7. MI is defensible *only* as a **coarse trend on a single codebase over time, under one pinned tool and version** — a mild smoke alarm worth a look, always confirmed by reading the components and the code. Never an absolute, never cross-project/cross-tool, never a target or build gate, never a substitute for reading the code or its components. 8. Instrument the **components separately** (cognitive complexity, length, optionally Volume — each on its own axis), rank files by **churn × complexity** to find hotspots, mine **change coupling** for hidden seams, and join **defect history** to find real fragility. Justification: **Nagappan & Ball** showed relative *churn* predicts defect density better than static size/complexity snapshots, and Tornhill's hotspot work shows complexity *you keep changing* is where risk concentrates — so history-based, dimension-separated signals beat a static composite, and none of them is gameable into a green build the way an MI gate is.

Cheat Sheet¶

THE FORMULA(S)  — there is no single one
  original:   MI = 171 − 5.2·ln(V) − 0.23·G − 16.2·ln(LOC)
  +comments:  ... + 50·sin(√(2.4·perCM))         perCM = comment_lines/LOC
  VS rescale: MI = max(0, original · 100/171)     bands 20/10 (green/yellow/red)
  → quoting an MI without the TOOL+VERSION is meaningless

WHY IT'S WEAK
  171 & coeffs   = regression fit on 1990s C/Pascal (HP) corpus — not laws, don't generalize
  composite      = collapses V,G,LOC → one scalar; non-invertible; can't tell you WHY
  Halstead V     = tokenizer-dependent ("what's an operator?") → not comparable across tools
  ln(LOC) term   = weight 16.2 ≫ others → MI mostly tracks FILE SIZE (≈ wc -l)
  gameable       = pad comments ↑MI; split files ↑MI — often WITHOUT real improvement
  insensitive    = blind to names, coupling, hidden state, concurrency

WHY 3 TOOLS DISAGREE (two independent causes)
  1) different tokenizers → different Halstead V
  2) different formula variant (comment term? rescale? SQALE instead?)

ONLY DEFENSIBLE USE
  ✅ coarse TREND on ONE codebase, ONE pinned tool/version → "go look"
  ❌ absolute  ❌ cross-project  ❌ target/gate  ❌ replacement for reading code/components

USE INSTEAD (separate dimensions; prefer history over snapshot)
  components directly   cognitive complexity + length (+ Volume) on separate axes
  churn × complexity    hotspots — complexity you KEEP CHANGING (Tornhill)
  change coupling       files that change together = hidden seam
  defect history        where bugs actually are; Nagappan&Ball: churn > static size

Summary¶

The provenance is fatal to absolute/cross-project use. 171 and the coefficients are regression parameters fit on a small set of 1990s C/Pascal (HP) programs. A model describes only its training distribution; nothing licenses applying those weights to modern languages, and the threshold bands are conventions on conventions.
The composite shape destroys actionability. Collapsing Volume, complexity, and LOC into one non-invertible scalar means a low MI never tells you why. You must decompose back into the components to act — so the aggregation is pure loss that can also mask a single bad dimension.
The Halstead leg is unstable. "What counts as an operator/operand" is tokenizer- and language-dependent, so Volume — and therefore MI — differs across tools for the same file. A number you can't reproduce across instruments isn't a measurement.
Three tools, three numbers, two causes. Divergence comes from different tokenizers and different formula variants (with/without the comment term, with/without the 0–100 rescale, or SQALE instead). An MI value is meaningless without its tool and version.
It's mostly an elaborate file-size proxy. The −16.2·ln(LOC) term outweighs the rest and correlates with Volume, so MI largely tracks length — what wc -l gives you for free, with no magic constants and no tokenizer ambiguity.
It's both gameable and insensitive. Comment padding and file splitting move the number without helping (Goodhart); bad names, coupling, hidden state, and concurrency hazards wreck maintainability without moving it.
The honest position: MI is at best a coarse trend on one codebase under one pinned tool — never an absolute, never cross-project, never a target. Reach instead for the components directly, churn × complexity hotspots, change coupling, and defect history — all of which keep dimensions separate and lean on history, which the literature (Nagappan & Ball; Tornhill) shows predicts maintenance pain far better than any static composite.

You now hold the cross-examination cold: you can say exactly why MI is seductive, exactly where each term fails, and exactly what to instrument instead. The next layer — professional.md — is about operating that conviction across an organization: how to talk leadership out of an MI leaderboard, how to stand up hotspot-based prioritization, and how to make "quality isn't a scalar" an actual policy.