Data Clumps — Specification¶
Audience: engineers who want a precise, falsifiable definition of "data clump" — one a tool can check, one a reviewer can defend.
This document fixes the vocabulary, the metric, and the invariants that any Value Object (VO) extracted from a clump must satisfy. Where Fowler (Refactoring, 2nd ed., 2018, ch. 3) gives an informal definition, we tighten it into something measurable.
1. Formal Definition¶
A data clump in a Java codebase is an ordered tuple of parameter or field types T = (t_1, t_2, ..., t_n) with n >= 3 such that:
- Recurrence —
T(modulo argument order and synonymous renames) appears in the signature of at least 3 distinct methods or field-declaration groups in the same module. - Cohesion — the tuple represents one concept in the domain language; replacing any
t_iindependently is not meaningful. - Co-variation — when one element changes, the others tend to change with it. Tools approximate this by co-edit frequency (Git churn correlation > 0.6 over the last 200 commits on the file).
- Primitive bias — at least two of the
t_iare primitive types, boxed primitives,String,BigDecimal,LocalDate/LocalDateTime,UUID, or other "stringly-typed" carriers.
Conditions (1) and (4) are mechanical and check-friendly. Conditions (2) and (3) require domain knowledge and are confirmed by a human reviewer; tools surface candidates, humans confirm.
2. The Metric — Parameter Repetition Count (PRC)¶
We define PRC(T) for a candidate tuple T as:
PRC(T) = | { m in M : signature(m) contains T as a (possibly out-of-order, contiguous-or-scattered) sub-sequence } |
where M is the set of all methods in the analyzed scope (typically a module or package root).
Decision rule:
PRC(T) | Verdict |
|---|---|
| 1 | Acceptable. Single use is not a clump. |
| 2 | Watch. Add a // TODO: candidate VO comment. |
| 3+ | Refactor. Extract a VO. ArchUnit rule should fail. |
Three is not arbitrary. It comes directly from Fowler's "Rule of Three" (Refactoring, ch. 2): "The first time you do something, you just do it. The second time, you wince but do it anyway. The third time, you refactor."
2.1 Worked computation¶
Given the methods below:
void charge(BigDecimal amount, Currency currency, Account account);
void refund(BigDecimal amount, Currency currency, Account account);
Receipt invoice(Customer c, BigDecimal amount, Currency currency);
Candidate tuple T = (BigDecimal, Currency). PRC(T) = 3. Verdict: refactor. The extracted VO is Money with the obvious invariants.
3. Value Object Invariants¶
A type V claims VO status if and only if it satisfies all five invariants below. Tools can check four of the five.
3.1 Immutability (mechanically checkable)¶
- All fields
final. - No setters, no
void mutate*methods. - Defensive copies on input and output of any mutable reference (collections, dates pre-Java 8, arrays).
- For records (JEP 395, finalized in Java 16, 2021), immutability is enforced by the language.
3.2 Equality by value (mechanically checkable)¶
equals(Object)overridden; uses every field deemed identity-defining.hashCode()overridden consistently withequals.- For records, both methods are generated; do not override unless you have a precise reason (e.g., normalized representation).
public record Money(BigDecimal amount, Currency currency) {
// equals: compares amount AND currency by value.
// hashCode: derived from both fields.
// Both generated by the compiler. Do not override.
}
3.3 Self-validation (mechanically checkable)¶
- Every constructor or compact-constructor body rejects invalid input.
- No "valid-when-the-caller-is-careful" types. The type is its own guard.
public record DateRange(LocalDate start, LocalDate end) {
public DateRange {
Objects.requireNonNull(start);
Objects.requireNonNull(end);
if (end.isBefore(start)) throw new IllegalArgumentException();
}
}
3.4 Side-effect-free behavior (semi-checkable)¶
- All instance methods are pure: same input -> same output, no I/O, no mutation.
- A "modifier" returns a new instance:
money.add(other)returns a newMoney.
3.5 Conceptual whole (not mechanically checkable)¶
- The VO maps to exactly one term in the ubiquitous language.
- It has a name a domain expert would recognize.
MoneyHelperDataDTOV2fails this test.
4. JEP 395 — Records, In Specification Detail¶
JEP 395 (https://openjdk.org/jeps/395), finalized in Java 16, defines records as transparent carriers for immutable data. For the purposes of VO extraction, the relevant guarantees are:
- Each record component induces a
private finalfield, a public accessor, a canonical constructor parameter, and anequals/hashCodecontribution. toString()is generated and includes every component.- Records implicitly extend
java.lang.Recordand cannot extend any other class. They may implement interfaces. - Records are implicitly
final— they cannot be subclassed. This is desirable for VOs (subclassing breaks Liskov for value equality). - The compact constructor runs before the implicit field assignments and is the canonical place to put invariants.
- Records may declare static factory methods, static fields, and additional instance methods (but no additional instance fields).
A VO defined as a record with a compact constructor that throws on invalid input satisfies invariants 3.1, 3.2, and 3.3 by language construction. Only 3.4 and 3.5 still require engineer judgement.
4.1 Record limitations to know¶
- No inheritance of state. If your VO hierarchy requires shared state (rare for VOs, common for entities), records are wrong.
- Default
equalscompares every component. If a component is aBigDecimal,new Money("1.0", USD).equals(new Money("1.00", USD))is false becauseBigDecimal.equalsconsiders scale. Normalize in the compact constructor or overrideequalswith documentation. - Default
toStringexposes every field, including sensitive ones. Override for types holding secrets (e.g.,CreditCard,ApiKey).
5. Boundary Cases¶
5.1 When a clump is not a clump¶
- Test fixture parameters. A
@ParameterizedTesttaking(String, String, int)is documenting cases, not modeling domain. PRC counts only production methods. - Logging signatures.
logger.info(String, Object, Object)— repetitive but not domain. - Constructor of a builder. Builder methods often take repeated primitive groups; the builder itself is the parameter object.
5.2 When a VO is too small¶
- A single-field "VO" with no invariants beyond non-null is just a wrapper. It can still be valuable (type safety), but call it a Tiny Type and judge it by the primitive-obsession lens, not the data-clumps lens.
5.3 When a VO is too big¶
- More than ~5 components, or components that change independently, means you have an entity or aggregate. Split it.
6. Tooling Cross-Reference¶
| Tool | What it checks | Maturity in 2026 |
|---|---|---|
| IntelliJ SSR | Pattern in signatures | Stable |
| ArchUnit | Architectural invariants | Stable |
| Spoon / JavaParser | Custom AST rules | Stable |
SonarQube S107 | Too many parameters | Heuristic only |
PMD ExcessiveParameterList | Same as above | Heuristic only |
S107 and ExcessiveParameterList fire on long parameter lists. They are necessary but not sufficient — a 3-parameter method can still be a clump.
7. What's next¶
../07-primitive-obsession/specification.md— defines the "tiny type" boundary case in detail.../../06-anemic-domain-model/specification.md— explains why VOs must carry behavior to count as domain-rich../find-bug.md— applies this specification to ten real scenarios.
Memorize this¶
A data clump is a tuple of 3+ co-traveling, mostly-primitive types appearing in 3+ method signatures within one module. Its remedy is a Value Object satisfying immutability, value-equality, self-validation, side-effect-free behavior, and a single name in the ubiquitous language. In modern Java, the default carrier for that VO is a
record(JEP 395) with a compact constructor enforcing every invariant.