Skip to content

Data Clumps — Specification

Audience: engineers who want a precise, falsifiable definition of "data clump" — one a tool can check, one a reviewer can defend.

This document fixes the vocabulary, the metric, and the invariants that any Value Object (VO) extracted from a clump must satisfy. Where Fowler (Refactoring, 2nd ed., 2018, ch. 3) gives an informal definition, we tighten it into something measurable.


1. Formal Definition

A data clump in a Java codebase is an ordered tuple of parameter or field types T = (t_1, t_2, ..., t_n) with n >= 3 such that:

  1. RecurrenceT (modulo argument order and synonymous renames) appears in the signature of at least 3 distinct methods or field-declaration groups in the same module.
  2. Cohesion — the tuple represents one concept in the domain language; replacing any t_i independently is not meaningful.
  3. Co-variation — when one element changes, the others tend to change with it. Tools approximate this by co-edit frequency (Git churn correlation > 0.6 over the last 200 commits on the file).
  4. Primitive bias — at least two of the t_i are primitive types, boxed primitives, String, BigDecimal, LocalDate/LocalDateTime, UUID, or other "stringly-typed" carriers.

Conditions (1) and (4) are mechanical and check-friendly. Conditions (2) and (3) require domain knowledge and are confirmed by a human reviewer; tools surface candidates, humans confirm.


2. The Metric — Parameter Repetition Count (PRC)

We define PRC(T) for a candidate tuple T as:

PRC(T) = | { m in M : signature(m) contains T as a (possibly out-of-order, contiguous-or-scattered) sub-sequence } |

where M is the set of all methods in the analyzed scope (typically a module or package root).

Decision rule:

PRC(T) Verdict
1 Acceptable. Single use is not a clump.
2 Watch. Add a // TODO: candidate VO comment.
3+ Refactor. Extract a VO. ArchUnit rule should fail.

Three is not arbitrary. It comes directly from Fowler's "Rule of Three" (Refactoring, ch. 2): "The first time you do something, you just do it. The second time, you wince but do it anyway. The third time, you refactor."

2.1 Worked computation

Given the methods below:

void charge(BigDecimal amount, Currency currency, Account account);
void refund(BigDecimal amount, Currency currency, Account account);
Receipt invoice(Customer c, BigDecimal amount, Currency currency);

Candidate tuple T = (BigDecimal, Currency). PRC(T) = 3. Verdict: refactor. The extracted VO is Money with the obvious invariants.


3. Value Object Invariants

A type V claims VO status if and only if it satisfies all five invariants below. Tools can check four of the five.

3.1 Immutability (mechanically checkable)

  • All fields final.
  • No setters, no void mutate* methods.
  • Defensive copies on input and output of any mutable reference (collections, dates pre-Java 8, arrays).
  • For records (JEP 395, finalized in Java 16, 2021), immutability is enforced by the language.

3.2 Equality by value (mechanically checkable)

  • equals(Object) overridden; uses every field deemed identity-defining.
  • hashCode() overridden consistently with equals.
  • For records, both methods are generated; do not override unless you have a precise reason (e.g., normalized representation).
public record Money(BigDecimal amount, Currency currency) {
    // equals: compares amount AND currency by value.
    // hashCode: derived from both fields.
    // Both generated by the compiler. Do not override.
}

3.3 Self-validation (mechanically checkable)

  • Every constructor or compact-constructor body rejects invalid input.
  • No "valid-when-the-caller-is-careful" types. The type is its own guard.
public record DateRange(LocalDate start, LocalDate end) {
    public DateRange {
        Objects.requireNonNull(start);
        Objects.requireNonNull(end);
        if (end.isBefore(start)) throw new IllegalArgumentException();
    }
}

3.4 Side-effect-free behavior (semi-checkable)

  • All instance methods are pure: same input -> same output, no I/O, no mutation.
  • A "modifier" returns a new instance: money.add(other) returns a new Money.

3.5 Conceptual whole (not mechanically checkable)

  • The VO maps to exactly one term in the ubiquitous language.
  • It has a name a domain expert would recognize. MoneyHelperDataDTOV2 fails this test.

4. JEP 395 — Records, In Specification Detail

JEP 395 (https://openjdk.org/jeps/395), finalized in Java 16, defines records as transparent carriers for immutable data. For the purposes of VO extraction, the relevant guarantees are:

  1. Each record component induces a private final field, a public accessor, a canonical constructor parameter, and an equals/hashCode contribution.
  2. toString() is generated and includes every component.
  3. Records implicitly extend java.lang.Record and cannot extend any other class. They may implement interfaces.
  4. Records are implicitly final — they cannot be subclassed. This is desirable for VOs (subclassing breaks Liskov for value equality).
  5. The compact constructor runs before the implicit field assignments and is the canonical place to put invariants.
  6. Records may declare static factory methods, static fields, and additional instance methods (but no additional instance fields).

A VO defined as a record with a compact constructor that throws on invalid input satisfies invariants 3.1, 3.2, and 3.3 by language construction. Only 3.4 and 3.5 still require engineer judgement.

4.1 Record limitations to know

  • No inheritance of state. If your VO hierarchy requires shared state (rare for VOs, common for entities), records are wrong.
  • Default equals compares every component. If a component is a BigDecimal, new Money("1.0", USD).equals(new Money("1.00", USD)) is false because BigDecimal.equals considers scale. Normalize in the compact constructor or override equals with documentation.
  • Default toString exposes every field, including sensitive ones. Override for types holding secrets (e.g., CreditCard, ApiKey).

5. Boundary Cases

5.1 When a clump is not a clump

  • Test fixture parameters. A @ParameterizedTest taking (String, String, int) is documenting cases, not modeling domain. PRC counts only production methods.
  • Logging signatures. logger.info(String, Object, Object) — repetitive but not domain.
  • Constructor of a builder. Builder methods often take repeated primitive groups; the builder itself is the parameter object.

5.2 When a VO is too small

  • A single-field "VO" with no invariants beyond non-null is just a wrapper. It can still be valuable (type safety), but call it a Tiny Type and judge it by the primitive-obsession lens, not the data-clumps lens.

5.3 When a VO is too big

  • More than ~5 components, or components that change independently, means you have an entity or aggregate. Split it.

6. Tooling Cross-Reference

Tool What it checks Maturity in 2026
IntelliJ SSR Pattern in signatures Stable
ArchUnit Architectural invariants Stable
Spoon / JavaParser Custom AST rules Stable
SonarQube S107 Too many parameters Heuristic only
PMD ExcessiveParameterList Same as above Heuristic only

S107 and ExcessiveParameterList fire on long parameter lists. They are necessary but not sufficient — a 3-parameter method can still be a clump.


7. What's next

  • ../07-primitive-obsession/specification.md — defines the "tiny type" boundary case in detail.
  • ../../06-anemic-domain-model/specification.md — explains why VOs must carry behavior to count as domain-rich.
  • ./find-bug.md — applies this specification to ten real scenarios.

Memorize this

A data clump is a tuple of 3+ co-traveling, mostly-primitive types appearing in 3+ method signatures within one module. Its remedy is a Value Object satisfying immutability, value-equality, self-validation, side-effect-free behavior, and a single name in the ubiquitous language. In modern Java, the default carrier for that VO is a record (JEP 395) with a compact constructor enforcing every invariant.