Organizing Data — Optimize¶
12 cases where the refactor is correct but the perf cost is real.
Optimize 1 — Encapsulate Collection allocates per call (Java)¶
For 10K req/s, each calling orders() once: 10K list copies/s.
Cost & Fix
Allocates a new immutable list per call. For ~50-element lists, ~400 bytes/call → 4MB/s of garbage. **Fix:** Return an unmodifiable view (no copy): Or expose a Stream: Caller doesn't get O(N) copy; just iterates lazily. Mutating the underlying isn't possible during iteration without exception.Optimize 2 — Replace Data Value with Object on a hot path (Java)¶
class Transaction {
private Money amount; // was double
}
class Money {
private final double value;
private final Currency currency;
}
In a batch processing job, 10M transactions per minute.
Cost & Fix
Each `Money` is a heap object: header + 2 fields (~24 bytes). 10M Money objects per minute → ~240 MB/min. If escape analysis fails (e.g., Money stored in fields, returned, etc.), GC pressure rises. **Fix options:** 1. For batch / numerical hot paths: keep primitives (`double amount; Currency currency`). 2. Use a flyweight `Currency` (interned, singletons per code) — already common. 3. Wait for Project Valhalla's value classes. For domain code (one Money per request), the cost is invisible.Optimize 3 — Replace Type Code with Class breaks switch optimization (Java)¶
enum Status { ACTIVE, INACTIVE, PENDING }
switch (status) {
case ACTIVE -> ...;
case INACTIVE -> ...;
case PENDING -> ...;
}
vs. the old int status with switch(status).
Cost & Fix
Both compile to `tableswitch` — same dispatch cost. **No regression.** ✓ The myth "enums are slower than int switch" is from C-era thinking; in modern JVMs, enums compile to fast switch. For very hot paths over millions of items, the enum's heap allocation matters slightly (each value reachable through a static reference, but no per-item allocation), but switch dispatch itself is identical. **No fix needed.** Don't avoid enums for performance.Optimize 4 — Replace Magic Number adds runtime lookup (Python)¶
vs. old def total(amount): return amount * 1.07.
Cost & Fix
CPython looks up `TAX_RATE` in the module's namespace per call — small overhead. PyPy / Cython optimize this away. For ~1M calls/sec in tight loops, this can show. **Fix:** Bind to local in hot loop: Or compile the constant at function-bind time: For most code: irrelevant. For numerical hot loops: convert to NumPy vectorized op.Optimize 5 — Encapsulate Field's accessor not inlined by JIT (Java)¶
class Account {
private double balance;
public double balance() { return balance; } // ✓ free in steady state
}
Question: when isn't this free?
Cost & Fix
Cost cases: - `balance()` is overridden in many subclasses (megamorphic) → inline cache costly. - The class is huge and `balance()` happens to be in the cold portion that doesn't compile. - Reflection-based access (Spring DI proxies) wraps the call. Mitigations: - Mark `balance()` `final` if not meant for override. - For hot fields, expose a direct package-private field for internal use, public accessor for outside. Generally: 99% of time, no fix needed. Encapsulate Field's runtime cost is zero.Optimize 6 — Replace Array with Object adds per-instance memory (Java)¶
vs.
For 10M instances: - Array form: 400 MB. - Record form: 240 MB.
Records win.
Cost & Fix
Records are typically *better* than arrays for memory. Type safety + comparable footprint. The exception: when you have hundreds of fields and the record allocates more. Then column-store / NumPy / pandas / `Arrow` is the better choice for analytical workloads. **Fix:** Use records by default. For columnar data, use proper columnar storage.Optimize 7 — Bidirectional with serialization loop (Java + Jackson)¶
class Customer { @JsonManagedReference List<Order> orders; }
class Order { @JsonBackReference Customer customer; }
Cost & Fix
Without the annotations, JSON serialization recurses infinitely → stack overflow. With them: Customer serializes orders; orders' customer is suppressed. **Correct, but the JSON omits customer reference on orders.** For an API consumer that needs `order.customer.id`, the form is wrong. **Fix:** Use DTOs for serialization. Don't expose entities directly. Lesson: Bidirectional + entity serialization is a perf footgun. Always project to DTOs at the boundary.Optimize 8 — Encapsulate Collection with concurrent mutation (Java)¶
class Customer {
private final List<Order> orders = new ArrayList<>();
public List<Order> orders() { return Collections.unmodifiableList(orders); }
public synchronized void add(Order o) { orders.add(o); }
}
// Caller:
for (Order o : customer.orders()) { ... } // iteration
// Concurrently: anotherThread.add(...)
Cost & Fix
Iteration over the unmodifiable view while another thread mutates the underlying list → `ConcurrentModificationException`. **Fix options:** 1. **`CopyOnWriteArrayList`** — write is O(N), read is concurrent and snapshot-stable. 2. **Snapshot copy on read:** 3. **Immutable collections (Guava `ImmutableList`):** copy-on-add, share-on-read. Choose by read/write ratio. For read-mostly: CopyOnWriteArrayList. For write-heavy: snapshot copies.Optimize 9 — Replace Type Code with State allocates per transition (Java)¶
class Order {
private OrderStatus status;
public void submit() { status = new SubmittedStatus(); }
public void ship() { status = new ShippedStatus(); }
}
Cost & Fix
Each transition allocates a new state object. For long-lived orders, this is fine. For a high-throughput system with millions of state transitions per minute, it adds GC pressure. **Fix:** Use singletons for stateless states. Or use enum-implements-interface pattern: Each enum value is a singleton. Zero per-transition allocation.Optimize 10 — Encapsulate Field on a Python class without slots (Python)¶
class Account:
def __init__(self, balance):
self._balance = balance
@property
def balance(self):
return self._balance
Each instance has a __dict__ (~280 bytes overhead).
Cost & Fix
For 10M instances, that's 2.8 GB just in dict overhead. **Fix:** Or `@dataclass(slots=True, frozen=True)`. Reduces per-instance overhead to ~50 bytes. For domain code with one instance per request: irrelevant. For batches / pipelines: critical.Optimize 11 — Replace Reference to Value triggers expensive equals (Java)¶
Records auto-generate equals based on all fields. Address has its own auto-generated equals based on its fields.
HashMap<Customer, X> map = ...;
map.get(someCustomer); // calls equals → walks all fields recursively
For complex nested values, equality is O(total field count).
Cost & Fix
For lookups by Customer id, the entire address is compared. Wasteful. **Fix:** Cache hashCode in a final field, or use the id as map key: Or implement equals based on id only, with a documented warning that the record's "equals" is *not* the id-based one: This is one place where Java records' default behavior may not be what you want; lean on the type system to enforce id-based comparison externally.Optimize 12 — Magic Number Constant in JS / TypeScript (TypeScript)¶
V8 inlines TAX_RATE as long as it's const and not exported (or if the bundler does dead-code elimination).
Cost & Fix
For module-level `const`, V8 compiles to a constant load. In TypeScript, this is generally as fast as `0.07` literal. For `export const`, V8 must re-resolve through the module exports table — slower in tight loops. **Fix:** for hot paths, pin the constant locally: Or just inline. JIT optimizers handle most of this; only matters in extreme hot paths.Patterns¶
| Refactor | Risk |
|---|---|
| Encapsulate Collection (copy) | Per-call alloc |
| Replace Data Value with Object | Per-instance alloc + header |
| Encapsulate Field | Almost free |
| Replace Type Code with State | Per-transition alloc — use singletons |
| Bidirectional + JSON | Stack overflow / wrong shape |
| ConcurrentMod over views | Need CoW or snapshot |
| Records with deep equals | Lookup cost |
| Python without slots | Per-instance overhead |
Next¶
- tasks.md — practice clean refactors
- find-bug.md — wrong refactors
- interview.md — review