Functional Interfaces and Lambdas — Optimize¶
Lambdas are not magic and not slow by default — but they have three measurable costs (cold-start linkage, allocation per capture, virtual dispatch) and a long list of small choices (primitive specialization, method-reference equivalence, hoisting) that decide whether your stream pipeline runs at C-speed or 5× slower. All numbers below are illustrative; verify with JMH on your hardware and JDK.
1. The three costs at a glance¶
| Cost | When it happens | Order of magnitude |
|---|---|---|
| Cold-start linkage | First execution of a lambda's invokedynamic | ~1–5 µs once per call site |
| Capture allocation | Each evaluation of a capturing lambda | 16–24 bytes per object |
| Virtual dispatch | Each SAM invocation | 0–15 ns (mono/bi/megamorphic) |
Steady-state, the only one that matters in most code is the capture allocation, and it disappears if the JIT proves the lambda doesn't escape. The cold-start and dispatch costs are real in benchmarks and microservice warmup; they're noise in long-running CPU loops.
2. Cold-start cost: the metafactory bootstrap¶
The first time a lambda's invokedynamic site executes, the JVM calls LambdaMetafactory.metafactory (JEP 126), which spins a class, loads it, and produces a CallSite. Order of magnitude: a few microseconds.
// First call to this method pays ~5 µs for the metafactory; subsequent calls pay ~nothing.
Function<String, Integer> length = String::length;
length.apply("hello");
This matters for:
- Microbenchmarks. A JMH benchmark that creates the lambda inside the measured method overstates the cost by including linkage on every iteration. JMH's
@Setupis the right place to construct the lambda. - Microservice startup. A service that lazily creates 1 000 lambda call sites at first request pays ~5 ms of one-time linkage. Use
CDS(Class Data Sharing) and AOT (jlink --compress, GraalVM native-image) to amortise. - Serverless cold start. Every Lambda function on AWS Lambda pays this when it first runs. Reach for non-lambda alternatives only if profiling proves it dominates; usually it doesn't.
-Xlog:class+load=info shows spun lambda classes loading; jcmd <pid> VM.classloader_stats summarises them. If a process spawns thousands of distinct lambda sites at startup, you'll see the metafactory cost in the GC logs as Metaspace pressure.
3. Steady-state cost: no slower than anonymous classes¶
Once the call site is linked and the JIT has profiled it, a lambda's invocation cost is the same as an anonymous inner class's invocation cost — not a magical free call. Both go through invokeinterface (or invokevirtual for a few specific lambda forms), both are subject to the same monomorphic/bimorphic/megamorphic JIT profile.
// Two equivalent shapes in steady state:
Function<String, Integer> a = s -> s.length();
Function<String, Integer> b = new Function<>() { public Integer apply(String s) { return s.length(); } };
At a monomorphic call site, C2 inlines the SAM body and erases the dispatch — both shapes become as fast as a direct call. At a megamorphic call site (many distinct lambda instances flowing through forEach), both fall back to a real virtual call with the same cost.
The thing lambdas have over anonymous classes is: no separate class file per occurrence. Anonymous classes inflate the class count (and Metaspace) once per source location; lambdas share one spun class per site. For a project with thousands of small callbacks, that's measurable.
4. Capture allocation: the 16-byte tax¶
A non-capturing lambda is a singleton. Allocation cost: zero.
A capturing lambda allocates a small object every time the lambda expression is evaluated. The object holds the captured values plus the standard 12-byte object header:
Runnable nonCapturing = () -> System.out.println("hi"); // singleton, 0 bytes per use
String prefix = "log: ";
Runnable capturing = () -> System.out.println(prefix); // ~16 bytes per evaluation
For a typical small capture (one reference, padded), that's ~16 bytes per evaluation; for two references, ~24; for one long/double, ~24. In a hot loop that creates the lambda each iteration, you'll see allocation pressure that wasn't obvious in source.
C2's escape analysis often saves you: if the lambda is created inside a method and doesn't escape, EA scalar-replaces the capture object — fields live in registers, no heap allocation. Verify with -XX:+UnlockDiagnosticVMOptions -XX:+PrintEliminateAllocations. EA succeeds when:
- The lambda's lifetime is bounded by the method's stack frame.
- The lambda isn't stored in a heap field, passed to a method the JIT can't see through, or returned.
When EA fails, hoist the lambda:
// Bad — fresh lambda per call, EA may fail across the loop boundary:
void process(List<Order> orders) {
orders.forEach(o -> log.info("processing {}", o.id()));
}
// Better — non-capturing static method reference:
void process(List<Order> orders) {
orders.forEach(MyClass::logProcessing);
}
private static void logProcessing(Order o) { /* log */ }
// Best when you want capture but only once:
private static final Consumer<Order> LOGGER = o -> log.info("processing {}", o.id());
void process(List<Order> orders) { orders.forEach(LOGGER); }
The third form turns "N lambda evaluations per call" into "one lambda evaluation ever" — the consumer is reused.
5. Primitive specializations — skip the boxing tax¶
A Function<Integer, Integer> boxes int to Integer on the way in and unboxes on the way out. In a tight loop over millions of elements, that's millions of Integer allocations (mitigated by the small-value cache, but still a measurable hit).
// Generic — boxes:
Function<Integer, Integer> square = x -> x * x;
int total = 0;
for (int x : xs) total += square.apply(x); // box-call-unbox per element
// Primitive specialization — no boxing:
IntUnaryOperator square = x -> x * x;
int total = 0;
for (int x : xs) total += square.applyAsInt(x);
JMH shows a typical 2–5× speedup in this shape — but the JIT can often optimise the boxed version too once EA proves the Integer doesn't escape. Don't switch on faith; measure first and verify with -XX:+PrintEliminateAllocations.
The stream API mirrors the same idea:
versus
Reach for IntStream, LongStream, DoubleStream and the primitive *Function types whenever you're in a hot path over primitives.
6. Method references and lambdas compile to the same bytecode¶
There is no runtime difference between s -> s.length() and String::length — both compile to invokedynamic against LambdaMetafactory.metafactory with the same instantiatedMethodType and an implMethod handle pointing at String.length. The bytecode is identical; the JIT decisions are identical.
invokedynamic #N, 0 // apply: ()LFunction;
// BSM: LambdaMetafactory.metafactory
// args: ( (Ljava/lang/Object;)Ljava/lang/Object;,
// String.length()I,
// (Ljava/lang/String;)Ljava/lang/Integer; )
Picking between them is purely a readability decision (professional.md). Don't profile-tune on the assumption one is faster than the other.
7. Megamorphic call sites — when polymorphism stops being free¶
HotSpot's type profiler at a SAM call site tracks observed receiver types. The three states:
- Monomorphic (one observed type): C2 inlines the SAM body. ~0 ns extra over a direct call.
- Bimorphic (two types): C2 emits a type check + two inlined bodies. ~1–2 ns extra per call.
- Megamorphic (≥ three): falls back to a real
invokeinterfacethrough the itable. ~5–15 ns extra, and the secondary wins of inlining (escape analysis, constant folding) collapse.
A forEach over a list of mostly-identical lambda receivers stays monomorphic. A pipeline dispatch that sees dozens of distinct lambda classes goes megamorphic — and lambdas are particularly easy to make megamorphic because each capturing lambda evaluation produces a new instance of a distinct class.
// Likely megamorphic — each call site sees a fresh capture class:
Map<String, Function<Order, String>> handlers = ...;
orders.forEach(o -> handlers.get(o.type()).apply(o));
Inspect: -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining prints (virtual call) vs (inline) vs (bimorphic) decisions at each callsite.
Mitigations are the usual ones: reduce the number of distinct receivers, pre-resolve dispatch to a small enum, or accept the cost (often the readability is worth more than 10 ns per call).
8. Recycle vs allocate per call site¶
For long-lived lambda registrations (event listeners, comparator constants, filter predicates), allocate once and reuse:
public final class OrderFilters {
public static final Predicate<Order> IS_PAID = Order::isPaid;
public static final Predicate<Order> IS_OVERDUE = o -> o.due().isBefore(LocalDate.now());
public static final Predicate<Order> NEEDS_REVIEW = IS_OVERDUE.and(IS_PAID.negate());
}
Compared to constructing the same predicate at each call site:
// Three allocations per call:
orders.stream()
.filter(o -> o.due().isBefore(LocalDate.now()))
.filter(o -> !o.isPaid())
.toList();
Hoisting also helps the JIT: a constant-static-final lambda's class profile is stable across the whole program, so monomorphism is much more likely. The cost is one constant per lambda; the savings are one allocation per call.
When not to hoist:
- The lambda captures call-site-specific state (a local variable, a method parameter).
- The lambda is used in exactly one place; the hoist is over-engineering.
- The codebase favours readability over micro-optimisation and the call site reads better in-place.
9. Bench checklist — do these before claiming "lambdas are slow"¶
- Confirm steady state. Run the JMH benchmark with
@Warmup(iterations = 10)minimum. Cold-start linkage shows in first-iteration numbers and disappears after warmup. - Hoist construction. Create the lambda in
@Setup, not inside@Benchmark. Otherwise you're benchmarkingLambdaMetafactory. - Pin the receivers. A megamorphic site is the actual bottleneck in a lot of "lambdas are slow" reports — and it would be slow with anonymous classes too.
- Check escape analysis.
-XX:+PrintEliminateAllocationssays whether the capture allocations are being eliminated. - Compare apples to apples. A boxed
Function<Integer, Integer>vs afor-loop withintis not a lambda problem — it's a boxing problem. UseIntUnaryOperatorfor the lambda side too. - Profile with
async-profiler. Wall-clock and CPU profiles will show whether time is in the lambda body, the dispatch, or somewhere unrelated.
A typical "lambdas are 30× slower" report turns out to be: a JMH that builds the lambda inside the loop (cold + allocation), with no warmup, comparing against a hand-inlined integer expression. Fix any of those and the ratio collapses.
10. Quick rules¶
- Cold-start linkage is ~µs and one-time per call site — irrelevant in long-running code, real in benchmarks and serverless.
- Non-capturing lambdas are singletons; capturing ones allocate ~16–24 bytes per evaluation unless EA eliminates them.
- In hot paths over primitives, use
IntStream/IntUnaryOperator/ToIntFunction— skip the boxing tax. - Method references and lambdas compile to the same bytecode — pick by readability, not by performance.
- Megamorphic SAM call sites cost ~5–15 ns extra plus loss of inlining; rein in the receiver set or accept the cost.
- Hoist long-lived lambdas to
static finalconstants for stable JIT profiles and zero re-allocation. - Before claiming "lambdas are slow", check warmup, megamorphism, boxing, and escape analysis in that order.
11. What's next¶
| Topic | File |
|---|---|
| Hands-on exercises (some include JMH targets) | tasks.md |
| Interview Q&A | interview.md |
See also: ../../06-method-dispatch-and-internals/01-jvm-method-dispatch/ for the JIT side of invokeinterface/invokedynamic, ../03-reflection-and-annotations/ for the MethodHandle and Lookup mechanics, and ../05-default-methods-and-diamond-problem/ for how default methods interact with hot-path dispatch.
Memorize this: the three costs are linkage (one-time, µs), allocation (per-evaluation, bytes), and dispatch (per-call, ns). Linkage is irrelevant in steady state; allocation often disappears under EA; dispatch is monomorphic unless you actively scatter receivers. Boxing — not lambda dispatch — is the usual real slowdown. Reach for primitive specializations and hoisted constants before reaching for "I'll just write the for-loop".