Future / Promise — Optimize¶
Ten optimization walkthroughs for Future/Promise code. Each shows before, the problem, the after, why it's faster/safer, and the trade-off. Concepts come from senior.md/professional.md.
Table of Contents¶
- Serial awaits → parallel fan-out
- Blocking get() → composition
- Common pool → bulkheaded executors
- Excessive async hops → collapse sync stages
- Unbounded fan-out → bounded concurrency
- Fail-fast → partial results
- No timeout → bounded latency
- Per-item futures in a hot loop → batching
- Redundant duplicate calls → dedupe/memoize
- Blocking pipeline → virtual threads
- Optimization Tips
1. Serial awaits → parallel fan-out¶
Before
var a = supplyAsync(() -> callA(), io).join();
var b = supplyAsync(() -> callB(), io).join(); // starts only after A
var fa = supplyAsync(() -> callA(), io);
var fb = supplyAsync(() -> callB(), io);
return combine(fa.join(), fb.join()); // latency = max(A, B)
2. Blocking get() → composition¶
Before
Problem. The calling thread blocks idle untilload finishes; under request load this ties up request threads. After Why. No thread blocks; the transform runs on completion. The request thread is freed (returns a Future up the stack). Trade-off. The whole call stack must become Future-returning; half-migrations leave a blocking boundary somewhere. 3. Common pool → bulkheaded executors¶
Before
Problem. Blocking IO on the CPU-sized shared common pool starves parallel streams and other libraries; CPU and IO contend for the same threads. AftersupplyAsync(() -> blockingHttp(url), ioPool) // IO isolated
.thenApplyAsync(this::parse, cpuPool); // CPU isolated
4. Excessive async hops → collapse sync stages¶
Before
f.thenApplyAsync(this::trim, ex)
.thenApplyAsync(String::toLowerCase, ex)
.thenApplyAsync(this::normalize, ex); // 3 executor hops for pure transforms
*Async is an executor submit + cross-core wakeup (~µs each); for trivial pure functions the dispatch cost dwarfs the work. After Why. Pure CPU-trivial transforms should run inline; collapsing removes scheduling overhead and cache-cold resumes. Trade-off. Inline stages run on the completing thread — only safe when the work is fast and non-blocking. 5. Unbounded fan-out → bounded concurrency¶
Before
Problem. A million queued tasks at once: memory blowup and downstream overload (no rate control). AfterSemaphore gate = new Semaphore(64);
var fs = millionIds.stream()
.map(id -> runAsync(gate::acquireUninterruptibly, io)
.thenComposeAsync(v -> fetch(id), io)
.whenComplete((r, ex) -> gate.release()))
.toList();
6. Fail-fast → partial results¶
Before
Problem. Ifb fails, the whole aggregate rejects — you lose the perfectly good a and c results. After var wrapped = Stream.of(a, b, c)
.map(f -> f.handle((r, ex) -> ex == null ? Outcome.ok(r) : Outcome.fail(ex)))
.toList();
CompletableFuture.allOf(wrapped.toArray(CompletableFuture[]::new))
.thenApply(v -> wrapped.stream().map(CompletableFuture::join).toList());
handle turns each failure into a successful Outcome, so allOf never short-circuits; you return everything available. Trade-off. Callers must now branch per-item on success/failure instead of one all-or-nothing path. 7. No timeout → bounded latency¶
Before
Problem. A stuck dependency hangs the request indefinitely, holding resources and breaking SLA. Afterreturn supplyAsync(() -> slowDependency(), io)
.orTimeout(300, TimeUnit.MILLISECONDS) // reject if too slow
.exceptionally(ex -> cachedFallback()); // graceful degradation
orTimeout rejects the Future but doesn't stop the underlying work (it runs on, wasting a thread); pair with cooperative cancellation if that matters. 8. Per-item futures in a hot loop → batching¶
Before
Problem. N network round-trips and N future allocations; the per-future overhead and per-call latency dominate. After Why. One round-trip, one future; amortizes both network latency and allocation across the batch. Trade-off. Larger batches add latency for the first item and need backend batch support; choose a batch size that balances latency vs throughput.9. Redundant duplicate calls → dedupe/memoize¶
Before
CompletableFuture<Price> price(String sym) {
return supplyAsync(() -> api.quote(sym), io); // 100 callers → 100 identical calls
}
ConcurrentHashMap<String, CompletableFuture<Price>> inflight = new ConcurrentHashMap<>();
CompletableFuture<Price> price(String sym) {
return inflight.computeIfAbsent(sym, s ->
supplyAsync(() -> api.quote(s), io)
.whenComplete((r, ex) -> inflight.remove(s))); // single-flight
}
10. Blocking pipeline → virtual threads¶
Before
// Deeply nested CompletableFuture chain just to avoid blocking platform threads.
fetchF(id).thenCompose(this::enrichF).thenCompose(this::persistF).thenApply(...);
try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
var raw = scope.fork(() -> persist(enrich(fetch(id)))); // plain blocking, virtual thread
scope.join().throwIfFailed();
return raw.get();
}
CompletableFuture is still needed at interop boundaries (libraries, async servlet APIs), so both styles coexist. Optimization Tips¶
- Profile before optimizing. Measure where p99 latency actually lives — usually pool queueing, not your functions. A JMH/async-profiler flame graph beats intuition.
- The executor hop is the unit of cost. Count
*Asyncboundaries; each is ~µs. Collapse cheap pure stages (Opt 4), keep hops only at CPU↔IO confinement changes. - Parallelize independent work, sequence only true dependencies (Opt 1) — and never block one future before starting the next.
- Always bound concurrency and time (Opts 5, 7); unbounded fan-out and missing timeouts are the top two production failure modes.
- Single-flight + batching (Opts 8, 9) attack call count, which often dwarfs per-call tuning.
- When blocking is cheap (Loom), simpler often wins (Opt 10): don't carry
CompletableFuturecomplexity where a virtual-threadedStructuredTaskScopeis clearer. - Re-measure after each change — async optimizations frequently move the bottleneck rather than removing it.
In this topic