Future / Promise — Optimize¶

Ten optimization walkthroughs for Future/Promise code. Each shows before, the problem, the after, why it's faster/safer, and the trade-off. Concepts come from senior.md/professional.md.

Table of Contents¶

Serial awaits → parallel fan-out
Blocking get() → composition
Common pool → bulkheaded executors
Excessive async hops → collapse sync stages
Unbounded fan-out → bounded concurrency
Fail-fast → partial results
No timeout → bounded latency
Per-item futures in a hot loop → batching
Redundant duplicate calls → dedupe/memoize
Blocking pipeline → virtual threads
Optimization Tips

1. Serial awaits → parallel fan-out¶

Before

var a = supplyAsync(() -> callA(), io).join();
var b = supplyAsync(() -> callB(), io).join();   // starts only after A

Problem. Latency = A + B; the two independent calls don't overlap. After

var fa = supplyAsync(() -> callA(), io);
var fb = supplyAsync(() -> callB(), io);
return combine(fa.join(), fb.join());            // latency = max(A, B)

Why. Both requests are in flight at once; you pay only the slower one. Trade-off. Higher peak concurrency on downstreams — ensure pools/rate limits allow it.

2. Blocking get() → composition¶

Before

int v = supplyAsync(this::load, io).get();
return transform(v);

Problem. The calling thread blocks idle until load finishes; under request load this ties up request threads. After

return supplyAsync(this::load, io).thenApply(this::transform);

Why. No thread blocks; the transform runs on completion. The request thread is freed (returns a Future up the stack). Trade-off. The whole call stack must become Future-returning; half-migrations leave a blocking boundary somewhere.

3. Common pool → bulkheaded executors¶

Before

supplyAsync(() -> blockingHttp(url))          // commonPool
    .thenApplyAsync(this::parse);             // commonPool

Problem. Blocking IO on the CPU-sized shared common pool starves parallel streams and other libraries; CPU and IO contend for the same threads. After

supplyAsync(() -> blockingHttp(url), ioPool)   // IO isolated
    .thenApplyAsync(this::parse, cpuPool);     // CPU isolated

Why. Bulkheading prevents a slow downstream from consuming threads the fast path needs; each workload is sized for its nature. Trade-off. More pools to size, name, and monitor; mis-sizing one shifts the bottleneck.

4. Excessive async hops → collapse sync stages¶

Before

f.thenApplyAsync(this::trim, ex)
 .thenApplyAsync(String::toLowerCase, ex)
 .thenApplyAsync(this::normalize, ex);   // 3 executor hops for pure transforms

Problem. Each *Async is an executor submit + cross-core wakeup (~µs each); for trivial pure functions the dispatch cost dwarfs the work. After

f.thenApply(s -> normalize(s.trim().toLowerCase()));   // one inline stage

Why. Pure CPU-trivial transforms should run inline; collapsing removes scheduling overhead and cache-cold resumes. Trade-off. Inline stages run on the completing thread — only safe when the work is fast and non-blocking.

5. Unbounded fan-out → bounded concurrency¶

Before

var fs = millionIds.stream().map(id -> supplyAsync(() -> fetch(id), io)).toList();

Problem. A million queued tasks at once: memory blowup and downstream overload (no rate control). After

Semaphore gate = new Semaphore(64);
var fs = millionIds.stream()
    .map(id -> runAsync(gate::acquireUninterruptibly, io)
        .thenComposeAsync(v -> fetch(id), io)
        .whenComplete((r, ex) -> gate.release()))
    .toList();

Why. At most 64 in flight bounds memory and respects downstream capacity. Trade-off. Lower peak throughput; the cap becomes a tuning knob you must size to the downstream.

6. Fail-fast → partial results¶

Before

CompletableFuture.allOf(a, b, c).thenApply(v -> List.of(a.join(), b.join(), c.join()));

Problem. If b fails, the whole aggregate rejects — you lose the perfectly good a and c results. After

var wrapped = Stream.of(a, b, c)
    .map(f -> f.handle((r, ex) -> ex == null ? Outcome.ok(r) : Outcome.fail(ex)))
    .toList();
CompletableFuture.allOf(wrapped.toArray(CompletableFuture[]::new))
    .thenApply(v -> wrapped.stream().map(CompletableFuture::join).toList());

Why. handle turns each failure into a successful Outcome, so allOf never short-circuits; you return everything available. Trade-off. Callers must now branch per-item on success/failure instead of one all-or-nothing path.

7. No timeout → bounded latency¶

Before

return supplyAsync(() -> slowDependency(), io).thenApply(this::use);

Problem. A stuck dependency hangs the request indefinitely, holding resources and breaking SLA. After

return supplyAsync(() -> slowDependency(), io)
    .orTimeout(300, TimeUnit.MILLISECONDS)            // reject if too slow
    .exceptionally(ex -> cachedFallback());           // graceful degradation

Why. Bounds tail latency and converts a hang into a fast, degraded answer. Trade-off. orTimeout rejects the Future but doesn't stop the underlying work (it runs on, wasting a thread); pair with cooperative cancellation if that matters.

8. Per-item futures in a hot loop → batching¶

Before

ids.forEach(id -> supplyAsync(() -> db.lookup(id), io));   // N round-trips, N futures

Problem. N network round-trips and N future allocations; the per-future overhead and per-call latency dominate. After

CompletableFuture<Map<Id, Row>> batch = supplyAsync(() -> db.lookupBatch(ids), io);

Why. One round-trip, one future; amortizes both network latency and allocation across the batch. Trade-off. Larger batches add latency for the first item and need backend batch support; choose a batch size that balances latency vs throughput.

9. Redundant duplicate calls → dedupe/memoize¶

Before

CompletableFuture<Price> price(String sym) {
    return supplyAsync(() -> api.quote(sym), io);   // 100 callers → 100 identical calls
}

Problem. Concurrent callers for the same key each launch a duplicate expensive call (a "cache stampede"). After

ConcurrentHashMap<String, CompletableFuture<Price>> inflight = new ConcurrentHashMap<>();
CompletableFuture<Price> price(String sym) {
    return inflight.computeIfAbsent(sym, s ->
        supplyAsync(() -> api.quote(s), io)
            .whenComplete((r, ex) -> inflight.remove(s)));   // single-flight
}

Why. Concurrent callers for the same key share one in-flight Future (single-flight), collapsing N calls into one. Trade-off. A failure is shared by all waiters; and you must evict on completion to avoid serving stale results forever.

10. Blocking pipeline → virtual threads¶

Before

// Deeply nested CompletableFuture chain just to avoid blocking platform threads.
fetchF(id).thenCompose(this::enrichF).thenCompose(this::persistF).thenApply(...);

Problem. Callback-shaped control flow, lost stack traces, and executor-confinement complexity — all to dodge cheap-to-write blocking. After (Java 21+)

try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
    var raw = scope.fork(() -> persist(enrich(fetch(id))));  // plain blocking, virtual thread
    scope.join().throwIfFailed();
    return raw.get();
}

Why. Virtual threads make blocking cheap, so the pipeline becomes linear, debuggable code with real stack traces and proper cancellation. Trade-off. Requires Java 21+; CompletableFuture is still needed at interop boundaries (libraries, async servlet APIs), so both styles coexist.

Optimization Tips¶

Profile before optimizing. Measure where p99 latency actually lives — usually pool queueing, not your functions. A JMH/async-profiler flame graph beats intuition.
The executor hop is the unit of cost. Count *Async boundaries; each is ~µs. Collapse cheap pure stages (Opt 4), keep hops only at CPU↔IO confinement changes.
Parallelize independent work, sequence only true dependencies (Opt 1) — and never block one future before starting the next.
Always bound concurrency and time (Opts 5, 7); unbounded fan-out and missing timeouts are the top two production failure modes.
Single-flight + batching (Opts 8, 9) attack call count, which often dwarfs per-call tuning.
When blocking is cheap (Loom), simpler often wins (Opt 10): don't carry CompletableFuture complexity where a virtual-threaded StructuredTaskScope is clearer.
Re-measure after each change — async optimizations frequently move the bottleneck rather than removing it.