Chain of Responsibility — Optimize¶

Source: refactoring.guru/design-patterns/chain-of-responsibility

Each section presents a CoR that works but is wasteful. Profile, optimize, measure.

Table of Contents¶

Optimization 1: Inline static chain for JIT
Optimization 2: Iterative runner — avoid stack frames
Optimization 3: Batch processing in chain
Optimization 4: Cache expensive handler results
Optimization 5: Skip handlers that don't apply
Optimization 6: Combine compatible handlers
Optimization 7: Compile chain to single method
Optimization 8: Replace CompletableFuture with virtual threads
Optimization 9: Use LongAdder for hot counters
Optimization 10: Sort handlers by short-circuit probability
Optimization Tips

Optimization 1: Inline static chain for JIT¶

Before¶

public class ChainBuilder {
    public Handler build(Request r) {
        Handler h = new AuthHandler();
        h.setNext(new LogHandler()).setNext(new BusinessHandler());
        return h;
    }
}

// Per request:
Handler chain = builder.build(req);
chain.handle(req);

Per-request allocation of 3 handlers. JIT sees varying chain shapes; can't inline.

After¶

public class StaticChain {
    private static final Handler PIPELINE;
    static {
        PIPELINE = new AuthHandler();
        PIPELINE.setNext(new LogHandler()).setNext(new BusinessHandler());
    }

    public static void process(Request r) {
        PIPELINE.handle(r);
    }
}

StaticChain.process(req);

Measurement. No per-request allocation. Stable chain shape → JIT inlines next.handle() calls into one method body. ~2-3× faster after warmup.

Lesson: Chain assembly should happen once at startup, not per request. JIT can fully optimize a static chain.

Optimization 2: Iterative runner — avoid stack frames¶

Before¶

public abstract class Handler {
    protected Handler next;
    public abstract void handle(Request r);   // recursive next.handle(r)
}

// 50-deep chain × 100K req/s = 50 stack frames per req × 100K = 5M frames/s.
// JIT inlining helps but doesn't eliminate frame overhead in megamorphic case.

After¶

public enum HandleResult { CONTINUE, SHORT_CIRCUIT }

public abstract class IterHandler {
    public abstract HandleResult handle(Request r);
}

public class ChainRunner {
    private final IterHandler[] handlers;

    public ChainRunner(List<IterHandler> handlers) {
        this.handlers = handlers.toArray(new IterHandler[0]);
    }

    public void run(Request r) {
        for (IterHandler h : handlers) {
            if (h.handle(r) == HandleResult.SHORT_CIRCUIT) return;
        }
    }
}

Measurement. No nested call frames. Tighter loop; cache-friendly array access. ~10-30% faster for long chains.

Trade-off. Loses onion model — can't easily do post-work after next. For pure forward-only CoR: better.

Lesson: For deep chains where onion model isn't needed, iterative runner with array beats recursion.

Optimization 3: Batch processing in chain¶

Before¶

for (Request r : requests) {
    chain.handle(r);   // chain runs once per request
}

100K requests = 100K chain traversals. Each traversal: 10 virtual calls.

After¶

public abstract class BatchHandler {
    protected BatchHandler next;
    public abstract void handle(List<Request> batch);
}

public final class AuthBatchHandler extends BatchHandler {
    public void handle(List<Request> batch) {
        // Vectorize JWT verifications (e.g., batch RSA)
        List<Request> valid = batch.stream()
            .filter(r -> verifyToken(r.token()))
            .toList();
        if (next != null) next.handle(valid);   // pass valid ones forward
    }
}

public final class BusinessBatchHandler extends BatchHandler {
    public void handle(List<Request> batch) {
        // Bulk DB query
        Map<String, User> users = repo.findAllById(batch.stream().map(Request::userId).toList());
        // ... process
    }
}

Measurement. Per-batch dispatch cost amortized across batch size. For batch size 100: ~100× fewer virtual calls. DB queries batched: 1 query instead of 100.

Trade-off. Latency: requests wait until batch fills. Use for high-throughput scenarios where latency tolerated (analytics, log processing). Not for synchronous user requests.

Lesson: Batching changes throughput economics. Apache Kafka, Spark, Flink all use batch-style chain processing.

Optimization 4: Cache expensive handler results¶

Before¶

public class JwtAuthHandler extends Handler {
    public void handle(Request r) {
        Claims claims = jwtParser.parse(r.token());   // expensive: RSA verify, parse
        r.setUser(decodeUser(claims));
        if (next != null) next.handle(r);
    }
}

// Same user makes many requests in a session; JWT parsed every time.

After¶

public class JwtAuthHandler extends Handler {
    private final Cache<String, Claims> cache = Caffeine.newBuilder()
        .maximumSize(10_000)
        .expireAfterWrite(Duration.ofMinutes(5))
        .build();

    public void handle(Request r) {
        Claims claims = cache.get(r.token(), this::parseAndVerify);
        r.setUser(decodeUser(claims));
        if (next != null) next.handle(r);
    }

    private Claims parseAndVerify(String token) {
        return jwtParser.parse(token);   // expensive
    }
}

Measurement. With 99% cache hit rate: 100× faster for hot tokens. Cache miss: same as before. Memory: bounded by maximumSize.

Trade-off. TTL must be ≤ token expiry. Cache invalidation on logout. Distributed: per-instance cache; eventual consistency.

Lesson: Expensive handlers (DB, crypto, network) → cache. Caffeine is the standard Java cache library. Per-instance cache is fine for most cases.

Optimization 5: Skip handlers that don't apply¶

Before¶

public class TenantValidationHandler extends Handler {
    public void handle(Request r) {
        if (r.tenantId() != null) {
            // expensive validation
            validateTenant(r.tenantId());
        }
        if (next != null) next.handle(r);
    }
}

// Many requests don't have tenantId — handler still runs (no-op).
// 1M requests × 0.1µs check + dispatch = 100ms wasted.

After¶

public class ConditionalHandler extends Handler {
    private final Predicate<Request> applies;
    private final Handler delegate;

    public ConditionalHandler(Predicate<Request> applies, Handler delegate) {
        this.applies = applies;
        this.delegate = delegate;
    }

    public void handle(Request r) {
        if (applies.test(r)) {
            delegate.handle(r);
        }
        if (next != null) next.handle(r);   // forward regardless
    }
}

Handler chain = new AuthHandler()
    .setNext(new ConditionalHandler(
        r -> r.tenantId() != null,
        new TenantValidationHandler()
    ))
    .setNext(new BusinessHandler());

Measurement. Per-request: predicate check ~1ns vs handler dispatch + body. For 99% non-applicable requests: 100× faster on this step.

Better: pre-route requests so they bypass the handler entirely:

public class Router {
    public Handler routeFor(Request r) {
        return r.tenantId() != null ? tenantChain : nonTenantChain;
    }
}

Compile-time chain selection. Zero runtime cost.

Lesson: A handler that's a no-op for most requests is dispatch waste. Use predicates or routing to skip.

Optimization 6: Combine compatible handlers¶

Before¶

new HeaderValidator()      // validates headers
.setNext(new HeaderEnricher())   // adds derived headers
.setNext(new HeaderLogger())     // logs all headers
.setNext(...)

3 separate handlers, 3 dispatch calls, 3 iterations over headers.

After¶

public class HeaderProcessor extends Handler {
    public void handle(Request r) {
        // Single pass over headers
        Map<String, String> headers = r.headers();
        for (var entry : headers.entrySet()) {
            // validate
            validate(entry.getKey(), entry.getValue());
            // enrich (set derived)
            // log
            log.info("{}: {}", entry.getKey(), entry.getValue());
        }
        // ... derived headers
        if (next != null) next.handle(r);
    }
}

Measurement. 1 dispatch + 1 pass instead of 3. ~3× faster when iteration cost dominates.

Trade-off. Less modular. Harder to reorder. Use only when handlers are tightly coupled and always run together.

Lesson: CoR's modularity has a cost. For inner-loop chains, fusing handlers eliminates dispatch and improves cache locality. Profile before fusing — most chains aren't bottlenecks.

Optimization 7: Compile chain to single method¶

Before¶

Handler chain = new AuthHandler();
chain.setNext(new LogHandler()).setNext(new BusinessHandler());

chain.handle(req);   // 3 virtual calls

After (annotation processor or codegen)¶

public final class GeneratedChain {
    public static void process(Request r) {
        // Inlined Auth
        if (!verify(r.token())) throw new UnauthorizedException();

        // Inlined Log
        long start = System.currentTimeMillis();

        // Inlined Business
        process(r);

        log.info("{} {}ms", r.url(), System.currentTimeMillis() - start);
    }
}

Measurement. Zero CoR overhead — chain dissolved into linear code. ~5-10× faster than dynamic dispatch chain.

Trade-off. Build complexity (annotation processor / codegen). Loss of runtime configurability. Best for very hot paths and stable chain configurations.

Tools: Java annotation processors, ANTLR, Dagger (DI), MapStruct (mappers), Roslyn analyzers.

Lesson: When chain configuration is fixed at build time, codegen eliminates abstraction. ANTLR generates parsers; same idea for chains.

Optimization 8: Replace CompletableFuture with virtual threads¶

Before¶

public CompletableFuture<Response> handle(Request req) {
    return validate(req)
        .thenCompose(this::auth)
        .thenCompose(this::log)
        .thenCompose(this::business);
}

Each thenCompose allocates a CompletableFuture + BiCompletion. For 4-step × 100K req/s: ~20MB/s GC. Plus async scheduling overhead.

After (Java 21 virtual threads)¶

public Response handle(Request req) {
    Request validated = validate(req);
    Request authed = auth(validated);
    Request logged = log(authed);
    return business(logged);
}

// Caller:
Thread.startVirtualThread(() -> {
    Response r = handler.handle(req);
    sendResponse(r);
});

Measurement. No future allocations. No async scheduling. Virtual thread mounts/unmounts on carrier thread when blocked. 1M concurrent virtual threads cheap.

Trade-off. Java 21+ only. Some libraries pin to OS threads (synchronized blocks, native code) — virtual threads stuck. Test for pinning.

Lesson: Project Loom changes async-vs-sync trade-off. For I/O-bound CoR chains: synchronous code with virtual threads is simpler AND faster. CompletableFuture becomes legacy for most use cases.

Optimization 9: Use LongAdder for hot counters¶

Before¶

public class HitCounter extends Handler {
    private final AtomicLong count = new AtomicLong();

    public void handle(Request r) {
        count.incrementAndGet();
        if (next != null) next.handle(r);
    }
}

// 100K req/s on 32-core machine:
// AtomicLong's CAS contention serializes — ~50ns per increment under contention.

After¶

public class HitCounter extends Handler {
    private final LongAdder count = new LongAdder();

    public void handle(Request r) {
        count.increment();
        if (next != null) next.handle(r);
    }

    public long count() { return count.sum(); }
}

Measurement. LongAdder is sharded — each thread updates its own cell. No contention. ~10× faster under high contention.

Trade-off. sum() is O(N) cells (N = thread count). Slow for frequent reads. Use when writes >> reads.

For metrics: - Counter (write-heavy, read at scrape time) → LongAdder. - Real-time dashboard counter (frequent read) → AtomicLong.

Lesson: For hot atomic counters in chain handlers (rate limit, hit count), LongAdder scales better than AtomicLong. Java 8+.

Optimization 10: Sort handlers by short-circuit probability¶

Before¶

Handler chain = new HeaderValidator()        // always passes
    .setNext(new BodyParser())                // always succeeds
    .setNext(new ExpensiveCheck())            // takes 50ms; rejects 10%
    .setNext(new AuthCheck())                 // rejects 20%
    .setNext(new BusinessHandler());

// Average: 50ms × 100% requests = 50ms wasted on rejected paths.

After¶

Handler chain = new AuthCheck()                  // rejects 20% — early
    .setNext(new ExpensiveCheck())               // rejects 10% — second
    .setNext(new HeaderValidator())              // always passes — last
    .setNext(new BodyParser())
    .setNext(new BusinessHandler());

// Rejected requests now exit early — avoid expensive work.

Measurement. If 30% of requests rejected: - Before: 100% pay full chain cost up to rejection point. - After: 30% rejected after AuthCheck (~5ms); save 50ms × 30% = 15ms average.

For 100K req/s: 1.5s CPU saved per second.

Trade-off. Reorders may break dependencies (handlers expecting earlier handler's work). Document and test.

Lesson: Place handlers most likely to short-circuit first. Place expensive handlers as late as possible. Fail fast, fail cheap. Same principle as DB query optimization (filter early, project late).

Optimization Tips¶

Static chain assembly. Build once at startup; JIT inlines.
Iterative runner for deep chains; avoids stack frames and recursion overhead.
Batch processing when latency permits — vector operations + amortized dispatch.
Cache expensive handler results. JWT, DB lookups, RSA verifications.
Skip non-applicable handlers with predicates or routing.
Fuse compatible handlers when modularity isn't worth the dispatch cost.
Codegen for build-time-fixed chains.
Virtual threads instead of CompletableFuture for I/O-bound chains (Java 21+).
LongAdder for hot counters under contention.
Order handlers by short-circuit probability — fail fast.
Profile first. Chain dispatch is rarely the bottleneck. Cache misses, allocations, lock contention usually dominate.

← Find Bug · Behavioral patterns home