Leader/Followers — Optimize¶

Optimization walkthroughs for the Leader/Followers concurrency pattern. Each shows before → problem → after → why → trade-off. Foundations in senior.md and professional.md.

Table of Contents¶

Opt 1 — Shrink the Promotion Critical Section
Opt 2 — Shard the Handle Set
Opt 3 — Promote Per-Handle for Burst Parallelism
Opt 4 — Pin Threads to Cores (NUMA)
Opt 5 — Use EPOLLONESHOT for Free Suspend/Resume
Opt 6 — Batch-Drain Selected Keys
Opt 7 — Offload Blocking Handlers
Opt 8 — FIFO Token to Kill Convoys
Opt 9 — Right-Size the Pool by Blocking Fraction
Opt 10 — Avoid Allocation in the Hot Loop
Optimization Tips

Opt 1 — Shrink the Promotion Critical Section¶

Before. The lock is held across select() and the handler (as in find-bug Bug 6), or across allocation/logging. Problem. Any work under the promotion lock serializes the whole pool — the lock becomes a global mutex, throughput collapses to single-thread. After. Hold the lock for exactly two operations: flip leaderPresent and signal(). Everything else (select, dispatch, logging, allocation) runs outside it. Why. The promotion lock's cache line is contended by all N threads; the shorter the hold, the less time others spin/block. Minimal critical section = maximal overlap. Trade-off. None for correctness; you must be disciplined that no shared state is read outside the lock.

Opt 2 — Shard the Handle Set¶

Before. One selector, one promotion lock, N threads. At high request rates, profiling shows CPU dominated by lock acquire + CV signal. Problem. The single token's cache line bounces between all N cores (MESI), and the lock saturates — the hand-off cost you removed reappears as lock contention. After. M selectors, M promotion locks, N/M threads each; connections hashed to shards and pinned. Why. Each token is now contended by only N/M threads, cutting coherence traffic roughly M-fold. Throughput scales until per-shard contention reappears. Trade-off. Cross-shard load imbalance (hot shard vs idle shard) and no work stealing. Choose M so threadsPerShard stays small (2–4) without starving idle shards.

Opt 3 — Promote Per-Handle for Burst Parallelism¶

Before. On a burst, select() returns K ready handles; the leader promotes once, then dispatches all K serially on its own thread. Problem. The K handlers run sequentially even though K threads are available — burst latency is K × handler time. After. Promote, take one handle; the new leader's select() returns the remaining K-1 and it promotes again, taking one — so the burst fans out across threads. Why. Each handle gets its own thread; burst latency drops toward 1 × handler time. Trade-off. More promotion-lock churn per burst (K signals instead of 1). Worth it only when handlers are non-trivial and bursts are common; for tiny handlers, serial dispatch on a warm cache may still win.

Opt 4 — Pin Threads to Cores (NUMA)¶

Before. Pool threads float across all cores/sockets; connections aren't affinitized to NUMA nodes. Problem. Detect (kernel touches socket buffer on core A) and process (handler runs on core B, different socket) cross NUMA boundaries — cold caches and cross-socket coherence traffic erase Leader/Followers' cache advantage. After. Per-shard pools pinned to a core/socket; connections registered on the selector owned by that socket's pool. Why. Detect+process stay on one socket; the warm-cache benefit (the whole point of the pattern) is realized and compounded. Trade-off. Reduced flexibility and worse balance if traffic is skewed across sockets; requires NUMA-aware connection placement.

Opt 5 — Use `EPOLLONESHOT` for Free Suspend/Resume¶

Before. Userspace suspend/resume (key.interestOps(0) before promote, restore after) to prevent double-dispatch (find-bug Bug 10). Problem. Extra selector mutations and lock juggling per event in userspace. After (C/epoll). Register handles with EPOLLONESHOT: the kernel automatically disarms a handle after reporting it once, until you explicitly re-arm it post-handler. Why. The kernel does suspend/resume atomically; the new leader can never re-detect a handle currently being processed. Less userspace bookkeeping, no race window. Trade-off. A mandatory epoll_ctl(MOD) re-arm after each handler (a syscall). Java NIO lacks a direct equivalent — you emulate it with interest-ops; this optimization is C/epoll-specific.

Opt 6 — Batch-Drain Selected Keys¶

Before. Loop: select() returns, take one key, promote, process, repeat — one select() syscall per event. Problem. Under high event rates, one syscall per event dominates; syscall overhead throttles throughput. After. Drain all ready keys from one select(), promote, then process the batch (or fan out per Opt 3). One syscall amortizes over K events. Why. Syscall count drops from O(events) toward O(bursts); fewer kernel transitions, higher throughput. Trade-off. Slightly higher per-event latency for the last key in a large batch, and tension with Opt 3 (per-handle promotion). Tune batch size against your latency SLO.

Opt 7 — Offload Blocking Handlers¶

Before. A handler occasionally does a blocking DB/RPC call inline. Problem. A blocked processor can't lead; a few concurrent slow handlers starve the leader role, and detection latency spikes — Leader/Followers' worst failure mode. After. Detect the blocking step and hand just that step to a separate blocking-IO thread pool (or async client), returning the L/F thread to follow immediately. Why. L/F threads stay short and uniform — the precondition for the pattern — while slow work runs elsewhere. Trade-off. Reintroduces a hand-off for the blocking subset (the cost you avoided), and added complexity. If most handlers block, abandon L/F for Half-Sync/Half-Async entirely.

Opt 8 — FIFO Token to Kill Convoys¶

Before. A plain Condition + signal(); whichever thread re-contends fastest after processing keeps winning leadership. Problem. Unfair leadership causes a convoy: the fast re-contender hogs the leader role, others starve, and tail latency for "unlucky" threads' connections spikes. After. A FIFO follower queue (ACE's ACE_Token model): promote the head of the wait queue, not an arbitrary follower. Why. Strict FIFO bounds the wait any follower experiences; leadership rotates fairly; tail latency tightens. Trade-off. Slightly higher per-promotion overhead (queue management vs a single signal). Worth it for latency-fairness-sensitive workloads; overkill for uniform ones.

Opt 9 — Right-Size the Pool by Blocking Fraction¶

Before. Pool size = CPU count (the reflex for CPU-bound pools). Problem. If handlers spend a fraction f of their time blocked (even briefly), CPU-count threads leave cores idle during those blocks and can starve the leader role. After. Size N ≈ cores / (1 − f). Measure f empirically (handler wall-time vs CPU-time). Why. Provides enough threads that one is always free to lead even while others are momentarily blocked, without over-subscribing. Trade-off. Over-sizing increases promotion-lock contention (more threads on the token) and memory; under-sizing risks leader starvation. It's a measured balance, not a formula to apply blindly.

Opt 10 — Avoid Allocation in the Hot Loop¶

Before. Each iteration allocates a fresh ArrayList<SelectionKey> (or boxes, or builds log strings) per select(). Problem. Hot-loop allocation creates GC pressure that introduces pauses — directly hitting the tail latency Leader/Followers is chosen to protect. After. Reuse a per-thread scratch buffer for the drained keys; defer/skip log-string construction unless logging is enabled; avoid boxing. Why. Zero steady-state allocation → no GC-induced jitter → stable p99. Trade-off. Slightly more code (object reuse, thread-local buffers) and care to keep reused buffers thread-confined. Standard low-latency hygiene.

Optimization Tips¶

Profile before optimizing. Leader/Followers' justification is performance, so every change here must be validated against context-switch count, cache-misses, and p99 — not wall-clock alone (see professional.md's microbenchmark anatomy).
The promotion lock is the usual ceiling. Opts 1, 2, and 8 all target it. Shrink the critical section first (free); shard when one token saturates; add FIFO only for fairness.
Warm cache is the whole point. Opts 4 and 10 protect the cache-locality advantage; if you let detect and process drift to different cores or thrash GC, you've thrown away the reason to use the pattern.
Know when to stop. If you find yourself adding Opt 7 (offloading) for most handlers, the workload is blocking — switch to Half-Sync/Half-Async. If you push sharding to one thread per shard, you've arrived at Thread-Pool-per-Reactor; that's a valid destination, not a failure.
Each optimization has an inverse trade. Per-handle promotion (Opt 3) adds lock churn; batching (Opt 6) adds latency; sharding (Opt 2) adds imbalance. Optimize against your specific SLO, and measure both directions.