Lucas' Theorem — Senior Level¶
Focus: How do you architect a large-scale counting system around Lucas' theorem? When a service must answer millions of "how many ways, modulo
M" queries with arguments up to10^18, the senior questions are about choosing the modulus/prime, sharing precomputed factorial tables across workers, combining CRT for composite moduli, caching the cheap and short-circuiting the rest, and observing correctness and latency in production. This file treats Lucas as an infrastructure component, not just a formula.
Table of Contents¶
- Introduction
- System Design: A Combinatorics Service
- Choosing the Prime and the Modulus
- Shared and Sharded Factorial Tables
- Combining CRT Across a Composite Modulus
- Comparison with Alternatives
- Architecture Patterns
- Code Examples
- Observability
- Failure Modes
- Summary
Introduction¶
A senior engineer rarely computes a single binomial. The real setting is a counting service: lattice-path counts, ballot/Catalan-derived quantities, multinomial weights, error-correcting-code sizes, or analytics over enormous combinatorial spaces — all reported modulo some M because the true integers are too large to transmit. The arguments n, k reach 10^18, the query rate reaches tens of thousands per second, and the answers must be provably correct under a fixed modulus.
Lucas' theorem is the kernel that makes this feasible for a small prime modulus, and Lucas+CRT (and the Granville extension) handles composite moduli. The senior concerns are systemic:
- The modulus is often given by the product spec (e.g. "report mod
142857for legacy reasons"). You must factor it and route each prime/prime-power to the right kernel. - A size-
pfactorial table is a shared, immutable resource. Who builds it, where does it live, how is it warmed, and what is the memory budget per replica? - Many queries short-circuit to
0; a cache and an early-exit path turn those into nanosecond responses. - Correctness is non-negotiable: a single off-by-one in the digit loop silently corrupts results, so the service needs continuous validation against an oracle.
System Design: A Combinatorics Service¶
The request path:
- Modulus router factors
M(cached factorizations) and dispatches each prime factor to a kernel. - Lucas kernel owns a size-
pfact/invFacttable per relevant small prime, loaded from a shared cache at startup. - CRT combiner merges per-prime residues for square-free
M. - Result cache memoizes hot
(n, k, M)triples and serves zero short-circuits instantly.
The factorial tables are immutable after build — perfect for read replication and zero-lock sharing.
Choosing the Prime and the Modulus¶
Often you cannot choose the modulus — it is dictated by the problem. But when you can, or when you must factor a given M, the choices matter.
| Situation | Decision |
|---|---|
| You control the modulus, want speed | Pick a small prime so the size-p table is tiny and cache-resident (e.g. p = 1009, 9973). |
| You control it, want a large prime field | Use 10^9 + 7 — but then Lucas is the wrong tool; use the O(n) factorial table (sibling 23) if n fits. |
Modulus is a fixed square-free M | Factor M = ∏ p_i; one Lucas table per distinct prime; CRT to combine. |
Modulus is a prime power p^e | Granville kernel; table grows but stays O(p^e) or O(p) with care. |
| Modulus is general composite | Factor into prime powers; Granville per power; CRT. |
Key tension: Lucas's cost is O(p + log_p n). A larger base p means fewer digits (log_p n shrinks) but a bigger table (O(p) grows). For huge n and a fixed memory budget B, pick the largest prime whose table fits in B; that minimizes the digit count without overflowing memory. In practice the table dominates, so you choose p near the memory ceiling.
Prime p | digits for n = 10^18 | table size | total work |
|---|---|---|---|
| 2 | ~60 | 16 B | tiny table, many digits |
| 1009 | ~6 | 16 KB | balanced |
| 10^5 | ~4 | 1.6 MB | few digits, modest table |
| 10^6 | ~3 | 16 MB | fewest digits, large table |
Shared and Sharded Factorial Tables¶
The size-p fact/invFact arrays are the only nontrivial state. Treat them as a shared immutable artifact:
- Build once at startup (or lazily on first use per prime), then mark read-only. No locks needed for reads.
- Replicate, do not recompute. If many workers need the same prime's table, build it in one place and ship the bytes (e.g. via a sidecar or a memory-mapped file) rather than each worker spending
O(p). - Memory-map large tables. For
pnear10^6, an mmap'd file lets the OS page-cache share one physical copy across processes on the same host. - Warm caches before taking traffic. A cold worker that builds a 16 MB table on the first request adds a latency spike; warm it in a readiness probe.
- Bound the set of supported primes. Do not allow arbitrary user-supplied huge primes to trigger a multi-gigabyte table build — that is a denial-of-service vector (see Failure Modes).
Combining CRT Across a Composite Modulus¶
For a square-free M = p_1 · … · p_r, the service runs r independent Lucas queries (parallelizable, one per prime) and merges with CRT. For a general M, factor into prime powers M = ∏ q_j^{e_j}, run the Granville kernel per power, then CRT.
Architectural points:
- The per-prime queries are embarrassingly parallel — fan them out across cores or workers, then reduce via CRT. Latency is
maxover primes, not the sum. - Precompute the CRT coefficients for a fixed
M: theM_i = M / p_iand their inverses depend only on the modulus, not on(n, k). Compute them once whenMis registered. - Distinct-prime requirement is a routing invariant. The router must never send a non-prime (or a prime power) to the plain Lucas kernel. Enforce this at the factorization step.
Comparison with Alternatives¶
| Attribute | Lucas kernel | Lucas + CRT | Granville (prime power) | Factorial table (sibling 23) |
|---|---|---|---|---|
| Modulus | single small prime | square-free composite | prime power | any prime (large ok) |
Max n | 10^18 / bignum | 10^18 / bignum | 10^18 / bignum | ~10^7 |
| Setup memory | O(p) | Σ O(p_i) | O(p) or O(p^e) | O(n) |
| Per-query | O(log_p n) | Σ O(log_{p_i} n) | O(e · log_p n) | O(1) |
| Parallelism | per digit (limited) | per prime (high) | per digit | none needed |
| Production usage | counting mod small prime | mod square-free M | mod p^e quantities | mod 10^9+7 with small n |
Senior takeaway: the kernels are complementary, selected by (n, M). The service's value is the router that picks correctly and the shared tables that amortize setup.
Architecture Patterns¶
- Zero short-circuit fast path: before any table work, if
k > nor any base-pdigit comparison fails, return0. This is the cheapest possible response and very common. - Result memoization: hot
(n, k, M)triples (e.g. repeated analytics queries) cache well; the answer is a tiny integer. - Idempotent, stateless workers: because tables are immutable, any worker can serve any query; scale horizontally behind a load balancer.
Code Examples¶
Thread-safe Lucas service with a shared immutable table¶
Go¶
package main
import (
"sync"
)
// LucasTable is an immutable, shareable factorial table for a prime p.
type LucasTable struct {
p int64
fact []int64
invFact []int64
}
func powMod(a, e, m int64) int64 {
a %= m
if a < 0 {
a += m
}
r := int64(1) % m
for e > 0 {
if e&1 == 1 {
r = r * a % m
}
a = a * a % m
e >>= 1
}
return r
}
func buildTable(p int64) *LucasTable {
fact := make([]int64, p)
invFact := make([]int64, p)
fact[0] = 1 % p
for i := int64(1); i < p; i++ {
fact[i] = fact[i-1] * i % p
}
invFact[p-1] = powMod(fact[p-1], p-2, p)
for i := p - 1; i > 0; i-- {
invFact[i-1] = invFact[i] * i % p
}
return &LucasTable{p: p, fact: fact, invFact: invFact}
}
// Query is lock-free: the table is immutable after build.
func (t *LucasTable) Query(n, k int64) int64 {
if k < 0 || k > n {
return 0
}
res := int64(1) % t.p
for n > 0 || k > 0 {
ni, ki := n%t.p, k%t.p
if ki > ni {
return 0
}
res = res * t.fact[ni] % t.p * t.invFact[ki] % t.p * t.invFact[ni-ki] % t.p
n /= t.p
k /= t.p
}
return res
}
// TableCache builds each prime's table at most once.
type TableCache struct {
mu sync.Mutex
tables map[int64]*LucasTable
}
func NewTableCache() *TableCache { return &TableCache{tables: map[int64]*LucasTable{}} }
func (c *TableCache) Get(p int64) *LucasTable {
c.mu.Lock()
t, ok := c.tables[p]
if !ok {
t = buildTable(p)
c.tables[p] = t
}
c.mu.Unlock()
return t // safe to use concurrently — immutable
}
Java¶
import java.util.concurrent.ConcurrentHashMap;
public class LucasService {
public static final class Table {
final long p;
final long[] fact, invFact;
Table(long p) {
this.p = p;
fact = new long[(int) p];
invFact = new long[(int) p];
fact[0] = 1 % p;
for (int i = 1; i < p; i++) fact[i] = fact[i - 1] * i % p;
invFact[(int) p - 1] = powMod(fact[(int) p - 1], p - 2, p);
for (int i = (int) p - 1; i > 0; i--) invFact[i - 1] = invFact[i] * i % p;
}
long query(long n, long k) { // immutable -> thread-safe reads
if (k < 0 || k > n) return 0;
long res = 1 % p;
while (n > 0 || k > 0) {
long ni = n % p, ki = k % p;
if (ki > ni) return 0;
res = res * fact[(int) ni] % p * invFact[(int) ki] % p
* invFact[(int) (ni - ki)] % p;
n /= p; k /= p;
}
return res;
}
}
static long powMod(long a, long e, long m) {
a %= m; if (a < 0) a += m;
long r = 1 % m;
while (e > 0) { if ((e & 1) == 1) r = r * a % m; a = a * a % m; e >>= 1; }
return r;
}
private final ConcurrentHashMap<Long, Table> cache = new ConcurrentHashMap<>();
public Table table(long p) { return cache.computeIfAbsent(p, Table::new); }
}
Python¶
import threading
class LucasTable:
"""Immutable factorial table for a prime p; safe to share across threads."""
def __init__(self, p):
self.p = p
self.fact = [1] * p
for i in range(1, p):
self.fact[i] = self.fact[i - 1] * i % p
self.inv_fact = [1] * p
self.inv_fact[p - 1] = pow(self.fact[p - 1], p - 2, p)
for i in range(p - 1, 0, -1):
self.inv_fact[i - 1] = self.inv_fact[i] * i % p
def query(self, n, k):
if k < 0 or k > n:
return 0
res = 1 % self.p
p = self.p
while n > 0 or k > 0:
ni, ki = n % p, k % p
if ki > ni:
return 0
res = res * self.fact[ni] % p * self.inv_fact[ki] % p * self.inv_fact[ni - ki] % p
n //= p
k //= p
return res
class TableCache:
def __init__(self):
self._lock = threading.Lock()
self._tables = {}
def get(self, p):
with self._lock:
t = self._tables.get(p)
if t is None:
t = LucasTable(p)
self._tables[p] = t
return t # immutable -> reads need no lock
What it does: a thread-safe cache that builds each prime's factorial table at most once; the immutable tables are then queried concurrently with no read locks.
Capacity Planning and Latency Budget¶
Sizing a Lucas-based counting service is dominated by two numbers: the table memory per supported prime and the per-query digit count.
Memory. Each prime needs fact and invFact arrays of p int64 each — 16p bytes. A whitelist of, say, 20 small primes up to 10^5 costs 20 × 16 × 10^5 ≈ 32 MB — trivial, and shareable across all workers on a host via mmap. A single prime near 10^6 costs 16 MB. The budget blows up only if you allow primes near 10^9 (16 GB) — hence the whitelist.
Latency. A query is O(log_p n) digit iterations after the (amortized) table build. For n ≤ 10^18 and p ≥ 1000, that is ≤ 6 iterations, each a handful of modular multiplies — sub-microsecond in compiled languages. The realistic p99 driver is not the math; it is request parsing, the result-cache lookup, and (for composite moduli) the CRT fan-out. Budget accordingly:
| Component | Typical cost | Notes |
|---|---|---|
| Parse + validate request | ~1 µs | dominates for tiny math |
| Result-cache lookup | ~1 µs | hot zero short-circuits live here |
| Single Lucas digit walk | < 1 µs | ≤ 6 digits for n ≤ 10^18, p ≥ 1000 |
CRT fan-out (square-free M) | max over primes + combine | parallelizable |
| Cold table build (first hit) | O(p) ≈ ms | warm in readiness probe |
Throughput. With tables warm and immutable, queries are CPU-bound and lock-free, so throughput scales linearly with cores. A single modern core handles millions of n ≤ 10^18 Lucas queries per second for a moderate prime; the service scales out horizontally behind the load balancer because workers are stateless.
Deployment, Warming, and Rollout¶
Because the only state is immutable tables, deployment is simple — but the first request after a deploy can stall while a big table builds. Senior practices:
- Warm in the readiness probe. Build (or mmap) every whitelisted prime's table before the worker reports ready; do not accept traffic until tables are present. This converts a per-request latency spike into a fixed startup cost.
- Pin tables in a sidecar or shared-memory segment. One builder process produces the tables; workers map them read-only. A rolling deploy of workers does not rebuild tables.
- Version the whitelist with the binary. Adding a new supported prime is a config change that triggers a table build; treat it like a schema migration so capacity is checked first.
- Canary the Granville path separately. The prime-power extension has subtle sign logic (see
professional.md§6.2); roll it out behind a flag and run the shadow validator at elevated sampling during the canary.
A Worked Routing Example¶
A client asks for C(n, k) mod 12600 with n = 10^15. The router proceeds:
- Factor the modulus:
12600 = 2^3 · 3^2 · 5^2 · 7. Cached factorization. - Classify each factor:
2^3 = 8→ prime power,e = 3→ Granville kernel (and note thep = 2, e ≥ 3sign rule!).3^2 = 9→ prime power → Granville kernel.5^2 = 25→ prime power → Granville kernel.7→ prime → plain Lucas kernel.- Fan out four independent queries (parallel), each returning a residue mod its prime power.
- CRT-combine the four residues (precomputed CRT coefficients for
12600) into the final answer mod12600.
The router's correctness invariant: every modulus reaching the plain Lucas kernel is a genuine prime, and every set of moduli reaching CRT is pairwise coprime. A misclassification here is the classic silent-wrong-answer bug, so the classification step is unit-tested against a bignum oracle for the whole supported modulus set.
Observability¶
| Metric | What it tells you | Alert threshold |
|---|---|---|
lucas_query_latency_p99 | tail latency of a single query | > 1 ms |
table_build_seconds{p} | one-time setup cost per prime | > 200 ms (table too large) |
zero_shortcircuit_ratio | fraction returning 0 early | track for cache sizing |
result_cache_hit_ratio | memoization effectiveness | < 0.5 → reconsider keys |
crt_combine_errors | non-coprime moduli reached CRT | any > 0 is a bug |
table_memory_bytes | total factorial-table footprint | > 80% of budget |
oracle_mismatch_total | shadow check vs Pascal/bignum disagreed | any > 0 is a bug |
A continuous shadow validator is the most important signal: periodically pick small (n, k, p), compute via Pascal's triangle or Python bignum, and assert equality with the production kernel. Lucas bugs are silent — a wrong digit loop returns a plausible number, not an exception.
Abuse, Input Validation, and Trust Boundaries¶
A counting service that accepts an arbitrary modulus M and arbitrary n, k from untrusted callers has a real attack surface — all of it stemming from the O(p) (or O(p^e)) table build:
- Large-prime table bomb. A request with modulus
= 999999937(a large prime) would trigger a ~16 GB table build. Defense: maintain an allow-list of supported primes/moduli; reject anything outside it, or route large primes to a table-free path (per-digit on-the-fly inverses,O(log p · log_p n)time,O(1)space) with a hardncap. - Prime-power blow-up. Modulus
= 2^30asks for a size-2^30Granville table. Defense: capp^e(e.g. rejectp^e > 10^7); the table-free per-digit variant handles largerpat higher per-query cost. - Pathological argument sizes.
nas a 10000-digit bignum is legitimate for Lucas (cost isO(log_p n)digits) but the parser and the result encoder must bound input length to avoid memory abuse. - Factorization cost. Factoring an adversarial 64-bit
Mis cheap, but a 256-bitMis not. Defense: bound the modulus bit-length, and cache factorizations.
The trust boundary is the request edge: validate modulus membership, argument bit-length, and 0 ≤ k ≤ n before any table work. The math kernels themselves should assume already-validated inputs.
Distributed CRT and Multi-Region Considerations¶
For very high query volumes, the per-prime Lucas evaluations of a composite-modulus query can be sharded across machines, with a coordinator performing the final CRT:
- Sharded fan-out: route
C(n,k) mod p_ito the worker that already holds primep_i's table (consistent hashing on the prime). This keeps each table resident on a fixed set of nodes rather than replicated everywhere — useful when supporting many distinct primes whose total table memory exceeds one host. - Coordinator-side CRT: the coordinator collects the
rresidues and combines them with precomputed CRT coefficients. The CRT step isO(r)and stateless. - Idempotency for retries: the computation is pure (no side effects), so a timed-out per-prime query can be retried on any replica holding that prime's table without correctness risk.
- Consistency: because tables are immutable and derived solely from
(prime, e), every region computes identical results — there is no replication-lag class of bug. The only cross-region concern is the whitelist/config version, which should be rolled out atomically.
Failure Modes¶
- Oversized table from a large prime: a user-supplied huge prime triggers an
O(p)multi-GB build → OOM / DoS. Mitigation: whitelist supported primes; reject or route large-prime requests to the factorial-table kernel with annbound. - Non-coprime CRT inputs: a prime power slipped into the plain-Lucas/CRT path → wrong answer. Mitigation: the router must classify each factor (prime vs prime power) and send powers to Granville; assert pairwise coprimality before CRT.
- Silent digit-loop bug: looping while
n > 0only (dropping highkdigits) returns wrong values with no error. Mitigation: loopwhile n > 0 or k > 0; cover with the shadow validator. - Overflow in
fact[i]*j: whenpis pushed large, intermediate products nearp^2overflow 64-bit. Mitigation: keeppin the Lucas regime; use 128-bit or reduce frequently. - Cold-start latency spike: first request after deploy builds a big table inline. Mitigation: warm tables in the readiness probe; do not serve until built.
- Cache stampede on a hot miss: many concurrent misses for the same prime each build the table. Mitigation: single-flight build (one builder, others wait) — the
TableCachelock above does this.
Caching Strategy in Depth¶
The result cache deserves senior attention because the distribution of Lucas answers is bimodal: a large fraction of real queries short-circuit to 0, and the rest return small integers in [0, M).
- Cache the zeros aggressively. A query short-circuits when some base-
pdigitk_i > n_i; such inputs are common in sparse counting problems. Caching(n, k, M) → 0(or even a Bloom filter of known-zero triples) serves them in nanoseconds. - Key normalization. Use the symmetry
C(n,k) = C(n,n-k)to canonicalize keys (k ← min(k, n-k)), roughly halving the key space. - TTL vs immutable. Results are mathematically immutable —
C(n,k) mod Mnever changes — so cache entries never expire for correctness reasons; eviction is purely an LRU memory-management decision. - Negative-result caching is safe. Unlike most systems, caching a "this is 0" answer carries no staleness risk.
| Cache layer | Hit characteristic | Eviction |
|---|---|---|
| Zero Bloom filter | short-circuit triples | never (rebuild offline) |
| LRU result cache | hot (n,k,M) | LRU by memory budget |
| Factorial table cache | per supported prime | never (immutable, whitelisted) |
Prime-Power (Granville) Path: What a Senior Must Know Operationally¶
When the modulus has a square factor (M = p^e · …, e ≥ 2), plain Lucas is mathematically invalid — 9 is not prime, so "C(n,k) mod 9 via Lucas" is a category error. The service must route such factors to the Granville/Andrew kernel. The full derivation lives in professional.md §6; the senior-level operational facts are:
- Table size is
O(p^e), notO(p). A factor2^20would demand a million-entry coprime-product table. The whitelist and thep^e ≤ 10^7cap from the Abuse section apply to the prime power, not just the prime. - There is a sign trap. The Wilson-style block sign is
+1for2^ewithe ≥ 3but−1for oddp^eand forp^e = 4(professional.md§6.2). This is the single most common silent bug; it is why the canary in the Rollout section samples thep = 2, e ≥ 3andp^e = 4cases at elevated rate. μ = v_p(C(n,k))is computed from carries (Kummer), independently of the table. Ifμ ≥ ethe answer is0 mod p^eand the whole table walk is skipped — another high-value short-circuit to cache.- Per-query cost is
O(e · log_p n)after theO(p^e)build — still tiny per query; the build dominates and must be warmed.
| Modulus shape | Kernel | Build cost | Per-query | Routing invariant |
|---|---|---|---|---|
small prime p | plain Lucas | O(p) | O(log_p n) | factor is genuinely prime |
square-free M = ∏ p_i | Lucas + CRT | Σ O(p_i) | Σ O(log_{p_i} n) | factors pairwise coprime, all prime |
prime power p^e | Granville/Andrew | O(p^e) | O(e·log_p n) | factor classified as power, not prime |
| general composite | per-power Granville + CRT | Σ O(p_i^{e_i}) | Σ O(e_i log n) | prime-power decomposition correct |
Precompute Sizing: A Decision Procedure¶
Given a fixed modulus M and a memory budget B, a senior sizes the precompute before the first deploy, not reactively:
- Factor
M = ∏ p_i^{e_i}once at registration; cache it. - Sum the table footprints: plain-prime factors cost
16 p_ibytes (fact+invFactasint64); prime-power factors cost≈ 8 p_i^{e_i}bytes (single coprime-product tableg). Reject registration ifΣ > B. - For free-choice moduli, pick the largest prime whose
16p ≤ B; this minimizes the digit countlog_p n(fewer, fatter digits) while staying in budget — the trade restated from the "Choosing the Prime" table. - Decide table-free fallback for any factor whose table would breach
B: switch that factor to the on-the-fly inverse path (O(log p · log_p n)time,O(1)space), accepting slower per-query for zero memory.
This converts an unbounded "build whatever the request asks" liability into a bounded, reviewed capacity decision — the same discipline you would apply to any precomputed index.
Cross-References¶
23-binomial-coefficients— the sibling on the unrestrictedC(n,k) mod pvia anO(n)factorial table. That is the right tool whennis small enough to tabulate andpis large (e.g.10^9+7); Lucas is the right tool whennis astronomical andpis small. The router in this file chooses between them by the(n, p)regime — see the comparison table in Comparison with Alternatives.06-crt— the Chinese Remainder Theorem machinery that the CRT combiner and the distributed coordinator depend on. The pairwise-coprimality invariant the router enforces is exactly CRT's precondition; review06-crtfor the coefficient precomputation (M_i = M/p_iand their inverses) that makes the combine stepO(r).
Summary¶
At the senior level, Lucas' theorem is an infrastructure kernel inside a counting service. The architecture centers on a modulus router that factors M and dispatches each prime to the right kernel (plain Lucas for a small prime, Lucas+CRT for a square-free composite, Granville for a prime power), backed by shared immutable factorial tables that are built once, memory-mapped, replicated, and queried lock-free. Prime choice trades digit count (log_p n) against table size (O(p)) under a memory budget; per-prime CRT queries fan out in parallel; zero short-circuits and result memoization serve the cheap cases instantly. The defining production risk is silent incorrectness — a wrong digit loop or a non-coprime CRT input returns a plausible-but-wrong integer — so a continuous shadow validator against a bignum/Pascal oracle, plus strict routing invariants and prime whitelisting, are the senior's primary safeguards.
Next step: continue to professional.md for the formal proof of Lucas via generating functions and Frobenius, Kummer's carry theorem, the prime-power (Granville/Andrew) extension, and the full complexity analysis.