Rabin-Karp (Rolling-Hash String Matching) — Middle Level¶
Focus: How to choose the base and modulus so collisions are rare, why double hashing crushes the false-positive rate, why verification keeps the algorithm correct, and how the same rolling-hash skeleton extends to multi-pattern set matching and 2D grid search — plus when KMP or the Z-algorithm is the better tool.
Table of Contents¶
- Introduction
- Deeper Concepts
- Choosing the Base and Modulus
- Double Hashing
- Verification and Collision Handling
- Multi-Pattern Set Matching
- 2D Rabin-Karp
- Comparison with KMP and Z
- Prefix-Hash Applications
- Advanced Patterns
- Code Examples
- Error Handling
- Performance Analysis
- Best Practices
- Visual Animation
- Summary
Introduction¶
At junior level the message was a single mechanism: a polynomial hash that rolls in O(1), compared against the pattern hash, with verification on hits. At middle level you start asking the engineering questions that decide whether the solution is correct and fast in practice:
- What base and modulus actually keep collisions near
1/p, and why does primality matter? - How does double hashing take the collision probability from
~1/pto~1/p²— effectively zero? - Why is verification not optional, and what is the precise relationship between "expected
O(n+m)" and "worst caseO(n·m)"? - How do you search for many patterns of the same length at once without re-scanning the text per pattern?
- How does the one-dimensional roll generalize to a two-dimensional roll for grid/image pattern search?
- When should you reach for KMP or Z instead, and what does Rabin-Karp give you that they do not?
These are the questions that separate "I can code the basic matcher" from "I can pick and tune the right hashing scheme for the problem in front of me."
A useful framing: at junior level the rolling hash is the algorithm; at middle level it is a reusable component you parameterize (base, modulus, single vs double), compose (sets for multi-pattern, two passes for 2D, prefix arrays for substring queries), and place correctly against alternatives (KMP, Z, Aho-Corasick, suffix structures). The mechanics never change — O(1) roll plus verify — but the surrounding decisions determine whether your solution is correct, fast, and the right tool at all. The rest of this document works through each of those decisions in turn, with runnable code in Go, Java, and Python.
Deeper Concepts¶
The polynomial hash, restated precisely¶
For a string s[0..m-1] over character codes s[i], define
So the leftmost character has the highest power. This convention makes the roll clean: when the window advances, the leftmost character (highest power b^{m-1}) is the one removed.
The roll, derived¶
Let H_i = H(T[i..i+m-1]). We want H_{i+1} = H(T[i+1..i+m]). Subtract the leading term, multiply the rest by b (shifting all powers up by one), and add the new last character:
The only precomputed constant is b^{m-1} mod p. Everything else is two multiplies, one subtract, one add, one reduce — O(1).
Prefix hashes — an alternative that supports random access¶
Instead of rolling, precompute prefix hashes pre[k] = H(T[0..k-1]). Then the hash of any substring T[l..r] is
This gives the hash of any substring in O(1) after O(n) preprocessing — essential for "count distinct substrings" and "longest duplicate substring" (see interview.md). The rolling form is the special case where you only ever need consecutive equal-length windows.
A practical note: building the power table pow[0..n] (with pow[k] = pow[k-1]·b mod p) alongside the prefix array is what makes the substring formula O(1) rather than O(log L) per query. Always precompute both pre[] and pow[] together in the same O(n) pass; computing b^{r-l+1} on demand inside a query loop is a frequent and silent performance regression.
Choosing the Base and Modulus¶
The collision probability of a single comparison is roughly 1/p for a random hash, so the modulus p directly controls correctness cost. Guidelines:
- Modulus
p: a large prime. Common choices are1_000_000_007or1_000_000_009(just under2^30). For stronger guarantees use a 61-bit Mersenne prime2^61 − 1, which keeps products inside 64 bits with a fast reduction. Primality matters because it makes the hash a polynomial over a fieldZ_p, which by the Schwartz-Zippel / polynomial-root argument bounds collisions atm/pfor a random base (two distinct length-mstrings collide only if a degree-<mpolynomial vanishes at the base). - Base
b: larger than the alphabet, ideally random. It must exceed the maximum character code (so each character is a distinct "digit"); 256 or 257 works for bytes, 31/131/137 are popular for lowercase text. For adversarial safety pickbrandomly in[256, p)at runtime (seesenior.mdon hash flooding). - Offset character codes by 1. Map
'a' → 1not0, so that"a","aa","aaa"get distinct hashes and leading characters cannot vanish.
| Choice | Collision rate | Notes |
|---|---|---|
p = 101 (toy) | ~1% | Teaching only; constant spurious hits. |
p ≈ 10^9 + 7 | ~10^{-9} per compare | Good default; fits in 64-bit products. |
p = 2^61 − 1 | ~4·10^{-19} per compare | Strong; needs 128-bit or careful mulmod. |
Two independent (b, p) | ~p^{-2} | Double hashing — effectively collision-free. |
Double Hashing¶
A single 30-bit modulus gives a per-comparison collision probability around 10^{-9}. Over a text with n ≈ 10^9 windows (or many comparisons across many queries), the birthday-style accumulation can make a collision likely. The fix is to compute two independent hashes — different (base, modulus) pairs — and treat two strings as equal only if both hashes agree.
If the two hashes are independent, the joint collision probability is ~ (1/p₁)·(1/p₂). With two ~10^9 primes that is ~10^{-18} — for all practical purposes zero. Represent the pair as a single 64-bit key (h1 << 32) | h2 so set lookups and comparisons stay O(1).
Double hashing does not remove the need to verify if you require guaranteed correctness — but in competitive programming it is often used in place of verification, because the residual probability is negligible. In production code that must never be wrong, keep the verification step; in throwaway contest code, double hashing alone is usually accepted.
Verification and Collision Handling¶
Verification is the bridge from "probably correct" to "definitely correct." The logic at each window:
- Compare
winHashtopatHash. If unequal → definitely not a match, skip. - If equal → it is a candidate. Compare characters directly.
- Match → report.
- Mismatch → it was a spurious hit (collision); discard and continue.
The cost analysis: let c be the number of spurious hits. Total time is O(n + (matches + c)·m). With a good hash, E[c] ≈ (n − m)/p, which for p ≈ 10^9 is effectively 0. So expected time is O(n + m + matches·m) — and if you only need the count or first match it is O(n + m). The worst case O(n·m) arises exactly when c = Θ(n), which an adversary can engineer against a fixed hash but not against a randomized one.
Multi-Pattern Set Matching¶
Suppose you have k patterns, all of the same length m, and want to find which (if any) occur at each text position. Naively running Rabin-Karp k times costs O(k·n). Better: hash all k patterns once into a hash set (or hash map from hash → pattern), then roll a single window across the text and, at each position, look up the window hash in the set in O(1) expected.
1. For each pattern, compute its hash → insert into set S (store the pattern too, for verification).
2. Roll one window across T.
3. At each position, if winHash ∈ S → verify against the matching pattern(s) → report.
This is expected O(n + k·m) — one text scan regardless of k. (Patterns of different lengths need one set per distinct length, or the Aho-Corasick automaton, which is the proper multi-length multi-pattern tool — see sibling 04-aho-corasick.) Store the full pattern alongside its hash so verification can resolve which pattern matched and reject collisions.
One subtlety: as k grows, the chance that some window collides with some pattern grows linearly in k (a union bound over the k targets). So the effective collision pressure is ~ n·k·(m−1)/p, and the modulus should be chosen with k in mind — another argument for a 61-bit modulus or double hashing once k is in the thousands. The full analysis is in professional.md.
2D Rabin-Karp¶
To find an r × c pattern grid inside an R × C text grid, apply the rolling hash twice:
- Hash every horizontal window of width
c. For each row, compute the rolling hash of each length-chorizontal segment. This collapses each row into a sequence of "column-strip hashes." Cost:O(R·C). - Hash vertically over those row-hashes. Treat each column of row-hashes as a 1D string and roll a height-
rwindow down it, using a second base. A 2D match occurs where the combined vertical-of-horizontal hash equals the pattern's combined hash.
The pattern grid is reduced to a single hash the same way (hash its rows, then hash that column of hashes). The total cost is expected O(R·C + r·c) — linear in the grid size — versus O(R·C·r·c) for naive 2D matching. Verification on a 2D hash hit compares the full r × c subgrid. The key insight: a 2D rolling hash is just a 1D rolling hash whose "characters" are themselves 1D rolling hashes.
Why two different bases for 2D¶
The horizontal pass uses base b_1 and the vertical pass uses base b_2. Using the same base for both would conflate a horizontal shift with a vertical shift — two different grids could then collide simply because their row-hash sequences are rotations of each other. Independent bases make the combined hash a genuine bivariate polynomial evaluation G(b_2, b_1), whose collision bound is ~((r−1)+(c−1))/p (see professional.md). In code, this means precomputing two high powers: b_1^{c−1} for the horizontal roll and b_2^{r−1} for the vertical roll.
2D worked sketch¶
For a 2×2 pattern in a 3×3 grid: first build rowHash[i][j] for each of the 3 rows and the 2 horizontal positions, then for each of the 2 column positions roll a height-2 window down the 3 row-hashes (2 vertical positions). That is 3·2 row hashes + 2·2 combined hashes = a handful of O(1) rolls, all linear in the grid. Each combined-hash hit triggers a full 2×2 subgrid verification.
Comparison with KMP and Z¶
| Property | Rabin-Karp | KMP (01-kmp) | Z-algorithm (02-z-algorithm) |
|---|---|---|---|
| Time (typical) | expected O(n + m) | worst-case O(n + m) | worst-case O(n + m) |
| Time (adversarial) | O(n·m) unless randomized | O(n + m) always | O(n + m) always |
| Preprocessing | hash pattern, O(m) | failure function, O(m) | Z-array of P#T, O(m) |
| Extra space | O(1) | O(m) | O(n + m) |
| Multi-pattern (equal length) | natural — hash a set, one scan | needs Aho-Corasick | awkward |
| 2D / grid search | natural — roll twice | hard | hard |
| Substring-pair comparison | O(1) via prefix hashes | not supported | not supported |
| Correctness risk | must verify (lossy hash) | exact, no verification | exact, no verification |
Use Rabin-Karp when you need multi-pattern equal-length search, 2D matching, or O(1) substring comparisons, or when its simplicity wins. Use KMP/Z when you need a hard worst-case linear guarantee on adversarial input with a single pattern.
A closer look at the trade-off¶
KMP and Z preprocess the pattern into a structure (failure function / Z-array) that lets them never re-examine a text character — giving a deterministic O(n+m) with no verification and no probability of error. Rabin-Karp preprocesses the pattern into a single number and accepts a tiny error probability (eliminated by verification) in exchange for the ability to compare any string to that number in O(1). That difference — a number versus an automaton — is exactly why Rabin-Karp generalizes to sets of patterns (a set of numbers), to 2D (a number-of-numbers), and to arbitrary substring queries (prefix-hash arithmetic), while KMP/Z, built around a single linear scan, do not. Neither is "better"; they occupy different points on the generality-versus-guarantee axis.
What about suffix structures?¶
Suffix automata and suffix arrays (05-suffix-structures) subsume many hashing applications (distinct substrings, longest duplicate) exactly and often in O(n). The hashing route trades a log factor and a controllable error probability for dramatically simpler code and lower constant factors. For one-off problems and contests, hashing usually wins on implementation time; for a library that must be exact and is built once, the suffix structure may be worth the complexity.
Prefix-Hash Applications¶
The prefix-hash form (Section "Deeper Concepts") supports O(1) hashing of any substring, which unlocks several problems that pure rolling cannot:
Longest common extension (LCE)¶
Given two indices i and j, how far do the suffixes starting there agree? Binary search the length L: the answer is at least L iff hashSub(i, i+L−1) == hashSub(j, j+L−1). Each probe is O(1), so an LCE query is O(log n) after O(n) preprocessing. LCE is a building block for suffix comparison, palindrome detection, and string-periodicity tests.
Longest duplicate substring¶
Binary search the answer length L. For each candidate L, hash every length-L window (rolling, or via prefix hashes) and insert into a set; if any hash repeats, a duplicate of length L exists, so try longer, else shorter. Total O(n log n) expected. Verification (or double hashing) guards against a false "duplicate" from a collision. This is the canonical "binary search + hashing" pattern (see interview.md Challenge 3).
Counting distinct substrings of a fixed length¶
Roll a length-L window across the string, insert each hash into a set, and return the set size. With a 61-bit modulus or double hashing the collision probability is negligible, giving an O(n) expected count per length. Summing over all L gives total distinct substrings in O(n²) expected (the suffix automaton does this exactly in O(n) — the hashing route trades a factor for simplicity).
Palindrome checks¶
Build a forward prefix hash and a reverse prefix hash. A substring T[l..r] is a palindrome iff its forward hash equals the reverse hash of the same range. Each check is O(1), enabling fast palindrome-counting and longest-palindromic-substring binary searches.
Advanced Patterns¶
Pattern: Anchored multi-length search via per-length sets¶
When patterns have a few distinct lengths L_1, …, L_t, maintain one rolling hash per distinct length and one set of pattern hashes per length. A single forward pass updates all t rolling hashes (each O(1)), so total cost is O(t·n + Σ|P|). This stays attractive while t is small; once lengths are many and varied, Aho-Corasick (04-aho-corasick) is the right tool.
Pattern: Sliding-window deduplication¶
To detect near-duplicate windows in a stream (e.g. repeated log lines, shingled document fingerprints), roll a hash over each length-w window and record seen hashes in a bounded LRU set. A repeated hash flags a candidate duplicate to verify. This is the kernel of shingling-based similarity detection.
Pattern: Hash-then-verify as a general filter¶
Generalize the Rabin-Karp discipline: whenever an expensive equality test can be preceded by a cheap hash compare, use the hash as a filter and the full test as verification. Database join probes, content-addressable stores, and memoization keys all follow this shape. The rolling property is what makes the filter cheap when the keys are sliding windows.
Pattern: Combining with binary search on the answer¶
Many "find the longest/shortest X with property P" string problems become O(n log n) by binary searching the length and using a rolling/prefix hash to test property P in O(n) per length. Longest duplicate substring, longest common substring of two strings, and longest repeated-with-gap substring all fit this template.
Code Examples¶
Example: Double-Hash Rabin-Karp with Verification¶
Go¶
package main
import "fmt"
const (
b1, p1 = 131, 1_000_000_007
b2, p2 = 137, 998_244_353
)
func hashPair(s string) (int64, int64) {
var h1, h2 int64
for i := 0; i < len(s); i++ {
h1 = (h1*b1 + int64(s[i]+1)) % p1
h2 = (h2*b2 + int64(s[i]+1)) % p2
}
return h1, h2
}
func search(txt, pat string) []int {
n, m := len(txt), len(pat)
var res []int
if m == 0 || m > n {
return res
}
ph1, ph2 := hashPair(pat)
var hp1, hp2 int64 = 1, 1
for i := 0; i < m-1; i++ {
hp1 = hp1 * b1 % p1
hp2 = hp2 * b2 % p2
}
w1, w2 := hashPair(txt[:m])
for i := 0; i+m <= n; i++ {
if w1 == ph1 && w2 == ph2 && txt[i:i+m] == pat {
res = append(res, i)
}
if i+m < n {
w1 = ((w1-int64(txt[i]+1)*hp1%p1+p1)%p1*b1 + int64(txt[i+m]+1)) % p1
w2 = ((w2-int64(txt[i]+1)*hp2%p2+p2)%p2*b2 + int64(txt[i+m]+1)) % p2
}
}
return res
}
func main() {
fmt.Println(search("abracadabra", "abra")) // [0 7]
}
Java¶
import java.util.*;
public class DoubleHashRK {
static final long B1 = 131, P1 = 1_000_000_007L;
static final long B2 = 137, P2 = 998_244_353L;
static long[] hashPair(String s) {
long h1 = 0, h2 = 0;
for (int i = 0; i < s.length(); i++) {
int c = s.charAt(i) + 1;
h1 = (h1 * B1 + c) % P1;
h2 = (h2 * B2 + c) % P2;
}
return new long[]{h1, h2};
}
static List<Integer> search(String txt, String pat) {
int n = txt.length(), m = pat.length();
List<Integer> res = new ArrayList<>();
if (m == 0 || m > n) return res;
long[] ph = hashPair(pat);
long hp1 = 1, hp2 = 1;
for (int i = 0; i < m - 1; i++) { hp1 = hp1 * B1 % P1; hp2 = hp2 * B2 % P2; }
long[] w = hashPair(txt.substring(0, m));
for (int i = 0; i + m <= n; i++) {
if (w[0] == ph[0] && w[1] == ph[1] && txt.regionMatches(i, pat, 0, m)) res.add(i);
if (i + m < n) {
int out = txt.charAt(i) + 1, in = txt.charAt(i + m) + 1;
w[0] = ((w[0] - out * hp1 % P1 + P1) % P1 * B1 + in) % P1;
w[1] = ((w[1] - out * hp2 % P2 + P2) % P2 * B2 + in) % P2;
}
}
return res;
}
public static void main(String[] args) {
System.out.println(search("abracadabra", "abra")); // [0, 7]
}
}
Python¶
B1, P1 = 131, 1_000_000_007
B2, P2 = 137, 998_244_353
def hash_pair(s):
h1 = h2 = 0
for ch in s:
c = ord(ch) + 1
h1 = (h1 * B1 + c) % P1
h2 = (h2 * B2 + c) % P2
return h1, h2
def search(txt, pat):
n, m = len(txt), len(pat)
res = []
if m == 0 or m > n:
return res
ph1, ph2 = hash_pair(pat)
hp1, hp2 = pow(B1, m - 1, P1), pow(B2, m - 1, P2)
w1, w2 = hash_pair(txt[:m])
for i in range(n - m + 1):
if w1 == ph1 and w2 == ph2 and txt[i:i + m] == pat:
res.append(i)
if i + m < n:
out, inn = ord(txt[i]) + 1, ord(txt[i + m]) + 1
w1 = ((w1 - out * hp1) * B1 + inn) % P1
w2 = ((w2 - out * hp2) * B2 + inn) % P2
return res
if __name__ == "__main__":
print(search("abracadabra", "abra")) # [0, 7]
What it does: maintains two independent rolling hashes; a position is a candidate only if both match, and even then characters are verified. Run: go run main.go | javac DoubleHashRK.java && java DoubleHashRK | python dh.py
Example: Multi-Pattern Set Matching (Python)¶
def multi_search(txt, patterns):
"""All patterns share length m. Returns {pattern: [start indices]}."""
m = len(patterns[0])
table = {}
for p in patterns:
table.setdefault(hash_pair(p), []).append(p)
n = len(txt)
result = {p: [] for p in patterns}
if m > n:
return result
hp1, hp2 = pow(B1, m - 1, P1), pow(B2, m - 1, P2)
w1, w2 = hash_pair(txt[:m])
for i in range(n - m + 1):
cands = table.get((w1, w2))
if cands:
for p in cands: # verify each candidate
if txt[i:i + m] == p:
result[p].append(i)
if i + m < n:
out, inn = ord(txt[i]) + 1, ord(txt[i + m]) + 1
w1 = ((w1 - out * hp1) * B1 + inn) % P1
w2 = ((w2 - out * hp2) * B2 + inn) % P2
return result
Worked Comparison: Where the Time Goes¶
Consider searching a 1 MB text (n = 10^6) for a 100-character pattern (m = 100), with p = 10^9 + 7.
- Naive: up to
(n−m+1)·m ≈ 10^8character comparisons. On real (non-adversarial) English text it is far faster because mismatches happen early, but the worst case is10^8. - Rabin-Karp: one
O(m)pattern hash (100 ops), oneO(m)first-window hash (100 ops), then~10^6rolls at a constant ~5 arithmetic ops each (~5·10^6ops), plus verification only on hits. Expected spurious hits≈ (n−m)·(m−1)/p ≈ 10^6·100/10^9 ≈ 0.1— essentially none. So total ≈5·10^6operations, dominated entirely by the roll. - KMP:
O(m)failure function plusO(n)scan — also~10^6steps, with a slightly larger per-step constant but no verification and a hard worst-case guarantee.
The lesson: Rabin-Karp's cost is the roll, not the verification, as long as the modulus is large. Shrinking p to 101 would push expected spurious hits to ~10^4, each costing O(m) — ~10^6 extra ops — visibly slowing the search. This is the quantitative argument for a large prime.
Step-by-Step: A Collision and Its Rejection¶
To see verification earn its keep, force a collision with a deliberately tiny modulus p = 7, base b = 10. Search P = "12" in T = "75" (treating digits as their numeric codes for clarity).
The hashes are equal (5 == 5) — a hash hit — yet "12" ≠ "75". With a tiny modulus, distinct strings collide easily. Verification compares "75" to "12" character by character, finds them different, and rejects the candidate. No false match is reported. Replace p = 7 with 10^9 + 7 and this particular collision disappears, but the structural guarantee — verification rejects every collision — holds for any modulus. That is why correctness never depends on the modulus, only speed does.
Error Handling¶
| Error | Cause | Fix |
|---|---|---|
| Two hashes correlated | Same base or same modulus for both. | Use distinct (base, modulus) pairs; ideally distinct primes. |
Collisions despite large p | Birthday effect across many comparisons. | Switch to double hashing or keep verification. |
| Multi-pattern misses some | Patterns of differing length in one set. | One set per length, or use Aho-Corasick. |
| 2D search wrong | Reused the same base for rows and columns ambiguously. | Use independent bases for the horizontal and vertical passes. |
| Negative residue | Subtraction underflow before reduce. | (h - x % p + p) % p. |
| Slow multi-pattern | Re-scanning text per pattern. | Hash all patterns into one set, single text scan. |
Performance Analysis¶
- Single pattern: expected
O(n + m), worstO(n·m). The expectation holds becauseE[spurious hits] ≈ (n−m)/p ≈ 0for largep. - Double hashing: doubles the constant factor of each roll (two hashes instead of one) but reduces collision probability to
~p^{-2}; verification then almost never fires. - Multi-pattern (equal length
m,kpatterns):O(n + k·m)expected — one scan. The set lookup isO(1)expected. - 2D (
R×Ctext,r×cpattern):O(R·C + r·c)expected. The first pass isO(R·C), the second isO(R·C)over the row-hash grid. - Cache behavior: the roll touches only two characters per step, so it is extremely cache-friendly; the verification, when it fires, is a contiguous scan.
- Constant factors: each roll is ~2 multiplies, 1 subtract, 1 add, and 1–2 reductions. Double hashing doubles this; a 61-bit Mersenne hash adds a few shift/mask ops for the fold. Against KMP (1 array read + 1 compare + occasional failure-jump per character), Rabin-Karp's per-step constant is slightly higher, but its scan pattern is simpler and branch-light.
Comparing the variants quantitatively¶
| Variant | Per-roll ops | Collision rate | Verify cost |
|---|---|---|---|
| Single 30-bit hash | ~5 | ~10^{-9} | O(m) per hit |
| Double 30-bit hash | ~10 | ~10^{-18} | rarely fires |
| Single 61-bit hash | ~8 (with fold) | ~4·10^{-19} | rarely fires |
| Naive (no hash) | — | — | every window is a full compare |
The takeaway: spending 2× per-roll cost on double hashing buys ~10^9× fewer collisions — almost always worth it when many strings are compared, and the verification cost then becomes negligible.
Choosing the Right Variant: A Decision Guide¶
| Situation | Variant | Reasoning |
|---|---|---|
| One pattern, trusted input, just need correctness | single 30-bit hash + verify | simplest; verification guarantees exactness |
| One pattern, many comparisons or untrusted input | random base + verify, or 61-bit hash | avoids flooding and birthday collisions |
| Contest, speed over rigor | double 30-bit hash, skip verify | residual error ~p^{-2} accepted |
k patterns, equal length | hash set of patterns, single scan | O(n + k·m) |
| Patterns of a few distinct lengths | one rolling hash + set per length | linear in lengths × n |
| Many varied-length patterns | switch to Aho-Corasick | collision-free, length-independent |
| 2D / image / grid | 2D Rabin-Karp (roll twice) | O(R·C) |
| Many substring-equality queries | prefix hashes | O(1) per query after O(n) |
| Binary-search-on-length problems | rolling/prefix hash per length | O(n log n) family |
The single most important branch is verify vs Monte-Carlo: keep verification whenever a wrong answer is unacceptable; drop it only with a strong (double or 61-bit) hash and an explicit, tiny error budget.
Frequently Asked Middle-Level Questions¶
Why does the leftmost character get the highest power? So that the character removed during a roll is the one carrying b^{m-1} — making the rolling update a clean "subtract the top term, shift, add the bottom term." The opposite convention also works but makes the update formula uglier.
Can I skip recomputing b^{m-1} per pattern? Yes if all patterns share length m: compute it once. If lengths vary, you need one high power per length (cache them).
Is double hashing the same as a 64-bit hash? Conceptually similar in collision rate, but two independent 30-bit hashes are usually faster than one true 64-bit mulmod in languages without __int128, and they keep every product inside 64 bits without special handling. A single 2^{61}−1 hash is the cleaner choice when 128-bit multiply is cheap.
Why not just use the language's built-in substring search? For a single short pattern it is usually faster (tuned two-way algorithm). Reach for Rabin-Karp when the shape of the problem is multi-pattern, 2D, or substring-equality queries — that is where the rolling hash's flexibility pays off.
Does the alphabet size matter? The base must exceed the maximum character code so each symbol is a distinct digit; beyond that, a larger alphabet does not change the asymptotics, only the constant in the collision bound ((m−1)/p is alphabet-independent given a random base).
Can I roll backward? Yes — the same algebra runs in reverse: to move the window from T[i..i+m-1] to T[i-1..i+m-2], subtract the trailing character, divide by the base (multiply by b^{-1} mod p), and add the new leading character times b^{m-1}. The backward roll needs a modular inverse, so the forward roll is preferred; bidirectional scans usually just keep two independent forward hashes.
Why offset character codes by 1? Without the offset, a character mapped to code 0 contributes nothing, so "\0abc" and "abc" (different lengths) or "a" and "aa" can hash alike in edge cases. Mapping every symbol to {1, …, |Σ|} guarantees distinct-length all-same-char strings get distinct hashes and leading symbols never vanish.
How do I pick between one 61-bit hash and two 30-bit hashes? If your language has a cheap 128-bit multiply (C/C++/Rust __int128, or Go bits.Mul64), the single 61-bit hash is simpler and faster. If not (or in contest Java), two 30-bit hashes keep every product inside a native 64-bit word with no special handling, at the cost of a second roll per step.
Best Practices¶
- Default to a large prime modulus near
10^9; upgrade to double hashing or2^61 − 1when correctness pressure is high. - Offset character codes by 1 so distinct-length all-same-char strings never collide trivially.
- Keep verification in production code; rely on double-hash-only correctness solely in throwaway contexts.
- For multi-pattern, store the pattern next to its hash so verification can resolve which one matched.
- For 2D, decompose explicitly into a horizontal pass then a vertical pass — do not try to write one fused quadruple loop.
- Benchmark against KMP/Z; if the input is single-pattern and adversarial, prefer those.
- Precompute both the prefix-hash array and the power array in one
O(n)pass when you need substring queries; never exponentiate per query. - For multi-pattern and binary-search-on-length problems, scale the modulus with the number of comparisons (favor 61-bit or double hashing), not merely with the string length.
- Decide verify-vs-Monte-Carlo explicitly and document it; the choice is the single biggest correctness lever in the whole technique.
Visual Animation¶
See
animation.htmlfor an interactive visualization.The animation demonstrates the window sliding across the text, the rolling hash updating as the leading character leaves and the trailing character enters, the hash-vs-pattern comparison, and the verification step on a hash hit. Edit the text and pattern to watch collisions and confirmed matches play out.
Summary¶
At middle level, mastery of Rabin-Karp means treating the rolling hash as a tunable, composable component rather than a fixed recipe — choosing parameters from collision math, composing it into sets and grids, and placing it correctly against KMP, Z, Aho-Corasick, and suffix structures.
The correctness and speed of Rabin-Karp live in three middle-level decisions. First, the base and modulus: a large prime keeps the per-comparison collision rate near 1/p, and a random base defeats adversaries. Second, double hashing: two independent hashes drive collisions to ~p^{-2}, effectively eliminating spurious hits — though verification stays in code that must never be wrong. Third, the rolling-hash skeleton generalizes: hash a set of equal-length patterns for one-scan multi-pattern search, and roll twice (horizontal then vertical) for 2D grid matching. Against KMP and Z, Rabin-Karp trades a worst-case guarantee for unmatched flexibility — multi-pattern, 2D, and O(1) substring comparison via prefix hashes. Choose it when that flexibility is what the problem needs.
The next file, senior.md, takes these decisions into production: defending against hash flooding with a randomized base, keeping 64-bit arithmetic exact, and reusing the very same rolling hash as the engine of fingerprinting and content-defined chunking.
The file after that, professional.md, proves the collision bound and expected-time analysis that justify every parameter choice made here, including why a prime field is required and why verification — not the modulus — is what guarantees correctness.