Boyer-Moore String Matching — Junior Level¶
One-line summary: Boyer-Moore searches for a pattern
Pinside a textTby aligning the pattern under the text and comparing right to left. On a mismatch it uses precomputed rules — the bad-character rule and the good-suffix rule — to slide the pattern forward by many characters at once, often skipping past whole chunks of the text. This makes it sublinear on average: it can find a match while looking at fewer thanncharacters.
Table of Contents¶
- Introduction
- Prerequisites
- Glossary
- Core Concepts
- Big-O Summary
- Real-World Analogies
- Pros & Cons
- Step-by-Step Walkthrough
- Code Examples
- Coding Patterns
- Error Handling
- Performance Tips
- Best Practices
- Edge Cases & Pitfalls
- Common Mistakes
- Cheat Sheet
- Visual Animation
- Summary
- Further Reading
Introduction¶
You want to find the word "EXAMPLE" inside a long document. The obvious approach — the naive algorithm — lines the pattern up at the very start of the text, compares character by character from the left, and if anything fails to match, slides the pattern forward by exactly one position and tries again. That works, but it is wasteful: a mismatch deep into the pattern throws away everything it just learned and re-checks almost the same characters from a position only one step to the right.
In 1977, Robert S. Boyer and J Strother Moore published an algorithm that flipped two of the naive algorithm's assumptions and produced one of the fastest practical string searchers ever invented:
- Compare the pattern right to left, not left to right. Align
PunderT, then start comparing at the last character ofPand walk backward toward its first character. - On a mismatch, slide the pattern forward by more than one position, using information about which text character caused the mismatch and what suffix of the pattern had already matched.
The magic of the right-to-left scan is that a single mismatch at the end of the pattern can tell you the alignment is hopeless for many shifts to come — so you can jump the pattern forward by a big stride instead of crawling one step at a time. The longer the pattern and the bigger the alphabet, the bigger those jumps tend to be. This is why Boyer-Moore (and its lighter cousin, Horspool, covered in sibling 09-horspool) is the engine behind real-world tools: classic grep, memmem, and many strstr/String.indexOf implementations are built on these ideas.
Boyer-Moore uses two heuristics and, on each mismatch, takes the larger of the two shifts they suggest (a bigger safe jump is always better):
- The bad-character rule looks at the text character that caused the mismatch and shifts the pattern so that character lines up with its last occurrence in the pattern (or past it entirely, if it does not occur).
- The good-suffix rule looks at the portion of the pattern that did match (a suffix) and shifts so that this matched suffix re-aligns with another occurrence of it in the pattern.
At junior level we focus on the right-to-left scan and the bad-character rule — they are enough to understand why the algorithm is fast and to write a correct (if slightly simplified) searcher. The good-suffix rule, the combined-shift correctness, and the linear-time worst case are developed in middle.md and proved in professional.md.
Contrast this with KMP (Knuth-Morris-Pratt, sibling 01-kmp): KMP scans left to right, never re-examines a text character, and guarantees O(n + m) worst case. Boyer-Moore scans right to left, skips text characters, and is usually faster in practice but needs Galil's rule to also guarantee linear worst case. KMP is the "never look back" algorithm; Boyer-Moore is the "look at as little as possible" algorithm.
Prerequisites¶
Before reading this file you should be comfortable with:
- Strings as arrays of characters — indexing
T[i], lengthn = len(T),m = len(P). - The naive (brute-force) substring search — slide a window, compare, advance by one. Boyer-Moore is a smarter version of this loop.
- Zero-based indexing — the last character of
PisP[m-1]. - Big-O notation —
O(n),O(nm),O(m + σ)whereσis the alphabet size. - Arrays / hash maps — the bad-character rule stores a small lookup table keyed by character.
- ASCII / character codes — mapping a character to an integer index (e.g.
'A'→65).
Optional but helpful:
- A glance at KMP (sibling
01-kmp) to appreciate the left-to-right vs right-to-left contrast. - The notion of a prefix (start) and a suffix (end) of a string.
Glossary¶
| Term | Meaning |
|---|---|
Text T | The (usually long) string we search inside. Length n. |
Pattern P | The (usually short) string we search for. Length m. |
| Alignment / shift | A position s where the pattern's first character sits under T[s], so P[j] is compared with T[s + j]. |
| Right-to-left scan | Comparing P[m-1] with T[s+m-1] first, then moving leftward toward P[0]. |
| Mismatch | A position j where P[j] != T[s+j] during the scan. |
| Bad character | The text character T[s+j] that caused the mismatch. |
| Good suffix | The already-matched suffix P[j+1 .. m-1] to the right of a mismatch. |
| Bad-character rule | Heuristic that shifts P to align the bad character with its last occurrence in P. |
| Good-suffix rule | Heuristic that shifts P so the matched suffix re-aligns with another copy (or a prefix) of itself. |
| Last-occurrence table | A table last[c] = index of the rightmost occurrence of character c in P, or -1 if absent. |
| Sublinear | Examining fewer than n characters on average — Boyer-Moore's claim to fame. |
| σ (sigma) | The alphabet size (e.g. 256 for bytes, 26 for lowercase letters). |
| Galil's rule | A bookkeeping trick that makes the worst case O(n) (covered in middle.md/professional.md). |
Core Concepts¶
1. The Right-to-Left Scan¶
Place the pattern under the text at some shift s (initially s = 0). The naive algorithm compares from the left. Boyer-Moore compares from the right:
Concretely, we set j = m - 1 and compare P[j] with T[s + j]. If they match, decrement j and compare again. If we reach j = -1, the whole pattern matched at s — report an occurrence. If P[j] != T[s+j] for some j, we have a mismatch and must compute how far to slide.
Why right-to-left? Because the last character of the pattern is compared against a new region of the text first. If that single character is one that does not even appear in the pattern, we instantly know the pattern cannot match anywhere that overlaps it — so we can leap the whole pattern past it.
2. The Bad-Character Rule (the entry point)¶
When P[j] mismatches the text character c = T[s + j], the bad-character rule asks: "Where is the rightmost occurrence of c in the pattern?"
- If
coccurs inPat some indexk < j, slide the pattern so thatP[k]lines up under the mismatched text position. The shift isj - k. - If
coccurs inPonly at indices> j(to the right of the mismatch), that occurrence is useless here; treat it as not helping and fall back to a shift of1(or let the good-suffix rule take over). - If
cdoes not occur inPat all, slide the pattern completely pastc: the shift isj + 1.
We precompute a last-occurrence table last[c] once, in O(m + σ), so each lookup is O(1).
T: . . . a b X d . . . mismatch: P[j] != 'X'
P: ? ? ? ? ? 'X' is not in P
→ shift P right past 'X' (big jump)
3. The Good-Suffix Rule (a preview)¶
Sometimes the suffix you already matched is itself informative. Suppose the last three characters matched but the fourth-from-last mismatched. That matched suffix is a known substring of the text. The good-suffix rule slides the pattern so that another copy of that suffix inside the pattern lines up — or, if none exists, so that the longest pattern prefix that equals a suffix of the matched part lines up. This rule is fully developed in middle.md; for now, just know it is the second shift candidate.
4. Take the Maximum Shift¶
On every mismatch you compute both the bad-character shift and the good-suffix shift, and you slide by the larger of the two. A larger shift is always safe as long as each rule is individually safe (it never skips a real match — proved in professional.md). The bigger jump simply finishes the search faster.
5. Why It Is Sublinear¶
Because a mismatch at the right end of the pattern can trigger a shift of up to m, the algorithm often advances by close to a full pattern length per probe. On English text with a moderate alphabet, it commonly examines only about n / m characters in the best regions — fewer than n. That is the headline result: Boyer-Moore can run faster than reading every character of the text.
Big-O Summary¶
| Operation | Time | Space | Notes |
|---|---|---|---|
| Build last-occurrence (bad-char) table | O(m + σ) | O(σ) | One pass over P, one init of the table. |
| Build good-suffix table | O(m) | O(m) | Two linear passes (see middle.md). |
| Single alignment scan | O(m) | O(1) | Right-to-left comparison loop. |
| Search, average case | O(n / m) to O(n) | — | Sublinear on typical text/large alphabet. |
| Search, best case | O(n / m) | — | Every probe mismatches on the last char and jumps by m. |
| Search, worst case (no Galil) | O(nm) | — | Pathological repetitive inputs. |
| Search, worst case (with Galil) | O(n) | — | Galil's rule caps re-comparisons. |
The headline numbers are O(m + σ) preprocessing and an average-case sublinear search that, in the best case, costs only O(n / m).
Real-World Analogies¶
| Concept | Analogy |
|---|---|
| Right-to-left scan | Checking a long shelf for a specific book by glancing at the last letter of each candidate spine first — a quick way to reject most books instantly. |
| Bad-character rule | You are looking for the word ending in "...ING", you glance at a spot and see a Q where you expected G. Since Q is nowhere in your target word, you skip the whole word and jump ahead. |
| Good-suffix rule | You matched "ING" at the end but the letter before it was wrong; you slide forward to the next place in your word where "ING" (or part of it) could line up again. |
| Taking the max shift | Two friends each suggest how far to fast-forward a video; you trust whoever says "skip further" because both are guaranteeing you won't miss the scene. |
| Sublinear search | Skimming a book by reading only every fifth word and still reliably finding the chapter you want — you touch far fewer words than the book contains. |
| Last-occurrence table | A cheat sheet that, for each letter, tells you the rightmost spot that letter appears in the word you're hunting. |
Where the analogy breaks: real shelves do not let you "slide" your search target, and the safety guarantee (never skipping a real occurrence) is a precise mathematical property, not just intuition — see professional.md.
Pros & Cons¶
| Pros | Cons |
|---|---|
Sublinear on average — can examine far fewer than n characters. | Worst case is O(nm) without Galil's rule. |
| Gets faster as the pattern grows (bigger possible jumps). | Preprocessing and tables add fixed overhead — overkill for tiny patterns. |
| Excellent on large alphabets (bad-character jumps are large). | Less effective on small/binary alphabets (few distinct chars → small jumps). |
The bad-character rule alone (Horspool, sibling 09) is tiny and very practical. | The full good-suffix table is fiddly and easy to get subtly wrong. |
Underpins production grep, memmem, many strstr. | For very short patterns, naive or memchr-style scanning can win. |
| No re-reading of the text in the common case. | Cache behaviour can suffer from large jumps on huge texts. |
When to use: searching for a moderately long pattern in a long text, especially over a large alphabet (bytes, Unicode, natural language); when average speed matters more than worst-case guarantees.
When NOT to use: very short patterns (the jumps are tiny), tiny/binary alphabets (use KMP or specialized methods), or hard real-time systems that need a guaranteed O(n) bound without implementing Galil's rule.
Step-by-Step Walkthrough¶
Search for P = "EXAMPLE" (length m = 7) in T = "HERE IS A SIMPLE EXAMPLE". We will use only the bad-character rule for clarity.
Step 0 — build the last-occurrence table for P. For each character, record its rightmost index in P = E X A M P L E (indices 0..6):
Step 1 — align at s = 0. Compare right to left starting at P[6]='E' vs T[6].
'S' is not in P, so last['S'] = -1. Bad-character shift = j - last[c] = 6 - (-1) = 7. Slide the pattern forward by 7 — we skip the entire first window.
Step 2 — align at s = 7. Now P[6]='E' lines up with T[13].
Suppose T[s+6] = 'E' matches; move left. P[5]='L' vs T[s+5], and so on. If a mismatch occurs at some j with bad character c, compute j - last[c] and jump again. Whenever c is absent from P, the jump is the full j + 1.
Step 3 — eventually align at the real occurrence. When the window sits exactly over the "EXAMPLE" at the end of T, the right-to-left scan matches P[6], P[5], …, P[0] all the way down to j = -1. That is a reported match. Then we shift by 1 (or by the good-suffix prefix shift) to continue searching for further occurrences.
The key observation: at s = 0 we examined a single character (T[6]='S') and immediately jumped 7 positions. The naive algorithm would have advanced by 1 and re-examined nearly the same window. That is the sublinear behaviour in action.
Code Examples¶
Example: Boyer-Moore with the Bad-Character Rule¶
This builds the last-occurrence table and searches with the right-to-left scan, falling back to a shift of 1 when the bad-character rule would not advance.
Go¶
package main
import "fmt"
const SIGMA = 256
// buildLast returns last[c] = rightmost index of byte c in p, or -1.
func buildLast(p string) [SIGMA]int {
var last [SIGMA]int
for i := 0; i < SIGMA; i++ {
last[i] = -1
}
for i := 0; i < len(p); i++ {
last[p[i]] = i
}
return last
}
// search returns all start indices where p occurs in t.
func search(t, p string) []int {
res := []int{}
n, m := len(t), len(p)
if m == 0 || m > n {
return res
}
last := buildLast(p)
s := 0 // current alignment
for s <= n-m {
j := m - 1
for j >= 0 && p[j] == t[s+j] {
j-- // scan right to left
}
if j < 0 {
res = append(res, s) // full match at s
s++ // shift by 1 to find overlaps
} else {
c := t[s+j]
shift := j - last[c] // bad-character shift
if shift < 1 {
shift = 1 // never go backward or stall
}
s += shift
}
}
return res
}
func main() {
t := "HERE IS A SIMPLE EXAMPLE"
p := "EXAMPLE"
fmt.Println(search(t, p)) // [17]
fmt.Println(search("aaaaa", "aa")) // [0 1 2 3]
}
Java¶
import java.util.*;
public class BoyerMooreBadChar {
static final int SIGMA = 256;
static int[] buildLast(String p) {
int[] last = new int[SIGMA];
Arrays.fill(last, -1);
for (int i = 0; i < p.length(); i++) {
last[p.charAt(i)] = i;
}
return last;
}
static List<Integer> search(String t, String p) {
List<Integer> res = new ArrayList<>();
int n = t.length(), m = p.length();
if (m == 0 || m > n) return res;
int[] last = buildLast(p);
int s = 0;
while (s <= n - m) {
int j = m - 1;
while (j >= 0 && p.charAt(j) == t.charAt(s + j)) {
j--; // right to left
}
if (j < 0) {
res.add(s);
s++;
} else {
char c = t.charAt(s + j);
int shift = j - last[c];
if (shift < 1) shift = 1;
s += shift;
}
}
return res;
}
public static void main(String[] args) {
System.out.println(search("HERE IS A SIMPLE EXAMPLE", "EXAMPLE")); // [17]
System.out.println(search("aaaaa", "aa")); // [0, 1, 2, 3]
}
}
Python¶
SIGMA = 256
def build_last(p):
"""last[c] = rightmost index of char c in p, else -1."""
last = [-1] * SIGMA
for i, ch in enumerate(p):
last[ord(ch)] = i
return last
def search(t, p):
res = []
n, m = len(t), len(p)
if m == 0 or m > n:
return res
last = build_last(p)
s = 0
while s <= n - m:
j = m - 1
while j >= 0 and p[j] == t[s + j]:
j -= 1 # scan right to left
if j < 0:
res.append(s)
s += 1
else:
c = ord(t[s + j])
shift = j - last[c] # bad-character shift
s += max(1, shift) # never stall
return res
if __name__ == "__main__":
print(search("HERE IS A SIMPLE EXAMPLE", "EXAMPLE")) # [17]
print(search("aaaaa", "aa")) # [0, 1, 2, 3]
What it does: builds the last-occurrence table, slides the pattern with the bad-character rule, and reports every start index where P occurs. Run: go run main.go | javac BoyerMooreBadChar.java && java BoyerMooreBadChar | python bm.py
Coding Patterns¶
Pattern 1: Naive Search (the oracle you test against)¶
Intent: Before trusting Boyer-Moore, compare its output against the dead-simple naive search on small inputs.
def naive_search(t, p):
n, m = len(t), len(p)
res = []
for s in range(n - m + 1):
if t[s:s + m] == p:
res.append(s)
return res
This is O(nm) but obviously correct. Boyer-Moore must return the same list of indices.
Pattern 2: First-Occurrence-Only (strstr-style)¶
Intent: Return the first match index (or -1), like strstr / indexOf. Just return s on the first full match instead of collecting all.
Pattern 3: Last-Occurrence Table as a Hash Map¶
Intent: For large or sparse alphabets (Unicode), use a hash map {char: index} instead of a size-σ array. Lookups default to -1 for absent characters.
Error Handling¶
| Error | Cause | Fix |
|---|---|---|
| Infinite loop | Computed shift was 0 (or negative) and s never advanced. | Clamp shift = max(1, shift). |
| Missed match at the very end | Loop condition s < n-m instead of s <= n-m. | Use s <= n - m so the last window is checked. |
| Index out of range | Comparing T[s+j] with s+j >= n. | Loop only while s <= n-m; then s+j < n always holds. |
| Wrong results on empty pattern | m == 0 not handled. | Decide a convention (e.g. match at every position) and guard it. |
| Non-ASCII characters crash the array table | Char code exceeds σ. | Use a hash map, or size the table to the real alphabet. |
Off-by-one in last[c] | Stored leftmost instead of rightmost occurrence. | Overwrite during a left-to-right pass so the last write wins. |
Performance Tips¶
- Use an array table for byte alphabets (
σ = 256) —O(1)lookups with no hashing overhead. - Use a hash map for large/sparse alphabets (Unicode) to avoid a huge mostly-empty array.
- Clamp the shift to at least 1, but never more — that would skip real matches.
- Add the good-suffix rule (see
middle.md) for materially larger jumps on patterns with repeated suffixes. - For very short patterns, a
memchr-style scan for the first character may beat full Boyer-Moore — measure. - Reuse the precomputed tables across many searches with the same pattern; build them once.
Best Practices¶
- Always validate Boyer-Moore against the naive oracle (Pattern 1) on random small strings before trusting it.
- Decide and document your empty-pattern and empty-text conventions up front.
- Keep
buildLastandsearchas separate, individually testable functions. - Name the alphabet-size constant once (
SIGMA = 256); never sprinkle magic numbers. - State clearly whether you return the first match or all matches.
- For production, prefer the well-tested library
strstr/indexOfunless you have a measured reason to roll your own (seesenior.md).
Edge Cases & Pitfalls¶
- Empty pattern (
m = 0) — define behaviour explicitly; many libraries return position 0. - Pattern longer than text (
m > n) — no match possible; return empty/-1immediately. - Pattern equals text — a single match at
s = 0. - Overlapping matches —
"aa"in"aaaa"occurs at 0, 1, 2; shift by 1 after a match to catch overlaps. - All-same-character inputs —
"aaaa"in"aaaaaaaa"is the worst case for the bad-character rule (tiny jumps); this is where Galil's rule matters (middle.md). - Case sensitivity —
"Example"≠"example"; normalize case beforehand if needed. - Bad character to the right of the mismatch —
j - last[c]can be≤ 0; you must clamp to1.
Common Mistakes¶
- Forgetting to clamp the shift to at least 1 — the program loops forever when
last[c] >= j. - Storing the leftmost occurrence in the table instead of the rightmost — produces wrong (sometimes negative) shifts.
- Scanning left to right by habit — that is the naive/KMP direction, not Boyer-Moore; the whole speedup vanishes.
- Using
s < n-mand missing a match in the final window. - Skipping the
max(1, shift)because "the bad-character shift is always positive" — it is not when the bad character occurs to the right ofj. - Confusing length with index — the last character is
P[m-1], and a shift past an absent character isj + 1, notj. - Assuming
O(n)worst case without Galil's rule — the plain algorithm can degrade toO(nm).
Cheat Sheet¶
| Question | Tool | Formula |
|---|---|---|
| Where to start each comparison | right-to-left scan | begin at j = m - 1 |
| Bad-character shift | last-occurrence table | shift = j - last[c] (clamp to ≥ 1) |
Char c absent from P | last[c] = -1 | shift = j + 1 |
| After a full match | continue searching | shift by 1 (or good-suffix prefix shift) |
| Combine both rules | take the larger | shift = max(badChar, goodSuffix) |
| Build table cost | preprocessing | O(m + σ) |
| Average search cost | sublinear | down to O(n / m) |
Core algorithm (bad-character only):
build last[c] for all c # O(m + σ)
s = 0
while s <= n - m:
j = m - 1
while j >= 0 and P[j] == T[s+j]: j -= 1 # scan right→left
if j < 0: report match at s; s += 1
else: s += max(1, j - last[T[s+j]]) # bad-character jump
Visual Animation¶
See
animation.htmlfor an interactive visualization.The animation demonstrates: - Aligning the pattern under the text and scanning right to left - Highlighting the mismatched bad character in the text - Showing the bad-character and good-suffix shift candidates side by side - Jumping the pattern forward by the chosen (maximum) shift - Play / Pause / Step controls and editable text and pattern fields
Summary¶
Boyer-Moore searches for a pattern by aligning it under the text and comparing right to left. On a mismatch it slides the pattern forward by the larger of two precomputed shifts: the bad-character rule (align the mismatched text character with its last occurrence in the pattern, or jump past it if it is absent) and the good-suffix rule (re-align the suffix you already matched). Because a mismatch near the right end can authorize a jump of almost a full pattern length, the search is sublinear on average — it can find a match while examining fewer than n characters. Preprocessing the tables costs O(m + σ). The plain algorithm's worst case is O(nm); Galil's rule brings it down to O(n). Master the right-to-left scan and the bad-character table here, then add the good-suffix rule in middle.md. Contrast with KMP (sibling 01-kmp), the left-to-right, never-look-back algorithm, and Horspool (sibling 09-horspool), the bad-character-only simplification that powers real grep.
Further Reading¶
- Boyer, R. S., & Moore, J S. (1977). A Fast String Searching Algorithm. Communications of the ACM.
- Introduction to Algorithms (CLRS) — the string-matching chapter and the Boyer-Moore exercises.
- Gusfield, D. Algorithms on Strings, Trees, and Sequences — the definitive treatment of the good-suffix rule.
- Sibling topic
01-kmp— the left-to-right, linear-time alternative for contrast. - Sibling topic
09-horspool— the simplified bad-character-only variant used in practice. - cp-algorithms.com — "Prefix function" and string-matching foundations.