Aho-Corasick Multi-Pattern String Matching — Practice Tasks¶
All tasks must be solved in Go, Java, and Python. Each task ships with starter code or a precise spec in all three languages. Implement the Aho-Corasick logic and validate against the evaluation criteria. Always test against a brute-force naive multi-search oracle (
str.findper pattern) on small inputs before trusting the automaton.
Beginner Tasks (5)¶
Task 1 — Build a trie of patterns and count nodes¶
Problem. Insert a list of lowercase patterns into a trie (children indexed by c - 'a'). Output the total number of trie nodes (including the root). Shared prefixes must share nodes.
Input / Output spec. - Read P, then P patterns. - Print the node count (root counts as 1).
Constraints. - 1 ≤ P ≤ 1000, patterns lowercase a–z, total length ≤ 10^5.
Hint. Each distinct prefix is exactly one node. {he, hers} shares h, he, so the node count is 1 (root) + |distinct prefixes|.
Starter — Go.
package main
import "fmt"
const SIGMA = 26
func main() {
var p int
fmt.Scan(&p)
next := [][SIGMA]int{}
// TODO: add a root node, insert each pattern, count nodes
_ = next
fmt.Println( /* node count */ )
}
Starter — Java.
import java.util.*;
public class Task1 {
static final int SIGMA = 26;
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
int p = sc.nextInt();
// TODO: build trie, count nodes
System.out.println(/* node count */ 0);
}
}
Starter — Python.
import sys
SIGMA = 26
def main():
data = sys.stdin.read().split()
p = int(data[0])
pats = data[1:1 + p]
# TODO: build trie, count nodes including root
print(0)
if __name__ == "__main__":
main()
Evaluation criteria. - {he, hers, his} produces the correct shared-prefix node count. - Root is counted; empty pattern list yields node count 1. - O(total length) insertion.
Task 2 — Naive multi-search oracle¶
Problem. Implement naive_search(patterns, text) that returns all (pattern_index, start_index) occurrences (overlaps included), using only repeated substring search. This is your correctness oracle for all later tasks.
Input / Output spec. - Read P, the patterns, then the text. - Print each (id, start) sorted by (start, id).
Constraints. 1 ≤ P ≤ 50, total length ≤ 2000, |text| ≤ 2000.
Hint. For each pattern, repeatedly call find(pattern, fromIndex) advancing by 1 each time to catch overlaps. O(P · n · maxlen) is fine here.
Reference (Python).
def naive_search(patterns, text):
res = []
for pid, p in enumerate(patterns):
start = text.find(p)
while start != -1:
res.append((pid, start))
start = text.find(p, start + 1)
return sorted(res, key=lambda r: (r[1], r[0]))
Evaluation criteria. - Catches overlapping occurrences (aa in aaa at starts 0 and 1). - Sorted output; deterministic across all three languages.
Task 3 — Failure links by BFS¶
Problem. After inserting patterns, compute the failure link fail[u] for every node using BFS, following the rule fail(u) = goto(fail(parent), c). Output fail[u] for all nodes in node-id order.
Input / Output spec. - Read P, the patterns. - Print fail[0], fail[1], … (root's fail is 0).
Constraints. 1 ≤ P ≤ 500, total length ≤ 10^4.
Hint. Initialize the root's children with fail = 0 and enqueue them. Process the queue in FIFO order. To compute fail(child), you need the parent's completed transitions — so completing goto and computing fail happen together in the BFS.
Worked check. For {he, she, his, hers} (trie from junior.md), fail(node "she") = node "he" and fail(node "hers") follows the suffix s. Verify a few links by hand.
Evaluation criteria. - fail[root] == 0; all depth-1 nodes have fail == 0. - fail[u] is always strictly shallower than u (no cycles except the root self-link). - Matches a hand trace on {he, she, his, hers}.
Task 4 — Report all matches (full Aho-Corasick)¶
Problem. Given patterns and a text, report every (pattern_id, end_index) occurrence, including overlaps, using the complete automaton (goto + fail + merged outputs).
Input / Output spec. - Read P, patterns, then the text. - Print each (id, end) sorted by (end, id).
Constraints. 1 ≤ P ≤ 10^4, total length ≤ 10^5, |text| ≤ 10^6.
Hint. Merge output[u] += output[fail[u]] during BFS so reporting at a state is one list walk. The match at state u after reading index i ends at i.
Evaluation criteria. - Matches naive_search (converted to end indices) for small inputs. - Reports overlapping matches and shorter patterns that are suffixes of longer ones. - O(n + z) scan after O(m·σ) build.
Task 5 — Contains any keyword (boolean)¶
Problem. Given a banned-word list and a text, return true iff the text contains at least one banned word as a substring. Early-exit on the first match.
Input / Output spec. - Read P, the words, then the text. - Print true or false.
Constraints. 1 ≤ P ≤ 10^4, |text| ≤ 10^6.
Hint. Build the automaton; during the scan, the moment you reach a state with a nonempty output, return true. No need to enumerate all matches.
Evaluation criteria. - Returns true on the first match and stops scanning. - Returns false when no word occurs. - Correct when one word is a substring of another.
Starter — Python.
def contains_any(words, text):
# TODO: build automaton; scan; return True at first nonempty output
return False
Intermediate Tasks (5)¶
Task 6 — Count occurrences of every pattern (fail-tree aggregation)¶
Problem. Given patterns and a text, output, for each pattern id, the number of times it occurs (overlaps counted), computed in O(n + m) via fail-tree subtree sums — not by enumerating matches.
Constraints. 1 ≤ P ≤ 10^5, total length ≤ 10^5, |text| ≤ 10^6.
Hint. Scan once, incrementing cnt[state] on each landing. Then process nodes in reverse BFS order, doing cnt[fail[u]] += cnt[u]. A pattern's count is cnt[its end node].
Reference oracle (Python).
def brute_counts(patterns, text):
res = []
for p in patterns:
c, start = 0, text.find(p)
while start != -1:
c += 1
start = text.find(p, start + 1)
res.append(c)
return res
Starter — Python (fail-tree skeleton).
from collections import deque
SIGMA = 26
def count_each(patterns, text):
nxt = [[-1] * SIGMA]
fail = [0]
end_node = [] # node where each pattern ends
cnt = [0]
def node():
nxt.append([-1] * SIGMA)
fail.append(0)
cnt.append(0)
return len(nxt) - 1
for p in patterns:
cur = 0
for ch in p:
c = ord(ch) - 97
if nxt[cur][c] == -1:
nxt[cur][c] = node()
cur = nxt[cur][c]
end_node.append(cur)
# TODO: BFS to build fail + complete goto, record order
# TODO: scan text, increment cnt[state]
# TODO: reverse-BFS push cnt up the fail tree
# TODO: return [cnt[end_node[p]] for p in range(len(patterns))]
return [0] * len(patterns)
Evaluation criteria. - ["a","aa"] on "aaaa" returns [4, 3]. - Matches brute_counts for small inputs. - No per-occurrence enumeration; O(n + m).
Task 7 — First occurrence end-position of each pattern¶
Problem. Given patterns and a text, output for each pattern the end index of its first occurrence, or -1 if it never occurs.
Constraints. 1 ≤ P ≤ 10^4, total length ≤ 10^5, |text| ≤ 10^6.
Hint. Scan once. At each state, walk merged outputs; for each pattern id not yet seen, record the current index as its first end. Use a seen / firstEnd array initialized to -1.
Evaluation criteria. - Reports the earliest end index per pattern. - -1 for patterns that never occur. - Single pass; matches a naive per-pattern find.
Worked check. For ["he","she"] and text she, he first ends at index 2 (via the output link from she), and she first ends at index 2. Both are reported from the same scan position, demonstrating that output links surface co-terminating patterns.
Task 8 — Censor banned words¶
Problem. Replace every character covered by any banned word with * (union of all match ranges). Overlapping matches all contribute.
Constraints. 1 ≤ P ≤ 10^4, total length ≤ 10^5, |text| ≤ 10^6.
Hint. Get all (start, end) matches via the automaton, mark a boolean cover[] array over text positions, then rebuild masking covered positions. For large match counts, prefer a difference-array (+1 at start, -1 at end+1) over per-character marking.
Reference check (Python).
# censor(["ab","bc"], "xabcy") == "x***y"
def naive_censor(banned, text):
cover = [False] * len(text)
for p in banned:
start = text.find(p)
while start != -1:
for j in range(start, start + len(p)):
cover[j] = True
start = text.find(p, start + 1)
return "".join("*" if cover[i] else ch for i, ch in enumerate(text))
Evaluation criteria. - Correct union masking for overlapping matches. - ["ab","bc"] on "xabcy" yields "x***y". - Uses a difference array for the cover marking when matches are dense.
Task 9 — Longest pattern matched at each position¶
Problem. For each text index i, output the length of the longest pattern that ends at i, or 0 if none ends there.
Constraints. 1 ≤ P ≤ 10^4, total length ≤ 10^5, |text| ≤ 10^6.
Hint. Precompute for each node the maximum pattern length in its merged output set (maxLen[u] = max over outputs). During the scan, maxLen[state(i)] is the answer at i. This avoids walking the output list per position.
Evaluation criteria. - For {he, hers} and text hers, position 1 (he) gives 2, position 3 (hers) gives 4. - 0 where no pattern ends. - O(n) scan after O(m) precompute of maxLen.
Task 10 — Streaming scan across chunks¶
Problem. Implement scan(state, chunk, baseOffset) that processes a chunk of text starting from a carried automaton state and returns the new state, reporting matches at absolute offsets. Demonstrate that splitting a text into chunks produces the same matches as scanning it whole.
Constraints. 1 ≤ P ≤ 10^4, total length ≤ 10^5, total stream length ≤ 10^7, chunk size arbitrary.
Hint. The entire scan state is one integer; thread it across chunk calls. A pattern straddling a boundary is handled automatically because no character is re-read and state is preserved. Report end positions as baseOffset + i.
Evaluation criteria. - Chunked scan and whole-text scan produce identical match sets. - Matches straddling chunk boundaries are caught. - Only O(1) state carried between chunks (no buffering of prior chunks).
Advanced Tasks (5)¶
Task 11 — Alphabet reduction for byte input¶
Problem. Build an Aho-Corasick automaton over arbitrary bytes (σ up to 256) using alphabet reduction: map only the byte values that appear in patterns to a dense range [0, σ'), build transition rows of width σ', and translate each text byte through the map during scanning. Report all matches.
Constraints. 1 ≤ P ≤ 10^4, total pattern length ≤ 10^5, |text| ≤ 10^7, byte alphabet.
Hint. First pass over patterns marks used bytes; assign each a dense index. A byte not in any pattern routes the automaton to the root path. Compare memory against a dense [256] build.
Evaluation criteria. - Correct matches identical to a full [256]-width build. - Memory measurably lower than the unreduced build (report the row width σ'). - Handles bytes absent from all patterns (route to root).
Task 12 — Replace-and-rebuild (atomic dictionary swap)¶
Problem. Implement a matcher whose dictionary can be replaced at runtime. rebuild(patterns) constructs a fresh immutable automaton; search(text) uses whatever automaton is currently published. Concurrent searches during a rebuild must use a consistent (old or new) automaton, never a half-built one.
Constraints. 1 ≤ P ≤ 10^5, total length ≤ 10^6, searches may run concurrently with a rebuild.
Hint. Build the new automaton fully into local structures, then publish it via an atomic reference/pointer swap. Readers snapshot the reference at the start of a scan. The automaton is read-only, so sharing is safe.
Evaluation criteria. - A search started before a swap sees the old dictionary throughout. - No reader ever observes a partially built automaton. - Build is off the critical path; publish is a single atomic store.
Task 13 — Shortest substring containing all patterns¶
Problem. Given patterns (all present at least once is not guaranteed) and a text, find the length of the shortest substring of the text that contains every pattern as a substring; -1 if impossible.
Constraints. 1 ≤ P ≤ 1000, total length ≤ 10^5, |text| ≤ 10^6.
Hint. Collect all occurrence intervals (start, end) via the automaton; sort by end; slide a window over occurrences requiring all P pattern ids covered, tracking the minimum span [min start in window, current end]. This is the interview "shortest window" challenge.
Reference oracle (Python).
def brute_shortest_window(patterns, text):
n = len(text)
best = float("inf")
for i in range(n):
for j in range(i, n):
sub = text[i:j + 1]
if all(p in sub for p in patterns):
best = min(best, j - i + 1)
break # shortest ending at this start
return -1 if best == float("inf") else best
Evaluation criteria. - ["ab","bc"] on "zzabxbcyy" returns 5 (span covering ab and bc). - -1 when some pattern never occurs. - Two-pointer window over occurrences, not brute-force over all substrings. - Matches brute_shortest_window for small inputs.
Task 14 — DNA motif counting (small alphabet)¶
Problem. Over the DNA alphabet {A, C, G, T} (σ = 4), count occurrences of many short motifs in a long sequence. Use the dense array automaton (the small alphabet makes it ideal) and fail-tree counting.
Constraints. 1 ≤ P ≤ 10^5, total motif length ≤ 10^6, sequence length ≤ 10^7, alphabet {A,C,G,T}.
Hint. Map A,C,G,T → 0,1,2,3. The dense [4] rows are tiny, so the full goto table is cheap. Use the O(n + m) fail-tree count from Task 6. Note that overlapping motif occurrences are counted (e.g. AA in AAA).
Degree sanity check. With σ = 4, each node has exactly 4 transitions after completion; total transition table size is 4 × nodes. Assert this after building to catch row-width bugs.
Evaluation criteria. - Correct overlapping counts for motifs like AA, ACA. - Dense [4] rows; full goto table built and scanned in linear time. - Matches a naive per-motif count on small sequences.
Task 15 — Decide whether Aho-Corasick is the right tool¶
Problem. Given a problem description with constraints (P, m, n, query_type, dictionary_volatility), classify the best approach as one of: AHO_CORASICK, KMP_SINGLE, KMP_PER_PATTERN, SUFFIX_STRUCTURE_OF_TEXT, or WRONG_FOR_FUZZY. Justify each decision.
Constraints / cases to handle. - Many patterns, one or few reused texts, exact match → AHO_CORASICK. - Exactly one pattern → KMP_SINGLE (sibling 01-kmp). - Pattern set changes every query (no reuse), few patterns → KMP_PER_PATTERN. - One fixed text, many changing pattern sets → SUFFIX_STRUCTURE_OF_TEXT (suffix automaton/array). - Approximate / fuzzy / regex matching → WRONG_FOR_FUZZY (Aho-Corasick is exact substring only).
Hint. Encode the decision rules from middle.md and senior.md. The trap cases are the single-pattern one (use KMP) and the fuzzy one (Aho-Corasick does not do approximate matching).
Evaluation criteria. - Correctly classifies all five canonical cases with justification. - The KMP_SINGLE and WRONG_FOR_FUZZY branches explicitly explain why Aho-Corasick is suboptimal/inapplicable. - References the right complexity (O(n + z), O(P·n + m), etc.).
Benchmark Task¶
Task B — Aho-Corasick vs KMP-per-pattern crossover¶
Problem. Empirically find the pattern count P at which Aho-Corasick overtakes running KMP once per pattern, for a fixed text length n. Implement two workloads on a random text and random short patterns (fixed seed):
- (a) KMP-per-pattern: build each pattern's prefix function and scan the text per pattern —
O(P·n + m). - (b) Aho-Corasick: build one automaton and scan the text once —
O(m·σ + n + z).
Measure wall-clock time for n = 10^6 across P ∈ {1, 4, 16, 64, 256, 1024}. Report a table and identify the crossover P.
Starter — Python.
import random
import time
def gen_patterns(p, seed, length=8):
r = random.Random(seed)
return ["".join(r.choice("abcd") for _ in range(length)) for _ in range(p)]
def gen_text(n, seed):
r = random.Random(seed + 1)
return "".join(r.choice("abcd") for _ in range(n))
def kmp_search_all(patterns, text):
# TODO: prefix function per pattern, scan text per pattern; count matches
return 0
def aho_search_all(patterns, text):
# TODO: build one automaton, scan once; count matches
return 0
def bench(fn, patterns, text):
start = time.perf_counter()
fn(patterns, text)
return time.perf_counter() - start
def main():
n = 1_000_000
text = gen_text(n, 42)
print("P kmp_per_pattern_ms aho_ms")
for p in [1, 4, 16, 64, 256, 1024]:
pats = gen_patterns(p, 7)
a = [bench(kmp_search_all, pats, text) for _ in range(3)]
b = [bench(aho_search_all, pats, text) for _ in range(3)]
ms = lambda xs: sum(xs) / len(xs) * 1000
print(f"{p:<8d} {ms(a):<22.2f} {ms(b):<10.2f}")
if __name__ == "__main__":
main()
Evaluation criteria. - Same seed produces the same patterns and text across Go, Java, and Python. - KMP-per-pattern (a) wins or ties for P = 1; Aho-Corasick (b) wins for large P. Report the crossover P. - Both methods report identical total match counts on the same inputs. - Report includes the mean across at least 3 runs and does not time pattern/text generation. - Writeup: a short note on the measured crossover P and how it compares to theory (the P·n term of KMP-per-pattern vs the single n of Aho-Corasick).
General Guidance for All Tasks¶
- Always validate against the naive oracle first. Every search/count task above ships with (or references) a slow per-pattern oracle. Run it on small
P,m,n, diff against the automaton, and only then trust it on large inputs. - Handle the root correctly.
fail[root] = root, and the root's missing transitions loop back to the root. This is the single most common build bug. - Build failure links in BFS order. A fail link points to a shallower node, so all shallower nodes must be finalized first.
- Merge or walk output links. Forgetting
output[u] += output[fail[u]](or skipping the output chain) silently drops overlapping/suffix patterns — testheinsideshe. - Be consistent about end vs start positions. A match at state
uafter indexiends ati; start =i − len + 1. - Choose representation by alphabet. Dense arrays for small alphabets (DNA, lowercase); alphabet reduction or maps for byte/Unicode input.
- Build once, scan many. Never rebuild the automaton inside a per-text or per-query loop.
Suggested Test Vectors¶
Use these shared cases across all three language implementations so outputs can be diffed directly:
| Patterns | Text | Expected (matches as pattern@end) |
|---|---|---|
he she his hers | ushers | she@3, he@3, hers@5 |
a aa | aaaa | counts a=4, aa=3 (total 7) |
ab bc | xabcy | censor → x***y |
ab bc | zzabxbcyy | shortest window length 5 |
abc | xyz | no matches; contains → false |
a | `` (empty) | no matches |
Progression and Difficulty Notes¶
- Beginner (1–5) establishes the data structure: trie construction, the oracle you validate against, failure-link BFS, the full reporting scan, and the boolean early-exit variant. Finish these before attempting the rest — every later task reuses the
buildroutine. - Intermediate (6–10) layers the standard production variants on top: counting via the fail tree, first-occurrence tracking, censoring, longest-match-per-position, and streaming across chunks. These exercise the output machinery and the online property.
- Advanced (11–15) stresses scale and judgement: byte alphabets with reduction, runtime dictionary swaps, the shortest-window two-pointer combination, small-alphabet bioinformatics counting, and the meta-skill of deciding when Aho-Corasick is the wrong tool at all.
- Benchmark (B) ties it together empirically by locating the Aho-Corasick vs KMP-per-pattern crossover, the single most important practical fact: the
Pfactor.
Common Grading Pitfalls¶
- Forgetting to merge
output[fail[u]]— passes the single-pattern cases, fails on overlapping/suffix patterns (heinshe). Always include an overlap test. - Mishandling the root so the scan dereferences
-1— crashes on the very first character that has no root edge. - Counting via per-position enumeration in Task 6 instead of fail-tree aggregation — passes correctness but violates the
O(n + m)requirement; check the complexity, not just the answer. - Off-by-one between start and end positions — surfaces only in position-sensitive tasks (censor, shortest window); pin the convention and assert it.
How to Validate Your Solutions¶
A disciplined validation loop catches nearly all bugs before they reach the large test cases:
- Hand-trace the tiny case. Run
{he, she, his, hers}onushersand confirmshe@3,he@3,hers@5. If this fails, your output-link merging or root handling is wrong — fix it before anything else. - Diff against the naive oracle. Generate random small dictionaries (
P ≤ 8, lengths≤ 5) and random texts (≤ 30chars) over a 3-letter alphabet; compare your automaton's sorted output against the brute-forcestr.findoracle for hundreds of trials. - Check the degenerate cases. Empty text, empty dictionary, a single pattern (must equal KMP), duplicate patterns, and a pattern equal to the whole text.
- Cross-language determinism. Feed the same seeded inputs to your Go, Java, and Python solutions and diff the outputs byte-for-byte. Divergence usually means an unstable sort or a normalization difference.
- Scale up last. Only after 1–4 pass should you run the large constraint-sized inputs to check performance (
O(n + z)scan,O(n + m)count).
Each task must be implemented and tested in Go, Java, and Python, with identical outputs across all three on the same seeded inputs.