Knuth-Morris-Pratt (KMP) — Interview Preparation¶
KMP is a perennial interview favorite because it tests one crisp insight — "on a mismatch, slide the pattern by the longest border instead of restarting" — and then probes whether you can (a) build the prefix function correctly in O(m), (b) justify the O(n+m) bound with an amortized argument, (c) recognize the byproducts (period, borders, repeated-substring detection), and (d) avoid the classic traps (overlap counting, off-by-one on the match index, confusing KMP's "no re-read" with Boyer-Moore's "skip"). This file is a question bank by seniority, behavioral prompts, and four end-to-end coding challenges with runnable Go, Java, and Python solutions.
Quick-Reference Cheat Sheet¶
| Question | Tool | Complexity |
|---|---|---|
| Find first / all occurrences | KMP search with failure links | O(n + m) |
| Build failure function | prefix function (lps) | O(m) |
| Count occurrences (overlapping) | KMP search, count++; j = lps[j-1] | O(n + m) |
Smallest period of s | L - lps[L-1] (clean iff divides L) | O(m) |
| Longest border / happy prefix | lps[L-1] | O(m) |
| Repeated-substring-pattern | period < L and L % period == 0 | O(m) |
| Search via builder only | prefix function of P # T, find value == m | O(n + m) |
| Count simple paths / fuzzy match | not KMP (different problem) | — |
Core algorithm:
buildLPS(P): kmpSearch(T, P):
lps[0]=0; len=0 lps=buildLPS(P); j=0
for i in 1..m-1: for i in 0..n-1:
while len>0 and P[i]!=P[len]: while j>0 and T[i]!=P[j]: j=lps[j-1]
len=lps[len-1] if T[i]==P[j]: j++
if P[i]==P[len]: len++ if j==m: report i-m+1; j=lps[j-1]
lps[i]=len
Key facts: - lps[i] = longest proper prefix of P[0..i] that is also a suffix (a border). - The text index never moves backward → O(n) search. - After a match, j = lps[j-1] (not 0) to catch overlaps. - Smallest period = m - lps[m-1]; it is a clean repetition iff that divides m.
Junior Questions (12 Q&A)¶
J1. What does lps[i] represent?¶
The length of the longest proper prefix of P[0..i] that is also a suffix of P[0..i] — equivalently the longest border of that prefix. "Proper" means it cannot be the whole substring.
J2. Why is the naive matcher slow?¶
After a partial match it shifts by one and restarts comparing from the start of the pattern, re-reading text it already examined. On inputs like T = "aaaa…ab", P = "aa…ab" this is O(nm).
J3. What is KMP's time complexity?¶
O(n + m): O(m) to build the prefix function and O(n) to scan the text. It holds on every input — no adversarial worst case.
J4. Why does the search never move the text index backward?¶
On a mismatch KMP changes only j (how much of the pattern matched) via j = lps[j-1]; the text index i only ever advances. Already-matched text is never re-read as a fresh comparison.
J5. What is lps[0] and why?¶
0. A single character has no proper prefix, so its longest border is the empty string.
J6. How do you find overlapping matches?¶
After reaching j == m, set j = lps[j-1] instead of j = 0. This keeps the matched suffix as state so overlapping occurrences (e.g. "aa" in "aaa") are found.
J7. What is the match start index when j first equals m?¶
i - m + 1, where i is the current text index. A common bug is reporting i.
J8. Is KMP for one pattern or many?¶
One. For many patterns, generalize to Aho-Corasick, which builds one automaton with KMP-style failure links over all patterns.
J9. Does KMP skip characters like Boyer-Moore?¶
No. KMP never skips text; it only avoids re-reading. Boyer-Moore matches from the right and can skip whole blocks (sublinear average).
J10. What is a border of a string?¶
A string that is simultaneously a proper prefix and a proper suffix. The empty string is a border of every nonempty string. lps[L-1] is the longest border of the whole string.
J11. How do you get the smallest period from lps?¶
period = L - lps[L-1]. If L % period == 0, the string is a clean repetition of its first period characters.
J12. What if the pattern is empty?¶
Conventionally it matches everywhere (at every position including the end). Decide and document; many implementations special-case it.
Mid-Level Questions (10 Q&A)¶
M1. Walk through building lps for "aabaaab".¶
a a b a a a b → lps = [0,1,0,1,2,2,3]. At i=5 (s='a'), len was 2 (s[2]='b' mismatch... actually len=2, compare s[5]='a' vs s[2]='b' mismatch → len=lps[1]=1, s[5]='a' vs s[1]='a' match → len=2), giving lps[5]=2; at i=6 ('b'), s[6]='b' vs s[2]='b' match → len=3.
M2. Prove the construction is O(m).¶
len rises by at most 1 per position (≤ m-1 total rises) and every while pass strictly decreases it. A quantity that rises ≤ m-1 times and only falls one-or-more per while pass can fall ≤ m-1 times total, so the inner loop runs O(m) times across the whole build. (Potential argument with Φ = len.)
M3. Explain the P # T trick.¶
Build the prefix function of P + '#' + T with a sentinel # absent from both. No border exceeds m. Wherever the value equals m, the pattern ends there, giving a match. It reuses the builder as a searcher but uses O(n+m) extra memory.
M4. How do you count occurrences of each prefix of a string?¶
Tally cnt[lps[i]] over all i, then propagate down the border chain (cnt[lps[k-1]] += cnt[k] for k high→low) and add 1 per prefix. This counts how many times each prefix appears as a substring.
M5. KMP vs Z-function?¶
Equivalent power (O(n+m)), interconvertible in linear time. lps reads "longest border ending here"; Z reads "longest prefix-match starting here." KMP's failure links generalize to automata/Aho-Corasick; Z is often considered simpler for one-off matching.
M6. KMP vs Rabin-Karp?¶
KMP is deterministic and collision-free, O(n+m) always. Rabin-Karp uses rolling hashes — great for many equal-length patterns or 2D search — but is O(nm) worst case on collisions and incorrect if you skip verification.
M7. KMP vs Boyer-Moore?¶
Boyer-Moore is sublinear on average (skips via bad-character/good-suffix from the right) and is what grep often uses, but can degrade without the good-suffix rule. KMP's O(n+m) is unconditional and streams cleanly.
M8. What is the KMP automaton and when is it worth it?¶
A DFA with m+1 states and transition δ[state][char], removing the inner while (one lookup per char). Worth O(m·|Σ|) memory only for small alphabets; impractical for Unicode.
M9. Why j = lps[j-1] and not lps[j] on mismatch?¶
j is a length; the last matched index is j-1. The longest border of the matched prefix P[0..j-1] is lps[j-1]. Using lps[j] reads beyond the matched region and is wrong.
M10. How does KMP handle streaming input?¶
The only carry state is the integer j plus the immutable lps. Feed the text in chunks without resetting j; matches can span chunk boundaries. No lookback buffer is needed.
Senior Questions (8 Q&A)¶
S1. When would you NOT use KMP in production?¶
When a tuned library indexOf/memmem (SIMD prefilter) suffices on non-adversarial text — it is often several times faster. Also when matching many patterns (use Aho-Corasick) or when Boyer-Moore's average skipping wins for long patterns over large alphabets.
S2. What Unicode pitfalls bite KMP?¶
Byte-level KMP on UTF-8 is correct (self-synchronizing) and fast; rune/grapheme-level needs decoding/normalization. Java's charAt returns UTF-16 code units, so astral-plane code points (emoji) are surrogate pairs and charAt-based KMP can match half a pair — use codePoints() or bytes.
S3. How do you test a KMP implementation?¶
Differential testing against a naive oracle and the library matcher over random small-alphabet strings; property checks (every reported index is a real match; 0 ≤ lps[i] ≤ i); streaming equivalence (one chunk vs byte-by-byte); adversarial timing test ("a"*10^6 vs "a"*1000+"b") to catch quadratic regressions.
S4. Derive the worst-case comparison count.¶
KMP does at most 2n − 1 comparisons in the search: each text character yields one advancing/final comparison, and fall-backs are amortized against prior advances. The strong failure function (true KMP vs Morris–Pratt) shaves redundant fall-backs further.
S5. Explain border/period duality.¶
s has a border of length b iff it has period p = L − b. So the longest border (lps[L-1]) gives the smallest period L − lps[L-1]. This is the basis of repeated-substring detection.
S6. Where does Fine–Wilf come in?¶
If s has periods p and q with L ≥ p + q − gcd(p,q), it also has period gcd(p,q). It governs how multiple periods interact and underlies fast all-periodicity (runs) algorithms; the period KMP exposes is consistent with it.
S7. How does Aho-Corasick relate to KMP?¶
It is KMP generalized to a trie of many patterns: a node's failure link points to the longest proper suffix of its path that is another pattern's prefix — the multi-pattern analog of lps[j-1]. One automaton, one O(n + Σm + matches) scan.
S8. What metrics/limits would you add around a KMP service?¶
Cap pattern/text size at trust boundaries (the lps array scales with pattern length); emit matches-found, bytes-scanned, chunks-processed; log the matching granularity (byte vs rune); use a 64-bit global offset for long streams.
Behavioral Prompts¶
- "Tell me about a time you optimized a slow algorithm." Frame: profiled an
O(nm)log-scanner, identified re-reading on mismatch, switched to KMP (or a library SIMD search after benchmarking), measured the latency drop. Emphasize measuring before and after. - "Describe debugging a subtle correctness bug." The overlap-counting bug (
j=0vsj=lps[j-1]) found only by differential testing against the naive oracle — a great story about test harnesses catching off-by-one logic. - "When did you choose the 'boring' solution over the clever one?" Choosing the standard-library matcher over hand-rolled KMP because benchmarks showed SIMD won on real data and the worst case was not adversarial — judgment over cleverness.
- "Explain a complex concept to a non-expert." Explain the failure link with the "lost-your-place-reading" analogy; gauge whether you can teach, not just code.
Coding Challenges¶
Challenge 1 — Substring Search (return all start indices)¶
Problem. Given text T and pattern P, return all starting indices where P occurs in T (overlaps allowed). If P is empty, return an empty list.
Go¶
package main
import "fmt"
func buildLPS(p string) []int {
lps := make([]int, len(p))
l := 0
for i := 1; i < len(p); i++ {
for l > 0 && p[i] != p[l] {
l = lps[l-1]
}
if p[i] == p[l] {
l++
}
lps[i] = l
}
return lps
}
func search(t, p string) []int {
if p == "" {
return []int{}
}
lps := buildLPS(p)
res := []int{}
j := 0
for i := 0; i < len(t); i++ {
for j > 0 && t[i] != p[j] {
j = lps[j-1]
}
if t[i] == p[j] {
j++
}
if j == len(p) {
res = append(res, i-len(p)+1)
j = lps[j-1]
}
}
return res
}
func main() { fmt.Println(search("abxabcabcaby", "abcaby")) } // [6]
Java¶
import java.util.*;
public class Challenge1 {
static int[] buildLPS(String p) {
int[] lps = new int[p.length()];
int l = 0;
for (int i = 1; i < p.length(); i++) {
while (l > 0 && p.charAt(i) != p.charAt(l)) l = lps[l - 1];
if (p.charAt(i) == p.charAt(l)) l++;
lps[i] = l;
}
return lps;
}
static List<Integer> search(String t, String p) {
List<Integer> res = new ArrayList<>();
if (p.isEmpty()) return res;
int[] lps = buildLPS(p);
int j = 0;
for (int i = 0; i < t.length(); i++) {
while (j > 0 && t.charAt(i) != p.charAt(j)) j = lps[j - 1];
if (t.charAt(i) == p.charAt(j)) j++;
if (j == p.length()) { res.add(i - p.length() + 1); j = lps[j - 1]; }
}
return res;
}
public static void main(String[] a) { System.out.println(search("abxabcabcaby", "abcaby")); } // [6]
}
Python¶
def build_lps(p):
lps = [0] * len(p)
l = 0
for i in range(1, len(p)):
while l > 0 and p[i] != p[l]:
l = lps[l - 1]
if p[i] == p[l]:
l += 1
lps[i] = l
return lps
def search(t, p):
if not p:
return []
lps = build_lps(p)
res, j = [], 0
for i in range(len(t)):
while j > 0 and t[i] != p[j]:
j = lps[j - 1]
if t[i] == p[j]:
j += 1
if j == len(p):
res.append(i - len(p) + 1)
j = lps[j - 1]
return res
print(search("abxabcabcaby", "abcaby")) # [6]
Challenge 2 — Count Occurrences (overlapping)¶
Problem. Return the number of times P occurs in T, counting overlaps. Example: count("aaaaa", "aa") == 4.
Go¶
func count(t, p string) int {
if p == "" {
return 0
}
lps := buildLPS(p) // reuse from Challenge 1
c, j := 0, 0
for i := 0; i < len(t); i++ {
for j > 0 && t[i] != p[j] {
j = lps[j-1]
}
if t[i] == p[j] {
j++
}
if j == len(p) {
c++
j = lps[j-1]
}
}
return c
}
Java¶
static int count(String t, String p) {
if (p.isEmpty()) return 0;
int[] lps = buildLPS(p);
int c = 0, j = 0;
for (int i = 0; i < t.length(); i++) {
while (j > 0 && t.charAt(i) != p.charAt(j)) j = lps[j - 1];
if (t.charAt(i) == p.charAt(j)) j++;
if (j == p.length()) { c++; j = lps[j - 1]; }
}
return c;
}
Python¶
def count(t, p):
if not p:
return 0
lps = build_lps(p)
c = j = 0
for ch in t:
while j > 0 and ch != p[j]:
j = lps[j - 1]
if ch == p[j]:
j += 1
if j == len(p):
c += 1
j = lps[j - 1]
return c
print(count("aaaaa", "aa")) # 4
Challenge 3 — Shortest Period (smallest repeating unit length)¶
Problem. Return the length of the smallest string w such that s is a prefix of w repeated. Equivalently the smallest period. Example: period("abcabca") == 3, period("aba") == 2.
Go¶
func period(s string) int {
if s == "" {
return 0
}
lps := buildLPS(s)
return len(s) - lps[len(s)-1]
}
Java¶
static int period(String s) {
if (s.isEmpty()) return 0;
int[] lps = buildLPS(s);
return s.length() - lps[s.length() - 1];
}
Python¶
def period(s):
if not s:
return 0
lps = build_lps(s)
return len(s) - lps[-1]
print(period("abcabca"), period("aba")) # 3 2
Challenge 4 — Repeated Substring Pattern (LeetCode 459)¶
Problem. Return true iff s can be built by repeating a substring two or more times. Example: "abab" → true, "aba" → false, "abcabcabc" → true.
Go¶
func repeatedSubstringPattern(s string) bool {
n := len(s)
if n < 2 {
return false
}
lps := buildLPS(s)
p := n - lps[n-1]
return p < n && n%p == 0
}
Java¶
static boolean repeatedSubstringPattern(String s) {
int n = s.length();
if (n < 2) return false;
int[] lps = buildLPS(s);
int p = n - lps[n - 1];
return p < n && n % p == 0;
}
Python¶
def repeated_substring_pattern(s):
n = len(s)
if n < 2:
return False
lps = build_lps(s)
p = n - lps[-1]
return p < n and n % p == 0
print(repeated_substring_pattern("abab")) # True
print(repeated_substring_pattern("aba")) # False
print(repeated_substring_pattern("abcabcabc")) # True
Challenge 5 — Shortest palindrome by prepending (LeetCode 214)¶
Problem. Find the shortest palindrome you can make by adding characters in front of s. Build lps of s + '#' + reverse(s); the longest palindromic prefix has length lps[last], so prepend the reverse of the remaining suffix.
Go¶
func shortestPalindrome(s string) string {
rev := reverse(s)
comb := s + "#" + rev
lps := buildLPS(comb)
k := lps[len(lps)-1] // length of longest palindromic prefix
return reverse(s[k:]) + s
}
func reverse(s string) string {
b := []byte(s)
for i, j := 0, len(b)-1; i < j; i, j = i+1, j-1 {
b[i], b[j] = b[j], b[i]
}
return string(b)
}
Java¶
static String shortestPalindrome(String s) {
String rev = new StringBuilder(s).reverse().toString();
String comb = s + "#" + rev;
int[] lps = buildLPS(comb);
int k = lps[lps.length - 1];
return new StringBuilder(s.substring(k)).reverse().toString() + s;
}
Python¶
def shortest_palindrome(s):
comb = s + "#" + s[::-1]
lps = build_lps(comb)
k = lps[-1]
return s[k:][::-1] + s
print(shortest_palindrome("aacecaaa")) # "aaacecaaa"
print(shortest_palindrome("abcd")) # "dcbabcd"
Why it works. lps[last] of s + '#' + reverse(s) is the longest prefix of s that is also a suffix of reverse(s) — i.e. the longest palindromic prefix of s. Whatever follows it must be mirrored in front. This is a classic "KMP in disguise" interview problem; recognizing the P#T shape is the whole trick.
Rapid-Fire Trivia (one-liners interviewers like)¶
- "What is
lps[lps.length-1]?" The length of the longest border of the whole string — andlength - thatis the smallest period. - "Can KMP run in less than
O(n)?" No — any exact matcher must read every text character in the worst case (Ω(n)lower bound). - "What is the worst-case comparison count?" At most
2n − 1with the strong failure function. - "Why a sentinel in
P#T?" So no border crosses the boundary; the sentinel must be absent from both strings. - "Does KMP do extra work on
"aaaa…"?" The build/search are stillO(m)/O(n); only non-amortized border-enumeration loops are quadratic on all-equal strings. - "KMP or Z-function in an interview?" Either; they are equivalent. Pick the one you can write bug-free under pressure.
- "What state does streaming KMP carry?" One integer
j(plus the immutablelps) — that is why it streams.
Mock Interview Walkthrough (Substring Search)¶
A model 20-minute flow an interviewer expects:
- Clarify (1 min): "Overlapping matches counted? Return first index or all? ASCII or Unicode? Can the pattern be empty?" Asking these signals seniority.
- Naive baseline (2 min): State the
O(nm)brute force and why it is slow (re-reading after partial matches). This motivates KMP. - Insight (3 min): "On a mismatch after matching a prefix, the matched text equals that prefix, so I can shift by the longest border without rewinding the text." Define
lps. - Build
lps(5 min): Code the prefix function; trace it on a small example aloud to catch the fall-back logic. - Search (4 min): Code the matcher; emphasize
inever moves backward and thej = lps[j-1]fall-back. - Complexity (2 min):
O(n+m)with the amortized rise-vs-fall argument. - Test (3 min): Walk one example, mention edge cases (empty pattern, overlaps, pattern longer than text) and that you would diff against a naive oracle.
Candidates who skip step 1 or 7 lose points even with correct code; the algorithm is necessary but not sufficient.
Whiteboard Tips¶
- State the invariant out loud: "
j= length of the longest pattern prefix that is a suffix of the text read so far." It makes the fall-back step obviously correct. - Build
lpson a small example first ("abab"or"aabaa") so the interviewer sees you understand borders before you write the matcher. - Defend
O(n+m)with the amortized argument —len/jfalls at most as often as it rises. Interviewers love this. - Mention the byproducts (period, repeated-pattern) — it signals depth beyond memorized code.
- Name the alternatives (Z-function, Rabin-Karp, Boyer-Moore, Aho-Corasick) and one-line each trade-off.
- Watch the two bugs:
i - m + 1for the start index, andj = lps[j-1](not0) after a match.
Summary¶
KMP interviews reward the failure-link insight plus disciplined execution: build the prefix function in amortized O(m), scan the text once in O(n) without rewinding, and reuse lps for periods, borders, and repeated-substring detection. Be ready to prove the linear bound with the rise-vs-fall amortized argument, to contrast KMP with the Z-function, Rabin-Karp, Boyer-Moore, and Aho-Corasick, and to dodge the overlap and off-by-one traps. The four challenges here — search, overlapping count, shortest period, repeated-substring-pattern — are the canonical set; if you can write all three language versions of each from memory and explain the invariant, you are ready.
Further Reading¶
- LeetCode 28 (Find the Index of the First Occurrence in a String), 459 (Repeated Substring Pattern), 1392 (Longest Happy Prefix), 686 (Repeated String Match).
- cp-algorithms.com — "Prefix function. Knuth-Morris-Pratt."
- Introduction to Algorithms (CLRS) §32.4.
- Sibling
02-z-function,03-rabin-karp,08-boyer-moore,05-aho-corasick. - The
string-algorithmsrepo skill for the broader pattern-matching toolbox. - Gusfield — Algorithms on Strings, Trees, and Sequences (prefix/Z-function unification).
- Sibling files
junior.md(intuition),middle.md(applications),professional.md(proofs) in this topic.
Final Checklist Before the Interview¶
- Can write
buildLPSfrom memory in Go, Java, and Python without looking. - Can write the matcher and explain why
inever moves backward. - Can state and defend the
O(n+m)amortized argument in two sentences. - Know the four canonical problems: search, overlapping count, shortest period, repeated-substring-pattern.
- Recognize the
P#Tshape in disguised problems (shortest palindrome, min prepend). - Can name the trade-offs vs Z-function, Rabin-Karp, Boyer-Moore, and Aho-Corasick in one line each.
- Remember the two killer bugs: start index
i - m + 1, andj = lps[j-1](not 0) after a match.
If every box is checked, you can handle KMP at any level the interview throws at you — from "find the substring" to "prove it is linear" to "now make it streaming."