Word Break (String Segmentation DP) — Senior Level¶
Word Break is a textbook DP on a whiteboard — but the moment the dictionary holds millions of entries, the input is adversarial, the segmentation count explodes, or the same string-splitting logic powers a real tokenizer/NLP pipeline, every detail (Trie vs Aho-Corasick, substring allocation, exponential output guards, modular counting, failure modes) becomes a correctness or performance incident.
Table of Contents¶
- Introduction
- Huge Dictionaries: Trie and Aho-Corasick
- Avoiding Exponential Blow-Up
- Tokenization and NLP Relevance
- Modular and Big-Integer Counting at Scale
- Memory and Allocation Engineering
- Code Examples
- Observability and Testing
- Failure Modes
- Summary
1. Introduction¶
At the senior level the question is no longer "what is the recurrence" but "what system am I building, and what breaks when I scale it?". Word Break shows up in three guises that look different but share one engine:
- Feasibility / counting under a fixed vocabulary — "can this be tokenized?", "how many tokenizations?". The dictionary may be huge; the answer is a boolean or an exact (often modular) count.
- Enumeration with an exponential ceiling — "give me all segmentations", where the output itself can be exponential and must be guarded, streamed, or capped.
- Real text segmentation / tokenization — word segmentation for languages without spaces (Chinese, Japanese, Thai), subword tokenization (BPE/WordPiece), and lexer construction, where Word Break is the skeleton and the scoring/ambiguity resolution is the production complication.
All three reduce to: build a fast membership/longest-match structure over the dictionary, run the prefix DP, and choose the right output mode (decide / count / one / all). The senior decisions are: which dictionary structure (hash set, Trie, Aho-Corasick), how to keep enumeration from exploding, how to keep the count exact and overflow-free, and how to make the inner loop allocation-free.
2. Huge Dictionaries: Trie and Aho-Corasick¶
2.1 Why the hash set degrades¶
The hash-set DP does s[j:i] in dict for each (j, i). Each check constructs a substring (O(L) allocation + copy) and hashes it (O(L)). For a long string against a big dictionary, this allocation churn dominates — O(n²) substring builds, each producing garbage for the collector. The asymptotics hide a brutal constant factor.
2.2 The Trie: longest-match without allocation¶
A Trie over the dictionary lets you, from each start position, walk s[start], s[start+1], … down the tree and emit every dictionary word starting there — zero substring allocation, pure pointer-following. From a segmentable position i, descend the Trie along s[i..]; at every end-of-word node at depth d, mark dp[i + d] = true. This is the standard production upgrade and the basis for everything below. Build cost is O(total dictionary characters); query is O(maxWordLen) per start, O(n · maxWordLen) overall, with no per-check garbage.
2.3 Aho-Corasick: all matches in one pass¶
When you want all dictionary words ending at every position of s in a single linear scan, Aho-Corasick is the tool. It is a Trie augmented with failure links (like KMP, but over the whole dictionary), letting you scan s once in O(n + total matches) and report, at each position i, every dictionary word that ends at i. Feed those matches into the DP: for each match (start, end), if dp[start] then set dp[end] = true. This decouples matching (O(n + Σ matches)) from the DP and is ideal when the dictionary is fixed and reused across many input strings.
The trade-off: Aho-Corasick has higher build complexity and memory (failure links, output links) than a plain Trie. Use it when the dictionary is large, stable, and queried against many strings; use a plain Trie for one-off or smaller dictionaries.
2.3a Choosing the dictionary structure — a decision table¶
| Structure | Build | Per-query | Memory | Best for |
|---|---|---|---|---|
| Hash set + substring | O(‖D‖) | O(n²) substrings | O(‖D‖) | tiny inputs, prototypes |
| Hash set + rolling hash | O(‖D‖) | O(n·L) expected | O(‖D‖) | medium, want no Trie code |
| Trie | O(‖D‖) | O(n·L), no alloc | O(‖D‖) nodes | the default production choice |
| Aho-Corasick | O(‖D‖) + links | O(n + matches) | larger (links) | huge fixed dict, many strings |
| DAWG / radix | O(‖D‖) | O(n·L) | far smaller | natural-language lexicons |
| Succinct (LOUDS) | O(‖D‖) | O(n·L) × small const | near-optimal bits | memory-constrained, web-scale |
The recurrence never changes; only this oracle does. Treat it as a pluggable interface so you can swap a hash set for a Trie for Aho-Corasick without touching the DP — the same abstraction-boundary discipline as the semiring choice.
2.4 Compressed and external dictionaries¶
For dictionaries too large for RAM (web-scale vocabularies), options include:
- Compressed tries (radix trees / DAWG) — collapse single-child chains and share suffixes; a DAWG (directed acyclic word graph) merges equivalent suffixes for major memory savings on natural-language lexicons.
- Succinct tries (LOUDS, etc.) — store the structure in near information-theoretic minimum space, trading a small constant-factor query slowdown for huge memory wins.
- Sharded / external membership — for truly massive vocabularies, a bloom filter front-end (to reject non-words cheaply) backed by an on-disk or distributed exact set.
The DP recurrence never changes; only the membership/longest-match oracle does. This is the same abstraction-boundary discipline as choosing a semiring in matrix problems — keep the DP generic over the dictionary structure.
3. Avoiding Exponential Blow-Up¶
3.1 The output-sensitivity wall¶
Word Break II is output-sensitive: its complexity is Ω(total length of all sentences), which for "aaaa…a" with {"a","aa"} is Θ(n · φⁿ). No cleverness escapes this — the answer set is exponentially large. The senior responsibility is to never accidentally trigger it.
3.2 Guards before you enumerate¶
- Decide first. Run Word Break I. If
dp[n]is false, return empty immediately. - Prune with reachability. Precompute
canReachEnd[i](= "iss[i..n)segmentable?"). During backtracking, only recurse intoendwhencanReachEnd[end]. This eliminates dead-end branches that produce nothing — the difference between instant failure and a timeout on"aaaa…ab". - Cap or stream. If the caller wants "up to
Ksentences" or "the lexicographically first", do not materialize the whole list — use a generator/iterator that yields on demand and stops afterK. - Detect the blow-up early. The counting DP (
O(n·L), polynomial) tells you how many sentences exist before you try to list them. If the count exceeds a threshold, refuse or switch to streaming.
3.3 The count-then-decide pattern¶
Because counting is polynomial and enumerating is exponential, a robust API computes the count first (mod a large prime, or exactly with big integers if it fits), and only enumerates when the count is small enough to be safe. This is the single most important architectural decision for any Word Break II endpoint exposed to untrusted input — it is a denial-of-service vector otherwise (a short adversarial string can demand exponential output).
3.4 A worked blow-up: "aaaa…a"¶
Concretely, s = "a" * 40, D = {"a", "aa"}:
decide (WB-I): true, instant (O(n·L))
count (#WB): F(41) = 165,580,141 ... in O(n·L), instant
enumerate (WB-II): 165,580,141 sentences, each ~40 chars
=> ~6.6 GB of output text. OOM on any normal machine.
The decision and the count both finish in microseconds; the enumeration is physically impossible to materialize. A naive wordBreakII call on this input takes down the process. The count (cheaply computed first) tells you in advance that enumeration is unsafe. This is why the senior rule is absolute: never call the enumerator without a count guard.
3.5 The reachability pruning, with code¶
def reachable_to_end(s, dict_set, max_len):
n = len(s)
reach = [False] * (n + 1)
reach[n] = True # empty suffix is segmentable
for i in range(n - 1, -1, -1):
for end in range(i + 1, min(n, i + max_len) + 1):
if s[i:end] in dict_set and reach[end]:
reach[i] = True
break
return reach # reach[i] == "is s[i..n) segmentable?"
During Word Break II backtracking, recurse into end only when reach[end] is true. On "aaaa…ab" (segmentable prefix, dead b), reach[0] is false, so the entire search is cut at the root — O(n) instead of an exponential dead-end walk.
4. Tokenization and NLP Relevance¶
4.1 Word segmentation for space-less languages¶
Chinese, Japanese, and Thai text has no spaces; segmenting it into words is exactly Word Break, with the lexicon as the dictionary. The naive Word Break is ambiguous (many valid segmentations), so production systems add scoring: choose the segmentation that maximizes a probability (e.g., the product of unigram word frequencies, or an HMM/CRF score). The DP recurrence changes from boolean OR to a max over scores — a min-plus/Viterbi-style DP:
This is the maximum-probability / longest-matching / shortest-path segmentation, and it is the same prefix DP with a scoring semiring instead of booleans.
4.2 Subword tokenization (BPE, WordPiece, SentencePiece)¶
Modern NLP tokenizers (used by LLMs) segment words into subword units from a learned vocabulary. WordPiece, in particular, does greedy longest-match Word Break against its vocabulary (with a special continuation prefix like ##). The DP/greedy-segmentation skeleton is identical; the vocabulary is learned rather than given. Understanding Word Break is directly understanding how a tokenizer splits text into the tokens an LLM consumes.
4.3 Lexing / regex tokenization¶
A lexer splits source code into tokens by matching against token patterns. When patterns are literal keywords/identifiers, this is Word Break with the keyword set as the dictionary; with regular expressions it generalizes to maximal-munch matching against a DFA. The connection: Word Break is the literal-dictionary special case of regular-expression tokenization, and the Trie/Aho-Corasick acceleration is the literal-set analogue of the DFA the regex engine builds.
4.4 The ambiguity problem¶
Real tokenization must resolve ambiguity, which pure Word Break does not. Strategies: greedy longest-match (fast, sometimes wrong), maximum-probability DP (Viterbi), or returning the full lattice of segmentations (the Word Break II output as a DAG) for a downstream model to score. The segmentation lattice — a DAG where edges are dictionary matches — is the compact representation of all segmentations without exponential enumeration, and it is what production NLP pipelines actually pass downstream.
4.5 Worked Viterbi segmentation¶
Maximum-probability segmentation in log-space (the production word-segmentation DP):
import math
def best_segmentation(s, log_prob, max_len):
"""log_prob: dict word -> log probability. Returns the most likely split."""
n = len(s)
NEG = float("-inf")
best = [NEG] * (n + 1)
parent = [-1] * (n + 1)
best[0] = 0.0
for i in range(1, n + 1):
for j in range(max(0, i - max_len), i):
w = s[j:i]
if best[j] != NEG and w in log_prob:
score = best[j] + log_prob[w]
if score > best[i]:
best[i] = score
parent[i] = j
if best[n] == NEG:
return None # not segmentable
words, i = [], n
while i > 0:
words.append(s[parent[i]:i])
i = parent[i]
return list(reversed(words))
# For "thtable" with P("the")>P("th"), the high-prob words win the max.
The recurrence is the (max, +) semiring over log-probabilities; working in log-space avoids the floating-point underflow that multiplying many small probabilities would cause. This is exactly how a tokenizer picks the most likely segmentation of space-less text among the (possibly exponentially many) valid ones — without enumerating them.
5. Modular and Big-Integer Counting at Scale¶
5.1 Overflow budget¶
The number of segmentations grows like φⁿ (or faster), overflowing 64-bit integers around n ≈ 90 for the {"a","aa"} family. Two strategies:
- Modular count: reduce
count[i]mod a prime (10⁹+7) at every addition. Always safe, gives the count modp. - Big integers: when the exact count is required (e.g., reporting "there are N tokenizations"), use arbitrary-precision integers; each addition is
O(digits)anddigits = Θ(n), so the DP becomesO(n² · n / wordsize)— still polynomial, but no longer linear in arithmetic.
5.2 CRT for large exact counts¶
As with other counting DPs, if you need an exact count larger than one prime can represent but want to avoid full big-integer cost, run the modular count under several coprime primes and reconstruct via CRT. Each prime is an independent DP pass — embarrassingly parallel.
5.3 Counting the shortest or longest segmentation¶
Beyond counting all segmentations, you often want the segmentation with fewest words (min DP) or most words (max DP):
This is a shortest-path-style DP (each dictionary word is a unit-cost edge); it answers "minimum number of words to tile s" in O(n·L). The scoring/probability variant of Section 4.1 is the weighted generalization.
5.4 Which query, which DP — a quick map¶
| Business question | DP variant | Semiring |
|---|---|---|
| "Is this a valid token stream?" | boolean | (OR, AND) |
| "How many ways to tokenize?" | counting (mod p) | (+, ×) |
| "Fewest tokens?" | min DP | (min, +1) |
| "Most likely tokenization?" | Viterbi (log-prob) | (max, +) |
| "Give me one tokenization" | boolean + parent ptrs | (OR, AND) |
| "Give me all tokenizations" | backtracking (guarded!) | list |
Pin down which of these the requirement actually asks for before writing code — the wrong choice (e.g. enumerating when they only wanted the count) is the most common and most expensive senior-level mistake here.
6. Memory and Allocation Engineering¶
6.1 Kill substring allocation¶
The dominant hidden cost in the hash-set DP is s[j:i] substring construction — O(n²) allocations. The Trie walk eliminates this entirely (it follows characters in place). In Go, prefer indexing s[j] byte-by-byte over slicing; in Java, the Trie avoids String.substring; in Python, the Trie avoids slice objects.
6.2 The dp array¶
The boolean dp is O(n) — negligible. For Word Break II, the memo stores lists of sentences per index, which is where memory explodes. Prefer:
- Storing the DAG of matches (
O(n + matches)) instead of materialized sentences. - Reconstructing on demand from parent pointers rather than caching full sentence lists.
6.3 Reuse the dictionary structure¶
Build the Trie/Aho-Corasick once and reuse it across all queries. A classic production regression is rebuilding the dictionary structure per request — wasting O(dict) build time and allocating fresh nodes each call. The structure is read-only after construction, so it is trivially shareable across threads and request handlers.
6.4 Encoding and Unicode¶
For real text, "character" is ambiguous: bytes vs runes vs grapheme clusters. Word Break must agree with the dictionary's notion of a character. In Go, iterate runes (not bytes) for multi-byte UTF-8; in Java, beware surrogate pairs; in Python, strings are already code points. Mismatched encoding between s and the dictionary is a silent correctness bug — the Trie walk follows the wrong units.
6.5 Allocation comparison, measured¶
The difference between the hash-set and Trie versions is almost entirely allocation, not asymptotics:
| Version | Substring allocations | GC pressure | Wall-clock (n=10⁵, big dict) |
|---|---|---|---|
Hash set + s[j:i] | O(n·L) per query | high | seconds, GC-bound |
| Trie walk | 0 | minimal | tens of ms |
// HOT PATH ANTI-PATTERN: allocates a fresh string every inner iteration.
if dict[s[j:i]] { dp[i] = true } // s[j:i] is a new string each time
// PREFERRED: walk the trie, follow bytes in place, zero allocation.
node = node.children[s[j]] // no new string built
For a service processing many requests, this allocation difference is the gap between meeting and missing a latency SLO. Always prefer the in-place Trie walk on the hot path, and build the Trie once at startup.
6.6 Cross-query caching with a shared dictionary¶
When many queries hit the same dictionary (a tokenizer service, an autocomplete backend), the dictionary structure is shared (Section 6.3) but the per-string dp is not — yet repeated or overlapping inputs recur. Two complementary caches help:
- Result cache (string → answer). Memoize the boolean/count answer keyed by the input string. For a hot working set (popular queries), an LRU turns an
O(n·L)recompute into anO(1)lookup. The key insight: the answer is a pure function of(string, dictionary version), so the cache must be invalidated on dictionary mutation — bump a version counter on every dictionary update and fold it into the key, or the cache will serve stale segmentability after a word is added/removed. - Incremental segmentation (append-only streaming). When a string grows by appending characters (a streaming tokenizer fed one chunk at a time), the prefix DP is monotone:
dp[0..n]computed for the old prefix is still valid; appendingmcharacters only requires fillingdp[n+1..n+m], reusing the retaineddparray. This makes incremental workO(m·L)per append instead of recomputingO((n+m)·L)from scratch — the same forward-only structure that lets the DP extend without revisiting decided cells.
A streaming segmenter therefore keeps the shared read-only Trie plus a per-stream dp slice that it grows in place. The result cache sits in front for whole-string repeats; the incremental dp handles the grow-by-suffix case. Both depend on the dictionary being immutable for the lifetime of a cached answer.
// Incremental, append-only Word Break: extend dp without recomputing the prefix.
type Segmenter struct {
root *trieNode
s []byte
dp []bool // dp[k] = s[:k] segmentable; len(dp) == len(s)+1
}
func NewSegmenter(root *trieNode) *Segmenter {
return &Segmenter{root: root, dp: []bool{true}} // dp[0] = true
}
// Append one chunk; only fills the new dp cells, reusing the old prefix DP.
func (sg *Segmenter) Append(chunk []byte) bool {
old := len(sg.s)
sg.s = append(sg.s, chunk...)
n := len(sg.s)
for len(sg.dp) <= n {
sg.dp = append(sg.dp, false)
}
for i := 0; i < n; i++ {
if !sg.dp[i] {
continue
}
node := sg.root
for j := i; j < n; j++ {
node = node.children[sg.s[j]]
if node == nil {
break
}
if node.end && j+1 > old { // only newly reachable cells
sg.dp[j+1] = true
}
}
}
return sg.dp[n]
}
// LRU result cache keyed by (string, dictionary version) for a shared dictionary.
import java.util.*;
public class CachedWordBreak {
private final Node root; // shared, read-only after build
private volatile long dictVersion; // bump on any dictionary mutation
private final LinkedHashMap<String, Boolean> cache;
public CachedWordBreak(Node root, int capacity) {
this.root = root;
this.cache = new LinkedHashMap<>(capacity, 0.75f, true) {
protected boolean removeEldestEntry(Map.Entry<String, Boolean> e) {
return size() > capacity;
}
};
}
public synchronized void onDictionaryChanged() {
dictVersion++; // invalidate logically...
cache.clear(); // ...and physically drop stale answers
}
public boolean query(String s) {
String key = dictVersion + "