8.8 regexp — Find the Bug¶
Audience. You've read middle.md and senior.md, and you want to train your eye for the bugs that actually ship. Each snippet is short, looks roughly right, and contains at least one real bug from the patterns the earlier files describe. Read the snippet, find the bug, then read the analysis. The bugs are mostly contractual or performance — not visual.
1. Compiling in a hot loop¶
func looksLikeID(s string) bool {
re := regexp.MustCompile(`^[A-Z]{2}\d{6}$`)
return re.MatchString(s)
}
Analysis¶
regexp.MustCompile runs every time looksLikeID is called. A function that costs 50 ns to evaluate (one regex match) is now costing 10-50 microseconds (compile + match) — a 200-1000x slowdown. On a hot path, it's the difference between a 100 µs handler and a 50 ms handler.
The fix:
var idRE = regexp.MustCompile(`^[A-Z]{2}\d{6}$`)
func looksLikeID(s string) bool {
return idRE.MatchString(s)
}
go vet doesn't catch this. Static analyzers like staticcheck and gocritic do, with rules SA1000 and regexpMust. Add them to your CI.
2. Wrong replace syntax¶
re := regexp.MustCompile(`(\w+) (\w+)`)
out := re.ReplaceAllString("Alice Smith", "\\2 \\1")
fmt.Println(out)
Analysis¶
The replacement uses \\2 \\1 (Perl/PCRE-style backreferences in the replacement). Go uses $N instead. The output is the literal text \2 \1, not Smith Alice.
The fix:
To put a literal $ in the replacement, use $$. To use the literal-replacement form (no $N interpretation at all), call ReplaceAllLiteralString.
3. The $1abc ambiguity¶
Analysis¶
The replacement string $1abc is parsed as "capture group named 1abc," which doesn't exist. The substitution becomes empty, so the output is count: (with no 42abc and no 42).
The fix is to use ${1}abc, which makes the boundary explicit:
This is the most common "why does my replace produce the wrong thing" question on Go-related forums.
4. The longest-alternative pitfall¶
Analysis¶
Leftmost-first picks the first alternative that matches at the leftmost position. Both http and https match at position 0; because http comes first in the alternation, that's what's returned. The result is http, not https.
The fix is to put the longer alternative first:
Or use CompilePOSIX for leftmost-longest semantics — but at the cost of speed and Perl-syntax features.
The same bug strikes lexer-style patterns: (int|integer) matches integer as int because of leftmost-first ordering.
5. FindString != "" for a bool check¶
Analysis¶
Two issues:
- Allocates and computes a substring.
FindStringmaterializes the matched text. For a bool decision, that's unnecessary work —MatchStringanswers in less time and zero allocations. - Empty match treated as no match. A pattern that can match empty (
a*,\b,(?:)) returns""on a successful empty match. The check above incorrectly says "no match" when an empty match did occur.
The fix:
6. bytes API unnecessarily wrapped through string¶
func extract(body []byte) string {
re := regexp.MustCompile(`status=(\w+)`)
m := re.FindStringSubmatch(string(body))
if m == nil { return "" }
return m[1]
}
Analysis¶
Beyond the compile-in-hot-path bug, the string(body) allocates a copy of the entire body. For a 1 MB body this is 1 MB of allocations per call.
The fix is to use the byte methods:
var statusRE = regexp.MustCompile(`status=(\w+)`)
func extract(body []byte) string {
m := statusRE.FindSubmatch(body)
if m == nil { return "" }
return string(m[1]) // small allocation: just the captured value
}
The *Regexp is the same object — only the input/output types differ.
7. Forgetting to anchor¶
var pathRE = regexp.MustCompile(`/users/(\d+)`)
func userID(path string) (int, bool) {
m := pathRE.FindStringSubmatch(path)
if m == nil { return 0, false }
n, _ := strconv.Atoi(m[1])
return n, true
}
Analysis¶
The pattern is unanchored, so it matches /users/42 and /api/users/42 and /admin/users/42/extra. If the caller assumes this validates the full path shape, it doesn't.
It's also slower than anchored: unanchored patterns must try every starting position, while \A anchors stop after position 0.
The fix:
Or accept the unanchored form but document the looseness ("matches any path that contains /users/<id>").
8. (?i) for ASCII-only data¶
Analysis¶
Not strictly a bug — the pattern is correct — but (?i) enables full Unicode case-folding. For log levels which are guaranteed ASCII, that's wasted CPU.
The faster equivalent:
var levelRE = regexp.MustCompile(`\b(?:ERROR|WARN|INFO|DEBUG|error|warn|info|debug|Error|Warn|Info|Debug)\b`)
Or even simpler: lowercase the input before matching.
var levelRE = regexp.MustCompile(`\b(error|warn|info|debug)\b`)
func match(line string) bool {
return levelRE.MatchString(strings.ToLower(line))
}
The strings.ToLower is ASCII-fast in Go, often faster than (?i) folding.
9. Greedy .* swallowing too much¶
re := regexp.MustCompile(`<a href="(.*)">`)
m := re.FindStringSubmatch(`<a href="x"><a href="y">`)
fmt.Println(m[1])
Analysis¶
The greedy .* matches as much as possible. The captured text is x"><a href="y, not x as the author probably intended.
The fix is to use a non-greedy quantifier or a more precise class:
re := regexp.MustCompile(`<a href="(.*?)">`) // lazy: stops at first match
re := regexp.MustCompile(`<a href="([^"]*)">`) // class: anything but a quote
The class version is faster and cleaner — it expresses intent directly. Lazy quantifiers in RE2 don't backtrack (the engine still runs the NFA simulation), but for correctness they give the expected result.
(And as always: don't parse HTML with regex. Use golang.org/x/net/html.)
10. Mutating Longest() from multiple goroutines¶
var re = regexp.MustCompile(`a|aa`)
func match(s string) string {
re.Longest()
return re.FindString(s)
}
Analysis¶
Longest() is a mutating method. Calling it concurrently with match operations races. The first call mutates the regex; the second call (from another goroutine) might race with a FindString in progress.
The fix: call Longest() once at setup, or use CompilePOSIX from the start.
var re = regexp.MustCompilePOSIX(`a|aa`) // already leftmost-longest
func match(s string) string {
return re.FindString(s)
}
11. Copy() left over from old code¶
Analysis¶
Copy is deprecated since Go 1.6. The package now maintains a free- list of per-call match state internally, so concurrent calls on the shared *Regexp are safe and don't contend. The Copy here adds an allocation per call without buying anything.
The fix:
When you find Copy in a code review, flag it. Removing it is safe.
12. Missing scanner buffer size¶
var lineRE = regexp.MustCompile(`error: (\w+)`)
func processFile(path string) error {
f, err := os.Open(path)
if err != nil { return err }
defer f.Close()
s := bufio.NewScanner(f)
for s.Scan() {
if m := lineRE.FindStringSubmatch(s.Text()); m != nil {
log.Println(m[1])
}
}
return s.Err()
}
Analysis¶
bufio.Scanner has a default token cap of 64 KiB. A single line longer than that returns bufio.ErrTooLong from Scan(), which is reported by s.Err(). The scanner has advanced past the long line, so the matcher never sees it.
The fix: raise the cap.
This is a cross-leaf bug — see ../01-io-and-file-handling/senior.md section 16. It bites regex users particularly often because log lines from production services occasionally have huge embedded payloads.
13. Allocation in a hot loop via Text()¶
var ipRE = regexp.MustCompile(`\b\d{1,3}(\.\d{1,3}){3}\b`)
func countIPs(r io.Reader) (int, error) {
s := bufio.NewScanner(r)
n := 0
for s.Scan() {
if ipRE.MatchString(s.Text()) {
n++
}
}
return n, s.Err()
}
Analysis¶
s.Text() allocates a string per line. For a 100 MB log with a million lines, that's a million allocations.
The fix: use s.Bytes() and the []byte API.
s.Bytes() returns a view into the scanner's buffer (no allocation); Match([]byte) accepts it directly. The match operation is the same; only the input wrapping differs.
14. Naive replace double-scan¶
re := regexp.MustCompile(`\d+`)
out := re.ReplaceAllStringFunc(input, func(match string) string {
sm := re.FindStringSubmatch(match)
n, _ := strconv.Atoi(sm[0])
return strconv.Itoa(n * 2)
})
Analysis¶
Two issues:
- The callback receives the whole match. There's no need to call
FindStringSubmatchinside it — the match is alreadymatch. - Even if you did need submatches, calling
FindStringSubmatchinside the callback runs the regex a second time on the same data. Wasteful.
For the simple case:
out := re.ReplaceAllStringFunc(input, func(match string) string {
n, _ := strconv.Atoi(match)
return strconv.Itoa(n * 2)
})
For a case where you genuinely need submatches in the callback, walk indices manually with FindAllStringSubmatchIndex — see middle.md section 14.
15. Trusting an Atoi after \d¶
var idRE = regexp.MustCompile(`id=(\d+)`)
func extractID(s string) int {
m := idRE.FindStringSubmatch(s)
if m == nil { return 0 }
n, _ := strconv.Atoi(m[1])
return n
}
Analysis¶
The pattern guarantees m[1] is one or more digits. So Atoi can't return a parse error… but it can return an *strconv.NumError for overflow. \d+ matches arbitrary-length digit strings; one with 25 digits won't fit in int, and Atoi returns the platform max with an error.
The silent ignore loses the overflow case.
The fix: cap the digit count, or check the error.
var idRE = regexp.MustCompile(`id=(\d{1,18})`) // capped digits
// or
n, err := strconv.Atoi(m[1])
if err != nil { return 0 } // explicit, but at least non-silent
16. (?P<name> versus (?<name>¶
Analysis¶
Compile error: "invalid or unsupported Perl syntax: (?<". Go uses the Python-style (?P<name>...), not the Perl-style (?<name>...).
The fix:
Migrators from JavaScript, Java, .NET, or PCRE hit this constantly.
17. \b in a []byte context with non-ASCII¶
Analysis¶
\b is a word-boundary assertion. In Go's regex, "word" is defined by \w — which is ASCII-only [0-9A-Za-z_]. So between é (a non-word character per Go's definition) and w, there's a word boundary. The pattern matches both word instances inside préword and wordé.
If you intended Unicode word boundaries, you have to build them manually with Unicode classes:
Or pre-segment the input using bufio.Scanner with a custom split function.
This is a real footgun for international text. Document the ASCII-only behavior of \b in any code that's expected to handle mixed-script input.
18. Building patterns from user input without escaping¶
func search(needle, haystack string) bool {
re := regexp.MustCompile(needle)
return re.MatchString(haystack)
}
Analysis¶
Two bugs:
MustCompilepanics on bad patterns. Ifneedleis(, the server crashes. UseCompileand handle the error.- The user can inject regex metacharacters. Searching for
a.bmatchesaxb, not the literala.b. If you want a substring search, escape the input withregexp.QuoteMeta.
The fix depends on intent:
// User wants to search for an exact literal string:
re, err := regexp.Compile(regexp.QuoteMeta(needle))
if err != nil { return false }
// User wants to write regex syntax, you accept that:
re, err := regexp.Compile(needle)
if err != nil { return false }
regexp.QuoteMeta is the package's escape-for-literal helper. Use it any time the input is supposed to be a string, not a pattern.
19. Split on empty pattern produces surprises¶
Analysis¶
The empty pattern matches at every position, including before the first character and after the last. Split produces: ["", "a", "b", "c", ""] — five pieces from a three-character string. Callers expecting "split by some delimiter" get five empty- or-letter pieces and are confused.
If you want to split into individual runes, use a different approach:
If you really do want regex-driven splitting and the empty match is a special case, filter:
parts := re.Split(input, -1)
out := parts[:0]
for _, p := range parts {
if p != "" {
out = append(out, p)
}
}
20. MustCompile on user input¶
func searchHandler(w http.ResponseWriter, r *http.Request) {
pattern := r.FormValue("q")
re := regexp.MustCompile(pattern)
results := matchAll(re, corpus)
json.NewEncoder(w).Encode(results)
}
Analysis¶
MustCompile panics on a bad pattern. A user submitting ( crashes the goroutine, which on a non-recovered server crashes the entire process. Even with recover middleware, a malicious user can DoS the service by sending a stream of bad patterns.
The fix is Compile plus error handling:
re, err := regexp.Compile(pattern)
if err != nil {
http.Error(w, "invalid pattern: "+err.Error(), 400)
return
}
Add the bounds from professional.md section 2: length cap, complexity cap, possibly compile cache.
21. Mismatched submatch count¶
Analysis¶
The pattern has two capture groups. When the input is a, only the first group participates; the second one didn't match, so its entry in the submatch slice is "" — but the slice itself has length NumSubexp+1 = 3. The print shows an empty string, which the author probably didn't expect.
The fix is to check submatch indices explicitly using the index API:
m := re.FindStringSubmatchIndex("a")
if m[4] >= 0 {
fmt.Println("matched group b")
} else {
fmt.Println("matched group a")
}
Or use named captures with SubexpIndex:
re := regexp.MustCompile(`(?P<a>a)|(?P<b>b)`)
m := re.FindStringSubmatch("a")
ai := re.SubexpIndex("a")
bi := re.SubexpIndex("b")
if m[ai] != "" { /* group a */ }
if m[bi] != "" { /* group b */ }
22. Reading: what to read next¶
- optimize.md — when the bug isn't correctness but throughput.
- tasks.md — exercises building each of the patterns these bugs subvert.
- interview.md — pattern-spotting questions in conversational form.