Concurrent Fuzzing — Find the Bug¶
Real fuzz failure reports, with diagnosis and fix. Each scenario is presented as the fuzzer would surface it: a stack trace, the offending input, and a question. The diagnosis follows after a deliberate pause.
Table of Contents¶
- Scenario 1: A panic in a parser
- Scenario 2: A race in a "thread-safe" counter
- Scenario 3: A race in a "thread-safe" cache
- Scenario 4: A timeout (suspected deadlock)
- Scenario 5: A round-trip mismatch in JSON
- Scenario 6: A non-deterministic failure
- Scenario 7: A non-linearisable history
- Scenario 8: Bug found in the fuzz harness itself
Scenario 1: A panic in a parser¶
Report¶
--- FAIL: FuzzParseHeader (0.02s)
--- FAIL: FuzzParseHeader (0.00s)
panic: runtime error: index out of range [0] with length 0 [recovered]
panic: runtime error: index out of range [0] with length 0
goroutine 7 [running]:
...
myapp/parser.ParseHeader(...)
/home/me/myapp/parser/header.go:42
myapp/parser.FuzzParseHeader.func1(0xc000012340, {0x0, 0x0, 0x0})
/home/me/myapp/parser/header_fuzz_test.go:23
Failing input written to testdata/fuzz/FuzzParseHeader/a3f7e1...
The code¶
func ParseHeader(data []byte) (Header, error) {
if data[0] == 0x80 {
return Header{}, errInvalid
}
// ...
}
The input¶
Question¶
Why does the fuzzer fail on the seemingly-simple empty input? What is the fix?
Diagnosis: The function indexes data[0] without checking length. On an empty slice, data[0] panics. The fuzzer feeds an empty slice as one of its first mutations.
Fix:
func ParseHeader(data []byte) (Header, error) {
if len(data) == 0 {
return Header{}, errEmpty
}
if data[0] == 0x80 {
return Header{}, errInvalid
}
// ...
}
Regression: the empty-input reproducer is now committed under testdata/fuzz/FuzzParseHeader/. Any future regression that re-introduces the missing length check is caught instantly.
Scenario 2: A race in a "thread-safe" counter¶
Report¶
==================
WARNING: DATA RACE
Write at 0x00c000018028 by goroutine 8:
myapp/counter.(*Counter).Inc()
/home/me/myapp/counter/counter.go:18 +0x44
Previous read at 0x00c000018028 by goroutine 9:
myapp/counter.(*Counter).Value()
/home/me/myapp/counter/counter.go:32 +0x44
Goroutine 8 (running) created at:
myapp/counter.FuzzCounterReads.func1.1()
/home/me/myapp/counter/counter_fuzz_test.go:25 +0x144
...
==================
--- FAIL: FuzzCounterReads (0.03s)
Failing input written to testdata/fuzz/FuzzCounterReads/4b2e9c...
FAIL
exit status 66
The code¶
type Counter struct {
mu sync.Mutex
n int
}
func (c *Counter) Inc() {
c.mu.Lock()
c.n++
c.mu.Unlock()
}
func (c *Counter) Value() int {
return c.n // <-- no lock here
}
Question¶
Inc holds the lock. Why is the race detector unhappy about Value?
Diagnosis: Value reads c.n without acquiring mu. A concurrent Inc writes c.n under the lock, but a concurrent Value reads outside the lock. There is no happens-before edge between the write and the read; the race detector flags it correctly.
Fix:
Or use sync/atomic:
type Counter struct {
n atomic.Int64
}
func (c *Counter) Inc() { c.n.Add(1) }
func (c *Counter) Value() int { return int(c.n.Load()) }
Lesson: "thread-safe" must apply to all methods, not just the obvious mutators. Reads need synchronisation too.
Scenario 3: A race in a "thread-safe" cache¶
Report¶
==================
WARNING: DATA RACE
Write at 0x00c0000a4060 by goroutine 12:
runtime.mapassign_faststr()
/usr/local/go/src/runtime/map_faststr.go:208 +0x0
myapp/cache.(*Cache).Put()
/home/me/myapp/cache/cache.go:34 +0xfc
Previous read at 0x00c0000a4060 by goroutine 13:
runtime.mapaccess2_faststr()
/usr/local/go/src/runtime/map_faststr.go:13 +0x0
myapp/cache.(*Cache).Get()
/home/me/myapp/cache/cache.go:25 +0xa8
==================
The code¶
type Cache struct {
mu sync.Mutex
m map[string]string
}
func (c *Cache) Get(k string) (string, bool) {
c.mu.Lock()
if v, ok := c.m[k]; ok {
c.mu.Unlock()
return v, true
}
return "", false // <-- forgot to unlock!
}
func (c *Cache) Put(k, v string) {
c.mu.Lock()
defer c.mu.Unlock()
c.m[k] = v
}
Question¶
The race detector points at the map assignment. But the bug is elsewhere. Where?
Diagnosis: Get returns without unlocking on the cache-miss path. The next Put blocks forever waiting for the mutex — but the race detector still sees a race because the previous successful Get did unlock and another goroutine's Get raced with a third goroutine's Put. The mutex is occasionally not held when it should be.
Actually, look more carefully: the bug is that the Unlock is only in the hit branch. On miss, Unlock never runs, so the next caller's Lock blocks. But once unblocked (after a long wait or a panic), the state is corrupt. The race detector picks up the race because the mutex protection is inconsistent.
Fix:
func (c *Cache) Get(k string) (string, bool) {
c.mu.Lock()
defer c.mu.Unlock()
v, ok := c.m[k]
return v, ok
}
Lesson: prefer defer mu.Unlock() over manual mu.Unlock(). Manual unlock is one early-return away from a leak.
Scenario 4: A timeout (suspected deadlock)¶
Report¶
panic: test timed out after 10m0s
running tests:
FuzzPipeline (10m0s)
goroutine 1 [chan receive, 10 minutes]:
testing.(*F).Fuzz(...)
/usr/local/go/src/testing/fuzz.go:200 +0x4cc
myapp.FuzzPipeline(0xc000012340)
/home/me/myapp/pipeline_fuzz_test.go:18 +0x12c
The code¶
func RunPipeline(in []byte) []byte {
ch1 := make(chan byte)
ch2 := make(chan byte)
done := make(chan []byte)
go func() {
for _, b := range in {
ch1 <- b
}
close(ch1)
}()
go func() {
for b := range ch1 {
ch2 <- b * 2
}
// forgot to close(ch2)
}()
go func() {
var out []byte
for b := range ch2 {
out = append(out, b)
}
done <- out
}()
return <-done
}
Question¶
Why does the fuzz target time out?
Diagnosis: The middle goroutine never closes ch2. The third goroutine for b := range ch2 blocks forever waiting for either a value or a close. The main goroutine blocks on <-done. Deadlock.
The fuzzer hit this on any non-empty input. The timeout caught it.
Fix:
Lesson: every goroutine that sends on a channel should be responsible for closing it (when the channel will be ranged over). The fuzzer's timeout-based hang detection caught this; without -fuzz, this bug might lurk for months until a downstream consumer happened to range over an unclosed channel.
Scenario 5: A round-trip mismatch in JSON¶
Report¶
--- FAIL: FuzzJSONRoundTrip (0.05s)
--- FAIL: FuzzJSONRoundTrip (0.00s)
round_trip_test.go:24: round-trip mismatch
first: map[string]interface{}{"a":1.7976931348623157e+308}
second: map[string]interface{}{"a":+Inf}
Failing input written to testdata/fuzz/FuzzJSONRoundTrip/8c4d1e...
The input¶
Question¶
Both values look reasonable. What is the actual round-trip mismatch?
Diagnosis: 1.7976931348623157e+308 is the maximum representable float64. json.Unmarshal produces exactly this. json.Marshal emits 1.7976931348623157e+308 again. But the decimal-to-binary conversion on re-decode loses precision and lands on +Inf because the re-emitted string had a final rounding step on the boundary.
This is a known limitation of round-trip equality for floats. The fix is not to "fix" the encoding but to relax the invariant:
Or use json.Number mode so the value stays a string:
Lesson: round-trip invariants must account for the precision and semantics of the format. JSON does not promise float round-tripping near the limits. A "failure" here is the fuzzer correctly identifying a real semantic limitation of the format.
Scenario 6: A non-deterministic failure¶
Report¶
--- FAIL: FuzzPipeline (0.43s)
pipeline_fuzz_test.go:31: expected ok, got !ok
Failing input written to testdata/fuzz/FuzzPipeline/9d2b73...
You re-run:
Question¶
The fuzzer found a failure but you cannot reproduce it. What is wrong?
Diagnosis: The fuzz function depends on non-determinism somewhere. Likely candidates:
- Time. The fuzz body or system under test uses
time.Now(),time.Sleep, or a deadline. - Unseeded
rand. Arand.Intnsomewhere uses the defaultSource, which is seeded fresh each process. Different runs differ. - Scheduler-dependent assertions. An assertion runs before a
WaitGrouphas actually waited (rare but possible). - Map iteration order. Two iterations over the same
mapproduce different orders. Iterating and asserting on order is non-deterministic.
Diagnostic strategy:
- Re-run with
-count=100:go test -run=FuzzPipeline/9d2b73 -count=100 -race. If 5/100 fail, the bug is real but rare. - Audit the fuzz function and SUT for
time.Now,rand, map iteration. - Stub time with a fake clock; seed
randfrom the fuzz input; sort map iterations. - Re-run the fuzzer; the next failure should reproduce.
Fix sketch:
f.Fuzz(func(t *testing.T, seed int64, data []byte) {
rng := rand.New(rand.NewSource(seed))
clk := newFakeClock()
p := NewPipeline(WithRand(rng), WithClock(clk))
_ = p.Run(data)
})
Lesson: non-determinism in fuzz targets is the highest-priority bug class. It obscures every future fuzz failure. Fix it before anything else.
Scenario 7: A non-linearisable history¶
Report (from porcupine)¶
--- FAIL: FuzzKVLinearisability (1.21s)
kv_fuzz_test.go:62: history is not linearisable
Operations:
[#0 set("k", 1) start=... end=...]
[#1 get("k") -> 0 start=... end=...]
[#2 set("k", 2) start=... end=...]
[#3 get("k") -> 1 start=... end=...] ← problem
No serial order is consistent with these intervals.
The code¶
type KV struct {
mu sync.Mutex
m map[string]int
// Custom Get that snapshots without lock for "performance":
}
func (kv *KV) Get(k string) int {
return kv.m[k] // <-- unlocked
}
Question¶
Why is the history not linearisable?
Diagnosis: Operation #3 reads 1. For a linearisable schedule, op #3 must take effect at some point between its start and its end. But by the real-time order, op #2 (set 2) starts before op #3 returns. So #3 either observes 0, 1, or 2 — but it must observe a value consistent with the most recent set in its history.
Possible serial orders: - set 1, get -> 0, set 2, get -> 1: invalid, the first get should see 1, not 0. - get -> 0, set 1, get -> 1, set 2: invalid, the second get overlaps with set 2 which is after set 1, so seeing 1 is allowed only if set 2 has not yet happened — but #2 started before #3 returned.
The unlocked Get is reading from a stale snapshot of the map header. There is no consistent serial schedule. The KV store is not linearisable.
Fix: lock Get:
Or use sync.RWMutex's RLock if reads vastly dominate.
Lesson: linearisability checking is a powerful tool. A history of just 4 operations exposes the inconsistency. Without it you would only see "weird reads sometimes." The fuzzer + porcupine combo gives you both inputs and proofs.
Scenario 8: Bug found in the fuzz harness itself¶
Report¶
--- FAIL: FuzzQueueConcurrent (0.01s)
panic: runtime error: index out of range [3] with length 2 [recovered]
panic: runtime error: index out of range [3] with length 2
goroutine 17 [running]:
myapp/queue.decodeOps(...)
/home/me/myapp/queue/queue_fuzz_test.go:18
The panic is in decodeOps, inside the fuzz test file, not in the system under test.
The code¶
func decodeOps(data []byte) []op {
var out []op
for i := 0; i < len(data); i += 4 {
out = append(out, op{
kind: data[i],
key: string(data[i+1 : i+3]),
val: int(data[i+3]),
})
}
return out
}
Question¶
What is wrong?
Diagnosis: The decoder indexes data[i+3] without checking i+3 < len(data). On the last partial chunk, it panics. The fuzzer found an input like [0x00, 0x01] (length 2) and the loop tried to read data[1:3] (out of bounds).
Fix:
func decodeOps(data []byte) []op {
var out []op
for i := 0; i+4 <= len(data); i += 4 {
out = append(out, op{
kind: data[i],
key: string(data[i+1 : i+3]),
val: int(data[i+3]),
})
}
return out
}
Lesson: the decoder must be total. Any panic in the harness obscures the system under test. Write a tiny fuzz test that exercises just the decoder (f.Fuzz(func(t *testing.T, data []byte) { decodeOps(data) })) to validate this property.
This is also a reason to make harness code as boring as possible. Bugs in the harness produce mysterious-looking fuzz reports that waste hours of debugging.
Summary¶
Eight scenarios, each a real-world shape of fuzz failure: panic in a parser, data race on shared state, partial unlock paths, deadlock from unclosed channels, semantic round-trip mismatches at float boundaries, non-deterministic failures that signal a hidden time/rand dependency, linearisability violations under concurrent execution, and bugs in the fuzz harness itself. The recurring diagnostic moves are: (1) reproduce locally with go test -run=FuzzXxx/<hash>, (2) read the input and the stack trace together, (3) ask whether the bug is in the SUT or the harness, (4) check for non-determinism if reproduction fails, (5) commit the reproducer and a stress-replay test. Master these moves and triage time drops from hours to minutes.