Wrong Data Structure — Exercises¶
Category: Performance Anti-Patterns → Wrong Data Structure — hands-on practice swapping the collection to fit the access pattern.
These are fix-it exercises, not recognition quizzes (find-bug.md does recognition). For each you get a problem, starting code (Go, Java, or Python — the language varies on purpose), acceptance criteria, and a collapsible solution with the big-O before/after and, where it matters, a benchmark. The point is to make the change and state the cost-model reason it's faster.
How to use this file. Read the problem, do it in your editor before opening the solution, and write down the before/after big-O yourself. Refer back to
middle.mdfor the cost-table cheat sheet and the canonical fixes.
Table of Contents¶
| # | Exercise | Wrong → Right | Lang | Difficulty |
|---|---|---|---|---|
| 1 | Replace the membership scan | list → set | Python | ★ easy |
| 2 | Replace find-by-key scans | list scan → map | Go | ★★ medium |
| 3 | Top-k without sorting the world | full sort → heap | Java | ★★ medium |
| 4 | Fix the O(n) pop-front queue | list → deque | Python | ★ easy |
| 5 | Turn a nested-loop join into a hash join | nested loop → hash join | Go | ★★ medium |
| 6 | Find the small-n crossover | benchmark slice vs map | Go | ★★★ hard |
Exercise 1 — Replace the membership scan¶
Wrong → Right: list → set · Language: Python · Difficulty: ★ easy
tag_new flags log lines whose request id hasn't been seen in an allowlist. It crawls on large inputs.
def tag_new(lines, allowed_ids): # allowed_ids is a list
out = []
for line in lines:
rid = line.request_id
out.append((line, rid in allowed_ids)) # O(n) scan per line
return out
Acceptance: identical output; the per-line check must be O(1) average; the whole function O(n + m).
Solution
The dominant operation is **membership** (`rid in allowed_ids`), so the right structure is a **set**, built once. **Big-O:** O(n·m) → **O(n + m)**. For 100k lines × 100k allowed ids that's ~10 billion comparisons down to ~200k. **Benchmark** (`pyperf`, 100k × 100k, illustrative): list ≈ 19 s, set ≈ 11 ms — three orders of magnitude. **The cost-model reason:** `in` on a list is a linear scan (O(m)); on a set it's a hash + bucket check (O(1) average). Build the set **once**, outside the loop — building it per line would re-pay O(m) and keep the quadratic.Exercise 2 — Replace find-by-key scans¶
Wrong → Right: list scan → map · Language: Go · Difficulty: ★★ medium
enrich attaches a customer name to each order by scanning the customer slice per order.
func findCustomer(custs []Customer, id int) (Customer, bool) {
for _, c := range custs { // O(n) scan
if c.ID == id {
return c, true
}
}
return Customer{}, false
}
func enrich(orders []Order, custs []Customer) {
for i := range orders {
if c, ok := findCustomer(custs, orders[i].CustomerID); ok { // O(n) each
orders[i].CustomerName = c.Name
}
}
}
Acceptance: same result; lookups O(1) average; whole function O(n + q). Handle the build-once requirement explicitly.
Solution
The dominant operation is **lookup by key** (CustomerID). Index the customers into a **map once**, then probe.func indexByID(custs []Customer) map[int]Customer {
idx := make(map[int]Customer, len(custs)) // pre-sized: avoids rehash spikes
for _, c := range custs {
idx[c.ID] = c
}
return idx
}
func enrich(orders []Order, custs []Customer) {
idx := indexByID(custs) // O(n), once — hoisted out of the loop
for i := range orders {
if c, ok := idx[orders[i].CustomerID]; ok { // O(1) average
orders[i].CustomerName = c.Name
}
}
}
Exercise 3 — Top-k without sorting the world¶
Wrong → Right: full sort → heap · Language: Java · Difficulty: ★★ medium
topScores returns the k highest scores from a large stream by sorting everything.
List<Integer> topScores(List<Integer> scores, int k) {
List<Integer> copy = new ArrayList<>(scores);
copy.sort(Collections.reverseOrder()); // O(n log n), O(n) memory
return copy.subList(0, Math.min(k, copy.size()));
}
Acceptance: same top-k (order within the result may differ if you document it); O(n log k) time, O(k) memory; must not sort the whole input.
Solution
Keep a **bounded min-heap of size k**: the root is the smallest of the current best k, so a new score only enters if it beats the root.List<Integer> topScores(List<Integer> scores, int k) {
if (k <= 0) return List.of();
PriorityQueue<Integer> heap = new PriorityQueue<>(k); // min-heap
for (int s : scores) {
if (heap.size() < k) {
heap.offer(s);
} else if (s > heap.peek()) { // s beats the current k-th best
heap.poll(); // drop the smallest
heap.offer(s); // O(log k)
}
}
List<Integer> out = new ArrayList<>(heap);
out.sort(Collections.reverseOrder()); // sort only the k results
return out;
}
Exercise 4 — Fix the O(n) pop-front queue¶
Wrong → Right: list → deque · Language: Python · Difficulty: ★ easy
A breadth-first worker uses a list as a FIFO queue.
def drain(initial):
queue = list(initial)
processed = []
while queue:
item = queue.pop(0) # O(n) — shifts the whole list each time
processed.append(item)
queue.extend(children(item)) # appends are fine; pop(0) is the problem
return processed
Acceptance: identical processing order; front-removal must be O(1); draining O(total items).
Solution
The access pattern is **FIFO** (append at the back, remove from the front). The right structure is `collections.deque`. **Big-O:** draining O(n²) → **O(n)** (n total items processed). **Benchmark** (50,000 items, illustrative): list `pop(0)` ≈ 1.2 s, `deque.popleft()` ≈ 6 ms. **The cost-model reason:** `list.pop(0)` removes index 0, forcing every remaining element to shift down one slot — O(n) per pop, O(n²) to drain. A deque is a doubly-linked block structure with O(1) operations at both ends. (Java equivalent: `ArrayList.remove(0)` → `ArrayDeque`. Go: a ring buffer or head index, *not* `q = q[1:]` for long-lived queues, which leaks the backing array.)Exercise 5 — Turn a nested-loop join into a hash join¶
Wrong → Right: nested loop → hash join · Language: Go · Difficulty: ★★ medium
reconcile pairs payments with invoices by matching invoice id, with a nested loop.
func reconcile(payments []Payment, invoices []Invoice) []Match {
var out []Match
for _, p := range payments { // n
for _, inv := range invoices { // × m
if inv.ID == p.InvoiceID {
out = append(out, Match{p, inv})
break
}
}
}
return out
}
Acceptance: same matches; O(n + m); hash the smaller collection.
Solution
A nested loop matching by key is a **hash join** in disguise. Build a hash index of the smaller side once, probe with the larger.func reconcile(payments []Payment, invoices []Invoice) []Match {
// Build phase: index the (assumed smaller) invoices by ID. O(m).
byID := make(map[int]Invoice, len(invoices))
for _, inv := range invoices {
byID[inv.ID] = inv
}
// Probe phase: scan payments once, O(1) lookup each. O(n).
out := make([]Match, 0, len(payments))
for _, p := range payments {
if inv, ok := byID[p.InvoiceID]; ok {
out = append(out, Match{p, inv})
}
}
return out
}
Exercise 6 — Find the small-n crossover¶
Wrong → Right: measure, don't assume · Language: Go · Difficulty: ★★★ hard
"Always use a map for membership" is wrong for small n: a slice scan over a few cache-resident ints beats a hash lookup. Write a benchmark that finds the crossover — the n below which the slice scan wins.
Acceptance: a benchmark sweeping n that reports slice-scan vs map-lookup times per size, with the worst case for the scan (target = last element). State the crossover you observe and why it exists.
Solution
package bench
import (
"fmt"
"testing"
)
func makeSlice(n int) []int {
s := make([]int, n)
for i := range s {
s[i] = i
}
return s
}
func makeMap(n int) map[int]struct{} {
m := make(map[int]struct{}, n)
for i := 0; i < n; i++ {
m[i] = struct{}{}
}
return m
}
func sliceContains(s []int, x int) bool {
for _, v := range s {
if v == x {
return true
}
}
return false
}
var sink bool // package-level: defeats dead-code elimination
func BenchmarkCrossover(b *testing.B) {
for _, n := range []int{4, 8, 16, 32, 64, 128, 256} {
s := makeSlice(n)
m := makeMap(n)
target := n - 1 // worst case for the scan: the last element
b.Run(fmt.Sprintf("slice/n=%d", n), func(b *testing.B) {
for i := 0; i < b.N; i++ {
sink = sliceContains(s, target)
}
})
b.Run(fmt.Sprintf("map/n=%d", n), func(b *testing.B) {
for i := 0; i < b.N; i++ {
_, sink = m[target]
}
})
}
}
Wrap-Up¶
| Exercise | Wrong structure | Right structure | Big-O fix |
|---|---|---|---|
| 1 | list membership | set | O(n·m) → O(n + m) |
| 2 | list find-by-key | map (built once) | O(n·q) → O(n + q) |
| 3 | full sort for top-k | bounded heap | O(n log n) → O(n log k), O(k) mem |
| 4 | list as queue | deque | O(n²) → O(n) |
| 5 | nested-loop join | hash join | O(n·m) → O(n + m) |
| 6 | (measure) | slice below crossover | — find where simple wins |
The throughline: name the hot operation, pick the structure that makes it cheap, preprocess into it once — and for anything performance-critical, benchmark it (Exercise 6 shows why the table isn't always the final word).
Related Topics¶
find-bug.md— the spotting counterpart: identify the mismatch in a snippet.optimize.md— a full before/after with profiling and benchmark numbers, including the crossover caveat.middle.md— the cost-table cheat sheet and the canonical fixes practiced here.- N+1 in Code → tasks — the same "preprocess once, not per item" discipline for I/O.
- The
big-o-analysis,hash-table-design, andprofiling-techniquesskills — cost analysis, set/map internals, and the benchmarking discipline.
In this topic