Bloom Filter — Practice Tasks¶
All tasks must be solved in Go, Java, and Python. Each task lists constraints, expected behavior, and how you'll be evaluated. Always verify the core invariant after every task: a "no" must always be correct (no false negatives).
Beginner Tasks¶
Task 1 — Implement a fixed-size Bloom filter from scratch¶
Implement a Bloom filter with a given m (bits) and k (hash functions). Provide add(item) and mightContain(item). Use double hashing (g_i = h1 + i·h2 mod m) to derive the k indices from two base hashes. Pack bits into 64-bit words (or a bytearray), not one boolean per bit.
Go¶
package main
type Bloom struct {
// bits []uint64; m, k uint64
}
func NewBloom(m, k uint64) *Bloom { /* implement */ return nil }
func (b *Bloom) Add(data []byte) { /* implement */ }
func (b *Bloom) MightContain(data []byte) bool { /* implement */ return false }
func main() {
// b := NewBloom(64, 3); b.Add([]byte("hi")); print(b.MightContain([]byte("hi")))
}
Java¶
public class Task1 {
static class Bloom {
Bloom(int m, int k) { /* implement */ }
void add(String s) { /* implement */ }
boolean mightContain(String s) { /* implement */ return false; }
}
public static void main(String[] args) {
// Bloom b = new Bloom(64, 3); b.add("hi"); System.out.println(b.mightContain("hi"));
}
}
Python¶
class Bloom:
def __init__(self, m, k):
... # implement
def add(self, item):
... # implement
def might_contain(self, item):
... # implement
if __name__ == "__main__":
pass # b = Bloom(64, 3); b.add("hi"); print(b.might_contain("hi"))
- Constraints:
O(k)per op; bits packed into words; deterministic hashing. - Expected:
mightContainreturnsTruefor every added item. - Evaluation: correctness, bit-packing, double hashing.
Task 2 — Add from(n, p) constructor with sizing math¶
Add a constructor that computes m = ceil(−n·ln p / (ln 2)²) and k = round((m/n)·ln 2) from a target item count n and FPR p. Print the chosen m and k.
- Constraints:
k ≥ 1; rejectp ≤ 0orp ≥ 1. - Expected: for
n=1000, p=0.01→ roughlym≈9586, k≈7. - Evaluation: formula correctness, input validation.
Task 3 — Verify the no-false-negative property empirically¶
Insert n = 50,000 distinct strings. Then query all n of them and assert that every query returns True. Print the count of false negatives (must be 0).
- Constraints: use distinct items; assert zero false negatives.
- Expected:
falseNegatives = 0on every run. - Evaluation: correct invariant verification.
Task 4 — Measure the observed false-positive rate¶
After inserting n items, query n never-inserted items and compute the fraction that return True. Compare this observed FPR to the analytic p = (1 − e^(−kn/m))^k. Print both.
- Constraints: disjoint insert/test sets; report 4 decimals.
- Expected: observed FPR within ~20% of analytic for
n ≥ 10,000. - Evaluation: measurement correctness, closeness to theory.
Task 5 — Early-exit query¶
Modify mightContain to count how many of the k bits it inspects before deciding. Show that for non-members it usually inspects far fewer than k bits (returns on the first 0).
- Constraints: return on first
0bit; track inspected-bit count. - Expected: average inspected bits for non-members ≪
k. - Evaluation: early-exit logic, instrumentation.
Intermediate Tasks¶
Task 6 — Union two Bloom filters¶
Implement union(a, b) that returns a new filter equal to the bitwise OR of two filters with identical m, k, and hash seeds. Verify that the union answers "present" for every item inserted into either source filter.
- Constraints: reject filters with mismatched
m/k; OR the bit arrays. - Expected: union contains every item of both inputs (no false negatives).
- Evaluation: correct OR merge, parameter check.
Task 7 — Counting Bloom filter with delete¶
Replace each bit with a 4-bit counter (pack two counters per byte, or use a uint8 array). add increments the k counters; delete decrements them; "set" means counter > 0. Demonstrate that after add(x); add(y); delete(x), querying y still returns True (no false negative from deletion).
- Constraints: saturate counters at 15 (don't overflow/underflow); guard against decrementing below 0.
- Expected: deletes don't break other items' membership.
- Evaluation: counter logic, deletion safety. (Cross-ref:
23-counting-bloom-filter.)
Task 8 — Bloom filter vs hash set: memory and speed¶
Insert 1,000,000 URLs into both a Bloom filter (sized for 1% FPR) and a set/HashSet/map. Report the memory used by each and the query throughput of each. Quantify the memory saving.
- Constraints: measure actual bytes where the language allows; same item set.
- Expected: Bloom uses ~10×–50× less memory.
- Evaluation: fair measurement, correct reporting.
Task 9 — Tune FPR vs bits-per-element¶
For bits/element ∈ {4, 6, 8, 10, 12, 14}, build a filter, insert n items, and measure the observed FPR. Tabulate bits/elem → observed FPR and confirm each extra bit multiplies FPR by roughly 0.6185.
- Constraints: keep
nfixed; recomputekoptimally perm. - Expected: FPR falls geometrically (~×0.62 per bit).
- Evaluation: correct sweep, geometric trend.
Task 10 — Blocked (cache-line) Bloom filter¶
Implement a blocked Bloom filter: the first hash selects a 512-bit block; the remaining k bits are set within that single block. Confirm each operation touches exactly one block, and compare its FPR to a standard filter of the same total m.
- Constraints: block size = 512 bits; all
kbits inside one block. - Expected: one block per op; FPR slightly higher than standard for same
m. - Evaluation: blocking correctness, FPR comparison. (Cross-ref:
senior.md.)
Advanced Tasks¶
Task 11 — Scalable Bloom filter¶
Implement a scalable Bloom filter: a list of sub-filters where, on filling the current one to target load, you append a new larger filter with a tighter FPR (p·r^i, r = 0.9). mightContain checks all sub-filters; add goes to the newest. Insert 10× the initial design n and confirm the global FPR stays below the target.
- Constraints: geometric size growth; tightening per-slice FPR.
- Expected: overall FPR ≤ target despite massive overfill.
- Evaluation: growth logic, compounded-FPR bound. (Cross-ref:
senior.md.)
Task 12 — LSM-style per-SSTable Bloom filters¶
Simulate an LSM read path: create several "SSTables," each a set of keys with its own Bloom filter. Implement get(key) that checks each SSTable's filter and only "reads" (scans the real key set of) the tables whose filter says "maybe." Count how many disk reads the filters saved on a workload that is 90% absent keys.
- Constraints: per-SSTable filter; skip files on filter "no".
- Expected: ~90% of would-be reads avoided for absent keys.
- Evaluation: correct skip logic, read-saving metric. (Cross-ref:
04-lsm-tree.)
Task 13 — Concurrent (lock-light) Bloom filter¶
Make inserts thread-safe using atomic OR / CAS per word (no global lock), since bits only go 0→1. Run many goroutines/threads inserting concurrently, then verify from a single thread that all inserted items return True.
- Constraints: atomic bit-set; no data races (run with
-race/ thread sanitizer). - Expected: zero false negatives under concurrency.
- Evaluation: correctness under contention, race-freedom.
Task 14 — Estimate cardinality from a Bloom filter¶
Given a filled Bloom filter with known m and k, estimate the number of inserted distinct items from the count of set bits X using n̂ = −(m/k)·ln(1 − X/m). Compare your estimate to the true n across several fills.
- Constraints: count set bits efficiently (popcount); apply the inversion formula.
- Expected: estimate within a few percent of true
n(away from saturation). - Evaluation: popcount usage, formula correctness, accuracy.
Task 15 — Web-crawler URL dedup¶
Build a URL-dedup component for a crawler: before "fetching," check the Bloom filter; if "new," fetch and record. Feed a stream with many repeats and measure (a) duplicate fetches avoided and (b) unique URLs wrongly skipped due to false positives. Tune m, k so wrongly-skipped stays under 0.5%.
- Constraints: stream processing; track both metrics.
- Expected: near-100% duplicate suppression, <0.5% wrongful skips.
- Evaluation: dedup effectiveness, FPR control, realistic design.
Benchmark Task¶
Compare insert and query throughput, and observed FPR, across all 3 languages for a filter sized at
n = 1,000,000,p = 0.01.
Go¶
package main
import (
"fmt"
"time"
)
func main() {
n := 1000000
// b := New(n, 0.01) // your sized constructor
start := time.Now()
for i := 0; i < n; i++ {
// b.Add(fmt.Sprintf("k-%d", i))
_ = i
}
fmt.Printf("inserts of %d in %v\n", n, time.Since(start))
}
Java¶
public class Benchmark {
public static void main(String[] args) {
int n = 1_000_000;
// Bloom b = new Bloom(n, 0.01);
long start = System.nanoTime();
for (int i = 0; i < n; i++) {
// b.add("k-" + i);
}
System.out.printf("inserts of %d in %d ms%n", n, (System.nanoTime() - start) / 1_000_000);
}
}
Python¶
import time
n = 1_000_000
# b = Bloom(n, 0.01)
start = time.perf_counter()
for i in range(n):
# b.add(f"k-{i}")
pass
print(f"inserts of {n} in {time.perf_counter() - start:.3f}s")
Report a table of language → insert ms → query ms → observed FPR. Note where each language's hashing/allocation dominates, and whether the observed FPR matches the analytic ~1%.
In this topic
- interview
- tasks