Bloom Filter — Interview Preparation¶
Tiered questions (Junior → Middle → Senior → Professional) followed by a multi-language coding challenge. Answers are concise but complete — expand on the math where the interviewer probes.
Table of Contents¶
- Junior Questions
- Middle Questions
- Senior Questions
- Professional Questions
- Rapid-Fire Round
- Coding Challenge
Junior Questions¶
Q1. What is a Bloom filter?
A space-efficient probabilistic data structure for approximate set membership. It's a bit array of
mbits pluskhash functions. Insert sets thekbits an item hashes to; query returns "present" only if allkbits are1.
Q2. What's the one guarantee a Bloom filter always gives?
No false negatives. If it says "not present," the item is definitely absent. It may have false positives (say "present" for a never-inserted item), but never false negatives.
Q3. Why are there no false negatives?
Insert only ever turns bits from
0to1, never back. So an inserted item'skbits stay1forever, and a later query for it always finds all1s → "present."
Q4. What causes a false positive?
A never-inserted item whose
kbit positions were all set to1by other items. The bits collide, so the query mistakenly reports "present."
Q5. What operations does a Bloom filter support, and at what cost?
insertandmightContain, bothO(k)— computekhash positions and set/checkkbits.kis a small constant (~7), so effectivelyO(1), independent of how many items are stored.
Q6. Can you delete from a standard Bloom filter?
No. Clearing an item's bits could clear bits shared with other items, introducing false negatives and breaking the core guarantee. Use a counting Bloom filter to delete.
Q7. How is a Bloom filter different from a hash set?
A hash set stores the actual items (exact, deletable, enumerable) but uses memory proportional to item size. A Bloom filter stores no items — just bits — using ~10 bits/item regardless of item size, at the cost of possible false positives and no deletion/enumeration.
Q8. Can a Bloom filter tell you what items it contains?
No. It can't enumerate or recover items — it only answers approximate yes/no membership.
Q9. If a query hits a 0 bit on the second of k checks, what do you do?
Return "absent" immediately — early exit. One
0is conclusive proof of non-membership; no need to check the remaining bits.
Q10. Give a real use case.
A web crawler dedups URLs: before fetching, ask the Bloom filter "seen this URL?" If "no," fetch and record it; this avoids re-fetching billions of URLs with only a few bits each.
Middle Questions¶
Q11. State the false-positive-rate formula.
p ≈ (1 − e^(−kn/m))^k, wherem= bits,k= hashes,n= inserted items. The inner term1 − e^(−kn/m)is the expected fraction of set bits.
Q12. What is the optimal number of hash functions?
k* = (m/n)·ln 2 ≈ 0.693·(m/n). At thisk, exactly half the bits are set, minimizing the FPR.
Q13. How do you size the bit array for a target FPR p?
m = −n·ln p / (ln 2)². Equivalently, bits per element =−log₂ p / ln 2 ≈ 1.44·log₂(1/p). About 10 bits/item gives ~1% FPR.
Q14. Why isn't "more hash functions always better"?
More hashes set more bits per insert, saturating the array with
1s. Past the optimum, the array fills so densely that almost any query finds all bits set → FPR rises again. There's a sweet spot at half-full.
Q15. What is double hashing and why use it?
Kirsch–Mitzenmacher: derive all
kindices from two base hashes viag_i(x) = h1(x) + i·h2(x) mod m. It gives the same asymptotic FPR while computing only 2 hashes instead ofk, saving CPU.
Q16. Walk me through inserting and querying an item.
Insert: compute
h1, h2; fori = 0..k-1, set bit(h1 + i·h2) % m. Query: same indices; return false on the first0bit, true if all are1.
Q17. Memory for 1 billion items at 1% FPR?
~10 bits/item × 10⁹ ≈ 10 Gbit ≈ 1.25 GB. A hash set of 1B URLs would need tens of GB or more — the Bloom filter is an order of magnitude smaller.
Q18. What happens if you insert far more than the design n?
The FPR climbs toward 1 — the filter starts answering "present" to almost everything and stops being useful. Re-size or use a scalable Bloom filter.
Q19. Why must h2 be non-zero in double hashing?
If
h2 = 0, allkindices collapse toh1, so effectivelyk = 1and the FPR spikes. Forceh2odd/non-zero.
Q20. Compare Bloom filter vs counting Bloom filter.
Counting Bloom replaces each bit with a small counter (e.g. 4 bits): insert increments, delete decrements, "set" means counter > 0. It supports deletion at ~4× the memory.
Q21. Two Bloom filters with the same parameters — how do you union them?
Bitwise OR of the two bit arrays. The result answers membership over the union of both sets, exactly. (Intersection via AND is not safe — it can create false negatives.)
Q22. When would you pick a hash set over a Bloom filter?
When you need exact answers, deletion, enumeration, the set is small enough that memory isn't a concern, or a false "yes" would be incorrect rather than just wasteful.
Senior Questions¶
Q23. How do LSM-tree databases use Bloom filters?
Each immutable SSTable carries a Bloom filter over its keys. On a point read, the engine checks each candidate SSTable's filter; a "no" lets it skip that file's disk read entirely. This eliminates most disk I/O for absent keys in read/miss-heavy workloads (RocksDB, Cassandra, LevelDB, HBase).
Q24. What's a blocked Bloom filter and why does it matter?
A standard filter spreads
kbits across the whole array → up tokcache misses per query. A blocked filter confines allkbits to one cache-line-sized block (512 bits): one cache miss per op. It trades a slightly higher FPR for 2–4× lower latency on large filters.
Q25. Design a cache admission filter to fight one-hit-wonders.
Keep a Bloom filter of "seen keys." On first request, the filter says "no" → record it but don't cache the object. On second request → "yes" → admit to the real cache. This keeps single-access objects out of the cache. Use a counting/rotating filter so entries age out.
Q26. How do scalable Bloom filters handle unknown n?
Maintain a growing list of filters. When the current fills to target load, add a new, larger filter with a tighter per-filter FPR (
p·r^i,r≈0.9) so the compounded global FPR stays bounded. Queries check all filters (any "yes" → present); inserts go to the newest.
Q27. How do you tune FPR vs memory in production?
The lever is bits/key (
m/n); each extra bit multiplies FPR by ~0.6185. If the protected op is cheap (local SSD), accept higher FPR (~8–10 bits). If it's expensive (cross-DC RPC, cache stampede), drive FPR to 0.1–0.01% (14–20 bits). Balance filter RAM againstFPR × cost_of_wasted_op × query_rate.
Q28. Why can't you reflect upstream deletions in a plain Bloom filter?
It can't clear bits without risking false negatives, so deleted keys keep returning "maybe present." Fix: periodic full rebuild from the live keyset, or a counting Bloom filter with TTL rotation.
Q29. How do networking systems use Bloom filters?
Cache digests (Squid peers summarize their contents), router/switch forwarding and per-flow state in fast SRAM, DDoS/duplicate-packet detection at line rate, and IDS payload pre-screening. The draw is constant-time decisions in tiny fast memory.
Q30. Bloom filter vs cuckoo filter — when each?
Bloom: mergeable by OR (good for distributed union), lock-light concurrency (bits only rise), slightly better at high FPR. Cuckoo: supports deletion, lower space at low FPR, but harder to merge. Pick cuckoo when you need deletes + low FPR + tight space; Bloom when you need merges or simplicity.
Q31. How do you make a Bloom filter concurrent?
Inserts only OR bits in, so atomic OR / CAS per word makes inserts lock-free; reads are plain loads. No global lock needed. This is far simpler than concurrency for structures that mutate in both directions.
Q32. What's the "adversarial false positive" risk?
If an attacker knows the hash seeds, they can craft keys that all collide on already-set bits, spiking the effective FPR or forcing expensive backing lookups. Mitigate with secret/keyed hash seeds.
Professional Questions¶
Q33. Derive Pr[a given bit is 0] after n inserts.
Each of the
knindependent hash placements misses a given bit with probability1 − 1/m. SoPr[bit = 0] = (1 − 1/m)^(kn) ≈ e^(−kn/m).
Q34. Prove the optimal k = (m/n)·ln 2.
Minimize
ln p(k) = k·ln(1 − e^(−kn/m)). Setting the derivative to zero and substitutingk = (m/n)ln 2givese^(−kn/m) = 1/2, and the derivative evaluates to−ln2 + ln2 = 0. Convexity ofln pover the range makes it the global minimum. The conditione^(−kn/m)=1/2means exactly half the bits are set.
Q35. What is the information-theoretic space lower bound for an AMQ?
Any structure with FPR
εand no false negatives needs at leastn·log₂(1/ε)bits. It follows from a counting argument over how manyn-subsets the structure must distinguish given theεslack on non-members.
Q36. How far from optimal is the Bloom filter?
It uses
1.4427·n·log₂(1/ε)bits vs the boundn·log₂(1/ε). The ratio is1/ln2 = log₂ e ≈ 1.44, so it's ~44% above optimal. Cuckoo/quotient/ribbon filters narrow this gap.
Q37. Why is the expectation-based FPR formula reliable on a single instance?
The number of set bits
Xconcentrates:E[X] = mρ,Var[X] ≤ mρ(1−ρ), and a bounded-differences (McDiarmid) bound givesPr[|X−E[X]| ≥ λ] ≤ 2exp(−λ²/(2nk²)). SoX/m → ρwith exponentially small deviation, and the realized FPR ≈ρ^kmatches the formula.
Q38. Why does double hashing preserve the FPR despite dependence?
For a single element the
g_ilie on an arithmetic progression modm(dependent), but across different elements the(h1, h2)pairs are independent, so the array-wide bit distribution matches independent hashing. Kirsch–Mitzenmacher proveE[p_double] = (1−e^(−kn/m))^k·(1+o(1)).
Q39. State the no-false-negatives proof.
Invariant: bits are monotone (only written to
1). Whenxis inserted, allkof its bits become1and stay1. Hence any later query forxsees all1s → "present." Contrapositive: a "no" means some bit is0, never set, soxwas never inserted — a proof of non-membership.
Q40. Give the exact (non-approximate) FPR and when the approximation fails.
The simple
(1 − (1−1/m)^(kn))^kslightly overstates independence; the exact value (Bose et al.) sums over the random number of distinct set bits. Thee-approximation error isO(k/m), negligible form ≫ k, but matters for tinymor very largek.
Rapid-Fire Round¶
| # | Question | Answer |
|---|---|---|
| 41 | False negatives possible? | Never. |
| 42 | False positives possible? | Yes, tunable. |
| 43 | Optimal k? | (m/n)·ln 2. |
| 44 | Bits/item for 1% FPR? | ~9.6 (≈10). |
| 45 | Can you delete? | Not from a plain Bloom filter. |
| 46 | Merge two filters? | Bitwise OR (same params). |
| 47 | Insert/query time? | O(k). |
| 48 | FPR at half-full array? | (1/2)^k. |
| 49 | Filter that allows delete? | Counting Bloom / cuckoo filter. |
| 50 | One-cache-line variant? | Blocked Bloom filter. |
| 51 | Why not k separate hashes? | Use double hashing h1+i·h2. |
| 52 | Space lower bound? | n·log₂(1/ε) bits. |
| 53 | Estimate n from a filter? | n̂ = −(m/k)·ln(1 − X/m). |
| 54 | Cost of one extra bit/elem? | FPR ×0.62. |
| 55 | Does re-querying reduce error? | No — same fixed filter, same answer. |
| 56 | Where in an LSM is the filter dropped? | Bottom (largest) level, to save RAM. |
| 57 | Counter width for counting Bloom? | 4 bits (max 15) is ample. |
| 58 | Is "yes" a certificate? | No — only "no" is a proof. |
Behavioral / Design Discussion Prompts¶
These open-ended prompts test judgment, not recall — practice talking through them.
D1. "We added a Bloom filter and latency got worse, not better. Why?"
Likely a large unblocked filter causing
kcache misses per probe, or the workload is mostly hits (so the filter rarely says "no" and only adds work). Fix: use a blocked filter; reconsider whether a filter helps a hit-heavy workload at all (it shines on miss-heavy paths).
D2. "Our false-positive rate drifted from 1% to 15% over a month. What happened?"
The set grew past the design
n— the filter overfilled. No errors, just silent degradation. Fix: monitor fill ratio / inserted count, rebuild larger, or switch to a scalable Bloom filter.
D3. "Can we use the Bloom filter to answer the query directly and skip the database?"
No — a "yes" is only probable. The filter must front a source of truth that confirms positives; using it standalone serves phantom results on every false positive.
D4. "We need to remove expired entries. The Bloom filter can't delete — what now?"
Counting Bloom filter (decrement on delete) or double-buffering two filters that rotate on a time window. Never clear bits in a plain filter — that creates false negatives.
D5. "How would you choose between a Bloom filter and a cuckoo filter for a new service?"
Need deletes and low FPR with tight space → cuckoo. Need distributed merges (OR), simple lock-light concurrency, or you're at high FPR → Bloom. Also weigh ecosystem support: most storage engines ship Bloom out of the box.
Coding Challenge¶
Challenge: Implement a sized Bloom filter and verify its no-false-negative property¶
Build a Bloom filter that sizes
mandkfrom a target item countnand false-positive ratep. ImplementaddandmightContainusing double hashing. Then prove empirically that for any item you inserted,mightContainreturnstrue(no false negatives), and report the observed false-positive rate against the analytic one.
Requirements: 1. m = ceil(−n·ln p / (ln 2)²), k = round((m/n)·ln 2). 2. add(x) sets k bits; mightContain(x) returns false on the first 0 bit. 3. Insert n items, confirm all return true, then test n never-inserted items and count false positives.
Go¶
package main
import (
"fmt"
"hash/fnv"
"math"
)
type Bloom struct {
bits []uint64
m, k uint64
}
func New(n int, p float64) *Bloom {
m := uint64(math.Ceil(-float64(n) * math.Log(p) / (math.Ln2 * math.Ln2)))
k := uint64(math.Max(1, math.Round(float64(m)/float64(n)*math.Ln2)))
return &Bloom{bits: make([]uint64, (m+63)/64), m: m, k: k}
}
func (b *Bloom) base(s string) (uint64, uint64) {
h := fnv.New64a()
h.Write([]byte(s))
x := h.Sum64()
return x & 0xffffffff, (x >> 32) | 1
}
func (b *Bloom) Add(s string) {
h1, h2 := b.base(s)
for i := uint64(0); i < b.k; i++ {
idx := (h1 + i*h2) % b.m
b.bits[idx/64] |= 1 << (idx % 64)
}
}
func (b *Bloom) MightContain(s string) bool {
h1, h2 := b.base(s)
for i := uint64(0); i < b.k; i++ {
idx := (h1 + i*h2) % b.m
if b.bits[idx/64]&(1<<(idx%64)) == 0 {
return false
}
}
return true
}
func main() {
n, p := 10000, 0.01
b := New(n, p)
for i := 0; i < n; i++ {
b.Add(fmt.Sprintf("in-%d", i))
}
// no false negatives
fn := 0
for i := 0; i < n; i++ {
if !b.MightContain(fmt.Sprintf("in-%d", i)) {
fn++
}
}
// false positives
fp := 0
for i := 0; i < n; i++ {
if b.MightContain(fmt.Sprintf("out-%d", i)) {
fp++
}
}
fmt.Printf("m=%d k=%d falseNegatives=%d observedFPR=%.4f\n",
b.m, b.k, fn, float64(fp)/float64(n))
}
Java¶
import java.nio.charset.StandardCharsets;
public class Bloom {
final long[] bits; final int m, k;
Bloom(int n, double p) {
m = (int) Math.ceil(-n * Math.log(p) / (Math.log(2) * Math.log(2)));
k = Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
bits = new long[(m + 63) / 64];
}
long[] base(String s) {
long h = 0xcbf29ce484222325L;
for (byte by : s.getBytes(StandardCharsets.UTF_8)) { h ^= (by & 0xff); h *= 0x100000001b3L; }
return new long[]{ h & 0xffffffffL, (h >>> 32) | 1L };
}
void add(String s) {
long[] h = base(s);
for (int i = 0; i < k; i++) {
int idx = (int) (((h[0] + (long) i * h[1]) % m + m) % m);
bits[idx >>> 6] |= 1L << (idx & 63);
}
}
boolean mightContain(String s) {
long[] h = base(s);
for (int i = 0; i < k; i++) {
int idx = (int) (((h[0] + (long) i * h[1]) % m + m) % m);
if ((bits[idx >>> 6] & (1L << (idx & 63))) == 0) return false;
}
return true;
}
public static void main(String[] args) {
int n = 10000; double p = 0.01;
Bloom b = new Bloom(n, p);
for (int i = 0; i < n; i++) b.add("in-" + i);
int fn = 0, fp = 0;
for (int i = 0; i < n; i++) if (!b.mightContain("in-" + i)) fn++;
for (int i = 0; i < n; i++) if (b.mightContain("out-" + i)) fp++;
System.out.printf("m=%d k=%d falseNegatives=%d observedFPR=%.4f%n",
b.m, b.k, fn, (double) fp / n);
}
}
Python¶
import math
class Bloom:
def __init__(self, n, p):
self.m = math.ceil(-n * math.log(p) / (math.log(2) ** 2))
self.k = max(1, round(self.m / n * math.log(2)))
self.bits = bytearray((self.m + 7) // 8)
def _base(self, s):
import hashlib
h = int.from_bytes(hashlib.blake2b(s.encode(), digest_size=8).digest(), "little")
return h & 0xFFFFFFFF, (h >> 32) | 1
def add(self, s):
h1, h2 = self._base(s)
for i in range(self.k):
idx = (h1 + i * h2) % self.m
self.bits[idx >> 3] |= 1 << (idx & 7)
def might_contain(self, s):
h1, h2 = self._base(s)
for i in range(self.k):
idx = (h1 + i * h2) % self.m
if not (self.bits[idx >> 3] >> (idx & 7)) & 1:
return False
return True
if __name__ == "__main__":
n, p = 10000, 0.01
b = Bloom(n, p)
for i in range(n):
b.add(f"in-{i}")
fn = sum(1 for i in range(n) if not b.might_contain(f"in-{i}"))
fp = sum(1 for i in range(n) if b.might_contain(f"out-{i}"))
print(f"m={b.m} k={b.k} falseNegatives={fn} observedFPR={fp / n:.4f}")
Expected output (all languages): falseNegatives=0 always, and observedFPR ≈ 0.01 (close to the target p, within statistical noise). The zero false negatives is the property to highlight in an interview — it holds with probability 1, not on average.
Bonus Challenge: Union two filters and estimate intersection size¶
Given two Bloom filters with identical
(m, k)holding setsAandB, build their union by bitwise OR (which is exact), then estimate|A|,|B|,|A ∪ B|from the set-bit counts using the Swamidass–Baldi estimatorn̂ = −(m/k)·ln(1 − X/m), and finally estimate|A ∩ B| = n̂(A) + n̂(B) − n̂(A∪B)by inclusion–exclusion.
Requirements: 1. union(a, b) returns a new filter = a.bits OR b.bits (reject mismatched params). 2. estimateCount() = −(m/k)·ln(1 − setBits/m). 3. Print estimated |A|, |B|, |A∪B|, |A∩B| and compare to the true values.
Go¶
package main
import (
"fmt"
"hash/fnv"
"math"
)
type Bloom struct {
bits []uint64
m, k uint64
}
func New(m, k uint64) *Bloom { return &Bloom{bits: make([]uint64, (m+63)/64), m: m, k: k} }
func (b *Bloom) idx(s string, i uint64) uint64 {
h := fnv.New64a(); h.Write([]byte(s)); x := h.Sum64()
h1, h2 := x&0xffffffff, (x>>32)|1
return (h1 + i*h2) % b.m
}
func (b *Bloom) Add(s string) {
for i := uint64(0); i < b.k; i++ { idx := b.idx(s, i); b.bits[idx/64] |= 1 << (idx % 64) }
}
func (b *Bloom) Union(o *Bloom) *Bloom {
u := New(b.m, b.k)
for i := range b.bits { u.bits[i] = b.bits[i] | o.bits[i] }
return u
}
func (b *Bloom) SetBits() (c uint64) {
for _, w := range b.bits { for ; w != 0; w &= w - 1 { c++ } }
return
}
func (b *Bloom) Estimate() float64 {
return -float64(b.m) / float64(b.k) * math.Log(1-float64(b.SetBits())/float64(b.m))
}
func main() {
a, bb := New(2000, 5), New(2000, 5)
for i := 0; i < 300; i++ { a.Add(fmt.Sprintf("k-%d", i)) } // |A|=300
for i := 200; i < 500; i++ { bb.Add(fmt.Sprintf("k-%d", i)) } // |B|=300, overlap 100
u := a.Union(bb)
inter := a.Estimate() + bb.Estimate() - u.Estimate()
fmt.Printf("|A|≈%.0f |B|≈%.0f |A∪B|≈%.0f |A∩B|≈%.0f (true 300/300/500/100)\n",
a.Estimate(), bb.Estimate(), u.Estimate(), inter)
}
Java¶
import java.nio.charset.StandardCharsets;
public class BloomUnion {
final long[] bits; final int m, k;
BloomUnion(int m, int k) { this.m = m; this.k = k; bits = new long[(m + 63) / 64]; }
int idx(String s, int i) {
long h = 0xcbf29ce484222325L;
for (byte by : s.getBytes(StandardCharsets.UTF_8)) { h ^= (by & 0xff); h *= 0x100000001b3L; }
long h1 = h & 0xffffffffL, h2 = (h >>> 32) | 1L;
return (int) (((h1 + (long) i * h2) % m + m) % m);
}
void add(String s) { for (int i = 0; i < k; i++) { int x = idx(s, i); bits[x >>> 6] |= 1L << (x & 63); } }
BloomUnion union(BloomUnion o) {
BloomUnion u = new BloomUnion(m, k);
for (int i = 0; i < bits.length; i++) u.bits[i] = bits[i] | o.bits[i];
return u;
}
long setBits() { long c = 0; for (long w : bits) c += Long.bitCount(w); return c; }
double estimate() { return -(double) m / k * Math.log(1 - (double) setBits() / m); }
public static void main(String[] args) {
BloomUnion a = new BloomUnion(2000, 5), b = new BloomUnion(2000, 5);
for (int i = 0; i < 300; i++) a.add("k-" + i);
for (int i = 200; i < 500; i++) b.add("k-" + i);
BloomUnion u = a.union(b);
double inter = a.estimate() + b.estimate() - u.estimate();
System.out.printf("|A|≈%.0f |B|≈%.0f |A∪B|≈%.0f |A∩B|≈%.0f (true 300/300/500/100)%n",
a.estimate(), b.estimate(), u.estimate(), inter);
}
}
Python¶
import math, hashlib
class Bloom:
def __init__(self, m, k):
self.m, self.k = m, k
self.bits = bytearray((m + 7) // 8)
def _idx(self, s, i):
h = int.from_bytes(hashlib.blake2b(s.encode(), digest_size=8).digest(), "little")
h1, h2 = h & 0xFFFFFFFF, (h >> 32) | 1
return (h1 + i * h2) % self.m
def add(self, s):
for i in range(self.k):
x = self._idx(s, i)
self.bits[x >> 3] |= 1 << (x & 7)
def union(self, o):
u = Bloom(self.m, self.k)
u.bits = bytearray(a | b for a, b in zip(self.bits, o.bits))
return u
def estimate(self):
set_bits = sum(bin(byte).count("1") for byte in self.bits)
return -self.m / self.k * math.log(1 - set_bits / self.m)
if __name__ == "__main__":
a, b = Bloom(2000, 5), Bloom(2000, 5)
for i in range(300):
a.add(f"k-{i}")
for i in range(200, 500):
b.add(f"k-{i}")
u = a.union(b)
inter = a.estimate() + b.estimate() - u.estimate()
print(f"|A|≈{a.estimate():.0f} |B|≈{b.estimate():.0f} "
f"|A∪B|≈{u.estimate():.0f} |A∩B|≈{inter:.0f} (true 300/300/500/100)")
Expected output: estimates close to |A|≈300, |B|≈300, |A∪B|≈500, |A∩B|≈100. Key talking points: union is exact (OR of bit arrays), the Swamidass–Baldi inversion recovers cardinality from the fill, and inclusion–exclusion gives a cheap overlap estimate — all without storing a single item.
Next step: tasks.md — hands-on Bloom filter practice tasks in Go, Java, and Python.
In this topic
- interview
- tasks