MinHash — Practice Tasks¶
All tasks must be solved in Go, Java, and Python. Validate every estimator against the exact Jaccard on small sets before trusting it at scale.
Beginner Tasks¶
Task 1: Implement exact Jaccard similarity J(A,B) = |A∩B| / |A∪B| for two integer sets, with the convention J(∅,∅) = 1. This is your correctness oracle for every later task.
Go¶
package main
import "fmt"
func exactJaccard(a, b []int) float64 {
sa := map[int]bool{}
for _, x := range a {
sa[x] = true
}
inter, union := 0, map[int]bool{}
for x := range sa {
union[x] = true
}
sb := map[int]bool{}
for _, x := range b {
sb[x] = true
union[x] = true
if sa[x] {
inter++
}
}
if len(union) == 0 {
return 1.0
}
return float64(inter) / float64(len(union))
}
func main() {
fmt.Println(exactJaccard([]int{1, 2, 3, 4}, []int{2, 3, 5})) // 0.4
}
Java¶
import java.util.*;
public class Task1 {
static double exactJaccard(int[] a, int[] b) {
Set<Integer> sa = new HashSet<>(), sb = new HashSet<>(), union = new HashSet<>();
for (int x : a) { sa.add(x); union.add(x); }
for (int x : b) { sb.add(x); union.add(x); }
if (union.isEmpty()) return 1.0;
int inter = 0;
for (int x : sa) if (sb.contains(x)) inter++;
return (double) inter / union.size();
}
public static void main(String[] args) {
System.out.println(exactJaccard(new int[]{1,2,3,4}, new int[]{2,3,5})); // 0.4
}
}
Python¶
def exact_jaccard(a, b):
sa, sb = set(a), set(b)
if not sa and not sb:
return 1.0
return len(sa & sb) / len(sa | sb)
if __name__ == "__main__":
print(exact_jaccard([1, 2, 3, 4], [2, 3, 5])) # 0.4
- Constraints: handle empty sets; deduplicate inputs.
- Expected Output:
0.4 - Evaluation: correctness on disjoint, identical, empty, and overlapping sets.
Task 2: Implement a single MinHash value: given a set and one hash function h(x) = (a·x + b) mod p, return min over x of h(x). Provide starter code in all 3 languages. - Constraints: use a prime p > 2^32; reduce x %= p to avoid overflow.
Task 3: Build a length-k MinHash signature from k affine hash functions and estimate Jaccard as the fraction of agreeing slots. Verify against exactJaccard for A={1,2,3,4}, B={2,3,5} with k=256. - Constraints: distinct random (a_i, b_i) per slot; the estimate must land within ±0.08 of 0.4.
Task 4: Write a shingling function that turns a string into a set of integer feature-hashes using word k-grams (k=3). Confirm that two documents differing by one inserted word still share most shingles. - Constraints: canonicalize (lowercase, collapse whitespace) before shingling.
Task 5: Empirically demonstrate the 1/√k rule: for a fixed pair with known J, compute the MinHash estimate for k ∈ {16, 64, 256, 1024} averaged over many random seeds, and print the observed standard deviation. It should roughly halve each time k quadruples.
Intermediate Tasks¶
Task 6: Implement the bottom-k (k-minimum-values) estimator: keep the k smallest distinct hash values of each set under one hash, then estimate Jaccard from the fraction of the k smallest merged values present in both. Provide starter code in all 3 languages. - Constraints: one hash per element; compare accuracy to the k-hash estimator on the same pair.
Task 7: Extend bottom-k to also output a cardinality estimate (k−1)/v_(k), where v_(k) is the k-th smallest hash normalized to [0,1). Verify it approximates the true distinct count.
Task 8: Implement one-permutation MinHash: hash each element once, bin the range into k buckets, and take the per-bucket minimum. Detect empty bins and report how many occur for small sets.
Task 9: Add densification to Task 8: fill each empty bin by copying from the nearest non-empty bin (with a deterministic offset). Show that the densified estimate is unbiased on small sets where naive OPH was biased.
Task 10: Implement LSH banding: given many signatures, split each into b bands of r rows (k=b·r), bucket by band hash, and return all candidate pairs sharing a bucket. Verify that a high-Jaccard pair is reported and a disjoint pair usually is not.
Advanced Tasks¶
Task 11: Plot (or print a table for) the S-curve P[candidate] = 1 − (1 − J^r)^b for several (b, r) with b·r = 128, and pick (b, r) whose threshold (1/b)^{1/r} is closest to 0.8. Provide starter code in all 3 languages. - Constraints: report the (b,r) choice and the resulting threshold.
Task 12: Build a small near-duplicate detector end to end: shingle a list of short documents, MinHash to signatures, LSH-band to candidates, then verify with exact Jaccard and report clusters with J ≥ 0.7.
Task 13: Implement b-bit MinHash: store only the low b bits of each slot and use the adjusted estimator that accounts for random low-bit agreement (Ĵ = (observed_agreement − 2^{−b}) / (1 − 2^{−b})). Compare its accuracy and memory to full-width MinHash.
Task 14: Make MinHash streaming and mergeable: maintain a running signature that updates in O(k) per element, and implement merge(sigA, sigB) as the slot-wise minimum. Verify merge(sig(A), sig(B)) == sig(A ∪ B).
Task 15: Implement a chooseK helper using the Hoeffding bound k ≥ ln(2/δ)/(2ε²), then run an experiment confirming that with that k, the fraction of estimates falling outside ±ε of the true J is at most δ over many random pairs.
Benchmark Task¶
Compare signature-build performance across all 3 languages as
kgrows (O(n·k)).
Go¶
package main
import (
"fmt"
"time"
)
func main() {
elems := make([]uint64, 5000)
for i := range elems {
elems[i] = uint64(i*2654435761) & 0xFFFFFFFF
}
for _, k := range []int{32, 64, 128, 256, 512} {
mh := New(k, 1) // your MinHash builder
start := time.Now()
for i := 0; i < 200; i++ {
mh.Sign(elems)
}
fmt.Printf("k=%4d: %.3f ms/sig\n", k, float64(time.Since(start).Microseconds())/200000.0)
}
}
Java¶
public class Benchmark {
public static void main(String[] args) {
long[] elems = new long[5000];
for (int i = 0; i < elems.length; i++) elems[i] = ((long) i * 2654435761L) & 0xFFFFFFFFL;
for (int k : new int[]{32, 64, 128, 256, 512}) {
MinHashSig mh = new MinHashSig(k, 1); // your builder
long start = System.nanoTime();
for (int i = 0; i < 200; i++) mh.sign(elems);
System.out.printf("k=%4d: %.3f ms/sig%n", k, (System.nanoTime() - start) / 200.0 / 1_000_000);
}
}
}
Python¶
import timeit
elems = [(i * 2654435761) & 0xFFFFFFFF for i in range(5000)]
for k in [32, 64, 128, 256, 512]:
mh = MinHashSig(k, 1) # your builder
t = timeit.timeit(lambda: mh.sign(elems), number=20)
print(f"k={k:4d}: {t / 20 * 1000:.3f} ms/sig")
Expected: build time grows linearly in
k, confirmingO(n·k). This motivates bottom-k and one-permutation MinHash (one hash per element) whenkis large.