Locality-Sensitive Hashing (LSH) — Practice Tasks¶
All tasks must be solved in Go, Java, and Python. Always validate against a brute-force nearest-neighbor oracle on small data before trusting LSH at scale.
Beginner Tasks¶
Task 1: Implement a brute-force nearest-neighbor oracle: given a list of vectors and a query, return the id of the vector with the highest cosine similarity. This is your correctness baseline for every later task.
Go¶
package main
import (
"fmt"
"math"
)
func cosine(a, b []float64) float64 {
var dot, na, nb float64
for i := range a {
dot += a[i] * b[i]
na += a[i] * a[i]
nb += b[i] * b[i]
}
return dot / (math.Sqrt(na) * math.Sqrt(nb))
}
func bruteNN(items [][]float64, q []float64) (int, float64) {
best, bestSim := -1, -2.0
for i, x := range items {
if s := cosine(q, x); s > bestSim {
bestSim, best = s, i
}
}
return best, bestSim
}
func main() {
items := [][]float64{{1, 0}, {0, 1}, {0.9, 0.1}}
id, s := bruteNN(items, []float64{0.95, 0.05})
fmt.Println(id, s) // expect 0 or 2
}
Java¶
public class Task1 {
static double cosine(double[] a, double[] b) {
double dot = 0, na = 0, nb = 0;
for (int i = 0; i < a.length; i++) { dot += a[i]*b[i]; na += a[i]*a[i]; nb += b[i]*b[i]; }
return dot / (Math.sqrt(na) * Math.sqrt(nb));
}
static int bruteNN(double[][] items, double[] q) {
int best = -1; double bestSim = -2;
for (int i = 0; i < items.length; i++) {
double s = cosine(q, items[i]);
if (s > bestSim) { bestSim = s; best = i; }
}
return best;
}
public static void main(String[] args) {
double[][] items = {{1,0},{0,1},{0.9,0.1}};
System.out.println(bruteNN(items, new double[]{0.95,0.05}));
}
}
Python¶
import math
def cosine(a, b):
dot = sum(p * q for p, q in zip(a, b))
na = math.sqrt(sum(p * p for p in a))
nb = math.sqrt(sum(q * q for q in b))
return dot / (na * nb)
def brute_nn(items, q):
best, best_sim = -1, -2.0
for i, x in enumerate(items):
s = cosine(q, x)
if s > best_sim:
best_sim, best = s, i
return best, best_sim
if __name__ == "__main__":
items = [[1, 0], [0, 1], [0.9, 0.1]]
print(brute_nn(items, [0.95, 0.05]))
- Constraints: handle the zero vector (define similarity 0 or skip). Reject mismatched dimensions.
- Expected Output: the index of the most-similar vector (
0or2above). - Evaluation: correctness on identical, orthogonal, and near-duplicate vectors.
Task 2: Implement a single random-hyperplane hash bit: given a vector x and a random normal r, return 1 if x·r ≥ 0 else 0. Provide starter code in all 3 languages. - Constraints: use a fixed seed so the plane is reproducible; verify two near-parallel vectors usually share the bit.
Task 3: Build an m-bit hyperplane signature and store vectors in a dictionary keyed by the full signature. Implement a query that returns all items in the query's bucket. Verify a near-duplicate shares the query's bucket for small m. - Constraints: generate the m planes once; never regenerate per query.
Task 4: Empirically confirm P[same bit] = 1 - θ/π: for several fixed angles θ, generate two unit vectors at that angle, draw many random planes, and report the observed fraction of agreeing bits. It should track 1 - θ/π.
Task 5: Implement estimate_cosine_from_signature(sigA, sigB) that estimates the angle as θ̂ = π · (fraction of differing bits) and converts to cosine cos(θ̂). Compare against exact cosine for several pairs with m = 256 bits.
Intermediate Tasks¶
Task 6: Implement banded LSH over hyperplane signatures: split the k-bit signature into b bands of r bits, hash each band into its own table, and a query unions candidate ids across all bands. Provide starter code in all 3 languages. - Constraints: k = b·r; de-duplicate candidate ids; re-rank by exact cosine.
Task 7: Implement the S-curve P = 1 - (1 - p^r)^b and print a table of P vs p ∈ {0.1,…,0.9} for (b,r) = (5,4), (10,2), and (2,10). Identify which configuration has the highest threshold.
Task 8: Write a threshold picker: given a budget k and a target threshold t, return the (b, r) with b·r = k minimizing |(1/b)^{1/r} - t|. Verify it returns (8, 16) for k = 128, t = 0.8.
Task 9: Implement multiple hash tables (L): build L independent banded indexes and union candidates across all of them. Measure how recall improves with L against the brute-force oracle on random data.
Task 10: Implement the p-stable (Euclidean) hash h(x) = floor((a·x + c)/w) with Gaussian a, uniform c ∈ [0,w), and tunable width w. Build an LSH index for L2 distance and verify nearby points usually share a bucket while far points do not.
Advanced Tasks¶
Task 11: Implement multi-probe LSH for hyperplane bits: for a query, generate a probe sequence that flips the least-confident bits first (smallest |x·r|), and gather candidates from the probed buckets. Show that with fewer tables, multi-probe matches the recall of plain LSH with more tables. Provide starter code in all 3 languages. - Constraints: cap the number of probes per query; report recall vs probe budget.
Task 12: Build an end-to-end near-duplicate detector using MinHash-LSH banding (reuse signatures from ../02-minhash): shingle short documents, MinHash to signatures, band into candidates, then confirm with exact Jaccard and report clusters with J ≥ 0.7.
Task 13: Measure recall vs selectivity: over random high-dimensional data with known true neighbors, sweep (b, r, L) and plot (or tabulate) recall@10 against candidates/N. Identify the cheapest configuration achieving recall ≥ 0.95.
Task 14: Make the index incremental and concurrent: support insert(id, vec), delete(id) via tombstones, and concurrent query, guarded by a read/write lock. Verify tombstoned ids never appear in candidate sets and that a compaction pass physically removes them.
Task 15: Implement and verify the O(N^ρ) intuition empirically: for a fixed family, estimate p₁, p₂ from sampled near/far pairs, compute ρ = ln p₁/ln p₂, then measure how the candidate-count grows as N grows (it should scale roughly like N^ρ, sublinearly). Report the fitted exponent.
Benchmark Task¶
Compare query latency and recall across all 3 languages as
Ngrows (LSH should stay sublinear).
Go¶
package main
import (
"fmt"
"math/rand"
"time"
)
func main() {
d := 64
for _, n := range []int{1000, 10000, 100000} {
idx := NewLSH(8, 8, d, 1) // your banded LSH (k = 64 bits)
rng := rand.New(rand.NewSource(7))
for i := 0; i < n; i++ {
v := make([]float64, d)
for j := range v {
v[j] = rng.NormFloat64()
}
idx.Add(i, v)
}
q := make([]float64, d)
for j := range q {
q[j] = rng.NormFloat64()
}
start := time.Now()
for i := 0; i < 1000; i++ {
idx.Query(q)
}
fmt.Printf("N=%7d: %.3f ms/query\n", n, float64(time.Since(start).Microseconds())/1000000.0)
}
}
Java¶
import java.util.Random;
public class Benchmark {
public static void main(String[] args) {
int d = 64;
for (int n : new int[]{1000, 10000, 100000}) {
LSH idx = new LSH(8, 8, d, 1); // your banded LSH
Random rng = new Random(7);
for (int i = 0; i < n; i++) {
double[] v = new double[d];
for (int j = 0; j < d; j++) v[j] = rng.nextGaussian();
idx.add(i, v);
}
double[] q = new double[d];
for (int j = 0; j < d; j++) q[j] = rng.nextGaussian();
long start = System.nanoTime();
for (int i = 0; i < 1000; i++) idx.query(q);
System.out.printf("N=%7d: %.3f ms/query%n", n, (System.nanoTime() - start) / 1000.0 / 1_000_000);
}
}
}
Python¶
import random
import timeit
def build(n, d):
idx = LSH(b=8, r=8, d=d, seed=1) # your banded LSH
rng = random.Random(7)
for i in range(n):
idx.add(i, [rng.gauss(0, 1) for _ in range(d)])
q = [rng.gauss(0, 1) for _ in range(d)]
return idx, q
for n in [1000, 10000, 100000]:
idx, q = build(n, 64)
t = timeit.timeit(lambda: idx.query(q), number=200)
print(f"N={n:7d}: {t / 200 * 1000:.3f} ms/query")
Expected: query time grows far slower than linearly in
N(sublinear), confirming the LSH advantage — the query touches only its buckets, not allNitems. Compare against a brute-force scan to confirm both the speedup and the recall trade-off.