Misra-Gries Heavy Hitters — Practice Tasks¶
All tasks must be solved in Go, Java, and Python. Each task ships with starter code or a precise spec in all three languages. Implement the Misra-Gries logic and validate against the evaluation criteria. Always check your output against a brute-force exact counter (hash map /
Counter) on small inputs, and confirm both that no true heavy hitter is missed and that the estimate obeysf(x) - n/k ≤ est(x) ≤ f(x).
Beginner Tasks (5)¶
Task 1 — Misra-Gries first pass¶
Problem. Implement misraGries(stream, k) returning the candidate map (≤ k - 1 entries) with the three update rules: increment, claim free slot, decrement-all (drop zeros).
Input / Output spec. - Input: a list/slice of string items and integer k ≥ 2. - Output: a map from candidate item to its (undercounted) estimate.
Constraints. Use at most k - 1 counters at any time. Do not store all distinct items.
Hint. When iterating to decrement, snapshot the keys first to avoid mutating during iteration.
Starter — Go.
package main
import "fmt"
func misraGries(stream []string, k int) map[string]int {
// TODO: three rules
return nil
}
func main() {
fmt.Println(misraGries([]string{"A", "B", "A", "C", "C", "A", "B", "D", "A"}, 3))
}
Starter — Java.
import java.util.*;
public class Task1 {
static Map<String,Integer> misraGries(List<String> stream, int k) {
// TODO
return new HashMap<>();
}
public static void main(String[] args) {
System.out.println(misraGries(
Arrays.asList("A","B","A","C","C","A","B","D","A"), 3));
}
}
Starter — Python.
def misra_gries(stream, k):
# TODO: three rules
return {}
if __name__ == "__main__":
print(misra_gries(["A","B","A","C","C","A","B","D","A"], 3))
- Expected Output: a map containing
A(e.g.{A:2, D:1}); exact contents depend on order butAmust be present. - Evaluation: correctness, never exceeding
k - 1counters, dropping zeros.
Task 2 — Boyer-Moore majority (k = 2)¶
Problem. Implement the single-counter majority-vote and confirm it is Misra-Gries with k = 2. Return the candidate and whether it is a true majority (verify with a second pass).
Constraints. O(1) extra space for the first pass.
Hint. Adopt the current item as candidate whenever count == 0; increment on match, decrement otherwise.
Expected Output: for [2,2,1,1,2,2] → (2, true); for [1,2,3,4] → some candidate, false.
Evaluation: correctness on both majority-exists and no-majority inputs.
Task 3 — Two-pass exact heavy hitters¶
Problem. Using your Task 1 function, return the exact set of items with frequency > n/k via a verification second pass.
Constraints. Second pass counts only the ≤ k - 1 candidates.
Hint. Build exact = {c: 0} over candidates, re-scan the stream, then filter by count > n/k.
Expected Output: for the sample stream, k = 3 → {A: 4}.
Evaluation: no false positives in the final result; no heavy hitter missed.
Task 4 — Validate the undercount bound¶
Problem. Generate a random stream, run Misra-Gries, and assert for every reported item that f(x) - n/k ≤ est(x) ≤ f(x), and that every true heavy hitter is present.
Constraints. Use a brute-force exact counter as the oracle.
Hint. Compute true frequencies with a full hash map only for testing (allowed since it is offline validation).
Expected Output: "bound holds; no heavy hitter missed" with no assertion failures.
Evaluation: both halves of the bound checked; completeness checked.
Task 5 — Choose k from a fraction¶
Problem. Write kFromFraction(phi) returning ⌈1/phi⌉, and a wrapper that finds items above phi · n using Misra-Gries + verification.
Constraints. 0 < phi < 1.
Hint. phi = 0.01 → k = 100 → 99 counters.
Expected Output: for phi = 0.5 on [2,2,1,1,2,2], returns {2: 4}.
Evaluation: correct k derivation; correct heavy-hitter result.
Intermediate Tasks (5)¶
Task 6 — Streaming interface (online updates)¶
Problem. Implement a MisraGries object with add(x) (online, one item at a time) and candidates() methods, so it can process an unbounded stream without holding it all.
Constraints. State is O(k); add is O(1) amortized.
Hint. Track n as you add for later threshold computation.
Expected Output: feeding the sample stream item by item yields the same candidates as the batch version.
Evaluation: online correctness, bounded state.
Task 7 — Merge two summaries¶
Problem. Implement merge(s1, s2, k): add common counts, union keys, then if more than k - 1 remain, subtract the k-th largest value from all and drop non-positives.
Constraints. Result has ≤ k - 1 entries.
Hint. Sort the values descending; the element at index k - 1 is the subtraction threshold.
Expected Output: merging {X:5,Y:3} and {X:2,Z:4} with k = 3 keeps the dominant key(s) within 2 slots.
Evaluation: size bound respected; merged estimates still satisfy f(x) - N/k ≤ est(x) ≤ f(x) over the combined stream.
Task 8 — Shard-then-merge driver¶
Problem. Split a stream into m shards, run an independent Misra-Gries per shard, then merge all summaries into a global one. Compare the result to a single-pass Misra-Gries over the concatenation.
Constraints. Shards are processed independently (simulate parallelism).
Hint. Merge is associative; any tree works.
Expected Output: the merged heavy set matches the single-pass heavy set on the same data.
Evaluation: merge equivalence; guarantee preserved.
Task 9 — Tumbling-window heavy hitters¶
Problem. Process a stream in fixed-size windows; reset the summary each window and emit per-window candidates.
Constraints. O(k) state at any time.
Hint. Reset (or start a fresh instance) at each window boundary.
Expected Output: per-window candidate lists; a window dominated by one item reports that item.
Evaluation: correct windowing; bounded memory.
Task 10 — Compare with Space-Saving¶
Problem. Implement a minimal Space-Saving (replace the minimum counter on overflow, assign min + 1) and compare its estimates with Misra-Gries on the same stream. Report which overcounts and which undercounts.
Constraints. Both use k - 1 counters.
Hint. Track Space-Saving's estimate vs true frequency to confirm est ≥ f.
Expected Output: Misra-Gries estimates ≤ f; Space-Saving estimates ≥ f.
Evaluation: correct error directions demonstrated.
Advanced Tasks (5)¶
Task 11 — Thread-safe Misra-Gries¶
Problem. Wrap the summary with a lock so concurrent add calls are safe, then build a concurrent driver that feeds the stream from multiple workers.
Constraints. No data races; decrement-all is a single critical section.
Hint. Compare correctness against the single-threaded result on the same multiset.
Expected Output: concurrent run yields a summary consistent with the sequential one.
Evaluation: thread safety, correctness under concurrency.
Task 12 — Tightness construction¶
Problem. Construct a stream (given n, k) that forces a target heavy item's estimate down to exactly f(x) - ⌊n/k⌋, demonstrating the bound is tight.
Constraints. The target must still be a true heavy hitter.
Hint. Interleave the target with k - 1 distinct fillers and a fresh trigger each round so a decrement-all hits the target every cycle.
Expected Output: measured undercount equals ⌊n/k⌋.
Evaluation: the construction achieves equality in the bound.
Task 13 — Worst-case O(1) per item¶
Problem. Replace the O(k) decrement-all sweep with a linked-bucket structure (counts grouped into buckets) so increments and the overflow step are O(1) worst case.
Constraints. Same guarantees as the basic version.
Hint. This is the Stream-Summary structure used by Space-Saving; adapt it to Misra-Gries' decrement semantics or implement the equivalent global-offset trick (keep a base offset instead of decrementing every counter).
Expected Output: identical heavy set, faster worst-case updates.
Evaluation: correctness plus O(1) worst-case per item.
Task 14 — Decay / sliding window¶
Problem. Add exponential decay: every T items, multiply all counts by γ < 1 (rounding/dropping zeros). Discuss how the guarantee changes to weighted frequencies.
Constraints. Keep counts integer or fixed-point.
Hint. Decay makes "frequency" recency-weighted; document the new threshold semantics.
Expected Output: recent-dominant items rank above stale ones.
Evaluation: correct decay; recency reflected.
Task 15 — Real-data heavy hitters (DDoS-style)¶
Problem. Given a large file of "IP" tokens (one per line, synthetic is fine), find IPs sending more than 0.1% of packets using Misra-Gries (k = 1000) + verification, and report them with exact counts.
Constraints. Do not load all distinct IPs into a hash map for the first pass; only the ≤ 999 candidates plus a verification pass.
Hint. Stream the file line by line.
Expected Output: the small set of top-talker IPs with exact counts above the 0.1% threshold.
Evaluation: bounded first-pass memory; exact final result.
Benchmark Task¶
Compare performance across all 3 languages. Measure first-pass time vs stream size with
k = 100.
Go¶
package main
import (
"fmt"
"math/rand"
"time"
)
func misraGries(stream []int, k int) map[int]int {
c := map[int]int{}
for _, x := range stream {
if _, ok := c[x]; ok {
c[x]++
} else if len(c) < k-1 {
c[x] = 1
} else {
for key := range c {
c[key]--
if c[key] == 0 {
delete(c, key)
}
}
}
}
return c
}
func main() {
for _, n := range []int{10000, 100000, 1000000} {
stream := make([]int, n)
for i := range stream {
stream[i] = rand.Intn(5000)
}
start := time.Now()
misraGries(stream, 100)
fmt.Printf("n=%8d: %.1f ms\n", n, float64(time.Since(start).Microseconds())/1000.0)
}
}
Java¶
import java.util.*;
public class Benchmark {
static Map<Integer,Integer> misraGries(int[] stream, int k) {
Map<Integer,Integer> c = new HashMap<>();
for (int x : stream) {
if (c.containsKey(x)) c.put(x, c.get(x)+1);
else if (c.size() < k-1) c.put(x, 1);
else {
var it = c.entrySet().iterator();
while (it.hasNext()) {
var e = it.next();
if (e.getValue() == 1) it.remove(); else e.setValue(e.getValue()-1);
}
}
}
return c;
}
public static void main(String[] args) {
Random rnd = new Random(1);
for (int n : new int[]{10000, 100000, 1000000}) {
int[] stream = new int[n];
for (int i = 0; i < n; i++) stream[i] = rnd.nextInt(5000);
long start = System.nanoTime();
misraGries(stream, 100);
System.out.printf("n=%8d: %.1f ms%n", n, (System.nanoTime()-start)/1e6);
}
}
}
Python¶
import random, time
def misra_gries(stream, k):
c = {}
for x in stream:
if x in c:
c[x] += 1
elif len(c) < k - 1:
c[x] = 1
else:
for key in list(c):
c[key] -= 1
if c[key] == 0:
del c[key]
return c
if __name__ == "__main__":
for n in (10_000, 100_000, 1_000_000):
stream = [random.randint(0, 4999) for _ in range(n)]
t = time.perf_counter()
misra_gries(stream, 100)
print(f"n={n:>9}: {(time.perf_counter()-t)*1000:.1f} ms")
Goal: observe near-linear scaling in n (the first pass is O(n) amortized), and compare the constant factors across languages. Try raising k and the distinct-value range to see how decrement-all frequency affects runtime.