Reservoir Sampling — Middle Level¶
One-line summary: Algorithm R does one RNG draw per item and keeps item
iwith probabilityk/i. That is wasteful whenn ≫ k, because almost every draw is "discard." Algorithm L fixes this by computing, in one geometric jump, how many items to skip before the next accepted one — turningO(n)random draws intoO(k·(1 + log(n/k))). And when items have weights, the A-Res / Efraimidis-Spirakis scheme assigns each item a keyu^(1/w)(withuuniform in(0,1)) and keeps theklargest keys — a one-pass, weighted, uniform-without-replacement sample.
Table of Contents¶
- Introduction
- Why Algorithm R Works (the invariant)
- Comparison with Alternatives
- Algorithm L — skip-based optimization
- Weighted Reservoir Sampling (A-Res / Efraimidis-Spirakis)
- Reservoir Sampling vs the Naive Shuffle, in Depth
- A Worked Weighted Example (Efraimidis-Spirakis keys)
- Code Examples
- Error Handling
- Performance Analysis
- Best Practices
- Visual Animation
- Summary
Introduction¶
Focus: "Why does it work?" and "When should I choose this variant?"
At the junior level you saw Algorithm R: fill k slots, then keep the i-th item with probability k/i. It is correct and O(1) per item, but it pays a hidden tax — it calls the random number generator once per item, even though when n ≫ k the vast majority of those draws end in "discard." If you are sampling k = 100 items from a stream of a billion, you make a billion RNG calls to keep a hundred items.
This level answers three questions:
- Why is keeping item
iwith probabilityk/iexactly right? (The reservoir maintains an invariant: afteriitems, it is a uniformk-sample of thosei.) - How do we avoid the wasted draws? (Algorithm L computes how far to skip ahead before the next accepted item, in
O(k log(n/k))random draws.) - What if items have weights — some should be sampled more often than others? (Weighted reservoir sampling via the Efraimidis-Spirakis key trick
u^(1/w).)
We will also nail down why the naive "collect everything and shuffle" is the wrong tool for streams, even though it gives the same distribution.
Why Algorithm R Works (the invariant)¶
The correctness rests on a single loop invariant:
Invariant I(i): After processing the first
iitems, the reservoir contains a uniform random subset of sizemin(k, i)of thoseiitems — i.e. everyk-subset is equally likely, and (equivalently) each of theiitems is in the reservoir with probabilityk/i.
Base case. After i = k items, the reservoir is all k of them — the only k-subset of k items, trivially uniform. Each is present with probability k/k = 1. ✓
Inductive step. Assume I(i) holds. Item i+1 arrives. With probability k/(i+1) it enters the reservoir, evicting a uniformly chosen slot; with probability 1 - k/(i+1) it is discarded.
- New item
x_{i+1}ends in the reservoir with probabilityk/(i+1)— exactly its fair share. ✓ - An old item
x_m(m ≤ i) was in the reservoir with probabilityk/i(byI(i)). It stays unlessx_{i+1}enters and evicts its slot:P(evicted) = (k/(i+1)) · (1/k) = 1/(i+1). So:
Every item — old and new — has probability k/(i+1). I(i+1) holds. By induction I(n) gives each item probability k/n. (The full subset-uniformity proof is in professional.md.)
This is the same telescoping cancellation from the junior walkthrough, now stated as an invariant you can carry through the loop. The k/i is not a magic constant — it is precisely the value that keeps the invariant true at every step.
Comparison with Alternatives¶
| Attribute | Algorithm R | Algorithm L | Naive shuffle | Weighted (A-Res) |
|---|---|---|---|---|
| Memory | O(k) | O(k) | O(n) | O(k) (a heap) |
| Passes | 1 | 1 | 1 (+shuffle) | 1 |
Needs n in advance? | No | No | Yes (stores all) | No |
| RNG calls | O(n) | O(k·(1+log(n/k))) | O(n) | O(n) (one key each) |
| Per-item time | O(1) | O(1) amortized | O(1) collect | O(log k) heap push |
| Weighted? | No | No | No (uniform) | Yes |
| Without replacement? | Yes | Yes | Yes | Yes |
Choose Algorithm R when: simplicity matters and n is not enormous, or per-item cost is dominated by reading the item anyway. Choose Algorithm L when: n ≫ k and RNG/branch cost per item is significant (high-throughput pipelines). Choose the naive shuffle when: the data already fits in memory and n is known — there is no reason to stream. Choose A-Res when: items have weights and you want each picked in proportion to its weight, still in one pass.
Algorithm L — skip-based optimization¶
The insight: in Algorithm R, consecutive items are independently accepted or rejected. Instead of flipping a coin for each item, ask "how many items will I skip before the next acceptance?" — a geometric-like quantity you can sample in one draw, then jump directly to that item.
Vitter's Algorithm L maintains a value W and, after each acceptance, computes the number of items to skip:
init: fill reservoir with first k items
W = exp(log(random()) / k) # random() uniform in (0,1)
loop:
skip = floor( log(random()) / log(1 - W) ) # number of items to skip
i += skip + 1
if i <= n:
reservoir[ randInt(0, k-1) ] = item_i # accept item i, random slot
W *= exp(log(random()) / k)
else:
stop
Each accepted item costs O(1) and consumes a constant number of RNG calls; the number of acceptances over the whole stream is O(k·(1 + log(n/k))) in expectation (the reservoir gets "refreshed" about k ln(n/k) times). So instead of n random draws you make roughly k log(n/k) of them — a massive saving when n is billions and k is hundreds.
Algorithm L produces exactly the same uniform distribution as Algorithm R — it just samples the gaps between acceptances directly instead of testing every item. (There is an even older Algorithm Z that is asymptotically optimal; L is the simplest near-optimal one and the usual choice.)
Weighted Reservoir Sampling¶
Sometimes items are not equal: item with weight w should appear in the sample with probability proportional to w. The classic Algorithm R has no notion of weight. The elegant fix is the Efraimidis-Spirakis scheme (often called A-Res, "Algorithm for Reservoir sampling, weighted").
The key trick. For each item with weight w_i, draw a uniform u_i ∈ (0, 1) and compute a key:
Then keep the k items with the largest keys. That single rule yields a weighted random sample without replacement: the probability item i is in the sample is proportional to its weight in the precise A-Res sense.
Why a max-heap of size k. You cannot store all keys (O(k) memory!). Maintain a min-heap of the current top-k keys. For each new item:
- compute its key;
- if the heap has fewer than
kitems, push it; - otherwise, if the new key is larger than the heap's smallest key, pop the smallest and push the new one.
This keeps the k largest keys seen so far, in one pass, with O(log k) per item.
Intuition for u^(1/w). A larger weight w makes 1/w smaller, so u^(1/w) is pushed closer to 1 — heavier items tend to get larger keys and are more likely to survive in the top-k. The exact distribution of u^(1/w) is engineered so the top-k selection matches the desired weighted-sampling probabilities. (The correctness proof is in professional.md.) A common refinement, A-ExpJ, applies the same skip idea as Algorithm L to the weighted case for fewer RNG calls.
Reservoir Sampling vs the Naive Shuffle, in Depth¶
It is worth being precise about why the naive "collect everything, shuffle, take k" approach is the wrong tool for streams, because the two methods produce the same distribution — the difference is entirely operational.
The naive method:
naive_sample(stream, k):
all = list(stream) # O(n) memory — must hold everything
shuffle(all) # Fisher-Yates, O(n) time
return all[:k]
Both reservoir sampling and the naive shuffle return a uniform k-subset. But:
| Dimension | Naive shuffle | Reservoir sampling |
|---|---|---|
| Memory | O(n) — the whole stream is materialized | O(k) — only the reservoir + a counter |
Needs n known up front? | Effectively yes (you store all of it) | No — works on unbounded streams |
| Can start emitting before the stream ends? | No — must see everything first | The reservoir is a valid sample at every prefix |
| Works on an infinite stream? | No — O(n) blows up | Yes — memory stays fixed |
| Passes over data | 1 to collect + 1 to shuffle | 1 |
| Distribution | uniform k-subset | identical uniform k-subset |
The deep connection: reservoir sampling is a streaming, memory-bounded partial Fisher-Yates shuffle. In Fisher-Yates you walk the array and swap each position with a random earlier-or-equal position; in reservoir sampling you do the same "swap into one of the first k slots with the right probability" decision, but you only keep the first k slots and discard the rest as you go. If you could afford O(n) memory and knew n, you would just shuffle. The entire reason reservoir sampling exists is to drop the O(n) memory and the need to know n — at zero cost to correctness. So the rule of thumb is simple: if the data already fits in memory and n is known, shuffle; otherwise, stream with a reservoir.
A subtle but important consequence of "valid at every prefix": you can stop a reservoir sampler at any moment — after a timeout, a budget, or a signal — and the reservoir you have is a correct uniform sample of everything seen so far. The naive shuffle has no such property; interrupt it mid-collection and you have a biased prefix of the stream.
A Worked Weighted Example (Efraimidis-Spirakis keys)¶
Let us make u^(1/w) concrete. Suppose k = 1 (pick a single weighted item) and three items with weights w = [1, 1, 4]. Item 3 should be picked four times as often as item 1 or item 2 — i.e. with probability 4/6 = 2/3, versus 1/6 each for items 1 and 2.
For each item we draw a uniform u ∈ (0,1) and form the key u^(1/w):
item 1 (w=1): key1 = u1^(1/1) = u1
item 2 (w=1): key2 = u2^(1/1) = u2
item 3 (w=4): key3 = u3^(1/4) (a fourth root — pulled toward 1)
We keep the item with the largest key. Why does item 3 win 2/3 of the time? Because u3^(1/4) is the fourth root of a number in (0,1), which is always closer to 1 than u3 itself — a heavy weight systematically inflates the key. The exact distribution of u^(1/w) is engineered so that P(key_i is the maximum) equals w_i / Σ w_j. You can confirm this empirically:
import random
from collections import Counter
def weighted_pick_once(weights):
best_key, best = -1.0, None
for i, w in enumerate(weights):
key = random.random() ** (1.0 / w)
if key > best_key:
best_key, best = key, i
return best
counts = Counter(weighted_pick_once([1, 1, 4]) for _ in range(300_000))
print(counts) # item 2 (index) ≈ 200k, items 0 and 1 ≈ 50k each
The output shows item index 2 chosen about 2/3 of the time, exactly matching w_3 / Σw = 4/6. For k > 1 the same keys are used — you simply keep the top-k largest keys instead of the single maximum, and you maintain them in a size-k min-heap so memory stays O(k). This is the whole A-Res scheme.
A practical note on numerical stability. For large weights or very many items, u^(1/w) can underflow to 0.0 in floating point (every key rounds to zero and ties break arbitrarily, destroying the weighting). The fix is to compare in log space: instead of the key u^(1/w), store and compare log(u) / w. Since log is monotonic, the top-k by log(u)/w is exactly the top-k by u^(1/w), but the arithmetic never underflows. Production weighted samplers almost always use the log-space form.
Code Examples¶
Example 1: Algorithm L (skip-based, unweighted)¶
Go¶
package main
import (
"fmt"
"math"
"math/rand"
)
// ReservoirSampleL implements Vitter's Algorithm L: same uniform result as
// Algorithm R, but ~O(k log(n/k)) random draws instead of O(n).
func ReservoirSampleL(stream []int, k int) []int {
n := len(stream)
if n <= k {
out := make([]int, n)
copy(out, stream)
return out
}
reservoir := make([]int, k)
copy(reservoir, stream[:k])
w := math.Exp(math.Log(rand.Float64()) / float64(k))
i := k - 1 // 0-based index of last processed item
for {
skip := int(math.Floor(math.Log(rand.Float64()) / math.Log(1-w)))
i += skip + 1
if i >= n {
break
}
reservoir[rand.Intn(k)] = stream[i] // accept into a random slot
w *= math.Exp(math.Log(rand.Float64()) / float64(k))
}
return reservoir
}
func main() {
stream := make([]int, 1000)
for i := range stream {
stream[i] = i
}
fmt.Println(ReservoirSampleL(stream, 5)) // 5 uniform samples from 0..999
}
Java¶
import java.util.*;
public class ReservoirL {
static int[] sampleL(int[] stream, int k) {
int n = stream.length;
if (n <= k) return Arrays.copyOf(stream, n);
int[] reservoir = Arrays.copyOf(stream, k);
Random rnd = new Random();
double w = Math.exp(Math.log(rnd.nextDouble()) / k);
int i = k - 1;
while (true) {
int skip = (int) Math.floor(Math.log(rnd.nextDouble()) / Math.log(1 - w));
i += skip + 1;
if (i >= n) break;
reservoir[rnd.nextInt(k)] = stream[i];
w *= Math.exp(Math.log(rnd.nextDouble()) / k);
}
return reservoir;
}
public static void main(String[] args) {
int[] stream = new int[1000];
for (int i = 0; i < 1000; i++) stream[i] = i;
System.out.println(Arrays.toString(sampleL(stream, 5)));
}
}
Python¶
import math
import random
def reservoir_sample_l(stream, k):
"""Algorithm L: same uniform result as R, ~O(k log(n/k)) random draws."""
reservoir = []
it = iter(stream)
for _ in range(k): # fill phase
try:
reservoir.append(next(it))
except StopIteration:
return reservoir # n < k
w = math.exp(math.log(random.random()) / k)
i = k - 1
items = list(stream) # for index access in this demo
n = len(items)
while True:
skip = math.floor(math.log(random.random()) / math.log(1 - w))
i += skip + 1
if i >= n:
break
reservoir[random.randrange(k)] = items[i]
w *= math.exp(math.log(random.random()) / k)
return reservoir
if __name__ == "__main__":
print(reservoir_sample_l(range(1000), 5))
What it does: fills k slots, then repeatedly samples a geometric "skip" gap to jump straight to the next accepted item, replacing a random slot. Same distribution as Algorithm R, far fewer RNG calls.
Example 2: Weighted reservoir sampling (A-Res with a heap)¶
Go¶
package main
import (
"container/heap"
"fmt"
"math"
"math/rand"
)
type keyed struct {
key float64
item int
}
type minHeap []keyed
func (h minHeap) Len() int { return len(h) }
func (h minHeap) Less(i, j int) bool { return h[i].key < h[j].key }
func (h minHeap) Swap(i, j int) { h[i], h[j] = h[j], h[i] }
func (h *minHeap) Push(x interface{}) { *h = append(*h, x.(keyed)) }
func (h *minHeap) Pop() interface{} {
old := *h
n := len(old)
x := old[n-1]
*h = old[:n-1]
return x
}
// WeightedSample keeps the k items with the largest keys u^(1/w).
func WeightedSample(items []int, weights []float64, k int) []int {
h := &minHeap{}
heap.Init(h)
for idx, it := range items {
key := math.Pow(rand.Float64(), 1.0/weights[idx])
if h.Len() < k {
heap.Push(h, keyed{key, it})
} else if key > (*h)[0].key {
heap.Pop(h)
heap.Push(h, keyed{key, it})
}
}
out := make([]int, 0, h.Len())
for _, e := range *h {
out = append(out, e.item)
}
return out
}
func main() {
items := []int{1, 2, 3, 4, 5}
weights := []float64{1, 1, 1, 10, 10} // 4 and 5 favored
fmt.Println(WeightedSample(items, weights, 2))
}
Java¶
import java.util.*;
public class WeightedReservoir {
record Keyed(double key, int item) {}
static List<Integer> weightedSample(int[] items, double[] w, int k) {
PriorityQueue<Keyed> heap =
new PriorityQueue<>(Comparator.comparingDouble(Keyed::key)); // min-heap
Random rnd = new Random();
for (int idx = 0; idx < items.length; idx++) {
double key = Math.pow(rnd.nextDouble(), 1.0 / w[idx]);
if (heap.size() < k) {
heap.add(new Keyed(key, items[idx]));
} else if (key > heap.peek().key()) {
heap.poll();
heap.add(new Keyed(key, items[idx]));
}
}
List<Integer> out = new ArrayList<>();
for (Keyed e : heap) out.add(e.item());
return out;
}
public static void main(String[] args) {
int[] items = {1, 2, 3, 4, 5};
double[] w = {1, 1, 1, 10, 10};
System.out.println(weightedSample(items, w, 2));
}
}
Python¶
import heapq
import random
def weighted_sample(items, weights, k):
"""A-Res: keep the k items with the largest keys u^(1/w)."""
heap = [] # min-heap of (key, item)
for it, w in zip(items, weights):
key = random.random() ** (1.0 / w)
if len(heap) < k:
heapq.heappush(heap, (key, it))
elif key > heap[0][0]:
heapq.heapreplace(heap, (key, it))
return [item for _, item in heap]
if __name__ == "__main__":
items = [1, 2, 3, 4, 5]
weights = [1, 1, 1, 10, 10] # 4 and 5 favored
print(weighted_sample(items, weights, 2))
What it does: assigns each item a key u^(1/w) and keeps the k largest via a size-k min-heap — a one-pass weighted sample without replacement.
Error Handling¶
| Scenario | What goes wrong | Correct approach |
|---|---|---|
log(1 - w) with w = 1 | w reaches 1, log(0) → -inf, skip becomes huge/NaN. | Clamp random() away from 0/1, or special-case w ≥ 1. |
| Zero weight in A-Res | 1/w divides by zero. | Reject or skip items with w ≤ 0 (zero-weight items can never be sampled). |
| Heap comparator inverted | Keeping the smallest keys instead of largest. | Use a min-heap and replace its min only when the new key is larger. |
random() returns exactly 0 | log(0) = -inf, key 0^(1/w) = 0. | Use a generator that returns values in (0, 1) or re-draw on 0. |
Algorithm L on n ≤ k | Skip loop runs on too-short stream. | Short-circuit: if n ≤ k, return all items. |
| Float precision for tiny weights | u^(1/w) underflows for huge w. | Work in log space: compare log(u)/w instead of u^(1/w). |
Performance Analysis¶
The dominant cost difference between R and L is RNG calls, not arithmetic. Below is a micro-benchmark skeleton for the per-item throughput.
Go¶
package main
import (
"fmt"
"time"
)
func benchmark() {
sizes := []int{1000, 10000, 100000, 1000000}
k := 100
for _, n := range sizes {
data := make([]int, n)
for i := range data {
data[i] = i
}
start := time.Now()
for r := 0; r < 20; r++ {
_ = ReservoirSampleL(data, k) // compare against Algorithm R
}
fmt.Printf("n=%8d: %.3f ms\n", n, float64(time.Since(start).Milliseconds())/20.0)
}
}
Java¶
public static void benchmark() {
int[] sizes = {1000, 10000, 100000, 1000000};
int k = 100;
for (int n : sizes) {
int[] data = java.util.stream.IntStream.range(0, n).toArray();
long start = System.nanoTime();
for (int r = 0; r < 20; r++) {
ReservoirL.sampleL(data, k);
}
long elapsed = System.nanoTime() - start;
System.out.printf("n=%8d: %.3f ms%n", n, elapsed / 20.0 / 1_000_000);
}
}
Python¶
import timeit
k = 100
for n in (1_000, 10_000, 100_000, 1_000_000):
data = list(range(n))
t = timeit.timeit(lambda: reservoir_sample_l(data, k), number=20)
print(f"n={n:>8}: {t/20*1000:.3f} ms")
Expected shape: Algorithm R's RNG calls grow linearly with n; Algorithm L's grow like k log(n/k), so the gap widens dramatically as n increases with k fixed. The A-Res heap adds O(log k) per kept candidate, negligible when k is small.
| Variant | RNG calls | Per-item time | Memory |
|---|---|---|---|
| Algorithm R | n | O(1) | O(k) |
| Algorithm L | ≈ k·(1 + ln(n/k)) | O(1) amortized | O(k) |
| A-Res (weighted) | n (one key each) | O(log k) worst | O(k) |
| A-ExpJ (weighted+skip) | ≈ k·(1 + ln(n/k)) | O(log k) | O(k) |
Best Practices¶
- Use Algorithm R for clarity and moderate
n; switch to Algorithm L only when profiling shows RNG/branch cost matters at largen. - For weighted sampling, prefer the log-space key
log(u)/woveru^(1/w)to avoid underflow with large weights or many items. - Maintain weighted samples with a size-
kmin-heap; never store all keys. - Validate any variant against Algorithm R empirically: over many runs each item's inclusion frequency should match
k/n(or the weighted target). - Seed the RNG explicitly in tests for reproducibility; use a fresh, well-seeded RNG in production.
- Keep the fill phase, the accept rule, and the skip computation as separate, individually tested pieces.
- Document the guarantee: uniform-without-replacement (R/L) vs weighted-without-replacement (A-Res).
Visual Animation¶
See
animation.htmlfor interactive visualization.Middle-level relevance: - Watch the
k/iacceptance probability shrink asigrows (why later items are rarer to accept) - See how few items Algorithm L actually "lands on" versus how many Algorithm R inspects - Observe the reservoir staying a uniform sample at every prefix (the invariant)
Summary¶
At the middle level, reservoir sampling is understood through its invariant — after i items the reservoir is a uniform k-sample of those i — which is exactly why the acceptance probability must be k/i. Two practical refinements matter: Algorithm L samples the gap to the next accepted item, cutting RNG calls from O(n) to O(k·(1 + log(n/k))) while producing the identical distribution; and weighted reservoir sampling (A-Res / Efraimidis-Spirakis) assigns each item the key u^(1/w) and keeps the k largest via a min-heap, giving a one-pass weighted sample without replacement. The naive collect-and-shuffle matches the distribution but needs O(n) memory and a known n, so it is the wrong tool for true streams.
Next step: senior.md