LSM-Tree — Hands-On Tasks¶
All tasks must be solvable in Go, Java, and Python. Reference sketches below are in Python for brevity; Go/Java translations follow the patterns in
junior.md§12 andinterview.md. Tasks 1–5 build the core engine; tasks 6–10 add the production accelerators (bloom filter, sparse index, leveled compaction, WAL); tasks 11–15 push to scale, correctness, and benchmarking.
Beginner Tasks¶
Task 1 — Memtable + flush + newest-first get¶
Goal: Implement LSM(flush_at) with put(k, v), get(k), and an automatic flush of the memtable to an immutable in-memory sorted SSTable when it reaches flush_at entries. get resolves the memtable then SSTables newest-first.
Acceptance: - After put("b","1") put("d","2") put("a","3") with flush_at=3, there is exactly one SSTable [a=3, b=1, d=2] and the memtable is empty. - get returns the newest value when a key was overwritten across flushes (e.g. put("d","5") after a flush → get("d")=="5"). - Each SSTable is sorted; newer SSTables are searched first.
Hints: - Keep SSTables in a list with index 0 = newest (prepend on flush). - Binary-search inside each SSTable.
Reference (Python): See junior.md §12.3.
Task 2 — Deletes via tombstones¶
Goal: Add delete(k) that inserts a tombstone, and make get return "not found" when the newest version of a key is a tombstone.
Acceptance: - put("b","1") (flush) then delete("b") → get("b") is not-found, even though b=1 still exists in an older SSTable. - A delete on a never-existing key is harmless.
Hints: - Represent a tombstone as a sentinel value (e.g. a unique string or a None-wrapper). - delete(k) is just put(k, TOMBSTONE).
Common bug: removing the key instead of tombstoning — the deleted value resurrects on the next read.
Task 3 — Compaction (merge all SSTables)¶
Goal: Implement compact() that merges all SSTables into one, keeping the newest value per key and dropping tombstones.
Acceptance: - After the §11 trace, compact() yields a single SSTable [a=3, c=4, d=5] (no b, no tombstone, no stale d=2). - Post-compaction get results are unchanged for live keys and still not-found for deleted keys.
Hints: - Iterate SSTables oldest→newest into a dict so newer overwrites older; then drop tombstones and sort. - Verify with an oracle (a plain dict that you apply the same put/delete stream to).
Reference (Python): See junior.md §12.3 compact.
Task 4 — Range scan (k-way merge)¶
Goal: Add scan(lo, hi) returning all live (key, value) pairs with lo <= key < hi, in sorted order, merging across the memtable and all SSTables with newest-wins, suppressing tombstoned keys.
Acceptance: - A key updated in a newer SSTable appears once with its newest value. - A key deleted (tombstone in a newer source) does not appear, even if an older SSTable holds it. - Output is sorted ascending by key.
Hints: - Build iterators over each source; use a heap keyed by key, breaking ties by source recency (newest first). - When several sources sit on the same key, take the newest entry and advance all iterators past that key.
Common bug: forgetting that a tombstone must suppress the key, then be filtered from the output.
Task 5 — WAL and crash recovery¶
Goal: Add a write-ahead log: every put/delete appends to an on-disk (or in-memory list standing in for disk) log before updating the memtable. Implement recover() that rebuilds the memtable by replaying the WAL.
Acceptance: - Simulate a crash by discarding the in-memory memtable but keeping the WAL; recover() reconstructs the exact pre-crash state (verified against an oracle). - After a flush, the WAL segment covering the flushed memtable is truncated.
Hints: - Order matters: append to WAL → (fsync) → apply to memtable → ack. - On recovery, replay entries in order; later entries overwrite earlier ones.
Common bug: truncating the WAL before the memtable is durably flushed → data loss on crash.
Intermediate Tasks¶
Task 6 — Bloom filter per SSTable¶
Goal: Attach a bloom filter to each SSTable. In get, consult the filter first; skip the SSTable's block read when the filter says "definitely not present."
Acceptance: - For a key absent from an SSTable, the filter returns "no" (no false negatives), and the SSTable is skipped without a "block read" (count block reads to prove it). - Measured false-positive rate over many random absent keys is within ~2× of the theoretical p for the chosen bits/key.
Hints: - Bits per key ≈ 1.44·log2(1/p); k hashes ≈ (m/n)·ln 2. Use two independent hashes to derive k (Kirsch–Mitzenmacher double hashing). - Instrument a block_reads counter to demonstrate the savings.
Task 7 — Sparse index per SSTable¶
Goal: Lay out each SSTable as fixed-size blocks. Build a sparse index (first key + block offset per block). get binary-searches the index, reads one block, then searches within it.
Acceptance: - A lookup performs exactly one block read (after the bloom check passes), regardless of SSTable size. - The sparse index size is O(num_blocks), not O(num_keys).
Hints: - Pick a block size in entries (e.g. 4). The index entry for block j is (keys[j*block], j). - Binary-search the index for the largest first-key <= target; that's the block to read.
Task 8 — Leveled compaction¶
Goal: Replace flat SSTables with leveled organization: L0 (overlapping), then levels L1, L2, ... each fanout× larger with non-overlapping key ranges. Flushes go to L0; when a level exceeds its budget, compact its overflow into the next level (newest-wins).
Acceptance: - After many puts, each level ≥1 has non-overlapping SSTable ranges, and a point read probes at most one SSTable per level (plus L0). - Read amplification (sources probed per get) is bounded by ≈ number_of_levels, much lower than flat.
Hints: - Model a level ≥1 as one sorted run; cascading overflow merges newer-into-older. - Verify the non-overlap invariant after each compaction.
Reference (Python): See middle.md §10.3.
Task 9 — Size-tiered compaction + amplification comparison¶
Goal: Implement size-tiered compaction (merge T similarly-sized SSTables into the next tier) alongside the leveled one from Task 8. Instrument write, read, and space amplification for both.
Acceptance: - On the same workload, size-tiered shows lower write amp and higher read/space amp than leveled — reproduce the §2/§12 numbers in middle.md qualitatively. - A summary table prints WA / RA / SA for each strategy.
Hints: - WA = total disk bytes written / user bytes; RA = sources probed per get; SA = stored bytes / live bytes. - Use the same random put/delete/get stream for both strategies.
Task 10 — Sequence numbers and snapshot reads¶
Goal: Stamp every write with a monotonically increasing sequence number. Store keys as (user_key, seq) sorted by user_key asc, seq desc. Implement get_as_of(key, snapshot_seq) returning the newest version with seq <= snapshot_seq.
Acceptance: - get_as_of reflects only writes up to the snapshot, ignoring newer ones (MVCC). - Plain get(key) equals get_as_of(key, current_seq).
Hints: - "Newest wins" becomes "first entry per user_key in the (key asc, seq desc) order whose seq <= snapshot." - A tombstone version still applies if it's the chosen version for that snapshot.
Advanced Tasks¶
Task 11 — Concurrent memtable with background flush¶
Goal: Make writes non-blocking during flush: when the active memtable fills, freeze it (move to an immutable queue) and start a fresh active memtable immediately; a background worker flushes frozen memtables. get checks active → frozen (newest-first) → SSTables.
Acceptance: - Writes never block on flush (no foreground stall in the common case). - Reads see a consistent newest-first view across active, frozen, and on-disk sources under concurrent load. - Stress test with concurrent writers + readers against an oracle (eventually consistent after flushes settle).
Hints: - Go: a skip list or sync.RWMutex-guarded map + a channel-fed flush goroutine. Java: ConcurrentSkipListMap + an executor. Python: threading + a flush thread. - The immutable-memtable queue must be searched newest-first, between the active memtable and the SSTables.
Task 12 — Rate-limited compaction scheduler with metrics¶
Goal: Run compaction in the background under a token-bucket rate limiter (bytes/sec), triggered when compaction debt (pending bytes) exceeds a threshold. Emit write/read/space amplification and pending-bytes metrics.
Acceptance: - Compaction throughput never exceeds the configured byte rate (measured over a window). - Under sustained ingest above the compaction rate, pending_bytes (debt) rises — demonstrate the leading indicator of a write stall. - Metrics are queryable at any time.
Hints: - Use the scheduler skeletons in senior.md §8. - Simulate a "write stall": when debt crosses a stop-threshold, block/slow put.
Task 13 — Tombstone garbage collection with grace period¶
Goal: Implement correct tombstone GC: drop a tombstone during compaction only when the compaction reaches the bottom level (no older copy can survive) and a configured grace period (wall-clock or logical) has elapsed.
Acceptance: - A tombstone dropped only at the bottom level + after grace; otherwise retained. - Construct a scenario where dropping a tombstone too early would resurrect a key, and show your implementation does not.
Hints: - Track each tombstone's creation time/seq; compare against now - grace at compaction. - The bottom-level check: this compaction includes the oldest run that could contain the key.
Task 14 — Persist SSTables to real files¶
Goal: Write SSTables to actual files with the layout from middle.md §5 (data blocks, index block, filter block, footer) and per-block CRC checksums. Reopen the database from disk and serve reads; verify checksums on read.
Acceptance: - A database survives process restart: reopen, replay WAL for the un-flushed memtable, and serve correct reads. - A deliberately corrupted block is detected via checksum on read (and recovered from WAL/another copy if available).
Hints: - Footer stores offsets of index and filter blocks; read it first on open. - Use a length-prefixed entry encoding; compute and store a CRC per data block.
Task 15 — Benchmark: LSM-tree vs B-tree (or sorted map)¶
Goal: Benchmark your LSM-tree against a B-tree / balanced sorted map (TreeMap, container/list+, Python sortedcontainers) across (a) write-heavy, (b) read-heavy, and (c) delete-heavy workloads, reporting throughput and the three amplifications for the LSM-tree.
Compare performance across all 3 languages and both structures.
Go¶
package main
import (
"fmt"
"math/rand"
"time"
)
func main() {
sizes := []int{1000, 10000, 100000}
for _, n := range sizes {
// build a workload: 70% writes, 25% reads, 5% deletes
ops := make([]int, n)
keys := make([]string, n)
for i := range ops {
r := rand.Intn(100)
switch {
case r < 70:
ops[i] = 0 // put
case r < 95:
ops[i] = 1 // get
default:
ops[i] = 2 // delete
}
keys[i] = fmt.Sprintf("k%d", rand.Intn(n))
}
start := time.Now()
// runLSMWorkload(ops, keys) // your LSM-tree
elapsed := time.Since(start)
fmt.Printf("n=%7d: %.3f ms\n", n, float64(elapsed.Milliseconds()))
}
}
Java¶
import java.util.*;
public class Benchmark {
public static void main(String[] args) {
int[] sizes = {1000, 10000, 100000};
Random rnd = new Random(42);
for (int n : sizes) {
int[] ops = new int[n];
String[] keys = new String[n];
for (int i = 0; i < n; i++) {
int r = rnd.nextInt(100);
ops[i] = r < 70 ? 0 : (r < 95 ? 1 : 2); // put / get / delete
keys[i] = "k" + rnd.nextInt(n);
}
long start = System.nanoTime();
// runLsmWorkload(ops, keys); // your LSM-tree
long elapsed = System.nanoTime() - start;
System.out.printf("n=%7d: %.3f ms%n", n, elapsed / 1_000_000.0);
}
}
}
Python¶
import random, time
def main():
for n in (1_000, 10_000, 100_000):
random.seed(42)
ops, keys = [], []
for _ in range(n):
r = random.randint(0, 99)
ops.append(0 if r < 70 else (1 if r < 95 else 2)) # put/get/delete
keys.append(f"k{random.randint(0, n)}")
start = time.perf_counter()
# run_lsm_workload(ops, keys) # your LSM-tree
elapsed = (time.perf_counter() - start) * 1000
print(f"n={n:>7}: {elapsed:.3f} ms")
if __name__ == "__main__":
main()
Acceptance: - The LSM-tree out-throughputs the in-place sorted-map structure on the write-heavy workload. - The read-heavy and delete-heavy results illustrate the read/space-amplification trade-off (LSM slower or larger). - Report write/read/space amplification for the LSM-tree across all three workloads, and discuss which compaction strategy you would pick for each.
In this topic
- interview
- tasks