Why are Data Structures Important? — Senior Level¶
Table of Contents¶
- Introduction
- DS in Production Systems
- Choosing DS for Distributed Systems
- DS for High-Throughput Low-Latency Systems
- Trade-offs at Scale
- Concurrent Data Structures
- Code Examples
- Failure Modes
- Summary
Introduction¶
Focus: "How do data structure choices shape production systems at scale?"
At the senior level, data structures are no longer abstract containers you choose during coding exercises. They are architectural decisions that determine latency percentiles, memory budgets, operational complexity, and failure domains. A wrong choice discovered in production costs weeks of engineering effort to replace.
This document covers how production systems like Redis, PostgreSQL, Kafka, and Memcached rely on specific data structures, how to choose DS for distributed environments, and how to build concurrent data structures that perform under contention.
DS in Production Systems¶
Redis Data Structures¶
Redis is fundamentally a data structure server. Every Redis command maps to an operation on a specific DS:
| Redis Type | Underlying DS | Use Case | Key Operations |
|---|---|---|---|
| String | Simple Dynamic String (SDS) | Caching, counters, session data | GET, SET, INCR |
| List | Quicklist (linked list of ziplists) | Message queues, activity feeds | LPUSH, RPOP, LRANGE |
| Set | Hash table (intset for small sets) | Tags, unique visitors, dedup | SADD, SMEMBERS, SINTER |
| Sorted Set | Skip list + hash table | Leaderboards, rate limiting, scheduling | ZADD, ZRANGE, ZRANGEBYSCORE |
| Hash | Ziplist (small) / Hash table (large) | Object storage, user profiles | HSET, HGET, HGETALL |
| Stream | Radix tree of listpacks | Event sourcing, log aggregation | XADD, XREAD, XRANGE |
| HyperLogLog | Probabilistic registers | Cardinality estimation (unique counts) | PFADD, PFCOUNT |
| Bloom Filter | Bit array + hash functions | Membership testing with false positives | BF.ADD, BF.EXISTS |
Database Indexes — B+ Trees¶
Every relational database (PostgreSQL, MySQL, SQLite) uses B+ trees for indexes:
- Internal nodes store keys and pointers to children (fanout of 100-1000).
- Leaf nodes store key-value pairs and are linked sequentially.
- Height is typically 3-4 for billions of rows (log_{fanout}(N)).
- One disk read per level = 3-4 disk reads to find any row.
Why B+ trees and not hash tables? Hash tables support only equality queries (WHERE id = 42). B+ trees support equality, range queries (WHERE age BETWEEN 20 AND 30), prefix queries, and sorted iteration.
Why B+ trees and not BSTs? BSTs have one key per node and high height. B+ trees pack hundreds of keys per node, reducing disk reads. Each node fits in one disk page (4-16 KB).
Message Queues — Log-Structured Storage¶
Kafka uses an append-only log (essentially a persistent array):
- Producers append messages to the end: O(1).
- Consumers read from any offset: O(1) with direct disk seek.
- No random writes, no deletions — sequential I/O only.
- This design achieves millions of messages per second per partition.
Caches — Hash Tables with Eviction¶
Memcached and application-level caches are hash tables augmented with eviction policies:
| Eviction Policy | DS Required | Behavior |
|---|---|---|
| LRU (Least Recently Used) | Hash map + doubly linked list | Evict least recently accessed |
| LFU (Least Frequently Used) | Hash map + frequency buckets | Evict least frequently accessed |
| TTL (Time To Live) | Hash map + min-heap or timer wheel | Evict expired entries |
| Random | Hash map | Evict random entry (simplest) |
Choosing DS for Distributed Systems¶
Distributed Hash Tables (DHTs)¶
In distributed systems, a single hash map is not enough. You need consistent hashing to distribute keys across nodes:
Traditional hashing: node = hash(key) % N
Problem: Adding a node remaps almost all keys.
Consistent hashing: node = first_node_clockwise_from(hash(key))
Benefit: Adding a node remaps only 1/N keys.
| Distributed DS | Based On | Used By | Trade-off |
|---|---|---|---|
| Consistent hash ring | Hash table + ring | DynamoDB, Cassandra, Memcached | Even distribution vs rebalancing cost |
| Merkle tree | Hash tree | Git, blockchain, anti-entropy | Integrity verification vs storage |
| CRDTs | Various (counters, sets, maps) | Riak, Redis Enterprise | Eventual consistency vs complexity |
| LSM Tree | Sorted runs + merge | Cassandra, RocksDB, LevelDB | Write throughput vs read amplification |
| Raft/Paxos log | Append-only log | etcd, CockroachDB | Consensus vs latency |
CAP Theorem and DS Implications¶
The choice of distributed data structure determines which guarantees you get:
- CP systems (Consistency + Partition tolerance): Use replicated logs with consensus (Raft). Strong consistency but higher latency.
- AP systems (Availability + Partition tolerance): Use CRDTs or last-writer-wins registers. Always available but eventually consistent.
- CA systems (not realistic in distributed systems): Single-node database with B+ tree indexes.
DS for High-Throughput Low-Latency Systems¶
Requirements¶
- P99 latency < 1 ms — Cannot afford O(n) operations on any path.
- Millions of operations per second — Must minimize allocation and contention.
- Predictable performance — Avoid data structures with amortized costs (hash table resize spikes).
DS Selection for Low-Latency¶
| Requirement | DS Choice | Why |
|---|---|---|
| Fast key lookup | Open-addressing hash map | No pointer chasing, cache-friendly |
| Sorted data | B-tree (not BST) | Better cache locality, fewer levels |
| Queue with priority | D-ary heap (d=4 or 8) | Fewer cache misses than binary heap |
| Rate limiting | Token bucket (circular buffer) | O(1) operations, fixed memory |
| Time-based expiry | Timer wheel (hierarchical) | O(1) insert/cancel/expire |
| Deduplication | Bloom filter | O(1) check, minimal memory |
| Approximate counting | Count-Min Sketch | O(1) increment/query, fixed memory |
Avoiding Latency Spikes¶
| DS | Spike Source | Mitigation |
|---|---|---|
| Hash table | Resize (copy all elements) | Pre-allocate, incremental resize |
| Dynamic array | Resize (copy all elements) | Pre-allocate capacity |
| BST (unbalanced) | Degenerate to O(n) | Use self-balancing (Red-Black, AVL) |
| Garbage-collected DS | GC pause | Use arena allocation, object pools |
| Linked structures | Memory fragmentation | Use slab allocation |
Trade-offs at Scale¶
Read-Optimized vs Write-Optimized¶
| Dimension | Read-Optimized | Write-Optimized |
|---|---|---|
| DS | B+ tree, sorted array, hash table | LSM tree, append-only log |
| Read | O(log n) or O(1) | O(log n) * levels (read amplification) |
| Write | O(log n) with random I/O | O(1) sequential append |
| Space | 1x data + index overhead | 2-10x due to compaction |
| Use case | OLTP databases, caches | Time-series DBs, event stores |
Mutable vs Immutable¶
| Dimension | Mutable DS | Immutable DS |
|---|---|---|
| Update | In-place modification | Create new version, keep old |
| Concurrency | Needs locks or CAS | Lock-free (readers never blocked) |
| Memory | One copy | Multiple versions (higher memory) |
| Crash recovery | Complex (WAL needed) | Simple (old version always valid) |
| Examples | B+ tree, hash table | Persistent tree, append-only log |
Memory vs Disk¶
| Dimension | In-Memory DS | On-Disk DS |
|---|---|---|
| Latency | Nanoseconds | Microseconds-milliseconds |
| Capacity | GBs (expensive) | TBs (cheap) |
| DS design | Optimize for CPU cache | Optimize for disk page size |
| Examples | Hash maps, skip lists, heaps | B+ trees, LSM trees, SSTables |
Concurrent Data Structures¶
Lock-Free vs Lock-Based¶
| Approach | Mechanism | Pros | Cons |
|---|---|---|---|
| Mutex | One thread at a time | Simple, correct | Contention, priority inversion |
| Read-Write Lock | Multiple readers OR one writer | Better read throughput | Write starvation possible |
| Lock-Free (CAS) | Compare-and-swap atomic ops | No blocking, no deadlock | Complex, ABA problem |
| Wait-Free | Every operation completes in bounded steps | Strongest guarantee | Very hard to implement |
Common Concurrent DS¶
| DS | Concurrent Version | Used In |
|---|---|---|
| Hash Map | ConcurrentHashMap (Java), sync.Map (Go) | Caches, routing tables |
| Queue | Lock-free MPSC/MPMC queues | Message passing, work stealing |
| Skip List | ConcurrentSkipListMap (Java) | Sorted concurrent access |
| Counter | Atomic integer, LongAdder (Java) | Metrics, counters |
| Ring Buffer | Disruptor pattern (LMAX) | Ultra-low-latency event processing |
Code Examples¶
Concurrent-Safe Cache with Sharding¶
Go:
package main
import (
"fmt"
"hash/fnv"
"sync"
)
const numShards = 16
type ShardedCache struct {
shards [numShards]struct {
sync.RWMutex
data map[string]string
}
}
func NewShardedCache() *ShardedCache {
c := &ShardedCache{}
for i := 0; i < numShards; i++ {
c.shards[i].data = make(map[string]string)
}
return c
}
func (c *ShardedCache) getShard(key string) int {
h := fnv.New32a()
h.Write([]byte(key))
return int(h.Sum32()) % numShards
}
func (c *ShardedCache) Set(key, value string) {
shard := c.getShard(key)
c.shards[shard].Lock()
c.shards[shard].data[key] = value
c.shards[shard].Unlock()
}
func (c *ShardedCache) Get(key string) (string, bool) {
shard := c.getShard(key)
c.shards[shard].RLock()
val, ok := c.shards[shard].data[key]
c.shards[shard].RUnlock()
return val, ok
}
func main() {
cache := NewShardedCache()
cache.Set("user:1", "Alice")
cache.Set("user:2", "Bob")
if val, ok := cache.Get("user:1"); ok {
fmt.Println("Found:", val)
}
}
Java:
import java.util.concurrent.ConcurrentHashMap;
public class ConcurrentCache {
private final ConcurrentHashMap<String, String> cache = new ConcurrentHashMap<>();
public void set(String key, String value) {
cache.put(key, value);
}
public String get(String key) {
return cache.get(key);
}
public String getOrCompute(String key, java.util.function.Function<String, String> loader) {
return cache.computeIfAbsent(key, loader);
}
public static void main(String[] args) {
ConcurrentCache cache = new ConcurrentCache();
cache.set("user:1", "Alice");
cache.set("user:2", "Bob");
// Thread-safe get-or-compute
String value = cache.getOrCompute("user:3", key -> {
// Simulate expensive computation
return "Carol (computed)";
});
System.out.println("user:3 = " + value);
}
}
Python:
import threading
from collections import OrderedDict
class ThreadSafeLRUCache:
def __init__(self, capacity):
self.capacity = capacity
self.cache = OrderedDict()
self.lock = threading.Lock()
def get(self, key):
with self.lock:
if key not in self.cache:
return None
self.cache.move_to_end(key)
return self.cache[key]
def put(self, key, value):
with self.lock:
if key in self.cache:
self.cache.move_to_end(key)
self.cache[key] = value
if len(self.cache) > self.capacity:
self.cache.popitem(last=False)
cache = ThreadSafeLRUCache(capacity=3)
cache.put("user:1", "Alice")
cache.put("user:2", "Bob")
cache.put("user:3", "Carol")
cache.put("user:4", "Dave") # Evicts "user:1"
print(cache.get("user:1")) # None (evicted)
print(cache.get("user:2")) # Bob
Timer Wheel for Scheduled Expiration¶
Go:
package main
import (
"fmt"
"time"
)
type TimerWheel struct {
slots [][]string
numSlots int
current int
interval time.Duration
}
func NewTimerWheel(numSlots int, interval time.Duration) *TimerWheel {
slots := make([][]string, numSlots)
for i := range slots {
slots[i] = []string{}
}
return &TimerWheel{slots: slots, numSlots: numSlots, interval: interval}
}
func (tw *TimerWheel) Schedule(key string, delay time.Duration) {
ticks := int(delay / tw.interval)
slot := (tw.current + ticks) % tw.numSlots
tw.slots[slot] = append(tw.slots[slot], key)
}
func (tw *TimerWheel) Tick() []string {
tw.current = (tw.current + 1) % tw.numSlots
expired := tw.slots[tw.current]
tw.slots[tw.current] = []string{}
return expired
}
func main() {
wheel := NewTimerWheel(60, time.Second)
wheel.Schedule("session:abc", 5*time.Second)
wheel.Schedule("session:def", 5*time.Second)
wheel.Schedule("session:ghi", 10*time.Second)
for i := 0; i < 15; i++ {
expired := wheel.Tick()
if len(expired) > 0 {
fmt.Printf("Tick %d: expired %v\n", i+1, expired)
}
}
}
Java:
import java.util.*;
public class TimerWheel {
private final List<String>[] slots;
private final int numSlots;
private int current = 0;
@SuppressWarnings("unchecked")
public TimerWheel(int numSlots) {
this.numSlots = numSlots;
this.slots = new ArrayList[numSlots];
for (int i = 0; i < numSlots; i++) {
slots[i] = new ArrayList<>();
}
}
public void schedule(String key, int delayTicks) {
int slot = (current + delayTicks) % numSlots;
slots[slot].add(key);
}
public List<String> tick() {
current = (current + 1) % numSlots;
List<String> expired = slots[current];
slots[current] = new ArrayList<>();
return expired;
}
public static void main(String[] args) {
TimerWheel wheel = new TimerWheel(60);
wheel.schedule("session:abc", 5);
wheel.schedule("session:def", 5);
wheel.schedule("session:ghi", 10);
for (int i = 0; i < 15; i++) {
List<String> expired = wheel.tick();
if (!expired.isEmpty()) {
System.out.printf("Tick %d: expired %s%n", i + 1, expired);
}
}
}
}
Python:
class TimerWheel:
def __init__(self, num_slots):
self.num_slots = num_slots
self.slots = [[] for _ in range(num_slots)]
self.current = 0
def schedule(self, key, delay_ticks):
slot = (self.current + delay_ticks) % self.num_slots
self.slots[slot].append(key)
def tick(self):
self.current = (self.current + 1) % self.num_slots
expired = self.slots[self.current]
self.slots[self.current] = []
return expired
wheel = TimerWheel(60)
wheel.schedule("session:abc", 5)
wheel.schedule("session:def", 5)
wheel.schedule("session:ghi", 10)
for i in range(15):
expired = wheel.tick()
if expired:
print(f"Tick {i + 1}: expired {expired}")
Failure Modes¶
| DS in Production | Failure Mode | Impact | Mitigation |
|---|---|---|---|
| Hash table cache | Hash collision attack | O(n) lookups, DoS | Use SipHash or randomized hash |
| B+ tree index | Index bloat after deletions | Slow queries, wasted disk | REINDEX periodically |
| Bloom filter | False positive rate too high | Unnecessary work | Size filter for target FP rate |
| LSM tree | Write amplification | Disk I/O spikes during compaction | Leveled compaction, rate limiting |
| Skip list | Memory fragmentation | Increased latency over time | Memory pool / arena allocator |
| Concurrent hash map | Hot shard | One shard gets all traffic | Better hash function, more shards |
| Timer wheel | Slot overflow | Missed expirations | Hierarchical timer wheel |
Summary¶
| Concept | Key Takeaway |
|---|---|
| Redis | A data structure server — choosing the right Redis type = choosing the right DS |
| Database indexes | B+ trees enable O(log n) queries on billions of rows |
| Message queues | Append-only logs achieve millions of writes/sec with sequential I/O |
| Distributed DS | Consistent hashing, CRDTs, LSM trees power distributed databases |
| Low-latency DS | Open-addressing hash maps, D-ary heaps, timer wheels minimize cache misses |
| Trade-offs at scale | Read-optimized vs write-optimized, mutable vs immutable, memory vs disk |
| Concurrent DS | Sharding, lock-free algorithms, and CAS operations enable safe parallelism |
| Failure modes | Every production DS has failure modes — plan for them |
Remember: In production, the data structure IS the system. Redis is a skip list. PostgreSQL is a B+ tree. Kafka is an append-only log. Choose wisely.