Introduction to Data Structures & Algorithms — Middle Level¶

Table of Contents¶

Introduction
Why DSA Matters Beyond Performance
The Engineer's Decision Framework
Reading the Access Pattern
Trade-Off Axes Every Structure Sits On
How Algorithmic Strategies Map to Problems
Asymptotic vs Real-World Cost
Choosing a Structure — Worked Examples
Composing Data Structures
When DSA Goes Wrong in Production
Standard Library vs Hand-Rolled
Code Examples — Same Problem, Stronger Solutions
How to Practice at the Middle Level
Summary

Introduction¶

Focus: "Why does the choice matter?" and "When do I reach for which tool?"

The junior file answered what is DSA. The middle level is about judgement: given a vague requirement and a real codebase, how do you pick the right data structure and algorithm, justify the choice, and recognize the trade-offs you accepted? Middle engineers do not memorize structures — they read access patterns and derive the structure.

This document covers the decision framework, the axes you trade along, the gap between asymptotic and wall-clock performance, common composite structures, and the production failure modes that come from a wrong choice.

Why DSA Matters Beyond Performance¶

Performance is the obvious reason. But once you are past the junior stage, three more reasons start to dominate:

1. Data structure choice constrains future evolution¶

Once a hot path uses a particular structure, everything around it inherits that structure's assumptions. Choose a list when you need a set, and every caller has to defensively deduplicate forever. Choose a synchronous queue when you needed a pub/sub, and every new feature has to bolt on filtering. The right structure pays you back for years; the wrong one taxes every future change.

2. The structure encodes the invariant¶

A set "cannot contain duplicates" is a guarantee enforced by the structure itself — no caller can violate it. A min-heap "always returns the smallest pending item" is enforced by the structure. Picking the right structure means the invariant is unbreakable, instead of being a comment that some future commit will silently invalidate.

3. Communication and design vocabulary¶

"This is a write-behind cache with LRU eviction on top of an LSM-tree-backed store" is unambiguous and takes five seconds to say. The same sentence in code-level terms ("a buffer that flushes asynchronously...") takes a paragraph and is ambiguous. DSA is the shared dialect of system design.

The Engineer's Decision Framework¶

A middle engineer applies the same checklist on every non-trivial design:

Characterize the data.
Size (10? 10⁶? 10¹²?)
Mutability (read-only? frequent updates?)
Distribution (uniform? skewed? streaming?)
Domain (integers in a small range? arbitrary strings?)
Characterize the operations.
What is the ratio of reads to writes?
Which operations must be O(1) or O(log n)?
Which can be amortized? Which can be batched?
Are queries point queries, range queries, or aggregates?
Do you need stable ordering, or is "some order" acceptable?
Characterize the constraints.
Memory budget — RAM, cache, disk?
Latency SLA — p50, p99, p99.9?
Concurrency — readers/writers, single-threaded, distributed?
Persistence — in-memory only, write-through, write-back?
Failure model — must we survive crash mid-update?
Match the structure to the answers. This is mechanical once steps 1–3 are honest.
Estimate the resulting Big-O and back-of-envelope cost. Verify the operation budget fits the latency SLA before writing code.

The mistake of a junior engineer is to skip directly to step 4. The mistake of a senior who skipped this discipline is to skip steps 1–3 and pick the structure they used last week.

Reading the Access Pattern¶

The "access pattern" is the most useful concept at the middle level. It is the answer to: "Across all operations my code will perform, what are the dominant ones, and what shape do they have?"

Common access patterns¶

Pattern	Description	Default structure
Random-access by index	"Give me the i-th element"	Array
Insertion / removal at known position	Cursor-based edits, log replay	Linked list, gap buffer
LIFO	Last-in is first-out	Stack
FIFO	First-in is first-out	Queue / deque
Membership	"Is X present?" — order irrelevant	Hash set
Key → value lookup	"Get the record for ID 42"	Hash map
Ordered iteration / range scan	"All entries between A and Z"	Tree map / sorted array
Top-K / priority	"Give me the smallest/largest pending item"	Heap
Prefix search	"All words starting with `cat`"	Trie
Graph traversal	"Reach all nodes connected to X"	Adjacency list + BFS/DFS
Range query / aggregate	"Sum/min/max over an interval"	Prefix sums, segment tree
Sliding window	A moving range over a sequence	Deque, monotonic queue
Streaming / unbounded	Data arrives forever, cannot store all	Bloom filter, reservoir sample, sketch

A single application usually has more than one access pattern, which is why structures are routinely composed (see Composing Data Structures).

Trade-Off Axes Every Structure Sits On¶

Every data structure can be described as a point on a small number of axes. There is no free lunch: gaining on one axis costs on another.

Axis	Two ends	Example
Time vs space	Fewer ops vs less memory	Hash table (more space, faster ops) vs sorted array (less space, slower inserts)
Read vs write	Fast reads vs fast writes	B+ tree (fast read, slow write) vs LSM tree (slow read, fast write)
Worst case vs average case	Predictable vs fast-on-average	Balanced BST (O(log n) worst) vs hash table (O(1) avg, O(n) worst)
Mutable vs immutable	In-place edits vs persistent	Array vs persistent tree
Ordered vs unordered	Iteration order matters or not	Tree map vs hash map
Stable vs unstable	Equal keys keep original order	Merge sort (stable) vs quicksort (unstable)
Cache-friendly vs pointer-chasing	Contiguous vs scattered	Array vs linked list
Single-thread vs concurrent	Lock-free / locked / mutex-free	Plain hash map vs `sync.Map` / `ConcurrentHashMap`
Approximate vs exact	Saves space at the cost of error	Bloom filter, HyperLogLog vs hash set

A middle engineer can articulate which axis they prioritized and which they sacrificed for any design choice.

How Algorithmic Strategies Map to Problems¶

Algorithms group into a small number of families. Each family has a question that selects it.

Family	The question it answers	Typical signal
Brute force	"Can I just try everything?"	Tiny input, prototype
Two pointers	"Can I traverse from both ends?"	Sorted array, pair-finding
Sliding window	"Can I incrementally update a range as it moves?"	Contiguous subarray, substring with property
Binary search	"Can I halve the search space each step?"	Sorted / monotonic property
Greedy	"Does the local best lead to the global best?"	Activity selection, Huffman, Dijkstra
Divide and conquer	"Can I split, solve, merge?"	Mergesort, FFT, closest pair
Dynamic programming	"Do subproblems overlap?"	Counting paths, edit distance, knapsack
Backtracking	"Can I prune dead branches early?"	N-queens, Sudoku, generation
BFS / DFS	"Am I exploring a graph?"	Reachability, shortest hops
Topological sort	"Are there dependencies?"	Build order, course schedule
Union-Find	"Am I merging components incrementally?"	Connected components, Kruskal
Hashing	"Can I trade space for O(1) lookup?"	Membership, frequency, dedup
Randomization	"Will an expected bound suffice?"	Quickselect, randomized quicksort, MinHash
Approximation	"Can I sacrifice exactness for speed/space?"	Bloom filter, HyperLogLog, LSH

Middle-level practice is partly about training your pattern recognition so that the right family announces itself when you read the problem.

Asymptotic vs Real-World Cost¶

Big-O is necessary, but it is not sufficient. At the middle level you start caring about the constants.

1. Cache locality matters more than asymptotics for small n¶

For arrays of fewer than ~10⁴ elements, a linear scan over a contiguous slice beats a binary tree lookup, because the array fits in L1/L2 cache and the tree does not.

Hardware reality: a single L1 cache hit is ~1 ns; an L3 miss to RAM is ~100 ns; an SSD read is ~100 µs. A "fast" algorithm that pointer-chases through RAM is 100× slower than an "asymptotically worse" one that walks contiguous memory.

2. Hidden constants in language libraries¶

Operation	Big-O	Realistic constant
Go `map[K]V` lookup	O(1)	~30 ns (hashing + bucket walk)
Java `HashMap` lookup	O(1)	~30–50 ns
Python `dict` lookup	O(1)	~50 ns
Slice `append` (Go, amortized)	O(1) amortized	spikes to O(n) on grow
Python `list.insert(0, x)`	O(n)	a hidden O(n) on the "fast" language's list

The middle engineer asks: "What is the actual cost, not the textbook cost?"

3. Worst case vs typical case¶

A hash table is O(1) average but O(n) worst case under adversarial input. For an internal cache that does not see attacker-controlled keys, the average case is what you ship. For a public API that accepts user-controlled keys, you must consider the worst case — which is why many languages (Python, Go, Java) randomize hash seeds at startup.

4. Amortized cost¶

A dynamic array's push is O(1) amortized but each individual push may be O(n) when it grows the backing buffer. For batch work this is fine. For a real-time system with strict per-operation latency, "amortized O(1)" can still violate the SLA on a single bad operation. The right tool there is a structure with worst-case bounds, e.g. a deque with fixed-size chunks.

Choosing a Structure — Worked Examples¶

Example 1: Rate limiter — last 60 seconds of request timestamps¶

Operations:

record(now): append a timestamp.
count(): how many timestamps are within the last 60 s?

Access pattern: append-on-one-end + expire-from-the-other-end + count. This is a sliding window.

Wrong: a plain list (O(n) to drop stale entries). Right: a deque. Append on the right; pop from the left while the left entry is older than now - 60. Then len(deque) is the answer. Each operation is O(1) amortized.

Example 2: Spell checker¶

Operations:

is_word(s): is s in the dictionary?
suggestions(prefix): list all words starting with prefix.

Access pattern: membership + prefix lookup.

Wrong: a hash set (no prefix support). Right: a trie. Membership is O(|s|). Prefix lookup walks down the trie to the prefix node and DFS-collects the subtree.

Example 3: Job scheduler¶

Operations:

submit(job, priority).
nextJob(): returns the highest-priority pending job.

Access pattern: insert anywhere + extract maximum.

Wrong: a sorted list (O(n) insert). Right: a max-heap. Both ops are O(log n).

Example 4: Online median (data arrives one at a time, query median anytime)¶

Wrong: re-sort the whole list every time you want the median (O(n log n) per query). Right: two heaps — a max-heap for the lower half, a min-heap for the upper half, kept balanced. Insert O(log n), query O(1).

Example 5: Detect duplicate URLs in a 100 GB log on a 16 GB machine¶

Constraint: the full set cannot fit in RAM.

Wrong: a hash set (will OOM). Right: a Bloom filter for the membership check, accepting some false positives, with a slower exact check on the suspicious subset. If you must be exact, an external-memory hash or sort-merge dedup.

These five examples illustrate the entire middle-level skill: read the access pattern, list the constraints, pick the structure whose strengths match.

Composing Data Structures¶

Real systems almost never use one structure in isolation. The middle level introduces composite structures.

LRU cache = hash map + doubly linked list¶

Hash map gives O(1) key lookup.
Doubly linked list gives O(1) move-to-front and evict-from-back.
Together: O(1) get and put, with bounded size and least-recently-used eviction.

Median data stream = max-heap + min-heap¶

Already shown above. Two heaps cover two access patterns the language's standard heap cannot do alone.

Index on a database = B+ tree + bloom filter (often)¶

B+ tree provides ordered, paged on-disk access.
An optional Bloom filter per page short-circuits the "definitely not here" case without a disk read.

Inverted index = hash map + posting list¶

Hash map: term → posting list of document IDs.
Posting list: usually a sorted array, sometimes compressed (variable-byte, Roaring bitmaps).
Together: full-text search.

Counter with top-K = hash map + heap¶

Hash map for counts.
Heap (size K) for the running top-K.

The lesson: when no single structure satisfies all your access patterns, layer them.

When DSA Goes Wrong in Production¶

Middle engineers learn DSA partly from the literature and partly from the production incidents listed here. These failures recur.

Failure	Root cause	Fix
Quadratic blow-up on adversarial input	Naïve nested loop on user-supplied data	Hashing, sort + sweep, or input cap
Hash table degenerates to O(n) under attack	Predictable hash seed enables collision attack	Randomized hash seed; treap; capped chain length
Memory exhaustion in long-running process	Unbounded cache	TTL, LRU/LFU, max-size eviction
Recursion stack overflow	Deep recursion on user data (e.g. JSON nesting)	Iterative version with explicit stack
Thundering herd	Many threads recompute the same expired cache entry	Probabilistic early expiration; single-flight
Lock contention on a popular map	Single-mutex map under concurrency	Striped locks; `sync.Map` / `ConcurrentHashMap`
Slow `O(n²)` join on growing tables	A pair-wise loop instead of a hash join	Hash the smaller side; sort-merge join
Latency spike from amortized resize	Dynamic-array growth at a bad moment	Pre-size; or use a worst-case bounded deque
GC pressure from short-lived nodes	Linked-list of tiny objects	Switch to slice/array; pool objects
False sharing on a per-CPU counter	Adjacent atomic counters on the same cache line	Pad to cache-line size

Each is a DSA problem in disguise.

Standard Library vs Hand-Rolled¶

A general rule for middle engineers:

Use the standard library by default. Hand-roll a structure only when (a) the standard library does not provide it, (b) you have measured and the standard library version is the bottleneck, or (c) the educational value is the point.

Library choices by language¶

Need	Go	Java	Python
Dynamic array	`[]T` (slice)	`ArrayList<T>`	`list`
Linked list	`container/list`	`LinkedList<T>`	`collections.deque`
Stack	slice with `append` / `pop`	`Deque<T>` (`ArrayDeque`)	`list` or `deque`
Queue	`chan T` or slice	`Queue<T>` (`ArrayDeque`)	`collections.deque`
Deque	slice or `container/list`	`ArrayDeque<T>`	`collections.deque`
Hash map	`map[K]V`	`HashMap<K, V>`	`dict`
Hash set	`map[K]struct{}`	`HashSet<T>`	`set`
Sorted map	(none — use sorted slice + `sort.Search` or third-party tree)	`TreeMap<K, V>`	`sortedcontainers.SortedDict`
Priority queue / heap	`container/heap`	`PriorityQueue<T>`	`heapq`
LRU cache	third-party (e.g. `hashicorp/golang-lru`)	`LinkedHashMap` with `removeEldestEntry`	`functools.lru_cache` / `collections.OrderedDict`

Knowing what your standard library provides is itself a DSA skill.

Code Examples — Same Problem, Stronger Solutions¶

A classic middle-level problem: longest substring without repeating characters. Three solutions illustrate the progression of thinking.

Naïve — try every substring (O(n³))¶

Go¶

package main

import "fmt"

func longestUnique(s string) int {
    best := 0
    for i := 0; i < len(s); i++ {
        for j := i + 1; j <= len(s); j++ {
            seen := map[byte]bool{}
            ok := true
            for k := i; k < j; k++ {
                if seen[s[k]] {
                    ok = false
                    break
                }
                seen[s[k]] = true
            }
            if ok && j-i > best {
                best = j - i
            }
        }
    }
    return best
}

func main() { fmt.Println(longestUnique("abcabcbb")) }

Java¶

public class LongestUniqueNaive {
    public static int longestUnique(String s) {
        int best = 0;
        for (int i = 0; i < s.length(); i++) {
            for (int j = i + 1; j <= s.length(); j++) {
                boolean[] seen = new boolean[128];
                boolean ok = true;
                for (int k = i; k < j; k++) {
                    if (seen[s.charAt(k)]) { ok = false; break; }
                    seen[s.charAt(k)] = true;
                }
                if (ok) best = Math.max(best, j - i);
            }
        }
        return best;
    }
    public static void main(String[] args) {
        System.out.println(longestUnique("abcabcbb"));
    }
}

Python¶

def longest_unique_naive(s):
    best = 0
    for i in range(len(s)):
        for j in range(i + 1, len(s) + 1):
            if len(set(s[i:j])) == j - i:
                best = max(best, j - i)
    return best

if __name__ == "__main__":
    print(longest_unique_naive("abcabcbb"))

Better — sliding window with a hash set (O(n))¶

Go¶

package main

import "fmt"

func longestUniqueWindow(s string) int {
    seen := map[byte]bool{}
    left, best := 0, 0
    for right := 0; right < len(s); right++ {
        for seen[s[right]] {
            delete(seen, s[left])
            left++
        }
        seen[s[right]] = true
        if right-left+1 > best {
            best = right - left + 1
        }
    }
    return best
}

func main() { fmt.Println(longestUniqueWindow("abcabcbb")) }

Java¶

import java.util.HashSet;

public class LongestUniqueWindow {
    public static int longestUnique(String s) {
        HashSet<Character> seen = new HashSet<>();
        int left = 0, best = 0;
        for (int right = 0; right < s.length(); right++) {
            while (seen.contains(s.charAt(right))) {
                seen.remove(s.charAt(left++));
            }
            seen.add(s.charAt(right));
            best = Math.max(best, right - left + 1);
        }
        return best;
    }
    public static void main(String[] args) {
        System.out.println(longestUnique("abcabcbb"));
    }
}

Python¶

def longest_unique_window(s):
    seen = set()
    left = best = 0
    for right, ch in enumerate(s):
        while ch in seen:
            seen.remove(s[left])
            left += 1
        seen.add(ch)
        best = max(best, right - left + 1)
    return best

if __name__ == "__main__":
    print(longest_unique_window("abcabcbb"))

Strongest — sliding window with a `lastSeen` map (still O(n), fewer rewinds)¶

Go¶

package main

import "fmt"

func longestUniqueFast(s string) int {
    last := make(map[byte]int)
    left, best := 0, 0
    for right := 0; right < len(s); right++ {
        if idx, ok := last[s[right]]; ok && idx >= left {
            left = idx + 1
        }
        last[s[right]] = right
        if right-left+1 > best {
            best = right - left + 1
        }
    }
    return best
}

func main() { fmt.Println(longestUniqueFast("abcabcbb")) }

Java¶

import java.util.HashMap;

public class LongestUniqueFast {
    public static int longestUnique(String s) {
        HashMap<Character, Integer> last = new HashMap<>();
        int left = 0, best = 0;
        for (int right = 0; right < s.length(); right++) {
            char c = s.charAt(right);
            if (last.containsKey(c) && last.get(c) >= left) {
                left = last.get(c) + 1;
            }
            last.put(c, right);
            best = Math.max(best, right - left + 1);
        }
        return best;
    }
    public static void main(String[] args) {
        System.out.println(longestUnique("abcabcbb"));
    }
}

Python¶

def longest_unique_fast(s):
    last = {}
    left = best = 0
    for right, ch in enumerate(s):
        if ch in last and last[ch] >= left:
            left = last[ch] + 1
        last[ch] = right
        best = max(best, right - left + 1)
    return best

if __name__ == "__main__":
    print(longest_unique_fast("abcabcbb"))

Version	Time	Space	Pattern used
Naïve	O(n³)	O(n)	None
Sliding window + set	O(n)	O(n)	Two pointers + hash set
Sliding window + map	O(n) (fewer rewinds)	O(n)	Two pointers + last-seen map

The improvement from version 2 to version 3 is a middle-level move: same Big-O, smaller constant, achieved by storing more useful information.

How to Practice at the Middle Level¶

The junior file said "implement, do not just read". The middle file extends that:

Implement, then optimize. Solve a problem with the obvious approach; then re-solve it asking "where is the wasted work?". The second pass is where you learn pattern recognition.
State invariants before writing code. What property of the data structure must always hold? If you cannot state it, your loop conditions will be wrong.
Practice naming the access pattern. Before solving any problem, write one sentence describing its access pattern. "This is a range-sum problem with point updates." If the sentence is sharp, the structure follows.
Read production source code. The Java HashMap, Go runtime/map.go, Python dictobject.c, Redis dict.c — each one is a master class in DSA in the wild. Reading these teaches you what real hash tables look like, not the textbook diagram.
Time your solutions. Add a microbenchmark to every non-trivial implementation. The gap between your model of "fast" and the measured truth is where you learn the most.
Reproduce failures. When a colleague reports "the cache is slow", write the smallest reproduction. Half the time the bug is a DSA mistake (unbounded growth, accidental quadratic, lock contention).
Vary the inputs. Every solution should be tested on: empty, single element, all-duplicates, already-sorted, reverse-sorted, large random, adversarially-crafted. This is what hardens your code beyond toy examples.

Summary¶

At the middle level, DSA is no longer about memorizing structures — it is about reading the access pattern of a problem and deriving the structure from it. You learn to articulate trade-offs (read vs write, time vs space, worst vs average, exact vs approximate), to spot when a single structure does not suffice and structures must be composed (LRU cache, median-of-stream), and to recognize the production failure modes that come from a wrong choice.

Asymptotic complexity is necessary but not sufficient: at this level you also reason about cache locality, hidden constants, amortized vs worst-case bounds, and adversarial inputs. The standard library is your default — hand-rolling is reserved for the rare case where it is justified.

The next file, senior.md, takes the same problems and embeds them in distributed systems, concurrency, and observability. The level after that, professional.md, proves the bounds formally.

Introduction to Data Structures & Algorithms — Middle Level¶

Table of Contents¶

Introduction¶

Why DSA Matters Beyond Performance¶

1. Data structure choice constrains future evolution¶

2. The structure encodes the invariant¶

3. Communication and design vocabulary¶

The Engineer's Decision Framework¶

Reading the Access Pattern¶

Common access patterns¶

Trade-Off Axes Every Structure Sits On¶

How Algorithmic Strategies Map to Problems¶

Asymptotic vs Real-World Cost¶

1. Cache locality matters more than asymptotics for small n¶

2. Hidden constants in language libraries¶

3. Worst case vs typical case¶

4. Amortized cost¶

Choosing a Structure — Worked Examples¶

Example 1: Rate limiter — last 60 seconds of request timestamps¶

Example 2: Spell checker¶

Example 3: Job scheduler¶

Example 4: Online median (data arrives one at a time, query median anytime)¶

Example 5: Detect duplicate URLs in a 100 GB log on a 16 GB machine¶

Composing Data Structures¶

LRU cache = hash map + doubly linked list¶

Median data stream = max-heap + min-heap¶

Index on a database = B+ tree + bloom filter (often)¶

Inverted index = hash map + posting list¶

Counter with top-K = hash map + heap¶

When DSA Goes Wrong in Production¶

Standard Library vs Hand-Rolled¶

Library choices by language¶

Code Examples — Same Problem, Stronger Solutions¶

Naïve — try every substring (O(n³))¶

Go¶

Java¶

Python¶

Better — sliding window with a hash set (O(n))¶

Go¶

Java¶

Python¶

Strongest — sliding window with a lastSeen map (still O(n), fewer rewinds)¶

Go¶

Java¶

Python¶

How to Practice at the Middle Level¶

Summary¶

Strongest — sliding window with a `lastSeen` map (still O(n), fewer rewinds)¶