Quickselect and the k-th Order Statistic — Mathematical Foundations¶
Table of Contents¶
- Formal Definition
- Correctness Proof
- Expected O(n) Proof — Geometric Series
- Expected O(n) Proof — Indicator Variables
- Exact Expected Comparison Count for the Median
- High-Probability Bound
- Median-of-Medians: Worst-Case O(n) Proof
- Worked Example: Median-of-Medians Step by Step
- Variance and the Distribution of Running Time
- Selection as a Building Block
- Lower Bounds for Selection
- Las Vegas vs Monte Carlo: Classifying Quickselect
- Adversary Argument for the Min-and-Max Bound
- Information-Theoretic View
- Parallel and External-Memory Complexity
- Comparison with Alternatives
- Summary
- Appendix: One-Line Proof Summary
Formal Definition¶
Problem (Selection / k-th order statistic):
Input: array A[0..n-1] of n elements from a totally ordered set,
and an integer k with 0 <= k <= n-1.
Output: the element x = A_(k) such that exactly k elements of A are < x
(the k-th smallest, 0-based; k=0 is the minimum, k=n-1 the maximum).
RANDOMIZED-SELECT(A, lo, hi, k): // k is a GLOBAL 0-based index into the sort
if lo == hi:
return A[lo]
q = RANDOMIZED-PARTITION(A, lo, hi) // pivot lands at final index q
if k == q:
return A[q]
else if k < q:
return RANDOMIZED-SELECT(A, lo, q - 1, k) // recurse LEFT only
else:
return RANDOMIZED-SELECT(A, q + 1, hi, k) // recurse RIGHT only
RANDOMIZED-PARTITION(A, lo, hi):
i = RANDOM(lo, hi)
exchange A[i] with A[hi]
return PARTITION(A, lo, hi) // Lomuto, pivot = A[hi]
CLRS Chapter 9 reference. The structural distinction from QUICKSORT is exactly one deleted recursive call.
Correctness Proof¶
We rely on the partition postcondition and prove selection by induction on the subarray size m = hi - lo + 1.
Partition postcondition (Lomuto)¶
After PARTITION(A, lo, hi) returns q:
A[lo..q-1]all<= A[q],A[q+1..hi]all>= A[q],A[q]occupies its final sorted position withinA[lo..hi]; i.e. its rank within the subarray isq - lo.
(The loop-invariant proof is identical to Quick Sort's and is omitted; see the partition correctness in the quick-sort topic.)
Selection correctness (induction on m)¶
Claim. RANDOMIZED-SELECT(A, lo, hi, k) returns the element whose global sorted rank is k, for every lo <= k <= hi.
Base case (m = 1): lo == hi. The single element has rank lo = k, and we return it. ✓
Inductive step: Assume the claim for all subarrays of size < m. Run partition; pivot lands at q, with rank exactly q. Three cases:
k == q: the pivot is the element of rankk. ReturnA[q]. ✓k < q: every element of rank< qlies inA[lo..q-1](by the postcondition, all are<= A[q]and there are exactlyq - loof them, occupying rankslo..q-1). The target rankkis among them, so the element of rankkis the same inA[lo..q-1]. That subarray has sizeq - lo < m; by IH the recursive call returns it. ✓k > q: symmetric — the target lies inA[q+1..hi], sizehi - q < m; IH applies. ✓
The recursion strictly decreases m, so it terminates. By induction the returned element has global rank k. QED.
Expected O(n) Proof — Geometric Series¶
This is the intuitive proof; the next section gives the rigorous indicator-variable version.
Definition of a "good" pivot¶
Call a partition balanced (or the pivot good) if q - lo lies in the middle half of the subarray, i.e. the pivot's rank is between the 25th and 75th percentile. Equivalently, the surviving side has size at most 3/4 of the current size.
A uniformly random pivot is good with probability 1/2 (it must land in the middle half of m positions: positions m/4 to 3m/4, which is m/2 of the m positions).
Expected number of partitions to halve the size¶
Let the current subarray have size m. The number of partitions until we get one good pivot is geometrically distributed with success probability 1/2, so the expected number of partitions before a good split is 2. After a good split the size drops to at most 3m/4.
Summing the work¶
Let T(n) be the expected number of comparisons. Each partition over a subarray of size m costs m - 1 < m comparisons. Group the work into phases: phase j is the span during which the subarray size lies in [(3/4)^{j+1} n, (3/4)^j n). By the argument above, the expected number of partitions per phase is at most 2, and each costs at most (3/4)^j n. Hence:
E[T(n)] ≤ Σ_{j ≥ 0} (expected partitions in phase j) · (max cost per partition in phase j)
≤ Σ_{j ≥ 0} 2 · (3/4)^j n
= 2n · Σ_{j ≥ 0} (3/4)^j
= 2n · 1/(1 - 3/4)
= 2n · 4
= 8n
= O(n). QED
The constant 8 is loose; the tight expected comparison count for the median is about 3.4n and the exact recurrence-based bound is 4n. The essential point is the geometric series converges because we recurse into only one (shrinking) side. Quick Sort, recursing into both sides, has no such telescoping: every level re-incurs the full n, giving Θ(n log n).
The clean recurrence form¶
Conditioning on a good split every (expected) constant number of steps yields the textbook upper-bound recurrence:
By the Master Theorem with a = 1, b = 4/3, f(n) = Θ(n): since log_b a = 0 < 1, we are in Case 3, and T(n) = Θ(f(n)) = Θ(n).
Expected O(n) Proof — Indicator Variables¶
The rigorous CLRS argument. We bound expected comparisons exactly, with no "good pivot" hand-waving.
Let the elements in sorted order be z_1 < z_2 < ... < z_n. Define the indicator
Total comparisons X = Σ_{i<j} X_{ij}, so E[X] = Σ_{i<j} Pr[z_i and z_j compared] (linearity of expectation).
Key observation. Two elements z_i, z_j are compared only if one of them is chosen as a pivot while both are still in the active subarray, and crucially, comparisons only happen against a pivot. In selection (unlike sorting) we also discard whole sides, which can only reduce the chance two elements ever meet — so the selection bound is no worse than the sort bound. For the element we seek (rank k), define for the pair (z_i, z_j):
Pr[z_i and z_j are compared] ≤ Pr[the first pivot chosen from {z_i, ..., z_j} is z_i or z_j].
Among the j - i + 1 elements in that contiguous rank range, each is equally likely to be the first picked as a pivot (random pivots), so
Therefore (this already bounds it by Quick Sort's count, O(n log n); the selection-specific refinement below gives O(n)):
A tighter, selection-aware accounting fixes the target rank k and sums by distance from k. CLRS shows the expected comparisons for RANDOMIZED-SELECT satisfy the recurrence
Bounding max(q, n-1-q) and summing the harmonic-like series gives a closed form linear in n:
Inductive verification that E[T(n)] ≤ 4n: assume E[T(m)] ≤ 4m for m < n. The averaged recurrence's sum is dominated by the larger side, max(q, n-1-q), which ranges over ⌈n/2⌉ .. n-1 as q varies. Substituting the hypothesis and evaluating the arithmetic series:
E[T(n)] ≤ (2/n) Σ_{m=⌊n/2⌋}^{n-1} 4m + O(n)
≤ (8/n)·(3n²/8) + O(n)
= 3n + O(n)
≤ 4n for n large enough. QED
So E[T(n)] = O(n), with the leading constant provably bounded — expected linear time, no assumption on the input distribution, only on the algorithm's internal randomness.
Exact Expected Comparison Count for the Median¶
The O(n) bound hides a constant. For randomized selection of the median, the exact leading constant can be derived and is a standard advanced exercise.
Let C(n, k) be the expected number of comparisons to select rank k from n distinct random elements. A partition over n elements costs n - 1 comparisons and lands the pivot uniformly at any rank q ∈ {0, ..., n-1}. Conditioning on q:
Solving this recurrence (Knuth, TAOCP Vol. 3, §5.3.3) gives a closed form in terms of harmonic numbers. For selecting rank k (using H_m = Σ_{i=1}^m 1/i):
Evaluating at the median k ≈ n/2 and using H_n ≈ ln n:
So the median costs about 3.39n comparisons in expectation — markedly below the 1.39 n log₂ n of a full Quick Sort, and the source of the "~3.4n" figure quoted at the middle level. Selecting ranks near the ends is cheaper: k = 0 (the minimum) costs exactly n - 1, and the constant interpolates between ~2n (extreme ranks) and ~3.39n (median) as k moves inward. The maximum expected cost over all k is at the median, which is why "find the median" is the canonical hardest case for selection.
High-Probability Bound¶
Expectation alone permits rare slow runs. A concentration bound shows slow runs are exponentially unlikely.
Theorem. For any constant c, randomized Quickselect runs in O(n) time with probability at least 1 - n^{-c}.
Proof sketch. Track the size of the active subarray. A partition is "successful" if it shrinks the subarray to at most 3/4 of its size (probability ≥ 1/2, independently each step). After t partitions, the number of successes S is a sum of independent indicators with E[S] ≥ t/2. To reduce size from n to 1 we need at most log_{4/3} n = O(log n) successes. Choosing t = Θ(log n) and applying a Chernoff bound:
With E[S] = Θ(log n) and a constant δ, this is ≤ exp(-Θ(log n)) = n^{-c}. So with probability ≥ 1 - n^{-c} we accumulate enough successes within O(log n) partitions; total work is then Σ of geometrically shrinking partition costs = O(n). QED.
Practical consequence: the O(n²) tail of Quickselect has probability vanishing polynomially in n for random pivots — but it is nonzero, which is exactly why production code adds the median-of-medians fallback (introselect) to make the bound deterministic.
Median-of-Medians: Worst-Case O(n) Proof¶
The BFPRT algorithm (Blum, Floyd, Pratt, Rivest, Tarjan, 1973) selects in O(n) worst case, deterministically — no randomness.
SELECT(A, k):
1. If n <= 5: sort A directly and return A[k].
2. Partition A into ⌈n/5⌉ groups of 5 (the last may be smaller).
3. Sort each group; let m_i = median of group i. Let M = {m_1, ..., m_{⌈n/5⌉}}.
4. x = SELECT(M, ⌈|M|/2⌉) # median of the medians (RECURSION 1)
5. Partition A around the value x; let q = rank of x.
6. if k == q: return x
else if k < q: return SELECT(A[lo..q-1], k) (RECURSION 2)
else: return SELECT(A[q+1..hi], k - (q+1)) (RECURSION 2)
The pivot-quality lemma (why neither side exceeds 7n/10)¶
Let x be the median of medians. Arrange the ⌈n/5⌉ groups conceptually as columns of 5, sorted within each column, with the columns ordered by their medians.
- Consider the groups whose median is
≤ x. There are at least⌈ (1/2)⌈n/5⌉ ⌉ ≈ n/10of them (x is the median of the medians). - In each such group, the median and the two elements below it are all
≤ x— that's 3 elements per group. - So at least
3 · (n/10) = 3n/10elements are≤ x.
By the symmetric argument, at least 3n/10 elements are ≥ x. Therefore the number of elements strictly greater than x is at most n - 3n/10 = 7n/10, and likewise the number strictly less is at most 7n/10.
Conclusion: whichever side recursion 2 takes, it has at most 7n/10 elements.
(More careful bookkeeping for small n and partial groups gives 7n/10 + 6; the +6 is absorbed into the analysis for n above a small threshold.)
The recurrence and its solution¶
- Step 3 (sort
⌈n/5⌉groups of 5):O(n)total (each group is constant work). - Step 4 (median of medians on
n/5elements):T(n/5). - Steps 5–6 (partition + recurse one side
≤ 7n/10):O(n) + T(7n/10).
Solve by substitution. Guess T(n) ≤ c·n. Assume it for sizes < n:
T(n) ≤ c·(n/5) + c·(7n/10) + a·n
= c·n·(1/5 + 7/10) + a·n
= c·n·(2/10 + 7/10) + a·n
= c·n·(9/10) + a·n
= c·n - c·n/10 + a·n.
This is ≤ c·n provided a·n ≤ c·n/10, i.e. c ≥ 10a. Such a c exists, so:
Why the group size is 5¶
The recurrence works iff the two fractions sum to less than 1: 1/5 + 7/10 = 9/10 < 1.
- Groups of 3: each group contributes only its median and 1 element below →
2 · (n/6) = n/3elements≤ x, so the recurse side can be up to2n/3. The recurrence becomesT(n) ≤ T(n/3) + T(2n/3) + O(n), whose fractions sum to1→ solves toO(n log n), not linear. Too weak. - Groups of 5:
1/5 + 7/10 = 9/10 < 1→ linear. The smallest odd group size that works. - Groups of 7, 9, ...: also linear, with smaller recursion fraction but larger per-group sort cost; 5 is the standard sweet spot.
The large constant (c ≥ 10a) is why BFPRT is impractical as the default pivot and is used as the introselect fallback instead.
Worked Example: Median-of-Medians Step by Step¶
To make the 7n/10 lemma concrete, take 15 elements (3 groups of 5):
A = [12, 3, 5, 7, 19, 1, 8, 4, 11, 2, 9, 14, 6, 10, 13]
Step 1 — groups of 5:
G1 = [12, 3, 5, 7, 19]
G2 = [1, 8, 4, 11, 2]
G3 = [9, 14, 6, 10, 13]
Step 2 — sort each group, take the median (3rd element):
G1 sorted = [3, 5, 7, 12, 19] -> median 7
G2 sorted = [1, 2, 4, 8, 11] -> median 4
G3 sorted = [6, 9, 10, 13, 14] -> median 10
M = [7, 4, 10]
Step 3 — median of medians = SELECT(M, 1) = 7. (x = 7)
Step 4 — guarantee check:
Groups with median <= x=7: G1 (med 7) and G2 (med 4).
In G1: the median 7 and the two below it (3,5) are <= 7 -> 3 elements.
In G2: the median 4 and the two below it (1,2) are <= 4 <= 7 -> 3 elements.
So at least 6 elements are <= 7 (≈ 3·(#groups with median ≤ x)).
Symmetrically at least 6 elements are >= 7.
=> Whichever side we recurse into has at most 15 - 6 = 9 ≈ 7n/10 elements.
The bound "at least 3 elements per qualifying group, and at least half the groups qualify" is exactly the 3n/10 below / 3n/10 above guarantee that caps each recursive side at 7n/10. With more groups the fractions converge to the clean 3/10 and 7/10 used in the recurrence.
Variance and the Distribution of Running Time¶
Expectation is not the whole story; a senior-level analysis asks about the distribution of comparisons.
For randomized Quickselect finding the median, the number of comparisons C_n satisfies:
The Θ(n) standard deviation means individual runs can deviate from the mean by a constant fraction — runtime is "spread out," which is why a single Quickselect call has visibly variable latency. This is in contrast to, say, merge sort whose comparison count is nearly deterministic. Two consequences:
- Tail latency matters. A service that calls Quickselect once per request sees a wider latency distribution than one that sorts; budget the p99 accordingly, or use introselect to clamp the tail.
- Floyd–Rivest reduces both the mean (toward 1.5n) and the variance (by selecting tight bracketing pivots from a sample), which is why it is preferred for large, latency-sensitive selection.
The high-probability bound from the previous section says the time complexity class (O(n)) is stable with probability 1 - n^{-c}; the variance result here describes the constant-factor spread within that class.
Selection as a Building Block¶
Linear-time selection is a primitive that makes other algorithms provably efficient:
| Algorithm | How selection is used | Resulting bound |
|---|---|---|
| Balanced Quicksort | Pick the exact median as pivot via BFPRT | Worst-case O(n log n) (deterministic) |
| Median-of-medians k-d tree build | Median split at each level | O(n log n) construction, balanced tree |
| Weighted median / 1-D facility location | Select around accumulated weight | O(n) optimal placement |
| Linear-time order-statistics on the fly | Repeated partial selection | Avoids full sort |
| Linear programming in fixed dimension | Megiddo's prune-and-search uses median splits | O(n) for constant dimension |
The deterministic median pivot is the theoretical bridge that turns Quicksort's expected O(n log n) into a worst-case O(n log n): partition around the exact median (found in O(n) by BFPRT) and both sides are exactly n/2, giving the clean T(n) = 2T(n/2) + O(n) = O(n log n) with no quadratic tail. In practice the BFPRT constant makes this slower than randomized Quicksort, so it is a theoretical guarantee rather than a production default — but it proves that O(n log n) worst-case comparison sorting is achievable without heaps.
Lower Bounds for Selection¶
How few comparisons can any comparison-based selection algorithm use?
Trivial lower bound: Ω(n)¶
Every element must be examined at least once: if an algorithm never inspects A[i], an adversary can set A[i] to be the answer (e.g. the true k-th element), making the output wrong. So any correct selection needs Ω(n) comparisons. Quickselect (expected) and BFPRT (worst case) both match this bound — selection is Θ(n).
Exact comparison complexity (finer bounds)¶
The constant in front of n has been studied intensely:
| Problem | Lower bound | Best known upper bound |
|---|---|---|
| Minimum (k=0) | n - 1 (tight) | n - 1 |
| Min and max | ⌈3n/2⌉ - 2 (tight) | ⌈3n/2⌉ - 2 |
| Median (worst case) | ≥ (2 + ε)n (Dor–Zwick) | ≤ 2.95n (Dor–Zwick) |
| Median (BFPRT) | — | ~5.43n |
| Median (expected, randomized) | 1.5n (lower) | ≈ 3.4n (Floyd–Rivest expected 1.5n + o(n)) |
Two notable results:
- Min-and-max in
⌈3n/2⌉-2: process elements in pairs (1 comparison each), then route the smaller toward the running min and larger toward the running max (2more comparisons per pair) →3comparisons per2elements =3n/2. Provably optimal. - Floyd–Rivest achieves expected
1.5n + n^{1/2+o(1)}comparisons for the median — close to the1.5nlower bound — by sampling a small subset to pick two pivots that bracket the target tightly. It is faster than median-of-medians in practice but probabilistic.
Selection is strictly easier than sorting¶
Sorting requires Ω(n log n) comparisons (decision-tree / log(n!) argument). Selection requires only Θ(n). The separation is fundamental: ordering all elements carries Θ(n log n) bits of information (which permutation), whereas naming one element of known rank carries only Θ(n) work. This is the formal justification for "if you need one rank, never sort."
Las Vegas vs Monte Carlo: Classifying Quickselect¶
Randomized algorithms split into two families, and it matters which one Quickselect is.
- Las Vegas: always returns the correct answer; the running time is the random variable. (Example: randomized Quickselect — output is always the true k-th element; only the time varies.)
- Monte Carlo: runs in bounded time but may return a wrong answer with small probability. (Example: Miller–Rabin primality, a random sampling estimate of a quantile.)
Quickselect is a Las Vegas algorithm. This is a crucial guarantee: no matter how unlucky the pivots are, it never returns the wrong order statistic — it only takes longer. The randomness is purely an efficiency device, not a correctness device. The correctness proof in this document makes no probabilistic assumption; it holds for every possible sequence of pivot choices.
You can convert any Las Vegas algorithm with expected time T into a Monte Carlo algorithm with worst-case time O(T) by truncation (stop after c·T steps and report failure), and vice versa by repetition. Introselect is essentially a derandomization-on-demand: it runs the Las Vegas fast path but, rather than risk a long tail, deterministically switches to BFPRT to make the time bound hard while keeping correctness exact.
Quickselect: always correct, expected O(n), tail O(n²) [Las Vegas]
Median-of-medians: always correct, worst-case O(n) [Deterministic]
Introselect: always correct, worst-case O(n) [Las Vegas + deterministic fallback]
Quantile sketch: approximate answer, O(n) one pass [Monte Carlo-ish, bounded error]
Adversary Argument for the Min-and-Max Bound¶
The ⌈3n/2⌉ - 2 bound for finding both the minimum and maximum deserves its adversary proof, since it is a clean example of an information-theoretic lower bound for selection.
Assign each element one of four labels based on what comparisons have revealed: N (never compared), L (has lost at least one comparison, so is not the max), W (has won at least one, so is not the min), WL (both won and lost — eliminated from both races). Initially all n elements are N.
To certify the answer, every element except the eventual min and max must reach WL (it must be known to be neither). Count the label-state "units" needed: an N must gain 2 units (one win, one loss); an L or W must gain 1. The adversary answers comparisons to give as few units as possible:
- Comparing two N elements yields at most 2 units (one becomes W, the other L) per comparison.
- Every other comparison yields at most 1 unit.
There are ⌈n/2⌉ N-vs-N comparisons available at best (pairing up). The total units required is 2n - 2 (two survivors need not be fully eliminated). So:
units needed: 2n - 2
max units from N-vs-N comparisons: 2·⌊n/2⌋
remaining units (1 each): (2n - 2) - 2·⌊n/2⌋
total comparisons ≥ ⌊n/2⌋ + [(2n-2) - 2⌊n/2⌋]
= ⌈3n/2⌉ - 2. QED
The matching algorithm (pair up, route smaller to the min-tournament, larger to the max-tournament) achieves exactly this, so the bound is tight. This is the canonical demonstration that even simple order statistics have a precise, provable comparison cost below which no algorithm can go.
Information-Theoretic View¶
Why is selection Θ(n) while sorting is Θ(n log n)? The decision-tree / information argument makes the gap precise.
A comparison-based algorithm is a binary decision tree: each internal node is a comparison with two outcomes, each leaf an answer. To distinguish among M possible answers the tree needs depth ≥ log2 M, i.e. at least log2 M comparisons in the worst case.
- Sorting: the answer is a permutation; there are
M = n!of them.log2(n!) = Θ(n log n)by Stirling, giving theΩ(n log n)sorting lower bound. - Selection: the answer is a single element (which of the
nis the k-th). NaivelyM = nsuggests onlylog2 ncomparisons — but that ignores the certification cost. To prove an element is the k-th, you must establish it is≥exactlykelements and≤the rest, which forces inspecting every element:Ω(n). The certification, not the identification, dominates.
So selection's Θ(n) is the cost of verifying the rank, and sorting's Θ(n log n) is the cost of identifying the full permutation. Selection carries far less information (one ranked element vs an entire ordering), and the comparison cost reflects exactly that information content. This is the formal underpinning of the practical maxim from the junior level: asking for less lets you do less.
Sorting: distinguish n! outcomes -> Ω(log n!) = Ω(n log n)
Selection: certify 1 element's rank -> Ω(n) [every element must be read]
Gap: a factor of Θ(log n)
Parallel and External-Memory Complexity¶
PRAM (parallel) model¶
| Algorithm | Work | Span (depth) |
|---|---|---|
| Sequential Quickselect | Θ(n) expected | Θ(n) |
| Parallel-partition Quickselect | Θ(n) expected | Θ(log² n) expected |
| Cole–Vishkin / optimal parallel selection | Θ(n) | Θ(log n log* n) |
The partition step parallelizes via parallel prefix sums (counting elements below the pivot, then scattering). Each level has O(log n) span from the prefix sum; the expected O(log n) levels give O(log² n) total span with O(n) work — work-optimal. Theoretically optimal selection achieves O(log n) span, but the constants make the simpler O(log² n) scheme the practical choice.
The fundamental obstacle: selection has an inherently sequential "spine" — you cannot start the second partition until the first reveals which side holds rank k. This is why selection parallelizes worse than sorting despite being asymptotically cheaper sequentially.
External-memory (I/O) model¶
With N elements, cache/block size B, memory size M:
Sequential scan (one pass): O(N/B) I/Os
Selection (Quickselect-style): O(N/B) I/Os expected
Sorting: Θ((N/B) log_{M/B}(N/B)) I/Os
Selection needs only O(N/B) I/Os — a constant number of passes — because each partition is a linear scan and the surviving side shrinks geometrically. Sorting needs the extra log_{M/B} merge factor. So in the external-memory model selection is asymptotically cheaper than sorting by the merge factor, mirroring the in-RAM n vs n log n gap. This is the formal reason large-scale "top-k" and "percentile" jobs avoid full external sorts.
Communication complexity (distributed)¶
For the exact median across p machines each holding n/p elements, the count-and-narrow protocol uses O(log(value range)) rounds, each broadcasting one candidate and summing p counts — total communication O(p · log(range)) words, independent of n. Exact distributed selection is therefore communication-cheap but round-bound; approximate sketches collapse it to a single round at the cost of bounded error, which is why production telemetry overwhelmingly uses sketches.
Comparison with Alternatives¶
| Method | Worst-case time | Expected time | Space | Deterministic? | Streaming? | Output |
|---|---|---|---|---|---|---|
| Randomized Quickselect | Θ(n²) | Θ(n) | O(1) | No | No | one rank |
| Median-of-medians (BFPRT) | Θ(n) | Θ(n) | O(log n) | Yes | No | one rank |
Introselect (nth_element) | Θ(n) | Θ(n) | O(log n) | hybrid | No | one rank |
| Floyd–Rivest | Θ(n) (whp) | Θ(n), ~1.5n cmp | O(1) | No | No | one rank |
| Heap (size k) | Θ(n log k) | Θ(n log k) | O(k) | Yes | Yes | top-k sorted |
| Full sort + index | Θ(n log n) | Θ(n log n) | O(1)–O(n) | Yes | No | full order |
| Counting/radix (bounded int keys) | Θ(n + U) | Θ(n + U) | O(U) | Yes | No | any rank |
Theoretical summary: selection is Θ(n); the only question is whether you want the expected-linear randomized version (small constants, possible rare blowup) or the worst-case-linear deterministic version (BFPRT, large constant). Introselect resolves the tension by combining them.
Summary¶
At the professional level, selection is fully characterized:
- Correctness follows by induction on subarray size from the partition postcondition; the single recursive branch is justified because all elements of rank
< qlie left of the pivot and all of rank> qlie right. - Expected O(n) has two proofs: the geometric series (
2n·Σ(3/4)^j = O(n), from recursing into one shrinking side, givingT(n) ≤ T(3n/4) + O(n)), and the rigorous indicator-variable argument boundingE[T(n)] ≤ 4n. - High-probability O(n) follows from a Chernoff bound; the bad tail vanishes polynomially in
n. - Worst-case O(n) is achieved deterministically by median-of-medians, whose pivot guarantees each side
≤ 7n/10, yieldingT(n) ≤ T(n/5) + T(7n/10) + O(n)with the critical sum1/5 + 7/10 = 9/10 < 1, solving toO(n). Group size 5 is the smallest that makes the fractions sum below 1. - Lower bounds: selection is
Θ(n)(every element must be read); min-and-max is exactly⌈3n/2⌉-2; median needs≥ (2+ε)nworst case. Selection is provably easier than sorting (Θ(n)vsΩ(n log n)).
The production synthesis is introselect: randomized Quickselect for average-case speed, median-of-medians fallback for the worst-case guarantee — matching the Θ(n) lower bound with practical constants.
Appendix: One-Line Proof Summary¶
For quick recall, the entire complexity story compressed:
Correctness: induction on subarray size + partition postcondition.
Expected O(n): one-side recursion -> T(n) ≤ T(3n/4) + O(n) -> Master Thm Case 3.
(rigorously, E[T(n)] ≤ 4n by indicator variables / substitution.)
Median const: ≈ 3.39n comparisons (harmonic-sum recurrence).
High-prob: Chernoff on "good splits" -> O(n) w.p. 1 - n^{-c}.
Worst O(n): median-of-medians: each side ≤ 7n/10,
T(n) ≤ T(n/5) + T(7n/10) + O(n), 9/10 < 1 -> O(n).
Group size 5: smallest making the two recursion fractions sum below 1.
Lower bound: Ω(n) (must read every element); selection is Θ(n).
Min & max: exactly ⌈3n/2⌉ - 2 (adversary / unit-counting argument).
Sorting gap: sorting Ω(n log n) (log n! decision tree); selection is Θ(log n) cheaper.
Class: Las Vegas — always correct; running time is the random variable.
I/O model: O(N/B) I/Os (a few passes) vs sorting's Θ((N/B) log_{M/B}(N/B)).
If you can reconstruct the argument behind each line, you understand selection at the professional level. The through-line is a single structural fact — recursing into one side instead of two — whose consequences ripple through every model: sequential (n vs n log n), external-memory (N/B vs (N/B) log), and information-theoretic (certify one rank vs identify a permutation).
Next step: continue to interview.md for tiered Quickselect interview questions and a 3-language coding challenge.