Fast Fourier Transform — Mathematical Foundations and Complexity Theory¶

Table of Contents¶

Formal Definitions
Roots of Unity: The Halving and Cancellation Lemmas
The Discrete Fourier Transform and Its O(n²) Cost (Vandermonde / circulant view)
The Convolution Theorem (Proof) and the Aliasing Identity
Cooley-Tukey Correctness (Derivation + worked n=4)
The Recurrence T(n) = 2T(n/2) + O(n) = O(n log n)
The Inverse Transform (Proof + worked n=4)
Bit-Reversal and the Iterative Algorithm (equivalence proof)
The Number-Theoretic Transform over a Finite Field (worked p=17, CRT)
Numerical Error Analysis (unitarity, coefficient splitting)
Applications: String Matching, Signal Processing, Big Integers, Series
Lower Bounds and Related Transforms (Bluestein, comparison table)
Summary
Common Proof Pitfalls

1. Formal Definitions¶

Let n be a positive integer (we will take n a power of two), and work over the complex field ℂ.

Definition 1.1 (Polynomial, coefficient form). A polynomial of degree < n is A(x) = Σ_{j=0}^{n-1} a_j x^j, represented by its coefficient vector a = (a₀, …, a_{n-1}) ∈ ℂⁿ.

Definition 1.2 (Point-value form). Given distinct x₀, …, x_{n-1}, the point-value representation is {(x_k, A(x_k))}_{k=0}^{n-1}. A degree-< n polynomial is uniquely determined by its values at n distinct points (Lagrange interpolation; equivalently, the Vandermonde matrix is invertible when the x_k are distinct).

Definition 1.3 (n-th root of unity). A complex ω is an n-th root of unity if ωⁿ = 1. There are exactly n of them, given by e^{2πik/n} for k = 0, …, n−1. The principal n-th root of unity is

ω_n = e^{2πi/n} = cos(2π/n) + i·sin(2π/n).

ω_n is primitive: its powers ω_n^0, …, ω_n^{n-1} are all n distinct roots.

Definition 1.4 (Discrete Fourier Transform). The DFT of a ∈ ℂⁿ is the vector â = DFT_n(a) ∈ ℂⁿ with

â_k = A(ω_n^k) = Σ_{j=0}^{n-1} a_j · ω_n^{jk},      k = 0, …, n−1.

Equivalently â = F a, where F is the DFT matrix with F[k][j] = ω_n^{jk}.

Definition 1.5 (Convolution). The (linear) convolution of a ∈ ℂ^m and b ∈ ℂ^ℓ is c = a ∗ b ∈ ℂ^{m+ℓ-1} with

c_t = Σ_{j} a_j · b_{t-j}     (sum over valid j).

c is exactly the coefficient vector of the product polynomial A(x)·B(x). The cyclic convolution of two length-n vectors is (a ⊛ b)_t = Σ_j a_j b_{(t-j) mod n}.

Remark. The entire topic rests on one identity: convolution in the coefficient domain equals pointwise product in the DFT domain (Section 4). FFT makes both DFT directions cost O(n log n), so convolution costs O(n log n).

Definition 1.6 (Cyclic group of roots). The n-th roots of unity form a cyclic group μ_n ≅ ℤ/nℤ under multiplication, generated by ω_n. This group structure is what underlies the Halving and Cancellation lemmas of Section 2: ω_n^{n/2} is the unique element of order 2 (= −1), and the squaring map μ_n → μ_{n/2} is a 2-to-1 group homomorphism. Recognizing the algebra as "a cyclic group with a known generator and an order divisible by powers of two" is exactly the abstraction the NTT generalizes to ℤ_p^* (Section 9).

Notation. n is the transform length (a power of two), ω_n = e^{2πi/n}, ε = 2⁻⁵³ the machine epsilon, p a prime, g a primitive root mod p, and [P] the Iverson bracket (1 if P holds else 0). i is the imaginary unit; an index named i will be written idx where ambiguity could arise.

Reading guide. Sections 2–7 build the complex FFT from the ground up (lemmas → DFT → convolution theorem → Cooley-Tukey → recurrence → inverse). Section 8 covers the in-place iterative form. Section 9 generalizes everything to finite fields (the NTT). Sections 10–12 treat numerics, applications, and complexity context. Each major claim is a numbered Theorem/Lemma/Proposition with a proof, so the document doubles as a reference: cite "Lemma 2.1" for halving, "Theorem 4.1" for the convolution theorem, "Theorem 9.2" for NTT correctness.

2. Roots of Unity: The Halving and Cancellation Lemmas¶

These two lemmas are the algebraic engine of FFT.

Lemma 2.1 (Halving / Squaring Lemma). For n even and any integer k,

(ω_n^k)² = ω_{n/2}^k.

Proof. (ω_n^k)² = ω_n^{2k} = (e^{2πi/n})^{2k} = e^{2πi(2k)/n} = e^{2πik/(n/2)} = (e^{2πi/(n/2)})^k = ω_{n/2}^k. ∎

Consequence. Squaring the n distinct n-th roots produces the n/2 distinct (n/2)-th roots, each appearing exactly twice (ω_n^k and ω_n^{k+n/2} square to the same value). This is what lets the recursion reuse subproblem evaluation points. Without this 2-to-1 collapse, the two recursive subproblems would need n distinct points between them and the recursion would not shrink the work — the entire O(n log n) hinges on this lemma.

Lemma 2.2 (Cancellation Lemma). For n even and 0 ≤ k < n/2,

ω_n^{k + n/2} = −ω_n^k.

Proof. ω_n^{n/2} = e^{2πi(n/2)/n} = e^{πi} = −1, hence ω_n^{k+n/2} = ω_n^k · ω_n^{n/2} = −ω_n^k. ∎

Consequence. The two outputs â_k and â_{k+n/2} of a butterfly differ only in the sign of the ω_n^k·O_k term. Thus one complex multiplication (ω_n^k·O_k) and two additions produce two outputs — the source of the constant factor in the operation count and the reason the combine is Θ(n) rather than Θ(n log n) per level.

Lemma 2.3 (Summation / Orthogonality Lemma). For any integer d not divisible by n,

Σ_{k=0}^{n-1} ω_n^{dk} = 0,     and for d ≡ 0 (mod n) the sum is n.

Proof. If n | d every term is 1, summing to n. Otherwise ω_n^d ≠ 1, and the geometric series gives Σ_k (ω_n^d)^k = (ω_n^{dn} − 1)/(ω_n^d − 1) = (1 − 1)/(ω_n^d − 1) = 0, using ω_n^{dn} = (ω_n^n)^d = 1. ∎

Lemma 2.3 is precisely what makes the DFT matrix invertible and gives the clean inverse formula in Section 7.

Geometric reading. The roots ω_n^0, …, ω_n^{n-1} are n equally spaced points on the unit circle. The Summation Lemma says their centroid is the origin (they cancel by symmetry) unless the exponent multiple wraps fully around, in which case all points coincide at 1 and the sum is n. This vector-cancellation picture is the intuition behind both the inverse formula and the numerical stability (Section 10): no direction is privileged, so no error direction is amplified.

3. The Discrete Fourier Transform and Its O(n²) Cost¶

Computing â_k = Σ_j a_j ω_n^{jk} directly for one k is a length-n dot product: O(n) multiply-adds. Doing it for all n values of k is O(n²). Equivalently, the DFT is the matrix-vector product â = F a with F an n×n matrix; naive matrix-vector multiplication is Θ(n²).

Proposition 3.1. The naive DFT requires Θ(n²) scalar operations.

Horner's rule evaluates a single point in O(n), so n points is again O(n²). The whole point of FFT is to exploit the structure of F (Lemmas 2.1–2.2) to compute F a in O(n log n).

3.1 The DFT matrix as a Vandermonde matrix¶

F is the Vandermonde matrix on the nodes x_k = ω_n^k: F[k][j] = (x_k)^j = ω_n^{jk}. A general Vandermonde matrix on distinct nodes is invertible, which re-proves that point-value form determines a unique degree-<n polynomial (Definition 1.2). For arbitrary nodes, inverting Vandermonde costs O(n²) (or O(n log² n) with fast algorithms) and is numerically ill-conditioned. The roots of unity are special: F is (up to scaling) unitary — F* F = n·I by the Summation Lemma — so its inverse is just the scaled conjugate F^{-1} = (1/n)F*, with perfect conditioning (κ₂(F/√n) = 1). This unitarity is both why the inverse is cheap (Section 7) and why the complex FFT is numerically stable (Section 10).

Conditioning, precisely. The 2-norm condition number of the scaled DFT matrix is κ₂(F/√n) = σ_max/σ_min = 1 because all singular values of a unitary matrix equal 1. By contrast, a Vandermonde matrix on real nodes in [−1,1] has a condition number that grows exponentially in n. This is the quantitative reason FFT-based evaluation/interpolation is numerically safe while naive polynomial interpolation at arbitrary nodes is a numerical disaster — the roots of unity are the uniquely well-conditioned interpolation nodes on the unit circle.

3.2 Why the DFT diagonalizes circular convolution¶

The cyclic-convolution operator (convolving a fixed vector h against an input) is a circulant matrix C_h. Every circulant is diagonalized by the DFT: C_h = F^{-1} diag(ĥ) F. Hence convolving by h becomes, in the DFT basis, the diagonal (elementwise) multiplication by ĥ — this is the linear-algebra restatement of the Convolution Theorem (Section 4), and the reason "filtering = multiply in frequency domain" in signal processing.

4. The Convolution Theorem (Proof)¶

Theorem 4.1 (Convolution Theorem). Let a, b ∈ ℂⁿ (zero-padded so n ≥ deg(A)+deg(B)+1). Then the linear convolution satisfies

DFT_n(a ∗ b) = DFT_n(a) ⊙ DFT_n(b),

where ⊙ is the elementwise (Hadamard) product. Equivalently a ∗ b = DFT_n^{-1}( DFT_n(a) ⊙ DFT_n(b) ).

Proof. Let C(x) = A(x)·B(x). Evaluate at any point x = ω_n^k:

C(ω_n^k) = A(ω_n^k) · B(ω_n^k).

The left side is the k-th DFT coefficient of c = a ∗ b (since C has the convolution coefficients, and n was chosen large enough that no wrap-around occurs, so C has degree < n and its DFT is well-defined). The right side is â_k · b̂_k = (â ⊙ b̂)_k. Therefore ĉ_k = â_k · b̂_k for every k, i.e. DFT_n(c) = â ⊙ b̂. Applying the inverse DFT (Theorem 7.1) recovers c = a ∗ b. ∎

Caveat (cyclic vs linear). Without sufficient zero-padding the DFT computes the cyclic convolution a ⊛ b of length n, in which index arithmetic is mod n. Choosing n ≥ deg(A)+deg(B)+1 forces the cyclic and linear convolutions to coincide because the wrapped terms are all zero.

Why the theorem needs n ≥ deg(C)+1, not just n ≥ max(deg). The product C has degree deg A + deg B, so it has deg A + deg B + 1 coefficients. The DFT of length n can only represent a polynomial of degree < n; if n ≤ deg C, the representation is forced to wrap (cyclic), and ĉ_k = â_k b̂_k then equals the cyclic convolution's transform, not the linear one. The padding requirement is therefore not a heuristic but a tight necessary-and-sufficient condition for IDFT(â ⊙ b̂) = a ∗ b.

This theorem is the reason FFT multiplies polynomials in O(n log n): two forward DFTs (O(n log n)), one pointwise product (O(n)), one inverse DFT (O(n log n)).

Proposition 4.2 (Aliasing identity). For length-n vectors a, b, the cyclic convolution equals the linear convolution wrapped mod n:

(a ⊛ b)_t = Σ_{ℓ ≡ t (mod n)} (a ∗ b)_ℓ = (a∗b)_t + (a∗b)_{t+n} + … .

Proof. (a ⊛ b)_t = Σ_{j} a_j b_{(t-j) mod n}. Each pair (j, m) with j + m = t + qn for some integer q ≥ 0 contributes; grouping by q collects exactly the linear-convolution terms at indices t, t+n, t+2n, …. ∎

Consequence. If n ≥ deg(A)+deg(B)+1, the only nonzero linear-convolution index in each residue class is the one in [0, n), so the wrapped sum has a single term and cyclic = linear. Under-padding (small n) folds high-degree coefficients (a∗b)_{t+n} back onto (a∗b)_t — the aliasing bug. This is the formal statement of why "pad to a power of two ≥ len(a)+len(b)−1" is mandatory.

4.3 Algebraic properties used in testing¶

Convolution inherits the ring structure of polynomial multiplication, which gives free correctness checks:

Property	Statement	Test use
Commutative	`a ∗ b = b ∗ a`	swap inputs, compare
Associative	`(a ∗ b) ∗ c = a ∗ (b ∗ c)`	regroup, compare
Distributive	`a ∗ (b + c) = a∗b + a∗c`	linearity check
Identity	`a ∗ δ = a`, `δ = (1,0,0,…)`	impulse response
Shift	`a ∗ (shift by t) = shift of a by t`	translation check

These follow because the DFT is a ring isomorphism from (ℂ[x]/(x^n−1), +, ⊛) to (ℂⁿ, +, ⊙) — convolution becomes elementwise product, where these identities are obvious. Property-based testing (sibling property-based-testing) exploits exactly this.

5. Cooley-Tukey Correctness (Derivation)¶

Theorem 5.1 (Radix-2 Cooley-Tukey). Let n be even, split a into even- and odd-indexed subvectors

a^{even} = (a₀, a₂, …, a_{n-2}),     a^{odd} = (a₁, a₃, …, a_{n-1}),

and let E = DFT_{n/2}(a^{even}), O = DFT_{n/2}(a^{odd}). Then for 0 ≤ k < n/2:

â_k        = E_k + ω_n^k · O_k,
â_{k+n/2}  = E_k − ω_n^k · O_k.

Proof. Start from the definition and split by parity of j:

â_k = Σ_{j=0}^{n-1} a_j ω_n^{jk}
    = Σ_{m=0}^{n/2-1} a_{2m} ω_n^{2mk}  +  Σ_{m=0}^{n/2-1} a_{2m+1} ω_n^{(2m+1)k}
    = Σ_m a_{2m} (ω_n²)^{mk}  +  ω_n^k Σ_m a_{2m+1} (ω_n²)^{mk}.

By the Halving Lemma 2.1, ω_n² = ω_{n/2}, so (ω_n²)^{mk} = ω_{n/2}^{mk}. Hence the two sums are exactly the (n/2)-point DFTs of the even and odd subvectors evaluated at index k:

â_k = E_k + ω_n^k · O_k.        (★)

Now evaluate at k + n/2. Note E and O have period n/2 in their index (they are length-n/2 DFTs), so E_{k+n/2} = E_k and O_{k+n/2} = O_k. The twiddle factor changes by the Cancellation Lemma 2.2: ω_n^{k+n/2} = −ω_n^k. Substituting into (★) with k ← k + n/2:

â_{k+n/2} = E_{k+n/2} + ω_n^{k+n/2} · O_{k+n/2} = E_k − ω_n^k · O_k.

Thus the two stated formulas hold for all 0 ≤ k < n/2, covering all n outputs. ∎

The pair (★) and its sibling is the butterfly: from (E_k, O_k) and twiddle ω_n^k it produces two outputs with one multiply (ω_n^k·O_k) and two adds. There are n/2 butterflies per level.

5.1 Worked example (n = 4)¶

Take a = (a₀, a₁, a₂, a₃) and ω_4 = i. The even subvector is (a₀, a₂), the odd is (a₁, a₃). Their size-2 DFTs (with ω_2 = −1) are:

E₀ = a₀ + a₂,   E₁ = a₀ − a₂        (DFT of evens)
O₀ = a₁ + a₃,   O₁ = a₁ − a₃        (DFT of odds)

The size-4 butterflies (ω_4^0 = 1, ω_4^1 = i) give:

â₀ = E₀ + ω_4^0·O₀ = (a₀+a₂) + (a₁+a₃)
â₁ = E₁ + ω_4^1·O₁ = (a₀−a₂) + i(a₁−a₃)
â₂ = E₀ − ω_4^0·O₀ = (a₀+a₂) − (a₁+a₃)        (k=0, +n/2)
â₃ = E₁ − ω_4^1·O₁ = (a₀−a₂) − i(a₁−a₃)        (k=1, +n/2)

These match the direct definition â_k = Σ_j a_j i^{jk} term by term, confirming Theorem 5.1 concretely. Notice each of E₀, E₁, O₀, O₁ is computed once and reused in two output equations — the reuse is exactly the Θ(n)-per-level saving.

5.2 The twiddle-factor structure¶

The combine step multiplies O_k by the twiddle factor ω_n^k. Across the log₂ n levels these factors form a nested family: the level-s pass (block length 2^s) uses the 2^s-th roots ω_{2^s}^0, …, ω_{2^s}^{2^{s-1}-1}. A single precomputed table of the top-level roots ω_n^0 … ω_n^{n/2-1} suffices, since every level's roots are a subsampling of it (ω_{2^s} = ω_n^{n/2^s}). This is why production FFTs store one root table and stride into it.

6. The Recurrence T(n) = 2T(n/2) + O(n) = O(n log n)¶

Theorem 6.1. The Cooley-Tukey FFT runs in Θ(n log n) arithmetic operations.

Proof. Let T(n) be the operation count for a size-n FFT. Theorem 5.1 reduces a size-n DFT to two size-n/2 DFTs (cost 2T(n/2)) plus the combine: computing n/2 twiddle factors and n/2 butterflies, each O(1), for Θ(n) combine work. The splitting into even/odd subvectors is also Θ(n). Hence

T(n) = 2·T(n/2) + Θ(n),     T(1) = Θ(1).

By the Master Theorem (sibling 03-master-theorem) with a = 2, b = 2, f(n) = Θ(n): since f(n) = Θ(n^{log_b a}) = Θ(n^{log₂2}) = Θ(n), this is Case 2, giving T(n) = Θ(n log n). Equivalently, unrolling: there are log₂ n levels, each doing Θ(n) total butterfly work, so Θ(n log n). ∎

This matches merge sort (sibling 01-merge-sort), which has the identical recurrence. The savings over the Θ(n²) naive DFT is a factor of n / log n.

Recursion-tree accounting. Unrolling makes the bound concrete. The tree has log₂ n + 1 levels. At depth d there are 2^d subproblems each of size n/2^d, contributing 2^d · Θ(n/2^d) = Θ(n) combine work per level. Summing over the log₂ n levels with combine work (the leaves at depth log₂ n do Θ(1) each, total Θ(n)) gives Θ(n log n). The exact butterfly count is (n/2)·log₂ n complex multiplications and n·log₂ n complex additions — the standard operation count quoted for the radix-2 FFT. Split-radix FFT lowers the multiply count to roughly (n/3)·log₂ n by special-casing the ±1, ±i twiddles, the lowest known for power-of-two sizes.

6.1 Full operation count for n = 8¶

log₂ 8 = 3 stages. Each stage performs n/2 = 4 butterflies, each one complex multiply + two complex adds:

stage 1 (block 2): 4 butterflies, twiddles ω_2^0 = 1
stage 2 (block 4): 4 butterflies, twiddles ω_4^{0,1} = 1, i
stage 3 (block 8): 4 butterflies, twiddles ω_8^{0,1,2,3}
total: 3·4 = 12 complex multiplies, 3·8 = 24 complex adds.

The naive DFT would need 8² = 64 complex multiplies and 8·7 = 56 adds. Even at this tiny size the FFT already roughly halves the multiply count; the gap widens to the full n/log n factor as n grows. Note many of the stage-1 and stage-2 twiddles are 1 or i (no real multiply), which is exactly the structure split-radix exploits to reach ~(n/3)log n multiplies.

6.2 Subproblem reuse is the source of the speedup¶

The naive DFT recomputes, for each output k, an independent length-n sum — no sharing. Cooley-Tukey instead computes each half-transform value E_k, O_k once and feeds it into two outputs (â_k and â_{k+n/2}). Recursively, a leaf value contributes to all n outputs through log₂ n shared butterflies rather than n independent products. This shared-subexpression structure — identical in spirit to memoization in dynamic programming — is the precise reason the recurrence carries an O(n) combine instead of an O(n²) one.

7. The Inverse Transform (Proof)¶

Theorem 7.1 (Inverse DFT). The DFT matrix F with F[k][j] = ω_n^{jk} is invertible, with

(F^{-1})[j][k] = (1/n) · ω_n^{-jk}.

Consequently a_j = (1/n) Σ_{k=0}^{n-1} â_k · ω_n^{-jk}.

Proof. Compute the (j, ℓ) entry of F^{-1} F:

(F^{-1}F)[j][ℓ] = Σ_{k=0}^{n-1} (1/n) ω_n^{-jk} · ω_n^{kℓ} = (1/n) Σ_k ω_n^{k(ℓ-j)}.

By the Summation Lemma 2.3, the inner sum is n if ℓ ≡ j (mod n) and 0 otherwise. Since 0 ≤ j, ℓ < n, this is n·[j = ℓ]. Hence (F^{-1}F)[j][ℓ] = [j = ℓ], i.e. F^{-1}F = I. Therefore F^{-1} as stated is the inverse, and a = F^{-1} â gives the claimed formula. ∎

Corollary 7.2 (Inverse via forward FFT). Because (F^{-1})[j][k] = (1/n) ω_n^{-jk} = (1/n) \overline{ω_n^{jk}}, the inverse DFT is the same FFT algorithm run with ω_n replaced by its conjugate ω_n^{-1} = \overline{ω_n}, followed by dividing every entry by n. One code path with a sign flag handles both directions.

Corollary 7.3 (Conjugate trick). Since \overline{ω_n^{jk}} = ω_n^{-jk}, one can also implement the inverse using only a forward FFT and conjugation: a = (1/n)·\overline{FFT(\overline{â})}. This avoids a second twiddle table when memory is tight.

7.1 Worked inverse example (n = 4)¶

Continuing the n=4 example, suppose the forward transform produced â = (â₀, â₁, â₂, â₃). The inverse recovers a_j = (1/4) Σ_k â_k i^{-jk} with i^{-1} = −i:

a₀ = (1/4)(â₀ + â₁      + â₂      + â₃)
a₁ = (1/4)(â₀ − i·â₁    − â₂      + i·â₃)
a₂ = (1/4)(â₀ − â₁      + â₂      − â₃)
a₃ = (1/4)(â₀ + i·â₁    − â₂      − i·â₃)

Substituting the forward formulas from Section 5.1 collapses (by the Summation Lemma) back to the original (a₀, a₁, a₂, a₃) — the round-trip IFFT(FFT(a)) = a holds exactly in exact arithmetic. The factor 1/4 = 1/n is the only scaling; applying it twice (e.g. once per direction and once at the end) is the classic "result is n× off" bug.

8. Bit-Reversal and the Iterative Algorithm¶

Proposition 8.1 (Bit-reversal placement). In the recursive Cooley-Tukey decomposition, the input coefficient a_j is consumed at the leaf indexed by rev(j), the integer obtained by reversing the log₂ n bits of j.

Proof sketch. Each level of recursion partitions by the least-significant remaining bit: the even subvector keeps indices with that bit 0, the odd with bit 1. After log₂ n levels, the path of partition choices for a_j spells the bits of j from least to most significant, which is the binary expansion of j read in reverse — i.e. rev(j). ∎

Therefore the iterative algorithm first permutes a by j ↦ rev(j) (in O(n)), then performs log₂ n bottom-up passes. Pass with block length len = 2^s applies butterflies using the len-th roots ω_{len} within each contiguous block, identical to the recursive combine but in place. This eliminates recursion frames and per-level allocation, yielding the same Θ(n log n) with a smaller constant and better locality.

8.1 Bit-reversal table for n = 8¶

j      binary   rev(j) binary   rev(j)
0      000      000             0
1      001      100             4
2      010      010             2
3      011      110             6
4      100      001             1
5      101      101             5
6      110      011             3
7      111      111             7
permuted order: [0, 4, 2, 6, 1, 5, 3, 7]

After this reorder, pass 1 (block 2) butterflies (0,4),(2,6),(1,5),(3,7); pass 2 (block 4) butterflies within {0,4,2,6} and {1,5,3,7}; pass 3 (block 8) the whole array — exactly the leaves-up traversal of the recursion tree.

Proposition 8.2 (Iterative–recursive equivalence). The iterative bottom-up algorithm computes the identical output as the recursive Cooley-Tukey FFT.

Proof. By Proposition 8.1 the bit-reversal permutation places each input at the recursion leaf where the recursive algorithm consumes it. Pass s (block length 2^s) performs precisely the combine step that the recursive algorithm performs when returning from depth log₂ n − s to depth log₂ n − s − 1: it applies the butterfly â_k = E_k + ω_{2^s}^k O_k, â_{k+2^{s-1}} = E_k − ω_{2^s}^k O_k to each contiguous block, where E and O are the already-computed transforms of the lower half-blocks. Since both algorithms apply the same butterflies to the same operands in a topologically valid order (children before parents), they produce identical results. ∎

9. The Number-Theoretic Transform over a Finite Field¶

The DFT only needed an element ω of order n satisfying Lemmas 2.1–2.3. These hold in any field (or ring) containing a primitive n-th root of unity and an inverse of n. The field ℤ_p (integers mod a prime p) qualifies when n | (p − 1).

Definition 9.1 (Primitive root mod p). g is a primitive root mod p if its order is p − 1, i.e. {g^0, …, g^{p-2}} is all of ℤ_p^*. Then for any n | (p−1),

ω = g^{(p-1)/n} mod p

is a primitive n-th root of unity in ℤ_p: ωⁿ = g^{p-1} ≡ 1 (Fermat), and ω has order exactly n.

Theorem 9.2 (NTT correctness). With ω = g^{(p-1)/n} and n^{-1} the modular inverse of n mod p, the transform â_k = Σ_j a_j ω^{jk} mod p satisfies the convolution theorem and is inverted by a_j = n^{-1} Σ_k â_k ω^{-jk} mod p.

Proof. Lemma 2.1 (ω² = g^{2(p-1)/n} = g^{(p-1)/(n/2)} is the primitive (n/2)-th root) and Lemma 2.2 (ω^{n/2} = g^{(p-1)/2} ≡ −1 mod p, since g^{(p-1)/2} is the unique element of order 2) carry over verbatim, so the Cooley-Tukey derivation of Section 5 is valid over ℤ_p. For the inverse, the Summation Lemma 2.3 holds because for d ≢ 0 (mod n), ω^d ≠ 1 and the geometric sum Σ_k ω^{dk} = (ω^{dn} − 1)(ω^d − 1)^{-1} ≡ 0 (numerator ω^{dn} = 1); the inverse of ω^d − 1 exists in the field ℤ_p. The proof of Theorem 7.1 then goes through with 1/n replaced by n^{-1} mod p, and ω_n^{-1} by the modular inverse ω^{p-2}. Hence the NTT is an exact DFT over ℤ_p. ∎

Consequence. NTT convolution is exact: no rounding. Coefficients of a ∗ b are recovered exactly provided each true coefficient is < p; otherwise use several primes and reconstruct via the Chinese Remainder Theorem. NTT-friendly primes have the form p = c·2^k + 1 so that 2^k | (p − 1) permits transform lengths up to 2^k (e.g. 998244353 = 119·2²³ + 1, g = 3).

9.1 Small worked NTT example¶

Take the tiny prime p = 17 (= 1·2⁴ + 1, so 16 | (p−1)), primitive root g = 3, and length n = 4. The 4th root of unity is ω = g^{(p-1)/n} = 3^{16/4} = 3^4 = 81 ≡ 13 (mod 17). Check its order: 13² = 169 ≡ 16 ≡ −1, and 13⁴ ≡ 1 (mod 17) — order exactly 4, as required, and ω^{n/2} = 13² ≡ −1 confirms the Cancellation Lemma over ℤ_{17}. Convolving a = [1, 2] and b = [3, 4] (true product [3, 10, 8]) via the length-4 NTT and inverse NTT (scaling by n^{-1} = 4^{-1} ≡ 13 mod 17) reproduces [3, 10, 8, 0] exactly, mod 17 — no rounding anywhere. For real use, p must exceed the largest coefficient (here 10 < 17, fine); otherwise use multiple primes + CRT.

9.2 Multi-prime CRT¶

To recover a true coefficient x larger than any single NTT-friendly prime, run the NTT under primes p₁, …, p_r with product P = Πp_i > x, obtaining residues x ≡ r_i (mod p_i). The CRT reconstructs the unique x ∈ [0, P):

x = ( Σ_i r_i · M_i · (M_i^{-1} mod p_i) )  mod P,     M_i = P / p_i.

Three primes near 10⁹ give P > 10²⁷, covering coefficients up to n·C² for typical inputs; reduce x mod the target modulus M afterward.

10. Numerical Error Analysis¶

For the complex FFT in IEEE-754 double precision (machine epsilon ε = 2⁻⁵³), each butterfly introduces relative rounding O(ε). Accumulated over log₂ n levels, a standard backward-error result bounds the componentwise error of the computed transform fl(â) by

‖fl(â) − â‖ ≤ c · log₂(n) · ε · ‖a‖     (Frobenius/2-norm sense),

for a modest constant c. For convolution the relevant output magnitude is bounded by ‖a∗b‖_∞ ≤ n · ‖a‖_∞ · ‖b‖_∞; to round each integer coefficient correctly the absolute error must stay below 1/2, giving the safety condition

const · n · C² · log₂ n · ε  <  1/2,      C = max coefficient magnitude.

When violated (large C or n), either split coefficients into smaller pieces (reducing C) or use the NTT, which has zero rounding error because all arithmetic is exact in ℤ_p.

10.1 Why the bound is only logarithmic in n¶

The favourable log n (rather than n) dependence is a direct consequence of the unitarity noted in Section 3.1. Because F/√n is unitary, it neither amplifies nor attenuates the 2-norm: ‖Fa‖₂ = √n‖a‖₂. Rounding errors injected at each of the log₂ n levels therefore do not compound multiplicatively across levels (as they would for an ill-conditioned transform); each level adds an O(ε)·‖a‖ perturbation that the subsequent unitary stages merely rotate, not magnify. Summing log₂ n such contributions gives the stated O(log n · ε · ‖a‖) bound. A naive O(n²) DFT computed by direct summation has a worse O(n) error growth, so the FFT is not only faster but more accurate than the naive DFT — a rare win on both axes.

10.2 Coefficient splitting, formally¶

To halve the effective magnitude C, write each coefficient in base B = ⌈√C⌉: a_j = a_j^{hi} B + a_j^{lo} with both parts < B. Then

a ∗ b = (a^{hi} ∗ b^{hi}) B² + (a^{hi} ∗ b^{lo} + a^{lo} ∗ b^{hi}) B + (a^{lo} ∗ b^{lo}),

four convolutions of magnitude ~n·B² = n·C, reducing the precision demand from n·C² to n·C. Karatsuba's identity a^{hi}b^{lo} + a^{lo}b^{hi} = (a^{hi}+a^{lo})(b^{hi}+b^{lo}) − a^{hi}b^{hi} − a^{lo}b^{lo} cuts the four FFT products to three. This pushes safe double-precision convolution up to coefficient products near 10¹⁸, beyond which NTT is the cleaner route.

11. Applications: String Matching and Signal Processing¶

String matching via convolution. To match a pattern P (length m) against a text T (length N) allowing wildcards, encode each character and compute, for every alignment shift s, a mismatch score. Using the squared-difference encoding, the number of mismatches at shift s is

mismatch(s) = Σ_{j} (P[j] − T[s+j])² · (wildcard mask),

which expands into a constant number of correlations Σ_j f(P[j]) · g(T[s+j]). A correlation is a convolution of one sequence with the reverse of the other, so each term is one FFT/NTT, giving total O((N+m) log(N+m)) — versus O(Nm) naive. Exact-match counting and k-mismatch variants follow the same template (sibling 13-string-algorithms for non-FFT methods).

Derivation (exact match with wildcards). Let pattern symbol P[j] and text symbol T[i] be numeric codes, with a wildcard coded as 0. Define the per-alignment score

score(s) = Σ_{j=0}^{m-1} P[j]·T[s+j]·(P[j] − T[s+j])².

Each factor vanishes when P[j] = T[s+j] or when either is a wildcard (0), so score(s) = 0 exactly when the pattern matches at shift s (treating 0 as "matches anything"). Expanding the square,

score(s) = Σ_j P[j]³ T[s+j] − 2 Σ_j P[j]² T[s+j]² + Σ_j P[j] T[s+j]³.

Each of the three sums is a correlation Σ_j f(P[j])·g(T[s+j]) for fixed functions f, g, hence one convolution of T with the reversed P-derived sequence. Three FFTs/NTTs of size O(N) and an O(N) combine yield all score(s) in O(N log N). Setting score(s)=0 reports every (wildcard-aware) match — the classic Clifford–Clifford wildcard matching result.

Correlation vs convolution. Correlation (f ⋆ g)_s = Σ_j f_j g_{s+j} equals the convolution f' ∗ g where f' is f reversed (f'_j = f_{m-1-j}), up to an index shift. So every correlation reduces to one convolution, and FFT handles both.

Signal processing. The DFT decomposes a length-n sampled signal into n frequency components â_k, each the amplitude/phase of frequency k/n cycles per sample. FFT makes spectral analysis, fast convolution-based filtering (multiply spectra by a filter's frequency response), and compression (discard small-magnitude coefficients — the basis of JPEG's DCT and MP3) computationally feasible. Overlap-add and overlap-save use short FFTs to filter arbitrarily long streams in O(n log m) per block.

Real-input symmetry. For real input a, the DFT is conjugate-symmetric: â_{n-k} = \overline{â_k} (proof: â_{n-k} = Σ_j a_j ω_n^{j(n-k)} = Σ_j a_j ω_n^{-jk} = \overline{Σ_j a_j ω_n^{jk}} since a_j real). So only the first n/2 + 1 outputs are independent — the rfft storing half the spectrum. Two real signals a, b can be packed as a + ib and transformed in one complex FFT, then separated by â_k = (c_k + \overline{c_{n-k}})/2, b̂_k = (c_k − \overline{c_{n-k}})/(2i) where c = FFT(a+ib). This halves the cost of forward-transforming real data, exploited by every production FFT library.

Big-integer multiplication. An n-digit integer in base B is a polynomial X(B) = Σ x_i B^i with x_i ∈ [0, B). Multiplying two such integers is the convolution z = x ∗ y followed by carry propagation z_{i+1} += ⌊z_i / B⌋, z_i ← z_i mod B. The convolution is O(n log n) by FFT/NTT; the carry pass is O(n). Choosing a larger base B (e.g. 2¹⁶ or 10⁴) shrinks n but raises the per-coefficient magnitude ~n·B², which must respect the precision budget of Section 10 — the base is tuned to balance transform length against rounding safety.

11.1 Polynomial division and series operations¶

The convolution theorem also accelerates inverse operations. To compute the reciprocal of a power series A(x) modulo x^k (with A(0) invertible), Newton iteration B ← B(2 − A·B) mod x^{2t} doubles the known precision each step; each step is one FFT/NTT convolution, and the geometric sum of work gives O(k log k) total. This is the engine for fast polynomial division (O(n log n) via reversal + series inverse), and for series exp, log, and sqrt — all built on repeated convolution.

Newton-step correctness. If A·B_t ≡ 1 (mod x^t), set B_{2t} = B_t(2 − A B_t). Then 1 − A B_{2t} = 1 − 2A B_t + A²B_t² = (1 − A B_t)². Since 1 − A B_t ≡ 0 (mod x^t), its square is ≡ 0 (mod x^{2t}), so A B_{2t} ≡ 1 (mod x^{2t}) — the known precision doubles, giving the quadratic convergence that makes the work geometric-sum to O(k log k).

11.2 The transfer-matrix / generating-function bridge¶

Counting walks or strings under a regular constraint produces a rational generating function P(x)/Q(x); its power-series coefficients satisfy a linear recurrence with characteristic polynomial Q. FFT/NTT computes the series inverse 1/Q mod x^k (Section 11.1) and the product with P in O(k log k), extracting the first k terms. Combined with the Kitamasa reduction (polynomial exponentiation mod Q), a single far-future term c_N of an order-r recurrence is obtained in O(r log r · log N) — FFT/NTT accelerating each polynomial multiply inside the modular exponentiation. This links FFT to the matrix-power/transfer-matrix methods of the graph-walk topics (11-graphs).

12b. Generating Functions and Combinatorial Convolution¶

Many counting problems are products of generating functions, hence convolutions. If f(x) = Σ f_i x^i encodes "number of ways to make i from object set A" and g(x) does the same for set B, then (f·g)_s = Σ_i f_i g_{s-i} counts ways to make s by combining one choice from each — exactly the convolution. FFT/NTT evaluates such products in O(d log d) where d is the total degree, replacing an O(d²) double loop. Examples: counting pairs/triples with a target sum (f ∗ f, f ∗ f ∗ f), subset-sum-style enumeration, and partition-function recurrences. The transfer between "combinatorial product" and "polynomial product" is the reason FFT appears throughout enumerative combinatorics.

No general algorithm computes an arbitrary length-n DFT in o(n log n) under standard linear/arithmetic models; FFT is conjectured optimal for the dense case, and Ω(n log n) lower bounds hold in restricted models (e.g. linear circuits with bounded coefficients). For integer multiplication, FFT-based methods reach O(n log n) (Harvey–van der Hoeven, 2019) — the conjectured optimum — superseding Schönhage–Strassen's O(n log n log log n). Related transforms reuse the same divide-and-conquer skeleton: the DCT/DST (real cosine/sine transforms, used in JPEG/MP3), the Walsh–Hadamard transform (XOR-convolution), and Bluestein's and mixed-radix FFTs for non-power-of-two lengths.

12.1 Comparison of multiplication algorithms¶

Algorithm	Time	Idea	Exact?
Schoolbook	`Θ(n²)`	every coefficient pair	yes
Karatsuba (sibling `04`)	`Θ(n^{log₂3}) ≈ n^{1.585}`	3 half-size multiplies	yes
Toom-Cook (Toom-`k`)	`Θ(n^{log_k(2k-1)})`	evaluate/interpolate at `2k−1` points	yes
FFT / Schönhage–Strassen	`Θ(n log n log log n)`	convolution via DFT	approx / modular
Harvey–van der Hoeven	`Θ(n log n)`	multidimensional DFT	exact (integer)

Toom-Cook is itself a finite-point evaluation/interpolation scheme — the same evaluate-multiply-interpolate idea as FFT, but at a constant number of fixed rational points instead of n roots of unity. FFT is the limit of this idea taken to n carefully chosen points where evaluation and interpolation are themselves O(n log n).

On the matrix-multiplication exponent. Polynomial multiplication via FFT is not the same problem as n×n matrix multiplication (O(n^ω), ω < 2.372); the two are often conflated. FFT solves the 1-D convolution / DFT in O(n log n), which is provably near-optimal in linear circuit models. Matrix multiplication's subcubic algorithms (Strassen and descendants) are a separate line of work, though both are "fast algorithms via algebraic identities". The connection is that fast integer multiplication (FFT-based) is a subroutine inside some matrix-multiplication and graph algorithms, not that they share a complexity.

12.2 Bluestein's algorithm (arbitrary length)¶

For a DFT of arbitrary length n (e.g. n prime), Bluestein rewrites jk = (j² + k² − (k−j)²)/2, turning the DFT into

â_k = ω_n^{k²/2} · Σ_j ( a_j ω_n^{j²/2} ) · ω_n^{−(k−j)²/2},

which is a convolution of the two sequences b_j = a_j ω_n^{j²/2} and c_j = ω_n^{−j²/2}. That convolution is computed by zero-padding to a power of two ≥ 2n−1 and using ordinary radix-2 FFT, giving O(n log n) for any n. This is how libraries support non-power-of-two and prime sizes without a dedicated mixed-radix kernel.

13. Summary¶

The FFT computes the DFT â_k = Σ_j a_j ω_n^{jk} in Θ(n log n) by exploiting two roots-of-unity lemmas: Halving ((ω_n^k)² = ω_{n/2}^k, enabling the even/odd recursion) and Cancellation (ω_n^{k+n/2} = −ω_n^k, giving the butterfly's ± pair). Cooley-Tukey's split yields â_k = E_k + ω_n^k O_k and â_{k+n/2} = E_k − ω_n^k O_k, hence T(n) = 2T(n/2) + Θ(n) = Θ(n log n) (Master Theorem Case 2). The Convolution Theorem — DFT(a∗b) = DFT(a) ⊙ DFT(b), proved by evaluating A·B at the roots — turns O(n²) polynomial/big-integer multiplication into O(n log n); the Aliasing Identity explains why padding is mandatory. The inverse is the same transform with conjugate roots scaled by 1/n, justified by the Summation Lemma, and is numerically stable (more so than the naive DFT) because the normalized DFT matrix is unitary. Over a finite field ℤ_p with n | (p−1), a primitive n-th root ω = g^{(p-1)/n} makes all three lemmas hold, giving the exact NTT. Complex FFT carries error ~ n·C²·log n·ε; when that breaches 1/2, split coefficients or use NTT. The same machinery powers O((N+m) log(N+m)) string matching and the spectral methods at the heart of signal processing and compression.

14. Common Proof Pitfalls¶

A checklist of subtleties that derail FFT correctness arguments:

Periodicity of E and O. The combine step relies on E_{k+n/2} = E_k. This holds only because E is a length-n/2 DFT, which is inherently n/2-periodic in its index. Forgetting this leaves the â_{k+n/2} formula unjustified.
Halving needs n even. Lemma 2.1 and the whole split presuppose n even (in fact a power of two for radix-2). For odd or prime n, use Bluestein (Section 12.2) or mixed-radix; the simple even/odd split does not apply.
Linear ≠ cyclic without padding. Theorem 4.1 is stated for padded inputs. The unpadded DFT gives cyclic convolution (Proposition 4.2); conflating the two is the aliasing bug.
The inverse scale is 1/n, applied once. Corollary 7.2's 1/n comes from F*F = nI (Summation Lemma). Applying it per level and at the end double-scales by 1/n.
NTT needs n | (p−1). A primitive n-th root in ℤ_p exists iff n | (p−1) (Theorem 9.2). Picking a non-NTT-friendly prime, or a transform length exceeding 2^k for p = c·2^k+1, breaks the existence of ω.
Coefficient bound for exact recovery. NTT (or rounded FFT) recovers a∗b exactly only if each true coefficient is < p (NTT) or within the error budget < 1/2 after rounding (FFT). Exceeding it silently corrupts results — hence multi-prime CRT or coefficient splitting.
Distinctness of nodes. Point-value uniqueness (Definition 1.2) requires the n evaluation points to be distinct; the roots of unity are distinct precisely because ω_n is primitive — a non-primitive root would repeat points and break invertibility.
Conjugate, not negate, for the inverse. The inverse uses ω_n^{-1} = \overline{ω_n} (Corollary 7.2). Using −ω_n (negation) instead of conjugation is a frequent slip; the two coincide only for specific exponents.
n−1 vs n coefficients. The product A·B of degrees d₁, d₂ has degree d₁+d₂, hence d₁+d₂+1 coefficients — one more than the larger input length. Sizing to max(len) instead of len(a)+len(b)−1 truncates the top coefficient. Always size the output, not the input.
Multiplicity of the root. A non-primitive n-th root (e.g. ω_n² used as if it were ω_n) has order < n, so its powers repeat and the evaluation points are not distinct — invertibility (and hence the inverse formula) fails. Always verify the chosen root has exact order n.

One-paragraph takeaway. The FFT is the assertion that the Vandermonde matrix on the roots of unity — though dense — factors into log n sparse butterfly stages, each applying n/2 operations of the form (x, y) ↦ (x + ωy, x − ωy). Two group-theoretic facts (ω halves under squaring; ω^{n/2} = −1) make the factorization exist; unitarity makes it numerically safe; and replacing ℂ's e^{2πi/n} by a finite-field primitive root makes it exact. Everything else in this document is a careful unpacking of that single structural insight.

15. References and Cross-Topic Links¶

Cooley, J. W. & Tukey, J. W. (1965), "An algorithm for the machine calculation of complex Fourier series" — the founding paper of the modern FFT.
Introduction to Algorithms (CLRS), Chapter 30 — polynomials, the DFT, FFT, and the convolution view, with the same butterfly derivation used here.
Harvey, D. & van der Hoeven, J. (2021), "Integer multiplication in time O(n log n)" — the asymptotically optimal integer-multiplication result (Section 12).
Numerical Linear Algebra (Trefethen & Bau) — the unitarity and conditioning argument behind the O(log n) error bound (Section 10).
cp-algorithms.com — "Fast Fourier transform" and "Number-Theoretic Transform", matching the iterative bit-reversal code in middle.md/senior.md.
Sibling 04-karatsuba — the O(n^{1.585}) predecessor; sibling 03-master-theorem — the recurrence solved in Section 6; sibling 01-merge-sort — the identical T(n)=2T(n/2)+O(n) shape.
Cross-link to 11-graphs (transfer-matrix / walk counting) via the generating-function bridge of Section 11.2, and to 13-string-algorithms for the matching applications of Section 11.

Notation recap. n transform length (power of two), ω_n = e^{2πi/n} principal root, F the DFT (Vandermonde) matrix, ∗ linear convolution, ⊛ cyclic convolution, ⊙ Hadamard (elementwise) product, ε = 2⁻⁵³ machine epsilon, p an NTT prime, g a primitive root mod p, C max coefficient magnitude, [P] the Iverson bracket. All proofs are stated for ℂ but hold verbatim in any field with a primitive n-th root of unity and an inverse of n — which is exactly the hypothesis the NTT supplies over ℤ_p.