Loops — Optimization Exercises¶
Optimize each slow loop pattern. Measure the improvement with
timeit.
Score Card¶
| # | Difficulty | Topic | Type | Optimized? | Speedup |
|---|---|---|---|---|---|
| 1 | Easy | List comprehension vs for loop | CPU | [ ] | ___x |
| 2 | Easy | enumerate vs range(len) | CPU | [ ] | ___x |
| 3 | Easy | join vs += for strings | CPU/Memory | [ ] | ___x |
| 4 | Medium | Generator vs list in loop | Memory | [ ] | ___x |
| 5 | Medium | map/filter vs loops | CPU | [ ] | ___x |
| 6 | Medium | Avoiding global lookups in loops | CPU | [ ] | ___x |
| 7 | Medium | itertools usage | CPU | [ ] | ___x |
| 8 | Hard | numpy vectorization vs loops | CPU | [ ] | ___x |
| 9 | Hard | multiprocessing for CPU-bound loops | CPU | [ ] | ___x |
| 10 | Hard | C extension via ctypes for hot loops | CPU | [ ] | ___x |
Total optimized: ___ / 10
Exercise 1: List Comprehension vs For Loop¶
Difficulty: Easy
import timeit
# SLOW: Building a list with a for loop and append
def squares_slow(n: int) -> list[int]:
"""Build a list of squares using a for loop."""
result = []
for i in range(n):
result.append(i * i)
return result
n = 500_000
slow_time = timeit.timeit(lambda: squares_slow(n), number=10)
print(f"Slow (for + append): {slow_time:.4f}s")
Hint
List comprehensions are not just syntactic sugar — they compile to a dedicated `LIST_APPEND` bytecode that avoids the overhead of looking up and calling the `.append()` method on every iteration.Optimized Solution
# FAST: List comprehension — single expression, optimized bytecode
def squares_fast(n: int) -> list[int]:
"""Build a list of squares using a list comprehension."""
return [i * i for i in range(n)]
fast_time = timeit.timeit(lambda: squares_fast(n), number=10)
print(f"Fast (comprehension): {fast_time:.4f}s")
print(f"Speedup: {slow_time / fast_time:.1f}x")
# Typical speedup: 1.3-2x
Learn More
- Use `dis.dis(squares_slow)` and `dis.dis("[i*i for i in range(10)]")` to compare bytecode. - List comprehensions also have a memory advantage: CPython pre-sizes the internal array when it can predict the length from `range()`. - For side-effect-only loops (e.g., calling `print()`), do NOT use a comprehension — use a regular `for` loop. Comprehensions are for building data. - Nested comprehensions are fine: `[x*y for x in rows for y in cols]` is faster than nested `for` + `append`.Exercise 2: enumerate vs range(len)¶
Difficulty: Easy
import timeit
# SLOW: Using range(len()) to get both index and value
def index_values_slow(data: list[int]) -> list[tuple[int, int]]:
"""Pair each element with its index using range(len())."""
result = []
for i in range(len(data)):
result.append((i, data[i])) # data[i] is a separate subscript operation
return result
data = list(range(500_000))
slow_time = timeit.timeit(lambda: index_values_slow(data), number=10)
print(f"Slow (range(len)): {slow_time:.4f}s")
Hint
`enumerate()` is a built-in iterator implemented in C that yields `(index, value)` tuples directly, avoiding the cost of a separate `data[i]` subscript operation on each iteration. It also reads more clearly.Optimized Solution
# FAST: enumerate yields (index, value) pairs natively
def index_values_fast(data: list[int]) -> list[tuple[int, int]]:
"""Pair each element with its index using enumerate()."""
return [(i, v) for i, v in enumerate(data)]
fast_time = timeit.timeit(lambda: index_values_fast(data), number=10)
print(f"Fast (enumerate): {fast_time:.4f}s")
print(f"Speedup: {slow_time / fast_time:.1f}x")
# Typical speedup: 1.3-1.8x
Learn More
- `enumerate()` accepts a `start` parameter: `enumerate(data, start=1)` begins counting at 1. - For dicts, use `dict.items()` instead of iterating over keys and looking up values. - `zip()` is the counterpart when you need to iterate over two sequences in parallel — also implemented in C. - The `range(len())` anti-pattern is the most common Python performance mistake in code reviews.Exercise 3: join vs += for String Concatenation¶
Difficulty: Easy
import timeit
# SLOW: Building a string with += in a loop
def concat_slow(n: int) -> str:
"""Build a large string by concatenating with +=."""
result = ""
for i in range(n):
result += f"item-{i}," # Creates a new string object every iteration
return result
n = 100_000
slow_time = timeit.timeit(lambda: concat_slow(n), number=5)
print(f"Slow (+=): {slow_time:.4f}s")
Hint
Strings in Python are immutable. Every `+=` creates a brand-new string, copies all existing characters plus the new part. This turns an O(n) task into O(n^2). Building a list and calling `str.join()` once at the end avoids all the copying.Optimized Solution
# FAST: Collect parts in a list, join once at the end
def concat_fast(n: int) -> str:
"""Build a large string using join()."""
parts = [f"item-{i}" for i in range(n)]
return ",".join(parts)
fast_time = timeit.timeit(lambda: concat_fast(n), number=5)
print(f"Fast (join): {fast_time:.4f}s")
print(f"Speedup: {slow_time / fast_time:.1f}x")
# Typical speedup: 2-10x (grows with n)
Learn More
- CPython has a special optimization where `+=` on a string with refcount 1 can resize in-place. But this is an implementation detail, not guaranteed, and fails when other references exist. - For file-like output, use `io.StringIO()` as a buffer — it behaves like `join()` internally. - The `join()` pattern is also cleaner: `"\n".join(lines)` vs. a loop with `result += line + "\n"`. - This same principle applies to bytes: use `b"".join(chunks)` instead of `result += chunk`.Exercise 4: Generator vs List in Loop¶
Difficulty: Medium
import timeit
import sys
# SLOW: Building a full intermediate list just to iterate over it
def sum_of_squares_slow(n: int) -> int:
"""Sum of squares using an intermediate list."""
squares = [i * i for i in range(n)] # Allocates entire list in memory
total = 0
for s in squares:
total += s
return total
n = 2_000_000
slow_time = timeit.timeit(lambda: sum_of_squares_slow(n), number=5)
# Show memory difference
list_size = sys.getsizeof([i * i for i in range(n)])
print(f"List memory: {list_size / 1024 / 1024:.1f} MB")
print(f"Slow (list intermediate): {slow_time:.4f}s")
Hint
If you only need to iterate once over the results, a generator expression `(x for x in ...)` produces items one at a time without storing the entire sequence in memory. For aggregation functions like `sum()`, passing a generator avoids the intermediate list entirely.Optimized Solution
# FAST: Generator expression — no intermediate list in memory
def sum_of_squares_fast(n: int) -> int:
"""Sum of squares using a generator expression with sum()."""
return sum(i * i for i in range(n)) # Generator — O(1) memory
fast_time = timeit.timeit(lambda: sum_of_squares_fast(n), number=5)
gen_size = sys.getsizeof(i * i for i in range(n))
print(f"Generator memory: {gen_size} bytes")
print(f"Fast (generator + sum): {fast_time:.4f}s")
print(f"Speedup: {slow_time / fast_time:.1f}x")
# Typical speedup: 1.2-2x (plus massive memory savings)
Learn More
- Use generators when: you iterate once, the data is large, or you only need a subset (e.g., `any()`, `all()`, `min()`, `max()`). - Use lists when: you need random access, you iterate multiple times, or you need `len()`. - `itertools` functions (e.g., `chain`, `islice`) work with generators and maintain the lazy-evaluation benefit. - Generator expressions can be chained: `sum(x for x in (i*i for i in range(n)) if x % 2 == 0)`.Exercise 5: map/filter vs Explicit Loops¶
Difficulty: Medium
import timeit
# SLOW: Explicit loop with conditional append
def transform_filter_slow(data: list[int]) -> list[str]:
"""Convert even numbers to hex strings using a for loop."""
result = []
for x in data:
if x % 2 == 0:
result.append(hex(x))
return result
data = list(range(1_000_000))
slow_time = timeit.timeit(lambda: transform_filter_slow(data), number=5)
print(f"Slow (for loop): {slow_time:.4f}s")
Hint
`map()` and `filter()` are built-in functions implemented in C. They apply a function to each element (or filter elements) without the overhead of a Python-level loop, `if` statement, or `.append()` call. Combined, they can replace a loop+conditional+transform pattern.Optimized Solution
# FAST: map + filter — C-level iteration
def transform_filter_fast(data: list[int]) -> list[str]:
"""Convert even numbers to hex strings using map + filter."""
return list(map(hex, filter(lambda x: x % 2 == 0, data)))
# FASTEST: List comprehension — best of both worlds
def transform_filter_fastest(data: list[int]) -> list[str]:
"""Convert even numbers to hex strings using a list comprehension."""
return [hex(x) for x in data if x % 2 == 0]
fast_time = timeit.timeit(lambda: transform_filter_fast(data), number=5)
fastest_time = timeit.timeit(lambda: transform_filter_fastest(data), number=5)
print(f"Fast (map+filter): {fast_time:.4f}s (speedup: {slow_time/fast_time:.1f}x)")
print(f"Fastest (comprehension): {fastest_time:.4f}s (speedup: {slow_time/fastest_time:.1f}x)")
# Typical speedup: 1.1-1.5x (map/filter), 1.3-1.8x (comprehension)
Learn More
- `map(func, iterable)` is fastest when `func` is a C built-in (`hex`, `str`, `len`, `int`). - If `func` is a lambda, a list comprehension is usually equal or faster. - `filter(None, iterable)` is a fast way to remove falsy values (0, None, "", [], etc.). - `itertools.starmap(func, iterable_of_tuples)` is the multi-argument version of `map()`. - In Python 3, `map()` and `filter()` return iterators — wrap with `list()` if you need a list.Exercise 6: Avoiding Global Lookups in Loops¶
Difficulty: Medium
import timeit
import math
# SLOW: Global and attribute lookups on every iteration
def compute_distances_slow(coords: list[tuple[float, float]]) -> list[float]:
"""Compute Euclidean distance from origin for each point."""
distances = []
for x, y in coords:
distances.append(math.sqrt(x * x + y * y)) # LOAD_GLOBAL + LOAD_ATTR per iteration
return distances
coords = [(float(i), float(i + 1)) for i in range(300_000)]
slow_time = timeit.timeit(lambda: compute_distances_slow(coords), number=5)
print(f"Slow (global lookups): {slow_time:.4f}s")
Hint
Every `math.sqrt` inside the loop requires `LOAD_GLOBAL` (dictionary lookup for `math`) followed by `LOAD_ATTR` (dictionary lookup for `sqrt`). Similarly, `distances.append` requires `LOAD_FAST` + `LOAD_ATTR`. Caching both as local variables before the loop replaces all of these with simple `LOAD_FAST` (array index) operations.Optimized Solution
# FAST: Cache global/attribute lookups as local variables
def compute_distances_fast(coords: list[tuple[float, float]]) -> list[float]:
"""Compute distances with cached local lookups."""
sqrt = math.sqrt # Cache function as local
result = []
append = result.append # Cache method as local
for x, y in coords:
append(sqrt(x * x + y * y)) # Two LOAD_FAST calls only
return result
# FASTEST: Comprehension + local cache (comprehension scope is local too)
def compute_distances_fastest(coords: list[tuple[float, float]]) -> list[float]:
sqrt = math.sqrt
return [sqrt(x * x + y * y) for x, y in coords]
fast_time = timeit.timeit(lambda: compute_distances_fast(coords), number=5)
fastest_time = timeit.timeit(lambda: compute_distances_fastest(coords), number=5)
print(f"Fast (cached locals): {fast_time:.4f}s (speedup: {slow_time/fast_time:.1f}x)")
print(f"Fastest (comprehension): {fastest_time:.4f}s (speedup: {slow_time/fastest_time:.1f}x)")
# Typical speedup: 1.2-1.5x (cached), 1.3-1.8x (comprehension)
Learn More
- This technique is used throughout the Python standard library (e.g., `json/encoder.py`). - The default-argument trick also works: `def f(coords, _sqrt=math.sqrt)` — the default is evaluated once at function definition time. - List comprehensions automatically create a local scope, so `[math.sqrt(x) ...]` still does `LOAD_GLOBAL` for `math` each time. Cache it before the comprehension: `sqrt = math.sqrt; [sqrt(x) ...]`. - Profile before optimizing: `python -m cProfile script.py` or `line_profiler` for line-by-line data.Exercise 7: itertools for Efficient Loop Patterns¶
Difficulty: Medium
import timeit
from itertools import chain, product
# SLOW: Manual nested loop to flatten a list of lists
def flatten_slow(nested: list[list[int]]) -> list[int]:
"""Flatten a list of lists using nested for loops."""
result = []
for sublist in nested:
for item in sublist:
result.append(item)
return result
# SLOW: Manual cartesian product with nested loops
def cartesian_slow(a: list[int], b: list[str], c: list[float]) -> list[tuple]:
"""Compute cartesian product of three lists."""
result = []
for x in a:
for y in b:
for z in c:
result.append((x, y, z))
return result
nested = [list(range(100)) for _ in range(5000)] # 5000 sublists of 100 items
a = list(range(50))
b = [chr(65 + i) for i in range(26)]
c = [float(i) / 10 for i in range(20)]
slow_flat = timeit.timeit(lambda: flatten_slow(nested), number=5)
slow_cart = timeit.timeit(lambda: cartesian_slow(a, b, c), number=5)
print(f"Slow flatten: {slow_flat:.4f}s")
print(f"Slow cartesian: {slow_cart:.4f}s")
Hint
`itertools.chain.from_iterable()` flattens an iterable of iterables in C, without any Python-level nested loop. `itertools.product()` computes the cartesian product of multiple iterables entirely in C, replacing deeply nested `for` loops.Optimized Solution
from itertools import chain, product
# FAST: chain.from_iterable — C-level flatten
def flatten_fast(nested: list[list[int]]) -> list[int]:
"""Flatten using itertools.chain.from_iterable."""
return list(chain.from_iterable(nested))
# FAST: itertools.product — C-level cartesian product
def cartesian_fast(a: list[int], b: list[str], c: list[float]) -> list[tuple]:
"""Compute cartesian product using itertools.product."""
return list(product(a, b, c))
fast_flat = timeit.timeit(lambda: flatten_fast(nested), number=5)
fast_cart = timeit.timeit(lambda: cartesian_fast(a, b, c), number=5)
print(f"Fast flatten: {fast_flat:.4f}s (speedup: {slow_flat/fast_flat:.1f}x)")
print(f"Fast cartesian: {fast_cart:.4f}s (speedup: {slow_cart/fast_cart:.1f}x)")
# Typical speedup: 2-4x (flatten), 2-5x (cartesian)
Learn More
- Key `itertools` functions for loop optimization: - `chain.from_iterable()` — flatten nested iterables - `product()` — replace nested for loops - `groupby()` — replace manual grouping logic - `islice()` — slice an iterator without materializing it - `accumulate()` — running totals without a manual loop - `compress()` — filter by a selector iterable - `itertools` functions return iterators, so wrap with `list()` only when you need a list. - Combine with `operator` module for even more speed: `itertools.starmap(operator.mul, pairs)` is faster than a lambda.Exercise 8: NumPy Vectorization vs Python Loops¶
Difficulty: Hard
import timeit
# SLOW: Pure Python loop for numerical computation
def moving_average_slow(data: list[float], window: int) -> list[float]:
"""Compute moving average using a Python loop."""
result = []
for i in range(len(data) - window + 1):
total = 0.0
for j in range(window):
total += data[i + j]
result.append(total / window)
return result
data = [float(i) + 0.5 for i in range(100_000)]
window = 50
slow_time = timeit.timeit(lambda: moving_average_slow(data, window), number=3)
print(f"Slow (Python loop): {slow_time:.4f}s")
Hint
NumPy performs operations on entire arrays at the C level using SIMD instructions, bypassing the Python interpreter loop entirely. For a moving average, `np.convolve` or `np.cumsum` can compute the result in one vectorized pass.Optimized Solution
import numpy as np
# FAST: NumPy vectorized moving average using cumsum
def moving_average_fast(data_np: np.ndarray, window: int) -> np.ndarray:
"""Compute moving average using NumPy cumulative sum trick."""
cumsum = np.cumsum(data_np)
cumsum[window:] = cumsum[window:] - cumsum[:-window]
return cumsum[window - 1:] / window
data_np = np.array(data)
fast_time = timeit.timeit(lambda: moving_average_fast(data_np, window), number=3)
print(f"Fast (NumPy cumsum): {fast_time:.4f}s")
print(f"Speedup: {slow_time / fast_time:.1f}x")
# Verify correctness
py_result = moving_average_slow(data[:1000], window)
np_result = moving_average_fast(np.array(data[:1000]), window).tolist()
assert all(abs(a - b) < 1e-10 for a, b in zip(py_result, np_result)), "Results mismatch!"
print("Correctness verified.")
# Typical speedup: 50-200x
Learn More
- Rule of thumb: if you are looping over numbers in Python, you should probably be using NumPy. - Key vectorization patterns: - Replace `for` + `if` with boolean indexing: `arr[arr > 0]` - Replace nested loops with broadcasting: `a[:, None] + b[None, :]` - Replace accumulation loops with `np.cumsum`, `np.cumprod`, `np.diff` - `np.vectorize()` is NOT true vectorization — it is a convenience wrapper that still loops in Python. - For DataFrames, pandas uses NumPy under the hood. Avoid `.iterrows()` and use vectorized operations. - If NumPy is not available, consider `array` module for typed arrays — faster than lists for numeric data.Exercise 9: multiprocessing for CPU-Bound Loops¶
Difficulty: Hard
import timeit
import math
# SLOW: Single-threaded CPU-bound computation
def is_prime(n: int) -> bool:
"""Check if n is prime."""
if n < 2:
return False
if n < 4:
return True
if n % 2 == 0 or n % 3 == 0:
return False
i = 5
while i * i <= n:
if n % i == 0 or n % (i + 2) == 0:
return False
i += 6
return True
def count_primes_slow(numbers: list[int]) -> int:
"""Count primes in a list — single process."""
return sum(1 for n in numbers if is_prime(n))
# Large numbers to make primality testing expensive
numbers = list(range(100_000, 200_000))
slow_time = timeit.timeit(lambda: count_primes_slow(numbers), number=3)
result_slow = count_primes_slow(numbers)
print(f"Slow (single process): {slow_time:.4f}s — found {result_slow} primes")
Hint
Python's GIL (Global Interpreter Lock) prevents threads from running Python bytecode in parallel. For CPU-bound work, `multiprocessing` spawns separate processes, each with its own GIL, allowing true parallel execution across CPU cores. `multiprocessing.Pool.map()` distributes chunks of work across a pool of worker processes.Optimized Solution
from multiprocessing import Pool, cpu_count
# FAST: Distribute work across multiple CPU cores
def count_primes_chunk(chunk: list[int]) -> int:
"""Count primes in a chunk — runs in a worker process."""
return sum(1 for n in chunk if is_prime(n))
def count_primes_fast(numbers: list[int]) -> int:
"""Count primes using multiprocessing Pool."""
n_workers = cpu_count()
chunk_size = len(numbers) // n_workers
chunks = [numbers[i:i + chunk_size] for i in range(0, len(numbers), chunk_size)]
with Pool(n_workers) as pool:
results = pool.map(count_primes_chunk, chunks)
return sum(results)
fast_time = timeit.timeit(lambda: count_primes_fast(numbers), number=3)
result_fast = count_primes_fast(numbers)
assert result_fast == result_slow, f"Mismatch: {result_fast} != {result_slow}"
print(f"Fast ({cpu_count()} processes): {fast_time:.4f}s — found {result_fast} primes")
print(f"Speedup: {slow_time / fast_time:.1f}x")
# Typical speedup: 2-8x depending on CPU core count
Learn More
- Use `multiprocessing.Pool` for data parallelism (same function, different data chunks). - Use `concurrent.futures.ProcessPoolExecutor` for a higher-level API with `submit()` and `as_completed()`. - `pool.imap_unordered()` is faster when order does not matter — it yields results as they complete. - For shared state between processes, use `multiprocessing.Manager()` or `multiprocessing.shared_memory`. - Python 3.13+ (free-threaded build) can run threads in parallel without the GIL — but multiprocessing remains the portable approach. - Avoid multiprocessing for I/O-bound tasks — use `asyncio` or `threading` instead. - The overhead of process creation means multiprocessing only helps when each task takes at least several milliseconds.Exercise 10: C Extension via ctypes for Hot Loops¶
Difficulty: Hard
import timeit
# SLOW: Pure Python hot loop — sum of absolute differences
def sum_abs_diff_slow(a: list[float], b: list[float]) -> float:
"""Compute sum of absolute differences between two lists."""
total = 0.0
for i in range(len(a)):
total += abs(a[i] - b[i])
return total
import random
random.seed(42)
n = 1_000_000
a = [random.random() for _ in range(n)]
b = [random.random() for _ in range(n)]
slow_time = timeit.timeit(lambda: sum_abs_diff_slow(a, b), number=5)
print(f"Slow (Python loop): {slow_time:.4f}s")
Hint
You can write a tiny C function, compile it into a shared library, and call it from Python using `ctypes`. The C function operates on raw memory buffers with no Python object overhead, making it orders of magnitude faster for numerical loops. Use `ctypes.CDLL` to load the library and define argument/return types.Optimized Solution
import ctypes
import tempfile
import os
import subprocess
import array
# Step 1: Write a small C file
c_code = r"""
#include <math.h>
double sum_abs_diff(const double* a, const double* b, int n) {
double total = 0.0;
for (int i = 0; i < n; i++) {
total += fabs(a[i] - b[i]);
}
return total;
}
"""
# Step 2: Compile the C code into a shared library
tmpdir = tempfile.mkdtemp()
c_path = os.path.join(tmpdir, "fast_math.c")
so_path = os.path.join(tmpdir, "fast_math.so")
with open(c_path, "w") as f:
f.write(c_code)
subprocess.run(
["gcc", "-O2", "-shared", "-fPIC", "-o", so_path, c_path, "-lm"],
check=True,
)
# Step 3: Load and call via ctypes
lib = ctypes.CDLL(so_path)
lib.sum_abs_diff.argtypes = [
ctypes.POINTER(ctypes.c_double),
ctypes.POINTER(ctypes.c_double),
ctypes.c_int,
]
lib.sum_abs_diff.restype = ctypes.c_double
# Convert Python lists to C-compatible arrays
arr_a = (ctypes.c_double * n)(*a)
arr_b = (ctypes.c_double * n)(*b)
def sum_abs_diff_fast() -> float:
return lib.sum_abs_diff(arr_a, arr_b, n)
fast_time = timeit.timeit(sum_abs_diff_fast, number=5)
# Verify correctness
py_result = sum_abs_diff_slow(a, b)
c_result = sum_abs_diff_fast()
assert abs(py_result - c_result) < 1e-6, f"Mismatch: {py_result} vs {c_result}"
print(f"Fast (C/ctypes): {fast_time:.4f}s")
print(f"Speedup: {slow_time / fast_time:.1f}x")
print("Correctness verified.")
# Typical speedup: 30-100x
# Cleanup
os.remove(c_path)
os.remove(so_path)
os.rmdir(tmpdir)
Learn More
- `ctypes` is in the standard library — no external dependencies needed. - For production code, consider these alternatives: - **Cython**: Write Python-like code that compiles to C. Easier than raw C. - **cffi**: A more Pythonic alternative to `ctypes` for calling C code. - **pybind11**: Best for calling C++ code from Python. - **Numba** (`@jit`): JIT-compiles Python functions to machine code using LLVM. Zero C code needed. - When using `ctypes`, always set `.argtypes` and `.restype` — without them, ctypes assumes all arguments and return values are `int`, leading to crashes on 64-bit systems. - Use `array.array('d', data)` to create C-compatible arrays without copying: `ctypes.c_double * n` copies, but `array` can share the buffer via `ctypes.cast`. - NumPy arrays can be passed directly to ctypes: `a.ctypes.data_as(ctypes.POINTER(ctypes.c_double))`.Cheat Sheet¶
| Pattern | Slow | Fast | Type | Typical Gain |
|---|---|---|---|---|
| List building | for + append() | List comprehension | CPU | 1.3-2x |
| Index + value | range(len(x)) + x[i] | enumerate(x) | CPU | 1.3-1.8x |
| String concat | result += s in loop | "".join(parts) | CPU/Mem | 2-10x |
| Intermediate data | [x for x in ...] then loop | Generator (x for x in ...) | Memory | 1.2-2x |
| Transform + filter | for + if + append | map()/filter() or comprehension | CPU | 1.3-1.8x |
| Global lookups | math.sqrt in loop | sqrt = math.sqrt before loop | CPU | 1.2-1.5x |
| Nested loops | Manual nested for | itertools.product, chain | CPU | 2-5x |
| Numeric loops | Python for over numbers | NumPy vectorization | CPU | 50-200x |
| CPU-bound parallel | Single-process loop | multiprocessing.Pool.map() | CPU | 2-8x |
| Hot inner loop | Pure Python arithmetic | C function via ctypes | CPU | 30-100x |