Linear Search — Optimization Exercises¶

12 exercises that take the naive O(n) linear search and squeeze out performance through algorithmic, micro-architectural, or systems-level techniques. Each exercise is paired with a measurable benchmark target.

Table of Contents¶

Sentinel to Skip Bound Check
Early-Exit on Found
Parallel Search Across Chunks
SIMD Vectorized Comparison
Branch-Free Comparison
Cache-Line Aware Iteration
Bloom Filter Pre-Check for Rare Values
Use the Language Built-In
Jump Search (sqrt(n) Hybrid)
Self-Organizing List (Move-to-Front)
Switch to Binary Search if Sorted
Aho-Corasick for Multi-Pattern Search
Summary Table

1. Sentinel to Skip Bound Check¶

Naive¶

def search(arr, target):
    i = 0
    while i < len(arr) and arr[i] != target:   # ← two comparisons per iter
        i += 1
    return i if i < len(arr) else -1

Optimized¶

def sentinel_search(arr, target):
    n = len(arr)
    if n == 0: return -1
    last = arr[n-1]
    arr[n-1] = target          # place sentinel
    i = 0
    while arr[i] != target:    # ← one comparison per iter
        i += 1
    arr[n-1] = last            # restore
    if i < n - 1 or last == target:
        return i
    return -1

Speedup Target¶

~5-15% on tight integer loops. Larger savings in environments with bad branch prediction.

Caveat¶

Mutates the array temporarily. Not thread-safe without locking. Modern CPUs already make the bounds check cheap; this technique is more impactful in embedded / DSP code.

2. Early-Exit on Found¶

Naive¶

def search(arr, target):
    found = -1
    for i, v in enumerate(arr):
        if v == target:
            found = i           # ← keeps scanning
    return found

Optimized¶

def search(arr, target):
    for i, v in enumerate(arr):
        if v == target:
            return i            # ← stop now
    return -1

Speedup Target¶

~2× on average (n/2 vs n comparisons). Dramatic for cases where target is near the front.

Caveat¶

Only correct if you want the first match. For "find all," you must scan to the end.

3. Parallel Search Across Chunks¶

Naive¶

Single-threaded scan of n elements.

Optimized — Go¶

import (
    "context"
    "sync"
    "sync/atomic"
)

func ParallelSearch(arr []int, target int, workers int) int {
    n := len(arr)
    var found atomic.Int64
    found.Store(-1)
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()
    var wg sync.WaitGroup
    chunk := (n + workers - 1) / workers
    for w := 0; w < workers; w++ {
        wg.Add(1)
        start := w * chunk
        end := start + chunk
        if end > n { end = n }
        go func(lo, hi int) {
            defer wg.Done()
            for i := lo; i < hi; i++ {
                select {
                case <-ctx.Done(): return
                default:
                }
                if arr[i] == target {
                    for {
                        cur := found.Load()
                        if cur != -1 && cur < int64(i) { return }
                        if found.CompareAndSwap(cur, int64(i)) {
                            cancel()
                            return
                        }
                    }
                }
            }
        }(start, end)
    }
    wg.Wait()
    return int(found.Load())
}

Speedup Target¶

~workers× up to memory bandwidth saturation (typically 4-8× for int32 arrays).

Caveat¶

Worth it only when n × per-element cost > thread spawn cost (~tens of microseconds). For small arrays, parallel is slower than serial.

4. SIMD Vectorized Comparison¶

Naive¶

Scalar one-element-at-a-time compare.

Optimized — Go (using `bytes.IndexByte` for byte-array case)¶

import "bytes"

// Built-in: hand-tuned assembly using AVX2/SSE on x86, NEON on ARM.
idx := bytes.IndexByte(haystack, byte('x'))

Optimized — C with AVX2 intrinsics¶

#include <immintrin.h>

int simd_search_int32(const int32_t* arr, size_t n, int32_t target) {
    __m256i tgt = _mm256_set1_epi32(target);
    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m256i v   = _mm256_loadu_si256((const __m256i*)(arr + i));
        __m256i cmp = _mm256_cmpeq_epi32(v, tgt);
        int     m   = _mm256_movemask_epi8(cmp);
        if (m) return (int)i + __builtin_ctz(m) / 4;
    }
    for (; i < n; i++) if (arr[i] == target) return (int)i;
    return -1;
}

Optimized — Python (NumPy)¶

import numpy as np

def numpy_search(arr_np, target):
    matches = np.where(arr_np == target)[0]
    return int(matches[0]) if len(matches) else -1

Speedup Target¶

4-8× on x86 with AVX2; 8-16× with AVX-512. NumPy's vectorization gives Python ~50× speedup over pure-Python loops.

Caveat¶

Always falls back to scalar tail loop. Alignment matters. Auto-vectorization by compilers is unreliable for anything beyond simple loops.

5. Branch-Free Comparison¶

Naive (branchy)¶

for (size_t i = 0; i < n; i++) {
    if (arr[i] == target) return (int)i;   // ← unpredictable branch when target is rare
}
return -1;

Optimized (branch-free)¶

int branchless_search(const int* arr, size_t n, int target) {
    int found = -1;
    for (size_t i = 0; i < n; i++) {
        int eq = (arr[i] == target);   // 0 or 1
        // If eq, set found to i (only the first time, due to ternary).
        found = (found == -1 && eq) ? (int)i : found;
    }
    return found;
}

Speedup Target¶

Slower in average case (always scans the whole array), but constant-time — useful for cryptographic comparison where timing leaks are unacceptable.

Caveat¶

Don't use for general-purpose search. Use only when timing-channel resistance is required (password verification, MAC validation).

6. Cache-Line Aware Iteration¶

Naive (Array-of-Structs)¶

typedef struct { char name[60]; int age; int id; } Person;  // 68 bytes, padded to 72 typically.
Person people[N];

int find_age(Person* p, size_t n, int target) {
    for (size_t i = 0; i < n; i++) {
        if (p[i].age == target) return (int)i;
    }
    return -1;
}
// Each iteration touches 64-72 bytes, only 4 of which (age) are needed.
// 1 cache line per element — bandwidth-bound.

Optimized (Struct-of-Arrays)¶

struct PeopleSoA {
    char (*names)[60];   // separate
    int* ages;           // contiguous → 16 ages per cache line
    int* ids;
};

int find_age_soa(struct PeopleSoA* p, size_t n, int target) {
    for (size_t i = 0; i < n; i++) {
        if (p->ages[i] == target) return (int)i;
    }
    return -1;
}
// 16 elements per cache line → 16× more cache-efficient.

Speedup Target¶

4-16× depending on struct size and hot-field ratio. Key insight behind ECS architectures in game engines.

Caveat¶

SoA layout is more complex to maintain (parallel arrays must stay in sync). Tradeoff against code clarity and locality of related fields.

7. Bloom Filter Pre-Check for Rare Values¶

Naive¶

def search(arr, target):
    for i, v in enumerate(arr):
        if v == target:
            return i
    return -1

If target is rarely present (say, 1% of queries find a match), 99% of queries scan the whole array fruitlessly.

Optimized¶

class BloomFilteredArray:
    def __init__(self, arr, m_bits=10000, k_hashes=7):
        self.arr = arr
        self.bits = bytearray(m_bits // 8)
        self.k = k_hashes
        for v in arr:
            for h in self._hashes(v):
                self.bits[h // 8] |= (1 << (h % 8))

    def _hashes(self, v):
        # In real code, use mmh3 or xxhash with k different seeds.
        import hashlib
        for i in range(self.k):
            h = int(hashlib.md5(f"{i}-{v}".encode()).hexdigest(), 16)
            yield h % (len(self.bits) * 8)

    def search(self, target):
        for h in self._hashes(target):
            if not (self.bits[h // 8] & (1 << (h % 8))):
                return -1   # definitely absent
        # Possibly present — fall back to linear scan
        for i, v in enumerate(self.arr):
            if v == target:
                return i
        return -1   # false positive from Bloom

Speedup Target¶

~10-100× when miss rate is high (most queries return -1 in O(k) instead of O(n)).

Caveat¶

One-time O(n) build cost.
Extra O(m) memory.
Not worth it if hits are common (you'd scan anyway).
Bloom filter requires updating when array changes.

8. Use the Language Built-In¶

Naive¶

def my_search(arr, target):
    for i, v in enumerate(arr):
        if v == target:
            return i
    return -1

Optimized¶

try:
    idx = arr.index(target)   # implemented in C
except ValueError:
    idx = -1

import "slices"
idx := slices.Index(arr, target)   // SIMD-tuned for many types

int idx = list.indexOf(target);

Speedup Target¶

~50× in Python (avoids the interpreter loop), ~2-8× in Go/Java (uses SIMD for byte arrays).

Caveat¶

You lose flexibility: built-ins use == / equals. For custom predicates, you still hand-roll.

9. Jump Search (sqrt(n) Hybrid)¶

Naive¶

Linear scan: O(n).

Optimized (only on sorted arrays)¶

import math

def jump_search(arr, target):
    n = len(arr)
    step = int(math.sqrt(n))
    prev = 0
    while prev < n and arr[min(step, n) - 1] < target:
        prev = step
        step += int(math.sqrt(n))
        if prev >= n:
            return -1
    # Linear scan within the block
    while prev < min(step, n) and arr[prev] < target:
        prev += 1
    if prev < n and arr[prev] == target:
        return prev
    return -1

Speedup Target¶

O(√n) — much better than linear's O(n), worse than binary's O(log n).

When It Wins¶

Cache-friendly: large jumps make better use of prefetching than binary search's bisecting jumps.
Useful on sorted data where reading is much cheaper than seeking (e.g., on disk or tape).
Practical on huge sorted arrays where binary search's pointer-jumping kills the cache.

Caveat¶

Requires sorted data. Almost always beaten by binary search in RAM-resident scenarios. Mostly a curiosity; interpolation search (O(log log n) on uniform data) is more useful.

10. Self-Organizing List (Move-to-Front)¶

Naive¶

arr = [...]   # static order

def search(arr, target):
    for i, v in enumerate(arr):
        if v == target:
            return i
    return -1

Optimized¶

class MTFList:
    def __init__(self, items):
        self.items = list(items)

    def search(self, target):
        for i, v in enumerate(self.items):
            if v == target:
                # Move to front
                self.items.pop(i)
                self.items.insert(0, target)
                return 0
        return -1

Speedup Target¶

Uniform access: no improvement.
Zipf-distributed access (top 20% of keys = 80% of queries): up to 5× speedup.
Hot-key access (one key dominates): converges to O(1) after warmup.

Caveat¶

Mutation breaks thread safety without locking.
Adversarial workloads can trigger constant rearrangement (worst case O(n²)).
Counter-intuitive contract: search has side effects.

11. Switch to Binary Search if Sorted¶

Naive¶

sorted_arr = [...]   # static, sorted

def linear(arr, target):
    for i, v in enumerate(arr):
        if v == target:
            return i
    return -1

Optimized¶

import bisect

def binary(arr, target):
    i = bisect.bisect_left(arr, target)
    if i < len(arr) and arr[i] == target:
        return i
    return -1

Speedup Target¶

n = 1,000: 100× faster.
n = 1,000,000: 50,000× faster.

When It Doesn't Apply¶

Array isn't sorted (sorting first costs O(n log n) — only worthwhile for many queries).
Array is small (n < ~32) — linear's cache friendliness wins.
Predicate isn't equality (binary search needs an ordering).

Sorted-Array Maintenance¶

If the array changes frequently, keeping it sorted costs O(n) per insert. If reads dominate, sort once and binary-search. If writes dominate, switch to a balanced BST or skip list (O(log n) inserts).

12. Aho-Corasick for Multi-Pattern Search¶

Naive¶

def find_any(text, patterns):
    for p in patterns:
        idx = text.find(p)
        if idx != -1:
            return (p, idx)
    return None
# Cost: O(|text| × |patterns|)

Optimized¶

import ahocorasick   # pip install pyahocorasick

def build_ac(patterns):
    ac = ahocorasick.Automaton()
    for p in patterns:
        ac.add_word(p, p)
    ac.make_automaton()
    return ac

def find_any_ac(text, ac):
    for end_idx, pattern in ac.iter(text):
        return (pattern, end_idx - len(pattern) + 1)
    return None
# Cost: O(|text| + sum(|patterns|) + matches), one pass through text.

Speedup Target¶

~|patterns|× — finding 100 patterns in a 1 MB text drops from 100 ms to 1 ms.

Use Cases¶

IDS / IPS: scan packets for thousands of attack signatures.
Spam filters: match against millions of known phrases.
Lexers: tokenize source code with hundreds of keywords.
Bioinformatics: search DNA for many gene sequences.

Caveat¶

Building the automaton is O(sum of pattern lengths).
Memory grows with patterns + alphabet.
Overkill for 1-2 patterns — use Boyer-Moore or built-in find.

Summary Table¶

#	Technique	Wins When	Speedup	Complexity Cost
1	Sentinel	Tight integer loops, mutable buffer	5-15%	Mutates array, low
2	Early exit	Always (for "find first")	Up to 2×	None
3	Parallel	Large n, multi-core	Up to `p`×	Coordination, threads
4	SIMD	Hot loops, primitive types	4-16×	Platform-specific
5	Branch-free	Crypto / constant-time required	Slower	Always full scan
6	SoA layout	Single-field search on big structs	4-16×	Refactor data layout
7	Bloom filter	High miss rate	10-100×	O(n) build, extra mem
8	Built-in	General use	2-50×	None
9	Jump search	Huge sorted arrays, slow seeks	√n / n	Requires sorted
10	MTF	Skewed access pattern	Up to N×	Mutation, side effects
11	Binary search	Sorted, frequent queries	log n / n	Requires sorted
12	Aho-Corasick	Multi-pattern search	`\|patterns\|`×	Build cost, memory

Decision Flowchart¶

Need to find target in collection
        │
        ├─ n < 32 ?
        │     └─ YES → naive linear search (cache wins)
        │
        ├─ Is data sorted?
        │     └─ YES → binary search (#11)
        │
        ├─ Repeated queries on same data?
        │     └─ YES → build hash set / index (one-time O(n) cost)
        │
        ├─ Multiple patterns to find at once?
        │     └─ YES → Aho-Corasick (#12)
        │
        ├─ Most queries miss (target absent)?
        │     └─ YES → Bloom filter pre-check (#7)
        │
        ├─ Hot loop, primitives, room for SIMD?
        │     └─ YES → SIMD (#4) or use built-in (#8)
        │
        ├─ Multi-core idle and n very large?
        │     └─ YES → parallel (#3)
        │
        └─ Otherwise → naive linear with early-exit (#2)

The takeaway: most "linear search optimizations" are really just choosing a different algorithm. Pure linear search optimization (sentinel, SIMD, parallel) gets you constant-factor speedups. Algorithmic switches (binary, hash, Aho-Corasick) get you asymptotic improvements. Always profile first; then choose the smallest hammer that solves your problem.

Linear Search — Optimization Exercises¶

Table of Contents¶

1. Sentinel to Skip Bound Check¶

Naive¶

Optimized¶

Speedup Target¶

Caveat¶

2. Early-Exit on Found¶

Naive¶

Optimized¶

Speedup Target¶

Caveat¶

3. Parallel Search Across Chunks¶

Naive¶

Optimized — Go¶

Speedup Target¶

Caveat¶

4. SIMD Vectorized Comparison¶

Naive¶

Optimized — Go (using bytes.IndexByte for byte-array case)¶

Optimized — C with AVX2 intrinsics¶

Optimized — Python (NumPy)¶

Speedup Target¶

Caveat¶

5. Branch-Free Comparison¶

Naive (branchy)¶

Optimized (branch-free)¶

Speedup Target¶

Caveat¶

6. Cache-Line Aware Iteration¶

Naive (Array-of-Structs)¶

Optimized (Struct-of-Arrays)¶

Speedup Target¶

Caveat¶

7. Bloom Filter Pre-Check for Rare Values¶

Naive¶

Optimized¶

Speedup Target¶

Caveat¶

8. Use the Language Built-In¶

Naive¶

Optimized¶

Speedup Target¶

Caveat¶

9. Jump Search (sqrt(n) Hybrid)¶

Naive¶

Optimized (only on sorted arrays)¶

Speedup Target¶

When It Wins¶

Caveat¶

10. Self-Organizing List (Move-to-Front)¶

Naive¶

Optimized¶

Speedup Target¶

Caveat¶

11. Switch to Binary Search if Sorted¶

Naive¶

Optimized¶

Speedup Target¶

When It Doesn't Apply¶

Sorted-Array Maintenance¶

12. Aho-Corasick for Multi-Pattern Search¶

Naive¶

Optimized¶

Speedup Target¶

Use Cases¶

Caveat¶

Summary Table¶

Decision Flowchart¶

Optimized — Go (using `bytes.IndexByte` for byte-array case)¶