Bit-Parallel Techniques (Word-Level Parallelism / Bitset Acceleration) — Middle Level¶

Focus: the full bitset operation set on a multi-word uint64[], the cross-word left shift that powers B |= B << w, boolean reachability / transitive closure by OR-ing rows, and bit-parallel substring search with Shift-Or (Bitap) — all unified by the rule operate on whole words, get 64 lanes per instruction.

Introduction¶

At junior level the message was a single line: B |= B << w computes subset-sum reachability with a 64× constant-factor win. At middle level you implement the bitset properly — as a multi-word uint64[] with correct cross-word shifting — and you learn that the same word-parallel idea is the engine behind three more classic speedups:

The full bitset API: set/clear/test plus word-parallel and/or/xor/not, count (popcount), and findFirstSet (ctz).
Boolean reachability / transitive closure: represent each adjacency row as a bitset and OR rows together — O(V³/64) instead of O(V³).
Shift-Or / Bitap: track all in-progress matches of a length-≤ 64 pattern in one register, scanning text in O(m) with a tiny constant.
The cross-word shift details that the junior level hid behind Python's big integers.

These are the questions that separate "I can write B |= B << w when the language gives me big integers" from "I can implement bit-parallel acceleration correctly in a fixed-width language and recognize when it applies."

Deeper Concepts¶

The bitset as a `uint64[]`¶

A bitset over [0, T) is an array of ⌈T/64⌉ words. The addressing is the same everywhere:

word(s)  = s >> 6      // s / 64
posn(s)  = s & 63      // s % 64
mask(s)  = 1 << posn(s)

The single-bit ops are O(1); the whole-bitset ops are O(T/64):

Op	Per word	Meaning
`and(A,B)`	`C[i] = A[i] & B[i]`	set intersection
`or(A,B)`	`C[i] = A[i] \\| B[i]`	set union
`xor(A,B)`	`C[i] = A[i] ^ B[i]`	symmetric difference
`andNot(A,B)`	`C[i] = A[i] & ~B[i]`	set difference
`count(A)`	`Σ popcount(A[i])`	cardinality
`ffs(A)`	first nonzero word, then `ctz`	minimum element

The cross-word left shift (the heart of `B |= B << w`)¶

When w is not a multiple of 64, shifting the whole bitset left by w moves each word's high bitShift bits into the next higher word. Decompose w:

wordShift = w >> 6        // whole words to move up
bitShift  = w & 63        // bits within a word

The shifted value landing in output word i comes from source word i - wordShift, shifted left by bitShift, plus the top bitShift bits of word i - wordShift - 1:

out[i] = (src[i-wordShift] << bitShift)
       | (src[i-wordShift-1] >> (64 - bitShift))   // only if bitShift > 0

Two correctness rules:

Direction. For an in-place left shift you must iterate high index → low index, or you overwrite a source word before reading it.
The bitShift == 0 trap. x >> (64 - 0) is x >> 64, which is undefined in Go/Java. Guard the carry term with if bitShift > 0.

Python sidesteps all of this: a Python int is an arbitrary-width bitset and B |= B << w carries across "words" automatically at C speed.

Why `B |= B << w` is exactly subset-sum¶

Bit s set means "sum s reachable". B << w relabels bit s as bit s + w — i.e. "for every reachable s, the sum s + w (take this item) is also reachable". The OR unions "don't take it" (old B) with "take it" (B << w). One shift + one OR processes all T/64 words, 64 candidate sums per instruction.

popcount and ffs: counting and searching¶

popcount (hardware POPCNT) gives the number of reachable sums in O(T/64). findFirstSet/ctz gives the smallest set bit: scan words for the first nonzero, then ctz that word. Skipping zero words makes ffs fast on sparse bitsets.

A worked cross-word shift¶

Suppose w = 70 (so wordShift = 1, bitShift = 6) and a bitset with B[0] having bit 60 set. Shifting left by 70 should land that bit at position 60 + 70 = 130, i.e. word 2, bit 2:

out[2] = (B[1] << 6) | (B[0] >> 58)   # B[0] bit 60 -> (60-58)=2 of out[2]

The high bitShift = 6 bits of B[0] (positions 58–63) carry into the low 6 bits of word 2. Bit 60 is among them, landing at out[2] bit 60 - (64 - 6) = 2. This is exactly the carry term B[src-1] >> (64 - bitShift). Trace through it once by hand and the formula stops being mysterious.

Why the OR keeps both alternatives¶

In B |= B << w, the OR unions two sets: the unshifted B (sums reachable without the new item) and the shifted B << w (sums reachable with it). Neither is discarded, which is exactly the subset-sum recurrence "take it or leave it". If you wrote B = B << w (assignment, not OR-assign), you would force taking every item — a different and wrong computation. The |= is load-bearing.

Comparison with Alternatives¶

Bitset vs boolean array vs hash set¶

Structure	Membership	AND/OR of two sets	Memory for `T` slots	Best when
`bool[]`	O(1)	O(T)	T bytes	tiny, readability over speed
bitset `uint64[]`	O(1)	O(T/64)	T/8 bytes	dense booleans, many set ops
hash set	O(1) avg	O(size)	~size·constant	sparse sets, huge universe

The bitset wins decisively when the universe T is moderate (fits in memory as T/64 words) and the booleans are dense — exactly the subset-sum and reachability regimes. For sparse sets over a gigantic universe, a hash set's O(size) beats the bitset's O(T/64).

Subset-sum: bitset shift vs scalar DP¶

Approach	Time	Notes
Recursive enumeration	O(2ⁿ)	hopeless beyond ~25 items
Scalar boolean DP	O(n·T)	the textbook table
Bitset shift `B \\|= B << w`	O(n·T/64)	same class, 64× smaller constant

Concrete (n = 1000, T = 10^5):

Approach	Operations	Verdict
Scalar DP	10⁸ bool writes	~0.1–1 s
Bitset shift	10⁸ / 64 ≈ 1.6·10⁶ word ops	a few ms

String search: Shift-Or vs KMP vs naive¶

Algorithm	Preprocess	Search	Pattern length limit	Notes
Naive	—	O(n·m)	none	slow on bad inputs
KMP	O(m)	O(n)	none	optimal asymptotically
Shift-Or (Bitap)	O(σ + m)	O(n) with tiny constant	≤ word width (64) per word	branch-free; extends to fuzzy match

Shift-Or is not asymptotically better than KMP, but its inner loop is two or three branch-free word ops, so in practice it is extremely fast for short patterns and trivially extends to approximate matching (Shift-Add, Bitap with errors).

Advanced Patterns¶

Pattern: in-place cross-word `B |= B << w`¶

The production subset-sum core. Iterate high→low, carry across words, guard the bitShift == 0 case. (Full code in Code Examples.)

Pattern: bounded subset-sum (cap the bitset width)¶

If only totals up to a cap C matter, size the bitset to C+1 words and let bits beyond C fall off the end — they are answers you do not care about. This bounds memory and time to O(n·C/64) regardless of the raw item magnitudes.

Pattern: "which items" reconstruction¶

Reachability tells you that a sum is achievable, not which items. Keep a per-item snapshot (or recompute): after the full bitset is built, walk items backward, and for target t, if t - w was reachable before item w was added, item w is part of a solution; recurse on t - w. This is O(n) extra per reconstruction.

Pattern: word-parallel set algebra¶

Union, intersection, difference, and complement of dense sets are one loop of word ops each — the foundation for reachability closure and for filtering.

graph TD A["bitset rows = adjacency"] --> B["reach[v] = neighbors of v"] B --> C["repeat: reach[v] |= reach[u] for u in reach[v]"] C --> D["fixed point = transitive closure"] D --> E["test reach[v] bit w: is w reachable from v?"]

Graph and Reachability Applications¶

Boolean transitive closure by OR-ing rows¶

Represent the graph as V bitsets: row[v] has bit u set iff there is an edge v → u. Reachability closure (Warshall-style) becomes:

for k in 0..V-1:
    for i in 0..V-1:
        if row[i] has bit k set:
            row[i] |= row[k]      # one word-parallel OR over V/64 words

Each inner row[i] |= row[k] is O(V/64) instead of O(V), so the whole closure is O(V³/64) instead of O(V³). For V in the low thousands this is the difference between feasible and not. This is the canonical "boolean matrix multiply / reachability" bitset acceleration, and it composes with binary exponentiation if you need exactly-k-step reachability (square the boolean adjacency bitset).

Reachability in exactly `k` steps¶

Boolean matrix power: (A^k)[i][j] = 1 iff a length-k walk exists. Packing each row as a bitset makes each boolean multiply O(V³/64); binary exponentiation gives A^k in O(V³/64 · log k).

Worked reachability example¶

Graph 0→1, 1→2, 2→0. Rows as bitsets: row[0]={1}, row[1]={2}, row[2]={0}. After adding reflexivity and running the closure, row[0] becomes {0,1,2} — vertex 0 reaches all three (it is a cycle). Each absorb step row[i] |= row[k] is one word-parallel OR. For V ≤ 64 the whole adjacency row is a single uint64, so the entire closure is a tight double loop of register ORs — extremely fast, the canonical demonstration of why dependency-graph reachability uses bitsets in compilers and build systems.

Bitset Boolean multiply, explicitly¶

C[i] = OR over t of (A[i] has bit t set ? B[t] : 0)

i.e. output row i is the OR of the source rows B[t] for every t where A[i] has an edge. This is the row-bitset multiply; it composes with binary exponentiation (square the boolean adjacency) for exact-k reachability. The identity matrix is the reflexive bitset (I[i] = only bit i).

Bit-Parallel String Matching (Shift-Or)¶

Shift-Or (a.k.a. Bitap) finds all occurrences of a pattern P of length m ≤ 64 in a text T, scanning each text character with a couple of word operations. It is the cleanest illustration that an automaton's whole state can live in one register.

The idea: state bits track in-progress matches¶

Encode a state word R where bit j is 0 iff the pattern prefix P[0..j] currently matches the text ending at the current position. (Shift-Or uses 0 = "matched" so a left shift naturally introduces a fresh 0 at the bottom.) Precompute, for each alphabet character c, a mask mask[c] whose bit j is 0 iff P[j] == c.

The per-character update is:

R = (R << 1) | mask[text[i]]

R << 1 advances every in-progress match by one position (and shifts a fresh 0 into bit 0, optimistically starting a new match).
| mask[c] invalidates (sets to 1) every position j where P[j] != c.

A full match ending at position i is detected when bit m-1 of R is 0:

if (R & (1 << (m-1))) == 0:  report match ending at i

Why it is correct (state automaton encoded in bits)¶

R is the bit-encoded state of the nondeterministic prefix automaton: bit j low ⇔ "the first j+1 characters of P align with the text ending here". The shift moves the "how far have we matched" frontier forward by one; OR-ing the character mask kills alignments that the new character contradicts. Because all m prefix-states live in distinct bit lanes of one word, all transitions happen in a single << and | — the entire automaton steps in O(1) word ops per text character.

Initialization and conventions¶

Start R = ~0 (all 1 = no match in progress). Build masks with mask[c] all-ones, then clear bit j for each j with P[j] == c. The "0 means matched" convention is what makes the optimistic << introduce a starting 0. (A dual "1 means matched" Shift-And formulation exists; pick one and be consistent.)

Extensions¶

Patterns longer than 64: use multiple words and carry the shift across them (like the subset-sum cross-word shift).
Approximate matching (Bitap with k errors): keep k+1 state words R₀..R_k; R_d allows up to d edits, updated from R_d, R_{d-1} via shift/OR/AND. This is the basis of agrep and fuzzy search.

Worked Shift-Or trace¶

For P = "abab", mask['a'] = …1010 (bits 0,2 cleared), mask['b'] = …0101 (bits 1,3 cleared). Scanning "ababcabab" and watching the low 4 bits of R, a match is reported when bit 3 (= m-1) becomes 0. This happens at text index 3 (the substring starting at 0) and again at index 8 (starting at 5). Two matches, found with one shift + one OR + one test per character — the automaton stepping entirely inside one register.

Code Examples¶

Multi-word bitset with cross-word shift + subset-sum (counting semiring of booleans)¶

Go¶

package main

import (
    "fmt"
    "math/bits"
)

type Bitset struct{ w []uint64 }

func NewBitset(nbits int) *Bitset { return &Bitset{w: make([]uint64, (nbits+63)>>6)} }
func (b *Bitset) Set(s int)       { b.w[s>>6] |= 1 << uint(s&63) }
func (b *Bitset) Test(s int) bool { return b.w[s>>6]>>uint(s&63)&1 == 1 }
func (b *Bitset) Count() int {
    c := 0
    for _, x := range b.w {
        c += bits.OnesCount64(x)
    }
    return c
}

// ShiftLeftOr performs b |= b << amt, carrying across words.
func (b *Bitset) ShiftLeftOr(amt int) {
    ws, bs := amt>>6, uint(amt&63)
    for i := len(b.w) - 1; i >= 0; i-- {
        var v uint64
        if src := i - ws; src >= 0 {
            v = b.w[src] << bs
            if bs > 0 && src-1 >= 0 {
                v |= b.w[src-1] >> (64 - bs)
            }
        }
        b.w[i] |= v
    }
}

func subsetSum(weights []int, cap int) *Bitset {
    b := NewBitset(cap + 1)
    b.Set(0)
    for _, w := range weights {
        b.ShiftLeftOr(w)
    }
    return b
}

func main() {
    b := subsetSum([]int{3, 1, 2, 64, 65}, 200)
    fmt.Println(b.Test(67), b.Test(200), b.Count()) // cross-word sums work
}

Java¶

public class Bitset {
    final long[] w;
    Bitset(int nbits) { w = new long[(nbits + 63) >> 6]; }
    void set(int s)     { w[s >> 6] |= 1L << (s & 63); }
    boolean test(int s) { return ((w[s >> 6] >>> (s & 63)) & 1L) == 1L; }
    int count() { int c = 0; for (long x : w) c += Long.bitCount(x); return c; }

    void shiftLeftOr(int amt) {
        int ws = amt >> 6, bs = amt & 63;
        for (int i = w.length - 1; i >= 0; i--) {
            long v = 0L;
            int src = i - ws;
            if (src >= 0) {
                v = w[src] << bs;
                if (bs > 0 && src - 1 >= 0) v |= w[src - 1] >>> (64 - bs);
            }
            w[i] |= v;
        }
    }

    static Bitset subsetSum(int[] weights, int cap) {
        Bitset b = new Bitset(cap + 1);
        b.set(0);
        for (int x : weights) b.shiftLeftOr(x);
        return b;
    }

    public static void main(String[] args) {
        Bitset b = subsetSum(new int[]{3, 1, 2, 64, 65}, 200);
        System.out.println(b.test(67) + " " + b.test(200) + " " + b.count());
    }
}

Python¶

def subset_sum(weights, cap=None):
    B = 1
    for w in weights:
        B |= B << w
    if cap is not None:
        B &= (1 << (cap + 1)) - 1  # drop sums beyond the cap
    return B


if __name__ == "__main__":
    B = subset_sum([3, 1, 2, 64, 65], cap=200)
    print((B >> 67) & 1, (B >> 200) & 1, bin(B).count("1"))

Shift-Or (Bitap) exact substring search¶

Go¶

package main

import "fmt"

// shiftOr returns starting indices of all matches of pat (len <= 64) in text.
func shiftOr(text, pat string) []int {
    m := len(pat)
    var mask [256]uint64
    for i := range mask {
        mask[i] = ^uint64(0)
    }
    for j := 0; j < m; j++ {
        mask[pat[j]] &^= 1 << uint(j) // clear bit j (0 = matches)
    }
    var R uint64 = ^uint64(0)
    matchBit := uint64(1) << uint(m-1)
    var res []int
    for i := 0; i < len(text); i++ {
        R = (R << 1) | mask[text[i]]
        if R&matchBit == 0 {
            res = append(res, i-m+1)
        }
    }
    return res
}

func main() {
    fmt.Println(shiftOr("ababcabab", "abab")) // [0 5]
}

Java¶

import java.util.*;

public class ShiftOr {
    static List<Integer> search(String text, String pat) {
        int m = pat.length();
        long[] mask = new long[256];
        Arrays.fill(mask, ~0L);
        for (int j = 0; j < m; j++)
            mask[pat.charAt(j)] &= ~(1L << j);   // 0 = matches
        long R = ~0L, matchBit = 1L << (m - 1);
        List<Integer> res = new ArrayList<>();
        for (int i = 0; i < text.length(); i++) {
            R = (R << 1) | mask[text.charAt(i)];
            if ((R & matchBit) == 0) res.add(i - m + 1);
        }
        return res;
    }

    public static void main(String[] args) {
        System.out.println(search("ababcabab", "abab")); // [0, 5]
    }
}

Python¶

def shift_or(text, pat):
    m = len(pat)
    mask = [~0] * 256
    for j, ch in enumerate(pat):
        mask[ord(ch)] &= ~(1 << j)        # 0 = matches
    R = ~0
    match_bit = 1 << (m - 1)
    full = (1 << 64) - 1                  # keep R in 64 bits like Go/Java
    res = []
    for i, ch in enumerate(text):
        R = ((R << 1) | mask[ord(ch)]) & full
        if R & match_bit == 0:
            res.append(i - m + 1)
    return res


if __name__ == "__main__":
    print(shift_or("ababcabab", "abab"))  # [0, 5]

Error Handling¶

Scenario	What goes wrong	Correct approach
`bitShift == 0` carry	`x >> (64 - 0)` is `>> 64`, undefined.	Guard the carry term with `if bitShift > 0`.
Wrong shift direction	In-place left shift low→high overwrites unread source words.	Iterate high→low for left shift.
Pattern longer than word	Shift-Or with `m > 64` silently drops the high bits.	Multi-word Shift-Or, or use KMP for long patterns.
Java arithmetic shift	`>>` on `long` leaks the sign bit.	Use `>>>` (logical) for unsigned moves.
Python R grows unbounded	Without masking, `R << 1` keeps widening the int.	AND with `(1<<64)-1` each step to emulate fixed width.
Bitset undersized	Sums beyond the array vanish.	Size to `sum(weights)` or the cap.
Wrong match convention	Mixing "0 = match" and "1 = match" Shift-Or/Shift-And.	Pick one convention; build masks accordingly.

Performance Analysis¶

Workload	Scalar	Bitset (`/64`)	Speedup
Subset-sum, `n=10³`, `T=10⁵`	10⁸ writes	~1.6·10⁶ word ops	~64×
Transitive closure, `V=2000`	8·10⁹	1.25·10⁸ word ops	~64×
Shift-Or, text `10⁷`, `m≤64`	n/a	~2·10⁷ word ops	branch-free, cache-friendly

The dominant cost is the number of word operations, T/64 per pass. The single biggest practical levers:

Skip empty words in ffs/scan loops.
Native popcount/ctz instead of bit-by-bit counting.
Flat contiguous word array for cache locality (8× density vs bool[]).
In Python, use native big-int shifts — they run in C and dwarf any hand-rolled word loop.
Skip zero words in the cross-word shift and in ffs scans; a long run of zero words costs nothing if you detect and jump over it.
Cap the universe to the largest target you care about; an un-capped subset-sum bitset can grow to sum(weights) bits and blow the cache.

import time

def bench(n, t):
    import random
    weights = [random.randint(1, t // n) for _ in range(n)]
    start = time.perf_counter()
    B = 1
    for w in weights:
        B |= B << w
    return time.perf_counter() - start, bin(B).count("1")

# Typical: n=1000, t=10**5 -> a few ms in CPython, since << and | are C-level.

Best Practices¶

Implement the cross-word shift once, test it hard: it is the single most bug-prone routine. Compare against the scalar oracle on random inputs.
Choose a fixed match convention (0=match for Shift-Or) and document it.
Size bitsets to the real maximum (sum(weights) or a cap); never guess.
Use logical shifts (>>> in Java; unsigned in Go) to avoid sign-bit leaks.
Prefer the language's native bitset when available: Python int, Java BitSet, C++ std::bitset/dynamic_bitset — only hand-roll uint64[] when you need explicit control or shift-OR.
Validate against the scalar DP / KMP on small inputs before trusting any bit-parallel routine.

Visual Animation¶

See animation.html for an interactive view.

The middle-level animation highlights: - The reachable-sums bit array as a row of cells, lit iff the sum is reachable. - Each item performing B |= B << w: the shifted copy slides up by w, newly-set bits flash green, then OR-merge. - A live popcount (count of reachable totals) and an editable item list.

Summary¶

Mini-recap of the three accelerations¶

Subset-sum: B |= B << w per item, O(n·T/64) — the shift adds w to all reachable sums.
Reachability: OR adjacency-row bitsets, O(V³/64) — absorb a target's reachable set with one word-parallel OR.
Shift-Or: R = (R << 1) | mask[c] per text char, O(n) — the whole prefix automaton steps in one register.

All three are the same idea (operate on whole words) on three different boolean structures.

A bitset is an array of 64-bit words; whole-bitset boolean ops (and/or/xor/andNot, count, ffs) run in O(T/64), giving the flat 64× win. The implementation crux is the cross-word left shift behind B |= B << w: split w into wordShift and bitShift, iterate high→low in place, carry the top bitShift bits into the next word, and guard the bitShift == 0 case. The same word-parallel principle powers boolean transitive closure (OR adjacency-row bitsets, O(V³/64)) and Shift-Or / Bitap substring search, where the entire prefix automaton lives in one register and steps with a single (R << 1) | mask[c]. None of these change the asymptotic class — they multiply throughput by the word width. Master the multi-word bitset, the cross-word shift, and the Shift-Or state encoding, and one small toolkit accelerates subset-sum, reachability, and string matching alike.

Bit-Parallel Techniques (Word-Level Parallelism / Bitset Acceleration) — Middle Level¶

Table of Contents¶

Introduction¶

Deeper Concepts¶

The bitset as a uint64[]¶

The cross-word left shift (the heart of B |= B << w)¶

Why B |= B << w is exactly subset-sum¶

popcount and ffs: counting and searching¶

A worked cross-word shift¶

Why the OR keeps both alternatives¶

Comparison with Alternatives¶

Bitset vs boolean array vs hash set¶

Subset-sum: bitset shift vs scalar DP¶

String search: Shift-Or vs KMP vs naive¶

Advanced Patterns¶

Pattern: in-place cross-word B |= B << w¶

Pattern: bounded subset-sum (cap the bitset width)¶

Pattern: "which items" reconstruction¶

Pattern: word-parallel set algebra¶

Graph and Reachability Applications¶

Boolean transitive closure by OR-ing rows¶

Reachability in exactly k steps¶

Worked reachability example¶

Bitset Boolean multiply, explicitly¶

Bit-Parallel String Matching (Shift-Or)¶

The idea: state bits track in-progress matches¶

Why it is correct (state automaton encoded in bits)¶

Initialization and conventions¶

Extensions¶

Worked Shift-Or trace¶

Code Examples¶

Multi-word bitset with cross-word shift + subset-sum (counting semiring of booleans)¶

Go¶

Java¶

Python¶

Shift-Or (Bitap) exact substring search¶

Go¶

Java¶

Python¶

Error Handling¶

Performance Analysis¶

Best Practices¶

Visual Animation¶

Summary¶

Mini-recap of the three accelerations¶

The bitset as a `uint64[]`¶

The cross-word left shift (the heart of `B |= B << w`)¶

Why `B |= B << w` is exactly subset-sum¶

Pattern: in-place cross-word `B |= B << w`¶

Reachability in exactly `k` steps¶