Montgomery Multiplication (and Barrett Reduction) — Practice Tasks¶
All tasks must be solved in Go, Java, and Python. Each task ships with starter code or a precise spec in all three languages. Implement the Montgomery/Barrett logic and validate against the evaluation criteria. Always test against the plain
(a*b) % noracle (and the built-inpow(a, b, n)for modpow) on small inputs before trusting the fast version.
Beginner Tasks (5)¶
Task 1 — Compute n' = -n^{-1} mod R¶
Problem. Given an odd modulus n and R = 2^k, compute n' = (-n^{-1}) mod R two ways: Newton iteration and extended Euclid. Verify they agree and that n·n' ≡ R - 1 (mod R).
Input / Output spec. - Read n (odd) and k. - Print n' and OK if (n * n') mod R == R - 1, else FAIL.
Constraints. - n odd, 1 ≤ k ≤ 64, R = 2^k > n.
Hint. Newton: start inv = n (correct mod 8), repeat inv = inv*(2 - n*inv) mod R until stable. Then n' = (-inv) mod R.
Starter — Go.
package main
import "fmt"
func nPrime(n uint64, k uint) uint64 {
// TODO: Newton iteration for n^{-1} mod 2^k, then negate mod 2^k.
return 0
}
func main() {
var n uint64
var k uint
fmt.Scan(&n, &k)
np := nPrime(n, k)
mask := (uint64(1) << k) - 1
fmt.Println(np)
if (n*np)&mask == mask {
fmt.Println("OK")
} else {
fmt.Println("FAIL")
}
}
Starter — Java.
import java.util.*;
public class Task1 {
static long nPrime(long n, int k) {
// TODO: Newton iteration mod 2^k, then negate
return 0;
}
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
long n = sc.nextLong();
int k = sc.nextInt();
long np = nPrime(n, k);
long mask = (1L << k) - 1;
System.out.println(np);
System.out.println(((n * np) & mask) == mask ? "OK" : "FAIL");
}
}
Starter — Python.
def n_prime(n, k):
# TODO: Newton iteration for n^{-1} mod 2^k, then negate
return 0
def main():
n, k = map(int, input().split())
R = 1 << k
np = n_prime(n, k)
print(np)
print("OK" if (n * np) % R == R - 1 else "FAIL")
if __name__ == "__main__":
main()
Evaluation criteria. - Newton and extended-Euclid results match. - n·n' ≡ -1 (mod R) holds. - Rejects even n (no inverse mod 2^k).
Task 2 — Implement REDC¶
Problem. Implement REDC(T) for 0 ≤ T < n·R returning T·R^{-1} mod n, using n' from Task 1. Use only mask (mod R), shift (÷ R), multiplies, and one conditional subtraction.
Input / Output spec. - Read n (odd), k, and T. - Print REDC(T).
Constraints. n odd, R = 2^k > n, 0 ≤ T < n·R.
Hint. m = (T mod R)·n' mod R; t = (T + m·n) / R; if t >= n: t -= n. Verify T + m·n is divisible by R.
Reference oracle (Python).
def redc_oracle(T, n, k):
R = 1 << k
Rinv = pow(R, -1, n) # slow reference using a real inverse
return (T * Rinv) % n
Evaluation criteria. - Matches redc_oracle for many random T. - The division by R is exact ((T + m·n) % R == 0). - Output in [0, n).
Task 3 — Convert in and out of Montgomery form¶
Problem. Using REDC, implement to_mont(x) = x·R mod n (via REDC(x·(R² mod n))) and from_mont(x̄) = REDC(x̄). Verify from_mont(to_mont(x)) == x for all x in [0, n).
Input / Output spec. - Read n (odd), k. For each x in 0..min(n,1000)-1, check the round-trip. - Print OK if all round-trips hold, else the first failing x.
Constraints. n odd, R = 2^k > n.
Hint. Precompute R² mod n once. to_mont(1) must equal R mod n (the Montgomery identity).
Evaluation criteria. - from_mont(to_mont(x)) == x for all tested x. - to_mont(1) == R mod n. - to_mont uses no division by n (only REDC).
Task 4 — Montgomery multiplication¶
Problem. Implement mont_mul(ā, b̄) = REDC(ā·b̄) and verify that for a, b in [0, n), from_mont(mont_mul(to_mont(a), to_mont(b))) == (a*b) % n.
Input / Output spec. - Read n (odd), k, a, b. - Print (a*b) mod n computed via Montgomery, and OK/FAIL against (a*b) % n.
Constraints. n odd, 0 ≤ a, b < n, R = 2^k > n.
Hint. Stay in Montgomery form: a and b go in, the product stays in form, convert out at the end.
Evaluation criteria. - Matches (a*b) % n for random a, b. - Correct on the boundary a = b = n-1. - No division by n in mont_mul.
Task 5 — Montgomery modular exponentiation¶
Problem. Implement modpow(base, exp) = base^exp mod n entirely in Montgomery form, converting in/out only at the boundaries. Read n (odd), base, exp.
Input / Output spec. - Read n, base, exp. Print base^exp mod n.
Constraints. n odd, 0 ≤ base < 2^63, 0 ≤ exp ≤ 10^18.
Hint. Accumulator starts at to_mont(1). (F: result = mont_mul(result, b)) on set bits; square b each step. Convert out at the end.
Starter — Python.
class Mont:
def __init__(self, n, k=64):
assert n % 2 == 1
self.n, self.k = n, k
self.R, self.mask = 1 << k, (1 << k) - 1
self.np = (-pow(n, -1, self.R)) % self.R
self.r2 = (self.R * self.R) % n
def redc(self, t):
# TODO: m = (t & mask)*np & mask; u = (t + m*n) >> k; subtract if u >= n
return t % self.n # placeholder
def mul(self, a, b): return self.redc(a * b)
def into(self, x): return self.mul(x % self.n, self.r2)
def outof(self, x): return self.redc(x)
def modpow(self, base, exp):
# TODO: res = into(1); b = into(base); binary exponentiation with mul; outof(res)
return pow(base, exp, self.n) # placeholder oracle
if __name__ == "__main__":
m = Mont(1_000_000_007)
assert m.modpow(2, 10) == 1024
assert m.modpow(3, 10**18) == pow(3, 10**18, 1_000_000_007)
print("OK")
Evaluation criteria. - Matches the built-in / plain pow(base, exp, n). - O(log exp) Montgomery multiplies; instant for exp = 10^18. - modpow(base, 0) == 1.
Intermediate Tasks (5)¶
Task 6 — Barrett reduction¶
Problem. Implement Barrett reduction for an arbitrary modulus n (even allowed): precompute μ = ⌊2^{2k}/n⌋ with 2^k > n, then reduce x < n² via q = (x·μ) >> 2k, r = x - q·n, with at most two corrections.
Constraints. 1 ≤ n < 2^32 (so x < n² fits in 64 bits), 0 ≤ x < n².
Hint. Use k = bitlen(n) so 2^k > n, and reduce with the 2k shift. Assert the correction loop runs ≤ 2 times.
Reference oracle (Python).
Starter — Python.
class Barrett:
def __init__(self, n):
self.n = n
self.k = 2 * n.bit_length() + 1
self.mu = (1 << self.k) // n # one-time division
def reduce(self, x): # 0 <= x < n^2
# TODO: q = (x * mu) >> 2k; r = x - q*n; while r >= n: r -= n
return x % self.n # placeholder; replace with Barrett
if __name__ == "__main__":
import random
for n in (1_000_000_006, 1_000_000_007): # even and odd
b = Barrett(n)
for _ in range(10000):
x = random.randrange(n * n)
assert b.reduce(x) == x % n
print("OK")
Starter — Go.
package main
import (
"fmt"
"math/big"
)
type Barrett struct {
N, MU *big.Int
k uint
}
func newBarrett(n uint64) Barrett {
N := new(big.Int).SetUint64(n)
k := uint(2*N.BitLen() + 1)
MU := new(big.Int).Lsh(big.NewInt(1), k)
MU.Div(MU, N)
return Barrett{N: N, MU: MU, k: k}
}
func (b Barrett) reduce(x *big.Int) *big.Int {
// TODO: q = (x*MU) >> k ; r = x - q*N ; correct while r >= N
return new(big.Int).Mod(x, b.N) // placeholder
}
func main() {
b := newBarrett(1_000_000_006)
fmt.Println(b.reduce(big.NewInt(999_999_999_999)))
}
Evaluation criteria. - Matches x % n for random x < n², including even n. - Correction loop runs at most twice (assert it). - No division by n in the hot path (only the one-time μ).
Task 7 — Montgomery vs Barrett vs plain % cross-check¶
Problem. For a fixed odd prime modulus, implement all three mulmods (plain, Montgomery, Barrett) and assert they agree on a large batch of random (a, b) pairs. Report which has the fewest division-like operations.
Constraints. n odd prime < 2^31, at least 10^5 random pairs.
Hint. Plain uses one % n; Montgomery uses REDC (no ÷ n); Barrett uses two multiplies + shift + ≤2 subtracts.
Evaluation criteria. - All three agree on every pair. - A short writeup: Montgomery has the cheapest per-multiply but needs conversions; Barrett needs none but costs an extra multiply; plain needs a division.
Task 8 — Montgomery-based Miller-Rabin¶
Problem. Implement deterministic Miller-Rabin for n < 2^64 using witnesses {2,3,5,7,11,13,17,19,23,29,31,37}, with the witness loop running in Montgomery form (the candidate n is odd after the small-prime screen).
Constraints. 2 ≤ n < 2^64.
Hint. Build the Montgomery context once per n. Precompute one = to_mont(1) and minus_one = to_mont(n-1) to compare against without converting out.
Reference oracle (Python).
def is_prime_oracle(n):
if n < 2:
return False
i = 2
while i * i <= n:
if n % i == 0:
return False
i += 1
return True
# Use only for small n to validate the Montgomery version.
Test vectors (use these to validate).
is_prime(2) = True is_prime(561) = False (Carmichael, fools Fermat but not MR)
is_prime(7) = True is_prime(1_000_000_007) = True
is_prime(1) = False is_prime(1_000_000_011) = False
is_prime(25) = False is_prime(2_147_483_647) = True (Mersenne prime 2^31 - 1)
is_prime(341) = False is_prime(3_215_031_751) = False (strong pseudoprime base 2,3,5,7)
Evaluation criteria. - Matches is_prime_oracle for all n up to 10^6. - Correct on known large primes (1_000_000_007) and composites (561 Carmichael, 1_000_000_011). - Even n > 2 rejected immediately by the small-prime screen. - The witness loop runs in Montgomery form (no ÷ n in the squaring loop).
Task 9 — Pollard-rho with Montgomery inner loop¶
Problem. Implement Pollard-rho factorization where the iteration x ← x² + c (mod n) runs in Montgomery form. Combine with Montgomery Miller-Rabin (Task 8) to fully factor a composite n < 2^64.
Constraints. 2 ≤ n < 2^64, odd composite cofactors.
Hint. Keep x, y, c in Montgomery form (addition is free in the form). Use Brent's cycle detection and batch products before a single gcd. Recurse on found factors.
Evaluation criteria. - Correctly factors semiprimes and multi-factor composites. - Inner loop uses Montgomery mulmod (no ÷ n). - Matches trial-division factorization for small n.
Task 10 — Constant-time conditional subtraction¶
Problem. Replace REDC's branching if t >= n: t -= n with a branchless masked subtraction, and verify it produces identical results to the branching version across random inputs.
Constraints. n odd < 2^63.
Hint. Compute t2 = t - n (may wrap), derive mask from the borrow (all-ones when t < n), and select: result = (t & mask) | (t2 & ~mask). In Go/Java use unsigned comparison to set the mask.
Evaluation criteria. - Branchless and branching REDC agree on all random inputs. - No data-dependent branch in the hot path (inspect the code). - Still correct on the boundary t = n and t = 2n-1.
Advanced Tasks (5)¶
Task 11 — 64-bit R = 2^64 Montgomery with 128-bit products¶
Problem. Implement full 64-bit Montgomery (R = 2^64) using the language's 128-bit product primitive (bits.Mul64, Math.multiplyHigh, __int128/big-int), including the carry-aware conditional subtraction. Support odd n up to nearly 2^64.
Constraints. n odd, n < 2^64 - 1.
Hint. REDC takes (hi, lo); the low limb cancels to 0 (capture only its carry); add into the high limb; the conditional subtract must account for a carry-out of the high word.
Evaluation criteria. - Matches plain (a*b) % n (computed with 128-bit intermediates) for random large n. - Correct for n just below 2^64 (carry-out handled). - modpow matches the built-in for large n.
Task 12 — Redundant [0, 2n) representation (deferred subtraction)¶
Problem. Implement a Montgomery variant that keeps representatives in [0, 2n) and omits the conditional subtraction in the inner loop, normalizing once at the end. Require 4n ≤ R so bounds never overflow.
Constraints. n odd, 4n ≤ R = 2^k.
Hint. Without the subtract, REDC outputs stay in [0, 2n) when inputs are in [0, 2n) and 4n ≤ R. Prove the bound, then drop the branch; normalize the final result with one subtract.
Evaluation criteria. - Inner loop has no conditional subtraction. - Final normalized result matches the standard Montgomery output. - A short proof/argument that the [0, 2n) invariant is maintained.
Task 13 — NTT with Montgomery butterflies¶
Problem. Implement a number-theoretic transform (sibling 13-ntt) under a fixed NTT-friendly prime (e.g. 998244353) where every butterfly multiply is a Montgomery multiply. Convert coefficients into Montgomery form once before the transform and out once after.
Constraints. Power-of-two transform length up to 2^20, prime 998244353.
Hint. Precompute twiddle factors in Montgomery form. Addition/subtraction stay in the form for free; only the twiddle multiply uses REDC. The O(n) conversions are negligible against O(n log n) butterflies.
Evaluation criteria. - Polynomial multiplication via Montgomery NTT matches a naive O(n²) convolution mod the prime. - Conversions happen exactly once each (in/out), not per butterfly. - Measurable speedup over a %-based NTT in a compiled language.
Task 14 — Reducer selection benchmark¶
Problem. For a fixed odd modulus, benchmark plain %-modpow, Montgomery modpow, and Barrett-modpow across a range of exponents. Identify where (if at all) Montgomery overtakes plain, and where Barrett sits. Report a table.
Constraints. n odd < 2^31, exponents ∈ {2^8, 2^16, 2^32, 2^60}, ≥ 3 runs each.
Hint. Build each reducer's constants once. Time only the modpow loops (not setup). In compiled languages the division cost makes Montgomery/Barrett win as the exponent grows; report the crossover.
Evaluation criteria. - Same modulus/exponents across Go, Java, Python; results agree numerically. - Table reports mean over ≥ 3 runs, excluding setup time. - Writeup: Montgomery wins for large exponents (many multiplies); for tiny exponents plain % may tie or win (no conversion to amortize).
Task 15 — Decide the right reducer for a workload¶
Problem. Implement classify(n_is_odd, modulus_is_constant, multiplies_per_modulus, constant_time_required) returning one of PLAIN, MONTGOMERY, BARRETT, or CRT_SPLIT, with justification.
Constraints / cases to handle. - Odd n, many multiplies, runtime modulus → MONTGOMERY. - Even n, many multiplies → BARRETT. - One-off multiply, or compile-time-constant modulus → PLAIN (compiler strength-reduces %). - Even n but Montgomery speed truly required on the odd part → CRT_SPLIT. - Constant-time crypto, odd n → MONTGOMERY (branchless subtract).
Hint. Encode the decision matrix from senior.md §3. The trap is the even-modulus case (Montgomery is invalid) and the compile-time-constant case (the compiler already optimized %).
Evaluation criteria. - Correctly classifies all listed cases. - The even-n branch never returns MONTGOMERY. - Justification cites amortization, the odd-modulus restriction, and compiler strength-reduction.
Benchmark Task¶
Task B — Division-vs-shift micro-benchmark¶
Problem. Empirically demonstrate that the hardware division behind % is the bottleneck. Time a tight loop of (a*b) % n (division) against a tight loop of a single REDC (shift-based) for the same number of iterations, with a runtime (non-constant) modulus so the compiler cannot strength-reduce.
Starter — Python.
import time
class Mont:
def __init__(self, n, k=64):
self.n, self.k = n, k
self.R, self.mask = 1 << k, (1 << k) - 1
self.np = (-pow(n, -1, self.R)) % self.R
self.r2 = (self.R * self.R) % n
def redc(self, t):
m = ((t & self.mask) * self.np) & self.mask
u = (t + m * self.n) >> self.k
return u - self.n if u >= self.n else u
def mul(self, a, b): return self.redc(a * b)
def main():
import random
n = random.choice([1_000_000_007, 998_244_353]) # runtime modulus
m = Mont(n)
a = m.r2 # some Montgomery-form value
iters = 200_000
t0 = time.perf_counter()
acc = 1
for _ in range(iters):
acc = (acc * a) % n # division-based
td = (time.perf_counter() - t0) * 1000
t0 = time.perf_counter()
acc = a
for _ in range(iters):
acc = m.mul(acc, a) # REDC-based (shift)
tr = (time.perf_counter() - t0) * 1000
print(f"division loop: {td:.2f} ms REDC loop: {tr:.2f} ms")
if __name__ == "__main__":
main()
Evaluation criteria. - Same runtime modulus across Go, Java, Python (so no compiler strength-reduction of %). - In compiled languages, the REDC loop is faster than the division loop; report the ratio. - Note that CPython's big-int % is C-level, so the shape (division dominates) matters more than the absolute numbers there; the concrete win is in Go/Java/C. - Report mean over ≥ 3 runs; exclude setup time.
General Guidance for All Tasks¶
- Always validate against the plain oracle first. Every task references
(a*b) % nor the built-inpow(a, b, n). Run it on small inputs, diff, then trust the fast version. - Assert the modulus is odd for Montgomery. Fail loudly; route even moduli to Barrett.
- Verify the magic constant once:
n·n' ≡ -1 (mod R)andfrom_mont(to_mont(x)) == x. - Never skip the conditional subtraction (except in the deliberate redundant-representation Task 12, where you prove the bound).
- Mind the 128-bit product: in Go/Java/C,
a*bof two 64-bit values overflows; usebits.Mul64/Math.multiplyHigh/__int128. - Don't use Montgomery for a single multiply — converting in and out costs more than one
%. - Reuse the context. Compute
n'andR² mod n(or Barrettμ) once per modulus, never per operation. - Remember the Montgomery identity is
R mod n, not1. Accumulators in modpow start atto_mont(1). - Branchless subtract only where you prove the bound (Task 12); elsewhere the plain conditional subtraction is correct and clear.
-
The candidate in Miller-Rabin is odd after the small-prime screen, which is exactly why Montgomery applies in Task 8.
-
Cross-check the magic constant with
n·n' ≡ R - 1 (mod R)in every task that computesn'. - For Barrett, keep inputs
< n²— the quotient bound (≤ 2 corrections) assumes it.
Each task must be implemented and tested in Go, Java, and Python, with identical outputs across all three on the same inputs.