Skip to content

Floating-Point (IEEE 754) — Hands-On Tasks

Topic: Floating-Point (IEEE 754) Focus: Build intuition by making the bugs happen and then fixing them — bit inspection, rounding, cancellation, summation, comparison, money, and reproducibility.


How to use this file

Work top to bottom; difficulty rises. Each task has: - a clear goal and self-check boxes ([ ]) you tick when your output matches, - a Hint (collapsed in your head — try first), - a Solution sketch (sparse — the idea, not full code, except where the exact bits matter).

Use any language with IEEE 754 doubles (all of C, Java, Python, Go, Rust, JS qualify). Where a task needs the exact stored value, print with 17 significant digits (%.17g, f"{x:.17f}", strconv.FormatFloat(x,'g',17,64)).


Tier 1 — See the approximation (Junior)

Task 1.1 — Reproduce 0.1 + 0.2

Goal: Print 0.1 + 0.2 and confirm it isn't 0.3.

  • 0.1 + 0.2 prints 0.30000000000000004
  • 0.1 + 0.2 == 0.3 prints false
  • 0.1 printed with 17 digits shows it's stored as 0.10000000000000001

Hint: Use your language's 17-digit format specifier to reveal the true stored value.

Solution sketch: print(0.1 + 0.2); print(0.1 + 0.2 == 0.3); print(f"{0.1:.17f}"). All IEEE 754 languages give the same answer.


Task 1.2 — Which fractions are exact?

Goal: For each of 0.5, 0.25, 0.75, 0.1, 0.2, 0.3, 0.125, 0.7, decide (by reasoning, then by printing 17 digits) whether it's exactly representable.

  • You predicted exactness from the rule "denominator (in lowest terms) is a power of 2" before running
  • 0.5, 0.25, 0.75, 0.125 print exactly; 0.1, 0.2, 0.3, 0.7 show extra digits

Hint: 0.75 = 3/4, denominator 4 = 2². 0.7 = 7/10, denominator 10 — not a power of 2.

Solution sketch: Loop and print each with 17 digits. The power-of-2-denominator ones round-trip cleanly.


Task 1.3 — The NaN and Infinity tour

Goal: Produce NaN, +Inf, -Inf and verify their quirks.

  • 0.0/0.0 (computed from variables) is NaN; 1.0/0.0 is +Inf; -1.0/0.0 is -Inf
  • NaN == NaN is false; your isnan() helper returns true
  • Inf + 1 == Inf; 1 / Inf == 0; Inf - Inf is NaN

Hint: In Go, division of constant zeros won't compile — use variables or math.Inf/math.NaN. In Python, 0.0/0.0 raises; use float('nan'), float('inf').

Solution sketch: Build the values from variables, then run each comparison. Confirm x != x matches your library's isnan.


Task 1.4 — The money-as-float bug

Goal: Show that summing 0.10 ten times as a double does not give 1.0, then fix it with integer cents.

  • Summing 0.10 ten times prints something like 0.9999999999999999
  • Summing integer cents (10 ten times) and dividing by 100 prints exactly 1.0

Hint: The fix is to do all arithmetic in integers and convert to dollars only for display.

Solution sketch: total = sum(0.10 for _ in range(10)) vs cents = 10*10; dollars = cents/100.


Task 1.5 — Replace == with a tolerance

Goal: Write a close(a, b, eps=1e-9) helper and use it to make 0.1 + 0.2 "equal" 0.3.

  • 0.1 + 0.2 == 0.3 is false but close(0.1 + 0.2, 0.3) is true
  • You can articulate why eps = 1e-9 is fine here but wrong for comparing two billion-sized numbers

Hint: abs(a - b) < eps. The "why wrong for big numbers" answer is in Tier 3.

Solution sketch: def close(a, b, eps=1e-9): return abs(a-b) < eps.


Tier 2 — Quantify the grid (Middle)

Task 2.1 — Inspect the bits of a double

Goal: Decompose 1.0, 0.5, 2.0, 0.1, -0.1 into sign, exponent (biased and unbiased), and fraction.

  • 1.0 → sign 0, raw exponent 1023 (unbiased 0), fraction 0
  • 2.0 → raw exponent 1024 (unbiased 1)
  • 0.5 → raw exponent 1022 (unbiased −1)
  • 0.1 shows a repeating fraction pattern in the bits

Hint: Reinterpret the double's 8 bytes as a uint64 (struct.unpack('>Q', struct.pack('>d', x)) in Python; memcpy in C; math.Float64bits in Go), then mask: sign = bit 63, exponent = bits 52–62, fraction = bits 0–51.

Solution sketch: bits = float64_to_uint64(x); sign = bits >> 63; exp = (bits >> 52) & 0x7FF; frac = bits & ((1<<52)-1). Print exp and exp - 1023.


Task 2.2 — Machine epsilon and ULP

Goal: Compute machine epsilon empirically, and observe ULP growth.

  • You find ε ≈ 2.22e-16 by halving until 1.0 + eps == 1.0
  • ulp(1.0), ulp(1e6), ulp(1e16) differ by powers of 2 (the last is 2.0)
  • 1.0 + eps != 1.0 but 1.0 + eps/2 == 1.0

Hint: To find ε: start at 1.0, keep halving a candidate while 1.0 + candidate != 1.0; the last value that still changed 1.0 is ε. For ULP, use math.ulp (Python/Go have it) or nextafter(x, inf) - x.

Solution sketch: Loop halving; compare to sys.float_info.epsilon. math.ulp(1e16) is 2.0 because adjacent doubles there are 2 apart.


Task 2.3 — The 2^53 integer ceiling

Goal: Find the smallest positive integer N such that N and N+1 are not both representable as a double.

  • 2**53 + 1 == 2**53 is true
  • 2**53 is the boundary; below it, all integers are exact
  • (JS bonus) Number.MAX_SAFE_INTEGER === 2**53 - 1

Hint: At 2^53 the ULP becomes 2.0, so odd integers there can't be stored.

Solution sketch: Print 2**53 + 1 == 2**53. Explain via ULP at that magnitude.


Task 2.4 — Round half to even

Goal: Demonstrate banker's rounding and contrast it with "round half up."

  • round(0.5) == 0, round(1.5) == 2, round(2.5) == 2, round(3.5) == 4 (ties-to-even)
  • Summing many rounded ties shows ties-to-even has ~zero bias while round-half-up drifts up
  • You can state why the IEEE default is ties-to-even (Vancouver / bias)

Hint: Generate many values of the form k + 0.5 and round both ways; compare the sums of rounding errors.

Solution sketch: Round 0.5..n.5 ties-to-even (built-in round in Python 3) vs math.floor(x + 0.5) (half-up). The half-up sum drifts positive.


Task 2.5 — Catastrophic cancellation in the quadratic formula

Goal: Solve x² + 1e8·x + 1 = 0 with the naive and stable formulas; compare to the true roots (≈ -1e-8 and ≈ -1e8).

  • The naive (-b + sqrt(b²-4ac))/(2a) gives a wrong small root (lost digits)
  • The stable form (choose the sign that adds magnitudes, then use c/q for the other root) gives ≈ -1e-8
  • You can point to the exact subtraction that cancelled

Hint: -b + sqrt(b² - 4ac)-1e8 + 1e8 — the leading digits cancel. Compute q = -(b + sign(b)*sqrt(disc))/2, then roots are q/a and c/q.

Solution sketch: Implement both; print both roots; verify the stable small root against -1e-8.


Task 2.6 — Non-associativity and order-dependent sums

Goal: Show that summing a list in different orders gives different results.

  • (1e16 + 1.0) - 1e16 is 0.0 (the 1.0 was absorbed)
  • Summing [1.0, 1e16, 1.0, -1e16] left-to-right vs smallest-magnitude-first differs
  • You can explain absorption as "the small value is below one ULP of the running total"

Hint: Sort by absolute value ascending before summing to preserve small contributions.

Solution sketch: sum(xs) vs sum(sorted(xs, key=abs)). The sorted order keeps the two 1.0s.


Task 2.7 — Kahan summation

Goal: Implement Kahan (compensated) summation and beat naive sum on a large input.

  • Naive sum([0.1] * 10_000_000) drifts from 1_000_000.0
  • Your Kahan sum is much closer (often exact to display precision)
  • (Bonus) Compare against your language's fsum/pairwise sum

Hint: Keep a comp (compensation) variable: y = x - comp; t = total + y; comp = (t - total) - y; total = t.

Solution sketch: Implement the four-line loop above. Print naive vs Kahan vs math.fsum.


Task 2.8 — A correct isclose

Goal: Write a comparison that works near zero and at huge magnitudes.

  • isclose(0.1 + 0.2, 0.3) is true
  • isclose(1e-18, 0.0) is false with only relative tolerance, true once you add an absolute tolerance
  • isclose(1e20, 1e20 + 1) is true (1 is below one ULP there) — relative tolerance handles it

Hint: Combined form: abs(a-b) <= max(rel_tol * max(|a|,|b|), abs_tol).

Solution sketch: Implement the combined tolerance; test all three cases above.


Tier 3 — The machine beneath (Senior)

Task 3.1 — FMA changes the answer

Goal: Find inputs where fma(a, b, c) differs from a*b + c.

  • With a = 1 + 2^-27, b = 1 - 2^-27, c = -1: separate gives 0.0, fused gives -2^-54
  • fma(a, b, -(a*b)) extracts the exact rounding error of a*b (verify a*b == p + e exactly with the bits)
  • You can explain it as "one rounding vs two"

Hint: Use your language's fma (math.fma in newer Python/Go, std::fma in C++, f64::mul_add in Rust). Confirm hardware FMA, or the difference may not appear under software emulation.

Solution sketch: Compute sep = a*b - 1.0 and fused = fma(a, b, -1.0). Print both with 20 digits.


Task 3.2 — Observe -ffast-math breaking a NaN check (C/C++)

Goal: Compile a NaN-detection loop with and without -ffast-math (or -ffinite-math-only) and show it stops working.

  • Without the flag, for (...) if (v[i] != v[i]) return 1; detects a planted NaN
  • With -ffinite-math-only, the compiler folds v[i] != v[i] to false and the NaN slips through
  • (Bonus) Inspect the assembly diff to see the check disappear

Hint: Plant a NaN in the array (e.g., 0.0/0.0 from a volatile zero so the compiler can't constant-fold it away).

Solution sketch: Two builds, same source; diff the output and the objdump.


Task 3.3 — Kahan deleted by reassociation (C/C++)

Goal: Show that -fassociative-math (part of -ffast-math) optimizes Kahan's compensation to zero, restoring the drift.

  • Kahan sum at -O2 is accurate
  • Kahan sum at -O2 -ffast-math drifts like a naive sum (compensation eliminated)
  • Marking comp/t volatile restores correctness even under fast-math

Hint: The compiler proves (t - sum) - y == 0 algebraically. volatile forbids that reasoning.

Solution sketch: Same Kahan function, two builds; observe drift returns; add volatile and watch it fix.


Task 3.4 — x87 double rounding (32-bit x86, optional)

Goal: On a target using x87 (-m32 -mfpmath=387), reproduce a value that compares unequal to itself after a store.

  • volatile double x = 0.1 + 0.2; double y = 0.1 + 0.2; (x == y) can be false under x87
  • On x86-64 SSE2 (default) it's always true
  • You can explain it via FLT_EVAL_METHOD (2 vs 0)

Hint: The volatile forces a 64-bit memory store; the non-volatile y may stay in an 80-bit register.

Solution sketch: If you can't build 32-bit x87, read your platform's FLT_EVAL_METHOD and explain what would happen. (Most modern setups are 0 and clean.)


Task 3.5 — Quiet vs signaling NaN bits

Goal: Construct a quiet NaN with a custom payload; inspect the quiet bit.

  • Your qNaN has exponent all-ones, top fraction bit = 1, and your payload in the low bits
  • isnan returns true; the value is still NaN after + 1.0
  • (Discussion) Explain why producing a persistent signaling NaN is hard (loads often quiet it)

Hint: Build the uint64: (0x7FF << 52) | (1 << 51) | payload, then reinterpret as a double.

Solution sketch: Pack the bits, unpack as double, run isnan, add 1.0, confirm still NaN.


Task 3.6 — Decimal floating point for 0.1

Goal: Show that a decimal type represents 0.1 exactly while binary does not.

  • Decimal('0.1') + Decimal('0.2') == Decimal('0.3') is true
  • Decimal(0.1) (from the float) captures the binary error — 0.1000000000000000055...; Decimal('0.1') (from the string) is exact
  • You can explain the base-10 significand

Hint: Always construct decimals from strings, never from binary floats.

Solution sketch: Compare string-constructed vs float-constructed Decimal. Use BigDecimal in Java, decimal.Decimal in Python.


Task 3.7 — Shortest round-trip vs %.17g

Goal: Compare your language's default float printing against forced 17-digit printing.

  • Default printing of 0.1 shows 0.1 (shortest round-trip)
  • %.17g of 0.1 shows 0.10000000000000001
  • Both, when parsed back, give the same double; the default is just shorter

Hint: The default printers in Python repr, Go %v, Rust {}, modern JS use Grisu/Ryū-style shortest round-trip; C printf("%g") does not.

Solution sketch: Print 0.1 and 0.1 + 0.2 both ways; round-trip each string back to a double and assert bit-equality.


Tier 4 — Production discipline (Professional)

Task 4.1 — Reproduce the Vancouver Stock Exchange drift

Goal: Simulate an index updated ~3000×/day for ~600 days, with truncation vs round-to-nearest, and measure the divergence.

  • Truncation produces a steadily declining index (systematic downward bias)
  • Round-to-nearest stays near the true value
  • The gap is large (the real incident: ~520 vs ~1098)

Hint: Each "update" multiplies by a near-1 factor and re-quantizes to 3 decimals; truncate (floor) vs round.

Solution sketch: Loop 3000 * 600 times applying a tiny random walk and trunc(x*1000)/1000 vs round(x*1000)/1000. Plot or print both.


Task 4.2 — Safe float→int narrowing (the Ariane lesson)

Goal: Write a conversion that range-checks before narrowing a double to an int16, and prove the unchecked version is unsafe.

  • The naive cast of 40000.0 to int16 wraps/overflows (or is UB in C)
  • Your checked version returns an error for out-of-range and NaN/Inf inputs
  • (C bonus) -fsanitize=float-cast-overflow flags the unchecked cast

Hint: Check isfinite(x) and INT16_MIN <= round(x) <= INT16_MAX before casting.

Solution sketch: if not finite or out of [-32768, 32767]: error; else int16(round(x)). Compare to the bare cast.


Task 4.3 — Money type that forbids floats

Goal: Build a Money type storing integer minor units that makes adding a double a type error (or runtime guard), with correct splitting.

  • Money(1099, "USD") + Money(250, "USD") = Money(1349, "USD")
  • Adding mismatched currencies fails
  • split($10.00, 3) returns shares summing to exactly $10.00 (e.g., 334+333+333)

Hint: Largest-remainder split: base, rem = divmod(total, parts); first rem shares get +1.

Solution sketch: A frozen struct/dataclass wrapping cents: int and currency: str; __str__ divides by 100 for display only.


Task 4.4 — Bound drift with re-baselining

Goal: Build a running-total accumulator that periodically recomputes from the source of truth, and show it beats a naive long-lived double accumulator.

  • A naive double accumulator over many small increments drifts (or freezes via absorption)
  • Your re-baselining accumulator stays accurate
  • You can state the three drift signatures (linear, √n, freeze)

Hint: Every N adds, recompute the total with fsum/Kahan over the durable history and reset the accumulator.

Solution sketch: Keep running and a history; every N updates set running = fsum(history).


Task 4.5 — Catch NaN at the boundary

Goal: Add finiteness guards so a planted NaN trips an alarm at its source, not three layers downstream.

  • Without guards, a NaN injected early surfaces as a NaN output with no clue where it came from
  • With assert isfinite(x) at each stage boundary, the assert fires at the injection point with a stack trace
  • (C bonus) feenableexcept(FE_INVALID) traps on the producing instruction

Hint: not isfinite catches NaN and ±Inf. Place guards after parse, before store, at module edges.

Solution sketch: A pipeline parse → transform → aggregate → store; inject NaN in transform; add require_finite() at each boundary; observe where it trips.


Task 4.6 — Reproducibility fingerprint across configs

Goal: Hash the exact bit patterns of a computed float array and show that different build flags (FMA on/off, fast-math) change the hash.

  • You hash the raw 8-byte representations (not decimal strings)
  • -ffp-contract=off vs -ffp-contract=fast (or -mfma) can change the hash for an FMA-eligible kernel
  • -ffast-math changes the hash (reassociation/subnormals)

Hint: Pack each double as 8 bytes (struct.pack('>d', v) / math.Float64bits) and feed to SHA-256.

Solution sketch: Compute sum(a[i]*b[i] + c[i]) two ways/two builds; fingerprint each; compare. Discuss what it means for caching/consensus.


Capstone

Task C.1 — A float diagnostics CLI

Goal: Build a small tool that, given a decimal string, prints: the exact stored double (17 digits), the sign/exponent/fraction bits, the ULP at that magnitude, the next and previous representable doubles, and whether the input round-trips through the shortest printer.

  • Handles normal, subnormal, zero, ±Inf, and NaN inputs
  • Correctly shows 0.1 is stored as 0.10000000000000001 with a repeating fraction
  • Shows ulp(1e16) == 2.0 and the neighbors 1e16 ± 2

Hint: Combine the bit-inspection (2.1), ULP (2.2), nextafter, and shortest/%.17g printing (3.7).

Solution sketch: Parse → to_bits → decode fields → ulp/nextafter → print. This consolidates Tier 2–3 into one reusable utility.


Task C.2 — A statistics module that survives bad inputs

Goal: Implement mean, variance (Welford), and sum (Kahan/fsum) that are accurate on adversarial inputs and reject non-finite values at the boundary.

  • Variance via Welford never goes negative (the naive E[x²]-E[x]² can, then sqrt → NaN)
  • Summation stays accurate on [1e16, 1.0, -1e16, 1.0, ...]
  • Non-finite inputs raise at ingestion, not in the output

Hint: Welford: maintain n, mean, m2; update incrementally. Guard each input with isfinite.

Solution sketch: Welford for variance; Kahan/fsum for the total; require_finite on every appended value. Compare against the naive formulas on the adversarial array.


Self-Assessment

You've mastered this topic when you can, without looking anything up:

  • Decode a double's sign/exponent/fraction bits and explain the implicit 1 and the bias.
  • Explain why 0.1 isn't representable and predict 0.1 + 0.2.
  • State machine epsilon and explain how ULP scales with magnitude.
  • Recognize catastrophic cancellation and absorption on sight and reformulate around them.
  • Write a correct isclose (combined relative + absolute) and know when to use ULP.
  • Test for NaN/Inf correctly and explain NaN != NaN.
  • Explain FMA, x87 extended precision, and what -ffast-math breaks.
  • Choose the right money representation and explain the Patriot/Ariane/Vancouver incidents.
  • Bound drift in a long-running accumulator and diagnose its signature.