CAS and Atomic Primitives — Interview Preparation¶

Tiered questions (Junior → Middle → Senior → Professional) with answer focus, followed by a full coding challenge — a lock-free counter and a Treiber-stack push via CAS — in Go, Java, and Python. Aim to explain the CAS loop out loud while you code; interviewers grade reasoning about correctness and ordering as much as the code.

Table of Contents¶

Junior Questions
Middle Questions
Senior Questions
Professional Questions
Rapid-Fire Round
Scenario / Design Questions
Coding Challenge — Lock-Free Counter + Treiber Push
Bonus Challenge — Atomic Max via CAS (Go/Java/Python)
Whiteboard Trace Exercise
Follow-up Discussion Points
Self-Assessment Checklist

1. Junior Questions¶

#	Question	Expected Answer Focus
1	What does "atomic" mean?	Indivisible read-modify-write; no thread observes an intermediate state or interleaves.
2	Why is `counter++` unsafe across threads?	It's read→add→write (3 steps); interleaving causes lost updates.
3	What does CAS do?	`CAS(addr, expected, new)` writes `new` only if `addr == expected`, atomically, and reports success/failure.
4	Write the CAS loop in words.	read `old` → compute `desired = f(old)` → CAS; if it fails (value changed), retry.
5	Difference between fetch-add and CAS?	fetch-add does `+= delta` in one hardware op; CAS lets you install any value-dependent result via a loop.
6	When prefer fetch-add over a CAS loop?	When the update is a pure add — simpler and faster (and wait-free).
7	What is `exchange` / `swap`?	Unconditionally store new, return old. CAS without the compare.
8	Why must you re-read `old` inside the loop?	Each retry must use fresh data; reading once outside reuses stale `old` and livelocks/fails.
9	Does Python's `+=` on a global int need protection?	Yes — the GIL narrows the window but doesn't guarantee atomicity of `+=`; use a lock or `multiprocessing.Value`.
10	Atomic load vs a mutex for reading a flag?	Atomic load is cheaper and sufficient for a single word; no need for a lock.
11	What is `test-and-set` used for?	Building a spinlock: set a flag to 1 and check its old value to learn if you acquired it.
12	Does CAS block a thread?	No — on failure it returns immediately; the thread retries. No sleeping/context switch.
13	What's the final value if 3 threads each CAS-increment a counter starting at 0, 1000 times?	3000 — CAS guarantees no lost updates regardless of interleaving.

Model answer (Q2 + Q3): "counter++ compiles to load, add, store. Two threads can both load 41, both store 42 — one increment is lost. CAS fixes this: I read 41, compute 42, then CAS(addr, 41, 42) which writes 42 only if addr is still 41. If another thread already moved it to 42, my CAS fails, I re-read 42, compute 43, and retry. No update is ever lost."

2. Middle Questions¶

#	Question	Expected Answer Focus
1	What is the ABA problem?	A value goes A→B→A; CAS sees "still A" and succeeds wrongly, missing that the world changed.
2	Why are integer counters immune to ABA?	The value carries no hidden identity; "5 again" means the same thing — increment still correct.
3	Name three ABA fixes.	Tagged/versioned pointer (bump a counter), hazard pointers, double-width CAS.
4	What does `AtomicStampedReference` give you?	A (reference, int stamp) pair CAS'd together; the stamp bump defeats ABA.
5	Explain acquire vs release ordering.	release on a store publishes prior writes; acquire on a load makes them visible after — forming happens-before.
6	What's wrong with `relaxed` for publishing a pointer to an init'd object?	No ordering between the field write and the pointer publish; a reader may see the pointer before the data (esp. on ARM).
7	How do you build a spinlock from an atomic?	CAS flag 0→1 to acquire (acquire order), store 0 to release (release order), spin with a pause hint.
8	fetch-add is wait-free; a CAS loop is only lock-free — why?	fetch-add finishes in bounded steps always; a CAS loop's unlucky thread can retry unboundedly while others succeed.
9	Why does a single hot `AtomicLong` scale poorly?	All threads contend one cache line; updates serialize. Use a striped `LongAdder`.
10	When is a CAS-based approach worse than a mutex?	Multi-variable invariants, long sections — CAS guards one word and composes poorly.
11	What does the GIL mean for Python atomics?	One thread runs bytecode at a time; `+=` still spans bytecodes so it's not guaranteed atomic — use locks or `multiprocessing`.
12	Why pair release with acquire and not use both as release?	Release publishes on the writer; acquire consumes on the reader — the edge only forms when a release is read by an acquire.

Model answer (Q1): "ABA: thread T1 reads pointer A and pauses. Others pop A, pop B, and re-push A reusing the same address. T1 resumes and CAS(top, A, B) succeeds because top is A again — but A.next is no longer B; B was freed. The structure corrupts. The fix is to make the value never repeat: attach a version counter and CAS (pointer, version) together, e.g. AtomicStampedReference."

3. Senior Questions¶

#	Question	Expected Answer Focus
1	Walk through a Treiber stack push and pop and where ABA strikes.	push installs a fresh node (safe); pop's CAS on top is ABA-prone; needs tag/HP or GC.
2	What is the "helping" pattern in the Michael–Scott queue?	A thread completes another's half-done op (swing a lagging tail) → keeps the structure lock-free.
3	How does exponential backoff with jitter help a contended CAS loop?	De-synchronizes contenders, cuts cache-line thrash; jitter avoids lock-step collisions.
4	What is false sharing and how do you fix it?	Independent atomics on one 64-byte line invalidate each other; pad each to its own line.
5	Why does code work on x86 but corrupt on ARM?	x86 TSO rarely reorders; ARM is weak — missing acquire/release fences surface there.
6	Compare Go's and Java's memory models for atomics.	Go: seq_cst only (simple, no knobs). Java: `AtomicX` = seq_cst, `VarHandle` exposes acquire/release/opaque.
7	What is safe memory reclamation and why is it hard lock-free?	Can't free a node a stalled thread may still read; needs hazard pointers/EBR/RCU (or a GC).
8	When would you choose a Disruptor/ring buffer over a lock-free linked queue?	Pre-allocated array, no per-node alloc, padded sequence counters; lower latency, no GC churn.
9	What's the elimination optimization on a stack?	Colliding push/pop cancel via a side array, bypassing the contended top → near-linear scaling.
10	What metric warns you of an impending contention cliff?	Rising CAS-retry rate (and `perf c2c` HITM events); instrument and alert on it.
11	How does NUMA change your striping strategy?	Home each stripe's line near its writers; pin threads; NUMA-aware alloc so writes stay local.
12	What is flat combining and when does it beat lock-free CAS?	One combiner applies all pending ops under one lock; wins by keeping the structure hot in one cache vs bouncing.
13	How do you safely roll out a lock-free replacement for a mutex?	Library first, property/linearizability tests, ARM in CI, canary/shadow, feature-flag rollback.

Model answer (Q4): "Cache coherence works per 64-byte line. If two atomics that different threads update share a line, every write to one invalidates the other in the other core's L1 — they ping-pong as if one variable. Padding each hot atomic to its own line (8 bytes + 56 bytes pad, or @Contended in Java) removes the coupling. A striped counter only scales if its stripes are padded."

4. Professional Questions¶

#	Question	Expected Answer Focus
1	Define linearizability.	Each op takes effect at one linearization point between invoke/respond; the point-order is a legal sequential history respecting real-time order.
2	Why is linearizability composable but sequential consistency not?	Linearizability is local: object-wise linearizable ⇒ whole-system linearizable. SC isn't local.
3	State the progress hierarchy formally.	wait-free (bounded steps, every process) ⊃ lock-free (some process progresses) ⊃ obstruction-free (progress in isolation) ⊃ blocking.
4	What is a consensus number?	Max #processes for which an object solves wait-free consensus; a hard measure of synchronization power.
5	Consensus numbers of register, queue, CAS?	Register = 1, queue/stack/test-and-set/fetch-add = 2, CAS/LL-SC = ∞.
6	Why is CAS "universal"?	Consensus number ∞ ⇒ Herlihy's universal construction implements any sequential object wait-free using CAS.
7	Sketch 2-process (or n-process) consensus from CAS.	`CAS(C, ⊥, myVal)`; winner's value is the decision, losers read `C`. Wait-free for any n.
8	LL/SC vs CAS — key difference?	LL/SC fails on any intervening write (ABA-immune) but allows spurious failures; CAS compares values only.
9	What is the DRF-SC guarantee?	A data-race-free program (proper acquire/release) behaves as if sequentially consistent.
10	Define happens-before in the axiomatic model.	Transitive closure of sequenced-before ∪ synchronizes-with; release→acquire via reads-from creates the sw edge.
11	Why do registers have consensus number 1?	Bivalency/critical-state argument: a write hidden by a solo run can't create a mutually-visible irrevocable decision.
12	State the wait-free space/step cost of the universal construction.	O(n) space and O(n) steps per op (helping replays all n pending ops) — a proof, not a practical impl.
13	What is the fast-path/slow-path pattern for wait-freedom?	Cheap lock-free CAS attempts, then fall to a wait-free helping path only after starvation — pays the O(n) tax rarely.

Model answer (Q6): "Herlihy showed that solving wait-free consensus among n processes is equivalent to implementing arbitrary shared objects wait-free for n processes. CAS has consensus number ∞ — the first process whose CAS on a ⊥ cell succeeds fixes the decision, for any n. So CAS can drive the universal construction: agree on a total order of operations via consensus, append to a shared log, and have every process replay it (helping) to get its response — yielding a wait-free, linearizable implementation of any sequential object. That's why hardware ships CAS."

5. Rapid-Fire Round¶

Prompt	Answer
Go atomic add API	`atomic.Int64.Add`
Java +1 atomic	`AtomicLong.incrementAndGet()`
Java ABA-safe ref	`AtomicStampedReference`
Java fine ordering API	`VarHandle` (`getAcquire`/`setRelease`)
Java scalable counter	`LongAdder`
x86 CAS instruction	`LOCK CMPXCHG`
ARM CAS mechanism	`LDXR`/`STXR` (LL/SC)
Spin hint (Java)	`Thread.onSpinWait()`
Cache line size (typical)	64 bytes
CAS consensus number	∞
fetch-add progress class	wait-free
CAS-loop progress class	lock-free
Publish data through a flag	release store + acquire load
Python true-parallel option	`multiprocessing` (GIL blocks threads)
Exchange/swap returns	the old value
test-and-set builds	a spinlock
Helping pattern upgrades	lock-free → wait-free
Treiber stack progress	lock-free (not wait-free)
Disruptor avoids	allocation + false sharing
LL/SC immune to	the ABA problem

6. Scenario / Design Questions¶

These are open-ended; interviewers want a structured discussion, not a single right answer.

S1. "We have a global request counter that's now the hottest line in our service — QPS dropped after we added it. Diagnose and fix." Lead with: a single atomic counter serializes on one cache line; every core's increment invalidates it elsewhere (cache-line ping-pong), worse across NUMA sockets. Fix: stripe it — per-core/per-thread padded cells (LongAdder in Java, padded [N]atomic.Int64 in Go), summed on read. Confirm by measuring throughput vs thread count before/after, and watch perf c2c HITM events. Mention that if exact real-time reads aren't needed, an approximate periodically-summed counter is even cheaper.

S2. "Design a lock-free bounded MPMC queue for a low-latency trading path." Reach for a ring buffer / Disruptor: pre-allocated array (no per-node alloc, no GC churn), producers fetch-add a claim sequence, consumers track a published cursor; pad all sequence counters to their own cache lines to kill false sharing. Discuss backpressure when full, the publish cursor with release/acquire ordering, and why array-based beats linked-node for tail latency (no pointer chasing, no allocation).

S3. "Your lock-free stack passes all tests but corrupts data once a week in production. What's your hypothesis?" Two leading suspects: (1) ABA — a rare interleaving where top cycles to the same node; fix with a tagged pointer / hazard pointers (and note GC prevents the crash but not the logical corruption). (2) A memory-ordering bug invisible on x86 but exposed on ARM hosts in the fleet; fix with proper release/acquire and add ARM to CI. Emphasize: instrument the CAS-retry rate and add jcstress/-race property tests.

S4. "When would you deliberately choose a mutex over a lock-free design?" When the critical section spans multiple variables (compound invariant a single CAS can't cover), when the section is long, when contention is low (a mutex is then just as fast and far simpler), or when correctness/maintainability matters more than the last few hundred nanoseconds. "Make it correct and simple first; go lock-free only when profiling proves the lock is the bottleneck."

S5. "Explain to a junior why x++ is unsafe but atomic.Add(&x,1) is safe, in one minute." x++ is read-add-write — three steps two threads can interleave, losing an update. atomic.Add performs the same read-add-write as one indivisible hardware instruction (LOCK XADD), so no interleaving is possible and no increment is lost.

7. Coding Challenge — Lock-Free Counter + Treiber Push¶

Task. Implement (a) a thread-safe counter using an explicit CAS loop (not fetch-add — show you understand the loop), and (b) push for a Treiber stack using CAS. Both must be correct under concurrent access.

Go¶

package main

import (
    "fmt"
    "sync"
    "sync/atomic"
)

// (a) Lock-free counter via explicit CAS loop.
type Counter struct{ v atomic.Int64 }

func (c *Counter) Inc() {
    for {
        old := c.v.Load()                    // read
        if c.v.CompareAndSwap(old, old+1) {  // compute + CAS
            return
        }
        // CAS failed -> someone else incremented; retry
    }
}
func (c *Counter) Get() int64 { return c.v.Load() }

// (b) Treiber stack push via CAS.
type node struct {
    val  int
    next *node
}
type Stack struct{ top atomic.Pointer[node] }

func (s *Stack) Push(v int) {
    n := &node{val: v}
    for {
        t := s.top.Load()        // read current top
        n.next = t               // point new node at it
        if s.top.CompareAndSwap(t, n) { // publish atomically
            return
        }
        // top changed -> retry
    }
}
func (s *Stack) Pop() (int, bool) {
    for {
        t := s.top.Load()
        if t == nil {
            return 0, false
        }
        if s.top.CompareAndSwap(t, t.next) {
            return t.val, true
        }
    }
}

func main() {
    var c Counter
    var s Stack
    var wg sync.WaitGroup
    for i := 0; i < 8; i++ {
        wg.Add(1)
        go func(id int) {
            defer wg.Done()
            for j := 0; j < 100_000; j++ {
                c.Inc()
            }
            s.Push(id)
        }(i)
    }
    wg.Wait()
    fmt.Println("counter =", c.Get()) // 800000
    cnt := 0
    for {
        if _, ok := s.Pop(); !ok {
            break
        }
        cnt++
    }
    fmt.Println("pushed items popped =", cnt) // 8
}

Java¶

import java.util.concurrent.atomic.AtomicLong;
import java.util.concurrent.atomic.AtomicReference;

public class Challenge {

    // (a) Lock-free counter via explicit CAS loop.
    static final class Counter {
        private final AtomicLong v = new AtomicLong(0);
        void inc() {
            while (true) {
                long old = v.get();                  // read
                if (v.compareAndSet(old, old + 1))   // compute + CAS
                    return;
                // retry on failure
            }
        }
        long get() { return v.get(); }
    }

    // (b) Treiber stack push via CAS.
    static final class Stack<T> {
        static final class Node<T> { final T val; Node<T> next; Node(T v){val=v;} }
        private final AtomicReference<Node<T>> top = new AtomicReference<>(null);
        void push(T v) {
            Node<T> n = new Node<>(v);
            while (true) {
                Node<T> t = top.get();        // read
                n.next = t;                   // link
                if (top.compareAndSet(t, n))  // publish
                    return;
            }
        }
        T pop() {
            while (true) {
                Node<T> t = top.get();
                if (t == null) return null;
                if (top.compareAndSet(t, t.next)) return t.val;
            }
        }
    }

    public static void main(String[] args) throws InterruptedException {
        Counter c = new Counter();
        Stack<Integer> s = new Stack<>();
        Thread[] ts = new Thread[8];
        for (int i = 0; i < 8; i++) {
            final int id = i;
            ts[i] = new Thread(() -> {
                for (int j = 0; j < 100_000; j++) c.inc();
                s.push(id);
            });
            ts[i].start();
        }
        for (Thread t : ts) t.join();
        System.out.println("counter = " + c.get()); // 800000
        int cnt = 0;
        while (s.pop() != null) cnt++;
        System.out.println("pushed items popped = " + cnt); // 8
    }
}

Python¶

import threading

# Python has no lock-free CAS; the GIL serializes bytecode. We implement the
# SAME CAS-LOOP STRUCTURE but make the compare-and-set atomic with a lock,
# which is what real Python code does. The loop is identical to Go/Java.

class _Cas:
    """A cell with an atomic compare_and_set (lock-backed)."""
    def __init__(self, value=None):
        self._v = value
        self._g = threading.Lock()
    def load(self):
        with self._g:
            return self._v
    def cas(self, expected, new):
        with self._g:
            if self._v is expected or self._v == expected:
                self._v = new
                return True
            return False

# (a) Counter via explicit CAS loop.
class Counter:
    def __init__(self):
        self._cell = _Cas(0)
    def inc(self):
        while True:
            old = self._cell.load()            # read
            if self._cell.cas(old, old + 1):   # compute + CAS
                return
    def get(self):
        return self._cell.load()

# (b) Treiber stack push via CAS. Nodes are (value, next) tuples.
class Stack:
    def __init__(self):
        self._top = _Cas(None)
    def push(self, v):
        while True:
            t = self._top.load()               # read top
            n = (v, t)                          # link new node
            if self._top.cas(t, n):             # publish
                return
    def pop(self):
        while True:
            t = self._top.load()
            if t is None:
                return None
            value, nxt = t
            if self._top.cas(t, nxt):
                return value

if __name__ == "__main__":
    c, s = Counter(), Stack()
    def work(idx):
        for _ in range(50_000):
            c.inc()
        s.push(idx)
    threads = [threading.Thread(target=work, args=(i,)) for i in range(8)]
    for t in threads: t.start()
    for t in threads: t.join()
    print("counter =", c.get())                # 400000
    cnt = 0
    while s.pop() is not None:
        cnt += 1
    print("pushed items popped =", cnt)         # 8

Grading rubric: - Read happens inside the loop (correctness) — must-have. - CAS used, not a plain store — must-have. - Push links n.next = t before the CAS — must-have. - Mentions ABA on pop and that GC (Go/Java) or a tag protects it — strong signal. - Notes Python's GIL / lock substitution — strong signal.

8. Bonus Challenge — Atomic Max via CAS (Go/Java/Python)¶

Task. Maintain the maximum value ever offered by concurrent threads. This cannot be done with fetch-add — the new value depends on the old in a non-additive way — so it is the canonical "you really need a CAS loop" exercise. Short-circuit when the candidate can't beat the current max.

Go¶

import "sync/atomic"

func AtomicMax(addr *atomic.Int64, candidate int64) {
    for {
        old := addr.Load()
        if candidate <= old {
            return // current max already wins; no CAS needed
        }
        if addr.CompareAndSwap(old, candidate) {
            return
        }
        // lost the race; re-read and re-check
    }
}

Java¶

import java.util.concurrent.atomic.AtomicLong;

static void atomicMax(AtomicLong addr, long candidate) {
    while (true) {
        long old = addr.get();
        if (candidate <= old) return;
        if (addr.compareAndSet(old, candidate)) return;
    }
}
// JDK shortcut for this exact pattern:
//   addr.accumulateAndGet(candidate, Math::max);

Python¶

# Lock-backed CAS cell (Python has no hardware CAS). Loop is identical to Go/Java.
import threading

class Cell:
    def __init__(self, v=0):
        self._v = v
        self._g = threading.Lock()
    def load(self):
        with self._g:
            return self._v
    def cas(self, expected, new):
        with self._g:
            if self._v == expected:
                self._v = new
                return True
            return False

def atomic_max(cell, candidate):
    while True:
        old = cell.load()
        if candidate <= old:
            return
        if cell.cas(old, candidate):
            return

Why fetch-add fails here: fetch-add can only add a fixed delta; "keep the larger of old and candidate" is a value-dependent decision, so you must read, decide, and conditionally install via CAS. The short-circuit (candidate <= old) avoids needless CAS traffic when the candidate loses.

9. Whiteboard Trace Exercise¶

Interviewers love asking you to trace an interleaving. Practice this one out loud. Shared counter = 9. Threads A and B each run one CAS-loop increment. Fill in the table for the interleaving A-read, B-read, A-compute, B-compute, A-CAS, B-CAS.

Step	Actor	Action	A.old	B.old	shared	Result
1	A	read	9	–	9	—
2	B	read	9	9	9	—
3	A	compute 9+1	9	9	9	desired=10
4	B	compute 9+1	9	9	9	desired=10
5	A	CAS(9→10)	9	9	10	success
6	B	CAS(9→10)	9	9	10	fail (shared≠9)
7	B	re-read	9	10	10	fresh read
8	B	compute 10+1	9	10	10	desired=11
9	B	CAS(10→11)	9	10	11	success

Final: counter = 11. Both increments counted; B retried once. Expected talking point: the failed CAS at step 6 is what prevents the lost update — B re-reads instead of blindly writing 10.

10. Follow-up Discussion Points¶

"Make the counter wait-free." → switch to fetch-add / incrementAndGet; explain wait-free vs lock-free.
"Now it's the hottest line in the system." → stripe it (LongAdder / padded cells), discuss false sharing.
"Make pop ABA-safe without a GC." → tagged pointer / hazard pointers / DCAS.
"Prove your stack is correct." → linearization points at the successful CAS; lock-free progress argument.
"Why might this corrupt on ARM but not x86?" → weak vs TSO memory model; need acquire/release (Go gives seq_cst free; Java via VarHandle).
"Add backoff." → exponential backoff with jitter + onSpinWait/Gosched.
"Make it wait-free, not just lock-free." → fast-path/slow-path with helping (Kogan–Petrank); explain the O(n) wait-free tax.
"How would you test this?" → jcstress (JVM), go test -race, randomized linearizability checking against a sequential model; CI on ARM.

11. Self-Assessment Checklist¶

Tick these before claiming you "know CAS." If any are shaky, revisit the linked level file.

Next step: tasks.md — graded hands-on tasks (beginner → advanced) plus a contention benchmark across Go, Java, and Python.