Parallel Programming¶

"If your problem isn't CPU-bound, parallelism won't make it faster — it'll just make it more complicated."

This roadmap is about using multiple cores (and SIMD lanes, and sometimes GPUs) to do real CPU work faster. It is the half of the Concurrency, Async & Parallel trio that targets compute-bound workloads — the place where Amdahl's Law, false sharing, cache lines, and work-stealing all start to matter.

Looking for non-blocking I/O (event loops, async/await)? See Async Programming.

Looking for concurrency primitives (mutex, channels, threads)? See Concurrency.

Looking for distributed compute (multi-node, MapReduce, Spark)? That belongs in the System Design and Data tracks — parallelism across machines is a different problem from parallelism within one machine, which is what this roadmap covers.

Why a Dedicated Roadmap¶

Parallel programming is the part of "make this faster" that the textbooks underplay:

A naive 8-core parallelisation often gives 3×, not 8× — and the reason (Amdahl, contention, cache-line bouncing) isn't obvious from the code.
Concurrency primitives let two tasks share state safely; parallel programming asks how to partition work so they don't have to share state at all.
The right abstraction depends on shape: data-parallel (the same op on many items) vs task-parallel (different ops in parallel) vs pipeline (stages flowing through cores).

Roadmap	Question it answers
Concurrency	How do logical flows coordinate?
Async Programming	How do I do thousands of waits at once?
Parallel Programming (this)	How do I keep all the cores busy on the same problem?

Sections¶

#	Topic	Focus
01	Amdahl & Gustafson	The two laws that bound how much speedup parallelism can give; serial fraction; strong vs weak scaling
02	Data Parallelism	Parallel `map`, `reduce`, `forEach`; SIMD-style at high level; Java parallel streams, Rust Rayon, .NET PLINQ
03	Task Parallelism	Independent task graphs; `Future` join trees; pipelining stages
04	Fork-Join	Divide-and-conquer pattern; cilk-style, Java `ForkJoinPool`, Rust `join!`
05	Work-Stealing	How modern parallel runtimes balance load (Tokio, Rayon, ForkJoinPool, TBB); when it helps and hurts
06	SIMD & Vectorization	Auto-vectorization; intrinsics (SSE/AVX/NEON); Rust `std::simd`, Go SIMD packages
07	Parallel Collections	Java parallel streams, Scala par-collections, Rust Rayon — pros, cons, and when they silently regress
08	Cache & False Sharing	Cache lines, padding, `@Contended`, cache-friendly layouts (SoA vs AoS for parallel scans)
09	NUMA-aware Parallelism	Why putting threads on the wrong socket halves your performance
10	GPU & Accelerator Offload	CUDA / OpenCL / Metal at conceptual level, when "throw it at a GPU" works and when it doesn't
11	Benchmarking Parallel Code	Why microbenchmarks lie; warm-up, JIT, contention noise; flame graphs vs lock-contention profilers
12	Anti-patterns	"Parallel for-loop everywhere"; over-parallelisation; lock-as-coordination disguised as parallelism

Languages¶

Cross-language comparison: Java (ForkJoinPool, parallel streams, @Contended), Rust (Rayon, std::simd), C++ (TBB, OpenMP, intrinsics), Go (runtime.GOMAXPROCS, manual fan-out — deliberately no parallel-collections magic), C# (PLINQ, Parallel.For).

Status¶

⏳ Structure defined; content pending.

References¶

The Art of Multiprocessor Programming — Herlihy & Shavit
Structured Parallel Programming — McCool, Robison, Reinders
What Every Programmer Should Know About Memory — Ulrich Drepper (the cache-line classic)
Is Parallel Programming Hard, And, If So, What Can You Do About It? — Paul McKenney (free)
Aleksey Shipilëv — JVM concurrency talks on false sharing, @Contended

Project Context¶

Part of the Senior Project — a personal effort to consolidate the essential knowledge of software engineering in one place.