Graph Representation — Senior Level¶
A representation that is perfect on a single node — an
O(V²)matrix, a pointer-rich adjacency list — becomes a production incident the moment the graph outgrows one machine's RAM, must be queried concurrently, or has to be updated while traffic is live. At scale, the representation is the system-design decision.
Table of Contents¶
- Introduction
- System Design — Graph Stores and When to Materialize
- Distributed and Out-of-Core Graphs
- Concurrency — Immutable CSR and Concurrent Reads
- Comparison at Scale
- Architecture Patterns
- Code Examples
- Observability
- Failure Modes
- Capacity Planning — Bytes per Edge
- Serialization, Loading, and Versioning
- Incremental and Streaming Updates
- Zero-Copy mmap Load — Full Implementation
- Capacity Planning Worked End-to-End
- Summary
1. Introduction¶
At the senior level the question is no longer "matrix or list?" but "where does this graph live, how is it built, who reads it concurrently, and what breaks when it grows past one machine?". A graph representation has three properties that drive every architectural decision:
- Size.
O(V²)matrices are unusable past a few thousand vertices;O(V+E)lists scale to hundreds of millions of edges on one box, then need partitioning. - Mutability. A live social or routing graph changes constantly; a batch analytics graph is built once and read a billion times. These want opposite representations.
- Access pattern. Point lookups ("are
uandvconnected?") want different layouts than full scans (PageRank, connected components).
The senior decisions are therefore: when to materialize a graph versus query it from a database; how to partition it across machines; how to make it safe for thousands of concurrent readers; and how to plan capacity in bytes-per-edge so you know the break point before you hit it.
2. System Design — Graph Stores and When to Materialize¶
2.1 Three tiers of graph storage¶
| Tier | When right | When wrong |
|---|---|---|
| In-memory CSR / list | Fits in RAM, read-heavy, latency-critical traversal. | Graph exceeds RAM or must survive restarts without a rebuild. |
| Graph database | Durable, transactional, ad-hoc queries, mutable. | You run iterative analytics over the whole graph — DB traversal overhead dominates. |
| Distributed engine | Whole-graph analytics (PageRank, components) on 10¹¹+ edges. | You need single-pair point queries — batch latency is seconds to minutes. |
2.2 Materialize vs query-on-demand¶
The most expensive recurring decision: do you materialize an in-memory representation (CSR), or query edges from a backing store per traversal step?
- Materialize when the graph fits in RAM and is read far more than written. Building CSR once and traversing it millions of times amortizes the build cost to nothing.
- Query-on-demand when the graph is huge, sparse in access (you only ever touch a small neighborhood), or changes faster than you can rebuild. A 2-hop friend recommendation touches thousands of edges out of billions — materializing the whole graph would be absurd.
A common hybrid: materialize a CSR snapshot for the read-mostly core, fall back to the database for the long tail and for writes.
3. Distributed and Out-of-Core Graphs¶
3.1 Partitioning — the central problem¶
To store a graph across P machines you must assign each vertex (and its edges) to a partition. The quality of the partition determines how many edges cross machines, and cross-partition edges are the cost — every traversal step that crosses a partition is a network hop.
- Edge-cut partitioning (assign vertices to machines, cut edges between them). Used by Pregel/Giraph. Minimizing the cut is NP-hard; in practice hash partitioning (
hash(vertex) mod P) is used for simplicity, and metis-style multilevel partitioners are used when cut quality matters. - Vertex-cut partitioning (assign edges to machines, replicate vertices). Used by PowerGraph/GraphX, much better for power-law graphs where a few hub vertices have enormous degree. A single hub's edges are split across many machines instead of overwhelming one.
3.2 Pregel / vertex-centric model¶
Pregel (and its open-source descendants Giraph and Spark GraphX) represents the graph as partitioned adjacency lists and runs computation in supersteps: each vertex reads incoming messages, updates its state, and sends messages along its out-edges. The representation is a distributed CSR — each partition holds a local CSR plus a routing table mapping remote vertex ids to their owning machine. PageRank, connected components, and shortest paths all fit this "think like a vertex" model.
3.3 Out-of-core (single machine, disk-backed)¶
When a graph exceeds RAM but you do not want a cluster, out-of-core engines (GraphChi, X-Stream) stream the CSR from disk in shards. The key trick is ordering edges so that a sequential disk scan visits all edges touching a contiguous vertex range — turning random graph access into sequential I/O. CSR's flat layout is exactly what makes this possible; a pointer-based adjacency list cannot be streamed.
4. Concurrency — Immutable CSR and Concurrent Reads¶
4.1 The read-mostly graph¶
Most large graphs are read far more than written. The cleanest concurrency story is an immutable CSR: build it once, then share it across all reader threads with zero locking. Two flat arrays with no interior mutability are trivially thread-safe; readers never contend.
4.2 Handling writes — snapshot swap¶
When the graph must change, do not mutate the shared CSR in place. Instead:
- Apply writes to a side buffer (a mutable delta log or a fresh adjacency list).
- Periodically rebuild a new immutable CSR snapshot from the base plus deltas.
- Atomically swap a single pointer (
atomic.Valuein Go,AtomicReferencein Java) so new readers see the new snapshot; old readers finish on the old one and it is GC'd.
This is copy-on-write at the whole-graph granularity. Readers are lock-free; writers pay a periodic rebuild. It is exactly how read-optimized stores serve a slowly-changing graph under heavy concurrent load.
4.3 Fine-grained mutable graphs¶
If writes are frequent and reads must see them immediately, you fall back to a concurrent adjacency list with per-vertex locks (lock striping by vertex id) or a concurrent map of neighbor sets. This loses CSR's cache benefits but supports real-time mutation — the trade-off a graph database makes internally.
5. Comparison at Scale¶
| Representation | Build | Point query | Full scan | Mutable | Distributable | When |
|---|---|---|---|---|---|---|
| In-memory matrix | O(V²) | O(1) | O(V²) | yes | poorly (dense blocks) | small dense graphs only |
| In-memory adjacency list | O(V+E) | O(d) | O(V+E) | yes | hard (pointers) | mutable, fits RAM |
| Immutable CSR | O(V+E) | O(d) | O(V+E), fastest | no (rebuild) | yes (shard offsets) | static, read-heavy, hot |
| Graph database (Neo4j) | incremental | O(d) + I/O | slow (full scan over storage) | yes, ACID | clustered | durable, transactional, ad-hoc |
| Distributed CSR (Pregel) | O((V+E)/P) | network hop | O((V+E)/P) per superstep | batch | yes | whole-graph analytics |
The in-memory CSR wins whenever the graph fits in RAM and is read-mostly. The graph database wins when durability and transactions matter. The distributed engine wins only when the graph genuinely does not fit on one large machine — and modern machines hold a lot of edges, so reach for the cluster later than you think.
6. Architecture Patterns¶
6.1 CSR snapshot with atomic swap¶
writes --> delta log --> (periodic rebuild) --> new immutable CSR
|
readers ----atomic load of current snapshot pointer--+
Readers always dereference one atomic pointer to the current CSR and traverse lock-free. A background job rebuilds and swaps. Bounded staleness (the rebuild interval) is the trade-off for lock-free reads.
6.2 Tiered: hot CSR + cold database¶
Keep the read-hot subgraph (recent, high-traffic vertices) materialized in CSR; serve cold queries from the durable graph database. A cache-miss promotes a vertex's neighborhood into the hot tier. This bounds RAM while keeping the common case fast.
6.3 Compressed representations for scale¶
For web-scale graphs, store CSR with gap-encoded, variable-length neighbor ids (sort each vertex's neighbors, store deltas, compress with a byte-aligned varint or the WebGraph framework). This routinely cuts a web graph from O(E) 4-byte ids to 2–4 bits per edge, letting a graph that would need 400 GB fit in 30 GB. The cost is O(d) decode per neighbor scan — usually worth it because memory, not CPU, is the bottleneck at this scale.
7. Code Examples¶
7.1 Go — immutable CSR with atomic snapshot swap¶
package graph
import (
"sort"
"sync"
"sync/atomic"
)
// CSR is an immutable, read-only graph snapshot. Safe for concurrent readers.
type CSR struct {
offset []int32
target []int32
}
func (c *CSR) Neighbors(u int32) []int32 {
return c.target[c.offset[u]:c.offset[u+1]]
}
// buildCSR turns a directed edge slice into an immutable CSR.
func buildCSR(n int32, edges [][2]int32) *CSR {
offset := make([]int32, n+1)
for _, e := range edges {
offset[e[0]+1]++
}
for i := int32(1); i <= n; i++ {
offset[i] += offset[i-1]
}
target := make([]int32, len(edges))
cursor := make([]int32, n)
copy(cursor, offset[:n])
for _, e := range edges {
u := e[0]
target[cursor[u]] = e[1]
cursor[u]++
}
return &CSR{offset: offset, target: target}
}
// Store holds the current snapshot plus a pending write buffer.
type Store struct {
n int32
current atomic.Pointer[CSR] // lock-free read
mu sync.Mutex // guards pending writes
pending [][2]int32
base [][2]int32
}
func NewStore(n int32, edges [][2]int32) *Store {
s := &Store{n: n, base: edges}
s.current.Store(buildCSR(n, edges))
return s
}
// Read path: no locks, just an atomic load.
func (s *Store) Neighbors(u int32) []int32 {
return s.current.Load().Neighbors(u)
}
// Write path: buffer the edge; it becomes visible after Rebuild.
func (s *Store) AddEdge(u, v int32) {
s.mu.Lock()
s.pending = append(s.pending, [2]int32{u, v})
s.mu.Unlock()
}
// Rebuild folds pending edges into a fresh snapshot and swaps atomically.
func (s *Store) Rebuild() {
s.mu.Lock()
all := make([][2]int32, 0, len(s.base)+len(s.pending))
all = append(all, s.base...)
all = append(all, s.pending...)
s.base = all
s.pending = nil
s.mu.Unlock()
sort.Slice(all, func(i, j int) bool { return all[i][0] < all[j][0] })
s.current.Store(buildCSR(s.n, all)) // readers see new snapshot atomically
}
Notes for review:
- Reads are fully lock-free — one atomic pointer load, then a slice into immutable arrays.
- Writes are buffered and applied in batches by
Rebuild; this trades bounded staleness for lock-free reads. int32ids halve memory versusint; for> 2³¹vertices, switch toint64.
7.2 Java — immutable CSR shared across reader threads¶
import java.util.Arrays;
import java.util.concurrent.atomic.AtomicReference;
public final class GraphStore {
// Immutable snapshot: two flat arrays, no interior mutability.
public static final class CSR {
final int[] offset;
final int[] target;
CSR(int[] offset, int[] target) {
this.offset = offset;
this.target = target;
}
public int[] neighbors(int u) {
return Arrays.copyOfRange(target, offset[u], offset[u + 1]);
}
}
private final int n;
private final AtomicReference<CSR> current = new AtomicReference<>();
public GraphStore(int n, int[][] edges) {
this.n = n;
current.set(build(n, edges));
}
public int[] neighbors(int u) { // lock-free read
return current.get().neighbors(u);
}
public void swap(int[][] edges) { // build new snapshot, publish atomically
current.set(build(n, edges));
}
private static CSR build(int n, int[][] edges) {
int[] offset = new int[n + 1];
for (int[] e : edges) offset[e[0] + 1]++;
for (int i = 1; i <= n; i++) offset[i] += offset[i - 1];
int[] target = new int[edges.length];
int[] cursor = Arrays.copyOf(offset, n);
for (int[] e : edges) target[cursor[e[0]]++] = e[1];
return new CSR(offset, target);
}
}
The final fields plus the AtomicReference publication give the JMM guarantee that a reader observing a new snapshot sees fully-initialized arrays. No reader ever needs a lock.
7.3 Python — out-of-core CSR scan from disk shards¶
import struct
from typing import Iterator, Tuple
def scan_csr_shard(offset_path: str, target_path: str,
lo: int, hi: int) -> Iterator[Tuple[int, int]]:
"""Stream edges for vertices [lo, hi) sequentially from on-disk CSR arrays.
offset_path: int32 offsets, length V+1
target_path: int32 targets, length E
Yields (u, v) edges; sequential disk reads make this out-of-core friendly.
"""
with open(offset_path, "rb") as fo:
fo.seek(lo * 4)
offs = struct.unpack(f"<{hi - lo + 1}i", fo.read((hi - lo + 1) * 4))
start, end = offs[0], offs[-1]
with open(target_path, "rb") as ft:
ft.seek(start * 4)
chunk = ft.read((end - start) * 4)
targets = struct.unpack(f"<{end - start}i", chunk)
for u in range(lo, hi):
a = offs[u - lo] - start
b = offs[u - lo + 1] - start
for k in range(a, b):
yield u, targets[k]
Because CSR is flat, one seek + one big read pulls an entire vertex range's edges as sequential I/O — the property that makes disk-backed graph processing viable. A pointer-based adjacency list could never be streamed this way.
8. Observability¶
A graph representation is invisible until it OOMs or a traversal mysteriously slows. Instrument:
| Metric | Type | Why |
|---|---|---|
graph_vertices, graph_edges | gauge | Track growth toward the RAM ceiling. |
graph_bytes_resident | gauge | Actual memory; compare against capacity plan. |
graph_max_degree | gauge | A new super-hub can blow up per-vertex work and skew partitions. |
csr_rebuild_seconds | histogram | Snapshot rebuild time; if it exceeds the write rate you fall behind. |
csr_snapshot_staleness_seconds | gauge | How old is the current snapshot? Bounds read consistency. |
traversal_neighbors_visited | histogram | Detect a query touching far more of the graph than expected. |
partition_cross_edge_ratio | gauge | Fraction of edges crossing machines; high = bad partition, more network hops. |
graph_cache_miss_rate | gauge | CSR locality regression (e.g. after losing vertex-order locality). |
The most actionable are graph_bytes_resident against the capacity plan, and partition_cross_edge_ratio for distributed graphs — a bad partition silently turns local reads into network storms.
9. Failure Modes¶
9.1 Matrix OOM on a sparse graph¶
Someone reaches for int[V][V] on a graph that grew to V = 10⁵. That is 10¹⁰ cells — instant OOM. Mitigation: never allocate a matrix without checking V² · cellsize against available RAM; default to a list/CSR.
9.2 Super-hub degree explosion¶
A power-law graph gains a celebrity vertex with degree 10⁸. Any per-vertex operation that materializes neighbors(hub) allocates a huge slice; hash partitioning dumps the whole hub on one machine. Mitigation: vertex-cut partitioning, streaming neighbor iteration (never copy the slice), and degree-capping or two-level hub handling.
9.3 Snapshot rebuild falling behind writes¶
Write rate exceeds rebuild throughput; the delta log grows unbounded and staleness climbs. Mitigation: incremental CSR updates for the hot region, faster (parallel) rebuilds, or back-pressure on writes.
9.4 Lost vertex-order locality¶
Ids get renumbered (e.g. after a re-import) so neighbors are no longer near each other in target[]. Cache-miss rate spikes, traversals slow 3–5×. Mitigation: relabel vertices by a locality-preserving order (BFS order, or a space-filling curve for spatial graphs) before building CSR.
9.5 Cross-partition message storm¶
A poorly partitioned distributed graph sends a message across the network for nearly every edge. Mitigation: measure partition_cross_edge_ratio, repartition with a multilevel partitioner, or switch edge-cut → vertex-cut for power-law graphs.
9.6 Concurrent mutation tearing¶
Someone mutates a "shared" adjacency list while readers traverse it, producing torn reads or ConcurrentModificationException. Mitigation: make the shared representation immutable (CSR + atomic swap); route all writes through the rebuild path.
10. Capacity Planning — Bytes per Edge¶
10.1 The fundamental numbers¶
Per directed edge, by representation (32-bit ids):
- CSR:
4 bytes(oneint32intarget[]) plus the amortizedoffsetcost4·(V+1)/Ebytes. ForE ≫ V, effectively 4 bytes/edge. - Adjacency list (
int[]per vertex):4 bytes/edgefor the value plus slice/array header overhead per vertex (~24–48 bytes/vertexin Go/Java) and capacity slack (often 1.5–2× over-allocation). - Adjacency list (
List<Integer>in Java):16 bytes/edge— eachIntegeris a boxed object (~16 B) plus the backing array reference. Avoid boxed collections for large graphs. - Adjacency matrix:
V²/8 bytesfor a bitset (0/1),V² · 4for weighted. Independent ofE. - Compressed (gap + varint):
0.25–1 byte/edgeon web graphs — the only way to fit10¹¹-edge graphs in commodity RAM.
10.2 Worked example¶
A social graph: V = 2×10⁸ users, average degree 200, so E = 4×10¹⁰ directed edges.
- CSR with
int32ids:4·10¹⁰ · 4 B = 160 GBof targets +(2×10⁸+1)·4 B ≈ 0.8 GBoffsets ≈ 161 GB. Fits on a single large memory-optimized instance (e.g. 256–512 GB RAM). Materialize it. - Same as
List<List<Integer>>in Java:~16 B/edge = 640 GBplus per-vertex overhead — does not fit; you would shard or switch to primitive CSR. - Compressed CSR (gap-encoded, ~1 B/edge):
~40 GB— fits comfortably, at the cost of varint decode per neighbor.
10.3 When to leave a single node¶
Move off in-memory CSR when any of these holds:
bytes(E) + bytes(V)exceeds the largest practical single-machine RAM (today ~1–2 TB) even after compression.- The graph must be durable and transactional → graph database.
- Whole-graph analytics latency on one machine is unacceptable → distributed engine with partitioned CSR.
Until then, one big box with a compressed immutable CSR beats a cluster on both latency and operational simplicity. Modern hardware holds tens of billions of edges; reach for distribution later than instinct suggests.
11. Serialization, Loading, and Versioning¶
A graph that lives only in RAM is one process crash away from a multi-minute rebuild. At scale you persist the built representation, not just the raw edges, so a restart is a mmap, not a recompute.
11.1 On-disk CSR layout¶
Persist CSR as a self-describing binary blob: a fixed header (magic, version, n, m, id_width, flags) followed by the offset[] array and the target[] array, both little-endian and naturally aligned. Because both are flat, the file is a byte-for-byte image of the in-memory arrays:
The payoff is zero-copy load: mmap the file and reinterpret the bytes as int32 slices — no parsing, no allocation, no rebuild. A 160 GB social-graph CSR loads in the time it takes the OS to set up page tables; pages fault in lazily as traversals touch them. A pointer-based adjacency list cannot do this — its heap pointers are meaningless across processes — which is another reason CSR dominates the read-mostly tier.
11.2 Versioning and schema evolution¶
Bake a version byte and an id_width (4 vs 8) into the header so a reader can reject or up-convert an incompatible file rather than silently misinterpret offsets. When you cross the 2³¹-vertex boundary you flip id_width to 8 and double the file; readers branch on the header. Treat the format like any wire protocol: never reuse a field meaning, always add new fields after the existing ones, and validate n, m, and offset[n] == m on load before trusting a single byte.
11.3 Snapshot distribution¶
For a fleet of read replicas, the rebuild-and-swap pattern (§4.2) extends across machines: a builder node produces a new immutable CSR file, writes it to object storage (S3/GCS) under a content-addressed key, and replicas atomically mmap the new blob and swap their snapshot pointer. This is the same copy-on-write story at cluster granularity — replicas serve lock-free reads from an immutable mapped file while the next snapshot is built out of band. Staleness equals the build-plus-distribute interval, which you expose as a metric (csr_snapshot_staleness_seconds).
12. Incremental and Streaming Updates¶
The snapshot-swap model assumes you can afford a periodic full rebuild. For a fast-changing hot region or a true streaming graph, a full Θ(n + m) rebuild per batch is wasteful. Three escalating strategies:
12.1 Delta overlay on an immutable base¶
Keep the immutable CSR base plus a small mutable overlay (a per-vertex hash map of added/removed neighbors). A neighbor query reads the base block and applies the overlay diff:
Reads stay lock-free against the base and take a cheap lock (or a second atomic snapshot) only on the overlay. When the overlay grows past a threshold (say 5–10% of m), fold it into the base with one background rebuild and clear it. This bounds both staleness and the per-query overlay cost.
12.2 Slotted / gap CSR¶
Allocate each vertex's neighbor block with slack (a cap[u] ≥ deg[u]), so an insertion writes into a free slot in place instead of shifting the whole target[]. This is the CSR-with-gaps used by GPU dynamic-graph frameworks (Hornet, faimGraph). Inserts are O(1) until a block overflows, at which point that single block is relocated to a fresh region (amortized O(1) with doubling). It sacrifices a little space and perfect locality for in-place mutation without a global rebuild.
12.3 Log-structured merge for graphs¶
Borrowing from LSM trees: writes land in a small in-memory level (an adjacency hash map); when it fills it is flushed to an immutable sorted CSR run; runs are periodically merged. Reads consult the in-memory level plus the runs, newest-wins. This gives high write throughput and bounded read fan-out, and is essentially how transactional graph databases (and time-evolving graph stores) keep both writes cheap and full scans CSR-fast. The trade-off mirrors the LSM read-amplification story: more runs means more blocks to merge per neighbor query until compaction catches up.
12.4 Choosing a strategy¶
| Write rate | Read latency need | Strategy |
|---|---|---|
| Low (hourly batches) | Microsecond, lock-free | Full rebuild + atomic swap |
| Moderate, bursty | Low, slightly stale OK | Delta overlay on immutable base |
| High, in-place | Microsecond, GPU/many-core | Slotted (gap) CSR |
| Very high, durable | Millisecond, durable | LSM-structured CSR runs |
The unifying principle: keep the bulk of the graph in immutable, cache-friendly CSR and confine mutation to a small, separately-managed delta — never mutate the shared flat arrays in place under concurrent readers.
13. Zero-Copy mmap Load — Full Implementation¶
Section 11 argued that persisting the built CSR lets a restart be a mmap rather than a multi-minute rebuild. This section shows the actual mechanics in all three languages, including the header validation that keeps a corrupt or version-skewed file from being silently misinterpreted as offsets and targets.
The on-disk layout is the one from §11.1: a 64-byte header followed by the two flat arrays.
offset 0 [ magic "CSRG" | u32 version | u32 idWidth | u64 n | u64 m | u32 flags | pad ] 64 B
offset 64 [ offset[]: (n+1) * 4 B little-endian int32 ]
offset ... [ target[]: m * 4 B little-endian int32 ]
13.1 Go — mmap a CSR file and reinterpret bytes with no copy¶
package graph
import (
"encoding/binary"
"fmt"
"os"
"unsafe"
"golang.org/x/sys/unix"
)
const headerSize = 64
const magic = 0x47525343 // "CSRG" little-endian
// MappedCSR is a read-only CSR backed directly by an mmap'd file.
// No parsing, no allocation: the arrays alias the mapped pages.
type MappedCSR struct {
data []byte // the whole mmap region (kept alive so the GC won't unmap)
offset []int32 // aliases data[64 : 64+(n+1)*4]
target []int32 // aliases the remainder
n, m int64
}
func OpenCSR(path string) (*MappedCSR, error) {
f, err := os.Open(path)
if err != nil {
return nil, err
}
defer f.Close()
fi, _ := f.Stat()
data, err := unix.Mmap(int(f.Fd()), 0, int(fi.Size()),
unix.PROT_READ, unix.MAP_SHARED)
if err != nil {
return nil, err
}
// Validate the header before trusting a single offset.
if binary.LittleEndian.Uint32(data[0:4]) != magic {
return nil, fmt.Errorf("bad magic, not a CSR file")
}
if v := binary.LittleEndian.Uint32(data[4:8]); v != 1 {
return nil, fmt.Errorf("unsupported version %d", v)
}
if w := binary.LittleEndian.Uint32(data[8:12]); w != 4 {
return nil, fmt.Errorf("id width %d unsupported (need int64 path)", w)
}
n := int64(binary.LittleEndian.Uint64(data[16:24]))
m := int64(binary.LittleEndian.Uint64(data[24:32]))
// Reinterpret the mapped bytes as []int32 slices — the zero-copy step.
offBytes := data[headerSize : headerSize+(n+1)*4]
tgtBase := headerSize + (n+1)*4
tgtBytes := data[tgtBase : tgtBase+m*4]
off := unsafe.Slice((*int32)(unsafe.Pointer(&offBytes[0])), n+1)
tgt := unsafe.Slice((*int32)(unsafe.Pointer(&tgtBytes[0])), m)
// Cheap integrity check: the last offset must equal the edge count.
if int64(off[n]) != m {
return nil, fmt.Errorf("offset[n]=%d != m=%d, file corrupt", off[n], m)
}
return &MappedCSR{data: data, offset: off, target: tgt, n: n, m: m}, nil
}
func (c *MappedCSR) Neighbors(u int32) []int32 {
return c.target[c.offset[u]:c.offset[u+1]]
}
func (c *MappedCSR) Close() error { return unix.Munmap(c.data) }
The traversal then faults pages in lazily: a BFS that touches only a small neighborhood pays for only the pages it reads, not the whole 160 GB file. MAP_SHARED + PROT_READ means many processes on one box share the same physical pages — N replicas of a read-only graph cost one copy of RAM.
13.2 Java — MappedByteBuffer as an IntBuffer view¶
import java.io.IOException;
import java.io.RandomAccessFile;
import java.nio.ByteOrder;
import java.nio.IntBuffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
public final class MappedCSR implements AutoCloseable {
private static final int MAGIC = 0x47525343; // "CSRG"
private final IntBuffer offset; // (n+1) ints, view into the map
private final IntBuffer target; // m ints
private final long n, m;
private final FileChannel channel;
public MappedCSR(String path) throws IOException {
RandomAccessFile raf = new RandomAccessFile(path, "r");
channel = raf.getChannel();
long size = channel.size();
MappedByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, size);
buf.order(ByteOrder.LITTLE_ENDIAN);
if (buf.getInt(0) != MAGIC) throw new IOException("bad magic");
if (buf.getInt(4) != 1) throw new IOException("unsupported version");
if (buf.getInt(8) != 4) throw new IOException("id width != 4");
n = buf.getLong(16);
m = buf.getLong(24);
// Slice off the header, then expose the two regions as IntBuffers.
buf.position(64);
offset = buf.slice().asIntBuffer();
offset.limit((int) (n + 1));
buf.position(64 + (int) (n + 1) * 4);
target = buf.slice().asIntBuffer();
target.limit((int) m);
if (offset.get((int) n) != m) throw new IOException("offset[n] != m");
}
public int degree(int u) { return offset.get(u + 1) - offset.get(u); }
public int neighbor(int u, int k) { return target.get(offset.get(u) + k); }
@Override public void close() throws IOException { channel.close(); }
}
MappedByteBuffer keeps the bytes in the OS page cache, outside the Java heap — so a 160 GB graph does not inflate the heap or pressure the garbage collector, and the GC never scans it. This is the standard trick for serving graphs larger than a sane heap.
13.3 Python — mmap + memoryview.cast for a parse-free load¶
import mmap
import struct
from typing import List
class MappedCSR:
"""Read-only CSR backed by an mmap; arrays are zero-copy memoryviews."""
MAGIC = 0x47525343 # "CSRG"
HEADER = 64
def __init__(self, path: str) -> None:
self._file = open(path, "rb")
self._mm = mmap.mmap(self._file.fileno(), 0, prot=mmap.PROT_READ)
magic, version, id_width = struct.unpack_from("<III", self._mm, 0)
if magic != self.MAGIC:
raise ValueError("not a CSR file")
if version != 1:
raise ValueError(f"unsupported version {version}")
if id_width != 4:
raise ValueError("id width != 4")
self.n, self.m = struct.unpack_from("<QQ", self._mm, 16)
view = memoryview(self._mm)
off_end = self.HEADER + (self.n + 1) * 4
# cast() reinterprets the bytes as int32 with no copy.
self.offset = view[self.HEADER:off_end].cast("i")
self.target = view[off_end:off_end + self.m * 4].cast("i")
if self.offset[self.n] != self.m:
raise ValueError("offset[n] != m, file corrupt")
def neighbors(self, u: int) -> List[int]:
return self.target[self.offset[u]:self.offset[u + 1]].tolist()
def degree(self, u: int) -> int:
return self.offset[u + 1] - self.offset[u]
def close(self) -> None:
self.offset.release()
self.target.release()
self._mm.close()
self._file.close()
memoryview.cast("i") gives an int32 view over the mapped bytes with zero copying; slicing it never materializes a list until .tolist() is called. Even in Python — where everything is usually boxed — the bulk graph stays as raw mapped pages, so a multi-gigabyte CSR loads in milliseconds and shares the page cache across worker processes.
13.4 The architecture this unlocks¶
┌─────────────────────────┐
│ builder node │
│ edges → sorted CSR │
│ → write CSRG blob │
└────────────┬────────────┘
│ PUT graph-<contenthash>.csr
▼
┌─────────────────────────┐
│ object storage (S3/GCS) │
│ content-addressed blobs │
└────────────┬────────────┘
GET (lazy/range)│
┌─────────────────┬───────┴────────┬─────────────────┐
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│replica 0│ │replica 1│ │replica 2│ │replica 3│
│ mmap + │ │ mmap + │ │ mmap + │ │ mmap + │
│ atomic │ │ atomic │ │ atomic │ │ atomic │
│ swap ptr│ │ swap ptr│ │ swap ptr│ │ swap ptr│
└─────────┘ └─────────┘ └─────────┘ └─────────┘
lock-free reads from the immutable mapped snapshot; next blob built out of band
Each replica does exactly what §13.1–§13.3 implement: mmap the new blob, validate its header, atomically swap the snapshot pointer (the §4.2 pattern, now at fleet scale). The content-addressed key means a replica that already holds blob <hash> skips the download entirely; rollback is "point at the previous hash." Staleness equals the build-plus-distribute interval, exported as csr_snapshot_staleness_seconds.
14. Capacity Planning Worked End-to-End¶
§10 gave the bytes-per-edge numbers; this section walks a single realistic sizing decision from requirements to an instance choice, the way it would happen in a design review.
14.1 The requirement¶
Serve 2-hop neighborhood queries for a follow graph.
V = 5×10⁸accounts, mean out-degree 150, p99 latency budget 5 ms, read QPS 200k, write QPS 20k (new follows), durability handled separately by the source-of-truth database. The in-memory graph is a read accelerator, rebuildable from the DB.
14.2 Step 1 — raw size¶
E_directed = V * mean_out_degree = 5e8 * 150 = 7.5e10 directed edges
target[] = E * 4 B (int32 ids) = 7.5e10 * 4 = 300 GB
offset[] = (V + 1) * 4 B ≈ 5e8 * 4 = 2 GB
----------------------------------------------------------
primitive CSR resident ≈ 302 GB
V = 5×10⁸ < 2³¹ = 2.1×10⁹, so int32 ids are safe — no need for the 8-byte id path that would double everything to ~600 GB.
14.3 Step 2 — does it fit, and at what cost?¶
| Option | Resident | Fits one box? | Note |
|---|---|---|---|
| Primitive CSR, int32 | ~302 GB | Yes (memory-optimized 512 GB–1 TB instance) | Headroom for the OS, page cache, query state. |
Boxed List<List<Integer>> | ~1.2 TB (16 B/edge) | Barely / no | GC pressure makes p99 unachievable; rejected. |
| Gap+varint compressed CSR | ~75–150 GB (1–2 B/edge) | Comfortably | Adds varint decode per neighbor — check the latency budget. |
14.4 Step 3 — latency check against the budget¶
A 2-hop query touches ≈ deg + Σ_{first hop} deg ≈ 150 + 150·150 = 22,650 edges in the worst typical case. Each neighbor read is a likely cache miss (~100 ns on a random target[]/offset[] access):
primitive CSR: 22,650 misses * 100 ns ≈ 2.3 ms (within 5 ms budget) ✓
compressed CSR: 22,650 * (100 ns + ~20 ns decode) ≈ 2.7 ms (still within budget) ✓
Both meet 5 ms. The decision hinges on RAM cost versus instance availability, not latency — so pick primitive CSR if a 512 GB instance is cheap and available, compressed CSR if you want to run on a 256 GB instance or pack more replicas per host.
14.5 Step 4 — write path and rebuild cadence¶
20k follows/s into a 7.5×10¹⁰-edge base is 1.7×10⁹ new edges/day — about 2.3% growth/day. Using the delta-overlay strategy (§12.1) with a 5% overlay threshold, a full rebuild is needed roughly every two days. Rebuild cost:
rebuild = counting-sort over (E + delta) ≈ 7.7e10 edges
parallel build at ~2e9 edges/s/core * 32 cores ≈ 6.4e10 edges/s
rebuild wall time ≈ 7.7e10 / 6.4e10 ≈ 1.2 s of CPU-bound scatter
+ sort + I/O ≈ tens of seconds total
Comfortably under the every-two-day cadence, so the overlay never overflows and csr_rebuild_seconds stays well below the inter-rebuild interval — the §9.3 failure mode (rebuild falling behind writes) does not trigger.
14.6 Step 5 — replica fleet sizing¶
read QPS 200k, per-query CPU ≈ 2.3 ms (cache-miss bound, single thread)
per-core throughput ≈ 1 / 2.3 ms ≈ 435 q/s
cores needed ≈ 200,000 / 435 ≈ 460 cores
at 64 cores/replica ≈ 8 replicas for compute
Memory says one replica per ~512 GB box; compute says ~8 boxes. Compute dominates here, so provision ~8 replicas (plus 1–2 for redundancy), each mmap-ing the shared blob (§13). The conclusion: a fleet of ten 512 GB / 64-core instances serves a 75-billion-edge follow graph at 200k QPS within a 5 ms budget — no cluster graph engine required, vindicating the §5 "reach for distribution later than instinct suggests" rule.
15. Summary¶
- The representation is a system-design decision: size, mutability, and access pattern dictate it more than asymptotics do.
- Materialize an in-memory CSR when the graph fits in RAM and is read-mostly; query a graph database when durability and transactions matter; reach for a distributed engine only when the graph truly does not fit on one large machine.
- Partitioning quality (edge-cut vs vertex-cut) determines network cost; power-law graphs need vertex-cut to tame super-hubs.
- Immutable CSR plus atomic snapshot swap gives lock-free concurrent reads with bounded staleness — the cleanest concurrency story for a read-heavy graph.
- Plan in bytes-per-edge: primitive CSR is ~4 B/edge, boxed Java lists ~16 B/edge, compressed web graphs under 1 B/edge. The break point is predictable; compute it before you allocate.
- Watch
graph_bytes_resident,csr_snapshot_staleness, super-hub degree, andpartition_cross_edge_ratio— these catch the failures (OOM, stale reads, message storms) that throughput metrics miss. - Persist the built CSR as a flat, versioned binary blob so a restart is a zero-copy
mmaprather than a multi-minute rebuild; distribute snapshots to read replicas via content-addressed object storage and atomic pointer swap. - For fast-changing graphs, confine mutation to a small delta — a delta overlay on an immutable base, a slotted (gap) CSR for in-place inserts, or LSM-structured CSR runs — and never mutate the shared flat arrays in place under concurrent readers.
References to study further: Pregel (Malewicz et al. 2010), PowerGraph vertex-cut partitioning (Gonzalez et al. 2012), GraphChi out-of-core (Kyrola et al. 2012), the WebGraph compression framework (Boldi & Vigna), and Neo4j's native graph storage internals.