Composite — Optimization¶
1. How to use this file¶
Twelve scenarios where Composite-shaped code allocates more, traverses slower, or scales worse than it should. Each entry has a Before (code + benchmark) and a collapsible After (optimized code + benchmark + why + trade-offs + when NOT).
Anchored at Go 1.23, amd64. Numbers are reproducible-shape — run go test -bench=. -benchmem on your hardware before quoting them. Composite cost is dominated by four things: per-node interface dispatch, per-traversal slice allocation, deep recursion stack growth, and pointer-chasing across the heap. Most wins remove one of those four from the hot path. Reading order: Ex. 1, 2, 5, then any order. Ex. 4, 7, 10 are the ones most senior reviews flag.
2. Exercise 1 — Recursive Walk on a deep tree¶
A pure-interface Composite walks a 50k-node parse tree of average depth 200. Each recursive call pushes a ~80 B frame. Go's stack copy on overflow is amortized, but a balanced 200-deep tree bounces between stack sizes.
type Node interface{ Children() []Node }
func Walk(n Node, fn func(Node)) {
fn(n)
for _, c := range n.Children() { Walk(c, fn) }
}
After
Iterative DFS with an explicit `[]Node` stack. Reuse via `sync.Pool` for very hot loops. ~1.6× faster, depth bounded by an explicit slice not the goroutine stack. **Why faster:** No call/return overhead per node. The stack slice fits in L1 once stabilized. The runtime stops growing/shrinking the goroutine stack mid-walk. **Trade-off:** One slice allocation per walk (amortize via `sync.Pool`). Pre-order is easy; post-order needs a two-phase stack or `(node, visited)` pairs. Panic traces no longer show traversal depth. **When NOT:** Shallow trees (depth ≤ 32) where recursion is shorter. Trees needing natural post-order without bookkeeping. Test code where stack traces are valuable.3. Exercise 2 — Children() copying a slice per call¶
A Folder.Children() returns a defensive copy of d.children to prevent caller mutation. A walker over 100k nodes pays 100k allocations.
func (d *Folder) Children() []Node {
out := make([]Node, len(d.children))
copy(out, d.children)
return out
}
After
Return the underlying slice. Document the contract: callers must not mutate. The stdlib does this (`bytes.Buffer.Bytes()`). ~25× faster, zero allocations. **Why faster:** A slice header is three words on the stack — no heap traffic. The previous version copied 8 B per child × 100k = 0.8 MB through the allocator. **Trade-off:** Callers can corrupt the tree by mutating the returned slice. Mitigated by code review and a doc comment. For untrusted library boundaries, expose `Len()` + `Child(i)` instead — allocation-free and unaliasable. **When NOT:** Public API where mutation safety beats throughput. Concurrent access where another goroutine may `Add` mid-iteration — needs a copy or RWMutex-guarded view.4. Exercise 3 — Interface method dispatch per node¶
A pure-interface Composite calls n.Kind() and n.Children() per node. Each call is an indirect through the iface itab — a pipeline stall under random node-type interleaving.
type Node interface {
Kind() Kind
Children() []Node
}
func CountByKind(n Node, k Kind) int {
count := 0
Walk(n, func(x Node) { if x.Kind() == k { count++ } })
return count
}
After
Tagged-struct: one concrete `Node` with `Kind` field. Dispatch becomes a field load + integer compare. Inlines cleanly.type Kind uint8
const (
KindFile Kind = iota
KindFolder
)
type Node struct {
Kind Kind
Name string
Children []*Node
}
func CountByKind(root *Node, k Kind) int {
count := 0
var walk func(*Node)
walk = func(n *Node) {
if n.Kind == k { count++ }
for _, c := range n.Children { walk(c) }
}
walk(root); return count
}
5. Exercise 4 — ast.Inspect descending into subtrees you don't need¶
A linter scans a 200-file package for fmt.Println calls. ast.Inspect walks every node, but the linter only cares about *ast.CallExpr. Returning true descends into bodies it doesn't need.
ast.Inspect(file, func(n ast.Node) bool {
if call, ok := n.(*ast.CallExpr); ok {
if sel, ok := call.Fun.(*ast.SelectorExpr); ok && sel.Sel.Name == "Println" {
report(call)
}
}
return true
})
After
Return `false` early to skip subtrees that can't contain the target.ast.Inspect(file, func(n ast.Node) bool {
switch x := n.(type) {
case *ast.CallExpr:
if sel, ok := x.Fun.(*ast.SelectorExpr); ok && sel.Sel.Name == "Println" {
report(x)
}
return false // Args can't contain a top-level Println
case *ast.FuncDecl, *ast.BlockStmt, *ast.File:
return true
case ast.Expr:
return false // skip type exprs, literals, idents
}
return true
})
6. Exercise 5 — Per-node new allocation during parse¶
A parser allocates each *Node with &Node{...}. For a 500k-node tree, that's 500k separate heap objects, ~32 B each, fragmented and individually traced by the GC.
type Node struct {
Kind Kind
Name string
Children []*Node
}
func parseFolder(name string, entries []entry) *Node {
n := &Node{Kind: KindFolder, Name: name}
for _, e := range entries {
if e.IsDir {
n.Children = append(n.Children, parseFolder(e.Name, e.Sub))
} else {
n.Children = append(n.Children, &Node{Kind: KindFile, Name: e.Name})
}
}
return n
}
After
Arena: one big `[]Node` per parse, hand out pointers into it. Sub-chunks chained when full. GC sees one allocation per chunk. ~10× faster, ~3500× fewer allocations. **Why faster:** `runtime.mallocgc` per-call overhead is 30-50 ns regardless of size. Arena amortizes across 4096 nodes per chunk. GC scan drops from O(nodes) to O(chunks). Children of the same parent typically land in the same chunk — cache locality. The arena freezes a chunk when it fills so prior pointers stay stable. **Trade-off:** Arena lifetime = tree lifetime — no individual frees. Replacing a subtree needs re-walking to compact. `*Node` from one arena passed to another is a footgun. **When NOT:** Trees heavily mutated post-parse. Trees living across multiple parses where node identity must persist. Small trees (< 1k nodes) where wins don't pay for API friction.7. Exercise 6 — Linear search in Children for a named child¶
A config tree resolves root.Get("server.http.port") by splitting on . and linearly scanning each level's Children. For wide nodes (50 keys), the inner loop is O(N) per segment.
func (n *Node) Get(path string) (*Node, bool) {
cur := n
for _, seg := range strings.Split(path, ".") {
var found *Node
for _, c := range cur.Children {
if c.Name == seg { found = c; break }
}
if found == nil { return nil, false }
cur = found
}
return cur, true
}
After
Index children by name lazily on first lookup.type Node struct {
Kind Kind
Name string
Children []*Node
byName map[string]*Node // lazily built; nil if never indexed
}
func (n *Node) child(name string) (*Node, bool) {
if n.byName == nil && len(n.Children) > 8 {
n.byName = make(map[string]*Node, len(n.Children))
for _, c := range n.Children { n.byName[c.Name] = c }
}
if n.byName != nil { c, ok := n.byName[name]; return c, ok }
for _, c := range n.Children { // fall through for small fanouts
if c.Name == name { return c, true }
}
return nil, false
}
8. Exercise 7 — Pointer-tangled tree with NextSibling/PrevSibling¶
A DOM-style Composite (modeled on golang.org/x/net/html.Node) uses Parent, FirstChild, NextSibling, PrevSibling pointers. Each node is its own heap allocation; sibling iteration chases pointers across the heap.
type Node struct {
Parent, FirstChild, NextSibling, PrevSibling *Node
Data string
}
func walkSiblings(parent *Node, fn func(*Node)) {
for c := parent.FirstChild; c != nil; c = c.NextSibling { fn(c) }
}
After
Contiguous `[]Node` arena; reference parents/children/siblings by `int32` index. Sibling iteration becomes sequential memory access. ~2.3× faster. **Why faster:** The original chases pointers to arbitrary heap addresses; the prefetcher can't predict them. The indexed version dereferences into a contiguous slice — sequential access hits cache lines already prefetched. Node size drops from 40 B to 24 B, packing more nodes per cache line. **Trade-off:** Indices are stable only if the slice doesn't reallocate — pre-size or freeze the tree post-build. Removing a node leaves a tombstone (or requires compaction). Loses cross-tree node sharing. `n.FirstChild.Data` becomes `t.nodes[t.nodes[n].FirstChild].Data`; wrap in helpers. **When NOT:** Trees needing cross-tree references or persistent node identity across mutations. Trees mutated heavily — index churn outweighs cache wins. Public APIs where `*Node` is part of the contract.9. Exercise 8 — Concurrent walk with a channel per node¶
A parallel walker sends every node through a channel to a worker pool. Per-node send/receive is ~150 ns and contends on the channel's lock.
func ParallelWalk(root Node, fn func(Node)) {
ch := make(chan Node, 64)
var wg sync.WaitGroup
for i := 0; i < runtime.GOMAXPROCS(0); i++ {
wg.Add(1)
go func() { defer wg.Done(); for n := range ch { fn(n) } }()
}
var enqueue func(Node)
enqueue = func(n Node) {
ch <- n
for _, c := range n.Children() { enqueue(c) }
}
enqueue(root); close(ch); wg.Wait()
}
After
Fan-out worker pool seeded with the root's children. Each worker pulls a *subtree* and walks it locally. Channel sees subtree roots only.func ParallelWalk(root Node, fn func(Node)) {
work := make(chan Node, runtime.GOMAXPROCS(0))
var wg sync.WaitGroup
walkLocal := func(n Node) { // iterative DFS, no channel per node
stack := []Node{n}
for len(stack) > 0 {
x := stack[len(stack)-1]; stack = stack[:len(stack)-1]
fn(x)
kids := x.Children()
for i := len(kids) - 1; i >= 0; i-- { stack = append(stack, kids[i]) }
}
}
for i := 0; i < runtime.GOMAXPROCS(0); i++ {
wg.Add(1)
go func() { defer wg.Done(); for n := range work { walkLocal(n) } }()
}
fn(root)
for _, c := range root.Children() { work <- c }
close(work); wg.Wait()
}
10. Exercise 9 — json.Marshal of a 100 MB tree¶
json.Marshal builds the entire output in a []byte before returning. Peak memory doubles (tree + serialized form), and the GC sees a multi-MB allocation.
func WriteTree(w io.Writer, root *Node) error {
data, err := json.Marshal(root)
if err != nil { return err }
_, err = w.Write(data); return err
}
After
Stream JSON manually, writing each subtree as it's emitted. Peak memory stays bounded.func WriteTree(w io.Writer, n *Node) error {
io.WriteString(w, `{"name":`)
nameBytes, _ := json.Marshal(n.Name)
w.Write(nameBytes)
if len(n.Children) > 0 {
io.WriteString(w, `,"children":[`)
for i, c := range n.Children {
if i > 0 { io.WriteString(w, ",") }
if err := WriteTree(w, c); err != nil { return err }
}
io.WriteString(w, `]`)
}
_, err := io.WriteString(w, `}`); return err
}
11. Exercise 10 — Naive O(n²) tree diff¶
A reconciler computes the structural diff of two trees by checking, for each node in tree A, whether tree B contains an isomorphic subtree. Naive: for every (a, b) pair, recursively compare — O(n²) at best.
func equalSubtree(a, b *Node) bool {
if a.Kind != b.Kind || a.Name != b.Name || len(a.Children) != len(b.Children) { return false }
for i := range a.Children {
if !equalSubtree(a.Children[i], b.Children[i]) { return false }
}
return true
}
func findMoves(oldTree, newTree *Node) []Move {
var moves []Move
var visit func(*Node)
visit = func(n *Node) {
for _, m := range allNodes(newTree) {
if equalSubtree(n, m) { moves = append(moves, Move{n, m}); break }
}
for _, c := range n.Children { visit(c) }
}
visit(oldTree); return moves
}
After
Hash each subtree bottom-up: `H(kind, name, child_hashes...)`. Equal subtrees collide. Lookup is O(1) per node.func computeHash(n *Node) uint64 {
h := fnv.New64a()
binary.Write(h, binary.LittleEndian, n.Kind)
h.Write([]byte(n.Name))
for _, c := range n.Children { binary.Write(h, binary.LittleEndian, computeHash(c)) }
return h.Sum64()
}
func findMoves(oldTree, newTree *Node) []Move {
newIndex := map[uint64]*Node{}
var index func(*Node)
index = func(n *Node) { newIndex[computeHash(n)] = n; for _, c := range n.Children { index(c) } }
index(newTree)
var moves []Move
var visit func(*Node)
visit = func(n *Node) {
if m, ok := newIndex[computeHash(n)]; ok {
moves = append(moves, Move{n, m}); return // whole subtree matched
}
for _, c := range n.Children { visit(c) }
}
visit(oldTree); return moves
}
12. Exercise 11 — Visitor allocating a closure per call¶
A traversal builds a per-node closure capturing local state. Each Walk(root, func(n Node) { ... }) heap-allocates the closure because the func escapes through Walk. A watcher calling Walk 10k times per scan churns 240 KB/s of garbage.
func CountFiles(root Node) int {
count := 0
Walk(root, func(n Node) { // closure escapes — heap alloc
if _, ok := n.(*File); ok { count++ }
})
return count
}
After
Define a Visitor struct once; reuse across walks. Method dispatch is monomorphic per visitor type, and no closure escapes.type Visitor interface{ Visit(n Node) (descend bool) }
func Walk(n Node, v Visitor) {
if !v.Visit(n) { return }
for _, c := range n.Children() { Walk(c, v) }
}
type FileCounter struct{ Count int }
func (fc *FileCounter) Visit(n Node) bool {
if _, ok := n.(*File); ok { fc.Count++ }
return true
}
func CountFiles(root Node) int { fc := FileCounter{}; Walk(root, &fc); return fc.Count }
13. Exercise 12 — ast.Walk on a hot lint path¶
A linter uses ast.Walk(visitor, file). The interface requires Visit(ast.Node) ast.Visitor — returning a new visitor per node. Each call materializes an iface header on the return path.
type printlnFinder struct{}
func (p printlnFinder) Visit(n ast.Node) ast.Visitor {
if call, ok := n.(*ast.CallExpr); ok {
if sel, ok := call.Fun.(*ast.SelectorExpr); ok && sel.Sel.Name == "Println" {
report(call)
}
}
return p // returning iface per node
}
ast.Walk(printlnFinder{}, file)
After
`ast.Inspect` takes `func(ast.Node) bool` — a single closure, no per-node visitor return. ~1.4× faster, ~2.5× less garbage. **Why faster:** `ast.Walk` boxes the returned `ast.Visitor` per node. `ast.Inspect` internally wraps the closure in a single visitor instance — the closure is allocated once at the top, not per visit. The bool return avoids iface header construction on the hot return path. **Trade-off:** `ast.Inspect` can't vary descent behavior beyond yes/no — no per-subtree visitor switch. Most linters don't need it; state-machine visitors do. Beware of capturing huge state through `ast.Inspect` if called many times. **When NOT:** Visitors that legitimately need to return a different visitor per subtree (switching modes when entering a function body). Tools needing the pre/post hook of `Walk`. Code where explicit visitor types beat marginal speed.14. When NOT to optimize¶
Composite cost dominates only when traversal is on the hot path of a high-frequency operation. If your tree is walked once per minute (config reload, periodic report), every optimization here is irrelevant: one-shot AST walks in a CLI tool, filesystem scans triggered by user action, test fixtures built per case — Composite cost is dwarfed by surrounding work.
Profile first. Composite overhead has four signatures in a CPU profile: runtime.mallocgc on a hot stack → Ex. 2 or 5; runtime.morestack_noctxt on traversal → Ex. 1; runtime.convI2I per node → Ex. 3 or 11; runtime.mapaccess2_faststr on a tree walk → Ex. 7 (or reconsider whether you're using the tree as a map).
Common premature optimizations: iterative Walk (Ex. 1) on trees of depth < 32; arena (Ex. 5) on trees < 1k nodes; index-by-name (Ex. 7) when fanout averages < 8; contiguous index tree (Ex. 8) when sibling iteration isn't a hotspot; hash diff (Ex. 10) on trees diffed once per session.
Correctness gaps disguised as optimizations: returning the underlying Children slice (Ex. 2) without documenting "do not mutate"; tagged struct (Ex. 3) without exhaustiveness checks — a new Kind falls through silently; pruned ast.Inspect (Ex. 4) that skips a category the linter was supposed to cover; arena (Ex. 5) where a pointer escapes the arena's lifetime — UAF returning garbage; cached child index (Ex. 7) where Add/Remove forgot to update the map; indexed tree (Ex. 8) where slice growth invalidates pointers callers held; fan-out walk (Ex. 9) calling a non-thread-safe fn; streaming JSON (Ex. 9) that doesn't flush on error; hash diff (Ex. 10) without collision fallback — two different subtrees treated as identical; struct visitor (Ex. 11) reused across concurrent walks.
15. Summary¶
Always-ship wins (default in any new Composite code): return the underlying Children slice with a "do not mutate" doc (Ex. 2); return false from ast.Inspect for subtrees that can't contain the target (Ex. 4); hoist tree lookups out of hot loops; prefer ast.Inspect over ast.Walk for closure-style traversals (Ex. 12); stable identity (Name, ID) on nodes — not pointer equality — when reconciling across rebuilds.
Wins behind a profile (when measurements justify them): iterative Walk (Ex. 1, when morestack shows); tagged-struct (Ex. 3, when convI2I shows); arena allocation (Ex. 5, when mallocgc shows in parse paths); indexed children (Ex. 7, when linear scan in Children shows); contiguous-slice tree (Ex. 8, when pointer-chase stalls show); fan-out worker pool (Ex. 9, when single-threaded walk is CPU-bound and tree is balanced); streaming JSON (Ex. 9, when peak memory matters); hashed subtree diff (Ex. 10, when reconciliation is on a request path); struct visitor (Ex. 11, when closure-per-walk shows in alloc profiles).
Specialty (only when the design calls for it): custom arena with per-chunk freezing for parsers with millions of nodes per parse; slice-of-Node-as-DOM for game scene graphs and AST batch transforms; bottom-up hash diff with collision fallback for VDOM reconcilers and structural search; work-stealing parallel walk for heavily skewed trees with deep subtree workloads.
Composite cost is dispatch, allocation, depth, and cache misses. Strip those four from the read path by choosing the right primitive: pure-interface for ergonomic typed analysis; tagged-struct for tight memory and switch dispatch; arena for parse-heavy paths; indexed slices for cache-friendly siblings. The pattern itself is cheap — the wins come from matching the traversal shape to the tree shape. Profile, then pick the lever; the four signatures above tell you which one.