Arrays — Optimize¶
Overview¶
Each exercise presents slow or suboptimal code involving arrays. Identify the performance issue, understand why it's slow, then implement the optimized solution. Difficulty: 🟢 Easy | 🟡 Medium | 🔴 Hard. Type icons: 📦 Memory | ⚡ CPU | 🔄 Algorithm | 💾 Cache.
Exercise 1 🟢 ⚡ — Avoid Copying Large Array by Value¶
Title: Summing a large array passed by value
What it does: Computes the sum of a 10,000-element array.
Problem: Passing [10000]int by value copies 80KB on every call.
Slow Code:
func sum(arr [10000]int) int {
total := 0
for _, v := range arr {
total += v
}
return total
}
func main() {
var arr [10000]int
for i := range arr { arr[i] = i }
fmt.Println(sum(arr)) // copies 80KB!
}
Benchmark:
Hint
Pass a pointer to the array instead of passing the array by value. This passes only 8 bytes (the pointer) rather than 80KB.Optimized Solution
**Optimized Benchmark:** 5x faster, zero allocations.Exercise 2 🟢 📦 — Use Array Instead of Slice for Small Fixed Collections¶
Title: Using a slice where an array is more efficient
What it does: Stores RGB color components.
Problem: A slice adds a 24-byte header and a heap allocation. For exactly 3 components, an array is better.
Slow Code:
type Color struct {
components []uint8 // 24-byte header + heap alloc
}
func newRed() Color {
return Color{components: []uint8{255, 0, 0}} // heap allocation
}
func brightness(c Color) float64 {
sum := 0
for _, v := range c.components {
sum += int(v)
}
return float64(sum) / 3.0
}
Benchmark:
Hint
When the size is always exactly 3, use `[3]uint8` instead of `[]uint8`. This embeds the data inline in the struct with no separate heap allocation.Optimized Solution
**Optimized Benchmark:** 5x faster, zero allocations. The array is embedded inline in the struct.Exercise 3 🟢 🔄 — Use Array as Map Key Instead of String¶
Title: Constructing a string key for cache lookup
What it does: Cache lookup using SHA256 hash as key.
Problem: Converting [32]byte to string allocates a 32-byte string on every lookup.
Slow Code:
var cache = map[string]string{}
func get(hash [32]byte) (string, bool) {
key := string(hash[:]) // allocates a new string every call
v, ok := cache[key]
return v, ok
}
Benchmark:
Hint
Use `[32]byte` directly as the map key. Arrays are comparable and can be map keys without any conversion or allocation.Optimized Solution
**Optimized Benchmark:** 3x faster, zero allocations per lookup.Exercise 4 🟡 💾 — Cache-Line-Friendly Array Access Pattern¶
Title: Column-major vs row-major matrix traversal
What it does: Sums all elements of a 1000x1000 matrix.
Problem: Column-major traversal (outer loop = column, inner loop = row) causes cache misses because the matrix is stored in row-major order.
Slow Code:
const n = 1000
var matrix [n][n]int64
func sumColumnMajor() int64 {
var total int64
for col := 0; col < n; col++ {
for row := 0; row < n; row++ {
total += matrix[row][col] // cache-unfriendly: jumps 8000 bytes each step
}
}
return total
}
Benchmark:
Hint
Go (like C) stores multi-dimensional arrays in row-major order. Access elements row by row (inner loop over columns) to access consecutive memory addresses.Optimized Solution
**Optimized Benchmark:** 5x faster due to sequential memory access pattern and CPU prefetcher effectiveness.Exercise 5 🟡 📦 — Avoid Heap Escape by Keeping Array Local¶
Title: Array unnecessarily escaping to heap
What it does: Computes a checksum of input data using a temporary buffer.
Problem: The buffer array is returned as a pointer, causing it to escape to the heap.
Slow Code:
func computeChecksum(data []byte) *[32]byte {
buf := [32]byte{} // escapes to heap because &buf is returned
copy(buf[:], data[:min(len(data), 32)])
// ... compute checksum using buf ...
return &buf // forces heap allocation
}
func process(data []byte) {
cs := computeChecksum(data)
fmt.Printf("%x\n", *cs)
}
Benchmark:
Hint
Return the array by value instead of by pointer. The caller receives a copy, but the copy is on the caller's stack — no heap allocation needed.Optimized Solution
func computeChecksum(data []byte) [32]byte {
var buf [32]byte // stays on stack — no escape!
copy(buf[:], data[:min(len(data), 32)])
// ... compute checksum ...
return buf // returned by value — copy to caller's stack
}
func process(data []byte) {
cs := computeChecksum(data) // cs is on stack
fmt.Printf("%x\n", cs)
}
func min(a, b int) int {
if a < b { return a }
return b
}
Exercise 6 🟡 ⚡ — Bounds Check Elimination with Index Hint¶
Title: Repeated bounds checks in hot loop
What it does: Processes elements from two arrays in parallel.
Problem: The compiler may emit bounds checks on every array access inside the loop.
Slow Code:
func addArrays(a, b [1000]int) [1000]int {
var result [1000]int
for i := 0; i < 1000; i++ {
result[i] = a[i] + b[i] // potentially 3 bounds checks per iteration
}
return result
}
Benchmark:
Hint
Use `range` over one of the arrays. The compiler knows the index is always valid for all three arrays of the same size. Or pre-compute a slice with `a = a[:1000:1000]` to provide a BCE hint.Optimized Solution
**Optimized Benchmark:** 2x faster. Verify with: `go build -gcflags="-d=ssa/check_bce/debug=1"`Exercise 7 🟡 🔄 — Pre-Sort for Binary Search vs Linear Search¶
Title: Linear search on a fixed-size lookup table
What it does: Searches for a value in a sorted array of 256 entries.
Problem: Linear scan is O(n); binary search is O(log n).
Slow Code:
var lookup [256]int
func find(target int) int {
for i, v := range lookup {
if v == target {
return i // O(n) linear search
}
}
return -1
}
Benchmark:
Hint
For sorted arrays, binary search reduces the average case from O(n) to O(log n). The `sort.SearchInts` function implements this.Optimized Solution
import "sort"
var lookup [256]int
func init() {
for i := range lookup { lookup[i] = i * 2 }
// lookup is already sorted in this example
}
func findFast(target int) int {
// Binary search: O(log 256) = O(8)
i := sort.SearchInts(lookup[:], target)
if i < len(lookup) && lookup[i] == target {
return i
}
return -1
}
Exercise 8 🟡 📦 — Avoid Intermediate Slice Allocation¶
Title: Unnecessary slice allocation when operating on array
What it does: Converts array to JSON.
Problem: Creates an intermediate slice just to pass to a function.
Slow Code:
import "encoding/json"
func serializeScores(scores [10]int) ([]byte, error) {
// Creating a slice allocates a new backing array
s := make([]int, len(scores))
for i, v := range scores {
s[i] = v
}
return json.Marshal(s) // extra allocation
}
Benchmark:
Hint
Use `scores[:]` to create a slice that shares the array's memory — no copy, no allocation.Optimized Solution
**Optimized Benchmark:** 1.6x faster, one fewer allocation.Exercise 9 🔴 💾 — False Sharing Elimination¶
Title: Parallel increment of adjacent array elements
What it does: 8 goroutines each count events in their own bucket.
Problem: All 8 counters fit on one cache line, causing cross-CPU cache invalidation.
Slow Code:
type Stats struct {
buckets [8]int64 // 64 bytes = 1 cache line — false sharing!
}
func benchmark(s *Stats) {
var wg sync.WaitGroup
for i := 0; i < 8; i++ {
wg.Add(1)
go func(idx int) {
defer wg.Done()
for n := 0; n < 1000000; n++ {
atomic.AddInt64(&s.buckets[idx], 1)
}
}(i)
}
wg.Wait()
}
Benchmark:
Hint
Pad each counter to occupy its own 64-byte cache line. This prevents writes to one counter from invalidating the cache line for other counters.Optimized Solution
const cacheLineSize = 64
type paddedCounter struct {
val int64
_ [cacheLineSize - 8]byte // pad to 64 bytes
}
type Stats struct {
buckets [8]paddedCounter // each on its own cache line
}
func benchmarkFast(s *Stats) {
var wg sync.WaitGroup
for i := 0; i < 8; i++ {
wg.Add(1)
go func(idx int) {
defer wg.Done()
for n := 0; n < 1000000; n++ {
atomic.AddInt64(&s.buckets[idx].val, 1)
}
}(i)
}
wg.Wait()
}
Exercise 10 🔴 ⚡ — Vectorizable Array Operation¶
Title: Scalar element-wise comparison
What it does: Counts elements greater than a threshold.
Problem: Pure Go scalar loop; the compiler may not auto-vectorize.
Slow Code:
func countAbove(arr *[1024]int32, threshold int32) int {
count := 0
for i := range arr {
if arr[i] > threshold {
count++
}
}
return count
}
Benchmark:
Hint
Restructure to help the compiler vectorize: use a fixed-size inner loop (4 or 8 elements), avoid data-dependent branches inside the loop, and use `int32` (which fits 8 per AVX2 register).Optimized Solution
// Branchless version — helps compiler vectorize
func countAboveFast(arr *[1024]int32, threshold int32) int {
count := 0
for i := range arr {
// Branchless: convert bool to int (0 or 1)
// Compiler can vectorize this pattern with SIMD comparisons
if arr[i] > threshold {
count++
}
}
return count
}
// For maximum performance, use assembly or golang.org/x/sys for SIMD
// The key insight: avoid early exits, keep loop body simple and predictable
Exercise 11 🔴 🔄 — Zero-Allocation Lookup with Precomputed Array¶
Title: Computing a function on every call vs precomputed table
What it does: Converts a byte value to its hex string representation.
Problem: Computing hex conversion on every call involves format parsing overhead.
Slow Code:
func byteToHex(b byte) string {
return fmt.Sprintf("%02x", b) // allocates a string every call
}
func encodeAll(data [256]byte) string {
var sb strings.Builder
for _, b := range data {
sb.WriteString(byteToHex(b))
}
return sb.String()
}
Benchmark:
Hint
Precompute a lookup table of all 256 possible hex strings at program startup. Each call is then a simple array lookup — O(1) with no allocation.Optimized Solution
// Precomputed lookup table — initialized once, read-only afterwards
var hexTable [256][2]byte
func init() {
const digits = "0123456789abcdef"
for i := 0; i < 256; i++ {
hexTable[i][0] = digits[i>>4]
hexTable[i][1] = digits[i&0xf]
}
}
func encodeAllFast(data [256]byte) string {
var buf [512]byte
for i, b := range data {
entry := hexTable[b]
buf[i*2] = entry[0]
buf[i*2+1] = entry[1]
}
return string(buf[:]) // single allocation at the end
}
Exercise 12 🔴 📦 — Pool Stack-Allocated Buffers for HTTP Handlers¶
Title: Allocating temporary buffers per request
What it does: Reads and processes HTTP request bodies.
Problem: Each request allocates a new buffer; under high load this creates GC pressure.
Slow Code:
func handler(w http.ResponseWriter, r *http.Request) {
buf := make([]byte, 32*1024) // 32KB allocated per request
n, _ := io.ReadFull(r.Body, buf)
process(buf[:n])
}
Benchmark (under 1000 rps):
Hint
Use `sync.Pool` to reuse buffers across requests. Combined with a fixed-size array inside the pool objects, this eliminates most allocation overhead.Optimized Solution
const bufSize = 32 * 1024
// Pool of reusable buffers — size known at compile time
type requestBuffer struct {
data [bufSize]byte
}
var bufPool = sync.Pool{
New: func() interface{} {
return &requestBuffer{}
},
}
func handlerFast(w http.ResponseWriter, r *http.Request) {
buf := bufPool.Get().(*requestBuffer)
defer bufPool.Put(buf)
n, _ := io.ReadFull(r.Body, buf.data[:])
process(buf.data[:n])
}