Skip to content

Cross-compilation — Optimization

A release matrix that builds N targets sequentially with a cold cache wastes minutes per push. These exercises cut wall time and artifact size. Numbers are illustrative; measure with time on your own program.


Exercise 1: Parallelize target builds

Before — a shell loop builds one target after another:

for t in linux/amd64 linux/arm64 darwin/arm64 windows/amd64; do
  os=${t%/*}; arch=${t#*/}
  CGO_ENABLED=0 GOOS=$os GOARCH=$arch go build -o dist/app-$os-$arch .
done

After — fan out via make -j, GitHub Actions matrix, or xargs -P:

printf '%s\n' linux/amd64 linux/arm64 darwin/arm64 windows/amd64 \
  | xargs -P 4 -I{} bash -c '
      t={}; os=${t%/*}; arch=${t#*/}
      CGO_ENABLED=0 GOOS=$os GOARCH=$arch \
        go build -trimpath -ldflags="-s -w" \
        -o dist/app-$os-$arch .'
Metric Sequential (4 targets) Parallel -P 4
Wall time (small app, warm cache) ~12s ~4s
CPU usage 1 core busy up to 4 cores busy

Each target gets its own linker process — the work is embarrassingly parallel.


Exercise 2: Share GOCACHE and GOMODCACHE across targets

Before — CI evicts the build cache between jobs; every target downloads modules and recompiles the standard library.

After — pin and persist the caches:

- uses: actions/cache@v4
  with:
    path: |
      ~/.cache/go-build
      ~/go/pkg/mod
    key: go-${{ runner.os }}-${{ hashFiles('**/go.sum') }}
    restore-keys: go-${{ runner.os }}-

Different GOOS/GOARCH combinations have different cache keys inside GOCACHE, but they live in the same directory; one shared cache stores the compiled stdlib for every target you build.

Metric No cache Persisted cache
Stdlib compile per target repeated every job once, then reused
First-build time ~40s ~40s
Subsequent-build time ~40s ~6s

Exercise 3: Shrink and stabilize with -trimpath -buildvcs=false -ldflags="-s -w"

Before:

go build -o app .
ls -l app   # 12 MB

After:

CGO_ENABLED=0 go build \
  -trimpath -buildvcs=false \
  -ldflags="-s -w" \
  -o app .
ls -l app   # 8 MB
Metric Default flags -s -w + trim
Binary size 12 MB ~8 MB (≈33% smaller)
Reproducibility builder-dependent paths embedded trimmed
Cost DWARF gone, harder post-mortem on prod cores acceptable for most services

For a service deployed to many regions, the size cut directly reduces image-pull time and bandwidth.


Exercise 4: Prefer pure-Go over cgo where possible

BeforeCGO_ENABLED=1 (default on Linux/macOS) pulls in cgo for net and os/user, links via the slower external system linker, and requires a C cross-toolchain to cross-compile.

After:

CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o app .
Metric cgo on cgo off
Link time ~1.2s ~0.3s
Cross-compile setup needs C cross-compiler none
Runtime DNS resolver libc / nsswitch pure-Go resolver
Binary deployability needs target libc static, drops into scratch

If you must keep cgo on for a real reason (SQLite, OpenSSL, a vendor SDK), at least exclude it from the parts that don't need it.


Exercise 5: docker buildx cache reuse for multi-arch

Before — multi-arch build redoes everything from scratch each push:

docker buildx build --platform linux/amd64,linux/arm64 -t app:latest --push .

After — push and pull a registry cache:

docker buildx build \
  --platform linux/amd64,linux/arm64 \
  --cache-to   type=registry,ref=ghcr.io/acme/app:buildcache,mode=max \
  --cache-from type=registry,ref=ghcr.io/acme/app:buildcache \
  -t ghcr.io/acme/app:latest --push .

Combine with the --platform=$BUILDPLATFORM Go cross-compile pattern (see senior.md §6) so the slow go build step runs on the native host and Buildx only needs to assemble the final per-arch layer.

Metric No buildx cache Registry cache
Unchanged-source rebuild ~120s ~15s (manifest update only)
New dependency rebuild ~120s ~60s (cached stdlib + module download)

Exercise 6: Only build targets whose inputs changed

Before — every push rebuilds all matrix entries even when only docs changed.

After — make per-target outputs depend on the right files in Make (or use dorny/paths-filter in GitHub Actions to skip the matrix entirely on doc-only changes):

SRC := $(shell find . -name '*.go' -not -path './dist/*') go.mod go.sum
dist/app-%: $(SRC)
    @os=$(word 2,$(subst -, ,$*)); arch=$(word 3,$(subst -, ,$*)); \
     CGO_ENABLED=0 GOOS=$$os GOARCH=$$arch \
       go build -trimpath -ldflags="-s -w" -o $@ .

make dist/app-linux-amd64 only rebuilds if a Go file or go.mod/go.sum is newer than the output.

Metric Always rebuild all mtime-aware
Docs-only push 5 builds 0 builds
Single-file change 5 builds 5 builds (Go sources affect all targets)

This wins most on doc/CI/script-only changes — surprisingly common in real repos.


Exercise 7: One build per arch family when ABI is shared

For Linux, linux/amd64 baseline (GOAMD64=v1) and linux/arm64 are the two arch families most services need. Avoid duplicating builds for linux/amd64 + linux/amd64-v3 unless you have a measured performance reason — pick one level per binary and ship it.

Before — building app-linux-amd64 and app-linux-amd64-v3 and app-linux-amd64-v4:

Variant Built Used in practice
v1, v2, v3, v4 4 binaries Usually 1

After — ship v1 for portability or v3 for owned hardware, not both. Halve your release matrix; size of dist/ drops linearly.

If you genuinely need both, gate the higher variant on an explicit deploy target so you do not build it for every release.


Measurement checklist

  • Build targets in parallel (make -j, xargs -P, CI matrix).
  • Persist GOCACHE and GOMODCACHE across CI jobs.
  • Always pass -trimpath -buildvcs=false -ldflags="-s -w" for release artifacts.
  • Set CGO_ENABLED=0 unless you have a concrete cgo reason.
  • Use docker buildx registry cache for multi-arch images.
  • Skip jobs on doc-only changes via path filters or make dependencies.
  • Do not multiply targets by GOAMD64 levels without a measured win.