API & Library Design — Optimize & Reconcile¶
A clean API and a fast API are not enemies, but they pull in different directions. The ergonomic shape —
func(opts ...Option), returning a fresh[]Result, acceptinginterface{}for flexibility — allocates, copies, and boxes on every call. That cost is invisible off the hot path and ruinous on it. The discipline here is not "make the API ugly to make it fast." It is: keep the ergonomic API as the default, then offer an explicit fast/batch/streaming path beside it —Readnext toReadByte,Marshalnext toMarshalTo(buf),Findnext to an iterator. Each scenario below pairs a clean API with its measured cost and the principled reconciliation.
Table of Contents¶
- Functional options allocate per call on the hot path
- Returning a fresh copy vs exposing the internal slice
- The append-to-destination (
dst []byte) pattern - Streaming vs materializing the whole result list
- Batch API beside the single-item API (N+1 at the boundary)
[]bytevsstringat the API boundary (zero-copy)- Generic API forcing boxing / interface conversion
- Reflection-driven flexibility (
any) vs a typed fast path - Sync blocking API vs offering a streaming/async variant
- Callback API that escapes the closure to the heap
- Returning an interface that defeats inlining and EA
- Defensive validation re-run on every internal call
- Over-flexible "do everything" entry point
- Pagination cursor that re-scans from the start
- Rules of Thumb
- Related Topics
Scenario 1 — Functional options allocate per call on the hot path¶
The functional-options pattern is the canonical "clean Go API": self-documenting, backward-compatible, defaults baked in.
type Client struct{ timeout time.Duration; retries int; gzip bool }
type Option func(*Client)
func WithTimeout(d time.Duration) Option { return func(c *Client) { c.timeout = d } }
func WithRetries(n int) Option { return func(c *Client) { c.retries = n } }
func (c *Client) Do(req *Request, opts ...Option) (*Response, error) {
cfg := *c
for _, o := range opts { o(&cfg) } // applies per-call overrides
return c.send(&cfg, req)
}
Cost. Each Do(req, WithTimeout(2*time.Second)) call allocates: the variadic opts backing array (1 alloc), one closure per option (each With… returns a heap-allocated closure capturing its argument — 1 alloc each), and the cfg := *c copy if Client is large. For a single option that is 2–3 allocations per call. At 200k calls/sec that is ~500k allocs/sec feeding the GC. A go test -bench -benchmem will show ~120 ns/op, 3 allocs/op versus ~8 ns/op, 0 allocs for a plain method call.
Resolution
Options belong at **construction** time, not on the per-request hot method. Configure once, call cheaply: When a *per-call* override is genuinely needed on a hot path, provide an explicit struct-config fast path beside the ergonomic one — no closures, no variadic: `RequestConfig` is a plain value the caller fills; passing it allocates nothing (it can stay on the stack). Keep `Do(req, opts...)` for the 99% cold-path callers; document `DoWith` as "for hot loops." This is the same split as `http.Client` (configured once) versus per-request fields on `http.Request`.Scenario 2 — Returning a fresh copy vs exposing the internal slice¶
A getter that returns the internal slice lets callers mutate your invariants. The clean fix is a defensive copy.
func (o *Order) Lines() []Line {
out := make([]Line, len(o.lines))
copy(out, o.lines) // safe: caller can't corrupt o.lines
return out
}
Cost. Every call allocates and copies the whole slice. A reporting loop that calls order.Lines() 1,000 times on a 50-line order does 50,000 element copies and 1,000 allocations for data that never changed. pprof shows Lines dominating alloc_space.
Resolution
Three tiers, cleanest default first: 1. **Read-only access by index** — no copy, no leak: 2. **Iteration callback** (Go 1.23 `iter.Seq`) — caller can't retain or mutate the backing array: 3. **Copy** stays as the explicit "I need an independent snapshot" method, named to signal cost: `LinesCopy()`. The defensive copy was protecting against mutation; index/iteration access removes the mutation vector *and* the allocation. Reserve the copy for callers who truly need to own the data. (Mirrors gold Bloaters Optimize 7 — same Tell-Don't-Ask resolution, here applied to the API surface a library hands out.)Scenario 3 — The append-to-destination (dst []byte) pattern¶
Encoding APIs naturally return a freshly allocated buffer:
func (m *Message) Marshal() []byte {
buf := make([]byte, 0, m.size())
buf = append(buf, m.header...)
buf = append(buf, m.body...)
return buf
}
Cost. Every Marshal() allocates a new buffer. In a serialization loop over 1M messages that is 1M allocations the caller can never reuse — the buffer is born, written to a socket, and discarded.
Resolution
Adopt the standard-library **append-to-dst** convention: keep the convenient nullary form, and add a variant that writes into a caller-owned buffer.// Convenient: allocates a new slice. Cold path / one-shot callers.
func (m *Message) Marshal() []byte { return m.AppendTo(nil) }
// Fast: appends into dst and returns the grown slice. Hot loops reuse one buffer.
func (m *Message) AppendTo(dst []byte) []byte {
dst = append(dst, m.header...)
dst = append(dst, m.body...)
return dst
}
Scenario 4 — Streaming vs materializing the whole result list¶
The simplest signature returns everything at once.
public List<Row> query(String sql) { // builds the entire list in memory
List<Row> rows = new ArrayList<>();
try (ResultSet rs = stmt.executeQuery(sql)) {
while (rs.next()) rows.add(mapRow(rs));
}
return rows;
}
Cost. A query returning 10M rows materializes 10M Row objects (~100 bytes each ≈ 1 GB) before the caller sees the first one. Latency to first row = time to last row. On a 4 GB heap this OOMs; even when it fits, the caller usually only needs to fold over the rows once.
Resolution
Make the **streaming contract** the primary API and let materialization be the trivial wrapper, not the reverse:// Primary: caller controls consumption; constant memory; first row is immediate.
public Stream<Row> queryStream(String sql) { … } // lazy, backed by the ResultSet
// Convenience for small results, documented as "loads all rows":
public List<Row> query(String sql) {
try (Stream<Row> s = queryStream(sql)) { return s.collect(toList()); }
}
Scenario 5 — Batch API beside the single-item API (N+1 at the boundary)¶
A clean per-item method composes beautifully — and is a latency disaster when the per-call overhead is fixed.
class UserRepo:
def get(self, user_id: int) -> User: # one round trip each
return self._db.query_one("SELECT * FROM users WHERE id = %s", user_id)
# Caller, looks innocent:
users = [repo.get(uid) for uid in order_user_ids] # 500 IDs → 500 round trips
Cost. Each get is ~1 ms of network round-trip. 500 calls = 500 ms of serial latency that is 99% wait. This is N+1 surfacing at the API boundary: the API only offered a per-item door, so the caller had no way to express "I want all 500."
Resolution
Offer a **batch method beside** the single-item one. Keep `get` (it reads cleanly for the single case); add `get_many` that amortizes the fixed cost across the set: 500 IDs now cost ~1 round trip (~1–2 ms) instead of 500. The single-item `get` can even be defined in terms of the batch when convenient, but usually keep both implementations — `get` stays a clean one-liner for the common single lookup. Document the guidance explicitly: *"Calling `get` in a loop? Use `get_many`."* DataLoader-style coalescing is the automatic version of this same principle. The clean per-item API survives; the batch path is the explicit fast lane for the boundary that would otherwise go N+1.Scenario 6 — []byte vs string at the API boundary (zero-copy)¶
A parser that takes a string is pleasant; a parser that takes []byte avoids a copy.
func ParseToken(s string) (Token, error) { … }
// Caller has bytes off the wire:
tok, err := ParseToken(string(buf)) // string(buf) COPIES the entire buffer
Cost. string([]byte) allocates and copies because strings are immutable in Go. In a parser fed by bufio.Reader, every line incurs one full copy of the line's bytes. Symmetrically, an API that returns substrings as fresh strings copies each one. For a 1 KB line at 100k lines/sec, that is 100 MB/sec of pure copy overhead.
Resolution
Define the core API over `[]byte` (the zero-copy boundary) and provide a `string` convenience wrapper — never the other way around: For results, return *views* into the caller's buffer instead of fresh strings, documenting the aliasing contract clearly: This is the `bytes` vs `strings` package split, and how `encoding/json` token scanning and `net/http` header parsing avoid per-field allocation. The contract must be explicit: zero-copy means the result borrows the input's lifetime. Offer a `…Copy` accessor for callers who need to retain the value. Java's analog is `ByteBuffer`/`CharSequence` views vs `new String(bytes)`; Python's is `memoryview(buf)` vs `bytes(buf)`.Scenario 7 — Generic API forcing boxing / interface conversion¶
An API typed as interface{} / Object looks maximally flexible.
public interface Cache {
void put(String key, Object value);
Object get(String key);
}
// Numeric hot path:
cache.put("hits", hits); // int → Integer (autoboxing allocates)
int h = (Integer) cache.get("hits"); // unbox + checked cast
Cost. Every put/get of a primitive autoboxes: int → Integer allocates an object (outside the −128..127 cache). At 1M ops/sec on counters that is ~1M Integer allocations/sec plus a checked downcast on the way out. async-profiler shows Integer.valueOf in the allocation flame graph.
Resolution
Use generics to keep the clean shape while erasing the cast, and offer a **primitive-specialized** path for the numeric hot case:public interface Cache<V> { // type-safe, no downcast at call site
void put(String key, V value);
V get(String key);
}
public interface IntCache { // zero boxing on the hot numeric path
void put(String key, int value);
int get(String key); // returns int directly
}
Scenario 8 — Reflection-driven flexibility (any) vs a typed fast path¶
A "just give me anything" serializer is the most flexible API imaginable.
func Encode(v any) ([]byte, error) { // uses reflect to walk arbitrary structs
return reflectEncode(reflect.ValueOf(v))
}
Cost. Reflection re-discovers the type's fields, tags, and kinds on every call: reflect.ValueOf boxes v into an interface (allocation for non-pointers), and the field walk does map lookups and type switches per field. Benchmarks of encoding/json (reflection-based) versus a code-generated encoder show the codegen path 3–8× faster with near-zero allocations, because the typed path is just field writes the compiler can inline.
Resolution
Keep the reflective `Encode(any)` as the universal fallback — it's what makes the library usable for arbitrary types with no setup. Beside it, offer an interface that types can implement to get the typed fast path, and **dispatch to it when present**:type Encodable interface {
AppendEncoding(dst []byte) []byte // hand-written, allocation-free
}
func Encode(v any) ([]byte, error) {
if e, ok := v.(Encodable); ok { // fast path: no reflection
return e.AppendEncoding(nil), nil
}
return reflectEncode(reflect.ValueOf(v)) // flexible fallback
}
Scenario 9 — Sync blocking API vs offering a streaming/async variant¶
A synchronous method that returns the finished result is the easiest thing to call.
public Report generate(Query q) { // blocks until the whole report is built
return heavyComputation(q); // 8 seconds for a large report
}
Cost. The caller's thread is pinned for the full 8 seconds; a request thread pool of 200 is exhausted by 200 slow reports while CPUs sit idle waiting on I/O. The caller also can't show progress or cancel — the API contract is "all or nothing, blocking."
Resolution
Keep the blocking `generate` (it's the right shape for scripts and tests), and offer a non-blocking / streaming variant beside it so server callers can free the thread and stream partials:// Convenient, blocking — fine for CLI/tests/cron.
public Report generate(Query q) { return generateAsync(q).join(); }
// Non-blocking — frees the calling thread; composes with other async work.
public CompletableFuture<Report> generateAsync(Query q) { … }
// Streaming — first section is visible immediately; supports cancel/progress.
public Flow.Publisher<ReportSection> generateStream(Query q) { … }
Scenario 10 — Callback API that escapes the closure to the heap¶
A callback-style API is clean and inversion-of-control friendly.
func (idx *Index) Walk(prefix string, fn func(key string, val []byte)) {
for _, e := range idx.entriesWithPrefix(prefix) {
fn(e.key, e.val)
}
}
// Caller:
var count int
idx.Walk("user:", func(k string, v []byte) { count++ })
Cost. The closure func(k,v){count++} captures count by reference, so Go's escape analysis moves count to the heap and allocates the closure. If Walk itself is called in a loop, that is one closure allocation per outer iteration. go build -gcflags='-m' prints func literal escapes to heap and moved to heap: count.
Resolution
Two complementary moves. First, prefer the **range-over-func iterator** (Go 1.23) as the primary API — the compiler can often keep the loop body on the stack and it reads like a normal `for`: Second, when a callback API must stay, document that **the callback should not capture** for hot uses, and offer a context parameter so state can be passed without a capturing closure: The iterator form is the clean default that also happens to be allocation-friendly; the `ctx`-passing callback is the escape hatch for code that profiles hot. Java's analog: prefer a primitive-specialized `IntConsumer` and avoid capturing lambdas in tight `forEach` loops; Python has no closure-allocation concern but the same "pass state explicitly" guidance reduces surprise.Scenario 11 — Returning an interface that defeats inlining and EA¶
Returning an interface decouples callers from the concrete type — textbook clean design.
type Reader interface{ Read(p []byte) (int, error) }
func NewBuffer(b []byte) Reader { // returns interface, hides *bytes.Reader
return bytes.NewReader(b)
}
Cost. Returning Reader (an interface) forces every Read call through dynamic dispatch (an itable lookup), which the compiler cannot inline. It also defeats escape analysis: a value stored into an interface generally escapes to the heap. For a tight read loop, that is a missed inline on the hottest method plus an extra allocation. Benchmarks of "return interface" vs "return concrete type" on small hot methods commonly show 2–4× differences once inlining is lost.
Resolution
**Return the concrete type; accept interfaces.** The Go proverb "accept interfaces, return structs" is a performance rule as much as a clarity rule: Callers who want the abstraction can assign the concrete type to an interface variable themselves — at *their* choice and *their* call site — and hot callers keep the concrete type so `Read` inlines and the value can stay on the stack. The concrete return type is also *more* informative (callers see the full method set), so this rarely costs ergonomics. When you genuinely must hide the type (plugin boundary, multiple implementations), keep the interface return but make the methods coarse-grained enough that one dynamic dispatch covers real work — don't put a dispatch barrier in front of a one-line getter. (Same inlining concern as gold Bloaters Optimize 3, here triggered by the *return type* of a public API rather than an extracted helper.)Scenario 12 — Defensive validation re-run on every internal call¶
A library that validates its inputs on every public method is "defensive" and safe.
class Matrix:
def __init__(self, rows): self.rows = rows
def multiply(self, other):
self._validate() # checks rectangular + numeric, O(n·m)
other._validate()
return self._mul(other)
def transpose(self):
self._validate() # validates AGAIN
return self._t()
Cost. _validate() is O(n·m) — it walks every cell. A chain m.transpose().multiply(x).transpose() re-validates the same already-trusted matrix on every step. For a 1000×1000 matrix that is millions of redundant cell checks per operation, and they dominate the runtime of cheap ops like transpose.
Resolution
**Validate once at the boundary (the constructor), then trust the type internally** — Parse, Don't Validate. Public entry points and methods that *receive untrusted data* validate; methods that only transform an already-valid instance do not:class Matrix:
def __init__(self, rows):
self._validate(rows) # the ONLY validation; runs once per object
self.rows = rows
@classmethod
def _trusted(cls, rows): # internal constructor, skips validation
m = cls.__new__(cls); m.rows = rows; return m
def transpose(self):
return Matrix._trusted(zip(*self.rows)) # output of valid input is valid
def multiply(self, other):
return Matrix._trusted(self._mul(other)) # no re-validation
Scenario 13 — Over-flexible "do everything" entry point¶
One mega-function that handles every case is "convenient" — one symbol to learn.
def fetch(url, *, parse=None, retries=0, cache=None, transform=None,
paginate=False, stream=False, validate=None, rate_limit=None):
# 200 lines branching on every combination of options
...
Cost. Every call pays for all the machinery: the function checks cache is not None, rate_limit is not None, sets up pagination state, wires a streaming generator, etc., even for the trivial fetch(url) case. The branch soup also blocks the interpreter/JIT from specializing, and forces every caller to read 9 parameters to understand the common case. Flexibility taxed onto the simple path.
Resolution
Split into a **minimal core plus composable decorators** — small focused functions that the caller assembles, with the simple case staying a one-liner:def fetch(url) -> bytes: # minimal core: one job, no branches
return _http_get(url)
def with_retries(fn, n): ... # opt-in wrappers, each O(1) overhead
def with_cache(fn, cache): ...
def paginated(fn): ...
get = with_retries(with_cache(fetch, cache), 3) # caller composes exactly what they need
data = fetch(url) # simple case pays for nothing extra
Scenario 14 — Pagination cursor that re-scans from the start¶
An offset-based page API is the simplest pagination to expose.
public Page<Item> list(int offset, int limit) { // OFFSET/LIMIT under the hood
return query("SELECT * FROM items ORDER BY id LIMIT ? OFFSET ?", limit, offset);
}
Cost. SQL OFFSET n makes the database read and discard the first n rows on every page. Page 1 is fast; page 10,000 (OFFSET 1_000_000) scans a million rows to throw them away. A client paging through the whole table is O(n²) in total work, and tail-page latency grows linearly with depth.
Resolution
Keep offset paging for shallow UI cases (jump-to-page-5 needs it), but offer a **keyset / cursor** API beside it for deep or full scans — the contract returns an opaque cursor instead of a numeric offset:// Shallow, random-access — fine for small offsets.
public Page<Item> list(int offset, int limit) { … }
// Deep/sequential — O(1) per page regardless of depth.
public CursorPage<Item> list(Cursor after, int limit) {
return query("SELECT * FROM items WHERE id > ? ORDER BY id LIMIT ?",
after.lastId(), limit); // index seek, no row-skipping
}
Rules of Thumb¶
- Keep the ergonomic API as the default; add the fast path beside it, never instead of it.
ReadandReadByte,MarshalandAppendTo,getandget_many. Cold-path callers get clarity; hot-path callers get an explicit, documented opt-in. - Push per-call configuration to construction time. Functional options, builders, and validation belong where you pay once (the constructor), not on the method invoked a million times.
- Return the lazy thing; let the caller materialize. Streaming/iterator as the core API,
list(...)/collect()as the one-line wrapper. You can't retrofit O(1) memory onto a method whose contract is "return the whole list." - Define the zero-copy boundary as the core; wrap it for convenience.
[]byte/memoryview/ByteBuffercore,stringconvenience wrapper — never the reverse, because the wrapper costs exactly one copy only when needed. - Accept interfaces, return concrete types. Returning an interface forces dynamic dispatch and heap escape on the caller's hottest methods; let callers choose their own abstraction level.
- Offer a batch door at every boundary that has fixed per-call cost. Network, disk, and DB calls amortize; a per-item-only API forces N+1 onto every caller and they cannot fix it from outside.
- Validate once at the boundary, trust the type within. Parse, Don't Validate: a constructor that guarantees validity lets every internal transform skip O(n) re-checks.
- Reflection/
any/boxing is a fine fallback, never the only path. Intercept a typed interface (json.Marshaler-style) so hot types opt out of the slow generic route without losing the universal one. - Measure before adding a fast path.
go test -benchmem, JMH-prof gc,pprof/async-profiler. If the ergonomic API isn't on a hot path, do not split it — a second method is real surface area and maintenance cost. Premature fast paths are a sprawling-surface smell. - Make the fast path's contract explicit. Zero-copy results borrow input lifetime; reused buffers can't be retained; cursors are opaque. Document the constraint at the method, or callers will use the fast path wrong.
Related Topics¶
- README.md — the positive rules for this chapter (minimal surface, least astonishment, designing errors into the contract).
- find-bug.md — spotting API-design defects (leaky internals, boolean obsession, missing deprecation paths).
- professional.md — judgment calls and review heuristics for evolving a public API.
- Boundaries — the consumer's side: wrapping third-party APIs you don't control.
- Abstraction & Information Hiding — internal module quality that underlies a clean public surface.
- Refactoring — code smells (bloaters, defensive copies, primitive obsession) whose fixes recur in API performance reconciliation.
In this topic