Runes — Professional Level¶
Introduction¶
Focus: "What happens under the hood?"
At the professional level, we examine how Go's UTF-8 decoding works internally, how range over strings is implemented by the compiler, the memory layout of rune data, and the performance characteristics of rune operations.
How UTF-8 Decoding Works Internally¶
The range Loop Compilation¶
// Source:
for i, r := range s {
process(i, r)
}
// Equivalent Go runtime expansion:
for i := 0; i < len(s); {
r, size := utf8.DecodeRuneInString(s[i:])
process(i, r)
i += size
}
// The compiler generates inline UTF-8 decoding — no function call overhead
utf8.DecodeRune Implementation¶
// From src/unicode/utf8/utf8.go (simplified)
func DecodeRune(p []byte) (r rune, size int) {
n := len(p)
if n < 1 { return RuneError, 0 }
b0 := p[0]
if b0 < 0x80 {
// Single byte (ASCII): 0xxxxxxx
return rune(b0), 1
}
if b0 < 0xC0 {
// Continuation byte without leading byte
return RuneError, 1
}
if b0 < 0xE0 {
// 2-byte sequence: 110xxxxx 10xxxxxx
if n < 2 { return RuneError, 1 }
r = rune(b0&0x1F)<<6 | rune(p[1]&0x3F)
return r, 2
}
// 3-byte and 4-byte follow similar pattern...
}
Memory Layout¶
string in Go:
type stringHeader struct {
Data unsafe.Pointer // pointer to bytes (8 bytes on 64-bit)
Len int // byte length (8 bytes on 64-bit)
}
Total: 16 bytes for the header (data lives on heap)
rune = int32:
4 bytes in memory
[]rune:
type sliceHeader struct {
Data unsafe.Pointer // 8 bytes
Len int // 8 bytes
Cap int // 8 bytes
}
Total: 24 bytes header + 4*n bytes data
Converting "Hello" to []rune:
- allocates 5 * 4 = 20 bytes
- copies each rune into the new array
- O(n) time and memory
Compiler Perspective¶
Optimization: range on ASCII-Only Strings¶
For strings that are proven to be ASCII-only at compile time, the compiler may optimize the range loop to avoid UTF-8 decoding:
const s = "hello" // compile-time constant, ASCII proven
for i, r := range s {
// Compiler may generate: r = rune(s[i]), i++ (no UTF-8 check)
}
How string(r) Works¶
s := string(rune(20013)) // "中"
// This allocates a new string and UTF-8 encodes the rune:
// U+4E2D → 0xE4 0xB8 0xAD (3 bytes)
// The compiler generates UTF-8 encoding inline
Performance Internals¶
utf8.DecodeRuneInString: ~3-8 ns per call
range loop over string: ~2-6 ns per byte (amortized)
[]rune(s) conversion: O(n) time and memory
utf8.RuneCountInString: O(n) time, O(1) memory (better than []rune)
string(r): allocates, UTF-8 encodes (10-15 ns)
strings.Builder.WriteRune: amortized O(1), O(n) total
Source Code Walkthrough¶
unicode/utf8 package (src/unicode/utf8/utf8.go):
- DecodeRune, DecodeRuneInString: decode first rune
- EncodeRune: encode rune to byte slice
- RuneCountInString: count runes without allocation
- ValidString: check UTF-8 validity
- RuneLen: bytes needed for a rune
Key constant: RuneError = U+FFFD (replacement character)
Returned by DecodeRune for invalid sequences.
Edge Cases at the Lowest Level¶
// U+FFFD: the replacement character
// Appears when DecodeRune encounters invalid UTF-8
invalidUTF8 := string([]byte{0xFF, 0xFE})
for _, r := range invalidUTF8 {
fmt.Printf("U+%04X\n", r) // U+FFFD for each invalid byte
}
// Null rune: valid in Go strings (strings can contain null bytes)
s := "hello\x00world"
fmt.Println(len(s)) // 11 (null byte included)
Test¶
package runes_pro_test
import (
"testing"
"unicode/utf8"
"unsafe"
)
func TestStringLayout(t *testing.T) {
// A string header is 16 bytes on 64-bit
type stringHeader struct {
ptr uintptr
len int
}
if unsafe.Sizeof(stringHeader{}) != 16 {
t.Error("string header should be 16 bytes on 64-bit")
}
}
func TestRuneSize(t *testing.T) {
var r rune
if unsafe.Sizeof(r) != 4 {
t.Errorf("rune should be 4 bytes (int32), got %d", unsafe.Sizeof(r))
}
}
func TestDecodeRune(t *testing.T) {
// Chinese character "中" encoded as UTF-8
bytes := []byte{0xE4, 0xB8, 0xAD}
r, size := utf8.DecodeRune(bytes)
if r != '中' || size != 3 {
t.Errorf("DecodeRune: got r=%d size=%d, want r=中 size=3", r, size)
}
}
Tricky Questions¶
Q: How does the Go runtime avoid allocations when ranging over a string? A: The range loop over a string operates directly on the underlying byte array of the string. It incrementally decodes UTF-8 bytes at the current offset without converting the entire string to []rune.
Q: What is utf8.RuneError and when does it appear? A: utf8.RuneError (U+FFFD, value 65533) is returned by DecodeRune when it encounters an invalid UTF-8 byte sequence. It's also the Unicode replacement character shown when text cannot be displayed.
Summary¶
At the machine level, runes are int32 values stored in CPU registers. String iteration via range performs inline UTF-8 decoding at ~2-6 ns/byte. The utf8 package functions use table-driven state machines for efficient decoding. Converting string to []rune requires O(n) allocation and is the main performance bottleneck in rune-heavy code. Design for O(1) space rune operations using range and utf8 package functions instead.