Lexer / Scanner — Professional¶
This file is about shipping tooling on top of Go's lexical layer: linters, formatters, code generators, and analyzers. We cover the go/scanner/go/token API as a production dependency, how it relates to the compiler's cmd/compile/internal/syntax, how gofmt uses scanning, and the edge cases that bite real tools.
1. Two scanners, two audiences¶
| Concern | go/scanner + go/token | cmd/compile/internal/syntax |
|---|---|---|
| Audience | tooling (gofmt, vet, linters, you) | the gc compiler internals |
| Importable | yes, stable public API | no (internal/), API can change |
| Position model | token.Pos + token.FileSet | syntax.Pos + syntax.PosBase |
| Buffer model | whole file in a []byte | streaming io.Reader + growable buf |
| Token set | token.Token (has COMMENT, ELLIPSIS, etc.) | syntax.token (compact, 64-token set) |
| Comment handling | ScanComments mode → COMMENT token | comments/directives mode → callback |
| Used by | go/parser, go/ast, go/types | the compiler's own parser/AST |
Rule of thumb: if you are building anything outside the compiler, use go/scanner/go/token (or, more often, go/parser + go/ast which sit on top of them). Never import cmd/compile/internal/* — it is unexported and unstable, and go build will refuse it from outside GOROOT.
A subtle difference: go/scanner scans an entire source []byte you already have in memory, whereas the compiler's source streams from an io.Reader with a sliding window. For tooling the whole-file model is simpler and just as fast at file scale.
2. The production go/scanner API¶
The full surface you build on:
type Scanner struct{ /* ... */ }
func (s *Scanner) Init(file *token.File, src []byte, err ErrorHandler, mode Mode)
func (s *Scanner) Scan() (pos token.Pos, tok token.Token, lit string)
func (s *Scanner) ErrorCount int // field, total errors seen
type ErrorHandler func(pos token.Position, msg string)
const (
ScanComments Mode = 1 << iota // return COMMENT tokens instead of skipping
dontInsertSemis // (internal) used by gofmt-style callers
)
token side:
fset := token.NewFileSet()
file := fset.AddFile(name, fset.Base(), len(src))
// later: fset.Position(pos) -> token.Position{Filename, Offset, Line, Column}
A robust tool always installs an ErrorHandler and checks s.ErrorCount:
var s scanner.Scanner
var firstErr error
s.Init(file, src, func(p token.Position, msg string) {
if firstErr == nil {
firstErr = fmt.Errorf("%s: %s", p, msg)
}
}, scanner.ScanComments)
for {
pos, tok, lit := s.Scan()
if tok == token.EOF {
break
}
_ = pos; _ = lit
}
if s.ErrorCount > 0 {
return firstErr // file had lexical errors; do not trust the token stream
}
3. Most tools want the AST, not raw tokens¶
In practice, lexing alone is rarely what a linter needs. The standard stack:
src []byte
└─ go/scanner → tokens (rarely used directly)
└─ go/parser → *ast.File (what most tools start from)
└─ go/ast → walk/inspect nodes
└─ go/types → type information
go/parser.ParseFile runs the scanner for you and returns an *ast.File plus the FileSet. You drop to go/scanner directly only when you genuinely need the token stream — e.g. a custom formatter, a syntax highlighter, a tool counting tokens, or something that must inspect text the AST discards (exact whitespace between tokens, malformed input you still want to tokenize).
When you do need both structure and trivia, go/parser.ParseComments keeps comments in the AST (ast.File.Comments, *ast.CommentGroup), which is how gofmt and doc tools preserve and reflow comments.
4. How gofmt uses scanning¶
gofmt (and go/printer) deliberately work at the AST level, not the token level, for the structural rewrite — but lexical facts drive its output:
- Comment attachment.
go/parserwithParseCommentsrecords each comment's position (from the scanner) so the printer can re-attach it to the nearest node and re-emit it. Floating comments that the scanner located between tokens are the hardest part of formatting. - Semicolon normalization. Because ASI already happened in the scanner, gofmt simply omits the inserted semicolons in output and relies on the same rules round-tripping. This is why gofmt can safely delete the
;you typed at a line end: the scanner would re-insert it. //lineand directives. The printer preserves//go:and//linedirectives verbatim because the scanner surfaced them as comments with exact text and position; mangling them would break builds.
The practical lesson: a formatter must round-trip the scanner's decisions. If your tool rewrites source text, re-run it through go/parser and compare the resulting ASTs to guarantee you did not change meaning (gofmt's own tests do exactly this).
5. Building a linter rule: a worked pattern¶
Say you want to flag return statements that are immediately followed on the next line by an expression — the ASI footgun. You could do it lexically:
// Flag the pattern: a 'return' token whose inserted ';' (the newline)
// is followed by tokens on the next line at greater indentation.
prev := token.ILLEGAL
prevLine := 0
for {
pos, tok, _ := s.Scan()
if tok == token.EOF {
break
}
line := fset.Position(pos).Line
if prev == token.RETURN && tok == token.SEMICOLON && /* synthetic */ true {
// the next real token on a later line is unreachable
}
prev, prevLine = tok, line
_ = prevLine
}
In real life you would use go/ast + go/analysis for this (it is exactly what vet's unreachable pass does), but the point stands: lexical tools see the inserted semicolons, which is sometimes precisely the signal you want.
The modern, supported way to ship such a check is the golang.org/x/tools/ go/analysis framework — write an *analysis.Analyzer, get the *token.FileSet and []*ast.File handed to you, and emit diagnostics with positions. It plugs into go vet, gopls, and CI uniformly.
6. Edge cases that bite real tools¶
These are the ones that generate bug reports:
-
BOM at file start. A UTF-8 BOM (
EF BB BF) is allowed only as the first character; the scanner skips it. A BOM mid-file is an error. Tools thatbytes.TrimSpaceor slice the source manually can corrupt positions — let the scanner handle the BOM. -
CRLF line endings. The scanner treats
\ras whitespace and drops\rinside raw strings. A tool that counts\nfor line numbers but reads CRLF files can be off; rely ontoken.Position.Line, never your own count. -
//go:needs no space. Linters that "normalize" comments by inserting a space after//will silently disable//go:noinline,//go:embed,//go:generate, etc. Never touch directive comments. -
Build tags must precede
package.//go:build/// +buildlines are only build constraints if they sit before the package clause with a blank line after. A tool that reorders top-of-file comments can break the build. -
Raw vs interpreted strings. A rewriter that re-quotes strings must not turn
`a\b`(literal backslash) into"a\b"(escape) — different bytes. Usestrconv.Quote/Unquoteandstrconv.Unquotefor raw strings carefully;Unquotehandles both forms. -
Malformed but tokenizable input. With an error handler installed, the scanner still returns tokens for bad literals (
bad = trueon the compiler side; ingo/scannerthe literal text is returned andErrorCountincrements). Tools must checkErrorCountbefore trusting the stream. -
ELLIPSISand.......is one token (token.ELLIPSIS); the scanner's one-token lookahead distinguishes it from..(which is a syntax error) and.. A naive hand-rolled tokenizer often gets this wrong. -
Numeric literals with separators.
1_000,0x_FF,0b1010are valid;1__0,_1,1_are not. A tool that re-emits numbers must preserve the underscores exactly or it changes the source text (gofmt preserves them). -
Implicit semicolons and trailing commas. Removing a trailing comma in a multi-line composite literal turns the prior line's literal into a semicolon-terminated statement, producing a syntax error. Code generators that emit Go must keep trailing commas.
7. Source rewriting safely: the round-trip rule¶
Any tool that emits or rewrites Go source must respect the scanner's decisions, or it will produce code that compiles to something different — or not at all. The discipline that makes rewriting safe:
// 1. Parse the original into an AST (this runs the scanner for you).
fset := token.NewFileSet()
orig, err := parser.ParseFile(fset, name, src, parser.ParseComments)
if err != nil {
return err
}
// 2. Mutate the AST (rename, insert, reorder nodes) — never edit text directly.
// 3. Print it back out with go/printer (or format.Node), which re-derives
// formatting and re-applies the scanner's rules (semicolons, etc.).
var buf bytes.Buffer
if err := format.Node(&buf, fset, orig); err != nil {
return err
}
// 4. Re-parse the output and compare ASTs to prove meaning is unchanged.
check, err := parser.ParseFile(token.NewFileSet(), name, buf.Bytes(), 0)
if err != nil {
return fmt.Errorf("rewrite produced invalid Go: %w", err)
}
_ = check
The reason step 3 is safe is that go/printer emits source that the scanner will re-tokenize identically: it never puts { on its own line after ), always keeps trailing commas, never inserts a space into a directive, and re-quotes strings with strconv. Hand-editing source text bypasses all of that and is how rewriters introduce ASI bugs.
8. Working with directives in tooling¶
Because directives are just comments to the scanner, a tool that needs to find or preserve them works at the comment level:
// With ScanComments, COMMENT tokens carry directive text verbatim.
for {
pos, tok, lit := s.Scan()
if tok == token.EOF {
break
}
if tok == token.COMMENT && strings.HasPrefix(lit, "//go:") {
fmt.Printf("%s pragma: %s\n", fset.Position(pos), lit)
}
}
Things tooling must get right with directives:
- Never normalize the spacing.
//go:embedmust stay glued; inserting a space silently disables it. Many naive "comment formatters" have shipped exactly this regression. //go:buildplacement. It must come before thepackageclause with a blank line after. A tool that sorts or moves leading comments must special- case build constraints (gofmt does, and even synthesizes the new-style//go:buildfrom a legacy// +build).//linerebasing. If your tool reports positions, honor//linedirectives by usingfset.Position(which already accounts for them) rather than counting newlines yourself.
9. Performance for tooling¶
For a one-shot CLI, scanning cost is negligible; for an editor backend (gopls) that re-scans on every keystroke across thousands of files, it matters:
- Reuse a single
token.FileSetacross files in a run; it deduplicates and compacts position storage. - Reuse one
scanner.Scannervalue by callingInitagain per file —Initresets state and the struct is cheap to keep around. - Prefer
go/parserincremental reuse viagopls' snapshot model rather than re-parsing whole packages; the scanner is fast butgo/typesis not.
10. Professional takeaways¶
- Build on
go/scanner/go/token(and usuallygo/parser/go/astabove them); never import the compiler'sinternal/syntax. - Always install an
ErrorHandlerand gate onErrorCountbefore trusting tokens. - gofmt round-trips the scanner's ASI and comment decisions — your rewriters must too; verify by re-parsing and comparing ASTs.
- The biting edge cases are directives (no space, must precede
package), raw-vs-interpreted strings, numeric separators, BOM/CRLF, and trailing commas under ASI. - Ship checks through
golang.org/x/tools/go/analysisso they integrate withgo vetandgopls.
Further reading¶
go/scannerdocs: https://pkg.go.dev/go/scannergo/tokendocs: https://pkg.go.dev/go/tokengo/parserdocs: https://pkg.go.dev/go/parsergo/analysisframework: https://pkg.go.dev/golang.org/x/tools/go/analysis- gofmt /
go/printersource: https://go.dev/src/go/printer/printer.go - Compiler directives: https://pkg.go.dev/cmd/compile#hdr-Compiler_Directives
- Build constraints (
//go:build): https://pkg.go.dev/cmd/go#hdr-Build_constraints