Lexical elements: Tokens
January 10, 2023
Tokens
Tokens form the vocabulary of the Go language. There are four classes: identifiers, keywords, operators and punctuation, and literals. White space, formed from spaces (U+0020), horizontal tabs (U+0009), carriage returns (U+000D), and newlines (U+000A), is ignored except as it separates tokens that would otherwise combine into a single token. Also, a newline or end of file may trigger the insertion of a semicolon. While breaking the input into tokens, the next token is the longest sequence of characters that form a valid token.
In other words, after removing comments (and replacing them with either spaces or newlines, as discussed yesterday), we can treat a source file as being passed through a split
function that splits any number of on space, tab, carriage return, and newline characters (or /[ \t\r\n]+/
in RegExp speak). This should seem pretty intuitive, as most modern langauges work in a similar way.
It’s not quite this simple, because of the way Go handles implicit semicolons, though, which we’ll be discussing tomorrow.
Finally, the last sentence is also intuitive, but important: “the next token is the longest sequence of characters that form a valid token.”
This just means that forward
won’t be interpreted as for
followed by ward
during tokenization.
Quotes from The Go Programming Language Specification, Version of June 29, 2022