Lexical elements: Tokens

January 10, 2023

Tokens

Tokens form the vocabulary of the Go language. There are four classes: identifiers, keywords, operators and punctuation, and literals. White space, formed from spaces (U+0020), horizontal tabs (U+0009), carriage returns (U+000D), and newlines (U+000A), is ignored except as it separates tokens that would otherwise combine into a single token. Also, a newline or end of file may trigger the insertion of a semicolon. While breaking the input into tokens, the next token is the longest sequence of characters that form a valid token.

In other words, after removing comments (and replacing them with either spaces or newlines, as discussed yesterday), we can treat a source file as being passed through a split function that splits any number of on space, tab, carriage return, and newline characters (or /[ \t\r\n]+/ in RegExp speak). This should seem pretty intuitive, as most modern langauges work in a similar way.

It’s not quite this simple, because of the way Go handles implicit semicolons, though, which we’ll be discussing tomorrow.

Finally, the last sentence is also intuitive, but important: “the next token is the longest sequence of characters that form a valid token.”

This just means that forward won’t be interpreted as for followed by ward during tokenization.

Quotes from The Go Programming Language Specification, Version of June 29, 2022