Source code representation

Source code representation

Source code is Unicode text encoded in UTF-8. The text is not canonicalized, so a single accented code point is distinct from the same character constructed from combining an accent and a letter; those are treated as two code points. For simplicity, this document will use the unqualified term character to refer to a Unicode code point in the source text.

Each code point is distinct; for instance, uppercase and lowercase letters are different characters.

Implementation restriction: For compatibility with other tools, a compiler may disallow the NUL character (U+0000) in the source text.

Implementation restriction: For compatibility with other tools, a compiler may ignore a UTF-8-encoded byte order mark (U+FEFF) if it is the first Unicode code point in the source text. A byte order mark may be disallowed anywhere else in the source.

I suppose it should be no surprise that Go source code is explicitly written in UTF-8, as both Ken Thompson and Rob Pike co-created both Go and UTF-8!

What this means is that it’s possible to unambiguously write any Unicode text into your Go source code. Having dealt with ambiguous encodings, and special runtime modules in some other languages in the past, I find this to be quite a nice thing!

You can see this demonstrated at the Go Playground, as its default Hello-World program includes some simplified Chinese:

func main() {
	fmt.Println("Hello, 世界")
}

But it works just as well with any other Unicode characters as well.

This does mean there’s room for some confusion, as there are many Unicode characters that (in)famously look alike.

The Go Programming Language Specification, Version of June 29, 2022

Source code representation

January 5, 2023

Source code representation