Prepitch: Character integer literals

Text (even down to an individual scalar’s worth of text) has multiple, interchangeable byte representations, even within a single encoding (i.e. without leaving UTF‐8). Anything written in a source file is text, and may switch forms without asking the user. But the different forms produce different numbers, or even different numbers of numbers. Two equal source files should not produce two different programs. Nor should a source file produce a different program than its copy produces. But this is what starts to happen when Unicode characters are used as something other than Unicode characters.

Option 4: Only What Is Safe

UInt8 UInt16 UInt32 Unicode.Scalar Character Notes
'x' 120 120 120 U+0078 x ASCII scalar
'©' error error error U+00A9* © Latin‐1 scalar
'é' error error error U+00E9/error* é Latin‐1 scalar which expands under NFD
'花' error error error U+82B1* BMP scalar
';' error error error U+037E/U+003B*† ; BMP scalar which changes under NFx
'שּׂ' error error error U+FB2D/error* BMP scalar which expands under NFx
'𓀎' error error error U+1300E* 𓀎 Supplemental plane scalar
'ē̱' error error error error ē̱ Character with no single‐scalar representation
'ab' error error error error error Multiple characters

* Cells with an asterisk are existing behaviour that is dangerous. As you can see, some characters in these ranges are safe to initialize as literals and others aren’t. You have to know a lot about Unicode to know which are which. At least if you have gone to the effort to call Unicode.Scalar out from under Character, then you must already know you are dealing with Unicode, and you’ve taken a certain level of responsibility for what happens. But that waiving of safety is not as clear when you are thinking of them as numbers (UIntx).

† These are equal strings.

10 Likes