Text (even down to an individual scalar’s worth of text) has multiple, interchangeable byte representations, even within a single encoding (i.e. without leaving UTF‐8). Anything written in a source file is text, and may switch forms without asking the user. But the different forms produce different numbers, or even different numbers of numbers. Two equal source files should not produce two different programs. Nor should a source file produce a different program than its copy produces. But this is what starts to happen when Unicode characters are used as something other than Unicode characters.
Option 4: Only What Is Safe
UInt8 | UInt16 | UInt32 | Unicode.Scalar | Character | Notes | |
'x' | 120 | 120 | 120 | U+0078 | x | ASCII scalar |
'©' | error | error | error | U+00A9* | © | Latin‐1 scalar |
'é' | error | error | error | U+00E9/error* | é | Latin‐1 scalar which expands under NFD |
'花' | error | error | error | U+82B1* | 花 | BMP scalar |
';' | error | error | error | U+037E/U+003B*† | ; | BMP scalar which changes under NFx |
'שּׂ' | error | error | error | U+FB2D/error* | שּׂ | BMP scalar which expands under NFx |
'𓀎' | error | error | error | U+1300E* | 𓀎 | Supplemental plane scalar |
'ē̱' | error | error | error | error | ē̱ | Character with no single‐scalar representation |
'ab' | error | error | error | error | error | Multiple characters |
* Cells with an asterisk are existing behaviour that is dangerous. As you can see, some characters in these ranges are safe to initialize as literals and others aren’t. You have to know a lot about Unicode to know which are which. At least if you have gone to the effort to call Unicode.Scalar
out from under Character
, then you must already know you are dealing with Unicode, and you’ve taken a certain level of responsibility for what happens. But that waiving of safety is not as clear when you are thinking of them as numbers (UIntx
).
† These are equal strings.