Prepitch: Character integer literals

SDGGiesbrecht · January 9, 2019, 11:11pm

Text (even down to an individual scalar’s worth of text) has multiple, interchangeable byte representations, even within a single encoding (i.e. without leaving UTF‐8). Anything written in a source file is text, and may switch forms without asking the user. But the different forms produce different numbers, or even different numbers of numbers. Two equal source files should not produce two different programs. Nor should a source file produce a different program than its copy produces. But this is what starts to happen when Unicode characters are used as something other than Unicode characters.

Option 4: Only What Is Safe

	UInt8	UInt16	UInt32	Unicode.Scalar	Character	Notes
'x'	120	120	120	U+0078	x	ASCII scalar
'©'	error	error	error	U+00A9*	©	Latin‐1 scalar
'é'	error	error	error	U+00E9/error*	é	Latin‐1 scalar which expands under NFD
'花'	error	error	error	U+82B1*	花	BMP scalar
';'	error	error	error	U+037E/U+003B*†	;	BMP scalar which changes under NFx
'שּׂ'	error	error	error	U+FB2D/error*	שּׂ	BMP scalar which expands under NFx
'𓀎'	error	error	error	U+1300E*	𓀎	Supplemental plane scalar
'ē̱'	error	error	error	error	ē̱	Character with no single‐scalar representation
'ab'	error	error	error	error	error	Multiple characters

* Cells with an asterisk are existing behaviour that is dangerous. As you can see, some characters in these ranges are safe to initialize as literals and others aren’t. You have to know a lot about Unicode to know which are which. At least if you have gone to the effort to call Unicode.Scalar out from under Character, then you must already know you are dealing with Unicode, and you’ve taken a certain level of responsibility for what happens. But that waiving of safety is not as clear when you are thinking of them as numbers (UIntx).

† These are equal strings.