SE-0243: Codepoint and Character Literals

CTMacUser · March 14, 2019, 1:19am

As a String is a vector of Character, the latter is a vector of code points. If you mean to compare (ASCII) code points to bytes within binary data, then wouldn't Unicode.Scalar be the correct abstraction? A scalar value always maps to exactly one integer, while a character may be a vector of scalars.

For text file mangling your representation, wouldn't scalars be better? The "é" abstract character may have two Character representations, U+00E9 as a single scalar or U+0065 & U+0301 as a primary and secondary code-point pair. If the characters within single quotes must always be Unicode scalars, then only one interpretation is allowed in the object code for "é", no matter which way it's stored in the source code file. It does mean that the compiler must have ICU or an equivalent to find valid recombinations that can resolve to a single scalar. If we allow "\u{}" Unicode escapes within single quotes, we can mandate that they always have to used the single scalar version and never a decomposed form. (In other words, recomposition is allowed from translation between the source file's encoding to object code, and never from the user deliberately splitting a single-scalar character to an official decomposed form.)