Unicode scalar literals

Karl · March 28, 2019, 4:25pm

I think everybody should read this exchange again and again:

Character-ness is (in general) a runtime decision. It can only be done at compile-time for ASCII. So anything related to Unicode-aware (i.e. potentially non-ASCII) text processing heavily depends on runtime features and IMO should not be done at compile time (or it might introduce mismatches which lead to catastrophic bugs in other parts of the system).

Perhaps our existing .asciiValue property makes assumptions which are not appropriate for byte comparisons, but I don't think unicode scalars are any better than adding a fixed/"raw" version of .asciiValue.

Having .asciiValue collapse CRLF extended grapheme clusters to a single UInt8 (LF) was an intentional decision designed to support returning a single UInt8 value, as CRLF is the only extended grapheme cluster which exists in ASCII. The discussion there also predicts this question and considers returning a 2-element tuple, which I would consider a more acceptable solution than baking more stuff in to the compiler.

(off-topic: yay for quoting code blocks )

I think this is the correct way to do it. Other languages, like C++, are moving away from compiler magic and towards 'ordinary code' which is evaluated at compile-time.

The canonical byte form of your String is provided by the whatever bytes are stored in your source file. The standard library's .asciiValue (or whatever "raw" version we add) does some numeric checks which IIUC could be trivially implemented with @compilerEvaluable. I see no need for special syntax or additional compiler features.