Unicode scalar literals

SDGGiesbrecht · March 30, 2019, 10:01pm

My honest opinion is that any modern file format should be designed to either not pretend to be text in any way, or else actually behave as text. If a file format is under the guise of text, it should be designed to safely undergo normal text handling operations. Swapping line endings preserves text intent; the file format’s intent should also be preserved. Converting a text file from one encoding to another preserves text intent^†; the file format’s intent should also be preserved. Performing normalization on a text file also preserves text intent; the file format’s intent should also be preserved.

I understand that file formats predating Unicode often don’t meet that expectation. I also understand that some more recent file formats fall short of that expectation, usually due to oversight. I even understand that .swift is a file format that doesn’t succeed in this respect. Finally, I understand that for backwards‐compatibility reasons, many such issues can never be fixed. I still firmly believe that it should be an unwavering law of any future design that, “If it says it is text, it really is text.”

That means for Swift, in my very strong opinion, the only correct way to write code at the level of non‐normalized code points is to reference them using their code point identifiers: \u{E9}. Such references are intent‐preserving when processed as text. Beyond that, their non‐normalized intent is plain to any reader.

Holding that view, it follows logically that I find the very existence of ExpressibleByUnicodeScalar unfortunate. Unicode.Scalar should have been ExpressibleByIntegerLiteral instead. We are stuck with ExpressibleByUnicodeScalar because of backwards‐compatibility, but it would be best to sweep it under the rug and let it remain a quirky corner case that requires extra effort to pull out from under ExpressibleByExtendedGraphemeClusterLiteral. I am categorically opposed to elevating Unicode scalar literals.

I make no judgement on how useful either of the following may or may not be, but they are the only reasonable ways forward that I can see:

If the goal is to find some sort of general text element deserving of a separate literal from strings, it is Character. Character literals preserve text intent, and they are always safe to use. I suspect this is closest to what the core team wants, since it resembles what they originally envisioned way back at the beginning of all of this.
- Whether or not compile time validation is possible is irrelevant; runtime breakage from a different version of ICU only occurs when operating systems are updated (which can cause other, much more widespread runtime issues ). ICU‐related breakage is limited to recent aspects of Unicode, and it evaporates as devices catch up. Only developers trying to use cutting‐edge Unicode features are likely to notice, and they are probably aware of both Unicode and the issues to expect in the wake of its updates.
- On the other hand, Unicode.Scalar is not a candidate for a separate literal from strings unless it is number‐based. Yes, scalars are the correct level to be working at for a lot of text processing, but spelling them with text literals is the wrong way to do it. The breakage such spelling encounters occurs across all of Unicode, including its oldest parts. This kind of breakage will never go away, and as Unicode adoption and understanding spreads into ever more tools, the resulting problems are likely to only become more frequent over time.
If the desire is to make scalar‐based file format parsing easier, then a new ExpressibleByASCIILiteral should be considered, either for a separate ASCII.Scalar type or where Unicode.Scalar would simply conform to it.
- ASCII code points are the most significant when it comes to parsing existing formats (XML, JSON, YML, CSV, C). When parsing Swift, Java or some format that can use non‐ASCII scalars for semantic purposes, individual code points are irrelevant, only large categorized sets of code points matter, so even then there is no real need to use individual literals. That means ASCII already covers the vast majority of real use cases.
- ASCII code points are inert across almost all text processing^‡, so they are safe to use. The only instance where they are not inert is if line endings are switched, and that is completely irrelevant because directly spelled line endings are not valid in text literals anyway^§.

^†provided the new encoding is a superset of the file’s contents.
^‡except conversions to EBCDIC or other small encodings, which are lossy conversions unlikely to be performed on Swift source anyway.
^§except in multiline string literals, which are not candidates for any form of character literal anyway.

This topic has demonstrated a tendency to go in circles, so I do not intend to continue discussion. I have taken extra effort this time to express my final thoughts clearly and thoroughly. If you genuinely want clarification about something I said, you are welcome to ask, but if your post seems more like an argument, I am unlikely to reply. You are free to disagree with my opinions and advice, but please respect my wish to leave it at that and move on.