SE-0243: Codepoint and Character Literals

SDGGiesbrecht · March 13, 2019, 3:51am

All sorts of things handle Swift source, not just the compiler, and they are fully compliant with Unicode if they normalize the source in some way or another. Where Swift syntax plays by the rules of Unicode equivalence, all is well. Such is the case with "x̱̄" as Character now. But wherever Swift syntax relies on a particular scalar representation—effectively ignoring unicode equivalence—then the source code is volatile, because any tool which (rightly) assumes unicode equivalence may be unknowingly changing the semantics of the Swift source code, sometimes even breaking the syntax rules so it no longer compiles.

Are there Swift‐related tools already doing this sort of thing? Yes.

This forum performs NFC on copy‐and‐paste, and as demonstrated in the pitch thread here, can render valid code uncompilable.
While Git does not perform Unicode normalization on file contents yet, it does already normalize filenames to NFD on macOS and NFC elsewhere. (Since it already normalizes line endings and whitespace, it is not much of a stretch to imagine it in the future.)

Those are just the two I have already run into. As Unicode awareness continues to spread, the number is likely to grow.

Currently, only constructions which deliberately call out Unicode.Scalar are vulnerable to this, such as "é" as Unicode.Scalar. I do not even like that this problem is already reachable, but at least right now you have to ask for it with as Unicode.Scalar, and as stated in an earlier post, it could not be protected against anyway without also blocking "\u{E8}" as Unicode.Scalar.

As for future syntax, I am strongly against anything which increases the number of places in the language where these vulnerabilities can be encountered—especially where it can be encountered without some explicit reminder that “This is lower‐level Unicode; think about what you are doing.”

Once the originally proposed deprecations had come into effect and 'é' as Unicode.Scalar completely replaced "é" as Unicode.Scalar, the net effect would only have been that the danger zone had moved, not that it had increased. And I could live with that.

But if 'é' as Character—where it must actually be a scalar—is proposed to be indefinitely possible (deprecation is fine), then it would be a new place to run into errors. That new zone feels much worse because (1) there is no nearby reminder of Unicode, let alone Scalar; (2) someone can know enough to write that code without ever having heard of Unicode whatsoever; (3) there is a very simple, safer alternative available: "é" as Character.

I do not really care how the vulnerability is prevented, only that it is. Many new alternative designs have already been mentioned in this review, and it is not evident to me which direction this will go if it is returned for revision. Some of the proposed alternatives do not even have to deal with this issue in the first place, some have simple fixes, others could be solved with some work, and still others are basically irreparable in this respect. I have not thought through each to know which will have to tiptoe around the ABI to get it right and which will have easy solutions. That is why my last few posts were littered with the word “if”. I am only trying to say that whichever way development continues, I think this needs to be a design consideration.