Pitch: Unicode Equivalence for Swift Source

SDGGiesbrecht · March 13, 2019, 9:17pm

Unicode is Swifty...

Swift claims to support Unicode source. Many languages aim to be encoding agnostic by only using ASCII in the grammar and not attempting to interpret anything beyond that in comments and string literals. But Swift has gone much further and defined its grammar in terms of Unicode. The formal definitions of identifiers and operators are each a screen full of Unicode scalar references and demonstrate clear and deliberate support for advanced Unicode concepts like combining characters. Swift’s documentation even goes so far as to encourage use of Unicode. Unicode characters appear in identifiers in The Swift Programming Language already on page 3, “The Basics” under “Naming Constants and Variables” and recur in examples throughout the book.

...but Swift has dropped the .

Swift proudly touts support for Unicode equivalence across the documentation for its String type, but the compiler does not actually consider it.

As @michelf noted recently:

String Comparison for Identifiers

[T]he Swift compiler does not normalize the source, and in particular it will not do normalization before comparing identifiers. Otherwise these two é identifiers would clash:
let é = "é" // é written as one unicode scalar
let é = "é" // é written as "e" with a combining accent

It addition to allowing clashes, it fails to resolve references the developer expects to match:

let café = "Queen’s Lane Coffee House"
print(café) // Error: “Use of unresolved identifier 'café'”

In addition to being surprising, this means that wherever Unicode actually gets used, two developers with different input methods may have difficulty using each other’s code.

Due to quirks of tooling, a single developer may write functions he later has trouble calling himself: A developer might write most of his code locally with a decomposition‐based input method—such as most Korean keyboards—and he writes it in Xcode, which is normalization agnostic. But during review on GitHub, he detects a typo in a function name and quickly fixes it there. Since GitHub’s interface checks everything in in NFC, he now—unknowingly—has a composed form function name. After a while he pulls the change to his local machine. When he types the function name with his keyboard in order to call it somewhere, the compiler tells him it does not exist. Confused, he tries it in several places and eventually concludes that the declaration must be buggy, so he erases and retypes it only to be puzzled by the fact that it now magically works. (Alternatively he gives up, reverts the commit and starts over, refactoring the code to avoid what he perceives as a compiler glitch.)

Known tools that are not really compatible with Swift’s current behaviour include:

GitHub UI: Commits made through the web interface are in NFC; but commits made by the command line are unaltered.
Swift Forums: Copied and pasted text becomes NFC, but text entered directly is unaltered.

Possible Solutions

Essentially the Swift compiler needs to respect Unicode equivalence when it compares tokens. Several concrete routes to accomplishing this exist:

A) Have the compiler test tokens for equality the same way as the Swift String type does.
- Advantages:
  - By keeping the on‐disk representation around the whole time, error messages and diagnostics will have the same representation as the file. This means the token will look the same on platforms/fonts that handle equivalence poorly in display.
  - Specific Unicode scalar sequences can be selected with direct string‐like literals at the developer’s own risk. (Compare option C.)
- Disadvantages:
  - Each comparison is slower during the compilation process.
  - A few corner cases around string‐like literals are still surprising. (Compare option C.)
B) Have the compiler normalize each token after tokenization, with the exception of string‐like literals.
- Advantages:
  - Fast repeated comparison after the cost of a single conversion.
  - Specific Unicode scalar sequences can be selected with direct string‐like literals at the developer’s own risk. (Compare option C.)
- Disadvantages:
  - Diagnostics may look odd on some platforms. (Compare option B.)
  - A few corner cases around string‐like literals are still surprising. (Compare option C.)
C) Have the compiler parse a normalized copy of the entire file.
- Advantages:
  - Fast repeated comparison after the cost of a single conversion.
  - No more corner cases or delayed surprises around string literals. (See here and here.) If external normalization happens to the source code it will make no difference to the resulting program. Every string‐like literal will either always generate the same compiled scalar sequence or always fail to compile.
  - Cleanly prevents any similar problems from ever appearing in the future, no matter how syntax evolves. There would be no need to think about which new syntax elements need to behave like syntax tokens and which need to behave like user strings. All behaviour would be consistent and not require evolution consideration.
- Disadvantages:
  - Diagnostics may look odd on some platforms. (Compare option A.)
  - Non‐normal sequences will not be representable as direct literals and each scalar will need to be expressly requested by its hex value in order to evade the normalization: let ångström = "\u{212B}". While this is already the only way to make source resilient, the requirement might come as an initial surprise to some developers. However, as @Michael_Ilseman has pointed out, Swift has never actually promised to preserve scalar sequences in a String even during runtime, so even let ångström = "\u{212B}" technically has no guarantees.
D) Have the compiler normalize the actual on‐disk file.
- Advantages:
  - Shares advantages of C.
  - Encourages writers of tooling to think about Unicode equivalence, because if they produce non‐normal source they would trigger diffs that jump back and forth.
- Disadvantages:
  - Shares disadvantages of D.
  - May seem too intrusive.
  - Opposite normalization to Swift‐agnostic text tools could cause certain tool combinations to become annoying.

(I personally favour option C, because it is future proof, consistent, and predictable. A and B are also a good options with only minimal surprises.)

Source Compatibility

Properly handling Unicode equivalence will break source that contains equivalent identifiers that the developer intended to be distinct, such as @michelf’s example near the top. An important question is whether that sort of thing is a bug that should ever have compiled in the first place—which is how I would qualify it.

Option C would also change the semantics of combinations like let x = "écrire".unicodeScalars.first!. Those oft assumed semantics were never actually promised, so it falls in a similar category to SE‐0241 about encodedOffset. It only breaks misuse, but there is a lot of misuse. So at the very least, migration aids should be considered.

ABI Stability

...uh oh... Can someone more knowledgeable tell me whether name mangling and such already handles equivalence or if this whole pitch goes right here and Swift in Unicode is doomed for ever?

CC: @Michael_Ilseman

Immediate Mitigation on a Per‐Project Basis

Projects which extensively use Unicode can defend against this to some degree by scripting normalization of all project files before build and at check in.^† Then at least your project is internally consistent and the risk is much lower. But you will still run into issues if you try to import two packages which use opposite normalization forms. And SourceKit will still throw compiler errors at you while you type if your input method assumes the opposite normalization form to the rest of your package—though liberal use of autocomplete can reduce that somewhat.

† For anyone interested, I tool I wrote, Workspace, can concisely normalize packages: $ workspace normalize, though that feature is largely unadvertised because I never recognized its importance.

John_McCall · March 13, 2019, 9:22pm

Technically normalizing unicode names could break ABI, but I think we should clearly do it anyway. I see the lack of normalization as a longstanding bug in the compiler.

SDGGiesbrecht · March 13, 2019, 9:31pm

...hmm...

It will not affect the ABI of anything ASCII, since that is static under all normalization. At least the API surface area of the standard, core and system libraries restricts itself to ASCII at the moment, right?. Are there internals that don’t? If not, then this would be theoretically ABI‐breaking, but break nothing in practice, since nothing that is declared ABI‐stable uses the affected functionality?

Only thinking out loud. None of those are facts I am absolutely sure of.

jrose · March 13, 2019, 9:43pm

Normalizing Unicode names could lose user data in NSCoding archives. :-(

I do think we should normalize for typo-correction purposes, but I don't think it's worth slowing down the compiler for something that most users will not encounter anyway.

EDIT: a programming language is by nature a parseable format, and while it's a parseable format for humans I feel like it's valid to be stricter about it than text. We don't want to start accepting U+037E GREEK QUESTION MARK as a statement delimiter even though its canonical form is a semicolon.

John_McCall · March 13, 2019, 9:47pm

Right, we can fix this because none of the core libraries are using identifiers with multiple representations. Note that there are several different normalizations we could use; ideally we would use the most compact, but it might be more prudent to use a normalization that matches what Xcode (sorry for the platform bias, but it probably needs to be Xcode) currently outputs.

John_McCall · March 13, 2019, 9:50pm

ASCII sequences are always canonical, so I wouldn't expect this to significantly slow down the compiler; we can very cheaply remember during identifier-lexing whether we saw a non-ASCII character, and we can put redundant entries in the identifier table for non-canonical strings.

I agree that we don't necessarily have to normalize non-identifiers during lexing, although I'm not sure this would really be particularly problematic.

xwu · March 13, 2019, 10:14pm

This has been discussed extensively in a few prior threads. I would encourage you to read through them to get a sense of where things stand.

In short, there is a thorough set of rules already laid out in UAX#31 on how to normalize identifiers in programming languages. Several of us have written several versions of a proposal to adopt it, but each time it has failed because of issues with emoji. Recent versions of Unicode now have more robust classifications for emoji, so the proposal can be resurrected with better luck now, probably. No need to start from scratch; feel free to build on the work that we’ve already done.

All of this applies only to identifiers. Literals should never be messed with by the compiler. That are, after all, supposed to be literals.

SDGGiesbrecht · March 13, 2019, 10:48pm

Thanks. I had search, but I guess I picked the wrong search terms. Knowing there must be some threads and being more persistent allowed me to find several. I’ll post links here in chronological order for others who land here. (I actually haven’t read the threads yet. I will now.)

SDGGiesbrecht · March 13, 2019, 11:37pm

While the first one started with the same premise, the other threads seem to be much wider in scope and become largely about other issues like which characters should be operators and which should be identifier. None of that matters to me. I only care about the compiler correctly recognizing matching tokens. But I defer to those who have already done much heavier lifting. Thanks for the hard work, @xwu.

michelf · March 14, 2019, 12:57am

If we don't normalize and then we add a reflection APIs, we'll have to deal with the possibility of two distinct properties having names that are equal to each other when it comes to String equality. But then maybe backward compatibility requires it.

Here's an idea to sort-of deprecate non-normalized identifiers without forbidding them outright: make the compiler normalize all identifiers except for those in `backticks`. This should make sure any desire for a weird normalization is written in a way that'll make people suspicious that something is going on. I believe this should be relatively easy to implement. The migrator could also automatically detect and migrate those symbols to whatever the developer chooses.

John_McCall · March 14, 2019, 12:59am

We would welcome a fix for the normalization-of-names issue. The rules about what exactly is an operator are a completely separate issue and should not block a normalization fix, which as you say is ABI-affecting and therefore should be fixed ASAP before there's a bunch of code relying on libraries with stable ABIs using inconsistent normalizations.

As a general matter, ABI issues can be fixed, you just have to be aware of the practical impact of the fix.

xwu · March 14, 2019, 1:49am

Since APIs that differ only in normalization form would arguably be a bug, would you consider that narrow change alone to require a proposal?

John_McCall · March 14, 2019, 2:24am

I wouldn't, but I can raise that question to the rest of the Core Team.

jawbroken · March 14, 2019, 3:22am

A targeted fix here just for identifiers would be great. I have no interest in normalising anything else, and would consider it a bug if string literals were normalised.

Chris_Lattner3 · March 17, 2019, 4:10am

+1, completely agreed.

-Chris

Panajev · March 17, 2019, 7:43am

Important enough to delay Swift 5.0 and iOS 12.2? It seems to me as the release date of these two should be open to s delsy to fit in this ABI breaking change you really really do not want people to rely on.
Would be quite puzzled if that cannot happen (there are business driven decisions behind the next iOS point release, but getting the ABI right at the start seems to be a worthy goal).

Panajev · March 17, 2019, 7:50am

On the other side I fear this is ultimately a mistake and that although essentially ASCII only grammar seems not inclusive it does provide a simple easy to parse and easy to share solution: a lingua franca does have some advantages as Anglocentric as that may sound, but that is quite off topic on a boat already sailed :).

John_McCall · March 17, 2019, 5:41pm

No, we cannot delay Swift releases over a relatively minor bug that we have no reason to think that we cannot fix in a future release.

dhoepfl · March 21, 2019, 11:37am

I, too, think that this is a bug and that the `backticks`-form should preserve encoding. I also see this bug as non-critical, so the following is more for later reference:

I think, due to character duplication (“Å” (U+00C5) vs. “Å” (U+212B)), the compiler has to use NFC internally. These characters differ in NFC, but are the same in NFD. Code points that do not exist in NFD should be disallowed outside `identifiers`.

It might be necessary to allow `\u{…}` inside backticked identifiers to allow NFD users to use all identifiers. Otherwise anyone using a NFD editor had no way of using a module that uses any of those characters if the module was written using NFC.

SDGGiesbrecht · March 21, 2019, 7:38pm

They are the same in either.

NFC always does NFD first^†—decomposing and reordering (NFD) before it recomposes in a possibly different way.

In this case NFD transforms both into U+0041 + U+030A. NFC then recombines that into U+00C5.

Maybe you are confusing it with this fact: The first time NFD or NFC is performed it will “lose information” by removing the “distinction” between the ångström unit and the scandinavian letter Å. The ångström will never reappear in NF‐anything. But after initial normalization there is only one NFC representation for any NFD and vice versa, so no information is lost no matter how many times data gets converted back and forth.

^†Implementations may skip or refactor this step for performance reasons when it is known that doing so will not alter the result.