Pitch: Unicode Equivalence for Swift Source

While the first one started with the same premise, the other threads seem to be much wider in scope and become largely about other issues like which characters should be operators and which should be identifier. None of that matters to me. I only care about the compiler correctly recognizing matching tokens. But I defer to those who have already done much heavier lifting. Thanks for the hard work, @xwu.

If we don't normalize and then we add a reflection APIs, we'll have to deal with the possibility of two distinct properties having names that are equal to each other when it comes to String equality. But then maybe backward compatibility requires it.

Here's an idea to sort-of deprecate non-normalized identifiers without forbidding them outright: make the compiler normalize all identifiers except for those in `backticks`. This should make sure any desire for a weird normalization is written in a way that'll make people suspicious that something is going on. I believe this should be relatively easy to implement. The migrator could also automatically detect and migrate those symbols to whatever the developer chooses.

3 Likes

We would welcome a fix for the normalization-of-names issue. The rules about what exactly is an operator are a completely separate issue and should not block a normalization fix, which as you say is ABI-affecting and therefore should be fixed ASAP before there's a bunch of code relying on libraries with stable ABIs using inconsistent normalizations.

As a general matter, ABI issues can be fixed, you just have to be aware of the practical impact of the fix.

3 Likes

Since APIs that differ only in normalization form would arguably be a bug, would you consider that narrow change alone to require a proposal?

I wouldn't, but I can raise that question to the rest of the Core Team.

1 Like

A targeted fix here just for identifiers would be great. I have no interest in normalising anything else, and would consider it a bug if string literals were normalised.

2 Likes

+1, completely agreed.

-Chris

3 Likes

Important enough to delay Swift 5.0 and iOS 12.2? It seems to me as the release date of these two should be open to s delsy to fit in this ABI breaking change you really really do not want people to rely on.
Would be quite puzzled if that cannot happen (there are business driven decisions behind the next iOS point release, but getting the ABI right at the start seems to be a worthy goal).

On the other side I fear this is ultimately a mistake and that although essentially ASCII only grammar seems not inclusive it does provide a simple easy to parse and easy to share solution: a lingua franca does have some advantages as Anglocentric as that may sound, but that is quite off topic on a boat already sailed :).

No, we cannot delay Swift releases over a relatively minor bug that we have no reason to think that we cannot fix in a future release.

3 Likes

I, too, think that this is a bug and that the `backticks`-form should preserve encoding. I also see this bug as non-critical, so the following is more for later reference:

I think, due to character duplication (“Å” (U+00C5) vs. “Å” (U+212B)), the compiler has to use NFC internally. These characters differ in NFC, but are the same in NFD. Code points that do not exist in NFD should be disallowed outside `identifiers`.

It might be necessary to allow `\u{…}` inside backticked identifiers to allow NFD users to use all identifiers. Otherwise anyone using a NFD editor had no way of using a module that uses any of those characters if the module was written using NFC.

They are the same in either.

NFC always does NFD first—decomposing and reordering (NFD) before it recomposes in a possibly different way.

In this case NFD transforms both into U+0041 + U+030A. NFC then recombines that into U+00C5.

Maybe you are confusing it with this fact: The first time NFD or NFC is performed it will “lose information” by removing the “distinction” between the ångström unit and the scandinavian letter Å. The ångström will never reappear in NF‐anything. But after initial normalization there is only one NFC representation for any NFD and vice versa, so no information is lost no matter how many times data gets converted back and forth.


Implementations may skip or refactor this step for performance reasons when it is known that doing so will not alter the result.

1 Like

So it would be a mere bug fix then? Or was that not what Chris’ answer meant and the core team’s answer still pending?

@xwu, is there already some implementation associated with the previous broader discussions that could be used as a starting point to factor out the targeted bug fix? Or would it be better to start from scratch?

With the proviso that we'd want to be able to run an implementation through as much compatibility testing as we could, yes, consensus on the Core Team was that this would just be a bug-fix.

8 Likes

No the previous discussion pre-dated the need for an implementation.

Thanks for clearifying, I didn't know this.

So fixing this bug and switching to any NF* internally is a source breaking change, no matter which NF* is used. The fact that duplicated characters are the reduced to the same identifer should then be mentioned in Swift Book/Lexical Structure/Identifiers

The only instance where anything would be source‐breaking is if code is already trying to use equal text—such as (decomposed) and é (composed)—as two separate identifiers. The compiler would begin considering them equal. Such usage is highly unlikely in practice, no matter how exotic the human language of the developer.

Even if a user’s code has been written in one normalization form (or a random hodge‐podge) and the compiler decides to think internally in the other form, the user will see no observable difference. All the same identifiers still reference all the same things.

The issue is only on the ABI level. If an already compiled binary used one normalization form and the compiler starts mangling names only in the opposite normalization form, then no new module could link against any existing symbol whose name differs between the normalization forms. (Since ASCII is static across normalization, users of ASCII code would never notice.)

3 Likes

I am less familiar the compiler than I am the other parts of the project written in Swift. I have heard mumblings about ICU vs no ICU in the compiler and I don’t know what the actual reality is. Is there easy access to a string normalization function inside the compiler, or will it require modification to even get access to that ability?

(It may take some time before I get to this. If someone beats me to it, I won’t hold it against you. :wink:)

I think module names are affected by this too.

I just saw this other post...

...and experimented with variations of the user’s instructions.

Creating a project and entering the name in either NFD or NFC results in the same thing: Xcode will have NFD in its UI for the project settings and the directory and process name are NFD too. But the mangled class reference in the interface builder file is NFC. And right now they aren’t considered matching. I’m not an interface builder expert though, so there may be more going on in that case.

If we are to encourage the usage of Unicode identifiers, there are a few places where normalization (or the lack thereof) can pose a problem:

  • Interface Builder: outlets, actions, class names, because the Objective-C runtime does not normalize names.

  • Generated files for modules, libraries, frameworks: not all filesystems normalize Unicode and/or use Unicode-aware name lookups. Thinking specifically of Linux vs. macOS here. To counter that, build tools should normalize names before creating files, and later use the normalized name when reading the files it generated. (I haven't tested, but I presume this is a problem.)

  • String identifiers in Cocoa and elsewhere. For instance: notifications names. Swift will see two names as being the same, but NSNotificationCenter will not on platforms where comparison is done using NSString. Note that comparison is done using String in swift-corelib-foundation, not NSString, so results will probably not match Apple platforms.

Generally, anywhere a string is used as an identifier might be a problem if not all comparisons are normalized ones. Everyone should be testing for this, but I bet no one is.

My personal advice for programmers would be to stick to ASCII if possible when choosing identifiers because it is not vulnerable to normalization issues. It might be a long time before the entire ecosystem can handle things harmoniously, and you probably don't want subtle bugs to creep in because one API see two strings as identical while another doesn't.

2 Likes