String Comparison for Identifiers

unicode

(Michel Fortin) #1

There's two way to write "é" in Unicode. You can write it:

  • LATIN SMALL LETTER E WITH ACUTE (U+00E9)
  • LATIN SMALL LETTER E (U+0065) followed by COMBINING ACUTE ACCENT (U+0301)

These are two forms of the same character, and there are many characters that can be represented in more than one form in Unicode.

In Swift, two strings will compare equal regardless of which form they're encoded. But this is not the case everywhere, and this has some peculiar effects.

For instance, the Swift compiler does not normalize the source, and in particular it will not do normalization before comparing identifiers. Otherwise these two é identifiers would clash:

let é = "é" // é written as one unicode scalar
let é = "é" // é written as "e" with a combining accent

(Note: you might have to type "e" and use the character palette to create a combining acute accent, as pasting code in this forum will normalize the text.)

So if you write a tool interpreting Swift code, the right way is to compare identifiers for equality by doing a Unicode scalar comparison. At least this is the case with the current Swift version.

This is actually an interoperability problem for identifying things with Unicode strings in general: different tools and systems will compare identifiers in different ways and misunderstandings will ensue. Tools written in some languages will likely perform normalization automatically on comparison while others will not.

"é" == "é" // true
"é" as NSString == "é" as NSString // false

And almost nobody thinks about testing normalization, a relatively obscure Unicode feature. Those who will end up exercising those bugs are probably people seeking security vulnerabilities. Think of "é" in a user name for instance:

  • If your user account manager lets you create two users with different normalizations but then your backend code treats the two strings as the same user, you might get two distinct accounts sharing the same backend data :dizzy_face:.

  • Or if the account manager recognizes the two different normalizations as the same user but then passes the user name to a backend system that only recognize one form, the backend might fail to find the associated data depending on which normalization was used to login.

All this to say, if you are using Unicode identifiers of some sort, you should think about normalization and how other components of the system will compare those identifiers for equality. If they aren't all on the same page, it invites trouble.


SE-0243: Codepoint and Character Literals
Pitch: Unicode Equivalence for Swift Source
SE-0243: Codepoint and Character Literals
SE-0243: Codepoint and Character Literals
(Xiaodi Wu) #2

See:

https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20161017/028174.html


(Jeremy David Giesbrecht) #3

[wistles]

Wow...

[ongoing stunned silence]

I could almost swear that this was not always the case. I thought I tested identifier and operator equivalence back in Swift 1 when I first saw that Unicode was encouraged. It has certainly been the first thing I did in every other language I picked up. I guess I forgot to with Swift. What an oversight!


#4

Past discussions about this were dominated by questions about malicious use of confusable characters and also the question about what is an identifier and what is an operator. These related issues are much more complex design spaces, which is probably part of the reason why progress stalled there. I think fixing the comparison of identifiers to operate as expected (i.e. using the same notion of equality as Swift strings) would be a much more straightforward matter, and I suspect that a targeted proposal here would be uncontroversial (except for scope creep). Perhaps it would even be considered a bug fix. It is very mildly source breaking, but I highly doubt anyone was purposefully relying on this behaviour.


(Michel Fortin) #5

It should probably be mentioned here that Jeremy just started a pitch for this.