There's two way to write "é" in Unicode. You can write it:
- LATIN SMALL LETTER E WITH ACUTE (U+00E9)
- LATIN SMALL LETTER E (U+0065) followed by COMBINING ACUTE ACCENT (U+0301)
These are two forms of the same character, and there are many characters that can be represented in more than one form in Unicode.
In Swift, two strings will compare equal regardless of which form they're encoded. But this is not the case everywhere, and this has some peculiar effects.
For instance, the Swift compiler does not normalize the source, and in particular it will not do normalization before comparing identifiers. Otherwise these two é
identifiers would clash:
let é = "é" // é written as one unicode scalar
let é = "é" // é written as "e" with a combining accent
(Note: you might have to type "e" and use the character palette to create a combining acute accent, as pasting code in this forum will normalize the text.)
So if you write a tool interpreting Swift code, the right way is to compare identifiers for equality by doing a Unicode scalar comparison. At least this is the case with the current Swift version.
This is actually an interoperability problem for identifying things with Unicode strings in general: different tools and systems will compare identifiers in different ways and misunderstandings will ensue. Tools written in some languages will likely perform normalization automatically on comparison while others will not.
"é" == "é" // true
"é" as NSString == "é" as NSString // false
And almost nobody thinks about testing normalization, a relatively obscure Unicode feature. Those who will end up exercising those bugs are probably people seeking security vulnerabilities. Think of "é" in a user name for instance:
-
If your user account manager lets you create two users with different normalizations but then your backend code treats the two strings as the same user, you might get two distinct accounts sharing the same backend data .
-
Or if the account manager recognizes the two different normalizations as the same user but then passes the user name to a backend system that only recognize one form, the backend might fail to find the associated data depending on which normalization was used to login.
All this to say, if you are using Unicode identifiers of some sort, you should think about normalization and how other components of the system will compare those identifiers for equality. If they aren't all on the same page, it invites trouble.