Swift string comparison doesn't consider ligatures equivalent to their components

Related but different: interestingly, according to Xcode these are two different identifiers:

// one form is decomposed, another precomposed:
let è = 42
let è = 24

This has nothing to do with ligatures and does look like a bug. Or is it not?
FWIW, Finder doesn't allow two items named like this in the same folder.

2 Likes

I don't think this is controversial at all; Unicode itself agrees:

Ligaturing is a behavior encoded in fonts: if a modern font is asked to display “h” followed by “r”, and the font has an “hr” ligature in it, it can display the ligature. Some fonts have no ligatures, while others (especially fonts for non-Latin scripts) have hundreds of ligatures. It does not make sense to assign Unicode code points to all these font-specific possibilities.

Fonts intended to render the contents and appearance of manuscripts might be expected to have all the requisite ligatures.

The existing ligatures, such as “fi”, “fl”, and even “st”, exist basically for compatibility and round-tripping with non-Unicode character sets. Their use is discouraged. No more will be encoded in any circumstances. [JC] & [AF]

<snip>

The Unicode Standard is a character encoding standard, and is not intended to standardize ligatures or other presentation forms, or any other aspects of the details of font and glyph design. The ligatures which you can find in the Unicode Standard are compatibility encodings only—and are not meant to set a precedent requiring the encoding of all ligatures as characters.

In other words, office and office should not compare as equal except very strictly in compatibility contexts only, and these ligature characters are not intended for use to replace ligatures as a typesetting feature.

These assertions are incorrect: U+FB03 LATIN SMALL LIGATURE FFI is not an alternative form of ASCII ffi, and shouldn't be used or treated as such — and specifically, they are not canonically equal by Unicode's definition of "canonical".

4 Likes

Out of genuine curiosity, have you received reports of this from your customers? In all honesty, I would be extremely surprised... These are pretty obscure characters the vast majority of computer users would never know about.

No, the app angle is hypothetical (for me).

But the domain name examples are very real. I've been subject to countless attempts at fraud (or worse) through domain name spoofing.

I think it's fair to say that examples like search-and-replace of "office" are quite fringe, but name spoofing is a very real problem.

I suppose it's arguably the prerogative of Unicode consortium to decree what they think, which I guess they have.

But ultimately what matters most is what actual people expect, in the general and broad sense. The "laypeople".

None of these technicalities, no matter how rational in their own ways, convince me at all that "office" != "office" for most purposes. That there are specific circumstances in which they should not be considered equivalent doesn't affect what's the safer and more intuitive default.

I don't understand your reasoning here, I'm afraid. You're saying that because there is a possibility of spoofing you need a good way to do extra, expert work to detect it… but if the spoofing weren't possible to begin with, why would you care?

Spoofing is indeed a real concern — enough so that ICU has an API to help detect character spoofing: uspoof.h. Where spoofing is a concern, very special care needs to be taken: in some contexts, compatibility equivalence can help avoid a spoofing attack, and in others, compatibility equivalence leads to spoofing attacks.

Using a dedicated API to detect this is really important for applying semantic understanding to the text you're looking at.

In the case of domain names, for example: both registrars and browsers should use APIs like this to prevent fraud at the source, so that registrars can help block unsavory registration domain attempts, and browsers can help inform users that links may not be taking them to where they think they're going.

But: applying these semantics to general string comparison and equality is highly unlikely to yield the results you're looking for.

4 Likes

This topic isn't really confined to Swift and has been explored in great detail elsewhere. With respect to domain names, I'd encourage you to peruse the following Unicode Consortium FAQ and follow the links to the relevant technical recommendations and standards (UTR #36, UTS #39, UTS #46):

There are recommendations in the same vein for source code too which I'm sure you'll be capable of finding.

The overall point I'm making is that you needn't—and shouldn't—be trying to discover these from scratch and case-by-case, as the resources and standards from folks who've thought about this are out there. Furthermore, there's much about UX and security considerations (including how they intersect with localization and internationalization) that are use case–specific and, thus, requires developers to hold it right because it exceeds the remit of any general-purpose APIs. But the resources are there.

Edit: I see @itaiferber's said basically the same thing as my last paragraph but better :)

Edit 2: While it's certainly not a primary source and I wouldn't ever recommend retrieving any authoritative lists from it, Wikipedia actually has a fairly complete and (IMO) well done series of articles on Unicode security-related topics as well. It's possible that some will find it more approachable than the technical documents which are primary sources.

2 Likes

Why shouldn't that "detection" be automatic; the intuitive, default path?

Your argument could be applied to Unicode support in general. "There are libraries for handling Unicode text, so why should the default string implementation worry about that stuff? Sure, Unicode support is critical for some things, but if you need Unicode support, just use those Unicode libraries. What's the problem?"

One might argue that Unicode support is more generally important and useful than the equality involving ligatures, and that's certainly true. But that only matters insofar as we might have to make trade-offs: Unicode support is certainly not free, but worth it - are there any trade-offs with handling ligatures more naturally, and even if there are, are they not also worth it?

Now, you did touch on what might be an important wrinkle which is that it really is very hard to resolve all of these specific cases; to determine what really is equivalent or not. It might be impossible to prevent spoofing, no matter what - let-alone w.r.t. default string behaviour that does have to balance other interests. But better is still better. I'm just not convinced Swift's String has hit the sweet spot. And more generally I'm surprised by the push-back on even trying to raise that bar.

Because, as @itaiferber has said: "in some contexts, compatibility equivalence can help avoid a spoofing attack, and in others, compatibility equivalence leads to spoofing attacks." Put another way, there is no "intuitive, default path."

4 Likes

100% agree.

Here we disagree, although to be clear I don't know that you're wrong, I just am hoping you are. Optimistically, at least. :laughing:

It's precisely because this problem space is truly gnarly and requires ridiculous levels of intelligence and experience that it shouldn't be left to every individual developer to not only address but also to even know they need to know it.

In principle that is, to be clear. Again, maybe this problem is truly intractable. Maybe it truly is impossible to have a sane default that covers every case in a sufficiently safe way (however we might define that).

Maybe Swift's already very nearly there. But it can't be exactly there yet, because it still thinks "office" != "office".

Because String.== is not the right place for it. Specifically, if you take a look at the uspoof.h API, you can get a sense for the types of checks may be necessary, and the signal you might be looking for:

enum USpoofChecks {
  USPOOF_SINGLE_SCRIPT_CONFUSABLE = 1,
  USPOOF_MIXED_SCRIPT_CONFUSABLE = 2,
  USPOOF_WHOLE_SCRIPT_CONFUSABLE = 4,
  USPOOF_CONFUSABLE = USPOOF_SINGLE_SCRIPT_CONFUSABLE | USPOOF_MIXED_SCRIPT_CONFUSABLE | USPOOF_WHOLE_SCRIPT_CONFUSABLE ,
  USPOOF_ANY_CASE = 8,
  USPOOF_RESTRICTION_LEVEL = 16,
  USPOOF_SINGLE_SCRIPT = USPOOF_RESTRICTION_LEVEL,
  USPOOF_INVISIBLE = 32,
  USPOOF_CHAR_LIMIT = 64,
  USPOOF_MIXED_NUMBERS = 128,
  USPOOF_HIDDEN_OVERLAY = 256,
  ...
}

String.== could not possibly give you this information, because it returns a single bit: "equal", or "not equal". There are no toggles on String.==, nor can you work backwards from that single bit to figure out what the contents are that you're looking at. In other words, you need a dedicated API to do this sort of checking. Maybe Swift could do with an ergonomic, easily-accessible API for doing this sort of in-depth checking, but String.== is not it.

FWIW, spoofing checks would reject "office" as a probably spoof (so if String.== rejected spoofs, it would return false), still not giving you the result you're looking for.

3 Likes

The good news is that should not be possible. Both registrars and browsers should process the domain according to UTS46 (IDNA), which will perform a compatibility decomposition and turn microsoftoffice.com in to microsoftoffice.com (Live URL Viewer).

For some reason, the forums software Punycodes the ligatured version (as xn--microsoftoce-nu49d.com). Since it encodes invalid characters, that is not considered a valid domain and a standards-conforming browser should refuse to navigate to it. (Live URL Viewer). Unfortunately, Chrome currently allows domains to contain invalid Punycode :confused: (at least, it allows them to be parsed -- it may reject requests to them at some lower level, and of course they should not be registered to anything anyway).

WebKit should fully conform to the standard, so they won't consider that a valid domain. Firefox is not fully conforming, but currently has better compatibility than Chrome. There is an effort to improve compliance across all browsers, but it takes time.

3 Likes

Right, as @itaiferber demonstrates regarding that example, there is no single implementation of == which satisfies all your criteria for "intuitiveness" in even the one case, let alone "every case in a sufficiently safe way."

3 Likes

This type of problem, unfortunately, is intractable. There's a balance to strike between shielding developers from the underlying complexity of a subject and giving them the tools to directly address that complexity — and it's tradeoffs all the way down.

For example, you could try to hide the complexity of NFC/NFD/NFKC/NFKD mappings in String equality to avoid having to force developers to pick an equality mode for every string comparison they perform, and you'll still get threads on the forums about why NFC was chosen and not NFD or NFKD :wink:

It's just not always possible to distill down an extraordinarily complex subject which necessarily requires thought, attention, and care into something that "just works" 99% of the time. An API abstracting over unbelievable complexity can't simultaneously be extremely simple and offer endless toggles, nor be intuitive and understandable without making tradeoffs on behalf of the developer.

Unfortunately, the root of the problem is the underlying complexity, and not the API surrounding it. You can't just make that go away.

2 Likes

Such checks are only necessary if spoofing is possible because identifiers are not sufficiently canonicalised. Or actually more precisely: validation checks must perform the exact same normalisation as name generation.

For example, if DNS treats "office.com" and "office.com" as distinct domains, then your TLS certificate validation must also distinguish between those two representations, otherwise you could use those two domain's certificates interchangeably, which would obviously be very bad.

But if you don't treat them as distinct, then cert validation doesn't need to either (indeed, perhaps shouldn't otherwise it'd erroneously reject arguably valid certificates).

Though in this case either way you seemingly could require byte-exact matching of names in certificates to DNS names. But then you'd still need user agents to correctly canonise inputs (lest they erroneously check the cert against what the user input, say "office.com", rather than say the canonical name "office.com").

Anyway, it doesn't seem like anyone's changing anyone's mind with this discussion, and I don't know how else to present my argument, so I'm going to stop here. But that's not to say this wasn't a very useful and welcome discussion - I've learnt a lot more about Unicode and text handling in general. I hope others found this thread useful too, and will find it useful in future (hello future-reader!). Thank you everyone!

As the ancient saying goes, "Unicode is dark and full of terrors". :stuck_out_tongue_winking_eye:

When it comes to compatibility decomposition, it is not always the correct thing to do:

Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text. It is best to think of these Normalization Forms as being like uppercase or lowercase mappings: useful in certain contexts for identifying core meanings, but also performing modifications to the text that may not always be appropriate. They can be applied more freely to domains with restricted character sets.

Compatibility equivalence is a weaker type of equivalence between characters or sequences of characters which represent the same abstract character (or sequence of abstract characters), but which may have distinct visual appearances or behaviors. The visual appearances of the compatibility equivalent forms typically constitute a subset of the expected range of visual appearances of the character (or sequence of characters) they are equivalent to. However, these variant forms may represent a visual distinction that is significant in some textual contexts, but not in others. As a result, greater care is required to determine when use of a compatibility equivalent is appropriate.

(UAX15)

  • Compatibility normalization removes meaning. For example, the character sequence 8½ (including the character ½ [U+00BD VULGAR FRACTION ONE HALF]), when normalized using one of the compatibility normalization forms (that is, NFKD or NFKC), becomes an ASCII character sequence: 81/2.

(W3C Character Model for the World Wide Web: String Matching)

In general, it can be said that NFKC is more
aggressive about finding matches between code points than NFC. For
things like the spelling of users' names, NFKC may not be the best
form to use. At the same time, one of the nice things about NFKC is
that it deals with the width of characters that are otherwise
similar, by canonicalizing half-width to full-width. This mapping
step can be crucial in practice. A replacement for Stringprep
depends on analyzing the different use profiles and considering
whether NFKC or NFC is a better normalization for each profile.

(RFC-6558 - Stringprep Revision and Problem Statement for the Preparation and Comparison of Internationalized Strings (PRECIS))

I think everybody agrees that we should offer this feature (just as we should offer case-insensitive comparison), but it is not the best thing to use by default because it removes information. It should be opt-in for when you know it's the right thing to use.

1 Like

Thanks for the alert to these issues. The outcome for Swift String might be to stimulate work on outstanding issues (as @xwu noted).

But another stopping point might be to find how to advise a Swift developer right now who wants to avoid surprising users with "false" negatives from ligature searches.

One suggestion:

As @Karl suggested, one could offer this as an option if the user query has ligatures or could be expected to match ligatures (together with explanations that it might be wrong). It seems like writing that function for detecting ligature-sensitive queries in English wouldn't be too hard.

Is matching ligatures a high-traffic issue that others have noticed or addressed? FWIW, I searched github for "ligature" in swift code

  • 65 issues, none pertaining to matching (all to display)
  • one repo matched, a small display demo
  • 1.5K swift files. Of the first 80, all were about displaying ligatures, except [1] couchbase which offered an option to ignore accents and ligatures when searching.

So there's little data that it's high-traffic. But lest we give up too soon, [2] Betts noted these compatibility ligatures are rare in English, but could be quite common/extensive in Arabic (so my github search sample is biased). Who knows about other languages?

So it seems like a library package with language-sensitive workarounds could be a stopgap for developers now (and a test suite for solutions long-term), but those most affected (i.e., motivated to write such a package) might not be reading this forum.


(Quoting references to avoid excerpting)

[1] couchbase: https://github.com/couchbase/couchbase-lite-ios/blob/13331a4ae72431011ffc23042eb6f8d0250480e2/Swift/Defaults.swift#L38

[2] Betts https://github.com/MaddTheSane/SwiftMacTypes/blob/12e2b389587d6c29fb84ae0785ad278e9687614d/CoreTextAdditions/CTStringAttributesAdditions.swift#L81

1 Like

It is a known, unfixable bug (due to ABI implications). It gets ugliest in module names in import statements. Basically, in contrast to the Standard Library, the compiler is string‐illiterate and processes bytes naïvely like C. You can find more information by searching these forums or the bug trackers.

Partly for this reason, alongside swift-format, I run all new source through an NFD normalizer and flag anything that would change under NFKD as a style violation.

1 Like

You would probably be the only person on earth who does this. Most people would, at most, make sure it looks like the correct URL, and then click it or copy it into the address bar.

2 Likes