Swift string comparison doesn't consider ligatures equivalent to their components

wadetregaskis · August 9, 2023, 1:27am

I just discovered that Swift String considers "office" and "oﬃce" to be unequal, which surprised me given that it considers e.g. "caña" and "caña" to be equal (that is, both composed and decomposed forms). To the layperson these cases are equivalent - they're just different ways of representing the same graphemes (in some fonts, at least).

I did some digging and it seems like this is because Swift promises to use "canonical" comparison, not "compatible" comparison, in Unicode terms. The documentation mentions this but doesn't explain what that means.

I dug further and found some controversy and storied history on ligatures in general in Unicode, which maybe sheds some light on this (e.g. it appears the current Unicode opinion on ligatures is that they should not be used for kerning purposes, like "ﬃ" is, but it still contains "inappropriate" ligatures - again, like "ﬃ" - that were added before this change of heart).

NSString is similar except if you use compare(_:options:) with the caseInsensitive option, in which case it considers ligatures equal to their decomposed forms. Which is weird since that's got nothing to do with character case.

I guess this post is mostly just an FYI / warning to others. But I am curious why Swift chose to do "canonical" comparisons rather than "compatible"? There also appears to be no way to do a "compatible" comparison in the Swift standard library - you have to import Foundation to get the string overlays in order to access the aforementioned NSString method.

tera · August 9, 2023, 2:10am

Nice find!

Interestingly both Xcode and this window in safari think the strings are the same when searching. Ditto Finder.

let one = "office"
let another = "oﬃce"

print(one == another) // false
print(one.compare(another, options: []) == .orderedSame) // false
print(one.compare(another, options: .caseInsensitive) == .orderedSame) // true

print((one as NSString).isEqual(to: another)) // false
print((one as NSString).compare(another, options: []) == .orderedSame) // false
print((one as NSString).compare(another, options: .caseInsensitive) == .orderedSame) // true

I guess the "String.compare" just call through the original NSString.compare and thus behaves identical.

Indeed looks strange that this has anything to do with character case.

Edit: this one works and gives "true" even without importing Foundation:

print(one.compare(another, options: .caseInsensitive) == .orderedSame) // true

wadetregaskis · August 9, 2023, 2:41am

Are you sure you didn't have Foundation imported? String doesn't have a compare method in the standard lib on my system (Xcode 15 beta 4 IIRC; Swift 5.9).

$ swift repl
1> let a = "office"
a: String = "office"
2> let b = "oﬃce"
b: String = "oﬃce"
3> a.compare(b, options: .caseInsensitive)
error: repl.swift:3:3: error: value of type 'String' has no member 'compare'
a.compare(b, options: .caseInsensitive)
~ ^~~~~~~

tera · August 9, 2023, 2:48am

You are right, I did have another file with import Foundation in it and although I was getting errors about NSString not found, there weren't errors about compare, hence I thought it's available.

Found this (no Foundation):

print(one == another) // false
print(one.lowercased() == one) // true
print(another.lowercased() == another) // true
print(one.lowercased() == another.lowercased()) // false
print(one.uppercased() == another.uppercased()) // true

SDGGiesbrecht · August 9, 2023, 2:50am

Compatibility decompositions are concessions foisted on Unicode by the need to be compatible (as a superset) with earlier encodings. Characters that have a compatibility decomposition would never have been encoded in Unicode if they had been proposed directly to it. Generally, if you are even using them, you are doing something wrong, because the sorts of things they encode (such as ligation) are not a property of the text, but of display formatting. In Unicode’s opinion, they do not belong in raw strings in the first place. However, folding them involves a loss of information on that formatting level, so it cannot safely be applied to legacy text that does make use of them. (In the example of “ff”, not every pair of “f”s is supposed to be ligated in display; those which straddle the halves of a compound word are supposed to remain separate. You cannot recover that distinction again by a cursory look at the context, you need to do it manually or by dictionary lookup.) That is fundamentally different from canonical decompositions like “ñ”, where “recovery” is trivial so normalization isn’t lossy.

If you want to know if two strings are equivalent in compatibility terms, then Foundation’s decomposedStringWithCompatibilityMapping is what you are looking for.

tera · August 9, 2023, 2:54am

Heh, not at all, there's just no capital version of that ligature, hence it's converting to normal characters.
Try this one instead:

"BEĲING"
"BEIJING"

Now Xcode and Safari disagree about these strings equivalence! And Finder allows creating these two files in the same folder.

wadetregaskis · August 9, 2023, 3:08am

SDGGiesbrecht:

Compatibility decompositions are concessions foisted on Unicode by the need to be compatible (as a superset) with earlier encodings. Characters that have a compatibility decomposition would never have been encoded in Unicode if they had been proposed directly to it. Generally, if you are even using them, you are doing something wrong, because the sorts of things they encode (such as ligation) are not a property of the text, but of display formatting. In Unicode’s opinion, they do not belong in raw strings in the first place. However, folding them involves a loss of information on that formatting level, so it cannot safely be applied to legacy text that does make use of them. (In the example of “ff”, not every pair of “f”s is supposed to be ligated in display; those which straddle the halves of a compound word are supposed to remain separate. You cannot recover that distinction again by a cursory look at the context, you need to do it manually or by dictionary lookup.) That is fundamentally different from canonical decompositions like “ñ”, where “recovery” is trivial so normalization isn’t lossy.

The problem is, while it explains how we got here, it doesn't help with cases like simple text search in my app, where users quite rightly expect ligatures to be irrelevant w.r.t. finding matches in the text.

"office" is canonically equivalent to "oﬃce" to a human, one could argue, using the very definition of canonical used by Unicode (two code point sequences that have the same appearance and meaning). Again, for some fonts and/or renderers. But I think it's fair to say that minor rendering differences in ligatures like this aren't what a layperson would consider significant differences in appearance. (font and design nerds aren't laypeople, sorry )

Yes, I can use NSString's comparison methods to help with this, but that has multiple problems:

Some library code implicitly uses string equality (== operator), beyond my control.
Most people don't know about these edge cases, so they'd never know they have to use something other than the == operator.
Foundation isn't desirable as a dependency in some contexts (e.g. some "server" programs, programs that want to be portable across operating systems, etc).
- I haven't tested if Foundation on Linux & Windows has all this functionality and behaves identically, but in general Foundation can't be assumed to work the same across operating systems, from what I've seen and heard.

Anyway, I suppose this horse has already left the barn.

It'd be nice to see improved Unicode support in Swift's standard library, though - e.g. the ability to compare strings for "compatible" equivalence, to explicitly control normalisation, etc.

xwu · August 9, 2023, 3:23am

For sure, there are areas of improvement to be made here (see String case folding and normalization APIs)—and it would be good to see some movement here. I'm not in a position to champion this at the moment, but if someone wants to run with it...

taylorswift · August 9, 2023, 3:46am

moreover, there’s a typosquatting vulnerability in ligature comparisons. you don’t want to say, match a domain name with an == implementation that decomposes ligatures.

(well maybe not domain names, but other kinds of unicode-supporting identifiers.)

SDGGiesbrecht · August 9, 2023, 3:49am

Yes, having compatibility equivalence methods in the Standard Library is a shoe‐in. I think the string team’s rough target design is lazy views (which would then be open to use with any collection algorithm).

wadetregaskis · August 9, 2023, 2:18pm

Yes I do. What we actually don't want are domain name registrars thinking ligatures represent distinct character sequences and allowing two registrations that appear virtually if not literally identical to end-users.

I also don't know if DNS actually has this vulnerability, though I'd be surprised if it didn't given all its past typo-squatting problems.

bbrk24 · August 9, 2023, 2:23pm

Just finding ligatures isn't sufficient for that, consider a (U+0061) vs а (U+0430).

wadetregaskis · August 9, 2023, 2:27pm

Though thinking more about this in the context of users searching textual documents, one typically doesn't want case sensitivity there either. Yet == is case-sensitive, and I'm not sure that's wrong.

So maybe my gripes with == here are misguided. The very real problem remains of library code using ==, but now I'm thinking that the root problem there really is that library code, and the only viable solution is fixing it to use a more appropriate string comparison method.

It sucks that the answer actually seems to be "developer education", as such, since that's a really hard problem. But I don't see a way around that, realistically (one could hypothetically remove built-in String equality in order to force developers to use a more thought-provoking method, like compare(:options), but that's obviously (a) too late to consider and (b) likely to cause a riot).

wadetregaskis · August 9, 2023, 2:29pm

Ugh, great, yet another example where Unicode "canonical" comparison doesn't work properly. Or at least, not as implemented in Swift String.

tera · August 9, 2023, 3:04pm

operator ~~ that does case insensitive comparisson

infix operator ~~ : ComparisonPrecedence

extension String {
    static func ~~ (lhs: Self, rhs: Self) -> Bool {
        lhs.compare(rhs, options: .caseInsensitive) == .orderedSame
    }
}

print("oﬃce" == "office")       // false
print("oﬃce" ~~ "office")       // true
print("beĳing" == "beijing")    // false
print("beĳing" ~~ "beijing")    // false
print("HATE" == "НАТЕ")         // false
print("HATE" ~~ "НАТЕ")         // false

When searching Safari thinks the two offices and the two beijings are the same, while Xcode only considers offices the same whilst beijings being different. Apart from the last "homograph spoofing" the strings do look different both on this site:

and in Xcode:

My view – those comparisons should give false for all entries. Potential fix could be at the "unicode level" itself: for every ligature (we are stuck to support) make sure there are both lowercase and uppercase versions, then we won't have a crazy exceptional situation that "ﬃ".uppercased() == "FFI".

Generalising: for any a and b strings (characters, grapheme clusters, or whatever the proper nomenclature is) the following precondition should be true:

precondition(
    a != b ||
    (a.lowercased() == b.lowercased() && a.uppercased() == b.uppercased())
)

wadetregaskis · August 9, 2023, 3:32pm

I can see a reasonable argument for that re. ==, but not for String.compare(:options: .caseInsensitive).

For String.compare(:options:) with .caseInsensitive, it seems apparent that "case insensitive" is taken as meaning case folding (I bet the implementation, that calls ICU or whatever, is making exactly that mapping between option flags). And it's actually an interesting naming question - in plain English "case" just refers to 'A' vs 'a' but in Unicode parlance "case" seems to be way broader (and I assume some languages have a broader notion of "case" too, than English does). So "case insensitive" in Unicode parlance actually is an acceptable name, I think. It could be documented better, though, in the String API.

Why is that "crazy"? The ligature "ﬃ" is simply an alternative form of the three ASCII characters "ffi", intended as a hint to text renderers but with absolutely no semantic difference in English*. There are no guarantees it will render differently, even - the text renderer might choose to use that ligature even if the raw string doesn't, or the font might lack that ligature so the renderer decomposes it to its constituent characters, or the font might contain the ligature but defines it as visually identical to the sequence of decomposed characters.

Note that this is not the case for all ligatures, e.g. æ is formally defined as a ligature in Unicode (even though arguably it's not, in a renderer sense, since no [English] human would merge those two characters), but it is not necessarily semantically equivalent to ae because it's actually a distinct character in some languages. I'm not sure how to handle that complication… maybe in e.g. English locales it should be considered semantically equivalent, but in e.g. Danish locales it should not?

The phrase "case-preserving but -insensitive" comes to mind. As a default, it's better to err on the side of being permissive (oﬃce == office etc) because that's safer (either literally e.g. domain name spoofing) or figuratively (e.g. text searching where false positives are less problematic than false negatives). But it's also important to correctly preserve the original encoding in case the distinctions are important for some use or in some context.

That might inform how Unicode enhancements to String should be implemented (such as whether String should force a particular normalisation, or whether APIs should be structured like a.transform == b.transform vs a.compare(b, options: …)).

* = An interesting question is what the expectations are in other locales. I have no knowledge there.

tera · August 9, 2023, 3:43pm

By that I mean we have a crazy situation on our hands:

"oﬃce" ~~ "office"       // true
"beĳing" ~~ "beijing"    // false

And I am arguing that (1) the two should give the same result and (2) the "false" is better here, see below.

I believe it is the opposite of safety... Safer would be to treat them differently (always) so you could easily tell that the two strings are different, whether it comes to domain spoofing on the user side (and the domain registrars should not allow those ligatures in the first place), or text: (where false positives would be dangerous, simply because they are inconsistent: "I searched for "oﬃce" and replaced it with "headquarters" and that worked for both spellings, so I'd just go ahead and replace all "beĳing"s with "peking"s and surely that will handle both spellings, right?"

wadetregaskis · August 9, 2023, 4:15pm

Can you elaborate how treating them differently helps?

e.g. if in your browser you wrote "microsoftoﬃce.com" and it took you to a different site than "microsoftoffice.com", because your browser & DNS think those are distinct strings, then that could be the basis for a scam (assuming "microsoftoffice.com" were a legitimate website - surprisingly it's currently unresolving; you'd think Microsoft would have grabbed that and made it redirect at least).

e.g. if in your DNS server code you handle new registrations by checking for existing ones, and you treat ligatures as distinct from their component characters, then you're going to allow registration of confusingly misleading domain names (as in the above example).

I say it's safer to be more lenient in equality checks by default because otherwise avoiding any of the above problems requires extra work by developers, that they often don't even realise they need to do.

These problems are in fact real, not hypothetical, sadly. Many very smart people worked on DNS internationalisation (and libraries and servers) and still screwed it up. If string equality in all these languages was not fooled by ligatures, it's much less likely we'd have these problems - those people would instead have had to go out of their way to cause these problems, which (I like to think) they would not have done.

It should ignore ligatures, yes. That it might discard the ligatures is fine (certainly the Unicode consortium currently thinks so - they wish they'd never added ligatures), as in it is the least problematic result. Or do I misunderstand why you're implying this is bad?

The more we discuss this, the stronger I feel that String should be more lenient in what it considers equal. Swift has gone to great lengths to stress that strings are not mindless sequences of bytes, to the point of making indexing into strings ergonomically difficult and computationally expensive - blasphemy by the standards of many preceding languages! I've always supported Swift in doing this, but I'm beginning to think Swift actually hasn't gone far enough.

If people really want to compare raw bytes of a string, they can (at their own peril), but they should have to do extra work to do so. They should have to turn off case-folding, or ligature equality, etc, if that's really what they want.

English upper vs lower case is still undecided in my mind. Having == treat "A" as equal to "a" would be a pretty significant departure from convention. But then, so was treating strings as Unicode at all! My goodness the howling and whinging from the C/C++ and Python and Win32 people, years ago. Yet we look back now and their hesitance (obstinance, even) seems comical and to have not aged well. So it'll be interesting to have the benefit of hindsight on this discussion, some years from now.

QuinceyMorris · August 9, 2023, 4:32pm

At the risk of being controversial…

It's not Unicode's business to have anything to do with ligatures as a typesetting feature — that is, as part of the process of constructing the visual representation of some text. There is no such thing as a "ligature character" (please excuse my dragging the word "character" in here). In computerized typography terms, there's only a ligature glyph.

However, Unicode does and should have a graphical character that looks like a ligature, so that we can name and talk about the process of ligature formation in typography. This allows us to say things like this:

On the printed page, oﬃce is how the string "office" appears when rendered by typographically sophisticated software.

Note that in this case, we're just displaying a pictograph of a ligature in that bolded word, and it's not a typographic decision whether it's rendered as a ligature or not. (It would in fact be a terrible error to break it back down into two "f" glyphs and an "i".)

Similarly, no one is going to be surprised that "a×is" and "axis" are string-unequal. (That's a multiplication sign in the first quoted word.)

The whole point of Unicode, right from the beginning, was to abstract away from typographic representation. That doesn't mean that Unicode can't be "aware" of typographic representation, so that we can talk about it using Unicode strings, but typography operates at a whole different level.

As @SDGGiesbrecht said, this is entirely different from the composed/decomposed distinction, because that's not about what the final glyphs look like, it's about how the "same character" (same grapheme cluster) is represented in terms of Unicode code points.

I'm fine with the concept of having convenience functions that deliberately obscure the distinction between these things, but we still need to keep the fundamental concepts of Unicode in mind.

tera · August 9, 2023, 4:34pm

Easy. When someone sends me the microsoftoﬃce.com link the first thing I'd do is paste it in a few text editors I trust (to not do the ligature folding), type microsoftoffice.com myself and compare the two strings via "search". Then I'd be alerted that the two strings are actually different and won't open the link. And if we push the default behaviour to do the folding – it would be harder for me to find a good text editor that doesn't do the folding to do the comparison, I'd be more easily fooled that the two strings are the same.

If you look at the above funny SOS picture I'd like to be alerted that there's some hidden message in there rather than not see it and forward an innocently looking message.

I believe the ligatures is a bad thing to begin with. And probably we should deprecate them somehow, pushing users to reach out for other means (tracking, kerning, etc). And even if we are stuck with them forever for compatibility reasons the best course of action would be to not help doing those automatic folding conversions. If that makes users lives somewhat harder in some aspects (they have to search replace "office" twice) – that's fine, users will be more keen to navigate away from using those legacy constructions and they will thank us in other situations when having two "beijings" spelling different would help them to reveal the spoofing or other differences that should or should not be there.