Case-insensitive UTF8 comparison

How do I go around implementing UTF8 byte comparison in a case-insensitive way?
Or is there already a library for this?

I cannot import the whole Foundation just for this, so String.caseInsensitiveCompare is not an option.

Case insensitive string comparison is quite complex. Even without case insensitivity it‘s not trivial, due to various unicode specifics.

If you‘re only interested in ASCII it is simple to roll your own small comparison function. But if you need to handle international text you should use Swift‘s provided String implementation.

1 Like

I do know it's a daunting and complex task to implement that, but that was fine by me.
I thought maybe I can get some pointers on how to do it, or maybe there are some shortcuts I could take through Swift's String processing like the UnicodeScalar or UTF8View types or such.

For now I'm just using String.uppercased() ... I'll benchmark these later. I assume it's decently slow, but it's better to have the slow function than not have any. I'll also try to look at how Swift handles this, maybe there are shortcuts I could take.

This is the related PR: Allow binary data in `Name` + Better case-insensitive equality check by MahdiBM · Pull Request #21 · MahdiBM/swift-dns · GitHub

Case-insensitive Latin-alphabet comparison is language-specific. You'll need to know the language your user is reading/writing in before you can correctly apply it.

If your use case does not involve presenting text to a user (e.g. you just need a sort order or something), you can use the "C locale" which mostly defaults to English rules, however you still need the Unicode database in order to compare arbitrary Unicode code points, which implies you need Foundation.

3 Likes

Hmm this is about domain names. I think String.uppercased() is already doing an adequately good job.
I'm thinking if I want to expose a better comparison function and all, I can do it under a trait that enables usage of Foundation :thinking:. Alongside other utilities that would need Foundation.

Domain names are technically ASCII. Unicode is simulated via punycode encoding. There are various comparison rules offered by the Unicode Consortium for comparing internationalized domain names safely, and case sensitivity is only part of the problem space.

Just FYI. :grimacing:

2 Likes

Yeah they are technically ASCII but in the real world I've noticed some non-ASCII domain that browsers would still accept, and also I've noticed some DNS resolvers like dig, do accept non-ASCII bytes.
Thanks for the heads up though, I had seen some DNS RFCs mention domain names need to be compared case-insensitively, but after digging a bit more I noticed there is the RFC 3490 which is dedicated to "Internationalizing Domain Names in Applications (IDNA)". I'll save a note somewhere to come back to this later. For now I'm happy with String.uppercased() which looks to be doing an adequately good job.

I mentioned benchmarking before ... it appears for a domain like google.com. the throughput of equality checking is ~16.7 million/s, and for googße.com. vs googSSe.com. (which are equal when compared case insensitively) the throughput is 2.9 million/s on the same machine, which is 5.7 times slower (benchmark, code).
So I guess String.uppercased() generally makes the equality check 5-6 times slower.

If performance really matters, it might make sense to optimise for the common cases:

First, check for bitwise equality. When that fails, check for simple ASCII case-insensitive equality. Only when the previous test detects non-ASCII characters use a more expensive check, like comparing the uppercased() strings.

When dealing with DNS, it might be more important to consider correctness, though. That really depends on the use case and specification. Comparing uppercased() strings produces different results than proper case-insensitive comparison—and thus might open security holes.

2 Likes

Hmm right, that's what I've tried to do. I also tried a few variations but this one seemed to be the most reasonable and speedy one.
I think it might be a good idea to add an ASCII short-circuit as well.

Good point. So far no logic relies on the equality check, but I'll need to keep that in mind. There is a fine chance that some logic crosses paths with the equality check at some point.

1 Like

With this PR adding ASCII short-circuit (code changes), I now have these numbers when equality checking for 2 names:
Identical Names: max 20 million/s
Lowercased ASCII vs fully uppercased: max 16.6 million/s
Lowercased UTF8 vs fully uppercased: max 2.8 million/s

They really are all ASCII. Non-ASCII domain names are encoded with punycode (as I mentioned earlier) but the actual textual representation is always ASCII. Take a look at this Wikipedia article for a good overview.

2 Likes

Another way you could compare is by masking both bytes with byte & 0b11011111 which will unset the ASCII uppercase bit on the letters. Assuming you’ve previously validated that both domain names only contain valid characters, this check will match case-insensitively.

1 Like

Hmm good point. I wasn't sure how those other bytes are allowed, considering RFC 1035 was pretty clear that "letter–digit–hyphen" is what is allowed in a domain name. Thought maybe a newer RFC has overruled that.

In that case UTF8 checking might be useless. I guess that's the next thing i'm taking care of.

Right ... though I dont think it'll make a difference comapred to what I have since we'll still have 2 bitwise operations.

Interesting ... I already had tests for this UTF8 vs punycode situation so I had to dig into it why what I do looks to be wrong...

So I was basing everything on what dig does when you enter a command on CLI (e.g. dig @8.8.4.4 helooß.co.uk. A). In that case dig just encodes the string as UTF8. So for ß, dig encodes [195, 159] on the wire. That's one of the reasons UTF8 had stuck in my mind.

Trying Chromium and Safari though, they do use punycode ... so i think dig is just doing a literal search for whatever characters you enter, instead of using punycode.

Considering the dig behavior surprised me, I did a little bit of digging on that front as well.
dig's manpage mentions it does support IDN and thus puny code, but only if it was enabled at compile time. On macOS's default dig, that's not enabled. Even then you might need to use the +idnout/+idnin flags to have punycode support enabled, I haven't tried.

I ended up going for IDNA and implementing it in this PR: Implement IDNA and Punycode by MahdiBM · Pull Request #33 · MahdiBM/swift-dns · GitHub.
Thanks to @grynspan for pointing me to the right direction.

Implements rfc5891 "Internationalized Domain Names in Applications (IDNA)" and friends. Uses them in Name .
Uses Unicode IDNA test suite with ~6400 test cases, to make sure of the compatibility.
Runs each test case extensively so each test case might even result in 2-3-4-5 test runs.

I'm thinking of possibly putting all the IDNA stuff into a separate package, but haven't decided yet. IDNA is specific to domain names (not arbitrary text) so I'll have to think more to see if it'll be useful outside a DNS library.

<meant to post that to another thread>

I ended up publishing the IDNA stuff as its own separate package: GitHub - MahdiBM/swift-idna: An implementation of IDNA - Internationalized Domain Names in Applications

Other than fixing the "140K lines of C which results in GitHub making it look like this is a C package" problem, I'm thinking of putting importing/depending-on swift-idna behind a Trait since most won't need it anyway and I think it adds a decent amount to the binary size.