Pitch: Unicode Processing APIs

Karl · January 7, 2024, 8:51pm

Full answer to this

ICU's uspoof checker implements confusable detection as specified by UAX39. In other words, we could implement exactly the same functionality in native Swift, without wrapping ICU, and I have actually considered doing that.

The reason I didn't is because confusable detection is very complex, and the services exposed by USpoofCheck are possibly best left as implementation details for use by experts. Good spoof-checking implementations need to be deliberate about which UAX39 options they select, and will typically want to augment it with other, domain-specific or contextual checks. Additionally, modern computing platforms could potentially offer better spoof detection than USpoofCheck; there's still a lot of research to be done.

I know this because I ported most of Chromium's IDN spoof checking in a proof-of-concept extension library for WebURL. AFAIK no system or third-party library exposes a browser-quality spoof-checking of domains, so it's an interesting thing to experiment with, and I wanted to make sure WebURL exposed good APIs for this because the processing can be quite intensive.

It uses ICU's uspoof, so if you want to try it out on a Mac, first run brew install icu4c, then swift test from the command line (OSX's built-in ICU doesn't include uspoof, and I couldn't get Xcode to set the right include flags for the brew-installed library, but it all works from the command line).

So what does Chromium do for spoof-checking? Well, it does use USpoofCheck, but it configures a custom set of allowed characters starting with UAX31's "Candidate Characters for Inclusion in Identifiers" with a bunch of extra characters removed - e.g. U+2010 HYPHEN is removed because it is confusable with U+002D HYPHEN-MINUS, and some characters are removed because they are allegedly rendered blank by certain default system fonts.

If there are no issues, it then checks a bunch of TLD-specific rules (e.g. Latin small letter thorn, þ, is only allowed in Icelandic domains because although it is used in that language, in other contexts it can be used to spoof b or p), and searches for extra digit lookalikes, Kana, and combining diacritics which may be considered potentially problematic even when used in a single script and require even more thorough processing.

As part of that more thorough processing, it performs a kind of loose match against a database of "skeletons". Skeletons are also defined by UAX39, and uspoof.h does indeed include a function for getting the skeleton of a string.

Chromium's implementation doesn't just call that function as-is, though - there's a bunch of custom rules which are used to generate up to 128 variants of each hostname, and skeletons are generated for each of those. Then they're tested against a database containing almost 5000 of the most popular domains. I haven't ported any of this, because WebURL and the spoofcheck library are third-party packages, and I don't think many applications would be willing to accept the increased download size even if I went through the effort of implementing it.

So that's a brief walkthrough of what a production-quality spoof-checker looks like; hopefully it illustrates that good detection is seriously involved. Even if we provided the functionality from ICU's uspoof.h, you'd still need very detailed study of several Unicode standards in order to use it effectively, and even then you'd likely need to augment it with context-specific logic (such as that for specific TLDs), before ultimately falling back to some kind of database.

Just providing USpoofCheck is not going to be useful for most people (and it may even be counterproductive if configured incorrectly, giving them a false sense of security); it is probably more productive to focus on higher-level but also more specific APIs -- e.g. APIs specifically for rendering domains.

But all of this analysis only looks at the text itself. There are way more factors which affect confusability analysis. For instance, the font and text size are huge factors. The UAX39 data tries to consider that:

The prospective confusables were gathered from a number of sources. Erik van der Poel contributed a list derived from running a program over a large number of fonts to catch characters that shared identical glyphs within a font, and Mark Davis did the same more recently for fonts on Windows and the Macintosh. Volunteers from Google, IBM, Microsoft and other companies gathered other lists of characters. These included native speakers for languages with different writing systems. The Unicode compatibility mappings were also used as a source. The process of gathering visual confusables is ongoing: the Unicode Consortium welcomes submission of additional mappings. The complex scripts of South and Southeast Asia need special attention. The focus is on characters that have Identifier_Status=Allowed, because they are of most concern.

The fonts used to assess the confusables included those used by the major operating systems in user interfaces. In addition, the representative glyphs used in the Unicode Standard were also considered. Fonts used for the user interface in operating systems are an important source, because they are the ones that will usually be seen by users in circumstances where confusability is important, such such as when using IRIS (Internationalized Resource Identifiers) and their sub-elements (such as domain names).

This process of manual inspection clearly is not ideal. Additionally, the UI fonts on Apple platforms have been tweaked several times over the years, but Apple is not specifically mentioned as contributing confusable data (which doesn't mean they don't). It is possible that Unicode's official data (used by ICU) is suboptimal on Apple devices; we can't say. What we can say is that the best confusable detection would account for the sophisticated text scaling, layout, and rendering performed by the OS frameworks.

Another thing to consider is that there are other sources of data which can sway the confusability analysis - for instance, you might already know which scripts the user understands, or you might have access to their bookmarks or browser history, letting you know that a domain which looks confusable is actually trusted.

I can't remember specifically where I read it, but I believe Chrome's omnibox does factor in these kinds of signals (because it's a browser; it can collect this data itself). Obviously that is quite sensitive information, and it would not be okay for every application to have access to it. Any API which considered this data would need to isolate it from the application to safeguard user privacy.

And if we really want to consider everything (and we do, especially if somebody is spoofing the hostname of a bank or online pharmacy), it's important to remember that Unicode isn't the only way that people spoof domains - elision is also not trivial.

IMO, all of this points to the OS as the best place to offer high quality spoof-checking, which doesn't just depend on the text itself, but also on how the text is rendered, and what we can reasonably expect the reader to discern.

To be even more specific, I think there should be some kind of specialised Label for displaying URLs and hostnames, which automatically handles elision and Unicode rendering where safe. I wanted to write one for WebURL but will probably never find the time.

There was recently a high-profile incident which exposed the limitations of Unicode's manually curated confusable tables:

Google-hosted malvertising leads to fake Keepass site that looks genuine | Ars Technica

The issue is that U+0137 LATIN SMALL LETTER K WITH CEDILLA ( ķ ) wasn't listed as a possible confusable for U+006B LATIN SMALL LETTER K ( k ). Since they both belong to the Latin block, all of that advanced mixed-script analysis described above didn't kick in. It's not the first time the tables missed a confusable, and it won't be last.

IMO, the best approach would be to use machine learning. If you search around, you'll find a fair amount of recent academic literature on the topic - it is promising, and there is evidence that it is being further developed with a view to possible production use.

TLDR: I don't think the standard library is the best place for this. Actually doing an excellent job (which is important - spoofing can have seriously bad consequences) requires data and tuning that is not really feasible outside of the OS.