Swift library for ICU (uspoof et al)

wadetregaskis · January 8, 2024, 6:06am

Continuing the discussion from Pitch: Unicode Processing APIs:

I'd be happy enough if there were a robust ICU swift package (as in, just on GitHub or wherever). Indeed it doesn't have to be built into the OS, let-alone the standard lib. Though there are benefits to being in the OS, like amortising the library size across many uses, not to mention more reliable updates (at least in principle).

I'm aware there's icu-swift - and I applaud @allevato for creating it, it looks like it was a lot of work! I forget now if there were additional reasons (maybe build problems?) but I think I was discouraged from using it in a real app because it hasn't been updated in seven years. I did consider adopting it, but that seems like more of a commitment than I can plausibly take on right now.

But if foundational libraries like uspoof aren't available, it's hard to build context-specific ones atop them. Being available as native Swift, rather than an ergonomically-poor bridge from C, is important (IMO) to encouraging their use at all.

I'm aware of uspoof's limitations - I experimented with it a couple of months ago and it was trivial to find really egregious cases that it completely misses, like the ķ case you noted.

As a consequence, actually, I've slowly been dabbling with a method of (and library for) visual comparison - loosely based on a paper I found - which looks promising and not all that hard, but I don't think that suits all uses either. I'd rather use it in combination with uspoof - and any other relevant libraries, too. Belt and suspenders, as it were.

I actually did look around for this, and found only a handful of meaningful papers - only one of which included enough detail to even try to replicate their work. If you happen to know of an enumeration or index of relevant papers, I'd be appreciative!

sspringer · January 8, 2024, 10:42am

Well, wouldn‘t you just have to try to recognize e.g. diacritic letters (there is an according ICU property) by a “classical” character recognition model trained with non-diacritic letters? Even the general task of finding similar looking characters should easily be implemented (one might search for a model or algorithm tested with the MPEG 7 dataset). The character images should be available from the Unicode website.

Update: Hmm. After some re-thinking I guess some other tool for recognizing similar shapes might fail. Also if you would really like to compare all characters, you need to compare 100000 images every second to finish after 3 days. But for recognizing a faking of latin characters, the first idea could work (you might try to recognize all characters).

Update 2: …But in addition, beware of combining characters.

wadetregaskis · January 8, 2024, 5:22pm

I don't think there's much utility in limiting oneself to diacritics or similar specific glyph features. Diacritics at least generally do look significantly different from their absence - Unicode is full of much more egregious cases where two glyphs for two different code points look literally identical, not to mention the sets which are practically identical because they differ only in imperceptible ways, like a fraction of a pixel shift in kerning.

Scale is indeed a challenge here - it's a comparable problem to computing rainbow tables. I haven't gotten to the point of actually trying to precompute the collisions for an entire glyph set yet, but I suspect it will be on the order of weeks to months [on consumer hardware]. And that's just for a single configuration - a specific font at a specific size using specific style and weight, in a specific renderer, in a specific context (Apple's font rendering differs subtly depending on which APIs you use and where you're rendering to).

Which is all part of why I think faulty but fast options, like uspoof, are still valuable. Malicious agents face the same computational scale problems, and have access to the same libraries like uspoof and the general spoofing metadata precomputed / collated in Unicode, so it's in a way a fair match. If defenders don't make use of the same conveniences, they lose markedly - attackers can find homoglyphs virtually instantly with uspoof & similar libraries, so if defenders aren't pre-screening with those same methods they're opening themselves up to terrible compute-amplification attacks.