Swift needs String comparison suitable for search fields over user lists

TLDR; Swift's various -insensitive searches are inadequate for searching for results in lists presented to users.

I use an iPad app that manages a library of songs. I was having some trouble with the search feature and an attempt to show the developer how it ought to work led me down this rabbit hole. The simplest version of this problem is:

Currently someone typing the search string “Bo's So” will never find a song called “Bo’s Song” in their library, because the punctuation characters are slightly different.

I'm not interested in calling either style “wrong;” they both occur all the time in practice and in some cases one or the other may be very difficult to type on an iPad.

This isn't a case, diacritic, or width difference, so none of the standard tools for normalizing strings or localized insensitive search work. The simplest hack I can think of is to strip all punctuation from both strings before doing a containment check.

But the problem is knottier than that simple hack can handle: I work with lots of Hawaiian songs, whose names often include ʻokinas (sounds like a glottal stop and is considered a letter in Hawaiian). Now, the official glyph for an ʻokina is Unicode 02BB, MODIFIER LETTER TURNED COMMA, which as its name implies is suitably classified as a modifier letter, not as punctuation. On an iPad that character appears to be impossible to type (even with a Hawaiian keyboard, unlike on iPhone!), and anyway even if it were possible but inconvenient, everyone is going to type some convenient and similar-looking punctuation character instead. Because it isn't punctuation, the hack described above won't work: you'll never find "ʻuliʻuli" (with ʻokinas) by searching for "'uli'uli" (with apostrophes) because "uliuli" is not a substring of "ʻuliʻuli".

Looking for a general solution, I checked the Unicode category of modifier letters, and found that many don't look like punctuation, so a quick hack based on the unicode category isn't viable. I ended up special casing ʻokinas to treat them as punctuation but it was pretty unsatisfying. It seems to me that a general solution would normalize each modifier letter into some other, less esoteric character that appeared similar.

All that said, the Modifier Letters category is an artificial grouping. I bet there are plenty of other Unicode characters that are hard to type and are commonly represented as other easy-to-type characters. My general claim is that Swift should have facilities for better handling this use case, and my justification for it being in scope is that exactly the same motivation drives the existence of diacritic- and width-insensitive searching. It shouldn't be up to this song library application to discover which set of hacks will work out in practice because the book library application author will have to discover the same set.

Maybe the semantics of these searches ultimately ought to be specified by Unicode, but Swift can lead the way.

Thoughts?

14 Likes

Interesting case, I agree that current string search in most user-facing apps is not very usable in English, can't imagine the problems in other languages.

Why does this need to be in the stdlib, instead of in a common search package, similar to say NIO?

I am all for the discussion of solving this common search problem, just wonder where it would ultimately be best distributed.

1 Like

This sounds like the sort of problem that localizedStandardContains should solve? localizedStandardContains(_:) | Apple Developer Documentation

I’ve not tested the above to know if it does though.

1 Like

Would be curious to hear how GitHub - ordo-one/FuzzyMatch: Fuzzy string matches at full speed would work for you. ED mode - the default (should work pretty well if you just skip typing the apostrophe/punctuation at all).

3 Likes

Agreed that should be be a system-provided function providing the best possible string matching. Where that belongs, I’ll leave up to others.

Part of me is thinking additional normalization rules for cases like this would be great to have. Though imagine it could be quite a while before antyhing could be ratified and implemented. Still, worth a shot?

For the okina case, when building a multimedia application for a local Hawaiian historical site, a common character that was used was U+0060 (`). At least by the Hawaiian studies teacher at the time. This was on Macs where one only had MacRoman. Others would use U+0027 (‘), and then there’s the smart quote variations. Thus, several characters could be mapped.

Out of curiosity, are you talking about virtual keyboards here that work differently on iPad and iPhone, in which case it could be an iOS bug?

Yes

Good point, yes it should! It does not :frowning:

2 Likes

@dabrahams did a small fix in feat(minor): add confusable-character normalization by hassila · Pull Request #11 · ordo-one/FuzzyMatch · GitHub - so if you get 1.1.0 and try it it should work well for the test cases you mentioned, I added them to the test suite.

Agree it would be nice to have something like this built-in, but I think the most useful thing would be to have something like the ED-mode which allows for user typos too, string search in general makes the assumption that the input is correct, which it often is not when humans provide it - like in a search field.

7 Likes

For historical context, there’s an algorithm called Soundex that has been used as an index key for English words and pronunciations of names. You can transform the user’s query under Soundex and match against the precomputed transformations of your search corpus. I vaguely recall seeing a Soundex field associated with driver licenses at some point, probably for communicating over police radio.

Obviously extending Soundex to every dialect in use on Earth would be a massive and never-ending challenge. But it might provide a starting point for someone researching what’s been done in this space.

3 Likes

Before going too ambitious maybe we need an API that compares
"bo`s", "bo’s" "boʼs" and "bo's" equal.

Thanks; I don't have any code of my own, though; I was merely trying to provide an actionable bug report/feature request for someone else's software.

Ah, I see - then I guess the action could be to give that a try - I just made it go back to Swift 6.0, so hopefully usable for many more now, should also make searches much faster if they use the normal case insensitive contains searches (like 10x...).