[pitch] [Recap] Adding StringMap's algorithm for faster string match results

Omar_Charif · February 1, 2018, 11:24pm

So, I wrote an article a while ago demonstrating the huge performance gain using StringMap algorithm instead of regular expressions for straight forward string full matching. The main goal here is to improve performance for sequential string matching done on a single block of string or a string array.
The motivation here is simply performance, a lot of time difference comparing google's re to my StringMap implementation (seconds), you can read the article here https://gist.github.com/o-micron/f7c98226e5e0c81164fbc2704a855132

If you want a swift package to play around with StringMap you can find it here https://github.com/o-micron/StringMap

I guess swift would benefit from that time difference and I would love to hear back from the community wether this is going to play well with the upcoming schedule of improvements.

-- o-micron

Nevin · February 2, 2018, 4:50am

Could you briefly explain what “StringMap”, “RE2”, and “fullmatch” mean, for those of us who are not experts in this area?

Omar_Charif · February 2, 2018, 12:34pm

I apologize, StringMap is an algorithm developed to map a string into smaller arrays of features represented in floating point numbers. For simplicity, you can transform a long word into a single float that uniquely represents this word, which then is going to be easier to compare against other words or (float numbers as well). You may think of it like a hashmap, you hash each word into a unique representation.
Say you have a string s = "Swift is cool";
you can represent "Swift" in a float (0.0), "is" in a float (1.0) and "cool" in a float (2.0) for example.
And whenever you want to search for a word you simply try to compare numbers instead of character by character.

By RE2 I mean regular expressions version 2.0.

By fullmatch I mean fully matching a string with another string, example:
matching "Swift" with "Swift" will match but "Swift" with "Swift is cool" will not match.

I wrote a practical example on Github https://github.com/o-micron/StringMap

I think it is easier to use than regular expressions in the simple cases and will give you a huge performance gain, I built a whole app out of this

-- o-micron

rayhamel · February 2, 2018, 9:47pm

Swift Strings are Unicode-aware, meaning two strings that compare equal may not have the same binary representation. They therefore can't be compared using a (non-Unicode-aware) hash function.

Omar_Charif · February 2, 2018, 10:16pm

Thanks for pointing this out ray, this case is handled, the algorithm is aware of that and shall give you right results just as expected :)

You can try this out yourself, clone StringMap swift package from https://github.com/o-micron/StringMap

You can create an example just like the one written swift/String.swift line 503 to 506.
cafe\u{301} will match with café en français.

-- o-micron

Omar_Charif · February 2, 2018, 10:52pm

1 more thing, I guess there is a bug in this area that needs improvement, will work on it. Thanks for pointing this out !

Omar_Charif · February 3, 2018, 12:30am

You are right ray, do you know a way out to ensure that for example cafe\u{301} be transformed to café ?

I thought I would transform "cafe\u{301}" to "café", so I get utf16 array of unicodeScalars of same length, so that I can always get the same result returned from String.UnicodeScalarView(cafe1.unicodeScalars) and String.UnicodeScalarView(cafe2.unicodeScalars)

Do you think this is possible ?

On the other hand, do you think cafe\u{301} should match café ? I mean this might eventually help us getting to match cafe with café but are they really equal ? I think not. And they must not match, that's the current way StringMap reacts. If they are exactly the same from utf16 point of view then they must match.

-- o-micron