Use string counts to short-circuit string comparisons

vonlost · November 28, 2021, 6:42pm

My Mac app involves repeated equality comparisons of many megabytes of an array of strings. I was (naively?) surprised that such comparisons became immensely faster after I added my own shadow array of the count of each string in the array; if the counts are not equal, the strings cannot be equal, so I then don't check the actual strings.

I would have thought that Swift could easily and efficiently use a string's count to short-circuit string comparisons if the counts differ. Operations like "contains" could also benefit from first checking the counts.

Would this be a reasonable refinement?

Tino · November 28, 2021, 6:56pm

I guess such a huge change for String is not feasible — but as this basic type has some really unfortunate (and for many developers coming from other languages) unexpected limitations, I like the idea of adding another string-type which offers true random access (so not only store the size).

David_Smith · November 28, 2021, 7:47pm

There are cases where unicode sequences of different lengths (edit: different UTF8 counts, not lengths) can compare equal (e.g. ü vs u + combining umlaut). That said, this should be able to work for known-ASCII Strings, and I believe it should be able to work for known-NFC Strings, though I'd need to think that over.

Question for you: what does your performance profile (say, in Instruments' time profiler) look like without this optimization? I'm very curious if this is the native String equality codepath, or the bridged one, and I have a hunch it might be the latter.

SDGGiesbrecht · November 28, 2021, 7:54pm

They became faster because you cached information in the shadow array, not because of anything specific to strings. My guess is caching the hash value instead of the count would be even more efficient, and that goes for just about any non‐trivial type. (It is precisely what Hashable is designed for.)

A String is (at least natively) UTF‐8 under the hood, so x.utf8.count is the only O(1) “count” available. Unfortunately,

let string = "caractère"
let decomposed = string.decomposedStringWithCanonicalMapping
let composed = string.precomposedStringWithCanonicalMapping

assert(decomposed.utf8.count != composed.utf8.count)
assert(decomposed == composed)

So it turns out to be a logical flaw to use it as a short‐circuit strategy. The same thing prevents use of the UTF‐16 and scalar views.

The only view where the count is necessarily equal between equal strings is the character view default to the string itself:

let string = "caractère"
let decomposed = string.decomposedStringWithCanonicalMapping
let composed = string.precomposedStringWithCanonicalMapping

assert(decomposed.count == composed.count)
assert(decomposed == composed)

But this property does not help us, because the count of the string is not a simple memory offset, and requires going over every byte of the string figuring out the grapheme boundaries, which is O(n). Doing that first in a separate step is in all likelihood slower than just checking each character as you go. The equality check can short‐circuit on the first unequal character, but finding the count requires going all the way to the end of the string no matter what.

ksluder · November 28, 2021, 8:51pm

These don’t have the same count? They are the same number of grapheme clusters.

David_Smith · November 28, 2021, 9:02pm

Ah, sorry, yes, I was jumping ahead. They have the same count, but not (iirc? Vacation brain heh) the same easily accessible cached value we could look at. Basically I'm thinking we can short circuit to self.utf8.count IFF some conditions are met. I'll actually pull out the work laptop and go read the code once I'm back from vacation though.

Michael_Ilseman · November 29, 2021, 12:50am

If they are both known-NFC (a superset of all-ASCII, ASCII-ness being checked at construction time), they will compare via memcmp semantics as a fast-path, including an initial code unit count check. Lazily bridged strings don’t benefit from this system, unless they were contiguous ASCII C strings.

The stdlib just landed native normalization support, so we could check more at construction time and/or provide API now to force into normal form for fast comparisons. We could even provide lazily normalized views. I’m pretty excited about native normalization support. CC @Alejandro

Edit: a related optimization is sticking grapheme cluster counts as an additional tail allocation ala breadcrumbs (or reuse the same one). In this way native (and again, only native) strings could benefit from that. Equality could take it into account as well, if available. We would only do that for sufficiently long strings such that it’s worth the memory usage (like we do for the UTF-16 breadcrumbs).

young · November 29, 2021, 4:38am

error: value of type 'String' has no member 'composedStringWithCanonicalMapping'

err, where is this from? Don't see this in NSString or String

SDGGiesbrecht · November 29, 2021, 6:15am

I spelled it wrong. Sorry.

precomposedStringWithCanonicalMapping

tera · November 29, 2021, 6:33am

At the first glance it doesn't look impossible to maintain the length variable within the string and make count constant time operation. For example when adding two strings together to form the third string the resulting count is not too far from the sum of the counts of the two strings, and just a few chars on the edges of the strings needs to be checked to determine the new length.

If not that, then caching the count to calculate the string length only once also seems reasonable. If that's built-in to String itself - it should be as fast as what you are currently doing, just much more convenient.

To the OP: what are your strings? are they unicode or ascii? Is there min/max length and if so what are those? How distributed are the strings (e.g. UUID().string is one thing and strings possibly having the same long prefix is another).

vonlost · November 29, 2021, 6:29pm

All unicode, min length zero, average length 17, max length around 50. All names of people and groups of people (including international like Chinese, Hebrew, Russian), most beginning with A to Z (upper) but can be anything, about 25% with prefix "[[" (special meaning token).

vonlost · November 29, 2021, 7:48pm

My "immensely faster" is four times faster with the shadow array of lengths.

Dmitriy_Ignatyev · November 29, 2021, 8:10pm

The topic is very interesting, but the general-purpose solution is not trivial.
Look at this example:

  let c2 = "Cafe\u{301}"
  let c3 = "Café"
  
  c2 == c3 // true
  c2.hash // 12562450099221
  c3.hash // 24033636967
  c2.utf8.count // 6
  c3.utf8.count // 5

c1 and c2 are equal, but underlying scalars are different.
For now, comparison optimization is up to developer. In general, we have 3 kinds of string internals. Let focus on two of them:
For long strings when they are heap-allocated, we can store and cache count. I mean grapheme clusters count.
For short strings, when they are stack-allocated, it is not possible, because for short strings we have limited memory space which varies on 32/64 bit processors. 15 symbols can be stored on 64 bit platforms if I'm not mistaken. If the string is larger, then it is allocated in heap.
So, if we do this, we get an asymmetry: large strings cache their count, but short strings don't have this feature. If they have, then fewer characters can be stored, and this breaks another optimization.

David_Smith · November 29, 2021, 8:12pm

I've pushed a speculative (completely untested) version of this optimization (for fast-utf8 NFC strings only) to a branch and asked swift-ci to take a look. I have some other work to get done before following up, but I figured I may as well run the benchmarks and tests over it while I'm busy.

vonlost · November 29, 2021, 8:36pm

How about a tag when a string is allocated that controls when counts can be compared?

divanise · November 29, 2021, 9:14pm

That looks like a (very surprising) bug to me. The documentation for Hashable reads:

The examples above imply to me that == is normalizing the strings, but .hash isn't. Can that even be possible, though? Wouldn't have someone figured out that [String: _] was behaving oddly by now if that was the case?

David_Smith · November 29, 2021, 9:16pm

I think you may be getting NSString's -hash there. Try hash(into:) or the deprecated hashValue.

xwu · November 29, 2021, 9:19pm

Exactly: hashValue is the same for c2 and c3.

Dmitriy_Ignatyev · November 29, 2021, 9:22pm

This is surprising, however. But the Dictionary works properly.

  var dict: [String: Int] = [:]
  dict[c2] = 2
  dict[c3] = 3
  print(dict) // ["Café": 3]

If print deprecated hashValue, they are equal:

  c2.hashValue // 2451573986552666876
  c3.hashValue // 2451573986552666876

David_Smith · November 29, 2021, 9:33pm

Surprisingly, just a 1.2x win on one benchmark in CI. We may just not have benchmark coverage for long non-equal-length native NFC String equality, I'll look into that later. Or my change could be wrong :) we'll see