Use string counts to short-circuit string comparisons

divanise · November 29, 2021, 9:35pm

With that example at least. I was thinking more about a heisenbug where the dictionary would behave differently depending on whether c2 and c3 got hashed into the same bucket or not, which would depend on the dictionary's size.

Ah, that makes sense.

tera · November 30, 2021, 12:05am

Yes. IRT "deprecated" - it's only deprecated "as a Hashable requirement" (which raises a question of its own, how to implement a value type with a completely custom Hashable implementation where you want some specific hashes for specific values without being deprecated). However it's totally fine calling it. Note that hashValue complexity is O(string length).

Depending upon the usage pattern of those strings (e.g. how often the string is used after creation, how often they are compared for equivalence vs < and >) consider putting all names into a separate data structrure:

var names[Int] = []
var nameIndices[String: Int] = [:]

func name(at index: Int) -> String {
    names[index]
}

func nameToIndex(_ name: String) -> Int {
    if let index = nameIndices[name] {
        return index
   }
   names.append(name)
   let index = names.count - 1
   nameIndices[index] = name
   return index
}

and then store and use indices instead of the actual strings.

David_Smith · November 30, 2021, 3:54am

I've kicked off a test and merge on Early out for unequal NFC counts in String == by Catfish-Man · Pull Request #40314 · apple/swift · GitHub, thanks for pointing out this missing optimization!

Michael_Ilseman · November 30, 2021, 4:57pm

This is a broad topic, and there's a lot of different angles and aspects in the space as well as this thread, so I'll focus on improving performance of often-compared strings for equality checks over native Swift strings of sufficient length (minimum 16 UTF-8 code units in length, otherwise they'd be small-form strings). See Piercing the String Veil. These would not need to have the isNFC flag set, which @David_Smith's optimization requires.

If we care about this performance case, then we should cache the NFC-normalized code unit count. If the string is in NFC, this is equivalent to the code unit count, so no caching is needed when isNFC is set. Often, this cache will be set by hashing, which has to consume the entire string in NFC anyways. (cc @lorentey), strings stored in dictionaries and sets will typically be hashed and compared for equality, potentially multiple times, so this would save alread-computed information for future use. Note that comparison does abort when the inputs diverge, so comparison would only be able to set the cache if it has inspected the entire string.

Since we do normalization natively, it should be much easier to count things, remember whether the input was already in NFC, etc. (cc @Alejandro). Another thing that can be set is the storage class's copy of the isNFC flag if we traverse the entire string and learn that it is in fact in NFC. This is a very common scenario, because strings tend to be put in NFC for compactness anyways (which is ultimately why the stdlib uses NFC instead of the simpler NFD).

We can't update the struct String's isNFC bit, except inside mutating methods, so we'd add a check for the class's copy of the isNFC bit. With native normalization in place, we can also add something like a

extension String {
  public mutating func canonicalize()
}

That will eagerly bridge it (if applicable), convert to NFC, and analyze the content to set all of the corresponding fast bits like isASCII and isNFC, but also a hasSingleScalarGraphemeClusters bit for when the character view is equivalent to the scalar view, etc. (It should check those properties after normalization, because Unicode). This would be a superset of what's done inside makeContiguousUTF8, but this wouldn't preserve the contents of the string when viewed through the scalar or code unit views, because it normalizes.

Karl · November 30, 2021, 5:50pm

IMO, there are far more interesting optimisation opportunities than storing an NFC-normalized code unit count. Adding overhead to a type like String is no joke; typical programs can have many thousands of strings spawning in and out of existence all the time, and adding extra work to every string creation/mutation to calculate numbers of NFC code-units is not a trivial amount of extra work.

Things like that can be... okay, if they're shown to be clearly worth it. But I don't think this particular piece of data is ever going to be worth the cost. It's too narrow-use. Native normalization opens up lots of doors for more exotic comparison techniques, and we should thoroughly explore those before thinking about eating more precious RAM.

Michael_Ilseman · November 30, 2021, 5:56pm

Such as?

Yes, I am intimately familiar with the problem space.

I think you misunderstand my recommendation. I'm not recommending we eagerly do anything, unless explicitly requested by the user via a method call or an initializer label. I omitted discussing the caching strategy, but it would be more akin to breadcrumbs (single atomic compare-and-swap, because we don't care if we're beat). It should be the lightest weight possible because the complexity and overhead of caching quickly eclipses gains.

Again, this sounds like you're envisioning a caricature rather than what's proposed.

Karl · November 30, 2021, 6:07pm

Well, I'm not as familiar with the problem space as you are, but one simple example that springs to mind is checking for equality alternately between the start and end of the string (so we can early-exit faster if a string has a common prefix).

I've only taken a brief look at string comparison (basically to see how much of it was available as public APIs for non-String storage), but it didn't seem like we did things like that.

I'm having a hard time picturing what that could look like - init(..., cachedPerformanceFlag: true)? Or some sort of mutating func updateCachedPerformanceFlag()?

Can you elaborate on what you're suggesting by:

but also saying that it should be enabled via some explicit opt-in?

xwu · November 30, 2021, 9:47pm

Is amortizing the cost of O(n) grapheme cluster counting already accomplished with the breadcrumb work? Seems that’s a common operation that could benefit from this treatment too.

scanon · November 30, 2021, 9:52pm

What are you trying to do? This is basically always a sign that you're doing something wrong with Hashable. (It's fine to want specific values to hash to the same thing, but you shouldn't be choosing the hashes).

Michael_Ilseman · November 30, 2021, 10:21pm

Sorry, this is a broad and multi-faceted space and I want to contribute a targeted discussion, while also being very busy with something else

Right, so there's different ways that we could perform the comparison based on expected properties of the two strings we're comparing, such as long shared prefixes. This can come up with strings representing paths, URLs, domains, etc., which often have the unfortunately property of also ending in the same handful number of characters.

The main thing blocking these kinds of adjustments is confidence that we should privilege them, that is confidence that whatever workloads we devise are representative and worthwhile.

Generally speaking, I've found the higher-order bit to be about hitting the NFC fast paths vs not, and currently only all-ASCII strings hit that fast path. Once we know that we can call memcmp for equality, it's true we could peek around after checking their lengths. This is probably only relevant for very large strings that have very long shared prefixes: memcmp is one of the faster things that String does.

extension String {
  /// Normalize, nativize, analyze, etc.
  public mutating func canonicalize()
  
  public init<C: Collection>(canonicalizingUTF8: C) where C.Element == UTF8.CodeUnit
}

or similar. String initializers normally will preserve exactly the scalar values passed in, mutation won't typically change the stored scalar values beyond what's requested by the mutation, etc. This would be API that exists to explicitly change the stored scalars into NFC.

This might be where the confusion is. If we want to modify the contents or update the String struct, we can only do so in mutating methods (or as part of initialization). However, we are free to sets bits on the backing class or do anything else we want to the class instance, provided we don't violate value semantics for a string that may be shared across threads. That backing class has a copy of the isNFC bit that the String struct has, because we needed a 16-byte header regardless. Those bits stored on the class can be updated as part of read-only operations, if we do it right (atomic compare-exchange).

If you pass a String into a dictionary API, it will typically hash the string's NFC bytes to get a bucket number, then if occupied it will compare against whatever's already in that bucket and linearly probe from there (IIRC, but @lorentey would remember this better). Hash consumes the entirety of the string's contents (yet another discussion) as lazily normalized NFC. This lazily NFC-normalized code path is now done through native code in the stdlib that we can control. It should be fairly straight-forward to add a counter to our normalized view, and a sticky bit to determine if we ever encountered contents that were not NFC.

At the end of such a hash or comparison function that has lazily normalized all the bytes, we will know the number of NFC code units present and whether the original string was already NFC. This isn't really any extra work or space at this point, but we now have the option to use this information to speed up subsequent operations, if we want to. So if we have a place to store this information (and we already do, it's just a question of how best to manage it), that has the potential to immensely speed up subsequent comparisons. If we determine the string is in fact NFC (the vast majority are), then string comparison and hashing forevermore can just operate over the raw bytes without lazily normalizing them every time by looking stuff up in data tables. Even if it's not in NFC, we can also store the NFC code unit count for early false exits (and we have a place to do so, just may want to tweak some details, yet another discussion). And of course if we determine a string to be NFC, it's unnecessary to store the NFC count since we already store the code unit count.

malloc will give us memory in buckets, which actually stair-step grow in size with the allocation size. String makes sure to claim that extra leftover space as spare capacity for mutations and for sticking the current breadcrumbs pointer inside. This is the level where we reason about what "sufficiently long" means, as adding a pointer on the end may fit in the existing bucket or bump us to a new bucket. Alternatively, we can use the single breadcrumbs pointer for all metadata we might want to remember (grapheme cluster counts, NFC code unit counts, etc), and we synchronize from within that.

When we have a piece of information we want to store or cache, we do an atomic compare-exchange, because the information we have was calculated from the (under value semantics) immutable contents, so if we're beat by another thread on the same String, which is rare but possible, we don't care because the results are ultimately the same.

Here's a diagram of the allocation sizes from when I trimmed off the breadcrumb pointers when strings weren't sufficiently long, sometime last year.

Michael_Ilseman · November 30, 2021, 10:24pm

Not currently, but that's an obvious improvement. I think the stored NFC code unit count is more valuable, but we'll want to have something that could also extend to grapheme counts as well.

vonlost · December 1, 2021, 7:34pm

Most of this is over my head, so I'll ask: Can the compiler be taught to create the appropriate count entries for string literals? My code also has thousands of tests against literals for ==, hasPrefix, hasSuffix, contains. Thanks.

David_Smith · December 1, 2021, 10:53pm

As long as the literals are NFC [edit actually ASCII apparently, that's something we can improve!], the optimization from this thread will apply to == on them. hasPrefix and hasSuffix appear to already have a similar optimization, though obviously it won't be applicable as often.

ksluder · December 1, 2021, 11:13pm

Should the compiler normalize non-raw string literals? Maybe unless the code points are specified with escapes?

SDGGiesbrecht · December 2, 2021, 12:35am

I would say normalize the literal before resolving the escapes. The particular code points of anything unescaped may not survive source operations like copy‐and‐paste anyway, and thus should never be relied on. (At one time I suggested normalizing the entire source file this way to even improve the reliability of symbol references, but many found the idea too invasive and possibly troublesome for backwards ABI compatibility.)

young · December 2, 2021, 6:30pm

This article show a custom Hashable implementation: Hashable / Hasher - NSHipster

RussBaz · September 28, 2023, 6:32am

It might be a bit off-topic, but what is currently the most efficient way (as of 5.9) of checking if a random fully Unicode string is in a list of known reserved words?

Writing a custom macro that will construct a search tree based on the content of strings in the list?

andreas66 · September 28, 2023, 1:07pm

Add the strings to a Set? Set.contains(_:) is O(1).

scanon · September 28, 2023, 1:30pm

"Set.contains(_:) is O(1)" is a statement about the asymptotic complexity of the operation w.r.t. the size of the set. But that's not really relevant to the question being asked. For String, it's O(n) w.r.t. the size of the string(s) being operated on.

Note that Set<String>.contains(_:) traverses the entire string at least twice in the case of a match (once to compute the hash value, and once to evaluate ==). In some cases, it will do so more than twice. It also has to account for a bunch of encoding and representation details that may not be relevant to the case being asked about.

Depending on the characteristics and size of the set of reserved words being matched, it can me much, much faster to use data structures that require only a single traversal and early-out in the case of a failed match, such as a Trie. This is still O(n) in the string length, but if we set aside the asymptotics and consider the constants (which depend on the distribution of inputs), it may be much more favorable.

andreas66 · September 28, 2023, 1:32pm

So Set does not compute the hash once for a string and store it along with the string in a key-value table?