Use string counts to short-circuit string comparisons

scanon · September 28, 2023, 1:33pm

Each time you try to look up a new string in the set, its hash has to be computed.

andreas66 · September 28, 2023, 1:35pm

Oh sure didn't think of that.

Karl · September 28, 2023, 5:11pm

Firstly, you'll need the strings in your database to be normalised. Let's say they're normalised as NFC.

In the general case, you will also need to normalise the input string. Once both input and DB are in the same normalisation form, things are easy -- a match will contain the same scalars, and therefore the same code-units. That's the point of normalisation. When you compute a hash or check strings for equality, it's also going to perform this normalisation behind the scenes.

Often the inputs will already be normalised, so it may be worth performing a lookup on the string as given (it may return false negatives, but never a false positive). In other words:

import Foundation

public func isReservedWord(_ input: String) -> Bool {

  if lookup(input.unicodeScalars) {
    return true
  }
  return lookup(input.precomposedStringWithCanonicalMapping.unicodeScalars)
}

private func lookup(_ scalars: some Collection<Unicode.Scalar>) -> Bool {
  // Look up exact sequence of scalars,
  // using your preferred database structure.
}

The standard library does not expose any normalisation APIs, so you'll have to use Foundation's precomposedStringWithCanonicalMapping to obtain an NFC string. That API is eager and returns a bridged UTF-16 string (which is why it could be faster to perform lookups at the scalar level).

In the future, hopefully the standard library will expose APIs which computes the normalised scalars on-demand. That's how it is implemented already (as an iterator, which computes the next set of normalised scalars each time you call next()), it just isn't exposed. Again, it returns a sequence of unicode scalars, so performing the lookup in terms of scalars can save you a little bit of overhead.

andreas66 · September 29, 2023, 7:00am

I have a small tool doing filtering of listed words from an array of larger words. Filtering was implemented by looping an array of words to ignore.

Out of curiosity, changed the array of words to ignore to a Set<String> and indeed, this was measurably slower than looping through the array. Even though in measurement, I ignored the time to create the Set from the Array. Otherwise, the code remained the same. A good learning experience, indeed.

sspringer · September 29, 2023, 10:56am

This depends on the number of items to compare. Set is good for a large number of items. This is not a surprise.

tera · September 29, 2023, 1:42pm

I believe it could be beneficial to store hash along with the value for large values like long strings, big arrays, etc. When the value is changed - mark hash invalid (O(1)). When hash is needed and is invalid - recalculate it (O(value size)) and store along with the value. If hash is already valid - use it (O(1)). As this is for large values only (e.g. values larger than 1K) - memory overhead could be insignificant (< 1%).

michelf · September 29, 2023, 6:00pm

Wouldn't that make String un-Sendable? Or will storing the hash trigger a copy-on-write? Maybe use atomic read and write for the hash so the worse that can happen is two threads computing the hash in parallel and overwriting with the same value.

There's always a trade-off: writing the hash is not free and will slow things down for no reason in cases you aren't going to reuse this hash later.

You could however use that strategy locally where it makes sense, like when populating a set or dictionary, by using a wrapper type like this one:

struct PreHashed<Value: Hashable>: Hashable {
	init(_ value: Value) {
		self.value = value
		self.hashValue = value.hashValue
	}
	var value: Value {
		didSet { hashValue = value.hashValue }
	}
	var hashValue: Int

	func hash(into hasher: inout Hasher) {
		hasher.combine(hashValue)
	}
	static func == (a: PreHashed, b: PreHashed) -> Bool {
		// fast path checking hashValue first
		a.hashValue == b.hashValue && a.value == b.value
	}
}

var a = Set([PreHashed("Hello"), PreHashed("World")])
assert(a.contains(PreHashed("Hello"))

While this pre-hashing can make sense for strings, it does not for integers which are too trivial to hash to benefit from a stored hash. Trade-offs again.

tera · September 29, 2023, 6:11pm

no, the string (contents) is not changed.

Writing aligned long word is inherently atomic (without using special atomic operations) and is not slow. As you mentioned at worst more than one thread would override the saved hash location with the same value more than once. No need to protect hash location with a mutex, etc.

This is probably best what could be done without implementing it in the standard library itself. It has a drawback of recalculation hash on every change. Recalculating hash (and updating the stored cached value) during the hashValue call itself could only be done in the standard library implementation itself.

Indeed. Hence my suggestion to only use it for large values to make space overhead tiny.

jrose · October 3, 2023, 4:28pm

FWIW this is not true, both because of caches not shared across CPU cores, and because undefined behavior is still undefined even if the naive instruction by instruction compilation would do what you want. But it is cheap to use a relaxed atomic store of an aligned machine-word.

tera · October 3, 2023, 4:35pm

I meant that without using atomics but when using aligned memory this could happen (and it is no harm):

CPU core #1 looks into a memory location and finds zero (no hash)
CPU core #1 recalculates hash and writes it to the memory location
CPU core #2 looks into a memory location and finds zero (the cache is not yet synced)
CPU core #2 recalculates hash and writes it to the memory location (same as the first one)

and this could not happen:
3. CPU core #2 looks into a memory location and finds a mixture of old and value partially written

Do you mean this could happen? Or do you mean that there is no guarantee that compiler will generate a code that accesses the value in one go and might choose to access it, say, byte by byte?

scanon · October 3, 2023, 4:59pm

It's true that aligned pointer-sized stores are single-copy atomic (non-tearing) on every reasonably modern x86 and arm64 architecture. It is not true that pointer-sized stores are single-copy atomic on every architecture.

totally incorrect thing that I originally said:

The more serious problem (at least from my perspective) is that you would have to update or invalidate the hash on any modification of a string, requiring a barrier or lock so that observed hash values are always in-sync with observed string contents, depending on the memory ordering rules of the architecture.

ksluder · October 3, 2023, 5:20pm

It would have to be a lock on architectures without double-word atomics, right?

taylorswift · October 3, 2023, 5:24pm

the swift runtime requires doublewide atomics anyways, doesn't it?

David_Smith · October 3, 2023, 7:04pm

TBH I'm not convinced that caching hashes is going to be much of a win in practice anyway, since the vast majority of Dictionary lookups (and therefore the vast majority of non-insertion hashing) seem likely to be constant keys rather than dynamically created ones.

What would be potentially interesting is pre-generating hash values for constant strings, but that's also

tricky at best to do right (Swift randomizes hashes so we'd somehow need to generate something that could have the randomness incorporated into it after the fact),
potentially harmful (we may discover we need to change the hashing algorithm, which would invalidate any pre-generated hashes. If not carefully designed it could prevent us from doing that)
potentially wasteful (uses 64 bytes of __DATA for each String, and not all constant Strings are ever hashed)

lorentey · October 3, 2023, 7:10pm

No, hash caching would not be beneficial.

Swift uses "per-instance" hash seeding, so Swift's hash function is not fixed -- it changes in each Set/Dictionary instance. This primarily isn't for any security benefit -- we already get most of that with per-process seeding! First and foremost it's a performance improvement: it avoids loops like this from unexpectedly taking quadratic time:

let a: Set<Int> = ...
var b: Set<Int> = ...
for i in a { 
  b.insert(i) 
}

Such loops often happen organically, but they are also used to implement operations like Set.union or Dictionary.merge. The natural expectation here is that insert would usually take O(1) time; however, without per-instance hashing, these loops would over-populate the left side of the result table, loading it to over 100% capacity, and resulting in insert taking linear time -- rendering the loop an unexpectedly quadratic operation. Per-instance seeding scatters the items in the result table uniformly across the entire table storage, making the table-wide load factor work as intended.

Some baseline performance concerns for hashed collections:

To ensure good lookup performance, hash tables need to implement open addressing with linear probing.
To ensure good iteration peformance, iteration needs to enumerate items in hash table order.
To avoid merge loops slowing to a crawl, hashed collections that implement the above two optimizations must individually randomize the hash function within each collection value.

Any cached hash would necessarily use a single hash function, so allowing Set/Dictionary to directly use it would lose this crucial third protection. Operations like Set.union, Dictionary.merge (and hand-written loops that merge contents from multiple hash tables) would take quadratic time -- not sometimes, not just on inconveniently chosen input data, but always, for all input.

So the only way hash caching could possibly work is by running the cached integer through another, per-instance-seeded hash function. (As in, putting hasher.combine(self._someCachedHash) into the hash(into:) implementation.) This still violates Hashable requirements, but it's good enough to work well for "organically" chosen keys. However, it softens the effect of caching the hash, as this will still feed one word to a custom Hasher (1 time unit) and then finalize it (~3-4 time units) for every hash operation. To get any measurable difference, the keys would need to be at least 64-128 bytes wide, possibly more.

What would be actually beneficial is to only use guaranteed-normalized Strings as keys. (The really costly part is String normalization, not hashing: feeding UTF-8 data to Hasher is essentially free.)

David_Smith · October 3, 2023, 7:18pm

Relatively speaking, sure, but empirically the hot path for NSDictionary lookup at least (I haven't looked at Dictionary recently, but conceptually at least it should be similar…) is NSString equality testing. Turns out looking at bytes at all matters.

(Though this is all a bit of a digression, since the proposal was to only do this caching for very long strings, which are an edge case and wouldn't show up in my statistics)

tera · October 3, 2023, 7:26pm

You'd need the synchronisation anyway if you modify a value (say, a string) in one thread and access it in another, I mean this would be wrong today:

thread1: getHash(value) with no locks or other barriers
thread2: change value with no locks or other barriers
thread1 or thread3: getHash(of value) with no locks or other barriers

I don't completely got your idea, will need to reread it a few times to fully grasp. Could you clarify on this:

what you wrote (per instance hash) would be also achievable by getting the non-per-instance hash (integer value) and pass it through a per instance hasher. In this case we'd get to hash 8 bytes (on 64 bit platforms) instead of the whole value.
String or UTF-8 hash calculation is O(string size in bytes). Is it not?

Edit:

Please correct me if I am completely off. Open probing is not necessarily +1 or +fixed constant (linear) step. It could use a pseudorandom algorithm where step is chosen dynamically, say +7, +125, +2, +13, +86, etc. So long as this process is deterministic this process will work as correct as linear probing and in respect of hash collision avoidance it could give better results than a naïve +1 or +other fixed constant probing. What you are saying about "over-populate the left side of the result table" is what I don't understand... Could you outline a simple example that illustrates your point?

lorentey · October 4, 2023, 2:06am

The first question I addressed in my second-to-last paragraph above. This would only start reducing hashing costs for keys wider than 64-128 bytes, or more. For smaller keys, this would generally make things slower, not faster. I.e., the constant factors involved are way too high to make this optimization practical.

For the second point: Yes, string hashing takes time that is linear in the size of the string. No surprise: String.== has the same performance characteristics. This is not actually a problem in practice: it is inherent in how hashing/equality checks work.

Large keys rarely if ever occur in practice, and unlike weak hashing, they are unlikely to lead to denial-of-service attacks on their own. (DoS attacks typically require an unexpectedly superlinear payoff: "processing twice as much data takes twice as time" is just business as usual, it isn't an exploitable weakness. But a linear-looking operation that accidentally turns quadratic can easily become an attack surface. And the attack can be completely accidental -- it doesn't need to be deliberate.)

String.== and String.hash(into:) are generally both dominated by the cost of normalization. A successful hash lookup involves exactly one Hasher session, and at least one == invocation. Making the Hasher session run in O(1) time will do nothing about == -- hash lookups will generally still cost linear time in the size of the key.

Granted, of the many horrific ways to implement O(1) hashing, hash memoization is probably the least terrible. But hash caching still inherently violates Hashable's stated requirements, as such types do not feed all the bits that get compared by == to Hasher -- they only feed it with some fixed-size distillation of that. This artificially reduces the strength of hashing. This may or may not enable some HashDoS scenarios in the standard Set/Dictionary; but it definitely breaks advanced use cases of Hashable outside of the standard Set/Dictionary types, such as in Bloom filters.

Hashable is designed to allow generating arbitrarily wide hash values, where each returned bit works as if it was generated by an independent, universal hash function. For instance, here is one way you can use Hashable to generate 128-bit hashes:

extension Hashable {
  var wideHash: (Int, Int) {
    var h1 = Hasher()
    h2.combine(0)
    h1.combine(self)
    var h2 = Hasher()
    h2.combine(1)
    h2.combine(self)
    return (h1.finalize(), h2.finalize())
  }
}

The 128-bit values returned by this extension can be used as if Hasher was a native 128-bit hash function. (To the extent that our hash function is actually universal.)

When this is run on a type that doesn't correctly conform to Hashable, then the returned bits will not contain enough information to make this work as expected. For instance, if String implemented any form of hash caching, then the 128-bit values returned by someString.wideHash would have roughly the same distinguishing power as a regular 64-bit hashValue -- the extra returned bits would not give us any information about the original string, rendering this mechanism useless.

String-keyed dictionaries can be remarkably slow in Swift. But this generally isn't at all because String feeds all its contents to Hasher -- in practice, the slowness tends to be because String values aren't necessarily NFC, and repeatedly normalizing them for hashing and equality checks can be really, really expensive.

Making String cache its hash values, or changing String to only hash some of its contents would lead to significant breakage; please do not suggest the stdlib do that (at least not without also supplying some hard benchmarking data to justify such a thing).

If you want to experiment with string keys that implement broken hashing, it is easy to do so by manually wrapping the String in your own custom type -- the type can then implement hash(into:) and == in whatever way you like. In this case, the scope of any breakage would be restricted to your own project, and it wouldn't introduce system-wide issues.

Predictions:

Implementing unconditional hash caching for all strings would make things generally slower, not faster.
Implementing hash caching only for "large enough" strings would also risk a general performance regression, as strings would need to compare their UTF-8 size against some constant cutoff to figure out whether to use a cache. For bridged strings, getting an UTF-8 count would be an O(n) operation.

There are many "correct" ways to implement a hash table. But if we're targeting a machine made in the last few decades, and we want to optimize lookup performance, then open addressing + linear probing is the only practical option, and the step size has to be set to 1.

The whole point is that we really, really need performance critical operations to be expressed as a series of sequential memory accesses. Hash lookups inherently need to start by accessing the hash bucket indicated by the hash value -- and that is an unpredictable (and therefore very costly) memory access. If the first access doesn't get us the right key, we really can't afford to incur another unpredictable access on the second, third etc. subsequent try. Iterating over the collision chain needs to map to strictly sequential accesses.

jrose · October 4, 2023, 4:48am

More about “implicitly atomic”, bordering on a diatribe

I agree that “torn” values are very unlikely. However, you’ve left out the possibility that there’s more going on than just “read memory” and “write memory”. The silliest example is if the word-sized value is an instance variable, or global or static variable, in which case the dynamic exclusivity checks will kick in and potentially complain.

But let’s assume you’re doing a direct access through an UnsafePointer, the closest you can get in Swift to emitting a single aligned load or store instruction. Do you have a guarantee of the behavior you described above? No, you still don’t. In fact, there’s an immediate way today to make that break: turn on Thread Sanitizer. (The way you tell Thread Sanitizer that it’s okay for other threads to read a stale value is to use a relaxed atomic.)

But maybe you say Thread Sanitizer is an artificial scenario that doesn’t count? That’s still not good enough. The OS, or hardware, or anything really, is permitted to track additional information about memory; in fact, we know it does this kind of thing on a per-page basis on most operating systems to support permission protections and virtual memory. So it could in theory do something like TSan for every program, making pages or sub-regions of pages “thread-associated”, and catching any attempt to access them across threads, for both performance and correctness reasons. This sounds like an outlandish amount of work and overhead to enable for every process by default, but arm64e exists, even if it’s only being used in limited scenarios. Other memory-safety-focused ABIs/architectures like CHERI exist as well. So I wouldn’t say it’s impossible, even if today’s x86_64 and arm64 processors and operating systems don’t do anything like that.

Finally, undefined behavior is undefined. I don’t want to make undefined behavior a boogeyman that will deliberately produce wrong answers, but the compiler is within its rights to say “I can prove this load races with that store and therefore I can load whatever value I want”. (This usually happens when the compiler is assuming a certain combination of conditions can’t possibly happen and therefore it should save code size rather than emit code that “should” be dead.)

So no, do not use a single non-atomic machine-word load or store to communicate across threads without any other form of synchronization in Swift. Or C. Or Rust.

tera · October 4, 2023, 9:28am

Obviously! Like a 100 character string. A hash of a sentence in a book, a hash of a file contents, etc.

Yes and no. Sure we could come up with lots of examples where we compare equal or almost equal strings (in many cases comparing equal strings could be "degenerated" in comparing identical strings via O(1) comparison), but in many other real life cases you compare different strings (an extreme example: UUID strings), so the string comparison typically terminates very early during comparison well before reaching the string(s) end (†)

Hmm. I am genially interested to see some real world statistics on this.

"Hash" understandably. EQ - is the normalisation step done upfront for both strings (this will affect my claim above †) or is it done on the fly in the "normalise as you go" manner? I guess the latter could speed things up to avoid unnecessary work.

I am not sure I follow you here. Cached hash gives exactly the same result as uncached recalculated on the fly hash (††), so I can't see how caching could affect anything here. What could affect wide hash quality is this difference:

// good
combine(0); combine(contents)  // to get the lower 64-bit word
combine(1); combine(contents)  // to get the higher 64-bit word

vs

// bad
combine(0); combine(hash)  // to get the lower 64-bit word
combine(1); combine(hash)  // to get the higher 64-bit word

If that's what you mean – I am with you. Note that this still has a potential for an optimisation:

// still good
combine(contents); combine(0)  // to get the lower 64-bit word
combine(contents); combine(1); // to get the higher 64-bit word

which would get you to the same entropy, and then naturally:

// equally good but up to twice as fast!
let doOnce = internal_combine(contents) // 128 bits wide or more
combine doOnce and 0 // // to get the lower 64-bit word
combine doOnce and 1 // // to get the higher 64-bit word

to do the main lengthly operation only once.

Only the latter (only hash some of its contents) would "break" things and I am not suggesting it!

Could you expand on "broken"? (See †† above). In my view caching could only affect timing, not the result itself.

Yep, of course... Bear in mind that throughout this discussion I was suggesting to not change hash meaningfully, other than speeding up its calculation via memoization.

Notably, this tangent leads us back to the topic suggestion: using string length to optimise string comparison.. How often would the bridged string occur (again, a genuine question). If those are the problem we could probably exclude them as well. Pseudocode:

if let len = string.fastStringLength, len > threshold {
    ... string length is available in O(1) time and string is long enough
} else {
    ... business as usual
}

My prediction for this superficial wrapper (as you described it):

would make things slower and memory requirements bigger because all strings size is increased, which would be critical for short strings. There are myriads of tiny strings in a typical app and their size would effectively doubled.

So if to do this optimisation at all (including for the purposes of initial benchmarking optimisation to see if it's worth doing!) that has to be done deeper: the cached hash value should live in the internal reference object itself. Note – not only strings could benefit from this optimisation but other large objects as well (arrays, dictionary, sets, data, etc).

So, you (impersonally) decided to go for good caching performance but in return you're getting all those issues resulting into higher chance of clustering in typical usage scenarios so to workaround those issues you'd need to be creative doing the per instance hashing... Is this a win overall? I need to reread those old classic algorithm books in light of this discussion.