I think the problem is mostly that there are so many good hash functions to choose from. After 2011's hash collision DoS scare, SipHash has become a popular choice; it has been designed for exactly this purpose. AFAIK, it is still considered relatively secure in this niche, while it is also fast enough for general use. Luckily, it's also already implemented in stdlib.
SipHash's primary drawback is that it doesn't fit well into the classic hashValue API -- it is rather slow to repeatedly reinitialize/finalize brand new SipHash states in nested hashValue
invocations.
(This is not unique to SipHash; it is also true for most/all other high-quality hashes.)
This can be solved by introducing an alternative hashing interface (like hash(into:)
elsewhere in this thread). Depending on how often we expect users to roll their own hashes, that interface may not even need to be public.
I believe a high-quality hash function like SipHash will be able to generate good hash values regardless of the distribution of its input -- as long as we feed it all the relevant bits, it should produce hash values with a nice wide spread.
I think it's worth making a fun detour into the nitty gritty details of stdlib's hashed collections, to underline your point about hash tables being sensitive to hash value distributions:
<detour>
Interestingly, Int.hashValue
returning self
is not a particularly great choice for Swift's hashed collections, because it often produces long sequential chains of hash values. The hash table behind Dictionary
and Set
uses open addressing with linear probing, and such chains can cause disastrous performance problems, even though they may not initially contain any actual collisions.
Set
and Dictionary
work around this problem by not using the raw values returned by hashValue
: they call _mixInt
on every hash they generate. Mixing up the bits a little is enough to break these long chains into much shorter pieces, restoring hash table performance in this very common case.
(Another way to solve this would be to switch to another probing method; however, linear probing's cache friendliness makes it difficult to replace it without significantly slowing down lookups into colliding buckets.)
So while Int.hashValue
may be trivial, the actual hash function is much more complicated, to fix issues with distribution quality. (Switching to a high-quality hash function would eliminate the need for this mixing step.)
As a natural consequence of what it does, _mixInt
(as well as all of the high-quality hash functions) introduce collisions that weren't in the original set of hash values. This is perfectly acceptable, but it can be surprising when you look at microbenchmarking results.
Another interesting observation is that mixing decouples the order of elements in the hash table from the numerical value of their original hash values, so it significantly reduces the expected performance of inserting/retrieving a sequential list of integer keys:
var set: Set<Int> = []
for i in 0..<1_000_000 {
set.insert(i) // This doesn't have great locality of reference
}
This is rarely (if ever) relevant in practice, but it often pops up in naive benchmarks. (E.g., try measuring the loop above to compare the performance of Set<Int>
to a hash table that does not reorder its elements, like NSSet
or std::unordered_set<int64_t>
. You may find Set
performs slower on such artificial benchmarks, while on real-life workloads the opposite may often be true.)
</detour>
It's not nice to force users into an API they can't reasonably be expected to use correctly. I think hashValue
before SE-0185 clearly fell into this category. It is possible (indeed, necessary) to implement a universally good, resilient hash function in the stdlib for synthesized hashing; and it seems quite sensible to me to provide this default hash function as a courtesy for users whose types synthesized hashing is not yet able to cover.
Sufficiently motivated, experienced developers should certainly be able to provide their own hash implementations, but the default hash should be good enough that even they wouldn't bother. A fully hand-rolled hashValue
should be merely an option for optimization, rather than a requirement for correctness.
One problem I have with hashValue
is that it is not easy to recognize a bad implementation. For example, I don't think it's feasible to write unit tests to verify that hashValue
does not collide too often -- how many collisions are too many? Over what set of inputs? Similarly, I don't think we can put runtime warnings in the stdlib to catch badly colliding hashValues
. We could provide debugging probes to determine the collision rate, but users would need to remember to look at them.