While the Set
-based implementation presented by @blangmuir is logically sound, elegant, and easy to comprehend, I recently found that this approach has considerable repercussions for performance.
We were using the following generic structure:
struct UnorderedPair <T> { let a, b: T }
Our initial implementation used Set
to provide an order-independent hash value. For our use case, however, we noticed pretty severe performance issues down the line (millions of such values may be inserted into a Set
or Dictionary
). Profiling showed that the Set
incurred a significant penalty for managing its underlying reference-counted storage.
Upon looking into it, we tried providing a custom implementation of hashValue
using the ^
(bitwise XOR) operation instead.
SE-0206 states that custom implementations of
hashValue
are to be deprecated, but this baseline of performance is kept for reasons discussed below.
extension UnorderedPair: Hashable where T: Hashable {
static func == (lhs: UnorderedPair, rhs: UnorderedPair) -> Bool { ... }
var hashValue: Int { return a.hashValue ^ b.hashValue }
}
The measurement of these two implementations revealed remarkable performance differences.
These measurements were tested using the
XCTestCase.measure(_:)
API, using theRelease
configuration, withhashValue
/hash(into:)
marked@inlinable
For generating 1_000_000
hash values for a given UnorderedPair<String>
, we got the following results:
Implementation | Time | Delta |
---|---|---|
hashValue / ^ |
0.0535s |
--- |
hash(into:) / Set |
1.753s |
3175% |
For generating 1_000_000
hash values for a given UnorderedPair<Int>
, we got the following results:
Implementation | Time | Delta |
---|---|---|
hashValue / ^ |
0.0326s |
--- |
hash(into:) / Set |
1.764s |
5311% |
First, is using the ^
operation logically correct and safe in these cases? I have tested several million iterations to ensure that there isn't collision between hash values where there shouldn't be, and so far things check out. I would be very interested to know if there is danger here, though!
We noticed an interesting thing that could perhaps be elucidated by @lorentey or others regarding the interplay between hashValue
and hash(into:)
. In order to get a further handle on things, we tried out the following implementation:
func hash(into hasher: inout Hasher) {
hasher.combine(a.hashValue ^ b.hashValue)
}
For generating 1_000_000
hash values for a given UnorderedPair<String>
, we got the following results:
Implementation | Time | Delta |
---|---|---|
hashValue / ^ |
0.0535s |
--- |
hash(into:) / ^ |
0.107s |
99% |
For generating 1_000_000
hash values for a given UnorderedPair<Int>
, we got the following results:
Implementation | Time | Delta |
---|---|---|
hashValue / ^ |
0.0326s |
--- |
hash(into:) / ^ |
0.062s |
91% |
Are these implementations logically different? What could be contributing to the performance differences here?
Let me know if I am missing something obvious, as I am not expert in any of these domains. Feel free to point me to places for further research. I am currently looking into SE-0206 itself, as well as the linked thread on combining hashes.