Set with 'custom Equatable and default Hashable Structs'

jsbean · October 3, 2018, 11:05pm

While the Set-based implementation presented by @blangmuir is logically sound, elegant, and easy to comprehend, I recently found that this approach has considerable repercussions for performance.

We were using the following generic structure:

struct UnorderedPair <T> { let a, b: T }

Our initial implementation used Set to provide an order-independent hash value. For our use case, however, we noticed pretty severe performance issues down the line (millions of such values may be inserted into a Set or Dictionary). Profiling showed that the Set incurred a significant penalty for managing its underlying reference-counted storage.

Upon looking into it, we tried providing a custom implementation of hashValue using the ^ (bitwise XOR) operation instead.

SE-0206 states that custom implementations of hashValue are to be deprecated, but this baseline of performance is kept for reasons discussed below.

extension UnorderedPair: Hashable where T: Hashable {
    static func == (lhs: UnorderedPair, rhs: UnorderedPair) -> Bool { ... }
    var hashValue: Int { return a.hashValue ^ b.hashValue }
}

The measurement of these two implementations revealed remarkable performance differences.

These measurements were tested using the XCTestCase.measure(_:) API, using the Release configuration, with hashValue / hash(into:) marked @inlinable

For generating 1_000_000 hash values for a given UnorderedPair<String>, we got the following results:

Implementation	Time	Delta
`hashValue` / `^`	`0.0535s`	---
`hash(into:)` / `Set`	`1.753s`	`3175%`

For generating 1_000_000 hash values for a given UnorderedPair<Int>, we got the following results:

Implementation	Time	Delta
`hashValue` / `^`	`0.0326s`	---
`hash(into:)` / `Set`	`1.764s`	`5311%`

First, is using the ^ operation logically correct and safe in these cases? I have tested several million iterations to ensure that there isn't collision between hash values where there shouldn't be, and so far things check out. I would be very interested to know if there is danger here, though!

We noticed an interesting thing that could perhaps be elucidated by @lorentey or others regarding the interplay between hashValue and hash(into:). In order to get a further handle on things, we tried out the following implementation:

func hash(into hasher: inout Hasher) {
    hasher.combine(a.hashValue ^ b.hashValue)
}

For generating 1_000_000 hash values for a given UnorderedPair<String>, we got the following results:

Implementation	Time	Delta
`hashValue` / `^`	`0.0535s`	---
`hash(into:)` / `^`	`0.107s`	`99%`

For generating 1_000_000 hash values for a given UnorderedPair<Int>, we got the following results:

Implementation	Time	Delta
`hashValue` / `^`	`0.0326s`	---
`hash(into:)` / `^`	`0.062s`	`91%`

Are these implementations logically different? What could be contributing to the performance differences here?

Let me know if I am missing something obvious, as I am not expert in any of these domains. Feel free to point me to places for further research. I am currently looking into SE-0206 itself, as well as the linked thread on combining hashes.