Set with 'custom Equatable and default Hashable Structs'

AlexSedykh · October 3, 2018, 7:01pm

Hello. I makes struct Bundle with custom Equatable behavior. And ideally, I want to in the final set must be only two items ('one'(or 'two' or 'three') and 'four'), but in several runs the final set contains different count of items. What I do wrong? And what hashable function is need to be?

// Forum

public struct Forum {
    public init() {}
    
    struct Bundle: Equatable, Hashable, CustomStringConvertible {
        var description: String { return name }
        
        let name: String
        let personOne: String
        let personTwo: String
        
        public static func == (lhs: Bundle, rhs: Bundle) -> Bool {
            var result = false
            if lhs.personOne == rhs.personOne && lhs.personTwo == rhs.personTwo { result = true }
            if lhs.personOne == rhs.personTwo && lhs.personTwo == rhs.personOne { result = true }
            return result
        }
        
        public init(first: String, second: String, name: String) {
            personOne = first
            personTwo = second
            self.name = name
        }
        
//        func hash(into hasher: inout Hasher) {
//            hasher.combine(???)
//        }
    
    }
    
    public func work() {
        let one = Bundle(first: "10", second: "3", name: "one")
        let two = Bundle(first: "3", second: "10", name: "two")
        let three = Bundle(first: "10", second: "3", name: "three")
        let four = Bundle(first: "5", second: "3", name: "four")
        
        var set = Set<Bundle>()
        set.insert(one)
        set.insert(two)
        set.insert(three)
        set.insert(four)
        
        print(set)
        
        // [two, one, three, four]
        // [one, four, three]
        // [two, four, three, one]
        // [four, one, two]
        // [four, three, two, one]
        // [two, four, one]
        // [one, four]
        
    }

}

blangmuir · October 3, 2018, 8:18pm

The problem is that the hash is more specific than your == - i.e. two elements that are equal are getting different hashes. In general, anytime you manually implement == you will need to implement hash(into:) also so that all equal elements get the same hash. The default implementation would look something like this:

func hash(into hasher: inout Hasher) {
  hasher.combine(name)
  hasher.combine(personeOne)
  hasher.combine(personTwo)
}

Which has two issues: first, it includes name, which your equality function ignores, and second the order of values passed to hasher.combine matters, so A,B has a different hash than B,A.

One simple fix would be to use Set, which will give you an order-independent hash value:

func hash(into hasher: inout Hasher) {
  hasher.combine(Set([personOne, personeTwo]))
}

If you're interested in how Set is calculating an order-independent hash, it's implemented here:

github.com

apple/swift/blob/6cc6f4f1825d7965ffc71e93f537b360698cd33e/stdlib/public/core/Set.swift#L487


      
              // FIXME(ABI)#177: <rdar://problem/18915294> Cache Set<T> hashValue
          
          
    // Generate a seed from a snapshot of the hasher.  This makes members' hash
              // values depend on the state of the hasher, which improves hashing
              // quality. (E.g., it makes it possible to resolve collisions by passing in
              // a different hasher.)
              var copy = hasher
              let seed = copy._finalize()
          
          
    var hash = 0
              for member in self {
                hash ^= member._rawHashValue(seed: seed)
              }
              hasher.combine(hash)
            }
          }
          
          
extension Set: _HasCustomAnyHashableRepresentation {
            public __consuming func _toCustomAnyHashable() -> AnyHashable? {
              return AnyHashable(_box: _SetAnyHashableBox(self))
            }

jsbean · October 3, 2018, 11:05pm

While the Set-based implementation presented by @blangmuir is logically sound, elegant, and easy to comprehend, I recently found that this approach has considerable repercussions for performance.

We were using the following generic structure:

struct UnorderedPair <T> { let a, b: T }

Our initial implementation used Set to provide an order-independent hash value. For our use case, however, we noticed pretty severe performance issues down the line (millions of such values may be inserted into a Set or Dictionary). Profiling showed that the Set incurred a significant penalty for managing its underlying reference-counted storage.

Upon looking into it, we tried providing a custom implementation of hashValue using the ^ (bitwise XOR) operation instead.

SE-0206 states that custom implementations of hashValue are to be deprecated, but this baseline of performance is kept for reasons discussed below.

extension UnorderedPair: Hashable where T: Hashable {
    static func == (lhs: UnorderedPair, rhs: UnorderedPair) -> Bool { ... }
    var hashValue: Int { return a.hashValue ^ b.hashValue }
}

The measurement of these two implementations revealed remarkable performance differences.

These measurements were tested using the XCTestCase.measure(_:) API, using the Release configuration, with hashValue / hash(into:) marked @inlinable

For generating 1_000_000 hash values for a given UnorderedPair<String>, we got the following results:

Implementation	Time	Delta
`hashValue` / `^`	`0.0535s`	---
`hash(into:)` / `Set`	`1.753s`	`3175%`

For generating 1_000_000 hash values for a given UnorderedPair<Int>, we got the following results:

Implementation	Time	Delta
`hashValue` / `^`	`0.0326s`	---
`hash(into:)` / `Set`	`1.764s`	`5311%`

First, is using the ^ operation logically correct and safe in these cases? I have tested several million iterations to ensure that there isn't collision between hash values where there shouldn't be, and so far things check out. I would be very interested to know if there is danger here, though!

We noticed an interesting thing that could perhaps be elucidated by @lorentey or others regarding the interplay between hashValue and hash(into:). In order to get a further handle on things, we tried out the following implementation:

func hash(into hasher: inout Hasher) {
    hasher.combine(a.hashValue ^ b.hashValue)
}

For generating 1_000_000 hash values for a given UnorderedPair<String>, we got the following results:

Implementation	Time	Delta
`hashValue` / `^`	`0.0535s`	---
`hash(into:)` / `^`	`0.107s`	`99%`

For generating 1_000_000 hash values for a given UnorderedPair<Int>, we got the following results:

Implementation	Time	Delta
`hashValue` / `^`	`0.0326s`	---
`hash(into:)` / `^`	`0.062s`	`91%`

Are these implementations logically different? What could be contributing to the performance differences here?

Let me know if I am missing something obvious, as I am not expert in any of these domains. Feel free to point me to places for further research. I am currently looking into SE-0206 itself, as well as the linked thread on combining hashes.

lorentey · October 5, 2018, 2:26pm

You're on the right track! Your last hash(into:) implementation is a great way to hash components whose order doesn't matter for equality.

extension UnorderedPair: Equatable where T: Equatable {
    static func == (lhs: UnorderedPair, rhs: UnorderedPair) -> Bool {
      return 
        (lhs.a == rhs.a && lhs.b == rhs.b) || 
        (lhs.a == rhs.b && lhs.b == rhs.a)
    }
}

extension UnorderedPair: Hashable where T: Hashable {    
    func hash(into hasher: inout Hasher) {
      hasher.combine(a.hashValue ^ b.hashValue)
    }
}

Creating and immediately discarding a Set works great, but as you've seen, it has allocation overhead.

The reason hashValue comes out first in your hashing benchmarks is that you aren't measuring the same amount of work: the hashValue case doesn't include the postprocessing Set and Dictionary has to do on the returned value to guarantee good dispersion. Set and Dictionary never uses the hashValue as is to find the correct bucket for elements -- they always need to run the value through a hasher to reduce the chance of performance regressions due to hash value clustering:

https://github.com/apple/swift/blob/swift-4.2-branch/stdlib/public/core/Dictionary.swift#L2560-L2566

As a general rule, conforming to Hashable by implementing hashValue is never faster than implementing hash(into:) and feeding the same value to it. (It is, however, often slower.)

A benchmark that would capture real-life hashing performance is to measure the time it takes to insert values of your type into a Set. In the case of UnorderedPair, the two implementations come out as exactly the same in this benchmark.

There is no reason whatsoever to ever implement hashValue in code targeting Swift 4.2 or above.