Why does Hashable require Equatable?

Hi there,

Recently, I've experienced an obscure bug using Dictionary caused by mismtatching implementation of hashValue and ==, something like below.

struct Foo { let n, m: Int }
extension Foo: Equatable {
  static func == (lhs: Foo, rhs: Foo) -> Bool {
    return lhs.n * lhs.m == rhs.n * rhs.m
  }
}
extension Foo: Hashable {
  var hashValue: Int { return "\(n)\(m)".hashValue }
}

var dict1 = [Foo: Int]()
dict1[Foo(n: 1, m: 2)] = 1
dict1[Foo(n: 2, m: 1)] = 2
print("\(dict1.count)")  // sometimes 1, sometimes 2.

I could avoid the problem by explicitly using hashValue as the key as below.

var dict2 = [Int: Int]()
dict2[Foo(n: 1, m: 2).hashValue] = 1
dict2[Foo(n: 2, m: 1).hashValue] = 2
print("\(dict2.count)")  // always 2.

But this doesn't seem to make much sense since Key is required to be Hashble already.

So, why does Hashable have to require Equatable whereas required hashValue is Int and therefore it can be compared by definition?

I encoutered this problem when I imported a 3rd party framework which happened to be implementing Equatable to some common types, and it happened to contradict my Hashable implementation to the type I was using as a Dictionary key.

Asking users to make sure == and hashValue don't contradict each other is a bit too much burden, isn't it? What do you think?

Because different objects can have the same hash. For example, if you had Int128, then you cannot give each of them unique hash. Because of that Dictionary and other types using Hashable first quickly check if there is a matching hash, and then make sure you have the same object using ==.

tl;dr
if a == b, then a.hashValue == b.hashValue, but if a.hashValue == b.hashValue, then a may or may not equal b

15 Likes

Yes, that's how Dictionary is implemented now. But my question is why it has to be in that way and isn't it too much berden on the user to make sure == and hashValue don't contradict each other.

I've experience the problem when I imported a 3rd party framework that happened to be implementing Equatable to the type of which I was using for a dictionary key. I implemented hashValue and == to the type, but the compiler didn't warn me there were two == functions to the type, and I needed to realize my == method wasn't called at runtime to figure out the bug.

Anyway, when two objects are not equal but have the same hashValue, what behavior should a user reasonably expect?

struct Foo { let n }
extension Foo: Equatable {
  static func == (lhs: Foo, rhs: Foo) -> Bool { return rhs.n == lhs.n }
}
extension Foo: Hashable {
  var hashValue: Int { return n / 2 }
}

var dict1 = [Foo: Int]()
dict1[Foo(n: 2)] = 2
dict1[Foo(n: 3)] = 3
print("\(dict1.count)")  // I was expecting this to be 1.

and every other data structure in every language using Hashable equivalents

Equatable documentation says that "Equality implies substitutability—any two instances that compare equally can be used interchangeably in any code that depends on their values".
User should expect that values that aren't equal aren't substituable, i.e. are different keys.
You should treat hash as a tool to speed things up.

2 Likes

It's actually fundamental to what hashing is. You cannot encode arbitrary amounts of information in to a fixed-size format without potentially losing some of it.

The simplest sort of collection is an Array. If you want to check if something is in the Array or not, you would need to compare every element to the test object, which doesn't scale well for large Arrays. Hash-tables make this more scalable by doing a sort-of approximate test first.

The dictionary has an internal Array, and the position an object gets placed is related to the object's contents. So first, it needs to normalise the object's arbitrary contents (in to something called the hashValue), then it needs to map those hash values to positions in its internal Array. A good hash-function will try to make distinct objects have distinct hashValues as much as possible, and a good hash-table will try to spread those values in its internal Array relatively evenly.

Sometimes there will be "hash collisions" - where distinct values produce the same hashValue, or where distinct hashValues get mapped to the same position in the internal Array. That means the hash-table's internal Array has to be an Array-of-Arrays (or an Array of "buckets"). So to test if an object is contained in the table, it first needs to calculate which "bucket" the object would be in, then check each object in the bucket to test if it really is in there or not.

In your example, Foo(n: 2) and Foo(n: 3) will get placed in the same "bucket" because they have the same hashValue, but because they are distinct (by ==), they will both be stored individually.

13 Likes

Hashable extends Equatable because hashing is a meaningless operation without equality: the requirements on hash values are defined in terms of equal and non-equal elements.

From a practical viewpoint, without Equatable, you may be able to add elements to a Dictionary or a Set, but you wouldn't be able to look them up later. This eliminates the primary purpose of these data structures.

When you calculate a hash, you're distilling the original value to a fixed-width sequence of bits that can be used as a direct evidence of inequality. Two equal elements must always produce the same hash value, so if we have two items with different hash values, then we are guaranteed to have two unequal items. The sole purpose of hash values is to serve as such evidence. If equality has no meaning, then there is no such thing as a hash.

The original code above violates the basic Hashable requirement. For example:

let a = Foo(n: 1, m: 2)
let b = Foo(n: -2, m: -1)
print(a == b) // ⟹ true
print(a.hashValue == b.hashValue) // ⟹ (usually!) false

This is a serious error that can produce all sorts of weird behavior -- including phantom and/or duplicate Dictionary keys. Unfortunately it is extremely difficult for Set and Dictionary to reliably catch this error. Sometimes they get lucky and notice an unexpected duplicate key while they're resizing themselves, in which case they can (and do!) trap with an error message that indicates the issue. Unfortunately, there is no guarantee this will occur during testing. (The current development snapshots catch this a little more often, especially in debug builds, but it simply isn't feasible to include full Hashable validation in the standard collection types.) The easiest way to not run into issues like this is to let the compiler synthesize Equatable and Hashable for you.

Of course, compiler synthesis is only appropriate when the type is the straightforward case. Judging by your definition for Foo.==, this is probably not the case here. Assuming you meant to express that Foo.m and Foo.n should be considered interchangeable, then this thread from last week will probably come useful -- it includes a direct example of how to do that sort of thing correctly.

Additional notes:

  • It is a good idea to treat explicit manually conformances of Equatable and Hashable as a code smell. Unfortunately, they are sometimes unavoidable; if you must manually implement them, make sure the implementations are carefully peer-reviewed and fully tested.
  • Be careful about the use of * in Foo.==. The definition as written considers Foo(n: 6, m: -7) to be equal to Foo(n: -1, m: 42), which may not be what you intended. The multiplication may also overflow.
  • On the other hand, if the definition of Foo.== above is indeed exactly what you want/need, then here is a matching implementation of Hashable:
    extension Foo: Hashable {
      func hash(into hasher: inout Hasher) {
        hasher.combine(n * m)
      }
    }
    
    The trick is to always feed hasher the exact same values that you compare in ==. (Don't try to optimize hashing by only including some of them -- it'll most likely only slow things down. In addition, it will definitely make the code harder to understand/review/maintain, so it'll be more likely for bugs to creep in.)
  • Don't implement hashValue in new code. Implement hash(into:) instead; it's a lot easier and it frequently performs better.
24 Likes

As I mentioned the above, in my case, the violation was inadvertently introduced by simply importing a 3rd party framework which happened to be implementing Equatable to the type of which I was using for Dictionary's Key. It was extremely tricky for me to even realize the bug because my == and the hidden == were almost the same.

1 Like

I see! Conflicting conformances are indeed a serious issue, especially with protocol hierarchies that build on each other, like Equatable/Hashable.

Retroactive protocol conformances are useful, but some forms of it can be dangerous. I think that it's a good idea to never conform other people's types to standard library protocols. The upstream authors are very likely to add an official conformance in some future revision -- after all, you found a use case for it, so others will probably come across one, too, and some of them will submit feature requests for it.

Sadly I can't suggest a fully satisfactory replacement to such conformances. I do know one solution that works today, although it's not particularly pretty: if I need to customize, say, Hashable for some external type Foo that I don't fully control, I wrap it in a little wrapper struct that is private to my module, and then use that wrapper wherever I need my definition of hashing:

struct MyFoo {
  var value: Foo
}
extension MyFoo: Equatable {
  static func ==(left: MyFoo, right: MyFoo) -> Bool {
    guard foo(left.value) == foo(right.value) else { return false }
    guard bar(left.value) == bar(right.value) else { return false }
    return true
  }
}
extension MyFoo: Hashable {
  func hash(into hasher: inout Hasher) {
    hasher.combine(foo(value))
    hasher.combine(bar(value))
  }
}  

Granted, wrappers like this are somewhat harder to use as I have to litter .value all over my codebase, but in exchange, they are perfectly safe from outside interference, including whatever shenanigans Foo's author might be up to. They also have practically no effect on performance or memory overhead at runtime.

Elsewhere on the evolution forums there are pitches for features such as "newtype" that would make creating/using such wrappers a lot nicer.

6 Likes

Let's return to the concrete problem of keeping Equatable and Hashable conformances in sync. Consider the code below, adapted from an earlier example:

struct Bar { 
  let a, b: Int 
  let c: String
}

extension Bar: Equatable {
  static func ==(left: Bar, right: Bar) -> Bool {
    // Silly and bogus example illustrating a complex case -- do not copy
    guard left.a * left.b == right.a * right.b else { return false }
    guard left.a + left.b == right.a + right.b else { return false }
    guard left.c == right.c else { return false }
    return true
  }
}

extension Bar: Hashable {
  func hash(into hasher: inout Hasher) {
    // Silly and bogus example illustrating a complex case -- do not copy
    hasher.combine(a * b)
    hasher.combine(a + b)
    hasher.combine(c)
  }
}

There are obvious similarities between the == and hash(into:) definitions above. You can almost define == in terms of the underlying hash encoding that a properly written hash(into:) feeds to the hasher.

hash(into:) is also eerily, annoyingly similar to encode(to:) from Encodable. You could in fact define both Equatable and Hashable in terms of encode(to:), by encoding Bar to an explicit byte sequence using a stable encoder, then comparing/hashing these encodings. But that would be terribly inefficient.

Here is a draft of a slightly better way:

protocol Distillable: Hashable {
  associatedtype Essence: Hashable
  var essence: Essence { get }
}

extension Distillable {
  static func ==(left: Self, right: Self) -> Bool {
    return left.essence == right.essence
  }

  func hash(into hasher: inout Hasher) {
    hasher.combine(essence)
  }
}

struct Bar: Distillable { 
  let a, b: Int 
  let c: String
  var essence: [AnyHashable] { return [a * b, a + b, c] }
}

Note how you only need to define the relevant components of Bar once, in essence.

Note that I haven't actually tried this code -- it's merely intended to illustrate a concept. However, something like this may actually be practical in some cases. Just be aware that creating an array of AnyHashables like this won't fare well in any performance-critical code.

6 Likes

Please excuse the self-reply storm -- I forgot to go for the obvious punchline.

Take the following hypothetical version of Bar; it replaces Distillable with some sort of non-existent magical metaprogramming feature:

struct Bar: Hashable {
  let a, b: Int
  let c: String

  $generateHashable(a * b, a + b, c) // strawman syntax
}

In theory, this could be mechanically transformed to the first, verbose definition of Bar above, so that we can get the exact same runtime performance without any of the confusing boilerplate.

I would love to be able to do something like this at some point. If Equatable/Hashable is an important enough use case to deserve some sort of special one-off syntax, it might be worth pitching it as a dedicated language feature.

I believe this is what Haskell does with its generic datatypes? The compiler can derive things like serialization, equality and hashing for you – but with no special treatment of some protocols like Codable, if I understand the feature correctly. Neat.

Another approach, which I believe Dictionary actually uses, is to add items with identical hashes to the next spaces in the array. It’s unlikely that you’d have two hashes right next to each other, so if an item’s hash leads to index 42, but there’s already something there, it would use index 43 to store it. Then when you want to retrieve an item whose key has that hash from the dictionary, you’d start at index 42, then compare the key against that item. When the comparison fails, you’d try 43 and so on until you get to nil, at which point you know there’s no object in the dictionary for that key.

3 Likes

If I‘m not mistaken, this is called linear probing. Hash values can still collide quite a lot, it really also depends on the storage size you have for the hash table itself. Linear probing is just one strategy to resolve collisions, but there are more and also better ones. Collisions in this context doesn‘t necessarily mean a.hashValue == b.hashValue. If you have a dynamic array as a hash table you will likely have more collisions at the beginning, but because of some hash function like hashValue % table.count. When the max size is reached you‘ll double the size of the array (hence dynamic) and to minimize collisions you‘d need to re-hash all the previous values of the hash table.

2 Likes

Aside: some hash tables do that, but there’s also Fibonacci hashing which is a cool little trick to avoid collisions. That guy’s blog is full of different hash table tips and tricks, for those interested.

But yes, in general a hash collision leads to a linear, O(n) search. The whole point of hash tables is to try and avoid that.

5 Likes

Not inefficient at all. The encode(to:) machinery is really just a very fast way to list the contents of every field (much faster than reflection). Most encoders generate a byte sequence from such data, but you can also directly update a hash value for a very fast and simple Hashable conformance.

In fact, SwiftProtobuf uses almost exactly this approach to implement Hashable conformance. It is very fast and requires no additional generated code since it just reuses the encoding support hooks.

Equatable is more complex because of the need to interleave traversals of two different objects.

3 Likes

In principle Hashing should not necessarily and by all means require Equatable. If the two were unrelated and I were mistakenly making a Set out of Hashable but not Equatable things – compiler would just complain and I'd make the "thing" Equatable right there and then to fix the error. It's definitely too late to change this in Swift and, IMHO, it's not a big deal; moreover it saves some keystrokes (so we can have just "Hashable" and not "Hashable, Equatable" combo - as having both features is what we want anyway most of the time). Bloom filter is one of those exceptional examples when hashing is useful without equality, and if you have one of those extremely rare cases when equality is not needed (or actively harmful) – just define a no-op fatalError("equatable not supported") Equatable conformance.


I love this example. What's your thoughts now, should we have it in the standard library?

Edit: one of the downsides to the "essence" approach is lack of fast fail (e.g. when we compare two random UUID strings it's good to quickly get false after comparing the first characters). Another downside is that essence is calculated multiple times for a value that is otherwise unchanged. Could we somehow cache it? Bike-shedding:

what user writes:

struct Bar: Distillable { 
  var a: Int 
  var b: Int 
  var c: String
  var essence: Essence { (a * b, a + b, c) }
}

what's under the hood:

struct Bar { 
    var a: Int { didSet { changed() } }
    var b: Int { didSet { changed() } }
    var c: String { didSet { changed() } }

    var _essence: Essence! // extra field added
    var chageCount: Int = 0 // extra field added
}

extension Bar {

    func changed() {
        _essence = nil
        changeCount += 1
    }

    var essence: Essence {
        if _essence == nil { _essence = essence }
        return _essence
    }
    static func ==(left: Self, right: Self) -> Bool {
        left.essence == right.essence
    }
    func hash(into hasher: inout Hasher) {
         hasher.combine(essence)
    }
}

hash caching could be done similarly (calculate hash once and never again until something changes). And build-in EQ can compare (cached) hash values first thing to "fast fail" when hashes are different.

I think Karoy's post up-thread does a good job of explaining why this is not the case, i.e., why hashability inherently requires a notion of equality:

If by 'hash' you only mean 'can be reduced to a fixed-width sequence of bits' without any further constraints, then sure, you don't need equality for that, but it's unclear what you'd be able to do with such a reduction.

Bloom filters require a notion of equality because you use them to test whether the 'same' value is definitely not in the given set. If you don't have the property that 'same' values produce the same hash value for all the hash functions you're using in your bloom filter, then there will be instances a and b of the 'same' value and some hash function h(x) where if you add a to your set and then query whether b is in the set, your proposed implementation would return 'definitely not' since for the kth hash function h(a) ≠ h(b) even though a correct implementation would return 'maybe.'

1 Like

I mean there are cases when EQ implementation is not needed (even if there is still a "notion of same-ness"), e.g. this implementation:

static func == (a: Self, b: Self) -> Bool {
    // pseudo code ahead:
    sleep(3 hours)
    return super.==(a, b)
}

or any other EQ implementation is not needed for Bloom filter implementation to operate.

Whether the implementation will be called for a particular operation is somewhat orthogonal to whether the underlying notion of equality is required for hashability to make sense. Every time you invoke the hash function to test bloom filter membership (or set membership, or dictionary membership...) you are implicitly relying on that notion of sameness due to the semantic guarantees of the hash function (that unequal hashes implies unequal values).

IOW, even if Swift had separated Hashable and Equatable into unrelated protocols, it would always be an error to conform to Hashable alone because the semantic guarantees of Hashable would be meaningless without an implementation for Equatable.

1 Like

I truly believe this /while typically true/ is not true in general case.

A working implementation of a simple bloom filter.
class C {}

struct S: Hashable {
    let string: String
    var c: C? // let's have this field here, so that we don't have synthesized EQ / hash
    
    func hash(into hasher: inout Hasher) {
        hasher.combine(string)
    }
    
    // EQ
    static func == (a: Self, b: Self) -> Bool {
        assert(false, "won't be called")
        // or let return some garbage
        return Bool.random()
    }
}

struct WrappedS: Hashable {
    let s: S
    let salt: Int
    
    static func == (a: Self, b: Self) -> Bool {
        assert(false, "won't be called wither")
        // or let return some garbage
        return Bool.random()
    }
}

struct BloomFilter {
    var bools: [Bool]
    let k: Int
    
    init(size: Int, k: Int) {
        bools = [Bool](repeating: false, count: size)
        self.k = k
    }
    
    private func indexForString(_ string: String, _ i: Int) -> Int {
        let v = WrappedS(s: S(string: string), salt: i).hashValue
        return Int(UInt.init(bitPattern: v) % UInt(bools.count))
    }
    
    mutating func add(_ string: String) {
        for i in 0 ..< k {
            bools[indexForString(string, i)] = true
        }
    }
    
    func maybeHas(_ string: String) -> Bool { // if false - definitely doesn't have
        for i in 0 ..< k {
            if !bools[indexForString(string, i)] { return false }
        }
        // maybe has
        return true
    }
}

var filter = BloomFilter(size: 10000, k: 10)
filter.add("Hello")
filter.add("World")
filter.add("Earth")
filter.add("Globe")

print(filter.maybeHas("Hello")) // true
print(filter.maybeHas("World")) // true
print(filter.maybeHas("Earth")) // true
print(filter.maybeHas("Globe")) // true
print(filter.maybeHas("Ooops")) // false
print(filter.maybeHas("Noope")) // false

Edit. please ignore the logical error in this demo implementation in that it combines a hash + salt to make another hash. Correct implementation would need to add salt to the data itself and calculate the hash of that.

This implementation is usable regardless of how I implement element's eq – it's just not called – and even if I implement "eq" incorrectly (or, if it was allowed not implementing it) - this implementation would give me correct results for the implied notion of sameness (which in this case takes string into account but ignores "c"): testing for an element would give either "maybe has" or "definitely doesn't have" result.

Obviously if I try to use "S" with a Dictionary / Set or another typical collection - that would indeed require a workable "EQ" that should match the implied notion of "sameness" and agree with EQ/hashability invariant.

Given this line of logic the following data structures are not unimaginable to me:

struct Set<Element: OnlyHashable & Equatable> { ... }
struct Dictionary<Key: OnlyHashable & Equatable, Value> { ... }
struct BloomFilter<Element: OnlyHashable> { ... }