Why does Hashable require Equatable?

Hi there,

Recently, I've experienced an obscure bug using Dictionary caused by mismtatching implementation of hashValue and ==, something like below.

struct Foo { let n, m: Int }
extension Foo: Equatable {
  static func == (lhs: Foo, rhs: Foo) -> Bool {
    return lhs.n * lhs.m == rhs.n * rhs.m
  }
}
extension Foo: Hashable {
  var hashValue: Int { return "\(n)\(m)".hashValue }
}

var dict1 = [Foo: Int]()
dict1[Foo(n: 1, m: 2)] = 1
dict1[Foo(n: 2, m: 1)] = 2
print("\(dict1.count)")  // sometimes 1, sometimes 2.

I could avoid the problem by explicitly using hashValue as the key as below.

var dict2 = [Int: Int]()
dict2[Foo(n: 1, m: 2).hashValue] = 1
dict2[Foo(n: 2, m: 1).hashValue] = 2
print("\(dict2.count)")  // always 2.

But this doesn't seem to make much sense since Key is required to be Hashble already.

So, why does Hashable have to require Equatable whereas required hashValue is Int and therefore it can be compared by definition?

I encoutered this problem when I imported a 3rd party framework which happened to be implementing Equatable to some common types, and it happened to contradict my Hashable implementation to the type I was using as a Dictionary key.

Asking users to make sure == and hashValue don't contradict each other is a bit too much burden, isn't it? What do you think?

Because different objects can have the same hash. For example, if you had Int128, then you cannot give each of them unique hash. Because of that Dictionary and other types using Hashable first quickly check if there is a matching hash, and then make sure you have the same object using ==.

tl;dr
if a == b, then a.hashValue == b.hashValue, but if a.hashValue == b.hashValue, then a may or may not equal b

10 Likes

Yes, that's how Dictionary is implemented now. But my question is why it has to be in that way and isn't it too much berden on the user to make sure == and hashValue don't contradict each other.

I've experience the problem when I imported a 3rd party framework that happened to be implementing Equatable to the type of which I was using for a dictionary key. I implemented hashValue and == to the type, but the compiler didn't warn me there were two == functions to the type, and I needed to realize my == method wasn't called at runtime to figure out the bug.

Anyway, when two objects are not equal but have the same hashValue, what behavior should a user reasonably expect?

struct Foo { let n }
extension Foo: Equatable {
  static func == (lhs: Foo, rhs: Foo) -> Bool { return rhs.n == lhs.n }
}
extension Foo: Hashable {
  var hashValue: Int { return n / 2 }
}

var dict1 = [Foo: Int]()
dict1[Foo(n: 2)] = 2
dict1[Foo(n: 3)] = 3
print("\(dict1.count)")  // I was expecting this to be 1.

and every other data structure in every language using Hashable equivalents

Equatable documentation says that "Equality implies substitutability—any two instances that compare equally can be used interchangeably in any code that depends on their values".
User should expect that values that aren't equal aren't substituable, i.e. are different keys.
You should treat hash as a tool to speed things up.

It's actually fundamental to what hashing is. You cannot encode arbitrary amounts of information in to a fixed-size format without potentially losing some of it.

The simplest sort of collection is an Array. If you want to check if something is in the Array or not, you would need to compare every element to the test object, which doesn't scale well for large Arrays. Hash-tables make this more scalable by doing a sort-of approximate test first.

The dictionary has an internal Array, and the position an object gets placed is related to the object's contents. So first, it needs to normalise the object's arbitrary contents (in to something called the hashValue), then it needs to map those hash values to positions in its internal Array. A good hash-function will try to make distinct objects have distinct hashValues as much as possible, and a good hash-table will try to spread those values in its internal Array relatively evenly.

Sometimes there will be "hash collisions" - where distinct values produce the same hashValue, or where distinct hashValues get mapped to the same position in the internal Array. That means the hash-table's internal Array has to be an Array-of-Arrays (or an Array of "buckets"). So to test if an object is contained in the table, it first needs to calculate which "bucket" the object would be in, then check each object in the bucket to test if it really is in there or not.

In your example, Foo(n: 2) and Foo(n: 3) will get placed in the same "bucket" because they have the same hashValue, but because they are distinct (by ==), they will both be stored individually.

10 Likes

Hashable extends Equatable because hashing is a meaningless operation without equality: the requirements on hash values are defined in terms of equal and non-equal elements.

From a practical viewpoint, without Equatable, you may be able to add elements to a Dictionary or a Set, but you wouldn't be able to look them up later. This eliminates the primary purpose of these data structures.

When you calculate a hash, you're distilling the original value to a fixed-width sequence of bits that can be used as a direct evidence of inequality. Two equal elements must always produce the same hash value, so if we have two items with different hash values, then we are guaranteed to have two unequal items. The sole purpose of hash values is to serve as such evidence. If equality has no meaning, then there is no such thing as a hash.

The original code above violates the basic Hashable requirement. For example:

let a = Foo(n: 1, m: 2)
let b = Foo(n: -2, m: -1)
print(a == b) // ⟹ true
print(a.hashValue == b.hashValue) // ⟹ (usually!) false

This is a serious error that can produce all sorts of weird behavior -- including phantom and/or duplicate Dictionary keys. Unfortunately it is extremely difficult for Set and Dictionary to reliably catch this error. Sometimes they get lucky and notice an unexpected duplicate key while they're resizing themselves, in which case they can (and do!) trap with an error message that indicates the issue. Unfortunately, there is no guarantee this will occur during testing. (The current development snapshots catch this a little more often, especially in debug builds, but it simply isn't feasible to include full Hashable validation in the standard collection types.) The easiest way to not run into issues like this is to let the compiler synthesize Equatable and Hashable for you.

Of course, compiler synthesis is only appropriate when the type is the straightforward case. Judging by your definition for Foo.==, this is probably not the case here. Assuming you meant to express that Foo.m and Foo.n should be considered interchangeable, then this thread from last week will probably come useful -- it includes a direct example of how to do that sort of thing correctly.

Additional notes:

  • It is a good idea to treat explicit manually conformances of Equatable and Hashable as a code smell. Unfortunately, they are sometimes unavoidable; if you must manually implement them, make sure the implementations are carefully peer-reviewed and fully tested.
  • Be careful about the use of * in Foo.==. The definition as written considers Foo(n: 6, m: -7) to be equal to Foo(n: -1, m: 42), which may not be what you intended. The multiplication may also overflow.
  • On the other hand, if the definition of Foo.== above is indeed exactly what you want/need, then here is a matching implementation of Hashable:
    extension Foo: Hashable {
      func hash(into hasher: inout Hasher) {
        hasher.combine(n * m)
      }
    }
    
    The trick is to always feed hasher the exact same values that you compare in ==. (Don't try to optimize hashing by only including some of them -- it'll most likely only slow things down. In addition, it will definitely make the code harder to understand/review/maintain, so it'll be more likely for bugs to creep in.)
  • Don't implement hashValue in new code. Implement hash(into:) instead; it's a lot easier and it frequently performs better.
19 Likes

As I mentioned the above, in my case, the violation was inadvertently introduced by simply importing a 3rd party framework which happened to be implementing Equatable to the type of which I was using for Dictionary's Key. It was extremely tricky for me to even realize the bug because my == and the hidden == were almost the same.

1 Like

I see! Conflicting conformances are indeed a serious issue, especially with protocol hierarchies that build on each other, like Equatable/Hashable.

Retroactive protocol conformances are useful, but some forms of it can be dangerous. I think that it's a good idea to never conform other people's types to standard library protocols. The upstream authors are very likely to add an official conformance in some future revision -- after all, you found a use case for it, so others will probably come across one, too, and some of them will submit feature requests for it.

Sadly I can't suggest a fully satisfactory replacement to such conformances. I do know one solution that works today, although it's not particularly pretty: if I need to customize, say, Hashable for some external type Foo that I don't fully control, I wrap it in a little wrapper struct that is private to my module, and then use that wrapper wherever I need my definition of hashing:

struct MyFoo {
  var value: Foo
}
extension MyFoo: Equatable {
  static func ==(left: MyFoo, right: MyFoo) -> Bool {
    guard foo(left.value) == foo(right.value) else { return false }
    guard bar(left.value) == bar(right.value) else { return false }
    return true
  }
}
extension MyFoo: Hashable {
  func hash(into hasher: inout Hasher) {
    hasher.combine(foo(value))
    hasher.combine(bar(value))
  }
}  

Granted, wrappers like this are somewhat harder to use as I have to litter .value all over my codebase, but in exchange, they are perfectly safe from outside interference, including whatever shenanigans Foo's author might be up to. They also have practically no effect on performance or memory overhead at runtime.

Elsewhere on the evolution forums there are pitches for features such as "newtype" that would make creating/using such wrappers a lot nicer.

5 Likes

Let's return to the concrete problem of keeping Equatable and Hashable conformances in sync. Consider the code below, adapted from an earlier example:

struct Bar { 
  let a, b: Int 
  let c: String
}

extension Bar: Equatable {
  static func ==(left: Bar, right: Bar) -> Bool {
    // Silly and bogus example illustrating a complex case -- do not copy
    guard left.a * left.b == right.a * right.b else { return false }
    guard left.a + left.b == right.a + right.b else { return false }
    guard left.c == right.c else { return false }
    return true
  }
}

extension Bar: Hashable {
  func hash(into hasher: inout Hasher) {
    // Silly and bogus example illustrating a complex case -- do not copy
    hasher.combine(a * b)
    hasher.combine(a + b)
    hasher.combine(c)
  }
}

There are obvious similarities between the == and hash(into:) definitions above. You can almost define == in terms of the underlying hash encoding that a properly written hash(into:) feeds to the hasher.

hash(into:) is also eerily, annoyingly similar to encode(to:) from Encodable. You could in fact define both Equatable and Hashable in terms of encode(to:), by encoding Bar to an explicit byte sequence using a stable encoder, then comparing/hashing these encodings. But that would be terribly inefficient.

Here is a draft of a slightly better way:

protocol Distillable: Hashable {
  associatedtype Essence: Hashable
  var essence: Essence { get }
}

extension Distillable {
  static func ==(left: Self, right: Self) -> Bool {
    return left.essence == right.essence
  }

  func hash(into hasher: inout Hasher) {
    hasher.combine(essence)
  }
}

struct Bar: Distillable { 
  let a, b: Int 
  let c: String
  var essence: [AnyHashable] { return [a * b, a + b, c] }
}

Note how you only need to define the relevant components of Bar once, in essence.

Note that I haven't actually tried this code -- it's merely intended to illustrate a concept. However, something like this may actually be practical in some cases. Just be aware that creating an array of AnyHashables like this won't fare well in any performance-critical code.

2 Likes

Please excuse the self-reply storm -- I forgot to go for the obvious punchline.

Take the following hypothetical version of Bar; it replaces Distillable with some sort of non-existent magical metaprogramming feature:

struct Bar: Hashable {
  let a, b: Int
  let c: String

  $generateHashable(a * b, a + b, c) // strawman syntax
}

In theory, this could be mechanically transformed to the first, verbose definition of Bar above, so that we can get the exact same runtime performance without any of the confusing boilerplate.

I would love to be able to do something like this at some point. If Equatable/Hashable is an important enough use case to deserve some sort of special one-off syntax, it might be worth pitching it as a dedicated language feature.

I believe this is what Haskell does with its generic datatypes? The compiler can derive things like serialization, equality and hashing for you – but with no special treatment of some protocols like Codable, if I understand the feature correctly. Neat.

Another approach, which I believe Dictionary actually uses, is to add items with identical hashes to the next spaces in the array. It’s unlikely that you’d have two hashes right next to each other, so if an item’s hash leads to index 42, but there’s already something there, it would use index 43 to store it. Then when you want to retrieve an item whose key has that hash from the dictionary, you’d start at index 42, then compare the key against that item. When the comparison fails, you’d try 43 and so on until you get to nil, at which point you know there’s no object in the dictionary for that key.

2 Likes

If I‘m not mistaken, this is called linear probing. Hash values can still collide quite a lot, it really also depends on the storage size you have for the hash table itself. Linear probing is just one strategy to resolve collisions, but there are more and also better ones. Collisions in this context doesn‘t necessarily mean a.hashValue == b.hashValue. If you have a dynamic array as a hash table you will likely have more collisions at the beginning, but because of some hash function like hashValue % table.count. When the max size is reached you‘ll double the size of the array (hence dynamic) and to minimize collisions you‘d need to re-hash all the previous values of the hash table.

1 Like

Aside: some hash tables do that, but there’s also Fibonacci hashing which is a cool little trick to avoid collisions. That guy’s blog is full of different hash table tips and tricks, for those interested.

But yes, in general a hash collision leads to a linear, O(n) search. The whole point of hash tables is to try and avoid that.

2 Likes

Not inefficient at all. The encode(to:) machinery is really just a very fast way to list the contents of every field (much faster than reflection). Most encoders generate a byte sequence from such data, but you can also directly update a hash value for a very fast and simple Hashable conformance.

In fact, SwiftProtobuf uses almost exactly this approach to implement Hashable conformance. It is very fast and requires no additional generated code since it just reuses the encoding support hooks.

Equatable is more complex because of the need to interleave traversals of two different objects.

2 Likes