Why does Hashable require Equatable?

and every other data structure in every language using Hashable equivalents

Equatable documentation says that "Equality implies substitutability—any two instances that compare equally can be used interchangeably in any code that depends on their values".
User should expect that values that aren't equal aren't substituable, i.e. are different keys.
You should treat hash as a tool to speed things up.

2 Likes

It's actually fundamental to what hashing is. You cannot encode arbitrary amounts of information in to a fixed-size format without potentially losing some of it.

The simplest sort of collection is an Array. If you want to check if something is in the Array or not, you would need to compare every element to the test object, which doesn't scale well for large Arrays. Hash-tables make this more scalable by doing a sort-of approximate test first.

The dictionary has an internal Array, and the position an object gets placed is related to the object's contents. So first, it needs to normalise the object's arbitrary contents (in to something called the hashValue), then it needs to map those hash values to positions in its internal Array. A good hash-function will try to make distinct objects have distinct hashValues as much as possible, and a good hash-table will try to spread those values in its internal Array relatively evenly.

Sometimes there will be "hash collisions" - where distinct values produce the same hashValue, or where distinct hashValues get mapped to the same position in the internal Array. That means the hash-table's internal Array has to be an Array-of-Arrays (or an Array of "buckets"). So to test if an object is contained in the table, it first needs to calculate which "bucket" the object would be in, then check each object in the bucket to test if it really is in there or not.

In your example, Foo(n: 2) and Foo(n: 3) will get placed in the same "bucket" because they have the same hashValue, but because they are distinct (by ==), they will both be stored individually.

13 Likes

Hashable extends Equatable because hashing is a meaningless operation without equality: the requirements on hash values are defined in terms of equal and non-equal elements.

From a practical viewpoint, without Equatable, you may be able to add elements to a Dictionary or a Set, but you wouldn't be able to look them up later. This eliminates the primary purpose of these data structures.

When you calculate a hash, you're distilling the original value to a fixed-width sequence of bits that can be used as a direct evidence of inequality. Two equal elements must always produce the same hash value, so if we have two items with different hash values, then we are guaranteed to have two unequal items. The sole purpose of hash values is to serve as such evidence. If equality has no meaning, then there is no such thing as a hash.

The original code above violates the basic Hashable requirement. For example:

let a = Foo(n: 1, m: 2)
let b = Foo(n: -2, m: -1)
print(a == b) // ⟹ true
print(a.hashValue == b.hashValue) // ⟹ (usually!) false

This is a serious error that can produce all sorts of weird behavior -- including phantom and/or duplicate Dictionary keys. Unfortunately it is extremely difficult for Set and Dictionary to reliably catch this error. Sometimes they get lucky and notice an unexpected duplicate key while they're resizing themselves, in which case they can (and do!) trap with an error message that indicates the issue. Unfortunately, there is no guarantee this will occur during testing. (The current development snapshots catch this a little more often, especially in debug builds, but it simply isn't feasible to include full Hashable validation in the standard collection types.) The easiest way to not run into issues like this is to let the compiler synthesize Equatable and Hashable for you.

Of course, compiler synthesis is only appropriate when the type is the straightforward case. Judging by your definition for Foo.==, this is probably not the case here. Assuming you meant to express that Foo.m and Foo.n should be considered interchangeable, then this thread from last week will probably come useful -- it includes a direct example of how to do that sort of thing correctly.

Additional notes:

  • It is a good idea to treat explicit manually conformances of Equatable and Hashable as a code smell. Unfortunately, they are sometimes unavoidable; if you must manually implement them, make sure the implementations are carefully peer-reviewed and fully tested.
  • Be careful about the use of * in Foo.==. The definition as written considers Foo(n: 6, m: -7) to be equal to Foo(n: -1, m: 42), which may not be what you intended. The multiplication may also overflow.
  • On the other hand, if the definition of Foo.== above is indeed exactly what you want/need, then here is a matching implementation of Hashable:
    extension Foo: Hashable {
      func hash(into hasher: inout Hasher) {
        hasher.combine(n * m)
      }
    }
    
    The trick is to always feed hasher the exact same values that you compare in ==. (Don't try to optimize hashing by only including some of them -- it'll most likely only slow things down. In addition, it will definitely make the code harder to understand/review/maintain, so it'll be more likely for bugs to creep in.)
  • Don't implement hashValue in new code. Implement hash(into:) instead; it's a lot easier and it frequently performs better.
24 Likes

As I mentioned the above, in my case, the violation was inadvertently introduced by simply importing a 3rd party framework which happened to be implementing Equatable to the type of which I was using for Dictionary's Key. It was extremely tricky for me to even realize the bug because my == and the hidden == were almost the same.

1 Like

I see! Conflicting conformances are indeed a serious issue, especially with protocol hierarchies that build on each other, like Equatable/Hashable.

Retroactive protocol conformances are useful, but some forms of it can be dangerous. I think that it's a good idea to never conform other people's types to standard library protocols. The upstream authors are very likely to add an official conformance in some future revision -- after all, you found a use case for it, so others will probably come across one, too, and some of them will submit feature requests for it.

Sadly I can't suggest a fully satisfactory replacement to such conformances. I do know one solution that works today, although it's not particularly pretty: if I need to customize, say, Hashable for some external type Foo that I don't fully control, I wrap it in a little wrapper struct that is private to my module, and then use that wrapper wherever I need my definition of hashing:

struct MyFoo {
  var value: Foo
}
extension MyFoo: Equatable {
  static func ==(left: MyFoo, right: MyFoo) -> Bool {
    guard foo(left.value) == foo(right.value) else { return false }
    guard bar(left.value) == bar(right.value) else { return false }
    return true
  }
}
extension MyFoo: Hashable {
  func hash(into hasher: inout Hasher) {
    hasher.combine(foo(value))
    hasher.combine(bar(value))
  }
}  

Granted, wrappers like this are somewhat harder to use as I have to litter .value all over my codebase, but in exchange, they are perfectly safe from outside interference, including whatever shenanigans Foo's author might be up to. They also have practically no effect on performance or memory overhead at runtime.

Elsewhere on the evolution forums there are pitches for features such as "newtype" that would make creating/using such wrappers a lot nicer.

6 Likes

Let's return to the concrete problem of keeping Equatable and Hashable conformances in sync. Consider the code below, adapted from an earlier example:

struct Bar { 
  let a, b: Int 
  let c: String
}

extension Bar: Equatable {
  static func ==(left: Bar, right: Bar) -> Bool {
    // Silly and bogus example illustrating a complex case -- do not copy
    guard left.a * left.b == right.a * right.b else { return false }
    guard left.a + left.b == right.a + right.b else { return false }
    guard left.c == right.c else { return false }
    return true
  }
}

extension Bar: Hashable {
  func hash(into hasher: inout Hasher) {
    // Silly and bogus example illustrating a complex case -- do not copy
    hasher.combine(a * b)
    hasher.combine(a + b)
    hasher.combine(c)
  }
}

There are obvious similarities between the == and hash(into:) definitions above. You can almost define == in terms of the underlying hash encoding that a properly written hash(into:) feeds to the hasher.

hash(into:) is also eerily, annoyingly similar to encode(to:) from Encodable. You could in fact define both Equatable and Hashable in terms of encode(to:), by encoding Bar to an explicit byte sequence using a stable encoder, then comparing/hashing these encodings. But that would be terribly inefficient.

Here is a draft of a slightly better way:

protocol Distillable: Hashable {
  associatedtype Essence: Hashable
  var essence: Essence { get }
}

extension Distillable {
  static func ==(left: Self, right: Self) -> Bool {
    return left.essence == right.essence
  }

  func hash(into hasher: inout Hasher) {
    hasher.combine(essence)
  }
}

struct Bar: Distillable { 
  let a, b: Int 
  let c: String
  var essence: [AnyHashable] { return [a * b, a + b, c] }
}

Note how you only need to define the relevant components of Bar once, in essence.

Note that I haven't actually tried this code -- it's merely intended to illustrate a concept. However, something like this may actually be practical in some cases. Just be aware that creating an array of AnyHashables like this won't fare well in any performance-critical code.

6 Likes

Please excuse the self-reply storm -- I forgot to go for the obvious punchline.

Take the following hypothetical version of Bar; it replaces Distillable with some sort of non-existent magical metaprogramming feature:

struct Bar: Hashable {
  let a, b: Int
  let c: String

  $generateHashable(a * b, a + b, c) // strawman syntax
}

In theory, this could be mechanically transformed to the first, verbose definition of Bar above, so that we can get the exact same runtime performance without any of the confusing boilerplate.

I would love to be able to do something like this at some point. If Equatable/Hashable is an important enough use case to deserve some sort of special one-off syntax, it might be worth pitching it as a dedicated language feature.

I believe this is what Haskell does with its generic datatypes? The compiler can derive things like serialization, equality and hashing for you – but with no special treatment of some protocols like Codable, if I understand the feature correctly. Neat.

Another approach, which I believe Dictionary actually uses, is to add items with identical hashes to the next spaces in the array. It’s unlikely that you’d have two hashes right next to each other, so if an item’s hash leads to index 42, but there’s already something there, it would use index 43 to store it. Then when you want to retrieve an item whose key has that hash from the dictionary, you’d start at index 42, then compare the key against that item. When the comparison fails, you’d try 43 and so on until you get to nil, at which point you know there’s no object in the dictionary for that key.

3 Likes

If I‘m not mistaken, this is called linear probing. Hash values can still collide quite a lot, it really also depends on the storage size you have for the hash table itself. Linear probing is just one strategy to resolve collisions, but there are more and also better ones. Collisions in this context doesn‘t necessarily mean a.hashValue == b.hashValue. If you have a dynamic array as a hash table you will likely have more collisions at the beginning, but because of some hash function like hashValue % table.count. When the max size is reached you‘ll double the size of the array (hence dynamic) and to minimize collisions you‘d need to re-hash all the previous values of the hash table.

2 Likes

Aside: some hash tables do that, but there’s also Fibonacci hashing which is a cool little trick to avoid collisions. That guy’s blog is full of different hash table tips and tricks, for those interested.

But yes, in general a hash collision leads to a linear, O(n) search. The whole point of hash tables is to try and avoid that.

5 Likes

Not inefficient at all. The encode(to:) machinery is really just a very fast way to list the contents of every field (much faster than reflection). Most encoders generate a byte sequence from such data, but you can also directly update a hash value for a very fast and simple Hashable conformance.

In fact, SwiftProtobuf uses almost exactly this approach to implement Hashable conformance. It is very fast and requires no additional generated code since it just reuses the encoding support hooks.

Equatable is more complex because of the need to interleave traversals of two different objects.

3 Likes

In principle Hashing should not necessarily and by all means require Equatable. If the two were unrelated and I were mistakenly making a Set out of Hashable but not Equatable things – compiler would just complain and I'd make the "thing" Equatable right there and then to fix the error. It's definitely too late to change this in Swift and, IMHO, it's not a big deal; moreover it saves some keystrokes (so we can have just "Hashable" and not "Hashable, Equatable" combo - as having both features is what we want anyway most of the time). Bloom filter is one of those exceptional examples when hashing is useful without equality, and if you have one of those extremely rare cases when equality is not needed (or actively harmful) – just define a no-op fatalError("equatable not supported") Equatable conformance.


I love this example. What's your thoughts now, should we have it in the standard library?

Edit: one of the downsides to the "essence" approach is lack of fast fail (e.g. when we compare two random UUID strings it's good to quickly get false after comparing the first characters). Another downside is that essence is calculated multiple times for a value that is otherwise unchanged. Could we somehow cache it? Bike-shedding:

what user writes:

struct Bar: Distillable { 
  var a: Int 
  var b: Int 
  var c: String
  var essence: Essence { (a * b, a + b, c) }
}

what's under the hood:

struct Bar { 
    var a: Int { didSet { changed() } }
    var b: Int { didSet { changed() } }
    var c: String { didSet { changed() } }

    var _essence: Essence! // extra field added
    var chageCount: Int = 0 // extra field added
}

extension Bar {

    func changed() {
        _essence = nil
        changeCount += 1
    }

    var essence: Essence {
        if _essence == nil { _essence = essence }
        return _essence
    }
    static func ==(left: Self, right: Self) -> Bool {
        left.essence == right.essence
    }
    func hash(into hasher: inout Hasher) {
         hasher.combine(essence)
    }
}

hash caching could be done similarly (calculate hash once and never again until something changes). And build-in EQ can compare (cached) hash values first thing to "fast fail" when hashes are different.

I think Karoy's post up-thread does a good job of explaining why this is not the case, i.e., why hashability inherently requires a notion of equality:

If by 'hash' you only mean 'can be reduced to a fixed-width sequence of bits' without any further constraints, then sure, you don't need equality for that, but it's unclear what you'd be able to do with such a reduction.

Bloom filters require a notion of equality because you use them to test whether the 'same' value is definitely not in the given set. If you don't have the property that 'same' values produce the same hash value for all the hash functions you're using in your bloom filter, then there will be instances a and b of the 'same' value and some hash function h(x) where if you add a to your set and then query whether b is in the set, your proposed implementation would return 'definitely not' since for the kth hash function h(a) ≠ h(b) even though a correct implementation would return 'maybe.'

1 Like

I mean there are cases when EQ implementation is not needed (even if there is still a "notion of same-ness"), e.g. this implementation:

static func == (a: Self, b: Self) -> Bool {
    // pseudo code ahead:
    sleep(3 hours)
    return super.==(a, b)
}

or any other EQ implementation is not needed for Bloom filter implementation to operate.

Whether the implementation will be called for a particular operation is somewhat orthogonal to whether the underlying notion of equality is required for hashability to make sense. Every time you invoke the hash function to test bloom filter membership (or set membership, or dictionary membership...) you are implicitly relying on that notion of sameness due to the semantic guarantees of the hash function (that unequal hashes implies unequal values).

IOW, even if Swift had separated Hashable and Equatable into unrelated protocols, it would always be an error to conform to Hashable alone because the semantic guarantees of Hashable would be meaningless without an implementation for Equatable.

1 Like

I truly believe this /while typically true/ is not true in general case.

A working implementation of a simple bloom filter.
class C {}

struct S: Hashable {
    let string: String
    var c: C? // let's have this field here, so that we don't have synthesized EQ / hash
    
    func hash(into hasher: inout Hasher) {
        hasher.combine(string)
    }
    
    // EQ
    static func == (a: Self, b: Self) -> Bool {
        assert(false, "won't be called")
        // or let return some garbage
        return Bool.random()
    }
}

struct WrappedS: Hashable {
    let s: S
    let salt: Int
    
    static func == (a: Self, b: Self) -> Bool {
        assert(false, "won't be called wither")
        // or let return some garbage
        return Bool.random()
    }
}

struct BloomFilter {
    var bools: [Bool]
    let k: Int
    
    init(size: Int, k: Int) {
        bools = [Bool](repeating: false, count: size)
        self.k = k
    }
    
    private func indexForString(_ string: String, _ i: Int) -> Int {
        let v = WrappedS(s: S(string: string), salt: i).hashValue
        return Int(UInt.init(bitPattern: v) % UInt(bools.count))
    }
    
    mutating func add(_ string: String) {
        for i in 0 ..< k {
            bools[indexForString(string, i)] = true
        }
    }
    
    func maybeHas(_ string: String) -> Bool { // if false - definitely doesn't have
        for i in 0 ..< k {
            if !bools[indexForString(string, i)] { return false }
        }
        // maybe has
        return true
    }
}

var filter = BloomFilter(size: 10000, k: 10)
filter.add("Hello")
filter.add("World")
filter.add("Earth")
filter.add("Globe")

print(filter.maybeHas("Hello")) // true
print(filter.maybeHas("World")) // true
print(filter.maybeHas("Earth")) // true
print(filter.maybeHas("Globe")) // true
print(filter.maybeHas("Ooops")) // false
print(filter.maybeHas("Noope")) // false

Edit. please ignore the logical error in this demo implementation in that it combines a hash + salt to make another hash. Correct implementation would need to add salt to the data itself and calculate the hash of that.

This implementation is usable regardless of how I implement element's eq – it's just not called – and even if I implement "eq" incorrectly (or, if it was allowed not implementing it) - this implementation would give me correct results for the implied notion of sameness (which in this case takes string into account but ignores "c"): testing for an element would give either "maybe has" or "definitely doesn't have" result.

Obviously if I try to use "S" with a Dictionary / Set or another typical collection - that would indeed require a workable "EQ" that should match the implied notion of "sameness" and agree with EQ/hashability invariant.

Given this line of logic the following data structures are not unimaginable to me:

struct Set<Element: OnlyHashable & Equatable> { ... }
struct Dictionary<Key: OnlyHashable & Equatable, Value> { ... }
struct BloomFilter<Element: OnlyHashable> { ... }

But the Element type of your BloomFilter here is not S or WrappedS—it's the type of the objects which you are add-ing and maybeHas-ing, i.e., String. Your S and WrappedS are still relying on the underlying guarantee that the 'same' value will always produce the same hash value. In the case of S this arises because we want that two S values containing the same String will always produce the same hasher state when they are hashed as part of WrappedS, and WrappedS further relies on this guarantee to ensure that given the 'same' S and the 'same' salt i, you'll produce the same index for bools every time.

That you've deliberately broken the implementation for == while also ensuring it doesn't get called isn't really material here. The implementation of BloomFilter here only works because of the semantic guarantees imposed on String (and transitively, S) by the combination of Equatable and Hashable.

To look at this another way, suppose that String exposed a public capacity property that gave the current capacity of the underlying buffer, and we implemented your example exactly as above, except that we implemented S.hash(into:) as:

    func hash(into hasher: inout Hasher) {
        hasher.combine(string)
        hasher.combine(string.capacity)
    }

Clearly, this gives us a broken BloomFilter implementation because we could potentially have:

var s = "Hello"
filter.add(s)
s.append("!")
s.removeLast()
filter.maybeHas(s) // false

But how can you explain what's broken here without invoking the notion of 'sameness'? And how would you describe what's broken in the following implementation of WrappedS.hash(into:) without invoking the same?

func hash(into hasher: inout Hasher) {
  hasher.combine(s)
  hasher.combine(salt)
  hasher.combine(Int.random())
}

My point, ultimately, is that you haven't really implemented Hashable without Equatable, because there are still very important properties that you want your Hashable implementation to have in order for BloomFilter to work properly. And you'll find that if you try to describe what those properties are, it's precisely a description of the relationship between Hashable and the correct implementation of Equatable for your types.

ETA: Yet another way to think about this is that the relationship between Equatable and Hashable is about more than just particular implementations of protocol requirements. It's about the fundamental idea that hashability requires you to invoke a notion of equality which agrees with your hash value in some way. Yes, it's possible to write algorithms which depend only on hashability and never directly invoke equality, but such algorithms still depend on the underlying equality notion in order to work properly.

5 Likes

String is of course irrelevant here. Here's a simplified version that has custom types only and has corrected salting / hashing implementation:

Simplified version
// Notion of "sameness" a.value == b.value
// Notion of EQ - absent / broken
struct Value: Hashable {
    var value: Int
    func hash(into hasher: inout Hasher) {
        hasher.combine(value)
    }
    static func == (a: Self, b: Self) -> Bool {
        fatalError("EQ is deliberatly froken here")
    }
    func hashWithSalt(_ salt: Int) -> Int {
        Value(value: value + salt).hashValue
    }
}

struct BloomFilter {
    var bools: [Bool]
    let k: Int
    
    init(size: Int, k: Int) {
        bools = [Bool](repeating: false, count: size)
        self.k = k
    }
    
    private func indexForString(_ value: Value, _ i: Int) -> Int {
        let v = value.hashWithSalt(i)
        return Int(UInt.init(bitPattern: v) % UInt(bools.count))
    }
    
    mutating func add(_ value: Value) {
        for i in 0 ..< k {
            bools[indexForString(value, i)] = true
        }
    }
    
    func maybeHas(_ value: Value) -> Bool { // if false - definitely doesn't have
        for i in 0 ..< k {
            if !bools[indexForString(value, i)] { return false }
        }
        // maybe has
        return true
    }
}

var filter = BloomFilter(size: 10000, k: 10)
filter.add(Value(value: 1))
filter.add(Value(value: 2))
filter.add(Value(value: 3))
filter.add(Value(value: 4))

print(filter.maybeHas(Value(value: 1))) // true
print(filter.maybeHas(Value(value: 2))) // true
print(filter.maybeHas(Value(value: 3))) // true
print(filter.maybeHas(Value(value: 4))) // true
print(filter.maybeHas(Value(value: 5))) // false
print(filter.maybeHas(Value(value: 6))) // false

Maybe we are not that far apart then? I can agree that hashability must agree with the "implied notion of sameness", what I am disagreeing with is that hashability's implied notion of sameness must be necessarily expressed as the matching Equatable (explicit or autosynthesized). I believe the above example shows that.

But this is precisely what Hashable: Equatable (along with the protocols' respective documentation) expresses: that to be Hashable, a type must have:

  • A notion of sameness (i.e., an equivalence relation)
  • A hash function
  • 'Agreement' between the notion of sameness and the values produced by the hash function

You can of course write implementations which violate the semantic guarantees of Swift protocols and then write code against those implementations which carefully avoid the brokenness, but that doesn't imply such conformances are actually useful.

Suppose Swift did separate Hashable and Equatable. Because, as you agree, Hashable is useless without an implied notion of sameness, we'd likely want to have Hashable: HashableEquivalent (or else define static func hashableEquivalent(_: Self, _: Self) -> Bool on Hashable directly).

But would such a setup actually be desirable? If HashableEquivalent and Equatable implementations were permitted to diverge, then we could have a == b (Equatable) but a.hashValue != b.hashValue (or similarly, we could have filter.maybeContains(a) == false but largeExpensiveBackingArray.contains(a) == true). So, we'd probably want to require that if a type is both Equatable and Hashable, that the implementations of Self.== and Self.hashableEquivalent always agree.

But then, are there any Hashable types for which it would not make sense to conform to Equatable? I think not. Your BloomFilter implementation purports to be able to answer "is element x definitely not in the collection?" but what would answering that question mean without a highly suitable notion of 'same' for the purposes of an Equatable implementation?

4 Likes