The basic underlying requirement is that for every two values of a hashable type, they must feed the same sequence of bytes to Hasher
if and only if they are equal. Two values that aren't equal must feed data that differs by at least a single bit. Everything else follows from this one requirement.
It's usually easy to ensure unique encodings, but it gets a little tricky in special cases like hashable collections. One subtle gotcha is that when we concatenate variable-length data together, we may lose track of the original boundary between the two components, which opens the way for hash collisions. For example, consider the Foo
struct below:
struct Foo: Hashable {
let a: [Int]
let b: [Int]
func hash(into hasher: inout Hasher) {
hasher.combine(a)
hasher.combine(b)
}
static func ==(left: Foo, right: Foo) -> Bool { ... }
}
We need to ensure these instances all feed different data to the hasher:
Foo(a: [], b: [1, 2, 3])
Foo(a: [1], b: [2, 3])
Foo(a: [1, 2], b: [3])
Foo(a: [1, 2, 3], b: [])
Foo
's hash(into:)
implementation is already perfect -- we don't want to change it. To satisfy the hash encoding requirement, we just need to make sure Array
uses a suitable encoding. For example, here is one way to do this:
extension Array: Hashable where Element: Hashable {
func hash(into hasher: inout Hasher) {
hasher.combine(count) // <== This is new
for element in self {
hasher.combine(element)
}
}
}
This is very similar to how enumerations need to add discriminators to their hash encoding:
enum Bar: Hashable {
case a(Int)
case b(Int)
case c(String)
func hash(into hasher: inout Hasher) {
switch self {
case let a(v):
hasher.combine(0)
hasher.combine(v)
case let b(v):
hasher.combine(1)
hasher.combine(v)
case let c(v):
hasher.combine(2)
hasher.combine(v)
}
}
static func ==(left: Bar, right: Bar) -> Bool { ... }
}
None of this affects simple cases like hashing stored properties in Foo
; simply combining them one by one works great. As Alexis explained, it's unnecessary to add such discriminators between every single combine
call; doing that would only slow things down, without meaningfully improving things.
I would also add that I don't see a need to add special support for such discriminators/delimiters in the Hasher
API; types that need to think about hash encodings need to, well, think about hash encodings. Even in the most complicated cases, coming up with a good hash encoding is a lot easier than writing a good hash function; and I suspect there is no one-size-fits-all API for doing this that wouldn't also slow down the common case too much.