[Accepted] SE-0206: Hashable Enhancements

felix91gr · April 6, 2018, 2:05pm

Ah, yes, my bad. I was talking about this workflow:

You make a data structure (like Dictionary) that uses a hashtable.
Inside of that, you hash keys.
You hash like so: call key.hash(&currentHasher) and then keyHash = currentHasher.finalize()

Therefore in the call to finalize(), you’re implicitly marking the end of the combine sequence. currentHasher should be able to record what’s the amount of steps in the sequence, and therefore it can append the length at the end before finalizing the hashing process.

lorentey · April 6, 2018, 3:04pm

The basic underlying requirement is that for every two values of a hashable type, they must feed the same sequence of bytes to Hasher if and only if they are equal. Two values that aren't equal must feed data that differs by at least a single bit. Everything else follows from this one requirement.

It's usually easy to ensure unique encodings, but it gets a little tricky in special cases like hashable collections. One subtle gotcha is that when we concatenate variable-length data together, we may lose track of the original boundary between the two components, which opens the way for hash collisions. For example, consider the Foo struct below:

struct Foo: Hashable {
  let a: [Int]
  let b: [Int]

  func hash(into hasher: inout Hasher) {
    hasher.combine(a)
    hasher.combine(b)
  }
  static func ==(left: Foo, right: Foo) -> Bool { ... }
}

We need to ensure these instances all feed different data to the hasher:

Foo(a: [], b: [1, 2, 3])
Foo(a: [1], b: [2, 3])
Foo(a: [1, 2], b: [3])
Foo(a: [1, 2, 3], b: [])

Foo's hash(into:) implementation is already perfect -- we don't want to change it. To satisfy the hash encoding requirement, we just need to make sure Array uses a suitable encoding. For example, here is one way to do this:

extension Array: Hashable where Element: Hashable {
  func hash(into hasher: inout Hasher) {
    hasher.combine(count) // <== This is new
    for element in self {
      hasher.combine(element)
    }
  }
}

This is very similar to how enumerations need to add discriminators to their hash encoding:

enum Bar: Hashable {
  case a(Int)
  case b(Int)
  case c(String)

  func hash(into hasher: inout Hasher) {
    switch self {
      case let a(v):
        hasher.combine(0)
        hasher.combine(v)
      case let b(v):
        hasher.combine(1)
        hasher.combine(v)
      case let c(v):
        hasher.combine(2)
        hasher.combine(v)
    }
  }
  static func ==(left: Bar, right: Bar) -> Bool { ... }
}

None of this affects simple cases like hashing stored properties in Foo; simply combining them one by one works great. As Alexis explained, it's unnecessary to add such discriminators between every single combine call; doing that would only slow things down, without meaningfully improving things.

I would also add that I don't see a need to add special support for such discriminators/delimiters in the Hasher API; types that need to think about hash encodings need to, well, think about hash encodings. Even in the most complicated cases, coming up with a good hash encoding is a lot easier than writing a good hash function; and I suspect there is no one-size-fits-all API for doing this that wouldn't also slow down the common case too much.

lorentey · April 6, 2018, 3:50pm

Ah, good point! I haven't realized that's the case. This is also true for the xor-based commutative hashing in order-independent collections like Set and Dictionary.

On the other hand, Hasher's avalanche effects might be enough to make Bloom filters work even if caching/commutative Hashable types only feed 64 bits to the hasher. (As long as the cached hash values don't collide, of course -- which I guess is exactly the point: changing the seed usually gives us an entirely different collision pattern, but caching the hash value defeats that.)

I don't think we need to fix this; hashing that (for some types) only produces 64 bits of information is still very useful.

That being said, it may make sense to fix Set and Dictionary to hash their elements into copies of the supplied hasher rather than creating their own. The code below is super subtle, but it makes Set suitable for use in Bloom filters:

extension Set {
  func hash(into hasher: inout Hasher) {
    var hash = 0
    for element in self {
      var h = hasher 
      // h is a brand new copy of the original hasher; 
      // combining/finalizing it doesn't affect hasher itself.
      h.combine(element)
      hash ^= h.finalize()
    }
    hasher.combine(bits: hash)
  }
}

Doing the same for caching is a bit more difficult. One option is to cache hash values for several different seeds, and combine them all to the supplied hasher.

felix91gr · April 6, 2018, 4:00pm

Wow, that enum example is really good!

I’m pretty sure the Array example can be fixed internally in the Hasher implementation, but the enum example needs (under this API) its own encoding. Unless it’s possible to feed in enum cases, instead of feeding in 0s, 1s and so on.

lorentey · April 6, 2018, 4:46pm

I think Swift has dodged this bullet so far. Luckily, Set and Dictionary only provide lookup operations for values of the exact same type they contain, and in general, there is currently no expectation that distinct types ever produce matching hash values. (Although the semantics of AnyHashable and Objective-C bridging makes things fuzzy at the edges -- hence my hesitation about committing to sensible hashing of integer values.)

jrose · April 7, 2018, 12:47am

I'm glad Vincent and Karoy have put so much effort in here! I think other commenters are covering most serious issues much better than me, so I'll just add a usability note.

Without commenting on the deterministic hashing question…if the default behavior of init(seed:) is not going to be consistent from run to run, then it is not worth providing. As far as I can tell, it doesn't do anything different from just immediately calling hasher.combine(bits: seed.0); hasher.combine(bits: seed.1), right? Like, it might be a slightly faster implementation, but it's not really different as far as perturbing the initial state. Its presence would just add confusion for no real benefit.

/me doesn't really understand modern hash functions, though, so maybe this is not the case.

fumoboy007 · April 7, 2018, 6:26pm

The proposal looks amazing! Just a potentially stupid question:

Does the order of combining change the hash value?

hasher.combine(value1)
hasher.combine(value2)

hasher.combine(value2)
hasher.combine(value1)

Will these produce different hash values?

felix91gr · April 7, 2018, 8:10pm

That question is not stupid! In fact, that’s already been mentioned here as part of the discussion :)

Afaik, the API doesn’t say.
But in general, if the order changes the hash, the hash is more secure. So I think we’ll be inclined to do that in the end.

CTMacUser · April 7, 2018, 8:38pm

Once you finalize, you're not supposed to re-finalize nor combine new data, but there's no flag for the finalization status!

Please add:

public struct Hasher {
    //...
    public var finalHash: Int? { get }
}

which returns the hash if the hasher has finalized, or nil if not yet. (I think the vast majority of people will compare against nil over getting the exact value. But it's not Bool in case someone really is curious.)

GarthSnyder · April 8, 2018, 5:10am

This makes perfect sense as a completeness feature, but I suspect that if anyone actually writes code where there is ambiguity about whether a hasher has been finalized---or, by extension, where there is ambiguity about who is responsible for finalizing---that coder probably needs to get smacked upside the head.

Can you posit a scenario where this feature would be both useful and not a code smell?

GarthSnyder · April 8, 2018, 10:44pm

The seed value, which SipHash and the rest of the crypto community seem to call the "key," is a separate input to the hash function. The key is intrinsic to the design of the algorithm and since a key is always required, you might as well use it for its intended purpose.

I would imagine that in practical terms, hashing a couple of seed ints would probably have similar perturbation effects to specifying an initial seed. But we don't really know for sure, because the design and cryptanalysis of SipHash are both predicated on the idea that you'll modify the key to create independent hash functions that don't have correlated outputs.

As you guess, using the key is also faster since it's passed directly to SipHash as a bit pattern. The current proposal mixes it with the random global seed, but that's probably a simple and cheap operation such as XOR. It's certainly cheaper than hashing an additional 128 bits of content, which would require at least two SipHash rounds.

A interesting, related question (that I don't know the answer to) is whether the bits in a final hash value are independent enough that it's reasonable to divide one 64-bit hash value into two 32-bit hash values and then use those halves as if they were independent hash values for, say, a Bloom filter.

And that question raises the issue of what to do if you want to be sure that you get a full 64-bit hash value, given that Int has no particular size. It'd be nice if Hasher also defined

extension Hasher {
  public mutating func finalize() -> UInt64
}

so that you could do something like

let hashValue: UInt64 = hasher.finalize()

when you wanted to be sure you'd get back the full 64-bit hash value calculated by the algorithm.

hlovatt · April 9, 2018, 4:54am

For this to work the supplied hasher has to a struct, unfortunately you can't enforce that. Therefore wouldn't this be asking for trouble.

CTMacUser · April 9, 2018, 6:58am

I don't know. Who is supposed to use this from the "I want to get a hash" side? Is it just for Set and Dictionary?

For a use: I could imagine passing a Hasher as inout to elements of a sequence where any element may arbitrarily finalize it after use. Then the outer code would check and create a new Hasher to pass to the remaining elements of the sequence. You would end up with a collection of Hasher instances. And, no, I can't think of any reason to have that kind of code structure to start with. I just bugs me to say that you can't violate something and then leave no way to check it.

hlovatt · April 9, 2018, 7:46am

I raised the issue above of providing access to finalise and telling the user not to call it. Its like the old joke of having a button marked "DO NOT PRESS". If you are not meant to call it then it should be split out into a seperate protocol.

lorentey · April 9, 2018, 11:05am

This makes sense! It is counterintuitive that init(seed:) is nondeterministic, and its functionality is easily replicated, so we should remove it from the proposal.

Like you say, Hasher(seed: (x, y)) is functionally indistinguishable from a hasher initialized as:

let seededHasher: Hasher = {
  var h = Hasher()
  h.combine(x)
  h.combine(y)
  return h
}

seededHasher can be copied and reused any number of times, so the initial seed combinations do not affect performance at all. Doing this has no effect whatsoever on the security of hashing.

Removing init(seed:) from this proposal would allow us to potentially reintroduce it in the future, with deterministic behavior. (I wouldn't be comfortable allowing deterministic use in this initial implementation -- the danger of misuse is too high. But this may turn out to be overly cautious, and if so, it'll be nice to have the name init(seed:) available to reintroduce with the expected meaning.)

lorentey · April 9, 2018, 11:38am

We do! We use a random value for the seed. The value supplied in init(seed:) is merely a perturbation of the real seed, and this perturbation is just as easy to achieve with combine calls. In fact, it's easier, because with combine, the number of seed bits isn't restricted to any particular choice. (It seems jarring that the proposal fixes the number of seed bits at 128 when everything else remains unspecified.)

A robust hash function is just as sensitive to any of the bits in the input byte sequence as it is to the bits in the seed. There are no security implications to extending the key into the input sequence. (If anything, it may make hashing even more secure, since the mixing is done by a much more robust function than XOR.)

Yes! The generated hash values are safe to slice and dice any way we like; the individual bits are essentially independent from each other.

GarthSnyder:

And that question raises the issue of what to do if you want to be sure that you get a full 64-bit hash value, given that Int has no particular size. It’d be nice if Hasher also defined
extension Hasher {
  public mutating func finalize() -> UInt64
}

The reason Hasher.finalize() returns an Int rather than UInt64 is to allow the standard library to use a different algorithm on 32-bit platforms, where a 64-bit hash function may be unnecessarily slow. (32-bit hash tables discard half of the bits in a 64-bit hash value, so computing it is wasted effort.)

lorentey · April 9, 2018, 11:50am

Yes! The hash function is defined in terms of the byte sequence fed to it in combine calls. Changing any of the bits of that byte sequence (including reordering them) is probably going to change the returned hash value. (Collisions are of course possible, but changing even a single bit in the input is expected to change about half of the bits in the output value, essentially producing a completely different value.)

var h1 = Hasher()
h1.combine(1)
h1.combine(2)
print(h1.finalize()) // => 6948817552246965716

var h2 = Hasher()
h2.combine(2)
h2.combine(1)
print(h2.finalize()) // => -2744202220050349903

lorentey · April 9, 2018, 11:55am

There is no ambiguity here. The proposal defines Hasher as a struct.

lorentey · April 9, 2018, 12:23pm

There are no protocols involved in Hasher (other than Hashable itself). Can you please clarify what you mean? Some example code would probably help us to understand it.

Hasher.init() and Hasher.finalize() are provided for the benefit of people who need to generate hash values (for example, because they want to memoize hashes or they're implementing a hashing collection). It is technically possible to call finalize on the Hasher supplied in hash(into:), but doing that will lead to a precondition failure. In simple cases, it also looks obviously incorrect:

extension Foo: Hashable {
  func hash(into hasher: inout Hasher) {
    hasher.combine(bar)
    _ = hasher.finalize()
  }
}

If we wanted to catch this during compilation, we could define a wrapper struct around Hasher that removes the finalize() operation, and use that in hash(into:) instead:

public struct HashCombinator {
  internal var hasher: Hasher
  public init(_ hasher: Hasher) { self.hasher = hasher }
  public func combine(bits: Int) { hasher.combine(bits: bits) }
  // ...
  public func combine(bits: UInt8) { hasher.combine(bits: bits) }
}

public protocol Hashable {
  func hash(into hasher: inout HashCombinator)
}

extension Foo: Hashable {
  func hash(into combinator: inout HashCombinator) {
    combinator.combine(bar) // OK
    _ = combinator.finalize() // Error: HashCombinator has no method named finalize
  }
}

This works, but I find it hard to justify the additional API surface area & complexity. (For example, it seems hard to come up with good names for the two types, and it's difficult to define a satisfying API for extracting a hash value -- the interaction between Hasher and HashCombinator is subtle.)

benrimmington · April 9, 2018, 1:57pm

@lorentey, are all the combine(bits:) methods needed for better performance?

If so, could the bits: argument label be dropped, so that hasher.combine(42) doesn't use the generic combine<H: Hashable>(_:) method?

Otherwise, the simplest design would only have the generic method. Standard library types (including integers) would use internal APIs to implement their hash(into:) methods.

PS: If you wanted to hide the finalize() method, it could be implemented as an Int initializer.