Meaning of 'must' in `Hashable` documentation

Jumhyn · October 28, 2020, 8:16pm

In the documentation for Hashable.hash(into:) we currently have the following line:

The components used for hashing must be the same as the components compared in your type’s == operator implementation.

IIRC, back when Hashable was simply based on the hashValue requirement, the semantic "rule" of the protocol was something like "values which return true for val1 == val2 must also return true for val1.hashValue == val2.hashValue," which is a notably weaker guarantee.

In particular, the current semantics of Hashable would seem to prohibit implementing a sort of "fast path" version of hash(into:) which only hashes a subset of the components used for Equatable.

The top-level Hashable documentation features a somewhat confusing statement (emphasis added):

Hashing a value means feeding its essential components into a hash function, represented by the Hasher type. Essential components are those that contribute to the type’s implementation of Equatable.Two instances that are equal must feed the same values to Hasher in hash(into:), in the same order.

The final sentence seems in-line with the former, looser semantics for Hashable, but it is seemingly contradicted by other elements of the documentation.

Are the newer, stricter semantics of Hashable meant to be interpreted as I've understood them? If so, why the change?

xwu · October 28, 2020, 8:41pm

See the discussion here:

github.com/apple/swift

[Foundation] Modernize hashing in Foundation's Swift-only types

apple:master ← lorentey:foundation-hashing

opened 12:25AM - 06 Apr 19 UTC

lorentey

+518 -167

This PR upgrades the Foundation overlay to use the new hashing API introduced in… [SE-0206]. This gets rid of the deprecation warnings for `hashValue` implementations and brings the overlay in sync with current Swift best practices. [SE-0206]: https://github.com/apple/swift-evolution/blob/master/proposals/0206-hashable-enhancements.md ## Background [SE-0206] radically changed how hashing works in Swift 4.2+. It is now the stdlib's responsibility to choose a hash function; types implementing `Hashable` only need to worry about selecting what they want to feed to the hasher. To make this possible, `Hashable` now requires hashing to be implemented through the `hash(into:)` method. The old `hashValue` property is deprecated as a `Hashable` requirement, and it's only kept to support source-level compatibility with existing code. `Set` and `Dictionary` have switched over to using `hash(into:)` as their entry point for hashing; they do not directly call `hashValue` at all. Beyond simplifying hashing, the intent of SE-0206 is to enable Swift to provide certain guarantees about its quality. In particular: as long as `hash(into:)` feeds enough data the hasher to unambiguously decide equality, *Swift attempts to guarantee that collision attacks won't be possible*. For this to work, it is critically important for `Hashable` implementations to include everything that `Equatable.==` looks at; and this is *especially* the case for the basic boundary types that come built-in with Swift, like `Data`. The exact hash algorithm used is a private implementation detail of the stdlib; the `Hasher` API was carefully designed to leak as little information about it as possible, enabling future versions of the stdlib to switch to other algorithms without worrying about ABI compatibility issues. While random seeding makes the hash algorithm opaque, it is technically possible for Swift code to rely on the particular hash encoding of a stdlib- (or SDK-)provided type. This is highly unlikely to happen in practice, but the longer we wait to fix hashing issues, the more likely it will be that someone starts relying on hashing particulars. For example, I could imagine this to happen through a well-meaning desire to work around a broken hash implementation. It is therefore important to do this sooner rather than later. ## Notes I'll highlight points specific to a particular changes as self-review comments. Here is a list of general remarks: 1. It is neither required nor desirable for Swift's `hashValue` to return reproducible values. 2. We don't need to ever match hash values between Swift types and the Cocoa classes that they bridge with. The act of bridging involves a full rehashing of all Set/Dictionary values. 3. The only hard requirement is that hashing needs to be consistent with the equivalence classes implemented by `==` *within the same type*. 4. Hashing is meaningless without equality. `Equatable` only defines equality for values of a single type. Therefore, `Hashable` makes no requirements or expectations about the hash values produced by values of different types. - It is perfectly fine for hash values produced by different types to not match. `(0 as Int8).hashValue` and `(0 as Int16).hashValue` will (typically) not be equal in Swift. - It is also perfectly fine for values of two distinct types to produce matching hash values. `Measurement<UnitLength>.hash(into:)` and `Measurement<UnitDuration>.hash(into:)` are allowed to use the same hash encoding, since there is no way they can ever produce large-scale collisions within the same hash table. 5. While this is not a hard requirement for user code, for boundary types provided in the stdlib/SDK, we require that hashing isn't just consistent with equality, but that it's *equivalent* to it. The Swift test suite has [checks] to actively enforce this -- this is possible through [repeatedly salting the hash function][salt]. "Optimizing" hashing by omitting some of the data compared by `==` is generally a mistake in Swift, because it *completely* breaks all guarantees about the strength of hashing, and opens the door to (accidental or deliberate) collision attacks. It's perfectly acceptable to hash a gigabyte of data if someone inserts some large value (such as a big collection) as a key in a hash table. Multi-megabyte String keys are easy to protect against; hidden hashing weaknesses aren't. [checks]: https://github.com/apple/swift/blob/master/stdlib/private/StdlibUnittest/StdlibUnittest.swift#L2417-L2578 [salt]: https://github.com/apple/swift/blob/master/stdlib/private/StdlibUnittest/StdlibUnittest.swift#L2541-L2562 6. It would be nice if we could somehow improve hashing in Foundation's Objective-C classes, but this is outside the scope of this PR (or indeed, this repository). However, we should at least make sure Swift programmers can rely on the quality of hashes produced by native Swift value types. 7. https://github.com/apple/swift-corelibs-foundation/pull/2118 is the corresponding PR for swift-corelibs-foundation. These PRs should land at the same time. 8. Replacing `hashValue` implementations on types with `hash(into:)` has no effect on the symbols exposed as ABI -- the compiler automatically synthesizes `hashValue` from `hash(into:)`, and vice versa. (This is not the case in protocol extensions, but we've fixed those in the 4.2/5.0 timeframe.) In the Swift 5.0 ABI, all of these types include definitions for both `hashValue` and `hash(into:) so we don't need to worry about availability issues. ~~The sole exception is `URL`, which implements `hashValue` but isn't `Hashable`. Unfortunately we can't add the conformance in 5.1, but we should add a (versioned) `hash(into:)` definition to discourage people from trying to implement it on their own.~~ (Edit: `URL` is not a special case; it's `Hashable` like the rest. I got mislead by an unrelated error.) rdar://problem/43394032

In particular, @lorentey writes:

Beyond simplifying hashing, the intent of SE-0206 is to enable Swift to provide certain guarantees about its quality. In particular: as long as hash(into:) feeds enough data the hasher to unambiguously decide equality, Swift attempts to guarantee that collision attacks won't be possible . For this to work, it is critically important for Hashable implementations to include everything that Equatable.== looks at; and this is especially the case for the basic boundary types that come built-in with Swift, like Data .

So the short answer, I guess, is that the protocol attempts through its documentation to ensure a certain hash quality for conforming types. Attempting a “fast path” which does not adhere to its recommendations will degrade that quality, which has associated consequences for its downstream uses.

Jumhyn · October 28, 2020, 8:51pm

Thanks @xwu, that indeed answers my question! This excerpt is also highly relevant:

While this is not a hard requirement for user code, for boundary types provided in the stdlib/SDK, we require that hashing isn't just consistent with equality, but that it's equivalent to it. The Swift test suite has checks to actively enforce this -- this is possible through repeatedly salting the hash function. "Optimizing" hashing by omitting some of the data compared by == is generally a mistake in Swift, because it completely breaks all guarantees about the strength of hashing, and opens the door to (accidental or deliberate) collision attacks.It's perfectly acceptable to hash a gigabyte of data if someone inserts some large value (such as a big collection) as a key in a hash table. Multi-megabyte String keys are easy to protect against; hidden hashing weaknesses aren't.

Yep, makes total sense. I will note, though, that this is not expressed in the documentation as a "recommendation"—it's expressed as a "must" which, by my reading, means that types which don't satisfy this detail do not validly conform to Hashable (just as types for which == is not an equivalence relation have not validly conformed to Equatable).

Given @lorentey's statement that the equivalence between hashability and equatability is "not a hard requirement for user code," I'm still curious whether the documentation intends to express this as a recommendation or a requirement for valid conformers.

Jumhyn · October 28, 2020, 10:08pm

It's also worth noting that the as-accepted wording for hash(into:)'s documentation was:

  /// Hash the essential components of this value into the hash function
  /// represented by `hasher`, by feeding them into it using its `combine`
  /// methods.
  ///
  /// Essential components are precisely those that are compared in the type's
  /// implementation of `Equatable`.
  func hash(into hasher: inout Hasher)

So it appears that this has been the rule since the dawn of SE-0206.