[Accepted] SE-0206: Hashable Enhancements

I really like this proposal but I am just wondering why requiring a combine call for each property? Let’s say we want to combine the hash of 5-6 properties, it would look like this:

func hash(into hasher: inout Hasher) {
    hasher.combine(topLeft)
    hasher.combine(bottomRight)
    hasher.combine(bottomLeft)
    hasher.combine(topRight)
    hasher.combine(midBottom)
    hasher.combine(midTop)
}

Needless to say this code could be greatly reduced by having an overload that takes a Set or Array that would look like this (or by using keypaths):

func hash(into hasher: inout Hasher) {
    hasher.combine([topLeft, bottomRight, bottomLeft, topRight, midBottom, midTop])
}

Sorry! :pensive: I had to pick a sacrificial example, and the docs were the most obvious victim. To be fair, I think documenting how to implement hashValue in a paragraph or two is an impossible task; GridPoint made a valiant effort, which made it all the more delicious to rip it apart. :smiling_imp: It’s actually a great representative example for the top ~5% of hashValue implementations out there – I’ve seen far, far worse. (Including most of my own attempts.)

This is worth considering, especially if we keep the generic Hasher.combine(_:) – that function is a nice replacement for direct calls to hash(into:). But the natural way to define hash(into:) is to make it an instance method; do we really want to carve out a new API convention for these cases? We also need to ensure that we remain able to override hash(into:) in class hierarchies; making it a static requirement makes that more difficult.

I’d prefer to add some metadata attribute / doc string syntax that de-emphasizes rarely used API like hash(into:), encode(to:), init(arrayLiteral:), etc. in the docs and editor/autocomplete UI, but still allows them to be defined in the natural way. (I also think hash(into:) should still be listed in autocomplete menus, it should just not be in a prominent position. It may not be frequently called, but I’m sure it’ll still be used sometimes.)

It’s not required. A problem discussed when automatic Hashable conformance was being reviewed is that of having non-relevant properties be part of the hash input. If you have a Circle, let’s say, with:

  • float radius,
  • vec2 center,
  • float area,
  • float perimeter

Area and perimeter are not relevant because they are derived from the radius. The proposed combine technique allows you to only give it what’s relevant; that is, radius and center.

I don’t know if that was your issue.

Regarding an overload that takes an array or list as argument: it could be good, but I think it reduces readability imo. It’s easier to miss something if everything’s in the same line.

Sure, I got that this combine method allows for more flexibility regarding which properties participate in the hash, but in some cases it could be all properties participate except one. In this case, I think it’d make sense to have a single method that takes all the key paths of the properties which should or shouldn’t participate in the hash.
I guess many people will just copy/paste the combine calls and they will forget to change the property name and so they could end up with multiple combine for the same property (what will happen in that case from the compiler perspective?). In case we use a Set of key paths, we could enforce a single combine for each property instead.

1 Like

Note that this sort of comment might be taken as dismissing a commenter’s concerns. The Swift core team and engineers are humans just like you and me and not omniscient, so it’s reasonable to raise concerns about the design being proposed, especially ones like @Gankro’s that are based on real-world experience with a similar design. I know you probably didn’t intend any harm in your reply, just keep that in mind.

2 Likes

There is an overload of Hasher.combine(bits:) that takes an UnsafeRawBufferPointer. Do you think we need to provide more ways to hash bulk data?

Hash memoization and custom hash functions are still possible in the hash(into:) world, although they will incur the cost of feeding a single word to Hasher. (This is similar to how Swift 4.1’s Set and Dictionary unconditionally postprocess hash values with _mixInt, it’s just more explicit.)

class Foo: Hashable {
  // ...
  var _cachedHash: Int? = nil
  func hash(into hasher: inout Hasher) {
    if let hash = _cachedHash {
      hasher.combine(bits: hash)
      return
    }
    var hash = 0
    // Calculate hash however you like
    _cachedHash = hash
    hasher.combine(bits: hash)
  }
}

I think we should be careful not to overgeneralize Hashable; I consider it more as a support protocol for Set and Dictionary, rather than something that should cover all possible forms of hashing.

For example, custom seed values makes it possible to use Hashable types in e.g. some forms of Bloom filters. This is a nice bonus, but it is not why we added this feature – the proposal adds supports for custom seed values because this turned out to be a requirement for Set/Dictionary. These collections have no need for an uncombine operation, so Hasher doesn’t provide it – rolling hashes would needlessly restrict the pool of hash functions Hasher could implement.

For specialized/generalized hashing requirements, I think it is best to define separate protocol(s).

If there are many components, it may indeed make sense to stage them in local buffers rather than feeding them into the hasher one by one. The problem with constructing an Array or Set directly in hash(into:) is that the overhead of allocating storage for these collections may very well be larger than combine's call overhead. Keypaths currently have significant overhead on their own, too. (Edit: added “currently”)

I suspect that for six components, just calling combine six times will be competitive with most alternatives.

When buffering does turn out to be useful (e.g., for variable-length types), the combine(_:) overload that takes an UnsafeRawBufferPointer should cover the most general case – but we need more data! If you know a use case which would work better with something else, please let us know.

You’re completely right Joe.

I’m sorry @Gankro. Your concerns are very much valid. I didn’t mean otherwise.

That is a really good point. Maybe the compiler should issue a warning in cases like that:

  • Two or more combines on a single property.
1 Like

Yes! The stdlib itself uses this technique for implementing hashing for Set and Dictionary. (An alternative would be to keep a copy of hash in their storage representation, and keep it updated with each mutation.)

I’m really not sure. I suspect Set and Dictionary may be the only order-independent collections most users will ever come across, much less implement, in Swift.

1 Like

We’re certainly not there today, but in the fullness of time we would like constant folding and loop unrolling to work well enough to make it viable to build abstractions on top of collections of key paths.

1 Like

What is your evaluation of the proposal?

Huge +1 in general. The proposed changes are very well motivated and an important area for improvement.

There has been some good discussion of design details in this thread. I don’t have a strong opinion about the details and trust those with more expertise to make the right decisions.

Is the problem being addressed significant enough to warrant a change to Swift?

Yes. As the proposal notes, expecting the average Swift developer to provide a good implementation of hashValue is unreasonable as it requires expertise they are unlikely to have. The move to a solution that only requires developers to specify the input data to be hashed is an enormous improvement.

Does this proposal fit well with the feel and direction of Swift?

Yes

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

N / A

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

A quick read of the final proposal as well as participation in the earlier discussion threads.

So I tried to do this in Rust and ran into a bunch of trouble. Until now I couldn’t quite remember what the issue was, but I just did, and it might not effect Swift if I recall the available APIs. In Rust we let you look up in HashMap with types that don’t exactly match the type of the key (Q where K: Borrow<Q>).

The canonical example is you can lookup elements in a Dictionary<Array, _> with a Slice (using Swift types here for clarity). Adding this API that gave types two different hashes created chaos because if one of the owned or borrowed types remembered to override hash() but not hashOnly(), everything fell apart. Like I literally couldn’t bootstrap the rust compiler because I missed some cases and it would miscompile itself.

2 Likes

There is an overload of Hasher.combine(bits:) that takes an UnsafeRawBufferPointer. Do you think we need to provide more ways to hash bulk data?

You are correct. Thanks for challenging my criticism. It should be possible to support high-speed hashing then.

I think we should be careful not to overgeneralize

Agreed.

Although you can cache the hash it isn’t obvious how you do this, is significant boiler plate, and has some overhead. Is it possible to add caching? EG:

struct Expensive {
    private var hash: Int?
    override func hash(into: inout Hasher) {
        guard h = hash else {
            into.combine( ... )
            ...
            hash = into.finalize()
        }
        into.previouslyCachedValue(h)
    }
}

IE I’m proposing adding previouslyCachedValue to Hasher. As a nice addition, this change gives a reason for finalize being a public function (at present the author of hash(into:) can call finalize but you shouldn’t!). An implication of this change is that a Hasher should be cheap to create so that the overhead of caching is low.

2 Likes

I wasn’t thinking about the class hierarchy issue—that’s a good point. And yes, something to deemphasize this batch of “usually the wrong thing” APIs would be great to have if tools could hook into it. Thanks!

(And j/k about the example, though it used to just return x ^ y, which is reprehensible.)

1 Like

A couple of minor quibbles below, but overall a strong +1 from me. This is an important issue that needs to be addressed in one way or another. I have read the full proposal and the discussion so far and agree with most of the arguments for the proposed design.

[Just a side note: this is really something of an exemplary proposal. The motivation and historical summary are clear and detailed. The full API and doc comments are camera-ready and nicely described in the surrounding text. Nice job.]

Comments:

  1. I find the bits labels in the various type-specific forms of combine(bits:) somewhat strange. Yes, I get that the hasher doesn’t care that what you’re passing in is an Int; it just wants its 32 bits or 64 bits of data. By why is this important to represent in the API? Why are these not just type-specific overloads for combine()?

    In the current design, if you call combine() on an Int (as the GridPoint example seems to imply that you should) the hasher presumably calls Int.hash(into:), which presumably then turns arounds and re-calls Hasher.combine(bits: Int). Why not just avoid this little dance and accept combine(Int) directly?

  2. @thomasvl raised the issue of how to hash things order-independently. The official solution of hashing all the parts separately with their own hashers and then XORing the results seems functional and not hard to implement given that this is an uncommon requirement. However, the proposal itself talks at some length about how instantiating a bunch of separate hashers may be potentially nonperformant. This case on its own seems like a good argument for making Hasher a protocol.

  3. It’s not obvious from the proposal that Hasher is, in fact, guaranteed to be order-sensitive, though that’s probably what most people would guess. This should really be an explicit part of the API contract that is mentioned in the header comments. If you’re hashing a sequence of things, there’s no need to, e.g., interleave sequence numbers if you want order-sensitive hashing.

  4. I agree with @hlovatt that the ability to cache hash values is important.

  5. This is getting exceedingly picayune, but it bothers me that the GridPoint example in the header comments shows how to add hashing to a struct that in fact needs no code from you at all; the compiler would handle it just fine on its own, and in fact it would be stylistically better for you to let the compiler do so. It would be more representative to include (in the GridPoint struct) an instance variable (e.g., a color) that you don’t want to include in hashing or equality comparisons and to note that its presence prevents the use of default hashing.

2 Likes

This does seem important. I’m not sure you’d need any more API, though. Wouldn’t this suffice?

struct Expensive {
    private var hash: Int?
    override func hash(into: inout Hasher) {
        if hash == nil {
            var selfHasher = Hasher()
            selfHasher.combine( ... )
            ...
            hash = selfHasher.finalize()
        }
        into.combine(bits: hash!)
    }
}

It does assume that Hasher is sufficiently cheap to instantiate, and that you don’t mind the performance penalty of rehashing the cached hash value as an Int. This model also doesn’t play well with compiler-generated hashing implementations: there’s no obvious solution for objects that can be straighforwardly (but expensively) hashed by the default machinery, other than implementing hashing by hand. Probably live-withable, though.

2 Likes

If so, maybe append would be better suited than combine?

This seems like it could be a relatively common scenario. Instead of requiring implementors to use an ad-hoc scheme such as packing in their own size, maybe it should be elevated into the formal API, e.g.:

extension Hasher {
    public mutating func markEndOfGroup() {}
}

I see two potential advantages to including this in the API. First, the goal is just to perturb the hashing algorithm in some way, and the hash algorithm implementors know the most efficient and effective way to do that. Ad-hoc implementors would probably just hash in an Int of some kind, but that can probably be improved upon.

Second, including this as an API call allows it to be optimized out entirely in the case where no calls intervene between markEndOfGroup() and finalize().

Terms of Service

Privacy Policy

Cookie Policy