[Accepted] SE-0206: Hashable Enhancements

You’re completely right Joe.

I’m sorry @Gankro. Your concerns are very much valid. I didn’t mean otherwise.

That is a really good point. Maybe the compiler should issue a warning in cases like that:

  • Two or more combines on a single property.
1 Like

Yes! The stdlib itself uses this technique for implementing hashing for Set and Dictionary. (An alternative would be to keep a copy of hash in their storage representation, and keep it updated with each mutation.)

I’m really not sure. I suspect Set and Dictionary may be the only order-independent collections most users will ever come across, much less implement, in Swift.

1 Like

We’re certainly not there today, but in the fullness of time we would like constant folding and loop unrolling to work well enough to make it viable to build abstractions on top of collections of key paths.

1 Like

What is your evaluation of the proposal?

Huge +1 in general. The proposed changes are very well motivated and an important area for improvement.

There has been some good discussion of design details in this thread. I don’t have a strong opinion about the details and trust those with more expertise to make the right decisions.

Is the problem being addressed significant enough to warrant a change to Swift?

Yes. As the proposal notes, expecting the average Swift developer to provide a good implementation of hashValue is unreasonable as it requires expertise they are unlikely to have. The move to a solution that only requires developers to specify the input data to be hashed is an enormous improvement.

Does this proposal fit well with the feel and direction of Swift?

Yes

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

N / A

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

A quick read of the final proposal as well as participation in the earlier discussion threads.

So I tried to do this in Rust and ran into a bunch of trouble. Until now I couldn’t quite remember what the issue was, but I just did, and it might not effect Swift if I recall the available APIs. In Rust we let you look up in HashMap with types that don’t exactly match the type of the key (Q where K: Borrow<Q>).

The canonical example is you can lookup elements in a Dictionary<Array, _> with a Slice (using Swift types here for clarity). Adding this API that gave types two different hashes created chaos because if one of the owned or borrowed types remembered to override hash() but not hashOnly(), everything fell apart. Like I literally couldn’t bootstrap the rust compiler because I missed some cases and it would miscompile itself.

2 Likes

There is an overload of Hasher.combine(bits:) that takes an UnsafeRawBufferPointer. Do you think we need to provide more ways to hash bulk data?

You are correct. Thanks for challenging my criticism. It should be possible to support high-speed hashing then.

I think we should be careful not to overgeneralize

Agreed.

Although you can cache the hash it isn’t obvious how you do this, is significant boiler plate, and has some overhead. Is it possible to add caching? EG:

struct Expensive {
    private var hash: Int?
    override func hash(into: inout Hasher) {
        guard h = hash else {
            into.combine( ... )
            ...
            hash = into.finalize()
        }
        into.previouslyCachedValue(h)
    }
}

IE I’m proposing adding previouslyCachedValue to Hasher. As a nice addition, this change gives a reason for finalize being a public function (at present the author of hash(into:) can call finalize but you shouldn’t!). An implication of this change is that a Hasher should be cheap to create so that the overhead of caching is low.

2 Likes

I wasn’t thinking about the class hierarchy issue—that’s a good point. And yes, something to deemphasize this batch of “usually the wrong thing” APIs would be great to have if tools could hook into it. Thanks!

(And j/k about the example, though it used to just return x ^ y, which is reprehensible.)

1 Like

A couple of minor quibbles below, but overall a strong +1 from me. This is an important issue that needs to be addressed in one way or another. I have read the full proposal and the discussion so far and agree with most of the arguments for the proposed design.

[Just a side note: this is really something of an exemplary proposal. The motivation and historical summary are clear and detailed. The full API and doc comments are camera-ready and nicely described in the surrounding text. Nice job.]

Comments:

  1. I find the bits labels in the various type-specific forms of combine(bits:) somewhat strange. Yes, I get that the hasher doesn’t care that what you’re passing in is an Int; it just wants its 32 bits or 64 bits of data. By why is this important to represent in the API? Why are these not just type-specific overloads for combine()?

    In the current design, if you call combine() on an Int (as the GridPoint example seems to imply that you should) the hasher presumably calls Int.hash(into:), which presumably then turns arounds and re-calls Hasher.combine(bits: Int). Why not just avoid this little dance and accept combine(Int) directly?

  2. @thomasvl raised the issue of how to hash things order-independently. The official solution of hashing all the parts separately with their own hashers and then XORing the results seems functional and not hard to implement given that this is an uncommon requirement. However, the proposal itself talks at some length about how instantiating a bunch of separate hashers may be potentially nonperformant. This case on its own seems like a good argument for making Hasher a protocol.

  3. It’s not obvious from the proposal that Hasher is, in fact, guaranteed to be order-sensitive, though that’s probably what most people would guess. This should really be an explicit part of the API contract that is mentioned in the header comments. If you’re hashing a sequence of things, there’s no need to, e.g., interleave sequence numbers if you want order-sensitive hashing.

  4. I agree with @hlovatt that the ability to cache hash values is important.

  5. This is getting exceedingly picayune, but it bothers me that the GridPoint example in the header comments shows how to add hashing to a struct that in fact needs no code from you at all; the compiler would handle it just fine on its own, and in fact it would be stylistically better for you to let the compiler do so. It would be more representative to include (in the GridPoint struct) an instance variable (e.g., a color) that you don’t want to include in hashing or equality comparisons and to note that its presence prevents the use of default hashing.

2 Likes

This does seem important. I’m not sure you’d need any more API, though. Wouldn’t this suffice?

struct Expensive {
    private var hash: Int?
    override func hash(into: inout Hasher) {
        if hash == nil {
            var selfHasher = Hasher()
            selfHasher.combine( ... )
            ...
            hash = selfHasher.finalize()
        }
        into.combine(bits: hash!)
    }
}

It does assume that Hasher is sufficiently cheap to instantiate, and that you don’t mind the performance penalty of rehashing the cached hash value as an Int. This model also doesn’t play well with compiler-generated hashing implementations: there’s no obvious solution for objects that can be straighforwardly (but expensively) hashed by the default machinery, other than implementing hashing by hand. Probably live-withable, though.

2 Likes

If so, maybe append would be better suited than combine?

This seems like it could be a relatively common scenario. Instead of requiring implementors to use an ad-hoc scheme such as packing in their own size, maybe it should be elevated into the formal API, e.g.:

extension Hasher {
    public mutating func markEndOfGroup() {}
}

I see two potential advantages to including this in the API. First, the goal is just to perturb the hashing algorithm in some way, and the hash algorithm implementors know the most efficient and effective way to do that. Ad-hoc implementors would probably just hash in an Int of some kind, but that can probably be improved upon.

Second, including this as an API call allows it to be optimized out entirely in the case where no calls intervene between markEndOfGroup() and finalize().

Wait, I’m slightly confused. When we talk about order-sensitivity, is it…

  1. When I hash Circle A and Circle B, and A.hash and B.hash don’t change if I hash A first or B first.
  2. When I hash a Vector2 V, and V.hash is the same if I reorder my calls to combine from: combine(V.x); combine(V.y); to combine(V.y); combine(V.x);

Which one is it?

Yeah, append actually suits it better from that POV.

Isn’t it better to append the length inside the finalize() implementation? Since when calling finalize() you’re implicitly declaring that it’s the end of the sequence.

Oops, sorry - I meant order sensitivity in the same sense that (I think) @thomasvl is using it. That is, given this naive implementation

extension Collection {
    public func hash(into: inout Hasher) {
        self.forEach { into.combine($0) }
    }
}

I would expect that in the default implementation, [x, y] would have a different hash value from [y, x]. But the current proposal doesn’t actually state this one way or the other.

It’s important because if you explicitly want hash values to change when elements are reordered, and the standard hashing implementation doesn’t care about the order in which components are presented, you have to do some additional work to ensure that reordering changes the hash value.

Conversely, for something like a Set, you probably do not want to order to matter, hence the earlier discussion of hashing the values of elements individually and then XORing them.

2 Likes

I see. Yes, this should be mentioned in the proposal.

As for which one of the two… I’d prefer it to be order-dependent. I’m pretty sure that’s more secure.

1 Like

See @Gankro’s first post in this thread for more context - I think that will help clarify.

The need is to delineate element subgroups within the sequence of elements presented to a Hasher. So it’s different from just finalizing the Hasher.

I’m not sure if I understand correctly. Please tell me if I’m wrong here:

  • The API proposed allows for an implementation that defends against HashDOS attacks by, for example, using mixins after every call to combine.
  • The problem with that as the default would be performance-wise. It’s true that you can implement a hash with mixins that doesn’t need to store all the pieces in a buffer before hashing — it could happen piece by piece.
  • But adding mixins in between would make combine more expensive to run, which depending on how much more expensive that is, would impose a performance hit on otherwise lightweight use-cases.
  • Therefore, the suggestion is to be able to disable security mixins as an option.

How wrong did I get it? I feel I’m missing something, but I can’t quite get at it.

Terms of Service

Privacy Policy

Cookie Policy