[Accepted] SE-0206: Hashable Enhancements

Joe_Groff · April 5, 2018, 4:58pm

We're certainly not there today, but in the fullness of time we would like constant folding and loop unrolling to work well enough to make it viable to build abstractions on top of collections of key paths.

anandabits · April 5, 2018, 5:27pm

What is your evaluation of the proposal?

Huge +1 in general. The proposed changes are very well motivated and an important area for improvement.

There has been some good discussion of design details in this thread. I don't have a strong opinion about the details and trust those with more expertise to make the right decisions.

Is the problem being addressed significant enough to warrant a change to Swift?

Yes. As the proposal notes, expecting the average Swift developer to provide a good implementation of hashValue is unreasonable as it requires expertise they are unlikely to have. The move to a solution that only requires developers to specify the input data to be hashed is an enormous improvement.

Does this proposal fit well with the feel and direction of Swift?

Yes

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

N / A

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

A quick read of the final proposal as well as participation in the earlier discussion threads.

lemire · April 5, 2018, 5:57pm

There is an overload of Hasher.combine(bits:) that takes an UnsafeRawBufferPointer. Do you think we need to provide more ways to hash bulk data?

You are correct. Thanks for challenging my criticism. It should be possible to support high-speed hashing then.

I think we should be careful not to overgeneralize

Agreed.

hlovatt · April 5, 2018, 11:23pm

Although you can cache the hash it isn’t obvious how you do this, is significant boiler plate, and has some overhead. Is it possible to add caching? EG:

struct Expensive {
    private var hash: Int?
    override func hash(into: inout Hasher) {
        guard h = hash else {
            into.combine( ... )
            ...
            hash = into.finalize()
        }
        into.previouslyCachedValue(h)
    }
}

IE I’m proposing adding previouslyCachedValue to Hasher. As a nice addition, this change gives a reason for finalize being a public function (at present the author of hash(into:) can call finalize but you shouldn’t!). An implication of this change is that a Hasher should be cheap to create so that the overhead of caching is low.

nnnnnnnn · April 6, 2018, 2:01am

I wasn't thinking about the class hierarchy issue—that's a good point. And yes, something to deemphasize this batch of "usually the wrong thing" APIs would be great to have if tools could hook into it. Thanks!

(And j/k about the example, though it used to just return x ^ y, which is reprehensible.)

GarthSnyder · April 6, 2018, 3:44am

A couple of minor quibbles below, but overall a strong +1 from me. This is an important issue that needs to be addressed in one way or another. I have read the full proposal and the discussion so far and agree with most of the arguments for the proposed design.

[Just a side note: this is really something of an exemplary proposal. The motivation and historical summary are clear and detailed. The full API and doc comments are camera-ready and nicely described in the surrounding text. Nice job.]

Comments:

I find the bits labels in the various type-specific forms of combine(bits:) somewhat strange. Yes, I get that the hasher doesn't care that what you're passing in is an Int; it just wants its 32 bits or 64 bits of data. By why is this important to represent in the API? Why are these not just type-specific overloads for combine()?

In the current design, if you call combine() on an Int (as the GridPoint example seems to imply that you should) the hasher presumably calls Int.hash(into:), which presumably then turns arounds and re-calls Hasher.combine(bits: Int). Why not just avoid this little dance and accept combine(Int) directly?
@thomasvl raised the issue of how to hash things order-independently. The official solution of hashing all the parts separately with their own hashers and then XORing the results seems functional and not hard to implement given that this is an uncommon requirement. However, the proposal itself talks at some length about how instantiating a bunch of separate hashers may be potentially nonperformant. This case on its own seems like a good argument for making Hasher a protocol.
It's not obvious from the proposal that Hasher is, in fact, guaranteed to be order-sensitive, though that's probably what most people would guess. This should really be an explicit part of the API contract that is mentioned in the header comments. If you're hashing a sequence of things, there's no need to, e.g., interleave sequence numbers if you want order-sensitive hashing.
I agree with @hlovatt that the ability to cache hash values is important.
This is getting exceedingly picayune, but it bothers me that the GridPoint example in the header comments shows how to add hashing to a struct that in fact needs no code from you at all; the compiler would handle it just fine on its own, and in fact it would be stylistically better for you to let the compiler do so. It would be more representative to include (in the GridPoint struct) an instance variable (e.g., a color) that you don't want to include in hashing or equality comparisons and to note that its presence prevents the use of default hashing.

GarthSnyder · April 6, 2018, 3:58am

This does seem important. I'm not sure you'd need any more API, though. Wouldn't this suffice?

struct Expensive {
    private var hash: Int?
    override func hash(into: inout Hasher) {
        if hash == nil {
            var selfHasher = Hasher()
            selfHasher.combine( ... )
            ...
            hash = selfHasher.finalize()
        }
        into.combine(bits: hash!)
    }
}

It does assume that Hasher is sufficiently cheap to instantiate, and that you don't mind the performance penalty of rehashing the cached hash value as an Int. This model also doesn't play well with compiler-generated hashing implementations: there's no obvious solution for objects that can be straighforwardly (but expensively) hashed by the default machinery, other than implementing hashing by hand. Probably live-withable, though.

gwendal.roue · April 6, 2018, 4:14am

If so, maybe append would be better suited than combine?

GarthSnyder · April 6, 2018, 4:16am

This seems like it could be a relatively common scenario. Instead of requiring implementors to use an ad-hoc scheme such as packing in their own size, maybe it should be elevated into the formal API, e.g.:

extension Hasher {
    public mutating func markEndOfGroup() {}
}

I see two potential advantages to including this in the API. First, the goal is just to perturb the hashing algorithm in some way, and the hash algorithm implementors know the most efficient and effective way to do that. Ad-hoc implementors would probably just hash in an Int of some kind, but that can probably be improved upon.

Second, including this as an API call allows it to be optimized out entirely in the case where no calls intervene between markEndOfGroup() and finalize().

felix91gr · April 6, 2018, 4:18am

Wait, I’m slightly confused. When we talk about order-sensitivity, is it...

When I hash Circle A and Circle B, and A.hash and B.hash don’t change if I hash A first or B first.
When I hash a Vector2 V, and V.hash is the same if I reorder my calls to combine from: combine(V.x); combine(V.y); to combine(V.y); combine(V.x);

Which one is it?

felix91gr · April 6, 2018, 4:19am

Yeah, append actually suits it better from that POV.

felix91gr · April 6, 2018, 4:22am

Isn’t it better to append the length inside the finalize() implementation? Since when calling finalize() you’re implicitly declaring that it’s the end of the sequence.

GarthSnyder · April 6, 2018, 4:26am

Oops, sorry - I meant order sensitivity in the same sense that (I think) @thomasvl is using it. That is, given this naive implementation

extension Collection {
    public func hash(into: inout Hasher) {
        self.forEach { into.combine($0) }
    }
}

I would expect that in the default implementation, [x, y] would have a different hash value from [y, x]. But the current proposal doesn't actually state this one way or the other.

It's important because if you explicitly want hash values to change when elements are reordered, and the standard hashing implementation doesn't care about the order in which components are presented, you have to do some additional work to ensure that reordering changes the hash value.

Conversely, for something like a Set, you probably do not want to order to matter, hence the earlier discussion of hashing the values of elements individually and then XORing them.

felix91gr · April 6, 2018, 4:34am

I see. Yes, this should be mentioned in the proposal.

As for which one of the two... I’d prefer it to be order-dependent. I’m pretty sure that’s more secure.

GarthSnyder · April 6, 2018, 4:58am

See @Gankro's first post in this thread for more context - I think that will help clarify.

The need is to delineate element subgroups within the sequence of elements presented to a Hasher. So it's different from just finalizing the Hasher.

felix91gr · April 6, 2018, 5:10am

I’m not sure if I understand correctly. Please tell me if I’m wrong here:

The API proposed allows for an implementation that defends against HashDOS attacks by, for example, using mixins after every call to combine.
The problem with that as the default would be performance-wise. It’s true that you can implement a hash with mixins that doesn’t need to store all the pieces in a buffer before hashing — it could happen piece by piece.
But adding mixins in between would make combine more expensive to run, which depending on how much more expensive that is, would impose a performance hit on otherwise lightweight use-cases.
Therefore, the suggestion is to be able to disable security mixins as an option.

How wrong did I get it? I feel I’m missing something, but I can’t quite get at it.

hlovatt · April 6, 2018, 5:43am

The latter point, cheap rehash and finalize (you didn't mention that), was my concern in suggesting an extra method. I also noted that instantiating a hasher needs to be cheap. As a further point your code whilst very similar to my code isn't that obvious, whereas a method called previously CachedValue is.

The concern about the cost of instantiating the hasher is there with both pieces of code, therefore is there a better solution that avoids instantiating the hasher?

regexident · April 6, 2018, 9:39am

A big +1 from me on this one.

When writing my original blog post about the problems with today's Hashable and the GridPoint documentation example in particular I jumped from my chair when I saw the collision statistics from running some quick tests on it. Being able to provide the very go-to example from the documentation as an adversarial example for the status quo was just perfect. So I have to thank you for that. It was the wind in the sails of this proposal. Without it I would have had to come up with a straw-man of my own, which would have greatly diminished the strength of the argument I was trying to make with the blog post.

BalestraPatrick:

Needless to say this code could be greatly reduced by having an overload that takes a Set or Array that would look like this (or by using keypaths):
func hash(into hasher: inout Hasher) {
    hasher.combine([topLeft, bottomRight, bottomLeft, topRight, midBottom, midTop])
}

Consider a deeply nested tree structure that recursively calls hash(into:) on its children and own value:

func hash(into hasher: inout Hasher) {
    hasher.combine([leftChild, self.value, rightChild])
}

With this one would get thousands of Array short-lived (and mostly rather small) allocations per hash computation. This would not only be bad in performance but might also badly fragment your memory, bringing your allocator to its knees.
Further more this would only be useful for homogeneous values. For my example above [leftChild, self.value, rightChild] would have to be of type [Hashable], imposing dynamic dispatch on the elements, further slowing things down.

You're not calling finalize() as an implementor of Hashable though:

extension GridPoint: Hashable {
  func hash(into hasher: inout Hasher) {
    hasher.combine(x)
    hasher.combine(y)
  }
}

and you shouldn't either. Finalize is meant to only ever be called once on a Hasher at the very end of hash computation, as implied by the name.

As such this won't work either:

hlovatt:

Although you can cache the hash it isn’t obvious how you do this, is significant boiler plate, and has some overhead. Is it possible to add caching? EG:
struct Expensive {
    private var hash: Int?
    override func hash(into: inout Hasher) {
        guard h = hash else {
            into.combine( ... )
            ...
            hash = into.finalize()
        }
        into.previouslyCachedValue(h)
    }
}
IE I’m proposing adding previouslyCachedValue to Hasher. As a nice addition, this change gives a reason for finalize being a public function (at present the author of hash(into:) can call finalize but you shouldn’t!). An implication of this change is that a Hasher should be cheap to create so that the overhead of caching is low.

This hash = into.finalize() will conflict with the semantics of Hasher.

You snippet is not valid Swift (missing let and return in guard let), so I'm not exactly sure what you meant to express, so I might be reading something different into your proposed approach, than you intended.

Hasher is stateful. As such instead you would need to do something like this:

struct Expensive {
    private var hash: Int?
    override func hash(into: inout Hasher) {
        let cachedHash = self.hash ?? {
            var hasher = Hasher()
            hasher.combine( ... )
            ...
            return hasher.finalize()
        }()
        into.previouslyCachedValue(h)
    }
}

You need to create a new Hasher to compute the cached hash, as the Hasher's current state is dependent on whatever it has hashed so far. There are a couple of things wrong with this caching strategy though:

func hash(into: inout Hasher) now needs to be marked as mutating func in Hashable, or else your snippet won't compile, being a struct. hash(into:) should however never ever imply that mutating self is intended behavior. It would turn the whole protocol into a foot gun by giving the impression of the wrong semantics.
mutating func hash(into: inout Hasher) would require types that make use of hashing and allow for key lookup à la func contains(key: T) to do this:
```
func contains(key: T) {
    var mutableKey = key // 👈🏻copy of key
    var hasher = Hasher(…)
    key.hash(into: &hasher) // 👈🏻call of mutating func on key
    let hash = hasher.finalize()
    return ...
}
```
This would trigger Swift's copy-on-write on every key lookup for types making use of isKnownUniquelyReferenced(_:) or having similar CoW semantics.
By instantiating a local Hasher via Hasher() one would be killing all of the efforts made to make this API work with stuff like Bloom Filters or other data structures/algorithms that might expect proper use of a single hasher.
It is unclear to me how previouslyCachedValue(_:) would play with the fact that Hasher is stateful. Allowing instances of Hashable to basically mess with the state of Hasher further more is a violation of concerns and encapsulation, imho.

lorentey · April 6, 2018, 11:17am

These are really interesting points; thank you for raising them.

There is a small semantic issue here -- it isn't clear if we can allow hasher.combine(42 as Int64) and hasher.combine(42 as Int8) to produce different hashes.

They are different types, so I'd argue we can and should be free to do that. However, the standard library defines heterogeneous comparison operators between all fixed-width integer types. This means that (42 as Int8) == (42 as Int64), and having the two values produce different hashes may be confusing. (AnyHashable and its interaction with NSNumber produces additional complications.)

The bits: label for the primitive combine functions makes it clear that the bit width of integer values is significant.

The implementation of hash(into:) for integer types is currently quite a bit more complicated than that. I hope to make hasher.combine(foo) mean the same thing as hasher.combine(bits: foo) for all integer types, but it will take more discussion and some extra work.

(For Int, the current implementation does indeed end up calling combine(bits: Int), but this is an accidental consequence of Hasher not implementing the full API proposed here, and we will likely need to change it once we have combine(bits: Int8) working.)

We shouldn't be doing unnecessary work. However, as far as I can see, commutative hashing requires calculating a separate hash value for each element, and spawning a new hasher is simply the correct way to do that.

Hasher.finalize() is not in itself slow; calling it only harms performance when the call can be eliminated.

Can you provide an example how a Hasher protocol would make things faster here? (I'd hesitate to have Set and Dictionary produce weaker hash values.)

This is a good point! We changed the name from append to combine late in the drafting process; I think combine is clearly better, but it does not by itself imply strict ordering. We can (and should) make up for this in the documentation. (append is uncomfortably close to Array.append. While Hasher.combine and Array.append are similar operations in an abstract sense, they actually mean completely different things.)

I agree too, and your implementation is the correct way to do it. (I.e., spawning/finalizing a separate hasher to generate the cached hash, then feeding the hash value in hash(into:).)

Note that caching the hash involves mutating operations, so for value types, you cannot do it directly in hash(into:). (This is by design, and the same applied for hashValue.) This is not a real limitation, as you can simply force-generate the cached hash sometime before you start using the value in hashing contexts.

Hm; maybe we should simply turn the example into a demonstration of how the compiler implements hash(into:). It would serve the same purpose,, but it would avoid getting bogged down in questions like why wouldn't you want to hash the color of a GridPoint.

lorentey · April 6, 2018, 11:37am

GarthSnyder:

This seems like it could be a relatively common scenario. Instead of requiring implementors to use an ad-hoc scheme such as packing in their own size, maybe it should be elevated into the formal API, e.g.:
extension Hasher {
    public mutating func markEndOfGroup() {}
}

It seems to me this wouldn't carry its weight as an API. It isn't clear that we'd be able to implement it well. People implementing hashable collections would still need to be aware that they need to call this function, so we'll need to call it out somewhere in the docs. But we may just as well describe what they need to do and why.

Not without knowing more details about the collection! In particular, unless we know more about the elements, combining the count seems like a better choice than any constant bit pattern.