Combining hashes

CTMacUser · January 27, 2018, 3:46am

Should we add a function to combine several hash values? It would help to implement hashValue for various product types.

xwu · January 27, 2018, 4:02am

The challenge of combining hash values, of course, is to find a formula that minimizes collisions. There are some decent ones in general, but whether or not they're appropriate for your type depends on the underlying distributions of the hash values you're trying to combine, and only you can know that. If you mindlessly call a combineHashValues function without devoting thought to that, then the result may or may not be very good--or even acceptable.

That said, we do in fact have such a function in the standard library named _combineHashValues(_:_:). It's the same one as provided by Boost. I'm just not sure we should be encouraging everyone to use it.

hashemi · January 27, 2018, 4:53am

For better or worse, SE-0185 Synthesizing Equatable and Hashable conformance forced us to pick a “generic” way to combine hashes. This is what the proposal authors wrote and it looks like this is how it was implemented in 4.1:

† The choice of hash function is left as an implementation detail, not a fixed part of the design; as such, users should not depend on specific characteristics of its behavior. The most likely implementation would call the standard library's _mixInt function on each member's hash value and then combine them with exclusive-or (^), which mirrors the way Collection types are hashed today.

I think it’s harder to defend hiding this function and discouraging its use now that the compiler is using it to synthesize hash values for arbitrary structs.

xwu · January 27, 2018, 5:28am

No, in fact, _combineHashValues did not exist at all prior to SE-0185, and defending its being hidden is explicitly what the proposal does.

The point is that, if you don't care how Hashable is implemented, then it's fine to default to some arbitrary one-size-fits-all function. That's what happens when you use the synthesized conformance: you're saying, "I'm good with whatever arbitrary hash value Swift can come up with, and I don't even need a guarantee that it'll be the same in the next version of Swift."

But, if you're implementing your own conformance, then by definition you care exactly how the hash value is computed. Therefore, we very much discourage calling the internal function, which we reserve the right to change at any time. (In fact, I was the most recent person who changed it.)

xwu · January 27, 2018, 5:31am

To expand on that point, I rather think it'd be a good idea to change the magic number in _combineHashValues at least once or twice every version.

This is based on the notion that a flexible design where the flexibility is never actually used becomes de facto ossified. Therefore, to prevent reliance on the precise formula for computing a synthesized hash, that formula should change on a regular basis as we reserve the right to do.

[cc @allevato]

hashemi · January 27, 2018, 5:36am

If you have a struct with four values, but you only want 3 of them to participate in the hash, you may still not care how the hashes are combined. You can still change the implementation of the function to use a different magic number, without scaring people off of using it.

xwu · January 27, 2018, 5:49am

Well, technically, that is caring about how the hashes are combined, but I see what you're getting at. I'm not saying that there are no scenarios in which you might be tempted to call the internal function. I'm saying that there's a very good reason for not exposing that function, which we use for a default implementation of a protocol requirement, to users for their custom implementations.

In Swift, either you use a default implementation of a protocol requirement, or you implement your own. There are definitely reasons why you may or may not care about part of what's required to write your own custom implementation, but that goes for every protocol requirement under the sun; we don't expose parts of default implementations as bits and pieces just because that might be the case.

Especially in a circumstance such as this, where it's not onerous to write your own 1-3 line function to combine hashes (or, optionally, something more elaborate if you want). Heck, just straight-up copy the internal implementation (as the author, I suppose I can just say that).

hashemi · January 27, 2018, 5:56am

I just worry that people can do much worse than a sanctioned function even if it’ll be suboptimal sometimes.

Personally, whenever I have to combine hashes, I paste in the first thing that comes up on StackOverflow.

xwu · January 27, 2018, 6:18am

If the quality of the combined hash is of concern to you even a little bit, then I would not recommend proceeding as though the internal function is better than what you find on StackOverflow.

In fact, I would be shocked if the top answer on StackOverflow hasn't had many, many orders of magnitude more eyeballs on it than the shipping version in Swift.

And if you tailor your search on StackOverflow to the specific characteristics of the type that you're hashing for, you might even find an even superior solution on that site than the general one-size-fits-all solution.

maven · January 27, 2018, 1:27pm

I've recently been dealing with this in C++ (which offers hashes for some built-in types, but no way to combine them).There are a number of proposals in this area (e.g. Types don't know #, N3333, N3876 if you're interested.

Now to my actual point: It seems nice to somehow expose the fields / members to be hashed and leave the hashing and combining up to the compiler (unless overridden), but until that happens it's IMO better to offer an at least decent method of combining hashes, as otherwise people resort to xor or +, which will be worse as they commute.

FWIW, I ended up using a round of MurmurHash3_32 for 32-bit values, or half of MurmurHash3_128 for 64-bit values (+ constant as in boost to prevent combining 0 to 0 repeatedly).

taylorswift · January 27, 2018, 7:14pm

for the record i don’t support this. changing hash functions every version for no reason but to fuck with people is just silly.

taylorswift · January 27, 2018, 7:18pm

while we’re on the topic of defacto reliance on unsupported functionality, i really doubt the underscore in front of _combineHashValues(_:_:) really prevents people from using it. if a symbol is public people will inevitably come to depend on it,, this is already the case for the _capacity property on array and the _sin and _cos intrinsics. the fact that people are willing to ignore the underscore warning means that these functions are useful and we should consider providing them as part of the supported interface.

xwu · January 27, 2018, 7:52pm

Language, please. This is a professional setting. I've laid out the rationale for making regular changes to the function, so it's disingenuous to say that there is "no reason."

Underscored properties aren't listed in autocomplete and aren't documented. People may rely on it at their own peril, but that only emphasizes why it's important that these internal methods, which are public for implementation reasons, are not merely documented as being unstable but should actually manifest that behavior.

taylorswift · January 27, 2018, 8:45pm

your reason is literally “we don’t want user code to break due to changing the hash implementation, so we’re going to break user code by changing the underlying hash implementation”

&& if people use the underscored function a lot that means that there’s a demand and use for that functionality

ps the code of conduct doesn’t specify what english dialect everyone has to speak in. people have different ways of communicating

Jon_Shier · January 27, 2018, 10:32pm

Like random number generation (though perhaps less popular), combining hashes is something everyone who needs to implement Hashable in a way that can't be synthesized will have to do, so something built into the standard library seems necessary. Otherwise they will just use _combineHashValues(_:_:). It's only the tiny tiny minority of users who will ever need anything else, or who will do the research necessary to find the algorithm optimal for their use.

xwu · January 28, 2018, 7:13am

And some ways of communicating are not acceptable in this forum, and one of those is the use of profanity.

xwu · January 28, 2018, 7:22am

Since _combineHashValues(_:_:) is hidden, if they've done the research necessary to find the function, they'll have at least found the source code for that, if not also numerous other alternative implementations while searching.

It would seem to me that, first, we'll need to figure out how pervasively people need to implement Hashable because it can't be synthesized, and second, ask if a whole bunch of those use cases can be solved by better synthesis. Then, we'd need to see if the non-solvable cases are largely appropriate for a general-use hash combining function (as you claim) or not appropriate. I'd be curious to know as well whether other languages with something like Hashable provide a general-use hash combining function and how well that's served those languages.

allevato · January 29, 2018, 12:33am

To be honest, I'd be supportive of an even bigger hammer—mixing something like the current process ID or some other randomized value determined at the start of execution into the synthesized hash value to ensure that people don't depend on behavior that they shouldn't, and improve security by making it harder for bad actors to craft intentionally harmful payloads.

xwu · January 29, 2018, 12:59am

The final _mixInt call that ends every synthesized hash value implementation, I believe, can indeed support that. If I recall, there's a temporary hardcoded magic value that goes into the _mixInt computation, but the design leaves room for making that value randomly selected at each start of execution.

Jon_Shier · January 29, 2018, 2:05am

By far most developers will use the synthesized versions. For those who realize they need manual combination, some small percentage will know the importance of how the values are combined. The rest will Google "swift hashable example" and stumble across an SO or blog post that mentions _combineHashValues(_:_:) and use it directly. Or, perhaps even more likely and worse, use one of the older methods of combining property hashes (just ^ing them together) and not move beyond that, since there's no official solution. Some small percentage of previously ignorant users will realize they need to do something special to combine hash values and research that. That may have them stumble across _combineHashValues(_:_:)'s implementation, but most likely they'll use some generic but recommended hashing pattern.

So, given the likelihood that developers will never need to venture beyond _combineHashValues(_:_:)'s capabilities but the significant improvement it provides over naive hash combination, it should be exposed publicly. Frankly, I'm not sure what the downside is here.