Combining hashes

You're positing that, no matter what we do, most users will use something without understanding it (whether it be xor or _combineHashValues). That's very bad, and it's made strictly worse--not better--by making something public so that users can use it without understanding.

I don't think your assessment is reflective of Swift users, as demonstrated by @maven writing how he's researched different hashing functions, and choosing MurmurHash for his use. However, if you're convinced that rampant use of _combineHashValues without understanding is what will occur, then it will be important to make that function internal to avoid abuse. It's only public at the moment for testing, but since we have standard library types that now use it, we can test those instead.

1 Like

This doesn't really make sense. Developers constantly use API they don't understand, whether array sorting or UIKit APIs for view management. That you think combining hashes is something that should require research doesn't change the fact that most developers (and people in general) are lazy. Until they notice something wrong, the simplest or easiest to implement solution will be the one chosen. So why not make that choice both obvious and high quality?

3 Likes

Maybe I’m naive, but it seems to me that if it’s ok to synthesize hashValue using all members it should be equally ok to expose a hash combine function that can be used when one or more properties must be omitted.

Consider the scenario where a type is using a synthesized hashValue. The author later decides to add a property to cache an expensive computation. Nothing else has changed. The user now has to write a custom implementation of hashValue in order to omit the cache property. I don’t understand why simply passing the same exact hash values to the same combining function that were used by the synthesized implementation is a bad idea. The result is equally good before and after the cache property was added.

Are you able to explain why synthesis is a good idea and exposing a combining function for custom implementations is a bad idea?

4 Likes

We can’t stop people from using APIs they don’t understand, but we can certainly avoid designing APIs to support such behavior.

Of course, if there were an obvious high-quality algorithm to combine hashes, we could vend that. But there isn’t one.

Most hash combining functions work fairly reasonably if the hash values being combined are uniformly distributed. But consider: the hash value of an Int is, very reasonably, the value itself. This is perfect for Int because it requires no computation and guarantees no collisions. But imagine if you have a Person struct with a member var age: Int. It would be comically bad to assume that age has a uniformly distributed hash value, and a general-use hash combining function would cause many more collisions than you’d likely want.

If a user implements a hash value calculation in a naive way and runs into lots of collisions that eat their Dictionary keys, they have full ownership of the problem, and they’ll debug and move on. If a user calls a standard library function that’s the blessed method of combining hashes and finds that it’s the reason why their Dictionary keys are being eaten, then the fault is on us.

Array's sort() is currently unstable, which could lead to unexpected results. Yet it's allowed in the standard library because such expectation are extremely rare. Therefore users don't have to implement their own sorting functionality. It's the same here. Expose a common implementation that's useful for the vast majority of developers while not standing in the way of those who need from implementing their own version. Otherwise you're just making the default experience that much worse. And frankly, if it's so important for hash combination to be well researched and implemented, then the standard library should provide such an experience out of the box rather than required every single Swift developer to implement their own.

1 Like

I've written in depth about this distinction above, but it seems like it's not sticking. So let's try an analogy:

  • You go to a restaurant and you'd like some wine. You say to the waiter, "I don't know wines, just bring me something. Really, I have no preferences, just a wine of any sort. I trust you have good taste." The waiter brings you a very fine merlot.

  • You go to a restaurant and you'd like some wine. You have preferences. Perhaps it's just, "I don't like red wines." But maybe it's specifically, "I want a 1997 cabernet sauvignon from X vineyard." Only the waiter can't actually respond to your preferences, because it's not a person but an iPad waiter app, and there's no interface to enter your preferences. (Just as the compiler has no way of knowing, when you conform to Hashable with a custom implementation of hashValue, just how custom you intend it to be.) What the waiter app can do, though, is show you the APIs wines it has to offer, and it helpfully says, "We recommend the merlot." "But I don't like red wines." "We recommend the merlot." "I thought merlot was a red wine?" "We recommend the merlot."

Is a very fine merlot a bad wine? No. Is it perfectly acceptable to bring it out to customers who say they literally just want any bottle of wine? Sure. Should we offer it to everyone? No.

This is an excellent argument--for the default sort algorithm being stable. Dave Abrahams has been on for years about how that should be the case, and we may finally have some movement on it.

I too consider default unstable sorting in the standard library to be a bug--and very much not a precedent to be emulated.

Not only that, but you write that requiring stability would be "extremely rare"--I don't think there's any basis to assume that. Likewise, any particular implementation of _combineHashValues being much less than ideal for a particular type is unlikely to be rare.

What is your opinion on supporting some kind of annotation to omit specific properties from the synthesized equality and hash value implementations? It is a disservice to users to expect them to know how to select a good hash combining algorithm for the sole reason that they must omit one or more properties.

I'll quote what I wrote above:

It's important to have empiric data about how often people need to omit specific stored properties from the synthesized implementation. Could those properties often be lazily computed properties instead? Often, I find that lazy would be a perfect fit for my model but for the fact that it can't ever be updated. If that's the overwhelming case, then the most appropriate solution would be the implementation of property behaviors.

So I think it's far too hasty to conclude that supporting some kind of annotation to omit specific properties would be the best solution; it might be, but we need more data.

1 Like

The idea of "transience" for properties was discussed in the proposal as a future direction. It was something I considered, but it wasn't critical to getting the basics in place.

The catch here is that such a feature only whittles down the space of "needs to hand-implement hashValue" a little bit, in a specific use case. It doesn't handle situations where you have something you want to hash but hash differently. Maybe that's ok, but we should really figure out how frequent that use case is before we start in on it.

(Note also that "transience" could also apply to concepts like Codable, but we probably don't want all transience to be uniform; you can easily imagine a situation where you want a field to be serialized, but not included in an equality or hash computation.)

If you're clever with conformances, you can implement transience today, in fact:

struct Transient<Wrapped: Equatable>: Equatable {
  var wrapped: Wrapped

  init(_ value: Wrapped) { self.wrapped = value }

  static func == (lhs: Transient<Wrapped>, rhs: Transient<Wrapped>) -> Bool {
    return true
  }
}
extension Transient: Hashable where Wrapped: Hashable {
  var hashValue: Int { return 0 }
}

Such a type plays no role in equality or hash value computations, and it can be used to implement caches in an otherwise synthesized type, and computed property accessors can be written to handle wrapping/unwrapping seamlessly if needed.

Is this a great solution? No, probably not, because it requires jumping through some hoops to write boilerplate around access to the wrapped value. But it would be interesting to see if something like this empirically shows up in people's code bases to work around the issue, because that can motivate whether such an attribute would be solving a real common problem.

1 Like

A better analogy for the situation we have right now which is what you’re advocating is this:

  • You go to a restaurant and open your mouth and the waiter will pick a four course meal and spoon feed it to you. (Synthesized hashable)

  • You go to a restaurant and ask for soup and you’re told, “well, since you know you want soup, I’ll point you to our kitchen. Feel free to go in, find the perfect ingredients for your soup and cook it exactly how you want it. Feel free to look up recipes online. You’ll probably find a recipe much better than anything we can make you here” (BYOH)

4 Likes

That's not a bad analogy. So, let's continue with it. Right now, those are the only two choices we offer: go to a restaurant and eat the fixed menu (which is, incidentally, not uncommon among a certain tier of restaurants) or design your own meal from scratch (which is, incidentally, also an actual type of restaurant).

What you're advocating for is serving everyone a meal without making sure they're aware of what's contained in the meal. In fact, because you think it's too much to expect people to read the menu, you'll make sure to offer a recommendation that they can choose blindly if they wish. Sure, they can go against your recommendation and cook their own meal instead, but how likely is that when they've already come to a restaurant and you're offering them this meal as the best one that the restaurant makes?

Now, consider that people have food allergies, sometimes deadly. Others have religious restrictions on diet. Doesn't matter: you recommend that they eat this meal. How long will your restaurant stay open?

All of issues you raise apply to synthesized Hashable but we’ve decided that it’s useful enough and won’t stop people from making their hashes from scratch if they have food allergies.

Synthesized Hashable, in this analogy, is like choosing to walk into a restaurant with a fixed menu. The user who chooses to eat at the restaurant is, by doing so, already declaring that they have no food allergies.

A customized Hashable is, in this analogy, literally declaring that you have food restrictions and can't eat just anything. Providing a standard library function that combines hashes is like providing a recommendation to the customer. But the compiler (waiter) isn't smart enough to tailor the recommendation to each person, and there isn't a menu item (hash combining algorithm) that everyone can eat; so instead, it recommends peanut butter sandwiches (because they are the most popular menu item) to people who will die if they eat peanut butter. That's not good, or even acceptable.

I feel like it's not uncommon for the swift standard library to include a correct but not necessarily fastest implementation of a function. For example, arc4random and by extension the upcoming Random proposal's default stdlib generator is about 10x slower than most other general purpose RNGs (source). Is it also reasonable for people to blame the stdlib when they can't generate 100 million random numbers per second?

As far as I know, it's pretty easy to sacrifice a bit of speed to have a hash combining function that works whether or not the inputs are uniformly distributed. If nothing else, I'm sure running SHA-1 on the 16 byte combined value of the two integers would work perfectly fine, though that's an extreme example.

Go actually randomizes dictionary iteration for similar reasons. The idea is to uncover cases that wrongly rely on undefined behavior.

2 Likes

Swift string hashing actually already does this on Linux. Try running print(Set((0..<100).map(String.init))). It'll be different every time you restart the repl (or the program if you compile it).

If you want to fallback to the default hashValue implementation, you can make another type for it. Though not the cleanest way, I don't think it's too bad either.

struct Foo: Hashable {
  var a, b, c, d: Int

  var hashValue: Int {
    struct Hasher: Hashable {
      var a, b, c: Int
    }
    return Hasher(a: a, b: b, c: c).hashValue
  }
}
1 Like

I believe Hashable.hashValue is not a great API. It combines two distinct tasks that should be kept separate:

  1. Choosing a particular hash function
  2. Feeding it with data

When we manually implement Hashable, the job should just be about identifying bits and pieces of our data that must participate in hashing. We shouldn't need to think about how these bits are shredded into a fixed-width value, because it is a completely different concern, and even worse, it requires careful consideration and specialist knowledge. Almost nobody is able (or willing) to provide a good implementation of hashValue from scratch, because task (1) is hard.

SE-0185 is a partial solution to this problem. If we augmented it with some sort of facility (e.g., a @transient attribute) to omit specific fields from synthesized implementations, it would reduce the problem of implementing Hashable to task (2) above, leaving the choice of a suitable hash function to the standard library.

Exposing _combineHashValues as public API is one way to help people who still need to implement Hashable manually. However, it is not the best choice to expose it as is: it lacks support for initializing/managing/finalizing internal state, so it cannot implement an important class of hash functions. Additionally, there is the issue of discoverability; the hashValue API itself does not naturally guide developers to use helper functions like this. (Although this can probably be solved with documentation.)

Even worse, the concept of a "good hashValue" is not universal -- some applications require secure, randomized, complex hashes, while others may be satisfied with a less secure function that calculates faster. Baking the hash function directly into our hashValue implementation makes it harder to reuse our data types. Potentially, it also makes it harder to replace the hash function in case a bug (such as a security issue) comes to light later.

If I were to redesign hashing from scratch, I would assign the responsibility of selecting a particular hash function to the collections that rely on hashing, and to change Hashable to be only concerned with task (2) above. @Vincent_Esche's HashVisitable pitch from last year is one way to do this. I'm currently experimenting with a slightly modified approach that changes Hashable directly.

In any case, the Hasher concept from Vincent's pitch is a likely candidate interface for replacing _combineHashValues and _mixInt internally in the stdlib. If we want to expose the hash function used for synthesized hashing as public stdlib API, I think Hasher is the way to go!

1 Like

Hash values have been explicitly documented since Swift 1.0 to potentially change not just between releases, but between executions of the same program. This was not done to annoy people, but to allow random seeds, which is an important tool for protecting against collision attacks. (The stdlib has not yet fully enabled random seeds, but this may change.)

The exact hash function used in synthesized Hashable implementations is an implementation detail of the stdlib, and it is subject to change between any two releases.

4 Likes