[Accepted] SE-0206: Hashable Enhancements

To me, this seems like an extremely significant change: it replaces a single struct with three protocols, a caseless enum, and two classes. It may seem like a surface-level change to you, but it significantly complicates the hashing API. The new hash(into:) and (especially!) HashCache would also very likely not benchmark the way you expect.

We did carefully consider making Hasher a protocol. The proposal explains why we decided to go with a simple struct instead. There may be a use case for custom Hashers, but it hasn't come up in review discussions so far.

The Bloom filter you've shown does not utilize multiple Hasher implementations, so it argues more against generalizing Hasher than in favor of it. This is as it should be; we expect that the stdlib's Hasher implementation will be good enough to support Bloom filters -- there is no reason to implement custom Hashers for that. (Note that Bloom64 allocates a shockingly large amount of space to cache hash values; I believe this goes directly against the spirit of bloom filtering, and some of that space would be much better used as additional state space for the filter itself. The current version of insert/contains also yields a much higher rate of false positives than what Bloom suggested.)

The idea to partially automate hash caching is intriguing, but it seems like it is orthogonal to this proposal, and it can be built on top of it later. It also adds a brand new concept for tracking mutations that seems non-obvious enough to deserve its own separate discussion thread.

(I'd like to note again that, in order to improve hash table performance, we can't reuse hash values across Set and Dictionary instances (or even during the lifetime of a single collection instance). So caching full raw hash values would rarely be helpful for these collections.)

This is a remarkably confident assertion. It's difficult to correctly reason about performance without actually comparing benchmarks. I strongly suspect you may be overestimating the cost of passing around an inout struct, and underestimating the cost of allocating/deallocating a class instance or of passing around a protocol existential. Performance considerations weren't the only reason we chose the design we proposed, but they were indeed quite significant -- and the hundreds of benchmarks I had run over the past few months were decidedly in favor of an interface that uses inout structs.

3 Likes

Er, I don’t quite follow. What exactly is the problem with having, say, an instance of Set cache its own hash value? (Or more accurately, cache its “prehash”, namely the xor of the hash values of its elements.)

Any mutation would clear the cached hash so it can never become stale, and I doubt that setting an optional integer to nil has any significant performance implications relative to mutating a set.

The prehash is only used in the set’s own hash(into: Hasher) implementation, so it never “leaks”. Instantiating a new set using the elements of an existing set doesn’t look at either set’s prehash.

Storing one set into another calls the inner set’s hash(into:) method with the outer set’s hasher, exactly as expected. And instantiating a new set using the elements of the outer set involves calling the inner set’s hash(into:) method with the new set’s hasher. So the inner set’s cached prehash gets reused successfully, while the outer sets are free to maintain independent hashers.

• • •

Edit: Scratch all that. We can’t cache hashes (or even prehashes) because the elements could be classes, which can be mutated without it being “mutating”.

• • •

Edit 2: Wait, if a class instance can have its hash value change out from under a Set, that poses larger problems even without caching. How do we handle sets where the element is a class today?

• • •

Edit 3: It appears we do not handle that situation today. The set claims not to contain the mutated class instance, although it does show up during iteration.

Hold your horses! :wink: I was talking about the specific idea of reusing the raw hash values across Set/Dictionary instances, not about hash caching in general. Hash tables must choose a new seed when they resize themselves, to prevent unacceptable performance regressions. (This is unrelated to collision attacks.)

Using the memoized result of a secondary hash function to reduce the number of bytes fed into Hasher in hash(into:) is a perfectly valid optimization. (That being said, it does have some fundamental correctness tradeoffs, and I believe it should only be used sparingly, and with copious amounts of justification in the form of before/after profiling data.)

There is nothing wrong with that, except it would needlessly slow down mutations (especially bulk removals) for Sets that will never be hashed -- which I expect to be the usual case. Much less importantly, it also interferes with using Sets as keys in advanced hashing scenarios, like Bloom filters.

Keeping an up-to-date hash cache for Sets is still an important optimization; it's just probably better done outside of the standard library. (But if the performance tradeoff seems worth it, the standard Set may end up doing it too. This is an implementation detail that is independent of the proposal we're discussing.)

No, it's a serious programming error to do that.

While this is an interesting topic, it too seems unrelated to the proposal at hand.

Cocoa handles this by defining the NSCopying protocol, and making a copy of all keys inserted into NSSet/NSDictionary instances. Foundation differentiates between mutable/immutable classes, where a copy of a mutable class gets to be immutable. This works fine within types defined in Foundation, but it's just a coding convention, not a language-enforced rule. You're free to ignore this convention when you're defining your own NSObject subclasses, and then you'll run into the same problem.

Swift chose to handle the same issue by emphasizing value types over reference types. Value types are literally impossible to mutate while they're stored in Set/Dictionary's hash table. Notably, this is enforced by the type system of the language itself, rather than just a convention.

1 Like

Okay, that makes sense.

True.

The same issue arises for a struct that contains a class instance. If the class instance is visible anywhere outside the struct (either through another reference, or even as a read-only property on the struct itself), the hash value of the struct can be changed out from under a set it is in. This applies even if the struct uses copy-on-write, because no write to the struct is occurring.

And yes, this is not related to the subject of the current thread, except that it came to mind while discussing cached hashes.

1 Like

This review is focused only on source compatibility. I read that section in depth and skimmed the rest of the propasal. I did not read all of the review thread or previous discussion.

When upgrading to Swift 4.2, Hashable types written for earlier versions of Swift will need to be migrated to implement hash(into:) instead of hashValue. There are two options for doing this:

IIUC, we will synthesize the new hash(into:) based on the user's hashValue implementation when using the pre-4.2 language mode. Why not do the same in 4.2 and deprecate hashValue instead of forcing users to update it immediately? A warning combined with excellent fixits is a great way to get everyone to migrate over time, but it's painful for it to become an error in the same release that the new functionality is added. The bar for making it an error in 4.2 should be that we can migrate it fully automatically, but...

The compiler should simplify the migration process by providing fix-its for both options. A fix-it to remove hashValue should only be provided if automatic synthesis is available. For the second option, it would be nice to have the compiler suggest a full implementation of hash(into:), but a template would probably suffice:

... I believe this is not sufficient for automatic migration. We will not be able to ensure that the behaviour of the program doesn't change with either of these options. I think we really need the user to make the decision to change their hashing.

1 Like

Thank you for bringing this up; ease of migration is a hugely important issue.

While the compiler would produce a deprecation warning in 4.2 mode, synthesis would continue to work, and no code would need to be actually changed until/unless hashValue is actually removed from the protocol, which is not going to happen soon (if ever).

(It may also be a good idea to defer hashValue's formal deprecation till a future language version. We'd like to encourage people to implement hash(into:), but there may be gentler ways of doing that than formally deprecating hashValue. However, it seems like a bad idea to support both requirements indefinitely -- we do want to get rid of the legacy interface at some point.)

I agree wholeheartedly. It may be helpful to provide it as an interactive fix-it attached to the deprecation warning, though.

The proposal doesn't cover fully automated migration of existing hashValue implementations; it seems Difficult to do that correctly in the most general case (with complications like caching, partial hashing, etc.). However, automatic migration may be possible for the simpler cases.

For example, (in theory at least) the migrator could recognize that the hashValue implementation below calls the getters of all stored properties without touching anything else, and it could replace it with a hash(into:) implementation or remove it altogether:

struct Foo: Hashable {
  var a: Int
  var b: Double

  var hashValue: Int { return a.hashValue ^ 31 &* b.hashValue }
  static func ==(l: Foo, r: Foo) -> Bool { return l.a == r.a && l.b == r.b }
}

(The same goes for the == implementation.) Of course, the "without touching anything else" part may not be easy to formally verify. (The code does call the hashValue properties along with a bunch of Int operations. Do we implement a whitelist?)

This migration would still change the meaning of the code in evil cases like the one below, where a.hashValue is canceled out from the returned hash value:

var hashValue: Int { 
  var hash = 0
  hash ^= a.hashValue
  hash = 31 &* b.hashValue
  return hash
}

I'd argue a human migrator would probably not fair any better, and I'd also argue the original intention behind this code was probably to hash both fields; the second xor was simply lost somewhere along the way.

As an improvement to this, the migrator could recognize partial hashing, and try to do the right thing there, too.

Thanks for everyone who has participated in the review thus far! The core team discussed this proposal in our meeting today. There are still a couple days left in the review period, but we're tentatively inclined to accept this proposal, but wanted to suggest some changes based on public review discussion and our own review:

  • As @jrose noted, init(seed:)'s behavior may be misleading to users who expect it to enable deterministic hashing, and has no real benefit over a collection implementation seeding the hasher directly with combine calls. To avoid misleading users, and to simplify the interface, we strongly suggest not including init(seed:) in the public Hasher API, instead suggesting that hashed collection implementers seed their hashers manually.

  • As a further simplification, the core team observes that the many combine(bits:) overloads on Hasher that take fixed-sized integers don't need to be public API. Instead, they can be @usableFromInline entry points invoked by @inlinable implementations of hash(into:) on the integer types themselves:

    extension Hasher {
      @usableFromInline internal mutating func _combine(bits: UInt8)
      /* etc. */
    }
    
    extension UInt8: Hashable {
      @inlinable public func hash(into hasher: inout Hasher) {
        hasher._combine(bits: self)
      }
    }
    

    This greatly reduces the public API surface area of Hasher, in particular eliminating the large overload set for combine(bits:), while preserving the performance characteristics of the proposed design, since using the generic combine<H: Hashable> method with a fixed-sized integer type would naturally inline down to a call to the corresponding @usableFromInline entry point. This would reduce the public Hasher API to only init(), combine<H: Hashable>(_: H), the combine(bits:) method that takes a raw buffer of hash input, and finalize() as public API.

With those changes, the proposal is likely to be accepted. The core team also considered many of the points raised in public review. @Gankro raised the importance of mixins for defending against collision family attacks in hashes of collection values, and many people suggested dedicated APIs for facilitating this in response. @lorentey noted in reply that dedicated API for this doesn't really carry its weight or shield the implementer of a collection hash from needing to know that they need mixins or what information to mix into the hasher, and the core team concurs with this point of view. Documentation will describe the importan

In the same post, @Gankro also pointed out the usefulness of a one-shot hashing interface for certain families of hash function that change algorithm based on input size, and so benefit from a complete view of the hash input at once. In the proposed design, the hasher implementation is intended to be fixed and chosen by the standard library, and the current algorithm being considered does not require one-shot hashing to benefit. If a future version of Swift decided to change the hash function to one that benefited from one-shot hashing, we should be able to introduce a new requirement resiliently by adding a new requirement with a default implementation in terms of the current incremental hashing interface. In a follow-up post, @Gankro described problems Rust had doing exactly that, albeit acknowledging that some of those problems are due to other Rust-specific design choices in their hashing design, particularly the ability to index their hashed containers with types other than the key type of the container itself. It isn't clear that Swift would suffer these same issues resiliently introducing one-shot hashing, or that Swift would have cause to switch the standard hash function to one that required it, so the core team would like to err on the side of simplicity and leave it out of the current design.

@lemire raised the concern that the proposed design does not allow for rolling hash functions. @lorentey replied that Hashable's primary purpose is hashed collections, and that the usefulness of rolling hashes is largely independent of Hashable's intended purpose, and the core team concurs.

@hlovatt and others asked how to cache hashing work in this design; it's more non-obvious and error-prone to do so than before. @lorentey replied, and the core team agrees, that automated caching is an orthogonal design question that could be added in a separate proposal, and that hash caching should in general be used sparingly as it's rarely a win in practice.

Many people raised concerns about the naming of methods in the proposed design. Particularly, @nnnnnnnn noted that perhaps hash(into:) ought to be deemphasized as a static method, to avoid polluting documentation and code completion, and many people raised the issue that combine may imply order insensitivity. The core team feels that the documentation/code completion pollution problem is a larger issue that deserves more holistic consideration. While many alternative names for combine were suggested, the core team did not find any of them significantly clearer. The core team is inclined to accept the names as is.

The core team considered the question of whether the old hashValue requirement could be eliminated altogether in a future version of Swift. There are significant source compatibility concerns in doing so; @blangmuir also lays these concerns out nicely in his post. Deprecating hashValue would not affect source compatibility with existing Swift 3 or 4 code while discouraging new code from being written to implement it, but the protocol requirement cannot be completely removed without breaking source compatibility, since there is no perfect automated migration path. However, a special case in the Hashable synthesis logic may be possible; in theory, the compiler could see that a type is declared to conform to Hashable with only a hashValue implementation provided, and in the situation, the compiler could then synthesize a hash(into:) implementation in terms of hashValue. It would be nice to be able to eliminate hashValue entirely from the API and ABI of the standard library, but it isn't worth breaking source stability for, and the core team does not want to gate acceptance of this proposal on even more compiler work to handle special-case compatibility hacks.

Thanks again to everyone who's given feedback in the review discussion. The review period is still scheduled to continue through this Friday, so if you have any other concerns to raise about the proposal or responses to the core team's decisions thus far, you still have time to provide them, and your feedback is welcome.

9 Likes

Just out of curiosity, do you have a bottom-line estimate of the performance cost of making Hasher a protocol rather than a unitary struct?

Minor question that I don’t see addressed in the proposal: will Hasher itself conform to Hashable?

It's not obvious to me why it would. Is there a use case you had in mind?

I really like the proposal but the one feature that I would like is a way to explicitly control the hash value given by finalize() rather than being forced into the underlying implementation used by Hasher.

The use case I see for this is when working with non-Swift tools (such as C, Python, etc.) which can have hash values. Being able to have consistent hash values of a datatype in each realm is very handy.

For example if I have this in a C header:

struct Foo {
   int bar;
   int baz;
}

int ComputeHashOfFoo(struct Foo foo);

I would personally want the hash value that I compute in C land to be the same as in Swift if I added an extension like this:

extension Foo: Hashable {
    func hash(into hasher: inout Hasher) {
        hasher.fix(to: Int(ComputeHashOfFoo(self)))
    }
}

Also, I'm not at all attached to the name fix(to: Int) or just a function like this being the interface to accomplish this task, it is just a quick example.

I imagine this isn't a very common use case so to discourage the use of fixing the value it should probably be named something like unsafelyLockFinalizedValue(to: Int) or something along those lines. My brain is a little too fried to think of better naming at the moment...

It could, and it would be cool, but you’d need something that differentiates two Hashers in a way that is prone to hashing.
This could be accomplished by extracting a representation of each Hashers’ hashing algorithm using some advanced Reflection feature. Then, you’d feed that representation to a Hasher and get their hash values.

I don’t think something like that exists in Swift atm, but damn that would be cool to make someday. A Hashtable with Hashers as keys? Hell yeah, why not :D

I wanted to raise a couple of questions about the design of finalize that came up in discussions yesterday:

First of all, a recurring theme in this discussion was the finalize method and the fact that, as proposed, it invalidates the Hasher value it's invoked on so that further combine operations trap. There's a straightforward answer to this problem in the language today: finalize could be made into a nonmutating consuming method. In Swift today, that would prevent the finalize operation from invalidating the hasher, since it would receive a copy of the hasher state instead of modifying the hasher in-place. Improperly invoking finalize on an inout hasher argument would give you a hash value based on the current hasher state up to this point, but not invalidate the hasher for future use. Looking forward to the proposed ownership model for Swift, a hash(into:) implementation could mark the hasher argument as move-only (and maybe Hasher itself could be made move-only or explicit-copy-only conditional on Swift language version), making so that you wouldn't be allowed to finalize during a hash(into:) operation, since you can't consume an inout-borrowed value. It's debatable whether silently providing an incomplete hash value is a better failure mode than trapping after improper use of finalize in the short term, but this design choice could lead to the correct long-term behavior further down in Swift's roadmap.

Next, I'd like to suggest the idea of using a stronger return type for the return type of finalize. @lorentey pointed out a potential problem if in the future we added support for one-shot hashing, since a one-shot hashOnly has the potential to be misimplemented by ignoring its hasher and simply returning a fixed value:

struct MyType: Hashable {
  func hashOnly(into hasher: Hasher) -> Int {
    // We ought to use the hasher to generate a hash so that we factor in its seed,
    // but nothing's forcing us to:
    return 1738
  }
}

This is even worse than hashValue since we'd expect hashOnly to return a pre-seeded value, so the raw hash value would be accepted blindly instead of getting mixed in with the seed like a plain hashValue would. We could use the type system to direct use here, by wrapping up the final hash value in a struct that only the standard library Hasher can construct:

@fixedLayout public struct HashValue {
  public let value: Int

  // Only usable by Hasher. Could be public API with a sufficiently scary name?
  fileprivate init(_rawHashValue: Int) { self.value = _rawHashValue }
}

That way, hashOnly or another API that is intended to return a strong hash value can return the HashValue type instead of a plain Int to ensure that the result in fact came from a Hasher.

3 Likes

@lorentey Thanks for commenting.

I do consider the change I suggested as a surface change, primarily because it retains the innovative step of providing the hasher instead of receiving a hash value.

Actually I expect them currently to be slower because of retain-release cycles, however once we get ownership manifesto and can mark the hasher as @nonescaping (or whatever it is called) then I would expect that like in other languages the cost of passing a class will be lower than passing a largish struct.

I can think of lots of uses for custom hashers:

  1. Often implementations of Bloom filters Provide many different hashing functions, e.g. GitHub - Baqend/Orestes-Bloomfilter: Library of different Bloom filters in Java with optional Redis-backing, counting and many hashing options. provides 10+.
  2. You may well want to know exactly what hasher is used, for example many Bloom Filter implementations choose Murmur as the best compromise between speed and performance.
  3. A library of Hashers could, maybe not at first but latter on, contain common message hashers like MD5 and SHA-1.

The Bloom Filter I wrote was just an example, I don't consider it serious code. I wanted to demonstrate that the proposed API would enable more exotic use of hashers whilst still enabling Caching and not requiring tons of boilerplate code. As noted above there are many useful hashers.

Yes it could be a seperate proposal.

Two points:

  1. The caching shown is far from optimal to keep the post short, as noted in the comments. In practice a SortedDictionary would perform better than the example's Dictionary. (But SortedDictionary doesn't exist yet!)
  2. The extra space is only for caching if nothing caches then the overhead would be small (see point 1). If something is cached then yes you are trading of space for speed, but that is what the programmer has decreed worthwhile.

In summary: I think the main point is the restructuring to enable future growth and to anticipate the implementation of the ownership manifesto and I agree that the caching could be seperate.

Thanks for putting all your work into your proposal, it is a really worthwhile improvement to Swift.

1 Like

After sleeping on this, I'm getting increasingly convinced we should just embrace that one-shot hashing provides an unsafe escape hatch from Hasher:

protocol Hashable {
  func hash(into hasher: inout Hasher)
  func unsafeHashValue(seed: (UInt64, UInt64)) -> Int
}

extension Hashable {
  public func unsafeHashValue(seed: (UInt64, UInt64)) -> Int {
    var hasher = Hasher(seed: seed)
    self.hash(into: &hasher)
    return hasher.finalize()
  }
}

In addition to one-shot hashing, this would also allow people to experiment with new hash functions without having to change the stdlib. (unsafeHashValue is essentially the same as the hashValue property, but with a seed.)

(To be clear, we should not do this now. Adding this as public API would be a separate proposal. But introducing a HashValue type now may turn out to be overly cautious later.)

Fair enough. I suppose this doesn't make much difference if it isn't public API to begin with, but would it make sense to pass a pre-seeded Hasher in instead of the seed value itself, or is it useful for the implementer to be able to take full control of hashing?

Passing in the seed is more geared towards the "escape hatch" use case rather than one-shot hashing. (An operation to extract a seed from Hasher would make a messy public API.) As long as we only want to enable one-shot hashing, there is no reason to expose seed values.

That said, in the current version of the (non-public) one-shot hashing implementation, it was useful to collapse all Hasher operations into a single resilient call, including its initialization. Eliminating the extra noninlinable function call makes a measurable improvement for tiny keys. This will change once Set/Dictionary stores full local Hasher instances rather than just their local seeds.

1 Like

The core team has accepted this proposal with the following revisions:

  • As @jrose noted, init(seed:)'s behavior may be misleading to users who expect it to enable deterministic hashing, and has no real benefit over a collection implementation seeding the hasher directly with combine calls. To avoid misleading users, and to simplify the interface, we strongly suggest not including init(seed:) in the public Hasher API, instead suggesting that hashed collection implementers seed their hashers manually.

  • As a further simplification, the core team observes that the many combine(bits:) overloads on Hasher that take fixed-sized integers don’t need to be public API. Instead, they can be @usableFromInline entry points invoked by @inlinable implementations of hash(into:) on the integer types themselves:

    extension Hasher {
      @usableFromInline internal mutating func _combine(bits: UInt8)
      /* etc. */
    }
    
    extension UInt8: Hashable {
      @inlinable public func hash(into hasher: inout Hasher) {
        hasher._combine(bits: self)
      }
    }
    

    This greatly reduces the public API surface area of Hasher, in particular eliminating the large overload set for combine(bits:), while preserving the performance characteristics of the proposed design, since using the generic combine<H: Hashable> method with a fixed-sized integer type would naturally inline down to a call to the corresponding @usableFromInline entry point. This would reduce the public Hasher API to only init(), combine<H: Hashable>(_: H), the combine(bits:) method that takes a raw buffer of hash input, and finalize() as public API.

  • The core team recommends that Hasher.finalize() be made a __consuming nonmutating method instead of a mutating method that invalidates the Hasher value. This saves some overhead from Hasher having to check its validation state, and looking forward to a future version of Swift with an ownership model and support for move-only values, would allow correct use of finalize() to be statically enforced on move-only Hasher instances.

With the exception of the last point, these were the same revisions suggested by the core team in our previous review of the proposal. In further study, @lorentey discovered that there is significant performance value to providing a one-shot hashing interface for standard library types, particularly pure-ASCII Strings, though the benefit for user-defined types will be less due to resilience. Therefore, while the implementation may use one-shot hashing as an implementation detail, the core team does not see the need to design a public API for this functionality at this time.

Thanks again to everyone for participating in the review!

19 Likes

I'm a little late to the party and slightly offtopic (applies to Equatable as well), but still: it would be great if the compiler error message could be improved in case auto-synthesizing fails. Currently, it only outputs "does not conform to protocol 'Hashable'"; it could inform us about why in a note. Typical cases include "property/associated value X is not Hashable so I could not synthesize".

But then, the compiler doesn't even tell us which protocol requirement is unfulfilled, so...

Thanks @reitzig. If you notice deficiencies in diagnostics, please file bugs at bugs.swift.org, or start a new thread under the "Users" tag to discuss them. This review thread is completed, and it's off-topic to discuss implementation issues here.