[Accepted] SE-0206: Hashable Enhancements

Minor question that I don’t see addressed in the proposal: will Hasher itself conform to Hashable?

It's not obvious to me why it would. Is there a use case you had in mind?

I really like the proposal but the one feature that I would like is a way to explicitly control the hash value given by finalize() rather than being forced into the underlying implementation used by Hasher.

The use case I see for this is when working with non-Swift tools (such as C, Python, etc.) which can have hash values. Being able to have consistent hash values of a datatype in each realm is very handy.

For example if I have this in a C header:

struct Foo {
   int bar;
   int baz;
}

int ComputeHashOfFoo(struct Foo foo);

I would personally want the hash value that I compute in C land to be the same as in Swift if I added an extension like this:

extension Foo: Hashable {
    func hash(into hasher: inout Hasher) {
        hasher.fix(to: Int(ComputeHashOfFoo(self)))
    }
}

Also, I'm not at all attached to the name fix(to: Int) or just a function like this being the interface to accomplish this task, it is just a quick example.

I imagine this isn't a very common use case so to discourage the use of fixing the value it should probably be named something like unsafelyLockFinalizedValue(to: Int) or something along those lines. My brain is a little too fried to think of better naming at the moment...

It could, and it would be cool, but you’d need something that differentiates two Hashers in a way that is prone to hashing.
This could be accomplished by extracting a representation of each Hashers’ hashing algorithm using some advanced Reflection feature. Then, you’d feed that representation to a Hasher and get their hash values.

I don’t think something like that exists in Swift atm, but damn that would be cool to make someday. A Hashtable with Hashers as keys? Hell yeah, why not :D

I wanted to raise a couple of questions about the design of finalize that came up in discussions yesterday:

First of all, a recurring theme in this discussion was the finalize method and the fact that, as proposed, it invalidates the Hasher value it's invoked on so that further combine operations trap. There's a straightforward answer to this problem in the language today: finalize could be made into a nonmutating consuming method. In Swift today, that would prevent the finalize operation from invalidating the hasher, since it would receive a copy of the hasher state instead of modifying the hasher in-place. Improperly invoking finalize on an inout hasher argument would give you a hash value based on the current hasher state up to this point, but not invalidate the hasher for future use. Looking forward to the proposed ownership model for Swift, a hash(into:) implementation could mark the hasher argument as move-only (and maybe Hasher itself could be made move-only or explicit-copy-only conditional on Swift language version), making so that you wouldn't be allowed to finalize during a hash(into:) operation, since you can't consume an inout-borrowed value. It's debatable whether silently providing an incomplete hash value is a better failure mode than trapping after improper use of finalize in the short term, but this design choice could lead to the correct long-term behavior further down in Swift's roadmap.

Next, I'd like to suggest the idea of using a stronger return type for the return type of finalize. @lorentey pointed out a potential problem if in the future we added support for one-shot hashing, since a one-shot hashOnly has the potential to be misimplemented by ignoring its hasher and simply returning a fixed value:

struct MyType: Hashable {
  func hashOnly(into hasher: Hasher) -> Int {
    // We ought to use the hasher to generate a hash so that we factor in its seed,
    // but nothing's forcing us to:
    return 1738
  }
}

This is even worse than hashValue since we'd expect hashOnly to return a pre-seeded value, so the raw hash value would be accepted blindly instead of getting mixed in with the seed like a plain hashValue would. We could use the type system to direct use here, by wrapping up the final hash value in a struct that only the standard library Hasher can construct:

@fixedLayout public struct HashValue {
  public let value: Int

  // Only usable by Hasher. Could be public API with a sufficiently scary name?
  fileprivate init(_rawHashValue: Int) { self.value = _rawHashValue }
}

That way, hashOnly or another API that is intended to return a strong hash value can return the HashValue type instead of a plain Int to ensure that the result in fact came from a Hasher.

3 Likes

@lorentey Thanks for commenting.

I do consider the change I suggested as a surface change, primarily because it retains the innovative step of providing the hasher instead of receiving a hash value.

Actually I expect them currently to be slower because of retain-release cycles, however once we get ownership manifesto and can mark the hasher as @nonescaping (or whatever it is called) then I would expect that like in other languages the cost of passing a class will be lower than passing a largish struct.

I can think of lots of uses for custom hashers:

  1. Often implementations of Bloom filters Provide many different hashing functions, e.g. GitHub - Baqend/Orestes-Bloomfilter: Library of different Bloom filters in Java with optional Redis-backing, counting and many hashing options. provides 10+.
  2. You may well want to know exactly what hasher is used, for example many Bloom Filter implementations choose Murmur as the best compromise between speed and performance.
  3. A library of Hashers could, maybe not at first but latter on, contain common message hashers like MD5 and SHA-1.

The Bloom Filter I wrote was just an example, I don't consider it serious code. I wanted to demonstrate that the proposed API would enable more exotic use of hashers whilst still enabling Caching and not requiring tons of boilerplate code. As noted above there are many useful hashers.

Yes it could be a seperate proposal.

Two points:

  1. The caching shown is far from optimal to keep the post short, as noted in the comments. In practice a SortedDictionary would perform better than the example's Dictionary. (But SortedDictionary doesn't exist yet!)
  2. The extra space is only for caching if nothing caches then the overhead would be small (see point 1). If something is cached then yes you are trading of space for speed, but that is what the programmer has decreed worthwhile.

In summary: I think the main point is the restructuring to enable future growth and to anticipate the implementation of the ownership manifesto and I agree that the caching could be seperate.

Thanks for putting all your work into your proposal, it is a really worthwhile improvement to Swift.

1 Like

After sleeping on this, I'm getting increasingly convinced we should just embrace that one-shot hashing provides an unsafe escape hatch from Hasher:

protocol Hashable {
  func hash(into hasher: inout Hasher)
  func unsafeHashValue(seed: (UInt64, UInt64)) -> Int
}

extension Hashable {
  public func unsafeHashValue(seed: (UInt64, UInt64)) -> Int {
    var hasher = Hasher(seed: seed)
    self.hash(into: &hasher)
    return hasher.finalize()
  }
}

In addition to one-shot hashing, this would also allow people to experiment with new hash functions without having to change the stdlib. (unsafeHashValue is essentially the same as the hashValue property, but with a seed.)

(To be clear, we should not do this now. Adding this as public API would be a separate proposal. But introducing a HashValue type now may turn out to be overly cautious later.)

Fair enough. I suppose this doesn't make much difference if it isn't public API to begin with, but would it make sense to pass a pre-seeded Hasher in instead of the seed value itself, or is it useful for the implementer to be able to take full control of hashing?

Passing in the seed is more geared towards the "escape hatch" use case rather than one-shot hashing. (An operation to extract a seed from Hasher would make a messy public API.) As long as we only want to enable one-shot hashing, there is no reason to expose seed values.

That said, in the current version of the (non-public) one-shot hashing implementation, it was useful to collapse all Hasher operations into a single resilient call, including its initialization. Eliminating the extra noninlinable function call makes a measurable improvement for tiny keys. This will change once Set/Dictionary stores full local Hasher instances rather than just their local seeds.

1 Like

The core team has accepted this proposal with the following revisions:

  • As @jrose noted, init(seed:)'s behavior may be misleading to users who expect it to enable deterministic hashing, and has no real benefit over a collection implementation seeding the hasher directly with combine calls. To avoid misleading users, and to simplify the interface, we strongly suggest not including init(seed:) in the public Hasher API, instead suggesting that hashed collection implementers seed their hashers manually.

  • As a further simplification, the core team observes that the many combine(bits:) overloads on Hasher that take fixed-sized integers don’t need to be public API. Instead, they can be @usableFromInline entry points invoked by @inlinable implementations of hash(into:) on the integer types themselves:

    extension Hasher {
      @usableFromInline internal mutating func _combine(bits: UInt8)
      /* etc. */
    }
    
    extension UInt8: Hashable {
      @inlinable public func hash(into hasher: inout Hasher) {
        hasher._combine(bits: self)
      }
    }
    

    This greatly reduces the public API surface area of Hasher, in particular eliminating the large overload set for combine(bits:), while preserving the performance characteristics of the proposed design, since using the generic combine<H: Hashable> method with a fixed-sized integer type would naturally inline down to a call to the corresponding @usableFromInline entry point. This would reduce the public Hasher API to only init(), combine<H: Hashable>(_: H), the combine(bits:) method that takes a raw buffer of hash input, and finalize() as public API.

  • The core team recommends that Hasher.finalize() be made a __consuming nonmutating method instead of a mutating method that invalidates the Hasher value. This saves some overhead from Hasher having to check its validation state, and looking forward to a future version of Swift with an ownership model and support for move-only values, would allow correct use of finalize() to be statically enforced on move-only Hasher instances.

With the exception of the last point, these were the same revisions suggested by the core team in our previous review of the proposal. In further study, @lorentey discovered that there is significant performance value to providing a one-shot hashing interface for standard library types, particularly pure-ASCII Strings, though the benefit for user-defined types will be less due to resilience. Therefore, while the implementation may use one-shot hashing as an implementation detail, the core team does not see the need to design a public API for this functionality at this time.

Thanks again to everyone for participating in the review!

19 Likes

I'm a little late to the party and slightly offtopic (applies to Equatable as well), but still: it would be great if the compiler error message could be improved in case auto-synthesizing fails. Currently, it only outputs "does not conform to protocol 'Hashable'"; it could inform us about why in a note. Typical cases include "property/associated value X is not Hashable so I could not synthesize".

But then, the compiler doesn't even tell us which protocol requirement is unfulfilled, so...

Thanks @reitzig. If you notice deficiencies in diagnostics, please file bugs at bugs.swift.org, or start a new thread under the "Users" tag to discuss them. This review thread is completed, and it's off-topic to discuss implementation issues here.