[Accepted] SE-0206: Hashable Enhancements

I‘d like to ask a few questions but I‘m currently on my phone and it‘s hard to type code samples so I hope there is still some time left. I‘ll come back tomorrow on that.


Just wanted to throw a few names into the bikeshedding bucket:

  • take
  • attach

Python's hashlib module provides a nonmutating, nondestructive variant of finalize, called digest. Our usecase is not the same as hashlib, but what if we borrowed this idea?

A non-destructive digest (or finalized?) may have (very) slight performance implications, but it would alleviate the worry that people may call finalize at inopportune moments. However, incorrect use of finalize is a serious programming error; I'd prefer to trap when it happens rather than hiding the issue.

Hiding finalize() is not a bad idea per se, but it seems to overly complicate the API surface. (It would be the right choice if Hasher was a protocol, but it's not.) It also prevents the trick that keeps commutative hashing working for Bloom filters (see my earlier implementation for Set.hash(into:)), unless we add even more complications.

I can sympathize with that, but statefulness is an essential part of Hasher's design. combine is an unashamedly mutating function, and I don't think we should change that. (I know a pure Hasher would not work well, because I tried implementing one.)

Floating a few more names: ingest, consume, update

This is a great question. I guess not much would go wrong, since hash(into:) implementations would would still naturally feed components to the hasher in a unique order, in the vast majority of cases.

The semantics of combine matter a great deal in case components cannot be sorted in any particular order, but I don't see how that would happen outside of "unordered" collections like Set and Dictionary. (And we assume those are written by relatively experienced Swift developers, who are willing to read the docs.)

This is a solid argument for keeping the combine name.

Just to clarify: Hasher is extremely sensitive to how combine calls are ordered.

1 Like

No, no, everything goes wrong! :-)

First, people who want order-independent hashing (and assume that that's what they will be getting from Hasher) will feed their items to Hasher in an arbitrary order. Hashing will work fine and will still return a meaningful hash value. There will be nothing to inform the developer they've done something wrong unless they specifically write test cases to verify order-invariant hashing. If they don't do that, they'll just eventually run across strange cases where hash values that (they think) should be equal are not.

Second, those who want order-dependent hashing but believe Hasher is order-invariant will do extra work in an attempt to "make" hashing be order-dependent. For example, one approach would be to manually and independently hash each element of a list and XOR its hash value with its index, feeding the result of that calculation to the top-level Hasher. This will work fine but is much slower than it should be. Again, there won't be any smoking gun that alerts the developer to an error.

I don't think this an issue that needs to be addressed in the API. It just needs to be really clear in the docs and header comments that hashing is order-dependent.

1 Like
hasher.stirIn(foo)

If we use labels, maybe this is good...?

hasher.feed(from: thing)

From that point of view, "stir in" and "feed from" aren't very clear about implying order. You can make an argument in their favor, but having to make an argument to end users is the thing we're trying to avoid.

What about refine(using:)? Each step of a refinement depends on the previous intermediate result.

1 Like

What is your evaluation of the proposal?
+ 1

Is the problem being addressed significant enough to warrant a change to Swift?
Yes

Does this proposal fit well with the feel and direction of Swift?
Yes

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
N/A

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?
I read the proposal and have followed the discussions.

I think that the shed might be best colored combine(next:)

I think if the code is mildly restructured you can get some benefits:

  1. Performance improvements by not copying hashers with a large state (use a clash).
  2. Prevent/discourage people from calling finalize.
  3. Provide a selection of hashing algorithms.
  4. Enable caching of expensive hashes.

The restructuring I am suggesting is:

protocol MyHashable { // As before except not an `inout` since `Hasher` is a class. Spelt `MyHashable` to allow example code to run.
    func hash(into: Hasher)
}

// Similar to before, but without `finalize` or `init` since user shouldn't call these and must be a class since hashers can have a large internal state and therfore should not be bitwise copied.
protocol Hasher: AnyObject {
    func combine(bits: UInt64) // Only combine methods.
    
    // Other types of combine ...
}

// Mixin `finalize` and `init`.
protocol HasherImplementation: Hasher { // A HasherImplementation isa Hasher.
    init() // No arguments, since one use is as a default hasher. Seed using combine.
    
    func finalize() -> UInt64 // Good to be same type as Random, so hashing and random interoperate.
}

enum Hashers { // Library of Hash functions.
    typealias Default = Sip
    
    final class Sip: HasherImplementation { // *Not* Sip but instead something simple to post!
        private var hash: UInt64 = 211// Real seed is longer!
        
        func finalize() -> UInt64 {
            // Real code ...
            return hash
        }
        
        func combine(bits: UInt64) {
            hash = 499 &* hash &+ bits // Real code more involved!
        }
        
        // Other types of combine ...
    }
}

/// Protocol that enables an implementing class to track modifications so that something that uses hashes can cache the hash.
protocol CacheableHash: AnyObject, MyHashable { // Only classes!
    var modificationCount: UInt64 { get } // Incremented eachtime a modification is made.
}

// A user of a hashing algorithm, set, dictionary, etc., would normally support caching by using this class rather than the `Hasher` directly.
final class HashCache {
    private struct Cached {
        weak var element: CacheableHash?
        let modificationCount: UInt64
        let hash: UInt64
    }
    
    private var cache = [ObjectIdentifier : Cached]() // Would be a sorted dictionary ideally!
    
    private let hasherFactory: () -> HasherImplementation
    
    private var numberOfHashes = 0
    
    init(hasherFactory: @escaping () -> HasherImplementation = { Hashers.Default.init() }) {
        self.hasherFactory = hasherFactory
    }
    
    func hash(for element: MyHashable, withSeed seed: UInt64 = 0) -> UInt64 {
        defer {
            if numberOfHashes == 131 {
                numberOfHashes = 0
                for entry in cache {
                    if entry.value.element == nil {
                        cache.removeValue(forKey: entry.key) // Rmove collected elements.
                    }
                }
            } else {
                numberOfHashes += 1
            }
        }
        guard let e = element as? CacheableHash else {
            let hasher = hasherFactory()
            hasher.combine(bits: seed)
            element.hash(into: hasher)
            return hasher.finalize()
        }
        let id = ObjectIdentifier(e)
        if let cached = cache[id] {
            if cached.element == nil {
                cache.removeValue(forKey: id)
            } else if cached.modificationCount == e.modificationCount {
                return cached.hash
            }
        }
        let hasher = hasherFactory()
        hasher.combine(bits: seed)
        e.hash(into: hasher)
        let hash = hasher.finalize()
        cache[id] = Cached(element: e, modificationCount: e.modificationCount, hash: hash)
        return hash
    }
}

The Hashable protocol, called MyHashable above to prevent name clashes, has a hash(into:) method that accepts a class and therefore is not an inout parameter. This is more efficient for hashers that have a large state.

The Hashable protocol only has combine methods not init or finalize, thus preventing accidentally calling of something you shouldn't. As noted above Hashers are classes.

HasherImplementation mixes in init and finalize.

Hashers is a library of hash functions with a nominated default hasher. Initially probably only Sip would be provided and hence Sip would be the default.

CloneableHash is a protocol that you implement if you want the hashers to cache the hash value because it is expensive to calculate. CloneableHash requires the implementer to be a class because the cache needs to track the objects it has cached the hashes for.

HashCache wraps a HasherImplementation and adds to it caching for CloneableHash types. You would normally wrap a HasherImplementation in a HashCache because HashCache works with both types that implement CloneableHash and types that don't, i.e. it is universal. You can choose which laser to use with HashCache, but by default it uses Hashers.Default. The code for HashCache uses a dictionary, in practice a sorted dictionary would be used (improved performance and less space).

To demonstrate how to use this API a simple bloom filter and some tests which demonstrate HashCache with both cached and non-cached types is given below:

/// Example of something that can cache hashes.
struct Bloom64<T> where T: MyHashable {
    private var hashersSeeds = zip([HashCache(), HashCache(), HashCache()], [1021, 2053, 4099] as [UInt64])
    
    private var bloom: UInt64 = 0
    
    mutating func insert(_ element: T) {
        for hash in hashes(for: element) {
            bloom |= hash
        }
    }
    
    func contains(_ element: T) -> Bool {
        for hash in hashes(for: element) {
            guard bloom & hash != 0 else {
                return false
            }
        }
        return true
    }
    
    private func hashes(for element: T) -> [UInt64] {
        return hashersSeeds.map { hasherSeed in
            (1 as UInt64) << (hasherSeed.0.hash(for: element, withSeed: hasherSeed.1) & 0x3F)
        }
    }
}

// Make `UInt64` conform to `MyHashable` for testing of cheap hashes.
extension UInt64: MyHashable {
    func hash(into: Hasher) {
        into.combine(bits: self)
    }
}
var bi = Bloom64<UInt64>()
bi.insert(1 as UInt64)
print(bi.contains(1 as UInt64)) // T
print(bi.contains(0 as UInt64)) // F

// Example of an expensive class that has its hash cached.
class Expensive: CacheableHash {
    private(set) var modificationCount: UInt64 = 0
    
    private var state: UInt64 // Real code would contain a lot of state!
    
    init(_ initial: UInt64) {
        state = initial
    }
    
    func hash(into: Hasher) {
        print("Calculating hash for \(state) with count of \(modificationCount)")
        into.combine(bits: state) // Real code would be an extensive data structure to hash.
    }
    
    func modify() {
        state += 1
        modificationCount += 1 // Any 'mutating' method *must* increment count.
    }
}
var be = Bloom64<Expensive>()
let e0 = Expensive(0)
let e1 = Expensive(11)
be.insert(e1) // "Calculating hash for 11 with a count of 0" x 3 for e1
print(be.contains(e1)) // No calculating hash because uses chached value for e1, T
print(be.contains(e0)) // "Calculating hash for 0 with a count of 0" x 3 because 1st time for e0, F
e1.modify()
print(be.contains(e1)) // "Calculating hash for 12 with a count of 1" x 3 because e1 modified, F

I think it is a modest change that has great benefits. What do you think?

1 Like

TL;DR

@lorentey you posted an example of an enum with associated values somewhere above but I couldn't quite follow then what you said about it. Is it reasonable to prefix each case with an index like value when creating a new hash value?

I wonder if this proposal will eventually allow us to synthesize the hash(into:) function for enums with associated values, which is currently missing. I couldn't find the reasoning why synthesizing the hash value for enums with associated values is source breaking. If someone can clarify this here or send me a PM, I'd very appreciate it.

In my code I use a very ugly solution, because I didn't came up with anything better and know that at least this solution will work as expected:

enum Foo : Hashable {
  case a
  case b
  case c(Int)
  case d(Int, String)

  var hashValue: Int {
    switch self {
    case .a:
      return "a".hashValue
    case .d(let intValue, let stringValue):
      return "d-\(intValue)\(stringValue)".hashValue
    ...
    }
  }
  
  static func == (lhs: Foo, rhs: Foo) -> Bool { ... }
}

I really liked the example of yours which associated an index to each case before consuming the associated values of the enum case.

Yes! Feeding discriminator values to the hasher is the correct way to do it.

I have good news! This was included in SE-0185, and it is already fully supported in the current compiler. This proposal merely updates generated code to use hash(into:). For example, this works just fine:

enum Foo: Hashable {
  case a(Int, String?, [Double])
  case b([String: Bool])
  case c(Set<Foo>)
}

SE-0185 does require that associated values implement Hashable. (Note that the example above makes use of conditional Hashable conformances introduced by SE-0143 that did not make it into the 4.1 release but are implemented on the master branch.)

1 Like

To me, this seems like an extremely significant change: it replaces a single struct with three protocols, a caseless enum, and two classes. It may seem like a surface-level change to you, but it significantly complicates the hashing API. The new hash(into:) and (especially!) HashCache would also very likely not benchmark the way you expect.

We did carefully consider making Hasher a protocol. The proposal explains why we decided to go with a simple struct instead. There may be a use case for custom Hashers, but it hasn't come up in review discussions so far.

The Bloom filter you've shown does not utilize multiple Hasher implementations, so it argues more against generalizing Hasher than in favor of it. This is as it should be; we expect that the stdlib's Hasher implementation will be good enough to support Bloom filters -- there is no reason to implement custom Hashers for that. (Note that Bloom64 allocates a shockingly large amount of space to cache hash values; I believe this goes directly against the spirit of bloom filtering, and some of that space would be much better used as additional state space for the filter itself. The current version of insert/contains also yields a much higher rate of false positives than what Bloom suggested.)

The idea to partially automate hash caching is intriguing, but it seems like it is orthogonal to this proposal, and it can be built on top of it later. It also adds a brand new concept for tracking mutations that seems non-obvious enough to deserve its own separate discussion thread.

(I'd like to note again that, in order to improve hash table performance, we can't reuse hash values across Set and Dictionary instances (or even during the lifetime of a single collection instance). So caching full raw hash values would rarely be helpful for these collections.)

This is a remarkably confident assertion. It's difficult to correctly reason about performance without actually comparing benchmarks. I strongly suspect you may be overestimating the cost of passing around an inout struct, and underestimating the cost of allocating/deallocating a class instance or of passing around a protocol existential. Performance considerations weren't the only reason we chose the design we proposed, but they were indeed quite significant -- and the hundreds of benchmarks I had run over the past few months were decidedly in favor of an interface that uses inout structs.

3 Likes

Er, I don’t quite follow. What exactly is the problem with having, say, an instance of Set cache its own hash value? (Or more accurately, cache its “prehash”, namely the xor of the hash values of its elements.)

Any mutation would clear the cached hash so it can never become stale, and I doubt that setting an optional integer to nil has any significant performance implications relative to mutating a set.

The prehash is only used in the set’s own hash(into: Hasher) implementation, so it never “leaks”. Instantiating a new set using the elements of an existing set doesn’t look at either set’s prehash.

Storing one set into another calls the inner set’s hash(into:) method with the outer set’s hasher, exactly as expected. And instantiating a new set using the elements of the outer set involves calling the inner set’s hash(into:) method with the new set’s hasher. So the inner set’s cached prehash gets reused successfully, while the outer sets are free to maintain independent hashers.

• • •

Edit: Scratch all that. We can’t cache hashes (or even prehashes) because the elements could be classes, which can be mutated without it being “mutating”.

• • •

Edit 2: Wait, if a class instance can have its hash value change out from under a Set, that poses larger problems even without caching. How do we handle sets where the element is a class today?

• • •

Edit 3: It appears we do not handle that situation today. The set claims not to contain the mutated class instance, although it does show up during iteration.

Hold your horses! :wink: I was talking about the specific idea of reusing the raw hash values across Set/Dictionary instances, not about hash caching in general. Hash tables must choose a new seed when they resize themselves, to prevent unacceptable performance regressions. (This is unrelated to collision attacks.)

Using the memoized result of a secondary hash function to reduce the number of bytes fed into Hasher in hash(into:) is a perfectly valid optimization. (That being said, it does have some fundamental correctness tradeoffs, and I believe it should only be used sparingly, and with copious amounts of justification in the form of before/after profiling data.)

There is nothing wrong with that, except it would needlessly slow down mutations (especially bulk removals) for Sets that will never be hashed -- which I expect to be the usual case. Much less importantly, it also interferes with using Sets as keys in advanced hashing scenarios, like Bloom filters.

Keeping an up-to-date hash cache for Sets is still an important optimization; it's just probably better done outside of the standard library. (But if the performance tradeoff seems worth it, the standard Set may end up doing it too. This is an implementation detail that is independent of the proposal we're discussing.)

No, it's a serious programming error to do that.

While this is an interesting topic, it too seems unrelated to the proposal at hand.

Cocoa handles this by defining the NSCopying protocol, and making a copy of all keys inserted into NSSet/NSDictionary instances. Foundation differentiates between mutable/immutable classes, where a copy of a mutable class gets to be immutable. This works fine within types defined in Foundation, but it's just a coding convention, not a language-enforced rule. You're free to ignore this convention when you're defining your own NSObject subclasses, and then you'll run into the same problem.

Swift chose to handle the same issue by emphasizing value types over reference types. Value types are literally impossible to mutate while they're stored in Set/Dictionary's hash table. Notably, this is enforced by the type system of the language itself, rather than just a convention.

1 Like

Okay, that makes sense.

True.

The same issue arises for a struct that contains a class instance. If the class instance is visible anywhere outside the struct (either through another reference, or even as a read-only property on the struct itself), the hash value of the struct can be changed out from under a set it is in. This applies even if the struct uses copy-on-write, because no write to the struct is occurring.

And yes, this is not related to the subject of the current thread, except that it came to mind while discussing cached hashes.

1 Like

This review is focused only on source compatibility. I read that section in depth and skimmed the rest of the propasal. I did not read all of the review thread or previous discussion.

When upgrading to Swift 4.2, Hashable types written for earlier versions of Swift will need to be migrated to implement hash(into:) instead of hashValue. There are two options for doing this:

IIUC, we will synthesize the new hash(into:) based on the user's hashValue implementation when using the pre-4.2 language mode. Why not do the same in 4.2 and deprecate hashValue instead of forcing users to update it immediately? A warning combined with excellent fixits is a great way to get everyone to migrate over time, but it's painful for it to become an error in the same release that the new functionality is added. The bar for making it an error in 4.2 should be that we can migrate it fully automatically, but...

The compiler should simplify the migration process by providing fix-its for both options. A fix-it to remove hashValue should only be provided if automatic synthesis is available. For the second option, it would be nice to have the compiler suggest a full implementation of hash(into:), but a template would probably suffice:

... I believe this is not sufficient for automatic migration. We will not be able to ensure that the behaviour of the program doesn't change with either of these options. I think we really need the user to make the decision to change their hashing.

1 Like

Thank you for bringing this up; ease of migration is a hugely important issue.

While the compiler would produce a deprecation warning in 4.2 mode, synthesis would continue to work, and no code would need to be actually changed until/unless hashValue is actually removed from the protocol, which is not going to happen soon (if ever).

(It may also be a good idea to defer hashValue's formal deprecation till a future language version. We'd like to encourage people to implement hash(into:), but there may be gentler ways of doing that than formally deprecating hashValue. However, it seems like a bad idea to support both requirements indefinitely -- we do want to get rid of the legacy interface at some point.)

I agree wholeheartedly. It may be helpful to provide it as an interactive fix-it attached to the deprecation warning, though.

The proposal doesn't cover fully automated migration of existing hashValue implementations; it seems Difficult to do that correctly in the most general case (with complications like caching, partial hashing, etc.). However, automatic migration may be possible for the simpler cases.

For example, (in theory at least) the migrator could recognize that the hashValue implementation below calls the getters of all stored properties without touching anything else, and it could replace it with a hash(into:) implementation or remove it altogether:

struct Foo: Hashable {
  var a: Int
  var b: Double

  var hashValue: Int { return a.hashValue ^ 31 &* b.hashValue }
  static func ==(l: Foo, r: Foo) -> Bool { return l.a == r.a && l.b == r.b }
}

(The same goes for the == implementation.) Of course, the "without touching anything else" part may not be easy to formally verify. (The code does call the hashValue properties along with a bunch of Int operations. Do we implement a whitelist?)

This migration would still change the meaning of the code in evil cases like the one below, where a.hashValue is canceled out from the returned hash value:

var hashValue: Int { 
  var hash = 0
  hash ^= a.hashValue
  hash = 31 &* b.hashValue
  return hash
}

I'd argue a human migrator would probably not fair any better, and I'd also argue the original intention behind this code was probably to hash both fields; the second xor was simply lost somewhere along the way.

As an improvement to this, the migrator could recognize partial hashing, and try to do the right thing there, too.

Thanks for everyone who has participated in the review thus far! The core team discussed this proposal in our meeting today. There are still a couple days left in the review period, but we're tentatively inclined to accept this proposal, but wanted to suggest some changes based on public review discussion and our own review:

  • As @jrose noted, init(seed:)'s behavior may be misleading to users who expect it to enable deterministic hashing, and has no real benefit over a collection implementation seeding the hasher directly with combine calls. To avoid misleading users, and to simplify the interface, we strongly suggest not including init(seed:) in the public Hasher API, instead suggesting that hashed collection implementers seed their hashers manually.

  • As a further simplification, the core team observes that the many combine(bits:) overloads on Hasher that take fixed-sized integers don't need to be public API. Instead, they can be @usableFromInline entry points invoked by @inlinable implementations of hash(into:) on the integer types themselves:

    extension Hasher {
      @usableFromInline internal mutating func _combine(bits: UInt8)
      /* etc. */
    }
    
    extension UInt8: Hashable {
      @inlinable public func hash(into hasher: inout Hasher) {
        hasher._combine(bits: self)
      }
    }
    

    This greatly reduces the public API surface area of Hasher, in particular eliminating the large overload set for combine(bits:), while preserving the performance characteristics of the proposed design, since using the generic combine<H: Hashable> method with a fixed-sized integer type would naturally inline down to a call to the corresponding @usableFromInline entry point. This would reduce the public Hasher API to only init(), combine<H: Hashable>(_: H), the combine(bits:) method that takes a raw buffer of hash input, and finalize() as public API.

With those changes, the proposal is likely to be accepted. The core team also considered many of the points raised in public review. @Gankro raised the importance of mixins for defending against collision family attacks in hashes of collection values, and many people suggested dedicated APIs for facilitating this in response. @lorentey noted in reply that dedicated API for this doesn't really carry its weight or shield the implementer of a collection hash from needing to know that they need mixins or what information to mix into the hasher, and the core team concurs with this point of view. Documentation will describe the importan

In the same post, @Gankro also pointed out the usefulness of a one-shot hashing interface for certain families of hash function that change algorithm based on input size, and so benefit from a complete view of the hash input at once. In the proposed design, the hasher implementation is intended to be fixed and chosen by the standard library, and the current algorithm being considered does not require one-shot hashing to benefit. If a future version of Swift decided to change the hash function to one that benefited from one-shot hashing, we should be able to introduce a new requirement resiliently by adding a new requirement with a default implementation in terms of the current incremental hashing interface. In a follow-up post, @Gankro described problems Rust had doing exactly that, albeit acknowledging that some of those problems are due to other Rust-specific design choices in their hashing design, particularly the ability to index their hashed containers with types other than the key type of the container itself. It isn't clear that Swift would suffer these same issues resiliently introducing one-shot hashing, or that Swift would have cause to switch the standard hash function to one that required it, so the core team would like to err on the side of simplicity and leave it out of the current design.

@lemire raised the concern that the proposed design does not allow for rolling hash functions. @lorentey replied that Hashable's primary purpose is hashed collections, and that the usefulness of rolling hashes is largely independent of Hashable's intended purpose, and the core team concurs.

@hlovatt and others asked how to cache hashing work in this design; it's more non-obvious and error-prone to do so than before. @lorentey replied, and the core team agrees, that automated caching is an orthogonal design question that could be added in a separate proposal, and that hash caching should in general be used sparingly as it's rarely a win in practice.

Many people raised concerns about the naming of methods in the proposed design. Particularly, @nnnnnnnn noted that perhaps hash(into:) ought to be deemphasized as a static method, to avoid polluting documentation and code completion, and many people raised the issue that combine may imply order insensitivity. The core team feels that the documentation/code completion pollution problem is a larger issue that deserves more holistic consideration. While many alternative names for combine were suggested, the core team did not find any of them significantly clearer. The core team is inclined to accept the names as is.

The core team considered the question of whether the old hashValue requirement could be eliminated altogether in a future version of Swift. There are significant source compatibility concerns in doing so; @blangmuir also lays these concerns out nicely in his post. Deprecating hashValue would not affect source compatibility with existing Swift 3 or 4 code while discouraging new code from being written to implement it, but the protocol requirement cannot be completely removed without breaking source compatibility, since there is no perfect automated migration path. However, a special case in the Hashable synthesis logic may be possible; in theory, the compiler could see that a type is declared to conform to Hashable with only a hashValue implementation provided, and in the situation, the compiler could then synthesize a hash(into:) implementation in terms of hashValue. It would be nice to be able to eliminate hashValue entirely from the API and ABI of the standard library, but it isn't worth breaking source stability for, and the core team does not want to gate acceptance of this proposal on even more compiler work to handle special-case compatibility hacks.

Thanks again to everyone who's given feedback in the review discussion. The review period is still scheduled to continue through this Friday, so if you have any other concerns to raise about the proposal or responses to the core team's decisions thus far, you still have time to provide them, and your feedback is welcome.

9 Likes

Just out of curiosity, do you have a bottom-line estimate of the performance cost of making Hasher a protocol rather than a unitary struct?