[Accepted] SE-0206: Hashable Enhancements

I haven’t seen this mentioned so far as a practical problem. Why do you think we need to solve it?

feed is probably not the best choice here.

As an English verb, it means roughly “put something into”; it requires a preposition or other clarification unless the direct object is the entity being fed. So, for example, you can “feed the dog” (direct object is the recipient) but you can’t just “feed an apple” or “feed an integer.” You have to “feed an integer to the hasher” or “feed the hasher an integer” or “feed on an integer.”

In addition, as @QuinceyMorris noted, the thing being fed can’t be the subject of the phrase. So

hasher.feed(anInt)

is weird because the hasher isn’t feeding anything; it’s being fed.

append seems to me to imply that objects are being accumulated into a to-be-hashed list, which is potentially misleading. And I’m not sure I see the implication of order-dependency; yes, there is an implication that the hypothetical to-be-hashed list would be hashed in order, but that doesn’t necessarily mean that the hash value is order-dependent.

consume implies disposal of the object being consumed, which is a bit odd.

I agree with @lorentey that hash is too overloaded, although otherwise it would be ideal. Too many hashers and hash values flying around already.

ingest and digest have a strangely biological ring, but ingest actually isn’t bad. I like ingest more than digest because the latter is too easily confused with a noun.

Other possibilities along these same lines: absorb, integrate, mix.

All things considered, I think I still like combine as the best compromise.

Edit: update seems off to me as well, even if it is the standard term. It means “to bring up to date,” but this API doesn’t have anything to do with time or with staleness.

3 Likes

Yes, and it could be “Feed foo to what?”. It’s not really meaningful until a preposition appears.

It’s a reasonable objection. It certainly would be a word stolen from a different discipline.

Another direction to go is with a word like complexify or perturb or even randomize that more accurately describes what the operation is doing. (Edit: To read right, these would probably need a with: keyword on the parameter.)

I’m OK with combine too. I guess one question would be: if someone misunderstood that the combine was order-dependent, what exactly would go wrong with the usability of Hasher?

2 Likes

I‘d like to ask a few questions but I‘m currently on my phone and it‘s hard to type code samples so I hope there is still some time left. I‘ll come back tomorrow on that.


Just wanted to throw a few names into the bikeshedding bucket:

  • take
  • attach

Python’s hashlib module provides a nonmutating, nondestructive variant of finalize, called digest. Our usecase is not the same as hashlib, but what if we borrowed this idea?

A non-destructive digest (or finalized?) may have (very) slight performance implications, but it would alleviate the worry that people may call finalize at inopportune moments. However, incorrect use of finalize is a serious programming error; I’d prefer to trap when it happens rather than hiding the issue.

Hiding finalize() is not a bad idea per se, but it seems to overly complicate the API surface. (It would be the right choice if Hasher was a protocol, but it’s not.) It also prevents the trick that keeps commutative hashing working for Bloom filters (see my earlier implementation for Set.hash(into:)), unless we add even more complications.

I can sympathize with that, but statefulness is an essential part of Hasher's design. combine is an unashamedly mutating function, and I don’t think we should change that. (I know a pure Hasher would not work well, because I tried implementing one.)

Floating a few more names: ingest, consume, update

This is a great question. I guess not much would go wrong, since hash(into:) implementations would would still naturally feed components to the hasher in a unique order, in the vast majority of cases.

The semantics of combine matter a great deal in case components cannot be sorted in any particular order, but I don’t see how that would happen outside of “unordered” collections like Set and Dictionary. (And we assume those are written by relatively experienced Swift developers, who are willing to read the docs.)

This is a solid argument for keeping the combine name.

Just to clarify: Hasher is extremely sensitive to how combine calls are ordered.

1 Like

No, no, everything goes wrong! :-)

First, people who want order-independent hashing (and assume that that’s what they will be getting from Hasher) will feed their items to Hasher in an arbitrary order. Hashing will work fine and will still return a meaningful hash value. There will be nothing to inform the developer they’ve done something wrong unless they specifically write test cases to verify order-invariant hashing. If they don’t do that, they’ll just eventually run across strange cases where hash values that (they think) should be equal are not.

Second, those who want order-dependent hashing but believe Hasher is order-invariant will do extra work in an attempt to “make” hashing be order-dependent. For example, one approach would be to manually and independently hash each element of a list and XOR its hash value with its index, feeding the result of that calculation to the top-level Hasher. This will work fine but is much slower than it should be. Again, there won’t be any smoking gun that alerts the developer to an error.

I don’t think this an issue that needs to be addressed in the API. It just needs to be really clear in the docs and header comments that hashing is order-dependent.

1 Like
hasher.stirIn(foo)

If we use labels, maybe this is good…?

hasher.feed(from: thing)

From that point of view, “stir in” and “feed from” aren’t very clear about implying order. You can make an argument in their favor, but having to make an argument to end users is the thing we’re trying to avoid.

What about refine(using:)? Each step of a refinement depends on the previous intermediate result.

1 Like

What is your evaluation of the proposal?
+ 1

Is the problem being addressed significant enough to warrant a change to Swift?
Yes

Does this proposal fit well with the feel and direction of Swift?
Yes

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
N/A

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?
I read the proposal and have followed the discussions.

I think that the shed might be best colored combine(next:)

I think if the code is mildly restructured you can get some benefits:

  1. Performance improvements by not copying hashers with a large state (use a clash).
  2. Prevent/discourage people from calling finalize.
  3. Provide a selection of hashing algorithms.
  4. Enable caching of expensive hashes.

The restructuring I am suggesting is:

protocol MyHashable { // As before except not an `inout` since `Hasher` is a class. Spelt `MyHashable` to allow example code to run.
    func hash(into: Hasher)
}

// Similar to before, but without `finalize` or `init` since user shouldn't call these and must be a class since hashers can have a large internal state and therfore should not be bitwise copied.
protocol Hasher: AnyObject {
    func combine(bits: UInt64) // Only combine methods.
    
    // Other types of combine ...
}

// Mixin `finalize` and `init`.
protocol HasherImplementation: Hasher { // A HasherImplementation isa Hasher.
    init() // No arguments, since one use is as a default hasher. Seed using combine.
    
    func finalize() -> UInt64 // Good to be same type as Random, so hashing and random interoperate.
}

enum Hashers { // Library of Hash functions.
    typealias Default = Sip
    
    final class Sip: HasherImplementation { // *Not* Sip but instead something simple to post!
        private var hash: UInt64 = 211// Real seed is longer!
        
        func finalize() -> UInt64 {
            // Real code ...
            return hash
        }
        
        func combine(bits: UInt64) {
            hash = 499 &* hash &+ bits // Real code more involved!
        }
        
        // Other types of combine ...
    }
}

/// Protocol that enables an implementing class to track modifications so that something that uses hashes can cache the hash.
protocol CacheableHash: AnyObject, MyHashable { // Only classes!
    var modificationCount: UInt64 { get } // Incremented eachtime a modification is made.
}

// A user of a hashing algorithm, set, dictionary, etc., would normally support caching by using this class rather than the `Hasher` directly.
final class HashCache {
    private struct Cached {
        weak var element: CacheableHash?
        let modificationCount: UInt64
        let hash: UInt64
    }
    
    private var cache = [ObjectIdentifier : Cached]() // Would be a sorted dictionary ideally!
    
    private let hasherFactory: () -> HasherImplementation
    
    private var numberOfHashes = 0
    
    init(hasherFactory: @escaping () -> HasherImplementation = { Hashers.Default.init() }) {
        self.hasherFactory = hasherFactory
    }
    
    func hash(for element: MyHashable, withSeed seed: UInt64 = 0) -> UInt64 {
        defer {
            if numberOfHashes == 131 {
                numberOfHashes = 0
                for entry in cache {
                    if entry.value.element == nil {
                        cache.removeValue(forKey: entry.key) // Rmove collected elements.
                    }
                }
            } else {
                numberOfHashes += 1
            }
        }
        guard let e = element as? CacheableHash else {
            let hasher = hasherFactory()
            hasher.combine(bits: seed)
            element.hash(into: hasher)
            return hasher.finalize()
        }
        let id = ObjectIdentifier(e)
        if let cached = cache[id] {
            if cached.element == nil {
                cache.removeValue(forKey: id)
            } else if cached.modificationCount == e.modificationCount {
                return cached.hash
            }
        }
        let hasher = hasherFactory()
        hasher.combine(bits: seed)
        e.hash(into: hasher)
        let hash = hasher.finalize()
        cache[id] = Cached(element: e, modificationCount: e.modificationCount, hash: hash)
        return hash
    }
}

The Hashable protocol, called MyHashable above to prevent name clashes, has a hash(into:) method that accepts a class and therefore is not an inout parameter. This is more efficient for hashers that have a large state.

The Hashable protocol only has combine methods not init or finalize, thus preventing accidentally calling of something you shouldn’t. As noted above Hashers are classes.

HasherImplementation mixes in init and finalize.

Hashers is a library of hash functions with a nominated default hasher. Initially probably only Sip would be provided and hence Sip would be the default.

CloneableHash is a protocol that you implement if you want the hashers to cache the hash value because it is expensive to calculate. CloneableHash requires the implementer to be a class because the cache needs to track the objects it has cached the hashes for.

HashCache wraps a HasherImplementation and adds to it caching for CloneableHash types. You would normally wrap a HasherImplementation in a HashCache because HashCache works with both types that implement CloneableHash and types that don’t, i.e. it is universal. You can choose which laser to use with HashCache, but by default it uses Hashers.Default. The code for HashCache uses a dictionary, in practice a sorted dictionary would be used (improved performance and less space).

To demonstrate how to use this API a simple bloom filter and some tests which demonstrate HashCache with both cached and non-cached types is given below:

/// Example of something that can cache hashes.
struct Bloom64<T> where T: MyHashable {
    private var hashersSeeds = zip([HashCache(), HashCache(), HashCache()], [1021, 2053, 4099] as [UInt64])
    
    private var bloom: UInt64 = 0
    
    mutating func insert(_ element: T) {
        for hash in hashes(for: element) {
            bloom |= hash
        }
    }
    
    func contains(_ element: T) -> Bool {
        for hash in hashes(for: element) {
            guard bloom & hash != 0 else {
                return false
            }
        }
        return true
    }
    
    private func hashes(for element: T) -> [UInt64] {
        return hashersSeeds.map { hasherSeed in
            (1 as UInt64) << (hasherSeed.0.hash(for: element, withSeed: hasherSeed.1) & 0x3F)
        }
    }
}

// Make `UInt64` conform to `MyHashable` for testing of cheap hashes.
extension UInt64: MyHashable {
    func hash(into: Hasher) {
        into.combine(bits: self)
    }
}
var bi = Bloom64<UInt64>()
bi.insert(1 as UInt64)
print(bi.contains(1 as UInt64)) // T
print(bi.contains(0 as UInt64)) // F

// Example of an expensive class that has its hash cached.
class Expensive: CacheableHash {
    private(set) var modificationCount: UInt64 = 0
    
    private var state: UInt64 // Real code would contain a lot of state!
    
    init(_ initial: UInt64) {
        state = initial
    }
    
    func hash(into: Hasher) {
        print("Calculating hash for \(state) with count of \(modificationCount)")
        into.combine(bits: state) // Real code would be an extensive data structure to hash.
    }
    
    func modify() {
        state += 1
        modificationCount += 1 // Any 'mutating' method *must* increment count.
    }
}
var be = Bloom64<Expensive>()
let e0 = Expensive(0)
let e1 = Expensive(11)
be.insert(e1) // "Calculating hash for 11 with a count of 0" x 3 for e1
print(be.contains(e1)) // No calculating hash because uses chached value for e1, T
print(be.contains(e0)) // "Calculating hash for 0 with a count of 0" x 3 because 1st time for e0, F
e1.modify()
print(be.contains(e1)) // "Calculating hash for 12 with a count of 1" x 3 because e1 modified, F

I think it is a modest change that has great benefits. What do you think?

1 Like

TL;DR

@lorentey you posted an example of an enum with associated values somewhere above but I couldn’t quite follow then what you said about it. Is it reasonable to prefix each case with an index like value when creating a new hash value?

I wonder if this proposal will eventually allow us to synthesize the hash(into:) function for enums with associated values, which is currently missing. I couldn’t find the reasoning why synthesizing the hash value for enums with associated values is source breaking. If someone can clarify this here or send me a PM, I’d very appreciate it.

In my code I use a very ugly solution, because I didn’t came up with anything better and know that at least this solution will work as expected:

enum Foo : Hashable {
  case a
  case b
  case c(Int)
  case d(Int, String)

  var hashValue: Int {
    switch self {
    case .a:
      return "a".hashValue
    case .d(let intValue, let stringValue):
      return "d-\(intValue)\(stringValue)".hashValue
    ...
    }
  }
  
  static func == (lhs: Foo, rhs: Foo) -> Bool { ... }
}

I really liked the example of yours which associated an index to each case before consuming the associated values of the enum case.

Yes! Feeding discriminator values to the hasher is the correct way to do it.

I have good news! This was included in SE-0185, and it is already fully supported in the current compiler. This proposal merely updates generated code to use hash(into:). For example, this works just fine:

enum Foo: Hashable {
  case a(Int, String?, [Double])
  case b([String: Bool])
  case c(Set<Foo>)
}

SE-0185 does require that associated values implement Hashable. (Note that the example above makes use of conditional Hashable conformances introduced by SE-0143 that did not make it into the 4.1 release but are implemented on the master branch.)

1 Like

To me, this seems like an extremely significant change: it replaces a single struct with three protocols, a caseless enum, and two classes. It may seem like a surface-level change to you, but it significantly complicates the hashing API. The new hash(into:) and (especially!) HashCache would also very likely not benchmark the way you expect.

We did carefully consider making Hasher a protocol. The proposal explains why we decided to go with a simple struct instead. There may be a use case for custom Hashers, but it hasn’t come up in review discussions so far.

The Bloom filter you’ve shown does not utilize multiple Hasher implementations, so it argues more against generalizing Hasher than in favor of it. This is as it should be; we expect that the stdlib’s Hasher implementation will be good enough to support Bloom filters – there is no reason to implement custom Hashers for that. (Note that Bloom64 allocates a shockingly large amount of space to cache hash values; I believe this goes directly against the spirit of bloom filtering, and some of that space would be much better used as additional state space for the filter itself. The current version of insert/contains also yields a much higher rate of false positives than what Bloom suggested.)

The idea to partially automate hash caching is intriguing, but it seems like it is orthogonal to this proposal, and it can be built on top of it later. It also adds a brand new concept for tracking mutations that seems non-obvious enough to deserve its own separate discussion thread.

(I’d like to note again that, in order to improve hash table performance, we can’t reuse hash values across Set and Dictionary instances (or even during the lifetime of a single collection instance). So caching full raw hash values would rarely be helpful for these collections.)

This is a remarkably confident assertion. It’s difficult to correctly reason about performance without actually comparing benchmarks. I strongly suspect you may be overestimating the cost of passing around an inout struct, and underestimating the cost of allocating/deallocating a class instance or of passing around a protocol existential. Performance considerations weren’t the only reason we chose the design we proposed, but they were indeed quite significant – and the hundreds of benchmarks I had run over the past few months were decidedly in favor of an interface that uses inout structs.

3 Likes

Er, I don’t quite follow. What exactly is the problem with having, say, an instance of Set cache its own hash value? (Or more accurately, cache its “prehash”, namely the xor of the hash values of its elements.)

Any mutation would clear the cached hash so it can never become stale, and I doubt that setting an optional integer to nil has any significant performance implications relative to mutating a set.

The prehash is only used in the set’s own hash(into: Hasher) implementation, so it never “leaks”. Instantiating a new set using the elements of an existing set doesn’t look at either set’s prehash.

Storing one set into another calls the inner set’s hash(into:) method with the outer set’s hasher, exactly as expected. And instantiating a new set using the elements of the outer set involves calling the inner set’s hash(into:) method with the new set’s hasher. So the inner set’s cached prehash gets reused successfully, while the outer sets are free to maintain independent hashers.

• • •

Edit: Scratch all that. We can’t cache hashes (or even prehashes) because the elements could be classes, which can be mutated without it being “mutating”.

• • •

Edit 2: Wait, if a class instance can have its hash value change out from under a Set, that poses larger problems even without caching. How do we handle sets where the element is a class today?

• • •

Edit 3: It appears we do not handle that situation today. The set claims not to contain the mutated class instance, although it does show up during iteration.

Hold your horses! :wink: I was talking about the specific idea of reusing the raw hash values across Set/Dictionary instances, not about hash caching in general. Hash tables must choose a new seed when they resize themselves, to prevent unacceptable performance regressions. (This is unrelated to collision attacks.)

Using the memoized result of a secondary hash function to reduce the number of bytes fed into Hasher in hash(into:) is a perfectly valid optimization. (That being said, it does have some fundamental correctness tradeoffs, and I believe it should only be used sparingly, and with copious amounts of justification in the form of before/after profiling data.)

There is nothing wrong with that, except it would needlessly slow down mutations (especially bulk removals) for Sets that will never be hashed – which I expect to be the usual case. Much less importantly, it also interferes with using Sets as keys in advanced hashing scenarios, like Bloom filters.

Keeping an up-to-date hash cache for Sets is still an important optimization; it’s just probably better done outside of the standard library. (But if the performance tradeoff seems worth it, the standard Set may end up doing it too. This is an implementation detail that is independent of the proposal we’re discussing.)

No, it’s a serious programming error to do that.

While this is an interesting topic, it too seems unrelated to the proposal at hand.

Cocoa handles this by defining the NSCopying protocol, and making a copy of all keys inserted into NSSet/NSDictionary instances. Foundation differentiates between mutable/immutable classes, where a copy of a mutable class gets to be immutable. This works fine within types defined in Foundation, but it’s just a coding convention, not a language-enforced rule. You’re free to ignore this convention when you’re defining your own NSObject subclasses, and then you’ll run into the same problem.

Swift chose to handle the same issue by emphasizing value types over reference types. Value types are literally impossible to mutate while they’re stored in Set/Dictionary's hash table. Notably, this is enforced by the type system of the language itself, rather than just a convention.

1 Like

Okay, that makes sense.

True.

The same issue arises for a struct that contains a class instance. If the class instance is visible anywhere outside the struct (either through another reference, or even as a read-only property on the struct itself), the hash value of the struct can be changed out from under a set it is in. This applies even if the struct uses copy-on-write, because no write to the struct is occurring.

And yes, this is not related to the subject of the current thread, except that it came to mind while discussing cached hashes.

1 Like

This review is focused only on source compatibility. I read that section in depth and skimmed the rest of the propasal. I did not read all of the review thread or previous discussion.

When upgrading to Swift 4.2, Hashable types written for earlier versions of Swift will need to be migrated to implement hash(into:) instead of hashValue. There are two options for doing this:

IIUC, we will synthesize the new hash(into:) based on the user’s hashValue implementation when using the pre-4.2 language mode. Why not do the same in 4.2 and deprecate hashValue instead of forcing users to update it immediately? A warning combined with excellent fixits is a great way to get everyone to migrate over time, but it’s painful for it to become an error in the same release that the new functionality is added. The bar for making it an error in 4.2 should be that we can migrate it fully automatically, but…

The compiler should simplify the migration process by providing fix-its for both options. A fix-it to remove hashValue should only be provided if automatic synthesis is available. For the second option, it would be nice to have the compiler suggest a full implementation of hash(into:), but a template would probably suffice:

… I believe this is not sufficient for automatic migration. We will not be able to ensure that the behaviour of the program doesn’t change with either of these options. I think we really need the user to make the decision to change their hashing.

1 Like
Terms of Service

Privacy Policy

Cookie Policy