[Accepted] SE-0206: Hashable Enhancements

lorentey · April 9, 2018, 2:49pm

I found it desirable to provide full control over the byte sequence that is fed to the hasher. Having a variety of overloads for all kinds of integer widths enables users to feed exactly as many bytes to the hasher as they have available, in as few combine calls as possible. (However, it's not necessary to provide overloads for both unsigned and signed variants; it's safe to keep only one flavor.)

I agree! I would be nice to mark most combine(bits:) overloads as @usableFromInline internal ; however, in the current implementation, combine(42 as UInt8) and combine(bits: 42 as UInt8) mean different things. To unify them, we first have to resolve the question of whether it's okay to allow

(42 as UInt8).hashValue != (42 as UInt64).hashValue

even though FixedWidthInteger's heterogeneous == operator makes

(42 as UInt8) == (42 as UInt64)

My opinion is that we should allow this, because Hashable doesn't (and shouldn't) set constraints on hashing behavior across distinct types. (The == above doesn't mean that UInt8 is "Equatable" to UInt64.) However, we haven't really discussed this aspect of hashing yet, and there may be good reasons to make integer types hash the same. (The subtleties of AnyHashable and NSNumber semantics add some extra challenges to width-dependent hashes, but these seem tractable.)

Note that the combine(bits:) overload that takes a UnsafeRawBufferPointer is special; we'd want to keep that even if we remove the integer overloads. UnsafeRawBufferPointer intentionally doesn't implement Hashable.

It may also be possible to come up with a better name than combine(bits:). Suggestions are welcome!

lorentey · April 9, 2018, 3:08pm

Ah, as in Int(finalizing: &hasher)? That's an interesting idea. It looks slightly weird in use, but I think I could get used to it:

func hashValue<Key: Hashable>(for key: Key) -> Int {
  var hasher = Hasher()
  hasher.append(key)
  return Int(finalizing: &hasher)
}

I feel like it's similar to Nate's suggestion to make hash(into:) static, but here the objection about overridability doesn't apply.

Would this really fix the issue, though? For example, does it make the memoization bug below easier to spot?

class Foo: Hashable {
  ...
  var _hashCache: Int? 
  func hash(into hasher: inout Hasher) { // (Note: not threadsafe)
    if let hash = _hashCache { hasher.combine(hash); return }
    var h = Hasher()
    h.combine(a)
    h.combine(b)
    h.combine(c)
    let hash = Int(finalizing: &hasher)
    _hashCache = hash
    hasher.combine(hash)
  }
}

davedelong · April 9, 2018, 3:54pm

(disclaimer: I'm not anything closing to a hashing expert, and it's fascinating, humbling, and educational to read all of the great commentary on this thread)

One thing I'm picking up on is the safety of the hasher in preventing out-of-band combine calls after you've supposedly finished your group of relevant values.

You can easily solve this using an intermediate type:

class Hasher { // or whatever
    
    class HashCombiner {
        private init() // this can only be created by `Hasher`
        func combine(_ whatever: Thing) // haven't been following the types
    }

    func grouping(_ grouping: (HashCombiner) -> Void)
}

Then you'd use this like so:

func hash(into hasher: inout Hasher) {
    var h = Hasher()
    h.grouping {
        $0.combine(a)
        $0.combine(b)
        $0.combine(c)
    }
    // the hasher is automatically finalized when the group ends
    // outside of the `grouping` class, `Hasher`'s "generatedHash` property is always valid
}

benrimmington · April 9, 2018, 5:10pm

I'm not sure if it makes any difference. The code might be improved by:

Parameter name shadowing

extension Foo: Hashable {

  public func hash(into hasher: inout Hasher) {
    if _hashValue == nil {
      var hasher = Hasher()
      hasher.combine(a)
      hasher.combine(b)
      hasher.combine(c)
      _hashValue = Int(finalizing: &hasher)
    }
    hasher.combine(_hashValue!)
  }
}

Using a computed property

extension Foo: Hashable {

  // NOT deprecated!
  public var hashValue: Int {
    if _hashValue == nil {
      var hasher = Hasher()
      hasher.combine(a)
      hasher.combine(b)
      hasher.combine(c)
      _hashValue = Int(finalizing: &hasher)
    }
    return _hashValue!
  }

  public func hash(into hasher: inout Hasher) {
    hasher.combine(hashValue)
  }
}

If a hasher will usually be initialized and finalized in the same block, then here's a variant of Dave's idea:

extension Hasher {

  public static func finalizing(
    _ body: (inout Hasher) throws -> Void
  ) rethrows -> Int {
    var hasher = Hasher()
    try body(&hasher)
    return hasher._finalize()
  }
}

extension Foo: Hashable {

  public func hash(into hasher: inout Hasher) {
    _hashValue = _hashValue ?? Hasher.finalizing {
      $0.combine(a)
      $0.combine(b)
      $0.combine(c)
    }
    hasher.combine(_hashValue!)
  }
}

GarthSnyder · April 9, 2018, 6:43pm

lorentey:

It would be nice to mark most combine(bits:) overloads as @usableFromInline internal; however, in the current implementation, combine(42 as UInt8) and combine(bits: 42 as UInt8) mean different things. To unify them, we first have to resolve the question of whether it’s okay to allow
(42 as UInt8).hashValue != (42 as UInt64).hashValue

I have nothing more to add to what you've already written here, but since it sounds like you're looking for feedback on this point: Yes, what you suggest sounds completely reasonable. Those hash values can be different.

The fact that there's some free-floating implementation of == out there in the world that is willing to compare UInt8 and UInt64 is not relevant. The only firm rule should be that entities that are == in terms of Equatable (which I believe requires ==(Self, Self)) must have the same hash value. As you said.

And as long as I'm in voting mode... I don't think there's much wrong with Hasher as it currently stands in the proposal. The suggested contortions that attempt to separate finalize() from the rest of the API (by moving it into Int or by adding separate sub-objects or sub-protocols, etc.) just muddy the API in an effort to solve a problem that doesn't actually exist.

The scheme is perfectly simple and understandable just as it is. Nobody is going to be confused. And if they are confused, it will probably be in the direction of thinking they need to finalize a Hasher that someone else has handed to them. But that will cause their code to crash when the actual owner tries to finalize; they'll fix it and never make that mistake again.

benrimmington · April 10, 2018, 7:33am

I'd prefer:

hash instead of combine
bytes: (or no argument label) instead of bits:

public struct Hasher {

  public init()

  @inlinable
  public mutating func hash<H: Hashable>(_ value: H)

  public mutating func hash(_ bytes: UnsafeRawBufferPointer)

  public mutating func finalize() -> Int

  @inlinable
  public static func finalizing(
    _ body: (inout Hasher) throws -> Void
  ) rethrows -> Int
}

I don't think it's worth hiding the finalize() method, if you can detect when a hasher has been invalidated.

gwendal.roue · April 10, 2018, 8:29am

combine is a term of art, isn't it? But since some have asked if order was important (and it is), hasher.append(bits:) may please some people.

lorentey · April 10, 2018, 11:56am

I don't think combine is a term of art! We're free to choose whatever term we like.

append was the original name of these methods in drafts of the proposal (you can still see it in the current stdlib codebase). We switched to combine for the proposal it to make it clear that this is not a collection operation.

However, combine sounds like the operation might be order-independent, and, based on comments on this thread, this is evidently a major source of (initial) confusion. It's worth revisiting our naming choices to see if we can fix this.

append made it obvious that Hasher is sensitive to ordering, so reverting to it is one option. Hasher defines what is essentially a streaming interface; this suggests we should also consider write.

How about feed? It's not confusable with any collection or stream operation, but I think it still implies that order matters. The implication is not as strong as with append or write, but it may just be enough to prevent confusion.

func hash(into hasher: inout Hasher) {
  hasher.feed(foo)
  hasher.feed(bar)
}

It reads pretty well, I think. In retrospect, combine seems too abstract and noncommittal about what actually happens to the data; feed paints a clear mental picture.

I find hash looks repetitive in use. I really like bytes for the URBP overload!

gwendal.roue · April 10, 2018, 12:20pm

I'd reuse an existing method name, if possible:

We won't use Hasher very often, because the compiler will often generate the hashing code for us. But when we want to use a Hasher, it'd be cool if we had not to spend too much energy remembering what strange synonym of push or append or write or hash or accept has been chosen :-) I admit that feed sounds very foreign to me in Swift, and I'll never remember it (because I'm lazy).

I suggest that we don't invent a new concept and word, but instead build on top of existing concepts and vocabulary. How do you want people to see Hasher? Should you have to write the introduction chapter, would you present it as an "append-only collection"? As a "stream"?

lorentey · April 10, 2018, 2:12pm

Hm. Hasher is neither a collection nor an output stream; it's neither a queue nor a socket. I think it's desirable to name things so that it won't be confused with any of these. (C.f. SE-0187)

"Feeding data to the hash function" is a relatively standard turn of phrase in discussions about hashing. The proposal already uses "feed" in this sense in the docs:

/// Feed `value` to this hasher, mixing its essential parts into
/// the hasher state.

Admittedly I'm not a native English speaker, but the use of "feed" here seems fine to me. Does it sound foreign? If it doesn't, what makes it a bad choice as an API name? (Grammatically it seems to work the same as write or append. I can write a chapter in a book; I can append an item to my todo list; can I feed a treat to my dog?)

gwendal.roue · April 10, 2018, 2:28pm

feed would not be a bad name, don't worry. Let's just wait for more input, maybe from native speakers unlike you and me ;-) ?

bzamayo · April 10, 2018, 3:02pm

feed works grammatically but it's a weird name for the API IMO. combine or append is fine. I think append does also hint at the fact that the order of inputs matter.

I don't think we should have a special type or protocol just to hide finalize from the implementors of hash(into:). The minor improvement to elegance isn't really worth the additional standard library surface.

felix91gr · April 10, 2018, 3:40pm

I like feed.

It conveys a lasting mental image; I feel people should tend to remember it more.

It doesn’t interfere with the concept of append,

And it still conveys the idea that feeding is order-dependent.

lancep · April 10, 2018, 5:20pm

IMO, append is the best option. Hasher does not conform to any of the collection protocols, and I think the benefits of conveying that the order matters outweighs the potential confusing Hasher with a collection

QuinceyMorris · April 10, 2018, 5:49pm

An objection to the verb feed is that it makes the hasher the object of the action, not the subject, as is usual for method verbs.

An objection to append is that it's unclear what (if anything) is actually being appended to, in the sense of being made longer. (What's being appended is additional algorithmic steps, which is a bit obscure.)

If you like the mental picture of "feed", then I'd suggest consume or ingest. Both mean approximately the same as "feed on".

ingest in particular has the connotation of an industrial or biological process that takes in raw materials and transforms them. It's related to the computer-science noun "digest", which (unless I'm mistaken) is what the hash result actually is.

lorentey · April 10, 2018, 7:50pm

Oh I see!

I read array.append(42) as the command "append 42 to array". Therefore hasher.feed(foo) looks perfectly fine to me, because I read it as "feed foo to hasher". (The subject is left unspecified in my interpretation for these expressions.)

However, array.append(42) can also be read as "hey array, append 42". And "hey hasher, feed foo" is indeed super weird. Feed foo with what?

Is this a fair assessment of the objection? If so, then this clearly makes feed a bad choice.

I think consume would work, except "consume" already has a conflicting meaning in the ownership model. hasher.consume(foo) could be misunderstood to mean that the ownership of foo passes to hasher, or foo is destroyed by hasher. (Arguably this is also true for feed and ingest, although not to the same extent.)

ingest has the same meaning, but I think it goes too far in the other direction -- it seems too idiosyncratic to me. I can't find any references that use it in the context of hash functions. Maybe I'm being a hypocrite, though -- I don't think I have seen combine used to describe this operation, either.

Libraries for cryptographic hash functions often use update for the operation we're trying to name here. Hasher isn't suitable for cryptography, but the mechanics are similar, and perhaps we should just borrow the same terminology.

This would leave Hasher with the following public API:

public struct Hasher {
  public init()
  @inlinable public mutating func update<H: Hashable>(_ value: H) {
    value.hash(into: &self)
  }
  public mutating func update(bytes buffer: UnsafeRawBufferPointer)
  public mutating func finalize() -> Int
}

Unfortunately, update seems almost as noncommittal about order sensitivity as combine. It is arguably a term of art, but I don't think it's a particularly good name.

davedelong · April 10, 2018, 8:05pm

I'm not a fan of the stateful design of Hasher; i think the subtype i mentioned above will clarify the intention:

public struct Hasher {
    public class Combiner {
        private init()
        public func update(bytes buffer: UnsafeRawBufferPointer)
    }

    public init()
    public var hashedValue: Int
    public mutating func update(with: (Combiner) -> Void)
    public mutating func update<H: Hashable>(_ value: H) // convenience for calling the update(with:) method
}

This would make it impossible to forget to finalize(), since that would happen automatically after calling the update block. It would also mean that the computed hash value would always be accessible without having to call finalize().

lorentey · April 10, 2018, 8:08pm

I haven't seen this mentioned so far as a practical problem. Why do you think we need to solve it?

GarthSnyder · April 10, 2018, 8:12pm

feed is probably not the best choice here.

As an English verb, it means roughly "put something into"; it requires a preposition or other clarification unless the direct object is the entity being fed. So, for example, you can "feed the dog" (direct object is the recipient) but you can't just "feed an apple" or "feed an integer." You have to "feed an integer to the hasher" or "feed the hasher an integer" or "feed on an integer."

In addition, as @QuinceyMorris noted, the thing being fed can't be the subject of the phrase. So

hasher.feed(anInt)

is weird because the hasher isn't feeding anything; it's being fed.

append seems to me to imply that objects are being accumulated into a to-be-hashed list, which is potentially misleading. And I'm not sure I see the implication of order-dependency; yes, there is an implication that the hypothetical to-be-hashed list would be hashed in order, but that doesn't necessarily mean that the hash value is order-dependent.

consume implies disposal of the object being consumed, which is a bit odd.

I agree with @lorentey that hash is too overloaded, although otherwise it would be ideal. Too many hashers and hash values flying around already.

ingest and digest have a strangely biological ring, but ingest actually isn't bad. I like ingest more than digest because the latter is too easily confused with a noun.

Other possibilities along these same lines: absorb, integrate, mix.

All things considered, I think I still like combine as the best compromise.

Edit: update seems off to me as well, even if it is the standard term. It means "to bring up to date," but this API doesn't have anything to do with time or with staleness.

QuinceyMorris · April 10, 2018, 8:25pm

Yes, and it could be "Feed foo to what?". It's not really meaningful until a preposition appears.

It's a reasonable objection. It certainly would be a word stolen from a different discipline.

Another direction to go is with a word like complexify or perturb or even randomize that more accurately describes what the operation is doing. (Edit: To read right, these would probably need a with: keyword on the parameter.)

I'm OK with combine too. I guess one question would be: if someone misunderstood that the combine was order-dependent, what exactly would go wrong with the usability of Hasher?