[Accepted] SE-0206: Hashable Enhancements

The review of SE-0206: Hashable Enhancements begins now and runs through April 13th, 2018.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to avoid this forum or just keep your feedback private, directly to me via email, twitter, or direct message. If emailing me directly, please put “SE-0206: Hashable Enhancements” in the subject line. Your feedback is equally valuable to us no matter how we receive it.

What goes into a review of a proposal?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift.

When reviewing a proposal, here are some questions to consider:

  • What is your evaluation of the proposal?

  • Is the problem being addressed significant enough to warrant a change to Swift?

  • Does this proposal fit well with the feel and direction of Swift?

  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Thanks,
Joe Groff
Review Manager

13 Likes

I am so excited to see this in review so swiftly (ha). It addresses a very real pain point for users of the language.

The solution put forward, with a small dash of compiler magic, is a pragmatic way to preserve backwards compatibility for a design that is otherwise very much idiomatic Swift at its best. The feasibility of this design has already been proved in its being used for conditional Hashable conformance (in underscored form). I enthusiastically support the design exactly as proposed over the alternatives considered.

As to how it compares with features in other languages, a comparison with Rust is hard not to make: here, the decision to support a single Hasher is supremely sensible given that the premise going into the proposal is that hashers are hard to write; if I understand correctly, a user can always implement hashValue themselves to call a custom hasher should any issue arise that necessitates it.

I glanced at the final version of the proposal but studied the pitch in great detail and participated in discussion of the implementation of the original, underscored design.

8 Likes

Wow. I love this design! It’s much more ergonomic and less error-prone :slight_smile:

I really like what you did with the combine technique. This is the first time I’ve seen such a simple approach to hashing. It lowers the entry barrier, and makes it much, much more mistake-proof.

What is your evaluation of the proposal?

12/10. I think it’s well-written, and very thought through. The design I find great. Never seen something like it in a language like Swift.

Is the problem being addressed significant enough to warrant a change to Swift?

Yes. This very much lowers the barrier of entry of Hashable from Advanced to Beginner.

Does this proposal fit well with the feel and direction of Swift?

It does, better than I could’ve imagined.

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

I’ve only used languages that didn’t have Hashable kind of protocols / interfaces (like C), or that have hashValue kind of implementations (like C#).

This proposal is much better than what I’ve used there.

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Quick reading. I skipped from Detailed Design onwards.

3 Likes

Hey there! This API appears to be a simplification of the one Rust uses:

https://doc.rust-lang.org/core/hash/trait.BuildHasher.html
https://doc.rust-lang.org/core/hash/trait.Hasher.html <-- Hasher from this proposal
https://doc.rust-lang.org/core/hash/trait.Hash.html <-- Hashable from this proposal
https://doc.rust-lang.org/std/collections/struct.HashMap.html#method.with_hasher

As such, I have some notes from the Rust community's experience using these APIs (I am of the opinion that we made a couple mistakes). Context: I've spent a considerable amount of time working on Rust's HashMap APIs, and briefly worked on Swift's Dictionary implementation (my greatest legacy is all the sick ASCII diagrams explaining Dictionary's architecture).

(minor disclaimer: I skimmed some parts of the proposal because I just wanted to make sure you hadn't addressed the issues I'm about to raise; I apologize if they were addressed but I missed them! Also I am lacking some citations because discourse only lets me have 2 links in a comment :upside_down_face: thanks ben!)

First off, a minor point:

Note that the custom seed is mixed with the default seed, so specifying one doesn't disable nondeterministic hashing. The algorithm implemented by Hasher is not part of its API, and enforcing randomization makes it harder for Swift programs to accidentally rely on any specific algorithm. Any such dependency would make it more difficult for the standard library to change Hasher in future releases.

I would implore you to consider a way to request deterministic hashing (which the custom seed API is perfect for). Determinism is strongly desirable for developers to be able to use Dictionary in contexts where repeatability is valuable. For instance the Rust compiler, which is written in Rust, is basically a giant HashMap benchmark suite. However it only uses deterministic hashing because binary reproducibility is very important. Yes, programs linked against different versions of Swift will have different results, but needing the same toolchain to get the same results is a reasonable constraint.

As long as determinism is an opt-in, you shouldn't really need to worry about people becoming over-reliant on the algorithm. Rust was able to migrate its default hashing algorithm from SipHash 2-4 to SipHash 1-3 painlessly. Although Rust also lets users completely override/specify the hashing algorithm used by HashMap (as explicitly and ~reasonably rejected by this proposal given swift's goals), so maybe that relieved some pressure for us?


The really major issue I would like to address is oneshot hashing. While the proposed API is very good for safely and quickly composing Hashable implementations, it isn't well-suited for the very common case of "actually I'm just one thing, no need to compose". There are two major reasons to care about this: it forces unnecessary mixin values, and it massively penalizes google's hashing algorithms (making them ~worthless).

On Mixins

To avoid HashDOS attacks, one must ensure that composite types do not represent a vector for creating collision families. This is very easy to do accidentally (pardon my swift, it's been a while):

struct Pair<T> { var first: T; var second: T; } 

extension Pair: Hashable where T: Hashable {
  func hash(into hasher: inout Hasher) {
    hasher.combine(first)
    hasher.combine(second)
  }
}

extension String: Hashable {
  func hash(into hasher: inout Hasher) {
    self.withUnsafeBufferPointer {
      hasher.combine($0)
    }
  }
}

This two very reasonable-seeming Hashable implementations create a collision family that can be abused by attackers; all of the following have the same hash!

Pair("cat", "")
Pair("ca", "t")
Pair("c", "at")
Pair("", "cat")

To prevent this problem, all collections must hash in a mixin value that distinguishes the end of the collection. The most common solution is to hash your own size:

extension String: Hashable {
  func hash(into hasher: inout Hasher) {
    self.withUnsafeBufferPointer {
      hasher.combine($0)
    }
    hasher.combine(self.count)
  }
}

Now ("ca", "t") is different from ("c", "at") because it becomes ca-2-t-1 vs c-1-at-2. Ok, great! It's a minor thing that collection implementors need to know, but otherwise not a big deal. Except we've now penalized the incredibly common case of Dictionary<String, _>, which needs no such mixin for correctness!

On Google Hashes (CityHash and FarmHash)

Google's hashing algorithms are actually just families of hashing algorithms that are keyed off of the length of the input. Specifically they've just chosen whatever hashing algorithm is fastest for a given length and piecewise glued them together. This means hashing literally cannot be performed until the entire input is accumulated.

As a result, the Rust implementations of these algorithms that conform to the Hasher API must accumulate the entire input in an Array waiting for finalize to be called. This is incredibly bad for performance, and makes them worse than everything in my experiments.

This is, again, in spite of the fact that most hash keys do not require any composition. String, Array<UInt8>, UInt64, etc do not require any composition when used as keys, and could be fed directly into City/FarmHash without any buffering. But the API cannot express this.

A Solution: OneShot hashing

The solution to this is fairly straight-forward: augment the Hashable protocol and Hasher struct with another default method:

protocol Hashable {
  func hashOnly(into hasher: inout Hasher) -> Int;
}

extension Hashable {
  func hashOnly(into hasher: inout Hasher) -> Int {
    self.hash(into: &hasher)
    return hasher.finalize()
  }
}

extension Hasher {
  func hashOnly(bits buffer: UnsafeRawBufferPointer) -> Int {
    // optimized impl that doesn't need to buffer anything
  }
}

key.hashOnly(&hasher) would be the only method that Dictionary ever calls. Types like String, Array, and the primitive types would just pass themselves to hasher.hashOnly() in their specializations. Swift meanwhile can optimize Hasher.hashOnly to its heart's content (potentially only valuable if a Google algorithm is used?).

There is however one drawback with this design: string.hash() and string.hashOnly() will then produce different results. This potentially represents a footgun for implementors of hash which are wrapping these types.


Also while I'm here I might as well mention that the desire to configure and mess with hashing is so great in the Rust community that I am currently in the process of adding yet another api to HashMap to support completely arbitrary nonsense.

23 Likes

Despite getting dunked on yet again for the GridPoint hash value example, I love the improvement that this proposal brings on so many fronts. One consideration I’d like to ask about, since we’re moving away from hashValue, is whether the hash(into:) requirement could be moved from being an instance method, where it clogs autocomplete and generally is in the way, as it should basically never be used by anyone who isn’t implementing a hashed collection type. Would the following version interfere with any of the performance improvements outlined in the proposal?

protocol Hashable {
    static func hash(_ value: Self, into hasher: inout Hasher)
}

The example of manually conforming to Hashable in the documentation would translate to this:

extension GridPoint: Hashable {
    static func hash(_ value: GridPoint, into hasher: inout Hasher) {
        hasher.combine(value.x)
        hasher.combine(value.y)
    }

    static func == (lhs: GridPoint, rhs: GridPoint) -> Bool {
        return lhs.x == rhs.x && lhs.y == rhs.y
    }
}

I can spiel out some theoretical justifications for this (e.g. it matches the operator declarations levels for Equatable). But really I just don’t want 5 or [1, 2, 3] or "goofus" to have a hash(into:) method that no one should ever use appearing in the autocomplete list. (I ride the same hobbyhorse about description, debugDescription, customMirror, etc, but those APIs aren’t up for review.)

9 Likes

It is a good proposal. I have two points to make. The first one is just an echo of what Alexis wrote: the proposal needs to be extended to allow high-performance implementations. (Hashing speed matters.) I would urge you, before you accept this, to make sure you have a proposal that allows high-speed hashing. It is no big deal if the first implementation is not super fast, but it would be a shame to corner yourself into something that could never be high speed. My second point is that it should be clearly understood that this is not as general as it seems. Some standard algos can’t use this API. That’s kind of a minor non-blocking point, but one that should be known.

Let me elaborate:

  1. I think that Alexis makes a good point performance-wise. If you need to hash large objects (spanning many bytes), having to eat them word by word is like drinking from a straw. Here is some documentation:
  • Alakuijala et al., Fast keyed hash/pseudo-random function using SIMD multiply and permute
  • Ivanchykhin et al., Regular and almost universal hashing: an efficient implementation, Software: Practice and Experience 47 (10), 2017
  • Lemire et al., Faster 64-bit universal hashing using carry-less multiplications
    Journal of Cryptographic Engineering 6(3), 2016
  • Kaser et al., Strongly universal string hashing is fast, Computer Journal 57 (11), 2014
  • Chakraborty et al., A fast single-key two-level universal hash function, IACR Transactions on Symmetric Cryptology, 2017
  1. The approach being proposed seems to restrict the hash functions to the “iterated type”. The universality of iterated hashing is limited. But more critically, you can eat new content but you can’t release it, so it prohibits rolling hash functions. So if I want to use this for a Karp-Rabin search, I can’t. Given the string “abcdef”, if I want to hash all 4-grams: “abcd”, “bcde”, “cdef”… I have to redo the computation from scratch for each substring.
3 Likes

Maybe it could work internally like this:

  1. Hasher gets created
  2. Every combine call adds a currentCount right after whatever was added. Like so: ”ca”, "t” becomes “ca0t1” inside the Hasher

This, I think, should avoid collision attacks. Not 100% sure yet, but I may be able to go 1 step further if there’s a weakness in that algorithm.

But whatever is the implementation, that doesn’t matter for swift-evolution. It only matters that we prove it can be done in a way that is resistant to collision attacks.

1 Like

This massively penalizes everyone who uses this API. For instance, there’s no reason to add a mixin when hashing MyType { Int64, Bool }, nor is there any reason to add a mixin between every element in Array<SomeSimpleType>.

The relevance to Swift Evolution is that the proposed API constrains implementations and results in suboptimal hashing performance, which is really important.

3 Likes

Of course it might. But I’m not talking about a particular implementation here. I’m trying to show that you can solve the collision problem internally while keeping the exact same API. It doesn’t have to be done like I showed. It just has to be done somehow (and I’m sure there’s already a plan for this).

  • What is your evaluation of the proposal?

    Great addition. Would suggest two minor mods:

    1. Allow construction of a Hasher that is deterministic without having to set the environment variable. Don’t make this the default, but make it an option. Perhaps another seeded constructor or a default argument to the existing seeded constructor.

    2. Change Hasher into a protocol and provide Hashers as a struct of provided hashers:

      struct Hashers {
          typealiase Default = Sip
          struct Sip: Hasher { ... }
      }
      

      This way the default can still be easily changed, but if someone needs to know the exact algorithm used they can specify it. It also allows for a ‘library’ of known good hash functions to be built up over time.

  • Is the problem being addressed significant enough to warrant a change to Swift?

    Yes

  • Does this proposal fit well with the feel and direction of Swift?

    Yes

  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

    No, the proposed solution is more elegant than that in other languages. Even if the other languages provide hash functions for the programmer to use. Congratulations on a well thought out proposal.

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

    Followed the thread.

Did a quick skim this morning, but one thing that wasn’t immediately obvious: What if a developers wants to implement some other collection type like Set or Dictionary where the order of the items doesn’t matter.

Would they have to create a Hasher for each element, then mix all the finalize() results themselves in an order independent way, and then push that one value back into the Hasher they were told to use? Is this something worth calling out and/or providing better support for given the comments around performance and expense of finalize()?

3 Likes

Thanks so much for taking the time to write this up! :blue_heart:

In earlier drafts of the proposal, Hasher.init(seed:) was deterministic. I changed this because I considered the risk of apps persisting hash values was too great to allow. With ABI stability, it becomes possible to link apps with future versions of the stdlib without recompiling them. Any app that depended on concrete values generated by Hasher would cause significant ABI compatibility issues that (in extreme cases) may even prevent us from updating the hash algorithm. The ability to safely evolve the algorithm is a critical part of this proposal, and I ranked it squarely above determinism.

It is possible to disable the hash seed through an environment variable, but that seems too blunt a hammer for production use. Plus, getting rid of the random hash seed doesn’t eliminate all nondeterminism: hash table ordering will still depend on allocation behavior, which happens to be repeatable now, but making this a part of the ABI contract seems like a big decision.

Where repeatability is a requirement, perhaps its enough to make it a policy to never iterate over the elements of a raw Set/Dictionary – instead, in these contexts, users can either wrap them into collections that preserve insertion order, or they can sort the keys into some strict ordering before iterating through them.

The proposal doesn’t include any API to control how Set and Dictionary initializes Hasher; the intention is to allow for unconditional per-instance seeding. But if there’s a clear use case, the stdlib could add Set/Dictionary initializers to control the hash seed.

It would be useful to have further input on this; if anyone reading this has a specific use case that requires deterministic Set/Dictionary ordering, please tell us about it!

This is great feedback – we completely ignored this issue in our current implementation, and we’ll need to fix this.

We’ll still be able to add new function requirements to existing protocols after the ABI is frozen, so if we want to switch to a hash algorithm like that, we’ll be able to add the hashOnly method later. It sounds like most of the types that would want to specialize top-level hashing would be in the stdlib; so we can even make the new requirement stdlib-internal. (Which would still make it a part of the ABI, but at least it wouldn’t confuse people.)

Optimizing hashing based on the length of the key is an interesting idea. I wonder if it can be done incrementally, by switching algorithms on the fly, rather than requiring an a priori count.

(My immediate plans for optimization are based on the size of the hash table: for tiny tables (say, less than a few dozen elements), an unsorted array works faster than pretty much any sort of hashing, so it makes sense to not do any hashing at all. This already covers most userInfo-style Dictionaries that are passed around through API boundaries. For tables less than a few hundred (maybe a thousand) elements, it may make sense to fall back to a less robust (but faster) hash algorithm.)

The wrappers will be distinct types, so I think it’s okay if they hash differently. (AnyHashable notwithstanding.)

This reminds me of an unresolved design issue with our integer types: should Int8 hash the same as Int64? (I see i8 and i64 do not hash the same in Rust, and this makes perfect sense to me. However, Swift’s integer types are directly comparable with ==, so it may be surprising if they compared the same, but produced different hashes. On the third hand, they are still distinct types, so arguably the operators are irrelevant.)

Yay for nonsense! Using closures to specialize (customize?) Equatable/Comparable/Hashable comes up from time to time here, too. It seems like supporting a raw (Key) -> Int closure for hashing would be a nice way to provide an escape hatch for advanced users in case the default industrial-strength hashing is inappropriate. This is not in scope for this proposal, but we’re definitely looking at it. (I think it can be done additively later, as long as Set/Dictionary can be made sufficiently resilient.)

3 Likes

I really like this proposal but I am just wondering why requiring a combine call for each property? Let’s say we want to combine the hash of 5-6 properties, it would look like this:

func hash(into hasher: inout Hasher) {
    hasher.combine(topLeft)
    hasher.combine(bottomRight)
    hasher.combine(bottomLeft)
    hasher.combine(topRight)
    hasher.combine(midBottom)
    hasher.combine(midTop)
}

Needless to say this code could be greatly reduced by having an overload that takes a Set or Array that would look like this (or by using keypaths):

func hash(into hasher: inout Hasher) {
    hasher.combine([topLeft, bottomRight, bottomLeft, topRight, midBottom, midTop])
}

Sorry! :pensive: I had to pick a sacrificial example, and the docs were the most obvious victim. To be fair, I think documenting how to implement hashValue in a paragraph or two is an impossible task; GridPoint made a valiant effort, which made it all the more delicious to rip it apart. :smiling_imp: It’s actually a great representative example for the top ~5% of hashValue implementations out there – I’ve seen far, far worse. (Including most of my own attempts.)

This is worth considering, especially if we keep the generic Hasher.combine(_:) – that function is a nice replacement for direct calls to hash(into:). But the natural way to define hash(into:) is to make it an instance method; do we really want to carve out a new API convention for these cases? We also need to ensure that we remain able to override hash(into:) in class hierarchies; making it a static requirement makes that more difficult.

I’d prefer to add some metadata attribute / doc string syntax that de-emphasizes rarely used API like hash(into:), encode(to:), init(arrayLiteral:), etc. in the docs and editor/autocomplete UI, but still allows them to be defined in the natural way. (I also think hash(into:) should still be listed in autocomplete menus, it should just not be in a prominent position. It may not be frequently called, but I’m sure it’ll still be used sometimes.)

It’s not required. A problem discussed when automatic Hashable conformance was being reviewed is that of having non-relevant properties be part of the hash input. If you have a Circle, let’s say, with:

  • float radius,
  • vec2 center,
  • float area,
  • float perimeter

Area and perimeter are not relevant because they are derived from the radius. The proposed combine technique allows you to only give it what’s relevant; that is, radius and center.

I don’t know if that was your issue.

Regarding an overload that takes an array or list as argument: it could be good, but I think it reduces readability imo. It’s easier to miss something if everything’s in the same line.

Sure, I got that this combine method allows for more flexibility regarding which properties participate in the hash, but in some cases it could be all properties participate except one. In this case, I think it’d make sense to have a single method that takes all the key paths of the properties which should or shouldn’t participate in the hash.
I guess many people will just copy/paste the combine calls and they will forget to change the property name and so they could end up with multiple combine for the same property (what will happen in that case from the compiler perspective?). In case we use a Set of key paths, we could enforce a single combine for each property instead.

1 Like

Note that this sort of comment might be taken as dismissing a commenter’s concerns. The Swift core team and engineers are humans just like you and me and not omniscient, so it’s reasonable to raise concerns about the design being proposed, especially ones like @Gankro’s that are based on real-world experience with a similar design. I know you probably didn’t intend any harm in your reply, just keep that in mind.

2 Likes

There is an overload of Hasher.combine(bits:) that takes an UnsafeRawBufferPointer. Do you think we need to provide more ways to hash bulk data?

Hash memoization and custom hash functions are still possible in the hash(into:) world, although they will incur the cost of feeding a single word to Hasher. (This is similar to how Swift 4.1’s Set and Dictionary unconditionally postprocess hash values with _mixInt, it’s just more explicit.)

class Foo: Hashable {
  // ...
  var _cachedHash: Int? = nil
  func hash(into hasher: inout Hasher) {
    if let hash = _cachedHash {
      hasher.combine(bits: hash)
      return
    }
    var hash = 0
    // Calculate hash however you like
    _cachedHash = hash
    hasher.combine(bits: hash)
  }
}

I think we should be careful not to overgeneralize Hashable; I consider it more as a support protocol for Set and Dictionary, rather than something that should cover all possible forms of hashing.

For example, custom seed values makes it possible to use Hashable types in e.g. some forms of Bloom filters. This is a nice bonus, but it is not why we added this feature – the proposal adds supports for custom seed values because this turned out to be a requirement for Set/Dictionary. These collections have no need for an uncombine operation, so Hasher doesn’t provide it – rolling hashes would needlessly restrict the pool of hash functions Hasher could implement.

For specialized/generalized hashing requirements, I think it is best to define separate protocol(s).

If there are many components, it may indeed make sense to stage them in local buffers rather than feeding them into the hasher one by one. The problem with constructing an Array or Set directly in hash(into:) is that the overhead of allocating storage for these collections may very well be larger than combine's call overhead. Keypaths currently have significant overhead on their own, too. (Edit: added “currently”)

I suspect that for six components, just calling combine six times will be competitive with most alternatives.

When buffering does turn out to be useful (e.g., for variable-length types), the combine(_:) overload that takes an UnsafeRawBufferPointer should cover the most general case – but we need more data! If you know a use case which would work better with something else, please let us know.

You’re completely right Joe.

I’m sorry @Gankro. Your concerns are very much valid. I didn’t mean otherwise.

Terms of Service

Privacy Policy

Cookie Policy