Combining hashes

I think the problem is mostly that there are so many good hash functions to choose from. After 2011's hash collision DoS scare, SipHash has become a popular choice; it has been designed for exactly this purpose. AFAIK, it is still considered relatively secure in this niche, while it is also fast enough for general use. Luckily, it's also already implemented in stdlib.

SipHash's primary drawback is that it doesn't fit well into the classic hashValue API -- it is rather slow to repeatedly reinitialize/finalize brand new SipHash states in nested hashValue invocations.
(This is not unique to SipHash; it is also true for most/all other high-quality hashes.)

This can be solved by introducing an alternative hashing interface (like hash(into:) elsewhere in this thread). Depending on how often we expect users to roll their own hashes, that interface may not even need to be public.

I believe a high-quality hash function like SipHash will be able to generate good hash values regardless of the distribution of its input -- as long as we feed it all the relevant bits, it should produce hash values with a nice wide spread.

I think it's worth making a fun detour into the nitty gritty details of stdlib's hashed collections, to underline your point about hash tables being sensitive to hash value distributions:

<detour>
Interestingly, Int.hashValue returning self is not a particularly great choice for Swift's hashed collections, because it often produces long sequential chains of hash values. The hash table behind Dictionary and Set uses open addressing with linear probing, and such chains can cause disastrous performance problems, even though they may not initially contain any actual collisions.

Set and Dictionary work around this problem by not using the raw values returned by hashValue: they call _mixInt on every hash they generate. Mixing up the bits a little is enough to break these long chains into much shorter pieces, restoring hash table performance in this very common case.

(Another way to solve this would be to switch to another probing method; however, linear probing's cache friendliness makes it difficult to replace it without significantly slowing down lookups into colliding buckets.)

So while Int.hashValue may be trivial, the actual hash function is much more complicated, to fix issues with distribution quality. (Switching to a high-quality hash function would eliminate the need for this mixing step.)

As a natural consequence of what it does, _mixInt (as well as all of the high-quality hash functions) introduce collisions that weren't in the original set of hash values. This is perfectly acceptable, but it can be surprising when you look at microbenchmarking results.

Another interesting observation is that mixing decouples the order of elements in the hash table from the numerical value of their original hash values, so it significantly reduces the expected performance of inserting/retrieving a sequential list of integer keys:

var set: Set<Int> = []
for i in 0..<1_000_000 {
    set.insert(i) // This doesn't have great locality of reference
}

This is rarely (if ever) relevant in practice, but it often pops up in naive benchmarks. (E.g., try measuring the loop above to compare the performance of Set<Int> to a hash table that does not reorder its elements, like NSSet or std::unordered_set<int64_t>. You may find Set performs slower on such artificial benchmarks, while on real-life workloads the opposite may often be true.)
</detour>

It's not nice to force users into an API they can't reasonably be expected to use correctly. I think hashValue before SE-0185 clearly fell into this category. It is possible (indeed, necessary) to implement a universally good, resilient hash function in the stdlib for synthesized hashing; and it seems quite sensible to me to provide this default hash function as a courtesy for users whose types synthesized hashing is not yet able to cover.

Sufficiently motivated, experienced developers should certainly be able to provide their own hash implementations, but the default hash should be good enough that even they wouldn't bother. A fully hand-rolled hashValue should be merely an option for optimization, rather than a requirement for correctness.

One problem I have with hashValue is that it is not easy to recognize a bad implementation. For example, I don't think it's feasible to write unit tests to verify that hashValue does not collide too often -- how many collisions are too many? Over what set of inputs? Similarly, I don't think we can put runtime warnings in the stdlib to catch badly colliding hashValues. We could provide debugging probes to determine the collision rate, but users would need to remember to look at them.

1 Like

My own experience gives:

  • Synthesis works quite often. That's immensely great.
  • Hashable synthesis currently fails on types that contain arrays (Equatable synthesis works fine).
  • When some properties should be ignored by Hashable/Equatable synthesis, one can, for example, extract the "key" in a private struct with synthesized Hashable/Equatable conformance, and forward the ==/hashValue implementations to the inner key.
  • I have one hashValue implementation that can't possibly be synthesized because the type has various inner implementations. This one would be very happy if Swift would provide a blessed way to combine hashes. Excuse the naive current implementation: https://github.com/groue/GRDB.swift/blob/2289ed9d95b16f8e852c8ba2cd0238b98f8d3fb5/GRDB/Core/Row.swift#L817-L820

(About the last example: the "inner private key" technique would help if Swift would compute hashValues for arrays - even if performance would suffer because of the creation of convenience arrays)

1 Like

I'd like to provide my moral support for re-examining that proposal's ideas and improve the hashing APIs. @lorentey let me know if I can be of any help.

I think we should have Array conform to Hashable when its Element type does. In fact, many of the stdlib types could be candidates for conditional Hashable conformance:

  • ArraySlice
  • Slice (with caveat below)
  • Dictionary
  • EmptyCollection, CollectionOfOne
  • DictionaryLiteral
  • Range and the partial ranges
  • etc.

Slice is weird in that the straightforward way to calculate the hash by sequentially combining element hashes would not be correct for slices of unordered collections like Set and Dictionary.

This seems like a good candidate for a proposal!

This is currently discussed at Let Optional, Dictionary and Array conditionally conform to Hashable - #19 by Morten_Bek_Ditlevsen. There even is a WIP implementation.

1 Like

:thinking: I think slices of sets and dictionaries probably need to be order-dependent when we compare them. Those two types are "substitutable" for their primary use cases when equal, but definitely not for slicing operations.

2 Likes

I do not understand how exposing one (or perhaps more in the future) hash combining functions is bad thing. We are not eating soup, cooking soup or participating in any other tortured metaphorical activities. We write code and often we just need a reasonably unique hash value for a set of values.. If Swift / The Standard Library has a way to produce one, it should be exposed to developers.

Also there is no requirement for the hash to be cryptographically secure and the method should be documented as such.

We do however want the default hash function to provide adequate protection against DoS attacks that try to flood hash tables with colliding keys. (This is especially the case in server environments, but there is no good reason not to just have it everywhere.)

Functions that do this well tend to dabble in some light cryptography; just enough to make attacks difficult, while still remaining fast enough to have comparable performance to e.g. the MurmurHash-derivative in stdlib's _mixInt.

I know of two suitable families of algorithms, SipHash and HighwayHash. The former has the distinct advantage of already being in the stdlib. The latter is a recent contender that was designed to lend itself better to SIMD implementations. (This is a somewhat active research area; we can expect to see more algorithms in the future.)

Whatever algorithm we choose for Swift 5, we should retain the ability to replace it in future versions of Swift. The name of the algorithm shouldn't be part of the API, and the functions implementing it must not be inlined into user code (which includes synthesized hashing).

Thanks for the links. DoS attacks are far more of a danger on servers than mobile/desktop apps and then only for some collections.

Assuming that the best "safe" hash algorithm is too slow, a future refinement to Dictionary, and Set for example, would be to specify an option for (faster / safer). This would, of course require significant changes to the way hashing is done.

It would be reasonable to specify fast/safe hashing on types, which would be far easier to implement. Applying this recursively would be more difficult and would be probably not be significantly less difficult than specifying the hash algorithm at the point of use.

The added complexity may not make sense for server code, but for battery-operated devices, it may be worth it. This will probably become a consideration when Swift becomes a viable option for OS and other "high-performance" projects.

Plenty of other languages provide a hash combiner and I have found it useful, therefore I think it would be a worthwhile addition. Ideally a protocol allowing a seed, addition, and finish and some implementations of the protocol for app use, server use, and crypto uses.

I think you've hit the nail on the head here. I was writing specifically in response to the idea of making _mixInt and _combineHashValues available for public consumption within the context of the current design of Hashable and the current state of primitive hash values. I think we are in agreement here that this would be not good, either for encourage correct implementation or for security.

On the other hand, what you've discussed would form an excellent basis for a revised protocol. If there's need for backwards compatibility, I think it should suffice to make your revised NewHashable (not actually what I'd propose as a name) a refinement of the classic Hashable with a default implementation for hashValue. If that's possible, it should permit us to tackle this design on its own time whether before or after ABI stability.

As a step towards a better hashing API, the Standard Library now uses a randomly seeded, high-quality, resilient hash function:

I'm working towards extending this to synthesized Hashable implementations; once that is done, all hashing will be based on the new hash function and we'll be able to remove _mixInt and _combineHashValues from the standard library.

The _Hasher type and the new _hash(into:) requirement in Hashable seem like good candidates for a proposal to make them public: (Note that API names aren't final; they merely reflect the current code's choices.)

/// Represents Swift's standard hash function.
/// The hash function maps an arbitrary number of bytes to 
/// a single fixed-size integer value (the hash), such that slight changes in
/// the input byte sequence will usually result in wildly different hashes.
///
/// The byte sequence to be hashed is fed to the function by a series
/// of `append` calls, followed by `finalize` to extract the hash value.
///
/// A hash function initialized with the same key will always map the 
/// same byte sequence to the same hash value.
struct Hasher {
  // Initialize a new hasher using a per-execution random key
  // seeded at process startup.
  init()
  // Initialize a new hasher using the specified key.
  init(key: (UInt64, UInt64))

  // Feeds bits from `value` into the hasher.
  mutating func append<H: Hashable>(_ value: H)

  // Feed a specific sequence of bits to the hasher
  mutating func append(bits: Int)
  mutating func append(bits: UInt64)
  mutating func append(bits: UInt32)
  // etc.

  /// Finalize the hasher's state and extract a hash value.
  mutating func finalize() -> UInt64
}

protocol Hashable: Equatable {
  var hashValue: Int { get }
  func hash(into hasher: inout Hasher)
}

Here is how you'd implement Hashable in new code:

struct BraveNewWorld: Hashable {
  let a: Int
  let b: String

  func hash(into hasher: inout Hasher) {
    hasher.append(a)
    hasher.append(b)
  }
  // hashValue is automatically synthesized by the compiler
}

Of course, existing code that defines hashValue would keep working, too: (Plus, there may be legitimate cases where hashValue is the most suitable API even if hash(into:) becomes available.)

struct LegacyCode: Hashable {
  let a: Int
  let b: String

  var hashValue: Int {
    return a.hashValue ^ b.hashValue // Not a great combinator
  }

  // hash(into:) is automatically synthesized by the compiler
}

Naturally, the most sensible choice would be to let the compiler do all the work:

struct Sensible: Hashable {
  let a: Int
  let b: String

  // Both hashValue and hash(into:) are synthesized by the compiler
}

I feel like this would be a much nicer way to approach hashing. What do you think?

14 Likes

I really like this approach! In the previous case, I guess the compiler would automatically synthesize hash(into:) when only hashValue is implemented by calling append with the hashValue?

struct LegacyCode: Hashable {
  let a: Int
  let b: String

  var hashValue: Int {
    return a.hashValue ^ b.hashValue // Not a great combinator
  }

  // hash(into:) is automatically synthesized by the compiler
  func hash(into hasher: inout Hasher) {
    hasher.append(hashValue)
  }
}

Yes! The hash(into:) method simply appends hashValue to the hasher, exactly as you described.

Similarly, if hashValue needs to be synthesized, its body would be expressed in terms of hash(into:), like this:

struct BraveNewWorld: Hashable {
  let a: Int
  let b: String

  func hash(into hasher: inout Hasher) {
    hasher.append(a)
    hasher.append(b)
  }
  /* synthesized */ var hashValue: Int {
    var hasher = Hasher()
    self.hash(into: &hasher)
    return hasher.finalize()
  }
}
2 Likes

I love this approach, specifically because it builds on top of Hashable without introducing any new protocols. It also feels very familiar to how Comparable leans on default implementations to avoid us implementing all operators. It's a pity we can't use default implementations and have to generate those in the compiler (I guess because we can't force the user to implement one or the other). I also like the names chosen. I would go straight to proposal!

3 Likes

If hash(into:) is part of the Hashable protocol, wouldn’t it always need to be synthesized? ‘Others’ may wish to call it too, right?

I am a huge fan of this idea too: this form of auto synthesized Hashable conformance coupled with the conditional conformance to Hashable for the various Collections really gives all the flexibility of rolling your own, relying on secure auto generated implementations or do something that is partly custom, but does not require the deeper knowledge of creating good and secure hashes.

2 Likes

I'm not sure I understand the question. Can you provide an example of what you mean?

FWIW, if we made hash(into:) an official part of the Hashable protocol, it would become a regular requirement, just like hashValue is now:

  • Anyone could manually implement it if synthesized conformance was not available or if it wasn't the right choice for their type. In most cases, I believe hash(into:) would be much easier to implement well than hashValue.
  • Custom hashing collections outside the standard library could choose to call hash(into:) rather than hashValue to generate their hashes. They'd enjoy the benefits of guaranteed good hashing with no need for any additional postprocessing steps.

The compiler would be able to synthesize a definition of hash(into:) in some, but not all, cases -- just like it is sometimes able to synthesize hashValue now. Synthesis would not become universally available: at least one of hash(into:) or hashValue would still need to be manually implemented for classes, and for types with non-hashable stored properties.

5 Likes

Ah, thanks, I see now.
I mistakenly took David’s question and your answer to mean hash(into:) was conditionally synthesized based on whether it was called from the hashValue implementation or not. I couldn’t make sense of that, but re-reading the thread I now see that I was misunderstanding the question.

Thanks for the elaboration!

I, too, am a big fan of the new interface.

However, I'm not sure that we need all that compiler magic to achieve it. Could we not introduce a refined protocol and use default implementations to provide hashValue in terms of the new protocol? The standard library even has this cool feature that underscored members get hidden, so we could even add a forwards-compatibility interface in to Hashable without anybody noticing.

There would be a new name, but since user-code would have to opt-in to the new functionality (by implementing hash(into:)), I think that's fine. Here's an example:

protocol Hashable: Equatable {
  var hashValue: Int { get }
  /* hidden via underscore */ func _hash(into hasher: inout Hasher)
}
protocol HashVisitable: Hashable { // strawman
  func hash(into hasher: inout Hasher)
}
// Backward compatibility, HashVisitable -> Hashable
extension HashVisitable {
  var hashValue: Int {
    var hasher = Hasher()
    self.hash(into: &hasher)
    return hasher.finalize()
  }
}

// Forward-compatibility, Hashable -> HashVisitable-like
extension Hashable {
  func _hash(into hasher: inout Hasher) { 
    hasher.append(hashValue)
  }
}
extension HashVisitable {
  func _hash(into hasher: inout Hasher) {
    hash(into: &hasher)
  }
}

We could, but I prefer a little bit of compiler magic than introducing a new protocol that users need to know about. Plus, with @lorentey's solution, you can always depend on a Hashable type being able to work with a Hasher.