Combining hashes

lorentey · January 29, 2018, 1:01pm

I fully agree; but why not go further? We already support random seeds for the SipHash implementation that we currently use for String.hashValue on Linux. We could just make use of it in _combineHashValues, so that we fulfill the per-execution hash seed promise.

(Looking at the implementation, I'm a bit worried that the Hoed-Zobel hash was designed for hashing ASCII characters, not arbitrary 64-bit integers. It's certainly better than what we had before, but there is probably room for further improvement.)

anandabits · January 29, 2018, 4:20pm

The "transience" idea was lurking behind my question. :) I was interested in hearing what Xiaodi thought of it (and should have caught up on the thread before asking).

Your Transient wrapper is interesting, thanks for sharing that! I also don't think it's a great long term solution but it's worth knowing about in the present. :)

I agree that transience is a tricky notion because there may be cases where some properties are transient with respect to one protocol and others are transient with respect to a different protocol. On the other hand, one benefit of synthesis in the case of Hashable is that it ensures the implementation is congruent with the implementation of Equatable. That benefit is lost by any approach involving custom hashValue implementations that forward to a combining function.

I don't have a strong opinion about how to solve this problem but I definitely agree with @lorentey on this topic (emphases added):

When we manually implement Hashable, the job should just be about identifying bits and pieces of our data that must participate in hashing. We shouldn’t need to think about how these bits are shredded into a fixed-width value, because it is a completely different concern, and even worse, it requires careful consideration and specialist knowledge. Almost nobody is able (or willing) to provide a good implementation of hashValue from scratch, because task (1) is hard.

I also think an attempt at addressing this in Swift 5 is a really good idea. I don't believe understanding the subtleties of hash combining should be a prerequisite to using a user-defined type as the key of a keyed container. It seems like there is agreement that something should be done but there is disagreement about what is best (partly because we need more information).

Nevin · January 29, 2018, 5:25pm

I think it’s fine, perhaps even preferable, to just have a simple concept of “transient stored property”, which indicates the marked property should not participate in any synthesized protocol conformances: not equatable, not hashable, not codable, and not any future synthesis.

If someone wants more customization than that, they can implement the protocols themselves. (And when they do so, they ought to have easy access to a simple hash combiner in the standard library :-)

Michael_Ilseman · January 29, 2018, 6:19pm

+1 to everything

What do you mean by the collection selecting the hashing function? How does that work for generic collections? For example, Dictionary would want a different hashing function when its input is a String (very not-uniformly-distributed in an infinite space) vs when its input is a native Swift object (non-deterministically distributed within a fixed, aligned, and limited address space).

(Disclaimer: I'm not a security expert!) I think it would be reasonable for Dictionary to just provide an initial seed before deferring to its key's default hash functions. IIRC, Dictionary does this now with a per-Swift-version salt. This could be made more secure, e.g. per-compilation salt (compiler-generated random immediate), per-execution salt (e.g. pid), thread id, and per-instance info such as significant bits from the heap storage's address. This combination would be performed regardless of key type, and probably not be customizable, while the hashing strategy of keys is done based on key type, receiving this seed as initial state. (Or, maybe Dictionary should just store a random number upon creation to use as initial seed instead of this cleverness.)

Either way, this would involve changing hashValue to take an initial seed, but otherwise there's not significantly more customizability. Do you think there's more for Dictionary to provide than just this initial seed? Do you think there's value to parameterizing Dictionary over choice of hash function?

What are some concrete use cases of generalizing further than caller-provided initial state? Other than for use in hashed collections, what are some legitimate use cases of hashValue? If the desire is to lump in cryptographically secure hash functions, we probably will still want a default, and that default is likely to be the kind of hash that's of use for hash tables. I didn't follow the default-random thread closely, but IIRC the discussion went towards providing a suitable but not cryptographically secure default, and this reasoning applies even more so towards hash values.

lorentey · January 29, 2018, 6:23pm

I agree this needs to be at least partially addressed in Swift 5. As it happens, I'm working on nailing down the ABI for Set and Dictionary, and Hashable implementations are a huge part of that.

There is no fixed deadline to exposing a default hash function as public stdlib API -- we can do it after Swift 5.0, as an incremental addition. But the interfaces used to implement synthesized hashing will become part of stdlib's ABI, so it is important to get them right for Swift 5. And as long as we're revisiting them, we may as well come up with a design that's worthy of an evolution proposal to make them public.

One aspect that we haven't discussed is that it would be desirable to be able to evolve the default hash function in the stdlib without recompiling user code. Therefore, hashing helpers should not be inlined into Hashable implementations, synthesized or otherwise.

A Better Hash Combinator Interface

Synthesized hash functions are currently based on two functions:

_mixInt(_: Int) -> Int scrambles bits of a single integer value. This was originally intended to postprocess hash values, improving their distribution for our linearly probed hash tables. It is also used sometimes as an intermediate mixing step.
_combineHashValues(_: Int, _: Int) -> Int combines two integer hash values into a single one, mixing bits from both.

I don't think this is a good enough interface to make public. In particular, it does not allow for defining internal state for the hash function, which is a requirement for most secure hashes.

The standard library's internal SipHash implementation is an example of a hash function that we should be able to use. It's interface looks roughly like this:

struct SipHash13 {
  // 256 bits of internal state
  var v0: UInt64
  var v1: UInt64
  var v2: UInt64
  var v3: UInt64

  // Initialize state from random per-execution key generated at process startup
  public init() {...}

  // Mix in bytes from `buffer` into internal state.
  public mutating func append(from buffer: UnsafeRawBufferPointer) {...}
  // Shortcuts for integer types of various widths
  public mutating func append(_ value: Int)
  public mutating func append(_ value: Int8)
  ...
  public mutating func append(_ value: UInt64)
  // Finalize state and retrieve the resulting hash value
  public mutating func finalize() -> Int {...}
}

We can use SipHash13 to manually define an excellent hashValue in a very mechanical way:

struct Foo: Hashable {
  var i: Int
  var j: String

  var hashValue: Int {
    let hasher = SipHash13()
    hasher.append(Int)
    hasher.append(j.hashValue)
    return hasher.finalize()
  }
}

I believe a construct modeled after SipHash13 would likely be a more viable interface than _combineHashValues for both Hashable synthesis and a stdlib-provided public default hash function:

public struct DefaultHasher {
  public init() {...}
  public mutating func append(buffer: UnsafeRawBufferPointer) {...}
  public mutating func append(_ value: Int) {...}
  ...
  public mutating func finalize() -> Int {...}
}

lorentey · January 29, 2018, 7:32pm

What follows is a hypothetical full redesign of Hashable, requiring source-level migration of existing code. This would be too disruptive to be viable as a proposal, but it's a nice illustration of what I mean:

public protocol Hasher {
  mutating func append(_ buffer: UnsafeRawBufferPointer)
  mutating func append(_ value: Int)
  ...
  mutating func append(_ value: UInt8)
}

public protocol Hashable { 
  func hash<H: Hasher>(into hasher: inout H)
}

Simple user-defined data types would just feed data to the hasher they are provided:

struct Foo {
  var i: Int
  var j: String
  
  func hash<H: Hasher>(into hasher: inout H) {
    i.hash(into: &hasher) // or, equivalently, hasher.append(i)
    j.hash(into: &hasher)
  }
}

More complicated types would manually construct a suitable byte sequence to feed to the hasher. For example, String would simply feed the hasher its normalized unicode scalars using some predetermined encoding. (It should not care about exactly how those bytes are hashed into a fixed-width value.)

Hashed collection types like Dictionary would indeed be able to select separate hashers for specific key types. However, I don't think that would be a frequent case, as a single hasher should suffice for all Hashable types. (After all, the normalized bits of a string aren't really all that different from a bunch of bits coming from e.g. an integer array -- in general, we know very little about the expected distribution of either of them.)

Since the collections are in full control over the initialization/finalization of their hashers, they can easily customize them with per-execution (or even per-instance) random seeds (like _SipHash13Context in the stdlib), and/or salt them with the capacity or base address of their storage -- this would all be up to the collection to decide.

extension Dictionary {
  internal func _hashValue(for key: Key) -> Int {
    var hasher = _SipHash13(
      seed1: _perExecutionRandomHashSeed, 
      seed2: self.capacity)
    key.hash(into: &hasher)
    return hasher.finalize()
  }
}

I think this would be a nicer API than hashValue because it is

easier to implement -- just pass the same hasher to the fields you want
easier to read -- what's being hashed isn't obscured by ad-hoc bit scrambling operations
easier to test -- e.g. we can use a mock hasher to see what exactly is fed into the hash function
harder to get wrong -- we aren't tempted to use 31 * a + b as the hash combinator, even if Stack Overflow encourages us to do so
easier to review -- we don't get into arguments about how to hash each individual type

FWIW, Rust uses a hashing interface very similar to this.

Having said all of this, I think having fully generic hashing might be overkill -- I don't expect people would often need to define their own hash functions. A single, resilient default Hasher provided by the stdlib would likely be sufficient for most applications.

However, as long as we want to be able to use (relatively) secure hash functions with nontrivial internal state, we probably still need to add some non-generic variant of the hash(into:) API, to achieve adequate performance. For example, we wouldn't want to repeatedly (and recursively) initialize/finalize SipHash state for each individual field -- we want to feed all data to a single hasher instance.

anandabits · January 29, 2018, 7:33pm

This looks pretty nice to me. It addresses the concern of over-promising by incorporating the name of the algorithm. The documentation could include a discussion of when it is appropriate to use the algorithm and when a different one might be a better choice. It is also an “open” design in the sense that we could add implementations of other hash algorithms in the future if deemed desirable. The choice of algorithm is explicit in custom implementations which addresses the concern of an API that “over promises”.

This approach also composes well with a future version of Swift that has stronger metaprogramming features. I can imagine an approach driven by property behaviors or user-defined attributes applied to properties where a hashValue implementation was synthesized using metaprogramming rather than directly by the compiler. This would be quite flexible allowing users to configure a hashing algorithm, etc while still receiving the primary benefit of synthesis (not directly specifying the hashed members). It is a more flexible version of @transient that wouldn’t require direct language support.

lorentey · January 29, 2018, 10:53pm

I think the problem is mostly that there are so many good hash functions to choose from. After 2011's hash collision DoS scare, SipHash has become a popular choice; it has been designed for exactly this purpose. AFAIK, it is still considered relatively secure in this niche, while it is also fast enough for general use. Luckily, it's also already implemented in stdlib.

SipHash's primary drawback is that it doesn't fit well into the classic hashValue API -- it is rather slow to repeatedly reinitialize/finalize brand new SipHash states in nested hashValue invocations.
(This is not unique to SipHash; it is also true for most/all other high-quality hashes.)

This can be solved by introducing an alternative hashing interface (like hash(into:) elsewhere in this thread). Depending on how often we expect users to roll their own hashes, that interface may not even need to be public.

I believe a high-quality hash function like SipHash will be able to generate good hash values regardless of the distribution of its input -- as long as we feed it all the relevant bits, it should produce hash values with a nice wide spread.

I think it's worth making a fun detour into the nitty gritty details of stdlib's hashed collections, to underline your point about hash tables being sensitive to hash value distributions:

<detour>
Interestingly, Int.hashValue returning self is not a particularly great choice for Swift's hashed collections, because it often produces long sequential chains of hash values. The hash table behind Dictionary and Set uses open addressing with linear probing, and such chains can cause disastrous performance problems, even though they may not initially contain any actual collisions.

Set and Dictionary work around this problem by not using the raw values returned by hashValue: they call _mixInt on every hash they generate. Mixing up the bits a little is enough to break these long chains into much shorter pieces, restoring hash table performance in this very common case.

(Another way to solve this would be to switch to another probing method; however, linear probing's cache friendliness makes it difficult to replace it without significantly slowing down lookups into colliding buckets.)

So while Int.hashValue may be trivial, the actual hash function is much more complicated, to fix issues with distribution quality. (Switching to a high-quality hash function would eliminate the need for this mixing step.)

As a natural consequence of what it does, _mixInt (as well as all of the high-quality hash functions) introduce collisions that weren't in the original set of hash values. This is perfectly acceptable, but it can be surprising when you look at microbenchmarking results.

Another interesting observation is that mixing decouples the order of elements in the hash table from the numerical value of their original hash values, so it significantly reduces the expected performance of inserting/retrieving a sequential list of integer keys:

var set: Set<Int> = []
for i in 0..<1_000_000 {
    set.insert(i) // This doesn't have great locality of reference
}

This is rarely (if ever) relevant in practice, but it often pops up in naive benchmarks. (E.g., try measuring the loop above to compare the performance of Set<Int> to a hash table that does not reorder its elements, like NSSet or std::unordered_set<int64_t>. You may find Set performs slower on such artificial benchmarks, while on real-life workloads the opposite may often be true.)
</detour>

It's not nice to force users into an API they can't reasonably be expected to use correctly. I think hashValue before SE-0185 clearly fell into this category. It is possible (indeed, necessary) to implement a universally good, resilient hash function in the stdlib for synthesized hashing; and it seems quite sensible to me to provide this default hash function as a courtesy for users whose types synthesized hashing is not yet able to cover.

Sufficiently motivated, experienced developers should certainly be able to provide their own hash implementations, but the default hash should be good enough that even they wouldn't bother. A fully hand-rolled hashValue should be merely an option for optimization, rather than a requirement for correctness.

One problem I have with hashValue is that it is not easy to recognize a bad implementation. For example, I don't think it's feasible to write unit tests to verify that hashValue does not collide too often -- how many collisions are too many? Over what set of inputs? Similarly, I don't think we can put runtime warnings in the stdlib to catch badly colliding hashValues. We could provide debugging probes to determine the collision rate, but users would need to remember to look at them.

gwendal.roue · January 30, 2018, 6:04am

My own experience gives:

Synthesis works quite often. That's immensely great.
Hashable synthesis currently fails on types that contain arrays (Equatable synthesis works fine).
When some properties should be ignored by Hashable/Equatable synthesis, one can, for example, extract the "key" in a private struct with synthesized Hashable/Equatable conformance, and forward the ==/hashValue implementations to the inner key.
I have one hashValue implementation that can't possibly be synthesized because the type has various inner implementations. This one would be very happy if Swift would provide a blessed way to combine hashes. Excuse the naive current implementation: https://github.com/groue/GRDB.swift/blob/2289ed9d95b16f8e852c8ba2cd0238b98f8d3fb5/GRDB/Core/Row.swift#L817-L820

(About the last example: the "inner private key" technique would help if Swift would compute hashValues for arrays - even if performance would suffer because of the creation of convenience arrays)

hartbit · January 30, 2018, 7:25am

I'd like to provide my moral support for re-examining that proposal's ideas and improve the hashing APIs. @lorentey let me know if I can be of any help.

lorentey · January 30, 2018, 12:02pm

I think we should have Array conform to Hashable when its Element type does. In fact, many of the stdlib types could be candidates for conditional Hashable conformance:

ArraySlice
Slice (with caveat below)
Dictionary
EmptyCollection, CollectionOfOne
DictionaryLiteral
Range and the partial ranges
etc.

Slice is weird in that the straightforward way to calculate the hash by sequentially combining element hashes would not be correct for slices of unordered collections like Set and Dictionary.

This seems like a good candidate for a proposal!

gwendal.roue · January 30, 2018, 1:14pm

This is currently discussed at Let Optional, Dictionary and Array conditionally conform to Hashable - #19 by Morten_Bek_Ditlevsen. There even is a WIP implementation.

nnnnnnnn · January 30, 2018, 3:35pm

I think slices of sets and dictionaries probably need to be order-dependent when we compare them. Those two types are "substitutable" for their primary use cases when equal, but definitely not for slicing operations.

Christopher_Kornher · January 30, 2018, 7:19pm

I do not understand how exposing one (or perhaps more in the future) hash combining functions is bad thing. We are not eating soup, cooking soup or participating in any other tortured metaphorical activities. We write code and often we just need a reasonably unique hash value for a set of values.. If Swift / The Standard Library has a way to produce one, it should be exposed to developers.

Also there is no requirement for the hash to be cryptographically secure and the method should be documented as such.

lorentey · January 30, 2018, 8:11pm

We do however want the default hash function to provide adequate protection against DoS attacks that try to flood hash tables with colliding keys. (This is especially the case in server environments, but there is no good reason not to just have it everywhere.)

Functions that do this well tend to dabble in some light cryptography; just enough to make attacks difficult, while still remaining fast enough to have comparable performance to e.g. the MurmurHash-derivative in stdlib's _mixInt.

I know of two suitable families of algorithms, SipHash and HighwayHash. The former has the distinct advantage of already being in the stdlib. The latter is a recent contender that was designed to lend itself better to SIMD implementations. (This is a somewhat active research area; we can expect to see more algorithms in the future.)

Whatever algorithm we choose for Swift 5, we should retain the ability to replace it in future versions of Swift. The name of the algorithm shouldn't be part of the API, and the functions implementing it must not be inlined into user code (which includes synthesized hashing).

Christopher_Kornher · January 30, 2018, 8:38pm

Thanks for the links. DoS attacks are far more of a danger on servers than mobile/desktop apps and then only for some collections.

Assuming that the best "safe" hash algorithm is too slow, a future refinement to Dictionary, and Set for example, would be to specify an option for (faster / safer). This would, of course require significant changes to the way hashing is done.

It would be reasonable to specify fast/safe hashing on types, which would be far easier to implement. Applying this recursively would be more difficult and would be probably not be significantly less difficult than specifying the hash algorithm at the point of use.

The added complexity may not make sense for server code, but for battery-operated devices, it may be worth it. This will probably become a consideration when Swift becomes a viable option for OS and other "high-performance" projects.

hlovatt · January 30, 2018, 10:21pm

Plenty of other languages provide a hash combiner and I have found it useful, therefore I think it would be a worthwhile addition. Ideally a protocol allowing a seed, addition, and finish and some implementations of the protocol for app use, server use, and crypto uses.

xwu · January 31, 2018, 1:53am

lorentey:

It’s not nice to force users into an API they can’t reasonably be expected to use correctly. I think hashValue before SE-0185 clearly fell into this category. It is possible (indeed, necessary) to implement a universally good, resilient hash function in the stdlib for synthesized hashing; and it seems quite sensible to me to provide this default hash function as a courtesy for users whose types synthesized hashing is not yet able to cover.

Sufficiently motivated, experienced developers should certainly be able to provide their own hash implementations, but the default hash should be good enough that even they wouldn’t bother. A fully hand-rolled hashValue should be merely an option for optimization, rather than a requirement for correctness.

One problem I have with hashValue is that it is not easy to recognize a bad implementation. For example, I don’t think it’s feasible to write unit tests to verify that hashValue does not collide too often – how many collisions are too many? Over what set of inputs? Similarly, I don’t think we can put runtime warnings in the stdlib to catch badly colliding hashValues. We could provide debugging probes to determine the collision rate, but users would need to remember to look at them.

I think you've hit the nail on the head here. I was writing specifically in response to the idea of making _mixInt and _combineHashValues available for public consumption within the context of the current design of Hashable and the current state of primitive hash values. I think we are in agreement here that this would be not good, either for encourage correct implementation or for security.

On the other hand, what you've discussed would form an excellent basis for a revised protocol. If there's need for backwards compatibility, I think it should suffice to make your revised NewHashable (not actually what I'd propose as a name) a refinement of the classic Hashable with a default implementation for hashValue. If that's possible, it should permit us to tackle this design on its own time whether before or after ABI stability.

lorentey · March 10, 2018, 1:06am

As a step towards a better hashing API, the Standard Library now uses a randomly seeded, high-quality, resilient hash function:

I'm working towards extending this to synthesized Hashable implementations; once that is done, all hashing will be based on the new hash function and we'll be able to remove _mixInt and _combineHashValues from the standard library.

The _Hasher type and the new _hash(into:) requirement in Hashable seem like good candidates for a proposal to make them public: (Note that API names aren't final; they merely reflect the current code's choices.)

/// Represents Swift's standard hash function.
/// The hash function maps an arbitrary number of bytes to 
/// a single fixed-size integer value (the hash), such that slight changes in
/// the input byte sequence will usually result in wildly different hashes.
///
/// The byte sequence to be hashed is fed to the function by a series
/// of `append` calls, followed by `finalize` to extract the hash value.
///
/// A hash function initialized with the same key will always map the 
/// same byte sequence to the same hash value.
struct Hasher {
  // Initialize a new hasher using a per-execution random key
  // seeded at process startup.
  init()
  // Initialize a new hasher using the specified key.
  init(key: (UInt64, UInt64))

  // Feeds bits from `value` into the hasher.
  mutating func append<H: Hashable>(_ value: H)

  // Feed a specific sequence of bits to the hasher
  mutating func append(bits: Int)
  mutating func append(bits: UInt64)
  mutating func append(bits: UInt32)
  // etc.

  /// Finalize the hasher's state and extract a hash value.
  mutating func finalize() -> UInt64
}

protocol Hashable: Equatable {
  var hashValue: Int { get }
  func hash(into hasher: inout Hasher)
}

Here is how you'd implement Hashable in new code:

struct BraveNewWorld: Hashable {
  let a: Int
  let b: String

  func hash(into hasher: inout Hasher) {
    hasher.append(a)
    hasher.append(b)
  }
  // hashValue is automatically synthesized by the compiler
}

Of course, existing code that defines hashValue would keep working, too: (Plus, there may be legitimate cases where hashValue is the most suitable API even if hash(into:) becomes available.)

struct LegacyCode: Hashable {
  let a: Int
  let b: String

  var hashValue: Int {
    return a.hashValue ^ b.hashValue // Not a great combinator
  }

  // hash(into:) is automatically synthesized by the compiler
}

Naturally, the most sensible choice would be to let the compiler do all the work:

struct Sensible: Hashable {
  let a: Int
  let b: String

  // Both hashValue and hash(into:) are synthesized by the compiler
}

I feel like this would be a much nicer way to approach hashing. What do you think?

hartbit · March 11, 2018, 4:24pm

I really like this approach! In the previous case, I guess the compiler would automatically synthesize hash(into:) when only hashValue is implemented by calling append with the hashValue?

struct LegacyCode: Hashable {
  let a: Int
  let b: String

  var hashValue: Int {
    return a.hashValue ^ b.hashValue // Not a great combinator
  }

  // hash(into:) is automatically synthesized by the compiler
  func hash(into hasher: inout Hasher) {
    hasher.append(hashValue)
  }
}