Combining hashes

I do not understand how exposing one (or perhaps more in the future) hash combining functions is bad thing. We are not eating soup, cooking soup or participating in any other tortured metaphorical activities. We write code and often we just need a reasonably unique hash value for a set of values.. If Swift / The Standard Library has a way to produce one, it should be exposed to developers.

Also there is no requirement for the hash to be cryptographically secure and the method should be documented as such.

We do however want the default hash function to provide adequate protection against DoS attacks that try to flood hash tables with colliding keys. (This is especially the case in server environments, but there is no good reason not to just have it everywhere.)

Functions that do this well tend to dabble in some light cryptography; just enough to make attacks difficult, while still remaining fast enough to have comparable performance to e.g. the MurmurHash-derivative in stdlib's _mixInt.

I know of two suitable families of algorithms, SipHash and HighwayHash. The former has the distinct advantage of already being in the stdlib. The latter is a recent contender that was designed to lend itself better to SIMD implementations. (This is a somewhat active research area; we can expect to see more algorithms in the future.)

Whatever algorithm we choose for Swift 5, we should retain the ability to replace it in future versions of Swift. The name of the algorithm shouldn't be part of the API, and the functions implementing it must not be inlined into user code (which includes synthesized hashing).

Thanks for the links. DoS attacks are far more of a danger on servers than mobile/desktop apps and then only for some collections.

Assuming that the best "safe" hash algorithm is too slow, a future refinement to Dictionary, and Set for example, would be to specify an option for (faster / safer). This would, of course require significant changes to the way hashing is done.

It would be reasonable to specify fast/safe hashing on types, which would be far easier to implement. Applying this recursively would be more difficult and would be probably not be significantly less difficult than specifying the hash algorithm at the point of use.

The added complexity may not make sense for server code, but for battery-operated devices, it may be worth it. This will probably become a consideration when Swift becomes a viable option for OS and other "high-performance" projects.

Plenty of other languages provide a hash combiner and I have found it useful, therefore I think it would be a worthwhile addition. Ideally a protocol allowing a seed, addition, and finish and some implementations of the protocol for app use, server use, and crypto uses.

I think you've hit the nail on the head here. I was writing specifically in response to the idea of making _mixInt and _combineHashValues available for public consumption within the context of the current design of Hashable and the current state of primitive hash values. I think we are in agreement here that this would be not good, either for encourage correct implementation or for security.

On the other hand, what you've discussed would form an excellent basis for a revised protocol. If there's need for backwards compatibility, I think it should suffice to make your revised NewHashable (not actually what I'd propose as a name) a refinement of the classic Hashable with a default implementation for hashValue. If that's possible, it should permit us to tackle this design on its own time whether before or after ABI stability.

As a step towards a better hashing API, the Standard Library now uses a randomly seeded, high-quality, resilient hash function:

I'm working towards extending this to synthesized Hashable implementations; once that is done, all hashing will be based on the new hash function and we'll be able to remove _mixInt and _combineHashValues from the standard library.

The _Hasher type and the new _hash(into:) requirement in Hashable seem like good candidates for a proposal to make them public: (Note that API names aren't final; they merely reflect the current code's choices.)

/// Represents Swift's standard hash function.
/// The hash function maps an arbitrary number of bytes to 
/// a single fixed-size integer value (the hash), such that slight changes in
/// the input byte sequence will usually result in wildly different hashes.
///
/// The byte sequence to be hashed is fed to the function by a series
/// of `append` calls, followed by `finalize` to extract the hash value.
///
/// A hash function initialized with the same key will always map the 
/// same byte sequence to the same hash value.
struct Hasher {
  // Initialize a new hasher using a per-execution random key
  // seeded at process startup.
  init()
  // Initialize a new hasher using the specified key.
  init(key: (UInt64, UInt64))

  // Feeds bits from `value` into the hasher.
  mutating func append<H: Hashable>(_ value: H)

  // Feed a specific sequence of bits to the hasher
  mutating func append(bits: Int)
  mutating func append(bits: UInt64)
  mutating func append(bits: UInt32)
  // etc.

  /// Finalize the hasher's state and extract a hash value.
  mutating func finalize() -> UInt64
}

protocol Hashable: Equatable {
  var hashValue: Int { get }
  func hash(into hasher: inout Hasher)
}

Here is how you'd implement Hashable in new code:

struct BraveNewWorld: Hashable {
  let a: Int
  let b: String

  func hash(into hasher: inout Hasher) {
    hasher.append(a)
    hasher.append(b)
  }
  // hashValue is automatically synthesized by the compiler
}

Of course, existing code that defines hashValue would keep working, too: (Plus, there may be legitimate cases where hashValue is the most suitable API even if hash(into:) becomes available.)

struct LegacyCode: Hashable {
  let a: Int
  let b: String

  var hashValue: Int {
    return a.hashValue ^ b.hashValue // Not a great combinator
  }

  // hash(into:) is automatically synthesized by the compiler
}

Naturally, the most sensible choice would be to let the compiler do all the work:

struct Sensible: Hashable {
  let a: Int
  let b: String

  // Both hashValue and hash(into:) are synthesized by the compiler
}

I feel like this would be a much nicer way to approach hashing. What do you think?

14 Likes

I really like this approach! In the previous case, I guess the compiler would automatically synthesize hash(into:) when only hashValue is implemented by calling append with the hashValue?

struct LegacyCode: Hashable {
  let a: Int
  let b: String

  var hashValue: Int {
    return a.hashValue ^ b.hashValue // Not a great combinator
  }

  // hash(into:) is automatically synthesized by the compiler
  func hash(into hasher: inout Hasher) {
    hasher.append(hashValue)
  }
}

Yes! The hash(into:) method simply appends hashValue to the hasher, exactly as you described.

Similarly, if hashValue needs to be synthesized, its body would be expressed in terms of hash(into:), like this:

struct BraveNewWorld: Hashable {
  let a: Int
  let b: String

  func hash(into hasher: inout Hasher) {
    hasher.append(a)
    hasher.append(b)
  }
  /* synthesized */ var hashValue: Int {
    var hasher = Hasher()
    self.hash(into: &hasher)
    return hasher.finalize()
  }
}
2 Likes

I love this approach, specifically because it builds on top of Hashable without introducing any new protocols. It also feels very familiar to how Comparable leans on default implementations to avoid us implementing all operators. It's a pity we can't use default implementations and have to generate those in the compiler (I guess because we can't force the user to implement one or the other). I also like the names chosen. I would go straight to proposal!

3 Likes

If hash(into:) is part of the Hashable protocol, wouldn’t it always need to be synthesized? ‘Others’ may wish to call it too, right?

I am a huge fan of this idea too: this form of auto synthesized Hashable conformance coupled with the conditional conformance to Hashable for the various Collections really gives all the flexibility of rolling your own, relying on secure auto generated implementations or do something that is partly custom, but does not require the deeper knowledge of creating good and secure hashes.

2 Likes

I'm not sure I understand the question. Can you provide an example of what you mean?

FWIW, if we made hash(into:) an official part of the Hashable protocol, it would become a regular requirement, just like hashValue is now:

  • Anyone could manually implement it if synthesized conformance was not available or if it wasn't the right choice for their type. In most cases, I believe hash(into:) would be much easier to implement well than hashValue.
  • Custom hashing collections outside the standard library could choose to call hash(into:) rather than hashValue to generate their hashes. They'd enjoy the benefits of guaranteed good hashing with no need for any additional postprocessing steps.

The compiler would be able to synthesize a definition of hash(into:) in some, but not all, cases -- just like it is sometimes able to synthesize hashValue now. Synthesis would not become universally available: at least one of hash(into:) or hashValue would still need to be manually implemented for classes, and for types with non-hashable stored properties.

5 Likes

Ah, thanks, I see now.
I mistakenly took David’s question and your answer to mean hash(into:) was conditionally synthesized based on whether it was called from the hashValue implementation or not. I couldn’t make sense of that, but re-reading the thread I now see that I was misunderstanding the question.

Thanks for the elaboration!

I, too, am a big fan of the new interface.

However, I'm not sure that we need all that compiler magic to achieve it. Could we not introduce a refined protocol and use default implementations to provide hashValue in terms of the new protocol? The standard library even has this cool feature that underscored members get hidden, so we could even add a forwards-compatibility interface in to Hashable without anybody noticing.

There would be a new name, but since user-code would have to opt-in to the new functionality (by implementing hash(into:)), I think that's fine. Here's an example:

protocol Hashable: Equatable {
  var hashValue: Int { get }
  /* hidden via underscore */ func _hash(into hasher: inout Hasher)
}
protocol HashVisitable: Hashable { // strawman
  func hash(into hasher: inout Hasher)
}
// Backward compatibility, HashVisitable -> Hashable
extension HashVisitable {
  var hashValue: Int {
    var hasher = Hasher()
    self.hash(into: &hasher)
    return hasher.finalize()
  }
}

// Forward-compatibility, Hashable -> HashVisitable-like
extension Hashable {
  func _hash(into hasher: inout Hasher) { 
    hasher.append(hashValue)
  }
}
extension HashVisitable {
  func _hash(into hasher: inout Hasher) {
    hash(into: &hasher)
  }
}

We could, but I prefer a little bit of compiler magic than introducing a new protocol that users need to know about. Plus, with @lorentey's solution, you can always depend on a Hashable type being able to work with a Hasher.

I don't. People are going to learn about this behaviour and wonder how to achieve the same thing in their protocols. They will likely struggle around with defaults before they uncover this thread deep in a Google search and learn that we have to resort to special-case magic because the language cannot otherwise handle it.

It's not specific to this proposal - I'm worried in general about the volume of compiler magic we seem to be adding. It's sad; personally, it's deeply unsatisfying when you're in a design corner, and you imagine a solution based on something you've seen and used, only to find out that it's literally a unique carve-out in the way the language works; because someone else was in your situation and had more power to change the universe than you do.

It also makes you wonder about the state of the language going forward - if the Swift compiler developers are able carving special holes around common problems, they're not going to feel the limitations that the rest of us do. It's similar to the problem Apple has with WatchKit (WatchKit is a sweet solution that will only ever give us baby apps – Marco.org).

The standard library will likely always have quirks, but I feel like it's just been exploding recently.

2 Likes

I've done a reasonable amount of work in Java and I have to agree. The fact that almost all of the standard library is written in Swift is a breath of fresh air compared to Java, which arbitrarily gates functionality "for the language only". It's really frustrating to write your own numeric type and be forced to write an add method because the + operator can only work with primitive types and String for some reason. It's annoying that arrays have their own initialization syntax that ArrayList can't use.

That being said, IIRC this proposal mentioned something along the lines of the compiler magic going away if we had better reflection, so I think we should wait for that before giving up on it.

This is a fair concern; however, in my opinion, adding an extra protocol would cause far more confusion than the extra compiler magic necessary to synthesize these members. I would object to any solution where users would have to choose between two separate protocols for hashing. (I'd prefer to keep hash(into:) internal rather than doing that.)

Granted, even two alternative requirements within the same protocol will lead to confusion. To resolve this, I propose that we consider hash(into:) to officially replace hashValue, and that we should deprecate the latter in the same release that makes hash(into:) available. I think completely removing hashValue would be far too disruptive a change to consider, so the new compiler magic would exist merely to let us evolve Hashable without breaking source compatibility with existing code.

4 Likes

I agree that it would be far better to do these things outside of the compiler. (And not just because I don't find building Swift ASTs by hand in C++ particularly enjoyable.) Allowing automatic conformance synthesis for protocols outside of the standard library would be a useful feature for a hypothetical future language-level metaprogramming facility. The current Hashable, Equatable, Codable, RawRepresentable synthesizers are nice examples of the kinds of things that should be expressible by a facility like that. However, we don't have that today, and designing it is not in scope for the current topic.

The standard library has always been in a special symbiotic relationship with the Swift compiler: while the stdlib is written in Swift, it is not written in the usual dialect -- beyond gyb, the stdlib has access to tools and privileges that aren't available to normal Swift code, and it uses techniques and idioms that aren't often found in regular code. Optional is the flagship example of this: while it may be defined in the stdlib as a regular enum, it is so much more than just that.

Yes, and in general I'm fine with that, if it's done judiciously. Common types having a small amount of sugar, such as Array or Optional, are fine; the issue arises when too much of the library's features become inaccessible for general programs due to laziness on the part of the language developers or "baby-proofing" so that users don't hurt themselves. Swift is nowhere near that, thankfully, but features like these must be considered carefully. As long as there's a path for them to become legitimate language features rather than hacks to the compiler, I'm generally fine with it.

Sounds like we're in full agreement there. The stdlib is often a testbed for potential future language enhancements -- some of the "magic" is only magic because we don't yet know enough about the problem to produce a well-designed language feature for it. For example, consider how stdlib code has always been magically inlinable/specializable across module boundaries -- and compare that to the universally available @inlinable attribute we're now discussing in SE-0193.

Custom conformance synthesis for user protocols is not currently possible within Swift; however, we do have synthesized protocols in the stdlib, and they provide a nice set of examples for problems such a feature would be expected to solve. The either/or requirement logic of the new Hashable protocol we're discussing adds an interesting new dimension to this design space -- the compiler will probably need a little refactoring to handle such rules well. Conformance synthesis may not be ready to progress beyond magic yet; but we're collecting data and gaining practice.

3 Likes