[Accepted] SE-0206: Hashable Enhancements

See @Gankro’s first post in this thread for more context - I think that will help clarify.

The need is to delineate element subgroups within the sequence of elements presented to a Hasher. So it’s different from just finalizing the Hasher.

I’m not sure if I understand correctly. Please tell me if I’m wrong here:

  • The API proposed allows for an implementation that defends against HashDOS attacks by, for example, using mixins after every call to combine.
  • The problem with that as the default would be performance-wise. It’s true that you can implement a hash with mixins that doesn’t need to store all the pieces in a buffer before hashing — it could happen piece by piece.
  • But adding mixins in between would make combine more expensive to run, which depending on how much more expensive that is, would impose a performance hit on otherwise lightweight use-cases.
  • Therefore, the suggestion is to be able to disable security mixins as an option.

How wrong did I get it? I feel I’m missing something, but I can’t quite get at it.

The latter point, cheap rehash and finalize (you didn’t mention that), was my concern in suggesting an extra method. I also noted that instantiating a hasher needs to be cheap. As a further point your code whilst very similar to my code isn’t that obvious, whereas a method called previously CachedValue is.

The concern about the cost of instantiating the hasher is there with both pieces of code, therefore is there a better solution that avoids instantiating the hasher?

A big +1 from me on this one. :smile:

When writing my original blog post about the problems with today’s Hashable and the GridPoint documentation example in particular I jumped from my chair when I saw the collision statistics from running some quick tests on it. Being able to provide the very go-to example from the documentation as an adversarial example for the status quo was just perfect. So I have to thank you for that. It was the wind in the sails of this proposal. :raised_hands:t2:Without it I would have had to come up with a straw-man of my own, which would have greatly diminished the strength of the argument I was trying to make with the blog post.

Consider a deeply nested tree structure that recursively calls hash(into:) on its children and own value:

func hash(into hasher: inout Hasher) {
    hasher.combine([leftChild, self.value, rightChild])
}

With this one would get thousands of Array short-lived (and mostly rather small) allocations per hash computation. This would not only be bad in performance but might also badly fragment your memory, bringing your allocator to its knees.
Further more this would only be useful for homogeneous values. For my example above [leftChild, self.value, rightChild] would have to be of type [Hashable], imposing dynamic dispatch on the elements, further slowing things down.

You’re not calling finalize() as an implementor of Hashable though:

extension GridPoint: Hashable {
  func hash(into hasher: inout Hasher) {
    hasher.combine(x)
    hasher.combine(y)
  }
}

and you shouldn’t either. Finalize is meant to only ever be called once on a Hasher at the very end of hash computation, as implied by the name.

As such this won’t work either:

This hash = into.finalize() will conflict with the semantics of Hasher.

You snippet is not valid Swift (missing let and return in guard let), so I’m not exactly sure what you meant to express, so I might be reading something different into your proposed approach, than you intended.

Hasher is stateful. As such instead you would need to do something like this:

struct Expensive {
    private var hash: Int?
    override func hash(into: inout Hasher) {
        let cachedHash = self.hash ?? {
            var hasher = Hasher()
            hasher.combine( ... )
            ...
            return hasher.finalize()
        }()
        into.previouslyCachedValue(h)
    }
}

You need to create a new Hasher to compute the cached hash, as the Hasher's current state is dependent on whatever it has hashed so far. There are a couple of things wrong with this caching strategy though:

  • func hash(into: inout Hasher) now needs to be marked as mutating func in Hashable, or else your snippet won’t compile, being a struct. hash(into:) should however never ever imply that mutating self is intended behavior. It would turn the whole protocol into a foot gun by giving the impression of the wrong semantics.

  • mutating func hash(into: inout Hasher) would require types that make use of hashing and allow for key lookup à la func contains(key: T) to do this:

    func contains(key: T) {
        var mutableKey = key // 👈🏻copy of key
        var hasher = Hasher(…)
        key.hash(into: &hasher) // 👈🏻call of mutating func on key
        let hash = hasher.finalize()
        return ...
    }
    

    This would trigger Swift’s copy-on-write on every key lookup for types making use of isKnownUniquelyReferenced(_:) or having similar CoW semantics.

  • By instantiating a local Hasher via Hasher() one would be killing all of the efforts made to make this API work with stuff like Bloom Filters or other data structures/algorithms that might expect proper use of a single hasher.

  • It is unclear to me how previouslyCachedValue(_:) would play with the fact that Hasher is stateful. Allowing instances of Hashable to basically mess with the state of Hasher further more is a violation of concerns and encapsulation, imho.

1 Like

These are really interesting points; thank you for raising them.

There is a small semantic issue here – it isn’t clear if we can allow hasher.combine(42 as Int64) and hasher.combine(42 as Int8) to produce different hashes.

They are different types, so I’d argue we can and should be free to do that. However, the standard library defines heterogeneous comparison operators between all fixed-width integer types. This means that (42 as Int8) == (42 as Int64), and having the two values produce different hashes may be confusing. (AnyHashable and its interaction with NSNumber produces additional complications.)

The bits: label for the primitive combine functions makes it clear that the bit width of integer values is significant.

The implementation of hash(into:) for integer types is currently quite a bit more complicated than that. I hope to make hasher.combine(foo) mean the same thing as hasher.combine(bits: foo) for all integer types, but it will take more discussion and some extra work.

(For Int, the current implementation does indeed end up calling combine(bits: Int), but this is an accidental consequence of Hasher not implementing the full API proposed here, and we will likely need to change it once we have combine(bits: Int8) working.)

We shouldn’t be doing unnecessary work. However, as far as I can see, commutative hashing requires calculating a separate hash value for each element, and spawning a new hasher is simply the correct way to do that.

Hasher.finalize() is not in itself slow; calling it only harms performance when the call can be eliminated.

Can you provide an example how a Hasher protocol would make things faster here? (I’d hesitate to have Set and Dictionary produce weaker hash values.)

This is a good point! We changed the name from append to combine late in the drafting process; I think combine is clearly better, but it does not by itself imply strict ordering. We can (and should) make up for this in the documentation. (append is uncomfortably close to Array.append. While Hasher.combine and Array.append are similar operations in an abstract sense, they actually mean completely different things.)

I agree too, and your implementation is the correct way to do it. (I.e., spawning/finalizing a separate hasher to generate the cached hash, then feeding the hash value in hash(into:).)

Note that caching the hash involves mutating operations, so for value types, you cannot do it directly in hash(into:). (This is by design, and the same applied for hashValue.) This is not a real limitation, as you can simply force-generate the cached hash sometime before you start using the value in hashing contexts.

Hm; maybe we should simply turn the example into a demonstration of how the compiler implements hash(into:). It would serve the same purpose, but it would avoid getting bogged down in questions like why wouldn’t you want to hash the color of a GridPoint.

It seems to me this wouldn’t carry its weight as an API. It isn’t clear that we’d be able to implement it well. People implementing hashable collections would still need to be aware that they need to call this function, so we’ll need to call it out somewhere in the docs. But we may just as well describe what they need to do and why.

Not without knowing more details about the collection! In particular, unless we know more about the elements, combining the count seems like a better choice than any constant bit pattern.

Ah, yes, my bad. I was talking about this workflow:

  1. You make a data structure (like Dictionary) that uses a hashtable.
  2. Inside of that, you hash keys.
  3. You hash like so: call key.hash(&currentHasher) and then keyHash = currentHasher.finalize()

Therefore in the call to finalize(), you’re implicitly marking the end of the combine sequence. currentHasher should be able to record what’s the amount of steps in the sequence, and therefore it can append the length at the end before finalizing the hashing process.

The basic underlying requirement is that for every two values of a hashable type, they must feed the same sequence of bytes to Hasher if and only if they are equal. Two values that aren’t equal must feed data that differs by at least a single bit. Everything else follows from this one requirement.

It’s usually easy to ensure unique encodings, but it gets a little tricky in special cases like hashable collections. One subtle gotcha is that when we concatenate variable-length data together, we may lose track of the original boundary between the two components, which opens the way for hash collisions. For example, consider the Foo struct below:

struct Foo: Hashable {
  let a: [Int]
  let b: [Int]

  func hash(into hasher: inout Hasher) {
    hasher.combine(a)
    hasher.combine(b)
  }
  static func ==(left: Foo, right: Foo) -> Bool { ... }
}

We need to ensure these instances all feed different data to the hasher:

Foo(a: [], b: [1, 2, 3])
Foo(a: [1], b: [2, 3])
Foo(a: [1, 2], b: [3])
Foo(a: [1, 2, 3], b: [])

Foo's hash(into:) implementation is already perfect – we don’t want to change it. To satisfy the hash encoding requirement, we just need to make sure Array uses a suitable encoding. For example, here is one way to do this:

extension Array: Hashable where Element: Hashable {
  func hash(into hasher: inout Hasher) {
    hasher.combine(count) // <== This is new
    for element in self {
      hasher.combine(element)
    }
  }
}

This is very similar to how enumerations need to add discriminators to their hash encoding:

enum Bar: Hashable {
  case a(Int)
  case b(Int)
  case c(String)

  func hash(into hasher: inout Hasher) {
    switch self {
      case let a(v):
        hasher.combine(0)
        hasher.combine(v)
      case let b(v):
        hasher.combine(1)
        hasher.combine(v)
      case let c(v):
        hasher.combine(2)
        hasher.combine(v)
    }
  }
  static func ==(left: Bar, right: Bar) -> Bool { ... }
}

None of this affects simple cases like hashing stored properties in Foo; simply combining them one by one works great. As Alexis explained, it’s unnecessary to add such discriminators between every single combine call; doing that would only slow things down, without meaningfully improving things.

I would also add that I don’t see a need to add special support for such discriminators/delimiters in the Hasher API; types that need to think about hash encodings need to, well, think about hash encodings. Even in the most complicated cases, coming up with a good hash encoding is a lot easier than writing a good hash function; and I suspect there is no one-size-fits-all API for doing this that wouldn’t also slow down the common case too much.

Ah, good point! I haven’t realized that’s the case. This is also true for the xor-based commutative hashing in order-independent collections like Set and Dictionary.

On the other hand, Hasher's avalanche effects might be enough to make Bloom filters work even if caching/commutative Hashable types only feed 64 bits to the hasher. (As long as the cached hash values don’t collide, of course – which I guess is exactly the point: changing the seed usually gives us an entirely different collision pattern, but caching the hash value defeats that.)

I don’t think we need to fix this; hashing that (for some types) only produces 64 bits of information is still very useful.

That being said, it may make sense to fix Set and Dictionary to hash their elements into copies of the supplied hasher rather than creating their own. The code below is super subtle, but it makes Set suitable for use in Bloom filters:

extension Set {
  func hash(into hasher: inout Hasher) {
    var hash = 0
    for element in self {
      var h = hasher 
      // h is a brand new copy of the original hasher; 
      // combining/finalizing it doesn't affect hasher itself.
      h.combine(element)
      hash ^= h.finalize()
    }
    hasher.combine(bits: hash)
  }
}

Doing the same for caching is a bit more difficult. One option is to cache hash values for several different seeds, and combine them all to the supplied hasher.

1 Like

Wow, that enum example is really good!

I’m pretty sure the Array example can be fixed internally in the Hasher implementation, but the enum example needs (under this API) its own encoding. Unless it’s possible to feed in enum cases, instead of feeding in 0s, 1s and so on.

1 Like

I think Swift has dodged this bullet so far. Luckily, Set and Dictionary only provide lookup operations for values of the exact same type they contain, and in general, there is currently no expectation that distinct types ever produce matching hash values. (Although the semantics of AnyHashable and Objective-C bridging makes things fuzzy at the edges – hence my hesitation about committing to sensible hashing of integer values.)

I’m glad Vincent and Karoy have put so much effort in here! I think other commenters are covering most serious issues much better than me, so I’ll just add a usability note.

Without commenting on the deterministic hashing question…if the default behavior of init(seed:) is not going to be consistent from run to run, then it is not worth providing. As far as I can tell, it doesn’t do anything different from just immediately calling hasher.combine(bits: seed.0); hasher.combine(bits: seed.1), right? Like, it might be a slightly faster implementation, but it’s not really different as far as perturbing the initial state. Its presence would just add confusion for no real benefit.

/me doesn’t really understand modern hash functions, though, so maybe this is not the case.

3 Likes

The proposal looks amazing! Just a potentially stupid question:

Does the order of combining change the hash value?

hasher.combine(value1)
hasher.combine(value2)

hasher.combine(value2)
hasher.combine(value1)

Will these produce different hash values?

That question is not stupid! In fact, that’s already been mentioned here as part of the discussion :)

Afaik, the API doesn’t say.
But in general, if the order changes the hash, the hash is more secure. So I think we’ll be inclined to do that in the end.

2 Likes

Once you finalize, you’re not supposed to re-finalize nor combine new data, but there’s no flag for the finalization status!

Please add:

public struct Hasher {
    //...
    public var finalHash: Int? { get }
}

which returns the hash if the hasher has finalized, or nil if not yet. (I think the vast majority of people will compare against nil over getting the exact value. But it’s not Bool in case someone really is curious.)

This makes perfect sense as a completeness feature, but I suspect that if anyone actually writes code where there is ambiguity about whether a hasher has been finalized—or, by extension, where there is ambiguity about who is responsible for finalizing—that coder probably needs to get smacked upside the head.

Can you posit a scenario where this feature would be both useful and not a code smell?

3 Likes

The seed value, which SipHash and the rest of the crypto community seem to call the “key,” is a separate input to the hash function. The key is intrinsic to the design of the algorithm and since a key is always required, you might as well use it for its intended purpose.

I would imagine that in practical terms, hashing a couple of seed ints would probably have similar perturbation effects to specifying an initial seed. But we don’t really know for sure, because the design and cryptanalysis of SipHash are both predicated on the idea that you’ll modify the key to create independent hash functions that don’t have correlated outputs.

As you guess, using the key is also faster since it’s passed directly to SipHash as a bit pattern. The current proposal mixes it with the random global seed, but that’s probably a simple and cheap operation such as XOR. It’s certainly cheaper than hashing an additional 128 bits of content, which would require at least two SipHash rounds.

A interesting, related question (that I don’t know the answer to) is whether the bits in a final hash value are independent enough that it’s reasonable to divide one 64-bit hash value into two 32-bit hash values and then use those halves as if they were independent hash values for, say, a Bloom filter.

And that question raises the issue of what to do if you want to be sure that you get a full 64-bit hash value, given that Int has no particular size. It’d be nice if Hasher also defined

extension Hasher {
  public mutating func finalize() -> UInt64
}

so that you could do something like

let hashValue: UInt64 = hasher.finalize()

when you wanted to be sure you’d get back the full 64-bit hash value calculated by the algorithm.

1 Like

For this to work the supplied hasher has to a struct, unfortunately you can’t enforce that. Therefore wouldn’t this be asking for trouble.

I don’t know. Who is supposed to use this from the “I want to get a hash” side? Is it just for Set and Dictionary?

For a use: I could imagine passing a Hasher as inout to elements of a sequence where any element may arbitrarily finalize it after use. Then the outer code would check and create a new Hasher to pass to the remaining elements of the sequence. You would end up with a collection of Hasher instances. And, no, I can’t think of any reason to have that kind of code structure to start with. I just bugs me to say that you can’t violate something and then leave no way to check it.

1 Like

I raised the issue above of providing access to finalise and telling the user not to call it. Its like the old joke of having a button marked “DO NOT PRESS”. If you are not meant to call it then it should be split out into a seperate protocol.

Terms of Service

Privacy Policy

Cookie Policy