One silly (?) scheme to mitigate this that I came up with a while ago when looking at NSString hashing is to have a flag in Dictionary "high collision rate detected" that switches to an alternate "thorough hash" when set. I don't think it's something we should contemplate without a serious real life motivating example though. As @lorentey says, things seem to be generally fine with the current setup for String and the like.
It's not silly at all! Starting with a simplified Hasher and rehashing when an overlong chain is created would be very interesting to try. (The tricky part is doing this without visibly slowing down regular insertions -- e.g., normally insertions don't need to figure out where their chain starts, only where it ends.)
i'm actually happy i smelled something suspicious, started questioning "what's wrong with Data" and thus we opened this pandora box... for swift to be serious on server side we need to spell these things out (and should not be ashamed to look how these issues are addressed in other languages). that NSData was "fine" with 80 byte hash trick is merely because no-one was seriously using Objective-C / Foundation in server side environments "on scale", environments which are often subject of targeted malicious attacks.
you did put everything very well! please clarify just a few things:
you mean to implement == in terms of hashing only without using anything else? but how? i think it is "valid" for a hash function to, say, always return 0 (it would be unwise and awfully bad! but still "valid" so long as it won't break the basic requirement "a == b --> a.hash == b.hash"). or do you mean just an obvious optimisation step "if hash not equal then return false, otherwise take the slow EQ path" ?
for the rest of us, what are you talking about here? and how incomplete hashing correlates to what you've just said about the necessity to look at all bits of value for a proper hash function?
that's interesting. do you think it would it be possible to make it as fast as reading? this looks RAM I/O bound task to me.
it is indeed. so is NSDictionary. it means if we use NSArray/NSDictionary as NSSet¹ elements or another NSDictionary keys we become open to trivial hash collision attacks... interesting.. hope when we bridge NSArray/NSDictionary to swift types we "fix" their hash values (e.g. Dictionary hash is something like a combination of it's count and key's hash values / values hash values of all items).
(¹ NSSet is kind of broken anyway as you can put NSMutableAnything in it and then mutate it invalidating set's invariants).
We can implement ==
by hashing with a handful of different seeds until we find a mismatch, or until we're confident enough that the values are equal. (As long as Hasher
works as expected, the probability of a false positive will be vanishingly low after a few of these tries -- certainly lower than the likelihood of, say, a cosmic particle causing a misbranch.)
We use this in code that validates Hashable
conformances. Hashable conformances that do not feed all distinguishing bits to the hasher (or that only implement hashValue
, not hash(into:)
do not fulfill the protocol's semantic requirements, so they are expected to fail this test.
(Violating semantic requirements isn't necessarily the end of the world, but if something breaks, then it's usually considered the type's fault, not the code that relies on them.)
(Note that the test is not foolproof, because it can only work with the samples that it is given. Types with bad hashing may pass the test, but types with good hashing will never fail it.)
It doesn't matter if a hash table reduces to a super slow version of an unsorted array, as long as there are only a few items in it. (In fact, it's often a good strategy to entirely disable hashing under, say, 4-16 keys, like OrderedSet does.)
Yes, and yes. In practice, this is mostly an irrelevant curiosity, as keys are almost always tiny, and hashing costs are dominated by finalization. (Or, in the case of String, Unicode normalization.)
(Edit: "irrelevant curiosity" was a bit too strongly worded -- it's just that I haven't seen a use case where it matters yet. We have many options to improve throughput if it ever becomes a problem. Generally, latency on short input proved far more important so far, so that's where most of the optimization effort was spent. (FWIW, the actual underlying hasher implementation is important, but it doesn't seem to matter as much as one might expect: the largest improvement during the initial implementation effort was due to reducing the number of opaque function calls for commonly hashed types, rather than low-level bitwise trickery.))
We do -- bridged NS collection types hash the exact same way as a native Swift value. (Which should not be a surprise, as a bridged NSArray and a native Array containing the same items compare equal to each other!)
While we are talking about "fun" edge cases, it would be a shame not to mention the fact that String and NSString have different concepts of equality -- String's equality uses Unicode normalization, while NSString's doesn't. (For a nice stroll into despair, consider what this means for string-keyed dictionaries bridged over from Cocoa.)
it would mean that NSDictionary bridged to swift would potentially have fewer elements than original nsDictionary interestingly it would also mean that "round tripped" NSDictionary to swift and back would... no longer be equal to itself.. weird. yet the round tripped NSString is still equal to itself.
Thanks for taking the time to write such a thorough and detailed response, Karoy! Some assorted thoughts:
-
Thanks for clearing up some of my misconceptions about hashing in Swift. Some of the details I gave were plainly incorrect, and in all cases I should have been significantly more specific about what I'd meant
-
I'm in agreement that I'd love to get rid of the arbitrary 80-byte limit on
Data
. I don't want types likeData
making decisions behind anyone's back, and I certainly don't want to weaken any hashing guarantees in Swift -
Given the details you lay out, I would personally find it valuable to even more pointedly document the following:
The current documentation does state that you should pass all "essential" properties you use to compare equality into
Hasher.combine(_:)
, but the requirement is actually significantly stricter in order to provide the guarantees that we do. Coming from other languages, it's not enough to just assume that a hash implementation is just a performance detail so long as you fulfillx == y ⇒ hash(x) == hash(y)
, but that semantically, any bit used to determinex == y
must be used to determinehash(x)
andhash(y)
. This is slightly less clear to me from the docs, and might benefit from being stated explicitly as semantic requirement. (But, this may just be me)
The docs for hash(into:)
itself are at least a bit more direct and avoid the "essential" terminology:
The components used for hashing must be the same as the components compared in your type’s
==
operator implementation. Callhasher.combine(_:)
with each of these components.
I've also complained in the past about the lack of clarity in the documentation around Hashable
, so I agree it would be great to beef it up with better explanation about the semantic requirements (and perhaps a brief elaboration on why the requirements are the way they are, as discussed in this thread and others on the forums).
Also just want to say that this has been a great thread, have really enjoyed reading everyone's contributions.
Yep, I reviewed hash(into:)
while writing my post to make sure I wasn't missing anything egregious, and found the sticking point for me to be the same: the slight vagueness around what "components" means. It's not enough to use the same properties for hashing as equality if you subset those properties for hashing (e.g., taking a certain number of bits) — you really must pass all bits to both in the same order.
[I do think that this can be very difficult to convey in a clear way to documentation readers of various levels of experience, but it should be possible to at least show some high-level examples of what not to do (e.g., pass in the first `n` elements of an array if you compare the whole array in equality).]
Adding a concrete example like that would be a really good idea.
A lesser concern that only matters for variable-sized types and that the documentation does not at all mention right now is that collisions should not be possible to produce even if hash encodings are concatenated together — so e.g. (“Foo”, “Bar”)
(“Foobar”, “”)
and (“”, “Foobar”)
should not feed the same bits to the Hasher. (This would come up with the Array example, and it would probably need to be explained.)
Beware, this is also true for Set
and Dictionary
, when they contain values with reference semantics. Cocoa handles this with NSCopying; Swift prefers encouraging the use of value types instead.
ouch! that's quite bad and even more harmful than what we discussed just before (as it is more frequently used). another pandora box..
NSCopying is used for NSDictionary keys, so that one was fine. but it wasn't (and couldn't be) used for NSSet elements, so only NSSet was (and is) "conceptually broken" in Objective-C's "NSMutableType: NSImmutableType" object model.
but now in swift not just Set (arguably less frequently used thing) but also Dictionary is broken when it contains keys with reference semantics.. is there a good foolproof remedy, or just a guideline "don't use keys with reference semantics"? seems too easy to shoot yourself in the foot! how about a new debug switch / environment variable / etc that will trap in this case? that's for start, until someone thinks of a better compile time magic check..
Mutations of reference-typed keys is less of a problem in practice that it may appear — I can’t recall ever coming across a case in Swift code where this was an issue. (Other than in bad attempts at creating a weak-keyed dictionary.)
FWIW Dictionary
and Set
trap with a very obvious crash report if they happen to detect such issues (during resizing, for example). However, this is very nondeterministic, and the trap usually happens far after the original mutation. Detecting such problems with complete certainty sadly doesn’t seem pratical.
i created a new On purity grades of value types [was: ValueType protocol] topic to discuss issues related to reference dictionary keys / set elements.
re the other spin-off discussion related to Data's 80-byte hashValue optimisation - i believe we discussed it thoroughly here, will leave any further steps up to others (is it just a "bug report" to be filed and then resolved, or a pitch is in order, etc). if needed (?) we can create another topic specific to that issue.
sorry for hijacking this thread!