Does `Dictionary.Keys.contains(_:)` have `O(1)` time complexity?

i like that. e.g. 24 bytes ranges * from ten different locations in accordance with some pseudo random sequence unique per process. plus an environment variable to be able switching it off in debug mode.

i wonder how restrictive that actually is. maybe that's the case for 0.0001% apps and if you give enough time to adopt (via deprecation warnings) that won't be an issue. plus as a mitigation - the new tree based alternatives of set and dictionary.

My point is that this restriction shouldn't be necessary — if Hasher takes a fixed-length subset of an unbounded value, it doesn't matter how large the value actually is. There's no downside to allowing it to exist in a hashing structure, since the hash function is guaranteed to be constant-time, and the risk is mitigated. What benefit do you have in mind here?

i am thinking of hash collision attacks done with a large number of "small" elements (be it Data, or other types like Strings/Array/etc with the new "bound" hashValue implementation). even with the new smart "we will use some bytes for hash value but won't tell you which ones" implementation it would be relatively easy to find those bytes that are not part of hash (even if experimentally during runtime) and create a large number of small hash colliding elements and thus perform the hash collision attack. with a simpler approach "we use all bytes for hash" but "you can't use long values for your dictionary keys / set elements" that sort of attack will not be possible to begin with.

I think it's one thing to be able to submit input to a program in order to try to effect results, but introspection of a program (from outside of it, or worse, from within it) is an entirely different beast. I have a strong feeling that if an attacker had such close access to a running process to be able to ascertain whether a particular insertion triggers a hash collision, they're already way past what you might be able to do to keep them out. So much so that in the general case, it might not even be worth trying to guard against.

But, maybe this becomes a tunable option on Hasher — a maximum length on the number of bytes it hashes, along with a flag that indicates whether longer input should trap. I would strongly suggest it be off as a default, but if this niche case is a concern, you could use your own Hasher and turn it on.

In any case, I think we've strayed pretty far from the original topic here. I hope this was a reasonable summary of the current state of things, and I'd be interested to see if @lorentey or someone else would be interested in expanding on Hasher functionality, but maybe that's a topic for a different thread.

(Sigh. I did say I did not want to talk about Data.)

This may be a (imo, highly questionable) expectation in some environments other than Swift. However, it's manifestly not the case with Hashable.

Swift's hashed collections are safe in the sense that as long as the key type properly implements Hashable, they guarantee that lookups cost O(1) hashing/equality operations, no matter the distribution of the data.

By properly implementing hashing, I mean that the key feeds exactly the same bits to the Hasher as it compares in ==.

There are no ifs or buts about this -- that's what the protocol requires, and this is how String, Array, etc. hash themselves in the Standard Library.

As @tera's example demonstrates, Data's current hash(into:) implementation is broken.

My apologies, but this is simply incorrect. This isn't how hashed collections work in Swift.

Lookup operations in Set or Dictionary hash the supplied key exactly once, then they execute as many equality checks as necessary to search through the corresponding collision chain. (The same goes for insertions; the hash table sometimes need to be resized there, but the costs of that are amortized away.)

In order to achieve our performance goals for Set and Dictionary, we must keep the expected length of the collision chain as low as possible. The only way to guarantee that is through proper hashing.

On the contrary, there is no expectation that types conforming to Hashable must hash in bounded time. (Just like there is no such expectation for ==.)

In Swift, hashing and equality have the same worst-case performance. On average, hashing is often slower, because it always needs to scan the entire value, while equality checks can bail out at the first difference. (Note that this does not mean that Swift's hash tables are somehow inefficient.)

Note that it is possible to implement == in terms of hashing -- indeed, that's how we validate Hashable implementations in the Swift test suite.)

I find this argument entirely unconvincing. String, Array, Data can and in fact do grow to arbitrary sizes. This is not at all a problem for Set or Dictionary -- they provide excellent performance.

It does raise the question though -- who is putting multi-megabyte Arrays in a Set and why? Huge hash keys rarely if ever occur in regular practice. Optimizing this rare case to the detriment of the security of the general case seems like a questionable choice to me.

If the keys are received from an untrusted source, then it's absolutely crucial to be able to trust that hash insertions/lookups will execute in reasonable time.

For instance, take this loop:

var stuff: Set<Data> = []
for data in chunksReceivedFromNetwork {
  stuff.insert(data)
}

With the current (incorrect) Data.hash(into:) implementation, a clever adversary can trivially use @tera's hash collisions to make this loop take time that's quadratic in the number of chunks received. They don't even need to send multi-gigabyte chunks -- 84 bytes or so per chunk will do just fine.

If Data had correct hashing, this loop would be guaranteed to finish in time that's linear in the overall number of bytes received -- the chances of a huge collision chain would be vanishingly low, no matter how cleverly the input was chosen.

The way I look at it, it's exceedingly rare for such colliding Data to happen outside of HashDoS attacks.

Like Swift itself, the Standard Library's Set and Dictionary are designed to help people write code that needs to process hostile input. It seems counter-productive to intentionally weaken this stance, for no particularly good reason.

3 Likes

There are no plans to weaken hashing in Swift in the general case. (There is plenty of evidence that this would be a terrible idea; there is no evidence that it would be beneficial.)

We did discuss a scheme where small hash tables (say, less than hundred or so items) switch to incomplete hashing. That could be an interesting optimization at some point!

However, Hasher is fast enough in practice, and keys are generally small enough in practice, that implementing this isn't a priority. (In fact, attempts at complicating hashing usually just make Set/Dictionary slower.)

(On large contiguous keys like Data, Hasher is currently about twice as slow as simply reading memory. And we have plenty of ways to speed it up if this ever becomes a problem.)

3 Likes

Can you show me a use case where someone would intentionally use such a humungous Data as a dictionary key? I can certainly imagine using such data in dictionary values, but in keys? Why?

Even if Data implemented hashing the way NSArray does (it's lightning fast!¹), successfully looking up such a key would generally need to perform an equality check that fully compares contents. (Unless we are looking up the exact same instance, in which case why not just use an ObjectIdentifier or a (baseAddress, length) pair as the key?)

If profiling data indicates hashing performance is a bottleneck², and if we are confident that our input data will never be malicious, then it's super easy to customize hashing by defining a trivial wrapper type.

struct MyKey: Hashable {
  var value: Data
  static func ==(left: Self, right: Self) -> Bool { 
    left.value == right.value 
  }
  func hash(into hasher: Hasher) {
    // On careful consideration, I concluded that this is okay,
    // because I am confident that the number of keys that 
    // only differ after the first 42 bytes will always be strictly
    // limited by some small constant. 
    hasher.combine(value.prefix(42))
  }
}    

If I have to do this, then I may grumble a bit about all the boilerplate, but then I implement it and move on. Later on, when it inevitably turns out that my analysis was wrong, and my app, server or daemon hangs in some Dictionary resize operation, it'll be relatively easy to figure out what's going on -- the code is quite explicit about the mess it's making.

On the other hand, if Data arbitrarily makes such decisions behind my back, and the decision proves harmful, then I'll need to go through a difficult debugging process, and at the end of it I'll feel betrayed by my tools. Swift promises to be a safe programming language; such unpleasant surprises break that promise, and drive people to other languages.

Partial hashing achieves a small constant speedup in the best case, but in exchange it generates a hugely increased risk of worst-case behavior -- and I think this is a bad tradeoff to make in general-purpose API.

Footnotes:
¹ Note: This is intended to be a jab at the notion of O(1) hashing, not at NSArray or Foundation's other Objective-C container types. (Foundation has valid reasons to keep these unfortunate definitions for these ancient types. Those reasons do not apply to Data.)

² In all these years, I only recall seeing such problems in code that accidentally bridged back-and-forth between Dictionary and NSDictionary in a tight loop -- a quite different problem.

3 Likes

One silly (?) scheme to mitigate this that I came up with a while ago when looking at NSString hashing is to have a flag in Dictionary "high collision rate detected" that switches to an alternate "thorough hash" when set. I don't think it's something we should contemplate without a serious real life motivating example though. As @lorentey says, things seem to be generally fine with the current setup for String and the like.

2 Likes

It's not silly at all! Starting with a simplified Hasher and rehashing when an overlong chain is created would be very interesting to try. (The tricky part is doing this without visibly slowing down regular insertions -- e.g., normally insertions don't need to figure out where their chain starts, only where it ends.)

3 Likes

i'm actually happy i smelled something suspicious, started questioning "what's wrong with Data" and thus we opened this pandora box... for swift to be serious on server side we need to spell these things out (and should not be ashamed to look how these issues are addressed in other languages). that NSData was "fine" with 80 byte hash trick is merely because no-one was seriously using Objective-C / Foundation in server side environments "on scale", environments which are often subject of targeted malicious attacks.

you did put everything very well! please clarify just a few things:

you mean to implement == in terms of hashing only without using anything else? but how? i think it is "valid" for a hash function to, say, always return 0 (it would be unwise and awfully bad! but still "valid" so long as it won't break the basic requirement "a == b --> a.hash == b.hash"). or do you mean just an obvious optimisation step "if hash not equal then return false, otherwise take the slow EQ path" ?

for the rest of us, what are you talking about here? and how incomplete hashing correlates to what you've just said about the necessity to look at all bits of value for a proper hash function?

that's interesting. do you think it would it be possible to make it as fast as reading? this looks RAM I/O bound task to me.

:rofl: it is indeed. so is NSDictionary. it means if we use NSArray/NSDictionary as NSSet¹ elements or another NSDictionary keys we become open to trivial hash collision attacks... interesting.. hope when we bridge NSArray/NSDictionary to swift types we "fix" their hash values (e.g. Dictionary hash is something like a combination of it's count and key's hash values / values hash values of all items).

(¹ NSSet is kind of broken anyway as you can put NSMutableAnything in it and then mutate it invalidating set's invariants).

1 Like

We can implement == by hashing with a handful of different seeds until we find a mismatch, or until we're confident enough that the values are equal. (As long as Hasher works as expected, the probability of a false positive will be vanishingly low after a few of these tries -- certainly lower than the likelihood of, say, a cosmic particle causing a misbranch.)

We use this in code that validates Hashable conformances. Hashable conformances that do not feed all distinguishing bits to the hasher (or that only implement hashValue, not hash(into:) do not fulfill the protocol's semantic requirements, so they are expected to fail this test.

(Violating semantic requirements isn't necessarily the end of the world, but if something breaks, then it's usually considered the type's fault, not the code that relies on them.)

(Note that the test is not foolproof, because it can only work with the samples that it is given. Types with bad hashing may pass the test, but types with good hashing will never fail it.)

It doesn't matter if a hash table reduces to a super slow version of an unsorted array, as long as there are only a few items in it. (In fact, it's often a good strategy to entirely disable hashing under, say, 4-16 keys, like OrderedSet does.)

Yes, and yes. In practice, this is mostly an irrelevant curiosity, as keys are almost always tiny, and hashing costs are dominated by finalization. (Or, in the case of String, Unicode normalization.)

(Edit: "irrelevant curiosity" was a bit too strongly worded -- it's just that I haven't seen a use case where it matters yet. We have many options to improve throughput if it ever becomes a problem. Generally, latency on short input proved far more important so far, so that's where most of the optimization effort was spent. (FWIW, the actual underlying hasher implementation is important, but it doesn't seem to matter as much as one might expect: the largest improvement during the initial implementation effort was due to reducing the number of opaque function calls for commonly hashed types, rather than low-level bitwise trickery.))

We do -- bridged NS collection types hash the exact same way as a native Swift value. (Which should not be a surprise, as a bridged NSArray and a native Array containing the same items compare equal to each other!)

While we are talking about "fun" edge cases, it would be a shame not to mention the fact that String and NSString have different concepts of equality -- String's equality uses Unicode normalization, while NSString's doesn't. (For a nice stroll into despair, consider what this means for string-keyed dictionaries bridged over from Cocoa.)

2 Likes

it would mean that NSDictionary bridged to swift would potentially have fewer elements than original nsDictionary :slight_smile: interestingly it would also mean that "round tripped" NSDictionary to swift and back would... no longer be equal to itself.. weird. yet the round tripped NSString is still equal to itself.

1 Like

Thanks for taking the time to write such a thorough and detailed response, Karoy! Some assorted thoughts:

  • Thanks for clearing up some of my misconceptions about hashing in Swift. Some of the details I gave were plainly incorrect, and in all cases I should have been significantly more specific about what I'd meant

  • I'm in agreement that I'd love to get rid of the arbitrary 80-byte limit on Data. I don't want types like Data making decisions behind anyone's back, and I certainly don't want to weaken any hashing guarantees in Swift

  • Given the details you lay out, I would personally find it valuable to even more pointedly document the following:

    The current documentation does state that you should pass all "essential" properties you use to compare equality into Hasher.combine(_:), but the requirement is actually significantly stricter in order to provide the guarantees that we do. Coming from other languages, it's not enough to just assume that a hash implementation is just a performance detail so long as you fulfill x == y ⇒ hash(x) == hash(y), but that semantically, any bit used to determine x == y must be used to determine hash(x) and hash(y). This is slightly less clear to me from the docs, and might benefit from being stated explicitly as semantic requirement. (But, this may just be me)

4 Likes

The docs for hash(into:) itself are at least a bit more direct and avoid the "essential" terminology:

The components used for hashing must be the same as the components compared in your type’s == operator implementation. Call hasher.combine(_:) with each of these components.

I've also complained in the past about the lack of clarity in the documentation around Hashable, so I agree it would be great to beef it up with better explanation about the semantic requirements (and perhaps a brief elaboration on why the requirements are the way they are, as discussed in this thread and others on the forums).

Also just want to say that this has been a great thread, have really enjoyed reading everyone's contributions.

6 Likes

Yep, I reviewed hash(into:) while writing my post to make sure I wasn't missing anything egregious, and found the sticking point for me to be the same: the slight vagueness around what "components" means. It's not enough to use the same properties for hashing as equality if you subset those properties for hashing (e.g., taking a certain number of bits) — you really must pass all bits to both in the same order.

[I do think that this can be very difficult to convey in a clear way to documentation readers of various levels of experience, but it should be possible to at least show some high-level examples of what not to do (e.g., pass in the first `n` elements of an array if you compare the whole array in equality).]

3 Likes

Adding a concrete example like that would be a really good idea.

A lesser concern that only matters for variable-sized types and that the documentation does not at all mention right now is that collisions should not be possible to produce even if hash encodings are concatenated together — so e.g. (“Foo”, “Bar”) (“Foobar”, “”) and (“”, “Foobar”) should not feed the same bits to the Hasher. (This would come up with the Array example, and it would probably need to be explained.)

3 Likes

Beware, this is also true for Set and Dictionary, when they contain values with reference semantics. Cocoa handles this with NSCopying; Swift prefers encouraging the use of value types instead.

4 Likes

ouch! that's quite bad and even more harmful than what we discussed just before (as it is more frequently used). another pandora box.. :slight_smile:

NSCopying is used for NSDictionary keys, so that one was fine. but it wasn't (and couldn't be) used for NSSet elements, so only NSSet was (and is) "conceptually broken" in Objective-C's "NSMutableType: NSImmutableType" object model.

but now in swift not just Set (arguably less frequently used thing) but also Dictionary is broken when it contains keys with reference semantics.. is there a good foolproof remedy, or just a guideline "don't use keys with reference semantics"? seems too easy to shoot yourself in the foot! how about a new debug switch / environment variable / etc that will trap in this case? that's for start, until someone thinks of a better compile time magic check..

Mutations of reference-typed keys is less of a problem in practice that it may appear — I can’t recall ever coming across a case in Swift code where this was an issue. (Other than in bad attempts at creating a weak-keyed dictionary.)

FWIW Dictionary and Set trap with a very obvious crash report if they happen to detect such issues (during resizing, for example). However, this is very nondeterministic, and the trap usually happens far after the original mutation. Detecting such problems with complete certainty sadly doesn’t seem pratical.

2 Likes

i created a new ValueType protocol topic to discuss issues related to reference dictionary keys / set elements.

re the other spin-off discussion related to Data's 80-byte hashValue optimisation - i believe we discussed it thoroughly here, will leave any further steps up to others (is it just a "bug report" to be filed and then resolved, or a pitch is in order, etc). if needed (?) we can create another topic specific to that issue.

sorry for hijacking this thread! :wink:

Terms of Service

Privacy Policy

Cookie Policy