Complexity of Set.contains()

Martin · June 19, 2018, 12:58pm

The documentation for Set.contains(_:) claims that its complexity is O(1).

Is that really true? In the case of hash collisions, searching for a value requires to check multiple elements, starting with the initial bucket. As an extreme case, if all elements happen to have the same hash value then searching for a non-existent value may require to check all elements, that would be O(N).

How is the statement “Complexity is O(1)” to be interpreted?

zoul · June 19, 2018, 1:10pm

I think it’s simply the average case?

Martin · June 19, 2018, 1:32pm

Average of what?

var hashValue: Int { return 0 }

is a valid implementation of the Hashable protocol, but makes all set operations O(N). Or is there some assumption that all (2^64) possible hash values are in some sense equally likely?

zoul · June 19, 2018, 1:49pm

Exactly, see SUHA (computer science) - Wikipedia.

sbromberger · June 19, 2018, 2:56pm

Here's a similar discussion for the Julia language: Redirecting to Google Groups

Big-O notation is supposed to be the worst-case, isn't it?

Letan · June 19, 2018, 4:12pm

IIRC Ben from his dotSwift talk mentions that the docs do comment worst case. I’m sorry I don’t remember when exactly during the talk, but around the 10 min mark

zoul · June 19, 2018, 4:54pm

I’d rather say it’s an upper bound. And while we know that the general upper bound of a hash member lookup is linear (since you can’t do worse than simply walk through all the items crammed into a single bucket), it makes sense to reason about some average performance, where average relates to the number of collisions. And it turns out that when you amortize the infrequent collisions over the happy cases, the real-world upper bound of member lookup gets constant. (That’s the way I understand it, anyway, I’m a CS noob.)

Ben_Cohen · June 19, 2018, 5:19pm

It depends what you mean by average and by worst case.

For example, Array.append is amortized O(1), but sometimes an individual call to append will take O(n) because the array needs to grow. But But because of how Array's growth strategy works, you can think of this as kind of "averaging out" to O(1).

But that's not what's happening with Set. Here, I think the justification for the complexity is that it's not reasonable to have to consider the case where all values hash the same. The expectation is that all types must provide a high-quality hash. Now that Swift will auto-generate one for you, and makes it easy to write custom ones, anything else can be considered a bug in the conforming type (possibly this could be emphasized more in documentation). This assumption, combined with a decent implementation of Set, means it's OK to assume constant-time lookup, because collisions should be modest and Set should never deteriorate into just being an unordered list in practice.

This is similar to the requirement that your implementation of == be reflexive. If it isn't, then that will break Array's == under some circumstances. But that doesn't mean we need to sprinkle caveats into the documentation for Array, that would just end up spreading FUD.

yxckjhasdkjh · June 19, 2018, 6:35pm

While I agree that hash operations are, for all intents and purposes, O(1) in most practical situations, specific DDOS attacks taking advantage of badly implemented hash functions have happened in the past, e.g. in Ruby: Denial of service attack was found for Ruby's Hash algorithm (CVE-2011-4815)

So there might be value in somehow including the information that there is a theoretical worst-case that is not O(1). However I don't know how to do that without confusing inexperienced users - maybe just linking to the Wikipedia entry for hash tables would suffice.

sbromberger · June 19, 2018, 7:03pm

FWIW, I agree that a pathological case in which all elements hash to the same value is not a good basis for a measure of efficiency. I would guess that most people would expect set search to be O(1) (or something very close thereto); same with insertion even though the backing store might require a reallocation, which is typically not measured for the same reason array growth is considered an amortized cost.

anthonylatsis · June 19, 2018, 7:21pm

IIRC, a high-quality hash function that @Ben_Cohen mentions, which is expected to be provided for all types, is a function that has equal output probability for any hash value. Theoretically, this implies an expected average complexity of θ(1), which is what the documentation states.

Joe_Groff · June 19, 2018, 7:25pm

The new hashing implementation in Swift 4.2 is specifically hardened against this class of attack, both by using a strong hashing function and by using per-process and per-instance seeding to reduce determinism.

Martin · June 19, 2018, 8:21pm

All this makes much sense, I'd like to thank everybody for their responses.