Does `Dictionary.Keys.contains(_:)` have `O(1)` time complexity?

lorentey · September 13, 2021, 9:42pm

Dictionary.Keys implements contains efficiently, with the same lookup algorithm used by Dictionary's keying subscript, or Set.contains().

The expected number of hashing/equality operations performed by this method is an O(1) function, as long as the key type implements Hashable correctly. (I.e., by feeding all distinguishing bits into the hasher in its hash(into:) implementation.) The maximum number of operations performed is O(n), where n is the number of items in the dictionary.

That isn't quite how these asymptotic bounds work. For example, if f(n) = 1 almost all the time, except for once in a blue moon when f(n) = n, then f(n) is still O(n) -- as long as there isn't just a finite number of these slow cases.

(And not just because every O(1) function is also O(n), O(n^2), O(2^n) etc.)

Assuming that the contents of a hash table are "random enough" and that the hash function is "good enough", lookup operations are expected to perform O(1) hashing/equality operations. However, hash tables are probabilistic data structures: for every hash function, no matter how good, a tiny fraction of input data will unavoidably generate collision chains of a size that is proportional to the number of keys in the table. (Hashes map a (potentially) infinite set of values to a finite range of integers, so if the input type is "large enough", we can always find an arbitrarily large set if input values that map to the same hash value. The challenge is to make sure these will only rarely occur in practice.)

So technically, the cost of looking up items in a Set or Dictionary isn't an O(1) function -- the best upper bound is O(n). That said, with proper hashing, the expected value of the lookup cost is an O(1) function; linear-time lookups are vanishingly rare.

Hasher is (supposed to be) a "good enough" hash function even if the distribution of the input data is unknown. And with random seeding, it's (supposed to be) good enough even in the face of maliciously chosen data -- until and unless the seed becomes known to the attacker.

However, Set/Dictionary analysis is complicated by the fact that their performance tightly depends on the quality of the key's Hashable implementation. All bets are off if a key type implements hashValue rather than hash(into:), or if it feeds less data to the hasher than what it compares in ==. Generally, these mistakes make it trivial to generate collisions -- so they are a really, really bad idea if the input data isn't 100% trusted.

Would it be reasonable to expose this much detail in API docs? Sadly, no. FWIW, the OrderedSet docs attempt to provide guidance without delving too much into the "scary" details.

app below shows more fine-grained O() behaviour.

This is an excellent demonstration!

This behavior is also visible (albeit much less clearly) in benchmarking charts -- by the tell-tale sawtooth curve in elapsed time: