I have a stupid question. Why in Swift Documentation is not say that Set and Dictionary is based on HashTable. If i am newbie and did not hear about algorithms and about hashtable is hard to understand how Set and Dic are working. Its important to mentions in first sentence that it is a based on Hashtable. For example if hash value are same how Set know about is this element equal. And if i know that under the hood it is hashtable i go and read about hash collision. What do you thing about?
Hey Bogdan, welcome to the forums!
The first line of the "Overview" for Dictionary
is:
A dictionary is a type of hash table, providing fast access to the entries it contains.
Though honestly, I'm not sure it belongs there.
A common principle in programming is to emphasize interfaces (what things do) rather than implementations (what things are).
A dictionary is an unordered collection that provides fast access to some values by some unique key. This is all the documentation should commit to, because:
- This is what the library consumer wants to know: "What is this, and what can I do with it?"
- This is liberating to the library author, who now has the flexibility to use any data structure that fits the promises made by the documentation (namely,
O(1)
key access).
That sounds like an abstract concern, but it comes up in the real world. For example, for small enough sets/dictionaries, it's actually faster to do a linear scan through the elements in an array, than it is to compute hash values, chase pointers, etc.
So a really good implementation might be a hybrid of an array for small sizes, and a hash table for large sizes. This is a niche performance optimization detail, which doesn't belong in the documentation, IMO. And if that belongs there, then what other details should we mention? Side chaining vs double hashing?
Why in Swift Documentation is not say that Set and Dictionary is based on HashTable.
On the contrary, imagine a new programming coming to the documentation, and wondering "What the hell is a hash map?" It sounds like a geographic map of a certain plant.
If youâre a newbie you probably should not care about hash collisions. If you want to implement a specific thing down the line you will find out about it during research.
Also I donât think implementation details should be in the docs. The docs should mention things you need to know to use that component (Although some Swift Documentation is not very detailed in that regard either) and possibly some warnings if the implementation has some caveats you should be aware of.
If you are interested in how a thing works, you can always look up the sources and find detailed comments and the implementing itself there.
My understanding is hash collisions don't break hash tables. Finding an entry is a two step process of checking the hash (which locates the right bucket) then doing an equality comparison. A hash collision might slightly affect performance because a hash mismatch short-circuits the equality compare (no need to call ==
if the hash is different) but it won't break any logic.
This is why it's not part of documentation that this implementation is used. Semantically, a Set
or Dictionary
holds unique values or keys, where "unique" is defined by the result of ==
, not the hash. And that's what would define equality for any other implementation of a unique container. The only relevant part to users is the requirement to conform to Hashable
instead of just Equatable
.
In this way Hashable
is not the same as Identifiable
, and it's why implementing Identifiable
with a hash is unsound. The id
in Identifiable
is supposed to uniquely identify a value on its own. A hash is not supposed to do that. A hash collision would actually break, for example, a SwiftUI ForEach
that uses hashes as row ids.
A hash used in a hash table doesn't need to be a cryptographic hash designed to make collisions practically impossible. Collisions in this sort of hash are common and expected which is why they aren't appropriate for security.
One caveat: predictable hashing can be exploited to intentionally craft data that causes hash collisions, which can degrade hash maps into their degenerate linked-list-like behaviour (O(n)
lookups).
Luckily, you don't need the full expense of a cryptographically secure hash function, just one with a randomized seed.
Wikipedia has a section on Hash Flooding / HashDoS
this is probably straying beyond the scope of a ânewbieâ question, but in my experience so far non-deterministic hashing causes a lot of weirdness for end users. is this better than exposing a security vulnerability? of course! but itâs quite silly the way so many Swift apps show your likes/followers in a different order every time you refresh the screen. this is weird and a lot of engineers these days seem to just brush it off as the way Dictionary
/Set
are âsupposedâ to work.
the answer is not to go back to deterministic hashing. this problem happens because OrderedDictionary
/OrderedSet
never graduated to the standard library, so project leads are reluctant to attach a SwiftPM package dependency to targets very high up in a build tree. so this leads to a funny two-tiered system where âlibraryâ code outputs things in unstable order and âappâ code outputs things in stable order. i am still holding out hope that these ordered collection types will graduate some day.