(Sigh. I did say I did not want to talk about Data
.)
This may be a (imo, highly questionable) expectation in some environments other than Swift. However, it's manifestly not the case with Hashable
.
Swift's hashed collections are safe in the sense that as long as the key type properly implements Hashable
, they guarantee that lookups cost O(1) hashing/equality operations, no matter the distribution of the data.
By properly implementing hashing, I mean that the key feeds exactly the same bits to the Hasher
as it compares in ==
.
There are no ifs or buts about this -- that's what the protocol requires, and this is how String
, Array
, etc. hash themselves in the Standard Library.
As @tera's example demonstrates, Data
's current hash(into:)
implementation is broken.
My apologies, but this is simply incorrect. This isn't how hashed collections work in Swift.
Lookup operations in Set
or Dictionary
hash the supplied key exactly once, then they execute as many equality checks as necessary to search through the corresponding collision chain. (The same goes for insertions; the hash table sometimes need to be resized there, but the costs of that are amortized away.)
In order to achieve our performance goals for Set and Dictionary, we must keep the expected length of the collision chain as low as possible. The only way to guarantee that is through proper hashing.
On the contrary, there is no expectation that types conforming to Hashable
must hash in bounded time. (Just like there is no such expectation for ==
.)
In Swift, hashing and equality have the same worst-case performance. On average, hashing is often slower, because it always needs to scan the entire value, while equality checks can bail out at the first difference. (Note that this does not mean that Swift's hash tables are somehow inefficient.)
Note that it is possible to implement ==
in terms of hashing -- indeed, that's how we validate Hashable
implementations in the Swift test suite.)
I find this argument entirely unconvincing. String
, Array
, Data
can and in fact do grow to arbitrary sizes. This is not at all a problem for Set
or Dictionary
-- they provide excellent performance.
It does raise the question though -- who is putting multi-megabyte Array
s in a Set
and why? Huge hash keys rarely if ever occur in regular practice. Optimizing this rare case to the detriment of the security of the general case seems like a questionable choice to me.
If the keys are received from an untrusted source, then it's absolutely crucial to be able to trust that hash insertions/lookups will execute in reasonable time.
For instance, take this loop:
var stuff: Set<Data> = []
for data in chunksReceivedFromNetwork {
stuff.insert(data)
}
With the current (incorrect) Data.hash(into:)
implementation, a clever adversary can trivially use @tera's hash collisions to make this loop take time that's quadratic in the number of chunks received. They don't even need to send multi-gigabyte chunks -- 84 bytes or so per chunk will do just fine.
If Data
had correct hashing, this loop would be guaranteed to finish in time that's linear in the overall number of bytes received -- the chances of a huge collision chain would be vanishingly low, no matter how cleverly the input was chosen.
The way I look at it, it's exceedingly rare for such colliding Data
to happen outside of HashDoS attacks.
Like Swift itself, the Standard Library's Set
and Dictionary
are designed to help people write code that needs to process hostile input. It seems counter-productive to intentionally weaken this stance, for no particularly good reason.