At that scale I’d start eyeing things like the working set size overflowing the TLB, but I haven’t pulled out the cpu counters instrument to verify that.
Set does an O(1) number of arithmetic operations and memory accesses. On physical hardware, the time required for those memory accesses increases as the data is found in L1, L2, L3, uncompressed memory, compressed memory, disk, and also as the addresses needed for the working set falls out of the TLB, and there are a bunch of other micro-architectural factors besides. None of these hardware details changes that an O(1) number of memory accesses are occurring.
We're playing sleight of hand tricks with the units of measurement here. The API docs usually only say "O(1)" or "O(n)" -- they tend not to explicitly specify what exactly they're counting. In the case of protocols like Collection
, different conforming types sometimes have a different idea of what the units are.
Most algorithm analysis measures performance in nice high-level units that are easy to work with -- the number of equality checks on the Element type, the number of items that need to be moved around in memory etc.
On the other hand, as programmers we typically assume that documented performance describes (or at least, somewhat resembles) elapsed time.
For example, with a properly hashed Element, Set.contains
is expected to perform O(1) equality checks -- but this does not mean lookups are expected to finish in constant time!
For example, just consider Set<String>
, where the cost of an equality check is O(n) in the length of the values being compared (as measured in unicode scalars, I believe).
And things aren't even straightforward with a nice simple element type like Int
. We usually take the cost of a random memory access as constant, but in practice, it tends to grow with the logarithm of the size of the working set. And if one random memory access costs O(log(n)) time, what does that tell us about the performance of Set<Int>.contains
? Hashing is an excellent access randomizer; linear probing makes the second and subsequent accesses super cheap, but the first one will likely dominate costs. Anything other than linear probing becomes a huge no no.
Everything about computer performance is a complicated mess of half-truths and over-simplifications. It's very easy to get lost in details, and they change in weird ways every year. Luckily, even the most simplistic performance analysis is better than nothing!
like your posts on the subject, @lorentey. still curious what makes Data type special.
It's better not to know, but.
wow, thanks for this gem! on my book this is clearly a bug and makes built-in data hashValue totally inappropriate in many cases!
WTF
for var n in 78 ..< 83 {
for i: UInt8 in 0 ..< 3 {
var data = Data(repeating: 0, count: n - 1)
data.append(i)
print("Data(\(n) bytes) hash is \(data.hashValue)")
}
print()
}
Output:
Data(78 bytes) hash is -6057134200881960299
Data(78 bytes) hash is -1381361526395587339
Data(78 bytes) hash is -3319890988691249578
Data(79 bytes) hash is 6847656298895565850
Data(79 bytes) hash is -8737025815244955736
Data(79 bytes) hash is 5557717136421446909
Data(80 bytes) hash is 684518477260868316
Data(80 bytes) hash is -4351981612195298604
Data(80 bytes) hash is 7285011032270536178
Data(81 bytes) hash is 2933313030649029302
Data(81 bytes) hash is 2933313030649029302
Data(81 bytes) hash is 2933313030649029302
Data(82 bytes) hash is -5508073393735331768
Data(82 bytes) hash is -5508073393735331768
Data(82 bytes) hash is -5508073393735331768
we can still include "Data" in this discussion provided it's hashValue is overridden to something more sensible!
struct BetterData: Hashable {
let data: Data
var hashValue / hash(into:) ... {
....
}
}
in fairness it is not as bad as it was!
NSData hashValue test
for n in 78 ... 83 {
for var i: UInt8 in 0 ... 1 {
for var j: UInt8 in 0 ... 1 {
let data = NSMutableData()
var byte: UInt8 = 0
for _ in 0 ..< n - 2 {
data.append(&byte, length: 1)
}
data.append(&i, length: 1)
data.append(&j, length: 1)
print("Data with \(data.length - 2) 0s and \(i), \(j) at the end, hash is \(data.hashValue)")
}
}
print()
}
Output:
Data with 76 0s and 0, 0 at the end, hash is 0
Data with 76 0s and 0, 1 at the end, hash is 1
Data with 76 0s and 1, 0 at the end, hash is 16
Data with 76 0s and 1, 1 at the end, hash is 17
Data with 77 0s and 0, 0 at the end, hash is 0
Data with 77 0s and 0, 1 at the end, hash is 1
Data with 77 0s and 1, 0 at the end, hash is 16
Data with 77 0s and 1, 1 at the end, hash is 17
Data with 78 0s and 0, 0 at the end, hash is 0
Data with 78 0s and 0, 1 at the end, hash is 1
Data with 78 0s and 1, 0 at the end, hash is 16
Data with 78 0s and 1, 1 at the end, hash is 17
Data with 79 0s and 0, 0 at the end, hash is 0
Data with 79 0s and 0, 1 at the end, hash is 0
Data with 79 0s and 1, 0 at the end, hash is 1
Data with 79 0s and 1, 1 at the end, hash is 1
Data with 80 0s and 0, 0 at the end, hash is 0
Data with 80 0s and 0, 1 at the end, hash is 0
Data with 80 0s and 1, 0 at the end, hash is 0
Data with 80 0s and 1, 1 at the end, hash is 0
Data with 81 0s and 0, 0 at the end, hash is 0
Data with 81 0s and 0, 1 at the end, hash is 0
Data with 81 0s and 1, 0 at the end, hash is 0
Data with 81 0s and 1, 1 at the end, hash is 0
I don't want to pull this discussion too far into the weeds, but in the interest of thoroughness:
@lorentey's analysis is excellent as usual, and I think it's worth calling out along with everything he's posted here that another thing that typically gets lost in analysis is the algorithmic complexity of hashing.
Discussion of O(1)
vs. O(n)
access to hash tables typically makes the assumption that hashing can be done in constant time, and it's usually important for this assumption to hold at least somewhat true. Depending on how you look at it, hash table operations are typically O(k * n)
where n
is the size of the data set and k
is the cost of access to a specific bucket (e.g. hashing a k
ey to produce an index into underlying storage). If k
is constant, then O(k * n) ⇒ k * O(n) ⇒ O(n)
. But if it isn't, then your O(1)
access in the amortized runtime case O(n) / n ops ⇒ O(n / n) ⇒ O(1)
is actually O(k * n) / n ops ⇒ O(k * n / n) ⇒ O(k)
.
This has ramifications not just theoretically, but practically. Set
might guarantee O(1)
equality checks as Karoy mentions, but on typical data, it likely performs many more hashing operations than equality checks. Even if the rate is algorithmically identical, the constants matter in practice. (It is possible to somewhat mitigate the costs of expensive hashing in practice by caching hash results, but this does come at the cost of memory — and everyone knows that caches are also really easy to manage. )
In the case of types which can be unbounded in size, like String
and Data
, it's important for the common case for hashing to remain relatively bounded. In practice, this doesn't necessarily need to mean that you have to guarantee O(1)
access, but that the input size won't grow large enough for it to matter. For a type like String
, having O(n)
hashing (where n
here is the length of the string) might not matter as much — the vast, vast majority of String
s aren't truly "long" in the algorithmic sense, and an even higher percentage of String
s that make their way into Dictionary
keys or Set
s are likely to be short enough that it doesn't matter.
But Data
has different typical access patterns, and it's much more likely that Data
can grow to be very large. It's pretty rare to find String
s many megabytes in size; it's significantly less rare to find Data
as such. (It's also not uncommon to find Data
that is gigabytes in size, especially when memory-mapped from disk.)
Because the size of data is expected to grow in a much more unbounded way relative to the amount of memory you have, it's a bit more important to impose constraints on its hashing in order to provide expected-time access. If you've got an image library with hundreds of megabytes of images, it'd be unfortunate to hash every single bit of that data on insertion into a Set
when you know that the underlying data can be trivial to prove unequal (i.e. the equality check can be significantly cheaper than hashing because data is relatively uniformly different).
And to that point, in real world scenarios, it's exceedingly rare from real world testing to find collections of instances of Data
whose first 80 bytes are all equal, but are not equal beyond that. Even for data of similar type (images with markers, files with magic bytes, files with constant headers, etc.), headers are rarely so large as to cause collisions.
Is this bound "defeatable"? Absolutely, as you show yourself. But in practice, it's important to find some bound that trades off practical cost with the rate of collision.
Itai, when you say "memory-mapped from disk" data, what exactly do you mean?
e.g. Data(bytesNoCopy:) has it's own set of issues related to "value type-ness".
otherwise i can see your point. i'd say that:
- it's obvious and understandable that even if Set/Dictionary itself guarantees O(1) behaviour we can easily fool it supplying our own implementation of hashValue / EQ function that will have, say, O(n^2) behaviour, thus ruining the Set/Dictionary guarantees.
- in practice developers don't use multi megabyte (or gigabyte!) Data as Set elements or dictionary keys. (in those extremely rare cases they do - if hasValue was "true" - they'd quickly figure out the slowdown and do it somehow differently).
- the old PICT file format had 512 header of arbitrary data (commonly all zeroes) seriously it's quite easy to encounter data formats (even audio visual cases like "PCM data in silent conditions" or "yuv data in a dark room", but there are zillion other non-AV cases of course) that has some common /slash/ constant data at the beginning.
- probably it makes sense to have a tree based implementation of dictionary / set in addition to the current hash based.
- it might well be the case that even shorter, under megabyte Strings could benefit speed wise from a tree based set / dictionary implementation.
- if nothing else, this magic 80 constant shall be in the documentation.
- there might be current or future hash collision attacks targeting this 80 byte hashValue limit! this is quite scary actually...
imagine for a moment we had the same situation with strings...
obviously by now we'd have all sorts of collision attacks with malicious people constructing strings of the same length and 80 spaces at the beginning.
how that's different to Data? Data is just less frequently used as Set elements / dictionary keys. but it can! IMHO this 80-byte limit needs to be fixed ASAP before people start exploiting it (or Data greater than 80 bytes shall be prohibited by runtime in Set elements / Dictionary keys).
having said that... "hash calculation time attack issue" is not any better than "hash collision attack issue"...
looks like the best course of action is all of these:
- increase 80 bytes to a bigger number like 1024
- document this limitation
- deprecate (and eventually prohibit) at runtime long Data usage as Set elements / Dictionary keys
- create a new tree based Set / Dictionary implementations in addition to the current hash based ones.
- later on extend existing APIs like JSON decoder to opt in to create a tree based Dictionary instead of default hash based Dictionary.
opinions?
Mutation and value types don't have to play into this. You can create an immutable Data
with bytesNoCopy:
from a memory-mapped region, or have an NSData
created with +dataWithContentsOfURL:options:error
with an option of NSDataReadingMappedIfSafe
/NSDataReadingMappedAlways
, bridged to a Data
instance.
(The memory mapped data case was an example of loading large amounts of data into memory, but you don't need the memory to be mapped — you can just as easily create a Data
several gigabytes large in-memory if you've got enough RAM...)
On some of your thoughts:
FWIW, I wouldn't expect someone to use a Data
directly as a dictionary key, but it's very possible to want to use a structure which has a Data
value as a property as a key in a dictionary. The correct implementation of hash(into:)
on that structure would include hashing the data.
Again, it's not impossible to find counterexamples for this, I agree.
There indeed might be some benefit to including a note in the documentation that not all bytes of the underlying data are hashed, but the 80-byte limit is an implementation detail.
Over the course of their lifetimes, Foundation types like NSData
and NSString
have changed their hashing methods in response to changing use-cases or security vectors with no notice (in both senses of the word: no notice was given, but also, nobody at all noticed ); there's no particular reason to expose this detail if it might change in the future.
This territory is where things get a little fuzzy. No matter what scheme you come up with, there are always opportunities for malicious entities to subvert it.
I don't know how "obvious" it is that various cases by now would be exploited:
String
has the opposite problem that hashing is totally unbounded. If a malicious actor can influence the data going into aDictionary
orSet
, would we expect to be hit with all sorts of attacks where attackers created multi-megabyte strings?- This limit has been shipping in practice for a long time, and exceedingly longer in
NSData
as well. If we "obviously" would have seen countless attacks in this way, would we also expect this to already be exploited?
I agree with this. There are always practical trade-offs to make, and if you're imagining hypothetical adversarial input, you can always come up with a worst-case scenario for your use-case. It's easy to come up with something that sounds scary, but it's not necessarily helpful as a hypothetical.
In the absence of significant concrete evidence, benchmarking, etc., I wouldn't advocate rushing to change for the sake of change.
thanks! i completely forgot about +[NSData dataWithContentsOfURL:]
note, NSData bridged to Data gives "weaker" hash values (as in the test above).
here's an idea: do this for all dynamic types not just Data:
so String / Array / Dictionary, etc. anything that 's hashable and that can be large. yep, long strings will be impossible to use in dictionary keys or set elements - which is a change, so deprecation stage, prohibition stage - but there will be a new option for those cases (tree based set / dictionary) and there'll be neither "hash collision" attacks nor "hash time calculation attacks", even in theory. win-win?
why document the limit (maybe even introduce a runtime variable to query the limit but make it change infrequently) : it would be clear for developers what data structure to choose - the default hash based or the new tree based depending upon a task at hand (short keys vs long keys).
I'm generally in favor of this sort of solution, not that anything is up to me. In my mind, the optimal place for this limitation would be directly on Hasher
— I would love a tunable Hasher
whose combine(bytes: UnsafeRawBufferPointer)
would inspect the length of the given bytes
and constrain how many were read, or would choose some bounded portions of the buffer (e.g. beginning 24, middle 24, and end 24) and hash those. (And maybe even borrowing from some of the seeded randomness hashing already introduces, maybe the strategy is picked randomly there too, per process invocation.)
Then, types like String
or Data
or Array
wouldn't care about the specifics — they could pass an entire underlying buffer to a Hasher
to combine, and the Hasher
would ensure that unbounded reads didn't happen on its own end.
(And if the Hasher
is tunable, you could create a Hasher
instance to customize the limitations to better suit your needs, while still keeping bounds in place.)
Given other solutions are possible, I think this is too restrictive. Long strings in sets isn't a programmer error — and if we're talking theoretical malicious actors, you can imaging user input exceeding the arbitrary runtime length in order to take down a program...
Instead, I think we'd prefer to have help from the system, instead of limitations. If Hasher
gave a bit more of a hand here, this could be the real win-win.
(Note: @lorentey and I have discussed this briefly in the past, while I was still working on Data
, and I think we're agreed that this could be a really interesting direction to go. (Or at least were at the time.) What I wouldn't want to advocate is dropping the length requirement on Data
before Hasher
got some of these tools)
i like that. e.g. 24 bytes ranges * from ten different locations in accordance with some pseudo random sequence unique per process. plus an environment variable to be able switching it off in debug mode.
i wonder how restrictive that actually is. maybe that's the case for 0.0001% apps and if you give enough time to adopt (via deprecation warnings) that won't be an issue. plus as a mitigation - the new tree based alternatives of set and dictionary.
i wonder how restrictive that actually is. maybe that's the case for 0.0001% apps and if you give enough time to adopt (via deprecation warnings) that won't be an issue. plus as a mitigation - the new tree based alternatives of set and dictionary.
My point is that this restriction shouldn't be necessary — if Hasher
takes a fixed-length subset of an unbounded value, it doesn't matter how large the value actually is. There's no downside to allowing it to exist in a hashing structure, since the hash function is guaranteed to be constant-time, and the risk is mitigated. What benefit do you have in mind here?
i am thinking of hash collision attacks done with a large number of "small" elements (be it Data, or other types like Strings/Array/etc with the new "bound" hashValue implementation). even with the new smart "we will use some bytes for hash value but won't tell you which ones" implementation it would be relatively easy to find those bytes that are not part of hash (even if experimentally during runtime) and create a large number of small hash colliding elements and thus perform the hash collision attack. with a simpler approach "we use all bytes for hash" but "you can't use long values for your dictionary keys / set elements" that sort of attack will not be possible to begin with.
it would be relatively easy to find those bytes that are not part of hash (even if experimentally during runtime) and create a large number of small hash colliding elements and thus perform the hash collision attack
I think it's one thing to be able to submit input to a program in order to try to effect results, but introspection of a program (from outside of it, or worse, from within it) is an entirely different beast. I have a strong feeling that if an attacker had such close access to a running process to be able to ascertain whether a particular insertion triggers a hash collision, they're already way past what you might be able to do to keep them out. So much so that in the general case, it might not even be worth trying to guard against.
But, maybe this becomes a tunable option on Hasher
— a maximum length on the number of bytes it hashes, along with a flag that indicates whether longer input should trap. I would strongly suggest it be off as a default, but if this niche case is a concern, you could use your own Hasher
and turn it on.
In any case, I think we've strayed pretty far from the original topic here. I hope this was a reasonable summary of the current state of things, and I'd be interested to see if @lorentey or someone else would be interested in expanding on Hasher
functionality, but maybe that's a topic for a different thread.
(Sigh. I did say I did not want to talk about Data
.)
Discussion of
O(1)
vs.O(n)
access to hash tables typically makes the assumption that hashing can be done in constant time, and it's usually important for this assumption to hold at least somewhat true.
This may be a (imo, highly questionable) expectation in some environments other than Swift. However, it's manifestly not the case with Hashable
.
Swift's hashed collections are safe in the sense that as long as the key type properly implements Hashable
, they guarantee that lookups cost O(1) hashing/equality operations, no matter the distribution of the data.
By properly implementing hashing, I mean that the key feeds exactly the same bits to the Hasher
as it compares in ==
.
There are no ifs or buts about this -- that's what the protocol requires, and this is how String
, Array
, etc. hash themselves in the Standard Library.
As @tera's example demonstrates, Data
's current hash(into:)
implementation is broken.
This has ramifications not just theoretically, but practically.
Set
might guaranteeO(1)
equality checks as Karoy mentions, but on typical data, it likely performs many more hashing operations than equality checks.
My apologies, but this is simply incorrect. This isn't how hashed collections work in Swift.
Lookup operations in Set
or Dictionary
hash the supplied key exactly once, then they execute as many equality checks as necessary to search through the corresponding collision chain. (The same goes for insertions; the hash table sometimes need to be resized there, but the costs of that are amortized away.)
In order to achieve our performance goals for Set and Dictionary, we must keep the expected length of the collision chain as low as possible. The only way to guarantee that is through proper hashing.
In the case of types which can be unbounded in size, like
String
andData
, it's important for the common case for hashing to remain relatively bounded.
On the contrary, there is no expectation that types conforming to Hashable
must hash in bounded time. (Just like there is no such expectation for ==
.)
In Swift, hashing and equality have the same worst-case performance. On average, hashing is often slower, because it always needs to scan the entire value, while equality checks can bail out at the first difference. (Note that this does not mean that Swift's hash tables are somehow inefficient.)
Note that it is possible to implement ==
in terms of hashing -- indeed, that's how we validate Hashable
implementations in the Swift test suite.)
But
Data
has different typical access patterns, and it's much more likely thatData
can grow to be very large. It's pretty rare to findString
s many megabytes in size; it's significantly less rare to findData
as such. (It's also not uncommon to findData
that is gigabytes in size, especially when memory-mapped from disk.)
I find this argument entirely unconvincing. String
, Array
, Data
can and in fact do grow to arbitrary sizes. This is not at all a problem for Set
or Dictionary
-- they provide excellent performance.
It does raise the question though -- who is putting multi-megabyte Array
s in a Set
and why? Huge hash keys rarely if ever occur in regular practice. Optimizing this rare case to the detriment of the security of the general case seems like a questionable choice to me.
If the keys are received from an untrusted source, then it's absolutely crucial to be able to trust that hash insertions/lookups will execute in reasonable time.
For instance, take this loop:
var stuff: Set<Data> = []
for data in chunksReceivedFromNetwork {
stuff.insert(data)
}
With the current (incorrect) Data.hash(into:)
implementation, a clever adversary can trivially use @tera's hash collisions to make this loop take time that's quadratic in the number of chunks received. They don't even need to send multi-gigabyte chunks -- 84 bytes or so per chunk will do just fine.
If Data
had correct hashing, this loop would be guaranteed to finish in time that's linear in the overall number of bytes received -- the chances of a huge collision chain would be vanishingly low, no matter how cleverly the input was chosen.
And to that point, in real world scenarios, it's exceedingly rare from real world testing to find collections of instances of
Data
whose first 80 bytes are all equal, but are not equal beyond that.
The way I look at it, it's exceedingly rare for such colliding Data
to happen outside of HashDoS attacks.
Like Swift itself, the Standard Library's Set
and Dictionary
are designed to help people write code that needs to process hostile input. It seems counter-productive to intentionally weaken this stance, for no particularly good reason.
There are no plans to weaken hashing in Swift in the general case. (There is plenty of evidence that this would be a terrible idea; there is no evidence that it would be beneficial.)
We did discuss a scheme where small hash tables (say, less than hundred or so items) switch to incomplete hashing. That could be an interesting optimization at some point!
However, Hasher
is fast enough in practice, and keys are generally small enough in practice, that implementing this isn't a priority. (In fact, attempts at complicating hashing usually just make Set
/Dictionary
slower.)
(On large contiguous keys like Data
, Hasher
is currently about twice as slow as simply reading memory. And we have plenty of ways to speed it up if this ever becomes a problem.)
Mutation and value types don't have to play into this. You can create an immutable
Data
withbytesNoCopy:
from a memory-mapped region, or have anNSData
created with+dataWithContentsOfURL:options:error
with an option ofNSDataReadingMappedIfSafe
/NSDataReadingMappedAlways
, bridged to aData
instance.(The memory mapped data case was an example of loading large amounts of data into memory, but you don't need the memory to be mapped — you can just as easily create a
Data
several gigabytes large in-memory if you've got enough RAM...)
Can you show me a use case where someone would intentionally use such a humungous Data
as a dictionary key? I can certainly imagine using such data in dictionary values, but in keys? Why?
Even if Data
implemented hashing the way NSArray
does (it's lightning fast!¹), successfully looking up such a key would generally need to perform an equality check that fully compares contents. (Unless we are looking up the exact same instance, in which case why not just use an ObjectIdentifier
or a (baseAddress, length)
pair as the key?)
If profiling data indicates hashing performance is a bottleneck², and if we are confident that our input data will never be malicious, then it's super easy to customize hashing by defining a trivial wrapper type.
struct MyKey: Hashable {
var value: Data
static func ==(left: Self, right: Self) -> Bool {
left.value == right.value
}
func hash(into hasher: Hasher) {
// On careful consideration, I concluded that this is okay,
// because I am confident that the number of keys that
// only differ after the first 42 bytes will always be strictly
// limited by some small constant.
hasher.combine(value.prefix(42))
}
}
If I have to do this, then I may grumble a bit about all the boilerplate, but then I implement it and move on. Later on, when it inevitably turns out that my analysis was wrong, and my app, server or daemon hangs in some Dictionary resize operation, it'll be relatively easy to figure out what's going on -- the code is quite explicit about the mess it's making.
On the other hand, if Data
arbitrarily makes such decisions behind my back, and the decision proves harmful, then I'll need to go through a difficult debugging process, and at the end of it I'll feel betrayed by my tools. Swift promises to be a safe programming language; such unpleasant surprises break that promise, and drive people to other languages.
Partial hashing achieves a small constant speedup in the best case, but in exchange it generates a hugely increased risk of worst-case behavior -- and I think this is a bad tradeoff to make in general-purpose API.
Footnotes:
¹ Note: This is intended to be a jab at the notion of O(1) hashing, not at NSArray
or Foundation's other Objective-C container types. (Foundation has valid reasons to keep these unfortunate definitions for these ancient types. Those reasons do not apply to Data
.)
² In all these years, I only recall seeing such problems in code that accidentally bridged back-and-forth between Dictionary
and NSDictionary
in a tight loop -- a quite different problem.