Does `Dictionary.Keys.contains(_:)` have `O(1)` time complexity?

David_Smith · September 14, 2021, 12:52pm

At that scale I’d start eyeing things like the working set size overflowing the TLB, but I haven’t pulled out the cpu counters instrument to verify that.

scanon · September 14, 2021, 2:15pm

Set does an O(1) number of arithmetic operations and memory accesses. On physical hardware, the time required for those memory accesses increases as the data is found in L1, L2, L3, uncompressed memory, compressed memory, disk, and also as the addresses needed for the working set falls out of the TLB, and there are a bunch of other micro-architectural factors besides. None of these hardware details changes that an O(1) number of memory accesses are occurring.

lorentey · September 14, 2021, 8:42pm

We're playing sleight of hand tricks with the units of measurement here. The API docs usually only say "O(1)" or "O(n)" -- they tend not to explicitly specify what exactly they're counting. In the case of protocols like Collection, different conforming types sometimes have a different idea of what the units are.

Most algorithm analysis measures performance in nice high-level units that are easy to work with -- the number of equality checks on the Element type, the number of items that need to be moved around in memory etc.

On the other hand, as programmers we typically assume that documented performance describes (or at least, somewhat resembles) elapsed time.

For example, with a properly hashed Element, Set.contains is expected to perform O(1) equality checks -- but this does not mean lookups are expected to finish in constant time!

For example, just consider Set<String>, where the cost of an equality check is O(n) in the length of the values being compared (as measured in unicode scalars, I believe).

And things aren't even straightforward with a nice simple element type like Int. We usually take the cost of a random memory access as constant, but in practice, it tends to grow with the logarithm of the size of the working set. And if one random memory access costs O(log(n)) time, what does that tell us about the performance of Set<Int>.contains? Hashing is an excellent access randomizer; linear probing makes the second and subsequent accesses super cheap, but the first one will likely dominate costs. Anything other than linear probing becomes a huge no no.

Everything about computer performance is a complicated mess of half-truths and over-simplifications. It's very easy to get lost in details, and they change in weird ways every year. Luckily, even the most simplistic performance analysis is better than nothing!

tera · September 15, 2021, 2:38am

like your posts on the subject, @lorentey. still curious what makes Data type special.

lorentey · September 15, 2021, 2:56am

It's better not to know, but.

tera · September 15, 2021, 3:20am

wow, thanks for this gem! on my book this is clearly a bug and makes built-in data hashValue totally inappropriate in many cases!

WTF

for var n in 78 ..< 83 {
    for i: UInt8 in 0 ..< 3 {
        var data = Data(repeating: 0, count: n - 1)
        data.append(i)
        print("Data(\(n) bytes) hash is \(data.hashValue)")
    }
    print()
}

Output:
Data(78 bytes) hash is -6057134200881960299
Data(78 bytes) hash is -1381361526395587339
Data(78 bytes) hash is -3319890988691249578

Data(79 bytes) hash is 6847656298895565850
Data(79 bytes) hash is -8737025815244955736
Data(79 bytes) hash is 5557717136421446909

Data(80 bytes) hash is 684518477260868316
Data(80 bytes) hash is -4351981612195298604
Data(80 bytes) hash is 7285011032270536178

Data(81 bytes) hash is 2933313030649029302
Data(81 bytes) hash is 2933313030649029302
Data(81 bytes) hash is 2933313030649029302

Data(82 bytes) hash is -5508073393735331768
Data(82 bytes) hash is -5508073393735331768
Data(82 bytes) hash is -5508073393735331768

we can still include "Data" in this discussion provided it's hashValue is overridden to something more sensible!

struct BetterData: Hashable {
    let data: Data
    var hashValue / hash(into:) ... {
        ....
    }
}

tera · September 15, 2021, 5:09am

in fairness it is not as bad as it was!

NSData hashValue test

for n in 78 ... 83 {
    for var i: UInt8 in 0 ... 1 {
        for var j: UInt8 in 0 ... 1 {
            let data = NSMutableData()
            var byte: UInt8 = 0
            for _ in 0 ..< n - 2 {
                data.append(&byte, length: 1)
            }
            data.append(&i, length: 1)
            data.append(&j, length: 1)
            print("Data with \(data.length - 2) 0s and \(i), \(j) at the end, hash is \(data.hashValue)")
        }
    }
    print()
}

Output:

Data with 76 0s and 0, 0 at the end, hash is 0
Data with 76 0s and 0, 1 at the end, hash is 1
Data with 76 0s and 1, 0 at the end, hash is 16
Data with 76 0s and 1, 1 at the end, hash is 17

Data with 77 0s and 0, 0 at the end, hash is 0
Data with 77 0s and 0, 1 at the end, hash is 1
Data with 77 0s and 1, 0 at the end, hash is 16
Data with 77 0s and 1, 1 at the end, hash is 17

Data with 78 0s and 0, 0 at the end, hash is 0
Data with 78 0s and 0, 1 at the end, hash is 1
Data with 78 0s and 1, 0 at the end, hash is 16
Data with 78 0s and 1, 1 at the end, hash is 17

Data with 79 0s and 0, 0 at the end, hash is 0
Data with 79 0s and 0, 1 at the end, hash is 0
Data with 79 0s and 1, 0 at the end, hash is 1
Data with 79 0s and 1, 1 at the end, hash is 1

Data with 80 0s and 0, 0 at the end, hash is 0
Data with 80 0s and 0, 1 at the end, hash is 0
Data with 80 0s and 1, 0 at the end, hash is 0
Data with 80 0s and 1, 1 at the end, hash is 0

Data with 81 0s and 0, 0 at the end, hash is 0
Data with 81 0s and 0, 1 at the end, hash is 0
Data with 81 0s and 1, 0 at the end, hash is 0
Data with 81 0s and 1, 1 at the end, hash is 0

itaiferber · September 15, 2021, 1:42pm

I don't want to pull this discussion too far into the weeds, but in the interest of thoroughness:

@lorentey's analysis is excellent as usual, and I think it's worth calling out along with everything he's posted here that another thing that typically gets lost in analysis is the algorithmic complexity of hashing.

Discussion of O(1) vs. O(n) access to hash tables typically makes the assumption that hashing can be done in constant time, and it's usually important for this assumption to hold at least somewhat true. Depending on how you look at it, hash table operations are typically O(k * n) where n is the size of the data set and k is the cost of access to a specific bucket (e.g. hashing a key to produce an index into underlying storage). If k is constant, then O(k * n) ⇒ k * O(n) ⇒ O(n). But if it isn't, then your O(1) access in the amortized runtime case O(n) / n ops ⇒ O(n / n) ⇒ O(1) is actually O(k * n) / n ops ⇒ O(k * n / n) ⇒ O(k).

This has ramifications not just theoretically, but practically. Set might guarantee O(1) equality checks as Karoy mentions, but on typical data, it likely performs many more hashing operations than equality checks. Even if the rate is algorithmically identical, the constants matter in practice. (It is possible to somewhat mitigate the costs of expensive hashing in practice by caching hash results, but this does come at the cost of memory — and everyone knows that caches are also really easy to manage. )

In the case of types which can be unbounded in size, like String and Data, it's important for the common case for hashing to remain relatively bounded. In practice, this doesn't necessarily need to mean that you have to guarantee O(1) access, but that the input size won't grow large enough for it to matter. For a type like String, having O(n) hashing (where n here is the length of the string) might not matter as much — the vast, vast majority of Strings aren't truly "long" in the algorithmic sense, and an even higher percentage of Strings that make their way into Dictionary keys or Sets are likely to be short enough that it doesn't matter.

But Data has different typical access patterns, and it's much more likely that Data can grow to be very large. It's pretty rare to find Strings many megabytes in size; it's significantly less rare to find Data as such. (It's also not uncommon to find Data that is gigabytes in size, especially when memory-mapped from disk.)

Because the size of data is expected to grow in a much more unbounded way relative to the amount of memory you have, it's a bit more important to impose constraints on its hashing in order to provide expected-time access. If you've got an image library with hundreds of megabytes of images, it'd be unfortunate to hash every single bit of that data on insertion into a Set when you know that the underlying data can be trivial to prove unequal (i.e. the equality check can be significantly cheaper than hashing because data is relatively uniformly different).

And to that point, in real world scenarios, it's exceedingly rare from real world testing to find collections of instances of Data whose first 80 bytes are all equal, but are not equal beyond that. Even for data of similar type (images with markers, files with magic bytes, files with constant headers, etc.), headers are rarely so large as to cause collisions.

Is this bound "defeatable"? Absolutely, as you show yourself. But in practice, it's important to find some bound that trades off practical cost with the rate of collision.

tera · September 15, 2021, 3:13pm

Itai, when you say "memory-mapped from disk" data, what exactly do you mean?
e.g. Data(bytesNoCopy:) has it's own set of issues related to "value type-ness".

otherwise i can see your point. i'd say that:

it's obvious and understandable that even if Set/Dictionary itself guarantees O(1) behaviour we can easily fool it supplying our own implementation of hashValue / EQ function that will have, say, O(n^2) behaviour, thus ruining the Set/Dictionary guarantees.
in practice developers don't use multi megabyte (or gigabyte!) Data as Set elements or dictionary keys. (in those extremely rare cases they do - if hasValue was "true" - they'd quickly figure out the slowdown and do it somehow differently).
the old PICT file format had 512 header of arbitrary data (commonly all zeroes) seriously it's quite easy to encounter data formats (even audio visual cases like "PCM data in silent conditions" or "yuv data in a dark room", but there are zillion other non-AV cases of course) that has some common /slash/ constant data at the beginning.
probably it makes sense to have a tree based implementation of dictionary / set in addition to the current hash based.
it might well be the case that even shorter, under megabyte Strings could benefit speed wise from a tree based set / dictionary implementation.
if nothing else, this magic 80 constant shall be in the documentation.
there might be current or future hash collision attacks targeting this 80 byte hashValue limit! this is quite scary actually...

tera · September 15, 2021, 3:45pm

imagine for a moment we had the same situation with strings...
obviously by now we'd have all sorts of collision attacks with malicious people constructing strings of the same length and 80 spaces at the beginning.
how that's different to Data? Data is just less frequently used as Set elements / dictionary keys. but it can! IMHO this 80-byte limit needs to be fixed ASAP before people start exploiting it (or Data greater than 80 bytes shall be prohibited by runtime in Set elements / Dictionary keys).

having said that... "hash calculation time attack issue" is not any better than "hash collision attack issue"...

looks like the best course of action is all of these:

increase 80 bytes to a bigger number like 1024
document this limitation
deprecate (and eventually prohibit) at runtime long Data usage as Set elements / Dictionary keys
create a new tree based Set / Dictionary implementations in addition to the current hash based ones.
later on extend existing APIs like JSON decoder to opt in to create a tree based Dictionary instead of default hash based Dictionary.

opinions?

itaiferber · September 15, 2021, 4:24pm

Mutation and value types don't have to play into this. You can create an immutable Data with bytesNoCopy: from a memory-mapped region, or have an NSData created with +dataWithContentsOfURL:options:error with an option of NSDataReadingMappedIfSafe/NSDataReadingMappedAlways, bridged to a Data instance.

(The memory mapped data case was an example of loading large amounts of data into memory, but you don't need the memory to be mapped — you can just as easily create a Data several gigabytes large in-memory if you've got enough RAM...)

On some of your thoughts:

FWIW, I wouldn't expect someone to use a Data directly as a dictionary key, but it's very possible to want to use a structure which has a Data value as a property as a key in a dictionary. The correct implementation of hash(into:) on that structure would include hashing the data.

Again, it's not impossible to find counterexamples for this, I agree.

There indeed might be some benefit to including a note in the documentation that not all bytes of the underlying data are hashed, but the 80-byte limit is an implementation detail.

Over the course of their lifetimes, Foundation types like NSData and NSString have changed their hashing methods in response to changing use-cases or security vectors with no notice (in both senses of the word: no notice was given, but also, nobody at all noticed ); there's no particular reason to expose this detail if it might change in the future.

This territory is where things get a little fuzzy. No matter what scheme you come up with, there are always opportunities for malicious entities to subvert it.

I don't know how "obvious" it is that various cases by now would be exploited:

String has the opposite problem that hashing is totally unbounded. If a malicious actor can influence the data going into a Dictionary or Set, would we expect to be hit with all sorts of attacks where attackers created multi-megabyte strings?
This limit has been shipping in practice for a long time, and exceedingly longer in NSData as well. If we "obviously" would have seen countless attacks in this way, would we also expect this to already be exploited?

I agree with this. There are always practical trade-offs to make, and if you're imagining hypothetical adversarial input, you can always come up with a worst-case scenario for your use-case. It's easy to come up with something that sounds scary, but it's not necessarily helpful as a hypothetical.

In the absence of significant concrete evidence, benchmarking, etc., I wouldn't advocate rushing to change for the sake of change.

tera · September 15, 2021, 4:55pm

thanks! i completely forgot about +[NSData dataWithContentsOfURL:]

note, NSData bridged to Data gives "weaker" hash values (as in the test above).

here's an idea: do this for all dynamic types not just Data:

so String / Array / Dictionary, etc. anything that 's hashable and that can be large. yep, long strings will be impossible to use in dictionary keys or set elements - which is a change, so deprecation stage, prohibition stage - but there will be a new option for those cases (tree based set / dictionary) and there'll be neither "hash collision" attacks nor "hash time calculation attacks", even in theory. win-win?

why document the limit (maybe even introduce a runtime variable to query the limit but make it change infrequently) : it would be clear for developers what data structure to choose - the default hash based or the new tree based depending upon a task at hand (short keys vs long keys).

itaiferber · September 15, 2021, 5:25pm

I'm generally in favor of this sort of solution, not that anything is up to me. In my mind, the optimal place for this limitation would be directly on Hasher — I would love a tunable Hasher whose combine(bytes: UnsafeRawBufferPointer) would inspect the length of the given bytes and constrain how many were read, or would choose some bounded portions of the buffer (e.g. beginning 24, middle 24, and end 24) and hash those. (And maybe even borrowing from some of the seeded randomness hashing already introduces, maybe the strategy is picked randomly there too, per process invocation.)

Then, types like String or Data or Array wouldn't care about the specifics — they could pass an entire underlying buffer to a Hasher to combine, and the Hasher would ensure that unbounded reads didn't happen on its own end.

(And if the Hasher is tunable, you could create a Hasher instance to customize the limitations to better suit your needs, while still keeping bounds in place.)

Given other solutions are possible, I think this is too restrictive. Long strings in sets isn't a programmer error — and if we're talking theoretical malicious actors, you can imaging user input exceeding the arbitrary runtime length in order to take down a program...

Instead, I think we'd prefer to have help from the system, instead of limitations. If Hasher gave a bit more of a hand here, this could be the real win-win.

(Note: @lorentey and I have discussed this briefly in the past, while I was still working on Data, and I think we're agreed that this could be a really interesting direction to go. (Or at least were at the time.) What I wouldn't want to advocate is dropping the length requirement on Data before Hasher got some of these tools)

tera · September 15, 2021, 6:07pm

i like that. e.g. 24 bytes ranges * from ten different locations in accordance with some pseudo random sequence unique per process. plus an environment variable to be able switching it off in debug mode.

i wonder how restrictive that actually is. maybe that's the case for 0.0001% apps and if you give enough time to adopt (via deprecation warnings) that won't be an issue. plus as a mitigation - the new tree based alternatives of set and dictionary.

itaiferber · September 15, 2021, 6:25pm

My point is that this restriction shouldn't be necessary — if Hasher takes a fixed-length subset of an unbounded value, it doesn't matter how large the value actually is. There's no downside to allowing it to exist in a hashing structure, since the hash function is guaranteed to be constant-time, and the risk is mitigated. What benefit do you have in mind here?

tera · September 15, 2021, 6:38pm

i am thinking of hash collision attacks done with a large number of "small" elements (be it Data, or other types like Strings/Array/etc with the new "bound" hashValue implementation). even with the new smart "we will use some bytes for hash value but won't tell you which ones" implementation it would be relatively easy to find those bytes that are not part of hash (even if experimentally during runtime) and create a large number of small hash colliding elements and thus perform the hash collision attack. with a simpler approach "we use all bytes for hash" but "you can't use long values for your dictionary keys / set elements" that sort of attack will not be possible to begin with.

itaiferber · September 15, 2021, 6:54pm

I think it's one thing to be able to submit input to a program in order to try to effect results, but introspection of a program (from outside of it, or worse, from within it) is an entirely different beast. I have a strong feeling that if an attacker had such close access to a running process to be able to ascertain whether a particular insertion triggers a hash collision, they're already way past what you might be able to do to keep them out. So much so that in the general case, it might not even be worth trying to guard against.

But, maybe this becomes a tunable option on Hasher — a maximum length on the number of bytes it hashes, along with a flag that indicates whether longer input should trap. I would strongly suggest it be off as a default, but if this niche case is a concern, you could use your own Hasher and turn it on.

In any case, I think we've strayed pretty far from the original topic here. I hope this was a reasonable summary of the current state of things, and I'd be interested to see if @lorentey or someone else would be interested in expanding on Hasher functionality, but maybe that's a topic for a different thread.

lorentey · September 16, 2021, 6:10am

(Sigh. I did say I did not want to talk about Data.)

This may be a (imo, highly questionable) expectation in some environments other than Swift. However, it's manifestly not the case with Hashable.

Swift's hashed collections are safe in the sense that as long as the key type properly implements Hashable, they guarantee that lookups cost O(1) hashing/equality operations, no matter the distribution of the data.

By properly implementing hashing, I mean that the key feeds exactly the same bits to the Hasher as it compares in ==.

There are no ifs or buts about this -- that's what the protocol requires, and this is how String, Array, etc. hash themselves in the Standard Library.

As @tera's example demonstrates, Data's current hash(into:) implementation is broken.

My apologies, but this is simply incorrect. This isn't how hashed collections work in Swift.

Lookup operations in Set or Dictionary hash the supplied key exactly once, then they execute as many equality checks as necessary to search through the corresponding collision chain. (The same goes for insertions; the hash table sometimes need to be resized there, but the costs of that are amortized away.)

In order to achieve our performance goals for Set and Dictionary, we must keep the expected length of the collision chain as low as possible. The only way to guarantee that is through proper hashing.

On the contrary, there is no expectation that types conforming to Hashable must hash in bounded time. (Just like there is no such expectation for ==.)

In Swift, hashing and equality have the same worst-case performance. On average, hashing is often slower, because it always needs to scan the entire value, while equality checks can bail out at the first difference. (Note that this does not mean that Swift's hash tables are somehow inefficient.)

Note that it is possible to implement == in terms of hashing -- indeed, that's how we validate Hashable implementations in the Swift test suite.)

I find this argument entirely unconvincing. String, Array, Data can and in fact do grow to arbitrary sizes. This is not at all a problem for Set or Dictionary -- they provide excellent performance.

It does raise the question though -- who is putting multi-megabyte Arrays in a Set and why? Huge hash keys rarely if ever occur in regular practice. Optimizing this rare case to the detriment of the security of the general case seems like a questionable choice to me.

If the keys are received from an untrusted source, then it's absolutely crucial to be able to trust that hash insertions/lookups will execute in reasonable time.

For instance, take this loop:

var stuff: Set<Data> = []
for data in chunksReceivedFromNetwork {
  stuff.insert(data)
}

With the current (incorrect) Data.hash(into:) implementation, a clever adversary can trivially use @tera's hash collisions to make this loop take time that's quadratic in the number of chunks received. They don't even need to send multi-gigabyte chunks -- 84 bytes or so per chunk will do just fine.

If Data had correct hashing, this loop would be guaranteed to finish in time that's linear in the overall number of bytes received -- the chances of a huge collision chain would be vanishingly low, no matter how cleverly the input was chosen.

The way I look at it, it's exceedingly rare for such colliding Data to happen outside of HashDoS attacks.

Like Swift itself, the Standard Library's Set and Dictionary are designed to help people write code that needs to process hostile input. It seems counter-productive to intentionally weaken this stance, for no particularly good reason.

lorentey · September 16, 2021, 6:21am

There are no plans to weaken hashing in Swift in the general case. (There is plenty of evidence that this would be a terrible idea; there is no evidence that it would be beneficial.)

We did discuss a scheme where small hash tables (say, less than hundred or so items) switch to incomplete hashing. That could be an interesting optimization at some point!

However, Hasher is fast enough in practice, and keys are generally small enough in practice, that implementing this isn't a priority. (In fact, attempts at complicating hashing usually just make Set/Dictionary slower.)

(On large contiguous keys like Data, Hasher is currently about twice as slow as simply reading memory. And we have plenty of ways to speed it up if this ever becomes a problem.)

lorentey · September 16, 2021, 8:08am

Can you show me a use case where someone would intentionally use such a humungous Data as a dictionary key? I can certainly imagine using such data in dictionary values, but in keys? Why?

Even if Data implemented hashing the way NSArray does (it's lightning fast!¹), successfully looking up such a key would generally need to perform an equality check that fully compares contents. (Unless we are looking up the exact same instance, in which case why not just use an ObjectIdentifier or a (baseAddress, length) pair as the key?)

If profiling data indicates hashing performance is a bottleneck², and if we are confident that our input data will never be malicious, then it's super easy to customize hashing by defining a trivial wrapper type.

struct MyKey: Hashable {
  var value: Data
  static func ==(left: Self, right: Self) -> Bool { 
    left.value == right.value 
  }
  func hash(into hasher: Hasher) {
    // On careful consideration, I concluded that this is okay,
    // because I am confident that the number of keys that 
    // only differ after the first 42 bytes will always be strictly
    // limited by some small constant. 
    hasher.combine(value.prefix(42))
  }
}

If I have to do this, then I may grumble a bit about all the boilerplate, but then I implement it and move on. Later on, when it inevitably turns out that my analysis was wrong, and my app, server or daemon hangs in some Dictionary resize operation, it'll be relatively easy to figure out what's going on -- the code is quite explicit about the mess it's making.

On the other hand, if Data arbitrarily makes such decisions behind my back, and the decision proves harmful, then I'll need to go through a difficult debugging process, and at the end of it I'll feel betrayed by my tools. Swift promises to be a safe programming language; such unpleasant surprises break that promise, and drive people to other languages.

Partial hashing achieves a small constant speedup in the best case, but in exchange it generates a hugely increased risk of worst-case behavior -- and I think this is a bad tradeoff to make in general-purpose API.

Footnotes:
¹ Note: This is intended to be a jab at the notion of O(1) hashing, not at NSArray or Foundation's other Objective-C container types. (Foundation has valid reasons to keep these unfortunate definitions for these ancient types. Those reasons do not apply to Data.)

² In all these years, I only recall seeing such problems in code that accidentally bridged back-and-forth between Dictionary and NSDictionary in a tight loop -- a quite different problem.