Dictionary access complexity

tera · October 20, 2023, 12:55pm

I am spinning off the question of Dictionary lookup complexity here, as the "is String's subscript O(n)..." thread is getting too long already and should stay focused on string's subscript.

The preconditions:

In this topic I suggest to focus only on the best case Dictionary lookup time (no hash collisions).
The precious gem post of @lorentey notwithstanding I'd like to focus on the algorithmic complexity itself, disregarding issues related to Obj-C bridging, potentially unbound time of taking a lock or spinning indefinitely in the atomic compare exchange loop, virtual memory unpredictability and existence of other processes. Like in a typical physics assignment "we are in vacuum, there is no friction, and all objects are point sized" we would consider - no virtual memory, no memory caches and no other threads / processes (true, under certain conditions those things we've just disregarded are the major contributors to the time of the operation in question).

With those preconditions in mind I'd say that claiming that dictionary lookup is "O(1)" is wrong or incomplete at best, without specifying the units like so or similar: "O(1) key hash and EQ operations".

Without those units specified one would naïvely assume "seconds" or "CPU clocks", which is not the case here. For example in case of string keys the time complexity of dictionary access is "O(keySize)", as String hash and EQ calculations are known to be O(stringSizeInUnicodePoints). I do not think it is too obvious or fair to hand wave it with just "O(1)" and let the reader figure out the missing details.

Finagolfin · October 21, 2023, 5:21am

When accessing a collection, O(n) or O(1) refers to the asymptotic complexity as the collection size n gets very large and has no relation to the size or complexity of each of the input elements, unicode strings in this case. You may be right that the type of input can make a big difference, but it is simply not what Big O notation is concerned with.

Nevin · October 21, 2023, 12:21pm

…and O(keySize) refers to the asymptotic complexity as the size of the key gets very large.

Big-O notation is perfectly capable of describing asymptotic complexities in terms of things other than collection counts.

itaiferber · October 21, 2023, 1:09pm

Of course it can, but that's not typically how Big-O notation is used. Big-O notation is a rough estimate tool used to compare algorithms operating under similar circumstances, and doesn't typically include terms which define those circumstances. For example, when we use Big-O notation for sorting algorithms, we don't typically include the time complexity cost of comparing two items — given O(M) complexity to compare two elements in an array, we don't give the time complexity of QuickSort as O(M*N*log(N)), or BubbleSort as O(M*N²); although there is absolutely a practical cost for the comparison, it's not typically interesting in the context of comparing algorithms. That O(M) cost is simply factored out of the comparison because all similar algorithms also have that same constant factor.

(FWIW, the same goes for general memory access: RAM is not actually O(1) in practice, but closer to something closer to O(√N), N being total amount of memory. But, we don't include this as a factor in Big-O notation because all algorithms on the same machine eat this same cost, so for comparison purposes, it's typically not "interesting".)

To that end, it's very typical for Big-O notation to only describe the cost of an operation relative to the size of a collection if other similar operations also operate within the same constraints. Could, or should, the documentation mention the cost of hashing and equality? Of course, and quite possibly. But is the Big-O notation mentioned incorrect? No, not as Big-O notation is typically used.

tera · October 21, 2023, 6:36pm

Consider this unusual string implementation:

struct SpecialString: Hashable {
    private static var allStrings: [String] = [""]
    
    private var index: Int = 0
    
    private var string: String = "" {
        didSet {
            // TODO: ATM the table never shrinks, fix it later
            if let index = Self.allStrings.firstIndex(of: string) {
                // found equal string
                self.index = index
            } else {
                Self.allStrings.append(string)
                self.index = Self.allStrings.count - 1
            }
        }
    }
    
    func hash(into hasher: inout Hasher) {
        hasher.combine(index)
    }
    
    static func == (lhs: Self, rhs: Self) -> Bool {
        lhs.index == rhs.index
    }
    
    mutating func append(_ other: String) {
        string.append(other)
    }
    
    mutating func removeSubrange(_ range: Range<String.Index>) {
        string.removeSubrange(range)
    }
    
    // other string ops
}

It maintains the global table of strings (which in this simple sketch implementation never shrinks but that's irrelevant) and every string is identified by an index in this global table. In this implementation EQ and hash operations are O(1) seconds (the obvious drawback is that every mutating string operation needs to spend time to recalculate the new index).

Now consider two dictionaries: one made out of normal strings and another with these special strings. If you test them with strings of different sizes you'd notice that only the "special strings" keys dictionary access is O(1) seconds †, whilst "normal strings" keys dictionary access time is getting longer and longer proportionally to the string length.

† During the test I'm deliberately ignoring the time to create (or mutate) a special string, assuming I already have it, as the question at hand is dictionary access time and nothing else.

I know the textbook tradition is to treat hash table access O(1) ignoring (or silently assuming) the units, which are the EQ/hash operations. My point is: so long as the units are not seconds (or CPU clocks / or anything proportional to normal time units) those units must be specified explicitly, otherwise people could assume "seconds" which would be incorrect.

A simpler and more trivial example:

// on Dictionary:
subscript(_ key: String) -> Element? { ... }  // "O(1)"
func foo(_ key: String) { sleep(1234) } // "O(1)"

It is obvious IMHO that only the second time complexity notation is correct here. The first is incorrect (or incomplete at best) without specifying the units, and could be corrected with something like "O(1) key hash/EQ operations". The second O(1) could be either left as is ("seconds" would be assumed) or it wouldn't harm to make it explicit as well: "O(1) seconds".

itaiferber · October 21, 2023, 7:29pm

No need to get fancy; you will notice the same for a dictionary with Int keys rather than String keys. Hashing an Int is O(1) while hashing a String is O(k) where k is the length of the String as it grows to infinity.

Despite this, the operations on Dictionary are still not described with relation to the element type because the operations are described relative to operations on the same type with all else being equal. It's not that they don't matter—of course a dictionary with length 1,000,000 string keys is going to be significantly slower than a dictionary with integer keys—it's that Big-O notation is notated to describe operations on Dictionary<T> against itself, or Array<T>, or MyCollection<T>, where the constraints on T itself are factored out, because T is arbitrary.

To give another common example: we often describe Array<T> as having O(1) access to random elements, but what if I choose T to be a type so obnoxiously large that copying a T from the array onto the stack takes, I dunno, a second? We don't describe that operation as O(sizeof(T)) because pretty much all operations on Array<T> have that implicit constraint, so it's not "interesting" from a Big-O perspective. (If you want to think about it another way: for a given T, sizeof(T) is constant, and therefore, irrelevant to Big-O, since O(c * N) = O(N) when c is constant.)

And again, I'm not disagreeing with you that these factors aren't important; quite the opposite: Big-O notation is hand-wavy*, and to be used as a general rule of thumb to see how algorithms scale; when doing real-world performance analysis, you're significantly more likely to care about the constants you're working with in practice. e.g., there are plenty of simple O(N²) that perform significantly better than algorithms with "better" time complexity for normal input sizes that sometimes, so that it's actually a big performance benefit to switch over to those algorithms for small inputs. (See, e.g., the adaptive sorting algorithm the stdlib uses.)

Just a minor addition: Big-O is a classification or grouping of algorithms that scale similarly; it doesn't have units. (In fact, again, units are just constant factors that would just fall out of the notation anyway.)

It's a category error to specify O(log(N)) seconds, or O(1) hashing operations; it's just O(log(N)) or O(1).

*

Well, Big-O is defined very precisely from a theoretical point of view — but it intentionally dismissive of constants, and used even more loosely in practice.

This is all to say

that I think the focus on Big-O is misleading, and maybe it'd be better to describe the growth of these algorithms more precisely. These facets are really important, but they are ignored by definition by Big-O notation.

bbrk24 · October 21, 2023, 7:55pm

I disagree. The same sorting algorithm can be O(n) swaps but O(n^2) comparisons (e.g. selection sort) and knowing both is occasionally important.

gregtitus · October 21, 2023, 8:00pm

I don't think it is reasonable, or even possible, for it to be the Swift standard library's job to explain what "algorithmic time complexity" means or what it is useful or not useful for.

If you don't have a solid conception of what the O() notation means, then yes, you can be confused as to what the implications are. But what is the alternative? Is it useful to replace all of the O() comments with "the amount of time this operation on Dictionary will take depends upon what type T is, as well as your particular hardware, etc?" No, it is not.

itaiferber · October 21, 2023, 8:02pm

I wouldn't describe "swaps" and "comparisons" as units here. You're talking about two different operations, which fall into different Big-O orders.

In practice, Big-O is used very loosely, so it is very common to hear this type of description, and I think it's very understandable. If we're already concerned about the interpretation of Big-O notation in documentation though, I think we'd better be specific about what we're describing, and not propagate this (very common) type of "sloppy" notation.

wadetregaskis · October 21, 2023, 8:10pm

FWIW, in my past experience as a Performance Engineer, Big-O notation was the bane of my existence, practically-speaking. The number of times I heard things like "this is fast because it's O(1)" in blind ignorance of the evidence… sigh.

It's not that asymptotic complexity estimation never has a use, it's just wildly over-used and over-trusted. It's fine to inform a hypothesis but it is not evidence in itself. Profile (or "measure" or whatever synonym you prefer). If an API wants to genuinely impress upon me that it is fast, then show me real benchmark results, or provide real guarantees (even if just obliquely with e.g. maximum instruction counts).

For things like libraries - especially 'standard' libraries like Swift Stdlib and Foundation - what really matters to users is:

Is this a fast and/or efficient implementation of whatever this thing is. If I need a map I need a map, but please provide me an optimal implementation of one.
In what way do I influence the performance (for better or worse), e.g. the hashing and equality operations of the Key type, the significance (or not) of presizing, etc.

Unqualified statements of time complexity, in API docs, don't really address either of these actual user interests.

tera · October 21, 2023, 8:13pm

Ok, same inputs in these examples:

If you don't like I am not using string in "foo" at all - well, I could take the first unicode scalar of key and do something useful with it. Likewise, it could be:

// on Dictionary:
func bar(_ key: String) { // "O(1)"
    /* something takes linear time depending on key size */
}
func foo(_ key: String) { sleep(1234) } // "O(1)"

both are documented to have "O(1)" without specifying units, the first just "implies" that units are EQ/hash operations without mentioning it. Would you agree that that would not be helpful, at all?

I suggested this alternative:

O(1) key hash/EQ operations

the tail after O(...) being "units".

itaiferber · October 21, 2023, 8:18pm

I agree. That's why I'm arguing that Big-O notation is not precise enough to be useful here. If we care about being more precise than what Big-O notation can describe, we should use something more precise than Big-O notation.

bbrk24 · October 21, 2023, 8:28pm

To be perfectly fair, I do remember having to refactor one function because the original implementation was O(n^4), but I’m pretty sure that’s the exception and not the norm.

tera · October 22, 2023, 7:01pm

I don't see why it can't be fixed to be less sloppy. For example if I see the following:

// on Dictionary:
func bar(_ key: String) { // "O(keySize), keySize is a number of unicode points in a string"
    /* something takes linear time depending on key size */
}
func foo(_ key: String) { sleep(1234) } // "O(1)"

and without looking in the sources measured that "bar(string_of_100_unicode_points)" happens to take 1 seconds and "foo(string_of_100_unicode_points)" happens to take 1234 seconds, I would not be too far off saying that calling those functions with a string of 1000 unicode points would take 10 and 1234 seconds respectively. Whilst if "bar" was labeled with just "O(1)" I would incorrectly guess that "bar" would take about 1 second. There's nothing inherently sloppy in the O-big notation itself, just don't use it in a sloppy way.

Finagolfin · October 22, 2023, 8:23pm

No, he's right, it's a loose measure that is very sloppily talked about. All it is supposed to do is tell you how algorithms or data structures scale, that's it. It has long been known that it is an inadequate measure for smaller samples, but as Wade said, some will only talk about that and assume it's enough.

For your example, you can instead talk about the big O of the hashing operation, as obviously that will take longer for longer strings and have a material impact on dictionary access. It would be perfectly reasonable to document the traditional big O of the dictionary and the big O of the hashing algorithm for large input elements, the latter of which is what you appear to be looking for.

Nevin · October 22, 2023, 9:15pm

I am going to push back extremely strongly against what you have written here.

Big-O is a classification or grouping of functions that scale similarly. It doesn’t have units, and therefore in order to apply it to anything in the real world, such as algorithms, the units must be specified or factored out.

Big-O notation can be used to make statements such as “The number of comparisons in a selection-sort of n elements is O(n²)” or “The number of swaps in a selection-sort of n elements is O(n)”.

It is a category error to say “Selection-sort is O(n²)” or “Selection-sort is O(n)” without specifying units. Sometimes the units can be inferred from context, in which case they are often omitted.

The question at hand in this thread is whether “Dictionary lookup is O(1)” is misleading, when a more precise statement would be, “Looking up a key of length m in a dictionary of size n takes O(m) steps on average” and what constitutes a “step” could also be specified as something like “fixed-width arithmetic and/or comparison operations”.

michelf · October 22, 2023, 9:59pm

Especially for strings, size of the key isn't the only thing that can slow things down. Two strings of the same size can vary a lot in the time it takes to hash them.

For instance, from my testing, there's a ratio of 94 between the time taken calculating the hashValue of the two following strings, even though the number of UTF-8 bytes is the same for the two of them:

let str1 = String(repeating: "a", count: 1_000_000)
let str2 = String(repeating: "à\u{300}", count: 250_000)
// both:  1_000_000 UTF-8 bytes

You can't express this variation very well using algorithmic complexity as it's not a matter of scaling based on the input size (both strings are the same size in byte).

It's not even scaling based on the size of individual characters size because if you continue adding more combining code points to characters (while removing characters to keep the string size the same) it goes down and stabilizes to a constant overhead somewhere around 34 times the first string.

I guess the conclusion is that, with Unicode characters, time spent can be hidden in mundane details of the input as much as the length itself. Hashing is still O(n), but if you test with one input and find the time it takes satisfactory, this does not automatically imply it'll be satisfactory with all inputs of the same size. It seems for hashing you need to be accommodating for a 94× factor (in the worse case that I could find).

tera · October 22, 2023, 10:45pm

Interesting find. BTW, if you add a single "à\u{300}" to the end of the str1 – the hash time difference becomes less dramatic (3x). Perhaps has to do with "known to be ascii only" optimisation, but there's more than just that as there's still the 3x difference.

itaiferber · October 22, 2023, 11:01pm

Thanks for calling this out, because in trying to express the disconnect between the formal CS definitions of the terms we're using, and their looser colloquial usage, I was myself not clear enough about what I meant. I agree with your characterization of real-world usage here.

I'm providing the following not to be smarmy, but to be more precise about what I meant — the formal definition of O-notation that I learned, from Introduction to Algorithms (Fourth Edition) by Cormen, et. al. (typesetting theirs):

For a given function g(n), we denote by O(g(n)) the set of functions

O(g(n)) = { f(n) : there exist positive constants c and n₀ such that 0 ≤ f(n) ≤ cg(n) for all n ≥ n₀ }

What I was hoping to express was that in formal analysis, we are not concerned about units. O(g(n)) defines a set of functions, and a formal set cannot have units; similarly, we use O(g(n)) to relate functions f(n) and g(n), and if you consider functions to have units, then:

They better have the same units in order to be comparable, and if so,
Their units are constant, and therefore can be factored into c in the definition above

In the real world: absolutely, if you don't provide context for the calculations you're performing, the analysis isn't helpful. But what I'm trying to express and focus on is that in practice, in exactly that real-world type of analysis, we aren't rigorous — everyone describes Big-O notation a little bit differently. At the same time, a formal runtime analysis of these algorithms would be so obtuse as to be largely useless.

i.e., I'm just trying to say that there's a bit of a catch-22 to Big-O notation: to be somewhat more easily understood, we take liberties with notation, units, context, etc., which leaves room for ambiguity and misunderstanding; being more precise with notation makes the analysis less accessible with theoretical analysis and discussion.

Which is why, if we're really interested in updating comments/documentation to be more exact, instead of trying to stick to Big-O notation as in

then I think we could do even better. I've worked with plenty of talented engineers who don't have a theoretical or formal CS background, and who never learned Big-O notation. Why not cut to the chase?

// `key` is used to perform a Dictionary lookup, which requires hashing every byte of
// the string. If `key` has many Unicode code points, this can be prohibitively
// expensive.
func bar(_ key: String) { ... }

Nevin · October 23, 2023, 2:09am

I think we’re using the word “units” somewhat differently—and they’re both valid definitions, so it’s totally understandable.

You appear to be using “units” to mean things like “seconds” or “meters”, whereas I was using it to mean things like “time” or “distance”. As in, “velocity has units of distance over time”. (These are also called “dimensions”, though that word has its own set of possible ambiguities.)

It doesn’t matter if we measure time in seconds or centuries or cpu cycles, but it does matter that we’re measuring time (or whatever else we care to measure).

For our purposes, big-O notation is essentially about counting things in terms of the counts of other things. And you’re right, people are often fuzzy about precisely what is being counted, and that fuzziness is often a feature, not a bug. But in order to make a meaningful statement, we still have to be clear about what types of things we are counting.