How to check two `Array` instances for identity equality in constant time?

This argument serves to remind that big-O notation is not very useful without context. My Data Structures professor made a point of this when we covered radix sort in sophomore year of college. If radix sort is O(N), and all other sorting algorithms we’d looked at thus far were at least O(N log N), why does anyone use anything other than radix sort? The answer, of course, is that N was counting different things.

Yes, memcmp is linear in the size of the items being compared. This is pretty important when you’re talking about writing a generic algorithm over arrays of arbitrary length containing values of arbitrary type. Nobody talks about the big-O complexity of memcmp(int[4], int[4]) precisely because it’s over-constrained.

6 Likes

I am, like Jordan himself, rather unenthusiastic about it.

A quick test for non-equality is indeed quite useful sometimes! However, I do not think implementing it generally requires "insider access" like implementing isIdentical(to:) does -- a negative test would generally be based on information that is already exposed through the type's public API surface. This makes such tests nonessential -- people who need them can easily define them on their own.

Even worse, people will most definitely disagree on how much data needs to get compared, and in what order -- there is no One Correct Implementation of isDefinitelyNotEqual; its definition varies with context.

On top of it all, there is also no obvious name for such a thing.

1 Like

It depends on context. Suppose I have a function:

func foo<T>(_: [T]) { … }

In general, the running time of foo() will depend on two parameters: the size of T, and the number of elements in the array. If we fix either parameter while letting the other vary, we may observe very different behaviors, for example the algorithm might be linear in the size of T but quadratic in the length of the array.

That is correct. Copying or moving a Swift value will in general require a number of steps that is linear in the size of the value.

5 Likes

When documenting (and thinking about) the algorithmic complexity of generic functions, we necessarily need to take the size of T to be a constant -- because it is constant in the context we're actually trying to analyze, which is running actual Swift code.

Analyzing the "growth rate" of a generic function while we "vary" its generic argument is a fun idea, but I don't really see the point of it -- how does one do that in actual Swift? Can you show a concrete example of code that does that?

Say I'm a Swift programmer, and I want to see how isKnownIdentical compares to == when I invoke it on two Array instances. My asymptotic complexity analysis tells me that Array.== has linear worst-case time complexity (as measured by the number element comparisons it performs), while isKnownIdentical has constant time complexity (as measured in the number of bytes it compares). The two estimates are given in very different units, but keeping that in mind, this still correlates very well with actual benchmarking results -- even if Element is something simple like Int, isKnownIdentical will handily beat == in any benchmark.

Pretending that isKnownIdentical had "linear complexity" would be absurd -- even its actual wall-clock runtime has a clear constant upper bound that it never exceeds no matter what I feed into it.

Now let's say we change Array into "InlineArray" above. Instances of an "inline array" type always have the same element count, and their size is also constant. Therefore asymptotic analysis says InlineArray.== performs at most O(1) element comparisons, while InlineArray.isKnownIdentical compares O(1) bytes. The units are again wildly different, but assuming that Element.== itself has O(1) time complexity, then ==/isKnownIdentical are estimated to be in the same ballpark -- which they most definitely are in practice.

Complexity analysis does work in that it produces useful approximations surprisingly often, as long as it is applied with care. Inconsistently treating constants as variables ruins that. :worried:

2 Likes

Sure:

struct Customer {
  var name: String
  …
}

var customers: [Customer] = …
customers.sort()

As I add more fields to my struct, the running time of the sort will increase, even if the new fields do not participate in the comparison operation. This is not the case if I change the struct into a class.

In fact, one sorting algorithm might be linear in the size of the value (if we compute a permutation and then perform all swaps at the end) while another might be super-linear (if we swap values as we go along), even if both algorithms are O(n log n) in the size of the array. These two algorithms will perform very differently when sorting an array of large of value types, so knowing this detail might be important to users.

3 Likes

How does one add new fields to struct Customer inside a running Swift app? Can I somehow do that in a loop?

In my experience, Swift developers don't seem overly keen in analyzing this sort of "varying".

The time complexity of Array.sort() is documented to be O(n*log(n)) element comparisons and swaps/moves, where n is the count of the array -- no matter what type Element is.

(The "element comparisons and swaps/moves" part is crucial, but it is not explicitly stated in the documentation. We are notably terrible at spelling out what units we are using to measure these documented complexities; and the (implicit) units can differ wildly between operations, even on the same type. Sadly this is not by accident -- precisely spelling out the correct details appears to have the opposite effect than expected: it scares some people into endlessly asking for ever more clarifying details, and sometimes even frightens them into trying to roll their own (generally much less efficient and/or incorrect) implementations. I keep hoping we can maybe some day clarify things a little more, but so far I've been bitterly disappointed every time we try. To be honest, this current discussion is not helping, either.)

Your original struct Customer with a single name field is already enough to make the absolute time complexity of customers.sort() (expressed in terms of memory accesses) quite terrifying. String is an absurdly complex type that has to perform Unicode normalization in its == implementation; that algorithm itself involves sorting arbitrarily long Unicode scalar sequences, making its worst case time complexity O(n*log(n)), where n is the size of the string in UTF-8 code units. (Regular non-zalgo cases are much faster, but unless Zalgo-style customer names are manually rejected, this is the lowest possible upper bound.) The worst-case cost of comparing Swift String instances grows superlinearly with the length of the input.

To sort an array of struct Customer values, we need to repeat this n * log(n) normalization algorithm for every pair of customers that the array's sort decides to compare. This turns the absolute complexity of [Customer].sort() into some eldritch horror like O(m * log(m) * n * log(n)) memory accesses, where m is the count of the array, and n is the maximum length of a customer's name in bytes. If we double the size of the single name field, the worst-case complexity of each comparison (and along with it, the array sort itself) grows superlinearly -- this is a highly unintuitive result, and I can easily imagine it ruining someone's day.

Compared to that, adding a bunch of additional fields to struct Customer seems like a trivial change to me. The growth rate as we add more fields (of the same size) matches what sort promises in its documentation, and what I would naively expect: if the items are twice as large, then comparing them will take at most twice as long, and so will sorting them in an array. There is no surprising superlinear growth.

None of this has any relevance to the idea of adding simple identity tests to types. Identity tests can be guaranteed to take O(1) time -- indeed, that performance guarantee is the sole reason for their existence, as Equatable does not set any requirement on the complexity of ==. Implementing identity tests by comparing the raw bits in a type's in-memory representation is a perfectly valid strategy (for some types), and, by definition, it always satisfies the O(1) complexity expectation.

No type should offer an isIdentical(to:) method unless (a) it is able to return a correct answer, and (b) it is meaningfully faster than ==. Array can easily satisfy both criteria, and InlineArray would generally fail both.

2 Likes

I agree with all these points… and my feeling (referenced in the pitch thread) is still leaning in favor of shipping isIdentical as opposed to definitely[Not]Equal.

1 Like

Just because the generic argument can't be "varied" at runtime doesn't mean that it's pointless to analyze it, or that time complexity notation can't be used to characterize that behavior. One can be interested in the execution time of an algorithm as a function of a compile-time provided parameter, not just runtime-provided parameters.

In practical terms, if I have a function that takes an InlineArray<N, Int> and performs some operation on it, I definitely care whether InlineArray<3000, Int> will take 1000 times longer than InlineArray<3, Int> or roughly the same time, even though N ought to be provided at compile time.

2 Likes

You all are not being rigidly pedantic enough.

Collections in Swift have a count property of type Int, so any valid instance of such a type has at most Int.max elements.

When running on any particular computer system, the value of Int.max is a constant. Therefore, any algorithm whose complexity can be expressed in terms of the number of elements in the collection, in fact has constant complexity.

For example:

  • O(n) is actually O(Int.max) which is O(1)
  • O(n²) is actually O(Int.max²) which is O(1)
  • O(n log(n)) is actually O(Int.max log(Int.max)) which is O(1)

It clearly follows that all statements of algorithmic complexity in official documentation for Swift collections should be updated post-haste to say “O(1)” in accordance with this rigorous, technically correct, objective fact.

2 Likes

I think the point being argued is what aspect of the complexity is salient or not to users. Of course, we can stretch that also: as far as I am concerned, all algorithms salient to me terminate when I terminate, so everything is O(1).

9 Likes

We can break down the use cases along two axes: whether the arrays’ length is known, and whether the arrays’ element type is known.

The time complexity of the == operation can be described in terms of these unknowns: it is O((Array1.count + Array2.count) * O(func ==(Element, Element)).

I would posit the cases that elicit desire for a “shortcut” operation are the ones where the code that’s comparing the arrays for equality lacks knowledge of (aka is generic over) either the length or the type of one or both of the arrays. (I’m skipping the case where both are unknown because that’s essentially reimplementing the generic == operator for arrays.)

Plugging those unknowns into the expression above, those cases are O(Array1.count + Array2.count) and O(func ==(Element, Element)). The first is clearly linear in its unknowns. In the second, we can use what we know about the implementation of func ==<T>(T, T) in Swift to rewrite it as O(MemoryLayout<Element>.size). This is also clearly linear in its single unknown.

Was this re-derivation really necessary for the discussion? Is there actually debate that adding some form of identity to Array would enable a fail-fast implementation of equality?

1 Like

How does one add new fields to struct Customer inside a running Swift app? Can I somehow do that in a loop?

Why does it matter? I can make a change (add a number of fields, change the constant, etc) recompile the app and run it again. If I’m so inclined to automate it I could write a script that will do that for me. Things like these are not important for O-big notation analysis.

1 Like

Forget Collection, the total size of memory is bounded by Int.max by definition of UnsafePointer. Therefore a computer is a finite state machine, the halting problem is decidable, and P = NP. :wink:

7 Likes

Well, a computer is a finite state machine. As is the universe itself—it's just that the number of states possible in the universe (or even an 8-bit computer from 1979) is beyond astronomically large.

(Fun fact: if you somehow managed to create an actual unbounded Turing machine in a finite space, it would instantly collapse into a black hole.)

3 Likes

I think that just as most Swift code is written with the assumption that copying is cheap, most Swift code is written with the assumption that values are reasonably small in inline size and that moving values is cheap. An example being how (as far as I know) generic functions dynamically allocate values of the generic type on the stack. Therefore, the time complexity of algorithms with respect to inline value size was assumed to be unimportant. InlineArray breaks that assumption, and I think now we may have to deal with the implications of that. I'm particularly thinking about if, in the future, we allow for the size of an InlineArray value to be erased in an existential (for example, any<n: Int> InlineArray<n, Int>).

1 Like

I don’t think this assumption is specific to Swift. C is certainly biased this way as well, and its built-in array syntax is directly equivalent to InlineArray.

Just like in C, authors of very large Swift data structures should either discourage copying such values through documentation, or make it impossible (e.g. by making the data structure a class).