[Starter Proposal] Add Stable Sort to StdLib

felix91gr · January 31, 2018, 9:33pm

reference* types

allevato · January 31, 2018, 9:41pm

My line of reasoning is the following:

The Equatable protocol defines a notion of substitutability that is wholly independent of identity.
Whether a type is a value type or a reference type is irrelevant here—if it conforms to Equatable, then by that protocol's documented definition, two values that compare equal are interchangeable regardless of whether they are two separate objects with different identities or not.
Therefore, it doesn't matter whether you choose a stable or unstable sort, because if the values are truly interchangeable, then the order of equal elements shouldn't matter.
So, a sort of Equatable elements that does not also provide a custom comparison predicate can choose a more performant unstable sort algorithm and still provide results that are entirely consistent with the substitutability property of the protocol.

The documentation for Equatable says this:

Equality implies substitutability—any two instances that compare equally can be used interchangeably in any code that depends on their values.

Your argument seems to be that even if two reference types are Equatable, you can still distinguish them based on their identity. Yes, that's true, but if you say that distinction matters for a sort, then it clearly violates the interchangeability requirement of its Equatable conformance.

xwu · January 31, 2018, 9:43pm

Interchangeable in value, yes. But stability upon sorting an instance of T : Collection where Element : class has to do with identity.

saagarjha · January 31, 2018, 9:43pm

That's my point: there will be code that doesn't depend on their values (and people are free to write this code, since they're not value types), and you're making that code at lot harder to write.

xwu · January 31, 2018, 9:43pm

allevato:

My line of reasoning is the following:

The Equatable protocol defines a notion of substitutability that is wholly independent of identity.

Whether a type is a value type or a reference type is irrelevant here—if it conforms to Equatable, then by that protocol’s documented definition, two values that compare equal are interchangeable regardless of whether they are two separate objects with different identities or not.

Therefore, it doesn’t matter whether you choose a stable or unstable sort, because if the values are truly interchangeable, then the order of equal elements shouldn’t matter.

So, a sort of Equatable elements that does not also provide a custom comparison predicate can choose a more performant unstable sort algorithm and still provide results that are entirely consistent with the substitutability property of the protocol.

What, then, is an unstable sort if you use that definition?

allevato · January 31, 2018, 9:53pm

Let's assume that we have an array with reference types [B2, A1, C1, B1], which conform to Comparable (and thus Equatable) based on the letter alone.

If I sort those by using the Comparable conformance, my claim is that substitutability means that I shouldn't be caring if I get back [A1, B2, B1, C1] or [A1, B1, B2, C1]. Either should be just as good, because my conformance says those two Bs are interchangeable, so a more performant sort algorithm that does not guarantee order stability is fine.

If I sort those instead by using a version of sort that takes a custom comparator, then I'm asking it to ignore the conformance and instead use that comparator, which only answers the question "is X ordered before Y" and makes no claim about interchangeability. So, I would expect the sort to be the safest possible result in that case, which would be stable—[A1, B2, B1, C1].

dabrahams · January 31, 2018, 9:58pm

That's my reasoning, FWIW. I also think that in Swift's user community there are likely to be many people who:

Don't care about the performance difference
Naturally assume that sorting is stable because of effects they experience in, e.g. table UIs

I advocate using an adaptive algorithm that decides how much/whether extra memory is used, and falls back to something like this to get O(1) memory use at, potentially, some cost to performance. I'm not saying O(1) memory use overall is a critical goal, but it can easily be achieved by falling back when the number of elements exceeds some constant.

Personally I don't think I'll be thoroughly convinced we need to have an unstable sort in the standard library until someone can demonstrate performance gains over the best adaptive stable sort we come up with.

Let me also add that I've seen real performance improvements over the standard library's sort by using insertion sort (which is stable) when the number of elements is known to be small. I do realize that the standard library's sort uses insertion sort (dynamically) for some of those cases, but the way it makes the choice to use insertion sort is clearly not always optimal.

xwu · January 31, 2018, 10:07pm

Those two Bs are interchangeable in value, sure. And yes, I might not care whether I get back [A1, B2, B1, C1] or [A1, B1, B2, C1] after one sort precisely because they are interchangeable in value. But it's still legitimate for me to care if the second item in the array is going to be the same element (i.e., identity) if I sort it again.

allevato · January 31, 2018, 10:13pm

I see your point, and my counter would be "if you care about identity, then use a custom comparator to signal that to the sort algorithm because it removes the substitutability property from the equation." I can see how some might say that's too subtle, though.

That being said, I'd be perfectly fine with stable sort being the only option provided by stdlib, which makes this a non-issue anyway.

saagarjha · January 31, 2018, 10:20pm

I'm sorry, was this supposed to link to something?

saagarjha · January 31, 2018, 10:23pm

Could get away with an unstable sort on value types? Since there's no concept of identity here a faster/more memory efficient algorithm could be employed.

felix91gr · January 31, 2018, 10:25pm

We could (this is an implementation detail, but important anyway) make the algorithm pick one of three:

Insertion sort (for very small N)
O(N) space Timsort or similar (for normal N)
O(1) space in-place Mergesort or similar (for huge N)

The threshold between 1 and 2 can maybe be hardcoded after doing benchmarks over N.

The threshold between 2 and 3 could be picked from the size of the elements to be sorted and the maximum space admissible, determined either dynamically (asking the runtime for what is the memory situation) or statically (from a good reasoning on available resources).

saagarjha · January 31, 2018, 10:27pm

Actually, Timsort does this already, so you only really have two cases here.

felix91gr · January 31, 2018, 10:29pm

Hmm. Maybe?
For custom comparators (think comparing vectors by X, Y or Z) the value-equality isn’t present in the comparison. There is equality, but only under a certain criteria. There is no full equality under those comparisons.

Under Comparable conformance comparators though, I’m not sure what the answer might be.

edit: added quote. My gosh, I was ambiguous w/o it

dabrahams · January 31, 2018, 11:04pm

Fixed; refresh and try again.

ben-cohen · January 31, 2018, 11:37pm

This is the key thing for me in favor of it. Because most languages use a stable sort, users are used to that and we would encourage bugs where people make the wrong assumption if Swift zigs where other languages zag.

ben-cohen · January 31, 2018, 11:44pm

Value types can have identity or reference semantics (UnsafeBufferPointer is a value type).

We could get away with types with value semantics having an unstable sort. But there's no way for the type system to detect value semantics (many threads out there on why we don't have such a feature).

We could potentially specialize sorting of known value types like Int or String, perhaps, if it can be shown to be worth it. But hopefully, like Dave says, a good enough stable sort would obviate this need.

saagarjha · January 31, 2018, 11:50pm

Actually, I don't think this is true. Most of the languages I've used offer an unstable sort by default (i.e., the "sort" function), with an option to enable stable sorting. This includes C, C++, Objective-C, ~~Rust~~, JavaScript (actually, EMCA doesn't define a particular sort, so you have to assume it may not be stable), Java (for primitive types), and Lua (which provides no stable sort). Only Python does stable sorting by default. I'm not arguing against the fact that stable sort should be the default, but just wanted to weigh in with my experience.

ben-cohen · February 1, 2018, 12:05am

The rust sort appears to be stable.

The Java sort for primitive types is indeed unstable. I'm comfortable with my instances of the number 42 being reordered :)

saagarjha · February 1, 2018, 12:43am

Oops, I guess I should read the docs more closely…looks like stable sorts aren't as rare as I thought they were.