Pitch [stdlib]: making sorting algorithm selectable

GLanza · May 3, 2019, 4:57pm

Or similarly without the need of instances:

public protocol SortImplementation {
    static func sort<Element>(array: inout Array<Element>, by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows
}

struct TimSort: SortImplementation {
    static func sort<Element>(array: inout Array<Element>, by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows {
        print("TimSort")
    }
}


struct HeapSort: SortImplementation {
    static func sort<Element>(array: inout Array<Element>, by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows {
        print("HeapSort")
    }
}

public struct SortAlgorithm {
    static var heapSort: SortAlgorithm { return SortAlgorithm(with: HeapSort.self) }
    static var timSort: SortAlgorithm { return SortAlgorithm(with: TimSort.self) }
    
    let implementation: SortImplementation.Type
    init<T: SortImplementation>(with implementation: T.Type) {
        self.implementation = implementation
    }
}

extension Array {
    mutating func sort(using implementation: SortImplementation.Type, by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows {
        try implementation.sort(array: &self, by: areInIncreasingOrder)
    }
    
    mutating func sort(using algorithm: SortAlgorithm = .timSort, by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows {
        try algorithm.implementation.sort(array: &self, by: areInIncreasingOrder)
    }
}

var array = [1, 2, 3, 4]
array.sort(using: .heapSort, by: <)

yxckjhasdkjh · May 3, 2019, 4:58pm

I'm all for having stable vs. non-stable sort (or any other characteristics that influence the result), but exposing different sorting algorithms would just increase the stdlib surface for almost no tangible benefit, IMHO.

There needs to be a point where a language stops adding everything to the stdlib and instead encourages the wider community to write libraries for more narrow use cases. I would argue that the necessity to choose a particular sorting algorithm would constitute a very narrow use case.

GLanza · May 3, 2019, 5:04pm

Frye in this particular case i'm trying to address the problem that sort in particular from swift 4 to swift 5 changed to be from unstable and in place to stable with space complexity undocumented (without unpacking stdlib). As developer i would like to know what actually happens when sort is called, and be sure that the result will be predictable across swift versions. I don't know if i explained correctly my point of view.

yxckjhasdkjh · May 3, 2019, 5:08pm

If Swift 5 broke backwards compatibility, that's a different problem and in particular, your proposal wouldn't have solved it (since the default would have changed, all your .sort() calls without an argument would have had the new behaviour).

GLanza · May 3, 2019, 5:15pm

Correct, but having the choice the developer can judge and say "for this specific use case, it doesn't matter" -> array.sort() or "ok, in this case i want to be sure that the algorithm is stable" -> array(using: .timSort) or "ok, my list is always sorted. If i add an element i must re-sort... insertionSort is my choice" -> array.sort(using: .insertionSort) or even "here i need something custom" -> array.sort(using: Custom.self)

Right now devs might make decisions assuming that the current swift algorithm is stable. If in swift 6 this changes backwards compatibility is at risk.

With this proposal, to be sure that it will stay stable, all you have to do is to not rely on the default parameter and use array.sort(using: .timSort). If swift 6 changes the default sorting algorithm dev choice won't break.

yxckjhasdkjh · May 3, 2019, 5:16pm

Without this proposal, to be sure that it will stay stable, all you have to do is write timSort(array).

(of course you have to write timSort first. But that will happen faster than waiting for the stdlib to implement it. Or you find a library.)

GLanza · May 3, 2019, 5:18pm

Why should i write timsort if array.sort() is already timsort?

stdlib, when it had introsort had already implementations for heapsort, insertionsort and quicksort. Now it has implementations for insertionsort and timsort. I say let's expose them.

jrose · May 3, 2019, 5:35pm

This seems like a cultural problem more than a technical problem. The standard library promises nothing about the implementations of its functions other than what's in the doc comments. I don't see that changing any time soon.

Now, could the standard library expose its current sorting implementations explicitly, either in a unified API, or even just as separate top-level functions? Sure, but then it's committing to supporting those implementations forever, which means larger binaries and more code to maintain. (How much maintenance does a sorting algorithm need? Depends on whether new language features like move-only types are going to affect it.)

So I'm pretty strongly on the side of "if you need to know more about your sorting algorithm, you should get it from a library other than the standard library".

GLanza · May 3, 2019, 5:49pm

Well... I disagree. It's like saying that results in terms of performances are unreliable (which is different from not making promises). The difference between a stable and unstable algorithm is big.
I do understand that there are no promises about implementations, but I believe that the exposure of 5 sorting algorithms (introsort, heapsort, insertionsort, quicksort, timsort) is not a big pain in terms of maintenance but it could give benefits to the community.

bzamayo · May 3, 2019, 6:04pm

Offering per-algorithm choices like this is overkill. At some point, I'm sure the standard library will offer a stable and unstable sorting function. I don't think it needs to go any more fine-grained than that. I believe the core team said before that the documentation for sort/sorted is intentionally left ambiguous currently, until a decision is made over whether the default sort should be stable or not. At that point, it would become a guarantee to maintain that particular behaviour.

John_McCall · May 3, 2019, 6:07pm

I agree with Jordan. The meta-request here seems to be "if the standard library makes a change which it can't guarantee is a strict improvement for everyone, it should provide a configuration to get back to the old behavior". That's an understandable reaction but ultimately not really workable.

GLanza · May 3, 2019, 6:09pm

I understand your point of view, and I respect that. I would prefer to not have to consider unreliable the current implementation until a decision is made, plus I don't see the downside in exposing algorithms currently already existing in the stdlib, and let the developers duplicate them.

scanon · May 3, 2019, 6:25pm

The trouble (or rather, one trouble) is that there is no canonical implementation of these algorithms. "Quicksort" does not refer to a single completely unambiguous sequence of comparisons and swaps. They all leave some details unspecified. How is the pivot chosen for Quicksort? If the standard library changes how the pivot is selected, do we now add .quicksort(version: 2)? We would have to, if you want to be able to perfectly reproduce the previous behavior.

Like Jordan, I think there's a group of people who want completely stable performance like this, but the place for that API to live is probably not the standard library.

Ben_Cohen · May 3, 2019, 6:28pm

What could change – what I think should change – is that the standard library could document 1) that sort is stable and 2) that it may allocate scratch space. These are the guarantees needed – not which algorithm is used (and thus, as a side effect, imply vaguely the space and time complexity). This would take an evolution proposal to lock in, but luckily since the algorithm is stable as of ABI stability, could be applied unconditionally retroactively.

I believe that a stable sort by default is the right choice. Most languages choose a stable sort, many developers will assume Swift's is stable, and it was a cause of bugs for Swift to make a different choice (I have screened radars to this effect – it was especially bad because it was stable for small inputs so often missed in testing). If constant-space or unstable-for-speed sort are useful as a fundamental facility in the std lib (to be debated), they should be considered as differently-named functions. But they should still not specify the actual algorithm used.

Note that it is not true that the old algorithm was constant space and that it didn't allocate. It both allocated when CoW triggered, and used memory on the stack for large inputs.

GLanza · May 3, 2019, 6:29pm

But you can document that. Developers choice will be conscious at that point, plus they will still have the choice to get their own implementations, but not be exposed to big changes like stable vs in-place.

Tino · May 3, 2019, 6:36pm

When you are in a situation where you depend that much on details of a certain algorithm, imho it‘s better to include the source of it in your project, as it is most likely a vital part of it (and you most likely have enough expertise to find a good implementation).

GLanza · May 3, 2019, 6:41pm

It's a fair point, yet if I judge today swift default implementation to be good enough for my needs I don't think it's right to duplicate this code to protect myself for a change that potentially might never happen.

GLanza · May 3, 2019, 6:51pm

array.sort()
array.sort(.inPlace)
array.sort(.inPlace, by: <)
array.sort(.stable)

Ben_Cohen · May 3, 2019, 6:54pm

Why determine at runtime something you are almost certainly choosing at compile time?

array.sortUnstably()
array.sortConstantSpace()

Note, "in-place" would cause confusion between the sort/sorted variants.

edit: admittedly, there are tricks that also allow you to use the enum argument to drive the decision at compile time, at the cost of extra types...

GLanza · May 3, 2019, 6:58pm

I’m not sure about naming. Plus it would break existing code.

I would go for enum with stable and unstable cases

public enum SortType {
    case stable, unstable
}

array.sort()
array.sort(.unstable)
array.sort(.unstable, by: <)
array.sort(.stable)