Pitch [stdlib]: making sorting algorithm selectable

Ben_Cohen · May 3, 2019, 7:21pm

Not sure what you mean here. In the case of both separate base-name and new overload that takes an enum, we are talking about a new function called by new code. The existing sort() function must remain, unchanged, either way.

GLanza · May 3, 2019, 7:32pm

Good point. The enum solution is more "swifty" in my opinion, bit I don't have strong preferences at this point. Just about naming: those two functions seems to be too verbose, in my opinion.

GLanza · May 3, 2019, 7:34pm

In your opinion which should be the unstable implementation? Should we dig back introsort?

Tino · May 3, 2019, 7:54pm

When all those algorithms become public parts of the stdlib, they surely will confuse a lot of people who simply want to bring a list of objects into the right order. I personally welcome that kind of confusion, as it opens a door to learn something new - but still, I think there is a better way than changing the stdlib:

import HeapSort
//…
sort(myArray) // will perform heapsort; details up for debate

This would not only keep the surface of Swift smaller, but also be more convenient for those with special sorting needs: I don't think there are many situations where you need different algorithms (beyond benchmarks to compare them).

GLanza · May 3, 2019, 8:02pm

var arr = [1, 3, 5, 2, 10]
arr.sort()
arr.append(9)
arr.sort()

This is a case where you might want to use insertion sort for the second sort

Tino · May 3, 2019, 8:10pm

I would want SortedCollection ;-)
But still, you could always import several modules, and their method could have a unique name.

GLanza · May 3, 2019, 8:56pm

The problem remains: the documentation of the sort in stdlib says that it's not guaranteed to be stable. Should I use it or not? What it means not guaranteed? When is stable and when it's not?

Making the sorting algorithm a choice of the developer would disambiguate this ambiguity in the documentation. About the confusion for New to the Deb world, if you don't know what's the difference between them (and you are not interested in doc) probably all you need is sort()

Ben_Cohen · May 3, 2019, 9:07pm

If something is not guaranteed to be stable, that means it's unstable – even if in practice it's always stable.

Even if not guaranteed, there is a benefit though, because people who incorrectly assume it was stable won't be hit by a bug, because it does happen to be stable.

I think we should lock in the stability and make it a guarantee, via an evolution proposal (that only changes the docs). The fact that it was stable as of ABI stability allows this.

Ben_Cohen · May 3, 2019, 9:15pm

As Steve points out, this isn't quite true. Even once you know the algorithm's "brand name", there are all sorts of ways that algorithm can vary by implementation. Some implementations of the same sort are better at partially sorted data than others. Some might trade comparisons for moves, which has implications depending on whether you're sorting trivial or ref-counted types. Some "unstable" sorts can be bullied into being stable, some "by nature" stable sorts can be implemented unstably. The std lib's old introsort did an insertion sort for small values but this isn't fundamental to the introsort, it was just a tweak our implementation did.

GLanza · May 3, 2019, 10:30pm

I'm aware of that. I believe we should make a choice within the implementation details of bundled algorithms, document that choice and propose it to the community as is. A quicksort implementation might be good for someone, less good for someone else but at least devs can take an informed decision.

We shouldn't, in my opinion, worry about the case quicksort(version: 2) because with the "brand name" quicksort we want to give away a sorting strategy, not its implementation detail.

Right now, it's total uncertainty. "it might be stable... or maybe not... sometimes it is... but it might change" any information that could give a certain level of safety and reliability I bet would be appreciated.

The current situation is "no promises", if as a developer you have special needs, then implement the algorithm yourself also risking to duplicate the code already existing in stdlib.

As a developer, I would like instead to see some promises, like "it's stable and will always be", "it's in place and will always be", "has worst case O(n^2)", "has the best case O(n)", "has space complexity O(n)". If with these promises we could deliver also a choice I think it would be great.

Ultimately, a dev might not like our quicksort, in that case he will write his own. It's still a win: He knows he doesn't like it and why, and he doesn't have to pin to a custom sort apriori because he has to consider the default swift algorithm as unreliable.

Ben_Cohen · May 4, 2019, 1:29am

We seem to be stuck in a bit of a loop.

No, there is not total uncertainty. The documentation is clear: "The sorting algorithm is not guaranteed to be stable."

I think there's consensus here that documenting that stability so it could be relied on would be good. There's support (I suspect near-consensus) that the default sort should be stable.

Then we are into debatable territory: do we even need a non-stable sort, either for speed or space reasons. Personally, I could go either way.

Do we need a fuller menu of knobs and levers, for best case/worst case complexity? I don't think so. There's a place for a specialist library here, but not for the standard library.

I think there's going to be a lot of pushback to actually adding "named" sorting algorithms to the standard library. I would certainly be against doing this, for all the reasons that have been rehearsed above.

Ben_Cohen · May 4, 2019, 1:34am

Maybe. Bear in mind that for all but the most giant collections, all of these complexity issues are far less important than other constant factors. These include app binary size and compile times (the sort algorithm is generic so must be fully specialized into the client), the performance criticality of specialization (the more complex and layered the sort, the less likely to get properly optimized), and whether the algorithm/collection pair can operate directly on the buffer, avoiding reference counting overhead. The old sort algorithm was flawed from these perspectives, and that's why even though the new one allocates, it's much faster in most cases.

@nnnnnnnn can probably comment further, as he implemented the new sort.

GLanza · May 4, 2019, 7:35am

Like you said, the documentation is clear: "it's not guaranteed to be stable" this doesn't imply that it isn't. As reader I understand that it might be, or might be not hence I can't rely on it if I need a stable sort and I can't rely on it either of I need an in place one.
I bet we agree on the fact that at least it must be certain to the dev that is stable or not. Personally I feel we need also at least an unstable implementation because not always the stability of the algorithm is as important as the memory usage and speed of the algorithm. Actually in my personal experience stability was almost never that important. I'm happy with named algorithm but I would be happy also with the choice "stable Vs unstable" what I'm not happy with is "maybe it's stable". If this feeling is shared, maybe we have an evolution proposal?

tubescreamer · May 4, 2019, 9:41am

import Foundation

protocol SortImplementation {
    
    func sort<DataStructure: MutableCollection, Element>(_ dataStructure: inout DataStructure,
                                                         by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows
        where DataStructure.Element == Element
}


struct BubbleSorter: SortImplementation {
    
    func sort<DataStructure, Element>(_ dataStructure: inout DataStructure,
                                      by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows
        where DataStructure: MutableCollection, Element == DataStructure.Element {
            
            guard dataStructure.isEmpty == false else { return }
            
            for i in 0..<dataStructure.indices.count {
                
                for j in dataStructure.indices.dropLast(i + 1) {
                    
                    let firstIndex = dataStructure.index(after: j)
                    
                    let secondIndex = j
                    
                    if try areInIncreasingOrder(dataStructure[firstIndex],
                                                dataStructure[secondIndex]) {
                        dataStructure.swapAt(firstIndex, secondIndex)
                    }
                }
            }
    }
    
}


extension MutableCollection {
    
    mutating func sort<Sorter: SortImplementation>(using sorter: Sorter, by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows {
        
        try sorter.sort(&self, by: areInIncreasingOrder)
        
    }
    
    mutating func sorted<Sorter: SortImplementation>(using sorter: Sorter, by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows -> Self {
        
        var copied = self
        try copied.sort(using: sorter, by: areInIncreasingOrder)
        return copied
        
    }
}

extension MutableCollection where Element: Comparable {
    
    mutating func sort<Sorter: SortImplementation>(using sorter: Sorter) {
        
        self.sort(using: sorter, by: <)
        
    }
    
    mutating func sorted<Sorter: SortImplementation>(using sorter: Sorter) -> Self {
        return self.sorted(using: sorter, by: <)
    }
}


// Example

var values = (0..<10).map { $0 }.shuffled()

print("unsorted: \(values)")

values.sort(using: BubbleSorter())

print("sorted: \(values)")

nnnnnnnn · May 4, 2019, 5:54pm

I agree that the characteristics of the sort should be the focus of an evolution proposal, if there is one, rather than the specific algorithm that provides those characteristics. At the very least, we should have a proposal to guarantee that the sort is stable, so the documentation can be updated to say that.

Beyond that, I think the biggest task is to gather a set of use cases that argue for adopting multiple sorting algorithms. What are the contours of the different problems where a developer might need a sort other than the allocating, stable version, and would it be important to select a specific algorithm or just one with other characteristics? If the standard library is able to add multithreaded operations in the future, would it a problem that people had opted out of the default sort?

There are some additional areas of research where I'd love to see the standard library improve, such as adding a SortedCollection protocol, providing binary search into sorted collections, and performing partial sorts and stable partitions. If you're interested in this topic, that's also fertile ground, though you'll want to search the forums to find the prior discussions.

GLanza · May 4, 2019, 8:47pm

Thank you all for your time. @nnnnnnnn i'll definitely search through the forum about SortedCollection. Thank you for your time.

CTMacUser · May 5, 2019, 10:46pm

Why are the methods locked to Array; shouldn't any Collection be able to get advanced sorting?

But I don't think we need this form anyway. All sorting algorithms, for a given comparison closure, must give the same result! So there's no need for us to supply sorting algorithms; the Standard Library authors should provide the best of breed algorithms themselves. The StdLib versions could analyze the elements to choose the best algorithm if possible, with the constraint that the pre-check should bail if its time threatens to take up more than a potentially simple sort.

protocol MutableCollection {

    //...

    func fastlySort(by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows 
    func stablySort(by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows

}

extension RangeReplaceableCollection {
    func fastlySorted(by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows -> Self { /*...*/ }
    func stablySorted(by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows -> Self { /*...*/ }
}

extension Sequence {
    func fastlySorted(by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows -> [Element] { /*...*/ }
    func stablySorted(by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows -> [Element] { /*...*/ }
}

// Custom extensions for the common `<` case elided.

The MutableCollection versions are requirements so the default implementation can work for forward-only collections while optimized implementations can be written for bi-directional and random-access collections.

The word "fastly" is archaic; "fast" with no suffix is the modern usage. I don't know if "fastSorted" and "fastSort" would cause confusion with the Quick-Sort algorithm. I originally had "fasterSort(ed)," but that isn't an adverb form.

GLanza · May 6, 2019, 1:00am

It's a pitch. Just for this reason the example was with array. It was just a rough sketch to show how the api would look like.

We all agreed on the fact that we don't need to expose specific algorithms.

As for the naming conventions, "fastly" is too much a big promise. There might be cases were the "fastly" version underperforms compared to the stable version.

For the "analyze the elements" I'm not sure I understand what do you mean.

Tino · May 6, 2019, 4:05am

That is often the case, but not always — that's what the whole stable vs unstable issue is about, isn't it? The topic is not relevant for simple structs, but for many other elements, it can make a difference.

I'm pretty sure that finding the best sorting algorithm for a given collection is much more difficult than the actual sort... and in most applications, it shouldn't matter much anyways.

CTMacUser · May 8, 2019, 7:05am

I tried to imply a "modulo stable vs. unstable," but I was too lazy and thought the fact my sample API included fast vs. stable variants would be enough of an implication.