Partial sorting?

Just cruising the C++ algorithm list and wondered about std::partial_sort and std::nth_element. I guess this could be Swift-ified like:

extension MutableCollection {

    /// Reorders the elements of the collection so the value at the given index
    /// is what it would be if the entire collection was sorted according to the
    /// given predicate.
    ///
    /// - Note: Put something about stability here, both for elements with the
    ///   same value as the one that will go in `pivot`, and for ones outside of
    ///   that.
    ///
    /// - Precondition: `pivot != endIndex`
    ///
    /// - Parameter pivot: The element location to receive its sorted value.
    /// - Parameter areInIncreasingOrder: A predicate that returns `true` if its
    ///   first argument should be ordered before its second argument;
    ///   otherwise, `false`.
    ///
    /// - Postcondition: The value at `pivot` is what it should have if the
    ///   whole collection was sorted under `areInIncreasingOrder`.  All
    ///   elements before `pivot` have values ordered before or equivalent to
    ///   the one at `pivot`.  (Which means the later elements have values
    ///   ordered equivalently or after one at `pivot`.)
    ///
    /// - Complexity:???
    public mutating func partitionToSort(at pivot: Index, by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows {
        // std::nth_element
    }

    /// Sorts the collection in place up to (but not including) the given index,
    /// using the given predicate as the comparison between elements.
    ///
    /// The sort is not guaranteed to be stable.  The order of elements outside
    /// the target range is unspecified.
    ///
    /// - Parameter middle: The past-the-end point for the sorting to take
    ///   effect.
    /// - Parameter areInIncreasingOrder: A predicate that returns `true` if its
    ///   first argument should be ordered before its second argument;
    ///   otherwise, `false`.
    ///
    /// - Postcondition: `self[..<middle]` is sorted according to
    ///   `areInIncreasingOrder` considering all the elements in `self`.
    ///
    /// - Complexity: ???
    public mutating func sort(upTo middle: Index, by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows {
        // std::partial_sort
    }

}

with Comparable versions skipping the predicate for <. But I'm still wondering why would these be needed, outside of implementing other sorting things? Stopping at a point much smaller then the whole collection doesn't help too much because you still have to scan the whole thing.

I think I misinterpreted one of those methods, because one of my first tries was:

extension Collection {

    /// Finds the index of the element whose value is the least ranked, after
    /// ignoring a given number of even lesser-ranked elements, using the given
    /// predicate for comparisons.
    ///
    /// - Note: Put something here about stability, like which element is chosen
    /// if there are multiple elements that match the given place.
    ///
    /// - Precondition: `count >= 0`
    ///
    /// - Parameter count: The number of elements with an ever lesser rank to
    ///   skip past.  If not given, defaults to zero.
    /// - Parameter areInIncreasingOrder: A predicate that returns `true` if its
    ///   first argument should be ordered before its second argument;
    ///   otherwise, `false`.
    /// - Returns: The index for the element with the lowest order rank after
    ///   skipping the `count` elements with even lower rank.  Returns `nil` if
    ///   the collection is empty or does not have enough elements to skip over.
    ///
    /// - Complexity: ???
    public func leastOrderedIndex(afterSkipping count: Int = 0, by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows -> Index? {
        // ???
    }

    /// Finds the index of the element whose value is the most ranked, after
    /// ignoring a given number of even greater-ranked elements, using the given
    /// predicate for comparisons.
    public func mostOrderedIndex(afterSkipping count: Int = 0, by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows -> Index? {
        // ???
    }

}

(with Comparable variants) but I don't think leastOrderedIndex(afterSkipping: by:) has an analog in the C++ Standard library. I'm not sure it can be implemented in a refined way, besides copying indices to a mutable Array then calling partitionToSort(at: by:) on it. This would cost O(n) space and at least O(n) time.

Can any of these be implemented stably?

There may be performance implications I'm not taking into account, but this is pretty easily accomplished via:

let thing = [3, 2, 4, -1, -4]
let partiallySorted = thing[..<3].sorted() + thing[3...]
print(partialSort) // prints [2, 3, 4, -1, -4]

This doesn't do the transform in place, but it's possible there's a way to do that as well.

Is there a use case you have in mind where in-place modification is important for this kind of thing?

You actually can do that transform in place:

var thing = [3, 2, 4, -1, -4]
thing[..<3].sort()
print(thing) // prints [2, 3, 4, -1, -4]
9 Likes

The only time I have ever used nth_element it was because it is an O(n) algorithm, so a much faster way to find the 90th percentile of a data set then sorting it. It was also a fair while ago and 200Mhz was a monster fast machine, so maybe I wouldn’t need it now :slight_smile:

(Ok, yeah, I’m sure people need it right now, just not me at the moment)

Nice! I was thinking that's how ArraySlice worked, but didn't have time to test it.

That's not partial sort, partialSorting would result in

[ -4, -1, 2,       4 , 3]
  - sorted -      - n/a -

The first n elements also need to be smallest among the whole list.

AFAICT, there's no way to do it in-just-a-few-lines in swift.

1 Like

Thanks for bringing these up, @CTMacUser — we probably will eventually want both of the C++ algorithms in the standard library. Partial sorting is more efficient than sorting a whole collection if you only need the smallest n of your elements.

I'm not sure I understand what your leastOrderedIndex and mostOrderedIndex do — are they examining the "skipped" elements at all?

There are two kinds of stability for partial_sort, right? Stability of the sorted section at the front and the stability of the unsorted section at the end. The C++ versions aren't stable for either. This would be a question we'd definitely need to discuss when designing this feature in the stdlib.

Note that a partial sort isn't the same as sorting part of a collection — the result would be more like:

var thing = [3, 2, 4, -1, -4]
thing.partialSort(upTo: 3)
// [-4, -1, 2, 3, 4]
2 Likes

I’m pretty sure it’s nth element, from the front and from the back.

2 Likes

Yes, I said:

    - Postcondition: `self[..<middle]` is sorted according to  `areInIncreasingOrder` considering all the elements in `self`.

That "considering all" is important there. Elements too large to be in the prefix get bumped out (to an unspecified order) while too-small elements on the outside get bumped in.

leastOrderedIndex(afterSkipping: 0, by: MyPredicate()) finds the index of the element min(by:) would target. leastOrderedIndex(afterSkipping: 1, by: MyPredicate()) ignores the smallest element and returns the index of the second-smallest element. I think I just accidentally made it up; I'm not sure there's a way to do it better than wasting O(n) space then calling partitionToSort(at: by:).

There's a Quickselect algorithm. To which all the technique of Quicksort applies. This can be done in-place, if you allow mutation of the original collection. This is likely what happened in nth_element in C++.

The one C++ nth_element I looked at uses Introselect, which has better worst case performance then Quickselect (O(n log n) v. O(n²)).

I haven't benchmarked them, so I can't really say how the "not worst case" compares. Wikipedia says Introselect is "comparable", so I guess "slower but not by too much for some values of too much"?

Added:

Both algorithms were introduced by David Musser in (Musser 1997), with the purpose of providing generic algorithms for the C++ Standard Library that have both fast average performance and optimal worst-case performance, thus allowing the performance requirements to be tightened.[1] However, in most C++ Standard Library implementations that use introselect, another "introselect" algorithm is used, which combines quickselect and heapselect, and has a worst-case running time of O ( n log n )[2].

I guess I should have actually read my references, not skimmed ;-)

2 Likes