[Pitch] Selection: Linear-Time Order-Statistics

Selection algorithms are used to quickly return an order statistic from a collection of values.

Suppose we have a sorted Array of five integers. We want to select the median:

let a = [1, 2, 3, 4, 5]
let median = a[2]
print(median == 3)
// true

This is easy when our Array is already sorted. What if our Array is not sorted? One option is to just go ahead sort the Array. This has a linearithmic runtime: O(n log n). We are performing more work than necessary: we don’t actually need to sort all the elements. MinMax could improve our runtime and only sort half the elements, but we are still performing more work than necessary.

An efficient algorithm to select the median should return in linear time without guaranteeing to sort the elements before or after our median.

To generalize our operation we want to select the element at any index k such that the k elements that appear before the element at index k are sorted as less or equal and the elements that appear after the element at index k are sorted as greater or equal. The comparison could be increasing lexicographic order or a custom operation. What we don’t care about is that the k elements appearing before the element at index k are themselves sorted with each other.

Here is an example:

let a = [3, 4, 5, 1, 2]
let b = a.selected(index: 2)
print(b[0] <= b[2] && b[1] <= b[2])
// true
print(b[2] <= b[3] && b[2] <= b[4])
// true
let median = b[2]
print(median == 3)
// true

Detailed Design

We declare a method on Collection that allocates space linear to our collection and returns without updating our collection.

We declare a method on MutableCollection that updates our collection in-place.

extension Collection {
  public func selected(
    index: Int,
    sortedBy areInIncreasingOrder: (Element, Element) throws -> Bool
  ) rethrows -> [Element] { ... }
}

extension MutableCollection {
  public mutating func select(
    index: Int,
    sortedBy areInIncreasingOrder: (Element, Element) throws -> Bool
  ) rethrows { ... }
}

We declare two convenience methods if Element is Comparable. These methods sort by increasing lexicographic order.

extension Collection where Element: Comparable {
  public func selected(index: Int) -> [Element] {
    self.selected(index: index, sortedBy: <)
  }
}

extension MutableCollection where Element: Comparable {
  public mutating func select(index: Int) {
    self.select(index: index, sortedBy: <)
  }
}

Complexity

We can achieve linear worst-case time with an algorithm similar to what is being used in Rust.[1]

Space complexity is constant when updating our collection in-place and linear when returning without mutating our original collection.

Benchmarks

Benchmarks are discussed in an adjacent thread:

Future Directions

Similar to Partition we could choose to add new methods that accept an explicit subrange.

We currently make no guarantee to preserve stability when elements are equal. This is consistent with C++ and Rust. New methods could explicitly guarantee stability similar to Partition. The tradeoff might be a potential performance penalty that leads to linearithmic time.

Comparison with other languages

C++: nth_element provides similar performance and in-place semantics with support for an explicit range.

Rust: select_nth_unstable provides similar performance and in-place semantics. This returns a tuple: the slice of elements before, the element requested, and the slice of elements after.


  1. https://github.com/rust-lang/rust/pull/107522 ↩︎

1 Like

Feedback on the naming of the index: argument label: I don't think it should be named index because that term has a specific meaning in the Collection protocol, and that meaning is not the one used by the proposed APIs ("the position at offset n from the start of the collection" after sorting).

Possible alternatives:

  • offset:
  • position:
    • I like the sound of this better than "offset", but it's not the established term.
    • Downside: The index-based Collection subscript uses position as the parameter name for an index value. We'd be using it with a different meaning here. Maybe no big deal because the position name is not part of the public API and developers generally only come into contact with it when implementing a custom Collection. It is visible in the documentation though.
2 Likes

I also find the term index is quite confusing because it implied the result is an array, instead of the k-th value. For example, your heap based implemantion doesn't return such an array. Also, index is 0-based, but k in select algorithms is 1-based. You were aware of that (see the part I highlighed) but it was confusing when I read it.

2 Likes

A few weeks ago, I implemented something similar and faced the same naming dilemma (as always). I ended up with the word rank, far from perfect, but after an hour deliberating, it was the best I could come up with. In my opinion, it does convey the meaning of the K-th element slightly better than the terms offset and position do.

2 Likes