[Pitch] Selection: Linear-Time Order-Statistics

Selection algorithms are used to quickly return an order statistic from a collection of values.

Suppose we have a sorted Array of five integers. We want to select the median:

let a = [1, 2, 3, 4, 5]
let median = a[2]
print(median == 3)
// true

This is easy when our Array is already sorted. What if our Array is not sorted? One option is to just go ahead sort the Array. This has a linearithmic runtime: O(n log n). We are performing more work than necessary: we don’t actually need to sort all the elements. MinMax could improve our runtime and only sort half the elements, but we are still performing more work than necessary.

An efficient algorithm to select the median should return in linear time without guaranteeing to sort the elements before or after our median.

To generalize our operation we want to select the element at any index k such that the k elements that appear before the element at index k are sorted as less or equal and the elements that appear after the element at index k are sorted as greater or equal. The comparison could be increasing lexicographic order or a custom operation. What we don’t care about is that the k elements appearing before the element at index k are themselves sorted with each other.

Here is an example:

let a = [3, 4, 5, 1, 2]
let b = a.selected(index: 2)
print(b[0] <= b[2] && b[1] <= b[2])
// true
print(b[2] <= b[3] && b[2] <= b[4])
// true
let median = b[2]
print(median == 3)
// true

Detailed Design

We declare a method on Collection that allocates space linear to our collection and returns without updating our collection.

We declare a method on MutableCollection that updates our collection in-place.

extension Collection {
  public func selected(
    index: Int,
    sortedBy areInIncreasingOrder: (Element, Element) throws -> Bool
  ) rethrows -> [Element] { ... }
}

extension MutableCollection {
  public mutating func select(
    index: Int,
    sortedBy areInIncreasingOrder: (Element, Element) throws -> Bool
  ) rethrows { ... }
}

We declare two convenience methods if Element is Comparable. These methods sort by increasing lexicographic order.

extension Collection where Element: Comparable {
  public func selected(index: Int) -> [Element] {
    self.selected(index: index, sortedBy: <)
  }
}

extension MutableCollection where Element: Comparable {
  public mutating func select(index: Int) {
    self.select(index: index, sortedBy: <)
  }
}

Complexity

We can achieve linear worst-case time with an algorithm similar to what is being used in Rust.[1]

Space complexity is constant when updating our collection in-place and linear when returning without mutating our original collection.

Benchmarks

Benchmarks are discussed in an adjacent thread:

Future Directions

Similar to Partition we could choose to add new methods that accept an explicit subrange.

We currently make no guarantee to preserve stability when elements are equal. This is consistent with C++ and Rust. New methods could explicitly guarantee stability similar to Partition. The tradeoff might be a potential performance penalty that leads to linearithmic time.

Comparison with other languages

C++: nth_element provides similar performance and in-place semantics with support for an explicit range.

Rust: select_nth_unstable provides similar performance and in-place semantics. This returns a tuple: the slice of elements before, the element requested, and the slice of elements after.


  1. https://github.com/rust-lang/rust/pull/107522 ↩︎

2 Likes

Feedback on the naming of the index: argument label: I don't think it should be named index because that term has a specific meaning in the Collection protocol, and that meaning is not the one used by the proposed APIs ("the position at offset n from the start of the collection" after sorting).

Possible alternatives:

  • offset:
  • position:
    • I like the sound of this better than "offset", but it's not the established term.
    • Downside: The index-based Collection subscript uses position as the parameter name for an index value. We'd be using it with a different meaning here. Maybe no big deal because the position name is not part of the public API and developers generally only come into contact with it when implementing a custom Collection. It is visible in the documentation though.
4 Likes

I also find the term index is quite confusing because it implied the result is an array, instead of the k-th value. For example, your heap based implemantion doesn't return such an array. Also, index is 0-based, but k in select algorithms is 1-based. You were aware of that (see the part I highlighed) but it was confusing when I read it.

3 Likes

A few weeks ago, I implemented something similar and faced the same naming dilemma (as always). I ended up with the word rank, far from perfect, but after an hour deliberating, it was the best I could come up with. In my opinion, it does convey the meaning of the K-th element slightly better than the terms offset and position do.

3 Likes

The other option I guess then is to just call it kthElement:

extension Collection {
  public func selected(
    kthElement: Int,
    sortedBy areInIncreasingOrder: (Element, Element) throws -> Bool
  ) rethrows -> [Element] { ... }
}

extension MutableCollection {
  public mutating func select(
    kthElement: Int,
    sortedBy areInIncreasingOrder: (Element, Element) throws -> Bool
  ) rethrows { ... }
}

extension Collection where Element: Comparable {
  public func selected(kthElement: Int) -> [Element] {
    self.selected(kthElement: kthElement, sortedBy: <)
  }
}

extension MutableCollection where Element: Comparable {
  public mutating func select(kthElement: Int) {
    self.select(kthElement: kthElement, sortedBy: <)
  }
}

The reason I like kthElement more than nthElement is when we tell engineers this runs in O(n) it is clear this is running linear to the total number of elements in the collection.

2 Likes

If we set select aside, there might be more options to consider:

// do not care about remainders
func sortedPrefix(_ maxLength: Int) -> [Element] // combine `sorted` & `prefix`
func mins(_ count: Int) -> [Element] // generalizing `min`

// keep remainders
func partition(headCount: Int) -> [Element] // generalizing `partition`
2 Likes

I like kthElement, it's better than my suggestions offset and position.

I'm now also wondering whether the basename select/selected is clear enough. I understand that this is literally called a selection algorithm (and you linked to the Wikipedia page), but:

  • Does "select" carry its weight (in the "omit needless words" sense of the API Design Guidelines)? For the non-mutating variant, I find kthElement(_:sortedBy:) clearer, following the C++ prior art.

    Edit: although kthElement(_:sortedBy:) may not work for a method that returns an array and not just one element, as @tera points out below.

    I can't think of a good counterpart for the mutating variant. But I don’t think select(kthElement:sortedBy:) describes its behavior either, unless you know the term "selection algorithm" (I didn't).

  • "select" has so many potential meanings that you really need to read it together with the argument label to figure out what it means here. If we can think of other plausible collection algorithms that might also use "select" as a basename but with different semantics, that might be a red flag. Referring to the API Design Guidelines again: don't use the same basename for operations with different semantics.

    For example, Swift Algorithms already has a set of APIs grouped under the heading Selecting Elements. None of these use "select" in their name, but it shows that "selecting" is used in other places.

Typed throws

Lastly, I was surprised to see that the comparator function is throwing, but apparently this is how it's done, including in the standard library (see e.g. Sequence.sorted(by:). Huh, I never noticed that.

But shouldn't we use typed throws instead of rethrows for new APIs? I'm assuming that the proposed APIs will only ever rethrow the error thrown by the comparison function and not change it in any way.

4 Likes

Why returning array? Returning an index of the k-th smallest number seems sufficient (which if needed is trivially convertible to an array of k-th smallest numbers via a subsequent filter operation).


The benefit of returning the element's index instead of its value could be useful, you could get the value from the index in O(1) but not the other way around.

2 Likes

+1 for kthElement(_:sortedBy:), feels the most “swifty“ yet in my opinion. Also, +1 for typed throws instead of rethrows.

1 Like

So I wouldn't hate kthElement(_:sortedBy:)… but I do agree there is then no clear name for a mutating method. If anything I believe the precedent then is that kthElement(_:sortedBy:) is the mutating method. The method that returns a new Array is then kthElemented(_:sortedBy:).

The current toolchain required by Algorithms is 5.7.[1] AFAIK typed throws would require us supporting 6.0 and up. Correct?


  1. https://github.com/apple/swift-algorithms/blob/1.2.1/Package.swift#L1 ↩︎

Well… I think no matter what the most impactful version of this API I want to ship is the mutating method:

extension MutableCollection {
  public mutating func select(
    index: Int,
    sortedBy areInIncreasingOrder: (Element, Element) throws -> Bool
  ) rethrows { ... }
}

The method that returns a new Array could basically just call this on a copy of self:

extension Collection {
  public func selected(
    index: Int,
    sortedBy areInIncreasingOrder: (Element, Element) throws -> Bool
  ) rethrows -> [Element] {
    var result = ContiguousArray(self)
    try result.select(
      index: index,
      sortedBy: areInIncreasingOrder
    )
    return Array(result)
  }
}

This is the same pattern used for sorted in Standard Library:

Both these methods make sense to ship first. The argument could be made that the only method that really counts here is the mutating method… but it is pretty low-effort for the first design proposal to include the version that returns a new Array. I see no big reason to block that.

The methods that assume Element is Comparable and default to increasing lexicographic order are also not "primitive" but look like impactful enough "secondary" methods to go ahead and include in the first design proposal.

Beyond that I do not see clear impact from focusing on different methods or designs at this time.

So if the argument we make here is that the linear-time operation transforming the index of the kth element to an Array of the k smallest values is "trivial" then I see no reason why the linear-time operation transforming the kth smallest value to the index of the kth element is not trivial.

What might help here is any prior art. Do there exist any major languages out there that currently implement this version which returns an index back?

I'm open to adding this as an explicit "alternative considered" but I do not currently see a compelling reason to prioritize designing and implementing this version at this time.

Try this: "you are given an array of integers (for simplicity unique and within 0 ... Int.max/2 range each). Replace the k-th smallest integer in that array with Int.max".

Yes, you could do this with a second pass through the array to do that, and the complexity is still O(n)... It just feels that the second pass is unnecessary as it could be easily avoided if indices are used instead of values... Even if it's not backed by any prior art.

The mutating method looks particular strange to me in this case. We don't even have a mutating version of "filter", which would make more sense than this one.

4 Likes

What you are describing is possible using the proposed design:

extension Collection {
  public func selectedIndicies(
    kthElement: Int,
    sortedBy areInIncreasingOrder: (Element, Element) throws -> Bool
  ) rethrows -> [Self.Index] {
    var result = ContiguousArray(self.indices)
    try result.select(kthElement: kthElement) {
      try areInIncreasingOrder(self[$0], self[$1])
    }
    return Array(result)
  }
}

I see no compelling reason to add this new method to our proposal. An engineer that really wanted the index could just implement this new method in their own repo.

It's a performance hook for engineers that do not need to preserve the order of the original collection: we can return without an additional O(n) space complexity.

How do I get the single k-th smallest element from the resulting array? Via an additional O(k) max operation?

The question is still: if we don't have a mutating filter method why do we want a mutating select (or, inversely, if we do want a mutating select why won't we want the same for a mutating filter operation?) filter is obviously a more frequently used operation by far.

2 Likes

I do not understand the question. It's an O(n) algorithm:

extension Collection {
  public func selectedIndicies(
    kthElement: Int,
    sortedBy areInIncreasingOrder: (Element, Element) throws -> Bool
  ) rethrows -> [Self.Index] {
    var result = ContiguousArray(self.indices)
    try result.select(kthElement: kthElement) {
      try areInIncreasingOrder(self[$0], self[$1])
    }
    return Array(result)
  }
}

extension Collection {
  public func selectedIndex(
    kthElement: Int,
    sortedBy areInIncreasingOrder: (Element, Element) throws -> Bool
  ) rethrows -> Self.Index {
    let indicies = try self.selectedIndicies(
      kthElement: kthElement,
      sortedBy: areInIncreasingOrder
    )
    return indicies[kthElement]
  }
}

I am strongly opposed to including selectedIndicies or selectedIndex in this pitch. I have no problem with engineers building selectedIndicies or selectedIndex in their own repos if this proposal ships. My strong preference at this point is to keep the discussion focused on the design being pitched and identify any more blockers before moving to a proposal.

It's a performance hook for engineers that do not need to preserve the order of the original collection: we can return without an additional O(n) space complexity. This pitch has no opinion about filter and a discussion about filter can be considered out of scope. My strong preference at this point is to keep the discussion focused on the design being pitched and identify any more blockers before moving to a proposal.

2 Likes

I thought a bit more about the naming and would find these acceptable. What are your current thoughts about them? Will we go with these names? On the contrary, if the user doesn’t know what a selection algorithm does, they probably don’t know what the K-th element is either, so perhaps this is still a strong contender?

public func selected(
    kthElement: Int,
    sortedBy areInIncreasingOrder: (Element, Element) throws -> Bool
  ) rethrows -> [Element] { ... }
1 Like

I believe my current favorites here are still select(kthElement:sortedBy:) and selected(kthElement:sortedBy:).

I would also be open to select(kth:sortedBy:) and selected(kth:sortedBy:) if we think that is still clear enough.

I would also be open to hearing if anyone thinks we should explicitly name these as unstable: unstableSelect(kthElement:sortedBy:) and unstableSelected(kthElement:sortedBy:).

OT1H… the "soft precedent" from partition is that a function that reorders elements in-place and does not explicitly tell us it is stable in the name itself is unstable by default.

OTOH… the other precedent from sort is that a function that reorders elements in-place and does not explicitly tells us it is stable in the name itself is stable by default.

So I don't think there's any clear "right answer" here if our proposed "selection" algorithm — which is not stable — should or should not explicitly add "unstable" in the name. I'm not sure I have a very strong opinion here and I'm open to hearing more what anyone else thinks about that.

1 Like