Simultaneous min & max

Just added a pull request for Sequence.extrema(by:), which calculates min(by:) and max(by:) in a single pass. It can be used to fix SR-890.

3 Likes

I wonder if extrema might sound “too plural”, in that someone could expect it to produce a collection of all maximal and minimal elements, rather than just one of each.

2 Likes

The documentation comments for min, max, and the proposed extrema all specifically state that areInIncreasingOrder must be a “strict weak ordering” on the elements. They then go on to define what that means, including how “incomparable” elements behave.

More concretely, one could have a list of elements, and compare them by only part of their data. For example, the elements could be “players in the game” and the comparison could be “scored more points than”.

There could easily be multiple distinct players who are tied for the most or least points.

The existing min and max will only find one of them. The proposed extrema will also only find one of each, despite the plural form of the word.

2 Likes

I’m not saying we need to introduce a new overload, just that the proposed name for the new function could be misleading.

3 Likes

I wonder if this would be a candidate for SIMD?

So rather than consuming the elements one-by-one, you'd consume them in batches of 8 elements, finding the maximum element in each lane, and using the mask to store the maxima in a result vector. Once you finish, you'd find the maximum element in the result vector, which would give you the overall maximum for the collection. Same for the minimum.

EDIT: A quick attempt, by somebody not at all experienced with SIMD. This just shows calculating the max, but you could calculate the min at the same time.

func max_simd<Scalar>(_ arr: [Scalar]) -> Scalar? where Scalar: SIMDScalar, Scalar: Comparable {
  var contents = arr[...]
  guard contents.count >= 8 else {
    return contents.max()
  }
  // Consume first 8.
  var simdMaxima = SIMD8(contents.prefix(8))
  contents = contents.dropFirst(8)
  // Consume and compare remaining contents in batches.
  while contents.count >= 8 {
    let elements = SIMD8(contents.prefix(8))
    contents = contents.dropFirst(8)
    simdMaxima.replace(with: elements, where: elements .> simdMaxima)
  }
  // Overall maximum is the greater of the SIMD elements and whatever is left in the tail.
  return max(contents.max() ?? simdMaxima[0], simdMaxima.max())
}

Aside from whether it's misleading, I think the name minmax is a term of art from other languages, more obvious in meaning, and avoids sounding unnecessarily "exotic".

The one downside is minmax has another meaning but in practice I don't think the risk of confusion is material.

8 Likes

Question re: "by areInIncreasingOrder:"

externally at the call site, you cannot tell what by should be:

sequence.extrema(by: somePredicate)

for this reason, in my own code, I just use areInIncreasingOrder, this make it very clear at the call site what the parameter should be:

Calling the simultaneous min and max function minmax is the obvious answer. But it's the wrong answer because it's a term of art in mathematics and statistics that means something completely different.

The obvious meaning is sort-of a term of art in some computer language libraries, but it may be because they didn't do their research. Would it be proper to copy their mistake? Should we call it something like "minAndMax" to dodge the problem?

4 Likes

Would you want "extremes" for one minimum and one maximum, and "allExtrema" for all the minima and maxima?

2 Likes

I just added a suggestion to run two elements during each loop iteration. I used a #conditional block to provide both the new code and the old loop. I'm not sure the suggestion actually helps, so you can mess with the conditional switch to check both versions. Maybe it's something that doesn't help with short toy sequences, but does actually help with (super-)long professional-length data.

I find that this would be the best name.
minmax would definitely create ambiguity for many people who are used to a very different meaning for that term, and extrema is less easily discoverable.

2 Likes

I would hope that the implementation is driven by benchmarks. There are lots of implementation strategies.

For example, I suggested trying SIMD - which would reduce branching by a lot, at the cost of introducing a data dependency which limits speculative execution. I’d be interested to see how that fares against the obvious implementation for different kinds and sizes of data.

Attabench (GitHub - attaswift/Attabench: Microbenchmarking app for Swift with nice log-log plots) is a useful tool for making these kinds of measurements.

1 Like

I've made a few more edits, making the new two-at-a-time loop the default. I accidentally uploaded the edits early, but now I'm not sure about the other edits I was planning, so the early upload may have been a good thing.

I made a change canonizing how ties are resolved. It would specify that the minimum will use the earliest element in a tie and the maximum will use the latest element in a tie. Then I started work on the SIMD optimization. An hour in I realized that I don't think the SIMD operations for min or max specify how they resolve ties. So I can't have both features. If I'm not wrong, which way should I optimize (if at all)?

I’m curious what benefit you see in the 2-at-a-time loop.

Unless I’m missing something, it makes the same number of comparisons, and accesses the elements in the same order, as the direct 1-at-a-time version.

When you process one element at a time, you make up to 2 comparisons per element: one against the minimum and one against the maximum. For each pair of elements, that means you're making 4 comparisons, or 2n for the whole sequence.

When you process a pair of elements at once, you make 3 comparisons: (1) the two elements against each other, (2) the smaller element against the minimum, and (3) the larger element against the maximum. So you end up with 3/2n comparisons in total, 25% less than the single element version.

5 Likes

Note that two-at-a-time is, somewhat surprisingly, optimal; it does 3/2 n comparisons. Three-at-a-time requires 3 comparisons to order the three elements, then one for the min and one for the max, for 5/3 n in total, and it only gets worse from there (a four-element sorting network is five comparisons, so 7/4 n, etc--you don't actually need a full sorting network, but two-at-a-time is still as good as it gets).

6 Likes

That makes sense, thanks.

If I’ve done my math right, for powers-of-2 at a time it stays exactly 3n/2 forever.

Still, going with 2-at-a-time introduces data dependencies for deciding which comparisons to perform, so it might not pipeline as well as 1-at-a-time.

Would be interesting to see benchmark comparisons, for both simple types like Int and compound types like String (especially strings with long identical prefixes that make comparisons slow.)

The dependencies for which comparisons to perform are not an issue in practice; what matters is the loop-carried dependency (the min and max "accumulators"). The number of comparisons they perform is reduced from n to n/2 using the pairwise method, so that's actually a much bigger win. On a sufficiently wide OoO CPU, the pairwise method is basically exactly twice as fast.

You can break the loop-carried dependency further by alternating between two or more "min" and "max" accumulators, but this is only profitable on relatively wide machines, and starts to get into more and more bookkeeping (and constant overhead) as you go larger.

You can extract maximum ILP using breadth-first binary-tree accumulators, at the cost of cache locality. In practice what people usually do when they optimize these algorithms as much as possible is accumulate within cache-sized blocks, then across them.

8 Likes

Thanks for the deep dive Steve, I appreciate learning this type of thing.

A bit off topic, but how does one programmatically discover the relevant cache size from Swift code?

It’s not exposed as a language feature; the mechanism to query it varies between systems: one might use sysctlbyname or the commpage on macOS, /sys on Linux, some other method on Windows, etc.

There’s also a rich family of cache oblivious algorithms that block in such a way that they achieve (asymptotically) optimal cache locality without needing to know the precise cache size(s).

For “simple” accumulation trees like this, you don’t need an exact fit anyway. L1 caches are generally in the range of 16-128KB, so a conservative choice like 4KB will basically always work out in practice.

(Note that I think that it could be worth exposing this sort of thing with a uniform interface, but that’s a better fit for the System package than for Swift or the standard library itself)

3 Likes