Simultaneous min & max

scanon · April 10, 2021, 3:01am

The dependencies for which comparisons to perform are not an issue in practice; what matters is the loop-carried dependency (the min and max "accumulators"). The number of comparisons they perform is reduced from n to n/2 using the pairwise method, so that's actually a much bigger win. On a sufficiently wide OoO CPU, the pairwise method is basically exactly twice as fast.

You can break the loop-carried dependency further by alternating between two or more "min" and "max" accumulators, but this is only profitable on relatively wide machines, and starts to get into more and more bookkeeping (and constant overhead) as you go larger.

You can extract maximum ILP using breadth-first binary-tree accumulators, at the cost of cache locality. In practice what people usually do when they optimize these algorithms as much as possible is accumulate within cache-sized blocks, then across them.