 # Balanced binary reduction

By: Justin Meiners

Date: June 16, 2021

## Introduction

Swift's `reduce` can be used to accumulate elements in a sequence.
For example:

``````[1.0, 2.0, 3.0, 4.0].reduce(0.0, +)
``````

If we visualize the expression tree for this operation, it's linear.
You can think of it as a binary tree which is completely unbalanced.

`````` -----3.0----6.0---10.0
/     /      /    /
1.0   2.0   3.0  4.0
``````

Assuming the binary operation is associative then a program could choose which
order to evaluate arguments in.
For example, we can image a function `balancedReduce` which rearranges
the order of evaluation to the following:

``````      10.0
/      \
/         \
3.0       7.0
/  \      /    \
1.0  2.0  3.0  4.0
``````

We haven't changed the result at all, just the order it's evaluated in.
Even though the two forms are mathematically equivalent,
there are good computational reasons why one might want to do this.
Here are just a few to illustrate it's general applicability:

1. Adding up many floating point numbers (as in this example).
Adding small magnitudes to large magnitudes introduces numerical error.
Given a long enough list of values of similar magnitudes
this is guaranteed to happen.
(Accelerate has a function `vDSP.sum` in-part

A balanced evaluation of such values will
tend to add similarly sized values. (Intermediate results which are
the result of adding equal numbers of values together).

2. Merge sort can be written by applying the binary operation
of "merge sorted list" to a sequence of lists:

``````[ [a1], [a2], ... [an] ]
``````

If you aply this operation using `reduce` it is essentially insertion sort,
one element is inserted into the growing list at a time.

If you use a `balancedReduce` it will only merge lists
that are roughly the same size, giving a proper
`O(n log(n))` sort.

3. It can be used to solve the problem of finding the smallest
element, and second smallest element, in a sequence
in the optimal number of comparisons.
(Described in 5.3.3 of Volume 3 in "Art of Computer Programming").

## Proposed implementation

This capability can be added with an extension to `Sequence`:

``````extension Sequence {
func balancedReduce(
_ associativeOperation: (Element, Element) -> Element
) -> Element? {

var counter: [Element?] = []
// counter will never be larger than 64
counter.reserveCapacity(20)

for x in self {
to: &counter,
with: associativeOperation) {
counter.append(carry)
}
}

return BinaryCounter.reduce(counter: &counter,
with: associativeOperation)
}
}
``````

This relies on two algorithms:

``````struct BinaryCounter {
private init() {}

_ x: T,
to counter: inout C,
with op: (T, T) -> T
) -> T? where C.Element == T? {

var i = counter.startIndex
let end = counter.endIndex

var carry = x

while i != end {
if let y = counter[i] {
// must be in this order since op is not commutative.
// things further to the end of the counter were inserted earlier
carry = op(y, carry)
counter[i] = nil
i = counter.index(after: i)
} else {
counter[i] = carry
return nil
}
}

return carry
}

static func reduce<C: MutableCollection, T>(
counter: inout C,
with op: (T, T) -> T
) -> T? where C.Element == T? {

var i = counter.startIndex
let end = counter.endIndex

// find first non-zero
while i != end && counter[i] == nil {
i = counter.index(after: i)
}

if i == end {
return nil
}

var x = counter[i]!
i = counter.index(after: i)

while i != end {
if let y = counter[i] {
// must be in this order since op is not commutative
// things further to the end of the counter were inserted earlier
x = op(y, x)
}
i = counter.index(after: i)
}

return x
}
}
``````

Notice that this transformation is very efficient.
It does require additional memory for the counter,
but its size is roughly `log(n)` which qualifies as in-place.

## Example usage

``````[1.0, 2.0, 3.0, 5.0].balancedReduce(+)
``````

Merge sort:

``````func mergeSorted<T>(
_ a: [T],
_ b: [T]
) -> [T] where T: Comparable  {
var i = a.startIndex
var j = b.startIndex

var out: [T] = []

while i != a.endIndex && j != b.endIndex {
if b[j] < a[i] {
out.append(b[j])
j = a.index(after: j)
} else {
out.append(a[i])
i = a.index(after: i)
}
}

if j != b.endIndex {
out.append(contentsOf: b.suffix(from: j))
} else if i != a.endIndex {
out.append(contentsOf: a.suffix(from: i))
}

return out
}

[1.0, 2.0, 3.0, 5.0].map({ \$0 }).balancedReduce(mergeSorted)
``````

## References

This algorithm can be found in Alex Stepanov's (creator of C++ STL) book "Elements of Programming"
chapter 11.2.

The Haskell package Tree Fold implements it.

There is a related notion in C++.
std::accumulate
which does a traditional left fold,
std::reduce
is allowed to associate its arguments.
For technical reasons, it additionally requires the operation is commutative,
which I do not think is necessary for this.
It appears `std::reduce` is primarily to facilitate parallel
computation, but it demonstrates two similar operations
whose results only differ on their association order.

## Questions/Concerns

If this was such a need shouldn't this be a popular library?

It's one of those things that solves problems if you have it available,
but most probably aren't going to look for it, due to awareness of the issues.

As far as I know, there is no facilities in swift to add large lists
of floating point numbers in a stable way (besides sorting).
To implement it oneself would require reimplementing something similar.
(iOS and mac have `vDSP.sum` in Accelerate).

It's too theoretical

All of the theory is hidden inside.
It's just one function which works like `fold`/`reduce`.
It's hard to misuse, and arguably produces
better and less suprising results
than `reduce` for most operations.

Algebraic requirements like associativity on operations are used throughout the stdlib,
for example `sort` relies on the comparison being a `StrictWeakOrdering`.

Looking for recommendations.
`reduce` may be incorrect because reduce allows you do specify a different
accumulation type than the elements in the sequence.
In this algorithm the operation needs to take two inputs
of the same type, so it really is a `fold`, but `fold` is
probably not a friendly term.
Perhaps `accumulate`?

Could binary counter be a struct?

A struct couldn't have `operation`
as a member, unless it was escaping.
So, I don't think it would help.

Does it need to store optionals?

It's possible to instead provide a `zero` parameter
which indicates a slot in the counter is available.
Conceptually this should be a value where `op(x, zero) = x` for all
`x`, although it's not required.
I chose `nil` because the zero parameter is difficult to explain,
and is confusing is because it wouldn't be the same as `reduce`'s `initialResult`
parameter.

6 Likes

A tangential note: vDSP.sum does not address "this problem". It is somewhat more accurate, on average, than naive reduction, but that's because it maintains multiple accumulators to break the loop-carried dependency chain for performance; the added average accuracy is a happy accident. vDSP simply tries to go as fast as possible. ( "Reducing Floating Point Error in Dot Product using the Superblock Family of Algorithms" is a good reference on this subject, and discusses more sophisticated blocking strategies as well.)

Variations on these algorithms would be a welcome addition to Swift Numerics, but are probably out of scope for the standard library at this point in time. I would also be interested in implementations of exact accumulation and Kahan summation, which are both even more accurate than the tree reduction described here.

4 Likes

I wrote an implementation of Kahan summation here in a thread that you also participated in. Should I open a PR to add that to Numerics?

• • •

Also, do you have feedback on the dynamic-dispatch / generic implementation ideas I sketched here a few posts later in the same thread?

I’d really like to be able to call `sum()` in a generic context on a sequence whose elements are only constrained to `AdditiveArithmetic`, and have it use the proper algorithm for the actual element type dynamically.

This isn’t too difficult to achieve (I have a working prototype), but it does require a different design with a protocol requirement that each conforming element type implements with its preferred algorithm.

In order to make the constraint be `AdditiveArithmetic`, the requirement would have to be added to that protocol (with a default implementation), and the implementations for standard library types would have to be in the standard library itself rather than Numerics.

1 Like

My comment about `vDSP.sum` is only based on my personal observations that it is more accurate than reduce. I'm glad you can share more about what it actually does. By saying "partly addresses this problem" I just mean that programmers who are not aware of numerical details have some obvious mechanism they can throw at summing and get better results,.

I appreciate the additional numerical information and will give that a look.

For the usual Kahan summation, I would expect that you can write it directly on AdditiveArithmetic and the compiler will optimize away the computation of `excess` when the element type is an integer (since it is trivially always zero), so it probably doesn't need a customization point at all. For Neumaier's improved Kahan summation, which you appear to be using here, it may be slightly harder to achieve this, but it's probably still possible with a little bit of care.

I'm not opposed to adding new customization points, but the bar to adding them on a protocol like AdditiveArithmetic should be high.

I actually implemented Kahan summation in a private framework to add a bunch of Double matrices and the compiler optimized even that away. That was somewhat rude. Luckily, the framework was in an XCodeProject so I could pass a flag to the compiler (via „compile sources“ section) that this specific file shouldn’t be optimized. No idea how I would do this in a SwiftPM Package without project.

But this really goes off-topic now.

OP: did you also consider a parallelized version of this?

That should never happen without an unsafe math flag being set. You should not need to disable optimization to avoid it.

The matrix type in my framework was just a wrapper around an array of doubles with a fancy subscript and a bunch of inline(__always) BLAS and vdsp methods for basic arithmetic. Does this apply here?

Maybe I misinterpreted the stepping behavior of the debugger, I didn’t look too deeply into disassembly.

I would want a different algorithm for signed integers, as described here in that same thread, to avoid spurious overflow.

In regards to parallelism, I have not considered it. The algorithm could be parallelized, but the vision here is that it works just like `reduce`. Issuing work for threads and synchronizing seems a bit heavy for this operation.