[Pitch] Balanced binary reduction (stdlib addition)

justinmeiners · June 16, 2021, 11:30pm

Balanced binary reduction

By: Justin Meiners

Date: June 16, 2021

Introduction

Swift's reduce can be used to accumulate elements in a sequence.
For example:

[1.0, 2.0, 3.0, 4.0].reduce(0.0, +)

If we visualize the expression tree for this operation, it's linear.
You can think of it as a binary tree which is completely unbalanced.

 -----3.0----6.0---10.0
/     /      /    /
1.0   2.0   3.0  4.0

Assuming the binary operation is associative then a program could choose which
order to evaluate arguments in.
For example, we can image a function balancedReduce which rearranges
the order of evaluation to the following:

      10.0
    /      \
   /         \
  3.0       7.0
 /  \      /    \
1.0  2.0  3.0  4.0

We haven't changed the result at all, just the order it's evaluated in.
Even though the two forms are mathematically equivalent,
there are good computational reasons why one might want to do this.
Here are just a few to illustrate it's general applicability:

Adding up many floating point numbers (as in this example).
Adding small magnitudes to large magnitudes introduces numerical error.
Given a long enough list of values of similar magnitudes
this is guaranteed to happen.
(Accelerate has a function vDSP.sum in-part
to address this problem.)

A balanced evaluation of such values will
tend to add similarly sized values. (Intermediate results which are
the result of adding equal numbers of values together).
Merge sort can be written by applying the binary operation
of "merge sorted list" to a sequence of lists:
```
[ [a1], [a2], ... [an] ]
```
If you aply this operation using reduce it is essentially insertion sort,
one element is inserted into the growing list at a time.

If you use a balancedReduce it will only merge lists
that are roughly the same size, giving a proper
O(n log(n)) sort.
It can be used to solve the problem of finding the smallest
element, and second smallest element, in a sequence
in the optimal number of comparisons.
(Described in 5.3.3 of Volume 3 in "Art of Computer Programming").

Proposed implementation

This capability can be added with an extension to Sequence:

extension Sequence {
    func balancedReduce(
        _ associativeOperation: (Element, Element) -> Element
    ) -> Element? {

        var counter: [Element?] = []
        // counter will never be larger than 64
        counter.reserveCapacity(20)

        for x in self {
            if let carry = BinaryCounter.add(x,
                                             to: &counter,
                                             with: associativeOperation) {
                counter.append(carry) 
            }
        }

        return BinaryCounter.reduce(counter: &counter,
                                    with: associativeOperation)
    }
}

This relies on two algorithms:

struct BinaryCounter {
    private init() {}

    static func add<C: MutableCollection, T>(
        _ x: T,
        to counter: inout C,
        with op: (T, T) -> T
    ) -> T? where C.Element == T? {

        var i = counter.startIndex
        let end = counter.endIndex

        var carry = x

        while i != end {
            if let y = counter[i] {
                // must be in this order since op is not commutative.
                // things further to the end of the counter were inserted earlier
                carry = op(y, carry)
                counter[i] = nil
                i = counter.index(after: i)
            } else {
                counter[i] = carry
                return nil
            }
        }

        return carry
    }

    static func reduce<C: MutableCollection, T>(
        counter: inout C,
        with op: (T, T) -> T
    ) -> T? where C.Element == T? {

        var i = counter.startIndex
        let end = counter.endIndex

        // find first non-zero
        while i != end && counter[i] == nil {
            i = counter.index(after: i)
        }

        if i == end {
            return nil
        }

        var x = counter[i]!
        i = counter.index(after: i)

        while i != end {
            if let y = counter[i] {
                // must be in this order since op is not commutative
                // things further to the end of the counter were inserted earlier
                x = op(y, x)
            }
            i = counter.index(after: i)
        }

        return x
    }
}

Notice that this transformation is very efficient.
It does require additional memory for the counter,
but its size is roughly log(n) which qualifies as in-place.

Example usage

Adding up numbers:

[1.0, 2.0, 3.0, 5.0].balancedReduce(+)

Merge sort:

func mergeSorted<T>(
    _ a: [T],
    _ b: [T]
) -> [T] where T: Comparable  {
    var i = a.startIndex
    var j = b.startIndex
    
    var out: [T] = []
    
    while i != a.endIndex && j != b.endIndex {
        if b[j] < a[i] {
            out.append(b[j])
            j = a.index(after: j)
        } else {
            out.append(a[i])
            i = a.index(after: i)
        }
    }
    
    if j != b.endIndex {
        out.append(contentsOf: b.suffix(from: j))
    } else if i != a.endIndex {
        out.append(contentsOf: a.suffix(from: i))
    }
    
    return out
}

[1.0, 2.0, 3.0, 5.0].map({ $0 }).balancedReduce(mergeSorted)

References

This algorithm can be found in Alex Stepanov's (creator of C++ STL) book "Elements of Programming"
chapter 11.2.

The Haskell package Tree Fold implements it.

There is a related notion in C++.
std::accumulate
which does a traditional left fold,
std::reduce
is allowed to associate its arguments.
For technical reasons, it additionally requires the operation is commutative,
which I do not think is necessary for this.
It appears std::reduce is primarily to facilitate parallel
computation, but it demonstrates two similar operations
whose results only differ on their association order.

Questions/Concerns

If this was such a need shouldn't this be a popular library?

It's one of those things that solves problems if you have it available,
but most probably aren't going to look for it, due to awareness of the issues.

As far as I know, there is no facilities in swift to add large lists
of floating point numbers in a stable way (besides sorting).
To implement it oneself would require reimplementing something similar.
(iOS and mac have vDSP.sum in Accelerate).

It's too theoretical

All of the theory is hidden inside.
It's just one function which works like fold/reduce.
It's hard to misuse, and arguably produces
better and less suprising results
than reduce for most operations.

Algebraic requirements like associativity on operations are used throughout the stdlib,
for example sort relies on the comparison being a StrictWeakOrdering.

The name is bad

Looking for recommendations.
reduce may be incorrect because reduce allows you do specify a different
accumulation type than the elements in the sequence.
In this algorithm the operation needs to take two inputs
of the same type, so it really is a fold, but fold is
probably not a friendly term.
Perhaps accumulate?

Could binary counter be a struct?

A struct couldn't have operation
as a member, unless it was escaping.
So, I don't think it would help.

Does it need to store optionals?

It's possible to instead provide a zero parameter
which indicates a slot in the counter is available.
Conceptually this should be a value where op(x, zero) = x for all
x, although it's not required.
I chose nil because the zero parameter is difficult to explain,
and is confusing is because it wouldn't be the same as reduce's initialResult
parameter.

scanon · June 17, 2021, 2:20am

A tangential note: vDSP.sum does not address "this problem". It is somewhat more accurate, on average, than naive reduction, but that's because it maintains multiple accumulators to break the loop-carried dependency chain for performance; the added average accuracy is a happy accident. vDSP simply tries to go as fast as possible. ( "Reducing Floating Point Error in Dot Product using the Superblock Family of Algorithms" is a good reference on this subject, and discusses more sophisticated blocking strategies as well.)

Variations on these algorithms would be a welcome addition to Swift Numerics, but are probably out of scope for the standard library at this point in time. I would also be interested in implementations of exact accumulation and Kahan summation, which are both even more accurate than the tree reduction described here.

Nevin · June 17, 2021, 3:17am

I wrote an implementation of Kahan summation here in a thread that you also participated in. Should I open a PR to add that to Numerics?

• • •

Also, do you have feedback on the dynamic-dispatch / generic implementation ideas I sketched here a few posts later in the same thread?

More info

I’d really like to be able to call sum() in a generic context on a sequence whose elements are only constrained to AdditiveArithmetic, and have it use the proper algorithm for the actual element type dynamically.

This isn’t too difficult to achieve (I have a working prototype), but it does require a different design with a protocol requirement that each conforming element type implements with its preferred algorithm.

In order to make the constraint be AdditiveArithmetic, the requirement would have to be added to that protocol (with a default implementation), and the implementations for standard library types would have to be in the standard library itself rather than Numerics.

justinmeiners · June 17, 2021, 3:18am

My comment about vDSP.sum is only based on my personal observations that it is more accurate than reduce. I'm glad you can share more about what it actually does. By saying "partly addresses this problem" I just mean that programmers who are not aware of numerical details have some obvious mechanism they can throw at summing and get better results,.

I appreciate the additional numerical information and will give that a look.

scanon · June 17, 2021, 4:06pm

For the usual Kahan summation, I would expect that you can write it directly on AdditiveArithmetic and the compiler will optimize away the computation of excess when the element type is an integer (since it is trivially always zero), so it probably doesn't need a customization point at all. For Neumaier's improved Kahan summation, which you appear to be using here, it may be slightly harder to achieve this, but it's probably still possible with a little bit of care.

I'm not opposed to adding new customization points, but the bar to adding them on a protocol like AdditiveArithmetic should be high.

Anachron · June 17, 2021, 5:12pm

I actually implemented Kahan summation in a private framework to add a bunch of Double matrices and the compiler optimized even that away. That was somewhat rude. Luckily, the framework was in an XCodeProject so I could pass a flag to the compiler (via „compile sources“ section) that this specific file shouldn’t be optimized. No idea how I would do this in a SwiftPM Package without project.

But this really goes off-topic now.

OP: did you also consider a parallelized version of this?

scanon · June 17, 2021, 7:44pm

That should never happen without an unsafe math flag being set. You should not need to disable optimization to avoid it.

Anachron · June 17, 2021, 7:57pm

Interesting! Where can I read about that flag?

The matrix type in my framework was just a wrapper around an array of doubles with a fancy subscript and a bunch of inline(__always) BLAS and vdsp methods for basic arithmetic. Does this apply here?

Maybe I misinterpreted the stepping behavior of the debugger, I didn’t look too deeply into disassembly.

Nevin · June 18, 2021, 3:04am

I would want a different algorithm for signed integers, as described here in that same thread, to avoid spurious overflow.

justinmeiners · June 18, 2021, 3:40pm

We have had some good discussion about the numerical application.
It sounds like the agreement is there should be better solutions available in the Swift Numerics.
(I would still like a specific reference to an API or proposal that addresses that.)

In regards to parallelism, I have not considered it. The algorithm could be parallelized, but the vision here is that it works just like reduce. Issuing work for threads and synchronizing seems a bit heavy for this operation.

What are thoughts on general utility? Do we think this is useful enough? Is it a better fit for Swift algorithms? Are any of the questions that I have identified that others have opinions on?