Is it possible to auto-vectorize a loop of floating point values?

wowbagger · November 11, 2021, 3:11pm

When compiled with optimization for x86,

func reduce(foo: [Int]) -> Int {
    foo.reduce(0, &+)
}

is auto-vectorized (godbolt link):

.LBB1_8:
        vpaddq  ymm0, ymm0, ymmword ptr [rdi + 8*rsi + 32]
        vpaddq  ymm1, ymm1, ymmword ptr [rdi + 8*rsi + 64]
        vpaddq  ymm2, ymm2, ymmword ptr [rdi + 8*rsi + 96]
        vpaddq  ymm3, ymm3, ymmword ptr [rdi + 8*rsi + 128]
        vpaddq  ymm0, ymm0, ymmword ptr [rdi + 8*rsi + 160]
        vpaddq  ymm1, ymm1, ymmword ptr [rdi + 8*rsi + 192]
        vpaddq  ymm2, ymm2, ymmword ptr [rdi + 8*rsi + 224]
        vpaddq  ymm3, ymm3, ymmword ptr [rdi + 8*rsi + 256]
        vpaddq  ymm0, ymm0, ymmword ptr [rdi + 8*rsi + 288]
        vpaddq  ymm1, ymm1, ymmword ptr [rdi + 8*rsi + 320]
        vpaddq  ymm2, ymm2, ymmword ptr [rdi + 8*rsi + 352]
        vpaddq  ymm3, ymm3, ymmword ptr [rdi + 8*rsi + 384]
        vpaddq  ymm0, ymm0, ymmword ptr [rdi + 8*rsi + 416]
        vpaddq  ymm1, ymm1, ymmword ptr [rdi + 8*rsi + 448]
        vpaddq  ymm2, ymm2, ymmword ptr [rdi + 8*rsi + 480]
        vpaddq  ymm3, ymm3, ymmword ptr [rdi + 8*rsi + 512]

But if the same loop is over Double:

func reduce(foo: [Double]) -> Double {
    foo.reduce(0, +)
}

then it's not auto-vectorized (godbolt link):

.LBB1_9:
        vaddsd  xmm0, xmm0, qword ptr [rdi + 8*rdx + 32]
        vaddsd  xmm0, xmm0, qword ptr [rdi + 8*rdx + 40]
        vaddsd  xmm0, xmm0, qword ptr [rdi + 8*rdx + 48]
        vaddsd  xmm0, xmm0, qword ptr [rdi + 8*rdx + 56]
        vaddsd  xmm0, xmm0, qword ptr [rdi + 8*rdx + 64]
        vaddsd  xmm0, xmm0, qword ptr [rdi + 8*rdx + 72]
        vaddsd  xmm0, xmm0, qword ptr [rdi + 8*rdx + 80]
        vaddsd  xmm0, xmm0, qword ptr [rdi + 8*rdx + 88]
        add     rdx, 8
        cmp     rcx, rdx
        jne     .LBB1_9

Is there some way to structure the code to have the loop of floating point values optimised with auto-vectorization?

scanon · November 11, 2021, 5:27pm

The reason why this isn't vectorized is pretty simple to explain: floating-point addition isn't associative, and vectorization changes the order in which the various terms are summed in a reduction. What you would like to have here is a variant of the + operator that licenses reassociation (just like &+ does for integers). This is implementable in the standard library leveraging builtins, but not really feasible in Swift outside of this standard library.

If this is more than curiosity, your best option today is probably to define a inline C function in a header with the desired semantics:

// myHeader.h
static inline __attribute__((inline_always))
double associative_add(double a, double b) {
#pragma clang fp reassociate(on)
  return a + b;
}

// myFile.swift
func reduce(foo: [Double]) -> Double {
    foo.reduce(0, associative_add)
}

Medium-term, I would consider making this and similar operations available as API in swift-numerics; feel free to open a GitHub issue to request the feature (it would be helpful to expand on the use case a bit).

Longer-term, there are a number of language features that could help here. @Joe_Groff has mentioned in the past the idea of scoped imports, which would allow you to import a different operator + that allows reassociation scoped to a block (i.e. without changing the behavior of any other code).

Another option, if you're on an Apple platform, is to use vDSP.sum(foo), which has an efficient vectorized implementation for all targets.

wowbagger · November 13, 2021, 4:40pm

Thanks for the explanation and suggestions! I knew that integers have to use &+ for associativity, but somehow it skipped my mind that (the lack of) associativity is also the reason why floating point operations aren't auto-vectorized.

I'll open a issue for feature request later. My use case is that I'm trying to make a k-means algorithm as fast as possible, and the distances between points/observations and cluster centroids are floating point values. I can elaborate it a bit more on the feature request.

I have a (perhaps naïve) follow up question, if you don't mind:

In addition to the associative_add function you suggested, if I also define an associative_multiply

static inline __attribute__((inline_always))
double associative_multiply(double a, double b) {
#pragma clang fp reassociate(on)
  return a * b;
}

Then, will the compiler be able to auto-vectorize subtractions (and is it a good idea/practice), if I turn something like

a[i] - b[i]

into

associative_add(a[i], associative_multiply(-1, b[i]))

scanon · November 13, 2021, 6:51pm

Yes (though you can also use the pragma on subtraction directly). It’s probably better not to call these “associative”, since they aren’t; we’re just telling the compiler to pretend that they are. “Relaxed” might be a better name.

scanon · November 19, 2021, 12:57am

Example here:

github.com/apple/swift-numerics

Initial pass at "relaxed" multiply and add operations.

apple:main ← stephentyrone:rilakkuma

opened 02:33AM - 18 Nov 21 UTC

stephentyrone

+389 -29

This commit adds the following implementation hooks to the AlgebraicField protoc…ol: static func _relaxedAdd(_:Self, _:Self) -> Self static func _relaxedMul(_:Self, _:Self) -> Self These are equivalent to + and *, but have "relaxed semantics"; specifically, they license the compiler to reassociate them and to form FMA nodes, which are both significant optimizations that can easily make many common loops 8-10x faster. These transformation perturb results slightly, so they should not be enabled without care, but the results with the relaxed operations are--for most purposes--"just as good as" (and often better than) what strict operations produce. The main thing to beware of is that they are no longer portable; different compiler versions and different targets and optimization flags will result in different results. These are then exposed under the `Relaxed` namespace as: Relaxed.sum(a, b) Relaxed.product(a, b)

In the toy benchmarks I tried, summation and dot-product loops get about 8x faster using these operations. It's possible to go faster still with hand-vectorized code--the codegen isn't as good as I would like--but I hope that we'll be able to get their with LLVM vectorizer improvements as well.

Note that this is a draft PR. I expect some details of these operations to change--this is not a "please start using this", this is a "here's an example of how you can make it work". One gotcha is that the clang included with Swift 5.3.3 does not support the clang fp reassociate(on) pragma, so if you have to support older Swift toolchains, you have to workaround that (which the branch linked above does).

I didn't do subtraction, but it can be defined simply as .relaxedAdd(a, -b).

wowbagger · November 20, 2021, 11:35am

I noticed this sentence in the PR:

These transformation perturb results slightly, so they should not be enabled without care, but the results with the relaxed operations are--for most purposes--"just as good as" (and often better than) what strict operations produce.

Just out of curiosity, and if you don't mind me asking: Why do relaxed operations often give better results? And if the results are often better, why doesn't the compiler make strict operations opt-in? (I guess there might be some really bad edge cases?)

Also, sorry for the delay. I finally opened the feature request as promised: Feature request for "relaxed" floating-point semantics · Issue #216 · apple/swift-numerics · GitHub

scanon · November 20, 2021, 3:21pm

These are excellent questions!

The magnitude of rounding error in a floating-point addition is bounded in terms of the magnitudes of the operands. When you sum "random" numbers, cancellation is anomalous; usually the magnitude of the result is about as big as or bigger than the inputs. Thus, an accumulator tends to grow proportional to the number of terms that have been summed. A strictly-in-order accumulator thus has expected error of the form εn for some small ε.¹

If we vectorize the accumulation, we are effectively splitting the buffer into k separate in-order accumulations, each of size n/k for some unknown integer k. The errors in each of these are like εn/k. When we sum those accumulators up, the last step has average error like εn/2, which is smaller than the in-order accumulation.

It is, of course, possible to construct cases where vectorized accumulation does worse, but on average it has smaller errors. This was folk knowledge among practitioners but not documented in the literature at all until surprisingly recently. The best reference for more details is probably "Reducing Floating Point Error in Dot Product Using the Superblock Family of Algorithms" (2008) if you're interested.

As to your second question, there are two components to the answer:

In some sense, the improved accuracy "doesn't matter", because both answers are good enough for the error bounds needed for higher-level algorithms to be well behaved to hold.
The in-order summation gives the same result on all targets, while relaxed accumulation will give different results depending on what architecture it's compiled for, what compiler version is used, and even what optimization flags are set. Needless to say, reproducibility has a lot of value as a safe default.

¹ Depending on the input distribution, it can also be more like ε√n, but the same analysis holds.