Vector extensions with Swift

dehesa · June 21, 2020, 10:30pm

I have been happily using the Accelerate framework to speed up some financial operations involving thousands of data points. Sadly, there is no vDSP.sum(_:) for signed integer numbers (only floating-point numbers).

The naive reduce() implementation is too slow for my current purposes. Therefore, I decided to use this opportunity to learn how to use AVX intrinsics with Swift. So far I got this:

import _Builtin_intrinsics.intel

extension vDSP {
    /// Returns the single-precision vector sum.
    @_transparent static func sum<U>(_ vector: U) -> Int32 where U:AccelerateBuffer, U.Element==Int32 {
        vector.withUnsafeBufferPointer { (buffer) -> Int32 in
            let (iterations, remaining) = (buffer.count / 8, buffer.count % 8)
            
            var result: Int32 = buffer.baseAddress!.withMemoryRebound(to: __m256i.self, capacity: iterations) {
                var accumulator = _mm256_setzero_si256()
                
                for i in stride(from: 0, to: iterations, by: 1) {
                    let element = _mm256_loadu_si256($0 + i)
                    accumulator = _mm256_add_epi32(accumulator, element)
                }
                
                let values = unsafeBitCast(accumulator, to: SIMD8<Int32>.self)
                return values[0] &+ values[1] &+ values[2] &+ values[3] &+ values[4] &+ values[5] &+ values[6] &+ values[7]
            }
            
            for i in stride(from: 0, to: remaining, by: 1) {
                result += buffer[iterations * 8 + i]
            }
            
            return result
        }
    }
}

The current code has several shortcomings and somehow I am unable to use some AVX2 intrinsics, such as _mm256_extracti128_si256. I would like to use some horizontal adds and extract parts of the values.

You can see the current compiler outcome in Godbolt.

Concretely, I have several questions:

Is someone out there actively using vector extensions with Swift?
How can I activate AVX2 compilations per function? The -Xcc -Xclang -Xcc -target-feature -Xcc -Xclang -Xcc +avx2 flags seems too heaviy handed (likewise for setting the whole project "Enable Additional Vector Extensions" attribute)
What is the best way to implement an Int32 sum with AVX2?

scanon · June 21, 2020, 11:07pm

If you write your reduce using wrapping addition instead of trapping addition (this is necessary for two reasons--first, the vector instructions you're trying to use all wrap, and second, trapping makes an operation non-associative, which blocks most vectorization anyway), it will be vectorized automatically:

func simpleReduce(_ buffer: UnsafeBufferPointer<Int32>) -> Int32 {
    buffer.reduce(into: 0, { $0 &+= $1 })
}

This is by far the simplest option, and generates a pretty decent inner loop:

.LBB1_10:
        movdqu  xmm2, xmmword ptr [rdi + 4*rdx]
        paddd   xmm2, xmm0
        movdqu  xmm0, xmmword ptr [rdi + 4*rdx + 16]
        paddd   xmm0, xmm1
        movdqu  xmm1, xmmword ptr [rdi + 4*rdx + 32]
        movdqu  xmm3, xmmword ptr [rdi + 4*rdx + 48]
        movdqu  xmm4, xmmword ptr [rdi + 4*rdx + 64]
        paddd   xmm4, xmm1
        paddd   xmm4, xmm2
        movdqu  xmm2, xmmword ptr [rdi + 4*rdx + 80]
        paddd   xmm2, xmm3
        paddd   xmm2, xmm0
        movdqu  xmm0, xmmword ptr [rdi + 4*rdx + 96]
        paddd   xmm0, xmm4
        movdqu  xmm1, xmmword ptr [rdi + 4*rdx + 112]
        paddd   xmm1, xmm2
        add     rdx, 32
        add     rax, 4
        jne     .LBB1_10

You can do better by hand (this is over-unrolled, and has some other minor issues), but this is pretty good for essentially zero effort.

If you enable avx2, the output is better, but still over-unrolled:

.LBB1_10:
        vpaddd  ymm0, ymm0, ymmword ptr [rdi + rdx]
        vpaddd  ymm1, ymm1, ymmword ptr [rdi + rdx + 32]
        vpaddd  ymm2, ymm2, ymmword ptr [rdi + rdx + 64]
        vpaddd  ymm3, ymm3, ymmword ptr [rdi + rdx + 96]
        vpaddd  ymm0, ymm0, ymmword ptr [rdi + rdx + 128]
        vpaddd  ymm1, ymm1, ymmword ptr [rdi + rdx + 160]
        vpaddd  ymm2, ymm2, ymmword ptr [rdi + rdx + 192]
        vpaddd  ymm3, ymm3, ymmword ptr [rdi + rdx + 224]
        add     rdx, 256
        add     rax, 2
        jne     .LBB1_10
        test    r9, r9
        je      .LBB1_13

There's no mechanism for function-level arch flags at present, unfortunately, so you can't do this on a function-by-function basis yet ([SR-11660] Umbrella: function multiversioning and dispatch on CPU features · Issue #54069 · apple/swift · GitHub).

(Note that, while I mentioned that the autovectorized codegen is substandard, it's actually better than what your proposed implementation does; to hit peak throughput, you need to perform at least two vector additions per loop iteration, because a modern Intel core can load two vectors per cycle from L1, and could do three VPADD operations if the data were available, but can only turn over a loop once per cycle. So let the autovectorizer do as much work as you can get it to.)

dehesa · June 22, 2020, 7:34am

Thank you for your detailed answer, @scanon

I usually try &+ and friends, first. I am surprised, I skipped directly to write my own AVX code (probably the willingness to learn). You are, of course, right that the compiler produces a better implementation and I will use that.

Since I am new to whole vector extensions programming:

could you recommend some good reads about it? (this has been pretty useful so far)
what is your workflow when you write this type of code? and how to best code while in swift?
why Swift cannot "see" some AVX intrinsics? (such as _mm256_extracti128_si256).

scanon · June 22, 2020, 11:24am

I'll circle back to answer your other questions, but this one is easy: Swift doesn't import function-like macros, because there's no Swift construct to map them to. _mm256_extracti128_si256 is implemented as follows:

#define _mm256_extracti128_si256(V, M) \
(__m128i)__builtin_ia32_extract128i256((__v4di)(__m256i)(V), (int)(M))

You can generally work around this in Swift by adding your own C header that defines an actual function version of the intrinsic:

#include <immintrin.h>

static inline __attribute__((inline_always))
__m128i lowHalf(__m256i vector) {
  return _mm256_extracti128_si256(vector, 0);
}

static inline __attribute__((inline_always))
__m128i highHalf(__m256i vector) {
  return _mm256_extracti128_si256(vector, 1);
}

(You need to have two versions, because the int argument to the intrinsic is required to be an integer constant.)

Note that you shouldn't actually need to do this; you should be able to simply use .wrappedSum() on your result to do the horizontal operation, but there's a little bit of compiler work that needs to happen to map that to the code that we actually want to generate; right now it produces about the same output as your workaround (this is fixable, just no one has had a chance to work on it yet).

dehesa · June 22, 2020, 2:54pm

That makes sense. It is a pity I have to add a new SPM target just for that, though.

I wanted to take the time now to properly say thank you for all the work and goodwill you are pouring. I might not interact that much in the forums, but anytime I look for something, you appear as one of the top "repliers" with thoughtful and long answers.

I understand you have your fingers in many pies (swift numerics, accelerate, new evolution proposal, etc.) and still manage to come up with outstanding work. Congratulations and please forward my thanks to the rest of the team.

scanon · June 22, 2020, 3:04pm

Yeah, it's a pain. I am vaguely considering adding a module to Swift Numerics that provides at least the most common ones as a workaround for folks (though obviously, there's a tradeoff here between adding workarounds and spending time making SIMD work better so that you don't need them).