Vector extensions with Swift

scanon · June 21, 2020, 11:07pm

If you write your reduce using wrapping addition instead of trapping addition (this is necessary for two reasons--first, the vector instructions you're trying to use all wrap, and second, trapping makes an operation non-associative, which blocks most vectorization anyway), it will be vectorized automatically:

func simpleReduce(_ buffer: UnsafeBufferPointer<Int32>) -> Int32 {
    buffer.reduce(into: 0, { $0 &+= $1 })
}

This is by far the simplest option, and generates a pretty decent inner loop:

.LBB1_10:
        movdqu  xmm2, xmmword ptr [rdi + 4*rdx]
        paddd   xmm2, xmm0
        movdqu  xmm0, xmmword ptr [rdi + 4*rdx + 16]
        paddd   xmm0, xmm1
        movdqu  xmm1, xmmword ptr [rdi + 4*rdx + 32]
        movdqu  xmm3, xmmword ptr [rdi + 4*rdx + 48]
        movdqu  xmm4, xmmword ptr [rdi + 4*rdx + 64]
        paddd   xmm4, xmm1
        paddd   xmm4, xmm2
        movdqu  xmm2, xmmword ptr [rdi + 4*rdx + 80]
        paddd   xmm2, xmm3
        paddd   xmm2, xmm0
        movdqu  xmm0, xmmword ptr [rdi + 4*rdx + 96]
        paddd   xmm0, xmm4
        movdqu  xmm1, xmmword ptr [rdi + 4*rdx + 112]
        paddd   xmm1, xmm2
        add     rdx, 32
        add     rax, 4
        jne     .LBB1_10

You can do better by hand (this is over-unrolled, and has some other minor issues), but this is pretty good for essentially zero effort.

If you enable avx2, the output is better, but still over-unrolled:

.LBB1_10:
        vpaddd  ymm0, ymm0, ymmword ptr [rdi + rdx]
        vpaddd  ymm1, ymm1, ymmword ptr [rdi + rdx + 32]
        vpaddd  ymm2, ymm2, ymmword ptr [rdi + rdx + 64]
        vpaddd  ymm3, ymm3, ymmword ptr [rdi + rdx + 96]
        vpaddd  ymm0, ymm0, ymmword ptr [rdi + rdx + 128]
        vpaddd  ymm1, ymm1, ymmword ptr [rdi + rdx + 160]
        vpaddd  ymm2, ymm2, ymmword ptr [rdi + rdx + 192]
        vpaddd  ymm3, ymm3, ymmword ptr [rdi + rdx + 224]
        add     rdx, 256
        add     rax, 2
        jne     .LBB1_10
        test    r9, r9
        je      .LBB1_13

There's no mechanism for function-level arch flags at present, unfortunately, so you can't do this on a function-by-function basis yet ([SR-11660] Umbrella: function multiversioning and dispatch on CPU features · Issue #54069 · apple/swift · GitHub).

(Note that, while I mentioned that the autovectorized codegen is substandard, it's actually better than what your proposed implementation does; to hit peak throughput, you need to perform at least two vector additions per loop iteration, because a modern Intel core can load two vectors per cycle from L1, and could do three VPADD operations if the data were available, but can only turn over a loop once per cycle. So let the autovectorizer do as much work as you can get it to.)