If you write your reduce
using wrapping addition instead of trapping addition (this is necessary for two reasons--first, the vector instructions you're trying to use all wrap, and second, trapping makes an operation non-associative, which blocks most vectorization anyway), it will be vectorized automatically:
func simpleReduce(_ buffer: UnsafeBufferPointer<Int32>) -> Int32 {
buffer.reduce(into: 0, { $0 &+= $1 })
}
This is by far the simplest option, and generates a pretty decent inner loop:
.LBB1_10:
movdqu xmm2, xmmword ptr [rdi + 4*rdx]
paddd xmm2, xmm0
movdqu xmm0, xmmword ptr [rdi + 4*rdx + 16]
paddd xmm0, xmm1
movdqu xmm1, xmmword ptr [rdi + 4*rdx + 32]
movdqu xmm3, xmmword ptr [rdi + 4*rdx + 48]
movdqu xmm4, xmmword ptr [rdi + 4*rdx + 64]
paddd xmm4, xmm1
paddd xmm4, xmm2
movdqu xmm2, xmmword ptr [rdi + 4*rdx + 80]
paddd xmm2, xmm3
paddd xmm2, xmm0
movdqu xmm0, xmmword ptr [rdi + 4*rdx + 96]
paddd xmm0, xmm4
movdqu xmm1, xmmword ptr [rdi + 4*rdx + 112]
paddd xmm1, xmm2
add rdx, 32
add rax, 4
jne .LBB1_10
You can do better by hand (this is over-unrolled, and has some other minor issues), but this is pretty good for essentially zero effort.
If you enable avx2, the output is better, but still over-unrolled:
.LBB1_10:
vpaddd ymm0, ymm0, ymmword ptr [rdi + rdx]
vpaddd ymm1, ymm1, ymmword ptr [rdi + rdx + 32]
vpaddd ymm2, ymm2, ymmword ptr [rdi + rdx + 64]
vpaddd ymm3, ymm3, ymmword ptr [rdi + rdx + 96]
vpaddd ymm0, ymm0, ymmword ptr [rdi + rdx + 128]
vpaddd ymm1, ymm1, ymmword ptr [rdi + rdx + 160]
vpaddd ymm2, ymm2, ymmword ptr [rdi + rdx + 192]
vpaddd ymm3, ymm3, ymmword ptr [rdi + rdx + 224]
add rdx, 256
add rax, 2
jne .LBB1_10
test r9, r9
je .LBB1_13
There's no mechanism for function-level arch flags at present, unfortunately, so you can't do this on a function-by-function basis yet ([SR-11660] Umbrella: function multiversioning and dispatch on CPU features · Issue #54069 · apple/swift · GitHub).
(Note that, while I mentioned that the autovectorized codegen is substandard, it's actually better than what your proposed implementation does; to hit peak throughput, you need to perform at least two vector additions per loop iteration, because a modern Intel core can load two vectors per cycle from L1, and could do three VPADD operations if the data were available, but can only turn over a loop once per cycle. So let the autovectorizer do as much work as you can get it to.)