When compiled with optimization for x86,
func reduce(foo: [Int]) -> Int {
foo.reduce(0, &+)
}
is auto-vectorized (godbolt link):
.LBB1_8:
vpaddq ymm0, ymm0, ymmword ptr [rdi + 8*rsi + 32]
vpaddq ymm1, ymm1, ymmword ptr [rdi + 8*rsi + 64]
vpaddq ymm2, ymm2, ymmword ptr [rdi + 8*rsi + 96]
vpaddq ymm3, ymm3, ymmword ptr [rdi + 8*rsi + 128]
vpaddq ymm0, ymm0, ymmword ptr [rdi + 8*rsi + 160]
vpaddq ymm1, ymm1, ymmword ptr [rdi + 8*rsi + 192]
vpaddq ymm2, ymm2, ymmword ptr [rdi + 8*rsi + 224]
vpaddq ymm3, ymm3, ymmword ptr [rdi + 8*rsi + 256]
vpaddq ymm0, ymm0, ymmword ptr [rdi + 8*rsi + 288]
vpaddq ymm1, ymm1, ymmword ptr [rdi + 8*rsi + 320]
vpaddq ymm2, ymm2, ymmword ptr [rdi + 8*rsi + 352]
vpaddq ymm3, ymm3, ymmword ptr [rdi + 8*rsi + 384]
vpaddq ymm0, ymm0, ymmword ptr [rdi + 8*rsi + 416]
vpaddq ymm1, ymm1, ymmword ptr [rdi + 8*rsi + 448]
vpaddq ymm2, ymm2, ymmword ptr [rdi + 8*rsi + 480]
vpaddq ymm3, ymm3, ymmword ptr [rdi + 8*rsi + 512]
But if the same loop is over Double
:
func reduce(foo: [Double]) -> Double {
foo.reduce(0, +)
}
then it's not auto-vectorized (godbolt link):
.LBB1_9:
vaddsd xmm0, xmm0, qword ptr [rdi + 8*rdx + 32]
vaddsd xmm0, xmm0, qword ptr [rdi + 8*rdx + 40]
vaddsd xmm0, xmm0, qword ptr [rdi + 8*rdx + 48]
vaddsd xmm0, xmm0, qword ptr [rdi + 8*rdx + 56]
vaddsd xmm0, xmm0, qword ptr [rdi + 8*rdx + 64]
vaddsd xmm0, xmm0, qword ptr [rdi + 8*rdx + 72]
vaddsd xmm0, xmm0, qword ptr [rdi + 8*rdx + 80]
vaddsd xmm0, xmm0, qword ptr [rdi + 8*rdx + 88]
add rdx, 8
cmp rcx, rdx
jne .LBB1_9
Is there some way to structure the code to have the loop of floating point values optimised with auto-vectorization?