I've created a small program in both Swift and C. In each, there's a struct, an, with two variables, baseState and workState, which simple arrays in C and unsafeMutablePointers in Swift. At one point, after operations are done on workState, the contents of baseState are added in.
The C version:
for (int i = 0; i < 16; i++)
{
an->workState[i] += an->baseState[i];
}
When compiled -O3, the relevant assembly (via Compiler Explorer) is:
movd xmm4, ecx
movd xmm5, r13d
movd xmm6, edi
movd xmm7, r12d
movd xmm8, r8d
punpckldq xmm2, xmm1
punpckldq xmm4, xmm3
punpcklqdq xmm4, xmm2
paddd xmm4, xmm0
movdqu xmmword ptr [rdx + 92], xmm4
movdqu xmm0, xmmword ptr [rdx + 44]
punpckldq xmm6, xmm5
punpckldq xmm8, xmm7
punpcklqdq xmm8, xmm6
paddd xmm8, xmm0
movdqu xmmword ptr [rdx + 108], xmm8
add ebx, dword ptr [rdx + 60]
mov dword ptr [rdx + 124], ebx
I.e., it vectorises the code.
With the Swift version…
for i in 0..<16 { an.workState[i] &+= an.baseState[i] }
… the output (-Ounchecked) is:
add r14d, dword ptr [rax]
mov dword ptr [rcx], r14d
mov r14d, dword ptr [rsp - 24]
add r14d, dword ptr [rax + 4]
mov dword ptr [rcx + 4], r14d
add r13d, dword ptr [rax + 8]
mov dword ptr [rcx + 8], r13d
add r10d, dword ptr [rax + 12]
mov dword ptr [rcx + 12], r10d
add edi, dword ptr [rax + 16]
mov dword ptr [rcx + 16], edi
add r11d, dword ptr [rax + 20]
mov dword ptr [rcx + 20], r11d
add r15d, dword ptr [rax + 24]
mov dword ptr [rcx + 24], r15d
add edx, dword ptr [rax + 28]
mov dword ptr [rcx + 28], edx
mov edx, dword ptr [rsp - 28]
When compiled in Clang (release), the C program is notably faster, and examination of the Swift version in Instruments shows similar unvectorised assembly as here.
I'm curious as why this is so. I thought that unsafeMutablePointer, being "closer to the metal", would have been more readily vectorised. Is this a feature of this type, or is there better code syntax I should be using?