Difference in compilation of pointers between Swift and C

I've created a small program in both Swift and C. In each, there's a struct, an, with two variables, baseState and workState, which simple arrays in C and unsafeMutablePointers in Swift. At one point, after operations are done on workState, the contents of baseState are added in.

The C version:

for (int i = 0; i < 16; i++)
{
    an->workState[i] += an->baseState[i];
}

When compiled -O3, the relevant assembly (via Compiler Explorer) is:

    movd    xmm4, ecx
    movd    xmm5, r13d
    movd    xmm6, edi
    movd    xmm7, r12d
    movd    xmm8, r8d
    punpckldq       xmm2, xmm1
    punpckldq       xmm4, xmm3
    punpcklqdq      xmm4, xmm2
    paddd   xmm4, xmm0
    movdqu  xmmword ptr [rdx + 92], xmm4
    movdqu  xmm0, xmmword ptr [rdx + 44]
    punpckldq       xmm6, xmm5
    punpckldq       xmm8, xmm7
    punpcklqdq      xmm8, xmm6
    paddd   xmm8, xmm0
    movdqu  xmmword ptr [rdx + 108], xmm8
    add     ebx, dword ptr [rdx + 60]
    mov     dword ptr [rdx + 124], ebx

I.e., it vectorises the code.

With the Swift version…
for i in 0..<16 { an.workState[i] &+= an.baseState[i] }

… the output (-Ounchecked) is:

   add     r14d, dword ptr [rax]
    mov     dword ptr [rcx], r14d
    mov     r14d, dword ptr [rsp - 24]
    add     r14d, dword ptr [rax + 4]
    mov     dword ptr [rcx + 4], r14d
    add     r13d, dword ptr [rax + 8]
    mov     dword ptr [rcx + 8], r13d
    add     r10d, dword ptr [rax + 12]
    mov     dword ptr [rcx + 12], r10d
    add     edi, dword ptr [rax + 16]
    mov     dword ptr [rcx + 16], edi
    add     r11d, dword ptr [rax + 20]
    mov     dword ptr [rcx + 20], r11d
    add     r15d, dword ptr [rax + 24]
    mov     dword ptr [rcx + 24], r15d
    add     edx, dword ptr [rax + 28]
    mov     dword ptr [rcx + 28], edx
    mov     edx, dword ptr [rsp - 28] 

When compiled in Clang (release), the C program is notably faster, and examination of the Swift version in Instruments shows similar unvectorised assembly as here.

I'm curious as why this is so. I thought that unsafeMutablePointer, being "closer to the metal", would have been more readily vectorised. Is this a feature of this type, or is there better code syntax I should be using?

I agree that this case should vectorize.

Somewhat strangely, checking quickly on godbolt in appears to not get vectorized for C either with clang. There's not really a "why" here, and you're not doing anything wrong. This appears to be "just" a missed optimization bug in the LLVM layer.

1 Like

Is -Ounchecked used in practice?

It really oughtn't be. It shouldn't matter here in any case, since pointer subscripting isn't bounds checked and all of the arithmetic is wrapping.