Flat indexing array of simd methods

Troy_Harvey · December 20, 2019, 7:41am

Hi All. I'm wondering if I'm missing something (or its getting late...). There doesn't seem to be a built-in (and performant!) method to flat index the elements of an array of SIMD.

As in:
let Array : SIMD4 = [[1,2,3,4],[5,6,7,8]]
Array.flatIndex[4] // answer = 5

While I could double for-loop it, that would add some overhead. I could drop to walking a pointer, but I feel I must be missing a more Swifty approach.

Thoughts?

Tino · December 20, 2019, 9:56am

I think the example might be confusing: You don't wont to inspect single numbers, but iterate through all of the, right?
This could be done with a iterator-iterator ;-). Maybe this already exists (in Apple Developer Documentation), but it wouldn't be hard to roll your own (unless you really want to have the fastest code possible ;-)

For single item access, I guess there's nothing better than Apple Developer Documentation

Troy_Harvey · December 20, 2019, 5:18pm

High performance is key. If it wasn't, I wouldn't be using SIMD.

If I was using C, I'd make a Union of {SIMD, Float}, which would be a clean interface to switch to flat indexing. Again, I can drop to pointers, but wondered if there is a better Swifty approach.

Karl · December 21, 2019, 7:25am

The compiler will already generate a direct offset from a double subscript: Compiler Explorer

One thing to note is to use “-Ounchecked” to avoid bounds-checking the Array.

EDIT: A more complete example. You'll see that the doTest function boils down to the same thing (in .LBB4_2):

        movss   xmm0, dword ptr [rcx + 4*rax]
        call    (output.blackHole(Swift.Float) -> ())

Karl · December 21, 2019, 9:57am

Hmm, this ended up being a curious little micro-optimisation puzzle.

Basically, if you write:

  for i in 0..<arr.flatCount {
    let result = arr[flatIndex: i]
    blackHole(result)
  }

You'll get this output (where I believe it is re-calculating the offset every iteration of the loop):

       push    rbp
        mov     rbp, rsp
        push    r15
        push    r14
        push    rbx
        push    rax
        mov     r15, qword ptr [rdi + 16]
        shl     r15, 2
        test    r15, r15
        je      .LBB5_3
        mov     r14, rdi
        add     r14, 32
        xor     eax, eax
.LBB5_2:
        mov     rcx, rax
// IIUC, all of this shifting, subtracting and adding is recalculating the offset.
        sar     rcx, 63
        shr     rcx, 62
        add     rcx, rax
        sar     rcx, 2
        lea     edx, [4*rcx]
        lea     rbx, [rax + 1]
        sub     eax, edx
        shl     rcx, 4
        add     rcx, r14
        cdqe
        movss   xmm0, dword ptr [rcx + 4*rax]
        call    (output.blackHole(Swift.Float) -> ())
        mov     rax, rbx
        cmp     r15, rbx
        jne     .LBB5_2
.LBB5_3:
        add     rsp, 8
        pop     rbx
        pop     r14
        pop     r15
        pop     rbp
        ret

However, writing it as a while loop:

    var i = 0
    while i < arr.flatCount {
        let result = arr[flatIndex: i]
        blackHole(result)
        i += 1
    }

Generates fewer, more pleasing instructions:

        push    rbp
        mov     rbp, rsp
        push    r15
        push    r14
        push    r12
        push    rbx
        mov     r15, qword ptr [rdi + 16]
        shl     r15, 2
        test    r15, r15
        jle     .LBB5_3
        mov     r14, rdi
        add     r14, 32
        xor     ebx, ebx
        movabs  r12, 4611686018427387900
.LBB5_2:
        mov     rax, rbx
        and     rax, r12
        lea     rax, [r14 + 4*rax]
        mov     ecx, ebx
        and     ecx, 3
        movss   xmm0, dword ptr [rax + 4*rcx]
        call    (output.blackHole(Swift.Float) -> ())
        add     rbx, 1
        cmp     r15, rbx
        jne     .LBB5_2
.LBB5_3:
        pop     rbx
        pop     r12
        pop     r14
        pop     r15
        pop     rbp
        ret

I very much doubt you'll get anything nicer than the latter result, even from C.

Troy_Harvey · December 29, 2019, 9:08pm

Thanks for diving down on that one.
I hadn't gone down the assembly output yet. It does better than expected.

Interesting difference on the while loop. Compiler Explorer is a nice little tool. Thanks for the introduction.