Using SIMD types for vector addition

scanon · October 5, 2024, 5:34pm

Two factors at work here:

The SIMD implementations are inlined into the caller, so they inherit the caller's optimization settings.
Each SIMD add only does 4 additions; even if the actual addition were optimized perfectly, you still have all the other loop machinery, plus creating and unpacking the vectors, which takes significantly more work than the additions themselves do.

wigging · October 5, 2024, 7:54pm

I'm eager to see how you would write a SIMD function for element-wise addition of large vectors. I don't understand how you would get rid of the loop. And I don't see how you would create the SIMD vectors once to avoid recreating them during each iteration. I tried using var but it doesn't let me overwrite the entire SIMD vector.

ksluder · October 5, 2024, 11:50pm

I understood Steve’s point to be that a single 4-element addition is too small to really take advantage of SIMD.

wigging · October 6, 2024, 1:30am

Yeah I get that point. What I don't understand is how you would remove the loop machinery and how to handle the creation/unpacking of the vectors without affecting performance.

David_Smith · October 6, 2024, 2:19am

Loop unrolling can help quite a bit on CPUs that can execute more SIMD additions per cycle than taken branches per cycle. Assuming the Intel Mac you're using is a 2019 model for this example, Agner Fog's guide says:

The throughput for taken branches is one jump per clock or one jump per two clocks,
depending on the density of branches.

It also mentions that ports 0, 1, and 5 can do "integer and vector arithmetic". Making a number of hand wavy assumptions (e.g. ignoring memory throughput and vector formation, and assuming I'm not forgetting something important, I didn't actually test any of this while writing this reply) that would suggest that unrolling either 4x or 8x would allow 3x higher throughput.

Unoptimized Swift code generation is often quite inefficient so I wouldn't expect that this would be the only thing limiting it, this is just an example of one thing that could contribute.

If you're interested in digging into this sort of thing further the "counters" mode in Instruments is a great tool for figuring out which CPU resource you're bottlenecked on, and of course the Apple CPU optimization guide is an essential resource if you're doing similar tuning on a newer Mac.

But running with optimizations is definitely step 1 :)

scanon · October 6, 2024, 6:10pm

In this case, the loop is load-store bound on every mainstream architecture, so you will never saturate the ALUs. For each element, whether it is scalar or SIMD vector of whatever size, the best possible codegen requires 2 loads, 1 floating-point add, and 1 store.

Common big-core microarchitectures can sustain 2-4 fadd instructions (scalar or vector) per cycle, but typically only sustain 2 or 3 memory operations per cycle (again, scalar or vector, and only if they hit caches). So the limiting factor on how fast you're going to go is necessarily load-store bandwidth.

Unrolling is still be beneficial because of other considerations, however; beyond the real computational work that's invariant, you have loop overhead in the form of an index update, plus compare and branch, which adds two or three additional instructions and uOps to each loop iteration. If you only do one element per loop, that means you have to be able to turn over 5-7 instructions and 6-7 uOps per cycle to saturate the machine, and not all uArches are capable of doing so. Amortizing the loop overhead across two elements cuts that down to 8-11 instructions and maybe 10 uOps, which is achievable on many mainstream big core designs.

Thus, unrolling this loop 2x will be pretty common to achieve better performance. E.g. the compiler-optimized loop on Apple Silicon:

0:	ldp	q0, q1, [x12, #-0x10] // load two vectors from a
	ldp	q2, q3, [x11, #-0x10] // load two vectors from b
	fadd.2d	v0, v0, v2        // add first vectors from a and b
	fadd.2d	v1, v1, v3        // add second vectors 
	stp	q0, q1, [x10, #-0x10] // store two result vectors to c
	add	x12, x12, #0x20       // advance pointer to a
	add	x11, x11, #0x20       // advance pointer to b
	add	x10, x10, #0x20       // advance pointer to c
	subs	x9, x9, #0x4      // subtract 4 from loop count
	b.ne	0b                // repeat loop if not done

and the compiler-optimized loop for x86_64 (without AVX[512]):

0:  movupd  xmm0, xmmword ptr [r15 + 8*rcx + 32] // first vector from a
    movupd  xmm1, xmmword ptr [r15 + 8*rcx + 48] // second vector from a
    movupd  xmm2, xmmword ptr [r14 + 8*rcx + 32] // first vector from b
    addpd   xmm2, xmm0                           // sum first vectors
    movupd  xmm0, xmmword ptr [r14 + 8*rcx + 48] // second vector from b
    addpd   xmm0, xmm1                           // sum second vectors
    movupd  xmmword ptr [r12 + 8*rcx + 32], xmm2 // store first result
    movupd  xmmword ptr [r12 + 8*rcx + 48], xmm0 // store second result
    add     rcx, 4                               // update index
    cmp     rax, rcx                             // check for loop termination
    jne     0b

scanon · October 6, 2024, 6:59pm

Well, the best thing is to write a scalar loop and the optimizer do it. It does a pretty OK job (LBB1_10 in this example on Godbolt). I can write it better, but for simple tasks like this, the optimizer is more than good enough for 99.999% of use cases, and your own optimization efforts are better spent on harder problems.

You can't. The loop is necessary. But compiling without optimization introduces a lot of overhead from the loop.

You can't. But compiling without optimization changes something that's free ("just a load") into something that has overhead. Always use the optimizer.

If you're really dedicated to writing this example using SIMD by hand, you might do something like this, which generates pretty decent assembly, comparable to what C intrinsics would get you (I wrote this in the browser window, so it's totally untested, and worth exactly what you paid for it):

0:      movupd  xmm0, xmmword ptr [r12 + rcx + 32]
        movupd  xmm1, xmmword ptr [r14 + rcx + 32]
        addpd   xmm1, xmm0
        movupd  xmmword ptr [r15 + rcx + 32], xmm1
        movupd  xmm0, xmmword ptr [r12 + rcx + 48]
        movupd  xmm1, xmmword ptr [r14 + rcx + 48]
        addpd   xmm1, xmm0
        movupd  xmmword ptr [r15 + rcx + 48], xmm1
        add     rcx, 32
        add     rdx, -2
        jne     0b

wigging · October 7, 2024, 12:39pm

Why do you use @_transparent instead of @inlinable in your example?

scanon · October 7, 2024, 12:58pm

Because I grabbed those little helper functions from a project I had lying around that I had developed against an older version of the compiler where using transparent was necessary to get the right codegen. It appears to no longer be needed, which is nice.