Swift SIMD just seems to fallback to scalar operations

taylorswift · August 6, 2020, 2:52am

i’m looking at some Swift-generated assembly for code that uses SIMD operations, and a lot of it just seems badly broken? as in the compiler seems to only use the SIMD registers to retrieve the arguments, and then just unpacks them into the normal registers and does all the operations byte-by-byte.

For example, this really trivial function

func add(a:SIMD16<UInt8>, b:SIMD16<UInt8>) -> SIMD16<UInt8> 
{
    a &+ b
}

just turns into this:

// stack setup 
	pushq	%rbp
	movq	%rsp, %rbp
// save registers
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
// store arguments as local stack vars
	movaps	%xmm0, -64(%rbp)
	movaps	%xmm1, -80(%rbp)
// reload element b[0] and b[1] ... why???
	movb	-80(%rbp), %al
	movb	-79(%rbp), %cl
// add element a[0] to b[0]
	addb	-64(%rbp), %al
// spill a[0] + b[0]
	movb	%al, -42(%rbp)
// add element a[1] to b[1] 
	addb	-63(%rbp), %cl
// spill a[1] + b[1] 
	movb	%cl, -41(%rbp)
// same as above, but for the other 14 elements
	movb	-78(%rbp), %r8b
	addb	-62(%rbp), %r8b
	movb	-77(%rbp), %r9b
	addb	-61(%rbp), %r9b
	movb	-76(%rbp), %r10b
	addb	-60(%rbp), %r10b
	movb	-75(%rbp), %r11b
	addb	-59(%rbp), %r11b
	movb	-74(%rbp), %r14b
	addb	-58(%rbp), %r14b
	movb	-73(%rbp), %r15b
	addb	-57(%rbp), %r15b
	movb	-72(%rbp), %r12b
	addb	-56(%rbp), %r12b
	movb	-71(%rbp), %r13b
	addb	-55(%rbp), %r13b
	movb	-70(%rbp), %sil
	addb	-54(%rbp), %sil
	movb	-69(%rbp), %cl
	addb	-53(%rbp), %cl
	movb	-68(%rbp), %dl
	addb	-52(%rbp), %dl
	movb	-67(%rbp), %bl
	addb	-51(%rbp), %bl
	movb	-66(%rbp), %al
	addb	-50(%rbp), %al
	movb	-65(%rbp), %dil
	addb	-49(%rbp), %dil
// move... each byte... back into the simd registers, 
// one by one for some reason
	movzbl	%dil, %edi
	movd	%edi, %xmm0
	movzbl	%al, %eax
	movd	%eax, %xmm1
	punpcklbw	%xmm0, %xmm1
	movzbl	%bl, %eax
	movd	%eax, %xmm0
	movzbl	%dl, %eax
	movd	%eax, %xmm2
	punpcklbw	%xmm0, %xmm2
	punpcklwd	%xmm1, %xmm2
	movzbl	%cl, %eax
	movd	%eax, %xmm0
	movzbl	%sil, %eax
	movd	%eax, %xmm3
	punpcklbw	%xmm0, %xmm3
	movzbl	%r13b, %eax
	movd	%eax, %xmm0
	movzbl	%r12b, %eax
	movd	%eax, %xmm1
	punpcklbw	%xmm0, %xmm1
	punpcklwd	%xmm3, %xmm1
	punpckldq	%xmm2, %xmm1
	movzbl	%r15b, %eax
	movd	%eax, %xmm0
	movzbl	%r14b, %eax
	movd	%eax, %xmm2
	punpcklbw	%xmm0, %xmm2
	movzbl	%r11b, %eax
	movd	%eax, %xmm0
	movzbl	%r10b, %eax
	movd	%eax, %xmm3
	punpcklbw	%xmm0, %xmm3
	punpcklwd	%xmm2, %xmm3
	movzbl	%r9b, %eax
	movd	%eax, %xmm0
	movzbl	%r8b, %eax
	movd	%eax, %xmm2
	punpcklbw	%xmm0, %xmm2
// reload spilled sums, and move them into the 
// simd registers, individually
	movzbl	-41(%rbp), %eax
	movd	%eax, %xmm4
	movzbl	-42(%rbp), %eax
	movd	%eax, %xmm0
// interleave four simd registers containing 4 elements 
// each into xmm0
	punpcklbw	%xmm4, %xmm0
	punpcklwd	%xmm2, %xmm0
	punpckldq	%xmm3, %xmm0
	punpcklqdq	%xmm1, %xmm0
// restore registers
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
// leave
	popq	%rbp
	retq

this ate up fourteen normal registers and four vector registers for no reason, and still ended up spilling onto the stack…

weirdly this doesn’t seem to be a problem with eight-byte SIMD operations:
(maybe because the SIMD8s fit into the normal rdi, rsi, etc. argument-passing registers?)

func add(a:SIMD8<UInt8>, b:SIMD8<UInt8>) -> SIMD8<UInt8> 
{
    a &+ b
}

// stack setup
	pushq	%rbp
	movq	%rsp, %rbp
// load `a` into xmm0, as it should 
	movq	%rdi, %xmm0
// load `b` into xmm1, as it should 
	movq	%rsi, %xmm1
// xmm1 += xmm0
	paddb	%xmm0, %xmm1
// return xmm1
	movq	%xmm1, %rax
// leave
	popq	%rbp
	retq

scanon · August 6, 2020, 4:34am

Has nothing to do with argument passing, it's a result of LLVM heuristics for loop unrolling and vectorization. There have been some changes in the past year that resulted in some regressions like this. The long-term fix is to use "generic builtins" to directly lower to the vector nodes in LLVM so that it's not sensitive to optimizer drift like this. As I said in a sibling post, you can workaround it by using concrete intrinsics in the short-term (because, again, the types are the most important thing).

taylorswift · August 6, 2020, 7:02pm

I’m confused by this statement, aren’t the vector operators like SIMD16.&+ supposed to be the generic builtins?

scanon · August 6, 2020, 7:07pm

They are generic functions in the standard library. They will eventually be implemented in terms of "generic builtins" in the SIL layer. Currently those don't exist, which is why they are implemented in terms of scalar code that the optimizer has to re-vectorize. Sometimes that fails, like in your example (it usual works ok for 4 element vectors of Float and the like, which is why we've been able to limp along with it for now).

There's a bunch of work to be done here. None of it is that difficult, but it touches on LLVM and the Swift compiler and the standard library, so it's fairly subtle, which is one of the reasons why it's taken the back seat while the team works on more immediately pressing performance issues.

I'll post more explaining what has to happen to get the behavior we really want, as well as a short-term hack I have planned, sometime later today or tomorrow.

brandon · August 6, 2020, 10:45pm

FWIW, LLVM’s autovectorizer seems to work much better with UnsafePointers than the SIMD vector types in the stdlib. It could give some insight into what’s causing different assembly to be generated.

taylorswift · August 6, 2020, 10:54pm

very interesting,, though the pointer-based version has the downside of causing a lot of unnecessary memory region overlap checks to get generated.

brandon · August 6, 2020, 11:02pm

From what I know, the vectorization passes are able to omit the overlap checks if it’s able to prove that they will never overlap. Though, I haven’t played around with it enough to see how to force the optimizer to do so.

brandon · August 7, 2020, 8:01pm

I got the optimizer to emit cleaner assembly with this:

func &+<V: SIMD>(_ a: V, b: V) -> V where V.Scalar: FixedWidthInteger {
    var result = V()
    withUnsafeMutableBytes(of: &result) { (result: UnsafeMutableRawBufferPointer) in
        withUnsafeBytes(of: a) { (a: UnsafeRawBufferPointer) in
            withUnsafeBytes(of: b) { (b: UnsafeRawBufferPointer) in
                let result = result.baseAddress!.assumingMemoryBound(to: V.Scalar.self)
                let a = a.baseAddress!.assumingMemoryBound(to: V.Scalar.self)
                let b = b.baseAddress!.assumingMemoryBound(to: V.Scalar.self)
                for i in 0..<V.scalarCount {
                    result[i] = a[i] &+ b[i]
                }
            }
        }
    }
    return result
}

taylorswift · August 7, 2020, 8:58pm

that…is very verbose, but at least we have a workaround.

this is semantically equivalent to using the SIMD type as a fixed-size array, so i wonder if that might be a more natural vectorization model than what we have right now?

brandon · August 7, 2020, 10:07pm

Yep, SIMD vectors are essentially fixed-size (packed) arrays of integers/floating-points.

Though in Swift, a type's stride isn't always the same as it's size... Depending on the architecture, the generated assembly may not vectorize correctly (or even at all). In that case, I'm not sure how the above code would behave. Though it should be fine on x86_64 and ARM, as far as I can tell.

jrose · August 7, 2020, 10:15pm

Stride is size rounded up to alignment, so that shouldn't be a problem for any of the SIMD element types.