Simd vs plain Swift (simd is slower?)

JetForMe · April 28, 2021, 4:03am

While asking about inlining in another post, I decided to add some extensions to SCNQuaternion. Fortunately, GLKQuaternion implements them in the header in C, so I could look at their implementations to copy.

Here’s one for adding quaternions:

GLK_INLINE GLKQuaternion GLKQuaternionAdd(GLKQuaternion quaternionLeft, GLKQuaternion quaternionRight)
{
#if   defined(GLK_SSE3_INTRINSICS)
    __m128 v = _mm_load_ps(&quaternionLeft.q[0]) + _mm_load_ps(&quaternionRight.q[0]);
    return *(GLKQuaternion *)&v;
#else
    GLKQuaternion q = {{ quaternionLeft.q[0] + quaternionRight.q[0],
                         quaternionLeft.q[1] + quaternionRight.q[1],
                         quaternionLeft.q[2] + quaternionRight.q[2],
                         quaternionLeft.q[3] + quaternionRight.q[3] }};
    return q;
#endif
}

I was ready to try to massage the pointers to make use of SSE, but wanted to also play nice on Apple silicon, and remembered simd exists. SCNQuaternion is a SCNVector4 and there's already support for simd_float4 with that, so I gave that a try, and compared it to a simple Swift implementation to add the components. To my surprise, at least in the Playground, the simple Swift code seems substantially faster. I can't figure out if Xcode is optimizing Playground code or not (the code is in a separate file in the Sources dir, and the test methods are called from the Playground proper). I wonder if it's just failing to inline some calls, or what. Here's my code:

import SceneKit

import Foundation
import simd


public
func
simdAdd(_ inLHS: SCNQuaternion, _ inRHS: SCNQuaternion)
    -> SCNQuaternion
{
    let l = simd_float4(inLHS)
    let r = simd_float4(inRHS)
    let s = l + r
    return SCNQuaternion(s)
}

public
func
swiftAdd(_ inLHS: SCNQuaternion, _ inRHS: SCNQuaternion)
    -> SCNQuaternion
{
    return SCNQuaternion(inLHS.x + inRHS.x, inLHS.y + inRHS.y, inLHS.z + inRHS.z, inLHS.w + inRHS.w)
}

public
func
testSIMD()
{
    let a = SCNQuaternion(1, 2, 3, 1)
    var c = SCNQuaternion()

    let start:Double  = CFAbsoluteTimeGetCurrent()
    for _ in 0 ..< 10000
    {
        c = simdAdd(c, a)
    }
    let end: Double = CFAbsoluteTimeGetCurrent()

    print("SIMD took:  \(end - start) s")
}

public
func
testSwift()
{
    let a = SCNQuaternion(1, 2, 3, 1)
    var c = SCNQuaternion()

    let start:Double  = CFAbsoluteTimeGetCurrent()
    for _ in 0 ..< 10000
    {
        c = swiftAdd(c, a)
    }
    let end: Double = CFAbsoluteTimeGetCurrent()

    print("Swift took: \(end - start) s")
}

Running the playground produces this output:

SIMD took:  0.005373954772949219 s
Swift took: 0.0032699108123779297 s

I can't get Xcode to show me the assembly to see what it's doing. But I'm surprised at the result. Any ideas?

Jon_Shier · April 28, 2021, 4:24am

First, don’t do performance testing in playgrounds. Use a real project with optimizations turned on. Second, you can use godbolt.org to see disassembly.

JetForMe · April 28, 2021, 7:12am

Unfortunately godbolt.org doesn't know about things like SceneKit, but I did it on the command line (code and assembly output available. Source is at the bottom of the page):

$ swift test.swift 
SIMD took:  0.00363600 s
Swift took: 0.00278902 s
swift -O test.swift 
SIMD took:  0.00006199 s
Swift took: 0.00001395 s

The following commands got me disassembly in BBEdit:

$ swiftc -emit-assembly  test.swift | edit
$ swiftc -O -emit-assembly  test.swift | edit

To my dismay, none of the init() calls are inlined in either optimized or unoptimized build, so in the optimized build, simdAdd() makes three calls, whereas swiftAdd() makes one. In the unoptimized build, I have no idea what swiftAdd() is doing. There is a lot more code generated with calls to generic specializations and assertions and who knows what else.

So how can one make this code faster? The most obvious is to use simd_quatf, but as a general rule, don’t we want Swift to be inherently better at this sort of thing?

It's fascinating to see the unoptimized loop explode. I can't even tell where the loop is. Even the optimized loop

LucianoPAlmeida · April 28, 2021, 11:11am

One aspect of this code that can make it slower is that there are(possible) vector allocations/copies there...
l + r would create a new vector and copy to s? So what happens with benchmarks if you do it like

    let l = simd_float4(inLHS)
    let r = simd_float4(inRHS)
    l += r
    return SCNQuaternion(l)

Also, it maybe worth it trying to see what happens if you @inline the simdAdd and swiftAdd to see what kind of help the optimizer can give in that case.

scanon · April 28, 2021, 1:41pm

I'm confused, why not use the perfectly good existing GLK implementation? It's already written for you, and the compiler can vectorize the generic fallback path just fine. Or use simd_add on simd_quatfs, which is also already written for you, and explicitly vectorized. There's no really good reason to rewrite all this stuff that already exists in the SDK.

Syre · April 28, 2021, 8:17pm

I completely agree with Steve here, and just to add on since it looks like you are using SceneKit, you can simply get the simd_quatf of a node's orientation from the simdOrientation property.

There should be no need to deal with SCNQuaternion at all.

JetForMe · April 28, 2021, 8:25pm

At the time I wrote the original post I didn't realize simd_quatf existed, but if you look at my reply above, I state:

So, yes, I'm now using simd_quatf, because SceneKit has simd versions of many of its properties and methods.

scanon · April 28, 2021, 10:08pm

simd_quatf is imported from the simd module as a Swift type, and used just like any other; it is Swift. It doesn't make sense to say "don't we want Swift to be faster than simd_quatf?".

We would like to make it possible for users to write their own types that are as performant as (or more than) simd_quatf more easily, and that's an ongoing project.

taylorswift · April 28, 2021, 11:26pm

possibly related: Swift SIMD just seems to fallback to scalar operations

JetForMe · April 29, 2021, 12:46am

I didn't say that. And to clarify what I did say, I meant, don't we want Swift to be faster at using something like simd_qantf?

This code

	let l = simd_float4(inLHS)
	let r = simd_float4(inRHS)
	let s = l + r
	return SCNQuaternion(s)

with optimizations, generated

test.simdAdd(__C.SCNVector4, __C.SCNVector4) -> __C.SCNVector4:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	subq	$48, %rsp
	movsd	%xmm7, -32(%rbp)
	movsd	%xmm6, -24(%rbp)
	movsd	%xmm5, -16(%rbp)
	movsd	%xmm4, -8(%rbp)
	callq	(extension in SceneKit):Swift.SIMD4< where A == Swift.Float>.init(__C.SCNVector4) -> Swift.SIMD4<Swift.Float>
	movaps	%xmm0, -48(%rbp)
	movsd	-8(%rbp), %xmm0
	movsd	-16(%rbp), %xmm1
	movsd	-24(%rbp), %xmm2
	movsd	-32(%rbp), %xmm3
	callq	(extension in SceneKit):Swift.SIMD4< where A == Swift.Float>.init(__C.SCNVector4) -> Swift.SIMD4<Swift.Float>
	addps	-48(%rbp), %xmm0
	addq	$48, %rsp
	popq	%rbp
	jmp	(extension in SceneKit):__C.SCNVector4.init(Swift.SIMD4<Swift.Float>) -> __C.SCNVector4
	.cfi_endproc

My knowledge of the ABI is very limited, but I see it

Moving four doubles from the stack to registers
Calling SIMD4<Float>.init(), which returns the vector on the stack
Repeating the call for the second SIMD4
Doing the vector add (this much seems like it inlined as you would expect), and storing the result on the stack
Jumping to the SCNQuaternion constructor (presumably such that its return will return to the caller).

Maybe the way I invoked swiftc doesn't allow for cross-module optimization (specifically, inlining the SIMD4<Float>.init() call). And to be fair, as I look at it now it seems the call within my own code to simdAdd() was in fact inlined.

But compare that to the generated code for the pure-Swift scalar code:

test.swiftAdd(__C.SCNVector4, __C.SCNVector4) -> __C.SCNVector4:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	addsd	%xmm4, %xmm0
	addsd	%xmm5, %xmm1
	addsd	%xmm6, %xmm2
	addsd	%xmm7, %xmm3
	popq	%rbp
	jmp	(extension in SceneKit):__C.SCNVector4.init(CoreGraphics.CGFloat, CoreGraphics.CGFloat, CoreGraphics.CGFloat, CoreGraphics.CGFloat) -> __C.SCNVector4
	.cfi_endproc

It’s pretty clear why it’s the winner.

Also, I wouldn't say simd_quatf is Swift. It's C, called from Swift. In fact, the constructor is defined in the header like this.

static inline SIMD_CFUNC simd_quatf simd_quaternion(float ix, float iy, float iz, float r) {
  return (simd_quatf){ { ix, iy, iz, r } };
}

Soemthing in the journey from C header to my Swift file is creating a function to call.

scanon · April 29, 2021, 1:37am

I know, I wrote that header (and the rest of the simd headers ). Swift imports C headers natively, they’re as much a supported part of the language as anything else is.

JetForMe · April 29, 2021, 1:38am

So why is it generating function calls?

scanon · April 29, 2021, 1:49am

Because the following initializer defined in the SceneKit overlay, which you are using, is not marked inlinable:

extension SIMD4 where Scalar == Float {
  public init(_ v: SCNVector4) {
    self.init(Float(v.x), Float(v.y), Float(v.z), Float(v.w))
  }
}

That's not a Swift performance bug, it's a library performance bug. The compiler is required by the overlay to generate a call.

Note that you're also not doing an apples-to-apples comparison (though this is a relatively minor detail); on the platform you're targeting, SCNVector4 is a vector of four doubles, so converting to simd_float4 and back requires an actual conversion operation, while your scalar code stays in double the whole time.

Note also that because of ABI considerations, vectorization isn't actually profitable for a stand-alone add function on SceneKit quaternions, because they're defined as a struct of four CGFloat, which are passed in xmm0, xmm1, xmm2, xmm3, etc. Assembling them into a contiguous register in order to do SIMD arithmetic is less efficient than just adding them as scalars. Nonetheless, simply making the definitions visible to the compiler produces essentially optimal code with this limitation in mind:

import SceneKit

extension simd_double4 {
  init(_ other: SCNVector4) {
    self = simd_double4(Double(other.x), Double(other.y), Double(other.z), Double(other.w))
  }

  var scnv4: SCNVector4 {
    SCNVector4(x: CGFloat(self.x), y: CGFloat(self.y), z: CGFloat(self.z), w: CGFloat(self.w))
  }
}

func add(a: SCNVector4, b: SCNVector4) -> SCNVector4 {
  (simd_double4(a) + simd_double4(b)).scnv4
}

_$s3addAA1a1bSo10SCNVector4VAE_AEtF: // add(a: SCNVector4, b: SCNVector4) -> SCNVector4
0000000100003f90	pushq	%rbp
0000000100003f91	movq	%rsp, %rbp
0000000100003f94	addsd	%xmm4, %xmm0
0000000100003f98	addsd	%xmm5, %xmm1
0000000100003f9c	addsd	%xmm6, %xmm2
0000000100003fa0	addsd	%xmm7, %xmm3
0000000100003fa4	popq	%rbp
0000000100003fa5	retq

If we directly use simd_quatf or simd_quatd instead, which are passed contiguously in SIMD registers, we get something nicer:

_$s3addAA1a1bSo10simd_quatdaAE_AEtF: // add(a: simd_quatd, b: simd_quatd) -> simd_quatd
0000000100003f90	pushq	%rbp
0000000100003f91	movq	%rsp, %rbp
0000000100003f94	addpd	%xmm2, %xmm0
0000000100003f98	addpd	%xmm3, %xmm1
0000000100003f9c	popq	%rbp
0000000100003f9d	retq
_$s3addAA1a1bSo10simd_quatfaAE_AEtF: // add(a: simd_quatf, b: simd_quatf) -> simd_quatf
0000000100003fa0	pushq	%rbp
0000000100003fa1	movq	%rsp, %rbp
0000000100003fa4	addps	%xmm1, %xmm0
0000000100003fa7	popq	%rbp
0000000100003fa8	retq

Basically, everything you are seeing is a necessary result of how the types and operations are defined and exposed in SceneKit, rather than Swift compiler limitations. There are some very real compiler limitations around SIMD performance still, but these are not they.

JetForMe · April 29, 2021, 2:13am

That's an unfortunate requirement. If I put an import simd at the top will that fix that?

BTW, my code now looks like this (all simd_quatf):

let wz = node.simdConvertVector(simd_float3(x: 0, y: 0, z: 1), to: self.layer.scene!.rootNode)
let wy = node.simdConvertVector(simd_float3(x: 0, y: 1, z: 0), to: self.layer.scene!.rootNode)
let wx = node.simdConvertVector(simd_float3(x: 1, y: 0, z: 0), to: self.layer.scene!.rootNode)
let zq = simd_quatf(angle: Float(-self.multiAxisState.roll) * 0.001, axis: wz)
let yq = simd_quatf(angle: Float(self.multiAxisState.yaw) * 0.001, axis: wy)
let xq = simd_quatf(angle: Float(-self.multiAxisState.pitch) * 0.001, axis: wx)
let qq = xq * yq * zq
node.simdRotate(by: qq, aroundTarget: .zero)

I'm quite happy to forego SCNQuaternion, now that I know simd_quat exists, and that it seems to have all the convenience methods of GLKQuaternion.

The double-vs-float helps me understand part of what I see in the code, thanks. And thanks for the rest of the explanation!

All of this started because I didn't want to deal with GLKQuaternion and SCNQuaternion conversions. I’ve since learned simd_ is where it’s at.

scanon · April 29, 2021, 2:19am

Sadly, no. You can define your own internal conversions like my small example does, but the best bet is probably to simply stay in simd_quatf land as much as possible.

JetForMe · April 29, 2021, 3:02am

Sorry, I think I'm misudnerstanding something. If I write

import SceneKit
import simd
let q = simd_quatf(x, y, z, w)

Is that forced to go through a SceneKit overlay wrapper call to consuct q?

scanon · April 29, 2021, 1:59pm

No, that does not go through the SceneKit overlay; the init taking a SCNVector4 does.

JetForMe · April 30, 2021, 4:15am

Oh gosh. I see what I'm doing wrong. Totally lost sight of what I had been trying to do at the start of this thread to what I'm doing now, and I couldn't understand what the compiler was generating.

I tried this:

public
func
addQuats(_ inLHS: simd_quatf, _ inRHS: simd_quatf)
	-> simd_quatf
{
	return inLHS + inRHS
}

public
func
testAddQuats()
{
	let a = simd_quatf(ix: 1.0, iy: 2.0, iz: 3.0, r: 1.0)
	var b = simd_quatf(ix: -2.0, iy: -1.0, iz: -5.0, r: -0.5)
	var c = simd_quatf()

	let start:Double  = CFAbsoluteTimeGetCurrent()
	for _ in 0 ..< 10000
	{
		c = addQuats(b, a)
		b = addQuats(c, b)
	}
	let end: Double = CFAbsoluteTimeGetCurrent()
	
	print("addQuats took:  \(String(format: "%10.8f", end - start)) s")
}

and it generated

	.globl	test.addQuats(__C.simd_quatf, __C.simd_quatf) -> __C.simd_quatf
	.p2align	4, 0x90
test.addQuats(__C.simd_quatf, __C.simd_quatf) -> __C.simd_quatf:
	pushq	%rbp
	movq	%rsp, %rbp
	addps	%xmm1, %xmm0
	popq	%rbp
	retq

And a whoooole mess o' code for testAddQuats() (which I assume must be loop unrolling or something), but notably no calls to addQuats(), so it's clearly inlining that. Oddly it also inlines it when compiling without -O. I popped a @inline(never) on it to see what it would do.

In any case, I’m quite happy moving forward with simd_quatf and friends throughout. They seem to provide everything I need.

I hesitate to add this commentary: Apple’s docs on all of this could be a lot better. There's virtually no overarching documentation to say “Hey, here are four ways of doing stuff in Apple OSes, moving forward we recommend this one where possible.”