Will using SIMD2<Float> always be more efficient than (Float, Float)?

Jens · January 24, 2020, 12:28pm

Now that we have the SIMD types, it might be tempting to rewrite eg:

struct Disk {
    var currentPosition: (x: Float, y: Float)
    var previousPosition: (x: Float, y: Float)
    var acceleration: (x: Float, y: Float)
    var radius: Float
    var miscBits: UInt32
}

as

struct Disk {
    var currentPosition: SIMD2<Float>
    var previousPosition: SIMD2<Float>
    var acceleration: SIMD2<Float>
    var radius: Float
    var miscBits: UInt32
}

There are obvious benefits of using SIMD instead of tuples (lots of ready-made methods and supposedly much better performance).

However, I'd like to know if there are scenarios / ways of using Disk where the SIMD-based Disk will be slower than the tuple-based one, ie is there any potential overhead related to (mis)using SIMD and if so how do we avoid it? Or can we be sure that the SIMD-based Disk will never be slower than the tuple-based one?

(Imagine lots and lots of Disks being processed in many different ways.)

Spencer_Kohan · January 24, 2020, 2:03pm

FWIW I tried to replace some simple vector types (essentially 4 float structs) with SIMD types and actually saw slightly worse performance in the cursory testing I was able to do. I hope someone with deeper knowledge will correct me if I'm wrong, but it seemed as if the compiler was pretty good at vectorizing whether or not SIMD was used explicitly.

I would love to know what accounted for the small decrease in performance when using SIMD explicitly.

Jon_Shier · January 24, 2020, 3:14pm

If you post some sample code I’m sure people will look at it.

scanon · January 24, 2020, 3:53pm

With the current (compiler and standard library) implementation it's definitely possible to get worse performance this way, because the compiler may not manage to fully specialize everything, and then you're off into generic protocol witness land. This is rare in release builds, unless you're building a nest of types that are generic over an underlying scalar type, but it can happen fairly easily in debug builds, if those are a concern for you.

Right. The biggest win for using the SIMD types today is that you can treat things that are conceptually a single point / vector / whatever as a single thing, instead of always referring to them by their components. They make your code easier to reason about.

The second win is that they bridge to clang ExtVectors, which is an important feature for interacting with some libraries written in C-family languages, as well as platform-specific intrinsic headers.

As the compiler and standard library get smarter about handling them, they'll also start to provide some additional performance wins that are extremely difficult to get at using tuples.

Spencer_Kohan · January 24, 2020, 3:58pm

So how would this be different than declaring a struct with multiple components? I.e:

struct Vect {
    var x: Float
    var y: Float
    var z: Float
    var w: Float
}

let v = Vect(...)

scanon · January 24, 2020, 4:01pm

First: you don't have to declare your own struct for each type you work with, and don't have to implement all of the operations yourself.

I could also declare my own Int64:

struct Int64 {
  var low: UInt32
  var high: Int32
  // operations
}

but it's quite a bit nicer to have the standard library provide it for me (and, that way my Int64 works with your Int64).

Second: your struct wouldn't bridge to the clang vector types, so you wouldn't get compatibility with C-family vector libraries. (Just like my Int64 above wouldn't bridge to C's int64_t type, but the standard library's Int64 does.)

porterchild · February 19, 2020, 3:43am

I've got a few specific instances where SIMD performance vs Array is unexpected, would love some explanation. Here's one where I'm comparing time to accumulate at every index of an 8-element array and SIMD, respectively:
TEST 1

import Foundation
import Dispatch

let iterations = 10000000
let elements: Int = 8
var array = Array<Float>(repeating: 5.0, count: elements)    
var simd = SIMD8<Float>(repeating: 5.0)

//--------------------------------------------------- 
//Array
var arrayTime = 0.0
let startArray = DispatchTime.now().uptimeNanoseconds
for i in 0..<iterations{
    for j in 0..<elements{
        array[j] += 7.0
    }
}
let endArray = DispatchTime.now().uptimeNanoseconds
arrayTime += Double(endArray - startArray) / Double(1e7)
print(("Array time", arrayTime))

//--------------------------------------------------- 
//SIMD
let startSimd = DispatchTime.now().uptimeNanoseconds
var simdTime = 0.0
for i in 0..<iterations{
    for j in 0..<elements{
        simd[j] += 7.0
    }
}
let endSimd = DispatchTime.now().uptimeNanoseconds
simdTime += Double(endSimd - startSimd) / Double(1e7)
print(("simd  time", simdTime))

print("ratio array / simd", arrayTime / simdTime)

on my machine, this produces

("Array time", 21.2826863)
("simd  time", 1.6308109)
ratio array / simd 13.050370401620446

So SIMD has exciting gains over array in this case, I'm guessing because some auto-vectorization is going on. (Is that true?)

TEST 2
Now an unexpected deviation:
If I replace the j inner loop:

 for j in 0..<elements{
     array[j] += 7.0
 }

with

array[i % 8] += 7.0

(now instead of updating each element on every iteration, you're just updating one),
and the same for the SIMD loop,
it produces

("Array time", 0.5291106)
("simd  time", 7.7913118)
ratio array / simd 0.06791033571522577

SIMD is losing by quite a margin now. I guess the i % 8 is unexpected enough to stop the auto-vectorization? (again, confirmation appreciated)
Even if that's the case, why isn't it at least the same speed as the regular array?

Not sure if @scanon's comment about "generic protocol witness land" applies here, seems like it doesn't.

I am building Release.
By the way it's so cool that SIMDs have such an easy interface now, thanks for the great work!

Jens · February 19, 2020, 7:16am

When you do these kinds of tests, it's important to remember that there's a big difference between having stuff defined in global vs local scope. You should probably put everything in a function unless your test is meant to measure exactly the specific case of a stored variable defined in global scope.

Also, you should always run measurements like this several times in a loop, to see if / how much the result differ between successive runs, the first run is often a lot different than the following ones for example.

And, you have to make sure (as we'll see) that the tested code isn't optimized away entirely by dead code elimination (unless that's specifically what you want to test).

Here's what I see when I run your unmodified program (as a command line tool prj) on my late 2013 MBP, using default toolchain of Xcode 1.3.1 (11C504), release build from within Xcode:

("Array time", 2.8470284)
("simd  time", 1.0790284)
ratio array / simd 2.6385110901622237

and with array[i % 8] and simd[i % 8]:

("Array time", 8.3321858)
("simd  time", 45.5596783)
ratio array / simd 0.1828850885455001

So for some reason both got slower, and the simd version got a lot slower.

Ok, now let's put everything in a function test(), and call that function a couple of times from another function.

Here's your program with that modification.

import Foundation
import Dispatch

func test() {
    let iterations = 10000000
    let elements: Int = 8
    var array = Array<Float>(repeating: 5.0, count: elements)
    var simd = SIMD8<Float>(repeating: 5.0)

    //---------------------------------------------------
    //Array
    var arrayTime = 0.0
    let startArray = DispatchTime.now().uptimeNanoseconds
    for i in 0..<iterations{
        for j in 0..<elements{
            array[j] += 7.0
        }
    }
    let endArray = DispatchTime.now().uptimeNanoseconds
    arrayTime += Double(endArray - startArray) / Double(1e7)
    print(("Array time", arrayTime))

    //---------------------------------------------------
    //SIMD
    let startSimd = DispatchTime.now().uptimeNanoseconds

    var simdTime = 0.0
    for i in 0..<iterations{
        for j in 0..<elements{
            simd[j] += 7.0
        }
    }
    let endSimd = DispatchTime.now().uptimeNanoseconds
    simdTime += Double(endSimd - startSimd) / Double(1e7)
    print(("simd  time", simdTime))

    print("ratio array / simd", arrayTime / simdTime)
}
func runTestMutlipleTimes() {
    for _ in 0 ..< 3 { test() }
}
runTestMutlipleTimes()

Now, the array/simd[j]-variant will look like this:

("Array time", 1.2672577)
("simd  time", 4.41e-05)
ratio array / simd 28736.002267573695
("Array time", 1.0452441)
("simd  time", 1e-05)
ratio array / simd 104524.40999999997
("Array time", 1.0183805)
("simd  time", 9.2e-06)
ratio array / simd 110693.53260869565

And the array/simd[i % 8]-variant will look like this:

("Array time", 2.5467075)
("simd  time", 0.0189487)
ratio array / simd 134.40011715843303
("Array time", 2.7012884)
("simd  time", 0.0193017)
ratio array / simd 139.95080226094075
("Array time", 2.5367798)
("simd  time", 0.0172931)
ratio array / simd 146.69317820402358

It looks like the loop for the simd[j] case might have been optimized away entirely (I haven't checked with eg godbolt, just guessing), so let's prevent that optimization from being possible by making sure simd(and array) are used in some way after that loop, eg by printing the exponent of the max element (just something that must process all elements and won't print too many irrelevant digits):

print(("Array time", arrayTime), array.max()!.exponent)
...
print(("simd  time", simdTime), simd.max().exponent)

Here's the entire program with this modification.

import Foundation
import Dispatch

func test() {
    let iterations = 10000000
    let elements: Int = 8
    var array = Array<Float>(repeating: 5.0, count: elements)
    var simd = SIMD8<Float>(repeating: 5.0)

    //---------------------------------------------------
    //Array
    var arrayTime = 0.0
    let startArray = DispatchTime.now().uptimeNanoseconds
    for i in 0..<iterations{
        for j in 0..<elements{
            array[j] += 7.0
        }
    }
    let endArray = DispatchTime.now().uptimeNanoseconds
    arrayTime += Double(endArray - startArray) / Double(1e7)
    print(("Array time", arrayTime), array.max()!.exponent)

    //---------------------------------------------------
    //SIMD
    let startSimd = DispatchTime.now().uptimeNanoseconds

    var simdTime = 0.0
    for i in 0..<iterations{
        for j in 0..<elements{
            simd[j] += 7.0
        }
    }
    let endSimd = DispatchTime.now().uptimeNanoseconds
    simdTime += Double(endSimd - startSimd) / Double(1e7)
    print(("simd  time", simdTime), simd.max().exponent)

    print("ratio array / simd", arrayTime / simdTime)
}
func runTestMutlipleTimes() {
    for _ in 0 ..< 3 { test() }
}
runTestMutlipleTimes()

Now when we run it we will get this for array/simd[j]:

("Array time", 0.9755937) 26
("simd  time", 1.0709854) 26
ratio array / simd 0.9109309053139286
("Array time", 1.0400305) 26
("simd  time", 0.9804056) 26
ratio array / simd 1.0608165640832734
("Array time", 0.9691748) 26
("simd  time", 0.9465985) 26
ratio array / simd 1.0238499215876635

And this for array/simd[i % 8]:

("Array time", 2.5223742) 26
("simd  time", 45.7046157) 26
ratio array / simd 0.05518860975785428
("Array time", 2.7398013) 26
("simd  time", 45.7768663) 26
ratio array / simd 0.05985122009105284
("Array time", 2.7021) 26
("simd  time", 46.6054519) 26
ratio array / simd 0.0579781954651533

And as it turns out, we get the same result for simd[i % 8] as in your original global scope version of the program, but not for array[i % 8] and array/simd[j].

Someone else will have to explain the exact reasons behind these results by examining the generated code, seeing if there are some missed optimizations etc. (Btw is SIMD8<Float> backed by some actual hardware supported simd type, I mean 8 * 4 == 32 bytes > 16 bytes, Isn't it max 4 float32?).

All I'm saying is that these types of tests can give very different results depending on exactly how they are formulated.

One last such thing (which isn't applicable here) is that you might have to use random data to prevent the compiler from being able to eg turn a loop of successive additions (depending on overflowing etc) into just a single addition and multiply, etc, etc.

Jens · February 19, 2020, 10:19am

I had one more go at this since I'd like to know the reason for this unexpected behavior too.

Using the following simplified version of your benchmark program, we will again demonstrate that using a SIMD4<Float> is slower than using a four-element [Float] when adding a constant value to one element at a time like this:

import Dispatch

typealias FourFloats = [Float]
//typealias FourFloats = SIMD4<Float>

func test() {
    print("FourFloats ==", FourFloats.self)
    let iterations = 10_000_000
    var v: FourFloats = [0, 0, 0, 0]
    for _ in 0 ..< 5 {
        let t0 = DispatchTime.now().uptimeNanoseconds
        for i in 0..<iterations{ v[i & 3] += 7.0 }
        let t1 = DispatchTime.now().uptimeNanoseconds
        print(Double(t1-t0)/1e6, "microseconds (cs: \(v[0]+v[1]+v[2]+v[3]))")
    }
}
test()

Running this program (on my old MBP) with
typealias FourFloats = [Float]:

$ swiftc --version
Apple Swift version 5.1.3 (swiftlang-1100.0.282.1 clang-1100.0.33.15)
Target: x86_64-apple-darwin19.2.0

$ swiftc -O test.swift && ./test
FourFloats == Array<Float>
7.546169 microseconds (cs: 7.041301e+07)
7.265836 microseconds (cs: 1.5041301e+08)
7.134038 microseconds (cs: 2.3041301e+08)
7.121394 microseconds (cs: 3.10413e+08)
7.406443 microseconds (cs: 3.90413e+08)

And running it with
typealias FourFloats = SIMD4<Float>:

$ swiftc -O test.swift && ./test
FourFloats == SIMD4<Float>
80.508516 microseconds (cs: 7.041301e+07)
81.561493 microseconds (cs: 1.5041301e+08)
80.465867 microseconds (cs: 2.3041301e+08)
80.200564 microseconds (cs: 3.10413e+08)
79.63726 microseconds (cs: 3.90413e+08)

So, the reported times are about 10 times longer when using a SIMD4<Float> instead of a four-element [Float], which is unexpected.

Looking at this in godbolt

This is the inner loop when typealias FourFloats = [Float]:

.LBB1_7:
        mov     edx, ecx
        and     edx, 2
        cmp     rdx, rax
        jae     .LBB1_19
        movss   xmm0, dword ptr [rbx + 4*rdx + 32]
        addss   xmm0, xmm1
        movss   dword ptr [rbx + 4*rdx + 32], xmm0
        mov     edx, ecx
        and     edx, 2
        or      rdx, 1
        cmp     rdx, rax
        jae     .LBB1_19
        add     rcx, 2
        movss   xmm0, dword ptr [rbx + 4*rdx + 32]
        addss   xmm0, xmm1
        movss   dword ptr [rbx + 4*rdx + 32], xmm0
        cmp     rcx, 10000000
        jne     .LBB1_7

And this is the inner loop when typealias FourFloats = SIMD4<Float>:

.LBB1_8:
        mov     ecx, eax
        and     ecx, 3
        movaps  xmmword ptr [rbp - 96], xmm0
        movss   xmm0, dword ptr [rbp + 4*rcx - 96]
        addss   xmm0, xmm1
        movss   dword ptr [rbp + 4*rcx - 96], xmm0
        movaps  xmm0, xmmword ptr [rbp - 96]
        lea     edx, [rax + 1]
        and     edx, 3
        movaps  xmmword ptr [rbp - 112], xmm0
        movss   xmm0, dword ptr [rbp + 4*rdx - 112]
        addss   xmm0, xmm1
        movss   dword ptr [rbp + 4*rdx - 112], xmm0
        movaps  xmm0, xmmword ptr [rbp - 112]
        lea     edx, [rax + 2]
        and     edx, 3
        movaps  xmmword ptr [rbp - 128], xmm0
        movss   xmm0, dword ptr [rbp + 4*rdx - 128]
        addss   xmm0, xmm1
        movss   dword ptr [rbp + 4*rdx - 128], xmm0
        movaps  xmm0, xmmword ptr [rbp - 128]
        lea     edx, [rax - 1]
        and     edx, 3
        movaps  xmmword ptr [rbp - 144], xmm0
        movss   xmm0, dword ptr [rbp + 4*rdx - 144]
        addss   xmm0, xmm1
        movss   dword ptr [rbp + 4*rdx - 144], xmm0
        movaps  xmm0, xmmword ptr [rbp - 144]
        add     rax, 5
        movaps  xmmword ptr [rbp - 160], xmm0
        movss   xmm0, dword ptr [rbp + 4*rcx - 160]
        addss   xmm0, xmm1
        movss   dword ptr [rbp + 4*rcx - 160], xmm0
        movaps  xmm0, xmmword ptr [rbp - 160]
        cmp     rax, 10000000
        jne     .LBB1_8

Can anyone explain what's going on? Is it a missed optimization opportunity that can be fixed?