When you do these kinds of tests, it's important to remember that there's a big difference between having stuff defined in global vs local scope. You should probably put everything in a function unless your test is meant to measure exactly the specific case of a stored variable defined in global scope.
Also, you should always run measurements like this several times in a loop, to see if / how much the result differ between successive runs, the first run is often a lot different than the following ones for example.
And, you have to make sure (as we'll see) that the tested code isn't optimized away entirely by dead code elimination (unless that's specifically what you want to test).
Here's what I see when I run your unmodified program (as a command line tool prj) on my late 2013 MBP, using default toolchain of Xcode 1.3.1 (11C504), release build from within Xcode:
("Array time", 2.8470284)
("simd time", 1.0790284)
ratio array / simd 2.6385110901622237
and with array[i % 8]
and simd[i % 8]
:
("Array time", 8.3321858)
("simd time", 45.5596783)
ratio array / simd 0.1828850885455001
So for some reason both got slower, and the simd version got a lot slower.
Ok, now let's put everything in a function test()
, and call that function a couple of times from another function.
Here's your program with that modification.
import Foundation
import Dispatch
func test() {
let iterations = 10000000
let elements: Int = 8
var array = Array<Float>(repeating: 5.0, count: elements)
var simd = SIMD8<Float>(repeating: 5.0)
//---------------------------------------------------
//Array
var arrayTime = 0.0
let startArray = DispatchTime.now().uptimeNanoseconds
for i in 0..<iterations{
for j in 0..<elements{
array[j] += 7.0
}
}
let endArray = DispatchTime.now().uptimeNanoseconds
arrayTime += Double(endArray - startArray) / Double(1e7)
print(("Array time", arrayTime))
//---------------------------------------------------
//SIMD
let startSimd = DispatchTime.now().uptimeNanoseconds
var simdTime = 0.0
for i in 0..<iterations{
for j in 0..<elements{
simd[j] += 7.0
}
}
let endSimd = DispatchTime.now().uptimeNanoseconds
simdTime += Double(endSimd - startSimd) / Double(1e7)
print(("simd time", simdTime))
print("ratio array / simd", arrayTime / simdTime)
}
func runTestMutlipleTimes() {
for _ in 0 ..< 3 { test() }
}
runTestMutlipleTimes()
Now, the array/simd[j]
-variant will look like this:
("Array time", 1.2672577)
("simd time", 4.41e-05)
ratio array / simd 28736.002267573695
("Array time", 1.0452441)
("simd time", 1e-05)
ratio array / simd 104524.40999999997
("Array time", 1.0183805)
("simd time", 9.2e-06)
ratio array / simd 110693.53260869565
And the array/simd[i % 8]
-variant will look like this:
("Array time", 2.5467075)
("simd time", 0.0189487)
ratio array / simd 134.40011715843303
("Array time", 2.7012884)
("simd time", 0.0193017)
ratio array / simd 139.95080226094075
("Array time", 2.5367798)
("simd time", 0.0172931)
ratio array / simd 146.69317820402358
It looks like the loop for the simd[j]
case might have been optimized away entirely (I haven't checked with eg godbolt, just guessing), so let's prevent that optimization from being possible by making sure simd
(and array
) are used in some way after that loop, eg by printing the exponent of the max element (just something that must process all elements and won't print too many irrelevant digits):
print(("Array time", arrayTime), array.max()!.exponent)
...
print(("simd time", simdTime), simd.max().exponent)
Here's the entire program with this modification.
import Foundation
import Dispatch
func test() {
let iterations = 10000000
let elements: Int = 8
var array = Array<Float>(repeating: 5.0, count: elements)
var simd = SIMD8<Float>(repeating: 5.0)
//---------------------------------------------------
//Array
var arrayTime = 0.0
let startArray = DispatchTime.now().uptimeNanoseconds
for i in 0..<iterations{
for j in 0..<elements{
array[j] += 7.0
}
}
let endArray = DispatchTime.now().uptimeNanoseconds
arrayTime += Double(endArray - startArray) / Double(1e7)
print(("Array time", arrayTime), array.max()!.exponent)
//---------------------------------------------------
//SIMD
let startSimd = DispatchTime.now().uptimeNanoseconds
var simdTime = 0.0
for i in 0..<iterations{
for j in 0..<elements{
simd[j] += 7.0
}
}
let endSimd = DispatchTime.now().uptimeNanoseconds
simdTime += Double(endSimd - startSimd) / Double(1e7)
print(("simd time", simdTime), simd.max().exponent)
print("ratio array / simd", arrayTime / simdTime)
}
func runTestMutlipleTimes() {
for _ in 0 ..< 3 { test() }
}
runTestMutlipleTimes()
Now when we run it we will get this for array/simd[j]
:
("Array time", 0.9755937) 26
("simd time", 1.0709854) 26
ratio array / simd 0.9109309053139286
("Array time", 1.0400305) 26
("simd time", 0.9804056) 26
ratio array / simd 1.0608165640832734
("Array time", 0.9691748) 26
("simd time", 0.9465985) 26
ratio array / simd 1.0238499215876635
And this for array/simd[i % 8]
:
("Array time", 2.5223742) 26
("simd time", 45.7046157) 26
ratio array / simd 0.05518860975785428
("Array time", 2.7398013) 26
("simd time", 45.7768663) 26
ratio array / simd 0.05985122009105284
("Array time", 2.7021) 26
("simd time", 46.6054519) 26
ratio array / simd 0.0579781954651533
And as it turns out, we get the same result for simd[i % 8]
as in your original global scope version of the program, but not for array[i % 8]
and array/simd[j]
.
Someone else will have to explain the exact reasons behind these results by examining the generated code, seeing if there are some missed optimizations etc. (Btw is SIMD8<Float>
backed by some actual hardware supported simd type, I mean 8 * 4 == 32 bytes > 16 bytes, Isn't it max 4 float32?).
All I'm saying is that these types of tests can give very different results depending on exactly how they are formulated.
One last such thing (which isn't applicable here) is that you might have to use random data to prevent the compiler from being able to eg turn a loop of successive additions (depending on overflowing etc) into just a single addition and multiply, etc, etc.