I decided to start this topic after having tried out a series of slight modifications to the code example of SR-7952, a bug that was reported in this thread:
The demonstration program of that bug turned out to make for a good example of how seemingly insignificant modifications to the code can significantly impact the efficiency of the resulting executable, making it very hard to detect, isolate and work around performance issues.
The following is a step by step demonstration of this.
I will use the, as of today, most recent development snapshot (although similar but not identical behavior can also be observed using eg the default toolchain of Xcode 9.4 and Xcode 10 beta).
Program 1
Let's start with the following (a variation of the demonstration program of SR-7952):
import Dispatch
func benchmark<T>(_ function: () -> T) -> (T, Double) {
let start = DispatchTime.now()
let res = function()
let end = DispatchTime.now()
return (res, Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / 1e9)
}
func runBenchmarks() {
let base = 0 ..< 1_000_000
let (cmR, cmT) = benchmark { base.lazy.compactMap({ $0 % 4 == 0 ? nil : $0 }).reduce(0, +) }
let (fmR, fmT) = benchmark { base.lazy.filter({ $0 % 4 != 0 }).map({ $0 }).reduce(0, +) }
print("compactMap: \(cmR), took \(cmT) s")
print(" filterMap: \(fmR), took \(fmT) s")
}
runBenchmarks()
Compiling and running (a couple of times):
$ /Library/Developer/Toolchains/swift-DEVELOPMENT-SNAPSHOT-2018-06-08-a.xctoolchain/usr/bin/swiftc -O -whole-module-optimization -swift-version 4.2 -static-stdlib -sdk /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk -target x86_64-apple-macosx10.13 test.swift
$ ./test
compactMap: 375000000000, took 0.318972944 s
filterMap: 375000000000, took 0.000626542 s
$ ./test
compactMap: 375000000000, took 0.317654136 s
filterMap: 375000000000, took 0.000569041 s
$ ./test
compactMap: 375000000000, took 0.317911993 s
filterMap: 375000000000, took 0.000626542 s
OK, so compactMap
is much slower than filterMap
.
Program 2
But how much slower is it? Let's add the following two lines to find out:
import Dispatch
func benchmark<T>(_ function: () -> T) -> (T, Double) {
let start = DispatchTime.now()
let res = function()
let end = DispatchTime.now()
return (res, Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / 1e9)
}
func runBenchmarks() {
let base = 0 ..< 1_000_000
let (cmR, cmT) = benchmark { base.lazy.compactMap({ $0 % 4 == 0 ? nil : $0 }).reduce(0, +) }
let (fmR, fmT) = benchmark { base.lazy.filter({ $0 % 4 != 0 }).map({ $0 }).reduce(0, +) }
let ratio = Int((cmT/fmT).rounded()) // <-- Added this!
print("compactMap: \(cmR), took \(cmT) s")
print(" filterMap: \(fmR), took \(fmT) s")
print("compactMap is \(ratio) times slower than filterMap") // <-- And this!
}
runBenchmarks()
Compiling and running:
$ /Library/Developer/Toolchains/swift-DEVELOPMENT-SNAPSHOT-2018-06-08-a.xctoolchain/usr/bin/swiftc -O -whole-module-optimization -swift-version 4.2 -static-stdlib -sdk /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk -target x86_64-apple-macosx10.13 test.swift
$ ./test
compactMap: 375000000000, took 0.324593056 s
filterMap: 375000000000, took 0.00109638 s
compactMap is 296 times slower than filterMap
$ ./test
compactMap: 375000000000, took 0.332787167 s
filterMap: 375000000000, took 0.001176485 s
compactMap is 283 times slower than filterMap
$ ./test
compactMap: 375000000000, took 0.323394991 s
filterMap: 375000000000, took 0.001097987 s
compactMap is 295 times slower than filterMap
But wait a second! Why has filterMap
suddenly become almost 2 times slower than it was before?
In Program 1 it took 0.0006 seconds but now it takes 0.0011 seconds. What's going on?
Inspecting the disassembly of Program 1 and 2 to explain the details behind this is left as an exercise for the reader.
Program 3
As strange as the above issue might seem, there's an equally strange workaround, which is to just replace the string interpolation with a separate argument for ratio
:
import Dispatch
func benchmark<T>(_ function: () -> T) -> (T, Double) {
let start = DispatchTime.now()
let res = function()
let end = DispatchTime.now()
return (res, Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / 1e9)
}
func runBenchmarks() {
let base = 0 ..< 1_000_000
let (cmR, cmT) = benchmark { base.lazy.compactMap({ $0 % 4 == 0 ? nil : $0 }).reduce(0, +) }
let (fmR, fmT) = benchmark { base.lazy.filter({ $0 % 4 != 0 }).map({ $0 }).reduce(0, +) }
let ratio = Int((cmT/fmT).rounded())
print("compactMap: \(cmR), took \(cmT) s")
print(" filterMap: \(fmR), took \(fmT) s")
print("compactMap is", ratio, "times slower than filterMap") // <-- Replaced string interpolated ratio with this!
}
runBenchmarks()
Compile and run:
$ /Library/Developer/Toolchains/swift-DEVELOPMENT-SNAPSHOT-2018-06-08-a.xctoolchain/usr/bin/swiftc -O -whole-module-optimization -swift-version 4.2 -static-stdlib -sdk /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk -target x86_64-apple-macosx10.13 test.swift
$ ./test
compactMap: 375000000000, took 0.322355613 s
filterMap: 375000000000, took 0.000548265 s
compactMap is 588 times slower than filterMap
$ ./test
compactMap: 375000000000, took 0.324320271 s
filterMap: 375000000000, took 0.000548251 s
compactMap is 592 times slower than filterMap
$ ./test
compactMap: 375000000000, took 0.323314934 s
filterMap: 375000000000, took 0.00062654 s
compactMap is 516 times slower than filterMap
$
Ah! Now filterMap
takes 0.0006 seconds again, so compactMap
is about 550 times slower than filterMap
, and not just 300 times slower as in Program 2.
Program 4
Let's add a small improvement: Instead of running the program several times ourselves, we'll make the program do it for us:
import Dispatch
func benchmark<T>(_ function: () -> T) -> (T, Double) {
let start = DispatchTime.now()
let res = function()
let end = DispatchTime.now()
return (res, Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / 1e9)
}
func runBenchmarks() {
let base = 0 ..< 1_000_000
for _ in 0 ..< 5 { // <-- Added this for-in loop!
let (cmR, cmT) = benchmark { base.lazy.compactMap({ $0 % 4 == 0 ? nil : $0 }).reduce(0, +) }
let (fmR, fmT) = benchmark { base.lazy.filter({ $0 % 4 != 0 }).map({ $0 }).reduce(0, +) }
let ratio = Int((cmT/fmT).rounded())
print("compactMap: \(cmR), took \(cmT) s")
print(" filterMap: \(fmR), took \(fmT) s")
print("compactMap is", ratio, "times slower than filterMap")
}
}
runBenchmarks()
Compile and run (once):
$ /Library/Developer/Toolchains/swift-DEVELOPMENT-SNAPSHOT-2018-06-08-a.xctoolchain/usr/bin/swiftc -O -whole-module-optimization -swift-version 4.2 -static-stdlib -sdk /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk -target x86_64-apple-macosx10.13 test.swift
$ ./test
compactMap: 375000000000, took 0.32222596 s
filterMap: 375000000000, took 3.8e-08 s
compactMap is 8479631 times slower than filterMap
compactMap: 375000000000, took 0.32233228 s
filterMap: 375000000000, took 3.1e-08 s
compactMap is 10397815 times slower than filterMap
compactMap: 375000000000, took 0.320527889 s
filterMap: 375000000000, took 3.1e-08 s
compactMap is 10339609 times slower than filterMap
compactMap: 375000000000, took 0.320475311 s
filterMap: 375000000000, took 3.1e-08 s
compactMap is 10337913 times slower than filterMap
compactMap: 375000000000, took 0.329549344 s
filterMap: 375000000000, took 3.2e-08 s
compactMap is 10298417 times slower than filterMap
Wow! filterMap
now takes essentially no time at all (I guess it got optimized away somehow) which makes compactMap
10 million times slower than filterMap
!
Program 5
As it turns out, this optimization will not happen if base
is moved to top level:
import Dispatch
func benchmark<T>(_ function: () -> T) -> (T, Double) {
let start = DispatchTime.now()
let res = function()
let end = DispatchTime.now()
return (res, Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / 1e9)
}
let base = 0 ..< 1_000_000 // <-- Moved this out of runBenchmarks().
func runBenchmarks() {
for _ in 0 ..< 5 {
let (cmR, cmT) = benchmark { base.lazy.compactMap({ $0 % 4 == 0 ? nil : $0 }).reduce(0, +) }
let (fmR, fmT) = benchmark { base.lazy.filter({ $0 % 4 != 0 }).map({ $0 }).reduce(0, +) }
let ratio = Int((cmT/fmT).rounded())
print("compactMap: \(cmR), took \(cmT) s")
print(" filterMap: \(fmR), took \(fmT) s")
print("compactMap is", ratio, "times slower than filterMap")
}
}
runBenchmarks()
Compile and run:
$ /Library/Developer/Toolchains/swift-DEVELOPMENT-SNAPSHOT-2018-06-08-a.xctoolchain/usr/bin/swiftc -O -whole-module-optimization -swift-version 4.2 -static-stdlib -sdk /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk -target x86_64-apple-macosx10.13 test.swift
$ ./test
compactMap: 375000000000, took 0.319917178 s
filterMap: 375000000000, took 0.001174697 s
compactMap is 272 times slower than filterMap
compactMap: 375000000000, took 0.324600748 s
filterMap: 375000000000, took 0.001174689 s
compactMap is 276 times slower than filterMap
compactMap: 375000000000, took 0.318163121 s
filterMap: 375000000000, took 0.001174688 s
compactMap is 271 times slower than filterMap
compactMap: 375000000000, took 0.3232251 s
filterMap: 375000000000, took 0.001174688 s
compactMap is 275 times slower than filterMap
compactMap: 375000000000, took 0.318713242 s
filterMap: 375000000000, took 0.001174699 s
compactMap is 271 times slower than filterMap
OK. But look! filterMap
is back at 0.0011s (or rather 0.0012 s) and not 0.0006 s which we know it can be.
Program 6
The string-interpolation-trick comes to rescue, but now in reverse (by changing ratio
back into a string interpolation this time):
import Dispatch
func benchmark<T>(_ function: () -> T) -> (T, Double) {
let start = DispatchTime.now()
let res = function()
let end = DispatchTime.now()
return (res, Double(end.uptimeNanoseconds - start.uptimeNanoseconds) / 1e9)
}
let base = 0 ..< 1_000_000
func runBenchmarks() {
for _ in 0 ..< 5 {
let (cmR, cmT) = benchmark { base.lazy.compactMap({ $0 % 4 == 0 ? nil : $0 }).reduce(0, +) }
let (fmR, fmT) = benchmark { base.lazy.filter({ $0 % 4 != 0 }).map({ $0 }).reduce(0, +) }
let ratio = Int((cmT/fmT).rounded())
print("compactMap: \(cmR), took \(cmT) s")
print(" filterMap: \(fmR), took \(fmT) s")
print("compactMap is \(ratio) times slower than filterMap") // <-- Undid the change we made in Program 3 here.
}
}
runBenchmarks()
Compile and run:
$ /Library/Developer/Toolchains/swift-DEVELOPMENT-SNAPSHOT-2018-06-08-a.xctoolchain/usr/bin/swiftc -O -whole-module-optimization -swift-version 4.2 -static-stdlib -sdk /Applications/Xcode-beta.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk -target x86_64-apple-macosx10.13 test.swift
$ ./test
compactMap: 375000000000, took 0.322214887 s
filterMap: 375000000000, took 0.000548238 s
compactMap is 588 times slower than filterMap
compactMap: 375000000000, took 0.320392592 s
filterMap: 375000000000, took 0.000548245 s
compactMap is 584 times slower than filterMap
compactMap: 375000000000, took 0.319095109 s
filterMap: 375000000000, took 0.000548243 s
compactMap is 582 times slower than filterMap
compactMap: 375000000000, took 0.325831477 s
filterMap: 375000000000, took 0.000548246 s
compactMap is 594 times slower than filterMap
compactMap: 375000000000, took 0.320049229 s
filterMap: 375000000000, took 0.000572327 s
compactMap is 559 times slower than filterMap
OK.
We'll stop here.
But note that we didn't touch the actual code being benchmarked, nor the benchmark function, the changes were only in the surrounding code.
The reason I went to the trouble of writing this post is to show what, according to my experience, is the rule rather than the exception, although you might not see it unless you're trying hard to optimize some performance critical part of your code.
So this was just a very small taste. In any given situation, there are all sorts of seemingly irrelevant but performance-impacting little changes that can be made, eg adding or removing a number of spaces at the start of a string used only for reporting the result of a test, switching between a for-in loop and the equivalent while loop, wrapping some piece of code in an immediately invoked closure (which can make that section of code faster in some specific contexts), &+
sometimes being slower than +
etc, etc.
Note also that these issues are impossible to represent/test/benchmark in eg XCTest and The Swift Benchmarking Suite. Why? Because the issues are all about context, and they are interacting in various funny ways as shown by eg the string-interpolation-trick working in reverse or not depending on whether base
was declared in top-level or not. XCTest and The Swift Benchmarking Suite forces some layer(s) of context on the tested code, and thus their harness becomes part of the tested code.