My Swift app iterates over two Array<String> and compares their elements.
Something very strange is going on. I have the impression the compiler is doing some optimizations that I cannot understand: commenting out the print statement in MyInterface.run() improves the runtime from about 10 seconds to below 0.5 seconds, and that print statement is executed only once at the end of the program. Even commenting out the if above it has the same effect.
I'm running the app in Xcode Instruments. When debugging it in Xcode, there is no difference when commenting out any of those two lines.
Instruments shows that most of the time is spent in protocol witness for Collection.subscript.read in conformance [A], Array.subscript.read and similar, which in turn call different malloc, initialize, free and release methods.
import Cocoa
@main
class AppDelegate: NSObject, NSApplicationDelegate {
func applicationDidFinishLaunching(_ aNotification: Notification) {
let x = Array(repeating: "adsf", count: 100000)
let y = Array(repeating: "adsf", count: 100000)
let diff = MyInterface(x: x, y: y)
diff.run()
}
}
class MyInterface<List: RandomAccessCollection & MutableCollection> where List.Element: Comparable, List.Index == Int {
private let algorithm: Algorithm<List>
init(x: List, y: List) {
algorithm = AlgorithmSubclass(x: x, y: y)
}
func run() {
algorithm.run()
if (0..<1).randomElement() == 0 {
print(algorithm.x.count) // commenting out this line, or the if above, makes the program 20x faster
}
}
}
class Algorithm<List: RandomAccessCollection> where List.Element: Equatable, List.Index == Int {
var x: List
var y: List
init(x: List, y: List) {
self.x = x
self.y = y
}
func run() {
}
}
class AlgorithmSubclass<List: RandomAccessCollection>: Algorithm<List> where List.Element: Equatable, List.Index == Int {
override func run() {
var count = 0
for _ in 0..<1000 {
for i in 0..<min(x.endIndex, y.endIndex) {
if x[i] == y[i] {
count += 1
}
}
}
let alert = NSAlert()
alert.messageText = "\(count)"
alert.runModal()
}
}
With a command-line program, I am not seeing a difference that large:
/*
with print
100000
--> delta: 71.41258597373962
without print
--> delta: 73.18637597560883
*/
Code
import Foundation
@main
enum ScratchPad {
static func main () async throws {
let x = Array (repeating: "adsf", count: 100000)
let y = Array (repeating: "adsf", count: 100000)
let diff = MyInterface (x: x, y: y)
let start = Date ()
diff.run()
let delta = Date().timeIntervalSince (start)
print("--> delta: \(delta)")
}
}
class MyInterface<List: RandomAccessCollection & MutableCollection> where List.Element: Comparable, List.Index == Int {
private let algorithm: Algorithm<List>
init(x: List, y: List) {
algorithm = AlgorithmSubclass(x: x, y: y)
}
func run() {
algorithm.run()
if (0..<1).randomElement() == 0 {
#if false
print ("with print")
print(algorithm.x.count) // commenting out this line, or the if above, makes the program 20x faster
#else
print ("without print")
#endif
}
}
}
class Algorithm<List: RandomAccessCollection> where List.Element: Equatable, List.Index == Int {
var x: List
var y: List
init(x: List, y: List) {
self.x = x
self.y = y
}
func run() {
}
}
class AlgorithmSubclass<List: RandomAccessCollection>: Algorithm<List> where List.Element: Equatable, List.Index == Int {
override func run() {
var count = 0
for _ in 0..<1000 {
for i in 0..<min(x.endIndex, y.endIndex) {
if x[i] == y[i] {
count += 1
}
}
}
}
}
/*
with print
100000
--> delta: 71.41258597373962
without print
--> delta: 73.18637597560883
*/
Yes, when using #if false or #if true I also get a runtime of about 10 seconds, which shows that the compiler isn't able to optimize in these two cases either.
Yes. In my real code it's a different if condition referencing other state that is not present in the sample code, and this kind of if seems to trick the compiler into not optimizing away its body. Using if true, or leaving the if away, for some reason allows the compiler to create the optimized version that runs in under 0.5 seconds, as mentioned in the comment in the code.
You are hitting an optimised / specialised code paths in some cases and not others (normally accessing things via "x[i]" when x is existential is quite heavy, requires memory allocation and such).
If you remove all those "print" / randomElement statements and do this instead:
protocol CouldRun { func run() }
extension Algorithm: CouldRun {}
var couldRun: CouldRun!
class MyInterface<List: RandomAccessCollection & MutableCollection> where List.Element: Comparable, List.Index == Int {
...
func run() {
couldRun = algorithm
couldRun.run()
}
...
}
It would also be slow as in this case compiler's optimiser can't figure out the fast path.
Things could change for you if you use less generic types (e.g. array of things instead of RandomAccessCollection of things) but this needs testing. Also reducing class hierarchy depths, making classes final or switching from classes to structs altogether typically helps the optimiser.
As a side note: more people would be temped to try running your code example if it was expressed as a command line app with no Cocoa / UI dependencies.
What happens if you make AlgorithmSubclass final? And/or change the type of MyInterface's algorithm to AlgorithmSubclass?
I ask because one or both of those changes should eliminate existentials (if the compiler wasn't already able to, which it appears it was not) which will answer the question of whether it's existential overheads / barriers (e.g. failure to inline).
That's true. Using ContiguousArray<String> makes it faster. I simply would have liked to keep things generic for code reuse.
I had this doubt as well, but then I thought that when running it in Instruments it wouldn't be possible to see the result (currently shown in the final alert) to confirm that the whole loop run even when taking much less time. But thank you for the hint, I will see how this could be done otherwise next time.
Writing let algorithm: AlgorithmSubclass<List> makes it faster. But that defeats the point of my program: being able to choose the algorithm at runtime. In the sample I only showed one subclass to make it as simple as possible. There are some other things that can be changed to make it faster, but they all make it less flexible.
Thanks for the suggestion. I rewrote the sample using a protocol with associated types. With or without the print statement or the enclosing if, the runtime is about 10 seconds, so even less optimization than before.
import Cocoa
@main
class AppDelegate: NSObject, NSApplicationDelegate {
func applicationDidFinishLaunching(_ aNotification: Notification) {
let x = Array(repeating: "adsf", count: 100000)
let y = Array(repeating: "adsf", count: 100000)
let diff = MyInterface(x: x, y: y)
diff.run()
}
}
class MyInterface<List: RandomAccessCollection & MutableCollection> where List.Element: Comparable, List.Index == Int {
private let algorithm: any Algorithm<List>
init(x: List, y: List) {
algorithm = AlgorithmSubclass(x: x, y: y)
}
func run() {
algorithm.run()
}
}
protocol Algorithm<List> {
associatedtype List: RandomAccessCollection where List.Element: Equatable, List.Index == Int
var x: List { get }
var y: List { get }
func run()
}
final class AlgorithmSubclass<List: RandomAccessCollection>: Algorithm where List.Element: Equatable, List.Index == Int {
var x: List
var y: List
init(x: List, y: List) {
self.x = x
self.y = y
}
func run() {
var count = 0
for _ in 0..<1000 {
for i in 0..<min(x.endIndex, y.endIndex) {
if x[i] == y[i] {
count += 1
}
}
}
let alert = NSAlert()
alert.messageText = "\(count)"
alert.runModal()
}
}
That doesn't surprise me - there's still indirection involved with protocols, much like for subclasses.
I suspect what you're seeing is the difference between x being "inlined" vs not, in a nutshell. By accessing x from outside your Algorithm instance, you might be unwittingly preventing the compiler from inlining something performance-critical. Your trivial algorithm should (in principle) be optimised down to enumerating two blocks of contiguous memory and accumulating the result into a register. It should be just a couple of dozen instructions. But, even though in principle that's still possible with your original code, the compiler is not perfect.
You could verify this hypothesis by looking at the disassembly (e.g. with Hopper).
If that is correct, then you might be able to work around it by making x (and y) protected, and storing the count in a separate member variable of Algorithm.