Swift loop execution time is 20x slower when adding print statement that is executed once at the end

My Swift app iterates over two Array<String> and compares their elements.

Something very strange is going on. I have the impression the compiler is doing some optimizations that I cannot understand: commenting out the print statement in MyInterface.run() improves the runtime from about 10 seconds to below 0.5 seconds, and that print statement is executed only once at the end of the program. Even commenting out the if above it has the same effect.

I'm running the app in Xcode Instruments. When debugging it in Xcode, there is no difference when commenting out any of those two lines.

Instruments shows that most of the time is spent in protocol witness for Collection.subscript.read in conformance [A], Array.subscript.read and similar, which in turn call different malloc, initialize, free and release methods.

What's the problem with this code?

import Cocoa

@main
class AppDelegate: NSObject, NSApplicationDelegate {

    func applicationDidFinishLaunching(_ aNotification: Notification) {
        let x = Array(repeating: "adsf", count: 100000)
        let y = Array(repeating: "adsf", count: 100000)
        let diff = MyInterface(x: x, y: y)
        diff.run()
    }

}

class MyInterface<List: RandomAccessCollection & MutableCollection> where List.Element: Comparable, List.Index == Int {
    
    private let algorithm: Algorithm<List>
    
    init(x: List, y: List) {
        algorithm = AlgorithmSubclass(x: x, y: y)
    }
    
    func run() {
        algorithm.run()
        if (0..<1).randomElement() == 0 {
            print(algorithm.x.count) // commenting out this line, or the if above, makes the program 20x faster
        }
    }
    
}

class Algorithm<List: RandomAccessCollection> where List.Element: Equatable, List.Index == Int {
    
    var x: List
    var y: List
    
    init(x: List, y: List) {
        self.x = x
        self.y = y
    }
    
    func run() {
    }
    
}

class AlgorithmSubclass<List: RandomAccessCollection>: Algorithm<List> where List.Element: Equatable, List.Index == Int {
    
    override func run() {
        var count = 0
        for _ in 0..<1000 {
            for i in 0..<min(x.endIndex, y.endIndex) {
                if x[i] == y[i] {
                    count += 1
                }
            }
        }
        let alert = NSAlert()
        alert.messageText = "\(count)"
        alert.runModal()
    }
    
}

Most likely, the expression algorithm.x.count itself is causing it, not the printing.

Most likely, the expression algorithm.x.count itself is causing it, not the printing.

I guess so, but why?

count for String is slow.

Which is O(n)

count for String is slow.

It's an Array<String>, not a String.

With a command-line program, I am not seeing a difference that large:

/*
 with print
 100000
 --> delta: 71.41258597373962

 without print
 --> delta: 73.18637597560883

 */
Code
import Foundation

@main
enum ScratchPad {
    static func main () async throws {
        let x = Array (repeating: "adsf", count: 100000)
        let y = Array (repeating: "adsf", count: 100000)
        let diff = MyInterface (x: x, y: y)
        let start = Date ()
        diff.run()
        let delta = Date().timeIntervalSince (start)
        print("--> delta: \(delta)")
    }
}

class MyInterface<List: RandomAccessCollection & MutableCollection> where List.Element: Comparable, List.Index == Int {
    
    private let algorithm: Algorithm<List>
    
    init(x: List, y: List) {
        algorithm = AlgorithmSubclass(x: x, y: y)
    }
    
    func run() {
        algorithm.run()
        if (0..<1).randomElement() == 0 {
            #if false
            print ("with print")
            print(algorithm.x.count) // commenting out this line, or the if above, makes the program 20x faster
            #else
            print ("without print")
            #endif
        }
    }
    
}

class Algorithm<List: RandomAccessCollection> where List.Element: Equatable, List.Index == Int {
    
    var x: List
    var y: List
    
    init(x: List, y: List) {
        self.x = x
        self.y = y
    }
    
    func run() {
    }
    
}

class AlgorithmSubclass<List: RandomAccessCollection>: Algorithm<List> where List.Element: Equatable, List.Index == Int {
    
    override func run() {
        var count = 0
        for _ in 0..<1000 {
            for i in 0..<min(x.endIndex, y.endIndex) {
                if x[i] == y[i] {
                    count += 1
                }
            }
        }
    }
}

/*
 with print
 100000
 --> delta: 71.41258597373962

 without print
 --> delta: 73.18637597560883

 */

Removing

(0..<1).randomElement()

speeds it up for me. Not sure why

Yes, when using #if false or #if true I also get a runtime of about 10 seconds, which shows that the compiler isn't able to optimize in these two cases either.

What's wrong with Bool.random() ?

Nothing, I just didn't think of it.

(0..<1).randomElement() == 0

is always true though.

1 Like

Yes. In my real code it's a different if condition referencing other state that is not present in the sample code, and this kind of if seems to trick the compiler into not optimizing away its body. Using if true, or leaving the if away, for some reason allows the compiler to create the optimized version that runs in under 0.5 seconds, as mentioned in the comment in the code.

You are hitting an optimised / specialised code paths in some cases and not others (normally accessing things via "x[i]" when x is existential is quite heavy, requires memory allocation and such).

If you remove all those "print" / randomElement statements and do this instead:

protocol CouldRun { func run() }
extension Algorithm: CouldRun {}
var couldRun: CouldRun!

class MyInterface<List: RandomAccessCollection & MutableCollection> where List.Element: Comparable, List.Index == Int {
	...
    func run() {
        couldRun = algorithm
        couldRun.run()
    }
    ...
}

It would also be slow as in this case compiler's optimiser can't figure out the fast path.

Things could change for you if you use less generic types (e.g. array of things instead of RandomAccessCollection of things) but this needs testing. Also reducing class hierarchy depths, making classes final or switching from classes to structs altogether typically helps the optimiser.


As a side note: more people would be temped to try running your code example if it was expressed as a command line app with no Cocoa / UI dependencies.

What happens if you make AlgorithmSubclass final? And/or change the type of MyInterface's algorithm to AlgorithmSubclass?

I ask because one or both of those changes should eliminate existentials (if the compiler wasn't already able to, which it appears it was not) which will answer the question of whether it's existential overheads / barriers (e.g. failure to inline).

That's true. Using ContiguousArray<String> makes it faster. I simply would have liked to keep things generic for code reuse.

I had this doubt as well, but then I thought that when running it in Instruments it wouldn't be possible to see the result (currently shown in the final alert) to confirm that the whole loop run even when taking much less time. But thank you for the hint, I will see how this could be done otherwise next time.

Runtime stays the same.

Writing let algorithm: AlgorithmSubclass<List> makes it faster. But that defeats the point of my program: being able to choose the algorithm at runtime. In the sample I only showed one subclass to make it as simple as possible. There are some other things that can be changed to make it faster, but they all make it less flexible.

Right, but at least now we have a strong hint that the problem relates to use of existentials.

Is it important to you to use subclasses specifically instead of a protocol and conforming types (which could be classes, final classes or structs)?

1 Like

Thanks for the suggestion. I rewrote the sample using a protocol with associated types. With or without the print statement or the enclosing if, the runtime is about 10 seconds, so even less optimization than before.

import Cocoa

@main
class AppDelegate: NSObject, NSApplicationDelegate {

    func applicationDidFinishLaunching(_ aNotification: Notification) {
        let x = Array(repeating: "adsf", count: 100000)
        let y = Array(repeating: "adsf", count: 100000)
        let diff = MyInterface(x: x, y: y)
        diff.run()
    }

}

class MyInterface<List: RandomAccessCollection & MutableCollection> where List.Element: Comparable, List.Index == Int {
    
    private let algorithm: any Algorithm<List>
    
    init(x: List, y: List) {
        algorithm = AlgorithmSubclass(x: x, y: y)
    }
    
    func run() {
        algorithm.run()
    }
    
}

protocol Algorithm<List> {
    associatedtype List: RandomAccessCollection where List.Element: Equatable, List.Index == Int
    
    var x: List { get }
    var y: List { get }
    func run()
}

final class AlgorithmSubclass<List: RandomAccessCollection>: Algorithm where List.Element: Equatable, List.Index == Int {
    
    var x: List
    var y: List
    
    init(x: List, y: List) {
        self.x = x
        self.y = y
    }
    
    func run() {
        var count = 0
        for _ in 0..<1000 {
            for i in 0..<min(x.endIndex, y.endIndex) {
                if x[i] == y[i] {
                    count += 1
                }
            }
        }
        let alert = NSAlert()
        alert.messageText = "\(count)"
        alert.runModal()
    }
    
}
1 Like

That doesn't surprise me - there's still indirection involved with protocols, much like for subclasses.

I suspect what you're seeing is the difference between x being "inlined" vs not, in a nutshell. By accessing x from outside your Algorithm instance, you might be unwittingly preventing the compiler from inlining something performance-critical. Your trivial algorithm should (in principle) be optimised down to enumerating two blocks of contiguous memory and accumulating the result into a register. It should be just a couple of dozen instructions. But, even though in principle that's still possible with your original code, the compiler is not perfect.

You could verify this hypothesis by looking at the disassembly (e.g. with Hopper).

If that is correct, then you might be able to work around it by making x (and y) protected, and storing the count in a separate member variable of Algorithm.