Swift loop execution time is 20x slower when adding print statement that is executed once at the end

As far as I know there is no protected access modifier in Swift.

In the class/subclass version of the sample code, making the x and y properties private would probably solve the problem as that would automatically prevent the print statement, which we already saw prevents inlining.

In the protocol version of the sample code, removing x and y from the protocol declaration and making them private in AlgorithmSubclass unfortunately still runs in about 10 seconds.

protocol Algorithm<List> {
    associatedtype List: RandomAccessCollection where List.Element: Equatable, List.Index == Int
    
    func run()
}

final class AlgorithmSubclass<List: RandomAccessCollection>: Algorithm where List.Element: Equatable, List.Index == Int {
    
    private var x: List
    private var y: List
 ...

Just curious about the outer loop. It could be manually optimised away.

class AlgorithmSubclass<List: RandomAccessCollection>: Algorithm<List> where List.Element: Equatable, List.Index == Int {
    
    override func run() {
        let N = 1000
        var count = 0
    #if false
        for _ in 0..<N {
            for i in 0..<min(x.endIndex, y.endIndex) {
                if x[i] == y[i] {
                    count += 1
                }
            }
        }
    #else
        count = 0
        for i in 0..<min(x.endIndex, y.endIndex) {
            if x[i] == y[i] {
                count += 1
            }
        }
        count *= N
    #endif
        print ("-->", count)
    }
}
with print
100000
--> delta: 0.07785999774932861

without print
--> delta: 0.07330799102783203

I'm not experienced in these kind of things, so I wouldn't know what do check.

Sorry, I thought it was clear enough that it's just a dummy sample code. My real code does a completely different thing. The 1000 times repeated loop is just to highlight the runtime difference.

1 Like
Here's my version which takes 0.1 sec.
import Foundation

func test() {
    let x = Array(repeating: "adsf", count: 100000)
    let y = Array(repeating: "adsf", count: 100000)
    let alg = AlgorithmClass(x: x, y: y)
    let diff = MyInterface(algorithm: alg)
    diff.run()
}

test()

// or class:
struct MyInterface<Alg: Algorithm> {
    let algorithm: Alg
    // uncomment for class:
    // init(algorithm: Alg) { self.algorithm = alg }
    func run() {
        algorithm.run()
    }
}

protocol Algorithm<List> {
    associatedtype List: RandomAccessCollection where List.Element: Equatable, List.Index == Int
    var x: List { get }
    var y: List { get }
    func run()
}

final class AlgorithmClass<List: RandomAccessCollection>: Algorithm where List.Element: Equatable, List.Index == Int {
    
    var x: List
    var y: List
    
    init(x: List, y: List) {
        self.x = x
        self.y = y
    }
    
    func run() {
        print("start run")
        let start = Date()
        var count = 0
        for _ in 0..<1000 {
            for i in 0..<min(x.endIndex, y.endIndex) {
                if x[i] == y[i] {
                    count += 1
                }
            }
        }
        let elapsed = Date().timeIntervalSince(start)
        print("alert: \(count), elapsed: \(elapsed)")
    }
}

I converted it to a command line app to ease testing.

Another idea would be to not have x/y stored but pass it to "run".

Thanks, that's an interesting variation. But unfortunately it doesn't work anymore when adding an if that dynamically selects the algorithm to run, which brings runtime back to 10 seconds for me:

import Foundation

func test() {
    let x = Array(repeating: "adsf", count: 100000)
    let y = Array(repeating: "adsf", count: 100000)
    let alg: any Algorithm
    if Bool.random() {
        alg = AlgorithmClass(x: x, y: y)
    } else {
        alg = AlgorithmClass(x: x, y: y)
    }
    alg.run()
}

test()

protocol Algorithm<List> {
    associatedtype List: RandomAccessCollection where List.Element: Equatable, List.Index == Int
    var x: List { get }
    var y: List { get }
    func run()
}

final class AlgorithmClass<List: RandomAccessCollection>: Algorithm where List.Element: Equatable, List.Index == Int {
    
    var x: List
    var y: List
    
    init(x: List, y: List) {
        self.x = x
        self.y = y
    }
    
    func run() {
        print("start run")
        let start = Date()
        var count = 0
        for _ in 0..<1000 {
            for i in 0..<min(x.endIndex, y.endIndex) {
                if x[i] == y[i] {
                    count += 1
                }
            }
        }
        let elapsed = Date().timeIntervalSince(start)
        print("alert: \(count), elapsed: \(elapsed)")
    }
    
}

Maybe pass x/y to run? Along with removing associated type from Algorithm.

This code, takes 0.1 sec on my machine
import Foundation

func test() {
    let x = Array(repeating: "adsf", count: 100000)
    let y = Array(repeating: "adsf", count: 100000)
    let alg: any Algorithm
    if Bool.random() {
        alg = AlgorithmClass()
    } else {
        alg = AlgorithmClass()
    }
    alg.run(x: x, y: y)
}

test()

// or class:
struct MyInterface<A: Algorithm> {
    let algorithm: A
    // uncomment for class:
    // init(algorithm: Alg) { self.algorithm = alg }
    func run<List: RandomAccessCollection>(x: List, y: List) where List.Element: Equatable, List.Index == Int {
        algorithm.run(x: x, y: y)
    }
}

protocol Algorithm {
    func run<List: RandomAccessCollection>(x: List, y: List) where List.Element: Equatable, List.Index == Int
}

final class AlgorithmClass: Algorithm {
    
    func run<List: RandomAccessCollection>(x: List, y: List) where List.Element: Equatable, List.Index == Int {
        print("start run")
        let start = Date()
        var count = 0
        for _ in 0..<1000 {
            for i in 0..<min(x.endIndex, y.endIndex) {
                if x[i] == y[i] {
                    count += 1
                }
            }
        }
        let elapsed = Date().timeIntervalSince(start)
        print("alert: \(count), elapsed: \(elapsed)")
    }
}

Looking at this issue from a different perspective. This loop:

	for i in 0..<min(x.endIndex, y.endIndex) {
		if x[i] == y[i] {
			count += 1
		}
	}

when is used in its existential version causes two mallocs of size 40, two mallocs of size 16 and the corresponding 4 frees, and this is for each iteration step! Is malloc essential here and could it not be changed to stack allocation via, say, withUnsafeTemporaryAllocation? If it could – the existential version would be much faster!

It might work in the sample, but in my case Algorithm has extensions that access x and y. I could change them to also take x and y as parameters... although that would be less than ideal. Also each private method in AlgorithmClass called by run() would need to repeat the generic List declaration, making it harder to understand that all methods of AlgorithmClass always operate on the same type.

This version has a manually written specialisation for array and for array-of-stings:

        for _ in 0..<1000 {
            func loopBodyString(x: [String], y: [String]) {
                for i in 0..<min(x.endIndex, y.endIndex) {
                    if x[i] == y[i] { count += 1 }
                }
            }
            func loopBodyArray<T: Equatable>(x: [T], y: [T]) {
                for i in 0..<min(x.endIndex, y.endIndex) {
                    if x[i] == y[i] { count += 1 }
                }
            }
            func loopBodyCollection<T: RandomAccessCollection>(x: T, y: T) where T.Element: Equatable, T.Index == Int {
                for i in 0..<min(x.endIndex, y.endIndex) {
                    if x[i] == y[i] { count += 1 }
                }
            }
            if let x = x as? [String], let y = y as? [String] {
                loopBodyString(x: x, y: y) // 0.07 sec
            } else if let x = x as? [List.Element], let y = y as? [List.Element] {
                loopBodyArray(x: x, y: y) // 1.6 sec
            } else {
                loopBodyCollection(x: x, y: y) // 9 sec
            }
        }

It's not ideal though as you have to repeat yourself.

1 Like

I've been looking at this today. The actual problem is the line

if (0..<1).randomElement() == 0 {

which for some reason is stopping the compiler from specializing the generic and inlining everything. So, with that line in your code, it runs as unspecialized generic code, which is indeed very slow. Without, it gets optimized and it's a lot faster.

1 Like

The problem is that Algorithm.run is not final and dynamically dispatched via a vtable. The compiler cannot specialize methods called via a vtable. And there is a huge performance difference between specialized and non-specialized code.

The difference between the good and bad version is that in the good version MyInterface.run is inlined into main. Only in this case, the compiler knows the real type of the class when calling run. It can de-virtualize the call and specialize everything.

Removing print or randomElement just changes the inlining threshold for the function. In the good version, the compiler decides to inline (because the function is smaller), in the bad version, the function is too large for being selected for inlining.

4 Likes

Thanks for your input. It's just surprising to me that an arbitrary threshold can make such a big difference in the performance of a program. If at least there was a compiler warning or a way of knowing that this happens without having to use Instruments or other tools...

2 Likes

Excellent analysis!

Here are my "Five whys" as a folowup (well, 4):

  • "The compiler cannot specialize methods called via a vtable."

    • Can compiler be changed to specialise methods called via a vtable?
  • "There is a huge performance difference between specialized and non-specialized code."

  • "Only in this (inlined) case, the compiler knows the real type of the class when calling run."

    • Can compiler be changed to know the real type in case of non inlined function?
  • "In the bad version, the function is too large for being selected for inlining."

    • Can compiler be changed so those various optimisations are applied without functions being inlined?
2 Likes
  • Can compiler be changed to specialise methods called via a vtable?

Yes, we could do that. But it's not trivial

Can this performance difference be reduced?

Probably yes, by improving optimizations. But not in a big scale.

  • Can compiler be changed to know the real type in case of non inlined function?
  • Can compiler be changed so those various optimisations are applied without functions being inlined?

Well, that's not so easy, because a function can be called from different call sites with different concrete types. It would be possible to so some kind of function signature optimization or specialization of a function for a concrete type. But that also results in duplicated code (= the main problem with inlining) in most cases.

2 Likes

It looks like @_specialize works in this case (add a @_specialize(where List == [String]) to AlgorithmClass::run and it'll run fast). Note that it seems to be implemented by adding a check to the generic version that branches to the specialized version if the type matches, so it's not completely free (vs if it did it by replacing vtable pointers), and _specialize being underscored means it may break / fail to compile in future Swift versions.

2 Likes

Thanks for the suggestion! Since it's underscored, as you said, I will probably avoid it. But it would be an idea if sometimes in the future the Swift compiler recognized when a specialized version of some code would yield much better performance, and suggest to add an annotation that would either keep the specialization or suppress the suggestion/warning.