Why does separating this nested loop in Swift run roughly 10x faster?

rswift · May 24, 2021, 10:43pm

I'm building a game using Swift. I've discovered that by separating the inner part of a pair of nested loops iterating over an array, the performance increases by nearly a whopping 10x. Why is this? Is it easier for the optimizer when it's separated in this way? Am I not understanding the reference counting overhead correctly here?

I'm on Xcode 12.5, but have also tested on Xcode 12.4 as well. Targeting macOS (Intel) release build.

I've recreated some sample code shown below. Test1 and Test2 classes show the 2 implementations, with Test2 performing nearly 10x faster.

Test1 outputs: 0.006973981857299805
Test2 outputs: 0.0007460117340087891

import Foundation
final class Body {
    let value1: Int
    let value2: Int
    let value3: Int
    let value4: Int
    init(value1: Int, value2: Int, value3: Int, value4: Int) {
        self.value1 = value1
        self.value2 = value2
        self.value3 = value3
        self.value4 = value4
    }
    static func createBodies() -> [Body] {
        var bodies: [Body] = []
        for i in 1 ... 500 {
            // Fill with some arbitrary values.
            bodies.append(Body(value1: Int(i), value2: Int(i*10), value3: Int(i*100), value4: Int(i*1000)))
        }
        return bodies
    }
}

final class Test1 {
    var bodies: [Body]
    init() {
        self.bodies = Body.createBodies()
    }
    func run() -> Int {
        var total: Int = 0
        for body1 in bodies {
            var innerTotal = 0
            for body2 in bodies {
                // some random computation
                innerTotal += body1.value1*body2.value1+body1.value2*body2.value2+body1.value3*body2.value3+body1.value4*body2.value4
            }
            total += innerTotal
        }
        return total
    }
}

final class Test2 {
    var bodies: [Body]
    init() {
        self.bodies = Body.createBodies()
    }
    func helper(body1:Body, bodies: [Body]) -> Int {
        var innerTotal: Int = 0
        for body2 in bodies {
            // some random computation
            innerTotal += body1.value1*body2.value1+body1.value2*body2.value2+body1.value3*body2.value3+body1.value4*body2.value4
        }
        return innerTotal
    }
    func run() -> Int {
        var total: Int = 0
        for body1 in bodies {
            total += helper(body1: body1, bodies: bodies)
        }
        return total
    }
}


final class Main {
    static func main() {
        let test = Test1() // Change this to Test2 for roughly 10x more performance.
        let startTime = CFAbsoluteTimeGetCurrent()
        let total = test.run()
        let elapsedTime = CFAbsoluteTimeGetCurrent() - startTime
        print("elapsedTime: \(elapsedTime)")
        print("total: \(total)")
    }
}

Main.main()

LucianoPAlmeida · May 25, 2021, 12:14am

By a quick look, it seems like the down side is that on Test1 you are accessing the class mutating member bodies inside the first loop (for body2 in bodies here the access via Test1.bodies.getter) is what is making it slow. And therefore when you separate into the function the second loop access to the bodies is now in the non mutating bodies parameter which access don't have to account for mutability therefore access is faster.
So the solution to your performance issue is to make

final class Test1 {
    let bodies: [Body] // Make it a let instead of a var 
    init() {
        self.bodies = Body.createBodies()
    }

Here is why the access getter for your let(immutable) property is faster
Emitted code for let bodies: [Body] getter

output.Test1.bodies.getter : [output.Body]:
        mov     rdi, qword ptr [r13 + 16]
        jmp     swift_retain@PLT

Emitted code for var bodies: [Body] getter

output.Test1.bodies.getter : [output.Body]:
        push    rbx
        sub     rsp, 32
        lea     rdi, [r13 + 16]
        lea     rsi, [rsp + 8]
        xor     edx, edx
        xor     ecx, ecx
        call    swift_beginAccess@PLT
        mov     rbx, qword ptr [r13 + 16]
        mov     rdi, rbx
        call    swift_retain@PLT
        mov     rax, rbx
        add     rsp, 32
        pop     rbx
        ret

Note that by making bodies a var the getter has to emit an extra swift_beginAccess which I believe is to account for exclusivity and other mutating garantees that I don't know (maybe there is more stuff)... but in the end that is what may cause the performance issue.

So it short the solution is just change the bodies Test1 property from var to let =]
Hope that helps :)

rswift · May 25, 2021, 1:15am

Thanks for the feedback, while the reasoning sounds good, it unfortunately didn't appear to make a significant difference. If you can try running this code yourself I'd be interested in seeing if you can reproduce this performance issue. I've run this on two different Macs so far and different Xcode versions and I can consistently reproduce this.

I've also tried disabling exclusive access to memory checks too and this performance issues still persists, very confused by this.

LucianoPAlmeida · May 25, 2021, 1:24am

That is strange, I indeed could reproduce the issue and see the difference when changing ... are you running that in release mode (-O)?
let bodies : 0.0009009838104248047
var bodies: 0.004495978355407715
Both on Test1 class :)

LucianoPAlmeida · May 25, 2021, 1:27am

I would post the whole code I just run but is literally just

final class Test1 {
    - var bodies: [Body]
    + let bodies: [Body]
    init() {
        self.bodies = Body.createBodies()
    }
...

rswift · May 25, 2021, 1:35am

Perhaps I spoke too soon, you are correct. It turns out this fixes it on Xcode 12.5 but NOT 12.4. Thanks again!

I actually never realized how much overhead can be created by accessing the getter of a mutable array. I often try to stick with structs for critical performance code but sometimes, I end up needing to work with classes and that's usually when I run into performance issues. But I'll be sure to keep this in mind now about accessing the getter of a mutating member inside a performance critical loop.

LucianoPAlmeida · May 25, 2021, 1:39am

No problem, happy to help =]