Massive code execution slowdown with Xcode 16

wgjcv · October 11, 2024, 1:45am

I have some code that uses grand central dispatch for concurrency and when I upgraded MacOS and upgraded Xcode to version 16 I am seeing a massive slowdown on code that runs concurrently. I have set the Swift version to version 5 and 4.2 in build settings and that seems to make no difference. The code produces correct results but is slow as molasses. Anyone else experience this? I wasn't ready to move to Swift 6 concurrency yet and don't know if it is possible (or practical) to downgrade Xcode and Swift. The slowdown is like a factor of 20 to 30 (like 7 seconds to 2 minutes 30 seconds). Any suggestion would be appreciated.

ksluder · October 11, 2024, 2:09am

It seems more likely that the OS upgrade is to blame than the Xcode upgrade, unless you’ve tried running your built app on an older OS.

wgjcv · October 11, 2024, 2:16am

I didn't think about Sequoia playing a factor but I guess it could. Certainly disappointing. I only have one Mac to test on. I guess I will be waiting for updates on everything and hope for an improvement.

ksluder · October 11, 2024, 2:17am

You can do more than just wait. You could take a sample of your process and see what it’s doing.

dima_kozhinov · October 11, 2024, 9:54am

Do you explicitly set quality of service, or the execution priority of your queues? The default may have changed in the Dispatch library or in the OS.

wgjcv · October 11, 2024, 1:44pm

I set the qos, priority for the dispatch queues as below:

let queue = DispatchQueue(
    label: "ConcurrentQueue",
    qos: .userInitiated,
    attributes: [.concurrent]
)

David_Smith · October 11, 2024, 3:50pm

Please do profile your code. The first rule of performance work is don't guess

Even experts very often have the experience of profiling and immediately going "wait THAT'S what's taking the time??"

wgjcv · October 12, 2024, 12:16am

Spent some time profiling code today. I could not find anything that looked out of the ordinary.

David_Smith · October 12, 2024, 12:17am

Ideally you'd have before and after profiles so you could see what's different, but a 30x difference is something I would expect to show up anyway. Like that suggests that almost all of the work in your profile would be absent if you did it on the old system. Is it something you could post so we could look at the profile?

wgjcv · October 12, 2024, 2:58am

I would rather not post the profiles. I would rather let this issue go for now.

mfstanton · October 14, 2024, 3:44pm

Other than anecdotal perceptions of operations taking longer what other data do you have? Are there any new log records being shown in the log window for Xcode? I have been noticing more log records and also random variance of the execution path resulting in what appears to be timing differences at the lowest level of the Kernel code.

wgjcv · October 14, 2024, 5:28pm

I not seeing any new records but did find something interesting during some testing. My application uses all processors. That includes performance and efficiency cores (M3MAX 12 and 4 respectively). When I cut the process count down to 12 it runs approximately 30-40% faster. I did not expect this. Similar code in Python does not behave similarly. The Python code runs faster on all cores versus 12.

Andropov · October 14, 2024, 6:45pm

When this happened to me in the past it was an issue where the compiler was suddenly unable to vectorize a loop (using SSE) after a minimal and seemingly innocuous change. Profiling didn't show anything unusual because it was spending time in the same function, it was just much slower (from 2-3 seconds up to several minutes).

This sounds like an issue with load balancing. Depending on how the multicore workload is partitioned, you can end up in a situation where the P cores finish their share of the work, but the E cores are way behind on their share. At that point the threads would be promoted from the E cores to the P cores, but it would take some extra time for the P cores to finish the remaining work... That 30-40% extra time you mention would be in the ballpark of what I'd expect, depending on granularity of your work items. Note that if the chunks are big enough this pattern can even be seen trivially by opening Activity Monitor and looking at the occupancy of each core.

It's possible that the similar Python code is balancing the workload in smaller chunks that don't cause this behavior. If you're using a library (ie numpy), it's likely that the library is better at balancing the load across multiple cores.

John_McCall · October 14, 2024, 7:01pm

You should be able to easily isolate the impact of the OS change vs. the Xcode change; just take a version of your app that was built on an old Xcode and run it on the new compiler.

My guess is that something changed in either the compiler or the SDK which is causing your code to serialize through a single actor when it didn't before.

wgjcv · October 14, 2024, 8:49pm

I have been able to isolate the problem to the operating system I believe. I took the application to an old 2013 Intel MacBook (Big Sur and Xcode 13.2.1) and compiled for 12 cores. Took the app back to the new Mac and it still ran slow as ever. I then compiled for 4 cores on the old laptop and ran it on the old machine and it runs faster than the M3Max laptop running Sequoia (with 12 or 16 cores). So Kyle Sluder stating that the OS is more likely the culprit looks to be correct. This is the first time I have ever migrated to a new OS at version .0 and I won't make that mistake again.

ksluder · October 14, 2024, 8:58pm

How are you distributing work between the cores? Does your overhead perhaps grow faster than the potential parallelism of your workload?

This reminds me of a bug I encountered years ago, where an app I worked on naïvely threw thousands of work items at parallel dispatch queues under the assumption that libdispatch would “just handle it”. It seemed to work at first, but then some upgrade caused the performance to tank, because the overhead of each work item increased, probably due to a bugfix. The moral of the story was that nothing has yet eliminated the need to batch work into appropriately-sized workitems.

wgjcv · October 14, 2024, 9:21pm

The application in this instance is a Multiple Polynomial Quadratic Sieve factoring algorithm. It lends itself very well to parallel work in the sieving phase. Work is spread very evenly between cores and the work is totally independent. Worked great until the OS upgrade. Still produces correct numbers albeit much more slowly. The application is not the issue. I am confident now that the upgrade to Sequoia is to blame.

ksluder · October 14, 2024, 9:25pm

The point of my anecdote is that something that changed in Sequoia seems to be interacting poorly with how your application has implemented its parallelism. As you noted, a Python implementation does not suffer from the same performance regression, so it’s not like we accidentally broke multithreading or something.

If you are able to share your code privately, please attach it to a feedback report and share the FB number here.

John_McCall · October 15, 2024, 6:08pm

I asked some Dispatch engineers about this, and they're asking that you file a radar. Ideally you'd attach your actual code (it's confidential), but at minimum you could provide the contrasting profiles, either with Instruments or with trace record or ktrace record.

wgjcv · October 15, 2024, 9:11pm

John, how do you file a radar. I am willing to attach code.