Massive code execution slowdown with Xcode 16

I have some code that uses grand central dispatch for concurrency and when I upgraded MacOS and upgraded Xcode to version 16 I am seeing a massive slowdown on code that runs concurrently. I have set the Swift version to version 5 and 4.2 in build settings and that seems to make no difference. The code produces correct results but is slow as molasses. Anyone else experience this? I wasn't ready to move to Swift 6 concurrency yet and don't know if it is possible (or practical) to downgrade Xcode and Swift. The slowdown is like a factor of 20 to 30 (like 7 seconds to 2 minutes 30 seconds). Any suggestion would be appreciated.

It seems more likely that the OS upgrade is to blame than the Xcode upgrade, unless you’ve tried running your built app on an older OS.

1 Like

I didn't think about Sequoia playing a factor but I guess it could. Certainly disappointing. I only have one Mac to test on. I guess I will be waiting for updates on everything and hope for an improvement.

You can do more than just wait. You could take a sample of your process and see what it’s doing.

4 Likes

Do you explicitly set quality of service, or the execution priority of your queues? The default may have changed in the Dispatch library or in the OS.

I set the qos, priority for the dispatch queues as below:

let queue = DispatchQueue(
    label: "ConcurrentQueue",
    qos: .userInitiated,
    attributes: [.concurrent]
)

Please do profile your code. The first rule of performance work is don't guess :slight_smile:

Even experts very often have the experience of profiling and immediately going "wait THAT'S what's taking the time??"

10 Likes

Spent some time profiling code today. I could not find anything that looked out of the ordinary.

1 Like

Ideally you'd have before and after profiles so you could see what's different, but a 30x difference is something I would expect to show up anyway. Like that suggests that almost all of the work in your profile would be absent if you did it on the old system. Is it something you could post so we could look at the profile?

3 Likes

I would rather not post the profiles. I would rather let this issue go for now.

1 Like

Other than anecdotal perceptions of operations taking longer what other data do you have? Are there any new log records being shown in the log window for Xcode? I have been noticing more log records and also random variance of the execution path resulting in what appears to be timing differences at the lowest level of the Kernel code.

1 Like

I not seeing any new records but did find something interesting during some testing. My application uses all processors. That includes performance and efficiency cores (M3MAX 12 and 4 respectively). When I cut the process count down to 12 it runs approximately 30-40% faster. I did not expect this. Similar code in Python does not behave similarly. The Python code runs faster on all cores versus 12.

When this happened to me in the past it was an issue where the compiler was suddenly unable to vectorize a loop (using SSE) after a minimal and seemingly innocuous change. Profiling didn't show anything unusual because it was spending time in the same function, it was just much slower (from 2-3 seconds up to several minutes).

This sounds like an issue with load balancing. Depending on how the multicore workload is partitioned, you can end up in a situation where the P cores finish their share of the work, but the E cores are way behind on their share. At that point the threads would be promoted from the E cores to the P cores, but it would take some extra time for the P cores to finish the remaining work... That 30-40% extra time you mention would be in the ballpark of what I'd expect, depending on granularity of your work items. Note that if the chunks are big enough this pattern can even be seen trivially by opening Activity Monitor and looking at the occupancy of each core.

It's possible that the similar Python code is balancing the workload in smaller chunks that don't cause this behavior. If you're using a library (ie numpy), it's likely that the library is better at balancing the load across multiple cores.

You should be able to easily isolate the impact of the OS change vs. the Xcode change; just take a version of your app that was built on an old Xcode and run it on the new compiler.

My guess is that something changed in either the compiler or the SDK which is causing your code to serialize through a single actor when it didn't before.

2 Likes

I have been able to isolate the problem to the operating system I believe. I took the application to an old 2013 Intel MacBook (Big Sur and Xcode 13.2.1) and compiled for 12 cores. Took the app back to the new Mac and it still ran slow as ever. I then compiled for 4 cores on the old laptop and ran it on the old machine and it runs faster than the M3Max laptop running Sequoia (with 12 or 16 cores). So Kyle Sluder stating that the OS is more likely the culprit looks to be correct. This is the first time I have ever migrated to a new OS at version .0 and I won't make that mistake again.

1 Like

How are you distributing work between the cores? Does your overhead perhaps grow faster than the potential parallelism of your workload?

This reminds me of a bug I encountered years ago, where an app I worked on naïvely threw thousands of work items at parallel dispatch queues under the assumption that libdispatch would “just handle it”. It seemed to work at first, but then some upgrade caused the performance to tank, because the overhead of each work item increased, probably due to a bugfix. The moral of the story was that nothing has yet eliminated the need to batch work into appropriately-sized workitems.

The application in this instance is a Multiple Polynomial Quadratic Sieve factoring algorithm. It lends itself very well to parallel work in the sieving phase. Work is spread very evenly between cores and the work is totally independent. Worked great until the OS upgrade. Still produces correct numbers albeit much more slowly. The application is not the issue. I am confident now that the upgrade to Sequoia is to blame.

The point of my anecdote is that something that changed in Sequoia seems to be interacting poorly with how your application has implemented its parallelism. As you noted, a Python implementation does not suffer from the same performance regression, so it’s not like we accidentally broke multithreading or something.

If you are able to share your code privately, please attach it to a feedback report and share the FB number here.

I asked some Dispatch engineers about this, and they're asking that you file a radar. Ideally you'd attach your actual code (it's confidential), but at minimum you could provide the contrasting profiles, either with Instruments or with trace record or ktrace record.

John, how do you file a radar. I am willing to attach code.