[Pitch] Core team publishes results of performance study: Cooperative Scheduler just introduced, plus compile for non-atomic ARC

johnburkey · June 16, 2023, 9:23pm

A request: publish a study showing the results of turning off Atomics everywhere, and then explain how you might give us some of this turbo through a better runtime that doesn't sprinkle atomics all over our codebases for us for the cases we don't need them. (like Task local allocations, and static let allocations.)

Love Swift, and Love the core team.

ksluder · June 17, 2023, 2:45am

You can’t just “turn off atomics” and maintain correct refcounting in multithreaded environment.

Are you suggesting a performance study of single-threaded Swift concurrency?

johnburkey · June 17, 2023, 3:26am

Read the title. Use the new cooperative scheduler. Cooperative schedulers don't have concurrent reentrancy. It's just a way to get data. By flipping on the new cooperative scheduler mentioned at WWDC, and flipping the #defines in the C++ code to turn the non atomic ARC on, you can take some measurements about cost.

Now, if the swift compiler were smart about the task local and static lets, it wouldn't emit retain/releases at all for those cases, and it would be almost free, which obviously is better.

But measurement of slow code tends to motivate, so why not start there?
Publishing the results might even inform the community how to make Swift swift.

odmir · June 17, 2023, 12:28pm

Could you post a link to the WWDC video of the new cooperative scheduler? I think I completely missed that.

johnburkey · June 17, 2023, 4:36pm

[

What’s new in Swift - WWDC23 - Videos - Apple Developer
developer.apple.com

](What’s new in Swift - WWDC23 - Videos - Apple Developer)

masters3d · June 17, 2023, 4:56pm

I believe they are talking about other non apple platforms.

On apple platforms this has the context video from 2021 Swift concurrency: Behind the scenes - WWDC21 - Videos - Apple Developer

Apple platforms are out of scope for the swift open source evolution process.

ksluder · June 17, 2023, 5:13pm

Swift Concurrency on Darwin platforms has been based on a cooperatively scheduled thread pool from the beginning. The threads themselves are still preemptively multitasked, but the tasks running on them are cooperatively scheduled across the available threads, maximizing thread occupancy and runnable time.

It is still possible for two cooperatively scheduled tasks running on a thread pool to execute simultaneously while holding strong references to the same object, thus necessitating the use of atomics or locks around refcount operations.

johnburkey · June 18, 2023, 2:37pm

Sounds exciting. How about a core team run performance analysis for a compute bound task that makes tens of thousands of temporary objects and sends one result to another task.

Let’s see what we see. You can imagine why I ask

John_McCall · June 19, 2023, 9:22pm

We're always very interested in ideas for optimizing the implementation. If you'd like to contribute to that, you're welcome to start a thread over in Development. This doesn't seems like a promising start to that conversation, though; it feels like you're trying to tell us in a roundabout way that atomic reference counting has costs, which is a bit condescending. Kyle is correct that we are required to use atomic reference counting in general because the runtime environment is concurrent. We do have some static optimizations to use non-atomic reference counting in some cases where we can prove that an object hasn't escaped a thread, and we're looking into ways to enable that in more situations. Doing that optimization dynamically sounds great in theory, but every technique I've seen for it relies on widespread code instrumentation that's often unprofitable outside of a JIT and a tracing GC (in which many of the costs can be taken for granted anyway). If you have concrete insights to share based on your understanding of our environment, please feel free to do so. For the most part, though, we are focused on finding other ways to reduce refcounting costs, such as eliminating copies and allowing allocations to be avoided through the ownership system.

johnburkey · June 19, 2023, 10:25pm

I’ve always believed that measuring performance, publishing results , and publishing a plan to address results is a straightforward way to get real work done.

You will notice that’s what I sent.

Ive found the replies have been mostly conjecture and emotional defensiveness instead of analysis.

One of the things we do for a big performance gain is override the debug hooks for retain/release and have a special class type that writes a value into a data member on the swift side that tells our hooks to ignore all retain release calls.

For “static let instances “ and other permanent objects it’s a BIG win (no atomics)- but obviously the compiler is still spraying function calls on top of most swift object ptr dereferences - you could imagine someone looking into that motif and getting our win plus the less function calling for everyone.

Imagine a type that the compiler can see that just says “no arc”. That could be a sendable kind of decoration on the swift side, that changes the type in a way the compiler sees that causes a codegen change.

Or it could be that the compiler sniffs out the use case and annotates then. Any language teams job is to give library writers the tools to write performant code safely. There's a set of idioms here that are ALOT safer than Unsafe. And give us some turbo.

We did this work because instruments clearly shows retain / release as a huge cost. So we analyzed, and then planned an attack. And got the win. The key thing with performance is to change the behavior of the code and remeasure- Atomics and other things that causes pipeline behavior changes in modern CPU's tend to have downstream performance costs that are best measured with A/B testing, not just spot looking at profiles.

I don’t see a lot of even acknowledgement that retain / release dominates cpu bound profiles in many cases? You know how they say admitting a problem is the first step..

In our data , It is clear that for cpu bound stuff it’s a bigger win than going to apple silicon - seems likes it’s worth a look.

Love swift and hope one day its performance catches up to rust and c++. It’s got python beat , and that’s truly inspiring.

AlexanderM · June 19, 2023, 11:39pm

Hey John,

Respectfully... What?

It's abundantly clear to anyone in the space that ARC isn't free, and even from my uninformed outsider perspective, the compiler has had more ARC-related optimization work than any other kind, from what I can see.

I've read this whole thread, and it's still not clear to me what kind of performance measurements you're looking for, or what kind of improvements you're after. Your thesis, as far as I understand it, is "Swift's ARC has a performance cost, and we should come to acknowledge that," (as if we hadn't) to which my response is: "ummmm yeah, obviously?"

johnburkey · June 19, 2023, 11:51pm

I asked for a study that shows the actual (versus thread guessing like here) costs, and then proposes a plan to fix the costs. And no I don't believe the language team actually knows how expensive ARC is because I don't believe they've checked in a while, otherwise they would quote numbers. For example, the SwiftNIO people have a lot of perf stuff in their code, trying to dodge ARC, so I think they looked.

Our results are greater than 10x cost for compute intensive work that generates temporaries, especially bad if Arrays are used.

It's not enough to say, oh yeah, expensive. That could mean 20%. Or it could mean 10x.

10x is in the realm of, "don't use that language because its like using chips from 5 years ago for your app".

"We" want to push Atomic ARC to a rare use case because it's expensive. And because obviously, the idiom of almost all modern code is to use local objects that aren't actually visible to other threads, and then probably push a result to another thread where its now owned there and still local. So no need for Atomics! That's for things that actually are visible to other threads. The idea is to use languages like Swift to communicate to the compiler which things are visible and used in ways that require Atomics, and not just make everything atomic all the time. (Because that's wasteful)

So we might:

Activate a set of code that has non atomic ARC
Activate a set of code in the compiler that generates code that does not use ARC calls all over the place
Use (1) when you still need refcounting
Use (2) when you don't need refcounting
You don't need refcounting for permanent objects.
You can use non-atomic refcounting for non sendable Task local objects, and stack / func local objects.

AlexanderM · June 19, 2023, 11:58pm

Did you miss this response from John McCall?

There is already an "immortal" object optimization, but IIRC such objects still get calls to retain/release, they just return-early once they see the immortal bit is set. It's not zero-cost, but there's no atomic operations there (or even ref count increments/decrements)

johnburkey · June 20, 2023, 12:07am

We are aware of those optimizations and use them and our retain/release hooks on top. Immortal alone doesn't give you enough kick.

You may note for example, that that's all class reference stuff, despite Apple saying we should be using structs in most perf guidance. But because of how structs are implemented they trigger a lot of un-dodge-able costs. This is all a dense thicket of things that can be cleaned up and made much simpler by what we are talking about.

Remember, I'm saying- publish a performance stuff showing cost of ARC for compute bound ops generating temporaries and then publish a plan to fix the costs. They are there. In our measurements they cost > 10x, and dominate profiles.

We use swift on all platforms, including RaspPi and Jetson. We have a large Ai that does very serious concurrent compute. We can't even use the async/await stuff because its costs per op are prohibitive. We use dispatch and concurrent queues with our permanent objects in pools instead. This isn't conjecture. But when the language doesn't give you the ability to speak clearly to the compiler, you end up doing weird things and causing complexity to solve for your customer that would be unnecessary otherwise.

We volunteer again for the swift team to invite us to a performance lab and we can study ARC's effects on our code base as it processes tens of millions of items in our AI and then generates probabilistic results of a few items. Like when you say "play Madonna" in something like Siri. This isn't a quick search of an array of a couple hundred thousand items. It's many orders of magnitude larger.

AlexanderM · June 20, 2023, 12:17am

Hey John, several thoughts:

You're previous messages don't make this clear. They were written in a way that sounds like you're suggesting to add these, as if they didn't already exist. As a general trend, it seems none of your posts mention/acknowledge any of the past/present/future in this area, and many of your points are making suggestions for things that have been implemented for years already.

Sure, I would welcome that. Your phrasing makes it sound like there's a secret formula for the Krabby Patty that Mr. Krabs is holding back in some grand conspiracy. I'll point out that the GitHub project already has a CI suite that runs on almost every PR, so it's pretty ... out there. If the proposal is to make a more polished/public summary of benchmarks and performance trends, then I welcome that.

If those CI-run benchmarks aren't addressing your concerns, then it would be constructive if you specified where you think they fall short.

Your measurements sound intriguing. Perhaps you should be the one to publish stuff, and we should be making demands of you instead :)

Do the new ownership features help your issues?

Respectfully, a substantial portion of your posts read as dramatized, non-constructive ranting. You're raise vague (but valid) points of performance concern, and seem to disregard the suggestions or answers being given in response. For example:

This is addressing exactly the "the language doesn't give you the ability to speak clearly to the compiler" point you're trying to make, but you seem to have just... ignored it completely?

I don't see any mention of leveraging the ownership system (or why it doesn't fit your needs) in any of your posts, so it kind of feels like we're talking to a wall here.

johnburkey · June 20, 2023, 1:46am

Ok so can we admit that the language should be fast without immortal and unsafe and ask ourselves why the l standard libraries use constructs for performance that are less available to all apps ?

Work to do , let’s analyze, iterate and fix

Do appreciate the energy, Alex

Given the dust, and the response from McCall, message received.

Thanks again Alex for your interest in Swift performance!

tera · June 21, 2023, 1:49am

I suggest we have a compiler diagnostic setting that makes retain / release non atomic, a simple integer increment / decrement. It won't work correctly in typical (multithreaded) apps, but it could be useful when running single threaded benchmarks and could answer the question how big ARC overhead actually is.

johnburkey · June 21, 2023, 2:03am

What’s great about that idea is it means we can toss our code we want to investigate in a single threaded context and we can all learn - sometimes there will be something we can do ourselves- and sometimes it will be something the language and runtime team can enable -

Great idea

johnburkey · June 21, 2023, 2:17am

I’ve also realized-

Apple folk get a new piece of hardware every year - and the customer pays for it . And because Apple is a monopoly - people pay. And I love that -

We are building software that goes into smart devices that we design and - if the sw is 5x faster it means we buy a cheaper chip and pass the savings to the customer .

Very different ! So we want performance because it affects our cost , our heat sinks, our power inputs - everything!

Let’s try Tera’s idea

soumyamahunt · June 21, 2023, 11:17am

Swift Concurrency introduces idea of Actors and Sendable types, from my understanding:

Actor isolated data can't be accessed at the same time by cooperatively scheduled tasks.
Only Sendable data types can cross actor boundaries, permitting access to cooperatively scheduled tasks.

Can't this behavior be used to disable atomics:

For non-Sendable objects.
For actor isolated objects.

Correct me if my understanding is wrong about the runtime.