Async Channels for Swift concurrency

gh123man · March 18, 2024, 1:20pm

Hi all, Ive been hacking on a library to bring golang style channel support to async swift. Its mostly in a proof of concept state but works quite well so far.

I spend a lot of time on a Vapor backend and iOS front end app and found myself reaching for this pattern after writing a lot of go code. Hoping to get some feedback and possibly ideas to improve it.

Here is a small example of how it works

let msg = Channel<String>(capacity: 3)
let done = Channel<Bool>()

await msg <- "Swift"
await msg <- "❤️"
await msg <- "Channels"
await msg.close()

Task {
    for await message in msg {
        print(message)
    }
    await done <- true
}
await <-done

Philippe_Hausler · March 18, 2024, 9:29pm

That seems quite similar to the swift-async-algorithms AsyncChannel (which was originally designed as a similar use case to go's channels). swift-async-algorithms/Sources/AsyncAlgorithms/Channels/AsyncChannel.swift at main · apple/swift-async-algorithms · GitHub

ktoso · March 18, 2024, 9:44pm

It'd be useful to put all those channel impls through a round of benchmarking and see if they're up to stuff.

Like e.g. this post outlines Jox 0.1: virtual-thread friendly channels for Java

gh123man · March 18, 2024, 9:50pm

@ktoso Thanks for the link - I'll draw some inspiration from this. Benchmarking is top of mind - as I'd like this to be as performant as possible.

I did some "yard stick" tests (like this one) which is far from a conclusive benchmark - just to make sure there were not any obviously horrible performance issues. (TLDR: it's about 7x slower than the equivalent go code) and only slightly slower than the blocking version. However I didn't really put a lot of effort into performance and optimization yet.

ktoso · March 18, 2024, 9:51pm

Yeah performance is a defining factor for channel implementations so I'd recommend writing some good benchmarks, inspired by that post, and also see how the impl @Philippe_Hausler linked is performing and then see what is the bottleneck.

We should be able to get nice performant channels in general.

taylorswift · March 18, 2024, 10:03pm

there are some noticeable problems with binary size when using swift-async-algorithms, so i can certainly see plenty of room for alternatives.

eventually though, my preference would be for these optimized implementations to get upstreamed to swift-async-algorithms, assuming its maintainers are open to that.

gh123man · March 22, 2024, 11:29pm

Ive made several significant performance optimizations and put together some rough benchmarks against equivalent go code. My previous claim about the swift code being ~7x slower than go was completely wrong, in fact it was much, much worse.

Here are the results I've collected so far (note: this is on an experimental benchmarking branch as I have not fully verified that this code is safe)

Some of the optimizations include:

an unsafe ring buffer for buffered channels to bypass array allocations and operations
granular locking with os_unfair_lock for critical sections that never otherwise block or suspend (inspired after reading some code in AsyncAlgorithms)
large reduction in the number of suspension points (mostly as a result of using os locks where possible instead of awaiting continuations).

Ive seem to hit a wall though. Perhaps it's unfair to make comparisons against go to begin with due to the runtime optimization go provides for concurrent work. That aside, I've noticed that simply introducing any amount of task suspension incurs a large performance impact (even in scenarios with no contention among tasks).

Curious if anyone has any ideas on how to further optimize (Or if I'm missing any obvious low hanging fruit). The code in the benchmarking branch is ~4-5x faster than the code in main given the current test cases. I'll continue to test this and merge it to main once I'm sure that it's stable.

taylorswift · March 23, 2024, 12:54am

can you try migrating this test target to a true executable target, and running it in release mode? (-c release)

there is no chance that

@testable import AsyncChannels

is going to beat a benchmark against any real competitor.

gh123man · March 23, 2024, 1:42am

I should clarify, the benchmarks were collected from the sub-project in the benchmark folder - which is an xcode project with the scheme set to the release target. It can be found here

I'll eventually clean up or delete the benchmarks in the package tests. As you stated, they are much slower.

wadetregaskis · March 23, 2024, 6:35am

You're using removeFirst on Array, which is expensive. If you switch to using Deque instead (from swift-collections) it shaves 35% - 45% off the first three benchmarks (but only ~10% off the fourth). On an M2.

With that addressed, a quick look at a Time Profile shows that e.g. testSingleReaderManyWriter spends most of its time doing locking:

~35% FastLock.lock()
~20% FastLock.unlock()

Using OSSpinLock plus some tweaks to release the locks faster shaves another ~40% off the first three benchmarks (but does nothing for the fourth).

So all told that's about 3x faster for the first three benchmarks.

After that it's not immediately apparent to me what the bottleneck(s) are - a lot of time is still spent in locking, but that's just a symptom of something being slow while the lock is held. And that looks like a lot of auto-generated thunks and runtime stuff, a little bit of ref-counting, etc.

John_McCall · March 23, 2024, 9:39am

Consider just making recvQueue just store UnsafeContinuation<T?>s instead of allocating three separate objects to get the same effect. Similarly, sendQueue could simply store UnsafeContinuation<()>s, although the performance of that is probably less important.

The entire AsyncMutex class is a little silly. An actor in Swift is an async mutex. Just make Channel an actor with normal isolated methods instead of recreating actors using actors.

Datagram · March 23, 2024, 1:47pm

Loving golang goroutine with the simplicity and performance of channels, I am very happy to see such a concept in Swift.

Thanks for your good proof of concept, well-written examples, and comparisons to golang.

Looking forward for the next step in optimization and achieving performance parity with golang.

gh123man · March 23, 2024, 2:07pm

Thanks for the suggestions.

Consider just making recvQueue just store UnsafeContinuation<T?> s instead of allocating three separate objects to get the same effect. Similarly, sendQueue could simply store UnsafeContinuation<()> s, although the performance of that is probably less important.

I'll give this a shot however I don't anticipate removing the class allocation to make much of a difference.

The entire AsyncMutex class is a little silly.

This was removed in the benchmarking branch.

Just make Channel an actor with normal isolated methods instead of recreating actors using actors.

This actually hurts performance badly, especially with many readers and writers - instead os_unfair_lock_lock is used when possible in its place.

From the testing ive done, performance improves greatly when you can remove suspension points anywhere you don't have to explicitly wait for a sender or receiver.

vns · March 23, 2024, 2:54pm

It looks like too much mutexes used there. There are locks within the channel, and more locks within semaphore. Plus objects creation (a class for each sender and receiver is not cheap, and each of them also creates two semaphores). Plus array, with already pointed expensive operations, and space allocations each time you insert new sender or receiver - you need a data structure more effective on fast insert/delete in the begin or end and access to the first element.

In that way, clearly, actors will give more cost - there is too much friction going in between. Instead, by eliminating all the locks and additional classes, replaced with continuations, and isolating all of that - currently split between 4 classes - inside on actor I would expect it to be much more efficient. You will have two crossing boundaries - sending a value to a channel and receiving it, everything else will be happening inside.

gh123man · March 23, 2024, 3:30pm

This is amazing, thank you for looking at the code!
Looks like OSSpinLock deprecated long ago - I'm not sure if that will ever be a problem long term. But given the improvement here, it's hard to ignore.

Thanks for the tip on Swift collections - that is indeed an obvious (and big) improvement.

I'll work on integrating these changes into the benchmarking branch.

wadetregaskis · March 23, 2024, 4:16pm

Making the send queue store just a tuple of the value plus continuation boosts the performance another ~15% - ~20% for the first three benchmarks, but makes the fourth (syncRw) regress by about 10%. Which is weird because the syncRw benchmark is the most trivial case and doesn't use any continuations at all.

Making the receive queue store just the continuation actually makes every benchmark slightly slower (but only ~1%). Also highly unintuitive.

wadetregaskis · March 23, 2024, 4:19pm

Making the nonBlockingSend and nonBlockingReceive methods @inline(__always) improves the first three benchmarks by another ~15% but makes syncRw regress by another ~10%.

wadetregaskis · March 23, 2024, 4:29pm

Forcing inlining of the key parts of UnsafeRingBuffer recovers nearly 10% on the syncRw case (and maybe improves the other three by a couple of percent).

I'm picking up a theme here, that the compiler is being way too conservative about inlining…

(fourth reply that is @inline(__always) by Discourse )

Ugh, and I just found a compiler bug too whereby it apparently ignores a return statement:

@usableFromInline
    func send(_ value: T) async {
        buffer.push(value)
        return // <-- The compiler completely ignores this, and just
               // proceeds onwards to mutex.lock() below, leading to spinlock.

        mutex.lock()

        if nonBlockingSend(value) {
            return
        }

        await withUnsafeContinuation { continuation in
            sendQueue.append((value, continuation))
            let waiter = selectWaiter
            mutex.unlock()
            waiter?.signal()
        }
    }

Apparently it thinks that because lock() returns Void, and send returns Void, that this code therefore means return mutex.lock(), which is… spectacular.

After working around that compiler bug (thank goodness semicolons still exist, I guess), it shows that even if you completely bypass all locking and just push & pop values in perfect pairs off of the UnsafeRingBuffer (i.e. syncRw), it still takes 0.7 seconds (vs ~2s to do the real code, or 1.6s in the best case version, or 1.8s in the original). That's way slower than I expect.

Stepping through the disassembly, it's mind-blowing how much pointless boilerplate the compiler is inserting. Hundreds (thousands?) of instructions of generics cruft, retains & release of something, creating and destroying transient optionals, etc. If I didn't know better I'd think this code is compiled with -Onone… (but I can see -O in the build transcripts)

It looks to me like it's failing to specialise some or all of the code. Unfortunately the @_specialize decorator that the stdlib uses isn't available elsewhere.

…and indeed, manually "specialising" the code by removing the generics entirely and hard-coding for Int improves the performance 400%.

testSingleReaderManyWriter()
Time elapsed: 0.2149193286895752
testHighConcurrency()
Time elapsed: 0.2130796511967977
testHighConcurrencyBuffered()
Time elapsed: 0.21261000633239746
syncRw()
Time elapsed: 0.356195330619812

For reference, the starting point from @gh123man's benchmark branch @ dc97b09 was:

testSingleReaderManyWriter()
Time elapsed: 2.4027369817097983
testHighConcurrency()
Time elapsed: 2.417561332384745
testHighConcurrencyBuffered()
Time elapsed: 2.2434799671173096
syncRw()
Time elapsed: 1.7973110278447468

So, it seems like - while there's plenty of moderate improvements to be had through code improvements, as detailed in previous posts - the biggest problem by far is the compiler. It's not specialising the generics [correctly], among other suspicious-looking behaviour (the inexplicable retain/releases in value-only code, long-winded runs through the Concurrency lib for an async function that immediately returns an integer, etc).

To be clear, when I say the problem is the compiler that might mean that the code needs to do extra things to help the compiler, whether hints like @inline(__always) or perhaps some semantic changes to permit the compiler to make certain optimisations. But this is about the limits of my current knowledge, on how to coax the compiler into better results.

John_McCall · March 23, 2024, 5:09pm

Please do not use a spin lock in application code; they are prone to catastrophically bad behavior. Spin locks are really a kernel-level technology. I'm quite surprised that you're seeing a significant performance difference vs. a lightweight lock like os_unfair_lock, because they're both not much more than a single compare-exchange in the fast path.

wadetregaskis · March 23, 2024, 5:14pm

os_unfair_lock is usually much slower, in my experience.

OSSpinLock as I understand it does actually 'sleep' at some point, it just spins for a while first. I'm not sure if os_unfair_lock has a brief spin period too, but even if it does, perhaps the duration is shorter [enough] that it penalises some situations where just a few more spins might be enough.

Some of the locking in this case could potentially be eliminated through atomics, which would be both the faster and safer way to go in any case. But not all of it (because of the select support).

The impact of the lock selection matters less as other parts are optimised. e.g. with the fully-specialised, optimised version that I currently have, OSSpinLock gets:

testSingleReaderManyWriter()
Time elapsed: 0.2149193286895752
testHighConcurrency()
Time elapsed: 0.2130796511967977
testHighConcurrencyBuffered()
Time elapsed: 0.21261000633239746
syncRw()
Time elapsed: 0.356195330619812

…while os_unfair_lock gets:

testSingleReaderManyWriter()
Time elapsed: 0.2587466637293498
testHighConcurrency()
Time elapsed: 0.2595813274383545
testHighConcurrencyBuffered()
Time elapsed: 0.26501500606536865
syncRw()
Time elapsed: 0.35403231779734295

So, still noticeably slower in the contended cases but only on the order of 20% rather than 40%.

(note also that these benchmark numbers are ±5% at least anyway, so we're getting close to the noise floor of these benchmarks)