Async Channels for Swift concurrency

taylorswift · March 23, 2024, 5:21pm

shouldn’t there be an indentation warning emitted here?

wadetregaskis · March 23, 2024, 5:21pm

No warnings from the compiler (Swift 5.10).

gh123man · March 30, 2024, 3:27am

Ive formalized the optimizations discussed above (sans OSSpinLock). Code is merged to main.
Results are here

I also compared this to swift-async-algorithms AsyncChannel (as suggested above). This library is 5-7x faster in the applicable tests.

At this point it seems like any (impactful) future performance is held back by compiler optimizations (or writing better code that the compiler can optimize). If anyone has any resources, reading material, or examples of previous work on this topic i'd be interested.

wadetregaskis · March 30, 2024, 4:45pm

I hope this progresses further. It's frustrating to see Swift performing so poorly.

The lack of generic specialisation is something I've seen implicated in performance problems multiple times, and makes me leery about using performance-sensitive code from packages - as opposed to manually copy-pasting the necessary code into my project / module in order to ensure specialisation can actually occur. Which can of course only be done in very limited degrees, practically-speaking, due to the heavy maintenance burden it imposes (and potentially, licensing issues).

I'm surprised this specialisation limitation doesn't get discussed more. It seems like a glaring, fundamental performance bottleneck for Swift. And the workaround seems to be "make everything @inlinable and cross your fingers that compiler chooses to do so", which doesn't really build confidence (nor necessarily result in optimal performance or code size).

Jon_Shier · March 30, 2024, 11:17pm

There's not really much to discuss, it's not like these issues are new or unique. Swift performance comes up fairly regularly on these forums, whether through random performance issues in individual code or Swift's surprisingly poor performance in benchmark comparisons with other languages. In almost every case the ultimate findings and core team response are one of: these are optimizations that are known to be missing but implementable, we just haven't gotten around to it yet; difficult to implement and so high effort it may not be worth it; simply malfunctioning / not aggressive enough / missing an edge case; or don't really matter despite what benchmarks say. Unless there are community members with existing expertise in compiler optimization relevant to Swift that aren't already Swift contributors (not likely), there's not really much we can do. Our only hope is that the low level, ongoing optimization work that already takes place eventually catches some of these issues (and that they later don't regress). Personally, I wish Swift was more like WebKit, where there's a published and followed list of project priorities with performance right near the top, second only to security.

taylorswift · March 31, 2024, 3:06am

although i have certainly posted my share of tragic godbolts on these forums, i don’t really think the language is what’s holding us back. many times, it is not Swift’s fault, but rather, poorly written libraries that are causing performance problems.

ironically, i frequently observe library authors trying to combine all their code into one module (i guess since everyone hates @inlinable), which is a localized performance win, but backfires when everyone is doing that, since you end up with huge binaries that contain lots of irrelevant code that the library authors did not want to leave on the other side of whatever cross-module optimization boundary they were thinking about.

mickeyl · March 31, 2024, 2:43pm

Indeed. Despite all the shiny evolution processes (which are cool, don't get me wrong), it's not to be forgotten that Swift has been, is, and will probably forever be primarily an (Apple-)product-driven language.

hassila · March 31, 2024, 3:26pm

I think it is fair to assume that as the language matures and adoption (within Apple too) continues, further efforts will go into performance work too. I do miss a performance section in the evolution template though, so one can understand the expected short/long term performance considerations/expectations for certain changes.
(That being said, I do hope we’ll see more such work sooner rather than later, especially in the context of generic api surfaces as mentioned here)

ole · April 3, 2024, 11:50am

wadetregaskis:

Ugh, and I just found a compiler bug too whereby it apparently ignores a return statement:
@usableFromInline
    func send(_ value: T) async {
        buffer.push(value)
        return // <-- The compiler completely ignores this, and just
               // proceeds onwards to mutex.lock() below, leading to spinlock.

        mutex.lock()
        // …
Apparently it thinks that because lock() returns Void, and send returns Void, that this code therefore means return mutex.lock()

I'm surprised you're not getting a warning for this. I haven't checked your code, but this simpler code does generate a warning, "Expression following 'return' is treated as an argument of the 'return'":

func foo() {
    return
    print("foo")
}

Notably, the warning disappears when you change the indentation to this:

func foo() {
    return
        print("foo")
}

I learned this from @nicklockwood: tweet 1 and tweet 2

Dmitriy_Ignatyev · April 3, 2024, 3:13pm

Where are currently using the following pattern in the project:

func foo() {
  // some logic
  guard condition else { return assertationFailure("message") }
  guard condition else { return logError(SomeError("message")) }
}

It is ok for a void function to return the result of another void function. But in the sample provided by @wadetregaskis it is undesirable because of mutex.lock().
While I can imagine a warning emitted by compiler with the help of something like ~Escapable Types but on a value level (aka ~escapable returned Void value), it is not a solution that can be taken seriously. So the question is: what is the right way to get diagnostic warnings in such situations.

Jean-Daniel · April 3, 2024, 5:44pm

Your log and assertions are on the same statement than the return and so are executed before the return. They would not be reported by an unreachable code warning.

vns · April 5, 2024, 8:34am

I've finally give it a shot with actors. I'm using simple linked list to store senders & receivers as unsafe continuations + as buffer as well, Array obviously performs badly, and Deque also is less efficient than that. Haven't implemented select so far. Adding inlines didn't make any time improvements for me, either that was my incorrect use of them or compiler actually optimises it well even without explicit annotations.

So far, measurements a bit down for the first 3 tests (with consistent 1.5s) and got an improvements for syncRw case in more than 1 second (still a lot off from the Go version). Tested on MacBook Pro M1.

hyperfine -w 3 '.build/release/SwiftChannelsDemo 1'
Benchmark 1: .build/release/SwiftChannelsDemo 1
  Time (mean ± σ):      1.512 s ±  0.002 s    [User: 1.531 s, System: 0.524 s]
  Range (min … max):    1.508 s …  1.515 s    10 runs

hyperfine -w 3 '.build/release/SwiftChannelsDemo 2'
Benchmark 1: .build/release/SwiftChannelsDemo 2
  Time (mean ± σ):      1.524 s ±  0.005 s    [User: 1.541 s, System: 0.532 s]
  Range (min … max):    1.519 s …  1.536 s    10 runs

hyperfine -w 3 '.build/release/SwiftChannelsDemo 3'
Benchmark 1: .build/release/SwiftChannelsDemo 3
  Time (mean ± σ):      1.524 s ±  0.005 s    [User: 1.540 s, System: 0.530 s]
  Range (min … max):    1.519 s …  1.534 s    10 runs

hyperfine -w 3 '.build/release/SwiftChannelsDemo 4' 
Benchmark 1: .build/release/SwiftChannelsDemo 4
  Time (mean ± σ):      1.891 s ±  0.030 s    [User: 1.853 s, System: 0.005 s]
  Range (min … max):    1.877 s …  1.976 s    10 runs

Still haven't tested how much pop/push in buffer takes time, but I suspect that the main reason that currently affects performance is suspension points. But actors still perform well on my opinion, especially if we think about the level of low-level optimisations made in Go implementation itself over years. As for channels idea itself, despite being fantastic model for concurrency in Go, I believe approach in Swift in general should differ from it, since it has completely different set of abstractions and tools.

The code:

//  Channel.swift

infix operator <- :AssignmentPrecedence

public func <- <T>(c: Channel<T>, value: T) async {
    await c.send(value)
}

public func <- <T>(value: inout T?, chan: Channel<T>) async {
    await value = chan.receive()
}

prefix operator <-

@discardableResult public prefix func <- <T>(chan: Channel<T>) async -> T? {
    return await chan.receive()
}

public actor Channel<T> where T: Sendable {
    private typealias Receiver = UnsafeContinuation<T?, Never>
    private var receivers = LinkedList<Receiver>()

    private typealias Sender = (continuation: UnsafeContinuation<Void, Never>, value: T)
    private var senders = LinkedList<Sender>()
    
    private let capacity: Int
    private var buffer: LinkedList<T> = LinkedList()

    public init(capacity: Int = 0) {
        self.capacity = capacity
    }

    public func send(_ value: T) async {
        guard receivers.isEmpty else {
            receivers.pop()?.resume(returning: value)
            return
        }
        guard buffer.count < capacity else {
            await withUnsafeContinuation { continuation in
                senders.push((continuation, value))
            }
            return
        }
        buffer.push(value)
    }

    public func receive() async -> T? {
        if !buffer.isEmpty {
            return buffer.pop()
        }
        if !senders.isEmpty {
            let sender = senders.pop()
            sender?.continuation.resume()
            return sender?.value
        }
        return await withUnsafeContinuation { continuation in
            receivers.push(continuation)
        }
    }
}

//  LinkedList.swift

internal struct LinkedList<T> {
    private final class Node {
        var value: T
        var next: Node?

        init(value: T) {
            self.value = value
        }
    }

    private(set) var isEmpty: Bool = true
    private var head: Node?
    private var tail: Node?
    private(set) var count: Int = 0

    mutating func push(_ value: T) {
        let node = Node(value: value)
        if head == nil {
            head = node
            tail = node
        } else {
            tail?.next = node
            tail = node
        }
        count += 1
        isEmpty = false
    }
    
    mutating func pop() -> T? {
        let value = head?.value
        head = head?.next
        if head == nil {
            isEmpty = true
            tail = nil
        }
        count -= 1
        return value
    }
}

And running app is pretty much copy-paste from the benchmarks, just instead of using Swift to time, I've used hyperfine.

//  App.swift

import Foundation
import SwiftChannels

// To run benchmark
// swift build -c release
// hyperfine -w 3 '.build/release/SwiftChannels <number>'
@main
struct App {
    static func main() async {
        guard let opt = CommandLine.arguments.last else {
            return
        }
        let value = Int(opt)
        switch value {
        case 1: await testSingleReaderManyWriter()
        case 2: await testHighConcurrency()
        case 3: await testHighConcurrencyBuffered()
        case 4: await testSyncRw()
        default: break
        }
    }
}

func testSingleReaderManyWriter() async {
    let a = Channel<Int>()
    for _ in (0..<100) {
        Task.detached {
            for _ in (0..<10_000) {
                await a <- 1
            }
        }
    }
    var sum = 0
    while sum < 1_000_000 {
        sum += (await <-a)!
    }
}

func testHighConcurrency() async {
    let a = Channel<Int>()
    for _ in (0..<1000) {
        Task {
            for _ in (0..<1000) {
                await a <- 1
            }
        }
    }
    var sum = 0
    while sum < 1_000_000 {
        sum += (await <-a)!
    }
}

func testHighConcurrencyBuffered() async {
    let a = Channel<Int>(capacity: 20)
    for _ in (0..<1000) {
        Task {
            for _ in (0..<1000) {
                await a <- 1
            }
        }
    }
    var sum = 0
    while sum < 1_000_000 {
        sum += (await <-a)!
    }
}

func testSyncRw() async {
    let a = Channel<Int>(capacity: 1)
    for i in (0..<5_000_000) {
        await a <- i
        await <-a
    }
}

jaleel · April 5, 2024, 8:38am

Oh, thx for doing it.
I had a suspicion that introduction locks and semaphores actually reduce the performance, and just going to actors should improve that. Haven't had time to play and glad we can compare now.

vns · April 5, 2024, 12:06pm

I've replaced in version with locks Deque with LinkedList, and got performance improvement in 0.1s on first tests with concurrent tests and 0.5s on syncRw.

gh123man · April 5, 2024, 1:08pm

Thanks for sharing this. Very interesting, I'll take a closer look at this later today but from a quick look this morning, I think there is at least one bug that is not making this comparison totally fair.

Consider this test case:

let a = Channel<Int>(capacity: 2)
let done = Channel<Bool>()

Task {
    await a <- 1
    await a <- 2 // buffer full
    await a <- 3 // block
    print("got here!")
    await done <- true
}
try! await Task.sleep(nanoseconds: 1000000000) // sleep for 1 second

print((await <-a)!) // Should unblock on  a <- 3
await <-done

In your implementation "Got here!" is never printed and we block forever because receive only dequeues but never unblocks the senders. There are a set of test cases in the main package that should cover most correctness cases (however you'd need to implement close to make many of them work).

vns · April 5, 2024, 1:55pm

Could be, I have not so much experience with Go channels and this implementation has been written by fragments during last week. I will take tests cases in a while to cover what’s missing.

gh123man · April 6, 2024, 12:54am

Ive done some more testing. By using a linked list as @vns proposed, we see another leap in performance over Deque. Ive inlined the push and pop methods which helped as well.

I opened a PR with the relevant changes here

highlights:

Test Case	Go (seconds)	Swift (seconds)	Swift `n` times slower than go
testSingleReaderManyWriter	`0.318661688`	`0.702384305`	`2.20x`
testHighConcurrency	`0.328830854`	`0.7504368067`	`2.28x`
testHighConcurrencyBuffered	`0.362022931`	`1.144838405`	`3.16x`
testSyncRw	`0.132789557`	`1.547898102`	`11.66x`
testSelect	`0.306248166`	`1.401787996`	`4.58x`

Modest bumps across the board but syncRw is 2x faster!

@vns If it's ok with you i'd like to upstream the linked list you implemented into the main branch (with credit of course - as you see fit. Ive mentioned you in the readme).

With all of this in mind - I am still curious to see how actors perform with further optimization.

Philippe_Hausler · April 6, 2024, 1:51am

For benchmarks I would highly suggest against this type of pattern; instead a task-group would be much more efficient - bringing up an entire task per each loop is kinda costly.

wadetregaskis · April 6, 2024, 4:52am

I'm suspicious of this; I suspect it's not a genuine benefit of the different data structure but rather the difference between having it defined in the same module vs a different module (i.e. the same issues as before regarding lack of generic specialisation).

What happens if you copy-paste Deque out of swift-collections and into your own module? (you don't need most of it, just the very minimal set of declarations and methods that are actually used in this case)

There's very few situations in which a linked list is faster than a ring buffer (like Deque uses).

vns · April 6, 2024, 5:51am

Yes, totally fine. I am only thinking on more advanced testing. As @wadetregaskis mentioned below/above, it is a bit odd there such benefit — and I agree to that. I have doubts it won’t degrade if there will be more complex types than just integers. I think it will make sense to add benchmarks with struct and class passing.

I was going to update with task group at first with the same intention, but then decided not to, because it seems to be incorrect in terms of usage and comparison to Go in general. I think such task creation is closer to the goroutines. But it worth to add another set of benchmarks using child tasks.

Yep, I was suspicious of this too — Deque is a reasonable data structure, it should actually perform better due to less allocations than linked list. My assumption is that using simplified data structure here brings improvements because we don’t carry overhead Deque brings with the complete implementation. I wonder if this change won’t shoot in a foot…