Forcing inlining of the key parts of UnsafeRingBuffer
recovers nearly 10% on the syncRw case (and maybe improves the other three by a couple of percent).
I'm picking up a theme here, that the compiler is being way too conservative about inlining…
(fourth reply that is @inline(__always)
by Discourse )
Ugh, and I just found a compiler bug too whereby it apparently ignores a return
statement:
@usableFromInline
func send(_ value: T) async {
buffer.push(value)
return // <-- The compiler completely ignores this, and just
// proceeds onwards to mutex.lock() below, leading to spinlock.
mutex.lock()
if nonBlockingSend(value) {
return
}
await withUnsafeContinuation { continuation in
sendQueue.append((value, continuation))
let waiter = selectWaiter
mutex.unlock()
waiter?.signal()
}
}
Apparently it thinks that because lock()
returns Void
, and send
returns Void
, that this code therefore means return mutex.lock()
, which is… spectacular.
After working around that compiler bug (thank goodness semicolons still exist, I guess), it shows that even if you completely bypass all locking and just push & pop values in perfect pairs off of the UnsafeRingBuffer
(i.e. syncRw), it still takes 0.7 seconds (vs ~2s to do the real code, or 1.6s in the best case version, or 1.8s in the original). That's way slower than I expect.
Stepping through the disassembly, it's mind-blowing how much pointless boilerplate the compiler is inserting. Hundreds (thousands?) of instructions of generics cruft, retains & release of something, creating and destroying transient optionals, etc. If I didn't know better I'd think this code is compiled with -Onone
… (but I can see -O
in the build transcripts)
It looks to me like it's failing to specialise some or all of the code. Unfortunately the @_specialize
decorator that the stdlib uses isn't available elsewhere.
…and indeed, manually "specialising" the code by removing the generics entirely and hard-coding for Int
improves the performance 400%.
testSingleReaderManyWriter()
Time elapsed: 0.2149193286895752
testHighConcurrency()
Time elapsed: 0.2130796511967977
testHighConcurrencyBuffered()
Time elapsed: 0.21261000633239746
syncRw()
Time elapsed: 0.356195330619812
For reference, the starting point from @gh123man's benchmark
branch @ dc97b09 was:
testSingleReaderManyWriter()
Time elapsed: 2.4027369817097983
testHighConcurrency()
Time elapsed: 2.417561332384745
testHighConcurrencyBuffered()
Time elapsed: 2.2434799671173096
syncRw()
Time elapsed: 1.7973110278447468
So, it seems like - while there's plenty of moderate improvements to be had through code improvements, as detailed in previous posts - the biggest problem by far is the compiler. It's not specialising the generics [correctly], among other suspicious-looking behaviour (the inexplicable retain/releases in value-only code, long-winded runs through the Concurrency lib for an async function that immediately returns an integer, etc).
To be clear, when I say the problem is the compiler that might mean that the code needs to do extra things to help the compiler, whether hints like @inline(__always)
or perhaps some semantic changes to permit the compiler to make certain optimisations. But this is about the limits of my current knowledge, on how to coax the compiler into better results.