I have a performance regression in my code after updating from Xcode 13.4 to Xcode 14. In this case performance is reduced by 90%, compared to pre Swift 5.7.
The code is a test project I set up to better understand the usage of async bytes handling (like in Foundation's FileHandle.AsyncBytes or SwiftAlgorithm's AsyncBufferedByteIterator).
In the project I'm parsing JSON elements from a UTF8 byte stream. There is some amount of nested async calls that are used in a tight loop. A unit test that parses a 25MB JSON file takes 0.521 s on Xcode 13.4 and 4.808 s on Xcode 14. In Instruments I'm now seeing a lot of time spent in swift_task_switch. This does not happen in Xcode 13.4.
I believe the regression is a result of SE-0338: Execution of Non-Actor-Isolated Async Functions. As far as my understanding goes, as part of the proposal, each function, when returning, has to switch back to the executor from which it was called. To test the theory that the regression stems from executor switching I added the @_unsafeInheritExecutor attribute on each nested function. With this change in place the code runs as fast as before.
Here's the test project. Just run the unit test named testAsyncCountLargeFile (in release mode) and look at the console to see the execution time.
To be quite honest; FileHandle.AsyncBytes and AsyncBufferedByteIterator are nearly identical - the only difference is that AsyncBytes uses an actor to coordinate the read calls. It seems like you are using an actor as well, so the perf comparison being the same is not surprising. But what is surprising is that we would have some sort of massive regression - is that true in release mode?
In off-forum discussions with @nikolai.ruhe and profiling in Instruments, we made these observations:
When run in Swift 5.7 (Xcode 14.0b2), the stack traces contain a lot of swift_task_switch calls, and also completeTaskWithClosure calls deep within the call stacks (i.e. I'm not talking about the top-most completeTaskWithClosure call that kicks of the work).
When run in Swift 5.6 (Xcode 13.4.1), these don't exist.
This is our rough hypothesis (totally not sure if this is correct):
SE-0338 stipulates that every return from a non-actor isolated async function is a potential executor hop. So the compiler must insert a swift_task_switch call on every return of an async function (is this true?).
Since SE-0338 is new in Swift 5.7, this would explain why Swift 5.6 is unaffected. Another piece of evidence is that that @_unsafeInheritExecutor also fixes the performance issue.
Apparently the optimizer isn't able to eliminate the dynamic executor checks, even though the relevant code is all in-module and mostly fileprivate. I'm not sure if the optimizer can be improved or if this is a fundamental limitation of the semantics of SE-0338.
The optimizer certainly ought to be able to remove the switch from that code. Maybe it doesn't have the ability to reason that some operations don't require it to be on some specific executor. In general, it's supposed to be conservative about executor assumptions, but obviously a single multiplication is beyond "conservative" and well into "not actually recognizing anything at all".
We looked at this in some less trivial cases than Ole's square where code had a hot loop on AsyncBytes.next(). My guess from that investigation was that the code was achieving high performance by virtue of the loop making it all the way to the LLVM memory/scalar optimizer within a single function body. SE-0338 has added a hop after the return from next() in order to ensure that execution isn't dangling on an actor, and that hop is causing a split in the function, which is in turn blocking LLVM optimization. The actual cost of the hop (which does not require an actual suspension) seems minor next to its cost as an optimization barrier.
This hop back after a call is necessary as a general rule; we are not going to stop doing it. I can see two ways to avoid its impact in this case:
We could (hopefully) avoid the optimization-barrier impact by compiling hops differently in a way that would allow the LLVM pass to see a static path without a suspension. For example, we could emit an executor check, branch on it, and then emit a possible suspension point only on the slow path. This would require runtime support. It would also substantially increase code size if used universally.
After inlining next(), we could recognize that the hop is redundant in the fast path because next() does not actually hop. This would require a fair amount of CFG manipulation, for example to hoist the hop into the inlined next() and duplicate it onto both sides of the hot-path / slow-path "diamond"; but it may still be feasible.
Is it correct that the hop back is being emitted as part of the callee, i.e. the function being called is responsible for emitting the hop?
It seems to me that making the caller responsible for emitting the hop would be a more efficient calling convention. Under the rules of SE-0338, the actor isolation for each function is known statically, so the compiler should be able to only emit the hop if the caller's and callee's actor isolation differs. But this won't work if the hop is emitted in the callee because the callee has to assume it's being called from a different actor. Correct?
Excellent, thank you for verifying! The small remaining regression seems worth looking into at some point but plausibly below the noise floor (eg could just be getting less lucky on function alignment or something), and certainly back to being usable.