Swift 5.7 regression in async call performance

nikolai.ruhe · June 23, 2022, 4:01pm

I have a performance regression in my code after updating from Xcode 13.4 to Xcode 14. In this case performance is reduced by 90%, compared to pre Swift 5.7.

The code is a test project I set up to better understand the usage of async bytes handling (like in Foundation's FileHandle.AsyncBytes or SwiftAlgorithm's AsyncBufferedByteIterator).

In the project I'm parsing JSON elements from a UTF8 byte stream. There is some amount of nested async calls that are used in a tight loop. A unit test that parses a 25MB JSON file takes 0.521 s on Xcode 13.4 and 4.808 s on Xcode 14. In Instruments I'm now seeing a lot of time spent in swift_task_switch. This does not happen in Xcode 13.4.

I believe the regression is a result of SE-0338: Execution of Non-Actor-Isolated Async Functions. As far as my understanding goes, as part of the proposal, each function, when returning, has to switch back to the executor from which it was called. To test the theory that the regression stems from executor switching I added the @_unsafeInheritExecutor attribute on each nested function. With this change in place the code runs as fast as before.

Here's the test project. Just run the unit test named testAsyncCountLargeFile (in release mode) and look at the console to see the execution time.

Philippe_Hausler · June 23, 2022, 5:47pm

To be quite honest; FileHandle.AsyncBytes and AsyncBufferedByteIterator are nearly identical - the only difference is that AsyncBytes uses an actor to coordinate the read calls. It seems like you are using an actor as well, so the perf comparison being the same is not surprising. But what is surprising is that we would have some sort of massive regression - is that true in release mode?

nikolai.ruhe · June 23, 2022, 6:37pm

But what is surprising is that we would have some sort of massive regression - is that true in release mode?

Yes, all measurements are made in release builds. The project contains a scheme that is set up for testing in release mode.

stephencelis · June 23, 2022, 8:16pm

I noticed this too in a flakey test that relied on Task.yield(). It went from failing < 1% of the time to failing > 50% of the time.

ole · June 24, 2022, 9:46am

In off-forum discussions with @nikolai.ruhe and profiling in Instruments, we made these observations:

When run in Swift 5.7 (Xcode 14.0b2), the stack traces contain a lot of swift_task_switch calls, and also completeTaskWithClosure calls deep within the call stacks (i.e. I'm not talking about the top-most completeTaskWithClosure call that kicks of the work).
When run in Swift 5.6 (Xcode 13.4.1), these don't exist.

This is our rough hypothesis (totally not sure if this is correct):

SE-0338 stipulates that every return from a non-actor isolated async function is a potential executor hop. So the compiler must insert a swift_task_switch call on every return of an async function (is this true?).
Since SE-0338 is new in Swift 5.7, this would explain why Swift 5.6 is unaffected. Another piece of evidence is that that @_unsafeInheritExecutor also fixes the performance issue.
Apparently the optimizer isn't able to eliminate the dynamic executor checks, even though the relevant code is all in-module and mostly fileprivate. I'm not sure if the optimizer can be improved or if this is a fundamental limitation of the semantics of SE-0338.

cc @John_McCall

Here's a link to an Instruments trace of the unit test in question (executed in release mode). Run 1 is Swift 5.7 (6.43 s), Run 2 is Swift 5.6 (728 ms): Dropbox - 2022-06-24-Instruments-SE-0338-Swift 5.7 (Run 1) vs. Swift 5.6 (Run 2).trace.zip

Swift 5.7 trace:

Swift 5.6 trace:

ole · June 24, 2022, 9:56am

We also looked at this in Godbolt in an attempt to understand what's going on.

Take this simple async function that doesn't even make any async calls:

@inline(never)
fileprivate func square(n: Int) async -> Int {
    return n * n
}

In Swift 5.6 (with -O), the generated assembly for this function is:

output.(square in _60494E8B9C642A7C4A26F3A3B6CECEB9)(n: Swift.Int) async -> Swift.Int:
        imul    rdi, rdi
        jo      .LBB1_1
        mov     rax, qword ptr [r14 + 8]
        jmp     rax
.LBB1_1:
        ud2

In Swift 5.7 (with -O), the assembly for the same function is (note the tail call to swift_task_switch and the split into two partial funcs):

output.(square in _60494E8B9C642A7C4A26F3A3B6CECEB9)(n: Swift.Int) async -> Swift.Int:
        mov     qword ptr [r14 + 16], rdi
        lea     rdi, [rip + ((1) suspend resume partial function for output.(square in _60494E8B9C642A7C4A26F3A3B6CECEB9)(n: Swift.Int) async -> Swift.Int)]
        xor     esi, esi
        xor     edx, edx
        jmp     swift_task_switch@PLT

(1) suspend resume partial function for output.(square in _60494E8B9C642A7C4A26F3A3B6CECEB9)(n: Swift.Int) async -> Swift.Int:
        mov     rdi, qword ptr [r14 + 16]
        imul    rdi, rdi
        jo      .LBB2_1
        mov     rax, qword ptr [r14 + 8]
        jmp     rax
.LBB2_1:
        ud2

This may not be a fair comparison because the @inline(never) may affect the optimizer (?) but it's interesting to see the implications of SE-0338 (if I understand correctly).

John_McCall · June 24, 2022, 2:08pm

The optimizer certainly ought to be able to remove the switch from that code. Maybe it doesn't have the ability to reason that some operations don't require it to be on some specific executor. In general, it's supposed to be conservative about executor assumptions, but obviously a single multiplication is beyond "conservative" and well into "not actually recognizing anything at all".

stephencelis · July 6, 2022, 8:50pm

This regression appears to have gotten worse in Xcode 14 beta 3.

John_McCall · July 6, 2022, 10:23pm

We looked at this in some less trivial cases than Ole's square where code had a hot loop on AsyncBytes.next(). My guess from that investigation was that the code was achieving high performance by virtue of the loop making it all the way to the LLVM memory/scalar optimizer within a single function body. SE-0338 has added a hop after the return from next() in order to ensure that execution isn't dangling on an actor, and that hop is causing a split in the function, which is in turn blocking LLVM optimization. The actual cost of the hop (which does not require an actual suspension) seems minor next to its cost as an optimization barrier.

This hop back after a call is necessary as a general rule; we are not going to stop doing it. I can see two ways to avoid its impact in this case:

We could (hopefully) avoid the optimization-barrier impact by compiling hops differently in a way that would allow the LLVM pass to see a static path without a suspension. For example, we could emit an executor check, branch on it, and then emit a possible suspension point only on the slow path. This would require runtime support. It would also substantially increase code size if used universally.
After inlining next(), we could recognize that the hop is redundant in the fast path because next() does not actually hop. This would require a fair amount of CFG manipulation, for example to hoist the hop into the inlined next() and duplicate it onto both sides of the hot-path / slow-path "diamond"; but it may still be feasible.

vine · July 7, 2022, 1:48am

So this is a feature?

ole · July 7, 2022, 11:49am

Thanks for the explanation, John. Question:

Is it correct that the hop back is being emitted as part of the callee, i.e. the function being called is responsible for emitting the hop?
It seems to me that making the caller responsible for emitting the hop would be a more efficient calling convention. Under the rules of SE-0338, the actor isolation for each function is known statically, so the compiler should be able to only emit the hop if the caller's and callee's actor isolation differs. But this won't work if the hop is emitted in the callee because the callee has to assume it's being called from a different actor. Correct?

John_McCall · July 7, 2022, 2:12pm

No, it’s done by the caller.

vine · July 22, 2022, 3:30am

Are there best practices now?

David_Smith · September 12, 2022, 4:26am

This should be fixed now

(edit: [5.7] SIL Optimizer: add the OptimizeHopToExecutor in the late pipeline. by eeckstein · Pull Request #60156 · apple/swift · GitHub)

ole · September 12, 2022, 11:27am

Fantastic. I can confirm. Using @nikolai.ruhe’s benchmark from post #1, i.e. running testAsyncCountLargeFile from NikolaiRuhe/JSONFormat in release mode.

For comparison, @nikolai.ruhe's original results were:

Environment	Time
Swift 5.6 (Xcode 13.4)	0.521 sec
Swift 5.7-beta (Xcode 14.0 beta 1)	4.808 sec

My updated results from today:

Environment	Time	Notes
Swift 5.6 (Xcode 13.4.1)	0.689 sec	Not comparable with above (different machine)
Swift 5.7 (Xcode 14.0 GM)	0.840 sec

(macOS 12.5.1, Apple M1 Pro, 10 CPU cores)

So Swift 5.7 is still a little slower than 5.6, but much, much faster than earlier betas.

Thanks for the fix @Erik_Eckstein!

David_Smith · September 12, 2022, 2:54pm

Excellent, thank you for verifying! The small remaining regression seems worth looking into at some point but plausibly below the noise floor (eg could just be getting less lucky on function alignment or something), and certainly back to being usable.

nikolai.ruhe · September 13, 2022, 2:51pm

Just here to fill in missing numbers from my hardware—and say thanks!

Environment	Time
Swift 5.6 (Xcode 13.4)	0.521 sec
Swift 5.7-beta (Xcode 14.0 beta 1)	4.808 sec
Swift 5.7 (Xcode 14.0 Release)	0.664 sec

(This adds nothing to @ole's findings above.)

young · September 14, 2022, 11:31am

Can you please tell me how to run your sample project? Thanks!

ole · September 15, 2022, 7:00pm

@young I followed the instructions in post #1 of this thread:

Here's the test project . Just run the unit test named testAsyncCountLargeFile (in release mode) and look at the console to see the execution time.

I did this in Xcode (make sure the scheme is set to run tests in release mode). XCTest will print the duration of the test in the console.