Non-deterministic bug resuming continuations from actors

josephlord · July 14, 2021, 11:05pm

I'm looking for help getting this bug in front of the right people at least to triage it because it seems quite serious to me.

I've hit this issue reported as SR-14875 (rdar://problem/80238311) where I'm getting hangs when trying to call resume from the actor. This is visible when running the test repeatedly (about 5 in 100 fail).

This is still happening in Beta 3 and in the latest toolchains, I've been waiting for the next release before trying to push this.

Is there a proper way to escalate the issue as I imagine it could bite quite a few people and the non-deterministic nature may cause them to spend lots of time trying to work out what is happening. @typesanitizer added it to rdar and @Douglas_Gregor had a look at a ticket that was linked to this but as far as I know this ticket hasn't been triaged yet.

It is possible that what I'm trying to do isn't legal, in which case there should be compile errors I feel or at least clear documentation (which I haven't seen in the proposals though it is possible that I've missed something).

Since originally creating the bug ticket I do now have a simplified test case (which is added to the ticket). If you want to reproduce set the test to run 100 times and you should see some errors:

import XCTest

@available(iOS 15.0, macOS 12.0, *)
actor SUTActor {
    var continuation: CheckedContinuation<(),Never>?
    var canGo = false
    
    func pause() async {
        if canGo { return }
        await withCheckedContinuation { (continuation: CheckedContinuation<Void, Never>) -> Void in
//            if canGo {
//                continuation.resume(returning: ())
//            } else {
                self.continuation = continuation
//            }
        }
    }
    
    func go() {
        canGo = true
        continuation?.resume()
    }
}

@available(iOS 15.0, macOS 12.0, *)
final class ActorContinuationTests : XCTestCase {
    func testPauseGo() {
        let sut = SUTActor()
        let exp = expectation(description: "Will resume")
        Task.detached(priority: .high) {
            await sut.pause()
            exp.fulfill()
        }
        Task.detached(priority: .default) {
            await sut.go()
        }
        waitForExpectations(timeout: 0.5, handler: nil)
        _ = sut // Ensure lifetime sufficient
    }
}

Doesn't seem to matter whether the commented out code in pause() is enabled or not. I'm just not certain that it isn't required. Is there a potential yield between calling withCheckedContinuation and the execution of the closure passed to it?

I have tried to build the toolchain from source to try to take a look myself but haven't yet had success with the instructions to get it building.

mickeyl · July 15, 2021, 8:22am

I also think this is quite serious. Perhaps related is [SR-14841] Concurrency: Resuming a stored continuation from an actor does not work · Issue #57188 · apple/swift · GitHub, which seems to be completely deterministically failing for me – in all cases.

typesanitizer · July 15, 2021, 4:24pm

There shouldn't be a hang when resuming continuations, we're investigating it.

That said, keeping the hang part aside, there is a good explanation for the rest of the behavior. For this one (with the commented out version), I would expect there to be some non-determinism:

If go() runs first, then it will fail to resume anything because nothing was stored.
If pause() runs first, then the continuation should be resumed.

This works most of the time 90 runs out of 100 or 95 times out 100 with the commented lines in the pause method uncommented

What this likely means is that

90% of the time, pause() is being called first, it finishes and then go() is being called. This leads to the continuation being resumed in go().
5% of the time, pause() is called first, gets to the await (as canGo == false), suspends, then go() starts and finishes (now canGo == true), then the continuation is resumed (because canGo == true).
5% of the time go() is called first, it finishes and then pause() is called. This follows the early exit in pause().

I cannot think of other executions which may lead to the same outcomes, but maybe I'm missing something.

josephlord · July 15, 2021, 7:37pm

Thanks, just wanted to make sure it was on the figuratively on the radar rather than just lost in the rdar, if it is I will leave it in your hands.

Should the commented out code be necessary? My assumption (although I'm not sure I've seen anything explicit) would have been that the closure in withCheckedContinuation would run immediately and on the actor context and the await only applies to waiting for the resume.

Your interpretation of the non-determinism seems plausible in which case a deterministic failure should be implementable. However if I add an await Task.sleep(10_000) to either detached task (before pause() or go() so I do wonder if there actually is somehow a breach of the actor isolation somewhere.

typesanitizer · July 15, 2021, 8:33pm

The await is on the withCheckedContinuation call, so execution can suspend right before that call, it's not guaranteed that withCheckedContinuation will execute after the previous statement without any suspension in between.

I don't quite get what you mean. Are you concerned that putting additional sleep calls will somehow allow simultaneous access to the actor state from multiple places? That shouldn't be possible, no. When a task is suspended, it is inert, so it is not accessing an actor's state. Before a new task can execute on an actor, the running task needs to suspend or finish.

josephlord · July 15, 2021, 10:03pm

Thanks. Given that potential suspension clearly the code I commented out is necessary for correctness.

Regarding the sleep calls. I was testing out forcing the order of the pause and go to see if there was a wrong order that consistently hung the execution. But whichever of the calls I delayed I didn't get any failed tests. This makes me think the issue only arises when the calls are in way simultaneous or at least interleave suspension points in some way but I'm only guessing from prodding the outside, I haven't had a proper look at the actor source code yet.

mickeyl · July 27, 2021, 9:57pm

Did beta4 change anything for your issue? [SR-14841] Concurrency: Resuming a stored continuation from an actor does not work · Issue #57188 · apple/swift · GitHub is unfortunately unaffected, still the same deterministic hang.

josephlord · July 27, 2021, 10:21pm

It is now a Known issue in the release notes with some potential workarounds. I’m happy with that because people are alerted and it is properly on the agenda to be resolved.

I haven’t retested yet myself but given it is noted I’m not expecting resolution in beta 4 but I'm sure it will come at some point.

[Edited to add link and quote for release notes]

Code using Task may deadlock in some circumstances. (80688213) For example:

Resuming a stored continuation in a Task.init(priority:operation:) context (previously async(priority:operation:) ). (SR-14802, SR-14841, SR-14875)

Making multiple async let calls to the same actor in a Task.init(priority:operation:) context. (FB9213145)

Explicitly specifying priorities other than nil . (SR-14875)

Workaround : For deadlocks involving Task.init(priority:operation:) , if possible, replace it with Task.detached(priority:operation:) .