Hang when awaiting call to actor

marco.masser · December 13, 2021, 3:06pm

I have an issue in a Vapor code base that I’m migrating to async/await that I don’t know how to debug.

Long story short, some unit tests simply hang at the line where they wait on an actor. Setting a breakpoint or adding print statements in the actor methods/properties show that they are never called. Sorry that I can’t post some code to reproduce this issue, but I haven’t been able to properly isolate this and I don’t know how to debug this.

Does anybody have suggestions on how to debug such an issue? Can I somehow log all calls to an actor before the suspension point? Is there a way to know what threads/queues/tasks are waiting on an actor? Any other hints?

I don’t know if this is relevant, but since this is a Vapor code base, all this involves NIO event loops in the background.

Some more detail:

The unit tests use mock classes that conform to certain protocols. The tests do something that calls the mock (there are some async calls involved here, which makes isolation of the issue tricky), which would increment a counter named requestCount in the mock. The unit tests would then check that the requestCount matched the expected amount.

After the async/await migration, the respective protocols now have method requirements that are async, which means the mock classes now also need to implement async methods. Because the unit tests therefore are also async, and the protocol methods might be called on any thread, I thought it might be a good idea to turn the mock classes into actors to ensure the mutations to its internal state are safe.

What’s funny is that triggering this issue can be fixed and “unfixed” by simple changes in the actor. For example, accessing the requestCount property in the unit tests works fine or hangs, depending on how that property is declared. The call site is always await mock.requestCount in all cases.

It works fine when declared like this:

actor MyMock {
    var requestCount: Int = 0
}

It also works fine like this:

actor MyMock {
    private var _requestCount: Int = 0
    var requestCount: Int {
        get {
            return _requestCount
        }
    }
}

… but it hangs when I add the explicit async to the computed property getter:

actor MyMock {
    private var _requestCount: Int = 0
    var requestCount: Int {
        get async { // 💣
            return _requestCount
        }
    }
}

Any suggestions on what I could try would be very welcome!

kavon · December 13, 2021, 6:49pm

Try an ordinary method and see if it still hangs if you make that method async vs not. That will help rule out a miscompile with computed properties.

My guess is that you're running into an issue where the task is not giving up the actor's executor after making a call to an async method or computed property, because the caller is an ordinary async function without isolation.

The current convention for calling a synchronous function or computed property isolated to an actor is to switch back to whichever executor was in use prior to the call. For an async actor-isolated method, a switch-back only happens if the caller is isolated to some actor (or executor). The reason for this difference escapes me at the moment, but is part of the design.

To test if that's what's going on, you could try putting an await Task.yield() right after the async call. Keep in mind that using yield in this way is a horrible thing to do in practice, and doesn't really solve the problem. Another way to test the hypothesis is to isolate the caller to the MainActor and see if it starts working.

marco.masser · December 13, 2021, 7:40pm

Thanks for your insight and suggestions!

I added these two methods to my mock actor and tried calling them in the unit test:

// Calling this hangs:
func fooNonAsync() -> Int { 42 }

// Calling this works fine:
func fooAsync() async -> Int { 42 }

So yes, the async-ness does make a difference here.

I suspected something in this direction could be the case but I still have way too little experience with async/await to really know anything. Thanks for sharing this!

Did you mean to suggest putting the yield() call after or before the other async call? Anyway, putting it after the async call does not help. Putting it before does fix the hang, though. That is, this works:

print("before")
await Task.yield()
await mock.fooNonAsync()
print("after")

(The yield() also fixes the hang I see with my actual mock.requestCount property access, BTW)

Yes, just putting @MainActor on the unit test method also prevents the hang. That’s probably a workaround that I can live quite well with for now.

Is there anything else I can try to further analyze this? Should I file an issue on JIRA? If so, do you have any tip on how best to extract a minimal sample to demonstrate the issue?

marco.masser · December 14, 2021, 4:09pm

Just FYI: This workaround doesn’t play nice with test discovery on Linux, though

/package/.build/aarch64-unknown-linux-gnu/debug/APIPackageTests.derived/AppTests.swift:147:46: error: call to main actor-isolated instance method 'testInfoResponseEmptyResultCases()' in a synchronous nonisolated context
        ("testInfoResponseEmptyResultCases", testInfoResponseEmptyResultCases),
                                             ^
/package/Tests/AppTests/InfoResponseServiceTests.swift:43:10: note: calls to instance method 'testInfoResponseEmptyResultCases()' from outside of its actor context are implicitly asynchronous
    func testInfoResponseEmptyResultCases() {
         ^
/package/.build/aarch64-unknown-linux-gnu/debug/APIPackageTests.derived/AppTests.swift:512:43: error: converting function value of type '@MainActor () -> ()' to '() -> Void' loses global actor 'MainActor'
        testCase(InfoResponseServiceTests.__allTests__InfoResponseServiceTests),
                                          ^
/package/.build/aarch64-unknown-linux-gnu/debug/APIPackageTests.derived/AppTests.swift:514:41: error: converting function value of type '@MainActor () throws -> ()' to '() throws -> Void' loses global actor 'MainActor'
        testCase(LicenseControllerTests.__allTests__LicenseControllerTests),
                                        ^

kavon · December 14, 2021, 8:14pm

I've filed a JIRA with a minimal reproducer on your behalf, but feel free to add some context or motivation for how it affects your particular situation, etc.

Now, I have a larger teaching example, with a lot of comments, to help illuminate what's going on so that you can work around this in your program:

actor A {
  func f(_ i: Int) async {
    print("task \(i) called A.f()")
  }
}

@main
struct Main {
  static func main() async {
    let a = A()
    await withTaskGroup(of: Void.self) { group in
      for i in 0..<3 {
        group.addTask {
          await caller(a, i)
        }
      }
    }
  }
}

func caller(_ a: A, _ task: Int) async {
  print("task \(task) starting")

  // Because this caller function is not isolated to any actor, after completing
  // this call to an async actor function, we remain on a's executor, which 
  // can prevent other tasks from using the same actor.
  await a.f(task)
  
  /////
  // Now, here are some one-liner tricks to play with. Try commenting, 
  // uncommenting, or even reordering:

  // Temporarily gives up a's executor, but I believe it will try to 
  // resume on the same executor upon returning? I'm not sure.
  // await Task.yield()

  // This gives up a's executor and switches to the main actor during the call.
  // Similar to a.f(), since we're calling an async function, we won't give 
  // up the main actor after returning.
  // await asyncMainActorFunc(task)

  // This one would also give up a's executor during the call, but upon 
  // returning it will try to switch back to whichever executor it was on prior
  // to the call. so, this can still prevent forward progress if it appears after
  // a call to an async actor-isolated function.
  // await ordinaryMainActorFunc(task)

  // this terrible hack should get us off of whichever executor we're on now 
  // and onto one that is unique, so every task can make progress in this func.
  // await DropExecutor().doIt()

  ///// end of one-liners
  
  // The goal is to have every task make it to `doLongRunningWork`.
  doLongRunningWork(task)
}


actor DropExecutor {
  var state: Int = 0
  func doIt() async {
    state = 0 // needed to prevent optimization
  }
}

func doLongRunningWork(_ i: Int) {
  print("task \(i) starting long-running work")
  while true {}
}

@MainActor
func asyncMainActorFunc(_ i: Int) async {
  print("task \(i) called asyncMainActorFunc()")
}

@MainActor
func ordinaryMainActorFunc(_ i: Int) {
  print("task \(i) called ordinaryMainActorFunc()")
}

To play with the example above, you can compile with:

xcrun swiftc -parse-as-library hang.swift

(just drop the xcrun if you're on Linux). I particularly recommend starting-off by commenting out all four "tricks". You should see something like this:

task 0 starting
task 1 starting
task 2 starting
task 0 called A.f()
task 0 starting long-running work

which shows that the other two tasks are stuck trying to call a.f(), but the one task still holding a's executor while doing their long-running work. Next, if you uncomment the line that calls asyncMainActorFunc you should see something like this:

task 2 starting
task 0 starting
task 1 starting
task 2 called A.f()
task 0 called A.f()
task 1 called A.f()
task 2 called asyncMainActorFunc()
task 2 starting long-running work

Notice that now all three made to a.f but not any further, because now task 2 is holding the main actor while doing its long-running work. Anytime you uncomment the DropExecutor hack, you'll see all three tasks will make it to their long-running work:

task 1 starting
task 0 starting
task 1 called A.f()
task 2 starting
task 0 called A.f()
task 2 called A.f()
task 1 called asyncMainActorFunc()
task 0 called asyncMainActorFunc()
task 2 called asyncMainActorFunc()
task 1 starting long-running work
task 0 starting long-running work
task 2 starting long-running work

That DropExecutor hack is creating a fresh actor instance and calling one of it's async methods that must be on the instance's executor to update its state. Since each instance has a unique executor, it doesn't matter that each task running caller continues on that executor after the call. Of course, this hack is terrible; so please closely watch the bug report for a better solution or fix.

marco.masser · December 15, 2021, 11:11am

@kavon: Thank you very much for your detailed answer and all your work! This really helped me understand what is going on here and I sure am glad you wrote that JIRA issue because I couldn’t have done such a good job debugging this

I’ll watch the bug report for any progress. I think it would probably be best to wait for a fix to this issue before deploying the async/await migration of the Vapor code base. Luckily, there’s no hurry here.

0xTim · December 15, 2021, 12:39pm

Vapor doesn't use actors internally so this shouldn't be a blocker

marco.masser · December 15, 2021, 12:43pm

Thanks for weighing in, @0xTim! I didn’t mean to imply that there was an issue with Vapor – sorry about that. I should have been clearer: I think I’ll wait for a fix to this issue in Swift before deploying the async/await migration of the app that I’m working on that happens to be a Vapor app

Alejandro_Martinez · December 16, 2021, 10:28am

I'm not a compiler/runtime expert so I'm curious about what they will reply to this bug report. In my understanding the concurrency system is cooperative so if we are doing long running blocking tasks is up to us to add suspension points (awaits) so the runtime can interleave other tasks. If we use non-blocking code then we will have to add awaits and that will just work, but if you have blocking code that takes a long time to complete I would say is up to you to yield back to the runtime at appropriate times.

Of course I could be wrong, and even if it's correct it would still maybe make sense that the return of await a.f(task) jumps back to another executor. But I'm not sure how predictable that is since it seems to me that the system is trying to avoid context changes as much as possible which is actually something we want.

Sorry for the rambling, I'm just curious what's the resolution of this ^^

marco.masser · January 5, 2022, 8:10am

There’s progress on this!