Should we document the behavior of `Mutex.withLockIfAvailable()`?

I have an actual use case you could refer to. Before we dive in, I want to emphasize that this use case is probably not universal, but that any use of tryLock() is made easier to reason about if you don't need to factor in "this might randomly not work due to quantum physical effects[1]." :melting_face:

An overly verbose case study

Swift Testing is introducing test cancellation in Swift 6.3, and one of the nuances we've struggled with in getting it to work harmoniously with task cancellation is that we must use an UnsafeCurrentTask in a way that could cause it to outlive the task's lifetime. We know that, generally speaking, our use of UnsafeCurrentTask is safe, but for a very specific race condition. Here's an oversimplified demonstration:

extension Test {
  private var body: @Sendable () async throws -> Void
  private var task: Mutex<UnsafeCurrentTask?>

  func run() async {
    await withTaskCancellationHandler {
      defer {
        // Clear the associated task before returning so we don't
        // risk touching it after it has been deallocated.
        task.withLock { $0 = nil }
      }
      try await body()
    } onCancel: {
      // The *task* associated with the test has been cancelled.
      // Now we need to cancel the *test* too, but this may be a
      // no-op if test cancellation is what triggered task cancellation.
      self.cancel()
    }
  }

  func cancel() {
    task.withLock { task in
      // Cancel the task associated with the test.
      task?.cancel()
      // Clear our reference to the task so future calls are no-ops.
      task = nil
    }
    // Perform additional (irrelevant) state changes here.
    ...
  }
}

The problem here may be subtle: task?.cancel() will trigger the cancellation handler, which will call cancel(). We'll then try to acquire the lock recursively and deadlock or crash. The obvious solution is to move the call to cancel() outside the critical section:

extension Test {
  ...

  func cancel() {
    let task = task.withLock { task in
      let result = task
      task = nil
      return result
    }
    task?.cancel()
    ...
  }
}

Except this solution now introduces a use-after-free bug. If a test creates an unstructured task within the test body, that task could call Test.cancel(). It would acquire the lock and move the task out, but what if that happens just as the test is returning and tearing itself down? The technical term is Kersplat.

How do we solve the problem?

If withLockIfAvailable() has a guarantee against spurious failure, we can instead write:

extension Test {
  ...

  func cancel() {
    let didCancel = task.withLockIfAvailable { task in
      defer { task = nil }
      if let task {
        task.cancel()
        return true
      }
      return false
    } ?? false
    if didCancel {
      ...
    }
  }
}

So if we acquire the lock, we cancel the task and clear the reference; if we don't acquire it, we know that somebody else is holding it and we know the only other things that could be holding it are another call to cancel() (so the test and task will end up cancelled either way) or the teardown function (in which case cancellation is mooted.)

But if withLockIfAvailable() is subject to spurious failures, then we have no way to know if a call to cancel() failed or just hit a spurious failure. The only way to solve that problem is to loop until the lock can be acquired, but if the current thread already holds the lock (i.e. we're recursing through the task cancellation handler) then that will never succeed and we're back deadlocking at square one.

Further philosophical posturing

Raymond Chen over at Microsoft gave a decent description on his blog of the difference between weak and strong compare-and-exchange operations and pointed out that you can only really accept a weak compare-and-exchange if failure is cheap:

It comes down to whether spurious failures are acceptable and how expensive they are.

[...]

On the other hand, if recovering from the failure requires a lot of work, such as throwing away an object and constructing a new one, then you probably want to pay for the extra retries inside the strong compare-exchange operation in order to avoid an expensive recovery iteration.

And of course if there is no iteration at all, then a spurious failure could be fatal.

Mutex as a library API has no idea if tryLock() failure is cheap or expensive[2]. So it should err on the side of caution and avoid spurious failures (which, again, it already does in our implementations—I'm not proposing a pessimization here!)


  1. Okay, not actually quantum physical effects, but it's still spooky action at a distance. And there's transistors involved, so, kind of yes actually quantum physics? ↩︎

  2. If you really really need weak compare-and-exchange semantics, Atomic is right there next to Mutex. ↩︎

6 Likes