Should we document the behavior of `Mutex.withLockIfAvailable()`?

The documentation for Mutex.withLockIfAvailable() does not specify if it can spuriously fail to acquire the mutex. I'd like to get feedback from the community: should we make clear that it cannot fail in this fashion?

This isn't a formal Swift Evolution proposal, but I'd like our documentation to state that Mutex.withLockIfAvailable() will not spuriously fail if its underlying platform-specific primitive is not held by another thread. Specifically, I'd like to strengthen the language in the Return Value documentation section:

- The return value, if any, of the `body` closure parameter or `nil` if the lock
- couldn’t be acquired.
+ If the lock is acquired, the return value of the `body` closure. If the lock
+ is already held by any thread (including the current thread), `nil`.

And I would like to add the following Note callout to the Discussion section:

+ - Note: This function cannot spuriously fail to acquire the lock. The behavior
+   of similar functions in other languages (such as C's `mtx_trylock()`) is
+   platform-dependent and may differ from Swift's behavior.

What is a "spurious failure"?

Given the following algorithm:

if mutex.tryLock() {
  doWork()
  mutex.unlock()
}

mutex.tryLock() should return true if the lock was acquired, and false if another thread has currently acquired it. A spurious failure occurs if the function returns false despite no other thread having acquired it.

Why bother?

Normally, the lack of any discussion about spurious failures wouldn't be an issue since we don't document whether or not most other API can spuriously fail either. However, withLockIfAvailable() is Swift's spelling of the tryLock() operation, and there is a schism between the C/C++ standards and POSIX as to whether tryLock() might fail spuriously.

C/C++ vs. POSIX

C11's mtx_trylock() and C++11's std::mutex::try_lock() are both allowed to spuriously fail. Per the C++11 standard §30.4.1.1/16:

An implementation may fail to obtain the lock even if it is not held by any other thread. [ Note: This spurious failure is normally uncommon, but allows interesting implementations based on a simple compare and exchange [...] ]

(The standard is referring to weak compare-and-exchange operations which can fail spuriously at the hardware level; strong compare-and-exchange operations cannot fail in this manner.)

But the POSIX standard for pthread_mutex_trylock() makes no such accommodation, instead stating simply:

The function pthread_mutex_trylock() is identical to pthread_mutex_lock() except that if the mutex object referenced by mutex is currently locked (by any thread, including the current thread), the call returns immediately.

Other languages

Other languages have equivalent API: Rust has mutex::try_lock(), Go has Mutex.TryLock(), Kotlin has Mutex.tryLock(), and Java has Lock.tryLock(). None of these methods document that they spuriously fail (other than the Rust-specific concept of a "poisoned" mutex which is not germane to this pitch.) So that may be an argument in favour of us leaving the documentation as-is.

Where does Swift stand?

It's my position that Swift should document the behaviour specifically. That way, developers who use Mutex will know what to expect from it and will (hopefully) avoid being confused if they've also read the documentation for the C/C++'s equivalents.

I've gone over the platform-specific implementations of _MutexHandle._tryLock() in the Swift repository and have confirmed that none of these implementations is subject to spurious failure (or, at least, none documents any such failure mode):

Platform Implementation Based On Uses cmpxchg? Documents Possible Spurious Failures?
Darwin os_unfair_lock_trylock() No No
FreeBSD UMTX_OP_MUTEX_TRYLOCK No No
Linux/Android Atomic.compareExchange()/FUTEX_TRYLOCK_PI Strong No
OpenBSD pthread_mutex_trylock() No No
Wasm Atomic.compareExchange() Strong No
Windows TryAcquireSRWLockExclusive() No No

In other words, although we don't document it, Swift's withLockIfAvailable() implementations do not spuriously fail. These implementations are, of course, implementation details and are subject to change over time, but any change to Mutex that causes us to start spuriously failing is a breaking change because it could wreak havoc on code that uses withLockIfAvailable() today.

Or maybe I'm just overthinking it. What do you all think? :grin:

8 Likes

(My usual caveat: as an outsider…)

To me the current documentation reserves judgment on a point it can’t guarantee, which is more correct with respect to setting expectations.

The programmer’s model now includes the negative (nil) case. It’s probably an error for the programmer to conclude in the negative case that some other code has the lock; all this code knows is that it couldn’t get it. So the false-negative case is not operational. In particular, I think that means that changing Mutexto “spuriously fail” (e.g., for performance reasons) would not be a breaking change.

That said, explaining somewhere else that a given implementation does not spuriously fail could be very helpful (though I don’t know exactly how).

It’s hard enough to express accurately the behavior of Swift’s API’s for concurrency. I can’t imagine trying to distinguish it from all other historical forms and systems, for each API call. But it’s possible that Mutex is the exception, because it’s a feature used standalone.

I do think a high-level comparison of concurrency approaches would be super helpful, and this would be part of that. In particular, Swift’s tolerance for platform differences (to avoid reducing features to the least functional available) creates a big gap in explaining behavior beyond what’s supportable in the common API documentation. Painstaking investigation like yours should not go unpublished.

I think my goal here is to guarantee it, because we can guarantee it, we just don't.

The current documentation is silent on the matter. If I were approaching the documentation in a vacuum, I wouldn't even consider the possibility of a spurious failure, because as a rule we don't concern ourselves with "oh it might randomly fail" as something we need to guard against. It's just frankly weird that the C11 and C++11 standards have this hypothetical abstract failure mode that isn't even shared by the POSIX equivalent on which they were modelled.

I'm not aware of any real-world C/C++ standard library implementation that actually has this failure mode, mind you—but if you choose to code defensively to those language standards, you have chosen to accept spurious failures into your life, and I don't know any software engineer who enjoys dealing with those. Might as well not use mtx_trylock() at all. (Yes, I feel the same way about weak compare-and-exchange operations. I have a 74-page treatise about them somewhere around here…)

None of our implementations spuriously fail, and I don't think we'd ever end up in a situation where one might because just about everything (Windows aside) supports pthread_mutex_t which doesn't have this problem… anyway, if we ever did need to support a platform where the only available tryLock() implementation could spuriously fail, I guess we'd document it as a special case. :man_shrugging:

For what it's worth, Mutex is part of the Synchronization module rather than the Concurrency module, and is not directly related to the latter.

I read that as "unpunished" and thought yeah, fair, I deserve that. :grin:

1 Like

So, how would this stronger guarantee change how you program against this?

1 Like

I have an actual use case you could refer to. Before we dive in, I want to emphasize that this use case is probably not universal, but that any use of tryLock() is made easier to reason about if you don't need to factor in "this might randomly not work due to quantum physical effects[1]." :melting_face:

An overly verbose case study

Swift Testing is introducing test cancellation in Swift 6.3, and one of the nuances we've struggled with in getting it to work harmoniously with task cancellation is that we must use an UnsafeCurrentTask in a way that could cause it to outlive the task's lifetime. We know that, generally speaking, our use of UnsafeCurrentTask is safe, but for a very specific race condition. Here's an oversimplified demonstration:

extension Test {
  private var body: @Sendable () async throws -> Void
  private var task: Mutex<UnsafeCurrentTask?>

  func run() async {
    await withTaskCancellationHandler {
      defer {
        // Clear the associated task before returning so we don't
        // risk touching it after it has been deallocated.
        task.withLock { $0 = nil }
      }
      try await body()
    } onCancel: {
      // The *task* associated with the test has been cancelled.
      // Now we need to cancel the *test* too, but this may be a
      // no-op if test cancellation is what triggered task cancellation.
      self.cancel()
    }
  }

  func cancel() {
    task.withLock { task in
      // Cancel the task associated with the test.
      task?.cancel()
      // Clear our reference to the task so future calls are no-ops.
      task = nil
    }
    // Perform additional (irrelevant) state changes here.
    ...
  }
}

The problem here may be subtle: task?.cancel() will trigger the cancellation handler, which will call cancel(). We'll then try to acquire the lock recursively and deadlock or crash. The obvious solution is to move the call to cancel() outside the critical section:

extension Test {
  ...

  func cancel() {
    let task = task.withLock { task in
      let result = task
      task = nil
      return result
    }
    task?.cancel()
    ...
  }
}

Except this solution now introduces a use-after-free bug. If a test creates an unstructured task within the test body, that task could call Test.cancel(). It would acquire the lock and move the task out, but what if that happens just as the test is returning and tearing itself down? The technical term is Kersplat.

How do we solve the problem?

If withLockIfAvailable() has a guarantee against spurious failure, we can instead write:

extension Test {
  ...

  func cancel() {
    let didCancel = task.withLockIfAvailable { task in
      defer { task = nil }
      if let task {
        task.cancel()
        return true
      }
      return false
    } ?? false
    if didCancel {
      ...
    }
  }
}

So if we acquire the lock, we cancel the task and clear the reference; if we don't acquire it, we know that somebody else is holding it and we know the only other things that could be holding it are another call to cancel() (so the test and task will end up cancelled either way) or the teardown function (in which case cancellation is mooted.)

But if withLockIfAvailable() is subject to spurious failures, then we have no way to know if a call to cancel() failed or just hit a spurious failure. The only way to solve that problem is to loop until the lock can be acquired, but if the current thread already holds the lock (i.e. we're recursing through the task cancellation handler) then that will never succeed and we're back deadlocking at square one.

Further philosophical posturing

Raymond Chen over at Microsoft gave a decent description on his blog of the difference between weak and strong compare-and-exchange operations and pointed out that you can only really accept a weak compare-and-exchange if failure is cheap:

It comes down to whether spurious failures are acceptable and how expensive they are.

[...]

On the other hand, if recovering from the failure requires a lot of work, such as throwing away an object and constructing a new one, then you probably want to pay for the extra retries inside the strong compare-exchange operation in order to avoid an expensive recovery iteration.

And of course if there is no iteration at all, then a spurious failure could be fatal.

Mutex as a library API has no idea if tryLock() failure is cheap or expensive[2]. So it should err on the side of caution and avoid spurious failures (which, again, it already does in our implementations—I'm not proposing a pessimization here!)


  1. Okay, not actually quantum physical effects, but it's still spooky action at a distance. And there's transistors involved, so, kind of yes actually quantum physics? ↩︎

  2. If you really really need weak compare-and-exchange semantics, Atomic is right there next to Mutex. ↩︎

5 Likes

(Feel free to ignore this message if/since the case study is illustrative only in this context.)

For this case study, I thought Task.cancel() was guaranteed to only call cancellation handlers once, so you could call test.cancel() in the task cancellation handler, and task.cancel() in the test.cancel() body (even if test.cancel() is called multiple times, since only the first task.cancel()is operational). No locks required for that.

For test cleanup, you could use an Atomic<Bool> strong CAS if you needed a once-only gate. So then no locks required at all?

Is task cancellation the only reason Test holds a task? It’d be nice not to hold it, or to hold it with a straight let/immutable lifecycle instead of using nil-ness as a gate.

That's correct.

No, you couldn't, because the thread that is bound to the task could still end up exiting before the thread that's set the flag is done cancelling:

  • Original task starts to tear down
  • Unstructured task sets the "cancelled" flag
  • Original task finishes tearing down
  • Unstructured task calls UnsafeCurrentTask.cancel() and crashes.

A lock is necessary in order to prevent the original task from tearing down before the unstructured task is done calling cancel().

It is necessary to keep an external reference to the test's task for the lifetime of the test. Otherwise, code that runs in a structured child task cannot cancel the test because it can't reach "up" to the task associated with the whole test:

@Test func f() async {
  await withTaskGroup { taskGroup in
    ...
    taskGroup.addTask {
      // ...
      try Test.cancel() // or Test.current?.cancel() or any other spelling
    }
  }
}

I want to emphasize that test cancellation is a single use case and does not represent all use cases for a tryLock() API. It's just the one I have handy that I can use for illustrative purposes.

3 Likes

I've open a PR to make the proposed documentation changes here.

2 Likes