This is a follow-up on @nsc’s post Deadlock When Using DispatchQueue from Swift Task from 2023-07.
Here’s a reduced example that reliably (for me) creates a deadlock in the cooperative thread pool, caused by exhausting the pool with calls into an opaque subsystem (Apple’s Vision framework in this case).
The code
The full example is on GitHub: GitHub - ole/SwiftConcurrencyDeadlock (macOS only because it uses an Apple API). The essential piece of code is this:
let imageURL: URL = …
try await withThrowingTaskGroup(of: (id: Int, faceCount: Int).self) { group in
// This deadlocks when the number of child tasks is larger than the
// number of CPU cores on your machine. Try using a smaller range.
for i in 1...25 {
group.addTask {
// Perform face detection using Vision framework
print("Task \(i) starting")
let request = VNDetectFaceRectanglesRequest()
let requestHandler = VNImageRequestHandler(url: imageURL)
try requestHandler.perform([request])
let faces = request.results ?? []
return (id: i, faceCount: faces.count)
}
}
for try await (id, faceCount) in group {
print("Task \(id) detected \(faceCount) faces")
}
}
This creates a task group with a bunch of child tasks, and each child task calls synchronously into Vision.framework to perform face detection on an image. The example has been reduced from a real-world codebase that worked similarly. (In the example, all child tasks run the face detection on the same image, making it wasteful. But you can imagine wanting to run this on multiple different images in parallel.)
Running this code deadlocks every time (in my tests) as soon as the number of child tasks (= the iteration count of the for loop) is greater than or equal to the number of threads in the cooperative pool (10 on my 10-core M1 Pro machine).
Observation
When you stop the deadlocked process in the debugger, it looks like this:
- 10 child tasks have started running
- Each of the 10 threads in the cooperative pool is hanging in a lock inside a
dispatchGroupWait
call that originated in Vision.framework - No other thread is doing any meaningful work that would make progress toward unblocking one of the cooperative threads.
Analysis
My interpretation:
- Although Vision.framework provides a synchronous API to clients (
VNRequestHandler.perform
), it internally performs some async work and uses GCD for this. - Vision.framework schedules the async work on its own dispatch queue and then uses
DispatchGroup.wait()
to block the thread on which the request came in until the work is done. - The async work item gets scheduled, but GCD never gives it a thread to run on because it's waiting for a task in the cooperative pool to finish. That never happens, hence the deadlock.
Relevant quotes from last year's thread:
DispatchQueue
is over-committing, but if you’re enqueuing to a queue withqueue.sync
, you’re also tying up a thread that may be from an executor that is not.sync
has the isolation properties of the queue but doesn’t change the underlying reality of what thread you’re on.
— Deadlock When Using DispatchQueue from Swift Task - #19 by John_McCall
(I imagine our DispatchGroup.wait()
behaves very much like DispatchQueue.sync
.)
The general rule is that you should never block work in Swift concurrency on work that isn’t already actively running on a thread (“future work”). That is being violated here because the barrier block is not actively running — it’s been enqueued, but it cannot clear the barrier without running, which it needs a thread to do.
— Deadlock When Using DispatchQueue from Swift Task - #25 by John_McCall
This is the rule we're (unknowingly) violating here. The async work item inside Vision.framework is enqueued, but not yet actively running. And the next quote explains why GCD doesn't bring up a new thread for this work item:
The specific implementation reason this blows up today is that both Swift concurrency and Dispatch’s queues are serviced by same underlying pool of threads — Swift concurrency’s jobs just promise not to block on future work. So when Dispatch is deciding whether to over-commit, i.e. to create an extra thread in order to process the barrier block, it sees that all the threads are tied up by Swift concurrency jobs, which promise to not be blocked on future work and therefore can be assumed to eventually terminate without extra threads being created. Therefore, it doesn’t make an extra thread, and since you are blocking on future work, you’re deadlocking.
— Deadlock When Using DispatchQueue from Swift Task - #25 by John_McCall
It's really easy to run into deadlocks
I realize that Swift concurrency’s tendency to deadlock is not qualitatively different from GCD, only quantitatively. If you rewrote the code with GCD, it would also deadlock when GCD exhausts its thread pool limit. You can try this easily by dispatching onto a global dispatch queue from inside the child task and using a continuation to bridge back into Swift concurrency land.
GCD-based “workaround”
try await withThrowingTaskGroup(of: (id: Int, faceCount: Int).self) { group in
// This "fixes" the deadlock at the cost of thread explosion.
// Also, GCD's max thread pool size is 64, so if you increase to 64 or
// more child tasks it will deadlock again.
for i in 1...64 {
group.addTask {
print("Task \(i) starting")
return try await withCheckedThrowingContinuation { c in
DispatchQueue.global().async {
do {
let request = VNDetectFaceRectanglesRequest()
let requestHandler = VNImageRequestHandler(url: imageURL)
try requestHandler.perform([request])
let faces = request.results ?? []
c.resume(returning: (id: i, faceCount: faces.count))
} catch {
c.resume(throwing: error)
}
}
}
}
}
…
So I guess you can say that the code as written is bad because it creates an unbounded amount of work at once, which either leads to thread explosion (bad) or deadlock (very bad).
Conclusion
These are the points I'm hoping to make:
It's really easy to run into deadlocks with Swift concurrency.
- Not using "dangerous" APIs (such as
DispatchSemaphore
,DispatchGroup
, orDispatchQueue.sync
) in your async code is not enough. Any call into an opaque subsystem, no matter how innocuous it looks (cf.requestHandler.perform()
above), may internally block on future work (as defined by @John_McCall above), thus violating the general rule. - The small thread pool size limit makes it way more likely to run into these problems in the real world than it used to be with GCD, even if the qualitative behavior is not so different (@wadetregaskis made the same point last year).
- If you test your code only on your 20-core dev machine but your customers run it on 4–6 cores, you may not be aware how many deadlocks you're creating.
It's unclear to me what the best workaround is.
-
If any opaque subsystem is a potential deadlock problem, I don't see how you can reliably avoid such problem in real-world code, especially in the Apple world where calling into closed-source opaque frameworks is the norm.
The default executor for tasks does not overcommit, so if you’re using a system that relies on overcommit for progress, and you cannot rewrite it, then you need to be very careful to only call into it from a thread that is definitely from an overcommiting executor.
— Deadlock When Using DispatchQueue from Swift Task - #17 by John_McCallI am willing to be careful, but I'm not sure how to reliably identify these problems before shipping my code to customers.
-
We've seen above that pushing the work out to a global dispatch queue is problematic at best: it causes thread explosion and only shifts the deadlock problem out.
-
I think a proper solution should somehow limit the amount of parallelism to a reasonable width, perhaps a little less than the number of CPU cores.
- Is this the best solution? If so, I'd love to have built-in APIs that make this convenient.
OperationQueue
does have the ability to limit the number of operations running simultaneously, but I've always found it pretty awkward to use.- A width-limited
TaskGroup
could be a useful thing to have. You can write this code manually, but it's quite a bit of boilerplate. I briefly attempted to write myself an abstraction for this, but that also turned out harder than expected.
In short, I don't know what the best answer is. cc @hborla and @mattie in case you want to include this in your concurrency migration guide (which l look forward to! Thanks for doing this!).
LIBDISPATCH_COOPERATIVE_POOL_STRICT
Sidenote: I tried setting the environment variable LIBDISPATCH_COOPERATIVE_POOL_STRICT=1
as a help to identify such problems. It would be nice if running my code under this provided a way to detect potential deadlocks during development, independent of the number of cores of the machine the code runs on.
But to my surprise, the program runs to completion without deadlocking! Setting the environment variable does limit the cooperative thread pool width to 1 (you can observe that the child tasks run sequentially), but it seems that setting it has other effects too. When you stop the app in the debugger, you can see that there now is another thread that serves the Vision.framework's internal dispatch queue, so everything is making forward progress, hence no deadlock:
Does GCD change how it interops with the cooperative thread pool when LIBDISPATCH_COOPERATIVE_POOL_STRICT
is set?