Incremental migration to Structured Concurrency

rokhinip · February 13, 2022, 6:02pm

Surely it comes down to what you do with the primitives? The simplest example I can think of that is likely problematic: a thread could acquire a lock and never release it. While I can imagine cases of plain multithreading where that is sometimes acceptable (if inefficient), it seems likely to be unacceptable for a fixed-sized cooperative thread pool.

That would be something that is risky in any thread pool. Thread pools are not infinite - most of them have some kind of limit in the end. It might be some really high number like 1024, they may keep going until the kernel tells you there is no more memory to create more threads. So if you have a thread which never releases a lock and all other work items eventually need that lock, you will eventually still run into a deadlock of some kind because those work items can't make progress.

The cooperative pool simply brings the thread pool limit to be closer to NCPUs.

Another example: one thread could lock a mutex, then pass the lock to another thread (by move) to be unlocked. While that's not an error in a plain multithreaded system, I'm guessing it undermines the dependency information that the system assumes to be represented by a lock.

pthread mutexes very unfortunately allow this but if you do this, the behaviour is actually undefined - regardless of whether it is used in async code or not. What is the critical region if you have a thread that locks and another thread that unlocks it? It's extremely fragile and unclear what kind of synchronization you are expecting here and what your protected region is.

That is why os_unfair_locks actually explicitly enforce this by crashing your process if the thread that unlocks the lock is not the same one which took the lock.

Edit: I just noticed the very important point you made here - 'by move'. The locks that exist today don't support this, and as mentioned, os_unfair_locks explicit enforce thread locality. If we have the ability to do this in the future, we'd have to rethink our lock APIs. The caution was made with what we currently provide and support today.

Lastly, since a lock can be implemented in terms of a binary semaphore, it doesn't seem to be an intrinsic property of the semaphore that makes it problematic.

Yes you're right but the only difference here, is that you are using a single bit which is flipped on and off instead of having say, a pthread_t worth of information to have additional bookkeeping on who the locking thread is. This does mean that the primitive will allow you to unlock a mutex on a different thread than the one which locked it but then you are falling into the same problem as earlier mentioned, of having undefined behaviour.

But it's hard to imagine that's the whole story, since although you say some primitives can be used safely, you also advise caution in their use. I'm trying to figure out what the exact cautions are

Regarding locks, the caution is as follows: Locks intrinsically rely on thread locality - you need to unlock the lock on the same thread which took the lock. You can't hold a lock across an await because there is no guarantee that the same thread will pick up the continuation.

Using thread local storage is another example of something that is not safe in Swift concurrency since you don't know which threads will pick up your task and the various partial tasks as it suspends and resumes.

The cases where you need caution with some primitives are really more about correctness in your state machine as opposed to "you are likely going to get into an unrecoverable state". Now if you use these primitives unsafely - like never releasing a lock - that is bad behaviour anywhere, regardless of whether that happens in async code or not.

See also SE-0340: Unavailable From Async Attribute which will provide annotations that API providers can use to warn against using such unsafe primitives in async code, and provide safer alternatives.

Another question: is failure to express dependency information potentially manifested in deadlock, or only in temporary priority inversion? The latter might be an acceptable risk for some applications, while the former is almost never acceptable.

Both.

Blocking a thread on a primitive can be safe if you can guarantee that the task which will unblock that primitive and your thread, has already run or is concurrently running. This is something you cannot guarantee with a semaphore but you can with a lock because you will only block on the lock when you know someone else who will unblock you is already running code in their critical region.

The likelihood of deadlock when using a primitive without clear dependency information is higher in a thread pool with a small limit, compared to one with a much higher limit. In a thread pool with a higher limit, the thread pool will keep giving you threads until one of those threads runs your semaphore-signaling task runs, or the thread pool limit is hit. Most thread pools have a high enough limit that you will likely get away with not hitting the limit and subsequent deadlock. But this comes at the cost of thread explosion, inefficiencies and lots of contention.

With Swift concurrency, because we have the asynchronous waiting semantics of await, the choice made was to take advantage of that to build a more efficient cooperative thread pool instead.

I also encourage you to think about the risk of priority inversion in say, a constrained device like the Apple Watch where you could easily see a "small" priority inversion result in multi-second user visible hangs.