[Prepitch]: Scoped, Fixed Width Concurrency Executors

This is a general idea I'm posting to gather feedback from various experts. It's more a need than a solution at this point, and the solutions discussed may not be possible or correct.

Problem

Today, Swift's concurrency runtime operates on an execution pool (queues, threads, etc.) with a width fixed to the number of logical processors, with perhaps an additional executor for the main actor (queue, thread, etc.). While this protects against thread explosions, it makes any intensive work rather fraught once the desired parallelism exceeds the available thread count. Once that occurs, awaits may not be resumed in a timely manner, eventually affecting the main actor and causing app hiccups, effectively deadlocking the app until execution is finished.

To directly deal with this issue, Swift offers Task.yield(), which allows the runtime to execute other work before resuming the long running thread. However, this solution is entirely manual, requiring developers to liberally add it at random points in their execution in order to maintain their app's responsiveness. Even if this work is possible, it requires significant expertise to do well, and may not even be compatible with work needed from other dependencies.

Current Solution

The one partial solution currently available for execution outside the standard thread pool is a custom actor executor. Unfortunately, Swift exposes no equivalent to the fixed width executor that can live under such an actor, forcing developers to do so manually. Additionally, an actor is an encapsulation of a single context, so it doesn't help control execution of other tasks created during any async work. This can lead to surprising behaviors where work thought to be performed in an actor is actually behind an await that executes back on the default executor. Custom executors are also rather low level and unsafe, making them inappropriate as a general solution. A more general, flexible solution is desired.

Future Solutions?

First, Swift should expose the various fixed-width executor creation methods that are used under the hood. Something compatible with custom executors but maybe generally useful? We have DispatchQueue.concurrentPerform, but that allows arbitrary width and requires the system to dynamically limit execution regardless of the work performed. For DispatchQueue, it would be nice to offer another queue type, in addition to .serial and .concurrent, that is automatically limited to processor count.

Second, Swift should add syntax to throw work off onto such a custom context. I'm not sure it's possible, but the core feature here would be scoped structured concurrency, where not just the top level API or a single actor run in the custom context, but every child and unstructured task created within it. I have no idea whether this is possible or what issues it may otherwise have, but here's what I image it would look like.

// Or .logicalCoreCount, .performanceCoreCount, .efficiencyCoreCore, whatever may be appropriate
let executor = FixedWidthExecutor(.explicitWidth(4)) // Runtime limited to core count?

// In a regular context.
await withExecutorContext(executor) {
  await intensiveTaskOnOtherExecutor()
}

These executors could be used for long running computations, disk or other blocking hardware access, or could perhaps become the basis of a monitoring service needed to ensure cleanup.

So, is this at all feasible or desirable?

5 Likes

We already have everything needed for this really. TaskExecutor preferences allow you to move code onto a specific executors. The reason it’s a preference is that any usage of actors with custom executors down the structured call tree will still hop off the preferred task executor.

DispatchQueue conforms to task executor; hence, can be used to set a preferred task executor. If you want a fixed width pool you can create your own task executor that’s just backed by a few dispatch queues and round robin through them when enqueueing jobs.

1 Like

Is that an upcoming feature of something? I tried the following code and get an error "Argument type 'DispatchQueue' does not conform to expected type 'TaskExecutor'"

await withTaskExecutorPreference(DispatchQueue.global()) {
            
}

Forgive my ignorant question, but why shouldn't the solution to the problem be, say, a fixed-width task group?

Just recently, there was a post with a question how to run a video decoder on a certain number of files and display the video thumbnails, a pattern that is pretty common in UI apps. There may be potentially any number of files, meaning there's a risk of overwhelming the queues.

A solution to the problem that would be very easy to explain to anyone would be, say:

withTaskGroup(width: .explicit(4)) {
    // ...
}

I.e. you don't need to understand executors if you just want to limit the number of running tasks elegantly. Why not?

5 Likes

@crontab – I agree that a tractable solution would be to add a “width” to with[Throwing][Discarding]TaskGroup.

I might also argue for something similar for concurrentPerform, too, but that is perhaps too far out of scope for this conversation.

But the idea of being able to control the degree of concurrency for an actor, itself, would be very, attractive, too. Unlimited actor reentrancy sometimes introduces all sorts of problems, and it would be nice to constrain that. I know that the “Future Directions” section of the actors proposal contemplated non-reentrancy, but I think that discussion should be broadened to a more flexible “max degree of reentrancy” (where 1 means not reentrant).

It’s not terribly hard to constrain concurrency with all three patterns manually, but it ends up being a lot of syntactic noise in our code.

A minor observation on terminology: In lieu of “width”, I might argue for “max” sort of terminology, like OperationQueue’s maxConcurrentOperationCount or Combine’s flatMap(maxPublishers:_:).

2 Likes

Overall I think there's something to address here, but I'd like to clear up some statements made here before we dive into what the actual request and potential solutions might be.

In no way are custom executors unsafe; I'm not sure what you mean here?

This is exactly what task executor preferences are. There is no need for new Task APIs here I think. Especially since they already address the whole task tree as you're asking for.

Hm this is a bit confusing the DispatchQueue docs claim the protocol is adopted already:


..., but it seems it doesn't in macOS 15.3 yet AFAICS.

Without getting into release details I can confirm though that yes, DispatchQueue is supposed to conform to TaskExecutor and will eventually. We also have the same "lagging adoption" problem in corelibs dispatch which is the queue impls on Linux; we'll also need to make sure the adoption is done there properly, but yes, in both places it is expected for DispatchQueue to implement this protocol eventually.

The OP question is a bit vague but in my experience this use-case comes up when one would like to offload some blocking work off from the global shared pool.

In that case, just changes in task group "width" are not going to make a difference at all.

You can still have a bunch of tasks end up in those "width limited groups" and have them all end up blocking the global pool, even if they're all "(task group) width: 2" etc.

So this isn't about the width of a group, but about using different set of threads for the "other work";

This is a pattern we've successfully used in production deployments of Akka in my previous life: putting the "bad blocking legacy calls" onto some other thread pool.


Back to the root of the question though:

What you're actually getting at is that an over subscribed system is limited by the thread count in the global pool. In reality if all work is CPU bound and not "blocking on IO" throwing more threads at it will not make much of a difference. It's still the same number of CPUs and we'd only increase thread switching between all the work which ultimately is still all just going to be executed on those same CPUs. Note that the thread count AFAIR is already some heuristic that is 2x of the number of cores etc.

So when does throwing more threads at such problem "help" in the sense of preventing starvating the shared pool? Well, if there is some "bad blocking work" (calling into blocking C APIs for example, or blocking IO -- for which now we'll be able to get away from on Linux thanks to io_uring in swift-system) that we could move over to some dedicated threads: say, we're ok with 4 threads being dedicated to blocking IO.

So we have some section of the code we know to do such calls, and we want to section it off then... That's the reason why task executors exist!

They provide a way to "execute this task (and its child tasks) on this executor (which may have many threads at disposal)."

The existing APIs: withTaskGroupPreference() and Task(executorPreference:) achieve just that, a section of the task hierarchy will execute on given executor.

DispatchQueue will implement TaskExecutor and NIO or other thread pools could also implement it. We are also actively working on custom global executor customization proposal, and I suspect a follow up there might be to expose some platform specific (e.g. pthreads based) executor impl which you could give "use 4 threads please" and you could then use it as a task executor...

--

Summing up, I don't think this prepitch is really pitching any new API? We have the APIs. You can write a bunch of task executors and make a package for them right now even.

Perhaps we should offer more executor implementations, but that is on the radar already and @al45tair is working on those -- perhaps as a side effect we'll make them possible to be instantiated and used as specific task executors? Maybe, we'll see.

6 Likes

In practice any "more than one" reentrant call on an actor defeats the non-reentrant nature of a method or actor, so I don't think any ability to control the "amount of reentrant calls in flight" really should be in the language.

If you want to do such things it's most likely tied to some business logic that controls this and therefore you should be implementing that. Language model wise such actor is reentrant as you have to worry about all the usual reentrancy issues.

1 Like

I'm not quite sure why anything would be blocking in the case of task groups with widths.

Imagine if TaskGroup's addTask() could store the closure somewhere instead of creating a task until one of the currently running tasks completes and frees the "slot". Of course you would not wait on the entire group in this case.

The common pattern that comes up often in UI apps is: do something asynchronously on an arbitrary number of elements, e.g. download N images and display them. N can be very large, but also you don't need to wait for all of the tasks to complete, you update the UI as the images arrive.

Of course the "task pool" concept can be implemented today but I think it would mean using semaphores, so an API would be nice to have. Its ease of use would encourage people to consider it more often I think.

The thread started about adding threads and ways to do so, so I explained when this would make sense: when something blocks. So in that scenario, just limiting the concurrency of a single task group doesn't help globally, you can still call an API using such task group many times and still starve the shared pool.

The limited width task group is also useful and something I'd like to add (or anyone should feel free to propose/implement that, contributions welcome!) but it's not really addressing the process wide problem of starving the shared pool.

2 Likes

Throwing more threads to the problem does not help execution time in CPU bound cases, but it helps keep the global thread pool able to accept new (potentially higher priority) work.

In the past I've ran into issues when adding CPU-heavy, fully multithreaded and non-yielding tasks to an app using TaskGroup, while Dispatch's concurrentPerform worked perfectly.

In my case, what happened was that the task's CPU-heavy work filled the thread pool the moment execution started, so new tasks (even higher priority new tasks) didn't get a chance to run. Of course, adding yield to the CPU-heavy tasks worked, but:

  • I was concerned about the performance implications of adding periodic yields to the functions.
  • It's a bit fragile. I (empirically) found that yielding every N iterations worked well on my MacBook. But slower devices take longer to run each iteration, so I'd need a lower value of N to keep the thread pool "responsive" on those devices. But yielding too often hurts performance on the faster devices.

This seemed like a problem that would be ideally handled by the OS' scheduler: I create a bunch of threads that will be synchronously executing CPU work for a long while, but at the same time adding more tasks to the default Swift Concurrency thread pool still works since the OS can prioritize those threads if there's higher priority tasks in them.

Ideally there would be a way to express "use 4 threads for this" that is not platform-specific (though the implementation obviously will be). In the general case the developer probably doesn't care too much what's used under the hood to create those threads as long as it keeps the default thread pool able to accept new work and doesn't incur in a massive overhead.

Though at that point, it seems worth thinking about a possible alternative to Dispatch's concurrentPerform too, as it's another common way of approaching the same problems that lead to using a fixed-width thread pool...

2 Likes

Respectfully, this is a slightly circular argument. Obviously, “more than one” defeats “non-reentracy”. But that isn’t the question, because actors are reentrant.

The question is what is more useful: no-reentrancy or constrained-reentrancy. (Admittedly, I spend more time writing boilerplate code on the latter than the former, so I am biased. Whenever writing concurrent code, I ask myself “what will happen in the massively concurrent scenario?” and tend to program defensively for that case.) And to be provocative, do we really ever want infinite reentrancy?

Admittedly, the non-reentrancy requirement is generally solving a different class of problem than the constrained-reentrancy requirement. But that having been said, I always view the 1 “width” is just a special case of the n width. If you solve the n scenario, then you have solved the 1 scenario, too. My point is why solve one without the other?

So, I respectfully disagree with your conclusion, but appreciate the time of your response and understand your position.

I'm wondering (and apologies if this is becoming off-topic): suppose you have the N-width mechanism for actors in place, how do you solve the problem of blocking of your N + 1 call? What exactly happens to it?

I always thought that this type of problems in concurrency should be solved on the application level algorithmically, rather than shifting the burden on the lower-level scheduling mechanisms. For example, if you have an actor that manages a slow peripheral, just design the API to be responsive and never blocking. What is wrong with it?

1 Like

I would suggest that you just await it, just like all calls to that actor’s functions. That’s what would undoubtedly happen if they supported non-reentrancy. The fixed “width” scenario would be similar.

That’s a lovely example. Assume your actor is awaiting some interaction with the device based upon a previous call. Because of reentrancy, if the actor is awaiting some action initiated from some prior call, you might want to make sure that a subsequent call to the actor doesn’t start until the prior call is done. Non-reentrancy would save us from having to write a lot of boilerplate code to, for example, store a Task for the prior call, writing custom cancelation code, and await that before starting the next request, etc.

1 Like

Yes but if the caller is itself a non-reentrant actor, isn't there a potential for a deadlock? The very thing that structured concurrency tries to avoid?

True, and this pattern is also very common in UI apps: you need to download some resource from the network, and then any number of callers can request it at any time. You create a singleton task like you said, and its result (value) becomes the source of that resource for everyone who requests it before or after the task completes.

But that's not a lot of boilerplate honestly. I'm not sure what would require more boilerplate, this pattern or defining a width=1 actor, possibly with its own singleton. Such an actor would probably give more flexibility, but is it worth the effort of implementing the feature, also given that it potentially opens the doors for deadlocks and possibly other problems? Just for the sake of argument, I'm not entirely sure about these things.

Yes, these sorts of circular dependencies between non-reentrant actors is contemplated in SE-0306 Actors – Deadlocks with non-reentrant actors. I’d refer you to that discussion. I suspect that this is a cautionary note about proper use if/when non-reentrant actors are introduced.

But as you can see from the Future Directions, they’re clearly contemplating non-reentrancy, regardless. To my eye, I would welcome that, even if they run with basic non-reentrant actors and never get to constrained-reentrancy.

FWIW, this sort of risk of circular dependency affects legacy patterns, too (e.g., dispatch queues that (either directly or indirectly) dispatch synchronously to themselves, circular semaphore and dispatch group “waits”, “waits” in unhandled thread explosion scenarios, etc.), so I don't know if this is just a matter of “with great power comes great responsibility” or whether they’re planning on some clever compiler/analyzer feature to warn us of this.

But all of this irrelevant to the “width” argument, as it affects both non-reentrancy and constrained-reentrancy. If they do not move forward with non-reentrancy at all, for whatever reason, this is all moot.

I agree, it’s not horrible. But it is still noise I add to pretty much every project. And it is something that OperationQueue and Combine publishers solved long ago, so it does feel like we’ve gone a little backward.

Frankly, I would declare success if with[Throwing][Discarding]TaskGroup offered maxConcurrent-like functionality. This is the most common use-case for me (and I just add a if index > 4 { await group.next() }, or what have you, and that fixes it. Still, it strikes me as a simple and elegant refinement to task groups.

I only brought up the actor issue because I know non-reentrancy was on their radar, and I thought it was worth raising the question within that context, too. But, it sounds like there is no appetite to address this, regardless, so that’s moot (if disappointing).

1 Like