Concurrency, how to do heavy CPU work?

PhilippeWeidmann · November 21, 2023, 1:19pm

Hi,

We have been using structured concurrency for a while in our project and for example this works well for network tasks. However, I still don't understand what's the proper way to do work that requires heavy CPU usage.

Let's consider a simple synchronous function that computes prime numbers -> static func computePrimeNumbers() -> [Int] in a class ComputePrime. Calling this function is expansive and we want to avoid blocking the main thread.

What would be the right way to do this using structured concurrency ?

Wrap the call in a Task (that doesn't inherit main actor) ?

Task {
   let results = ComputePrime.computePrimeNumbers()
}

This doesn't seem right because we would block one thread from the thread pool and we could also potentially create the Task in the MainActor which would block the main thread.

Marking computePrimeNumbers async would prevent the function from inheriting MainActor but still block forward progress.

We could also create a DispatchQueue and wrap the code in withCheckedContinuation but this seems a bit articulated.

What is the best way or the right way to do such work ?

Max_Desiatov · November 21, 2023, 2:21pm

I'd say don't use Task in non-UI code in general, but especially if you're talking about structured concurrency. Task initializer creates unstructured tasks that don't get cancelled when a parent task is cancelled (since no parent task in the first place) and don't propagate errors.

IIUC usually there are as many threads in the Swift Concurrency thread pool as there are CPU cores. If you need that thread to be freed up for whatever reason, break up your CPU-bound work into iterative work with checkpoints that allow it to suspend and unblock that thread on the thread pool for a moment, after saving at a checkpoint. But if you don't ever run more than a few (significantly less than the number of CPU cores) of these CPU-bound blocking computations simultaneously, then checkpoints suspension logic may not be worth implementing. One benefit of checkpoints though is that it allows you to check for Task.isCancelled or run try Task.checkCancellation() and honor cancellation requests that way.

You could create a separate actor that will also hold your computation state to help you implement the checkpoints behavior as described above. In general, code in this new actor won't block MainActor.

This sample code implementations cancellation, but no suspension, so no member functions are explicitly marked async. If you do need suspension, you'll have to be mindful of actor reentrancy and make sure it doesn't cause any issues for you.

actor ComputePrimeNumbers {
  private var lowerLimit: Int
  private let upperLimit: Int

  private var alreadyComputed = [Int]()

  init(lowerLimit: Int, upperLimit: Int) {
    self.lowerLimit = lowerLimit
    self.upperLimit = upperLimit
  }

  func getResult() { self.alreadyComputed }

  func run() throws {
    while let value = self.computeNext() {
      self.lowerLimit = value + 1
      self.alreadyComputed.append(value)

      try Task.checkCancellation()
    }
  }

  private func computeNext() -> Int? {
    let result = // ... one iteration of your computation here
    guard result < upperLimit else {
      return nil
    }
  }
}

@main
struct Entrypoint {
  static func main() async throws {
    let computation = ComputePrimeNumbers(lowerLimit: x, upperLimit: y)
    try await computation.run()
    print(await computation.getResult())
  }
}

asdf_bro · November 21, 2023, 3:27pm

You should do the work on a separate DispatchQueue or pthread. Swift Concurrency's fixed width thread pool isn't a good match for CPU intensive work.

wadetregaskis · November 21, 2023, 3:37pm

PhilippeWeidmann:

Task {
   let results = ComputePrime.computePrimeNumbers()
}
This doesn't seem right because we would block one thread from the thread pool and we could also potentially create the Task in the MainActor which would block the main thread.

Ensuring non-use of the main actor

There are various ways to create Tasks that might ensure by design that you don't run them on the main actor, although alas there is no @NotMainActor decorator that can help ensure you don't accidentally hit the main thread. Though presumably you would notice that in your own testing, at least.

There's also various methods that can check at runtime iff you can design things such that your long-running tasks are tied to specific isolation contexts, e.g. Actor.preconditionIsolated(…) / GlobalActor.preconditionIsolated(…). That might mean some awkward boilerplate of creating an Actor instance for each work item, for example, but it is technically doable.

`Task.detached`

Using Task.detached is a reliable way to ensure you're not on the main thread, as its whole purpose is to create a new top-level task, which cannot be associated with any existing isolation context (main thread or otherwise). As such it's actually too far the other way sometimes, but nonetheless if used appropriately it's fine.

Just remember that as a root task it won't be automatically cancelled by anything else, so you may have to add extra plumbing - but often this works well-enough because you can use a TaskGroup that's managed from inside a non-detached task, and just manually cancel the TaskGroup as a whole, as necessary, e.g.:

// In some suitable isolation context / existing `Task`,
// even in the main actor if need be.
withTaskGroup(of: Void.self) { taskGroup in
    for item in workItems {
        taskGroup.addTaskUnlessCancelled {
            let detachedTask = Task.detached(priority: .low) {
                // CPU-hogging work, checking cancellation
                // periodically.
            }

            withTaskCancellationHandler {
                await detachedTask.result
                // You'll probably want to do something with the
                // result, too.
            } onCancel: {
                detachedTask.cancel()
            }
        }
    }
}

Avoiding CPU-hogging `Task`s

Regarding hogging the CPU, you're quite right that this is indeed a concern with Swift Structured Concurrency. Though currently the main thread / actor has its own extra kernel thread, separate from the general pool, that is subject to change in future. In any case, you might well have time-sensitive tasks going on in the general pool too.

Fortunately, there's the Task.yield() method specifically for this situation. As long as you call that often enough¹ and your CPU-hogging task's priority is lower than your other, time-sensitive tasks, you should be able to maintain good time sharing between your tasks.

¹ Alas, what is often enough is context-sensitive and varies over time as CPU's get faster. You might need to experiment to determine the best place to yield.

Max_Desiatov · November 21, 2023, 4:06pm

Can you elaborate how would that be true? If you create your own thread pool that's wider than Swift Concurrency one (more than the number of CPU cores), it won't make anything go faster, in fact it will introduce an additional context switching overhead.

gwendal.roue · November 21, 2023, 4:48pm

I remember these tweets by @ktoso, which say:

Don’t do this. This submits heavy blocking work onto the shared global width limited pool and that’s how you get Thread starvation of your entire swift concurrency using application.

Swift concurrency lacks a good answer here - custom executor pools would be that answer 1/2

So instead resort to a thread pool or dispatch queue dedicated to given “heavy blocking work things” and put it on there — returning back to swift concurrency by awaiting via an continuation resumed by that work. 2/2

Since then I avoid blindly submitting heavy CPU jobs or blocking I/O in Tasks. For example, GRDB dispatches all SQLite calls to some DispatchQueues, and syncs with continuations. But I admit this is just a cargo cult at this point. Maybe some good practices have emerged, but I wasn't lucky enough to find them.

MPLewis · November 21, 2023, 8:03pm

I think the key difference here is the distinction between blocking work (which should be offloaded from the default Swift Concurrency executor) and CPU-heavy but non-blocking work (which is fine to use the default executor for, as long as it has sufficient checkpoints for cancellation and yielding the backing thread).

wadetregaskis · November 21, 2023, 9:31pm

Yes, and if throughput is the only thing you care about, that might be bad (depending on how well you otherwise schedule your critical path tasks).

What we're talking about is fundamentally cooperative was pre-empting multitasking. There's a reason pre-emptive multitasking is the dominant paradigm today on most operating systems (the only real exceptions being in time-sensitive 'embedded' systems, for similar reasons to not using garbage collection). It's simply harder to implement a cooperative multitasking system that doesn't have cases of pathological behaviour.

In reality many (most?) programs are not purely about throughput. If they have any interactive interactions with any other agent - whether that be a human, anything over a network, or even just other processes on the same computer - then they need to be concerned about blocking.

In general I think GCD is the better option for CPU-heavy work, but as I noted in my prior post it is possible (in principle) to use Structured Concurrency for that too. There are of course other context-specific factors in which is most suitable, so each situation needs to be decided on its own needs.

Jon_Shier · November 21, 2023, 9:55pm

Technically true, but even with yielding and checking cancellation, it's possible to block all async work for non-negligible amount of time (to a computer), especially on lower core devices. Given how easily, if not obviously, this can be avoided by running in GCD, it's generally a better solution. Apple doesn't give us the best tools here, as there's no way to replicate the "auto grow to size" behavior the queue under Swift concurrency gets (perhaps it's visible in the C headers? I haven't checked), so we could still use some help from the language.

ktoso · November 22, 2023, 2:03am

Isolating "blocking calls" (be it long running or non interruptible IO etc), onto separate threads has the primary purpose not to make anything to faster, but isolate the rest of the system from the un-responsiveness during those operations.

Having to call "bad, long running / blocking API which I don't control" is absolutely a thing, and putting them on a different thread has the primary purpose of keeping the existing threads able to respond rather than "not able to respond at all if we have n (cpu count) of those bad calls happening).

It may be better to have a dedicated thread for those "bad calls" and just execute them serially rather than prevent the entire system from reacting to anything until those calls completed.

I've illustrated this issue way back with an akka http web server and the story is really the same for any shared concurrency pool, such as Swift's default global concurrent executor. Notice the difference between:

"just use the default pool":

and "isolate the blocking to a dedicated threadpool" ("my-blocking-dispacher" in this diagram):

( Turquoise - Sleeping state | Orange - Waiting state | Green - Runnable state)

You'll notice that by isolating the blocking, we kept all the default threads waiting or runnable, the system is able to respond to other jobs/tasks, while in the first example everything is blocker (sleeping) and we'll miss timeouts, health-checks and are basically entirely unresponsive until the work completes. This is why isolating blocking operations matters -- not really for speeding anything up.

I know that on Apple platforms and devices opinions differ if it's OK to do IO on the shared pool -- some would say it's okey and perhaps they're right on a device.

But on a server, where you have much more concurrent requests and clients it really isn't an option -- if you're doing so you're going to timeout requests, healthchecks and other such time sensitive jobs.

So as always it boils down to understanding your application and doing what's right for it.

If you wanted to customize where execution happens, Swift provides two tools for this: custom actor executors since Swift 5.9, and the upcoming feature of custom Task Executors whihc will undergo Swift Evolution discussion soon.

To replicate the above diagrams, it basically comes down to creating more executors dedicated to the IO tasks and using them as task and serial executors.

You can use a dispatch queue as a custom executor as well, since Swift 5.9. So we're slowly providing the necessary control to isolate such work onto dedicated threads/queues Please give the task executors proposal a look as well, it is aimed to help exactly with this.

PhilippeWeidmann · November 22, 2023, 9:00am

Thank you all for your answers and great explanations.
From what I understand:
Using Task to defer small work amount off the main thread is okay as long as you don't block for too long.

If I have to do long running work there are multiple solutions:

Use structured concurrency but with Task.yield and Task.checkCancellation to be sure to let other work happen (I think this can work well in a for loop ?).
Use GCD and bridge with structured concurrency

David_Smith · November 22, 2023, 9:25am

This approach is the preferable of the two if possible, since it reduces both memory overhead (for thread stacks) and CPU overhead (for context switching)^[1].

What it comes down to is that there's a delicate balance between two conflicting pressures:

Having more threads than CPU cores is always wasteful (there's nowhere to run them!)
If all of your threads are busy, creating a new one will allow the kernel scheduler to interleave work for you by switching between your threads

Yielding to other coroutines as you describe here attempts to reconcile this contradiction in a very elegant way: by doing our own switching in userspace, we save the overhead of having more threads while still interleaving tasks. But it introduces a third pressure instead:

You have to identify good points to interleave work yourself (either by awaiting other async calls or by explicitly yielding), rather than just letting the kernel interrupt you wherever

The reason this is preferable for CPU work but not currently for blocking IO work is that CPU-bound tasks are still executing ordinary application code, where it's possible to yield to other tasks. Tasks blocked in IO are not executing at all, so have no chance to switch until the kernel returns control to them. (also if you're waiting for a non-CPU resource then something could be using that CPU core in parallel rather than needing to interleave!)

Judicious use of GCD can achieve similar behavior, but you lose the executor stealing optimization, and it's either easy to unintentionally go too wide (if using concurrent queues) or you need to manually determine the correct width for the hardware you're running on (if using serial queues) ↩︎

wadetregaskis · November 22, 2023, 4:11pm

Be careful not to under-appreciate the challenges inherent in this. Again, consider that if this were really such a great solution, we might still be using cooperative multi-tasking from 'Classic' MacOS. There's a reason the rest of the world mocked Apple brutally throughout the 1990s when System 7, then MacOS's 8 & 9 continued to not have pre-emptive multi-tasking.

(it's a little different when you're talking OS multi-tasking vs single-process-internal multi-tasking, but there's more parallels than not)

First there's the immediate, local challenges, including (but not limited to): when do you yield? How often? How do you ensure that happens reliably - e.g. if you just naively yield with every iteration of a loop, what happens if sometimes a loop iteration is really fast and sometimes it's really slow? And what about code you call that isn't yours - does it yield appropriately?

What about across different devices and CPUs - they perform differently so your yield frequency isn't intrinsically consistent. And they change over time - usually getting faster, but sometimes slower (e.g. Apple Watch introduction)! So there's a perpetual maintenance burden from the testing and tuning needed with every new hardware system (and potentially software changes - OS updates can change various aspects of scheduling behaviour - e.g. forcing your task on or off 'Efficiency' cores).

Then there's the perpetual, non-local challenges, like what are the other cooperating tasks doing? If they yield "more often" than you relative to their amount of work getting done, your own seemingly generous yielding behaviour might still be too "greedy". e.g. consider an I/O task that reads whatever's presently available in the kernel's buffer. If your CPU-hogging task yields only every 10ms - because you figure if that's a good slice interval for pre-emptive multi-tasking, why not the same for cooperative - then your I/O is going to be incredibly slow because you're getting at most one full kernel buffer's worth of data every 10ms. Network kernel buffers tend to be pretty small (e.g. 128 KiB for TCP on macOS Sonoma), so you've now limited your throughput to just ~ten megabytes a second.

But if you try to compensate by yielding every e.g. 1ms, now you're susceptible to CPU-hogging tasks that will do 10ms of their work for every mere 1ms you get. And you're paying a lot of context-switching overhead (even in user-space switching, you're still alternating working sets and thus cycling caches etc).

Maybe that's a benefit for a very tightly controlled, holistic system - maybe you like having these implied priority differences between tasks - but that's not usually what any of us are working with.

These problems aren't insurmountable insofar as the implementation doesn't have to be perfect to be acceptable, and it's true that pre-emptive multi-tasking isn't perfectly trivial either (e.g. the thread explosion challenges), but I think it's fair to say that pre-emptive systems (a la GCD) are more robust and beginner-friendly.

Again, that's not to poo-poo Structured Concurrency or disregard it as a useful tool, but be aware in using it that it requires more care and global knowledge than alternatives (at least, if you're going to use it for anything other than quite short-lived tasks).

asdf_bro · November 22, 2023, 5:18pm

In my programs I aim for "islands of cooperation in a sea of preemption"

Swift Concurrency, with the proposals for Task Executors and an @inheritsIsolation function attribute (currently @_unsafeInheritExecutor), will be an excellent tool for building islands of cooperation.

In today's Swift Concurrency, the global executor (which at times has a width of 1), must be an island exclusively for very quick, low latency jobs.

David_Smith · November 22, 2023, 7:53pm

Absolutely! Listing selecting when to yield as a “peer” of the other issues rather than something smaller was fully intentional; I don’t expect it’ll be feasible in every situation, and I’m delighted we’re expanding the ability to mix and match approaches.

It’s entirely possible that in 6 months I’ll find myself writing a similar post extolling the virtues of preemption if the community “best practices” sentiment pendulum swings too far back the other direction

Concurrency, how to do heavy CPU work?

Ensuring non-use of the main actor

Task.detached

Avoiding CPU-hogging Tasks

`Task.detached`

Avoiding CPU-hogging `Task`s