How to parallelize across CPU cores?

Nevin · January 29, 2025, 10:40pm

I have an optimization problem that is currently single-threaded, and I want to parallelize it across however many CPU cores are available. I don’t have much experience with multithreading, and I’d like to learn how this ought to be done.

The setup is, essentially, I have a list of items and I need to find the lowest-cost group of those items that meets certain requirements. It looks something like this:

struct GroupFinder {
  var items: [Item]
  var bestGroupYet: ItemGroup? = nil
  
  mutating func findBestGroup() {
    for item in items {
      findBestGroupStartingWith(item)
    }
  }
  
  ...
}

The value of bestGroupYet gets updated whenever an acceptable group with lower cost is found. This lowest-cost-yet is used throughout the search to prune away entire branches of the search tree. The calls to findBestGroupStartingWith are nearly independent of each other, except for checking the current lowest cost.

There may be thousands of items in the array, but I don’t want them all trying to run at the same time and thrashing back and forth, only as many as the available CPU cores. I have seen this thread, but there the subproblems were entirely independent with no shared mutable state, and the number of parallel tasks was hard-coded.

So, what’s a good approach for me to take here?

vanvoorden · January 29, 2025, 10:45pm

For the most part… it's not usually the job of the product engineer to need to think about cores… it's usually the job of the infra engineer to get the work and the infra is where the decision is made how many cores perform that work. That's been the mental model going back to GCD. Modern concurrency tools are an abstraction layer on top of threads and cores.

If you really do want to have the level of control to specify details like number of cores… you can. But is that really what you care about? Or is what you care about performing this work concurrently in a way that the infra can make those specific decisions on your behalf?

Nevin · January 29, 2025, 10:48pm

I very much do not want to manually deal with the number of cores. I want to tell the Swift compiler “please make this use all available cores”.

David_Smith · January 29, 2025, 10:58pm

I think step 1 is to benchmark and profile your code (use the "time profiler" tool in Instruments). N = ~thousands is still firmly in the realm of "might actually be faster to do on one thread", albeit possibly after optimizing findBestGroupStartingWith a bit.. Typically the threshold for "worth parallelizing" is more than 1ms or so of work.

However, assuming you do in fact have enough to be worth parallelizing, I would suggest a map-reduce style approach:

Split the array into chunks (ideally in a low overhead way, may not be trivial to get this right)
Have a bestGroupYet for each chunk
Use a TaskGroup or DispatchQueue.concurrentPerform to run a copy of findBestGroupStartingWith for each chunk
Take all the bestGroupYets and run findBestGroupStartingWith on those to find the best of the best

One challenge will be deciding on the appropriate chunk size. Too big and you'll leave cores idle, too small and you'll waste overhead when dealing with smaller arrays. One strategy you could use is to benchmark your single threaded loop and determine how many items are required to exceed 1ms of work, and then split into chunks around that size.

vanvoorden · January 29, 2025, 11:00pm

There might be some clues in here… my takeaway is that "overcommitting" threads and "thread explosion" is less of a concern in Swift Concurrency than it might have been in earlier tools like GCD.

vanvoorden · January 29, 2025, 11:05pm

These are all good ideas… and I think we can also agree there is an implicit step "zero": optimize and audit the original algorithms. Concurrency tools are great… but if you have an accidental quadratic hiding there — and that's me speaking from experience — then try to optimize that first.

That's what I was thinking… what kind of work is that doing? Is that a linear time algorithm?

Nevin · January 29, 2025, 11:43pm

I have already spent weeks optimizing the underlying algorithm. It boils down to an exact-cover problem, which is NP-complete. I started with the published algorithms for such problems, and they were far too slow. We’re talking age-of-the-universe runtime for the problem sizes I’m dealing with.

Luckily the specific problem I’m solving has some additional structure that I was able to leverage. It took many attempts to find the right approach, but I eventually got the runtime down to several minutes. Each of the subproblems takes on the order of a second.

I am confident that multithreading is the appropriate next step. But as I said, I do not have experience doing that.

Yes, this is one of the strategies that came to mind when reading the other thread that I linked. However, as you say, it involves manually specifying a chunk size. Or equivalently, manually specifying the number of tasks.

I would much rather say “Here’s the work that needs to be done, please distribute it among the available CPU cores.” But I don’t know how to do that.

Also, I suspect that I would still want each chunk to at least occasionally check and update the overall best-yet value, though that may not end up being necessary.

• • •

To clarify, I am not so much looking for strategies that a person who knows how to do multithreading could use. Rather, I am looking for “Here is an example of the code you should write”, since I don’t yet know the right approach to multithreading at all. So for the problem at hand, that would likely include:

An example of “Divide up these work items among the available CPU cores, however many that may be”.

An example of “Read a piece of shared mutable state, and update it if necessary” from several tasks running in parallel.

Nevin · January 29, 2025, 11:48pm

This sounds wonderful.

How can I use modern concurrency tools to abstract away threads and cores, so that my program will take advantage of them automatically?

David_Smith · January 29, 2025, 11:51pm

Something like (pseudocode written hastily into website, caveat emptor)

let bestResult = await withTaskGroup(…) { group in
  for chunk in chunks {
    await group.addTask { 
      findBestGroupStartingWith(chunk)
    }
  }
  var result: Item?
  await group.reduce(into: &item)  { currentBest, item in 
    currentBest >= item ? currentBest : item
  }
  return result
}

Nevin · January 29, 2025, 11:57pm

This looks essentially equivalent to what was shown in this thread that I previously linked, except that they hard-coded the number of chunks and you left it unspecified.

Is there a good way to make the number of chunks automatically match the number of CPU cores, or otherwise just let the concurrency system decide how many to use?

David_Smith · January 30, 2025, 12:01am

Like I said earlier, I wouldn't attempt to chunk based on core count, I would chunk based on "what's the minimum amount of work needed to occupy a thread for about a millisecond", and if that ends up being more chunks than cores, that's fine.

I don't think there's a particularly great way to automatically decide though, since the optimal size will depend on information the system doesn't have (the expected performance characteristics of the work).

vanvoorden · January 30, 2025, 12:15am

I think the idea here is that you're asking the right questions to what might be a different problem. In a "GCD era"? Yes. There were legit and measurable performance improvements to be gained from product engineers controlling the amount of work that was dispatched to concurrency. This defended against the "thread explosion" and "overcommitting" problems referenced from the WWDC lecture.

In a Swift Concurrency and async-await world? The programming model is different enough from GCD that you might not need to defend against thread explosion anymore. If your types and your data follow the contracts of Sendability and Reentrancy you should AFAIK be fine to create "greater than N" units of async work on an N core architecture and the system would schedule these units of work in a better way than GCD would have. But the best reference on this specific topic would probably be the WWDC lecture.

Nevin · January 30, 2025, 12:17am

Each of the findBestGroupStartingWith calls should take much longer than a millisecond.

However if I make a separate task in the task group for each of those calls, and each one tracks its own best-yet value, then the inefficiencies will be huge because they would all start from nil.

At minimum, I would need each task to grab the current overall best-yet at the time it starts, so that it doesn’t waste effort on high-cost things that have already been ruled out by previously-completed tasks.

But for that to work, I would need to be certain that the task group won’t try to start all the tasks at once and switch back and forth between them. Because if it did that, they would all begin with a nil best-yet, and the inefficiency would reappear.

Dang, I would have thought/hoped/expected that “Queue up a large number of tasks, and they will be scheduled automatically to saturate the available CPU” would be a core piece of the concurrency story. It certainly sounded like that’s what @vanvoorden was saying.

What code do I need to write to make this happen?

David_Smith · January 30, 2025, 12:25am

That's not the problem though. That part is trivially easy. If you throw arbitrary numbers of tasks into a task group it will only run one per core at a time.

The problem is how to structure your work so that this is useful.

Nevin · January 30, 2025, 12:31am

I don’t understand what you mean by this.

vanvoorden · January 30, 2025, 12:32am

You might want to look at the task local values pattern. If your graph of tasks has some context or state you want to share (like some cached "best guess" in-progress value) this might help you to share that state across task contexts.

David_Smith · January 30, 2025, 12:35am

If you run my code with 1000 (or 9, or 10,000, or whatever) chunks on an 8 core device it will run 8 of them at a time, which as far as I can tell is what you're asking for.

Nevin · January 30, 2025, 12:44am

When you say “it will run 8 of them at a time”, does that mean “It will start the first 8 chunks, and then when one of them finishes it will start the 9th”?

If so, then I think “one findBestGroupStartingWith per task” is the way to go, assuming that it’s possible to access the shared mutable state from each task (which I still don’t know how to do).

David_Smith · January 30, 2025, 12:48am

Ah, I think you may have fundamentally misunderstood my initial suggestion. My suggestion is that the findBestGroupStartingWith calls don't share the array. They each work on a separate subset of it.

So if you have 100 elements, then the first call would find the best out of elements 0...9, the second would find the best out of 10...19, and so on.

So having one item per task would accomplish nothing: it would find the best group in a chunk of one, which is the only group, and then the reduce call would do 100% of the work on one thread afterwards.

robert.ryan · January 30, 2025, 12:48am

Yes, this is an excellent approach. The idea is to split this large loop into chunks, have each chunk do it’s work independently, and the only synchronization will be at the end, where you compare the best of each chunk’s iterations.

Yep, both the cooperative thread pool used by TaskGroup and GCD’s concurrentPerform are automatically constrained by the number of available processor cores. So, if you have, say 20,000 items to check, you might want to process 20 chunks, each doing 1,000 iterations. And if your CPU only has 10 processors, then these both will ensure that only 10 will be run at a time. So, both technique explicit avoid thread-explosion and/or over committing of the CPU.

But I might advise caution using the TaskGroup pattern. As SE-0297 - async-await warns us about “long computations” that can impede progress on the cooperative thread pool:

In fact, in Visualize and optimize Swift concurrency they go so far as to suggest that you might want to keep this computationally intensive work in GCD and bridge back to Swift concurrency with a continuation.

I understand what David is saying here (if each chunk takes less than 1ms; the modest overhead of parallelization will likely outweigh any performance gains), but I personally adopt a slightly different approach. But my general rule is to have the iterations broken into n chunks, where n is 2-4× the number of processors on the device. So if I have 20,000 items to check on a 10 processor device, I’ll shoot for 20-40 iterations. Many more chunks than that, the parallelization overhead will start to become observable. Many less than that, and I won’t efficiently divide the work amongst the processors. That’s my starting point, and then I benchmark different number of chunks for the particular problem, to find a solution optimal for the algorithm in question.