TaskGroup and Parallelism

I'm going to add this answer because I've seen others cite this thread, drawing some incorrect conclusions in the process. So, my two cents:

  1. As noted elsewhere, the cooperative thread pool on the simulator is artificially constrained. Do not use the simulator when benchmarking Swift concurrency. Run “release” builds on actual device.

  2. When you parallelize code, make sure you have enough work running on each thread to justify the overhead. E.g., in extreme cases (without enough work on each iteration), the performance gains from parallelism are completely wiped out by the overhead. (In these cases, we sometimes can “stride” through the iterations, allocating more work for each parallel block of code, e.g. rather than 100 separate iterations, stride through 5 at a time, only performing 20 parallel iterations.)

  3. When async-await was first introduced to Swift, my early tests of massively parallel calculations using withTaskGroup suggested that it was observably slower than concurrentPerform. But recent tests (in Xcode 13.4 and 14) suggest that any difference has since been largely eliminated.

For example, I did a quick test on a 20 processor M1 Ultra, and 40 iterations of a very computationally expensive process (calculating π with Leibniz series to nine decimal places) took:

  • 4.59 sec with async-await and TaskGroup,
  • 4.60 sec with concurrentPerform, and
  • 1.25 min (16× slower) with serial calculations.

Bottom line, withTaskGroup runs in parallel, very efficiently. Nowadays, it is neither consistently faster nor slower than concurrentPerform. Just make sure to test “release” builds on actual devices, and ensure you have enough work on each iteration to justify the overhead of parallelism.

14 Likes

@robert.ryan can you share the code?

1 Like
2 Likes

Nice that it works well for you. My own experiences using Xcode 14 & iOS 16 are different. While DispatchQueue.concurrentPerform(iterations) guaranties parallel execution, withThrowingTaskGroup(of) does not and more often than not performs the tasks in sequential order using only one thread.

For example, the following code block performs approx. 3 times faster (2.8s vs. 8.2s for a large dataset/series) on an iPhone 13 Pro using DispatchQueue.concurrentPerform (concurrentForEach is just a convenience wrapper):

try series.instances.concurrentForEach { instance in
  let srcURL = instance.fileURL
  let dstURL = deidentifiedSeriesURL.appending(component: srcURL.lastPathComponent, directoryHint: .notDirectory)
  try DcmTk.copyDeIdentifiedDICOMFile(at: srcURL, to: dstURL)
}

A task involves a good mix of file access and data processing. The block below, however, gets executed sequentially:

try await withThrowingTaskGroup(of: Void.self) { group in
  for instance in series.instances {
	_ = group.addTaskUnlessCancelled { [instance] in
	  let srcURL = instance.fileURL
	  let dstURL = deidentifiedSeriesURL.appending(component: srcURL.lastPathComponent, directoryHint: .notDirectory)
	  try DcmTk.copyDeIdentifiedDICOMFile(at: srcURL, to: dstURL)
	}
  }
  try await group.waitForAll()
}

For guarantied parallel execution I would still recommend using DispatchQueue.concurrentPerform.

1 Like

Rob's test proves that withTaskGroup is now well optimized and does not have an inherently high overhead. His example is the perfect way to measure the inherent overhead: We have plenty of performance cores (16 of Ultra vs 2 of A15) and a computation-intensive code that does not block for I/O or other system calls and you see the perfect 16x performance boost vs a single core.

In Rob's scenario, the other tasks that demand CPU time are practically negligible compared to the actual computation load of the test. But in the constrained environment of A15 (core count and default power policy), you can't neglect other factors, especially I/O, system calls, and the scheduling policy. The very name of concurrentPerform shows that its execution policy prefers parallelism.

On the other hand, the default executor that is handling Swift concurrency is balanced for the best overall experience (which includes battery life and the heat generated on the phone surface, among other things). If you want maximum parallelism at any cost, still concurrentPerform is the way to go.

When you have plenty of performance cores at your disposal, the result of the default executor becomes almost identical to concurrentPerform. It does not have anything against parallelism, the only difference is that it puts a higher priority on other factors.

Also, note that your example is I/O bound. Unfortunately, we don't have true async I/O in the Darwin kernel. The only way to get better I/O performance is to throw more threads at it. But more threads will increase memory and power consumption. Darwin's kernel development has been very stagnant for many years now. A true async subsystem is sorely needed.

6 Likes

Thanks for the much needed insight. My expectations obviously didn't match the behavior of the executor. Since the parent task was created with the priority userInitiated, I expected there to be more emphasis on speed.

Are there any official documents or contracts to rely on or to get a better understanding of the behavior of the executor?

I confirm your findings. I updated the project to run on mac and in my case the serial execution was 3.5 times slower than yours in absolute terms (that's because my mac is very old), and async-await / concurrentPerform was about 4x faster than serial (matching the fact that my mac is quad core).

I would be wary about intermingling GCD code (including concurrentPerform) with Swift concurrency, as I would assume that the cooperative thread pool may not be able to reason correctly when there might be other threads running outside of its purview. And you cannot periodically yield within a computationally intensive concurrentPerform loop like we would do in slow computational tasks within a TaskGroup, either.

I would like to use a Swift concurrency approach, I just don't see one that prioritizes parallelism currently. Often, minimizing response time for the user is a top priority for interactive apps. The only control we have here is task priority and that doesn't seem to be the right trigger for mobile hardware.

Your post got me trying again after playing with TaskGroups when concurrency was introduced. I really appreciate your nice example and the findings.

I agree with Hooman's post above. We're in an awkward intermediate position at the moment because many tasks that are not CPU-bound are still written today using blocking APIs and so cannot achieve optimal concurrency under a scheduler that tries to limit the number of threads to some fixed multiple of the number of cores. The flip side is that a scheduler which doesn't apply such a limit is inherently prone to thread explosion, and that tends to be something that cannot be fixed retroactively because people come to rely on thread non-exhaustion in order to achieve progress.

Your copyDeIdentifiedDICOMFile is an I/O-bound operation. In the long run, it will use async I/O operations and stop blocking the current thread while it's waiting for data. In that world, your code should run with optimal concurrency.

6 Likes

People have already filled in, but let me add my two cents:

When you are dealing with I/O bound operations, task priority does not help is not enough for managing UI responsiveness, because of the current limitations of the OS kernel and the existing common I/O APIs.

Swift concurrency can't help because it has a cooperative model which relies on the predictably short execution time of partial tasks. Something that existing I/O subsystem can't provide. Which means you have to offload I/O to a separate thread pool to keep Swift scheduler functioning properly.

For the time being, you have to keep relying on the existing (pre-Swift concurrency) best practices to deal with the issue of responsive U.I. during I/O operations. I assume there is ongoing work to introduce new async/await based APIs to hide I/O thread pools and manage them behind the scenes.

EDIT: Correction based on @David_Smith and @John_McCall's responses.

2 Likes

FWIW modern IO systems can and do take into account priorities. For example, if I run sudo taskinfo on my Mac right now, I see things like

…
	req internal/external iotier: THROTTLE_LEVEL_TIER0 (IMPORTANT) / THROTTLE_LEVEL_TIER0 (IMPORTANT)
	req darwin BG iotier: THROTTLE_LEVEL_TIER2 (UTILITY)
…
3 Likes

Right. And Apple platforms especially are very good about propagating around the information that, say, a particular operation is currently blocking the UI thread.

2 Likes

Thank you and @John_McCall for clarification. It is very good to know.

I believe for many client-side I/O activities, this prioritization is not going to be noticeable (I/O typically is not as congested and the I/O operation itself is pretty slow). Do you have any idea how much practical difference would it make on an iPhone with a single foreground application initiating the I/O?

It's difficult to say in any general sense, because it's also interesting to take into account edge cases like low priority work being sharply throttled when the device is under thermal pressure. Quoting man powermetrics

SFI

The sfi sampler shows system wide selective forced idle statistics. Selective forced idle is a mechanism the operating system uses to limit system power while minimizing user impact, by throttling certain threads on the system. Each thread belongs to an SFI class, and this sampler dis- plays how much each SFI class is currently being throttled. These are instantaneous values taken at the end of the sample window, and do not necessarily reflect the values at other times in the window. To get SFI wait time statistics on a per process basis use --show-process-wait-times.

1 Like