TaskGroup and Parallelism

rvsrvs · August 8, 2021, 6:34pm

I'm implementing an "embarrassingly parallel" algorithm. My impression was that I could simply add tasks, say 100 of them, to a TaskGroup inside my main Task and then call next() until all were complete. What I observe is that when running in the iOS simulator, I get no speed up all (I would have expected between 3 and 4X), flipping and running the same code on the Mac provides a 2X speed up (I would have expected 8X).

When I run with Instruments on the iPhone 12 Pro simulator, I observe that all of my Tasks execute on one of the three identically named worker threads, but that there is no splitting across them. I've gone through the documentation and I don't see what I'm doing that would cause my child tasks to be bound to a single thread. Am I missing something conceptually?

typesanitizer · August 9, 2021, 2:39am

You may want to check out this answer: Is there an equivalent of Dispatch… | Apple Developer Forums

rvsrvs · August 9, 2021, 8:18am

thx! sadly what I expected

ole · August 9, 2021, 9:20am

@typesanitizer Thanks for the link. Can you (or someone else) expand on this? While Swift concurrency may not have a direct equivalent to concurrentPerform, it should still be possible to use task groups to execute parallel work, no?

(I realize that async functions shouldn’t block for an extendend amount of time, so a CPU-heavy computation should call Task.suspend from time to time.)

This tweet from Rokhini Prabhu seems to say that the iOS simulator doesn’t use as many threads as one would expect:

The simulator needs to run on older OS-es which don't have the new cooperative thread pool with kernel support. As such, we have to make locally optimal decisions in libdispatch on how many threads to bring up in libdispatch

And this one:

If you had kicked off tasks with different QoS-es, you'd see that we will bring up at least a thread per QoS class but we don't have intelligent load balancing

Not sure why you’re seeing a similar thing on macOS.

rokhinip · August 9, 2021, 8:19pm

The distinction goes back to the definition of parallelism vs concurrency.

When you are performing a parallel compute, it means "I want to run all these work items that are part of the same parallel compute workload, as quickly as possible". In dispatch, that means that the closure to concurrentPerform and the n iterations it needs to execute, are all part of 1 big parallel workload. We therefore have optimizations to make sure that work is executed across all cores - with intelligent work-stealing across parallel worker threads to run workloads quickly and efficiently. We are able to do these optimizations for concurrentPerform because of the semantics of the API - it is work that needs to happen as quickly as possible. For more details, check out the "Manage Parallel workloads" section of this article.

To contrast, concurrency simply says "These are independent work items which can be run independently". That means that the work items can be run in any order, by 1 or more threads, but there is no semantic notion that this is a workload that needs to be done quickly as possible.

A task group simply provides scoping for when the independent work items need to be done by, so that the parent task can continue. As such, it doesn't provide the additional guarantees that this workload may not be interleaved with other unrelated tasks running on the system. A thread may execute a child task from a task group and then run some completely unrelated work item on the same thread even if there are pending child tasks in the task group.

In a parallel workload, you would not want any other work to necessarily be interleaved with the parallel work that you are looking to complete as soon as possible.

To go back to your question, can you create a task group and sometimes get multiple threads to run that work and therefore get some speedup? Absolutely. But can you also get just a single thread that does all that work? Yes. I recommend not using task groups if what you are really look for is parallelism. The best primitive available today for parallel workloads, is concurrentPerform.

For more information, I highly recommend watching this segment of our previous WWDC talk which explains parallelism vs concurrency in detail: Modernizing Grand Central Dispatch Usage - WWDC17 - Videos - Apple Developer

Jon_Shier · August 9, 2021, 8:36pm

There's a more fundamental limitation at play here which I don't think anyone has mentioned yet: asI understand it, Swift's Concurrency features currently only executes tasks on the special fixed-width parallel DispatchQueue that the runtime creates. Until we have custom executors, all concurrency features are limited to running on this queue. So even if the system allowed explicit parallel execution, you'd only ever be as wide as the underlying queue, and shared with all other executing concurrency work.

Karl · August 9, 2021, 11:54pm

This seems similar to task priorities. Would it be correct to say that concurrentPerform acts like a task group with maximum (or at least very high) priority?

rokhinip · August 10, 2021, 12:17am

How is it similar to task priorities? I fail to follow.

To answer your question, concurrentPerform runs at the priority of the calling thread. It doesn't attach a priority to the work on its own.

Karl · August 10, 2021, 12:29am

Well, you said:

A thread may execute a child task from a task group and then run some completely unrelated work item on the same thread even if there are pending child tasks in the task group.

But if a task group's child tasks have a high-enough priority, the thread (or more precisely, the executor which owns the thread) would favour those child tasks over all other pending tasks in its queue.

So what's the difference between a task group with maximum/really high priority, and concurrentPerform?

(Also, I'm talking about task priorities - i.e. the relative order in which an executor should schedule tasks on the thread(s) that it owns. Those threads can of course have their own priorities, which can inform how those threads are allotted actual execution time)

rokhinip · August 10, 2021, 1:16am

Ah I see where you're coming from. I think the distinction comes in 2 places:

(a) A concurrentPerform does not have heterogeneous priority work. All work runs at the same priority - that of the calling thread. This is unlike a task group which as you said, can have heterogeneous priority tasks.

(b) I think the 2nd distinction is how many threads may be working on the TaskGroup vs concurrentPerform at any given time and how quickly they may scale them up or down. This is where having the semantic notion of "parallel work" in concurrentPerform as opposed to "independent tasks" in a TaskGroup, comes into play. One can potentially decide that parallel work should be given threads more aggressively to run that workload, as opposed to "independent tasks".

rvsrvs · August 10, 2021, 8:32am

thanks Rokhini!

for now, i'm going to see if I can't come up with something that has a TaskGroup-like API but which uses concurrentPerform under the covers.

hassila · August 17, 2021, 7:26am

Thanks, this leads up to the follow up question: how/when does the default executor / new GCD fixed-width queue choose when to use the threads available for that queue?

Or in other word, if I have a trivial app using a task group as above, with no other work being done by any other tasks (on an idle system), will the default executor end up using one or multiple threads in practice?

I still think the conduit between the concurrency model and actual cpu core execution is a bit opaque on how it actually works - I don’t want to overcommit threads, but I’d like to ensure that the full fixed-width gcd queue definitely is used.

I don’t think the dispatch code yet have been released with the implementation, so any insight appreciated.

(This related to the previous question here: https://forums.swift.org/t/swift-concurrency-threading-model-questions )

rvsrvs · August 17, 2021, 8:29am

ftr, i'm currently unable to get detached tasks to execute in parallel either. they will certainly move across threads, but not in parallel. any advice appreciated.

rvsrvs · August 18, 2021, 10:26pm

Ok, I'm more than a little confused here. I'm playing with the following code:

public extension Array {
    init<T>(
        performParallel: Bool,
        _ rows: Int,
        _ cols: Int,
        _ initializer: @escaping (Int, Int) -> T
    ) async where Element == [T] {
        self = await withTaskGroup(of: (Int, [T]).self) { group -> [[T]] in
            var results = ContiguousArray<[T]>(repeating: [], count: rows)
            (0 ..< rows).forEach { row in
                group.addTask(priority: priority(for: row)) {
                    (row, (0 ..< cols).map { col in initializer(row, col) })
                }
            }
            for await row in group {
                results[row.0] = row.1
            }
            return Array(results)
        }
    }
}

This is a very simple Conway's Game of Life grid. cellInitializer computes the next state of the cell at position: (row, col).

I started this entire exercise thinking that since the default Executor is concurrent with a width of the number of cores, that doing it this way should provide a speed up by a factor of the number of cores. But it appears that the default Executor is actually serial. All of the child tasks are executed serially though they are not bound to one thread.

However, if I alternate the priorities between .high and .medium I can manipulate the run time into actually using two different different serial queues and get a 2X speed up.

Is this actually the expected behavior? I mean, expected by someone other than me?

hassila · August 19, 2021, 1:57pm

Which platform/environment are you running on?

rvsrvs · August 19, 2021, 2:33pm

ios 12 Pro simulator..

rvsrvs · August 19, 2021, 2:39pm

I've always seen that speed up on the Mac M1 even without the priority bucketing

Dante-Broggi · August 19, 2021, 3:45pm

Then this behavior is expected.

I forgot which other forum threads it was mentioned in before, but I have read that because the simulators have to run on systems which may not support native Swift concurrency, the number of Swift concurrency threads is bound to the number of thread priorities used, not the number of cores.

rvsrvs · August 19, 2021, 5:23pm

ok. will try on device

rvsrvs · August 19, 2021, 6:47pm

Ok, so I do get the 2x speed up on actual phone hardware. Less than I had hoped but there seem to be only two threads in concurrent queue, so that seems like the right number. On the M1 Mac I do seem to spread the work across 8 cores.