TaskGroup is slow. Is it normal?

typoland · November 15, 2023, 4:58pm

I try to speed-up some function by make it multithread.
And I have very disappointing results.

First — „flat” version, processor get hot:

var correctionSubspaces : [[Vertex]] = []
 for correction in corrections { // evey correction
            var vertexLayers: [Vertex] = [] // subgrid for one correction
            for coordinates in newGridCoordinates { // every new grid position
                try Task.checkCancellation() 
                let vertexWithStrength = try await Self.vertexStrength(for: correction,
                                                                       in: corners,
                                                                       on: coordinates)
                vertexLayers.append(vertexWithStrength)
            }
            correctionSubspaces.append(vertexLayers)
        }

works nice, but for some heavy stuff it took:

axes: 4, corners: 16, corrections: 4
✅testCountGrid() 1296 elements in 56.065s

Second version, first with TaskGroup was surprising:

var correctionSubspaces : [[Vertex]] = [] //
 for correction in corrections { // every correction
            let correctionSubspace = try await withThrowingTaskGroup(of: Vertex.self, 
                                                             returning: [Vertex].self) { verticesTasks in
                newGridCoordinates.forEach { coordinates in
                    verticesTasks.addTask {
                        return try await Self.vertexStrength(for: correction, 
                                                             in: corners,
                                                             on: coordinates)
                    }
                }
                var vertexLayer: [Vertex] = []
                for try await vertex in verticesTasks {
                    vertexLayer.append(vertex)
                }
                return vertexLayer
            }
            correctionSubspaces.append(correctionSubspace)
}

almost 140% SLOWER, processor red-hot too. In debug view I can see a lot of threads works. Sweat.

axes: 4, corners: 16, corrections: 4
✅testCountGrid() 1296 elements in 78.229451s.

So, I wrote third version; newGridCoordinates has many (1296 this case) elements, corrections just one to five...

        let correctionSubspaces = try await withThrowingTaskGroup(of: [Vertex].self, 
                                                                  returning: [[Vertex]].self) 
        { correctionTasks in
            corrections.forEach { correction in
                correctionTasks.addTask {
                    var vertexLayers:[Vertex] = []
                    for coordinates in newGridCoordinates { // every new grid position
                        let vertexWithStrength = try await Self.vertexStrength(for: correction,
                                                                               in: corners,
                                                                               on: coordinates)
                        vertexLayers.append(vertexWithStrength)
                    }
                    return vertexLayers
                } 
            }
            var subspace :[[Vertex]] = []
            for try await layer in correctionTasks {
                subspace.append(layer)
            }
            return subspace
        }

Surprise, surprise:

axes: 4, corners: 16, corrections: 4
✅testCountGrid() 1296 elements in 46.218377s

but only 80% time of than „flat” version.

Do I do something completely wrong, or this is a normal behaviour?

Jon_Shier · November 15, 2023, 5:14pm

Can you provide a same project we can test ourselves? There could be any number of things going on here, so we need to run the code ourselves to see. But generally I've found Swift concurrency to be pretty bad at these sorts of unbounded parallelism problems, if for no other reason but that TaskGroup isn't actually a parallel construct, it simply guarantees concurrency. A comparison to DispatchQueue.concurrentPerform or other queue usage might be informative.

typoland · November 15, 2023, 5:16pm

I can share this project, but it it totally mess right now... I'm not a professional programmer, you can be disappointed by code quality. I'm still learning.

Should I do it, really?

It's here: GitHub - typoland/HyperValue2

Look to Tests: GridTests: testCountGrid....

Jon_Shier · November 15, 2023, 5:21pm

If you'd like, there are a lot of people who can help here. But as someone who's new, even before considering TaskGroup there are some other things to make sure you're doing.

Do not test performance in playgrounds. This isn't a realistic environment and is unoptimized.
Make sure you're compiling and running in release mode and that you haven't attached the debugger. You can do this in Xcode by modifying your Run action to use Release mode by default and turn off the debugger.

typoland · November 15, 2023, 5:23pm

1: Never
2: I don't know how to compile Package during tests to Release.

I work on package, using tests to think and test if it's OK or not.

Jon_Shier · November 15, 2023, 5:27pm

You would use swift test -c release but I don't know how optimization plays with testing like this. You may want to create an executable target that would allow you to run the benchmarks directly rather than through testing.

typoland · November 15, 2023, 5:31pm

For now executables explodes ;)
(Where to type in swift -c release? I'm using Xcode, graphic interface, linux commands are too smart for me right now.)

wadetregaskis · November 15, 2023, 10:43pm

swift test -c release --filter HyperValueTests.GridTests/testDefineGrid

The test's execution time varies substantially between runs, without any code changes. e.g. sometimes it completes in a couple of seconds, sometimes in ten. And the output is different each time.

So, it appears your test is using different data (or otherwise has some random element to it) between executions, which is going to mess up any interpretations of performance. You should nail that down first - make every test execution do the exact same work - before you try to optimise anything.

wadetregaskis · November 15, 2023, 11:34pm

Also, tangentially: there's no need for async on functions that don't actually block (e.g. on the network, the disk, or similar). Your functions (vertexStrength and all its callees) are pure computation; they get no benefit from being marked async and doing so may make calling them less efficient in some cases (although in your particular usage I doubt it makes a difference).

typoland · November 16, 2023, 5:54pm

I marked them async just to be sure I can cancel them.
Idea was, if a result was not delivered on time, and there is a need to count new result — cancel old and start new. So I made them async and put Task.checkCancelation() inside. But maybe it was not necessary?

typoland · November 16, 2023, 5:57pm

Statistically — yes, but I why I printed out how many elements was used, and compared only those which number was identical.

typoland · November 16, 2023, 6:02pm

What is most (not) funny: after a lot of time and effort, I discovered the COUNTING THIS GRID HAS NO SENSE AT ALL. But question stays: how to split any loop counting to make it twice or many times faster if processor has many cores?

wadetregaskis · November 16, 2023, 6:22pm

You can check for task cancellation in any function. Task (the type) is always accessible, because Tasks can run any kind of function. Only its async properties and methods are inaccessible outside of a task (well, strictly-speaking outside of an async context, but async contexts always execute within a task).

Though, some Task static member properties don't make as much sense outside of a task, e.g. currentPriority tries to return something sensible but may just default to default, isCancelled just returns false, checkCancellation returns without doing anything, etc. basePriority is the only one that actually has an unequivocally sensible behaviour outside of a task, which is to return nil.

Speaking very broadly, there are two boundary principles to try to adhere to:

Don't break your work up into more pieces than are necessary to:
1. Use all cores, and
2. Reasonably minimise long-tail completion delays (where some unit(s) of work take longer than others, so you end up waiting on them to finish while other cores are idle).
Do at least 10ms of work per unit. This is a very rough guide, and obviously what 10ms equates to varies over time as CPU cores get faster etc.

These can be in conflict - sometimes you have e.g. 10 cores and only 40ms of work total. In which case better to err on simply utilising fewer cores, as that at least leaves the other cores available for other applications (and may have thermal advantages).

Whether you use GCD or Structured Concurrency shouldn't generally matter. Both have some runtime overhead (GCD less than Structured Concurrency) and some code overhead (Structured Concurrency less than GCD, arguably).

You can achieve better performance using other means (e.g. bespoke thread-pooling with lock-free queues for shuffling data between workers) but it's very rarely worth it in Swift. Most people that truly benefit from that level of dedication to performance aren't using Swift anyway (for better or worse).

Also, remember that concurrency is like recycling - it's not the first step; it's actually closer to the last (and worst). The priorities are, where "it" is the work in question:

Don't do it.
Do it less.
Do it smarter.
Do it concurrently.

typoland · November 16, 2023, 6:42pm

I broke process by TaskGroup, but it did not help at all.
I think reasonable: same job with different parameters.
All cores were busy, time was almost the same.

Loop by parameter - not async* {
    do the job(with: parameter) on unused core
    add partial result some array
    if everything is done return result
}

Bang.

adamkuipers · November 16, 2023, 6:44pm

TaskGroup is sort of the wrong tool here. TaskGroups are nice for concurrently running async tasks. If you just want to make synchronous work in paralel, then concurrentPerform is the right API.

typoland · November 16, 2023, 6:49pm

.... number of iterations to be at least three times the number of available cores

So what if I have three tasks? of four? Forget about?

wadetregaskis · November 16, 2023, 6:58pm

To elaborate, I think you're referring to how GCD explicitly tolerates "CPU hog" tasks whereas Structured Concurrency handles them less than ideally, even though they're technically supported.

GCD over-subscribes the cores - if you launch a hundred tasks into a concurrent queue, you will potentially get a hundred threads irrespective of how many CPU cores you have, competing for the cores and pre-emptively multitasking amongst each other. This is pointless competition & overhead if you're purely throughput-centric, but in most real-world applications it's an important benefit because it ensures e.g. your GUI updates aren't blocked for unlimited lengths of time by some long-running computation; your network sockets don't completely time out because they couldn't get a few milliseconds of CPU time to send a packet; etc.

Structured Concurrency runs (for the most part) just one thread per CPU core, and so essentially does not use pre-emptive multitasking (within a single process). So if you fill all those threads up with long-running computation tasks, nothing else runs until they finish. The Structured Concurrency infrastructure (actors, Task, etc) do take some pains to ensure the main thread doesn't [by default] run most tasks, to try to mitigate the negative consequences of this design, but it's still pretty easy to cause problems in interactive or I/O-sensitive applications.

Ancedotally, I've also observed that GCD is less likely to run afoul of design flaws in Apple's kernel thread scheduler on Apple Silicon, such as where it stubbornly refuses to use all the available cores (especially when it forces use of only the slower 'Efficiency' cores). But this behaviour is unstable and inscrutable (e.g. yesterday I re-ran some tests to double-check the GCD vs Structured Concurrency behaviour, and found that for some reason the system was behaving completely differently to how it did last week when I ran the same tests).

typoland · November 16, 2023, 7:10pm

Thank you very much. Now I have a lot to learn and understand.

Jon_Shier · November 16, 2023, 7:19pm

You're correct about concurrent queues, but that's not what concurrentPerform does. concurrentPerform is core count aware and ensures proper scheduling order to not overload the CPU. Whether it's otherwise efficient I'm not sure.

In general, whether using TaskGroup or concurrentPerform I suggest only breaking work into your core count number of pieces. There's no abstraction available to Swift that allows you schedule arbitrary numbers of work units without overhead, so being aware of your core count is unfortunately necessary.

wadetregaskis · November 16, 2023, 7:41pm

Right, but that's all part of it - whether or not concurrentPerform enqueues all the tasks up front or emits them on-demand is for most purposes irrelevant. The important point is that you can use it - or any other mechanisms for enqueuing work into the concurrent queues - and (within reason) you won't get massive task starvation issues.

As it happens concurrentPerform is quite cleverly implemented and will indeed feed tasks into queues only as space becomes available, which saves a bunch of overhead and higher peak memory usage vs spawning all the tasks in advance. Internally it's more like a "pull" model than a "push" - when the concurrent queue has resources available, it looks for any concurrentPerforms currently running and serves them.

I don't recall, but I assume it does this in a way that avoids starvation if other things are constantly feeding tasks to the queue. I vaguely recall some kind of token system to ensure that the concurrentPerform gets to run on the queue at least its fair share of the time. It's been a while since I explored the GCD code (and it's pretty gnarly code, that's hard to follow).