TaskGroup and Tasks memory consumption

I was recently reading following blog post How Much Memory Do You Need to Run 1 Million Concurrent Tasks? | Piotr Kołaczkowski , and wondered how would Swift's TaskGroup compare.

I have created a small snippet and run it with /usr/bin/time -l ...

let numberOfTasks = 1_000_000 // Also with 100_000

try await withThrowingTaskGroup(of: Void.self) { group in
    for _ in 0..<numberOfTasks {
        group.addTask {
            try await Task.sleep(for: .seconds(10))
        }
    }
    try await group.waitForAll()
}
print("All tasks finished.")

with 1 million "tasks", the code used 1448 MB of RAM.
with 10k - 211 MB

runtime was much higher than 10 seconds, but it was not consistent.

I run this on M1 Max with 32GB.

Then I tried following snippet:

let dispatchGroup = DispatchGroup()
let queue = DispatchQueue.global(qos: .utility)

for _ in 0..<numberOfTasks {
   dispatchGroup.enter()
   queue.async {
       queue.asyncAfter(deadline: .now() + 10) {
           dispatchGroup.leave()
       }
   }
}

dispatchGroup.wait()
print("All tasks finished.")

Dispatch Group uses 5 times less memory and runtime is very close to 10 seconds.

I know that dispatch group and tasks are not directly comparable, and that task-groups provide additional safety and synchronisation, however, even compared to other languages' async task implementations, Swift's seems to be having higher memory and CPU overhead.

Am I doing/measuring something incorrectly?

Is there some kind of fundamental reason why it is like this, or can this be optimised and improved sometime in future?

I am seeing this (Swift 6, release build. 2018 Mac mini, with 16GB of memory) :

> /usr/bin/time -l ./OneMillionTasks
measure(numberOfTasks:) 10000
time(proc:) 10 seconds
measure(numberOfTasks:) 10000
time(proc:) 10 seconds
measure(numberOfTasks:) 10000
time(proc:) 10 seconds

measure(numberOfTasks:) 100000
time(proc:) 10 seconds
measure(numberOfTasks:) 100000
time(proc:) 10 seconds
measure(numberOfTasks:) 100000
time(proc:) 10 seconds
       62.51 real         3.07 user         1.19 sys
           232247296  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              125587  page reclaims
                   1  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
              105326  involuntary context switches
         12481737913  instructions retired
         17172350679  cycles elapsed
           185376768  peak memory footprint

> /usr/bin/time -l ./OneMillionTasks
measure(numberOfTasks:) 10
time(proc:) 10 seconds
measure(numberOfTasks:) 10
time(proc:) 10 seconds
measure(numberOfTasks:) 10
time(proc:) 10 seconds

measure(numberOfTasks:) 1000000
time(proc:) 11 seconds
measure(numberOfTasks:) 1000000
time(proc:) 11 seconds
measure(numberOfTasks:) 1000000
time(proc:) 11 seconds
       67.37 real        28.10 user        11.99 sys
          2324570112  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             1528266  page reclaims
                   1  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
             1145322  involuntary context switches
        120881171410  instructions retired
        160760478638  cycles elapsed
          2048065536  peak memory footprint

Please note that I have tweaked your code slightly to make internal measurements.

Details
import Foundation

@main
enum Driver {
    static func main () async throws {
        do {
            let M = 10000
            try await measure (numberOfTasks: M)
            try await measure (numberOfTasks: M)
            try await measure (numberOfTasks: M)
        }
        
        do {
            let M = 100000
            try await measure (numberOfTasks: M)
            try await measure (numberOfTasks: M)
            try await measure (numberOfTasks: M)
        }
    }
}

func measure (numberOfTasks N: Int) async throws {
    print (#function, N)
    try await time {
        try await start (numberOfTasks: N)
    }
}

func start (numberOfTasks: Int) async throws {
    try await withThrowingTaskGroup(of: Void.self) { group in
        for _ in 0..<numberOfTasks {
            group.addTask {
                try await Task.sleep(for: .seconds (10))
            }
        }
        try await group.waitForAll()
    }
}

func time (proc: () async throws -> Void) async throws {
    let start = ContinuousClock.now
    try await proc ()
    let elapsed = start.duration (to: .now)
    print (#function, elapsed.components.seconds, "seconds")
}

Here’s a pretty short article that goes into exactly this kind of thing and how to manage it more efficiently: Documentation

2 Likes

This pattern creates and keeps numberOfTasks in memory until you start draining them in the waitForAll. You'll notice that this effectively builds up a big chunk of memory, and only afterwards it'll start going down.

This is the reason we introduced withDiscardingTaskGroup (docs, proposal) which immediately discards tasks as soon as they finish, without keeping them alive for collecting results. This is more like the Dispatch API you're comparing it with, and the others shown in those benchmarks:

try await withThrowingDiscardingTaskGroup { group in
    for _ in 0..<numberOfTasks {
        group.addTask {
          // is destroyed as soon as complete, not when awaited
        }
    }
    try await group.waitForAll()
}
print("All tasks finished.")

The difference is that the dispatch use doesn't really keep and visit every block you've created as they don't return results -- this is why it's more similar to the discarding group

5 Likes

@ibex

I have copied your code, and I am getting 11 seconds for 10000 items, 19 seconds for 100000.

Could you add one more zero and try out?

@ktoso with withDiscardingTaskGroup memory consumption is much lower, however, it still takes lots of time to complete.

Following example takes around 100 seconds to complete on M1 Max (8 performance, 2 efficiency cores) with 32GB RAM

let numberOfTasks = 1000000
try await withThrowingDiscardingTaskGroup { group in
    for _ in 0..<numberOfTasks {
        group.addTask {
          try await Task.sleep(for: .seconds(10))
        }
    }
}

I am using macOS 26.3 and run swiftc sample.swift -Ounchecked -wmo -o to compile

swift-driver version: 1.127.14.1 Apple Swift version 6.2.3 (swiftlang-6.2.3.3.21 clang-1700.6.3.2)
Target: arm64-apple-macosx26.0

As a comparison I also run following code to see if Task creation is the problem, but it finishes in ~40 seconds

let dispatchGroup = DispatchGroup()
for _ in 0..<numberOfTasks {
   dispatchGroup.enter()
    Task { 
        try await Task.sleep(for: .seconds(10))
        dispatchGroup.leave()
    }
}

@mattie

In this theoretical example batching would not “work”, because next batch will only start in 10 seconds.

Remember doing same experiments two years ago and chatting a bit with @ktoso, checking instruments and so on. As far as I remember (and could be wrong) my conclusion was Task.sleep doing basically nothing and not threads will be used, so sleeping is a useless metric in that case—and that's why it's taking 100s and etc.

It’s not about Task.sleep, but about TaskGroup’s handling of concurrent tasks. Task.sleepis just a mechanism to ensure the tasks themselves do not consume much resources and TaskGroup does not compete with them

1 Like

It's not a useless metric. It's a way to measure the overhead of the concurrency runtime scheduling. In this case, any time that exceeds the 10 seconds indicates child task creation and enqueueing overhead. So for 100000 tasks, if it takes 18 seconds, then there is an 8 second overhead. For one million tasks the overhead seems to grow even further.

2 Likes

My take is that it's useless understanding this exact enqueueing overhead. Remember changing it to some task with some work done inside and picture was completely different.
But then again, it was two years ago and maybe I don't get some internals. Need to play a bit more.

In terms of timing, the actual cost this is observing is contention on the task lock when the tasks are all "at the same time" finishing and need to remove themselves from the task tree - this is a known thing and could be improved for sure but not removed entirely due to the structured nature of child tasks. Especially in the discarding case. You should use an actual profiler when doing those benchmarks and then you'll see what the contention points are.

If you want to measure the task creation overhead you should be measuring how long addTask takes since that is synchronously done -- that is an interesting metric as well.

2 Likes

Is it really measuring the scheduling overhead, though? Or is there something here related to Task.sleep specifically?

If it's the former, it should be impossible to schedule 100k tasks and take less than 8 seconds. But the following completes in just 350 ms (M1 Pro), with 100k tasks:

@_optimize(none)
func doNothing() async {
    for _ in 0..<3 {
        if Bool.random() { return }
        await Task.yield()
    }
}

let numberOfTasks = 100_000

try await withThrowingDiscardingTaskGroup { group in
    for _ in 0..<numberOfTasks {
        group.addTask {
            await doNothing()
        }
    }
}
1 Like

@Andropov Did you run the code and measure it?

This snippet takes 9 seconds to complete on my M1 Max:

let numberOfTasks = 100000

@_optimize(none)
@inline(never)
func doNothing() async {
    for _ in 0..<3 {
        if Bool.random() { return }
        await Task.yield()
    }
}

try await withThrowingDiscardingTaskGroup { group in
    for _ in 0..<numberOfTasks {
        group.addTask {
            await doNothing()
        }
    }
}
> ./OneMillionTasks 1000000
measure(numberOfTasks:) 1000000
time(proc:) 11 seconds
measure(numberOfTasks:) 1000000
time(proc:) 11 seconds
measure(numberOfTasks:) 1000000
time(proc:) 11 seconds

> ./OneMillionTasks 2500000
measure(numberOfTasks:) 2500000
time(proc:) 14 seconds
measure(numberOfTasks:) 2500000
time(proc:) 14 seconds
measure(numberOfTasks:) 2500000
time(proc:) 14 seconds

> ./OneMillionTasks 3000000
measure(numberOfTasks:) 3000000
time(proc:) 14 seconds
measure(numberOfTasks:) 3000000
time(proc:) 14 seconds
measure(numberOfTasks:) 3000000
time(proc:) 14 seconds

> ./OneMillionTasks 10000000
measure(numberOfTasks:) 10000000
time(proc:) 87 seconds
measure(numberOfTasks:) 10000000
time(proc:) 75 seconds
measure(numberOfTasks:) 10000000
time(proc:) 73 seconds

I don't fully understand what you mean by this? You can schedule how many tasks you want since it's a synchronous operation and nothing about what you scheduled (i.e. what the tasks actually do) should affect this scheduling. So this snippet

for _ in 0..<numberOfTasks {
    group.addTask {
        // Work
    }
}

should take, as your benchmark shows, under a second or so, regardless of what the child tasks are actually doing. Then the tasks do their work, be it sleep or some real work, and then they finish. Scheduling those tasks initially can surely (and in my opinion should) take far less than 8 seconds. Depending on how quickly the tasks actually get scheduled and run affects how long we have to wait for the results.

It is measuring overall scheduling overhead. Task.sleep also incurs scheduling overhead, since it's basically just running a special job / resuming a continuation after the given amount of time. Most likely those costs (+ the apparent lock contention of task groups) are the reason for the abysmal performance in the million task case.


Just to comment on the overall issue: at first glance, if the poor performance is only due to the Task.sleep, then it doesn't seem like a big problem since real programs don't just sleep. But it is an essential part in some programs and in fact the currently pitched withDeadline function depends on Task.sleep cancelling the task after a given amount of time. I surely wouldn't want my tasks in a fairly busy server, written with Swift NIO and abiding by the rules of modern Swift Concurrency, to cancel some database requests after 100 seconds of the actual requested timeout.

The difference you’re seeing is mostly due to how Swift structured concurrency works compared to GCD.

TaskGroup tasks carry extra metadata (cancellation, priority, parent-child tracking, async state), so creating huge numbers increases memory usage.
DispatchGroup + GCD uses simpler work items, so it consumes less memory and schedules faster.
Task.sleep suspends cooperatively, while asyncAfter uses timers — not an identical workload.
• Launching 1M tasks at once stresses the scheduler and increases runtime.

You’re likely measuring correctly — this overhead is expected.

Right. Apologies, I'm seeing some inconsistent behavior as to when Xcode decides to terminate the program (I ran the code snippet as a CLI project in Xcode), where it sometimes (but not always) waits for all tasks to complete. So for a moment I thought the child tasks were completing before the program exited too.


That said, it's not the only difference I'm seeing running the code directly in Xcode (compiled for Release) vs profiling in Instruments :eyes: (or executing the binary manually). In fact, I can't reproduce the observed behavior of the original post in Instruments.

So while this takes 23s to run in Xcode (compiled for Release, 100k tasks):

import Foundation

let numberOfTasks = 100_000

let time = try await ContinuousClock().measure {
    try await withThrowingTaskGroup(of: Void.self) { group in
        for _ in 0..<numberOfTasks {
            group.addTask {
                try await Task.sleep(for: .seconds(10))
            }
        }
        try await group.waitForAll()
    }
}

print("Took \(time)")

It runs in just 10.7s in Instruments.

Similarly, increasing the task count to 1M still completes in Instruments in 12.1 seconds!

A little off topic, but since we're talking about performance.

I'm porting tokio runtime to Swift and i’ve noticed that withTaskGroup produces x2 jobs for executor.

Swift produces 20 000 jobs

await withTaskGroup { group in
    for _ in 0..<10_000 {
        group.addTask {...}
    }
    await group.waitForAll()
}

Rust produces 10 000 jobs (metrics.spawned_tasks_count() = 10 000)

let mut set = JoinSet::new();
for _ in 0..10_000 {
    set.spawn(async move {...});
}
set.join_all().await;

@ktoso, is this how it should be or am I missing something?

I see no difference on my side :thinking: