What is the cost of a task group

Sorry, this is a long question and I am not sure whether I am in the right place asking this, Happy to move it if it is not.

My day job is C++, specifically I am working on FoundationDB. I am interested in learning what the performance characteristics of Swift coroutines and their primitives are compared to C++ coroutines (FoundationDB implements it's own language extension to implement coroutines, but they can also be reimplemented with C++20 coroutines).

Small tangent: as of now I believe Swift async functions are strictly better than C++ coroutines. C++ coroutines are a pain if any kind of visibility is needed, they do more allocations than Swift async functions (so I would expect most use cases to be slightly slower -- in FoundationDB we work around this by providing our own allocators that effectively leak memory), they are almost impossible to debug etc.

However: C++ coroutines give much more control to the user than Swift async functions. This means that it's a bit easier to understand their performance characteristics and, worst case, invest a ton of effort optimizing hot paths.

One such example is the common problem (at least when working on a distributed database) of quorum waits (or waitForAll which is just a special case of a quorum). The trivial implementation quickly became a huge bottleneck so we optimized this to make it fast (the implementation can be found here for anyone interested).

Now if I understand this correctly, something like a quorum would be implemented in Swift like this:

enum QuorumError: Error {
    case MajorityFailed
}

protocol MyServer {
    func write(data: [UInt8]) async throws
}

protocol MyService {
    associatedtype ServerImpl: MyServer
    var servers: [ServerImpl] { get }
}

extension MyService {
    func quorumWrite(data: [UInt8]) async throws {
        try await withThrowingTaskGroup(of: Void.self) { group in
            for s in servers {
                group.addTask {
                    try await s.write(data: data)
                }
            }
            // wait for a majority to finish
            var success = 0
            var errors = 0
            while success < servers.count / 2 {
                do {
                    try await group.next()
                    success += 1
                } catch {
                    errors += 1
                    if (errors >= servers.count / 2) {
                        throw QuorumError.MajorityFailed
                    }
                }
            }
            // success
            group.cancelAll()
        }
    }
}

(obviously it might be a bit more complicated since in real live you might want to do something with errors even if a majority succeeds, you might want to continue running even after a majority succeeded etc -- but generally I think this is what I would write today).

But what exactly happens behind the scenes here? Since the number of tasks is unknown statically I assume this needs to allocate a new stack for each task? Does each task call into some callback when it is done? Or more specifically: will group.next() run in O(1)?

Also: would the compiler do some special optimizations if it knew the size of the group? In FoundationDB code we often do things like wait(timeoutError(someFuture, someTime, someError)). In Swift I would implement timeoutError like this:

func timeoutError<T>(duration: Duration,
                     error: Error,
                     task: @escaping () async -> T)
    async throws -> T
{
    try await withThrowingTaskGroup(of: T.self) { group in
        group.addTask {
            await task()
        }
        group.addTask {
            try await Task.sleep(for: duration)
            throw error
        }
        let res = try await group.next()
        group.cancelAll()
        return res!
    }
}

Wouldn't the compiler be able to run this without doing any allocations (assuming it knows about task groups)? Or would this do the same as the quorum thing.

2 Likes

Yes, child tasks get new async call stacks. Note that they do not get new sync call stacks, as those are re-used on Task switch. There is also an object that represents as a Task that needs to be allocated as well.

Sort of? Conceptually a TaskGroup's result queue is an MPSC queue. In the current implementation this queue is trivially implemented via a std::mutex, but that's largely an implementation detail. When a Task completes it can chase up to its parent and push itself onto that queue. The parent can subsequently de-queue the result.

Yes.

Today, the compiler has no particular magic here, so this does what quorum does. Conceptually this is something that Swift could do, but it requires the compiler to be very clearly able to observe the maximum stack sizes that the various async tasks will require. This is something that Rust can do (due to its widespread monomorphization) but that Swift cannot, due to its support for calling async functions across resilience boundaries.

However, requesting this feature for constrained use-cases such as Task.sleep is certainly sensible.

5 Likes

For anyone curious about the details, the runtime implementation of TaskGroup isn't too hard to follow. The "child task completed" flow is in TaskGroup::offer, and the parent side of this (next and friends) are here.

5 Likes

I think allowing for unbounded recursion is the limitation here, not the lack of monomorphization. Async functions do present an initial frame size at runtime, though I don't think the compiler tries very hard yet to maximize it across a call graph when possible, nor does it support dynamically instantiating the frame size when part of the call graph straddles a resilience boundary. That seems like something we could implement and make more accessible to the runtime; it should be possible to precompute a maximum context size for a call tree at runtime even in cases where there's an ABI boundary, as long as there isn't unbounded recursion somewhere.

6 Likes

Thanks @lukasa and @Joe_Groff for the answer, this is super helpful!

I am aware and honestly I really like the design (my knowledge comes mostly from watching this YouTube Video a couple times).

So my mental model now is that generally doing an await ... in a swift async function is super cheap due to the wait async stacks are implemented while doing something like my timeoutError example will be potentially slightly more expensive than with C++ coroutines (as I understand as of now a task will allocate at least a few hundred bytes while a C++ coroutine would be much smaller). I think this is generally a very sensible tradeoff.

I think for certain applications (or rather: corner cases) it could be useful to have a language construct that would allow us to optimize. Either specifically optimizing certain tasks or (I don't know whether that's possible) some constrained task group? I know this wouldn't be valid Swift syntax but something like this would be cool (just maybe with a prettier syntax):

try await withBoundedTaskGroup<2> { groups
  group.task[0] = {
    // some task
  }
  group.task[1] = {
    // some other task
  }
  // attempting to access group.task[2] would result in an error
  await group.next()
  // ....
}

or maybe even better a language construct that would support this? Something like:

async let timeout = Task.delay(duration: ...)
async let task = someFunction()

let someVariant = await (timeout || task)
// check which of the two returned and cancel the other

If you look into our generic actors library you'll see many helper coroutines that we use all over the place and most of them only need to run a constant number of things concurrently. And we know that some of them are responsible for quite some CPU overhead and shouldn't be used in hot paths. Now to be fair: many of the most heavily used functions in this library would just not be needed in swift since they simply work around our type system limitations.

However, I do think I have one valid use-case (in FoundationDB we call this io-trust-seconds):

Whenever we do IO on a disk, we might want to wrap the call into something like timeoutError. Now the timeout will almost never trigger since it will be very large (in the order of seconds). But the rationale here is that if a disk requires 10 seconds to finish a single IO we might not want to trust this disk anymore as this could be a sign of a hardware failure. I believe the way this is currently implemented in FDB is by measuring latencies, but this has another problem: sometimes a disk might break in a way that it will never answer to an IO request (or at least not for multiple minutes). In these cases it might be better to just crash the process than to make progress very slowly or, even worse, run on a broken disk.

A modern NVMe drive will do something like 2 million IOPS. So creating a task for each of them would be a significant overhead.

Though I do admit that this is a corner case and I have trouble finding other use cases that are not something like a database storage engine :slight_smile: Also all this being said I believe there's another problem which is that tasks are cancelled cooperatively. So if you have an IO operation that is stuck in a system call I wouldn't actually quite know how to handle this. Though in this case I guess just finding a way to crash the process might be good enough.

1 Like

Task groups are very much a "primitive" operation in the structured concurrency paradigm, and while you can express a lot of things as if they were implemented in terms of task groups, I agree it would be nice to have more specialized implementations of them that can be more efficient than being implemented literally on top of task groups. Even async let is arguably an existing special case of task groups with a tuned implementation.

3 Likes