withThrowingTaskGroup doesn't (re)throw the error

rurza · March 28, 2024, 1:55pm

I have quite simple usage of the task group:

func blah() async throws {
    try await withThrowingTaskGroup(of: Void.self) { group in
            group.addTask {
                try await self.connection.sendMessage(name: route.rawValue)
            }
            group.addTask {
                try await Task.sleep(for: .seconds(2))
                try Task.checkCancellation()
                throw TaskError.timedOut
            }
            try await group.next()
            group.cancelAll()
            
        }
}

I'd expect that by calling try await blah() I will get the exception if the "timeout" group task will finish first, but NOTHING happens. The method just returns (edit: not true, actually it doesn't return). The sendMessage is causing it, because if I replace it with Task.sleep (for 10s for example) then everything is fine. But the method is just a wrapper for xpc calls.
Any ideas what can be wrong?

ktoso · March 28, 2024, 1:57pm

Would you mind debugging or printing the exact execution flow -- what task does what, in what order, so we could observe the details of execution? It's hard to say from your description if there's a confusion on behavior or what specifically.

Yes, the group should re-throw the first error encountered. Then ignore the second one.

rurza · March 28, 2024, 2:13pm

Cześć!

Sure thing! :)
That's the implementation from the experiment:

func sendMessage(_ route: XPCRoute) async throws  {
        let id = UUID()
        logger.debug("\(id, privacy: .public) sendMessage began")
        do {
            try await withThrowingTaskGroup(of: Void.self) { group in
                group.addTask {
                    self.logger.debug("\(id, privacy: .public) Sending message with route: \(route.rawValue)")
                    try await self.connection.sendMessage(name: route.rawValue)
                }
                group.addTask {
                    self.logger.debug("\(id, privacy: .public) I will sleep for 2 seconds")
                    try? await Task.sleep(for: .seconds(2))
                    try Task.checkCancellation()
                    throw TaskError.timedOut
                }
                do {
                    try await group.next()
                    logger.debug("\(id, privacy: .public) Did receive a message")
                    group.cancelAll()
                } catch {
                    self.logger.debug("\(id, privacy: .public) The first task has thrown an error: \(error.localizedDescription, privacy: .public)")
                    throw error
                }
            }
        } catch {
            self.logger.debug("\(id, privacy: .public) Expected error happened")
            throw error
        }
        logger.debug("\(id, privacy: .public) sendMessage ended")
    }

That's its output:

rurza · March 28, 2024, 2:14pm

It looks like the withThrowingTaskGroup never returned.

ktoso · March 28, 2024, 2:29pm

Hmmm, the sleep should have returned and the second task completed... The group automatically waits for all tasks, which is why it's waiting here but it should have returned AFAICS.

What SDK version are you using?

I would recommend adding the cancelAll() on all return paths btw, because now you will ALWAYS keep waiting for the 2 seconds, even if the error was already thrown. Worth an experiment if this changes anything here too.

albertbori · May 8, 2024, 4:07pm

I ran into this same issue and found that cancelAll() does not forcibly terminate tasks. Instead, each task in the group needs to proactively and intentionally check for cancellation and handle it by returning immediately.

As you have found out, this is a problem for tasks that wait for a single operation that may not complete because that task will hang indefinitely, even if cancelAll() is called on the group.

dnadoba · May 8, 2024, 4:19pm

That is expected. Cancelation is cooperative and there is no way to forcefully cancel an operation. You either need to wait for it or let it continue run in the background which is not recommended.

ktoso · May 9, 2024, 1:17am

Yeah, Swift's cancellation is cooperative by design.

I recommend this talk which explores this topic: Beyond the basics of structured concurrency - WWDC23 - Videos - Apple Developer

rurza · May 9, 2024, 9:16am

In other words, if a stupid bug in system's library causes my app to hang indefinitely there is no way on my end to express the idea of time out with a structured concurrency?

ktoso · May 9, 2024, 9:18am

This isn't much unlike some "dumb bug" blocking synchronously forever though. You'd "leak" a task, which is lighter than a thread if that's any consolation.

If you'd encounter such bugs, file issues or fix them if it's OSS or your code. That's the way really -- cooperative cancellation allows for some useful patterns, and not programming in constant fear that any point of code might not be actually reached etc. Though even with cooperative cancellation a smaller version of that fear exists, and I think we'll be looking at "cancellation shields" in the near future ("even though this task was cancelled run this piece of code that checks for cancellation anyway" -- as these may be useful for "definitely run my cleanup actions" etc).

Yes, Swift's cancellation model is cooperative. We won't "randomly" kill and stop tasks or free memory; it is handled cooperatively with the being cancelled code. It's not that different from "normal synchronous code" though, yeah.

rurza · May 9, 2024, 10:40am

Thank you for the video; it helped clarify the concept.
I believe Task.sleep now checks for cancellation—it seems this wasn't the case when Swift's structured concurrency was initially released, which explains why my example with two sleep functions worked. Is there a way to accomplish this using unstructured concurrency? While filing a radar is an option, it's not practical; I’d prefer to have reliable software before 2028.

albertbori · May 9, 2024, 3:29pm

Do you control the implementation of this line of code in your example, or does it come from another library?

try await self.connection.sendMessage(name: route.rawValue)

The reason I ask is because if you can change its implementation, you can introduce proactive cancellation management which can return early from the task if it detects a cancellation.

rurza · May 9, 2024, 9:03pm

It's another library. It's OSS, nothing fancy, primarily serves as a high level wrapper for calls to Apple's XPC framework.
Apple's XPC framework hangs.
I could fork it and implement a mechanism with a timer to check for cancellation.
I'm just not sure what's more ridiculous—having to understand the implementation details of the code I'm using, or engineers telling me the classic, "You're holding it wrong."

eskimo · May 10, 2024, 8:47am

Apple's XPC framework hangs.

XPC shouldn’t hang. Its general philosophy is that, because it’s focused on IPC within a single computer, it should either finish or fail. That’s why the XPC APIs have no timeouts.

The vast majority of XPC hangs are caused by the remote peer failing to reply, which is something to investigate in that code.

OTOH, if the remote peer is behaving correctly, and it’s just taking a long time to return a result, then you’d want to add cancellation propagation across the XPC connection. That’s possible, but not trivial.

having to understand the implementation details of the code I'm using

You’re gluing together something very new and something fairly old, so implementation details are leaking of out of the seam in your abstraction layer. If Apple shipped a new XPC API that was fully integrated with Swift concurrency, you could reasonable expect it to handle cancellation for you. Until that happens, you have to understand these implementation details.

engineers telling me the classic, "You're holding it wrong."

I think it’s reasonable to good faith on the part of the folks who are trying to help you.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

rurza · May 10, 2024, 9:54am

Hi Quinn!

Thanks as always for your feedback—I’ve found your answers on ADF incredibly helpful!

Please understand my frustration. As an indie developer working with niche technologies, I file radars for any bugs I encounter. However, my experience with radars isn't hopeful; it’s not about when the issue will be fixed but rather waiting for an automatic email years later asking if the issue persists in the latest OS version. With a reliable example that reproduces the problem and no available workaround, sometimes I turn to DTS. After a month of back-and-forth emails, I finally connect with an engineer, only to receive unhelpful or irrelevant solutions, ending with the suggestion to "file a radar." That’s my experience with what should be good faith.

Regarding the XPC, I think it ties back to the code signing requirement. Meanwhile, I've adjusted my approach and lowered the security settings. I'll create a topic on ADF, maybe someone will help me.

Moving on to cancelling task groups, could the team consider the concept of a "force exit"? Konrad mentioned the idea of shielding—how about something opposite? Although it contradicts the principle of cooperative cancellation, as a user of the API who can execute any code, it seems logical to have the option to tell a task, “Cancel what you can; I’m not interested in the result, just get me out of here.”

ktoso · May 10, 2024, 9:57am

No, we're not looking into "force killing tasks". Type wise it would not work out -- a function has to return something and not all functions can throw either.

While I understand your frustration venting here at folks trying to help will discourage folks from trying to help out. If you have any issues with Swift itself please don't be shy to just open issues on github, you'd be able to perhaps see things "happening" a bit more this way. Although yes for Apple platform technologies please file feedback -- thank you

rurza · May 10, 2024, 10:25am

Thank you for the help

eskimo · May 13, 2024, 8:00am

I think it ties back to the code signing requirement … I'll
create a topic on ADF, maybe someone will help me.

Cool.

If you do, make sure to tag it with XPC so that I see it. And it’d help if you posted a link here so that future folks can follow along.

Share and Enjoy

Quinn “The Eskimo!” @ DTS @ Apple

beatt83 · April 13, 2025, 2:51am

Just an FYI this issue is still happening. I got a very simple withThrowingTaskGroup that try await group.next() does not propagate the error in the sub task.