[Pitch] Task Pools

jayton · December 1, 2022, 4:01pm

Only if you opt into this by using the more verbose ThrowingTaskPool variant. If you use TaskPool, the compiler will require you to explicitly handle all errors.

ksluder · December 1, 2022, 4:08pm

But then you have to handle common errors redundantly within each task, further calling into question the wisdom of having separate TaskPool and ThrowingTaskPool types.

John_McCall · December 1, 2022, 5:20pm

To be precise, my proposal was that we provide a primitive where the tasks all have to return Void and cannot throw, but where tasks have some ability to add new tasks and/or cancel the whole group. That should be a very high-performance primitive where we can instantly destroy tasks when they finish; all of the overhead above the minimum will be around maintaining the current set of tasks as tasks are created and destroyed, which is unavoidable.

We can then implement whatever result merging we want on top of that primitive by creating some temporary lock that tasks grab and merge into before they exit. It doesn't have to be limited to reducing errors. "Remember the first error" can be done with an atomic lazy reference, not a lock, but still doesn't need specific runtime support. Exactly what set of operations we provide in the library on top of the primitive is something I don't have strong opinions about.

Joe_Groff · December 1, 2022, 6:32pm

I don't think you really want TaskPool to be Sendable either, though, since it should be accessible to child tasks but still shouldn't really be able to escape the hierarchy. It would be interesting to think about how we could make it so that child tasks can submit new work without letting the handle generally escape—@ktoso mentioned that addTask could re-pass a handle for adding tasks into the child task closure, which is a bit boilerplatey but could work.

lukasa · December 2, 2022, 9:28am

I think I'd have to see an example that required that central handling, where the same handling could not be expressed in a per-task catch block in more or less identical amounts of code.

That ought not to be true: if you don't use a ThrowingTaskPool, then closures passed to addTask should not be throws and the compiler will force you to handle your errors.

It's not redundant handling if you only write the code once, which it is always possible to do:

func suppressingNonFatalErrors(_ block: () async throws -> Void) async rethrows {
    do {
        try block()
    } catch PermissionsError {
        // suppress
        ()
    }
}

This is a fairly standard Swift pattern.

I don't inherently object to adding a result reducer, but it adds meaningful complexity to and divergence from TaskGroup. And the default behaviour will be to have no error closure, which falls back to the current behaviour regardless. So I'm unconvinced by the extra cognitive burden of this idea.

lukasa · December 2, 2022, 9:33am

Oh, one further downside to the error reducer approach is that you necessarily lose context. The error reducer occurs in a separate context to the throwing task, and so doesn't know anything about the task, such as what its inputs were. All you know is "a thing went wrong", but not what thing, and at that point we're really pretty late in the game anyway.

I should note here that my objection is mostly to using error reducers as the principal error-handling API. Whether they exist in the implementation is a separate question. I'm not even opposed to adding them in the future if there's a clear need for it, but right now I'm just not convinced that anyone will actually use these error filters unless we add a multi-error as the thrown error, which has a number of downsides explained in the pitch.

ksluder · December 2, 2022, 2:51pm

When designing APIs, defaults matter. This API is intended to act as the root of an app’s concurrency domain, accepting Task submissions with no dependencies between them. Yet the path of least resistance encourages writing fragile servers that introduce a mutual-success dependency between unrelated Tasks.

Clients have to go out of their way to refactor error handling into centralized functions, remembering to call those functions from every single Task submitted to the pool. This imposes an otherwise unnecessary hygiene burden on the entire team, and introduces serious complexity to the recovery path. Recovering from arbitrary cancellation in a parallel operation is far more complex than recovering from partial completion.

At the very least, as an adopter of this API I would expect implicit cancellation to be an opt-in feature, and I would only use it if I created a TaskPool for a specific job in which the Tasks I submit had an intrinsic dependency.

lukasa · December 2, 2022, 3:16pm

I think we're disagreeing on a fundamental point here. I don't think it's only about the default behaviour though, I think it's about a very core principle, which I will derive from the Zen of Python:

Errors should never pass silently.
Unless explicitly silenced.

I think this point is the crux of the disagreement:

My argument is that implicit cancellation is opt-in: you opt-in to it by choosing the variant that allows you to write a throws sub-task function.

More importantly though, I think adopting your proposed default implicitly forces us to default to silencing all unhandled errors, which IMO makes your code far harder to debug.

This is because if you have chosen the variant that allows you to throw from addTask, then you have forced Swift to decide what to do with that error. It has only two choices: it can either silence that error, or it can throw it out of the TaskPool.

I will point to the above principle and say that silencing the error is the wrong choice, so let's assume we want to eventually re throw it. Because the TaskPool cannot allow the child Tasks to outlive it, there are only two options again: either buffer the error until the TaskPool naturally exits, or bring the TaskPool to a halt as rapidly as possible. I argue that the former is implicitly silencing the error, because TaskPools will often be used with extremely long-lived queues of work, such that the error may never be seen in the lifetime of the program. It seems to me the only behaviour that doesn't have the effect of defaulting to "silence all errors" is to cancel the TaskGroup.

Now, you have proposed up-thread that an alternative is to have a specific handler function that is forced to process the errors. I think this fundamentally defeats the structured pattern that we're trying to enforce. Errors currently propagate upward, exactly as they do in straight-line code. This is good. It's easy to understand and easy to follow. Yes, in rare cases it will cause duplication of error handling code, just as it does today in straight-line code, but I simply do not think a clever solution is required here. Swift has well-tested, well-understood error handling patterns today. This problem is entirely solvable using the existing error handling patterns that Swift provides.

ksluder · December 2, 2022, 3:21pm

I don’t think that’s true. The throwing task pool can still rethrow any errors which aren’t explicitly silenced. It’s just that throwing an error into the enclosing task pool shouldn’t cancel other tasks if the owner of the task pool wants to silence the error.

lukasa · December 2, 2022, 3:24pm

The owner of the Task Pool has a way to spell that idea, as they have always had: use a try...catch block.

To be clear, I'm not saying that the "error reducer" idea is definitely bad. What I'm saying is that it's a solution in search of a problem, and if we introduce it we can never take it back. Given that TaskGroups already have a way to handle this problem (try...catch), and that that solution is idiomatic in Swift, I think we should prefer to lean on that until we have existence proofs of a problem.

The underlying implementation should absolutely be built in a way that allows us to add an "error reducer" after the fact, but I think putting it in right now adds syntactic burden that very few users will ever actually want.

ksluder · December 2, 2022, 3:32pm

How does the owner of a task pool use a try/catch block to avoid implicit cancellation of tasks whose bodies they don’t control?

If your answer is “don’t let anyone else submit directly to the pool”, my response is that the same argument could be used to make the pool owner responsible for the cancellation behavior by wrapping all incoming tasks in a do { try incomingTask() } catch { pool.cancel() } block. And that since “explicit is better than implicit”, this is the less surprising approach.

lukasa · December 2, 2022, 3:43pm

Yes, that's my answer, and yes, you can make that counter argument. If we only want to add TaskPool and not ThrowingTaskPool, then I can trivially build most of the behaviour I want:

extension TaskPool {
    mutating func addThrowingTask(_ body: () async throws -> Void) {
        self.addTask {
            do {
                try body()
            } catch {
                pool.cancel()
            }
        }
    }
}

There are three reasons why I think we should elevate this pattern to API. Firstly, because it's not uncommon to want the implicit cancellation behaviour. TaskGroup already has it if you throw out of the TaskGroup body, and other examples of structured concurrency also default to this pattern (e.g. Python's Trio). We have prior art that this is a good idea.

Secondly, we should elevate it to API because if you want to persist the errors then life is much harder than just the above function. You need to introduce a wholly new type that doesn't just encapsulate the existing TaskPool (quite difficult to do when the TaskPool is passed to the body inout) but also that contains a thread-safe implementation of error tracking. It seems a bit merciless to force users to do this for TaskPool when they are not asked to do it for TaskGroup.

And finally we should elevate it to API because it doesn't change a default. withThrowingTaskPool is syntactically heavier than withTaskPool. The API encourages you not to use the throwing version. You have to decide you want the version that defers the choice for unhandled errors up to Swift. No-one is making anyone use ThrowingTaskPool if it isn't what they want.

ksluder · December 2, 2022, 4:07pm

I don’t disagree with having implicit cancellation as API; I disagree that the current proposal makes the difference sufficiently clear.

There are many functions in the standard library where clients have to use the rethrowing version in case the closure they pass might throw. But the API as proposed conflates the throwingness of individual tasks (a type-level property) with whether throwing from one task cancels other tasks (a runtime behavior). The client is forced to either squelch errors entirely or accept the implicit cancellation. I am asking to replace this false dichotomy with an option. I phrased it as an error filter up-thread, but I would be satisfied with withTaskPool(implicitlyCancelling: Bool = false).

lukasa · December 2, 2022, 4:11pm

I understand, but I disagree that this is a false dichotomy. A thrown error that is caught by the TaskPool is either rethrown or leaked. We can have (and are having) a discussion about putting in a filter, which is an idea that has merit, though we disagree on the spelling. But I do not think we can just turn off implicit cancellation: if we do, the error is essentially squelched silently, for all the reasons I outlined above.

ksluder · December 2, 2022, 4:19pm

This argument is weakened by the fact that ThrowingTaskPool as proposed will already squelch the second and subsequent errors. This is a concurrency primitive, after all, so it seems likely that two tasks might fail simultaneously. Why is it ok to lose one of these tasks’ error details, but not both?

lukasa · December 2, 2022, 4:21pm

It's not, necessarily. The pitch outlines that we might want to throw them all as a multi-error, instead. It also provides reasons that a multi-error may provide a sub-optimal user experience. But I don't think that throwing them all is remotely off the table. The pitch does outline that throwing none of the errors is the worst option.

ksluder · December 2, 2022, 5:13pm

I think this brings us back to why ThrowingTaskGroup’s API wound up the way it did. Calling next() in a loop (or, equivalently, iterating over the task group) forces you to use the try...catch construct to handle the potential failure of each and every task. From previous posts, I understand you and @ktoso highly value this explicit syntax and are very keen to preserve it while solving the Task leakage problem for open-ended work. But the proposal has wound up closer to ThrowingTaskGroup.waitForAll() rather than a looping next().

I believe this is because the scoped nature of try...catch syntax is fundamentally in tension with the open-ended, self-running nature of ThrowingTaskPool. The core problem with TaskGroup highlighted by the pitch is that “neither the consumption of completed Task results nor the submission of new work can be moved to a separate Task .” This is ultimately due to the very design philosophy of structured concurrency, which maps syntactic scope to runtime semantics. try...catch similarly tries to enforce error handling within its scope. But static scope maps very poorly to dynamic, unbounded workloads. This is why Swift also has unstructured concurrency.

So I guess there’s a real big question hiding here: is structured concurrency actually a good match for the kinds of programs this pitch is targeted for? I think the answer is “yes”, but it’s obviously going to require some tradeoffs. Making try...catch syntax the sole way to handle errors forces other tradeoffs, such as either dropping errors or inventing a new error type. Given that the Standard Library already has Result<Success, Error> and Task.result, perhaps it’s worth considering a little flexibility on the error consumption side?

lukasa · December 5, 2022, 10:28am

I disagree with this conclusion, but I think it's straying off-topic for the thread.

I don't disagree that we're being forced into a trade-off here. What I'm trying to avoid doing is over-designing the type. My current thesis is that it is very likely that try...catch is sufficient for 80% or more of the use-cases for TaskPool. The vast majority of use-cases for TaskPool are submitting fundamentally homogeneous work to the pool, which can centralise its error handling into the child tasks.

I don't think your objections are without merit, but I do think that attempting to solve that problem without a concrete use-case risks us launching into a premature design. The two problems that TaskPools address in their current spelling are use-cases we absolutely know exist. (For clarity, these two cases are 1) handling an endless stream of work using an identical function tree, as in all servers, and 2) adding tasks from within a child task.) Additionally, for at least the first case, the "unhandled error terminates the service" is a desirable behaviour. "Let it crash" is an excellent resiliency principle for stateless services, and the type system enforces whether the behaviour exists or not.

The result is that for all the use-cases I 100% know exist, the current design is sufficient. It is also possible for users to achieve error filtering with this design, at the potential cost of needing to build their own abstractions. The use-case the current design cannot address is the "throw a multi-error" case, but I don't believe any of your upstream suggestions deal with it either.

The upshot of all of this is that I want to pitch the minimum viable API surface area to hit the use-cases I know exist. We can always add new API surface in future, but we cannot easily take it away. So long as we enable the possibility of error filtering in the design, we can absolutely revisit it if a use-case becomes apparent.

FranzBusch · December 5, 2022, 11:26am

I agree with @lukasa here that I think the current proposals error handling is sufficient. There are four failure scenarios that you might want to handle and I think this proposal covers all of them:

Don't allow any errors: Just use a non-throwing pool
Cancel everything on the first error: Use a throwing pool and try inside addTask
Custom error handling on a per child task basis: Use try catch
Custom error handling from multiple child tasks: Use try catch and yield the errors into a AsyncSequence that you consume in the body of the task pool. You can then decide if you wanna cancel all or log etc.

Arguably 4. could be done with some convenience API on the pool but until we have more concrete use-cases where we want to do this and where the AsyncSequence based approach doesn't work, I would be hesitant of adding it.

lukasa · December 13, 2022, 9:22am

Thanks very much for all the feedback folks. Based on this thread and informal conversations held with a number of other domain experts, I've updated this into a new proposal, currently proposed for Swift Evolution.

The main takeaways from this thread:

The Sendable/"add tasks from within child tasks" idea is entirely implementable for TaskGroup, using a different spelling.

This is a great insight. As a result, we've removed the Sendable argument from the pitch entirely and have deferred from making the TaskGroup type Sendable at this time. To us, it feels like a separate proposal altogether from the "discard results" capability. We don't need TaskGroup to be Sendable to discard results, and we can make TaskGroup Sendable without discarding results.
Error handling.

There was a detailed and valuable discussion around the options for error handling in this pitch thread. We think a lot of ideas were proposed that were really valuable, and we'd like to keep space to implement those in future. However, we don't think that right now we've got sufficient insight into where and why these error filtering functions might need to be implemented.

To that end, we'd like to avoid committing to an API that might be flawed, and instead have elected to defer adding the capability. We'll ensure that the implementation is able to support this capability in future, and will leave error filtering for a future direction.

Additionally, separate to this thread, we've elected to avoid introducing a new type, and instead have recast the proposal in terms of adding a new parameter (discardResults) to TaskGroup. Several reviewers did not feel we'd done enough to justify the new type, and would prefer to see us extend TaskGroup instead.