[Sub-Pitch] Task-local values in isolated synchronous deinit and async deinit

You can find the updated proposal text here - Isolated synchronous deinit and asynchronous deinit.

Updates since the reviewed version:

  • Scope of the proposal was extended to include async deinit.
  • Synchronous deinit of GAIT's is non-isolated by default and requires explicit opt-in, as requested by the review feedback. But async deinit inherits isolation of the class. As described in the proposal, this helps to warn users if they are using more expensive async deinit, where isolated synchronous one would be sufficient.
  • Removed assertion for escaping self from future directions, because it was implemented in [Runtime] Fatal error if self escapes from deinit. ยท apple/swift@108f780 ยท GitHub.
  • I've implemented resetting task-local values, but I still think that task-local values should be copied by default and reset should be an opt-in.

And it is the last part, that I would like this thread to focus on.

There are 3 alternatives regarding task-local values:

  1. Do nothing - task-locals will be available on the fast path of the isolated deinit, but not on slow path of the isolated deinit, and not in async deinit.
  2. Make task-locals consistently available - keep them on the fast path, copy on the slow path and for async deinit.
  3. Make task-locals consistenty unavailable - add stop-node on the fast path, do noting on the slow path and for async deinit.

Current implementation in the branch supports all 3 in the runtime, and options 2 and 3 in the frontend. It is easy to support all 3, the main question is which one should be the default one.

Behavior

Second option offers the least surprising developer experience:

  • no unexpected changes when adopting the feature
  • no unexpected changes between different code paths
  • no unexpected changes between debug and release builds
  • no two cases to test: hopping vs non-hopping deinit
  • it follows the precedent of Task.init vs Task.detached

It can be hard to predict where the point of the last release will be, but as long it can be constrained to some broad scope, code in the deinit can assume that all task-locals set for the entire scope are available.

In some cases it might be a better practice to capture task-local values in the init() , store them as object properties and re-inject them into deinit . But it is not always possible to know which task-local values are used inside the deinit . For example, if it calls functions from closed-source frameworks or hidden behind type-erasure. In some cases it can be even not possible at all, if task-locals are tunneled through call stack as an implementation detail with actual task-local declarations being unaccessible.

There are no compile-time checks for availability of the non-default task-local values. Checking correctness of the code that uses task-local values relies on testing. Having different behavior depending on the point of the last release can lead to bugs, which are easy to miss during the testing and hard to reproduce once reported.

Task-local values are a tempting instrument to inject loggers or similar infrastructure. If task-local values are accessed only in exceptional situations - e.g. for logging assertion failures or runtime errors, chances of missing task-locals-related problem increase even further. But those are exactly the use cases where it is critically important for logging to work.

And finally, choosing the default behavior does not exclude other options. Advanced users can use attributes to customise behaviour of the deinit in performance critical code. Current proposal text describes new attribute @resetTaskLocals which selects option #3, because it was needed to run benchmarks. But it is easy to add another one for option #1. This part of the proposal likely to be updated based on the feedback.

Performance

To have a well-informed discussion about this topic, I've made a benchmark of performance of isolated and async deinit.

Here is a short summary:

Case Cost
Fast path allowing task-local values 15-20ns per object
Resetting task-local values on fast path 20ns per object
Slow path not copying task-local values Enqueueing: 115ns per object
Dequeueing: 140ns per object
Copying task-local values on slow path Enqueueing: 30-35ns per value per object
Dequeueing: 20ns per value per object
Async deinit not copying task-local values Enqueueing: 270ns per object
Dequeueing: 315ns per object
Copying task-local values in async deinit Enqueueing: 25-27ns per value per object
Dequeueing: 20ns per value per object

Note that for cases involving queueing there are two numbers, which cannot be simply added up together.

If jobs are enqueued faster then dequeued, then overall performance is fully determined by the dequeueing and extra costs during enqueueing do not matter. And vice versa.

Numbers for fast path and dequeueing are incremental costs, compared to executing synchronous non-isolated deinit. To understand how long job takes to be dequeued and executed for specific use case, these costs need to be added to the duration of the deinit itself. E.g. if slow path job is scheduled copying 3 task-local values, and dequeued job runs deinit costing 100ns, then in total it will take 115ns + 3 * 35ns = 220ns to schedule and 240ns + 3 * 20ns + 100ns = 400ns to dequeue and execute.

When destroying a tree of objects with synchronous deinit isolated to the same actor, slow path is taken only once - for the root object. For the rest of the objects last release happens on the expected actor and fast path is taken.

Copying task-local values adds extra cost on the slow path, but is free on the fast path. Resetting task-locals (adding a stop-node, or even just checking if stop-node needs to be added), is free on the slow path, but comes with a cost on the fast path. If we assume that fast path is taken significantly more often than the slow one, then consistently copying task-local values is faster than consistently resetting task-local values.

If we consider a common case of deleting a tree of objects with synchronous deinit isolated to the same actor, than copying task-locals happens only for the root node (and only if root is released from the wrong actor), while cost of resetting is paid for each of the for N or N-1 objects in the tree regardless from where last release have happened. If we decide to consistently copy, extra cost is proportional to the number of task-locals (V), but does not depend on number of objects. If we choose to consistently reset, then extra cost is proportional to number of objects (O) and does not depend on number of task-local values. We are comparing Sometimes(V * 35ns) vs O * 20ns. Assuming that chance of taking slow path is low and number of objects with isolated deinit is higher than number of task-local values, consistently copying is faster than consistently resetting.

For async deinit, there is no fast path, so the logic above does not apply. But based on discussions so far, there seems to be far more demand for isolated synchronous deinit, than for async deinit. So it makes sense to make decisions prioritising isolated synchronous deinit.

Also async deinit has more than double costs of isolated synchronous deinit, so relative cost of copying task-local values is lower.

One of the concerns raised in the review, was that copying task-local values might prevent future optimizations for reusing current task for async deinit of the child objects. Release of the child objects happens from complier-generated code which does interfere with task-local values. And any task-local values added inside the body of the deinit, would be removed inside the body as well. So if last release of the child object starts a new async deinit, it would need to be executed with the same set of task-local values as parent deinit, and thus can be executed in the same task.

If async deinit is initiated from the body of the deinit, then potentially it could happen with added task-local values. Technically it is possible to store only incrementally added task-local values and still reuse current task. But current task can be reused only after entire deinit completes. If parent async deinit suspends after initiating another async deinit, then reusing task would cause second deinit to wait unnecessary, underutilising hardware. So probably, tasks should not be reused in this case.

Links

12 Likes

To give a small update about concerns we're trying to figure out still here...

We're thinking about the impact of creating unstructured tasks from deinits to begin with to be honest.

Thinking out loud: The biggest problem here is how implicit and invisible this delayed deallocation effect is... And it is more problematic in real systems where the concurrency queues are under load already, so the delays, might be actually noticably worse in real systems, and hard if not impossible to fix...

I am wondering also if dedicating workers (queues, threads) for such async deallocation might be necessary. I don't think starving the concurrent thread-width limited pools with other work may ever cause starvation of tasks which MUST run in order to deallocate resources... So it also feels like those tasks must not only be high priority, but perhaps even guaranteed to run even if the width-limited default pool is starved...?

As for the question about copying the locals or not: I still think that anything other than "no accessible" is going to be a problem. Relying on where exactly a deinit happens and then sneak a local there to perform cleanups seems even riskier than the usual sneaky task local tricks, I'm not very confident this will yield in good APIs. If explicit control like this is needed, strict lifetime like moveonly+deinits or perhaps explicit APIs "terminate/close/end()" (or scopes "with...") sound like the right solution to me.

PS: For benchmarks I'd really recommend using the GitHub - ordo-one/package-benchmark: Swift benchmark runner with many performance metrics and great CI support since we'll get other metrics from it as well easily like allocations, context switching etc :slight_smile:

This isn't a complete thought through answer, but I figured we might as well discuss some more rather than leave you handing with no reply, hope that makes sense.

3 Likes

I agree with @ktoso here. Having the compiler spawn unstructured task to support async deinits seems like a big foot gun. In my opinion, any usage of unstructured tasks must be explicit and written by the user to make it possible to audit the code and have linting rules that forbid the usage of unstructured tasks.

Generally, I would love to see async deinits on ~Copyable types to model resources such as file descriptors correctly.

I would want to add that I'd love for us to find a way to solve these concerns and the overall async deinits goal still seems valuable.

It's a rough topic and I didn't want to glance over it -- the overall goal of allowing async in deinits, and being able to still reason about when resources get cleaned up is important and I do hope we can find a way to get these done. Perhaps it'll be more work than it seemed at first though.

And I really wanted to thank you for the effort, and also the benchmarking done this time :bowing_man:

I do agree that linear types are the best tool to make explicit async cleanup API. I've mentioned this in the "Future directions". Though probably it won't be an async deinit, but rather a named consuming method in combination with deleted deinit. I think these are two different features, which can coexist in the language.

But I'm not sure if we need async deinit at all. So far, all the use cases, that I have personally and what I've seen on the forums, required isolated deinit, not async deinit. If you have any use cases for async deinit, please share, that would be a valuable input. Until there is a compelling evidence that there is a demand for async deinit, I don't mind completely removing async deinit from the proposal.

And for isolated deinit, it is very likely that deinit will take the fast path. Slow path is really just a safety net. Target audience for this feature are people who already have cleanup actions in their deinit, and they work okish on practise, but isolation checking produces warnings/errors. So for isolated deinit, I want to introduce as little changes as possible compared to regular deinit. Do you agree with this, as a design goal?

Keep in mind that isolated synchronous and async deinit are not needed to deallocate memory. They are needed to perform cleanup actions. Memory is kept to support these actions. Deallocation happens immediately as soon as cleanup work is done. It is not obvious for me why priority needs to be increased for these actions. As long as producing task occasionally yields, total number of objects pending for deallocation will be bound by the executor.

1 Like

It would be great if deinitializers in actors could be figured out for Swift 6.

1 Like