A case study for `reasync`

ZPedro · April 24, 2023, 10:43pm

The principle is to take advantage of the need to explore the first alternative of a bifurcation as an opportunity to send a trainee to do so instead of doing it ourselves; rather, we will directly consider the other alternative of the bifurcation as if the first was forbidden to us. When we run out of trainees, we suspend until at least one of them comes back. A trainee will come back once he has explored his assigned alternative: when he returns to his original bifurcation, instead of continuing to the other alternative he will come back directly to us, lacking the context to explore the whole tree.

To keep complexity to a minimum, we will carefully choose the kind of bifurcation we want to support, and only consider sending a trainee when on this specific kind. Since we have three explorer functions in practice, might as well choose one of these three.

Unfortunately, while the solution to bifurcate upon considering the addition of a new floor is attractive, it would not be efficient. Why? Because bifurcations of this kind are few and in particular can be unpredictably far between. Imagine we get to one such bifurcation, and we estimate that it is still a huge part of the solution space to explore, such that we hesitate sending a trainee alone in there: he might end up taking so long that everyone else will end up waiting for him in the end, such that a significant time could be spent with concurrency reduced to a single task. At the same time, we also hesitate exploring it ourselves: while some further bifurcations of the same type (appending yet another floor) are close, some are very far, such as cases where we build complex operands first, and some of it we couldn’t delegate at all, such as when we recruit all operands in a single operation on the floor we’ve chosen to explore (remember that each operand can contribute in reverse or not…). In fact, we do not want to delegate too small explorations either, as then the overhead of sending a trainee would dwarf his contribution (even with efficient task spawning, some overhead remains, such as duplication of our own state).

Instead, we only consider spawning a new task whenever we would call iteratePossibleLeftNodes(), as we will never regret not doing so: if we choose to explore it ourselves it will soon enough delegate to iteratePossibleLeftNodes() again anyway, directly or indirectly.

But remember that iteratePossibleLeftNodes() captures mutable local variables from exploreAdditionOfRightNode(), and in fact from resolve() itself. Will be be able to make that function sendable and keep doing so? Not directly. Instead, we unnest the functions, and rather pass these variables as inout (borrowed) parameters, and further sub-borrow them as the recursive calls require. When we actually spawn a new task, we copy these parameters, then make the function inside the subtask borrow the copy.

This makes the code less straightforward, in particular resulting in a number of inout parameters being carried around. Fortunately, some of the data is immutable and does not need to be copied, this is the case for the linked list of contributions for each composition (including the currently open one) for example.

Even then, the size of the alternative to explore is rather hard to predict (it can be computed, but this is rather memory-intensive). Instead, to consider whether to spawn a sub-task or not, we compute an estimator that we will use as a heuristic; it works rather well, resulting in subtasks that run (on target hardware) for about 3ms on average, with the 95th percentile duration being less than 9 ms and the 99th percentile being about 11ms. That means we can launch subtasks without needing to bother checking for cancellation inside the subtask, as it will always finish soon enough anyway.

Since we ourselves need to suspend when we have reached our limit of subtasks in flight (besides the cancellation considerations, remember each subtask carries its own state: the copy of the state we provided it with plus whatever it has started allocating itself, so having too many of them in-flight could be a memory issue), our code needs to be async. On the other hand, the subtasks don’t need to suspend, so they can just as well run sync code. This is the point of the PseudoReasync functions: to be sync while otherwise using the same types (with the same concurrency annotations, in particular) as the async code so as to seamlessly integrate, when it comes time to generate the description for a solution tree for instance.

And the only sane way to maintain the PseudoReasync functions is by putting the async explorer functions in their own file, and run a sed script that duplicates this code into a new file with the async and await keywords removed (and the function names changed):

sed -E -e 's/ async//g' -e 's/await//gi' -e 's/Async/PseudoReasync/g' < AsyncExplorers.swift > PseudoReasyncExplorers.swift

This is the only sane way (short of having reasync) because these functions need to be kept in sync (no pun intended) in every other way: they must manipulate the same types in the same fashion and maintain the same invariants and cooperate in doing so.

Hence, the need for reasync. It is not just necessary, it is unavoidable, if you don’t want this kind of workaround to widely develop.

Now it’s time to address some likely questions:

reasync, like rethrows, only makes sense for higher order functions, not for having a sync copy of arbitrary async functions, so what is the relationship with reasync?

Except that you can always turn an ordinary function into an higher-order one, in the worst case by having it take a function as parameter invoked whenever, to which you pass a dummy function. In fact, this is already the case for my explorers, and I did not even need to involve a dummy function: I have a rule that whenever I am tempted to name a function according to what is expected of it, rather than according to what it performs, then I deduce that this function needs to be injected instead, either by itself or as a member function within a protocol. As a result, the dispatch function that decides whether to spawn a new subtask or not is injected, and is of course async; meanwhile, inside the subtask I inject a dummy sync dispatch function (that unconditionally delegates), since I am not going to spawn subsubtasks anyway.

Regardless, there are many other reasons for why such functions ought to be higher order functions with an injected dependency (testability could be one of these), so reasync does fulfil the need with only modest requirements on the part of functions for which we need a sync and an async representation.

Couldn’t you just run the async function inside the subtask, rather than involve a second representation of the same code, with the only change being the exclusive use of the architectural stack?

I could… if I was willing to take a 15% performance hit: this is the improvement I measured when going from async to sync for the subtasks. Your mileage may vary, but I wouldn’t be surprised to see a comparable difference between sync and async in other scenarios.

If you mind the loss of performance in the subtasks, why don’t you mind such a loss in the main task?

Because the main task, regardless of whether it is sync or async, takes so little execution time out of the total that its consumption is best considered as being epsilon anyway. Seriously, ever since [SR-15785] TaskGroup addTask is blocking with poor performance · Issue #58062 · apple/swift · GitHub was fixed, I have failed to see a single call stack being captured in the middle of the main task when running under the time profile instrument. Not a single one. I only ever catch subtasks being on a CPU.

OK, but do you need the main task to be async?

I do in order to use Swift-native concurrency. I have another sync copy of the code that uses libdispatch, whose main difference is that it does not make use of concurrency annotations; instead I mark the code as being preconcurrency. But this code is obviously higher maintenance (what with the completion groups and all that jazz), and less featured, e.g. it uses a semaphore whose token count is provided at main task launch, so there is no avenue for the number of subtasks to adapt to any dynamic variation in activeProcessorCount (extremely unlikely on Apple platforms, but not so unlikely elsewhere)

Such a text transformation step will be surprising to the unwary coder, not to mention being unportable. Should you be preprocessing source code in this way?

While unseemly, this is preferable to syncing the code by hand, by far. Besides, Windows hosts can use any of Cygwin, MinGW, or some other distribution that includes sed for use on Windows. So I can live with it in the meantime. If you’re really bothered by this hack, then keep in mind that it is gone as soon as reasync is available.

Couldn't the code of that study be less complex?

Complexity is very much necessary to the point of this study: if the code was significantly less complex, it would be much less trouble to keep in sync a sync and an async version, such that reasync would not be necessary in that case. Here, it very much is.