Quadratic performance of the `ListMerger` in specific use case

Nickolas_Pohilets · January 11, 2024, 8:47pm

I found out that performance of async deinit is non-linear in number of objects and was investigating. Comparing performance of deallocating 50000 trees of 100 objects vs 1000 trees of 5000 objects, I can see that in the second case, swift::ListMerger<swift::Job*, (anonymous namespace)::JobQueueTraits>::merge(swift::Job*) takes 80% of time, while in the first case it is not visible in the profiler at all.

class Node {
    var first: Node?
    var second: Node?

    init(_ objects: Int) {
        let L = objects / 2
        let R = objects - 1 - L

        if L > 0 {
            first = Node(L)
        }

        if R > 0 {
            second = Node(R, group)
        }
    }

    @FirstActor deinit async {
        await noop()
    }
}

func noop() async {}

When deallocating tree of such objects, every executed deinit job spawns two new ones, until entire tree is destroyed.

Every time DefaultActorImpl::drainOne() is called we have a queue of jobs consisting of 2 unprocessed jobs (in reverse order), and N processed ones, where N keeps growing. All jobs have the same priority.

Unprocessed jobs are taken out of the queue and reversed in the process.

A new instance of ListMerger is used to merge this list of 2 jobs into the list of N processed jobs. But merging is always an appending and requires linear traversal of the entire list of N processed jobs to find the insertion point.

Nickolas_Pohilets · January 12, 2024, 1:04pm

I think a simple fix would be to persist lastInsertionPoint and lastInsertionPointIsKnownLastOfEquals in the DefaultActorImpl. They are used only inside drainOne() so they would be protected by the actor's lock.

There seems to be enough space in NumWords_DefaultActor to accommodate them.

Would there be any know drawbacks to this?

@John_McCall, WDYT?

John_McCall · January 12, 2024, 6:01pm

Making that code scale better with the number of tasks would be great, and you're probably right that it could fit into the concurrency design safely because it only needs to be touched during drain.

I do wonder if we should maybe not be generating an unbounded number of tasks in the first place here, though. Maybe there's a reasonable queuing approach we could take?

Nickolas_Pohilets · January 13, 2024, 9:17pm

This is what I've got - Fix quadratic performance of the `ListMerger` in specific usage pattern by nickolas-pohilets · Pull Request #70910 · apple/swift · GitHub

In this example, number of tasks is determined by the tree structure. Tasks are queued on the actor or global executor. Since tasks are executed in FIFO order, tree is visited in breadth-first order. So size of the queue is determined by the dynamics of breadth-first traversal.

In some cases, tasks can be suboptimal as a queue entry. E.g. I see that flagAsAndEnqueueOnExecutor calls _swift_task_alloc_specific triggering allocation of the first slab before task starts executing. This probably can be optimised.

But once task starts execution, it may be suspended. And I expect this to be pretty common for async deinit. Suspended tasks cannot be reused and must be kept in memory. If we artificially limit number of concurrent tasks available for async deinit, then we would be underutilising hardware resources, because of the suspended tasks.

And what should happen when execution reaches a point where an object with async deinit needs be to destructed, but new task cannot be started? Should we create and enqueue some kind of task-launcher-job? Assuming that we can avoid creating task allocator until task actually starts executing, I think such job would be not much different from the task itself. I think it is better to invest into making non-started task as cheap as possible.