Optimizing task switching in deeply-nested concurrent code

I'm working on writing a Lua interpreter in pure Swift, using structured concurrency to implement features like coroutines, as well as for data race safety, which is important for my use case as I want to avoid race condition bugs that affected me constantly in C++. I've gotten it pretty much working, but I'm finding over 30x slowdowns in release mode compared to C, and at first glance, profiling shows swift_task_switch using a total 12.4% of the computation time. I'd like to cut down on those context switches if possible before starting microoptimizations elsewhere.

A breakdown of the concurrency structure:

  • Important types:
    • The entire machine state is hosted under a single instance of an actor LuaState, which stores all global state.
    • Values used by the interpreter are held in an enum called LuaValue, which encloses all types usable in Lua. This conforms to Sendable as a prerequisite of Equatable and Hashable, which are required for certain language features. Thus, all contained types must also be Sendable.
    • Coroutines are an actor type LuaThread, which is attached to a LuaState instance. This must be an actor because it needs mutable state and is a type in LuaValue.
    • Each LuaThread owns a list of CallInfo classes, which are actors due to needing to be referenced, having mutable state & being sent between other actors.
    • A CallInfo holds a reference to a LuaClosure, which has to be an actor due to having mutable state (I wanted to make it immutable, but it broke a single necessary Lua function) and being a LuaValue type.
    • A LuaClosure references multiple LuaUpvalue actors, which must be actors due to the same as above.
    • LuaTable is another important LuaValue type that has to be an actor: it implements a dictionary+array, which is obviously mutable.
  • Function call flow:
    • The outer environment first calls LuaThread.resume async throws, which uses some continuation magic to start/resume the coroutine.
    • On start, the coroutine then calls LuaThread.execute async throws, which constructs a new CallInfo for the body closure, and executes until it fully returns. To protect call stack recursion, the interpreter returns back here for each call - I figured that async/await nullified this concern, but upon testing, making the call direct actually hurt performance measurably.
    • The actual interpreter loop is in CallInfo.execute async throws. This loop may call a number of other functions, including the most common ones:
      • LuaValue.index async throws -> LuaTable.subscript async for accessing table contents
      • LuaClosure.upvalues {get async} -> LuaUpvalue.value {get async} -> CallInfo.stack {get async} for accessing captured local variables in a closure (commonly used to get the global variable table)
      • CallInfo.call async throws -> CallInfo.setArgs async, LuaThread.prepareCall async throws -> return and re-call CallInfo.execute for calling Lua functions
      • CallInfo.call async throws -> LuaThread.call async throws -> LuaSwiftFunction.body async throws -> ... for calling Swift functions (which could do things like resume another thread, call a Lua function, or yield the current thread - all async ops)
      • Many operations can be overloaded using a metatable, which acts as a fallback when certain types are used. These overloads are function calls, and as such may initiate a new interpreter segment through LuaThread.execute async throws again.

This results in having a large amount of awaits throughout my codebase, which I can't get rid of - due to either possibly calling a function which could yield, or because it needs to access actor-isolated state. I want to try to tackle the second type, as I'm not using concurrency for concurrent tasks, so there's no need to isolate each actor individually - isolating everything to the whole LuaState is fine. However, I'm not sure how to do this - I tried making every actor inherit the unownedExecutor from the global state, but this had no effect (since I guess they already share the global executor), and creating my own executor with a DispatchQueue resulted in an additional 30x (totaling 900x!) performance loss, giving me numbers that are worse than I've seen on 240 MHz microcontrollers.

Is there a way to reduce the impact of frequent actor isolation switches, or to optimize task switches to not be as heavy? Otherwise, does anyone have any other tips for improving performance of deeply-nested async calls like I have?

I'm developing using Swift 6.1.2 on Arch Linux. I need to avoid platform-specific APIs, as I'm writing a cross-platform app. The full code is available at GitHub - MCJack123/craftos3-lua: Lua VM & runtime written in Swift if anyone wants to look, but I don't advise it - it's huge and undocumented, and confusing to anyone who doesn't know Lua well.

(For full disclosure, allocation is taking another 12% of time, but I'm going to tackle that later - concurrency seemed like a quicker target to start with.)

Easiest way to optimize switching is to not require switching in the first place. Easiest way to do that is to replace the actors you use to protect state with classes using a Mutex to protect state instead.

1 Like

Actor hops are indeed more expensive than simple mutexes/locks. You've described quite a constellation of actors so all this time spent switching does sound expected.

From what I understand about Lua it is single threaded and uses the coroutines for cooperative multithreading. (Unless you're doing something new with your interpreter?) You may not need to model this with separate actors at all.

One way to test this out is to just change all your actors to classes and mark them all with a single global actor. That makes them all isolated to the same concurrency domain. If you just did this but kept everything marked as "async" this may speed things up considerably since calling across these classes doesn't actually do any task switching since they are all in the same domain. This would at least let you quickly find out how much you'd get out of removing the task switch costs.

Apart from your plan to use Swift continuations to implement Lua coroutines, what else do you need actual concurrency for in the interpreter?

3 Likes

Thanks for the response. Putting everything on a global actor did significantly reduce the amount of task switching (to 2.75% total), and gave about 12% performance increase in my benchmark. A lot of the time is being spent retaining and releasing, which is where I'll focus next. However, I would like to avoid using a single global actor for all instances.

Apart from your plan to use Swift continuations to implement Lua coroutines, what else do you need actual concurrency for in the interpreter?

The actual interpreter doesn't need concurrency on its own - it is expected that a single Lua state only has one call thread happening at a time. However, I do want to let calls out to native Swift functions use concurrency if they need.

I do need the ability to run multiple independent states at the same time though, which is why a global actor isn't ideal for my use case. While a single global actor won't prevent this, it'll impact performance under load across multiple instances. For reference, the app I'm making now is a rewrite of a C++ codebase, where multiple Lua states were given their own thread to run on, plus a main thread for UI updates. With a global actor, each state would have to share time between each other, instead of being able to take advantage of the thread pool to parallelize. (To be fair, the really old Java version used to share a single thread too, but I'd like to take advantage of multithreading where I can.)

Is there a better way to share an isolation context across multiple objects, without sharing it across all objects?

I'm kinda confused why this needs a rewrite of the Lua VM though.

AFAICT, lua_State has no locks, and Lua never creates actual OS-level threads, its "threads" are cooperative and the user must ensure that they're not used concurrently.

This is modeled in safe Swift by having lua_State be non-Sendable, which it will be by default.

It also seems that that's what you're trying to get out of your own implementation — a type with no locks or other synchronization overhead, that's usable from a single isolation.

Sounds like the default behavior of the Swift import of the standard implementation is exactly what you're after.

This is purely a personal endeavour to try to make a safer VM - I've had my own version of PUC Lua to hack on for years, but I kept running into various memory errors, especially around the garbage collector, where I had issues with a type I implemented not being freed properly. I'd rather just rewrite everything in a better language using safe memory (and data race) constraints, instead of trying to hack further through the code. I also like being able to use the values directly, instead of having to operate on a stack, which can get confusing quick.

Yeah, don't mark things as Sendable. Then they can only be used from within a single actor that owns them. You can have a whole constellation of objects that are like this and live in a single actor. All that together becomes the actor's "state". They can expose their own async functions and you can await on them, but they will be resumed within the same domain.

This was harder to do before Swift 6.2 but with the new isolation rules, it's much easier to build a constellation of non-sendable things that you choose where to isolate them.