I'm working on writing a Lua interpreter in pure Swift, using structured concurrency to implement features like coroutines, as well as for data race safety, which is important for my use case as I want to avoid race condition bugs that affected me constantly in C++. I've gotten it pretty much working, but I'm finding over 30x slowdowns in release mode compared to C, and at first glance, profiling shows swift_task_switch using a total 12.4% of the computation time. I'd like to cut down on those context switches if possible before starting microoptimizations elsewhere.
A breakdown of the concurrency structure:
- Important types:
- The entire machine state is hosted under a single instance of an actor
LuaState, which stores all global state. - Values used by the interpreter are held in an enum called
LuaValue, which encloses all types usable in Lua. This conforms toSendableas a prerequisite ofEquatableandHashable, which are required for certain language features. Thus, all contained types must also beSendable. - Coroutines are an actor type
LuaThread, which is attached to aLuaStateinstance. This must be an actor because it needs mutable state and is a type inLuaValue. - Each
LuaThreadowns a list ofCallInfoclasses, which are actors due to needing to be referenced, having mutable state & being sent between other actors. - A
CallInfoholds a reference to aLuaClosure, which has to be an actor due to having mutable state (I wanted to make it immutable, but it broke a single necessary Lua function) and being aLuaValuetype. - A
LuaClosurereferences multipleLuaUpvalueactors, which must be actors due to the same as above. LuaTableis another importantLuaValuetype that has to be an actor: it implements a dictionary+array, which is obviously mutable.
- The entire machine state is hosted under a single instance of an actor
- Function call flow:
- The outer environment first calls
LuaThread.resume async throws, which uses some continuation magic to start/resume the coroutine. - On start, the coroutine then calls
LuaThread.execute async throws, which constructs a newCallInfofor the body closure, and executes until it fully returns. To protect call stack recursion, the interpreter returns back here for each call - I figured that async/await nullified this concern, but upon testing, making the call direct actually hurt performance measurably. - The actual interpreter loop is in
CallInfo.execute async throws. This loop may call a number of other functions, including the most common ones:LuaValue.index async throws->LuaTable.subscript asyncfor accessing table contentsLuaClosure.upvalues {get async}->LuaUpvalue.value {get async}->CallInfo.stack {get async}for accessing captured local variables in a closure (commonly used to get the global variable table)CallInfo.call async throws->CallInfo.setArgs async,LuaThread.prepareCall async throws-> return and re-callCallInfo.executefor calling Lua functionsCallInfo.call async throws->LuaThread.call async throws->LuaSwiftFunction.body async throws-> ... for calling Swift functions (which could do things like resume another thread, call a Lua function, or yield the current thread - all async ops)- Many operations can be overloaded using a metatable, which acts as a fallback when certain types are used. These overloads are function calls, and as such may initiate a new interpreter segment through
LuaThread.execute async throwsagain.
- The outer environment first calls
This results in having a large amount of awaits throughout my codebase, which I can't get rid of - due to either possibly calling a function which could yield, or because it needs to access actor-isolated state. I want to try to tackle the second type, as I'm not using concurrency for concurrent tasks, so there's no need to isolate each actor individually - isolating everything to the whole LuaState is fine. However, I'm not sure how to do this - I tried making every actor inherit the unownedExecutor from the global state, but this had no effect (since I guess they already share the global executor), and creating my own executor with a DispatchQueue resulted in an additional 30x (totaling 900x!) performance loss, giving me numbers that are worse than I've seen on 240 MHz microcontrollers.
Is there a way to reduce the impact of frequent actor isolation switches, or to optimize task switches to not be as heavy? Otherwise, does anyone have any other tips for improving performance of deeply-nested async calls like I have?
I'm developing using Swift 6.1.2 on Arch Linux. I need to avoid platform-specific APIs, as I'm writing a cross-platform app. The full code is available at GitHub - MCJack123/craftos3-lua: Lua VM & runtime written in Swift if anyone wants to look, but I don't advise it - it's huge and undocumented, and confusing to anyone who doesn't know Lua well.
(For full disclosure, allocation is taking another 12% of time, but I'm going to tackle that later - concurrency seemed like a quicker target to start with.)