Improve actor performance using transactional memory

wowbagger · April 19, 2022, 4:34pm

Apologies for this last-minute GSoC post. This is an idea that I've been researching into for about 2 months, but the thought of submitting it as a GSoC proposal just came to me in the last few days. The general idea is basically using software transactional memory to improve actor's performance when there are more processing cores/threads than there are actors. Because of my very limited experience with the compiler, I think it could greatly benefit from some mentorship through GSoC.

I've just submitted this proposal (pasted below) to GSoC, but because of how late I am to the process, I haven't gone through the steps of recruiting a mentor yet. If anyone likes the idea presented here and is open to be a mentor for the project, I would really appreciate it. In the event that this project is not selected for GSoC, if the idea appears feasible, I would still like to try to implement it.

Below is a copy of my submitted proposal. Please let me know what you think of it. Also, please let me know if colors used in my diagrams are difficult to see/tell apart for you, and I will update this post with a better version of them.

Improve Actor Performance with Transactional Memory

Author: 冀卓疌

Introduction

Swift's "islands of serialization" approach to concurrency is great for simplifying and streamlining parallel programming, but it comes with some performance cost. In many situations, the cost is minor compared to the ergonomic gains. However, in some other situations the cost can be more significant: e.g. a single actor with many tasks scheduled on it, each accessing a different and mutually exclusive state within the actor. Transactional memory is able to improve performance in this kind of situation where computational resources are under-utilized.

Motivation

Swift's concurrency model serializes all operations that accesses properties isolated by the same actor. This approach guarantees that there cannot be conflicting access to shared mutable states within actors. However, since operations are serialized even if they can be done in parallel without conflicts, some performance is traded away for the improved safety. Consider the following example:

actor ScannerPrinter {
	var inkLevels: [Color: Double] { /* talk to printer over internet */ }
	
	var scannerInput: Paper
	var printerOutput: Paper
	
	func scannedPaper() -> Data { /* reads `scannerInput` */ }
	func printData(_ data: Data) { /* writes to `printerOutput` and `inkLevels` */ }
}

func runMaintenance(on scannerPrinter: ScannerPrinter) async {
	async let inkLevels = scannerPrinter.inkLevels
	async let emptyScan = scannerPrinter.scannedPaper()
	await scannerPrinter.printData(Data(inkLevels.description.utf8))
	await calibrateScanner(using: emptyScan)
}

The runMaintenance function calibrates the scanner by running an empty scan, and prints out the printer's ink levels. This involves reading inkLevels and scannerInput, without writing any value back to them. Even though the 2 read operations do not conflict with each other, they are scheduled one after another in runMaintenance. This is because both properties are protected by the same ScannerPrinter actor instance, and operations scheduled through the same actor are serialized, even if they could run safely in parallel.

If all processor cores are busy when runMaintenance is run, then there is little performance difference whether inkLevels and scannerInput are read in sequence or in parallel. But if there are additional free processor cores, then parallel reading can achieve higher performance.

In addition, emptyScan and printData operations are also serialized, even though reading scannerInput and writing printerOutput don't conflict. Similarly, the performance of these two operations can be improved if they can be executed in parallel .

As shown in the example above, because the current default runtime implementation isn't able to fully use all available processor cores to process non-conflicting operations on the same actor, there is an opportunity cost of performance when there are more processor cores available than there are actors. Sometimes this cost may be significant enough to motivate some users away from using actors and into a more error-prone memory-sharing concurrency model.

Proposed Solution

In the past decade, there have been some literature on improving the performance of actors by using software transactional memory (STM), which is another concurrency control mechanism.

A short primer on software transactional memory

Transactional memory is a concurrency control mechanism, by which all operations on shared memory are presumed not to conflict and speculatively executed in parallel. If no conflict is detected at the end of each of the operations' execution, then the result is committed (i.e. written back to memory); otherwise, the operations are reverted back to their initial states and re-executed on slower serialized paths.

Software transactional memory (STM) is the software implementation of this mechanism, as opposed to hardware transactional memory (HTM), which is implemented in the hardware. We propose using STM instead of HTM for reasons listed in the "Alternatives Considered" section.

Among the literature is the 2015 paper "Dynamic Message Processing and Transactional Memory in the Actor Model", in which the authors demonstrated that by integrating the Akka framework's actor's dispatcher with Scala STM, they were able to improve actors' throughput when there are additional available processor cores.

We propose to apply a similar treatment to Swift's actor model, while preserving the existing semantic effect without any modification to the syntax: Enable the concurrency runtime to conditionally use transactional memory for operations scheduled on actors when and only when computational resources are under-utilized. This way, we can recover some of the unrealized performance by making fuller use of available resources.

This is a non-trivial undertaking, with two orthogonal sub-problems within it: where STM fits in the runtime, and which STM algorithm to use.

Fitting STM in the runtime

Currently, the default actor implementation can be summarized by the diagram above (green lines mark the fast path; red lines mark the slow path): If an actor is not running nor scheduled to run any task, then a lock is acquired on the actor to run the incoming task; otherwise, the task is enqueued using libdispatch to wait for its turn to run on the actor.

With transactional memory introduced into the picture, the fast path is preserved, but the slow path is modified as shown in the diagram below: If resources are under-utilized (i.e. if there are more available processor cores than there are actors that can be scheduled to run), then new tasks are scheduled transactionally (with task boundaries becoming transaction boundaries). Tasks in the same transaction are run in parallel, so if there are more cores available than there are actors, we can speed up non-conflicting accesses scheduled on the same actor. If the tasks are detected to conflict, then they revert to the slow path and enqueued.

Choosing the right STM algorithm

There exists a huge number of STM algorithms. In order to find a suitable algorithm, we must consider some characteristics of Swift's programming model:

Tasks in Swift can be nested:
```
await scannerPrinter.printData(Data(scannerPrinter.inkLevels.description.utf8))
```
so the STM algorithm must support nesting transactions.

Tasks in Swift can produce side effects:

extension ScannerPrinter {
    func shutDown {
        log("shutting down at \(date)") // logging is an unrecoverable side effect
        /* ... */
    }
}

so the STM algorithm must be able to avoid unrecoverable conflicts resulted from side effects.

Tasks in Swift can have conflicting write accesses:

extension ScannerPrinter {
    func copyPaper() { 
        /* reads `scannerInput`, writes to `printerOutput` and `inkLevels` */
    }
}

func copySomethingAndPrint(_ data: Data) async {
    await printData(data) // writes to `printerOutput` and `inkLevels`
    await copyPaper() // writes to `printerOutput` and `inkLevels`, too
}

so the STM algorithm must be able to handle write conflicts efficiently.

We envision to use the NOrec algorithm, because it has the following desirable properties that satisfy the criteria listed above:

NOrec supports nesting transactions.
NOrec is able to avoid unrecoverable conflicts by supporting irrevocable transactions.
NOrec is able to handle write conflicts efficiently by effectively serializing them.

Alternatives Considered

Introduce STM into libdispatch instead of the concurrency runtime

Instead of modifying the slow path to introduce a potentially less slow path, the transactional memory logic can be implemented in libdispatch instead. Since the concurrency runtime (on Apple platforms) relies on libdispatch's knowledge of system load wherever available, it might be a good idea to put STM there too.

Use HTM instead of STM

Although the HTM can be more efficient than any STM, we propose to use STM for the following reasons:

Currently all HTM offered by processor vendors are "best-effort HTM". This means the HTM doesn't guarantee forward progress so it must rely on a software fallback.
HTM hardwares are practically blackboxes–failures that occur in HTM transactions are extremely difficult to diagnose.
HTM even in its best-effort form is not available on all platforms currently supported by Swift.

Future Direction

Although transactional memory can speed up non-conflicting memory accesses to the same actor, conflicting accesses result in wasted work that must be reverted and redone on the slow path. In order to minimize the possibility of wasted work being done, we might be able to detect some of them in advance in semantic analysis, so they can skip the transactional memory path.

Additional GSoC-Related Information

Skills required

Since the Swift compiler is mostly written in C++, it should be necessary for the author to have some basic proficiency with it. Additional C++ skills can be obtained while working on this project.

Additionally, basic understanding of the Swift concurrency model should be necessary.

The author believes that he has the required skills.

Project scale and expected outcome

We do not expect this idea to be developed into a production-ready component in the Swift runtime within the span of GSoC 2022. Instead, we expect only to produce a proof of concept that can be benchmarked, so it can lay a foundation for additional development in the future. Even for just a proof of concept, this project likely involves some large scale compiler work. The following tentative timeline illustrates how the project can be divided in stages, and realistically how much can be achieved within 350 hours of time across 12 weeks:

Week number	Objectives, agenda and explanation
1	Add benchmarks that are able to measure performance difference between parallel and sequential operations scheduled through the same actors. This should be a fairly light task that has a broader impact than the project itself. Because concurrency benchmarks necessitate the benchmark suite to support concurrency first, which it doesn't at the moment. Currently there exists a draft proposal for such a benchmark by the proposal author, which can serve as a head start.
2	Start writing a NOrec STM implementation in the runtime library. There exists many open-source implementation of it that we can study from. One of which is this repository which was developed as a proof of concept for a proposal of implementing language-level STM in C++.
3	Continue writing the STM implementation. The STM algorithm itself usually shouldn't take more than 1 week to implement, but because of the author's lack of experience with the Swift compiler, it's expected that a bit more time is needed to find the right headers to include. Taking some additional time in implementing the algorithm properly should serve as a learning opportunity to learn the all the headers and facilities provided in the compiler that abstract cross-platform support.
4	Explore how task boundaries can be translated to transaction boundaries, and how the STM algorithm can tell apart read and write accesses in Swift. This possibly involves getting some information through semantic analysis.
5	Continue the same work as week 4's. Because of the author is unfamiliar with the compiler frontend, it's expected to take some extra time to figure things out. So, in total 4 weeks are allocated to it.
6	Continue the same work as week 5's.
7	Continue the same work as week 6's, and start phase 1 evaluation. By this time, there should be a verifiably (via unit test) working STM algorithm implemented, and the process of integrating STM with the runtime figured out.
8	Start integrate STM with the runtime. This could be many different options for this to work, but all of them would revolve teaching the runtime to start a transaction at the start of a task boundary, and end the transaction at end of task boundary.
9	STM integration with the runtime likely will take more than 1 week. So week 9 continues week 8's work.
10	If the work down in week 4 to 7 is successful, then by this week, we should have a proof of concept integration of STM and the concurrency runtime. If this is the case, we can start benchmarking it. Additionally, we should start considering how we can obtain system load information from libdispatch. This is not exactly needed for the goal of a minimal proof of concept that shows it's possible to integrate STM with Swift's actor model. However, it's needed in order for the runtime to only use STM when there are additional unused processor cores. Additionally, this provides a buffer if any of the preceding objectives takes longer than planned.
11	Continue exploring how system load information can be obtained from libdispatch.
12	Continue the work of week 11.
13	Prepare final evaluation. By tis time, in addition to a having working STM algorithm personalized for Swift's concurrency runtime, it should be in working integration with the runtime and benchmarked. For objectives that are mainly exploratory, there should be documents outlining some key points that are figured out, so they could be further built upon in the future.

If between this proposal being submitted and the start of GSoC coding period, it becomes clear the alternative where the STM algorithm should be implemented in libdispatch is the right path to take, then then timeline needs the following modification:

During week 2 to 3, instead of implementing the STM algorithm in the concurrency runtime, it will be implemented in libdispatch.
During week 4 to 7, instead of translating task boundaries to transaction boundaries in the concurrency runtime, it will be done in libdispatch. Additionally, roughly 1 or 2 more week needs to be allotted to this objective, since it's potentially more difficult for the STM to tell apart read and write accesses without additional information that can be passed from semantic analysis.
During week 8 to 9, it should be no longer needed to integrate STM with the concurrency runtime. So this objective should be removed.
During week 10 to 12, instead of looking for ways to pass system load information from libdispatch to the STM in the concurrency runtime, we will need to look for ways of using this information within libdispatch to decide whether it should use STM.
Any remaining time should serve as a buffer for objectives that take longer than planned.