Actor Races

drewmccormack · September 22, 2022, 7:52am

I was quite excited when actors were added to Swift, because much of my code is already structured with actor-like classes, utilizing OperationQueue or DispatchQueue. Unfortunately, they have fallen short of my expectations, and I couldn't use actors to replace my existing code. Let me explain...

Generally, I will use asynchronous Operation types to fully serialize and atomize high level operations. In this way, implementing a high level asynchronous function like a sync, which performs many different async calls to servers, local databases etc, can still be reasoned about simply: it is an asynchronous function call, but I know that it is fully serialized and atomic. It is not possible for two sync calls to run concurrently, even though the function is asynchronous. It is almost as easy to reason about as serial code (if uglier for all the callbacks).

This is how I expected actors to work too, but it isn't how they work. Calls to async actor methods are not atomic. Anywhere there is an await in an actor func, there is a potential for interlacing. In my example above, if the sync function were called multiple times in short succession, they would run concurrently. Not concurrently in the threading sense, but concurrently in the sense that two executions of the method would be inflight at the same time. So even though the actor guarantees execution is serial, there is concurrency of function calls.

So what? IMO, this concurrency makes it just as difficult to reason about actors as it was to reason about thread races. Actors were supposed to protect shared data, and make it easy to reason about, but the lack of atomicity of functions mean it is just as difficult to reason about actor races as thread races.

To make this more concrete, here is a contrived example, but one which demonstrates the problem as simply as I could make it.

actor Adder {
    private var sum: Int = 0
    
    public func addOne() async {
        var s = await getSum()
        s = s+1
        await setSum(to: s)
    }
    
    private func getSum() async -> Int {
        await pause()
        return sum
    }
    
    private func setSum(to newSum: Int) async {
        await pause()
        sum = newSum
    }
    
    private func pause() async {
        let pause = Int.random(in: 0..<100)
        try! await Task.sleep(for: .milliseconds(pause))
    }
    
    public func printSum() {
        print("Sum is \(sum)")
    }
}

// Sum to ten in parallel
let adder = Adder()
await withTaskGroup(of: Void.self) { group in
    for _ in 0..<10 {
        group.addTask {
            await adder.addOne()
        }
    }
    await group.waitForAll()
}
await adder.printSum()

This actor just adds one to a sum on each call to addOne. I have made addOne include a few awaits to introduce a race.

My guess is that most newcomers to actors would actually think this code is completely safe. They would assume a call to addOne will be atomic — it will finish completely before the next call to addOne is handled. That isn't the case, as demonstrated when you try to run the code. The sum is always wrong in my tests.

This is a contrived and simple example, and could easily be rewritten to work properly, but real world examples are generally much more complex. It is a shame that actors can't deliver on the promise to protect shared data, and make it easy to reason about. Reasoning about a somewhat complex actor, where you have to account for interlacing of methods, is just as difficult as reasoning about multithreaded code.

This isn't intended as a rant. I have brought it up because I think the concept of actors is a good one. If they could deliver on the promise of making it easy to control access to data and reason about concurrency, it would be a great step forward.

What could we do about this? I would like a mechanism whereby a function like the sync one mentioned above, could be marked as atomic. Such a function would need to run to completion before a new call from outside the actor could be handled. Effectively, there would be an extra top level queuing or locking mechanism in actors, which would only allow one atomic function to be in flight at one time, and when an atomic func was running, all calls from outside the actor would be queued up.

I realize this introduces other problems, like the potential for deadlocks. I think deadlocks could be avoided with adequate restrictions on these atomic functions, or adequate run time checks, but I am a user of Swift, and not a compiler developer.

What do others think? Is there a problem? And is there a good solution?

ktoso · September 22, 2022, 8:08am

On a personal note, I very much agree, and we covered this in the initial actor's pitch in the Actor reentrancy section. swift-evolution/0306-actors.md at main · apple/swift-evolution · GitHub and in the future directions in the same document: swift-evolution/0306-actors.md at main · apple/swift-evolution · GitHub

It is somewhat unfortunate that our actor runtime does not provide such flexibility (yet?), as it is something either subconsciously expect or actively need and can't really achieve without reaching to streams and manually reimplementing queueing...

I believe the solution would be to allow users to opt-into non-reentrant behavior for methods which are in need of such crucial guarantee, but otherwise reentrancy could be permitted - as was pitched in the evolution writeup above. Have a look and say what you think. Deadlocks are a concern, but in my opinion way easier to diagnose and debug than that one once-in-a-blue-moon weird ordering that causes a concurrency bug in what otherwise is a nicely synchronized actor.

Having that said, I'm not really speaking on behalf of the concurrency or core teams, but based on my experience on working on actor runtimes as well as swift's take on it. The actual solution may need more research and we'd be open to hearing alternate takes I think.

cc @Douglas_Gregor

Karl · September 22, 2022, 8:19am

This applies to all async functions - there is not really anything special about actors in this regard, except that users intuitively expect that, because they are executing a method, they should have exclusive read and write access to the actor's data members as well.

I say this quite a lot, but any await should be treated as a discontinuity. Code before the await may be separated from code after the await by an arbitrary period of time, and anything that isn't a local variable can change during that time. If you need some data to be consistent across an await, store it in a local variable. Just remember that rule and it all makes a lot more sense, for all async functions, everywhere you encounter them.

Non-actor examples:

var someGlobal = 42

func doSomething() async {
  print(someGlobal) // 42
  let newValue = await getNewValue()
  print(someGlobal) // ??? - Impossible to say
}

func doSomething2() async {
  let capturedLocal = someGlobal // 42
  let newValue = await getNewValue()
  print(capturedLocal) // 42. Always.
}

(Of course, someGlobal should itself be protected by a global actor -- that doesn't make a difference; it applies to all non-local variables across the entire program)

benjamin.g · September 22, 2022, 8:59am

I'm quite in the same boat. The reason i haven't jumped to actors yet despite many of my major components having their own private operation queues is that i find swift actor system too complex to reason about (even though it is supposed to make things simpler).

I've also avoided async / await in my code for the same reasons and stuck to operation / operationqueue. IMHO they provide the perfect set of primitives to correctly handle "job / worker" types of concurrency (which is i think the most common), with very easy control on ordering, concurrency and cancelling.

Karl · September 22, 2022, 9:08am

To put it another way - the Swift concurrency analog of Operation is not actor; it is Task.

So what are actors for? They serialise accesses (reads and writes) to mutable data which is shared by multiple Tasks (i.e. shared by multiple Operations). For example, each Task/Operation might perform network requests to download an image, store it in a database, etc. - but an image cache which all of those tasks/operations share and modify should be an actor.

If you perform those high-level operations in Tasks, you will find that you kind of have to store your per-Task state in local variables (or at least, it becomes a lot more obvious when you're touching data that is shared between tasks because you have to await it), so you get the sort of behaviour you're looking for.

ibex10 · September 22, 2022, 11:52am

That's exactly what I thought after I had read about the actors in the Swift Programming Language book.

But one day when I was reading the Actors Proposal (SE-0306), I nearly got a heart attack when I saw this :

Actor reentrancy

Actor-isolated functions are reentrant . When an actor-isolated function suspends, reentrancy allows other work to execute on the actor before the original actor-isolated function resumes, which we refer to as interleaving.
...

I hope that whoever is maintaining the book sees this thread and updates the book accordingly to include more information about actors.

drewmccormack · September 22, 2022, 12:39pm

Thanks for the pointers. The evolution document indeed covers what I had in mind with an atomic keyword. The @reentrant attribute is the same thing (or the opposite?), and the document covers the various concerns I had with deadlocks etc. If something like this were to exist in actors, I am pretty sure it would provide exactly what I would need to replace my existing OperationQueue approach.

(It's funny how the same design patterns come back through the years. Cocoa has NSLock and NSRecursiveLock, which are analogous to the task tree approach discussed.)

drewmccormack · September 22, 2022, 12:46pm

Do tasks have the same abilities as Operations? Eg can you setup dependencies? Ensure they run serially one after the other?

In any case, you are no doubt correct. I just think it would be good if actors were extended to hide the implementation details of Tasks, and provide a simple mechanism for people to control access to their shared data without having to worry much about races. Once you have to say "copy all the data into local variables", I think you have already lost the game, because that is how we already did it. In that scenario, an actor provides very little advantage over just working with async funcs.

Karl · September 22, 2022, 2:08pm

Sure, they can have dependencies - a Task can spawn any number of other Tasks and await their results. They expose handles which you can use to cancel the Task, they automatically propagate priority information and cancellation status, and we have constructs like TaskGroup, which allow you to perform complex scheduling operations such as spawning n child tasks, awaiting the first m to complete, and cancelling the rest. Because they are built in to the language, things like error propagation are built-in, you can conditionally execute tasks, and you get compile-time checking when you access data outside of the task.

They are far more expressive than Operation.

So, the other thing to remember is that all async functions are implicitly running on some Task; it's not the same programming model as Operation, where you write a subclass and store your state in data members; you just write an async function, and the function's local variables are like the operation subclass' instance variables.

When you write an async function, you are already using the Task programming model:

A task is the basic unit of concurrency in the system. Every asynchronous function is executing in a task.

SE-0304 Structured Concurrency

Going back to your example about downloading images and storing them in a database, I would refer to an example from SE-0296 async/await (with a bit of async let sugar for spawning structured child tasks). This is how you do it in the async function/Task model:

func processImageData() async throws -> Image {
  async let dataResource  = loadWebResource("dataprofile.txt")
  async let imageResource = loadWebResource("imagedata.dat")
  let imageTmp      = try await decodeImage(dataResource, imageResource)
  let imageResult   = try await dewarpAndCleanupImage(imageTmp)
  return imageResult
}

// Stubs:
struct Resource { }
struct Image { }
func loadWebResource(_ path: String) async throws -> Resource { fatalError() }
func decodeImage(_ r1: Resource, _ r2: Resource) async throws -> Image { fatalError() }
func dewarpAndCleanupImage(_ i : Image) async throws -> Image { fatalError() }

In the Operation model, each of those (stubbed) async functions could be its own operation subclass, maybe the processImageData function would also be a subclass, and stitching together those dependencies would be rather complex.

As I mentioned previously, actors are not the replacement for Operation; async functions/Tasks are. Actors are for when you have multiple independent tasks (multiple async functions which don't necessarily have a common parent), and you want to share data between them. In that case, you need some kind of independent "data island" which protects its data as all of these various tasks try to read/write it at the same time; and that's when you should use an actor.

I worry that perhaps our communication about swift concurrency has emphasised actors too much. They have a very particular use - mutable data which is shared between tasks. They should not really be all that common in most applications, IMO.

So maybe it's not that satisfying when I say "copy the data in to local variables", but perhaps the more important question is: what mutable data are you even sharing between independent tasks (if any)? Maybe an actor is not the right solution in the first place, or perhaps you are isolating more data within the actor than you really need.

drewmccormack · September 22, 2022, 3:25pm

Take this func which is similar to your processImageData...

func downloadAndStore() async {
    let download = await loadWebResource("data.txt")
    await database.store(download)
}

Code like this can be very dangerous, and most people probably don't even realize it.

Assume this function may be called at pretty much any time, perhaps when someone hits a button, and has no high level queueing.

If this function is initiated twice in close succession, and the server changes data during this time, what will end up in the database? The most recent version of the data? We don't know. It's a race.

So far, I haven't heard a simple way to enqueue calls like this. This is exactly the type of method you would like to have an atomic constraint on in an actor. It would make real world async app development so much easier and safer. (The nasty thing about races like this is they work 999 out of 1000, and on the 1000th run through, ruin everything.)

Karl · September 22, 2022, 3:48pm

In this case, the state you want to share between the tasks is whether there is another task running.

actor OneTaskOnly {
    private var runningTask: Task<Void, Never>?

    func runTask(_ block: @escaping @Sendable () async -> Void) {
        if runningTask == nil {
            runningTask = Task { 
                await block() 
                self.runningTask = nil
            }
        }
    }
}

I'm storing a Task handle here (so the task can be cancelled or awaited on), but you could store a boolean if you don't need that.

In the context of an entire application, making lots of different requests to a server, each of which may return slightly out-of-date snapshots of overlapping datasets, you may need one of these per dataset (or design your application logic in such a way that it is tolerant to some inconsistencies between datasets that may be fetched independently). You define where those boundaries are.

drewmccormack · September 22, 2022, 3:58pm

It's not the same though. You need a true queue here. Just seeing if there is already something in flight does not do it, because maybe the new task was triggered due to a push from the server or something, indicating that new data was available, while the existing (running) task is already half way through and didn't get that data.

It seems what we are supposed to use is a AsyncStream (How do you use AsyncStream to make Task execution deterministic?). It just seems like a very heavy handed approach. I feel like this is such a common pattern, and no doubt cause of many bugs, that making at simple as possible to avoid, with a simple keyword, would help a lot.

You basically want to have each actor have an AsyncStream internally, with atomic funcs being submitted in a single parent Task. Something like that.

Karl · September 22, 2022, 4:20pm

I'm not sure that you need an AsyncStream for this particular case; as well as just the task closure, you could also store some metadata indicating what triggered the task (and a timestamp of when it started, etc).

A server notification may be a high-priority event which always causes the running task to be cancelled and for a new task to run (I'm just using a simple force boolean parameter):

actor OneTaskOnly {
  private var runningTask: Task<Void, Never>?
  /*   private var runningTask: (Task<Void, Never>, Reason)? */

  func runTask(_ block: @escaping @Sendable () async throws -> Void, force: Bool = false) {
    func launchTask() {
      self.runningTask = Task {
        do {
          try await block()
          self.runningTask = nil
        } catch {
          self.runningTask = nil
        }
      }
    }

    if runningTask == nil {
      launchTask()
    } else if force {
      runningTask!.cancel()
      launchTask()
    }
  }
}

I just don't see re-entrancy as being the underlying problem here; the tasks are not trying to mutate the same variables and interleaving at suspension points; you're looking for a higher-level kind of mutual exclusion, where the granularity of that exclusion is based on the kinds of requests your server offers, how the data is modelled in the database, and how strict the consistency requirements are as different portions of the database get updated by those various requests.

tera · September 22, 2022, 4:21pm

drewmccormack:

Take this func which is similar to your processImageData...
func downloadAndStore() async {
    let download = await loadWebResource("data.txt")
    await database.store(download)
}
Code like this can be very dangerous, and most people probably don't even realize it.

Assume this function may be called at pretty much any time, perhaps when someone hits a button, and has no high level queueing.

If this function is initiated twice in close succession, and the server changes data during this time, what will end up in the database? The most recent version of the data? We don't know. It's a race.

Interesting example indeed. Can you think of a fix? For simplicity let's assume it was the old style closure based API:

func downloadAndStore(execute: @escaping () -> Void) {
    loadWebResource("data.txt") { download in
        database.store(download) {
            execute()
        }
    }
}

It's a race here as well, so, how'd you avoid it?
And if there's a good solution here how best to back-port it to async/await version?

drewmccormack · September 22, 2022, 5:24pm

The way I have solved that is with a serial queue. You then use an AsyncOperation class to hold the contents of the func, and submit that to the queue. Something like this

func downloadAndStore(completion: @escaping () -> Void) {
	serialQueue.addOperation(
		AsyncOperation { finishHandler in
		    loadWebResource("data.txt") { download in
		        database.store(download) {
		       	 	completion()
					finishHandler()
		        }
		    }
		}
	)
}

There is no AsyncOperation in Cocoa, but the Operation class supports the concept of an async operation, so it is quite easy to make a subclass (eg LLVS/AsynchronousOperation.swift at master · mentalfaculty/LLVS · GitHub)

How would this look in async/await? I suspect you can do this with AsyncStream (How do you use AsyncStream to make Task execution deterministic?), but it is not really any more elegant than the AsyncOperation approach above, and not that accessible to inexperienced developers.

For this reason, I think it would be nice if actors could take care of the streams, queues or whatever they need to make this happen, so the developer can just write something like this

actor DataManager {
    func downloadAndStore() atomic {
         let download = await loadWebResource("data.txt")
         await database.store(download)
    }
}

tera · September 22, 2022, 6:40pm

I see, thank you.

Would the following fly as an interim / hybrid approach?

func downloadAndStore() async -> Result {
    await serialQueue.addAsyncOperation {
         let download = await loadWebResource("data.txt")
         await database.store(download)
    }
}

tera · September 22, 2022, 7:13pm

Here's a proof of concept implementation of the hybrid approach

let serialQueue = OperationQueue()
serialQueue.maxConcurrentOperationCount = 1

func foo(_ name: String, timeout: UInt64) async {
    await serialQueue.addAsynchronousOperation {
        print("\(name) op started")
        try await Task.sleep(nanoseconds: timeout)
        print("\(name) op finished")
    }
}

func test() async {
    await foo("first", timeout: 2_000_000_000)
    await foo("second", timeout: 1_000_000_000)
}

print("Start")
Task {
    await test()
}
RunLoop.main.run(until: .distantFuture)
print("Stop")
print()

extension OperationQueue {
    func addAsynchronousOperation<T>(_ block: @escaping () async throws -> T) async {
        addAsynchronousOperation { finish in
            Task { () -> T in
                let result = try await block()
                finish()
                return result
            }
        }
    }
}

where "AsynchronousOperation" is taken from the link you provided above.

The test gives desired output:

Start
first op started
first op finished
second op started
second op finished

Paul_Cantrell · September 23, 2022, 1:06am

I do agree that this is a surprising and pernicious pitfall — not just with Swift, but with all languages that support async / await (or anything coroutine-shaped, really) and also support shared mutable state in any form. While Swift concurrency is a vast improvement over the state of the art, the existence of the difficult-to-spot partial task (i.e. the stretch of control flow until the next await) has serious downsides for our ability to reason about state and control flow.

I wrote about my concerns over this pitfall (and also here) during the reviews of the concurrency features. I still wonder if we might do anything about it.

As @ktoso noted above, non-reentrant actors could partially solve this problem…but with serious deadlock and performance downsides. Beyond those downsides, would they in fact completely solve the issue? It seems to me that the problem here is not only with actors, but with anywhere an await appears.

Two other thoughts:

Are there useful warnings the compiler could emit? Could we, for example, warn if there are modifications to potentially shared state on both sides of an await within a function? Does the compiler have enough information to emit such a warning?

If so, there could be a kind of explicit “commit work” operation that compiles to a noop (!), but lets the compiler know that the programmer believes they have left shared state in a good state and therefore a subsequent await is acceptable.
Might we consider Eiffel-like invariant checks?

For those unfamiliar, the basic idea is that a type has function(s) that verify that all the class’s invariants are correct, and the compile automatically adds a call to the appropriate invariant check function(s) every time a public method returns — or before every await. The function is only a check; in optimized builds, it does not run.

Those willing to add invariant checks to their types would presumably be able to catch misreasoning about await sooner.

filiplazov · September 23, 2022, 8:12am

Maybe an opt-in atomic marker on functions, don't know if at all something like this is doable. I really welcome avoiding deadlocks and improve on performance but in some cases we can nudge the scheduling to execute to our liking?

JuneBash · September 23, 2022, 5:31pm

It sounds like what's really being requested here is a TaskQueue in the standard library analogous to a serial OperationQueue from GCD. I'm in support of this; it's not terribly difficult to mock up the basic idea of what it should look like, but in practice it's a bit tricky (here's my crack at a draft). I could see it being either an actor or a class that uses a locking mechanism of some sort or something.