Writing tests to demonstrate actor re-entrance bugs

maartene · August 15, 2024, 3:06pm

Hi everyone,

I'm developing a multiplayer game, where the global state is held on the server in an actor World. Players queue commands that change the game's state (in particular by affecting Entitys (that are structs, thus value types). Most state changes are using async functions and because we're talking about value types, there will be copies of entities passed around.

I have a lot of tests that validate the game's logic. But I am concerned about bugs that actor re-entrance might introduce.

How could I write tests (using XCTest) that validate whether or not actor re-entrance is happening and/or unexpected things are happening?

ktraunmueller · August 15, 2024, 5:36pm

I think unexpected reenetrancy of functions and unexpected ordering of work are the two core problems of Swift concurrency. We either allow people to build simple mental models around how all that works, or it will be unusable by average programmers.

I don't think tests are the right means of guarding against incorrect code when it comes to concurrency.

aetherealtech · August 15, 2024, 10:13pm

Writing a test that detects when re-entrance occurs probably wouldn't even be meaningful because if the design allows it, it's eventually going to occur. Maybe collecting statistics would be more meaningful (i.e. in 1,000,000 randomly generated game sessions, 100 encounter re-entrance). But even then, you have to define what "re-entrance" means exactly. If it means simultaneous execution of two separate synchronous slices of a single async actor function, you need to define exactly what the entrance and exit criteria are for each slice. That a debugger would show two stacks with a frame for the actor function isn't really meaningful. You would need to set a flag at the "beginning" and "end" of each slice which defines the entrance and exit:

actor World {
  ...

  private var occupancy = 0 {
    didSet { if occupancy > 1 { print("Re-entrance detected") } }
  }

  func doSomeWork(with entity: Entity) async {
    // Some synchronous work
    occupancy += 1
    await doSomethingAsync()
    occupancy -= 1
    // More synchronous work
  }
}

But if you have to introduce state to track this, the re-entrance you have defined isn't meaningful to your app, it's inserted just to be tested.

Presumably you're worried about this because the async functions modify state during each synchronous slice of the body and you're worried that something will go wrong if those modifications get interleaved. You can test for that, but again if you design the code to allow it, it's just a matter of time before it happens. If it's incorrect for that to happen you need to design the code to prevent it.

If you want an exercise to test every possible interleaving combination, you can make the doSomethingAsync call mockable in tests and mock it with a continuation, and have a test spawn a parallel Task to call the actor function and use two continuations to synchronize the test and the Task. But I would be concerned if this was necessary to understand the implications of interleaving. The fact you're even concerned about it begs the question of why the design allows it at all.

I would expect a game design to make the entities actors and the World a struct (where the "world" is defined to the static/immutable part of the game and "entities" are dynamic parts that interact and change state during the game session). Don't the entities need unique identifiers? Wouldn't copying an entity be a non-trivial spawn operation that creates a new game object? I would expect the concurrency protection to need to apply to each individual entity's state, but no synchronization required between two entity's states. Do you imagine re-entrance would still a problem in that kind of design?

ktoso · August 15, 2024, 10:40pm

I tend to agree with the previous post, manually writing tests to prove you don't have some specific ordering somewhere that will cause a bug is a path that never ends and is unlikely to give you much real confidence... You could try hard and for every method consider every possible ordering things might arrive in it... but that's a lot of thinking and hardcoding orders as well.

For such things, from experience, it's much better to have many high level integration tests like "play the game with simulated inputs" and see if anything went wrong; it yes, use a trace (maybe use distributed tracing or just log messages), to figure out what ordering was "the bad one" and see if it's expected to happen and your logic has a bug, or if it's not expected and you need to prevent it.

I definitely agree with the need for better fine grained control over reentrancy, and we'll think of some solutions here, but either way -- big integration tests are always going to be useful, with reentrancy prevention then you'd be on the hunt for making sure you don't deadlock your actors after all! So the tests are useful either way, and the closer to real world the better, rather than manually constructing "call Y, then X, then Z which calls X" which are going to be pretty brittle as time moves on.

aetherealtech · August 15, 2024, 11:12pm

I have seen this objection raised many times against Swift's concurrency system. I think it indicates a misunderstanding of what that system is for.

A system of concurrency guaranteeing order is a contradiction. Concurrent means in parallel, which implies no guarantee of order. Guaranteeing order just reintroduces seriality. What are people asking for when they ask for a concurrency system to makes order guarantees?

The answer usually seems to be something about the order tasks are "started", in contrast to full serialization which implies one task is started and finished before the second task is started. First of all, this isn't something new to Swift concurrency. In this code:

  Thread.detach { print("Hello 1") }
  Thread.detach { print("Hello 2") }

The order of the print statements is indeterminate. That's the whole point of spawning new threads. There's no guarantee that the first thread "starts" before the second thread "starts".

Second of all, "start" isn't even well-defined. Nobody thinks it means the first line in the body (that is, that the print statements will be in order). So what does it mean? There are lots of CPU instructions that get run as part of adding a thread to the kernel's task scheduler and jumping to the function pointer supplied as its main. Which one exactly counts as "starting" the thread?

So why do people think if you replace Thread.detach with Task there ought to be some kind of guarantee about tasks "starting" in a particular order, which is equally ambiguous?

It's the same situation with every other scheduling abstraction. Replace Thread.detach with DispatchQueue.global().async and once again, there's no order guarantee (that's the whole point), and it's not even well-defined what it means for one of those blocks to "start".

Regarding re-entrance, why does anyone want or expect actors to continue blocking all message processing after a process yields? The point of the await keyword is that it means to yield, which enables cooperative multitasking. Because it's not preemptive, it means you don't have to worry about being forced to yield anywhere and have to introduce primitives like locks to deal with that. But two slices of work being inside on async func isn't any more significant than two critical sections being inside a single func. Of course there's no expectation that between unlocking then later locking a mutex that no one else locked it in between. That's the whole point of unlocking it!

The reason, I think, is that people don't make an actor function async or insert awaits judiciously to yield where appropriate. They insert them because they want to call a function that's async and not proceed until it completes, so then they have to call await and then mark the function async. It's not "this is the right place to put an await, it's "I put an await here because I have to". They want and need the whole function to be a single critical section but it just can't be because they happen to need to call an async function somewhere inside of it.

This is also why people repeatedly ask "can't we just get rid of the await keyword?" They think it's for the compiler, but it's not. It's for you. To most people async/await is just "awesome, I don't have to write callbacks anymore!" (the real question there is why so many things used callbacks before instead of just blocking).

This is the language forcing you to abandon poor design. It is equivalent to nested critical sections, a thread acquiring a lock for one piece of state and then while holding that lock, acquiring a lock for an unrelated piece of state, which starves the system because anyone who just needs to access the first state must wait for access to unrelated state somewhere else to finish. Critical sections should be short and to the point, getting their necessary access over with. Nesting them violates this rule. Cooperative multitasking simply doesn't let you do this.

If you start going down this road, what you actually want is a queue. This is a higher level of abstraction built on top of actors just like a serial DispatchQueue is a higher level of abstraction built on top of Threads and condition variables.

It might be that some devs get frustrated with Swift concurrency because it's new so we've basically been thrown back to the day where we were just given Thread and can start doing multitasking. No one's built the NSOperationQueue of the new concurrency world yet.

ibex10 · August 16, 2024, 12:09am

Thank you for this Concurrency 101 material, which I linked to from Actors 101.

This really deserves its own topic.

crontab · August 16, 2024, 6:36am

While I agree with your comment as a whole - yes, being reentrant is the whole point of async/await - there are situations when you want to prevent that from happening and I think it almost always happens when dealing with hardware. Just recently there was yet another post "how to make an actor non-reentrant" where the actor was managing Bluetooth communication.

More generalized, you can't have a queue when you deal with some "dumb" single-tasked peripheral: you want to wait for a reply before you send the next command to the device, so queues don't work here.

maartene · August 16, 2024, 6:40am

First off, thank you all for the comprehensive replies.

I’m out on a trip today, but I’ll get back to you tonight.

aetherealtech · August 16, 2024, 11:19am

Maybe we should start another thread for this... but based on what you've described here, it sounds like what you want is something similar to how the newer lower level GPU libraries work, which is a serial command queue that you submit command buffers to, where a command buffer contains multiple commands. So something like this:

final class CommandQueue {
  // Supplies a command buffer that can send commands to the hardware, which will be submitted to the queue as a single block of work.  By `await`ing this function, the caller is suspended until the device has finished executing all the commands in this buffer
  func submit<R>(_ bufferProvider: (_ buffer: borrowing CommandBuffer) async -> R) async throws -> R
}

struct CommandBuffer: ~Copyable {
  func executeCommand1() async throws -> Reply1
  func executeCommand2() async throws -> Reply2
  ...
  func executeCommandN() async throws -> ReplyN

  // Only the CommandQueue can create these
  fileprivate init(...)
}

So the CommandQueue is precisely a serial queue, but the "unit" of work isn't a single command sent to the hardware but a batch of commands. The queue guarantees that all the commands in one batch are executed together, with no other commands being interleaved in between (each buffer gets exclusive time on the queue while other buffers have to wait). The implementation of CommandQueue would be essentially the same as a more general SerialAsyncQueue (submit any async function and it ensures they are executed serially) but it would just restrict what can be submitted to it. It could probably just wrap a SerialAsyncQueue.

I made the individual commands async because I assume that's the central problem: communicating with the hardware is IO bound work that on the lowest level is suspending with a continuation that is resumed when the hardware replies asynchronously. The ability to treat an entire async function as a unit of work (i.e. the whole bufferProvider block) is the essential feature that SerialAsyncQueue provides. I think that's what people want from actors but that's not what they're for.

mattie · August 16, 2024, 12:11pm

One of the properties I have found absolutely necessary is the ability to add the async work to a queue synchronously. We had this property with dispatch queues, but async functions (in the general case) can't do this.

filiplazov · August 16, 2024, 1:19pm

One of the properties I have found absolutely necessary is the ability to add the async work to a queue synchronously . We had this property with dispatch queues, but async functions (in the general case) can't do this.

Especially for "fire and forget" situations

ktoso · August 16, 2024, 2:22pm

We will get the enqueue order guarantees with Closure isolation control

Nitpicking but this is a property of task creation, not async functions per se.

So for this part there’s a clear path forward. It’s not complete fifo but it’s a lot of what people actually need in practice.

aetherealtech · August 16, 2024, 2:37pm

One of the properties I have found absolutely necessary is the ability to add the async work to a queue synchronously . We had this property with dispatch queues, but async functions (in the general case) can't do this.

Me too, which is why I abandoned actors as the protection for my SerialAsyncQueues internal state. By using an actor, entering the critical section to add work to the queue requires an await, which requires an async context. If you're not already in one, you have to spawn a Task first, and that messes with the order in which things would get added to the queue. So I switched to Darwin locks as the protectors of the queue state.

However, thinking about this some more, I might have over-emphasized the problem in my mind. Say I'm in a non-async function and want to schedule two work items, knowing that the first item is scheduled ahead of the second item:

func scheduleSomeWork() {
  // I want the first task to run before the second task.
  queue.schedule { await Task.sleep(nanoseconds: 500_000); print("First Work Item") }  
  queue.schedule { await Task.sleep(nanoseconds: 500_000); print("Second Work Item") }
}

If the schedule call is async, I'd have to wrap these calls in Task first:

func scheduleSomeWork() {
  // Now it's indeterminate which one is scheduled first
  Task { await queue.schedule { await Task.sleep(nanoseconds: 500_000); print("First Work Item") } }
  Task { await queue.schedule { await Task.sleep(nanoseconds: 500_000); print("Second Work Item") } }
}

But the "obvious" solution here is to spawn only one Task:

func scheduleSomeWork() {
  Task { 
    await queue.schedule { await Task.sleep(nanoseconds: 500_000); print("First Work Item") }
    await queue.schedule { await Task.sleep(nanoseconds: 500_000); print("Second Work Item") } 
  }
}

Thinking through this, I might remember why I couldn't get this to work. I wasn't thinking about the signature of schedule properly. I started with this:

final class SerialAsyncQueue: Sendable {
  private actor State { 
    // Holds the array of scheduled blocks, etc.
  }

  private let _state = State()

  // Await this to wait for the scheduled work to run on the queue and return its result back to you
  func schedule<R>(_ work: () async throws -> R) async throws -> R
}

The above code with two await schedules in a sequence then means the first item is not submitted until after the first item completes, because awaiting the schedule waits for the result of the work. The only way to schedule a second block before the first block completes is to wrap each schedule in an individual Task, but that also destroys the ordering of the scheduling.

So I refactored to this:

final class SerialAsyncQueue: Sendable {
  private struct State { 
    // Holds the array of scheduled blocks, etc.
  }

  private let _state: Synchronized<State> // `Synchronized` protects its value with a Darwin read-write lock and is `@unchecked Sendable`

  // Await the returned `Task` to wait for the scheduled work to run on the queue and return its result back to you
  @discardableResult
  func schedule<R>(_ work: () async throws -> R) -> Task<R, any Error>
}

Now, you can schedule and not await. You get back a Task you can then await .value on if you want to wait for the scheduled work to complete, and get its result. If you don't care when it completes you just discard the returned Task.

But I missed what I'm now thinking I should have done, which didn't occur to me because it involves two levels of asynchrony and results in an interface that first seems awkward, but on closer consideration is what you actually want:

final class SerialAsyncQueue: Sendable {
  private actor State { 
    // Holds the array of scheduled blocks, etc.
  }

  private let _state = State()

  // Await this to wait for the scheduled work to get *scheduled* on the queue.  If you want to wait for the work to complete, you must additionally `await` the `Task`.
  @discardableResult
  func schedule<R>(_ work: () async throws -> R) async -> Task<R, any Error>
}

Here, it's an async func that returns a Task. I was originally thinking you'd have to write await twice to wait for the result, but similar to try you don't, a line only needs it once. So from a synchronous context, to schedule two tasks where one is only scheduled after the other completes:

Task {
  let _ = await queue.schedule { ... }.value
  await queue.schedule { ... }
}

But to schedule two tasks immediately but with a guaranteed order:

Task {
  await queue.schedule { ... } // No `.value` at the end
  await queue.schedule { ... }
}

This will be weird with work that doesn't return anything, because to wait for it to complete you need to await .value and discard it. This is a more general issue of it being awkward to await a Task<Void, ...> (I define an extension called something like waitToComplete() for this purpose).

So it's a tricky interface, and might fool people into thinking await queue.schedule (especially for Void returning work) waits for the work to be executed, when really it's just waiting for it to be scheduled. But that's how you get what you want: a serial queue built full out of Swift concurrency, using an actor to protect state, that guarantees work is queued in the same order it is submitted.

In fact once you build this, you can build a non-async flavor of schedule as an extension that returns a Task that internally flattens the nesting so you can immediately (synchronously) get back a Task handle that can be cancelled (it will either cancel the "real" inner Task received back from the queue, or if it gets cancelled before this Task is created it will just skip scheduling it at all). However that might be a bad idea because you can't call that multiple times in a synchronous context and expect strict ordering. You could make an extension that allows submitting an array of blocks and it can guarantee they are scheduled in order.

Now I need to go try refactoring my queue library (I mentioned elsewhere I'm working on getting it ready for publication) to see if this actually works!

maartene · August 16, 2024, 3:16pm

OK, so this seemingly simple problem: "how can I write a test to demonstrate actor re-entrance" has a lot of layers to unpack.

My understanding from your responses so far is:

it is very hard - perhaps impossible - to write tests that give absolute confidence that actor re-entrance doesn't cause unexpected results
actors are not a silver bullet, they might not be the right solution for the problem at hand
you need to design your code/solution in such a way that it can handle re-entrance

In my particular case, the risk I want to prevent is that if two commands affect the same entity. One KOs the entity, the other moves the entity. If the KO command is supposed to be executed first, I don't want the move command to 'revive' the entity if some form of interleaving takes place.

The design to cope I have in my head is this:

Players can only queue commands in the world (the actual actor)
Only one thread is responsible for updating the world every 'tick'. The update does two things in sequence:
a. run every command in the queue in sequence (taking the results of the previous command into account)
b. update every entity (health/visibility/...)

The risks I see are two fold:

this design is not working as intended. I.e. its the wrong solution.
the design is good, but the implementation stinks.

So to refine the question some more, without tests, how can I mitigate these risks?
In terms of design, perhaps standard patterns will help. But then, how to validate the implementation?

ktraunmueller · August 16, 2024, 4:03pm

I think the guarantees you are looking for (w.r.t. ordering of events) need to be addressed on an algorithmic level, not on the (low) level of concurrency building blocks (i.e., actors).

Lamports Time, Clocks, and the Ordering of Events in a Distributed System springs to my mind. Maybe Co-simulation can also offer some ideas.

The world simulation should make use of parallelism (multi-core hardware), I presume, so "Only one thread is responsible for updating the world every 'tick'" sounds like stepping on the break pedal.

mattie · August 16, 2024, 4:23pm

Can you help me understand how?

What I am talking about is the ability to call a function with a no-suspension guarantee. We've established this is guarantee is present only when isolation does not change. So, if you are calling from one actor into another, you must suspend and cannot making a synchronous call right?

ktraunmueller · August 16, 2024, 4:24pm

The open-source Zed editor has multi-user editing, which is exactly the same problem as a multi-player game, so maybe looking at the source can be insightful. I remember the mention of a "lamport clock" timestamp in one of their data structures, as discussed in one of their videos on Youtube (not sure if it was this one).

mattie · August 16, 2024, 4:34pm

Because your enqueue is asynchronous, it can only provide this guarantee only within the same task. If you need to preserve order while going from a synchronous to asynchronous contexts this won't work.

I've found I need a queue, sometimes, to help me go from sync->async. And I need an async-compatible lock/semaphore to control reentrancy. They are pretty distinct problems, although you can also use an async queue to help with reentrancy if you want.

maartene · August 16, 2024, 5:09pm

Thank you for these references! The document as well as the reference to the Zed editor. This also demonstrate that there are existing design patterns to cope with these kinds of challenges.

Yes, the simulation currently steps on the break (or is only firing on one cylinder). For now that is not an issue yet, so I'll leave that issue for a later date. This topic is complex enough for me as it is.

BTW while discussing this with you, my thinking of the problem evolves. I think the biggest requirement I try to implement is a form of 'transaction isolation'. I care about the sequence of commands, but even more that when a command gets executed, it either fully gets executed or not at all. And if the state of the world is consistent before the command gets executed, it remains consisted afterwards. Like the A and C in ACID.

Still would be great to find a way of reliably testing this. I assume database vendors need to do so too.

aetherealtech · August 16, 2024, 9:43pm

If you're in a synchronous context, and you want to submit something to the queue, then do some more synchronous work, then submit something else, with guarantee the second submission is scheduled after the first, would this work?

func aSyncContext() {
  // Synchronous stuff

  let scheduleFirstWork = Task { 
    await queue.schedule { ... } 
  } // Notice lack of .value

  // More synchronous stuff

  let scheduleSecondWork = Task {
    _ = await scheduleFirstWork.value // Because the above block didn't `await` `.value` of `queue.schedule`, this only awaits the enqueuing, not the completion of the work
    await queue.schedule { ... }
  }

  // Final synchronous stuff
}

This actually begs the question of why you need two tasks. There's no synchronization between the first work getting scheduled and the // More synchronous stuff code, but why would there need to be? What reason is there to wait until something is scheduled on the queue except to ensure something else you schedule is done in the right order, which requires going back to an async context anyways? So might as well just move the second schedule up to be inside the first Task. Enqueueing order is either about synchronizing two synchronous calls or two async calls, never one with the other.

Maybe you need to drill down, or up, a sync call stack before scheduling the second work, so then you just need to pass the first Task along so you can await it before scheduling again.

I started off with your perspective that I need synchronization between code in async land and code in non-async land, but now I'm arguing with myself and I can't win the argument. Every time I try to think of a scenario this won't support I knock down my own case.

I now can't think of why synchronous code would care when something has finished being scheduled, it only cares that things are scheduled in order.