Reliably testing code that adopts Swift Concurrency?

stephencelis · May 30, 2022, 6:20pm

Thanks for spending the time to go deep on this and report back! In addition to @mbrandonw's point about losing some testability, I'd like to share some thoughts on a few of your points:

The issue I'm trying to demonstrate here is subtle, but I don't think it has anything to do with AsyncSequence specifically and really does affect all async-await code more broadly.

I agree that a simple async function with no dependencies can be easy to test. But I also think a simple async sequence should be easy to test. After all, testing code that interacts with a plain ole Swift Sequence isn't much harder than testing code that interacts with another kind of value.

What I'm trying to point out is that as async-await code interacts with other code it quickly becomes impossible to reliably test due to our inability to control (or reason about) how work will be scheduled.

While it's true that the demo app I posted uses an AsyncSequence, the problem it demonstrates really has nothing to do with that construct, and really does apply to async-await code more generally. In fact, the task endpoint could be refactored to suspend for a single unit of work (with no AsyncSequence in sight) and the problem would still exist.

Unless I'm mistaken, NotificationCenter is predictable and totally controllable when you use its non-async-await interfaces (the Combine- and subscription-equivalent code demonstrates this), and so the only real issue here is the use of async-await, not NotificationCenter.

Unfortunately a lot of the discussion here has gotten lost in the weeds about the specifics of Notification Center when I want to focus on the general problem of testing async-await, so maybe I ought to change the example or come up with another one.

While testing the full view lifecycle is an admirable goal, I again think this is straying the topic from its focus on the testability of async-await. (Though I do think that testing a view model directly is a reasonable lightweight alternative to full UI testing or UIWindow-hosting.)

babbage:

I want to emphasise the first point. In the thread I noticed there are multiple references to testing ‘async/await code’. This is quite a misnomer here. The complexity here arises primary from using an AsyncSequence to respond to the notifications over time. With XCTest’s support for async test methods, it is trivally simple to create linear deterministic tests that await the results of an async method and then continue. The reason why this code is difficult to test pivots on the fact that you cannot await on an AsyncSequence in a test in the usual way as the test will never proceed forward to the subsequent waitForExpectations . It is key that the whole discussion in the thread above needs to be understood in that framework in my view—the discussion was mostly relevant to testing code that uses AsyncSequence not to testing async-await code in general. :)

I mentioned it above, but also to reiterate, the problem does not really have anything to do with AsyncSequence. While a single async function with no dependencies is relatively easy to test, a system built in async-await with many moving parts (that may or may not include async sequences) is what becomes completely untestable very quickly.

Doesn't it seem like Apple's Async Algorithms project intends to succeed Combine? If so, wouldn't it also be case that folks should always reach for the appropriate AsyncSequence over a Combine publisher alternative?

I do want to say thanks again for taking the considerable time to explore, but hope my points make sense and also help emphasize the thorniness of the problem!

babbage · May 30, 2022, 7:36pm

If you have a task that suspends with a single unit of work, you can just await it in an async XCTest and then check your test expectations when it returns. So clearly I'm missing something here. Can you explain the problem you're anticipating?

stephencelis · May 30, 2022, 7:47pm

The problem is the one where the post(notification) happens in the test before the await in the view model starts to perform its work (that work is currently an async sequence, but even if it weren't the problem remains).

babbage · May 30, 2022, 7:51pm

I guess perhaps this is 'the problem' being referred to. I think it hinges in that case on what reliably test means too. By definition when we dispatch an asynchronous task we have less control over when it will be scheduled. I'd argue however that through the use of async-await we have far more ability now to reason about how (more than precisely when) work will be completed, and in what order. Meanwhile, the ability to await functions within our XCTests means we still have control over the order of execution of tasks—other than in cases like AsyncSequence where we cannot just await the function to return. (In this regard, in a funny kind of way, AsyncSequence operates much more like asynchronous code that uses callbacks or completion handlers than async-await code.)

To grapple with this question, I think that it would be necessary to look at specific examples of where async-await code makes code untestable, or less testable than previous or current alternatives. Large applications with asynchronous execution are going to be complex, no matter how they are architected. That's why we want to break them down into specifically testable components of course. So far for me, async-await has been helping with that, rather than creating issues.

babbage · May 30, 2022, 7:57pm

This problem does not occur in my testB above, which tests that code. [Edit, later: I'm wrong here!]

babbage · May 30, 2022, 8:17pm

This refactoring of the test is logically flawed I'm afraid. You've indicated that the requirements of the feature are to drop the first item in the sequence, which indicates that on the first occasion, the action taken by the AsyncSequence should not be called. However, to test this, you have called defaultScreenshotAction() and then asserted the result should be that the model count is not iterated.

Test A is designed to test that when that action is called, it increments the model count. For your modified requirements, the test you needed to modify is a version of Test B: that when the notification is posted, what action is taken. What that refactored test should have done is insert a Mock for defaultScreenshotAction that counts when it is called. You should be issuing a series of screenshot notifications, and asserting that after the first, the defaultScreenshotAction was not called, and that after subsequent occasions it was called.

What you would find is that this test, like the ones I provided, was 100% deterministic and would demonstrate in fact that the proposed refactoring of the test code made it extremely easy to test your new requirements.

I'm really unclear why you think the move made the code untestable, because I literally provided tests for it. :) However, I can see why you might want the logic preserved in the ViewModel itself. There are not fundamental reasons why the default methods could not have been attached to/owned by the ViewModel, if a very slightly different approach was taken. The key issue is, given the original example had multiple separate functions/processes that led on to each other, some refactoring was desirable so that one component at a time could be replaced by injected test methods, in order to make the production code testable. (The reason I didn't proceed further down the path of trying to put the injected methods in the ViewModel itself is that the production code increments the ViewModel count, it seems it would have be a mutating method if on the ViewModel itself, and I have not got my head around calling mutating methods asynchronously—I'm not sure it is even possible. It may be that the problem was created because I asserted that the ViewModel.count should only be updated on the MainActor however, or perhaps due to the difference in the function being injected rather than hard coded on the type. Can't recall.)

PS. Clearly, if your requirements were that the first screenshot should be ignored, defaultScreenshotAction() should be renamed to defaultActionForScreenshotWeWantToProcess() or a pithier version. But that is a naming issue, not an issue with the refactored design.

stephencelis · May 30, 2022, 9:21pm

Just to make it more clear that the problem is code that interacts with async-await and not AsyncSequence, here's a quick refactoring of my original example that uses a continuation instead of a sequence:

class ViewModel: ObservableObject {
  @Published var didPost = false
  var continuation: UnsafeContinuation<Void, Never>?

  @MainActor
  func task() async {
    await withUnsafeContinuation { self.continuation = $0 }
    self.didPost = true
  }

  func post() {
    self.continuation?.resume()
  }
}

class ViewModelTests: XCTestCase {
  @MainActor
  func testBasics() async throws {
    let vm = ViewModel()
    let task = Task { await vm.task() }

    XCTAssertFalse(vm.didPost)

    // Give the task an opportunity to start executing its work.
    await Task.yield()

    vm.post()

    // Give the task an opportunity to update the view model.
    await Task.yield()

    XCTAssertTrue(vm.didPost)

    task.cancel()
    await task.value
  }
}

While it passes most of the time, when run repeatedly it will not (10,000 iterations provides ample opportunity for it to fail).

I'm not quite sure I understand your suggestion here, as you mentioned this about testB earlier in this thread:

I think you're seeing the same nondeterminism this entire post is about...and without an AsyncSequence

Even though your test appears to succeed when run individually, there is no guarantee it will.

mbrandonw · May 30, 2022, 9:27pm

Your testB does not currently pass:

Asynchronous wait failed: Exceeded timeout of 10 seconds, with unfulfilled expectations: "screenshotAction called."

And this failure is because of an issue you noted in your original post:

The problem you were seeing is not just a fluke, but rather is the crux of why this topic is being discussed. In your code your notification is posted before it is subscribed to causing the test to fail. It's not due to NotificationCenter not playing nicely with a multithreaded environments or XCTestExpectation. It's because this kind of asynchonous code is very difficult to test with the tools provided today.

I agree there are tests, but they aren't testing code that is run in production. They test that when a particular function is called a number is incremented, but that function isn't what is used when the app is run in the simulator.

In short, your tests are testing the mocks, Stephen's tests are testing the logic in the view model.

This can be seen by looking at your view model:

class ViewModel: ObservableObject {
    @Published @MainActor var count = 0
}

This has no logic in it whatsoever, and so to test the logic one has to supply these secondary objects (what you call actions). If you use defaultOnDisplay in tests you will get test failures (like in testB) and if you use defaultRespondToScreenshot you will not actually be testing real application code (like in testA).

babbage · May 30, 2022, 9:46pm

OK. Maybe. If so, interesting. Though this explanation doesn't explain why the test always passes when run once, because then it should always fail. That wasn't my observation.

When run repeatedly, the failure observed is:

"API violation - multiple calls made to -[XCTestExpectation fulfill] for screenshotAction called.."

I didn't dig any deeper into why. If it is the notification being broadcast too early, perhaps one of the repeat tests is somehow getting the notification from a preceding or subsequent test run, with bleed between the runs? Various posts elsewhere suggest holding weak references to XCTestExpectations etc but that didn't immediately resolve the issue and I didn't dig any further. It is yet to be demonstrated here I suggest that this has anything to do with async-await... this could be equally observed using XCTestExpectation with completion handler code or elsewhere?

This part isn't correct. The extension ContentView.Actions is part of the production code. They are also the actions that are automatically injected into the app when run both in the simulator (not for testing) and would be for release builds.

Furthermore, if you wanted to guarantee that the default actions provided were always and only the ones that could be used in production, this is quite simple to implement. It's fine to discuss whether you only want the functions to be on the ViewModel type but having the default functions elsewhere in production code doesn't mean they are mocks or that you are not testing production code.

If you've tried to cut and paste my code, perhaps you've got a partial mix of my code and Stephen's. Here's a repo: GitHub - babbage/ScreenshotWatcher: A demonstration of testing uncertainty in Swift Concurrency

babbage · May 30, 2022, 9:56pm

This bit I can't argue with, though see discussion above re the particular failure observed.

babbage · May 30, 2022, 10:03pm

This is a great example. I agree that there may not currently be any good way to test withUnsafeContinuation. As per the docs, when you call resume() the function returns immediately (note post() is not async) while the the task continues executing when its executor schedules it. There is explicitly no way to await it... though it is part of the wider Swift Concurrency ecosystem we're grappling with. I haven't found occasion to use withUnsafeContinuation yet, but wonder if it is mostly useful for bridging between legacy code and async code, since the subheading in the docs is "A mechanism to interface between synchronous and asynchronous code, without correctness checking."

I will definitely add withUnsafeContinuation to my list of hard to test async code that currently contains AsyncSequence.

mbrandonw · May 30, 2022, 10:16pm

I'm seeing the same results with your GitHub project. If you run it enough times manually (not using Xcode's repeated run feature) you will eventually get a failure. It helps to make the expectation timeout something more reasonable, like half a second. The fact is that your code cannot guarantee that the for await ... loop starts before the notification is posted, and so you will eventually have a failure.

The reason you are seeing the "API violation" error when running repeatedly is that you are not tearing down the for await ... when the test completes like @stephencelis's test does, and so the first test is receiving the notification when the second test sends it. If you fix that then it will be even easier to see the test failure by using the "Run repeatedly..." feature of Xcode.

I feel that this conversation has deviated quite a bit from the original topic of this thread. @babbage, would you want to DM me and Stephen to continue this conversation without overloading this topic?

Some interesting ideas have been shared, in particular using AsyncStreams in place of AsyncSequence has helped Stephen and me get past a few hurdles due to its buffering behavior. We'd love to get back on topic and discuss other ideas for testing async/await code, especially around serial executors if that is fruitful.

ptomaselli · June 1, 2022, 12:51pm

It was hinted at upthread, but would a controllable/serial executor even be useful for this kind of focused testing if it does not match the execution behavior of most production code (where, unless I misunderstand, execution is usually not serial nor FIFO)?

Joe_Groff · June 1, 2022, 4:49pm

I think this mindset is part of why you're getting frustrated. How work will be scheduled is an implementation detail, and robust async code should be agnostic to how it is scheduled. That is, after all, the ultimate goal of concurrency support features, to allow the system to make more efficient use of it CPUs executing independent parts of code when they're ready to run. Nondeterminism is the whole point of the feature, and if you try to hammer it out, you aren't really testing reality. How the code is scheduled isn't normally part of its contract, and you shouldn't usually be trying to test it. We can look at how libraries and frameworks can encourage code designs that allow for explicit interleaving of test setup and assertion with application code, but trying to set up an artificial controlled environment to brute-force tests is going to lead to tests that are too specific and too brittle and don't accurately reflect the actual environment async code runs in, as @ptomaselli noted.

mbrandonw · June 1, 2022, 5:08pm

For the example @stephencelis posted, and many of the simple situations we come across in day-to-day development, I think it's definitely useful.

There are plenty of times we temporarily ignore the non-determinism of reality so that we can test for a majority use case. For example, have you ever written a test for code that involved a merge of publishers?

publisherA.merge(with: publisherB)

In production you will never know if A or B will emit first, but that's no reason to not write a test. It's still nice to be able to use an immediate scheduler or test scheduler to wiggle ourselves into the subtle edge cases of this code.

I understand that having concurrency so deeply ingrained in the system means that a lot more things are non-deterministic, and so is a bit different from Combine, but there are still plenty of use cases.

The original code posted in this thread is merely a prototype of a much more common pattern in application: a view appears, we start subscribing to events, sometime later the user does something to trigger one of those events, and then we perform some logic. That is a completely serial, deterministic process that is difficult to test with the tools today.

And for the times that we do not want to serialize execution, and truly do want to test the wild west of concurrent, non-determistic code, we can turn to other tools like property-based testing or fuzz testing. But it would be unfortunate if we had to turn to such complex machinery for testing something so simple.

Perhaps async-await is just too new right now and people haven't yet written enough tests with it, but we feel that this is going to quickly become an issue for anyone writing tests for async code. We have an extensive suite of case studies and demos in our libraries involving complex, long-living effects, and nearly every one of them runs into this issue when writing tests.

Karl · June 1, 2022, 5:35pm

I wonder if it would be possible for the compiler/runtime to help test this aspect of concurrent code.

One idea could be this: lots of tests will tend to schedule the same number of Tasks every time. The code is fixed, so the suspend/resume points are fixed, and any data also tends to be fixed. So I'm thinking that we'd call an async function that we want to test, and the runtime would track all the Tasks that are scheduled, from which source locations and in which order, and it would try to generate a couple permutations of how a scheduler might behave (delays, different orders, etc).

Perhaps it would use data such as locations of suspension points, or presence of TaskGroups, to more intelligently stress your code, trying to open up new code paths. Basically a kind of guided fuzzer. Not all async functions would be testable in this way, of course, but I wonder if enough of them are for it to be useful at detecting bugs.

I expect this is an area of active research. It's quite exciting that Swift can be a part of that, especially because it brings things like structured concurrency and actors which may enable new techniques.

Joe_Groff · June 1, 2022, 6:06pm

mbrandonw:

There are plenty of times we temporarily ignore the non-determinism of reality so that we can test for a majority use case. For example, have you ever written a test for code that involved a merge of publishers?
publisherA.merge(with: publisherB)
In production you will never know if A or B will emit first, but that's no reason to not write a test.

At the same time, it isn't a reason to write a test with artificial assumptions about ordering. It would be inappropriate to assert that the merged sequence is exactly equal to some specific interleaving of As and Bs, but you could assert instead that the merged sequence contains the set of expected As and Bs (and perhaps that the As and Bs are in the right relative order to each other).

It would be nice to see some more complete examples of what you're trying to test, if you're able to provide them. To me it doesn't seem too unreasonable to inject an AsyncChannel for ad-hoc sequencing, and using AsyncStreams and other AsyncSequence implementations to structure your code gives you similar benefits to what you get with Combine, since you naturally factor your code into separable transforms that can be set up with artificial inputs and test artificial outputs by unit tests.

Maybe we need to make that machinery seem less complex, since property testing is probably overall a better fit for exercising concurrent systems. As @Karl noted, it would if anything be nice for the runtime to make itself more nondeterministic than it would normally have to be, to prevent implicit assumptions from taking root in the code.

mbrandonw · June 1, 2022, 6:18pm

Yeah, we are currently experimenting with some tools that help with this, and it's promising so far.

We will share some soon.

Honestly, the AsyncStream has solved the biggest problems for us since it buffers its elements until someone subscribes. That fixes the issue of an event occurring before a subscription. We're just not sure if there are other subtleties lurking in the shadows.

There's a chance that serial executors are not necessary. Now the only time we sprinkle in Task.yields is around time-based scheduling due to having to interface with Combine and schedulers. Hopefully a proper, controllable TestClock will help with that just as our TestScheduler helps with Combine code.

ptomaselli · June 1, 2022, 6:18pm

Genuinely asking, is this actually true? From what I can tell, the “problem” with the original code is that ViewModel is not actually “ready” once it has been initialized. There is additional asynchronous preparation required before it can actually start doing its job, and this preparation is potentially racing with the delivery of the events it is supposed to handle. But I might be misreading the code, despite having re-read this thread about 50 times at this point ?

I don’t completely follow the Combine analogy, but w/r/t serial executors specifically, it seems like being able to “inject” a serial executor here would allow a completely reliable test to be written for code that, in production, produces a bug one out of every thousand times (or whatever). Arguably a worse outcome than what we have already, imo. At least the current flakiness reflects reality?

Not that I don’t agree with the overall thrust of this thread — there does seem to be a pervasive Zalgofication lurking here and I don’t think many folks have reckoned with it yet.

cocoaphony · June 1, 2022, 6:19pm

Could you point to some production examples of this approach, perhaps within the async-algorithms package? Most of the tests in async-algorithms seem to go directly against these recommendations, relying heavily on locks, which makes some sense given that its a foundation to build on, but I expect there are some solid examples that are similar to what is expected of non-stdlib developers.