Reliably testing code that adopts Swift Concurrency?

Yes, that's how Async/await is different to completion handlers or promise based implementations based on completion handlers.

Yes but from what i am seeing these is not so ideal on average in medium to complex scenarios like already described by @caiozullo

Hi @Swift_Coder, please see this response:

…to see that that example is still fundamentally flawed.

tldr; Spinning up new tasks in a test is a way of indirectly giving the concurrency runtime a little bit of time to process your async code, but there is still nothing guaranteeing that it will execute. And the same would happen if you did a whole bunch of Task.yields or a Task.sleep for a tiny amount of time. So there's nothing special about spinning up an unstructured Task.

If you want to reliably test async code in Swift, then (as far as we know) this is the only way to do it:

2 Likes

To be clear, forcing everything onto a single thread is not a reliable way to test things either, it only reduces the likelihood of scheduling variability, at the cost of running tests in a very different environment from what the code will run in in production and so reducing the efficacy of the test. It's fine to use that as a stopgap, since perfect is the enemy of the good, but I would caution against looking at it as an ultimate solution. Ultimately, the reliable way to test code involving concurrency is to structure the code so that tests are able to explicitly synchronize with events at the points they need to test. In places where system APIs don't provide reliable guarantees for when important events occur (such as when a NotificationCenter subscription actually starts) or the core language doesn't provide the necessary primitives to synchronize (which is definitely a lot at this point), we still need to fix those APIs and fill in those deficiencies.

7 Likes

I don't think that the underlying problem are the system APIs like Notification Center but we have no tooling to progress time reliably to assert that state changes between bodies of async work. If we do not control it we are always going to have flakey tests where we have to depend on the system scheduler and hope we assert at the correct time. Lack of tooling has put our team off from swift concurrency for a long time. For serious projects, lack of testing tools is a deal breaker.
I agree that such hacks offered by pointfree is not the desired goal but its not like we have alternatives at this point in time nor are there any being worked on.

Hi @Joe_Groff, thank you for bringing up the caveats. I did not mean to imply that it is a universal solution that is here to stay. In the library we detail the various caveats, and we do agree it is a stopgap until there is a more robust solution provided by Swift.

2 Likes

I don't think that's necessarily true. If you substitute "Tasks" for "threads", and made a similar statement: "if we don't have control of the hardware to reliably progress CPU instructions in a certain way, we are always going to have flaky tests", then that hopefully rings false at first blush. It is possible to build testable concurrent systems out of hardware threads using synchronization, without changing the scheduling environment, and it ought to ultimately be possible with Swift concurrency as well, with the right tools.

4 Likes

There definitely is something a little different between threads and tasks though, so I don't feel this analogy works. This problem is more pernicious with async than threads, dispatch queues, Combine, etc.

And the reason is because async is built directly into the language, and so it seems that only the language can provide tools to make it testable. The moment you write async in your code you are at the mercy of a runtime scheduler and there's nothing you can do about it. That was not true of threads.

Apps using threads (and dispatch queues, Combine, etc.) don't have any of the problems we are discussing in this post. An app may use URLSession.shared.dataTask, and hence incur the vagaries of threading when running on device, but in tests you would provide a configuration that immediately and synchronous provides you data when you ask for it. So, it is very easy to side step the background thread that URLSession uses by default and get 100% reliable tests.

So, when using threads, it's not that we want to control the scheduling of threads in order to make the code testable. Instead we can just side step the scheduling entirely for tests.

Now it just so happens that our library does control task scheduling to get more reliable tests, but that's just an incidental conflation. We don't care about the scheduling, we just want to reliably test something. And it's the only way we have seen work from our many, many hours of experimentation.

6 Likes

Are there currently some options / APIs which can enable the community to make such tools or are there any improvements to the language or new APIs or tools that are coming down the pipelines which would enable that?
I am a silent follower of this thread for some time now and have seen very little success, except for pointfree's approach, which I agree is not ideal.

I didn't meant to imply that controlling time is the only way to have a testable system, but it is very easy to reason about especially compared to synchronous code which is familiar.

If you have multiple threads, you are absolutely at the mercy of a runtime scheduler, and even more pervasively than when using tasks, because you're exposed to thread concurrency whether you're using async or not. So I don't think that's the difference, and it's not so much that using those other frameworks don't have these problems, but that they have established solutions to the problems. Because Tasks are built into the language, it is maybe too easy to create a lot of them, and the problems of concurrency become more pronounced. I think part of the problem I've observed is that we don't yet have a full battery of tools analogous to those other frameworks—with Dispatch, you can submit blocks to a serial queue, and know those blocks will run in order, for example, which is something pretty common that can't yet be replicated without external libraries in Swift concurrency; an analogous primitive might be a single Task that runs a loop, with a mailbox that can accept new blocks to run in order, but we don't have that mailbox primitive yet. Since Task { } is one of the few APIs we actually have bundled with the standard library, and accessory APIs are less developed, and it's also the easiest way to enter async land if you aren't in it already, I think there's a tendency to overspawn tasks, which at their fundamental nature want to run independently, and then try to claw back ordering among them, when often it would be better to have fewer tasks running ordered work on a single task to avoid the scheduling problems altogether.

5 Likes

The closest thing I can think of is the swift-async-algorithms package, which contains a lot of primitives that hopefully will eventually become part of the standard library. AsyncChannel in particular can be used to set up probe points for a test to synchronize against or to provide abstraction for the inputs and outputs around a component so that a test event source can be swapped in for testing while using a more concurrent event source in production.

4 Likes

Any solution which requires completely restructuring the natural implementation is a nonstarter for the vast majority of people. The ideal is a solution that doesn't require any deviation from the natural implementation at all but instead simply allows you to hook into the relevant bits at test time. However, this isn't really achievable, so we make do with minimal interfaces (most often protocols or Point Free's preferred closure interfaces) or other hooks (like URLProtocol) that give just enough control to make testing possible. Point Free has made a lot of progress in making their testing tools practically invisible (and where visible, usually pushes you towards better code structure anyway), but even minimal friction can lower adoption. So asking people to completely restructure their async code, everywhere, in every dependency, lest it not be testable at all, is not realistic. This is a language problem and requires language solutions.

10 Likes

I think I was being imprecise in my language, but when I say that those frameworks do not have the problems I really mean the problems have solutions. So, they aren't really problems in my view. There are ways to “abstract” over the scheduling mechanism so that we don’t have to worry about it in tests.

However, with async there is no solution whatsoever. The moment you use async code you do have a testing problem on your hands that cannot be solved. One cannot “abstract” over async because it is baked directly into the language.

I agree that people reach for unstructured Tasks too often, but that is a separate problem from what we are discussing here. The code Stephen shared is all 100% structured concurrency, no Task whatsoever, and yet it has lots of testing problems. When I say "task" in my previous posts I mean it in the little "t" manner of some async job submitted to the global executor. I did not mean a capital "T" unstructured tasks.

It really does just seem quite odd that one would litter their code with these probe points just to eke out a bit of testability. This style of testing is also quite different from the other techniques people typically use to make code testable. For example, putting an interface in front of a dependency so that you can control it in tests certainly "changes" your code, but not fundamentally. At the end of the day you are just passing around an any Client instead of a ConcreteClient, and all the rest of your code remains exactly as it did before you cared about testability.

But, as @Jon_Shier mentions above, these probe points fundamentally change the structure of your application’s code. When reading this code you have to be able to mentally filter out lines that are meant only for testing from the lines that do the actual logic of your feature.

And it was mentioned earlier in the thread, but we still don't see how probe points solve the problem. Given this probe point:

await progress?.send(.didSubscribeToScreenshots)
for await _ in screenshots {
  …
}

…and this interaction with the probe point in a test:

let firstProgress = await vmProgress.next()
XCTAssertEqual(firstProgress, .didSubscribeToScreenshots)
NotificationCenter.default.post(
  name: UIApplication.userDidTakeScreenshotNotification, object: nil
)

…what guarantee is there that the subscription to screenshots will happen fast enough so that the posted notification doesn’t just go into the void? I don’t think any such guarantees can be given.

Now technically, in iOS 16.(something), notification center’s async sequence started buffering received notifications so that it plays old ones back to you when you subscribe. That means you can actually fix the original problem by eagerly creating the async sequence when the model is created, rather than lazily when the feature appears:

class FeatureModel: ObservableObject {
  @Published var count = 0
  let screenshots = NotificationCenter.default.notifications(
    named: UIApplication.userDidTakeScreenshotNotification
  )

  @MainActor
  func onAppear() async {
    for await _ in screenshots {
      self.count += 1
    }
  }
}

But this is also strange. We have to be intimately familiar with the implementation details of the specific async sequence to know to do this, and keeping instance variables of sequences is not how one typically deals with them. And if 6 months from now we come across this code, think it’s weird, and decide to inline it, then we will get a flakey test suite again.

I also don’t think it’s always appropriate for an async sequence to buffer its values so that old values are delivered once subscribed to. I can imagine a web socket library would only want to deliver the freshest data and never old data. And that’s not to mention whether or not it is correct to deliver all old emissions upon subscribing, or maybe just the last one, or last few, etc. These are all questions that I don’t think have a universal answer, and so the problem remains even if Notification Center has fixed it for their sequence.

10 Likes

To be clear, I absolutely agree that we need to do language/runtime work here.

It may not look the same, but I see it as being analogous. Using channels or some other communication API is a way to abstract over the communication between concurrent components, like using an interface is to abstract over functions or types. And people do restructure their synchronous code for testability all the time, without really thinking about it, by factoring out functions that can be called and tested separately when they might not otherwise.

In both cases, factoring out a function or introducing probe points, a similar problem is being addressed, that you can't generally observe values mid-execution in a monolithic function; for synchronous code, that problem is typically confined to the surface of the function, since you expect a function to do something and return, so factoring into smaller functions generally suffices for improving testability, but with asynchrony you are more likely to want to observe the function's behavior mid-execution. Manually communicating through channels is one way to reliably get that effect, manipulating the task scheduler maybe sometimes sort of has that effect. It's fair to say that manually introducing synchronization points for testing is generally too clumsy, but conversely, we also don't want every potential suspension point in every async function to be an implicit global synchronization barrier. Maybe there's some way to annotate particular local variables as being observable during suspension, so that tests can read them?

I don't see "fundamentally changing the structure of your application's code" as necessarily being an absolute dealbreaker—that is a big part of what new languages, new libraries, and new frameworks do compared to their predecessors. The language is part of the solution, but library and API design also play an important role in what the "natural solution" developers write their code in looks like.

3 Likes

:+1: Clear as day, and glad to hear it! We look forward to any future tools.

The probe points are certainly interesting, I just think for nearly every app developer out there it absolutely is a dealbreaker. I really don’t think most folks are willing to pollute their application code with probe points everywhere just to get some test coverage. They’re more likely to simply skip testing that part of their code, or avoid async in the first place in favor of older concurrency models that are more reliable to abstract over.

Having written a bunch of tests for the new backpressure APIs of AsyncStream proposal I can only agree that we currently do not have all the right testing tools for this. While we could introduce probing points in the implementation of AsyncStream that feels very wrong to me and IMO is not something that I would generally want to advise everybody to do.

Comparing tasks to threads is fair but I think we have one critical advantage here. We do have the scheduler (executor) in the user space and we can hook it. A common example in the tests for AsyncStream boil down to this

func foo() {
    let (stream, continuation) = AsyncStream<Int>.makeStream()

    await withTaskGroup { group in
        group.addTask { continuation.yield(1) }
        group.addTask { await stream.first { _ in true } }
    }
}

I want to be able to write a deterministic test for this where I can control whether the first child task gets scheduled first or the second gets scheduled first. While I agree that we should not just serialise all scheduling, it would be great if we could have a deterministic executor which we could tell in what order jobs should be run. The only thing missing is a way to identify from which task a job comes.

I could envision something along those lines

func foo() {
    let (stream, continuation) = AsyncStream<Int>.makeStream()

    await withTaskGroup { group in
        group.addTask(name: "Foo") { continuation.yield(1) }
        group.addTask(name: "Bar") { await stream.first { _ in true } }
    }
}

in my test I could then hook the global executor and say I want to run the jobs in the order of ["Foo", "Bar"]. We could make this deterministic scheduler arbitrarily complicated where it could also run jobs in parallel etc.

cc @ktoso

12 Likes

While there's always a fuzzy line between library code and "user" code, I can see the need for controlling scheduling in order to fully test "primitives" like that, since you want to ensure that they have the correct behavior in the face of specific possibilities of how the scheduler may act in reality. (And by analogy to threads, people do use debugger scripts to single-step threads in certain orders in order to test the behavior of concurrency primitives in the face of certain interleavings.) As one goes further up the stack, though, you have less overall control over the full set of code that's running, and it seems to me that ideal tests would reflect that lack of control and run in as close to the same general level of concurrency in a production environment as possible. Manipulating scheduling order might be a general tool of last resort but feels like something that shouldn't generally be necessary to test typical code (again acknowledging that the real world layering, if any, of "library" and "user" code is murky in practice).

1 Like

To me, this looks like Task metadata -- like a kind of specialised priority system for a particular serialising executor.

I've wanted this kind of thing for a while (for other reasons, not related to testing). I think we could probably do it with task locals + giving executors the ability to inspect locals for the partial tasks they are enqueueing.

Why not just:

func foo() async {
  let (stream, continuation) = AsyncStream<Int>.makeStream()

  continuation.yield(1)
  await stream.first { _ in true }
}

Your example removes the potential that the yield is scheduled after the call to next on the iterator. This works of course but what I wanted to test is what happens when next is called when the stream has no element buffered or has an element buffered.

This is currently impossible to test deterministically without a sleep since we can’t “observe” the call to next to suspend. Having control over the executor would allow this.

1 Like