A New Approach to Testing in Swift

AlexanderM · September 24, 2023, 10:14pm

I think the RSpec vs Minitest schism in Ruby is something we can learn from, and should aim to avoid repeating. I think it would be a worthwhile goal to support RSpec-style grouping.

If we allow nesting of @Suite (and some synonyms like @Context or whatever), developers will have the choice to either just keep to a single level, or use them for grouping. We can easily support both, without dividing into two parallel testing universes.

I've certainly seen real spaghetti RSpec test myself, but I think the issues with the alternative is under appreciated. I don't like overly DRYing tests because it hurts readability, but not DRYing them enough also hurts readability.

Imagine there's 4 different things you need to stub/create on each test. Every test might repeat those lines, with minor variations. While any one test might simple to understand, the test suite as a whole makes you get lost in the sauce. All you see is 16 different tests methods, each with almost the same setup, leaving you play "mental difftool" to try to fish out what part is the same repeated boilerplate, versus what part is the key distinguishing precondition of a specific test.

joemasilotti · September 24, 2023, 10:20pm

In my opinion, that can just as easily be a helper function on the test class. It also makes it even more explicit about what’s going on. And you can choose not to call it for tests that don’t need it.

AlexanderM · September 25, 2023, 12:38am

Heya Joe,

That's my go-to (the main codebase I work on uses Minitest). It works in simple cases when there's not many kinds of data being wrangled, and where the variation between different test cases' setup logic are really easy to express with simple parameters.

On the other hand, I think it can really fall apart for more complicated tests (e.g. integration tests), because it becomes really difficult to express niche/complex variations that can't easily be described with simple Bool, Int, etc. parameters.

Example

Bear with me, I found it tricky to come up with an example, because the kind of pain point I'm trying to describe comes up in some deep niche part of some complex system, which obviously doesn't amount to a simple example.

Here's what I could come up with: imagine an integration test for the product page of a shopping app like Amazon. It might start out simple like so:

// Creates a new user with the given feature flags,
// and makes some mock recommendations for them.
private func sharedSetup(
    userIsLoggedIn: Bool,
    userActiveFeatureFlags: [String],
    recommendedProductCount: Integer
) -> User { ... }

First issue: how do we know what the expected results should be, if all the fixtures/mocks were generated in sharedSetup()?

let user = sharedSetup(...) // "Arrange"
let page = renderProductPage(for: product, viewedBy: user) // "Act"
XCTAssertEqual(page.recommendedProducts.map(\.id), /* what goes here? */) // "Assert"

We can hard-code the set of expected product IDs. The hard-coded list in the test expectations needs to match the hard-coded list in the setup. This is brittle, and doesn't allow for variation in the dataset size.
We can return a tuple of (User, recommendedProducts: [Product], ...), but now each test case got more boilerplate, because it needs to unpack these tuples.

This scales poorly as more kinds of data get added to the page.
We can save the generated mock data in ivars. sharedSetup() can write to self.mockedRecommendationIDs, then the test can use that as its expected value.

This is probably the best option, but it introduces a temporal coupling, and is by no means clear. (Especially in Ruby, where ivars aren't forward declared in constructors, so it can be hard to tell where they came from, esp. when mixins start getting involved)

Next, consider some new product requirements being added, and how they start to blow up this design:

Let's say we have a new feature, to show recommended sellers on the product page, in addition to the product recommendations.
- Now we need to add a recommendedSellerCount, create the seller recommendations, but only if userActiveFeatureFlags includes that new feature flag.
  
  Should we blow up the sharedHelper with more parameters, or make a new helper specific to tests that enable the new experiment? What if there are multiple experiments that could be toggled independently, and we get a whole bunch of permutations?
Suppose we didn't want to show out-of-stock products. Our sharedSetup() function creates the mock products and stores them in our mock data store. But now we want to mark some of them as out-of-stock. The easiest thing to do is to just always mock our products with some mix of in-stock and out-of-stock products.
- Now what if we want to hide the recommendations section entirely, if none of them are in stock?
That test would need a way to express to the sharedSetup helper that "all the mocked products should be made out-of-stock". Yet another Bool parameter? Or perhaps the test case can lookup the products, and mutate them to be out-of-stock.
Suppose there's a new "Prime deals" section, that only applies to premium members. userIsLoggedIn being a bool isn't enough, because now we have 3 states: logged-out/guest, logged-in regular member, logged-in "Prime" member.
- Perhaps we can rewrite the userIsLoggedIn: Bool parameter to instead be userKind: UserKind, with:
```
enum UserKind { case guest, regular, prime }
```
  Now we're starting to create a mini-DSL for expressing our test setup. We've taken one step closer to RSpec, but with none of the standardization or generalizability.
- Perhaps we make multiple setup methods: sharedSetupGuest(), sharedSetupMember(), sharedSetupPrime(). Now we have yet more boilerplate and repetition.
And how does this compose with the feature flag problem in point 1? Do we have 1 test per feature flag permutation per login type? The Cartesian product gets really big, really fast.

In general, a shared helper function like this struggles in cases where a new test wants almost the same thing, but with some "deep" data tweaked in some way.

joemasilotti · September 25, 2023, 12:57am

Thanks for the detailed counterpoint!

I think that for each before block you could pull out a specific test helper if needed. So all of your examples don’t need a single helper method but perhaps many. Going to an extreme, a complete test object has served me well in the past. With the sole purpose of test setup.

That said, if BDD-style blocks were optional I have no issue with them. But I’d hate to force folks to use them instead of OO practices that XCTest and Minitest encourage.

AlexanderM · September 25, 2023, 1:00am

Hey Joe, I think you ran into exactly the point I was trying to make!

you could pull out a specific test helper if needed.

So you can't just have one shared setup helper. Each test would start with an "arrange" preamble that calls several different setup helpers, in similar-but-subtly-different ways. This is what I was referring to when I said this earlier:

Agree!

hassila · September 25, 2023, 5:32am

Super nice! Final polish would be “… running for x seconds.” that gets updated together with the animation ;-)

Sajjon · September 25, 2023, 5:33am

Ah! I had it implemented like that, but reverted, can bring it back and update PR and gif :)

Sajjon · September 25, 2023, 6:22am

@hassila it turned out quite nicely when I rounded the elapsed time to hundreds of ms and aligned the output, so that for running test and passed test the "S.XYZ seconds." align between rows.

nice_output_with_duration_50fps

The source branch of PR has been updated.

Sarunas · September 25, 2023, 12:57pm

It would be nice for the runner (when tests are parallel), to have a single hook that could initialise the environment needed for the test - for cases when larger integration end to end test cases is used.

grynspan · September 25, 2023, 3:08pm

Fuzz testing (which property testing can be categorized under) is an area we're very interested in exploring, although it hasn't been our focus yet. It'd be great to hear more about any specific use cases you have, as well as your experience writing these sorts of tests using other libraries. Would you mind starting up a separate thread where we can discuss this topic in more detail? Thanks!

allevato · September 25, 2023, 3:16pm

Do you imagine fuzz testing being implemented natively in Swift as part of swift-testing or would it be built on top of LLVM's libFuzzer.dylib? Since libFuzzer is only distributed with the open-source toolchains but not with Xcode, it would be really nice to have a solution that just works out of the box for Swift (whether that means writing one ourselves or getting libFuzzer in Xcode, though the latter is of course out of scope for the Swift project...).

grynspan · September 25, 2023, 3:40pm

The DocC bundle for swift-testing is now available on Swift Package Index here. Please have a read and feel free to provide feedback by filing issues against the GitHub repo!

grynspan · September 25, 2023, 3:45pm

We're hoping to implement as much as possible in Swift. For instance, it should be straightforward to provide a replayable RandomNumberGenerator type. We can then build generators, shufflers, etc. atop that RNG such that the same seed value produces the same actions/operations each time it's used.

My understanding of libFuzzer is that it feeds random data to the same function repeatedly, which sounds a lot like a parameterized test taking random values to me! So I don't think libFuzzer would be necessary to build basic fuzzing functionality into the library, but there may be deeper functionality in libFuzzer that I'm not aware of.

Max_Desiatov · September 25, 2023, 3:58pm

IIUC the "corpus" feature and its built-in minimization tools distinguish libFuzzer from what SwiftCheck or any other simple RNG would do:

Coverage-guided fuzzers like libFuzzer rely on a corpus of sample inputs for the code under test. This corpus should ideally be seeded with a varied collection of valid and invalid inputs for the code under test; for example, for a graphics library the initial corpus might hold a variety of different small PNG/JPG/GIF files. The fuzzer generates random mutations based around the sample inputs in the current corpus. If a mutation triggers execution of a previously-uncovered path in the code under test, then that mutation is saved to the corpus for future variations.
[...]
The corpus can also act as a sanity/regression check, to confirm that the fuzzing entrypoint still works and that all of the sample inputs run through the code under test without problems.

If you have a large corpus (either generated by fuzzing or acquired by other means) you may want to minimize it while still preserving the full coverage. One way to do that is to use the -merge=1 flag: [...]

scanon · September 25, 2023, 4:16pm

Right, seeding from a corpus (and the tooling to build and maintain a good corpus) are absolutely critical to getting useful information out of fuzzers for non-trivial operations.

chefski · September 25, 2023, 4:21pm

Out of interest and curiosity I've filed this issue against the framework to explore implementations of these concepts. I've also named a few testing patterns/techniques that would benefit from more complex machinery.

github.com/apple/swift-testing

Add a type for complex input generation that performs exploratory testing

opened 04:01PM - 25 Sep 23 UTC

SeanROlszewski

enhancement

### Description In order to support testing patterns where randomized input g…eneration is used to explore a program's state space (such as fuzz, differential, property-based, or mutation testing) we need to add some affordance to developers to specify the following: 1) The next value to pass into a `@Test` 2) A way to receive the prior value and test result as a result of executing `@Test` 3) A way to receive additional information, such as changes in code coverage, from the last `@Test` 4) A way to stop input generation based on some heuristic or predicate. Having these 4 items would be the essential set of primitives to really streamline the creation of tools like QuickCheck, libFuzzer, etc. with Swift Testing. Perhaps this could take the form of a protocol, potentially named `TestCaseInputGenerator`, that has the following shape: ```swift protocol TestCaseInputGenerator { associatedtype AdditionalContext func shouldStop<InputType>(given priorRun: (InputType, TestResult)?, additionalContext: AdditionalContext?) -> Bool func nextValue<InputType>(given priorRun: (InputType, TestResult)?, additionalContext: AdditionalContext?) -> InputType } ```` `AdditionalContext` can encapsulate data of interest to the test; previously tried values, their results, and other data like code coverage, crashes, etc. I'm filing this issue rather quickly, in between tasks, so apologizes in advance for oversights on the type signatures and compilability of the proposed API. :)

Jon_Shier · September 25, 2023, 4:33pm

This is really nice. I'd be interested in how much of an overall performance impact such dynamic output has on overall test time (when CPU bound rather than sleep). Different terminals have different performance characteristics here. For instance, Terminal on macOS shares a process between tabs, meaning the performance one tab can affect the others, or at least the output to the other tabs.

grynspan · September 25, 2023, 4:37pm

That's fair enough, and not something I was aware of in libFuzzer! I do think it'd be possible to implement something similar in swift-testing without needing to link to libFuzzer.

rauhul · September 25, 2023, 4:45pm

I don't know that there is huge value in rewriting libFuzzer instead of just using it. It already isn't even too hard to use from Swift.

Passing -[f]sanitize=address to both swiftc and ld + using @_silgen_name("LLVMFuzzerRunDriver") gets you most of the way there, the rest of the work is designing a nice API over libFuzzer for Swift.

@codafi has a sketch of this somewhere you may be able to get started with.

allevato · September 25, 2023, 5:01pm

swift-protobuf also has fairly robust usage of libFuzzer that (IMO) serves as a good real-world example.