[xctest] Removing outliers from performance tests

Hello folks,

I’m one of the heavy users of XCTest.measureBlock as it exists in Xcode 7.2. To give some hard numbers, I have ~50 performance tests in an OSX framework project, occupying about 20m wall clock time total. This occurs on a per-commit basis.

The current implementation of measureBlock as it currently exists in closed-source Xcode is something like this:

1. Run 10 trials
2. Compare the average across those 10 trials to some baseline
3. Compare the stdev across those 10 trials to some standard value (10% by default)

There are really a lot of problems with this algorithm, but maybe the biggest one is how it handles outliers. If you have a test suite running for 20m, chances are “something” is going to happen on the build server in that time. System background task, software update, gremlins etc.

So what happens lately is exactly *one* of the 10 * 50 = 500 total measureBlocks takes a really long time, and it is a different failure each time (e.g., it’s not my code, I swear). A result like this for some test is typical:

The probability of this kind of error grows exponentially with the test suite size. If we assume for an individual measureBlock that it only fails due to “chance” .01% of the time, then the overall test suite at N = 500 will only pass 60% of the time. This is very vaguely consistent with what I experience at my scale—e.g. a test suite that does not really tell me if my code is broken or not.

IMO the problem here is one of experiment design. From the data in the screenshot, this very well might be a real performance regression that should be properly investigated. It is only when I tell you a lot of extra information—e.g. that this test will pass fine the next 100 executions and it’s part of an enormous test suite where something is bound to fail—that a failure due to random chance seems likely. In other words, running 10 iterations and pretending that will find performance regressions is a poor approach.

I’ve done some prototyping on algorithms that use a dynamically sized number of trials to find performance regressions. Apple employees, see rdar://21315474 for an algorithm for a sliding window for performance tests (that also has other benefits, like measuring nanosecond-scale performance). I am certainly willing to contrib that work in the open if there’s consensus it’s a good direction.

However, now that this is happening in the open, I’m interested in getting others’ thoughts on this problem. Surely I am not the only serious user of performance tests, and maybe people with better statistics backgrounds than I have can suggest an appropriate solution.

Drew

Hi Drew,

Thanks for the detailed info on your issue. I see you filed a radar, and that is indeed the best way to make sure an issue on Darwin platforms is addressed. Unfortunately our corelibs implementation of XCTest isn’t ready yet for performance testing.

- Tony

···

On Dec 10, 2015, at 3:41 AM, Drew Crawford via swift-corelibs-dev <swift-corelibs-dev@swift.org> wrote:

Hello folks,

I’m one of the heavy users of XCTest.measureBlock as it exists in Xcode 7.2. To give some hard numbers, I have ~50 performance tests in an OSX framework project, occupying about 20m wall clock time total. This occurs on a per-commit basis.

The current implementation of measureBlock as it currently exists in closed-source Xcode is something like this:

1. Run 10 trials
2. Compare the average across those 10 trials to some baseline
3. Compare the stdev across those 10 trials to some standard value (10% by default)

There are really a lot of problems with this algorithm, but maybe the biggest one is how it handles outliers. If you have a test suite running for 20m, chances are “something” is going to happen on the build server in that time. System background task, software update, gremlins etc.

So what happens lately is exactly *one* of the 10 * 50 = 500 total measureBlocks takes a really long time, and it is a different failure each time (e.g., it’s not my code, I swear). A result like this for some test is typical:

<Screen Shot 2015-12-10 at 5.12.13 AM.png>

The probability of this kind of error grows exponentially with the test suite size. If we assume for an individual measureBlock that it only fails due to “chance” .01% of the time, then the overall test suite at N = 500 will only pass 60% of the time. This is very vaguely consistent with what I experience at my scale—e.g. a test suite that does not really tell me if my code is broken or not.

IMO the problem here is one of experiment design. From the data in the screenshot, this very well might be a real performance regression that should be properly investigated. It is only when I tell you a lot of extra information—e.g. that this test will pass fine the next 100 executions and it’s part of an enormous test suite where something is bound to fail—that a failure due to random chance seems likely. In other words, running 10 iterations and pretending that will find performance regressions is a poor approach.

I’ve done some prototyping on algorithms that use a dynamically sized number of trials to find performance regressions. Apple employees, see rdar://21315474 <rdar://21315474> for an algorithm for a sliding window for performance tests (that also has other benefits, like measuring nanosecond-scale performance). I am certainly willing to contrib that work in the open if there’s consensus it’s a good direction.

However, now that this is happening in the open, I’m interested in getting others’ thoughts on this problem. Surely I am not the only serious user of performance tests, and maybe people with better statistics backgrounds than I have can suggest an appropriate solution.

Drew

_______________________________________________
swift-corelibs-dev mailing list
swift-corelibs-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-corelibs-dev

Unfortunately our corelibs implementation of XCTest isn’t ready yet for performance testing.

That's why I'm here; I'm taking the temperature on implementing it. I'm at the pain level where I need a solution in the next several months, even if the solution is to code it up myself. My tests have failed 10x over this so far today.

I think the real question is, if I did implement basic performance testing, and I did implement a variable-sized window of runs, would that departure from the Old XCTest behavior (which uses 10 runs) disqualify the PR? It's a basic compatibility question about how close we need to follow the Old XCTest behavior.

e.g. if XCS wanted to migrate to corelibs-xctest and it used variable #s of runs, presumably that would be an undertaking for the XCS team. But I don't know whether those concerns (if they exist) play a role in what this project decides to do.

I'm going to do something on this problem eventually, unless someone else solves it first. I'm just trying to work out whether I can do something upstream or whether this is a better candidate for an independent effort.

I think that there’s a lot of room for improvement in how we measure and analyze perf test results. And I like some of your ideas of adaptive numbers of runs and what to do with outlier results, etc… that you presented below.

We would very much like to not diverge the API of the CoreLibs XCTest and Xcode’s.

There may be a bit more room for some variation in the behavior. So talking about doing something where the number of runs varies dynamically not via API but through smarter execution mechanisms, or doing better statistical analysis such as removing outlier results, this would introduce some change in the results that might be reported between the two implementations, but it would not give rise to people writing tests that would not work cross-platform.

Further, there’s always room for discussion of taking such ideas, and even API additions and supporting them in Xcode’s XCTest as well. But that sort of discussion will have to include the practicalities of schedule and resource on the Xcode side (and, to some unavoidable extent, there will be aspects to that discussion that will not be as transparent since we’ll be weighing that work against other work for the testing team that cannot be discussed as freely.)

This year, our primary goal for the core libraries is to broaden the implementation of the existing APIs in the OS X versions of the frameworks. For XCTest, we would also love to come up with some better answer for test discovery and potentially elicit help from the community in achieving that. Beyond that, the Xcode team’s bandwidth for incorporating other things that would necessaitate change to the Xcode XCTest is going to be limited.

Mike

···

On Dec 12, 2015, at 6:10 PM, Drew Crawford via swift-corelibs-dev <swift-corelibs-dev@swift.org> wrote:

Unfortunately our corelibs implementation of XCTest isn’t ready yet for performance testing.

That's why I'm here; I'm taking the temperature on implementing it. I'm at the pain level where I need a solution in the next several months, even if the solution is to code it up myself. My tests have failed 10x over this so far today.

I think the real question is, if I did implement basic performance testing, and I did implement a variable-sized window of runs, would that departure from the Old XCTest behavior (which uses 10 runs) disqualify the PR? It's a basic compatibility question about how close we need to follow the Old XCTest behavior.

e.g. if XCS wanted to migrate to corelibs-xctest and it used variable #s of runs, presumably that would be an undertaking for the XCS team. But I don't know whether those concerns (if they exist) play a role in what this project decides to do.

I'm going to do something on this problem eventually, unless someone else solves it first. I'm just trying to work out whether I can do something upstream or whether this is a better candidate for an independent effort.

_______________________________________________
swift-corelibs-dev mailing list
swift-corelibs-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-corelibs-dev