I’m one of the heavy users of XCTest.measureBlock as it exists in Xcode 7.2. To give some hard numbers, I have ~50 performance tests in an OSX framework project, occupying about 20m wall clock time total. This occurs on a per-commit basis.
The current implementation of measureBlock as it currently exists in closed-source Xcode is something like this:
1. Run 10 trials
2. Compare the average across those 10 trials to some baseline
3. Compare the stdev across those 10 trials to some standard value (10% by default)
There are really a lot of problems with this algorithm, but maybe the biggest one is how it handles outliers. If you have a test suite running for 20m, chances are “something” is going to happen on the build server in that time. System background task, software update, gremlins etc.
So what happens lately is exactly *one* of the 10 * 50 = 500 total measureBlocks takes a really long time, and it is a different failure each time (e.g., it’s not my code, I swear). A result like this for some test is typical:
The probability of this kind of error grows exponentially with the test suite size. If we assume for an individual measureBlock that it only fails due to “chance” .01% of the time, then the overall test suite at N = 500 will only pass 60% of the time. This is very vaguely consistent with what I experience at my scale—e.g. a test suite that does not really tell me if my code is broken or not.
IMO the problem here is one of experiment design. From the data in the screenshot, this very well might be a real performance regression that should be properly investigated. It is only when I tell you a lot of extra information—e.g. that this test will pass fine the next 100 executions and it’s part of an enormous test suite where something is bound to fail—that a failure due to random chance seems likely. In other words, running 10 iterations and pretending that will find performance regressions is a poor approach.
I’ve done some prototyping on algorithms that use a dynamically sized number of trials to find performance regressions. Apple employees, see rdar://21315474 for an algorithm for a sliding window for performance tests (that also has other benefits, like measuring nanosecond-scale performance). I am certainly willing to contrib that work in the open if there’s consensus it’s a good direction.
However, now that this is happening in the open, I’m interested in getting others’ thoughts on this problem. Surely I am not the only serious user of performance tests, and maybe people with better statistics backgrounds than I have can suggest an appropriate solution.