I’m kind of new around here, still learning about how the project works. On the suggestion from @Ben_Cohen I started contributing code around Swift Benchmark Suite (more benchmarks for Sequence
protocol along with support for GYB in benchmarks). I’ve filed SR-4600 back in April of 2017. After rewriting most of the compare_perf_test.py
, converting it from scripting style to OOP module and adding full unit test coverage, I’ve been working on improving the robustness of performance measurements with Swift benchmark suite on my own branch since the end of June.
One of my PRs in the benchmarking area was blocked, pending wider discussion on swift-dev, so the linked document is rather long way to illustrate what changes to the benchmark suite and our measurement process will, in my opinion, give us more robust results in the future.
I apologize that the report deals with state of Swift Benchmark Suite from autumn 2017, keeping the tree up to date once the commits I was depending on required lengthy manual conflict resolution I gave up on that and kept focusing on the experiment, rather than futile chase of the tip of the tree…
The document had to be split into several pages due to the embedded interactive charts that were overwhelming the browser, so here’s a table of contents for quicker access to each chapter:
Towards Robust Performance Measurement
- Anatomy of the Swift Benchmark Suite
- Issues with the Status Quo
- The Experiment So Far
- Analysis
- Exclude Outliers
- Exclude Setup Overhead
- Detecting Changes
- Memory Use
- Corrective Measures
Please read that document for detailed reasoning behind the suggestions quoted from the Corrective Measures chapter below. It is fully responsive, so that you can enjoy it on your iPhones and iPads, including the mesmerizing visualizations of samples from Swift Benchmark Suite like these:
I hope the sample visualization tool I’ve built could be further adapted and made useful for our compiler hackers (@Andrew_Trick, @Michael_Ilseman, @Slava_Pestov).
Regarding the changes I suggest we take, I am especially interested in review from people who worked on the benchmark suite before and the standard library hackers that use the benchmark suite daily: @Michael_Gottesman, @lancep, @mishal_shah, @Ben_Cohen, @dabrahams, @lorentey. Please ping all interested parties I might have forgotten...
Corrective Measures
Given the fact Swift Benchmark Suite is a set of microbenchmarks, we are measuring effects that are manifested in microseconds. We can significantly increase the robustness of our measurement process using statistical methods. Necessary prerequisite is having a representative sample population of reasonable size. From the experiment analyzed in previous sections it is apparent that we can make the measurement process resilient to the effects of varying system load if the benchmarked workload stays in range of hundreds of milliseconds, up to few thousand. Above that it becomes impossible to separate the signal from noise on a heavily contested CPU.
By making the run time small, it takes less time to gather enough samples and their quality is better. By staying well under the 10 millisecond time slice we get more pristine samples and the samples that were interrupted by context switching are easier to identify. Excluding outliers makes our measurement more robust.
After these resilience preconditions are met, we can speed up the whole measurement process by running it in parallel on all available CPU cores. If we gather 10 independent one-second measurements on a 10 core machine, we can run the whole Benchmark Suite in 500 seconds, while having much better confidence in the statistical significance of the reported results!
Based on the preceding analysis I suggest we take the following corrective measures:
One-time Benchmark Suite Cleanup
- Enable increase of measurement frequency by lowering the base workload of individual benchmarks to run under 2500 μs. For vast majority of benchmarks this just means lowering the constant used to drive their inner loop — effectively allowing the measurement infrastructure to peek inside the work loop more often. Benchmarks that are part of test family meant to highlight the relative costs should be exempt from strictly meeting this requirement. See for example the
DropFirst
family of benchmarks. - Ensure the setup overhead is under 5%. Expensive setup work (> 5%) is excluded from main measurement by using the setup and teardown methods. Also reassess the ratio of setup to the main workload, so that it stays reasonable (<20%) and doesn’t needlessly prolong the measurement.
- Ensure benchmarks have constant memory use independent of iteration count.
- Make all benchmark names <= 40 characters long to prevent obscuring results in report tables.
- Make all benchmark names use CamelCase convention.
Measurement System Improvements
-
Measure memory use and context switches in Swift by calling
rusage
before 1st and after last sample is taken (abstracted for platform specific implementations). This change is meant to exclude the overhead introduced by the measurement infrastructure, which is impossible to correct for from external scripts. -
Exclude outliers from the measured dataset by filtering samples whose runtime exceed top inner fence (TIF = Q3 + 1.5 * IQR), controlled by newly added
--exclude-outliers
option that will default totrue
. -
Expand the statistics reported for each benchmark run:
- Minimum, Q1, Median, Q3, Maximum (to complete the 5 number summary), Mean, SD, n (number of samples after excluding outliers), maximum resident set size (in pages?), number of involuntary context switches (ICS) during measurement.
- Option to report 20 percentiles in 5% increments (or 20 number summary; because 10% don’t fall on Q1 and Q3 exactly) compressed in delta format where each successive value is expressed as delta from the previous percentile.
-
Implement parallel benchmarking in
BenchmarkDriver
script to dramatically speed up measurement of the whole benchmark suite. -
Introduce automated benchmark validation to ensure individual benchmarks conform to the expected requirements, which will be performed for newly added tests during regular CI benchmarks and on the whole benchmark suite as a part of the validation tests. See
BenchmarkDoctor
for a prototype implementation. -
Adjust change detection to use Mann-Whitney U-test
What should the next steps be? I have some of the suggested solutions prototyped on my branches that are out of sync with the tip of the tree… @Michael_Gottesman are you the right person to discuss cherrypicking of useful parts into new PRs?