Towards Robust Performance Measurement

palimondo · March 29, 2018, 9:07pm

I’m kind of new around here, still learning about how the project works. On the suggestion from @Ben_Cohen I started contributing code around Swift Benchmark Suite (more benchmarks for Sequence protocol along with support for GYB in benchmarks). I’ve filed SR-4600 back in April of 2017. After rewriting most of the compare_perf_test.py, converting it from scripting style to OOP module and adding full unit test coverage, I’ve been working on improving the robustness of performance measurements with Swift benchmark suite on my own branch since the end of June.

One of my PRs in the benchmarking area was blocked, pending wider discussion on swift-dev, so the linked document is rather long way to illustrate what changes to the benchmark suite and our measurement process will, in my opinion, give us more robust results in the future.

I apologize that the report deals with state of Swift Benchmark Suite from autumn 2017, keeping the tree up to date once the commits I was depending on required lengthy manual conflict resolution I gave up on that and kept focusing on the experiment, rather than futile chase of the tip of the tree…

The document had to be split into several pages due to the embedded interactive charts that were overwhelming the browser, so here’s a table of contents for quicker access to each chapter:

Towards Robust Performance Measurement

Please read that document for detailed reasoning behind the suggestions quoted from the Corrective Measures chapter below. It is fully responsive, so that you can enjoy it on your iPhones and iPads, including the mesmerizing visualizations of samples from Swift Benchmark Suite like these:

I hope the sample visualization tool I’ve built could be further adapted and made useful for our compiler hackers (@Andrew_Trick, @Michael_Ilseman, @Slava_Pestov).
Regarding the changes I suggest we take, I am especially interested in review from people who worked on the benchmark suite before and the standard library hackers that use the benchmark suite daily: @Michael_Gottesman, @lancep, @mishal_shah, @Ben_Cohen, @dabrahams, @lorentey. Please ping all interested parties I might have forgotten...

Corrective Measures

Given the fact Swift Benchmark Suite is a set of microbenchmarks, we are measuring effects that are manifested in microseconds. We can significantly increase the robustness of our measurement process using statistical methods. Necessary prerequisite is having a representative sample population of reasonable size. From the experiment analyzed in previous sections it is apparent that we can make the measurement process resilient to the effects of varying system load if the benchmarked workload stays in range of hundreds of milliseconds, up to few thousand. Above that it becomes impossible to separate the signal from noise on a heavily contested CPU.

By making the run time small, it takes less time to gather enough samples and their quality is better. By staying well under the 10 millisecond time slice we get more pristine samples and the samples that were interrupted by context switching are easier to identify. Excluding outliers makes our measurement more robust.

After these resilience preconditions are met, we can speed up the whole measurement process by running it in parallel on all available CPU cores. If we gather 10 independent one-second measurements on a 10 core machine, we can run the whole Benchmark Suite in 500 seconds, while having much better confidence in the statistical significance of the reported results!

Based on the preceding analysis I suggest we take the following corrective measures:

One-time Benchmark Suite Cleanup

Enable increase of measurement frequency by lowering the base workload of individual benchmarks to run under 2500 μs. For vast majority of benchmarks this just means lowering the constant used to drive their inner loop — effectively allowing the measurement infrastructure to peek inside the work loop more often. Benchmarks that are part of test family meant to highlight the relative costs should be exempt from strictly meeting this requirement. See for example the DropFirst family of benchmarks.
Ensure the setup overhead is under 5%. Expensive setup work (> 5%) is excluded from main measurement by using the setup and teardown methods. Also reassess the ratio of setup to the main workload, so that it stays reasonable (<20%) and doesn’t needlessly prolong the measurement.
Ensure benchmarks have constant memory use independent of iteration count.
Make all benchmark names <= 40 characters long to prevent obscuring results in report tables.
Make all benchmark names use CamelCase convention.

Measurement System Improvements

Measure memory use and context switches in Swift by calling rusage before 1st and after last sample is taken (abstracted for platform specific implementations). This change is meant to exclude the overhead introduced by the measurement infrastructure, which is impossible to correct for from external scripts.
Exclude outliers from the measured dataset by filtering samples whose runtime exceed top inner fence (TIF = Q3 + 1.5 * IQR), controlled by newly added --exclude-outliers option that will default to true.
Expand the statistics reported for each benchmark run:
- Minimum, Q1, Median, Q3, Maximum (to complete the 5 number summary), Mean, SD, n (number of samples after excluding outliers), maximum resident set size (in pages?), number of involuntary context switches (ICS) during measurement.
- Option to report 20 percentiles in 5% increments (or 20 number summary; because 10% don’t fall on Q1 and Q3 exactly) compressed in delta format where each successive value is expressed as delta from the previous percentile.
Implement parallel benchmarking in BenchmarkDriver script to dramatically speed up measurement of the whole benchmark suite.
Introduce automated benchmark validation to ensure individual benchmarks conform to the expected requirements, which will be performed for newly added tests during regular CI benchmarks and on the whole benchmark suite as a part of the validation tests. See BenchmarkDoctor for a prototype implementation.
Adjust change detection to use Mann-Whitney U-test

What should the next steps be? I have some of the suggested solutions prototyped on my branches that are out of sync with the tip of the tree… @Michael_Gottesman are you the right person to discuss cherrypicking of useful parts into new PRs?

Michael_Gottesman · April 3, 2018, 7:39pm

Lets start simple with things that everyone can agree is good: Fixing setup overhead. That sounds like a good solid deliverable. We can talk about other things later on this thread (or in a new thread).

Michael_Gottesman · April 4, 2018, 1:28am

Also, next time you want me to see something like this, post it to the dev forum. I do not read the standard library forum that often (i.e. I did not see your post).

palimondo · April 4, 2018, 2:24am

Moved to Compiler category. I figured after the swift-dev was split into Compiler and Standard Library in the forums (both in Development category), SL was a better fit for benchmarks due to their focus…

felix91gr · April 4, 2018, 2:57am

Wow. This shows a lot of commitment, not to mention skill as well.

Thank you for investigating this. And thanks for pushing the benchmark suite forward

John_McCall · April 7, 2018, 11:10pm

I agree, this is an interesting approach.

Michael_Gottesman · April 8, 2018, 8:06pm

@palimondo Also, we should probably add a way to verify that benchmarks do not have setup overhead. I imagine that we would run it on one of the bots to verify that as new benchmarks are added we do not lose this invariant. Your thoughts?

palimondo · April 8, 2018, 8:53pm

That’s where the idea of BenchmarkDoctor comes in. It uses the 10R Series measurement strategy. That makes it possible to detect the overhead. So far I have added this: https://github.com/palimondo/swift/commit/bf6be50d39559994e44ce9d689bc5925bac5618b

I got stuck a bit on the rule to verify setup overhead, as implementations in chart.html and diag.py relied on a measure of spread (IQR) derived from cleaned series of all samples. It’s needed to go really fine grained, under the crude 5% rule of thumb.

That would require BenchmarkDoctor to run BenchmarkDriver in verbose mode to gather PerformanceTestSamples. It’s fine, doable and already implemented, but I was hoping to move the outlier detection and filtering down to Benchmark_X… it came up as design pressure during writing the test case for the setup overhead detection rule, too much mocking to set it all up. Not sure how to resolve that in a clean way yet

Update: Solution is not to sweat it, just do the 5% threshold and leave improvements for the future!

@Michael_Gottesman I would very much like to land the BenchmarkDoctor and related changes before we start the spring cleanup of the benchmarks. Would you agree that is a better course of action than piecewise modifications of benchmark suite?

Michael_Gottesman · April 9, 2018, 11:46pm

Please fix up the benchmarks that need setup/tear down functions, ignoring the detection issues. My reasoning is:

Fixing up the benchmarks setup/teardown is a solid, incremental deliverable that is unquestionable good.
I would like to have a design discussion where we discuss the proposal before any of that other stuff goes into tree, but I need to review the design document you posted. The benchmark suite is important infrastructure that the swift project relies on. Major changes to this should not be done without buy in, design review, and code review.

palimondo · April 11, 2018, 7:33am

@Michael_Gottesman I don’t mean to be difficult, but fixing just the setup overhead in isolation is not going to help much. Even the most egregious examples of setup overhead, like ReversedArray with 41%, is amortized over the high iteration count (~570) used in the current averaging measurement method so that it isn’t a problem at the moment. See a10 vs. i0 Series. It will become an issue once we start measuring with num-iters=1, that’s why I propose fixing it, but that could be done together with lowering the base workload such that it runs under 2500 μs.

From what I understand, changes to benchmarks that would alter their runtime are better to be done in one swoop, rather than piecewise, because you do have some kind of longterm benchmark tracking internally at Apple that isn’t publicly visible (IMO this needs to become public!). Where possible, it should be done in such a way that we can establish a well defined relationship between old and altered benchmark to allow for historic comparison. That’s why I think investing in the validation infrastructure is better course of action at the moment.

I have implemented the BenchmarkDoctor rule that validates the optimized runtime is under 2500 μs in such a way that it suggests a fixit: power of 2 factor, you should lower the runtime by. From the test case, the output of the check would be:
BenchmarkDoctor%20check%20runtime

As most benchmarks internally raise the base workload in a simple for in loop with some multiplier, this would be a matter of dividing it by the suggested factor. This way we could use the factor to create known relationship between old and altered benchmark for the purposes of historic comparisons, if you desire to retain such feature internally.

Aside: I did look into fixing some of the benchmarks with setup overhead and I definitely need help with the ArrayOf* benchmark family. I cannot locate the source of the overhead and I don’t understand what they are testing. They are full of methods that create an array and immediately throw it away (assigning to _) followed by comment // should be a nop. If they are indeed testing that some optimization takes place, I’d argue they should be tests/validation-tests and not benchmarks…

Graydon_Hoare · April 13, 2018, 2:39am

I'd recommend, if you haven't seen it yet, looking over the design of the Criterion benchmarking system in Haskell, of which a Rust port also exists which might be closer to idiomatic Swift.

It's had quite a lot of work put into achieving a good mix of comprehensiveness and comprehensibility. The default views tell you enough, with enough statistical confidence, to evaluate benchmarks well.

palimondo · April 14, 2018, 7:25am

Thank you for pointing this out to me. From quick skimming, the use of KDE is roughly analogous to what I was looking for as future improvements using median detection. I will study Criterion in more depth over the weekend.

Update: I've looked at Criterion. It is more similar to @lorentey's Attabench. They both focus on scaling the workload and produce benchmark visualizations to aid in performance analysis. I don't think much from it is directly applicable to what we are doing in SBS.

palimondo · July 12, 2018, 6:52am

Thanks to @Erik_Eckstein and @Michael_Gottesman, the first PR with prerequisites for robust microbenchmarking has landed on master.

When the issue of setUp and tearDown functions that came up during review it reminded me of related code from my experimentation.

There is feature of Benchmark_Driver that measures the memory consumption of the benchmark using time -lp utility, which can among other things report the maximum resident size (MAX_RSS) used by a process. Benchmark_Driver reports MAX_RSS in the last column of log results. This code currently works on Mac only, because GNU implementation of time on Linux has different options and output format. During my experiments I have extended the support to both platforms like this:

Platform dependent time invocation Benchmark_Driver (supports: darwin and linux)
Support for parsing of both formats from the log

But in the end I've realized that measuring this using time is still quite noisy. See first paragraph in Memory Use for detailed analysis. For reference, here's first discussion of this issue with @Luke_Larson and @Andrew_Trick in PR #8793.

I think I have a better solution and would like to gather feedback if this would be a reasonable next step to implement here. Given the time utility is just a simple wrapper around rusage function, I've used it directly to gather the same environmental stats and have verified that it reduces the noise in MAX_RSS by excluding the contribution from the benchmarking infrastructure overhead (about 10 MB). See this commit for proposed implementation:

Measure MAX_RSS, ICS and VCS

What do you think?

This implies 2 changes to the log formats (Benchmark_* public API):

Becnhmark_O gets additional column reporting MAX_RSS
Benchmark_Driver's MAX_RSS will report delta instead of absolute value

Second point is just a minor one-time change in semantics, but maybe you're tracking these values over time somewhere... Are your internal scripts able to handle this as is, or would you need to adapt something?

Erik_Eckstein · July 13, 2018, 12:20am

Getting the data directly in the benchmark (with rusage) is definitely a good idea.

About the output format: not a big deal to change it. Maybe it makes sense to include the memory usage column(s) only if it is explicitly turned on with a command line option. Just to keep the output readable if someone is just interested in runtime data.

Erik_Eckstein · August 21, 2018, 7:53pm

Follow-up on the discussion on Sign in to GitHub · GitHub
The reason why changing the name/runtime of many benchmarks is that it's disruptive for our performance tracking. We submit benchmarks results to a LNT server over time, which gives us a history for compiler performance.
Now, if some benchmarks are problematic, we should change them, because it's better to fix noisy benchmarks than to keep them in the history. But for benchmarks which are not problematic (i.e. where we didn't see any noise so far), it would not solve any problem to change them.

palimondo · August 22, 2018, 4:58pm

Based on all the data in the report, I'm very strongly convinced there are no noisy benchmarks in Swift Benchmarking Suite. All the noise is external, from the environment. The key to minimizing its impact is improving the robustness of measurement process by making it more resilient. Staying in the micro-benchmarking range helps us to avoid the interruptions by OS scheduler which would happen at least every 10 ms on mac os (6ms on linux). Keeping the benchmark runtime under 2500 μs gives us 1:4 (1:3) chance of being interrupted on an absolutely calm machine. The noise filtering technique I've proposed should be able to deal with that. Maybe we should go as low as 1000 μs (1ms) for the base workload, to be on the safe side. That would allow us to run benchmarks in parallel — dramatically reducing the time run SBS. (Final side benefit is that Onone benchmarks will take less time.)

But this is not possible if we first don't fix the benchmarks. Luckily, there isn't a big change needed to get the benefits of more stable and parallel benchmarking. All I'm proposing is a trivial change to the internal loop-multiplying constants, establishing a simple linear relationship between the old and recalibrated benchmarks with lowered base workload. This is the same principle of finer-grained measurement that's behind measuring with num-iters=1. This isn't a fundamental change to what is being measured. Only a policy change to allow the measurement infrastructure to peak inside the body of the main workload loop more often.

How is the performance tracking history maintained, after upgrading the CI machines? Doesn't that also create discontinuity similar to what would happen if we did one-time recalibration of benchmark inner-loop multipliers? Continuity for renamed benchmarks might be solved by some kind of database migration script, if deemed that important…

Erik_Eckstein · August 24, 2018, 3:49am

Actually the biggest problem with "noisy" benchmarks was the code alignment problem (IRGen, benchmarks: add an option -align-module-to-page-size for benchmarking by eeckstein · Pull Request #18318 · apple/swift · GitHub), which is fixed now.

I suggest to see how the new benchmarking method goes (Improved benchmarking for pull requests) and if there are still noisy benchmarks because of a too long iteration time, lets fix those benchmarks.

Mass-renaming/changing benchmarks creates a lot of burden for our performance tracking and I don't see how we can spend time on working on this.

palimondo · August 24, 2018, 5:17am

Could you please elaborate? How did this issue manifest? I'd like to investigate it on the dataset from my report and the new measurements after the PR #18318 has landed (month ago).

Erik_Eckstein · August 24, 2018, 6:07pm

This is not actually noise, but a 100% reproducible performance difference caused by different instruction alignment within a code page. As we link all benchmark object files into a single executable a small change in one benchmark can change the code alignment of all other benchmarks. The fix for this was to page-align all benchmark modules.

palimondo · August 24, 2018, 6:36pm

Oh, so this was stable new value (within a given build) for an unmodified benchmark, caused by modification/addition of a different benchmark?

(I thought this had something to do with same benchmark having various stable typical values between independent runs — which can be seen in the charts from report and is still observable in the C series measurement I run this week.)