Improved benchmarking for pull requests

Erik_Eckstein · August 24, 2018, 1:21am

I'd like to share some exciting news about benchmarking:
We made some significant improvements for running the benchmarks in pull requests:

It's now a lot faster: down to 30min from 2h (including the compiler build time)
Reduced noise: almost no false alarms anymore
Code size differences are now reported - for the benchmark object files and also for the Swift standard library files
Some improvements of the report table format. For example, improvements are not folded by default but shown in the same table as regressiosn (we should be proud of improvements and not hide them!)

Currently the new feature can be tested with "@swift-ci smoke benchmark staging" and they will go live with "@swift-ci smoke benchmark" soon.
You can look at a test PR to see some sample output: [Do not merge] Test benchmark runs by eeckstein · Pull Request #18876 · apple/swift · GitHub

Now what about the non-smoke "@swift-ci benchmark"? Currently the only difference between smoke and non-smoke are the number of iterations. But as the new method reduces noise anyway, I'm actually thinking of making "smoke benchmark" the default, i.e. just having "@swift-ci benchmark" which does the new thing.

We hope that this is much more usable as before and will enable everyone to run the benchmarks for every (non-trivial) pull request.
If everything goes well, we are eventually planning to run the benchmarks by default for all "@swift-ci test"s. Because it's so fast now it will not add any time overhead.

If you have any comments or questions, please let me know

Erik

PS: credit goes to @palimondo, who initiated that effort in Towards Robust Performance Measurement

jrose · August 24, 2018, 1:22am

Nice changes! But how do I get -Onone times? :-)

Erik_Eckstein · August 24, 2018, 3:34am

That's a good point, which I forgot to address.
Performance of -O and -Osize is clearly more important than -Onone, but we should not neglect -Onone, e.g. -Onone is an indirect way to test performance of non-specialized generic code (in the stdlib).

So my thinking was to only show the most important data for smoke testing. But actually, it should be no problem to include the -Onone results. It should not add significant job run time.
Another option would be to keep two separate commands for smoke and regular (non-smoke) testing and include -Onone only in the regular test.

Erik_Eckstein · August 27, 2018, 11:38pm

"@swift-ci smoke benchmark" is now using the new benchmarking method. The non-smoke "@swift-ci benchmark" will follow soon.

Slava_Pestov · August 28, 2018, 1:20am

I think -Onone performance is really important because -Onone builds are much faster, and easier to debug. There is a lot of SILGen cleanup that would both simplify the implementation and produce better unoptimized code, and we should be tracking the performance as we do this work. Please make sure -Onone results are included in the full benchmark run at least.

Erik_Eckstein · August 28, 2018, 4:22pm

-Onone will be included in the full run. And I think we can include it also in the smoke run, because it only adds a few minutes of run time.