To partially address this issue (I'm guessing) the last SPEEDUP column
sometimes features mysterious question mark in brackets. Its emitted when
the new MIN falls inside the (MIN..MAX) range of the OLD baseline. It is
not checked the other way around.
That bug must have been introduced during one of the rewrites. Is that in
the driver or compare script? Why not fix that bug?
That is in the compare script. It looks like the else branch got lost during
for "(?)" in that diff). I could certainly fix that too, but I'm not sure
that would be enough to fix all our problems.
We clearly don’t want to see any false changes. The ‘?’ is a signal to me
to avoid reporting those results. They should either be ignored as flaky
benchmarks or rerun. I thought rerunning them was the fix you were working
If you have some other proposal for fixing this then please, in a separate
proposal, explain your new approach, why your new approach works, and
demonstrate it’s effectiveness with results that you’ve gathered over time
on the side. Please don’t change how the driver computes performance
changes on a whim while introducing other features.
I honestly don’t know what MEAN/SD has to do with the problem you’re
pointing to above. The benchmark harness is already setup to compute the
average iteration time, and our benchmarks are not currently designed to
measure cache effects or any other phenomenon that would have a
statistically meaningful sample distribution. Statistical methods might be
interesting if you’re analyzing benchmark results over a long period of
time or system noise levels across benchmarks.
The primary purpose of the benchmark suite is identifying performance
bugs/regressions at the point they occur. It should be no more complicated
than necessary to do that. The current approach is simple: run a
microbenchmark long enough in a loop to factor out benchmark startup time,
cache/cpu warmup effects, and timer resolution, then compute the average
iteration time. Throw away any run that was apparently impacted by system
We really have two problems:
1. spurious results
2. the turnaround time for the entire benchmark suite
I don't think we can get more consistent test results just from re-running
tests that were detected as changes in the first pass, as described in
SR-4669 <https://bugs.swift.org/browse/SR-4669>, because that improves
accuracy only for one side of the comparison - the branch. When the
measurement error is with the baseline from the master, re-running the
branch would not help.
I have sketched an algorithm for getting more consistent test results, so
far its in Numbers. I have ran the whole test suite for 100 samples and
observed the varying distribution of test results. The first result is
quite often an outlier, with subsequent results being quicker. Depending on
the "weather" on the test machine, you sometimes measure anomalies. So I'm
tracking the coefficient of variance from the sample population and purging
anomalous results when it exceeds 5%. This results in solid sample
population where standard deviation is a meaningful value, that can be use
in judging the significance of change between master and branch.
This week I'm working on transferring this algorithm to Python and putting
it probably somewhere around `Benchmark_Driver`. It is possible this would
ultimately land in Swift (DriverUtil.swift), but to demonstrate the
soundness of this approach to you all, I wanted to do the Python
Depending on how this behaves, my hunch is we could speed up the benchmark
suite, by not running test samples for 1 second and taking many samples,
but to adaptively sample each benchmark until we get a stable sample
population. In worst case this would degrade to current
(1s/sample)*num_samples. This could be further improved on by running
multiple passes through the test suite, to eliminate anomalies caused by
other background processes. That is the core idea from --rerun (SR-4669).
On Mon, Jun 12, 2017 at 11:55 PM, Andrew Trick <firstname.lastname@example.org> wrote: