Measuring MEAN Performance (was: Questions about Swift-CI)

I have run Benchmark_O with --num-iters=100 on my machine for the the

whole performance test suite, to get a feeling for the distribution of
benchmark samples, because I also want to move the Benchmark_Driver to
use MEAN instead of MIN in the analysis.

I'm concerned about that, especially for microbenchmarks; it seems to me
as though MIN is the right measurement. Can you explain why MEAN is
better?

Using MEAN wasn’t part of the aforementioned SR-4669. The purpose of that
task is to reduce the time CI takes to get useful results (e.g. by using 3
runs as a baseline). MEAN isn’t useful if you’re only gathering 3 data
points.

Current approach to detecting performance changes is fragile for tests that
have very low absolute runtime, as they are easily over the 5%
improvement/regression threshold when the test machine gets a little bit
noisy. For example in benchmark on PR #9806
<https://github.com/apple/swift/pull/9806#issuecomment-303370149&gt;:

BitCount 12 14 +16.7% 0.86x

SuffixCountableRange 10 11 +10.0% 0.91x
MapReduce 303 331 +9.2% 0.92x

These are all false changes (and there are quite a few more there).

To partially address this issue (I'm guessing) the last SPEEDUP column
sometimes features mysterious question mark in brackets. Its emitted when
the new MIN falls inside the (MIN..MAX) range of the OLD baseline. It is
not checked the other way around.

I'm suggesting to use MEAN value in combination with SD
(standard-deviation) to detect the changes (improvements/regressions). At
the moment, this is hard to do, because the aggregate test results reported
by Benchmark_O (and co.) can include anomalous results in the sample
population that messes up the MEAN and SD, too. Currently it is only
visible in the high sample range - the difference between reported MIN and
MAX. But it is not clear how many results are anomalous.

Currently I'm working on improved sample filtering algorithm. Stay tuned
for demonstration in Benchmark_Driver (Python), if it pans out, it might be
time to change adaptive sampling in DriverUtil.swift.

Best regards
Pavol Vaskovic

···

On Tue, May 16, 2017 at 9:10 PM, Dave Abrahams via swift-dev < swift-dev@swift.org> wrote:

on Thu May 11 2017, Pavol Vaskovic <swift-dev-AT-swift.org> wrote:

On Wed, May 17, 2017 at 1:26 AM, Andrew Trick <atrick@apple.com> wrote:

I have run Benchmark_O with --num-iters=100 on my machine for the the
whole performance test suite, to get a feeling for the distribution of
benchmark samples, because I also want to move the Benchmark_Driver to
use MEAN instead of MIN in the analysis.

I'm concerned about that, especially for microbenchmarks; it seems to me
as though MIN is the right measurement. Can you explain why MEAN is
better?

Using MEAN wasn’t part of the aforementioned SR-4669. The purpose of that task is to reduce the time CI takes to get useful results (e.g. by using 3 runs as a baseline). MEAN isn’t useful if you’re only gathering 3 data points.

Current approach to detecting performance changes is fragile for tests that have very low absolute runtime, as they are easily over the 5% improvement/regression threshold when the test machine gets a little bit noisy. For example in benchmark on PR #9806 <https://github.com/apple/swift/pull/9806#issuecomment-303370149&gt;:

BitCount 12 14 +16.7% 0.86x
SuffixCountableRange 10 11 +10.0% 0.91x
MapReduce 303 331 +9.2% 0.92x
These are all false changes (and there are quite a few more there).

To partially address this issue (I'm guessing) the last SPEEDUP column sometimes features mysterious question mark in brackets. Its emitted when the new MIN falls inside the (MIN..MAX) range of the OLD baseline. It is not checked the other way around.

That bug must have been introduced during one of the rewrites. Is that in the driver or compare script? Why not fix that bug?

We clearly don’t want to see any false changes. The ‘?’ is a signal to me to avoid reporting those results. They should either be ignored as flaky benchmarks or rerun. I thought rerunning them was the fix you were working on.

If you have some other proposal for fixing this then please, in a separate proposal, explain your new approach, why your new approach works, and demonstrate it’s effectiveness with results that you’ve gathered over time on the side. Please don’t change how the driver computes performance changes on a whim while introducing other features.

I'm suggesting to use MEAN value in combination with SD (standard-deviation) to detect the changes (improvements/regressions). At the moment, this is hard to do, because the aggregate test results reported by Benchmark_O (and co.) can include anomalous results in the sample population that messes up the MEAN and SD, too. Currently it is only visible in the high sample range - the difference between reported MIN and MAX. But it is not clear how many results are anomalous.

I honestly don’t know what MEAN/SD has to do with the problem you’re pointing to above. The benchmark harness is already setup to compute the average iteration time, and our benchmarks are not currently designed to measure cache effects or any other phenomenon that would have a statistically meaningful sample distribution. Statistical methods might be interesting if you’re analyzing benchmark results over a long period of time or system noise levels across benchmarks.

The primary purpose of the benchmark suite is identifying performance bugs/regressions at the point they occur. It should be no more complicated than necessary to do that. The current approach is simple: run a microbenchmark long enough in a loop to factor out benchmark startup time, cache/cpu warmup effects, and timer resolution, then compute the average iteration time. Throw away any run that was apparently impacted by system noise.

We really have two problems:
1. spurious results
2. the turnaround time for the entire benchmark suite

Running benchmarks on a noisy machine is a losing proposition because you won’t be able to address problem #1 without making problem #2 much worse.

-Andy

···

On Jun 12, 2017, at 1:45 PM, Pavol Vaskovic <pali@pali.sk> wrote:
On Tue, May 16, 2017 at 9:10 PM, Dave Abrahams via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:
on Thu May 11 2017, Pavol Vaskovic <swift-dev-AT-swift.org> wrote:
On Wed, May 17, 2017 at 1:26 AM, Andrew Trick <atrick@apple.com <mailto:atrick@apple.com>> wrote:

Currently I'm working on improved sample filtering algorithm. Stay tuned for demonstration in Benchmark_Driver (Python), if it pans out, it might be time to change adaptive sampling in DriverUtil.swift.

Best regards
Pavol Vaskovic

I have run Benchmark_O with --num-iters=100 on my machine for the the
whole performance test suite, to get a feeling for the distribution of
benchmark samples, because I also want to move the Benchmark_Driver to
use MEAN instead of MIN in the analysis.

I'm concerned about that, especially for microbenchmarks; it seems to me
as though MIN is the right measurement. Can you explain why MEAN is
better?

Using MEAN wasn’t part of the aforementioned SR-4669. The purpose of that task is to reduce the time CI takes to get useful results (e.g. by using 3 runs as a baseline). MEAN isn’t useful if you’re only gathering 3 data points.

Current approach to detecting performance changes is fragile for tests that have very low absolute runtime, as they are easily over the 5% improvement/regression threshold when the test machine gets a little bit noisy. For example in benchmark on PR #9806 <https://github.com/apple/swift/pull/9806#issuecomment-303370149&gt;:

BitCount 12 14 +16.7% 0.86x
SuffixCountableRange 10 11 +10.0% 0.91x
MapReduce 303 331 +9.2% 0.92x
These are all false changes (and there are quite a few more there).

The current design assumes that in such cases, the workload will be increased so that is not an issue.

The reason why we use the min is that statistically we are not interesting in estimated the "mean" or "center" of the distribution. Rather, we are actually interested in the "speed of light" of the computation implying that we are looking for the min.

To partially address this issue (I'm guessing) the last SPEEDUP column sometimes features mysterious question mark in brackets. Its emitted when the new MIN falls inside the (MIN..MAX) range of the OLD baseline. It is not checked the other way around.

I'm suggesting to use MEAN value in combination with SD (standard-deviation) to detect the changes (improvements/regressions). At the moment, this is hard to do, because the aggregate test results reported by Benchmark_O (and co.) can include anomalous results in the sample population that messes up the MEAN and SD, too. Currently it is only visible in the high sample range - the difference between reported MIN and MAX. But it is not clear how many results are anomalous.

What do you mean by anomalous results?

Currently I'm working on improved sample filtering algorithm. Stay tuned for demonstration in Benchmark_Driver (Python), if it pans out, it might be time to change adaptive sampling in DriverUtil.swift.

Have you looked at using the Mann-Whitney U algorithm? (I am not sure if we are using it or not)

···

On Jun 12, 2017, at 1:45 PM, Pavol Vaskovic via swift-dev <swift-dev@swift.org> wrote:
On Tue, May 16, 2017 at 9:10 PM, Dave Abrahams via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:
on Thu May 11 2017, Pavol Vaskovic <swift-dev-AT-swift.org> wrote:
On Wed, May 17, 2017 at 1:26 AM, Andrew Trick <atrick@apple.com <mailto:atrick@apple.com>> wrote:

Best regards
Pavol Vaskovic

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

That is also a valid fix for the problem, which I forgot to mention.
-Andy

···

On Jun 12, 2017, at 2:55 PM, Michael Gottesman <mgottesman@apple.com> wrote:

Current approach to detecting performance changes is fragile for tests that have very low absolute runtime, as they are easily over the 5% improvement/regression threshold when the test machine gets a little bit noisy. For example in benchmark on PR #9806 <https://github.com/apple/swift/pull/9806#issuecomment-303370149&gt;:

BitCount 12 14 +16.7% 0.86x
SuffixCountableRange 10 11 +10.0% 0.91x
MapReduce 303 331 +9.2% 0.92x
These are all false changes (and there are quite a few more there).

The current design assumes that in such cases, the workload will be increased so that is not an issue.

Hi Andrew,

To partially address this issue (I'm guessing) the last SPEEDUP column
sometimes features mysterious question mark in brackets. Its emitted when
the new MIN falls inside the (MIN..MAX) range of the OLD baseline. It is
not checked the other way around.

That bug must have been introduced during one of the rewrites. Is that in
the driver or compare script? Why not fix that bug?

That is in the compare script. It looks like the else branch got lost during
a rewrite
<[Compare Perf] Add support for markdown and html · apple/swift@cb23837 · GitHub;
(search
for "(?)" in that diff). I could certainly fix that too, but I'm not sure
that would be enough to fix all our problems.

We clearly don’t want to see any false changes. The ‘?’ is a signal to me
to avoid reporting those results. They should either be ignored as flaky
benchmarks or rerun. I thought rerunning them was the fix you were working
on.

If you have some other proposal for fixing this then please, in a separate
proposal, explain your new approach, why your new approach works, and
demonstrate it’s effectiveness with results that you’ve gathered over time
on the side. Please don’t change how the driver computes performance
changes on a whim while introducing other features.

...

I honestly don’t know what MEAN/SD has to do with the problem you’re
pointing to above. The benchmark harness is already setup to compute the
average iteration time, and our benchmarks are not currently designed to
measure cache effects or any other phenomenon that would have a
statistically meaningful sample distribution. Statistical methods might be
interesting if you’re analyzing benchmark results over a long period of
time or system noise levels across benchmarks.

The primary purpose of the benchmark suite is identifying performance
bugs/regressions at the point they occur. It should be no more complicated
than necessary to do that. The current approach is simple: run a
microbenchmark long enough in a loop to factor out benchmark startup time,
cache/cpu warmup effects, and timer resolution, then compute the average
iteration time. Throw away any run that was apparently impacted by system
noise.

We really have two problems:
1. spurious results
2. the turnaround time for the entire benchmark suite

I don't think we can get more consistent test results just from re-running
tests that were detected as changes in the first pass, as described in
SR-4669 <Issues · apple/swift-issues · GitHub, because that improves
accuracy only for one side of the comparison - the branch. When the
measurement error is with the baseline from the master, re-running the
branch would not help.

I have sketched an algorithm for getting more consistent test results, so
far its in Numbers. I have ran the whole test suite for 100 samples and
observed the varying distribution of test results. The first result is
quite often an outlier, with subsequent results being quicker. Depending on
the "weather" on the test machine, you sometimes measure anomalies. So I'm
tracking the coefficient of variance from the sample population and purging
anomalous results when it exceeds 5%. This results in solid sample
population where standard deviation is a meaningful value, that can be use
in judging the significance of change between master and branch.

This week I'm working on transferring this algorithm to Python and putting
it probably somewhere around `Benchmark_Driver`. It is possible this would
ultimately land in Swift (DriverUtil.swift), but to demonstrate the
soundness of this approach to you all, I wanted to do the Python
implementation first.

Depending on how this behaves, my hunch is we could speed up the benchmark
suite, by not running test samples for 1 second and taking many samples,
but to adaptively sample each benchmark until we get a stable sample
population. In worst case this would degrade to current
(1s/sample)*num_samples. This could be further improved on by running
multiple passes through the test suite, to eliminate anomalies caused by
other background processes. That is the core idea from --rerun (SR-4669).

--Pavol

···

On Mon, Jun 12, 2017 at 11:55 PM, Andrew Trick <atrick@apple.com> wrote:

The current design assumes that in such cases, the workload will be
increased so that is not an issue.

I understand. But clearly some part of our process is failing, because
there are multiple benchmarks in 10ms range in the tree for months without
fixing this.

The reason why we use the min is that statistically we are not interesting
in estimated the "mean" or "center" of the distribution. Rather, we are
actually interested in the "speed of light" of the computation implying
that we are looking for the min.

I understand that. But all measurements have a certain degree of error
associated with them. Our issue is two-fold: we need to differentiate
between normal variation between measured samples under "perfect"
conditions and samples that are worse because of interference from other
background processes.

What do you mean by anomalous results?

I mean results that significantly stand out from the measured sample
population.

Currently I'm working on improved sample filtering algorithm. Stay tuned

for demonstration in Benchmark_Driver (Python), if it pans out, it might be
time to change adaptive sampling in DriverUtil.swift.

Have you looked at using the Mann-Whitney U algorithm? (I am not sure if
we are using it or not)

I don't know what that is. Here's what I've been doing:

Depending on the "weather" on the test machine, you sometimes measure
anomalies. So I'm tracking the coefficient of variance from the sample
population and purging anomalous results (1 sigma from max) when it exceeds
5%. This results in quite solid sample population where standard deviation
is a meaningful value, that can be use in judging the significance of
change between master and branch.

--Pavol

···

On Mon, Jun 12, 2017 at 11:55 PM, Michael Gottesman <mgottesman@apple.com> wrote:

The current design assumes that in such cases, the workload will be increased so that is not an issue.

I understand. But clearly some part of our process is failing, because there are multiple benchmarks in 10ms range in the tree for months without fixing this.

I think that is just inertia and being busy. Patch? I'll review = ).

The reason why we use the min is that statistically we are not interesting in estimated the "mean" or "center" of the distribution. Rather, we are actually interested in the "speed of light" of the computation implying that we are looking for the min.

I understand that. But all measurements have a certain degree of error associated with them. Our issue is two-fold: we need to differentiate between normal variation between measured samples under "perfect" conditions and samples that are worse because of interference from other background processes.

I disagree. CPUs are inherently messy but disruptions tend to be due to temporary spikes most of the time once you have quieted down your system by unloading a few processes.

What do you mean by anomalous results?

I mean results that significantly stand out from the measured sample population.

What that could mean is that we need to run a couple of extra iterations to warm up the cpu/cache/etc before we start gathering samples.

Currently I'm working on improved sample filtering algorithm. Stay tuned for demonstration in Benchmark_Driver (Python), if it pans out, it might be time to change adaptive sampling in DriverUtil.swift.

Have you looked at using the Mann-Whitney U algorithm? (I am not sure if we are using it or not)

I don't know what that is.

Check it out: Mann–Whitney U test - Wikipedia <https://en.wikipedia.org/wiki/Mann–Whitney_U_test&gt;\. It is a non-parametric test that two sets of samples are from the same distribution. As a bonus, it does not assume that our data is from a normal distribution (a problem with using mean/standard deviation which assumes a normal distribution).

We have been using Mann-Whitney internally for a while successfully to reduce the noise.

···

On Jun 12, 2017, at 4:54 PM, Pavol Vaskovic <pali@pali.sk> wrote:
On Mon, Jun 12, 2017 at 11:55 PM, Michael Gottesman <mgottesman@apple.com <mailto:mgottesman@apple.com>> wrote:

Here's what I've been doing:

Depending on the "weather" on the test machine, you sometimes measure anomalies. So I'm tracking the coefficient of variance from the sample population and purging anomalous results (1 sigma from max) when it exceeds 5%. This results in quite solid sample population where standard deviation is a meaningful value, that can be use in judging the significance of change between master and branch.

--Pavol

Hi Andrew,

To partially address this issue (I'm guessing) the last SPEEDUP column sometimes features mysterious question mark in brackets. Its emitted when the new MIN falls inside the (MIN..MAX) range of the OLD baseline. It is not checked the other way around.

That bug must have been introduced during one of the rewrites. Is that in the driver or compare script? Why not fix that bug?

That is in the compare script. It looks like the else branch got lost during a rewrite <[Compare Perf] Add support for markdown and html · apple/swift@cb23837 · GitHub; (search for "(?)" in that diff). I could certainly fix that too, but I'm not sure that would be enough to fix all our problems.

We clearly don’t want to see any false changes. The ‘?’ is a signal to me to avoid reporting those results. They should either be ignored as flaky benchmarks or rerun. I thought rerunning them was the fix you were working on.

If you have some other proposal for fixing this then please, in a separate proposal, explain your new approach, why your new approach works, and demonstrate it’s effectiveness with results that you’ve gathered over time on the side. Please don’t change how the driver computes performance changes on a whim while introducing other features.
...
I honestly don’t know what MEAN/SD has to do with the problem you’re pointing to above. The benchmark harness is already setup to compute the average iteration time, and our benchmarks are not currently designed to measure cache effects or any other phenomenon that would have a statistically meaningful sample distribution. Statistical methods might be interesting if you’re analyzing benchmark results over a long period of time or system noise levels across benchmarks.

The primary purpose of the benchmark suite is identifying performance bugs/regressions at the point they occur. It should be no more complicated than necessary to do that. The current approach is simple: run a microbenchmark long enough in a loop to factor out benchmark startup time, cache/cpu warmup effects, and timer resolution, then compute the average iteration time. Throw away any run that was apparently impacted by system noise.

We really have two problems:
1. spurious results
2. the turnaround time for the entire benchmark suite

I don't think we can get more consistent test results just from re-running tests that were detected as changes in the first pass, as described in SR-4669 <Issues · apple/swift-issues · GitHub, because that improves accuracy only for one side of the comparison - the branch. When the measurement error is with the baseline from the master, re-running the branch would not help.

When we are benchmarking, we can always have access to the baseline compiler by stashing the build directory. So we can always take more samples (in fact when I was talking about re-running I always assumed we would).

···

On Jun 12, 2017, at 4:45 PM, Pavol Vaskovic via swift-dev <swift-dev@swift.org> wrote:
On Mon, Jun 12, 2017 at 11:55 PM, Andrew Trick <atrick@apple.com <mailto:atrick@apple.com>> wrote:

I have sketched an algorithm for getting more consistent test results, so far its in Numbers. I have ran the whole test suite for 100 samples and observed the varying distribution of test results. The first result is quite often an outlier, with subsequent results being quicker. Depending on the "weather" on the test machine, you sometimes measure anomalies. So I'm tracking the coefficient of variance from the sample population and purging anomalous results when it exceeds 5%. This results in solid sample population where standard deviation is a meaningful value, that can be use in judging the significance of change between master and branch.

This week I'm working on transferring this algorithm to Python and putting it probably somewhere around `Benchmark_Driver`. It is possible this would ultimately land in Swift (DriverUtil.swift), but to demonstrate the soundness of this approach to you all, I wanted to do the Python implementation first.

Depending on how this behaves, my hunch is we could speed up the benchmark suite, by not running test samples for 1 second and taking many samples, but to adaptively sample each benchmark until we get a stable sample population. In worst case this would degrade to current (1s/sample)*num_samples. This could be further improved on by running multiple passes through the test suite, to eliminate anomalies caused by other background processes. That is the core idea from --rerun (SR-4669).

--Pavol

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

My understanding of this feature is that it would rerun both branches (or possibly whichever is slower or more jittery, but that’s probably over complicating it).

-Andy

···

On Jun 12, 2017, at 4:45 PM, Pavol Vaskovic <pali@pali.sk> wrote:

We really have two problems:
1. spurious results
2. the turnaround time for the entire benchmark suite

I don't think we can get more consistent test results just from re-running tests that were detected as changes in the first pass, as described inSR-4669 <Issues · apple/swift-issues · GitHub, because that improves accuracy only for one side of the comparison - the branch. When the measurement error is with the baseline from the master, re-running the branch would not help.

That’s a reasonable approach for running 100 samples. I’m not sure how it fits with the goal of minimizing turnaround time. Typically you don’t need more than 3 samples (keeping in mind were usually averaging over thousands of iterations per sample).

-Andy

···

On Jun 12, 2017, at 4:45 PM, Pavol Vaskovic <pali@pali.sk> wrote:

I have sketched an algorithm for getting more consistent test results, so far its in Numbers. I have ran the whole test suite for 100 samples and observed the varying distribution of test results. The first result is quite often an outlier, with subsequent results being quicker. Depending on the "weather" on the test machine, you sometimes measure anomalies. So I'm tracking the coefficient of variance from the sample population and purging anomalous results when it exceeds 5%. This results in solid sample population where standard deviation is a meaningful value, that can be use in judging the significance of change between master and branch.

This is a fairly important point that I didn’t stress enough. In my experience with other benchmark suites the sample distribution is nothing close to normal which is why I’ve always thought MEAN/SD was silly. But the “noise” I was dealing with was in the underlying H/W and OS mode transitions. General system noise from other processes might lead to a more normal distribution… but as I’ve said, benchmarking on a noisy system is something to be avoided.

-Andy

···

On Jun 12, 2017, at 5:29 PM, Michael Gottesman <mgottesman@apple.com> wrote:

I don't know what that is.

Check it out: Mann–Whitney U test - Wikipedia <https://en.wikipedia.org/wiki/Mann–Whitney_U_test&gt;\. It is a non-parametric test that two sets of samples are from the same distribution. As a bonus, it does not assume that our data is from a normal distribution (a problem with using mean/standard deviation which assumes a normal distribution).

Hi Andrew,

I wrote the draft of this e-mail few weeks ago, and the following sentence
is not true:

Its emitted when the new MIN falls inside the (MIN..MAX) range of the OLD
baseline. It is not checked the other way around.

See below...

Hi Andrew,

To partially address this issue (I'm guessing) the last SPEEDUP column
sometimes features mysterious question mark in brackets. Its emitted when
the new MIN falls inside the (MIN..MAX) range of the OLD baseline. It is
not checked the other way around.

That bug must have been introduced during one of the rewrites. Is that in
the driver or compare script? Why not fix that bug?

That is in the compare script. It looks like the else branch got lost during
a rewrite
<[Compare Perf] Add support for markdown and html · apple/swift@cb23837 · GitHub; (search
for "(?)" in that diff). I could certainly fix that too, but I'm not sure
that would be enough to fix all our problems.

I even wrote tests for this
<https://github.com/apple/swift/blob/686c7619922d0de2bfdb3a0ec988b2c3c6bb60b5/benchmark/scripts/test_compare_perf_tests.py#L162&gt;,
and my implementation is pretty clear too
<https://github.com/apple/swift/blob/e7b243cad7e27af63e6c25cd426fdc5359c4e51d/benchmark/scripts/compare_perf_tests.py#L135&gt;…
somehow I forgot this.

# Add ' (?)' to the speedup column as indication of dubious changes:

# result's MIN falls inside the (MIN, MAX) interval of result they are

# being compared with.

self.is_dubious = (

    ' (?)' if ((old.min < new.min and new.min < old.max) or

               (new.min < old.min and old.min < new.max))

    else '')

I'm sorry for the confusion.

--Pavol

···

On Tue, Jun 13, 2017 at 1:45 AM, Pavol Vaskovic <pali@pali.sk> wrote:

On Mon, Jun 12, 2017 at 11:55 PM, Andrew Trick <atrick@apple.com> wrote:

Well, if I understand correctly how the swift-CI builds perf-PR, then
switching between master and branch from Benchmark_Driver is not possible...

Or are you thinking about manual benchmarking scenario?

--Pavol

···

On Tue, Jun 13, 2017 at 2:31 AM, Michael Gottesman <mgottesman@apple.com> wrote:

I don't think we can get more consistent test results just from re-running
tests that were detected as changes in the first pass, as described in
SR-4669 <Issues · apple/swift-issues · GitHub, because that improves
accuracy only for one side of the comparison - the branch. When the
measurement error is with the baseline from the master, re-running the
branch would not help.

When we are benchmarking, we can always have access to the baseline
compiler by stashing the build directory. So we can always take more
samples (in fact when I was talking about re-running I always assumed we
would).

As the next two paragraphs after the part you quoted go on explaining, I'm
hoping that with this approach we could adaptively sample the benchmark
until we get stable population, but starting from lower iteration count.

If the Python implementation bears this out, the proper solution would be
to change the implementation in DriverUtil.swift, from the current ~1s run
adaptive num-iters to more finer grained runs. We'd be gathering more
smaller samples, tossing out anomalies as we go until we gather stable
sample population (with low coefficient of variation) or run out of the
allotted time.

This has a potential to speed up the benchmark suite with more intelligent
management of the measurements, instead of using brute force of super-long
runtime to drown out the errors as we do currently.

(I am aware of various aspects this approach might introduce that have the
potential to mess with the caching: time measurement itself, more frequent
logging - this would currently rely on --verbose mode, invoking Benchmark_O
from Python…)

The proof is in the pudding, so I guess we'll learn if this approach would
work this week, when I hammer the implementation down in Python for
demonstration.

--Pavol

···

On Tue, 13 Jun 2017 at 03:19, Andrew Trick <atrick@apple.com> wrote:

On Jun 12, 2017, at 4:45 PM, Pavol Vaskovic <pali@pali.sk> wrote:

I have sketched an algorithm for getting more consistent test results, so
far its in Numbers. I have ran the whole test suite for 100 samples and
observed the varying distribution of test results. The first result is
quite often an outlier, with subsequent results being quicker. Depending on
the "weather" on the test machine, you sometimes measure anomalies. So I'm
tracking the coefficient of variance from the sample population and purging
anomalous results when it exceeds 5%. This results in solid sample
population where standard deviation is a meaningful value, that can be use
in judging the significance of change between master and branch.

That’s a reasonable approach for running 100 samples. I’m not sure how it
fits with the goal of minimizing turnaround time. Typically you don’t need
more than 3 samples (keeping in mind were usually averaging over thousands
of iterations per sample).

-Andy

I was thinking (hoping) Benchmark_Driver would support this and we could ask for support from CI to call the driver that way.

-Andy

···

On Jun 12, 2017, at 5:55 PM, Pavol Vaskovic <pali@pali.sk> wrote:

On Tue, Jun 13, 2017 at 2:31 AM, Michael Gottesman <mgottesman@apple.com <mailto:mgottesman@apple.com>> wrote:

I don't think we can get more consistent test results just from re-running tests that were detected as changes in the first pass, as described in SR-4669 <Issues · apple/swift-issues · GitHub, because that improves accuracy only for one side of the comparison - the branch. When the measurement error is with the baseline from the master, re-running the branch would not help.

When we are benchmarking, we can always have access to the baseline compiler by stashing the build directory. So we can always take more samples (in fact when I was talking about re-running I always assumed we would).

Well, if I understand correctly how the swift-CI builds perf-PR, then switching between master and branch from Benchmark_Driver is not possible...

Or are you thinking about manual benchmarking scenario?

--Pavol

As the next two paragraphs after the part you quoted go on explaining, I'm hoping that with this approach we could adaptively sample the benchmark until we get stable population, but starting from lower iteration count.

If the Python implementation bears this out, the proper solution would be to change the implementation in DriverUtil.swift, from the current ~1s run adaptive num-iters to more finer grained runs. We'd be gathering more smaller samples, tossing out anomalies as we go until we gather stable sample population (with low coefficient of variation) or run out of the allotted time.

~1s might be longer than necessary for the benchmarks with cheap setup. Another option is for the benchmark to call back to the Driver’s “start button” after setup. With no setup work, I think 200 ms is a bare minimum if we care about changes in the 1% range.

I’m confused though because I thought we agreed that all samples need to run with exactly the same number of iterations. So, there would be one short run to find the desired num_iters for each benchmark, then each subsequent invocation of the benchmark harness would be handed num_iters as input.

-Andy

···

On Jun 12, 2017, at 10:36 PM, Pavol Vaskovic <pali@pali.sk> wrote:

This has a potential to speed up the benchmark suite with more intelligent management of the measurements, instead of using brute force of super-long runtime to drown out the errors as we do currently.

(I am aware of various aspects this approach might introduce that have the potential to mess with the caching: time measurement itself, more frequent logging - this would currently rely on --verbose mode, invoking Benchmark_O from Python…)

The proof is in the pudding, so I guess we'll learn if this approach would work this week, when I hammer the implementation down in Python for demonstration.

--Pavol

On Tue, 13 Jun 2017 at 03:19, Andrew Trick <atrick@apple.com <mailto:atrick@apple.com>> wrote:

On Jun 12, 2017, at 4:45 PM, Pavol Vaskovic <pali@pali.sk <mailto:pali@pali.sk>> wrote:

I have sketched an algorithm for getting more consistent test results, so far its in Numbers. I have ran the whole test suite for 100 samples and observed the varying distribution of test results. The first result is quite often an outlier, with subsequent results being quicker. Depending on the "weather" on the test machine, you sometimes measure anomalies. So I'm tracking the coefficient of variance from the sample population and purging anomalous results when it exceeds 5%. This results in solid sample population where standard deviation is a meaningful value, that can be use in judging the significance of change between master and branch.

That’s a reasonable approach for running 100 samples. I’m not sure how it fits with the goal of minimizing turnaround time. Typically you don’t need more than 3 samples (keeping in mind were usually averaging over thousands of iterations per sample).

-Andy

That was agreed on in the discussion about measuring memory consumption (PR
8793) <https://github.com/apple/swift/pull/8793#issuecomment-297834790&gt;\.
MAX_RSS was variable between runs, due to dynamic `num_iters` adjustment
inside `DriverUtils` to fit the ~1s budget.

This could work for keeping the num_iters same during comparison between
the [master] and [branch], give we logged the num_iters from [master] and
used them to drive [branch] MAX_RSS memory. I don't know how to extend this
to make memory consumption comparable between different measurement runs
(over time...), tough.

--Pavol

···

On Tue, Jun 13, 2017 at 8:51 AM, Andrew Trick <atrick@apple.com> wrote:

I’m confused though because I thought we agreed that all samples need to
run with exactly the same number of iterations. So, there would be one
short run to find the desired `num_iters` for each benchmark, then each
subsequent invocation of the benchmark harness would be handed `num_iters`
as input.

So I did a bit more research. Check out how LNT does this:

GitHub - llvm-mirror/lnt: Mirror of official llvm git repository located at http://llvm.org/git/lnt. Updated every five minutes. <https://github.com/llvm-mirror/lnt/search?utf8=✓&q=mann-whitney&type=&gt;

I talked with Chris Matthews (+CC) about how LNT uses Mann-Whitney. In the following let n be the number of samples taken. From what he told me this is what LNT does:

1. If n is < 5, then some sort of computation around confidence intervals is used.
2. If the number of samples is > 5, then Mann-Whitney U is done.

I am not 100% sure what 1 is, but I think it has to do with some sort of quartile measurements. I.e. Find the median of the new data and make sure it is within +- median absolute deviation (basically mean + std-dev but more robust to errors). I believe the code is in LNT so we can find it for sure.

Thus in my mind the natural experiment here in terms of Mann-Whitney U.

1. This seems to suggest that for small numbers we do some sort of simple comparison that we do today and if a regression is "identified", we grab more samples of before/after and run mann-whitney u.
2. Try out different versions of n. I am not 100% sure if 5 is the right or wrong answer or if it should be dependent on the test.

Chris, did I get it right?

Michael

···

On Jun 13, 2017, at 7:11 AM, Pavol Vaskovic via swift-dev <swift-dev@swift.org> wrote:

On Tue, Jun 13, 2017 at 8:51 AM, Andrew Trick <atrick@apple.com <mailto:atrick@apple.com>> wrote:
I’m confused though because I thought we agreed that all samples need to run with exactly the same number of iterations. So, there would be one short run to find the desired `num_iters` for each benchmark, then each subsequent invocation of the benchmark harness would be handed `num_iters` as input.

That was agreed on in the discussion about measuring memory consumption (PR 8793) <https://github.com/apple/swift/pull/8793#issuecomment-297834790&gt;\. MAX_RSS was variable between runs, due to dynamic `num_iters` adjustment inside `DriverUtils` to fit the ~1s budget.

This could work for keeping the num_iters same during comparison between the [master] and [branch], give we logged the num_iters from [master] and used them to drive [branch] MAX_RSS memory. I don't know how to extend this to make memory consumption comparable between different measurement runs (over time...), tough.

--Pavol
_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev