Towards Robust Performance Measurement

Progress Report

(I’m sorry for the delays.) I’m happy to announce that by the beginning of March 2018, a series of PRs that applied legacy factor across the whole Swift Benchmark Suite (SBS) has lowered the workloads of individual benchmarks to execute in the 20–1000 μs range. My thanks go to @Erik_Eckstein for his patient review and guidance.

Point of this modification was to strengthen the resilience of our measurement process against accumulated errors caused by context switches that happen every 10 ms on Mac OS. Here’s a box plot visualization (see also the interactive chart) of the before:

Notice the streaking that occurs starting at approximately 2500 μs on the exponential X axis for the datasets collected on a more contested CPU. This is the results of accumulated errors from the uncontrolled variable of system load. On a contested machine, measuring a benchmark with runtime longer than 10ms will always by corrupted, because it includes the execution of a different process. For example the CI bots are running Jenkins and the Java VM will do its parallel garbage collection. To mitigate this issue, we have modified the design of SBS, so that all benchmarks should now run in the microbenchmark range of 20–1000 μs. The upper limit is less than the 10 ms scheduler quantum to give us some headroom for more robust measurements of -Onone builds as well as running on Linux, where the scheduler quantum is 6 ms.

Here’s the current situation after the application of legacy factor :

These measurements were taken on my ancient 2008 MBP, so some benchmarks run over 1000 μs, but that’s OK — they are plenty fast to be robust even there, and they are running well within the limits on a modern HW. The outliers on the tail are few existential sequence benchmarks (AnyCollection variants) that should probably be disabled anyway.

Improved Robustness

By lowering of the multiplication constants used to size the workloads in the for loops of individual benchmarks and moving it into the legacyFactor, we have effectively increased the sampling frequency. Reporting the actually measured runtime multiplied by the legacy factor allows us to maintain the continuity of performance tracking across Swift releases.

Shorter runtimes have much lower chance of producing corrupted samples. Gathering much more samples allows us to better detect and exclude the outliers. This is a mayor improvement to the robustness of measurement process (it is much more resilient to the uncontrolled variable: system load). Here’s a zoomed look at the middle part of the previous chart:

To recap the nature of dataset measured on 2 core CPU: Series a10 (minimized console window) and b10 are measured with 1 benchmark process. The c10 series is 2 benchmark processes running in parallel. For series simulating the worst case scenarios: d10 is 3 and e10 is 4 parallel benchmarks running on 2 core CPU. The Clean modifier means that the outliers (samples that are above the Q3 + 1.5 * IQR threshold) were removed.

Shorter Measurement Times

As a side effect this benchmark cleanup also results in much shorter execution of the whole Swift Benchmark Suite (SBS). The Benchmark_Driver now takes measurements by passing --num-iters=1 (we have fixed the setup overhead by introducing setUpFunctions where necessary) to the Benchmark_O, which was also modified to stop collection after 200 samples. This means that a single pass through commit set of benchmarks from SBS takes less than 4 minutes on my machine with an average of 0.3 s per benchmark and majority of them collect 200 individual samples.

In my opinion, the demonstrated robustness of this measurement process allows us to run the benchmarks in parallel on multi core machines in the future, to further lower the time it takes to collect statistically relevant sample set.

Note: The Benchmark_Driver is a different measurement process from run_smoke_bench.py currently used by CI bots. It’s the evolution of the original method that’s been used there before.

32 Likes