Towards Robust Performance Measurement

Oh, so this was stable new value (within a given build) for an unmodified benchmark, caused by modification/addition of a different benchmark?

exactly

OK, so the "noisy" benchmarks you mentioned above caused by code alignment issue were genuine false positives when comparing SBS results between two branches. When I was talking about the source of noise, I was referring to high sample variance mostly caused by preemptive multitasking within a single SBS measurement. These interruptions are corrupting the reported runtimes when measuring with num-iters higher than one, because these numbers are not raw samples, but averages of num-iters true samples. They already include the cumulative error. The higher the runtime, the more cumulated error it contains.

The traditional advice there is to get a calm machine (quite impractical advice for machine running CI) or increase the measurement time and hope the noise gets averaged out. The second advice is a common benchmarking folklore which, based on the data in my report, I find to be totally misguided. The brute-force "solution" of running long doesn't solve the quality problem, it only prolongs time to get the wrong answer. This is what we do now with the 20 one second samples per benchmark in the full @swift-ci benchmark measurement.

Measuring with num-iters=1 fixes this issue, but exposes the outliers which should be filtered out.

The approach you took with run_smoke_bench.py is definitely a step in the right direction! The use of --sample-time=0.0025 means it is now averaging samples (and unnecessarily accumulating errors) only for runtimes under 2500 μs, longer runtimes effectively end up with num-iters=1 measurement. That's good. But as soon as you get over 10000 μs, each of the samples is guaranteed to include error from switching to another process and that's outside of our control.

But it does not do enough to significantly improve resilience in the presence of noise to enable parallel benchmarking. 184 benchmarks exceed 10ms runtime on the beastly CI-machines — much more on normal computers... The infuriating thing is that almost all of them are just artificially high because of the incorrectly selected inner-loop multiplier, not something that's inherent to the measured task. I'm strongly convinced we need to fix these benchmarks, for improved measurement quality and much shorter measurement time. I'm leaning towards 1000 μs as upper limit for optimized benchmarks and 10000 μs for Onone.

@Erik_Eckstein, could you please address these questions?

Interim Progress Report

With @Erik_Eckstein's help, the PR #18124 has landed on master, implementing the measurement of benchmark's memory use that excludes the overhead from the infrastructure. It is now reported directly from Benchmark_O in the last column, when running with --memory option.

This has now also been integrated into the Benchmark_Driver in PR #18719, along with much improved unit test coverage of the whole benchmarking infrastructure. Building on the refactored BenchmarkDriver a new benchmark validation command check was introduced. This is implemented in the BenchmarkDoctor class.

The PR #18924 wraped up the refactoring and full unit test coverage of Benchmark_Driver's run command. And PR #19011 is about to improve the resilience on contended machine by being a nice process and yielding the CPU when the scheduled time slice expires.

Benchmark Validation

The check command accepts the same benchmark filtering options like the run command, so you can validate individual benchmarks, benchmarks matching regular expression filters or check the whole suite. The idea is that benchmarks that are part of the Swift Benchmark Suite are required to follow a set of rules that ensure quality measurements. These include:

  • name matches UpperCamelCase naming convention
  • name is at most 40 characters long (to prevent report tables on GitHub from overflowing)
  • robustness when varying execution parameters like num-iters and num-samples:
    • no setup overhead
    • constant memory consumption
    • runtime under sensible threshold (currently 2500 μs, I'd like to go down to 1000 μs)

When run from console, it uses a color coded log to report on the health of the tested benchmarks. When redirected to file, it uses logging format with log level prefixes. Here's verbose log (includes DEBUG level) from diagnosing the current Swift Benchmark Suite on my machine.

My plan is to make passing these checks first mandatory for newly added benchmarks, but before that's enforced, we should probably build broader consensus about these rules, as there wasn't much debate about these specifics of my initial proposal.

Constant Memory Use

I have a problem with the constant memory use rule. The main point of it is to make sure that the benchmark doesn't vary the size of the workload with varying num-iters. This works well, but there is secondary warning about tests with high variance. This means that there is unusually high variance in the memory used between some of the 10 independent measurements of the same benchmark. Currently I'm just comparing it to a fixed threshold of 15 pages (15 * 4098 B = 60 kB). I've based it on the 13 page variance observed in my initial report. I have initially thought the variance would be a function of memory used, i.e. more memory is used would have higher variance. But this doesn't appear to be the case. It seems to be related to how the benchmark is written. Here's Numbers spreadsheet with mem pages extracted from the check.log:

The measurements were obtained by running the bechmarks 5 times with num-iters=1 (i1) and 5 times with num-iters=2 (i2). The colums are as follows: min i1 and min i2 are minimum number of pages used by the benchmark for given iteration count, 𝚫 is their difference and min min is the smaller of the two. R stands for range, i.e. the number of pages between the min and max for a given iteration count. Finally max R is the bigger of the two ranges.

It's filtered to hide benchmarks that use less than 20 pages of memory, but show those with high variance or variable size. Red 𝚫s are incorrectly written benchmarks that vary the size of the workload based on number of iterations.

As demonstrated by SequenceAlgos family of benchmarks, even using 15.7 MB per run doesn't necessarily imply high memory use variance. On the other hand, I've noticed that benchmarks involving Array have unstable variance running the same test. Sometimes they stay well under 15 pages, sometimes they overshoot. What is weird is that the variance can be 300% of the min! See DropWhileArrayLazy, which can use as little as 5 pages, but has range of 14! This all smells fishy to me... Can somebody with more knowledge about how Array implementation allocates memory chime in, please?

Setup Overhead

Apropos Array weirdness… I have re-run the measurements from my report and have noticed that benchmarks that used to have 4–6 μs setup overhead due to small Array initialization, like DropFirstArrayLazy now have 14 μs overhead. This means that since November, the cost of Array(0..<4096) has gone up by 350%!!! What happened? No benchmarks caught this directly?

My plan is to make passing these checks first mandatory for newly added benchmarks

sounds good

but before that's enforced, we should probably build broader consensus about these rules

The rules look good to me for new benchmarks.

I'm glad to report that with @Erik_Eckstein's help we are slowly but surely nearing the end of the quality improvement efforts in Swift Benchmark Suite. The benchmark validation infrastructure is finally in place (PRs #20807 and #20074) to guard the quality of newly added benchmarks.

In PR #20212 we've implemented a workaround that enables us to lower the inner benchmark loop multipliers to achieve short runtimes which minimize the accumulated error, while reporting the runtimes adjusted by a legacy factor — this maintains the continuity of the Apple-internal long term performance tracking.

I'm currently in the middle of mind-numbingly boring process of applying the legacy factor throughout the SBS: PR #20861. Once this is completed, I'll look into enabling multi-process (parallel) measurement with the Benchmark_Driver.

11 Likes

Progress Report

(I’m sorry for the delays.) I’m happy to announce that by the beginning of March 2018, a series of PRs that applied legacy factor across the whole Swift Benchmark Suite (SBS) has lowered the workloads of individual benchmarks to execute in the 20–1000 μs range. My thanks go to @Erik_Eckstein for his patient review and guidance.

Point of this modification was to strengthen the resilience of our measurement process against accumulated errors caused by context switches that happen every 10 ms on Mac OS. Here’s a box plot visualization (see also the interactive chart) of the before:

Notice the streaking that occurs starting at approximately 2500 μs on the exponential X axis for the datasets collected on a more contested CPU. This is the results of accumulated errors from the uncontrolled variable of system load. On a contested machine, measuring a benchmark with runtime longer than 10ms will always by corrupted, because it includes the execution of a different process. For example the CI bots are running Jenkins and the Java VM will do its parallel garbage collection. To mitigate this issue, we have modified the design of SBS, so that all benchmarks should now run in the microbenchmark range of 20–1000 μs. The upper limit is less than the 10 ms scheduler quantum to give us some headroom for more robust measurements of -Onone builds as well as running on Linux, where the scheduler quantum is 6 ms.

Here’s the current situation after the application of legacy factor :

These measurements were taken on my ancient 2008 MBP, so some benchmarks run over 1000 μs, but that’s OK — they are plenty fast to be robust even there, and they are running well within the limits on a modern HW. The outliers on the tail are few existential sequence benchmarks (AnyCollection variants) that should probably be disabled anyway.

Improved Robustness

By lowering of the multiplication constants used to size the workloads in the for loops of individual benchmarks and moving it into the legacyFactor, we have effectively increased the sampling frequency. Reporting the actually measured runtime multiplied by the legacy factor allows us to maintain the continuity of performance tracking across Swift releases.

Shorter runtimes have much lower chance of producing corrupted samples. Gathering much more samples allows us to better detect and exclude the outliers. This is a mayor improvement to the robustness of measurement process (it is much more resilient to the uncontrolled variable: system load). Here’s a zoomed look at the middle part of the previous chart:

To recap the nature of dataset measured on 2 core CPU: Series a10 (minimized console window) and b10 are measured with 1 benchmark process. The c10 series is 2 benchmark processes running in parallel. For series simulating the worst case scenarios: d10 is 3 and e10 is 4 parallel benchmarks running on 2 core CPU. The Clean modifier means that the outliers (samples that are above the Q3 + 1.5 * IQR threshold) were removed.

Shorter Measurement Times

As a side effect this benchmark cleanup also results in much shorter execution of the whole Swift Benchmark Suite (SBS). The Benchmark_Driver now takes measurements by passing --num-iters=1 (we have fixed the setup overhead by introducing setUpFunctions where necessary) to the Benchmark_O, which was also modified to stop collection after 200 samples. This means that a single pass through commit set of benchmarks from SBS takes less than 4 minutes on my machine with an average of 0.3 s per benchmark and majority of them collect 200 individual samples.

In my opinion, the demonstrated robustness of this measurement process allows us to run the benchmarks in parallel on multi core machines in the future, to further lower the time it takes to collect statistically relevant sample set.

Note: The Benchmark_Driver is a different measurement process from run_smoke_bench.py currently used by CI bots. It’s the evolution of the original method that’s been used there before.

32 Likes

This is all really great work @palimondo! Thanks for the update!

1 Like

No benchmark expert here; but assuming that the load is completely CPU-bound, would getrusage with RUSAGE_THREAD provide accurate results in the face of context switches? (Or is it a blocker that it’s only supported on Linux?)

I don't see how's that applicable here. You can see all the details of how we do the measurements in Swift Benchmark Suite by exploring the DriverUtils.swift.

If one wants to minimize context switching jitter, then the simplest workaround is to simply elevate the priority of the test process slightly above the default priority where most everything runs. In fact, if you feel confident that the benchmarks don't have pathological behavior, then one can use real-time scheduling features to get more reproducible results.

That being said and in my experience when I worked at Apple, I don't think context switching is the worst problem when trying to benchmark these days. After all, one should test on a quiet machine. The real challenge is power/energy/thermal management by the CPU and operating system. In practice, keeping a CPU running at peak predictable performance can be hard. For example, I need to disable "turbo" (performance above and beyond what is guaranteed to be sustainable) and pre-warm the CPU with busywork to force the CPU to exit low power mode(s). I'm not sure if the former is doable on macOS without access to an internal tool, but on Linux, one can use x86_energy_perf_policy to make benchmarks more reliable.

Other tricks I've used in the past include:

  1. Literally disconnecting the machine from devices that generate interrupts (like disabling Wi-Fi and unplugging Ethernet cables).
  2. Shutting down the GUI. I don't know if this still works on macOS, but one used to be able to login as >console to shutdown the GUI and get a command prompt. This can be especially true for macOS servers where the GUI has mild indigestion if a keyboard/mouse is not attached and then a small amount of background CPU activity is wasted on a GUI dialog box begging for a keyboard/mouse to be attached.
  3. Delaying benchmark tool output until the process exits so that intermediate tests don't interfere with later tests.
  4. Measure the interrupt overhead of an idle machine and then cancel it out of test results. This really only works on a very idle machine that is disconnected from the network or on a very quiet network. Furthermore, this only matters if one is trying to do cycle or sub-cycle accurate benchmarks.
  5. Be ever mindful of how SMT and the cache hierarchy works. In my experience, this meant running benchmarks serially to avoid tests fighting over shared resources.
1 Like