Towards Robust Performance Measurement

I'd recommend, if you haven't seen it yet, looking over the design of the Criterion benchmarking system in Haskell, of which a Rust port also exists which might be closer to idiomatic Swift.

It's had quite a lot of work put into achieving a good mix of comprehensiveness and comprehensibility. The default views tell you enough, with enough statistical confidence, to evaluate benchmarks well.

3 Likes

Thank you for pointing this out to me. From quick skimming, the use of KDE is roughly analogous to what I was looking for as future improvements using median detection. I will study Criterion in more depth over the weekend.

Update: I've looked at Criterion. It is more similar to @lorentey's Attabench. They both focus on scaling the workload and produce benchmark visualizations to aid in performance analysis. I don't think much from it is directly applicable to what we are doing in SBS.

Thanks to @Erik_Eckstein and @Michael_Gottesman, the first PR with prerequisites for robust microbenchmarking has landed on master. :tada:

When the issue of setUp and tearDown functions that came up during review it reminded me of related code from my experimentation.

There is feature of Benchmark_Driver that measures the memory consumption of the benchmark using time -lp utility, which can among other things report the maximum resident size (MAX_RSS) used by a process. Benchmark_Driver reports MAX_RSS in the last column of log results. This code currently works on Mac only, because GNU implementation of time on Linux has different options and output format. During my experiments I have extended the support to both platforms like this:

But in the end I've realized that measuring this using time is still quite noisy. See first paragraph in Memory Use for detailed analysis. For reference, here's first discussion of this issue with @Luke_Larson and @Andrew_Trick in PR #8793.

I think I have a better solution and would like to gather feedback if this would be a reasonable next step to implement here. Given the time utility is just a simple wrapper around rusage function, I've used it directly to gather the same environmental stats and have verified that it reduces the noise in MAX_RSS by excluding the contribution from the benchmarking infrastructure overhead (about 10 MB). See this commit for proposed implementation:

What do you think?

This implies 2 changes to the log formats (Benchmark_* public API):

  • Becnhmark_O gets additional column reporting MAX_RSS
  • Benchmark_Driver's MAX_RSS will report delta instead of absolute value

Second point is just a minor one-time change in semantics, but maybe you're tracking these values over time somewhere... Are your internal scripts able to handle this as is, or would you need to adapt something?

6 Likes

Getting the data directly in the benchmark (with rusage) is definitely a good idea.

About the output format: not a big deal to change it. Maybe it makes sense to include the memory usage column(s) only if it is explicitly turned on with a command line option. Just to keep the output readable if someone is just interested in runtime data.

1 Like

Follow-up on the discussion on Sign in to GitHub · GitHub
The reason why changing the name/runtime of many benchmarks is that it's disruptive for our performance tracking. We submit benchmarks results to a LNT server over time, which gives us a history for compiler performance.
Now, if some benchmarks are problematic, we should change them, because it's better to fix noisy benchmarks than to keep them in the history. But for benchmarks which are not problematic (i.e. where we didn't see any noise so far), it would not solve any problem to change them.

Based on all the data in the report, I'm very strongly convinced there are no noisy benchmarks in Swift Benchmarking Suite. All the noise is external, from the environment. The key to minimizing its impact is improving the robustness of measurement process by making it more resilient. Staying in the micro-benchmarking range helps us to avoid the interruptions by OS scheduler which would happen at least every 10 ms on mac os (6ms on linux). Keeping the benchmark runtime under 2500 μs gives us 1:4 (1:3) chance of being interrupted on an absolutely calm machine. The noise filtering technique I've proposed should be able to deal with that. Maybe we should go as low as 1000 μs (1ms) for the base workload, to be on the safe side. That would allow us to run benchmarks in parallel — dramatically reducing the time run SBS. (Final side benefit is that Onone benchmarks will take less time.)

But this is not possible if we first don't fix the benchmarks. Luckily, there isn't a big change needed to get the benefits of more stable and parallel benchmarking. All I'm proposing is a trivial change to the internal loop-multiplying constants, establishing a simple linear relationship between the old and recalibrated benchmarks with lowered base workload. This is the same principle of finer-grained measurement that's behind measuring with num-iters=1. This isn't a fundamental change to what is being measured. Only a policy change to allow the measurement infrastructure to peak inside the body of the main workload loop more often.

How is the performance tracking history maintained, after upgrading the CI machines? Doesn't that also create discontinuity similar to what would happen if we did one-time recalibration of benchmark inner-loop multipliers? Continuity for renamed benchmarks might be solved by some kind of database migration script, if deemed that important…

1 Like

Actually the biggest problem with "noisy" benchmarks was the code alignment problem (IRGen, benchmarks: add an option -align-module-to-page-size for benchmarking by eeckstein · Pull Request #18318 · apple/swift · GitHub), which is fixed now.

I suggest to see how the new benchmarking method goes (Improved benchmarking for pull requests) and if there are still noisy benchmarks because of a too long iteration time, lets fix those benchmarks.

Mass-renaming/changing benchmarks creates a lot of burden for our performance tracking and I don't see how we can spend time on working on this.

Could you please elaborate? How did this issue manifest? I'd like to investigate it on the dataset from my report and the new measurements after the PR #18318 has landed (month ago).

This is not actually noise, but a 100% reproducible performance difference caused by different instruction alignment within a code page. As we link all benchmark object files into a single executable a small change in one benchmark can change the code alignment of all other benchmarks. The fix for this was to page-align all benchmark modules.

Oh, so this was stable new value (within a given build) for an unmodified benchmark, caused by modification/addition of a different benchmark?

(I thought this had something to do with same benchmark having various stable typical values between independent runs — which can be seen in the charts from report and is still observable in the C series measurement I run this week.)

Oh, so this was stable new value (within a given build) for an unmodified benchmark, caused by modification/addition of a different benchmark?

exactly

OK, so the "noisy" benchmarks you mentioned above caused by code alignment issue were genuine false positives when comparing SBS results between two branches. When I was talking about the source of noise, I was referring to high sample variance mostly caused by preemptive multitasking within a single SBS measurement. These interruptions are corrupting the reported runtimes when measuring with num-iters higher than one, because these numbers are not raw samples, but averages of num-iters true samples. They already include the cumulative error. The higher the runtime, the more cumulated error it contains.

The traditional advice there is to get a calm machine (quite impractical advice for machine running CI) or increase the measurement time and hope the noise gets averaged out. The second advice is a common benchmarking folklore which, based on the data in my report, I find to be totally misguided. The brute-force "solution" of running long doesn't solve the quality problem, it only prolongs time to get the wrong answer. This is what we do now with the 20 one second samples per benchmark in the full @swift-ci benchmark measurement.

Measuring with num-iters=1 fixes this issue, but exposes the outliers which should be filtered out.

The approach you took with run_smoke_bench.py is definitely a step in the right direction! The use of --sample-time=0.0025 means it is now averaging samples (and unnecessarily accumulating errors) only for runtimes under 2500 μs, longer runtimes effectively end up with num-iters=1 measurement. That's good. But as soon as you get over 10000 μs, each of the samples is guaranteed to include error from switching to another process and that's outside of our control.

But it does not do enough to significantly improve resilience in the presence of noise to enable parallel benchmarking. 184 benchmarks exceed 10ms runtime on the beastly CI-machines — much more on normal computers... The infuriating thing is that almost all of them are just artificially high because of the incorrectly selected inner-loop multiplier, not something that's inherent to the measured task. I'm strongly convinced we need to fix these benchmarks, for improved measurement quality and much shorter measurement time. I'm leaning towards 1000 μs as upper limit for optimized benchmarks and 10000 μs for Onone.

@Erik_Eckstein, could you please address these questions?

Interim Progress Report

With @Erik_Eckstein's help, the PR #18124 has landed on master, implementing the measurement of benchmark's memory use that excludes the overhead from the infrastructure. It is now reported directly from Benchmark_O in the last column, when running with --memory option.

This has now also been integrated into the Benchmark_Driver in PR #18719, along with much improved unit test coverage of the whole benchmarking infrastructure. Building on the refactored BenchmarkDriver a new benchmark validation command check was introduced. This is implemented in the BenchmarkDoctor class.

The PR #18924 wraped up the refactoring and full unit test coverage of Benchmark_Driver's run command. And PR #19011 is about to improve the resilience on contended machine by being a nice process and yielding the CPU when the scheduled time slice expires.

Benchmark Validation

The check command accepts the same benchmark filtering options like the run command, so you can validate individual benchmarks, benchmarks matching regular expression filters or check the whole suite. The idea is that benchmarks that are part of the Swift Benchmark Suite are required to follow a set of rules that ensure quality measurements. These include:

  • name matches UpperCamelCase naming convention
  • name is at most 40 characters long (to prevent report tables on GitHub from overflowing)
  • robustness when varying execution parameters like num-iters and num-samples:
    • no setup overhead
    • constant memory consumption
    • runtime under sensible threshold (currently 2500 μs, I'd like to go down to 1000 μs)

When run from console, it uses a color coded log to report on the health of the tested benchmarks. When redirected to file, it uses logging format with log level prefixes. Here's verbose log (includes DEBUG level) from diagnosing the current Swift Benchmark Suite on my machine.

My plan is to make passing these checks first mandatory for newly added benchmarks, but before that's enforced, we should probably build broader consensus about these rules, as there wasn't much debate about these specifics of my initial proposal.

Constant Memory Use

I have a problem with the constant memory use rule. The main point of it is to make sure that the benchmark doesn't vary the size of the workload with varying num-iters. This works well, but there is secondary warning about tests with high variance. This means that there is unusually high variance in the memory used between some of the 10 independent measurements of the same benchmark. Currently I'm just comparing it to a fixed threshold of 15 pages (15 * 4098 B = 60 kB). I've based it on the 13 page variance observed in my initial report. I have initially thought the variance would be a function of memory used, i.e. more memory is used would have higher variance. But this doesn't appear to be the case. It seems to be related to how the benchmark is written. Here's Numbers spreadsheet with mem pages extracted from the check.log:

The measurements were obtained by running the bechmarks 5 times with num-iters=1 (i1) and 5 times with num-iters=2 (i2). The colums are as follows: min i1 and min i2 are minimum number of pages used by the benchmark for given iteration count, 𝚫 is their difference and min min is the smaller of the two. R stands for range, i.e. the number of pages between the min and max for a given iteration count. Finally max R is the bigger of the two ranges.

It's filtered to hide benchmarks that use less than 20 pages of memory, but show those with high variance or variable size. Red 𝚫s are incorrectly written benchmarks that vary the size of the workload based on number of iterations.

As demonstrated by SequenceAlgos family of benchmarks, even using 15.7 MB per run doesn't necessarily imply high memory use variance. On the other hand, I've noticed that benchmarks involving Array have unstable variance running the same test. Sometimes they stay well under 15 pages, sometimes they overshoot. What is weird is that the variance can be 300% of the min! See DropWhileArrayLazy, which can use as little as 5 pages, but has range of 14! This all smells fishy to me... Can somebody with more knowledge about how Array implementation allocates memory chime in, please?

Setup Overhead

Apropos Array weirdness… I have re-run the measurements from my report and have noticed that benchmarks that used to have 4–6 μs setup overhead due to small Array initialization, like DropFirstArrayLazy now have 14 μs overhead. This means that since November, the cost of Array(0..<4096) has gone up by 350%!!! What happened? No benchmarks caught this directly?

My plan is to make passing these checks first mandatory for newly added benchmarks

sounds good

but before that's enforced, we should probably build broader consensus about these rules

The rules look good to me for new benchmarks.

I'm glad to report that with @Erik_Eckstein's help we are slowly but surely nearing the end of the quality improvement efforts in Swift Benchmark Suite. The benchmark validation infrastructure is finally in place (PRs #20807 and #20074) to guard the quality of newly added benchmarks.

In PR #20212 we've implemented a workaround that enables us to lower the inner benchmark loop multipliers to achieve short runtimes which minimize the accumulated error, while reporting the runtimes adjusted by a legacy factor — this maintains the continuity of the Apple-internal long term performance tracking.

I'm currently in the middle of mind-numbingly boring process of applying the legacy factor throughout the SBS: PR #20861. Once this is completed, I'll look into enabling multi-process (parallel) measurement with the Benchmark_Driver.

11 Likes

Progress Report

(I’m sorry for the delays.) I’m happy to announce that by the beginning of March 2018, a series of PRs that applied legacy factor across the whole Swift Benchmark Suite (SBS) has lowered the workloads of individual benchmarks to execute in the 20–1000 μs range. My thanks go to @Erik_Eckstein for his patient review and guidance.

Point of this modification was to strengthen the resilience of our measurement process against accumulated errors caused by context switches that happen every 10 ms on Mac OS. Here’s a box plot visualization (see also the interactive chart) of the before:

Notice the streaking that occurs starting at approximately 2500 μs on the exponential X axis for the datasets collected on a more contested CPU. This is the results of accumulated errors from the uncontrolled variable of system load. On a contested machine, measuring a benchmark with runtime longer than 10ms will always by corrupted, because it includes the execution of a different process. For example the CI bots are running Jenkins and the Java VM will do its parallel garbage collection. To mitigate this issue, we have modified the design of SBS, so that all benchmarks should now run in the microbenchmark range of 20–1000 μs. The upper limit is less than the 10 ms scheduler quantum to give us some headroom for more robust measurements of -Onone builds as well as running on Linux, where the scheduler quantum is 6 ms.

Here’s the current situation after the application of legacy factor :

These measurements were taken on my ancient 2008 MBP, so some benchmarks run over 1000 μs, but that’s OK — they are plenty fast to be robust even there, and they are running well within the limits on a modern HW. The outliers on the tail are few existential sequence benchmarks (AnyCollection variants) that should probably be disabled anyway.

Improved Robustness

By lowering of the multiplication constants used to size the workloads in the for loops of individual benchmarks and moving it into the legacyFactor, we have effectively increased the sampling frequency. Reporting the actually measured runtime multiplied by the legacy factor allows us to maintain the continuity of performance tracking across Swift releases.

Shorter runtimes have much lower chance of producing corrupted samples. Gathering much more samples allows us to better detect and exclude the outliers. This is a mayor improvement to the robustness of measurement process (it is much more resilient to the uncontrolled variable: system load). Here’s a zoomed look at the middle part of the previous chart:

To recap the nature of dataset measured on 2 core CPU: Series a10 (minimized console window) and b10 are measured with 1 benchmark process. The c10 series is 2 benchmark processes running in parallel. For series simulating the worst case scenarios: d10 is 3 and e10 is 4 parallel benchmarks running on 2 core CPU. The Clean modifier means that the outliers (samples that are above the Q3 + 1.5 * IQR threshold) were removed.

Shorter Measurement Times

As a side effect this benchmark cleanup also results in much shorter execution of the whole Swift Benchmark Suite (SBS). The Benchmark_Driver now takes measurements by passing --num-iters=1 (we have fixed the setup overhead by introducing setUpFunctions where necessary) to the Benchmark_O, which was also modified to stop collection after 200 samples. This means that a single pass through commit set of benchmarks from SBS takes less than 4 minutes on my machine with an average of 0.3 s per benchmark and majority of them collect 200 individual samples.

In my opinion, the demonstrated robustness of this measurement process allows us to run the benchmarks in parallel on multi core machines in the future, to further lower the time it takes to collect statistically relevant sample set.

Note: The Benchmark_Driver is a different measurement process from run_smoke_bench.py currently used by CI bots. It’s the evolution of the original method that’s been used there before.

32 Likes

This is all really great work @palimondo! Thanks for the update!

1 Like

No benchmark expert here; but assuming that the load is completely CPU-bound, would getrusage with RUSAGE_THREAD provide accurate results in the face of context switches? (Or is it a blocker that it’s only supported on Linux?)

I don't see how's that applicable here. You can see all the details of how we do the measurements in Swift Benchmark Suite by exploring the DriverUtils.swift.

If one wants to minimize context switching jitter, then the simplest workaround is to simply elevate the priority of the test process slightly above the default priority where most everything runs. In fact, if you feel confident that the benchmarks don't have pathological behavior, then one can use real-time scheduling features to get more reproducible results.

That being said and in my experience when I worked at Apple, I don't think context switching is the worst problem when trying to benchmark these days. After all, one should test on a quiet machine. The real challenge is power/energy/thermal management by the CPU and operating system. In practice, keeping a CPU running at peak predictable performance can be hard. For example, I need to disable "turbo" (performance above and beyond what is guaranteed to be sustainable) and pre-warm the CPU with busywork to force the CPU to exit low power mode(s). I'm not sure if the former is doable on macOS without access to an internal tool, but on Linux, one can use x86_energy_perf_policy to make benchmarks more reliable.

Other tricks I've used in the past include:

  1. Literally disconnecting the machine from devices that generate interrupts (like disabling Wi-Fi and unplugging Ethernet cables).
  2. Shutting down the GUI. I don't know if this still works on macOS, but one used to be able to login as >console to shutdown the GUI and get a command prompt. This can be especially true for macOS servers where the GUI has mild indigestion if a keyboard/mouse is not attached and then a small amount of background CPU activity is wasted on a GUI dialog box begging for a keyboard/mouse to be attached.
  3. Delaying benchmark tool output until the process exits so that intermediate tests don't interfere with later tests.
  4. Measure the interrupt overhead of an idle machine and then cancel it out of test results. This really only works on a very idle machine that is disconnected from the network or on a very quiet network. Furthermore, this only matters if one is trying to do cycle or sub-cycle accurate benchmarks.
  5. Be ever mindful of how SMT and the cache hierarchy works. In my experience, this meant running benchmarks serially to avoid tests fighting over shared resources.
1 Like