OK, so the "noisy" benchmarks you mentioned above caused by code alignment issue were genuine false positives when comparing SBS results between two branches. When I was talking about the source of noise, I was referring to high sample variance mostly caused by preemptive multitasking within a single SBS measurement. These interruptions are corrupting the reported runtimes when measuring with num-iters higher than one, because these numbers are not raw samples, but averages of num-iters true samples. They already include the cumulative error. The higher the runtime, the more cumulated error it contains.
The traditional advice there is to get a calm machine (quite impractical advice for machine running CI) or increase the measurement time and hope the noise gets averaged out. The second advice is a common benchmarking folklore which, based on the data in my report, I find to be totally misguided. The brute-force "solution" of running long doesn't solve the quality problem, it only prolongs time to get the wrong answer. This is what we do now with the 20 one second samples per benchmark in the full @swift-ci benchmark measurement.
Measuring with num-iters=1 fixes this issue, but exposes the outliers which should be filtered out.
The approach you took with run_smoke_bench.py is definitely a step in the right direction! The use of --sample-time=0.0025 means it is now averaging samples (and unnecessarily accumulating errors) only for runtimes under 2500 μs, longer runtimes effectively end up with num-iters=1 measurement. That's good. But as soon as you get over 10000 μs, each of the samples is guaranteed to include error from switching to another process and that's outside of our control.
But it does not do enough to significantly improve resilience in the presence of noise to enable parallel benchmarking. 184 benchmarks exceed 10ms runtime on the beastly CI-machines — much more on normal computers... The infuriating thing is that almost all of them are just artificially high because of the incorrectly selected inner-loop multiplier, not something that's inherent to the measured task. I'm strongly convinced we need to fix these benchmarks, for improved measurement quality and much shorter measurement time. I'm leaning towards 1000 μs as upper limit for optimized benchmarks and 10000 μs for Onone.
With @Erik_Eckstein's help, the PR #18124 has landed on master, implementing the measurement of benchmark's memory use that excludes the overhead from the infrastructure. It is now reported directly from Benchmark_O in the last column, when running with --memory option.
This has now also been integrated into the Benchmark_Driver in PR #18719, along with much improved unit test coverage of the whole benchmarking infrastructure. Building on the refactored BenchmarkDriver a new benchmark validation command check was introduced. This is implemented in the BenchmarkDoctor class.
The PR #18924 wraped up the refactoring and full unit test coverage of Benchmark_Driver's run command. And PR #19011 is about to improve the resilience on contended machine by being a nice process and yielding the CPU when the scheduled time slice expires.
The check command accepts the same benchmark filtering options like the run command, so you can validate individual benchmarks, benchmarks matching regular expression filters or check the whole suite. The idea is that benchmarks that are part of the Swift Benchmark Suite are required to follow a set of rules that ensure quality measurements. These include:
name is at most 40 characters long (to prevent report tables on GitHub from overflowing)
robustness when varying execution parameters like num-iters and num-samples:
no setup overhead
constant memory consumption
runtime under sensible threshold (currently 2500 μs, I'd like to go down to 1000 μs)
When run from console, it uses a color coded log to report on the health of the tested benchmarks. When redirected to file, it uses logging format with log level prefixes. Here's verbose log (includes DEBUG level) from diagnosing the current Swift Benchmark Suite on my machine.
My plan is to make passing these checks first mandatory for newly added benchmarks, but before that's enforced, we should probably build broader consensus about these rules, as there wasn't much debate about these specifics of my initial proposal.
Constant Memory Use
I have a problem with the constant memory use rule. The main point of it is to make sure that the benchmark doesn't vary the size of the workload with varying num-iters. This works well, but there is secondary warning about tests with high variance. This means that there is unusually high variance in the memory used between some of the 10 independent measurements of the same benchmark. Currently I'm just comparing it to a fixed threshold of 15 pages (15 * 4098 B = 60 kB). I've based it on the 13 page variance observed in my initial report. I have initially thought the variance would be a function of memory used, i.e. more memory is used would have higher variance. But this doesn't appear to be the case. It seems to be related to how the benchmark is written. Here's Numbers spreadsheet with mem pages extracted from the check.log:
The measurements were obtained by running the bechmarks 5 times with num-iters=1 (i1) and 5 times with num-iters=2 (i2). The colums are as follows: min i1 and min i2 are minimum number of pages used by the benchmark for given iteration count, 𝚫 is their difference and min min is the smaller of the two. R stands for range, i.e. the number of pages between the min and max for a given iteration count. Finally max R is the bigger of the two ranges.
It's filtered to hide benchmarks that use less than 20 pages of memory, but show those with high variance or variable size. Red 𝚫s are incorrectly written benchmarks that vary the size of the workload based on number of iterations.
As demonstrated by SequenceAlgos family of benchmarks, even using 15.7 MB per run doesn't necessarily imply high memory use variance. On the other hand, I've noticed that benchmarks involving Array have unstable variance running the same test. Sometimes they stay well under 15 pages, sometimes they overshoot. What is weird is that the variance can be 300% of the min! See DropWhileArrayLazy, which can use as little as 5 pages, but has range of 14! This all smells fishy to me... Can somebody with more knowledge about how Array implementation allocates memory chime in, please?
Apropos Array weirdness… I have re-run the measurements from my report and have noticed that benchmarks that used to have 4–6 μs setup overhead due to small Array initialization, like DropFirstArrayLazy now have 14 μs overhead. This means that since November, the cost of Array(0..<4096) has gone up by 350%!!! What happened? No benchmarks caught this directly?
I'm glad to report that with @Erik_Eckstein's help we are slowly but surely nearing the end of the quality improvement efforts in Swift Benchmark Suite. The benchmark validation infrastructure is finally in place (PRs #20807 and #20074) to guard the quality of newly added benchmarks.
In PR #20212 we've implemented a workaround that enables us to lower the inner benchmark loop multipliers to achieve short runtimes which minimize the accumulated error, while reporting the runtimes adjusted by a legacy factor — this maintains the continuity of the Apple-internal long term performance tracking.
I'm currently in the middle of mind-numbingly boring process of applying the legacy factor throughout the SBS: PR #20861. Once this is completed, I'll look into enabling multi-process (parallel) measurement with the Benchmark_Driver.