Interim Progress Report
With @Erik_Eckstein's help, the PR #18124 has landed on master, implementing the measurement of benchmark's memory use that excludes the overhead from the infrastructure. It is now reported directly from Benchmark_O
in the last column, when running with --memory
option.
This has now also been integrated into the Benchmark_Driver
in PR #18719, along with much improved unit test coverage of the whole benchmarking infrastructure. Building on the refactored BenchmarkDriver
a new benchmark validation command check
was introduced. This is implemented in the BenchmarkDoctor
class.
The PR #18924 wraped up the refactoring and full unit test coverage of Benchmark_Driver
's run
command. And PR #19011 is about to improve the resilience on contended machine by being a nice process and yielding the CPU when the scheduled time slice expires.
Benchmark Validation
The check
command accepts the same benchmark filtering options like the run
command, so you can validate individual benchmarks, benchmarks matching regular expression filters or check the whole suite. The idea is that benchmarks that are part of the Swift Benchmark Suite are required to follow a set of rules that ensure quality measurements. These include:
- name matches UpperCamelCase naming convention
- name is at most 40 characters long (to prevent report tables on GitHub from overflowing)
- robustness when varying execution parameters like
num-iters
and num-samples
:
- no setup overhead
- constant memory consumption
- runtime under sensible threshold (currently 2500 μs, I'd like to go down to 1000 μs)
When run from console, it uses a color coded log to report on the health of the tested benchmarks. When redirected to file, it uses logging format with log level prefixes. Here's verbose log (includes DEBUG level) from diagnosing the current Swift Benchmark Suite on my machine.
My plan is to make passing these checks first mandatory for newly added benchmarks, but before that's enforced, we should probably build broader consensus about these rules, as there wasn't much debate about these specifics of my initial proposal.
Constant Memory Use
I have a problem with the constant memory use rule. The main point of it is to make sure that the benchmark doesn't vary the size of the workload with varying num-iters
. This works well, but there is secondary warning about tests with high variance. This means that there is unusually high variance in the memory used between some of the 10 independent measurements of the same benchmark. Currently I'm just comparing it to a fixed threshold of 15 pages (15 * 4098 B = 60 kB). I've based it on the 13 page variance observed in my initial report. I have initially thought the variance would be a function of memory used, i.e. more memory is used would have higher variance. But this doesn't appear to be the case. It seems to be related to how the benchmark is written. Here's Numbers spreadsheet with mem pages extracted from the check.log
:
The measurements were obtained by running the bechmarks 5 times with num-iters=1
(i1
) and 5 times with num-iters=2
(i2
). The colums are as follows: min i1
and min i2
are minimum number of pages used by the benchmark for given iteration count, 𝚫
is their difference and min min
is the smaller of the two. R
stands for range, i.e. the number of pages between the min and max for a given iteration count. Finally max R
is the bigger of the two ranges.
It's filtered to hide benchmarks that use less than 20 pages of memory, but show those with high variance or variable size. Red 𝚫
s are incorrectly written benchmarks that vary the size of the workload based on number of iterations.
As demonstrated by SequenceAlgos
family of benchmarks, even using 15.7 MB per run doesn't necessarily imply high memory use variance. On the other hand, I've noticed that benchmarks involving Array
have unstable variance running the same test. Sometimes they stay well under 15 pages, sometimes they overshoot. What is weird is that the variance can be 300% of the min! See DropWhileArrayLazy
, which can use as little as 5 pages, but has range of 14! This all smells fishy to me... Can somebody with more knowledge about how Array
implementation allocates memory chime in, please?
Setup Overhead
Apropos Array
weirdness… I have re-run the measurements from my report and have noticed that benchmarks that used to have 4–6 μs setup overhead due to small Array initialization, like DropFirstArrayLazy
now have 14 μs overhead. This means that since November, the cost of Array(0..<4096)
has gone up by 350%!!! What happened? No benchmarks caught this directly?