Using benchmarking to guard against gradual performance regressions

I recently dove into benchmarking and the awesome new Benchmark package (thanks @hassila) and one thing that quickly came to mind is the possibility of gradual regressions that can slip by undetected. I'm making this thread to describe what I have in mind and to ask if there is an established best practice for guarding against it.

The recommended CI integration is that whenever a PR is created or updated we generate benchmark results for the new code and then generate results from the main branch, and then compare to see if there are any regressions. Of course, since these things will always vary a bit, we have to specify a threshold percentage by which the performance is allowed to regress.

The thing is, if there is an acceptably small regression then that slightly slower code will be merged into main and will become the new standard. The next PR is now free to regress a bit again, and this process can continue until we hit death by a thousand tiny regressions.

The obvious answer would be to save a particular baseline and check against it instead of checking against benchmark results that are freshly derived from main. Am I correct that there are potential issues relating to using archived results for comparison in a CI workflow? Are they related to the unpredictability of the machines that CI workflows run on? I'm curious to understand more thoroughly what those issue are.

3 Likes

The way that I've wrangled this is to do a benchmark comparison in a PR - where I first build and run a local benchmark for the main branch and then for the PR branch, all in the same CI instance.

As a pure benchmark number, running in CI is often just useless due to the inconsistent nature of what's being handed to you to run the CI jobs, but it's consistent enough that I've had generally good luck with that flow.

In practice, I found checking every PR was kind of annoying though, so after I got it all working, I pulled it out and do it as a manual check when I'm cutting a release. I have this vague plan of making a GH Action for the process at some point, but haven't worked back to that yet.

2 Likes

For that reason, I'd recommend recording baselines for metrics that don't fluctuate much depending on a machine that benchmarks run on: number of allocations, syscalls, count of instructions executed instead of time taken to execute those instructions etc.

For package-benchmark it's .mallocCountTotal and .syscalls. We have set those up for SwiftPM, but haven't integrated support for instruction counts yet. I know swift-syntax has its own ad-hoc instruction counters setup.

3 Likes

Agree with previous two comments basically - checking baselines that don’t vary much regardless of execution time is good on machines that you don’t have control over (mallocs/syscalls/memory footprint/instruction count is the low hanging fruits). For those you can have a zero regression policy and thus avoid the slippery slope problem.

If you want to measure actual timings you’d need a dedicated machine for it (that’s how we do those measurements).

It is also as mentioned quite useful to run and store a local benchmark baseline when working on a branch and the compare against the baseline before merging - but that is just local engineer checks, not part of CI proper - but a good way to avoid breaking CI checks…

If you want absolute threshold check you can do that too, have a look at how eg. swift-nio set things up for that. (P90 threshold checks)

1 Like

Ok cool, thank you, so:

this much I'm clear on now.

But this:

I'm not clear on. One thing that confused me is the statement "run a local benchmark [...] all in the same CI instance." On which machine is this benchmark measured? Could either of you describe in a bit more detail what the overall procedure is?

I meant on the local machine (where you easily can store a baseline when starting working on a PR and you have control over the machine), not in CI - for CI time measurements, dedicated runners is the only way really as you need to have total control of the environment.

2 Likes

It's definitely the case that accumulated sub-threshold regressions can pile up into something meaningful. I recall a major project where, after ripping out a bunch of work that had gone in with no regressions, discovered an unexpected 5% speedup; exactly this situation.

For most projects though it's probably the case that if you're doing this level of rigor around performance, you're already so far ahead of a typical project that the added value of being more thorough is pretty small.

3 Likes