[pitch] Swift Benchmarking Infrastructure

hassila · August 22, 2022, 12:37pm

Ok, still many details that needs to be sorted out, but would be grateful for any feedback on a next iteration with some more flesh that is going for a plugin approach instead. There's an emphasis on percentiles for all measurements instead of average, (p0 == min, p100 == max):

Performance Integration Test Harness (PITH)

Authors: Joakim Hassila
Status: Awaiting implementation

Introduction

Applications that are sensitive to various performance metrics (CPU, memory usage, ...)
needs automated testing and comparisons vs. known good baselines to ensure that changes
to the code base - or dependencies - doesn't introduce performance regressions.

Most benchmarking libraries focus on wall clock runtime primarily commonly with microbenchmarks,
which while helpful during tuning, doesn't give the full performance metric coverage desired
for larger applications.

More coarse performance tests covering more functionality is desireable for large applications -
the analogue would be that typical microbenchmarks are similar to unit tests, while the goal with
the PITH is more analogous to integration tests.

For multi-platform software, it's also desirable to be able to use platform-specific
tools to capture such performance metrics on multiple platforms, to avoid introducing
platform-specific bugs due to different interactions with the underlying platform.

This pitch provides a harness for running such performance tests in an
automated manner with a comprehensive set of external probes for interesting metrics
and will save baselines for different machines the test is run on with the
goal of reducing performance regressions shipped and help engineers during optimization
work as well as providing a data source for visualizing changes over time if desired.

The primary goal is to automate and simplify performance testing to avoid
introducing regressions in any of the captured metrics, especially as part
of PR validation (but also ad-hoc by the engineer as needed).

A key feature is to be able to automate benchmarks as part of CI workflow in a great way.

This is intended to be complementary to any work being done on improving
microbenchmarks which may appeal to a wider audience, but perhaps we can see
an integrated approach after discussions.

Primary audiences for using PITH would be library authors, 'Swift on Server' developers
and other multi-platform users caring about performance (CPU, memory or otherwise).

Motivation

For more complex applications it is common to have a performance test suite that focuses
primarily on the runtime of the test - and compare that with a baseline on a blessed
performance testing host.

This approach is missing a few important pieces, e.g. it doesn't capture a few key metrics
that may end up causing trouble in production on other machines that are not
specced identical to the blessed performance testing host (hardware as well as OS).

It also doesn't provide the typical engineer a fast verification loop
for changes that may impact performance as e.g. unit tests does for verifying
fundamental functional behavior.

There is also often a dependency on a performance testing reference host which may be
running with a different hardare / OS setup (e.g. an engineer may be developing on a
M1 Pro / macOS machine, while the performance verification may be done
on x86 / Linux in the CI pipeline).

There are a plethora of useful tools that can be used to analyze
performance on various platforms, but often they are just used
ad-hoc by most engineers. The idea here is to make it extremely easy to use them and
to support automation by packaging them.

The performance tests are very similar to integration tests and complement them,
but the dividing line is that an integration tests must always pass to be allowed to
be merged, while a failed performance test (breaching previous baseline) allows
for a responsible engineer to analyze if the regression is acceptable or not
(and if acceptable, would reset the baseline for the metric), depending
on a projects requirements.

Desired functionality

The ability to define and run benchmarks and capture a wide range of performance related information:

Wall clock time
CPU time
Number of malloc/free (to capture excessive transient memory allocations)
Peak memory usage (broken down per type)
Memory leaks (total)
System calls made (total)
Disk I/O (total IO, total data amount)
Network I/O (total IO, total data amount)
Instruction count
Cache utilization
OS thread count (peak)
Number of context switches

and to compare it to baseline / trend analysis.

Automated run in CI and providing reports for PR:s with differentiating runs should be supported
with the view of automating benchmark validation for the PR.

We also want to support local developer workflow when performing optimizations to help quantifying improvements (in contrast with analytics tools such as Instruments that help developers to pinpoint bottlenecks and then improve them).

By default, the tool should support updating of baselines that are improved and notify the user about it, such that manual steps to improve the baselines aren't required in e.g. a CI environment. To update the baseline to a worse level is a manual step.

The tool should support multiple named baseline for A/B comparisons.

Proposed solution overview

The solution will be composed of the following major pieces:

A benchmark runner that is implemented as a SwiftPM command plug-in
A Swift package providing a library for supporting benchmark measurements which is used by the actual benchmarks
A GitHub workflow action for running benchmarks on main vs PR and provide a comment to the PR with results
A visualization tool for benchmark results and trends as needed for the different workflows

The benchmarks will be discovered using a convention where a project should include a Benchmarks directory on the top level which will contains a subdirectory for each benchmark target. Each benchmark target should be a SwiftPM executable target that uses the benchmark support library to define the actual benchmarks it implements. The benchmark plugin will communicate with the benchmark supporting library over pipes.

The command plug-in has the following major responsibilities:

Implement swift package benchmark (with relevant subcommands, e.g. list, baseline [set] etc)
Ensure that all benchmark targets are rebuilt if needed before running them
Discover all benchmarks by quering the benchmark targets (also used for swift package benchmark list)
Run the benchmarks (with skip/filter support as for normal tests)
Setup relevant benchmark environment as defined by the benchmark (e.g. supporting processes, CPU set thread pinning for both supporting processes and benchmark, ...)
Verify that system load is reasonable before running benchmark, otherwise warn / abort
Setup any tools as required to capture information (e.g. preloading malloc analytics, dtrace, bpftrace, custom samplers...)
Capture the output from any tools as required
Capture the results from the benchmark
Compare / store baseline results from the captured output
Fail benchmark based on trend/baseline violations
Tear down and clean up the benchmark environment (e.g. supporting processes)
Support capture of named baselines for iterative development and A/B/C comparisons

We should use and leverage existing fundamental tools for capturing information, e.g. dtrace, bpftrace, perf, heaptrack, leaks, malloc library interposition or similar techniques as needed.

The goal with this project is not to build the low-level tools for capturing the metrics, but instead to package the best tools we can identify and to allow performance tests to be run on multiple platforms easily and to keep track of baselines.

The actual tools used per platform should remain an implementation detail and can be evolved over time if better tools are identified.

The benchmark support library has the following major responsibilities:

Declaration of benchmarks with parameters (all default to automatic, but can be overriden, e.g. time to run test or number of iterations, whether the test should be run isolated (process restart between runs), which probes are relevant/needed for the benchmark (specific, default, all), what units the benchmark is expressed in, ...)
Simplify the implementation of the benchmarks and remove as much boiler plate code as possible
Simple things should be easy, e.g. a trivial CPU microbenchmark should require minimal boilerplate, support custom benchmark counters (i.e. cache hits / cache misses ratio or similar)
Implement a communication channel with the command-plugin to handle start/ready/go/stop interactions if needed
Implement a few built-in probes that are simpler in nature (e.g. CPU/memory usage probes)
Run the actual benchmarks the appropriate number of times and capture results
Provide feedback to the command plugin of benchmark results (e.g. distributions and percentiles for internal probes)
Ability to run the benchmarks (with internal probes enabled)

The GitHub workflow action as the following major responsibilites:

Checkout and run all benchmarks for both main and branch
Create a comment on the PR with benchmark results and/or add artifacts that can be analyzed with external tool (e.g. JMH or other, perhaps something exists that we could pipeline the output with for presentation)

The visualization of benchmark results has the following major responsibilities:

Pretty-printing of benchmark results for CLI viewing
Possibly generate graphs of detailed measurements, especially compared with captured baseline
Exports / conversions to other usable formats (FMH, JSON, ...)

Detailed design

Benchmarks result storage

The .benchmarks directory keeps the latest result.

hostIdentifier - an identifier for the current host that should be set using the SWIFT_BENCHMARK_HOST_IDENTIFIER environment variable

target - the name of the executable benchmark target

test - the name of the actual test inside the target

default - results captured during normal run of the benchmark

probe - results for a given probe

named - results captured for a specific named baseline - used for A/B testing etc.

Info.json - files with metadata relevant to the host

.benchmarks
├── hostIdentifier1
│   ├── Info.json
│   ├── target1
│   │   ├── test1
│   │   │   ├── default
│   │   │   │   ├── probe1.results
│   │   │   │   ├── probe2.results
│   │   │   │   ├── probe3.results
│   │   │   │   └── probe4.results
│   │   │   ├── named1
│   │   │   │   ├── probe1.results
│   │   │   │   ├── probe2.results
│   │   │   │   ├── probe3.results
│   │   │   │   └── probe4.results
│   │   │   └── named2
│   │   │       ├── probe1.results
│   │   │       ├── probe2.results
│   │   │       ├── probe3.results
│   │   │       └── probe4.results
│   │   ├── test2
│   │   │   ├── default
│   │   │   │   ├── probe1.results
│   │   │   │   ├── probe2.results
│   │   │   │   ├── probe3.results
│   │   │   │   └── probe4.results
│   │   │   ├── named1
│   │   │   │   ├── probe1.results
│   │   │   │   ├── probe2.results
│   │   │   │   ├── probe3.results
│   │   │   │   └── probe4.results
│   │   │   └── named2
│   │   │       └── ...
│   │   └── test3
│   │       └── ...
│   ├── target2
│   │   ├── test1
│   │   │   └── ...
│   │   └── test2
│   │       └── ...
│   └── ...
├── hostIdentifier2
│   ├── Info.json
│   ├── target1
│   │   └── ...
│   ├── target2
│   └── ...
└── ...

Files

When storing a baseline, there are a few different files that will be updated:

The probeN.results will be stored in JSON format and contains the following:

timestamp = <swift timestamp>
unit = us
polarity = lowerIsBetter
type = [throghput, time, amount]
iterations = 100
warmup = 10
runtime = <swift duration>
isolation = false
p0 = 8
p25 = 14
p50 = 17
p75 = 23
p90 = 43
p99 = 88
p100 = 147

Info.json for hosts includes:

hostname = myhost
cpus = 20 (arm64e)
memory = 128 (GB)

Command line usage

Tentative command line usage examples:
swift package benchmark benchmarkTarget
swift package benchmark all
swift package benchmark benchmarkTarget --baseline-update
swift package benchmark benchmarkTarget --baseline-remove
swift package benchmark all --baseline-remove
swift package benchmark benchmarkTarget --baseline-compare
swift package benchmark benchmarkTarget --baseline-compare namedBaseline
swift package benchmark benchmarkTarget --baseline-export namedBaseline --format

Similar to swift test we should support:

  --filter <filter>       Run benchmarks matching regular expression, Format: <benchmark-target>.<benchmark-case> or <benchmark-target> or <benchmark-case>
  --skip <skip>           Skip test cases matching regular expression, Example: --skip EnduranceTests

Sample tools usage:

> swift package benchmark benchmarkTarget
Running benchmarkTarget on host with Darwin 21.6.0 / arm64e / 20 cores / 128 GB / average load (1m) 1.20
Running probes [Time, CPU, Memory] ... finished (10 iterations, 449117 samples)
Wall clock percentiles (μs):
       0.0 <= 3
      25.0 <= 6
      50.0 <= 7
      75.0 <= 9
      90.0 <= 10
      99.0 <= 14
     100.0 <= 256
CPU time percentiles (μs):
       0.0 <= 30
      25.0 <= 60
      50.0 <= 70
      75.0 <= 90
      90.0 <= 100
      99.0 <= 140
     100.0 <= 2560
Memory percentiles (MB):
       0.0 <= 192
      25.0 <= 195 
      50.0 <= 200
      75.0 <= 202
      90.0 <= 203
      99.0 <= 208
     100.0 <= 210
Running probes [Malloc] ... finished (10 iterations)
Malloc count percentiles (#):
       0.0 <= 21000
      25.0 <= 21000
      50.0 <= 21000
      75.0 <= 21003
      90.0 <= 21008
      99.0 <= 21010
     100.0 <= 21210
Running probes [Syscalls, Treads, ] ... finished (10 iterations)
Syscalls count percentiles (#):
       0.0 <= 121000
      25.0 <= 121000
      50.0 <= 121000
      75.0 <= 121003
      90.0 <= 121008
      99.0 <= 221010
     100.0 <= 221210
Thread count percentiles (#):
       0.0 <= 12
      25.0 <= 14
      50.0 <= 14
      75.0 <= 14
      90.0 <= 15
      99.0 <= 15
     100.0 <= 21

> swift package benchmark benchmarkTarget --baseline-compare --percentile p90
Running benchmarkTarget on host with Darwin 21.6.0 / arm64e / 20 cores / 128 GB / average load (1m) 1.20 / 2022-07-18 12:42
Comparing with baseline from host with Darwin 21.5.0 / arm64e / 20 cores / 128 GB / average load (1m) 1.13 / 2022-06-12 13:10
Running probes [Time, CPU, Memory] ... finished (10 iterations, 449117 samples)

Wall clock percentiles (μs): [FAIL]
       0.0 <= -1
      25.0 <= -2
      50.0 <= -1
      75.0 <= 0
      90.0 <= +3
      99.0 <= +2
     100.0 <= +1
CPU time percentiles (μs): [Success]
       0.0 <= -1
      25.0 <= -2
      50.0 <= -1
      75.0 <= 0
      90.0 <= +3
      99.0 <= +2
     100.0 <= +1
Memory percentiles (MB): [Success]
       0.0 <= -30
      25.0 <= -23
      50.0 <= -20
      75.0 <= -18
      90.0 <= -15
      99.0 <= -12
     100.0 <= -5

Success / failure of comparisons should be able to specify with up to which baseline percentile that we want to compare.

E.g. swift package benchmark benchmarkTarget --baseline-compare --percentile p80 (then all results under p80 must improve for the comparison to return a success).

Benchmark implementation sample

Tentative sample code, API needs iteration.

@testable import Frostflake

import Benchmark

final class FrostflakeTests: Benchmark {

    func benchmarkFrostflakeClassOutput() async -> Benchmark {
      benchmark = Benchmark()
      benchmark.addProbes([.cpu, .memory, .syscalls, .threads]) // optional
      benchmark.setEnvironment(isolated: false, runtime: 5000) // optional
      if benchmark.active { // optional setup
        let frostflakeFactory = Frostflake(generatorIdentifier: 1_000)      
      }
      benchmark.run {
            for _ in 0 ..< 10 {
                let frostflake = frostflakeFactory.generate()
                let decription = frostflake.frostflakeDescription()
                blackHole(decription)
            }
        }
      if benchmark.active { // optional teardown
        // teardown
      }
      return benchmark
    }

    func benchmarkFrostflakeClassOutputWithDefaultSettings() -> Benchmark {
      return Benchmark().run {
           let frostflakeFactory = Frostflake(generatorIdentifier: 1_000)      
            for _ in 0 ..< 1_000 {
                let frostflake = frostflakeFactory.generate()
                let decription = frostflake.frostflakeDescription()
                blackHole(decription)
            }
        }
    }
}

Future directions

It would be desirable to extend the support for supporting processes being set up to run on remote machines instead of on the local machine for performance testing distributed systems / networking components. E.g., it would be useful to start a multicast producer that generates traffic that the component under test would connect to.

Alternatives considered

We have looked around a lot on different testing frameworks, but haven't found one with the required focus of running tests under different kinds of instrumentation automatically and capturing it, most are focused on unit testing or microbenchmarking or are focused on functional integration testing primarily (and/or have significant setup barriers).

Acknowledgments

Karoy Lorentey's excellent benchmark library for Swift Collections - highly recommended for time complexity analysis of collections.

Karl who originally suggested implementing this as a SwiftPM command tool

The SwiftNIO team have had malloc counters as part of the integration testing for a long time and I found that very helpful.