[pitch] Swift Benchmarking Infrastructure

Swift Benchmarking Infrastructure (SwiftPM/swift-corelibs-xctest)

Introduction

There's a need for ease(ier)-to-use benchmarking infrastructure for
Swift projects that works cross-platform for users of SwiftPM.

Motivation

Currently there are no congruent story to cross-platform performance
analytics for Swift. On macOS, there is some built-in support to the
properitary XCTest tool with the new XCTMetric suite of performance
probes, but this does not work on other platforms, as swift-corelibs-xctest
does'nt implement those features (yet) - and there are other ergonomic
issues with running performance tests as part of the normal test suite
(e.g. tests are typically build without optimization, while you basically
always will want to run benchmark tests with optimization enabled).

It's also important that benchmarks can easily be integrated into
CI pipelines and be run from the command line, without requiring e.g. Xcode.

Often, it's desirable for benchmarks to output more information than
a typical pass/fail as well, as well as storing baselines and allowing
tools to update such baselines and fail test if baseline benchmark
performance isn't reached.

The focus of benchmark tests is comparison visavi baseline metrics, but
a benchmark can still manually fail by the users as normal tests.

This is possible for Xcode project today, but is done in a proprietary
way as SwiftPM currently seems to drive XCTests on macOS with a separate
properitary tool and not use swift-corelibs-xctest - it would be desirable
to at least optionally allow SwiftPM users to use swift-corelibs-xctest
as the implementation on all platforms for consistency for benchmarks.

It would greatly benefit all Swift users doing cross-platform work if
a more unified approach could be taken and that benchmarks/performance
testing will be easier to use.

There are also a number of benchmarking solutions for Swift out there
that should be mentioned (in various states):

Google's swift-benchmark

Apple's Swift Collection benchmark

Apple's proprietary XCTest infrastructure

Proposed solution

The proposal is to extend and improve SwiftPM and swift-corelibs-xctest support
for running benchmarks in a cross-platform manner and to leverage existing
design and implementation from existing tools.

The short proposition is:

  • SwiftPM - add a new benchmarkTarget which is expecting performance test suites in
    the folder Benchmarks analogous with Tests. These will be very similar to normal tests, except
    they are built with optimization, optionally use a different runtime environments to capture more
    complex metrics, generate/compare results with baselines and can generate richer output for
    certain tests - and should be optimized for CLI.
    They should be able to share the majority of code with the existing test runner in SwiftPM.

This could look like:

// swift-tools-version: 5.7

import PackageDescription

let package = Package(
    name: "package-frostflake",
    products: [
    ...
    ],
    dependencies: [
    ...
    ],
    targets: [
    ...
        .benchmarkTarget(
            name: "FrostflakeBenchmarks",
            dependencies: ["SwiftFrostflake",
                           "Frostflake"]
        ),
    ]
)
  • Swift(PM) - add a new benchmark command analogous with test that will run the benchmark suite
    and can be used to manipulate baselines etc. So we'd have e.g. swift benchmark and swift benchmark --reset-baseline and similar commands, analogous to swift test. More detailed design required.

  • Drive benchmarks on all platforms using swift-corelibs-xctest, as this is new
    additive functionality normal tests can continue to run with the proprietary XCTest on macOS.

  • Leverage the performance metric design done with XCTMetric etc and implement those
    metrics in swift-corelibs-xctest and use that API to capture benchmark performance
    data as a start.

Storage of baseline data

This is currently broken in swift-corelibs-xctest and even
if fixed doesn't support cross-platform development that well.

We want to store baseline data for different machines to allow developers to
have a completely local development workflow while still validating vs. baselines
on e.g. a CI machine running on another platform.

The directory contains the following kind of entities:

targetN - the name of the SPM benchmark target to test
testN - the name of the actual test
machineIdentifierN - a hostname (or MAC address, or...? TBD)

The .baselines directory keeps the latest result per machine identifier:

.baselines
 β”œ <target1>
 β”‚  β”œ <test1>
 β”‚  β”‚  β”œ <machineIdentifier1>.result
 β”‚  β”‚  β”œ <machineIdentifier2>.result
 β”‚  β”” <test2>
 β”‚  β”‚  β”œ <machineIdentifier1>.result
 β”‚  β”‚  β”œ <machineIdentifier2>.result
 β”‚  β”” <test3>
 β”‚  β”‚  β”œ <machineIdentifier1>.result
 β”‚  β”‚  β”œ <machineIdentifier2>.result
 β”œ <target2>
 β”‚  β”œ <test1>
 β”‚  β”‚  β”œ <machineIdentifier1>.result
 β”‚  β”‚  β”œ <machineIdentifier2>.result
 β”‚  β”” <test2>
 β”‚  β”‚  β”œ <machineIdentifier1>.result
 β”‚  β”‚  β”œ <machineIdentifier2>.result
 β”‚ <target3>
 ┆  β””β”„

The .result file format is TBD, could be JSON or something else.

Detailed design

Future Directions

Additional performance metrics

We'd like to provide a richer set of performance metrics than what is
currently supported by XCTest (which have a core set of fundamental benchmark
metrics).

E.g. we also want to capture (depending on what is possible for a platform):

  • Number of malloc/free (to capture excessive transient memory allocations and unexpected COW behavior)
  • Memory leaks (numbers/bytes)
  • System calls made (total)
  • Network I/O (total IO, total data amount)
  • OS thread count (peak)
  • OS threads created/destroyed (count)
  • Number of context switches
    and possibly e.g.
  • Number of mutex locks taken/released
  • Cache utilization

We should use and leverage existing tools for this, e.g.
dtrace, bpftrace, perf, heaptrack, leaks, etc, etc.

The goal is not to build the low-level tools for capturing the metrics, but
instead to package the best tools we can identify for each platform and to
allow benchmarks to be run seamlessly.

Swift collection benchmark visualization

We also want to Leverage the Swift Collection benchmark package and extend the xctest API to
easily produce such reports for benchmarks which are suited for that kind of measurement
and make it a seamless and simple thing to both get visualizations and diffs between runs
which is beautifully supported by that package.

Alternatives considered

We also considered building a completely separate benchmark infrastructure
as a SwiftPM command plug-in analogous to the DocC plugin, that would run
executable targets which would be created using a support Swift package.

This have the advantage of being independent of merging anything with
the open source projects, but we think the overall user experience and
use for the community would be significantly improved with integrated
benchmark support in SwiftPM instead.

We believe it's important to get the buy-in for the approach from the community
(and Apple specifically) before investing effort into these improvements, as we
don't want to end up supporting forks of swift-corelibs-xctest and SwiftPM,
then it'd be more pragmatic to just support a command plug-in instead.

Acknowledgments

As mentioned before, the work and design of the teams behind:

Google's swift-benchmark

Apple's Swift Collection benchmark

Apple's proprietary XCTest infrastructure

also The SwiftNIO team have had malloc counters as part of the integration testing for a long time
and that is high on the list of desired next steps.

13 Likes

Having hacked into place my own benchmarks, using one or more of the various libraries you're previously mentioned, I think this would be a fantastic addition.

A consistent, cross-platform way to apply benchmarks for libraries provides a means for what I'd love to use as an end-goal for some of my projects - watching benchmarks track over (major) commits to get a warning of a regression or algorithm mistake that was otherwise missed.

6 Likes

+1. CI benchmarks caught a 6x performance regression in swift-json last week, that i never would have known about if i didn’t have benchmarks in CI

5 Likes

Hi @Joakim_Hassila1, this is a great topic but I think I'm going to have to push back on this approach.

Before we dive in, I'd like to stress that I'm very very supportive of getting shared benchmarking lib/plugin -- the amount of times I've reinvented one by now for my own needs is way too high (like 4 by now...).


The more "yet another type of target" SwiftPM gains, the worse IMHO. Instead, we should aim at building all such things using SwiftPM plugins.

In a previous life, I developed sbt-jmh - arguably theβ„’ benchmarking solution for Scala. [note: I should clarify here, this calls out to Java MicroBenchmark Harness, which is part of OpenJDK and is an incredible piece of engineering, I only did the plugin and many fun integrations there]. On the JVM benchmarking is much harder, because one has to carefully account for JIT warmup etc. Though this also is important in larger benchmarks regardless of runtime; (there's always some cache somewhere :wink:). This was done as a plugin, and this way we could iterate on it regardless of build too releases which IMHO is important.

I also recently kicked off swift package multi-node test which allows distributed actors to execute their tests across different processes and later on even across physical machines, without any changes to test code. See here: [WIP] New multi-node infrastructure for integration tests by ktoso Β· Pull Request #1055 Β· apple/swift-distributed-actors Β· GitHub

All this is doable in plugins, without special target types.

What you describe here with the baseline files is a very good idea, but again: we should be able to pull it off as a plugin. And when we hit limitations in sandbox etc, we should improve these aspects of plugins, rather than design an one-off thing for benchmarks.

As such, I'd instead suggest focusing on what a great benchmarking plugin looks like and start from there. So... what does an ideal benchmarking plugin look like? Can we cobble things up together from pieces from the existing benchmarking libs or do we have to build something anew?

With this being a proposal, I'm assuming you are interesting in doing some of the work -- would you be able to come up with a design and goals and then interested people could perhaps help out? This would be similar to Swiflty, where an initial design was shared by a few people, and it is going to be developed and opened up in the open very soon. We could take the same approach here, even if a "tools workgroup" does not exist yet, there's nothing preventing people from coming together around a shared goal already :slight_smile:

7 Likes

Hi @ktoso, no worries, the pushback is absolutely fine - to get feedback was why I posted.

Doing it as a plugin was another approach we considered and would be absolutely fine with - we just have certain kinds of additional performance tests that we want to be able to run systematically (both locally and in CI, multi-platform) and would be happy to explore that as the approach.

Ok, this is one reason why I wanted to post to gauge feedback on the approach before we commit engineering time to it :wink:

Agree it's very nice to be able to iterate without tying it to releases too and easier to get more people involved.

Cool, will have a look! Multi-node tests is definitely something we'd like to have, would be super cool if it was possible to drive that.

That makes sense.

This is one of the big questions as I'm sure different people have slightly different views there of course - to me there's also a lot of questions on how to best fit thing into the existing Swift eco system (not having enough hands-on experience yet to immediately see what is the 'right' approach). We do have a number of requirements / things that we want to solve (that I've both had and/or missed in previous life's work with those performance testing infrastructures we built then).

Sure, we are definitely willing to do some of the work - we have done a fairly exhaustive search for performance test drivers and not really found anything that is a great fit for us, so we ended up with the conclusion that we need to do 'something' about it, as having automated performance metrics in place is quite high on our priority list.

We can summarise our goals/requirements and do an initial design, but there are a number of fundamental questions though on how to do it best so it fit in well in the ecosystem and so on, so the question is perhaps how to capture that in an efficient manner to not go too far afield, not sure how it was done for Swiftly?

2 Likes

if i may expand our horizons a bit, it would be much easier to write such tooling if we actually had complete, modern file system APIs.

swift-system just isn’t where it needs to be yet to support these kinds of use-cases. simple operations like writing a string to a file require fiddling around with ERRNO and unsafe buffer pointers, other fundamental operations like iterating a directory or kickstarting a subprocess aren’t implemented at all.

the FilePath APIs are clunky, emphasize pedantry over usability, and suffer from low ecosystem interoperability despite the stated goal of the API being to provide a common format for file paths.

libraries that only need the FilePath type definitions have to import and depend on the entire SystemPackage, which takes a long time to build (and should really be called SystemModule).

i've ended up having to write my own extensions to the package, as i'm sure many others have had to do as well. but my extensions only compile on linux, which means i can’t really use them in any serious public-facing libraries that need to maintain support for macOS and Windows, and cross-compile to devices.

I think that'd be a great start! If we could write up the goals and phases... and maybe from there to get to some divisible work -- if you're looking for help, maybe people can help out etc.

I can add to the list:

  • ability to have a top level Benchmarks/ and putt benchmarks targets there
    • I do this in MultiNodeTests, the package simply uses a separate directory for organization; no new targets needed
  • ability to list benchmarks
  • various benchmark modes: throughput (ops per second), or avg time etc; Inspirations: code-tools/jmh: 2be2df7dbaf8 jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_02_BenchmarkModes.java
  • ability to influence reported time units (some benchmarks make sense in ns, others in ms)
  • ability to report either time, or iterations
  • benchmark results must report mean error or deviation between test results (e.g. mode:avgt samples:250 avg:2.042ns error:0.017ns unit:ns/op
  • pretty print test results, or dump them as json or other format
  • dump information about env (how many cores, what cpu etc) before a benchmark run:
  • ability to run benchmark with various -wi (warmup iterations) and -i iterations
  • ability to dump results into json or some other format; I'd think JMH format would be fantastic since many tools for visualization exist https://jmh.morethan.io
  • ability to record a baseline; perhaps similar as in JMH a benchmark type can contain @Baseline benchmark OR run a benchmark and --baseline to record a baseline before optimizations.
    • an in-line baseline helps IMHO so the benchmark doesn't accidentally measure some nonsense like "wow it is faster than doing nothing!" which sometimes happens :joy:
  • ability to declare and count extra counters; e.g. in my benchmark ops/sec is interesting but I also want to record "cache miss" or something programatically in the benchmark; so some @AuxCounter("something") that is possible to hit during the benchmark would be useful
  • ability to run (or mote) two benchmark methods concurrently; i.e. two fields annotated with Benchmark("same") should be run concurrently; this way we can benchmark code under contention
  • extra:
    • measure memory use before/after
    • other integrations

Of course this is just "my wishlist" and it depends what's most important etc... but figuring out what is important and how we can evolve the tool would be part of the journey :slight_smile: None of those need special SwiftPM things IMHO, we can get very far with just an executable and plugin.

More inspiration here: Nanotrusting the Nanotime and code-tools/jmh: 2be2df7dbaf8 /jmh-samples/src/main/java/org/openjdk/jmh/samples/ and others in the series :-)

(I’m just geeking out here as it’s a topic near and dear to my heart)


I'm sorry, I didn't realize the design doc wasn't shared more publicly for Swiftly;

but I can summarize how it came to be: people interested and willing to do the work (primarily led by @patrick and @adam-fowler) worker on a design doc about "this is how we envision it'll work" and after some discussions and prototyping are now moving to implementing... That'll all be in the open soon, just the initial designs (since we weren't sure about many things) were circulated in a smaller group.

3 Likes

I would like to note that I personally find baselines incredibly flaky and unreliable in practice, at least in my projects. An ideal Swift benchmarking harness IMO should run benchmarks to compare commits and branches. I personally find such reports more reliable, especially if they are running on CI.

That's only a part of the functionality I'd like to see, the other big and just as important part is displaying/publishing these reports, integration with something like GitHub Actions would be a must. If it were able to execute two runs on a base branch and a PR branch, compare them and publish a report as a comment to a corresponding PR, that would be fantastic.

This allows running benchmarks on every commit and enable checks that forbid (or at least warn about) merging PRs leading to performance regressions, instead of sporadic benchmarking that may be rarely used.

2 Likes

i would really like to be able to see a plot over time of benchmark results

1 Like

We're already doing something like this in SwiftWasm binary size tracker, source code available here.

2 Likes

I definitely agree that we should be able to easily run a CI validation step that compares the result from main vs. a PR branch as part of the PR and publish a report - that is more of a higher level concern though for a test driver (e.g. a GitHub Action CI workflow step) that can checkout / compare various branches.

Baselines are a bit tricky, we've used them extensively historically with good results, but always used dedicated test host actual hardware. I think those are complementary though and may be of different value to different users and what metrics you focus on - but e.g. having #malloc count as a baseline value is often fairly stable and quite useful in my experience. They can also be extracted and analysed over time to get a plot over time as @taylorswift asked for.

In my experience it's also critical to focus on distributions and percentiles rather than averages also.

Anyway, I'll take a step back and digest the overall feedback and we'll discuss internally how we'll move forward.

3 Likes

Ok, still many details that needs to be sorted out, but would be grateful for any feedback on a next iteration with some more flesh that is going for a plugin approach instead. There's an emphasis on percentiles for all measurements instead of average, (p0 == min, p100 == max):

Performance Integration Test Harness (PITH)

Introduction

Applications that are sensitive to various performance metrics (CPU, memory usage, ...)
needs automated testing and comparisons vs. known good baselines to ensure that changes
to the code base - or dependencies - doesn't introduce performance regressions.

Most benchmarking libraries focus on wall clock runtime primarily commonly with microbenchmarks,
which while helpful during tuning, doesn't give the full performance metric coverage desired
for larger applications.

More coarse performance tests covering more functionality is desireable for large applications -
the analogue would be that typical microbenchmarks are similar to unit tests, while the goal with
the PITH is more analogous to integration tests.

For multi-platform software, it's also desirable to be able to use platform-specific
tools to capture such performance metrics on multiple platforms, to avoid introducing
platform-specific bugs due to different interactions with the underlying platform.

This pitch provides a harness for running such performance tests in an
automated manner with a comprehensive set of external probes for interesting metrics
and will save baselines for different machines the test is run on with the
goal of reducing performance regressions shipped and help engineers during optimization
work as well as providing a data source for visualizing changes over time if desired.

The primary goal is to automate and simplify performance testing to avoid
introducing regressions in any of the captured metrics, especially as part
of PR validation (but also ad-hoc by the engineer as needed).

A key feature is to be able to automate benchmarks as part of CI workflow in a great way.

This is intended to be complementary to any work being done on improving
microbenchmarks which may appeal to a wider audience, but perhaps we can see
an integrated approach after discussions.

Primary audiences for using PITH would be library authors, 'Swift on Server' developers
and other multi-platform users caring about performance (CPU, memory or otherwise).

Motivation

For more complex applications it is common to have a performance test suite that focuses
primarily on the runtime of the test - and compare that with a baseline on a blessed
performance testing host.

This approach is missing a few important pieces, e.g. it doesn't capture a few key metrics
that may end up causing trouble in production on other machines that are not
specced identical to the blessed performance testing host (hardware as well as OS).

It also doesn't provide the typical engineer a fast verification loop
for changes that may impact performance as e.g. unit tests does for verifying
fundamental functional behavior.

There is also often a dependency on a performance testing reference host which may be
running with a different hardare / OS setup (e.g. an engineer may be developing on a
M1 Pro / macOS machine, while the performance verification may be done
on x86 / Linux in the CI pipeline).

There are a plethora of useful tools that can be used to analyze
performance on various platforms, but often they are just used
ad-hoc by most engineers. The idea here is to make it extremely easy to use them and
to support automation by packaging them.

The performance tests are very similar to integration tests and complement them,
but the dividing line is that an integration tests must always pass to be allowed to
be merged, while a failed performance test (breaching previous baseline) allows
for a responsible engineer to analyze if the regression is acceptable or not
(and if acceptable, would reset the baseline for the metric), depending
on a projects requirements.

Desired functionality

The ability to define and run benchmarks and capture a wide range of performance related information:

  • Wall clock time
  • CPU time
  • Number of malloc/free (to capture excessive transient memory allocations)
  • Peak memory usage (broken down per type)
  • Memory leaks (total)
  • System calls made (total)
  • Disk I/O (total IO, total data amount)
  • Network I/O (total IO, total data amount)
  • Instruction count
  • Cache utilization
  • OS thread count (peak)
  • Number of context switches

and to compare it to baseline / trend analysis.

Automated run in CI and providing reports for PR:s with differentiating runs should be supported
with the view of automating benchmark validation for the PR.

We also want to support local developer workflow when performing optimizations to help quantifying improvements (in contrast with analytics tools such as Instruments that help developers to pinpoint bottlenecks and then improve them).

By default, the tool should support updating of baselines that are improved and notify the user about it, such that manual steps to improve the baselines aren't required in e.g. a CI environment. To update the baseline to a worse level is a manual step.

The tool should support multiple named baseline for A/B comparisons.

Proposed solution overview

The solution will be composed of the following major pieces:

  1. A benchmark runner that is implemented as a SwiftPM command plug-in
  2. A Swift package providing a library for supporting benchmark measurements which is used by the actual benchmarks
  3. A GitHub workflow action for running benchmarks on main vs PR and provide a comment to the PR with results
  4. A visualization tool for benchmark results and trends as needed for the different workflows

The benchmarks will be discovered using a convention where a project should include a Benchmarks directory on the top level which will contains a subdirectory for each benchmark target. Each benchmark target should be a SwiftPM executable target that uses the benchmark support library to define the actual benchmarks it implements. The benchmark plugin will communicate with the benchmark supporting library over pipes.

The command plug-in has the following major responsibilities:

  • Implement swift package benchmark (with relevant subcommands, e.g. list, baseline [set] etc)
  • Ensure that all benchmark targets are rebuilt if needed before running them
  • Discover all benchmarks by quering the benchmark targets (also used for swift package benchmark list)
  • Run the benchmarks (with skip/filter support as for normal tests)
  • Setup relevant benchmark environment as defined by the benchmark (e.g. supporting processes, CPU set thread pinning for both supporting processes and benchmark, ...)
  • Verify that system load is reasonable before running benchmark, otherwise warn / abort
  • Setup any tools as required to capture information (e.g. preloading malloc analytics, dtrace, bpftrace, custom samplers...)
  • Capture the output from any tools as required
  • Capture the results from the benchmark
  • Compare / store baseline results from the captured output
  • Fail benchmark based on trend/baseline violations
  • Tear down and clean up the benchmark environment (e.g. supporting processes)
  • Support capture of named baselines for iterative development and A/B/C comparisons

We should use and leverage existing fundamental tools for capturing information, e.g. dtrace, bpftrace, perf, heaptrack, leaks, malloc library interposition or similar techniques as needed.

The goal with this project is not to build the low-level tools for capturing the metrics, but instead to package the best tools we can identify and to allow performance tests to be run on multiple platforms easily and to keep track of baselines.

The actual tools used per platform should remain an implementation detail and can be evolved over time if better tools are identified.

The benchmark support library has the following major responsibilities:

  • Declaration of benchmarks with parameters (all default to automatic, but can be overriden, e.g. time to run test or number of iterations, whether the test should be run isolated (process restart between runs), which probes are relevant/needed for the benchmark (specific, default, all), what units the benchmark is expressed in, ...)
  • Simplify the implementation of the benchmarks and remove as much boiler plate code as possible
  • Simple things should be easy, e.g. a trivial CPU microbenchmark should require minimal boilerplate, support custom benchmark counters (i.e. cache hits / cache misses ratio or similar)
  • Implement a communication channel with the command-plugin to handle start/ready/go/stop interactions if needed
  • Implement a few built-in probes that are simpler in nature (e.g. CPU/memory usage probes)
  • Run the actual benchmarks the appropriate number of times and capture results
  • Provide feedback to the command plugin of benchmark results (e.g. distributions and percentiles for internal probes)
  • Ability to run the benchmarks (with internal probes enabled)

The GitHub workflow action as the following major responsibilites:

  • Checkout and run all benchmarks for both main and branch
  • Create a comment on the PR with benchmark results and/or add artifacts that can be analyzed with external tool (e.g. JMH or other, perhaps something exists that we could pipeline the output with for presentation)

The visualization of benchmark results has the following major responsibilities:

  • Pretty-printing of benchmark results for CLI viewing
  • Possibly generate graphs of detailed measurements, especially compared with captured baseline
  • Exports / conversions to other usable formats (FMH, JSON, ...)

Detailed design

Benchmarks result storage

The .benchmarks directory keeps the latest result.

hostIdentifier - an identifier for the current host that should be set using the SWIFT_BENCHMARK_HOST_IDENTIFIER environment variable

target - the name of the executable benchmark target

test - the name of the actual test inside the target

default - results captured during normal run of the benchmark

probe - results for a given probe

named - results captured for a specific named baseline - used for A/B testing etc.

Info.json - files with metadata relevant to the host

.benchmarks
β”œβ”€β”€ hostIdentifier1
β”‚   β”œβ”€β”€ Info.json
β”‚   β”œβ”€β”€ target1
β”‚   β”‚   β”œβ”€β”€ test1
β”‚   β”‚   β”‚   β”œβ”€β”€ default
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ probe1.results
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ probe2.results
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ probe3.results
β”‚   β”‚   β”‚   β”‚   └── probe4.results
β”‚   β”‚   β”‚   β”œβ”€β”€ named1
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ probe1.results
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ probe2.results
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ probe3.results
β”‚   β”‚   β”‚   β”‚   └── probe4.results
β”‚   β”‚   β”‚   └── named2
β”‚   β”‚   β”‚       β”œβ”€β”€ probe1.results
β”‚   β”‚   β”‚       β”œβ”€β”€ probe2.results
β”‚   β”‚   β”‚       β”œβ”€β”€ probe3.results
β”‚   β”‚   β”‚       └── probe4.results
β”‚   β”‚   β”œβ”€β”€ test2
β”‚   β”‚   β”‚   β”œβ”€β”€ default
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ probe1.results
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ probe2.results
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ probe3.results
β”‚   β”‚   β”‚   β”‚   └── probe4.results
β”‚   β”‚   β”‚   β”œβ”€β”€ named1
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ probe1.results
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ probe2.results
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ probe3.results
β”‚   β”‚   β”‚   β”‚   └── probe4.results
β”‚   β”‚   β”‚   └── named2
β”‚   β”‚   β”‚       └── ...
β”‚   β”‚   └── test3
β”‚   β”‚       └── ...
β”‚   β”œβ”€β”€ target2
β”‚   β”‚   β”œβ”€β”€ test1
β”‚   β”‚   β”‚   └── ...
β”‚   β”‚   └── test2
β”‚   β”‚       └── ...
β”‚   └── ...
β”œβ”€β”€ hostIdentifier2
β”‚   β”œβ”€β”€ Info.json
β”‚   β”œβ”€β”€ target1
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ target2
β”‚   └── ...
└── ...

Files

When storing a baseline, there are a few different files that will be updated:

The probeN.results will be stored in JSON format and contains the following:

timestamp = <swift timestamp>
unit = us
polarity = lowerIsBetter
type = [throghput, time, amount]
iterations = 100
warmup = 10
runtime = <swift duration>
isolation = false
p0 = 8
p25 = 14
p50 = 17
p75 = 23
p90 = 43
p99 = 88
p100 = 147

Info.json for hosts includes:

hostname = myhost
cpus = 20 (arm64e)
memory = 128 (GB)

Command line usage

Tentative command line usage examples:
swift package benchmark benchmarkTarget
swift package benchmark all
swift package benchmark benchmarkTarget --baseline-update
swift package benchmark benchmarkTarget --baseline-remove
swift package benchmark all --baseline-remove
swift package benchmark benchmarkTarget --baseline-compare
swift package benchmark benchmarkTarget --baseline-compare namedBaseline
swift package benchmark benchmarkTarget --baseline-export namedBaseline --format

Similar to swift test we should support:

  --filter <filter>       Run benchmarks matching regular expression, Format: <benchmark-target>.<benchmark-case> or <benchmark-target> or <benchmark-case>
  --skip <skip>           Skip test cases matching regular expression, Example: --skip EnduranceTests

Sample tools usage:

> swift package benchmark benchmarkTarget
Running benchmarkTarget on host with Darwin 21.6.0 / arm64e / 20 cores / 128 GB / average load (1m) 1.20
Running probes [Time, CPU, Memory] ... finished (10 iterations, 449117 samples)
Wall clock percentiles (ΞΌs):
       0.0 <= 3
      25.0 <= 6
      50.0 <= 7
      75.0 <= 9
      90.0 <= 10
      99.0 <= 14
     100.0 <= 256
CPU time percentiles (ΞΌs):
       0.0 <= 30
      25.0 <= 60
      50.0 <= 70
      75.0 <= 90
      90.0 <= 100
      99.0 <= 140
     100.0 <= 2560
Memory percentiles (MB):
       0.0 <= 192
      25.0 <= 195 
      50.0 <= 200
      75.0 <= 202
      90.0 <= 203
      99.0 <= 208
     100.0 <= 210
Running probes [Malloc] ... finished (10 iterations)
Malloc count percentiles (#):
       0.0 <= 21000
      25.0 <= 21000
      50.0 <= 21000
      75.0 <= 21003
      90.0 <= 21008
      99.0 <= 21010
     100.0 <= 21210
Running probes [Syscalls, Treads, ] ... finished (10 iterations)
Syscalls count percentiles (#):
       0.0 <= 121000
      25.0 <= 121000
      50.0 <= 121000
      75.0 <= 121003
      90.0 <= 121008
      99.0 <= 221010
     100.0 <= 221210
Thread count percentiles (#):
       0.0 <= 12
      25.0 <= 14
      50.0 <= 14
      75.0 <= 14
      90.0 <= 15
      99.0 <= 15
     100.0 <= 21
> swift package benchmark benchmarkTarget --baseline-compare --percentile p90
Running benchmarkTarget on host with Darwin 21.6.0 / arm64e / 20 cores / 128 GB / average load (1m) 1.20 / 2022-07-18 12:42
Comparing with baseline from host with Darwin 21.5.0 / arm64e / 20 cores / 128 GB / average load (1m) 1.13 / 2022-06-12 13:10
Running probes [Time, CPU, Memory] ... finished (10 iterations, 449117 samples)

Wall clock percentiles (ΞΌs): [FAIL]
       0.0 <= -1
      25.0 <= -2
      50.0 <= -1
      75.0 <= 0
      90.0 <= +3
      99.0 <= +2
     100.0 <= +1
CPU time percentiles (ΞΌs): [Success]
       0.0 <= -1
      25.0 <= -2
      50.0 <= -1
      75.0 <= 0
      90.0 <= +3
      99.0 <= +2
     100.0 <= +1
Memory percentiles (MB): [Success]
       0.0 <= -30
      25.0 <= -23
      50.0 <= -20
      75.0 <= -18
      90.0 <= -15
      99.0 <= -12
     100.0 <= -5

Success / failure of comparisons should be able to specify with up to which baseline percentile that we want to compare.

E.g. swift package benchmark benchmarkTarget --baseline-compare --percentile p80 (then all results under p80 must improve for the comparison to return a success).

Benchmark implementation sample

Tentative sample code, API needs iteration.

@testable import Frostflake

import Benchmark

final class FrostflakeTests: Benchmark {

    func benchmarkFrostflakeClassOutput() async -> Benchmark {
      benchmark = Benchmark()
      benchmark.addProbes([.cpu, .memory, .syscalls, .threads]) // optional
      benchmark.setEnvironment(isolated: false, runtime: 5000) // optional
      if benchmark.active { // optional setup
        let frostflakeFactory = Frostflake(generatorIdentifier: 1_000)      
      }
      benchmark.run {
            for _ in 0 ..< 10 {
                let frostflake = frostflakeFactory.generate()
                let decription = frostflake.frostflakeDescription()
                blackHole(decription)
            }
        }
      if benchmark.active { // optional teardown
        // teardown
      }
      return benchmark
    }

    func benchmarkFrostflakeClassOutputWithDefaultSettings() -> Benchmark {
      return Benchmark().run {
           let frostflakeFactory = Frostflake(generatorIdentifier: 1_000)      
            for _ in 0 ..< 1_000 {
                let frostflake = frostflakeFactory.generate()
                let decription = frostflake.frostflakeDescription()
                blackHole(decription)
            }
        }
    }
}

Future directions

It would be desirable to extend the support for supporting processes being set up to run on remote machines instead of on the local machine for performance testing distributed systems / networking components. E.g., it would be useful to start a multicast producer that generates traffic that the component under test would connect to.

Alternatives considered

We have looked around a lot on different testing frameworks, but haven't found one with the required focus of running tests under different kinds of instrumentation automatically and capturing it, most are focused on unit testing or microbenchmarking or are focused on functional integration testing primarily (and/or have significant setup barriers).

Acknowledgments

Karoy Lorentey's excellent benchmark library for Swift Collections - highly recommended for time complexity analysis of collections.

Karl who originally suggested implementing this as a SwiftPM command tool

The SwiftNIO team have had malloc counters as part of the integration testing for a long time and I found that very helpful.

2 Likes