Benchmark package initial release

okay i think i am using the plugin wrong, because i cannot seem to get it to run more than 11 iterations:

    Benchmark.init("dates",
        desiredIterations: 1000)
    {
        benchmark in
        
        for _:Int in benchmark.throughputIterations
        {
            blackHole(encode(dates: dates))
        }
    }

always seems to generate 11 samples no matter what i set desiredIterations to:

╒══════════════════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Metric                                   │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞══════════════════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ Malloc (total) (K)                       │    1164 │    1164 │    1164 │    1164 │    1164 │    1164 │    1164 │      11 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Memory (resident peak) (M)               │      71 │      76 │      88 │      91 │      93 │     103 │     103 │      11 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Throughput (scaled / s)                  │      10 │      10 │      10 │      10 │       9 │       9 │       9 │      11 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Time (total CPU) (ms)                    │      90 │     100 │     100 │     110 │     110 │     110 │     110 │      11 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Time (wall clock) (ms)                   │      96 │      97 │      99 │     102 │     106 │     110 │     110 │      11 │
╘══════════════════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

You might want to tweak desiredDuration too - it will run until the first of those two are reached and default is just one second IIRC.

Easiest is to set eg. Benchmark.defaultDuration = .seconds(10)

1 Like

Example:

1 Like

Only just found this, this is great!

Had a few questions come up while taking a look through the plugin:

  • google/swift-benchmark had an option where it would try to get a statistically meaningful performance value if no number of iterations was provided is that something that could be interesting?

  • Would adding a standard deviation be interesting? (I just mention it since google/swift-benchmark originally displayed that as well)

  • Could it be possible configure for benchmarks to be in different folders than Benchmarks as well? Ie a per target folder for example. I have a large project where a lot of modules are grouped together with custom path's for most targets, would be very convenient to also put benchmarks into these groupings.

I really like this! Thanks!

1 Like

Hi, glad you find it useful!

I’ve never seen much use for ‘auto iterations’ in practice, as we’d usually tweak it with the combination of runtime / number of iterations - which would give more comparable test runs too - but maybe I’m missing something and would be happy to be convinced otherwise.

With regard to SD, I think it’s an error to have it in the first place in Google benchmark actually - see e.g. React San Francisco 2014 : Gil Tene - Understanding Latency - YouTube (or many other nice talks from Gil Tene) around the 30-minute mark - performance measurements aren’t normally distributed in practice, so it’s not a good model to try to fit into IMHO.

It’d make sense to support more flexibility for benchmark placement for more complex project layouts, maybe an optional prefix for the executable targets could be one way to do it (there’s no way to mark up targets with metadata as far as I know that we could use) - PR:s are welcome!

1 Like

yes, i’ve found performance tends to be multi-modal (as they do mention in the video), and this defies easy summarization with a statistic like standard deviation. in my opinion you really have to view the histogram to read these sorts of measurements. like:

1 Like

That's exactly the response I was hoping for haha. Thanks. I don't have that much experience with more than basic performance testing so that's good to know thanks for sharing.
I'll try to give a better example of a setup when I know how I'd ideally incorporate it with our project structure thanks! Then we might be able to get to a PR at some point :)!
Thanks for the response.

@hassila Have you used this to track performance over time by any chance?

We haven’t, although the intention was to make it possible by pulling out the data from the JSON format to some external system like grafana if you want to plot performance over time.

Our primary use cases are a) to validate PR performance vs main to avoid merging in regressions and b) to provide a convenient workflow for engineers for improving key metrics such as malloc s/memory footprint/context switches etc, vs baselines when actively working on performance cases.

You can see a trivial sample of a) here:

1 Like

Is there a convenient way to run a single benchmark? Would be cool if we could somehow get the same way of running benchmarks as tests (buttons to run individual ones in Xcode). Not sure if that's possible however

—filter regexp filtering is on the laundry list as it was waiting for the new regexp support - definitely something that would be nice to add.

Having Xcode integration would be fantastic - but there’s no apis as far as I know for that.

Best right now is probably to split out a separate benchmark suite where you can have the single one you want to test.

Wouldn't moving to testTargets and using XCTest allow us to do that?

It doesn’t work as xctest crashes with jemalloc that is used for the Malloc counters. That is only a problem with the proprietary xctest on macOS - the open source one on Linux works fine - I have a feedback open with Apple that was closed as they thought it was a problem with jemalloc - but had jemalloc engineers debug it and it seems xctest on macOS passes a pointer to jemalloc that was not allocated with jemalloc - so would need to fix that.

It also doesn’t build optimized for xctest targets as far as I understand? But maybe that is possible somehow.

There is probably some other issues with r gags to integration with eg. Swift argument parser etc (we need to be able to run from command line at Linux too), but that might be possible to split perhaps.

Hmm good point, it would not be a perfect solution IF it could work. Thanks!

Best right now is to have a separate benchmark target (you can have multiple) if you want to run only one and use eg.

swift package benchmark run --target Individual-Benchmark

(You can have one ‘throwaway’ target where you copy/paste in code if you want to run it separately).

Not perfect, but something.

Great thanks, what do you think about adding toolchain that's being run to the output metadata? What are your thoughts on the ability to add metadata in general to the output?
If you're interested I'd love to make a PR, just wanted to get your thoughts on it first!

We’ve got hardware + OS, tool chain version would be great to have as well, PR would be welcome of course!

What other metadata did you have in mind?

What do you think is the best approach regarding getting the toolchain? I haven't found any "great" way of getting that at runtime. Too bad there's no way to grab it at runtime. I've been thinking of running swiftc --version perhaps? I'm not sure if that's as dependable as we'd like it to be however. We definitely don't want it to report the wrong toolchain names. So I guess it's information we'd have to add at the CI level unfortunately.

The metadata example we currently have would be something like:
We are running simulations and we'd like to be able to tell how long they take to run. However we can be running a simulation that simulates an hour worth of time or weeks. Ideally that'd be part of the unit of the output. So say a simulation that simulates 40 days of time runs in 20 seconds that would result in a unit of 2 days of simulation per second. Which is much more informative than saying sim 1 ran in 40 seconds sim 2 ran in 10 seconds when the properties of those simulations are vastly different.
In this case this information could either be added through extra metadata or an alternative approach to units in this specific case.

Let me know your thoughts!

1 Like

Hmm, maybe supporting a custom StatisticsUnits for throughputScalingFactor would one possible approach, then you could set that to e.g. the number of seconds that are simulated and get comparable time units?

Maybe toolchain metadata is better to capture outside at CI level as you suggest.

Hi @hassila , I've been trying to understand how the memory aspect of this package works. We're hoping to run our benchmarks on linux - right now I'm just testing on macOS. I put something simple together in this repo. Just trying to build up some intuition on how to interpret the results.

I put together a set of benchmarks, and then duplicated those but putting those explicitly in a typical test target. Each of the tests I right-clicked and ran it in the instruments, and then ran the 'Leak' profiler. I then ran and created a baseline using the instructions from your repository and added photos to the README for easier readability.

The ExplicitCapture we know to be leaky, while the WeakCapture we know to not leak. The Malloc / free Δ for p100 on the ExplicitCapture shows that this might be true (although it's unclear to me how to interpret exactly?), but the ballooning PersistentMemory from instruments was much more intuitive for me to understand.

The other confusing aspect was that when the benchmarks were run, the Memory (allocated) for both the ExplicitCapture and WeakCapture were of the same order of magnitude, whereas there is an order of magnitude difference from Instruments.

Any help you can provide on how to correctly understand what's happening here (or best practices for how to define benchmarks in order to practically understand what's happening to memory in a portion of an application) would be much appreciated!

1 Like