Benchmark package initial release

hassila · February 3, 2023, 10:48am

First off, will be happy to help you sort this out, and try to make it useful for your use case, but might need some patience;

So just had a little time so far to look at it, but it seems you run into a few different issues at once here (and many thanks for the great repro repository, very helpful);

The first (and admittedly confusing) issue is that you run into a current limitation of benchmark, which is that it uses linear buckets for a certain range and then falls back to power-of-two buckets (memory allocated (K) for explicit capture shows this with e.g. 32768K), making it quite hard to properly see the values - this is usually not an issue, but if you are 'unlucky' and the first measurements gets very close to the threshold for moving from e.g. K -> G, you currently may end up with this - the good news is that we are actively working on moving to a more robust underlying implementation (hdrhistogram.org) and should have something to announce on that front shortly - then the results will be easier to interpret and you won't be 'unlucky' (not in that regard at least...).

Secondly, the current implementation does not isolate benchmark runs which are done in the same benchmark target (it is something we'll probably want to add/change in the future) - this is not an issue for most benchmark measurements (e.g. anything CPU related, malloc count or e.g. context switches are fine) - but if wanting to see actual memory growth you'd want to separate them and the currently supported way would be to run them as separate targets (until either we isolate all the runs within the same target, or we add support for --filter and allow you to do it from the command line easily manually).

So, I separated the two relevant tests you provided and increased the number of iterations a bit and got the following:

Here I also was 'lucky' and didn't run into the power-of-two issue mentioned above, so it is fairly clear that Memory (resident peak) is growing. What we see is the process memory size for each benchmark iteration bucketed into percentiles - if you have a growing process, you will see a clear progression from e.g. p25 -> p100, which we do here for Explicit Capture Memory which goes from 11M all the way up to 133M. In contrast, Weak Capture Memory stays rock solid at 9404K over time.

Now then, thirdly - Malloc (total) and Malloc / free delta unfortunately shows rubbish for your test case (2 mallocs for a huge loop that should leak/allocate is just not right) - we've primarily validated it with normal malloc/free calls, but in this case it seems that the code under measurements allocations doesn't show up for some reason.

It can either be due to how Foundation allocates (if it avoids the interposition of jemalloc somehow which leads to those allocations not showing up) or if we are misinterpreting the jemalloc counters (which are not completely obvious to get right - but at least with simpler tests that uses normal malloc/free we have a good correlation between actual mallocs and these counters).

Need to investigate that a bit further to understand what is the root problem there, but somethings definitely not right.

EDIT:
This was actually not a problem, but a small issue with how you had set up the benchmark - please have a look at

github.com

ordo-one/package-benchmark-samples/blob/main/Benchmarks/MemoryOne/MemoryOne.swift

//
// Copyright (c) 2022 Ordo One AB.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
//
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
//

import BenchmarkSupport                 // import supporting infrastructure
@main extension BenchmarkRunner {}      // Required for main() definition to not get linker errors

import Foundation

@_dynamicReplacement(for: registerBenchmarks) // Register benchmarks
func benchmarks() {

    Benchmark.defaultMetrics = BenchmarkMetric.memory
    Benchmark.defaultDesiredDuration = .seconds(3)

This file has been truncated. show original

(specifically, don't set desiredIterations, but instead desiredDuration for the test run - and instead change the throughputScalingFactor to .kilo - and you should get more expected results)

To clarify:
desiredIterations and desiredDuration both control for how many times/for how long a given benchmark should be run (whichever of those two are reached first, will stop the benchmark run).

In contrast, throughputScalingFactor controls the inner loop (for _ in benchmark.throughputIterations).

As your benchmark stands right, throughputScalingFactor was 1, so basically the code was called 100K times, but with a single loop iteration.

What I would suggest is instead as the code sample is to modify as suggested here to get much better results:

func benchmarks() {
    Benchmark.defaultMetrics = [
        .allocatedResidentMemory,
        .peakMemoryVirtual,
        .peakMemoryResident,
        .memoryLeaked
    ]
    Benchmark.defaultThroughputScalingFactor = .kilo
    Benchmark(
        "Int Memory"
    ) { benchmark in
        for _ in benchmark.throughputIterations {
            BenchmarkSupport.blackHole(1)
        }
    }
    Benchmark(
        "JSONEncoder Memory"
    ) { benchmark in
        for _ in benchmark.throughputIterations {
            BenchmarkSupport.blackHole(JSONEncoder())
        }
    }
    Benchmark(
        "Foo Memory"
    ) { benchmark in
        for _ in benchmark.throughputIterations {
            BenchmarkSupport.blackHole(Foo())
        }
    }
    Benchmark(
        "Explicit Capture Memory"
    ) { benchmark in
        for _ in benchmark.throughputIterations {
            BenchmarkSupport.blackHole(ExplicitCaptureEncoder())
        }
    }
    Benchmark(
        "Weak Capture Memory"
    ) { benchmark in
        for _ in benchmark.throughputIterations {
            BenchmarkSupport.blackHole(WeakCaptureEncoder())
        }
    }
}

It gives me:

If you then also split out into separate executable targets (as suggested above), hopefully the numbers would make much more sense.

Let me know how it goes!

hassila · February 3, 2023, 11:51am

Follow up for each respective issue:

hassila · February 3, 2023, 5:00pm

Please see edit... :-)

corymosiman12 · February 3, 2023, 5:05pm

Thank you so much for all of this! Would it be useful for me to open up an MR into your sample benchmark repo that aggregates what we've discovered here to help others?

hassila · February 3, 2023, 5:07pm

Sure, would be happy to take any PR improving sample code (maybe would be good with some more comments too, as always doc is lagging a bit).

I'd also suggest you simplify even one more step for memory checking, there's a shorthand:

@_dynamicReplacement(for: registerBenchmarks) // Register benchmarks
func benchmarks() {
/*    Benchmark.defaultMetrics = [
        .allocatedResidentMemory,
        .peakMemoryVirtual,
        .peakMemoryResident,
        .memoryLeaked
    ]*/
    Benchmark.defaultMetrics = BenchmarkMetric.memory // use this instead to get all malloc related metrics
    Benchmark.defaultThroughputScalingFactor = .kilo

corymosiman12 · February 3, 2023, 5:08pm

Lovely - will get that put together here shortly and ask for your review!

hassila · February 3, 2023, 6:39pm

Oh, I forgot - if you really want to run exactly 100k iterations to be able to compare with the macOS tooling, I would use a throughtputScalingFactor of .kilo and desiredIterations to 100.

corymosiman12 · February 3, 2023, 8:33pm

Hi Joakim - I somehow missed that those memory test examples were in your package-benchmark-samples repo already; I think I don't need to create an MR then. Thanks again for the clarifications.

corymosiman12 · February 10, 2023, 12:12am

@hassila maybe we can close the JSONEncoder memory leak bug? See: [SR-5501] JSONEncoder leaks memory · Issue #4275 · apple/swift-corelibs-foundation · GitHub

hassila · February 10, 2023, 3:08pm

To follow up on the linear/power of two bucket scheme currently used, this is the foundation we'll use instead as mentioned:

hassila · February 13, 2023, 2:50pm

And now it's shipped, please see the announcement here for details:

hassila · February 14, 2023, 1:15pm

And to follow up about running benchmarks in isolation, that is now fixed in 0.6.1 that just shipped.

So @corymosiman12 if you update to 0.6.1 your test case no longer needs to be split into separate executable targets anymore for things to make sense.

corymosiman12 · February 14, 2023, 1:30pm

amazing, thanks!

hassila · February 15, 2023, 11:54am

Now supporting regex for both benchmark suite and individual benchmark filtering/skipping (thanks Swift 5.7!)

This causes a command-line break where skip has been renamed skip-target and the new skip argument applies to benchmarks instead.

hassila · February 16, 2023, 1:05pm

Sorry, I'm just revising this again trying to shape up the API for a 1.0 soonish, and as far as I can tell it's not technically possible to mimic that approach when using the BenchmarkSupport as a dependency.

To make a long story short, the multi-node tests depends on a static array that defines the types that should be introspected - this works fine when things are in the same module, but there is no way as far as I have found to get to any global state from the hosting application from a package module.

Basically all avenues except dynamically replacing a function seem to be closed in practice (e.g trying to hook in by overriding a function in an extension fails with Overriding non-@objc declarations from extensions is not supported, same thing for dynamic dispatch of protocols that doesn't work for this, etc see e.g. the discussion Joe Groff had here https://twitter.com/owensd/status/634270773000151040 with the conclusion "Is the extension in a different module? Modules can't change the behavior of other modules." - and that's basically what's needed here in some way)

The setup is basically that the Benchmark suite executable depends on the BenchmarkSupport package that provides all of the behind-the-scenes functionality (among other things, using swift argument parser to get running instructions from the hosting plugin) - and these are two different moduels.

If there were any mechanisms apart from the currently used dynamic replacement of a function that allows me to hook in and register stuff, while still allowing a nice package dependency for the supporting infrastructure, I'm all ears (but I don't want to push a benchmark user to require additional glue code if at all possible).

Just wanted to put this to the record why the existing benchmark registration looks like it does...

On another note, closing up on 1.0 (primary thing remaining is the additional export functionality of the results) a source- breaking release was done now:

(additional samples on how to migrate any existing benchmark are available in the samples projects)

taylorswift · February 20, 2023, 11:01pm

i started getting the error while loading shared libraries: libjemalloc.so.2 error again trying to run benchmarks with the latest version (0.8) of the package.

here’s what i get when i run sudo make install:

~/jemalloc-5.3.0$ sudo make install
/usr/bin/install -c -d /usr/local/bin
removed ‘/usr/local/bin/jemalloc-config’
‘bin/jemalloc-config’ -> ‘/usr/local/bin/jemalloc-config’
removed ‘/usr/local/bin/jemalloc.sh’
‘bin/jemalloc.sh’ -> ‘/usr/local/bin/jemalloc.sh’
removed ‘/usr/local/bin/jeprof’
‘bin/jeprof’ -> ‘/usr/local/bin/jeprof’
/usr/bin/install -c -d /usr/local/include/jemalloc
removed ‘/usr/local/include/jemalloc/jemalloc.h’
‘include/jemalloc/jemalloc.h’ -> ‘/usr/local/include/jemalloc/jemalloc.h’
/usr/bin/install -c -d /usr/local/lib
/usr/bin/install -c -v -m 755 lib/libjemalloc.so.2 /usr/local/lib
removed ‘/usr/local/lib/libjemalloc.so.2’
‘lib/libjemalloc.so.2’ -> ‘/usr/local/lib/libjemalloc.so.2’
ln -sf libjemalloc.so.2 /usr/local/lib/libjemalloc.so
/usr/bin/install -c -d /usr/local/lib
removed ‘/usr/local/lib/libjemalloc.a’
‘lib/libjemalloc.a’ -> ‘/usr/local/lib/libjemalloc.a’
removed ‘/usr/local/lib/libjemalloc_pic.a’
‘lib/libjemalloc_pic.a’ -> ‘/usr/local/lib/libjemalloc_pic.a’
/usr/bin/install -c -d /usr/local/lib/pkgconfig
removed ‘/usr/local/lib/pkgconfig/jemalloc.pc’
‘jemalloc.pc’ -> ‘/usr/local/lib/pkgconfig/jemalloc.pc’
Missing xsltproc.  doc/jemalloc.html not (re)built.
/usr/bin/install -c -d /usr/local/share/doc/jemalloc
removed ‘/usr/local/share/doc/jemalloc/jemalloc.html’
‘doc/jemalloc.html’ -> ‘/usr/local/share/doc/jemalloc/jemalloc.html’
Missing xsltproc.  doc/jemalloc.3 not (re)built.
/usr/bin/install -c -d /usr/local/share/man/man3
removed ‘/usr/local/share/man/man3/jemalloc.3’
‘doc/jemalloc.3’ -> ‘/usr/local/share/man/man3/jemalloc.3’

here is what is in /usr/local/lib/:

$ ls /usr/local/lib
libjemalloc.a  libjemalloc_pic.a  libjemalloc.so  libjemalloc.so.2  pkgconfig

here’s what i get when i run the plugin:

$ swift package benchmark
Building for debugging...
Build complete! (0.59s)
Building targets in release mode for benchmark run...
Build complete! Running benchmarks...

BSONEncodingBenchmarks: error while loading shared libraries: libjemalloc.so.2: cannot open shared object file: No such file or directory
Failed to run 'run' for /swift/swift-mongodb/Benchmarks/.build/x86_64-unknown-linux-gnu/release/BSONEncodingBenchmarks, result [32512]
Likely your benchmark crahed, try running the tool in the debugger, e.g.
lldb /swift/swift-mongodb/Benchmarks/.build/x86_64-unknown-linux-gnu/release/BSONEncodingBenchmarks
Or check Console.app for a backtrace if on macOS.

here’s what’s in the binary:

$ readelf -d .build/release/BSONEncodingBenchmarks

Dynamic section at offset 0x23d698 contains 41 entries:
  Tag        Type                         Name/Value
 0x0000000000000003 (PLTGOT)             0x23efe8
 0x0000000000000002 (PLTRELSZ)           15264 (bytes)
 0x0000000000000017 (JMPREL)             0x3e610
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000007 (RELA)               0x17e08
 0x0000000000000008 (RELASZ)             157704 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffff9 (RELACOUNT)          4197
 0x0000000000000015 (DEBUG)              0x0
 0x0000000000000006 (SYMTAB)             0x278
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000005 (STRTAB)             0x81e0
 0x000000000000000a (STRSZ)              52097 (bytes)
 0x0000000000000004 (HASH)               0x14d68
 0x0000000000000001 (NEEDED)             Shared library: [libswiftGlibc.so]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libutil.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libFoundation.so]
 0x0000000000000001 (NEEDED)             Shared library: [libswiftDispatch.so]
 0x0000000000000001 (NEEDED)             Shared library: [libdispatch.so]
 0x0000000000000001 (NEEDED)             Shared library: [libBlocksRuntime.so]
 0x0000000000000001 (NEEDED)             Shared library: [libjemalloc.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libswift_Differentiation.so]
 0x0000000000000001 (NEEDED)             Shared library: [libswift_StringProcessing.so]
 0x0000000000000001 (NEEDED)             Shared library: [libswift_Concurrency.so]
 0x0000000000000001 (NEEDED)             Shared library: [libswift_RegexParser.so]
 0x0000000000000001 (NEEDED)             Shared library: [libswiftCore.so]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000c (INIT)               0x421b0
 0x000000000000000d (FINI)               0x1a49e4
 0x000000000000001a (FINI_ARRAY)         0x22f6e0
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x0000000000000019 (INIT_ARRAY)         0x22f6e8
 0x000000000000001b (INIT_ARRAYSZ)       16 (bytes)
 0x000000000000001d (RUNPATH)            Library runpath: [/usr/lib/swift/linux:$ORIGIN]
 0x000000006ffffff0 (VERSYM)             0x172c8
 0x000000006ffffffe (VERNEED)            0x17d68
 0x000000006fffffff (VERNEEDNUM)         3
 0x0000000000000000 (NULL)               0x0

i can run the binary successfully if i do

$ sudo ldconfig /usr/local/lib

can this be added to the getting started? alternatively, can the package configure this automatically?

finally, when i do run the benchmarks, i get a debug description output:

Debug result: [Benchmark.BenchmarkResult(metric: Time (wall clock), timeUnits: μs, measurements: 19, warmupIterations: 3, thresholds: nil, percentiles: [Benchmark.BenchmarkResult.Percentile.p75: 572522, Benchmark.BenchmarkResult.Percentile.p100: 639107, Benchmark.BenchmarkResult.Percentile.p99: 639107, Benchmark.BenchmarkResult.Percentile.p90: 634913, Benchmark.BenchmarkResult.Percentile.p0: 516162, Benchmark.BenchmarkResult.Percentile.p25: 533987, Benchmark.BenchmarkResul
...

i remember the plugin once printed an table to the terminal, which was useful. is there a way to re-enable that?

hassila · February 21, 2023, 7:45am

Hi!

First off, the tables would be printed to terminal if swift package benchmark ran successfully, but there seems to be some platform specific issue here - I'm guessing you are not on Ubuntu as I just tried a clean build using 0.8.0:

ubuntu@linux:~/x$ gh repo clone ordo-one/package-benchmark-samples
Cloning into 'package-benchmark-samples'...
remote: Enumerating objects: 137, done.
remote: Counting objects: 100% (130/130), done.
remote: Compressing objects: 100% (57/57), done.
remote: Total 137 (delta 59), reused 103 (delta 42), pack-reused 7
Receiving objects: 100% (137/137), 29.05 KiB | 303.00 KiB/s, done.
Resolving deltas: 100% (59/59), done.
ubuntu@linux:~/x$ cd package-benchmark-samples/
ubuntu@linux:~/x/package-benchmark-samples$ swift package update
Fetching https://github.com/ordo-one/package-benchmark
Fetched https://github.com/ordo-one/package-benchmark (0.85s)
Computing version for https://github.com/ordo-one/package-benchmark
Computed https://github.com/ordo-one/package-benchmark at 0.8.0 (0.22s)
...
Creating working copy for https://github.com/ordo-one/package-benchmark
Working copy of https://github.com/ordo-one/package-benchmark resolved at 0.8.0
ubuntu@linux:~/x/package-benchmark-samples$ swift package benchmark
Building for debugging...
[182/182] Linking BenchmarkTool
Build complete! (4.45s)
Building targets in release mode for benchmark run...
Build complete! Running benchmarks...

Benchmark results
============================================================================================================================

Host 'linux' with 8 'aarch64' processors with 15 GB memory, running:
#62-Ubuntu SMP Tue Nov 22 19:56:13 UTC 2022

Foundation-Benchmark
============================================================================================================================

Foundation AttributedString()
╒══════════════════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Metric                                   │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞══════════════════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ Malloc (total)                           │      19 │      19 │      19 │      19 │      19 │      19 │      19 │  100000 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Memory (resident peak) (M)               │      20 │      21 │      21 │      21 │      21 │      21 │      21 │  100000 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Throughput (scaled / s) (K)              │     572 │     545 │     534 │     522 │     522 │     471 │      11 │  100000 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Time (total CPU) (μs)                    │       0 │       0 │       0 │       0 │       0 │       0 │   10002 │  100000 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Time (wall clock) (μs)                   │       2 │       2 │       2 │       2 │       2 │       2 │      89 │  100000 │
╘══════════════════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

Foundation Date()
╒══════════════════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Metric                                   │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞══════════════════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ Throughput (scaled / s) (M)              │      43 │      43 │      43 │      42 │      42 │      42 │      42 │      43 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Time (wall clock) (μs)                   │   23364 │   23446 │   23560 │   23658 │   24068 │   24068 │   24068 │      43 │
╘══════════════════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

Samples
============================================================================================================================

All metrics, full concurrency, async
╒══════════════════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Metric                                   │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞══════════════════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ Bytes (read logical)                     │     747 │     750 │     750 │     751 │    1077 │    1078 │    1078 │      79 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Bytes (read physical)                    │       0 │       0 │       0 │       0 │       0 │       0 │       0 │      79 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Bytes (write logical)                    │       0 │       0 │       0 │       0 │       0 │       0 │       0 │      79 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Bytes (write physical)                   │       0 │       0 │       0 │       0 │       0 │       0 │       0 │      79 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Malloc (large)                           │       0 │       0 │       0 │       0 │       0 │       0 │       0 │      79 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Malloc (small)                           │     173 │     239 │     375 │     423 │     482 │     575 │     575 │      79 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Malloc (total)                           │     173 │     239 │     375 │     423 │     482 │     575 │     575 │      79 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Malloc / free Δ (K)                      │       0 │       0 │       0 │     328 │     787 │    2033 │    2033 │      78 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Memory (allocated) (M)                   │      23 │      34 │      39 │      40 │      40 │      40 │      40 │      79 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Memory (resident peak) (M)               │      24 │      26 │      26 │      26 │      27 │      27 │      27 │      79 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Memory (virtual peak) (M)                │     214 │     227 │     230 │     230 │     230 │     230 │     230 │      79 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Syscalls (read)                          │       3 │       3 │       3 │       3 │       4 │       4 │       4 │      79 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Syscalls (write)                         │       0 │       0 │       0 │       0 │       0 │       0 │       0 │      79 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Threads (peak)                           │      12 │      12 │      12 │      12 │      12 │      12 │      12 │      79 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Throughput (scaled / s)                  │      80 │      80 │      79 │      78 │      78 │      73 │      73 │      79 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Time (system CPU) (μs)                   │       0 │       0 │       0 │       0 │       0 │   10002 │   10002 │      79 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Time (total CPU) (ms)                    │      80 │      80 │      80 │      90 │      90 │      90 │      90 │      79 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Time (user CPU) (ms)                     │      80 │      80 │      80 │      90 │      90 │      90 │      90 │      79 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Time (wall clock) (ms)                   │      12 │      13 │      13 │      13 │      13 │      14 │      14 │      79 │
╘══════════════════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

...

MemoryOne
============================================================================================================================

Explicit Capture Memory
╒══════════════════════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Metric                                   │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞══════════════════════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ Malloc (large)                           │       0 │       0 │       0 │       0 │       0 │       0 │       0 │   22129 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Malloc (small)                           │    1016 │    1016 │    1016 │    1016 │    1016 │    1016 │    1021 │   22129 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Malloc (total)                           │    1016 │    1016 │    1016 │    1016 │    1016 │    1016 │    1021 │   22129 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Malloc / free Δ (K)                      │       0 │       0 │       0 │     328 │     328 │     328 │     918 │   22129 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Memory (allocated) (M)                   │      13 │     901 │    1789 │    2678 │    3211 │    3532 │    3565 │   22129 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Memory (resident peak) (M)               │      22 │     911 │    1799 │    2689 │    3221 │    3540 │    3576 │   22129 │
├──────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Memory (virtual peak) (M)                │     110 │    1131 │    2179 │    3056 │    3592 │    4264 │    4264 │   22129 │
╘══════════════════════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

ubuntu@linux:~/x/package-benchmark-samples$ ls /usr/local/lib
python3.10
ubuntu@linux:~/x/package-benchmark-samples$ readelf -d .build/release/MemoryOne

Dynamic section at offset 0x1ff388 contains 38 entries:
  Tag        Type                         Name/Value
 0x0000000000000003 (PLTGOT)             0x20ffe8
 0x0000000000000002 (PLTRELSZ)           14160 (bytes)
 0x0000000000000017 (JMPREL)             0x37988
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000007 (RELA)               0x14ed0
 0x0000000000000008 (RELASZ)             142008 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffff9 (RELACOUNT)          3779
 0x0000000000000015 (DEBUG)              0x0
 0x0000000000000006 (SYMTAB)             0x2b0
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000005 (STRTAB)             0x7b28
 0x000000000000000a (STRSZ)              51430 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x14410
 0x0000000000000001 (NEEDED)             Shared library: [libswift_StringProcessing.so]
 0x0000000000000001 (NEEDED)             Shared library: [libswiftGlibc.so]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libFoundation.so]
 0x0000000000000001 (NEEDED)             Shared library: [libswiftDispatch.so]
 0x0000000000000001 (NEEDED)             Shared library: [libdispatch.so]
 0x0000000000000001 (NEEDED)             Shared library: [libBlocksRuntime.so]
 0x0000000000000001 (NEEDED)             Shared library: [libswift_Differentiation.so]
 0x0000000000000001 (NEEDED)             Shared library: [libjemalloc.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libswift_Concurrency.so]
 0x0000000000000001 (NEEDED)             Shared library: [libswiftCore.so]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000c (INIT)               0x3b0d8
 0x000000000000000d (FINI)               0x17c598
 0x000000000000001a (FINI_ARRAY)         0x201ed0
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x0000000000000019 (INIT_ARRAY)         0x201ed8
 0x000000000000001b (INIT_ARRAYSZ)       16 (bytes)
 0x000000000000001d (RUNPATH)            Library runpath: [/usr/lib/swift/linux:$ORIGIN]
 0x000000006ffffffb (FLAGS_1)            Flags: PIE
 0x000000006ffffff0 (VERSYM)             0x14440
 0x000000006ffffffe (VERNEED)            0x14e4c
 0x000000006fffffff (VERNEEDNUM)         2
 0x0000000000000000 (NULL)               0x0
....
ubuntu@linux:~/x/package-benchmark-samples$ .build/release/MemoryOne
Debug result: [Benchmark.BenchmarkResult(metric: Memory (virtual peak), timeUnits: ms, measurements: 21503, warmupIterations: 3, thresholds: nil, percentiles: [Benchmark.BenchmarkResult.Percentile.p75: 3051, Benchmark.BenchmarkResult.Percentile.p100: 3592, Benchmark.BenchmarkResult.Percentile.p99: 3592, Benchmark.BenchmarkResult.Percentile.p90: 3592, Benchmark.BenchmarkResult.Percentile.p25: 1131, Benchmark.BenchmarkResult.Percentile.p0: 110, Benchmark.BenchmarkResult.Percentile.p50: 1836]), Benchmark.BenchmarkResult(metric: Memory (resident peak), timeUnits: ms, measurements: 21503, warmupIterations: 3, thresholds: nil, percentiles: [Benchmark.BenchmarkResult.Percentile.p75: 2613, Benchmark.BenchmarkResult.Percentile.p25: 886, Benchmark.BenchmarkResult.Percentile.p0: 22, Benchmark.BenchmarkResult.Percentile.p100: 3475, Benchmark.BenchmarkResult.Percentile.p50: 1749, Benchmark.BenchmarkResult.Percentile.p99: 3441, Benchmark.BenchmarkResult.Percentile.p90: 3131]), Benchmark.BenchmarkResult(metric: Memory (allocated), timeUnits: ms, measurements: 21503, warmupIterations: 3, thresholds: nil, percentiles: [Benchmark.BenchmarkResult.Percentile.p99: 3431, Benchmark.BenchmarkResult.Percentile.p75: 2603, Benchmark.BenchmarkResult.Percentile.p100: 3467, Benchmark.BenchmarkResult.Percentile.p50: 1739, Benchmark.BenchmarkResult.Percentile.p25: 875, Benchmark.BenchmarkResult.Percentile.p90: 3121, Benchmark.BenchmarkResult.Percentile.p0: 13]), Benchmark.BenchmarkResult(metric: Malloc / free Δ, timeUnits: μs, measurements: 21503, warmupIterations: 3, thresholds: nil, percentiles: [Benchmark.BenchmarkResult.Percentile.p75: 328, Benchmark.BenchmarkResult.Percentile.p90: 328, Benchmark.BenchmarkResult.Percentile.p99: 328, Benchmark.BenchmarkResult.Percentile.p25: 0, Benchmark.BenchmarkResult.Percentile.p100: 918, Benchmark.BenchmarkResult.Percentile.p50: 0, Benchmark.BenchmarkResult.Percentile.p0: 0]), Benchmark.BenchmarkResult(metric: Malloc (total), timeUnits: ns, measurements: 21503, warmupIterations: 3, thresholds: nil, percentiles: [Benchmark.BenchmarkResult.Percentile.p75: 1016, Benchmark.BenchmarkResult.Percentile.p90: 1016, Benchmark.BenchmarkResult.Percentile.p100: 1023, Benchmark.BenchmarkResult.Percentile.p0: 1016, Benchmark.BenchmarkResult.Percentile.p25: 1016, Benchmark.BenchmarkResult.Percentile.p99: 1016, Benchmark.BenchmarkResult.Percentile.p50: 1016]), Benchmark.BenchmarkResult(metric: Malloc (small), timeUnits: ns, measurements: 21503, warmupIterations: 3, thresholds: nil, percentiles: [Benchmark.BenchmarkResult.Percentile.p0: 1016, Benchmark.BenchmarkResult.Percentile.p90: 1016, Benchmark.BenchmarkResult.Percentile.p25: 1016, Benchmark.BenchmarkResult.Percentile.p100: 1023, Benchmark.BenchmarkResult.Percentile.p99: 1016, Benchmark.BenchmarkResult.Percentile.p50: 1016, Benchmark.BenchmarkResult.Percentile.p75: 1016]), Benchmark.BenchmarkResult(metric: Malloc (large), timeUnits: ns, measurements: 21503, warmupIterations: 3, thresholds: nil, percentiles: [Benchmark.BenchmarkResult.Percentile.p25: 0, Benchmark.BenchmarkResult.Percentile.p0: 0, Benchmark.BenchmarkResult.Percentile.p99: 0, Benchmark.BenchmarkResult.Percentile.p100: 0, Benchmark.BenchmarkResult.Percentile.p90: 0, Benchmark.BenchmarkResult.Percentile.p75: 0, Benchmark.BenchmarkResult.Percentile.p50: 0])]
ubuntu@linux:~/x/package-benchmark-samples$ uname -a
Linux linux 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:56:13 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux
ubuntu@linux:~/x/package-benchmark-samples$

You shouldn't need to use ldconfig manually at all, my best guess is that the way you've installed jemalloc is not putting it in the dynamic search paths of your system (on Ubuntu just installing with sudo apt-get install -y libjemalloc-dev do the right thing).

Which distribution are you running on?

taylorswift · February 21, 2023, 8:42pm

i am on the swiftlang/swift:nightly-5.8-amazonlinux2 docker image, either jemalloc installs itself to /usr/local/lib where it shouldn’t, or the docker image just doesn’t have that directory in its search path, and it should.

i don’t think this is an issue with the package, the package could just use some better documentation for the amazon linux 2 container use case.

i’m building it from source, since the version that comes with yum is too old for the plugin. i’m still using the same steps from before.

the histogram is showing for me again, so i have no idea why it was printing a debug description before. it probably had something to do with my benchmark.

Sven · February 22, 2023, 8:26am

Yes jemalloc-3.6 in EPEL7 is from 2015 and too old.

For the local install of jemalloc just do the following and then can run benchmarks fine:

echo /usr/local/lib > /etc/ld.so.conf.d/local_lib.conf && ldconfig

hassila · March 15, 2023, 1:28pm

I\m happy to announce a major release of Benchmark which includes several new features, bug fixes - but most importantly, significantly improved documentation thanks to @heckj who's made a great set of contributions in this release, thanks a lot!

Major improvements are e.g.:

Greatly expanded export formats (JMH, percentiles, raw samples, in tab-separated-values format for easy consumption by other tools, etc, and better analytic tools check out some samples on the landing page)
Significantly improved documentation (soon updated on SwiftPackageIndex later today, make sure you get the 0.9.0 documentation or check out the repo and build it in Xcode with Build Documentation)
A new check command which can be used to check for regressions in CI with much improved and understandable output compared to 0.8.0.
Progress bars with ETA while running tests
Regex filtering of benchmarks to run/skip
swift package benchmark help
Streamlined CLI interface

Release details here (couple of small search-and-replace API breaks outlined that needs to be fixed):

Any feedback on API or output formats much appreciated as we want to wrap up a 1.0 reasonably soon.