Learning to use collections-benchmark

I’m trying to learn how to use swift-collections-benchmark, and I’m finding the documentation a bit sparse. I hope I’m in the right place to ask for help with it.

My current point of uncertainty involves timer.measure.

In looking at the benchmarks in swift-collections, I see that the Array benchmarks use timer.measure while the Set benchmarks don’t.

For example:

Array<Int> sort benchmark
    self.add(
      title: "Array<Int> sort",
      input: [Int].self
    ) { input in
      return { timer in
        var array = input
        array.reserveCapacity(0) // Force unique storage
        timer.measure {
          array.sort()
        }
        precondition(array.elementsEqual(0 ..< input.count))
      }
    }
Set<Int> remove benchmark
    self.add(
      title: "Set<Int> remove",
      input: ([Int], [Int]).self
    ) { input, removals in
      return { timer in
        var set = Set(input)
        for i in removals {
          set.remove(i)
        }
        precondition(set.isEmpty)
        blackHole(set)
      }
    }

I also notice that this array benchmark does not call blackHole.

Furthermore, some of the other array benchmarks take an input of Int.self instead of [Int].self, and others take an input of ([Int], [Int]).self and use one of the inputs as indices.

So… there’s a lot that I don’t understand.

When do I need to use timer.measure? What happens if it is omitted?

When do I need to use blackHole? What happens if it is omitted?

If the input is Int.self I assume it will be the current size, and if it is an array I assume its count will be the current size. What else can I rely on about the values for various input types?

Is there more comprehensive documentation for collections-benchmark that I can read somewhere?

I'm not the definitive expert here, but I can help a smidge.

Use it when you want to control very specifically the thing that's getting measured. In the case of array above, it's measuring "sort" and none of the work around it.

If you don't use timer.measure, it measures everything that happens in the closure - as though you'd wrapped the whole of it with timer.measure { ... }.

blackHole prevents the Swift compiler from optimizing away work whose result is unused. It's defined as:

  @inline(never)
  @_optimize(none)
  public func blackHole<T>(_ value: T) {}

It literally does nothing — but the compiler can't prove it does nothing, so it must keep the code that produces the value.

In the Array sort example, blackHole isn't needed because array.sort() mutates the array in-place, and the subsequent precondition uses the array — so the compiler can't eliminate the sort.

In the kalimbaOrdered example from the Getting Started guide, blackHole(input.kalimbaOrdered()) is essential because without it, the compiler could see that the returned array is never used and skip the computation entirely. In practice, if you start seeing ludicrous numbers for micro-benchmarks like this, the compiler may be "helpfully" optimizing away the work you're trying to measure.

What I've caught is that this is primarily useful is the work you're producing in the benchmark isn't used - wrap it with blackhole to prevent the compiler from disappearing it. I'm sure there's other optimization paths, but that's the one I've hit in experiments.

5 Likes

Well put! :100:

It's hard to tell precisely when blackHole (and similar functions, like identity) are needed -- it depends on optimizer behavior that may arbitrarily change between compiler releases. I sort of follow my instincts, defaulting to adding calls when I'm in doubt. My instincts probably mislead me quite often.

identity takes an arbitrary value and returns it unchanged, without letting the compiler know that. This is sometimes useful to prevent the compiler from recognizing that the same function is called multiple times with the same arguments. For instance, take this code:

let items: Array<Int> = ...
for i in 0 ..< 10 {
  blackHole(items.sum())
}

A sufficiently clever optimizer that sees the definition of sum() could perhaps realize that it is guaranteed to have the same result on each iteration, so it could move the computation out of the loop, defeating it. Adding identity() helps avoid that:

let items: Array<Int> = ...
for i in 0 ..< 10 {
  blackHole(identity(items).sum())
}

The input of sum() is now the result of some opaque function, preventing such optimizations. (Note: there is usually no need to wrap measurements in such explicit loops -- the package already loops tasks as needed.)

Of course, there are three issues with the use of these helpers:

  • This is all mostly speculative. (Unless we only ever add blackHole/identity to adress a real issue we see in the results -- but that requires constant vigilance.)
  • The attributes we're using may be out of date, and they are just hints anyway. A sufficiently clever compiler can simply decide to ignore the @inline(never) and @_optimize(never) attributes and still analyze the behavior of these functions, defeating their purpose. (E.g., consider a hypothetical optimizer phase that runs at link time, or after it, on the generated binary executable.)
  • These function calls aren't free -- in fact, they are potentially unspecializable generic functions with all optimizations disabled, and so they come with observable runtime overhead. This overhead will show up in the actual benchmark results we're measuring. The overhead is constant, but it can be significant, and it makes it more difficult to usefully compare absolute microbenchmark results between, say, different languages. (The package is primarily concerned with validating scalability: i.e., measuring how elapsed time changes with input size, where constant overhead doesn't matter.)
3 Likes

The standard generators are defined here:

You can add your own by calling registerInputGenerator on the Benchmark instance. The Insertions struct generates random arrays of nonnegative integers where ∀i: values[i] <= i -- useful for generating randomized locations for inserting items.

Reading the source is currently the best documentation! (The package is only 9,000 source lines or so, a large part of which is dedicated to chart generation that can usually be ignored.) PRs to improve docs are welcome, of course.

3 Likes

Thanks!

After reading this comment within the implementation of Task.measure I tried changing the implementation of my benchmark from this:

self.add(
  title: "Batch shuffle (\(rngName))",
  input: [Int].self,
  body: { input in
    return { timer in
      var array = input
      array.reserveCapacity(0)
      timer.measure {
        array.batchShuffle_loop_6(using: &rng)
      }
      blackHole(array)
    }
  }
 )

to this:

self.add(
  title: "Batch shuffle (\(rngName))",
  input: [Int].self,
  body: { input in
    var array = input
    array.reserveCapacity(0)
    return { timer in
      array.batchShuffle(using: &rng)
      blackHole(array)
    }
  }
)

and it seems to have helped smooth out the measurements, meaning the error bars in the graphs are noticeably smaller now.

This snippet is within a function on Benchmark that takes an RNG instance and makes a local mutable copy of it called rng. (Taking the RNG inout does not work because the compiler complains about capturing it in escaping closures.)

There are several of these self.add(...) calls that capture the same rng instance for use with different algorithms. My intent is that they should all be getting different random numbers, because the same shared RNG instance keeps getting mutated, and thus the branch predictor won’t mess with my results by memorizing the sequence of a seeded PRNG.

Does it look like I’m using the APIs correctly to measure what I want, or have I made some sort of subtle mistake?

There is one potentially important difference between the behavior of the two snippets!

In the original code, batchShuffle is always invoked on a fresh, unique copy of input. The benchmark loop will look like this:

var array = input; array.reserveCapacity(0)
timer.measure { array.batchShuffle(using: &rng) }
array = input; array.reserveCapacity(0)
timer.measure { array.batchShuffle(using: &rng) }
array = input; array.reserveCapacity(0)
timer.measure { array.batchShuffle(using: &rng) }
...

In the updated version, array is created exactly once per input. The unique array instance gets captured by the actual task closure, and the same array gets repeatedly mutated through batchShuffle, feeding its output to the input of the next iteration. The benchmark loop looks like this:

var array = input; array.reserveCapacity(0)
timer.measure { array.batchShuffle(using: &rng) }
timer.measure { array.batchShuffle(using: &rng) }
timer.measure { array.batchShuffle(using: &rng) }
...

This whole nesting of closures thing is really quite subtle -- but it gives us the most flexibility without forcing developers to create one or more custom types per task to represent its various states. (Although that would be a potentially palatable API alternative, perhaps leading to easier to understand (if more verbose) definitions.)

If chaining batchShuffle like that is not expected to change its performance, then the difference isn't an issue. (An example of an operation that would probably be significantly affected by such chaining is the standard Array.sort().)

Avoiding timer.measure when possible indeed improves the stability of benchmark results, as the runner can decide to batch up multiple executions in a single timer run, and that can significantly improve the effective clock resolution. (Plus it amortizes any variable cost of accessing the clock itself).

The difference should just be a lower variance of the measured results on the left (quick) side of the charts (and sometimes faster gathering of data), without significantly affecting the minimum or average execution time. Experimental results seem to confirm that the results are close enough between the various execution styles, but I haven't done a formal proof of it -- statistics is not my strongest field. :sweat_smile:

Re rng: I'm a little surprised that it's okay to have the inout capture of a local in the task closure, as add(title:input:body:) takes an escaping closure, too. But if it works, then it works! The benchmarks are always run one at a time, on a single thread, with no concurrency.

If you want the tasks to always run on the same random sequence, this would be one way to do it:

self.add(
  title: "Batch shuffle (\(rngName))",
  input: [Int].self,
  body: { input in
    var array = input
    array.reserveCapacity(0)
    return { timer in
      var rng = rng // <==
      array.batchShuffle(using: &rng)
      blackHole(array)
    }
  }
)

The tradeoff is that this would add the cost of copying the generator to the measured task.

If you want each task execution to run on a distinct random number sequence, I think one good option would be to initialize the RNG from some seed within the task:

self.add(
  title: "Batch shuffle (\(rngName))",
  input: [Int].self,
  body: { input in
    var array = input
    array.reserveCapacity(0)
    return { timer in
      var rng = MyRNG(<some unique seed>)
      array.batchShuffle(using: &rng)
      blackHole(array)
    }
  }
)

The unique seed could come from a global integer variable (using, say, Atomic<Int>.wrappingAdd(1, ordering: .relaxed).oldValue))

If initializing/copying the RNG has significant cost, it may be a good idea to revert to timer.measure and let the benchmark run for more cycles -- that should similarly help smoothing out the deviation.

2 Likes

Thanks again!

Really minor question, how do I get the vertical axis to start at something other than a power of 10 when using a log scale?

I tried --min-time 500ps but the resulting graph starts at 100ps.

You might want to try hacking inside the renderer. I do not see a public option to pass from CLI to change that.

1 Like

Yep, and the chart scales explicitly widen the range boundaries to fall on major gridlines. I added this to make the charts look more readable/pleasant; but if it's undesirable in some contexts, there is always room to add more options!

One really interesting exercise would be to add an alternative rendering method that uses the excellent Swift Charts framework. (And while we're at it, it would also be nice to have an interactive results browser app that allows easy reading/comparing values under the cursor. For extra brownie points, we should also add support for generating a static html page with interactive SVG magic.) Of course, this tool doesn't really need to be elaborated like that, and there is always more important things to rather do -- but if there is an itch...

3 Likes

How do I move the legend to the bottom left corner?

Also, is there a way to graph relative speed? As in, I have several algorithms, and I want to pick one of them as the baseline, and graph them all with a y-axis of “how many times faster than the baseline algorithm”?

So I am not a DS expert… but I think the question you are starting to ask is along the lines of "how confident can we be that algorithm B is X times faster than algorithm A". I'm not sure if the collections-benchmark package was designed specifically around those use cases. The package can display confidence intervals around one algorithm… which is the "noise" you currently see plotted. But as far as explicit head-to-head comparisons you might also want to look at expanding the package to more properly handle test groups and control groups.

IMO one of the most impactful places to use collections-benchmark is to confirm that algorithms speed up across an entire "class" of complexity: you optimized an algorithm from O(n) to O(log n). If you are optimizing just for some constant scale factor you can still record and plot with collections-benchmark but you might also look for opportunities to add some additional probability math on the JSON data that was recorded.

Maybe also try taking a look at what is possible with results compare. But I have not explored that one too closely.