RFC: BenchmarkInfo refactoring

palimondo · February 19, 2019, 11:45pm

I'd like to refactor BenchmarkInfo used in Swift Benchmarking Suite (SBS) to better match the emergent style of writing benchmarks.

I think the current order of attributes in BenchmarkInfo is a bit unfortunate. The runFunction is currently sandwiched in the middle of attribute list. I'd like to make it the last parameter, so that we can use trailing closure syntax for writing the body of performance test directly after the declaration. This eliminates boilerplate of declaring separate one-use function, decorating it with @inline(never) attribute and referring to it in the declaration.

If the benchmark function can be reused between multiple benchmarks, where they each perform their own workload specific setUp, it of course makes sense to have shared run function which is referred to by name. See ExistentialPerformance.swift for a recent example of this style.

Here's the sketch of the proposed design in action:

Benchmark("Name",
  info: "Optional one line description of the benchmarks purpose ") {
  for _ in 1...100 {
    workload
  }
}

Benchmark("Name",
  aka: "OldName", legacyFactor: 10,
  setUp: { … }, tearDown: { … }) {
  for _ in 1...100 {
    workload
  }
}

Benchmark("Name", setUp: { prepareWorkload(size: 123}, run: doTheStuff)
// func doTheStuf is the run function shared between multiple benchmarks

In the Benchmark Naming Convention discussion was suggested addition of a one-line description attribute to better document the benchmark's purpose. I'm putting it here as optional info parameter after the name (actually after the legacyFactor).

The existing parameter legacyFactor and the proposed aka are part of a strategy to transition to robust performance measurements without disrupting the long-term performance tracking. I'm slowly finishing the final passes through SBS:

eliminating setup overhead and lowering the runtimes under 1000 μs and
setting legacyFactor to fake the original results.

After introducing the aka I want to rename the benchmarks that have specified legacyFactor and double report the results from Benchmark_O:

under name with actual measurements
under aka with measurements multiplied by legacyFactor

There will be option (--no-legacy ?) to turn off the double reporting, which will be use in Benchmark_Driver. But the default double reporting is there to allow the internal long term performance tracking LNT server to pick up long enough history, so that we may remove it in the future (Swift 6, 7?).

I'd also like to retire the BenchmarkCategory also known as tags in favor of an optional skip: Bool, as I think tags are a bit over-engineered and unused in practice.

As you can see there are quite a few moving parts and I'd like to get your feedback before moving forward.

cc @Erik_Eckstein

palimondo · February 20, 2019, 12:24am

cc @lorentey @Michael_Gottesman

Michael_Gottesman · February 20, 2019, 2:06am

Instead of aka, a more appropriate name would be "legacyName:". Also the tags stuff is something added by @atrick I think he would have strong opinions about removing it (potentially).

lorentey · February 20, 2019, 3:09am

I love the usability improvements -- the current BenchmarkInfo struct is somewhat of a pain to initialize.

Support for legacy names and factors seems like a good idea. I like that it's not just a flag that lets us disable these for PR benchmarks; this way we get to keep running them.

To pick on something random: info: sounds a bit too generic -- I'd clarify by using description: instead.

I believe we do rely on categories for grouping benchmarks. @Andrew_Trick and @Erik_Eckstein will know!

Michael_Gottesman · February 20, 2019, 3:45am

Keep in mind that I think Andy is on vacation until the end of the week. You should re-ping this thread later in the week if he doesn't respond.

I also agree with Karoy that description sounds better than info. What kind of data goes in info? Is it a string, a list, a dictionary? description, IMO, sounds clearer.

palimondo · February 20, 2019, 6:51am

Next week is just fine, we are also taking advantage of winter holidays our kids have here to catch a little break till the end of the week.

I thought of using info because that concept will be freed when we rename BenchmarkInfo to just Benchmark. @Michael_Gottesman suggested we add a one line description printed when listing benchmarks. I think that's a great idea because we can also put it into the title (shown as on hover tooltip) of a link to the benchmark source from the performance report on GitHub.

In my mind the point of this extra piece of metadata is to free you from constraints of the naming convention syntax (and 40 character limit) to clarify — when necessary — the benchmark's purpose beyond what's in the name. It isn't meant to replace documenting the benchmarks that require it with proper source code comments. Given that we usually stick to 80 characters per line in our source code, keeping the attribute name as short as possible gives you more space for useful information (sans indentation).

Same 80 char line fitting concerns lead me to propose aka instead of legacyName (which it is!), to increase the chances they both fit on the same line. The other idea I've played with was using a named tuple — legacy: (name: "OldName", factor: 42). How does this feel to you?

Benchmark("Name",
  legacy: ("OldName", factor: 10),
  setUp: { … }, tearDown: { … }) {
  for _ in 1...100 {
    workload
  }
}

I think I like that even better, because it gives us just one thing to remove in the future, when it's no longer needed and the optional named tuple will nicely clarify the code in DriverUtils.

Andrew_Trick · February 20, 2019, 10:40pm

I agree, benchmarks need to have a "description" String. All of the other fields are "info".

Benchmark tags are essential for organizing different benchmark suites and managing the benchmarks over time. If they're underused it's because most benchmarks predate them, and there's definitely a discoverability problem.... But the tags that we are currently using are important.

palimondo · February 21, 2019, 8:08pm

I’m assuming the tag usage is Apple internal, as the only publicly visible application is to mark tests to be skipped from the pre-commit suite. If I understand correctly there is some kind of proprietary extended benchmark suite that relies on BenchmarkCategory for organization?

Do you have some workflows that are using the --tags and --skip-tags arguments for Benchmark_O? Could you share more about how are the tags currently used?

itaiferber · February 21, 2019, 8:29pm

I don't know about Apple internal usage, but I did find tags helpful while iterating on Data tests, allowing me to more easily run just the tests I cared about. Just my (small) usage.

palimondo · February 21, 2019, 10:23pm

I assume you’ve used Benchmark_O directly with argument --tags=Data. The use case of running a (usually consecutive) subset of tests can be handled by running benchmarks using test numbers. Combined with bash shell you can do something like Benchmark_O {124..135} or for more high level approach, use Benchmark_Driver -f Data. That’s short form of --filter argument which takes a regular expression. You can use more than one of these.

Andrew_Trick · February 22, 2019, 3:51am

On the contrary, I would say that any benchmark that hasn't been tagged, which indicates its importance for measuring some particular aspect of the system, is one that still needs to be reviewed and possibly removed from the suite.

The usefulness of tagging subsets of benchmarks, and being able to hand a command line to someone to run that subset is plainly obvious to me. The benchmarks numbers aren't stable and would be a ridiculous way to track a subset of the suite. The purpose of each tag is documented somewhere in the source---it's hard to find, which is itself a big problem. But what made you think they aren't being used? I want to strongly encourage better use of the tags so we have a way to talk about the performance of different aspects of the system. For certain experiments you need to narrow down the set of data points and amount of code that needs to be analyzed. It takes a very long time to sift through all the benchmark and determine which are suitable for a given purpose. For example, which provide relevant metrics for comparing different h/w platforms (.cpubench).'

Organized, hierchical naming is all well and good, but it only works for the microbenchmarks. Data performance, for example, won't be limited to just the benchmarks that were expressly written for a particular code path. Larger benchmarks may also happen to depend on Data performance and that's something we want to keep track of.

palimondo · February 22, 2019, 9:14pm

From pushback by @Erik_Eckstein on some suggested clean-ups to Swift Benchmark Suite, I got the impression that what I see in the GitHub repo isn’t all that relies on how the Benchmark_O works and that there are additional Apple internal concerns to be considered. For that reason I have taylored my refactoring to be always 100% preserving of existing behavior and all the new capabilities for robust measurement are strictly additive and hidden behind flags and optional arguments. Public scripts (Benchmark_Driver, run_smoke_bench) that build on it have leveraged these features, but I got the message that more radical changes are problematic because they would create additional Apple-internal maintenance burden which isn’t budgeted for. Is there such a thing? Maybe I just got a wrong impression…

Mainly the lack of any publicly visible scripts that use this parameter for anything. There are no doubt some personal workflows that rely on tags?

I've been focusing on analyzing the SBS as a whole and built tooling on top of Benchmark_O and Benchmark_Driver, I had no use for tags yet. Now my workflow during the cleanup is to first run Benchmark_Driver -f PATTERN with test that were flagged by Benchmark_Driver check and then, when I need to drill down to the details of individual benchmark I invoke Benchmark_O using numbers to select tests (revealed in previous results). While I'm tuning the legacyFactor, I'm usually looking at output from runs with rather long argument list like:
--num-iters=1 --num-samples=6 --quantile=4 --verbose

All benchmarks are tagged, because it’s a mandatory parameter, it’s just that it left me with impression of a pro forma excersise, rather than a vital organizing tool. Here are stats from all 747 benchmarks in public SBS:

bin$ ./Benchmark_O --list --skip-tags=  | cut -d ',' -f 3,4,5,6,7,8 \
| tr -d '[ ]' | tr ',' '[\n*]' | sort | uniq -c | sort -bnr
 568 validation
 559 api
 215 String
 144 skip
 114 Array
  64 bridging
  56 algorithm
  54 Data
  51 Set
  42 Dictionary
  19 runtime
  11 cpubench
  10 abstraction
   6 unstable
   5 regression
   5 refcount
   3 miniapplication
   3 exceptions
   1 metadata

Almost everything is tagged api and validation, to the point it seem like a noise to me. The skip is all in Existential and StringWalk families (and I've suggested we keep that as Bool argument). The Array, Data, Dictionary and Set tagged benchmarks almost always include the type also in their benchmark names, so the performance tests for these currency types would be organizationally well covered just by the naming convention. Tag algorithm is quite interesting, but could still be replaced by grouping all these into one Algorithm. benchmark family in the naming hierarchy. The runtime, cpubench and the rest of the tags are hardly used (<2%).

Only the String tag seems like a solid argument for tagging, because of extensive coverage. @Michael_Ilseman do you rely on running Benchmark_O --tags=String? Could you share how you use the benchmarks?

That may be the original vision for introducing BenchmarkCategory, but is that also the practice after 1½ years? Probably I'm just unfamiliar with your workflows... Could you share some of these from your practice, to help me better understand the requirements and how the tags play into it all?

Isn't it possible to have an equally well organized Swift Benchmark Suite without tags by a thorough application of the naming convention? It seems like you could hand someone a command line using the Benchmark_Driver with multiple -filters to get the same effect as using tags. I realize we currently have a functionality gap between Benchmark_Driver and Benchmark_O, since Swift doesn't ship with regexes yet. Maybe we could add simple substring match +/- filtering?

 Benchmark_O +String +Breadcrumbs +Char -Indexing

To be clear, I'm not hell-bent on eliminating the tags, only taking the BenchmarkInfo refactoring as an opportunity to simplify the implementation if possible… you know, convention over configuration and all that

Apropos, @Michael_Gottesman any thoughts on keeping/dropping of the unsupportedPlatforms: BenchmarkPlatformSet ? It is only used in the DictionaryKeysContains.

Andrew_Trick · February 22, 2019, 10:23pm

All benchmarks are tagged, because it’s a mandatory parameter, it’s just that it left me with impression of a pro forma excersise, rather than a vital organizing tool.

I suspect that was an attempt to force benchmark authors to categorize the kind of benchmark they are adding, which I do think is useful. It so happens that almost all the current benchmarks are just microbenchmarks (testing a single specific code path), and we called those validation benchmarks, as opposed to applications or regression tests. It's actually quite often that someone requests only the "somewhat realistic" workloads. At the very least, the tags define a language for discussing the types of benchmarks that populate the suite.

bin$ ./Benchmark_O --list --skip-tags= | cut -d ',' -f 3,4,5,6,7,8
| tr -d '' | tr ',' '[\n*]' | sort | uniq -c | sort -bnr
568 validation
559 api
215 String
144 skip
114 Array
64 bridging
56 algorithm
54 Data
51 Set
42 Dictionary
19 runtime
11 cpubench
10 abstraction
6 unstable
5 regression
5 refcount
3 miniapplication
3 exceptions
1 metadata

The fact that you could show me this data with a trivial invocation of the benchmark harness is pretty good indication of the usefulness of tags. Anyway the mechanism for tagging benchmarks is a vital organization tool. That doesn't mean that a particular policy makes sense.

The key point here is that the benchmarks are self describing. We absolutely do not want people maintaining configuration scripts on top of the benchmark suite with separate lists of benchmarks!

The even broader point that I think you're missing is that the benchmark binary, Benchmark_O is the primary interface for running benchmarks. Separate driver scripts are only useful as automation aids for comparing result sets. IMO, good driver scripts just provide additional features without creating any unnecessary abstraction over the underlying interface.

palimondo · May 9, 2019, 10:22pm

Sorry for the delay on this, my past two months were consumed by fighting on the European front of the Copyright Wars (we lost). I want to finish the Swift Benchmarking Suite cleanup now. So to summarize the feedback from this thread so far, it looks like there was general consensus on renaming BenchmarkInfo to just Benchmark and moving the run function to the tail position. The optional short description parameter should be spelled-out as description.

The question of benchmark categorization using tags or naming convention can probably be debated further, while maintaining the status quo of using tags as Andrew argued for. I think there's a chance for some golden middle road, but we should probably first get more experience with consequent application of the naming convention and see where that leads…