Towards Robust Swift Performance

performance
benchmarks

(Pavol Vaskovic) #1

:dove::snowflake::sunrise:
Unladen soars Swift
Winter south of Sahara
Airspeed still mirage

I've seen a little bit of pushback against some benchmark reforms I’ve proposed, so I'd like to solicit a wider discussion of how to improve Swift's performance and the reliability thereof.

In my personal experience, the performance of idiomatic Swift code is currently very fragile. It is not uncommon to experience one or two orders of magnitude performance drops when the optimizer fails to remove a layer of abstraction. Reasons for this are totally opaque and impossible to reason about on the level of Swift code. Use of Instruments for profiling such problems is essential, but correctly calibrating the expectations when looking at the trace or disassembled code with regards to what is normal and abnormal behavior requires a ton of experimentation. Overall, in my experience the Swift is still painfully immature in this regard.

Our approach to such problems was, I think, largely reactive: waiting for users to file performance bugs, adding benchmarks for these cases as we go. But in my opinion, the Swift Benchmark Suite (SBS) has been a largely underutilized resource in our project. It is fair to say that I have some strong opinions about it, but I’m sure you can all improve upon them. So here is what I think would help with the situation and the steps I’m trying to take in that direction.

Without proper measurements, we don’t know where we stand and where we to go. So the robust measurement methodology (which is nearing completion) is a fundamental 1️⃣. step. We must have trust in the reported results. When written properly to avoid accumulating errors from environmental interruption, the benchmarks in SBS are robust enough to be run in parallel and fast enough to run in just a few seconds for a smoke check. This means that adding more benchmarks is no longer a cost prohibitive activity demanding extreme frugality, as was the case when full benchmark run took several hours to complete.

This enables us to take a 2️⃣. step: systematically increase the performance test coverage by adding benchmarks for various combinations of standard library types (eg. value vs. reference) This approach has been in my experience an effective tool for discovering a number of de-optimization pitfalls.

My first substantial Swift PR introduced this for sequence slicing operations. I like to think that it eventually lead to the elimination of AnySequence from of that API (SE-0234). I’m using the same approach now in PR #20552. In my experience, using GYB for these was very beneficial.

A legitimate issue with this approach is making sense of the increased number of data points, which is main reason why I’ve proposed the new Benchmark Naming Convention, to systematically order the benchmarks into families, groups and variants. This leads me to the necessity of a 3️⃣. step: improved tooling on top of benchmark reports.

Since the introduction of run_smoke_bench, which significantly shortened the time to take benchmark measurements with @swift-ci, the public visibility into the performance of the Swift is greatly diminished, because it reports only changes. There are no publicly visible full benchmark reports. I understand there are Apple-internal tools (lnt) which cover this, but that is of no use to me as an external Swift contributor.

I’d like to build more tools to make better sense of benchmark results, but that requires publicly available access to the benchmark measurements from CI. Bare minimum would be to restore the publishing of full benchmark reports on GitHub. Then it should be possible to build tools on top of its API.

Maybe you’ve seen the charts from my robust-microbench report, with fully responsive design — great on iPads and iPhones, not just on the desktop. Adopting the proposed benchmark naming convention (mainly the name length limit) would make more of them possible. Please have a look at this manual prototype of relative comparison table of benchmarks within one family, across the groups and variants built on top of the proposed naming convention. If this makes sense, please approve it, as it’s been stuck in review limbo for over a month now… (I don't dare to marge it without at least one :white_check_mark:)


I’m aware of efforts to bring Rust inspired ownership to Swift, which I expect will greatly improve the performance robustness. I believe the above outlined approach is largely complementary, as it’s about using the SBS to it’s full potential, so that all performance related decisions are backed by hard evidence. I don’t like to be flying blind anymore…

What do you think?


(Michael Gottesman) #2

I am not sure how important looking at this data on an iPhone is. But I would like to respond to one argument that I have heard made (correct me if I am wrong): the inline table in github. Specifically, the argument is that since the names of the benchmarks are on the left in the github UI the table cuts off the actual data we wish to see. To me this is an artifact of us having the names on the left hand side of the table. Why not put the names on the right hand side of the table? Then this isn't an issue.

In terms of viewing data on iPhone/iPad I do not understand why we can not have long names and then abbreviate them in various ways. For instance, we could have a way to "namespace" benchmarks and then display the benchmarks with shortened names (i.e. with all the namespaces removed). Then that isn't an issue and we still have names that are easy to read/actually mean something, without losing conciseness that may be needed if one is trying to display data on a smaller screen.


(Pavol Vaskovic) #3

Correct me if I'm misunderstanding your point. It reads to me like argument that we don't have any actual problem with length of benchmark names, or that it's trivial, so no change is necessary.

Display of names in the table on GitHub, and on mobile is one aspect. @Andrew_Trick had this to say about it:

I've never understood the argument for absurdly long camel case names. When you have many of these similar names scattered throughout the data set, it becomes impossible for me to visually distinguish them from each other, and I often get them mixed up. Brevity increases clarity. Using namespaces and separators also helps a lot.

Here's another demonstartion of the importance of concise names in a different tooling context — that relative comparison table I've been mentioning above (see link for the full context):


(Andrew Trick) #4

Very long names are neither human recognizable (not a good ID), nor human readable (not a good description). Whenever you have a large data set with many very long names that only differ in a few characters, where the data is sorted by some other key, it's impossible to visually cross reference data points. Long names reduce clarity and lead to mistakes.

The benchmark number would work well as an identifier, but they aren't stable, memorable, or hierarchical. Short benchmark names can be used as stable human recognizable identifiers, and they will work with arbitrary tools for data presentation. They do not need to be meaningul or "readable".

Obviously, the benchmarks do need a human readable (not recognizable), and meaningful description string. We should start adding those right away. They can be printed in the CI results on the right hand side.

I don't understand who's pushing back and what the alternate proposal is. The only thing I'm really strongly opposed to is using the benchmark name as a substitute for its description. If the argument is that we should add capability for descriptions first before renaming, then I could understand that.

I see you linked to a thread where @lorentey commented. Then, as far as I can tell, you followed his suggestions. @lorentey , do you have any other suggestions?

I saw one vague comment in the PR about the new names being less "readable". I honestly don't know what to make of that. The new names are obviously much more recognizable, and readability has nothing to do with it. Maybe it would help to show people a table of all the benchmark data, ordered by some arbitrary data point like runtime, so they can compare the old and new names in a real setting and see how much easier it is to pick out a data point. In fact, I'd like to see that to better evaluate your PR myself.

I can see why @Michael_Gottesman doesn't have problem with the names as is, since he always strips the benchmark labels when presenting data. If you don't use them, they aren't a problem ;)

The current benchmark names are most definitely not "easy to read" and do not "actually mean something" to me.


(Michael Gottesman) #5

You are incorrect. My point is that there may be ways that we can accommodate both use cases and if it is possible to do so without any of the use cases harming each other, we should. Here I don't see why a benchmark having a short name (without meaning) precludes it from having a long name /with/ meaning. As a simple straw man: why couldn't all tests have a short name that does not have to "actually mean something", a longer name, and a description in the BenchmarkInfo? Then in contexts where screen real estate is small (or the user doesn't care about it), we can use the abbreviated name. That satisfies your use case. If someone wants the longer names, that is fine as well for their use case. We could have a flag that enables one to control which is output by the driver.

Ignoring that both use cases can be satisfied without affecting the other... my personal reason why I would prefer such an approach is that I have noticed that there is a lot of time being spent on the mailing list trying to come up with short names of benchmarks since people are trying to come up with short names that mean something. If there was a separate longer name also associated with a benchmark, then as long as the short name is unique (and you can map it back to the longer one) you get the benefits of both and do not have to care about "its meaning". In fact if we don't care about the meaning at all, why don't we just given it a hash or something like that (perhaps of the long name). Then we do not have to spend any time talking about these names.


(Michael Gottesman) #6

Another way that we could come up with a short name (without meaning) from a long name with meaning, is to introduce the concept of a namespace. When I have seen people use these long camel case names, it is b/c they are trying to categorize benchmarks that are part of a benchmark suite. Why not formalize that into a concept and make it so that these benchmark suite namespaces get camel cased into the long name and become abbreviated in the short name (either hashing the name in some way or taking the upper case letters in the namespaces name).


(Andrew Trick) #7

In fairness to @palimondo , I think he's the only one who's spent a lot of time on this project. He's was asked to bring this issue to the forum for discussion.

Is the concern that by shortening the names we're losing information? If so, then let's make it a requirement to add a one-line description (or "long name", if you want to call it that) to any benchmark before shortening its name.

I do object to using the current camel-case strings as the one-line description. I can't read them and I don't think they serve any purpose if we have short IDs and unrestricted one-line descriptions. I think @lorentey suggested using the actual API signature as the one-line description, which sounds like a great idea to me.

AFAICT Pavol is introducing a hierarchical namespace, which is awesome.

Pavol, can you show us a full benchmark data set before and after your renaming?