Towards Robust Swift Performance

palimondo · December 8, 2018, 12:01am

Unladen soars Swift
Winter south of Sahara
Airspeed still mirage

I've seen a little bit of pushback against some benchmark reforms I’ve proposed, so I'd like to solicit a wider discussion of how to improve Swift's performance and the reliability thereof.

In my personal experience, the performance of idiomatic Swift code is currently very fragile. It is not uncommon to experience one or two orders of magnitude performance drops when the optimizer fails to remove a layer of abstraction. Reasons for this are totally opaque and impossible to reason about on the level of Swift code. Use of Instruments for profiling such problems is essential, but correctly calibrating the expectations when looking at the trace or disassembled code with regards to what is normal and abnormal behavior requires a ton of experimentation. Overall, in my experience the Swift is still painfully immature in this regard.

Our approach to such problems was, I think, largely reactive: waiting for users to file performance bugs, adding benchmarks for these cases as we go. But in my opinion, the Swift Benchmark Suite (SBS) has been a largely underutilized resource in our project. It is fair to say that I have some strong opinions about it, but I’m sure you can all improve upon them. So here is what I think would help with the situation and the steps I’m trying to take in that direction.

Without proper measurements, we don’t know where we stand and where we to go. So the robust measurement methodology (which is nearing completion) is a fundamental . step. We must have trust in the reported results. When written properly to avoid accumulating errors from environmental interruption, the benchmarks in SBS are robust enough to be run in parallel and fast enough to run in just a few seconds for a smoke check. This means that adding more benchmarks is no longer a cost prohibitive activity demanding extreme frugality, as was the case when full benchmark run took several hours to complete.

This enables us to take a . step: systematically increase the performance test coverage by adding benchmarks for various combinations of standard library types (eg. value vs. reference) This approach has been in my experience an effective tool for discovering a number of de-optimization pitfalls.

My first substantial Swift PR introduced this for sequence slicing operations. I like to think that it eventually lead to the elimination of AnySequence from of that API (SE-0234). I’m using the same approach now in PR #20552. In my experience, using GYB for these was very beneficial.

A legitimate issue with this approach is making sense of the increased number of data points, which is main reason why I’ve proposed the new Benchmark Naming Convention, to systematically order the benchmarks into families, groups and variants. This leads me to the necessity of a . step: improved tooling on top of benchmark reports.

Since the introduction of run_smoke_bench, which significantly shortened the time to take benchmark measurements with @swift-ci, the public visibility into the performance of the Swift is greatly diminished, because it reports only changes. There are no publicly visible full benchmark reports. I understand there are Apple-internal tools (lnt) which cover this, but that is of no use to me as an external Swift contributor.

I’d like to build more tools to make better sense of benchmark results, but that requires publicly available access to the benchmark measurements from CI. Bare minimum would be to restore the publishing of full benchmark reports on GitHub. Then it should be possible to build tools on top of its API.

Maybe you’ve seen the charts from my robust-microbench report, with fully responsive design — great on iPads and iPhones, not just on the desktop. Adopting the proposed benchmark naming convention (mainly the name length limit) would make more of them possible. Please have a look at this manual prototype of relative comparison table of benchmarks within one family, across the groups and variants built on top of the proposed naming convention. If this makes sense, please approve it, as it’s been stuck in review limbo for over a month now… (I don't dare to merge it without at least one )

I’m aware of efforts to bring Rust inspired ownership to Swift, which I expect will greatly improve the performance robustness. I believe the above outlined approach is largely complementary, as it’s about using the SBS to it’s full potential, so that all performance related decisions are backed by hard evidence. I don’t like to be flying blind anymore…

What do you think?

Michael_Gottesman · December 8, 2018, 12:33am

I am not sure how important looking at this data on an iPhone is. But I would like to respond to one argument that I have heard made (correct me if I am wrong): the inline table in github. Specifically, the argument is that since the names of the benchmarks are on the left in the github UI the table cuts off the actual data we wish to see. To me this is an artifact of us having the names on the left hand side of the table. Why not put the names on the right hand side of the table? Then this isn't an issue.

In terms of viewing data on iPhone/iPad I do not understand why we can not have long names and then abbreviate them in various ways. For instance, we could have a way to "namespace" benchmarks and then display the benchmarks with shortened names (i.e. with all the namespaces removed). Then that isn't an issue and we still have names that are easy to read/actually mean something, without losing conciseness that may be needed if one is trying to display data on a smaller screen.

palimondo · December 8, 2018, 1:12am

Correct me if I'm misunderstanding your point. It reads to me like argument that we don't have any actual problem with length of benchmark names, or that it's trivial, so no change is necessary.

Display of names in the table on GitHub, and on mobile is one aspect. @Andrew_Trick had this to say about it:

I've never understood the argument for absurdly long camel case names. When you have many of these similar names scattered throughout the data set, it becomes impossible for me to visually distinguish them from each other, and I often get them mixed up. Brevity increases clarity. Using namespaces and separators also helps a lot.

Here's another demonstartion of the importance of concise names in a different tooling context — that relative comparison table I've been mentioning above (see link for the full context):

Andrew_Trick · December 8, 2018, 3:05am

Very long names are neither human recognizable (not a good ID), nor human readable (not a good description). Whenever you have a large data set with many very long names that only differ in a few characters, where the data is sorted by some other key, it's impossible to visually cross reference data points. Long names reduce clarity and lead to mistakes.

The benchmark number would work well as an identifier, but they aren't stable, memorable, or hierarchical. Short benchmark names can be used as stable human recognizable identifiers, and they will work with arbitrary tools for data presentation. They do not need to be meaningul or "readable".

Obviously, the benchmarks do need a human readable (not recognizable), and meaningful description string. We should start adding those right away. They can be printed in the CI results on the right hand side.

I don't understand who's pushing back and what the alternate proposal is. The only thing I'm really strongly opposed to is using the benchmark name as a substitute for its description. If the argument is that we should add capability for descriptions first before renaming, then I could understand that.

I see you linked to a thread where @lorentey commented. Then, as far as I can tell, you followed his suggestions. @lorentey , do you have any other suggestions?

I saw one vague comment in the PR about the new names being less "readable". I honestly don't know what to make of that. The new names are obviously much more recognizable, and readability has nothing to do with it. Maybe it would help to show people a table of all the benchmark data, ordered by some arbitrary data point like runtime, so they can compare the old and new names in a real setting and see how much easier it is to pick out a data point. In fact, I'd like to see that to better evaluate your PR myself.

I can see why @Michael_Gottesman doesn't have problem with the names as is, since he always strips the benchmark labels when presenting data. If you don't use them, they aren't a problem ;)

The current benchmark names are most definitely not "easy to read" and do not "actually mean something" to me.

Michael_Gottesman · December 9, 2018, 1:00am

You are incorrect. My point is that there may be ways that we can accommodate both use cases and if it is possible to do so without any of the use cases harming each other, we should. Here I don't see why a benchmark having a short name (without meaning) precludes it from having a long name /with/ meaning. As a simple straw man: why couldn't all tests have a short name that does not have to "actually mean something", a longer name, and a description in the BenchmarkInfo? Then in contexts where screen real estate is small (or the user doesn't care about it), we can use the abbreviated name. That satisfies your use case. If someone wants the longer names, that is fine as well for their use case. We could have a flag that enables one to control which is output by the driver.

Ignoring that both use cases can be satisfied without affecting the other... my personal reason why I would prefer such an approach is that I have noticed that there is a lot of time being spent on the mailing list trying to come up with short names of benchmarks since people are trying to come up with short names that mean something. If there was a separate longer name also associated with a benchmark, then as long as the short name is unique (and you can map it back to the longer one) you get the benefits of both and do not have to care about "its meaning". In fact if we don't care about the meaning at all, why don't we just given it a hash or something like that (perhaps of the long name). Then we do not have to spend any time talking about these names.

Michael_Gottesman · December 9, 2018, 1:05am

Another way that we could come up with a short name (without meaning) from a long name with meaning, is to introduce the concept of a namespace. When I have seen people use these long camel case names, it is b/c they are trying to categorize benchmarks that are part of a benchmark suite. Why not formalize that into a concept and make it so that these benchmark suite namespaces get camel cased into the long name and become abbreviated in the short name (either hashing the name in some way or taking the upper case letters in the namespaces name).

Andrew_Trick · December 9, 2018, 3:20am

In fairness to @palimondo , I think he's the only one who's spent a lot of time on this project. He's was asked to bring this issue to the forum for discussion.

Is the concern that by shortening the names we're losing information? If so, then let's make it a requirement to add a one-line description (or "long name", if you want to call it that) to any benchmark before shortening its name.

I do object to using the current camel-case strings as the one-line description. I can't read them and I don't think they serve any purpose if we have short IDs and unrestricted one-line descriptions. I think @lorentey suggested using the actual API signature as the one-line description, which sounds like a great idea to me.

AFAICT Pavol is introducing a hierarchical namespace, which is awesome.

Pavol, can you show us a full benchmark data set before and after your renaming?

palimondo · December 10, 2018, 7:16pm

Maybe it would help to show people a table of all the benchmark data, ordered by some arbitrary data point like runtime, so they can compare the old and new names in a real setting and see how much easier it is to pick out a data point. In fact, I'd like to see that to better evaluate your PR myself.

Let me re-emphasize that for now I have proposed the naming convention (NC) to be applied to newly added benchmarks only. But to demonstrate the NC in action, especially to highlight how it solves existing issues, it makes perfect sense to try a hypothetical rename of existing benchmarks.

I've kept the idea of renaming existing tests in the back of my head, though. It would be possible to do the same trick as with legacyFactor and introduce a legacyName to aid in reorganizing SBS according to proposed NC and simply double report results from Benchmark_O and Benchmark_Driver to maintain the continuity of long-term performance tracking with lnt. It would make great sense to report 2 times: legacy with old names and factor applied and new names without legacyFactor — i.e. slowly transition to cleaned up SBS over upcoming Swift releases and finally retire the legacy in Swift 6. But that isn't part of the proposed NC and shouldn't be judged as such!

Looking at all benchmark names, the length becomes a problem only for large benchmark families. Currently there are overall 722 individual benchmarks, 555 are executed pre-commit (not marked .skip or .unstable). Biggest length offenders are in the large disabled benchmark families, which can be safely renamed and re-enabled after clean-up:

Existential (100 benchmarks) — See PR #20666 for a rename of these and addition of 8 more.
StringWalk, CharIndexing and CharIteration (72 benchmarks) — These are on my TODO list, I'll update here as soon as there's a PR applying the proposed NC.

Then there are 10 benchmarks from ObjectiveCBridging family longer than 40 characters, this has been extensively discussed in the PR. Final hypothetical rename would look like this:

Bridging (18 benchmarks)

After	Before
Bridging.NSString.as?.String	ObjectiveCBridgeFromNSString
Bridging.NSString.as!.String	ObjectiveCBridgeFromNSStringForced
Bridging.String.as.NSString	ObjectiveCBridgeToNSString
Bridging.NSArray.as?.Array.NSString	ObjectiveCBridgeFromNSArrayAnyObject
Bridging.NSArray.as!.Array.NSString	ObjectiveCBridgeFromNSArrayAnyObjectForced
Bridging.Array.String.as.NSArray	ObjectiveCBridgeToNSArray
Bridging.NSArray.as?.Array.String	ObjectiveCBridgeFromNSArrayAnyObjectToString
Bridging.NSArray.as!.Array.String	ObjectiveCBridgeFromNSArrayAnyObjectToStringForced
Bridging.NSDict.as?.Dict.NSString.NSNum	ObjectiveCBridgeFromNSDictAnyObject
Bridging.NSDict.as!.Dict.NSString.NSNum	ObjectiveCBridgeFromNSDictAnyObjectForced
Bridging.Dict.String.Int.as.NSDict	ObjectiveCBridgeToNSDict
Bridging.NSDict.as?.Dict.String.Int	ObjectiveCBridgeFromNSDictAnyObjectToString
Bridging.NSDict.as!.Dict.String.Int	ObjectiveCBridgeFromNSDictAnyObjectToStringForced
Bridging.NSSet.as?.Set.NSString	ObjectiveCBridgeFromNSSetAnyObject
Bridging.NSSet.as!.Set.NSString	ObjectiveCBridgeFromNSSetAnyObjectForced
Bridging.Set.String.as.NSSet	ObjectiveCBridgeToNSSet
Bridging.NSSet.as?.Set.String	ObjectiveCBridgeFromNSSetAnyObjectToString
Bridging.NSSet.as!.Set.String	ObjectiveCBridgeFromNSSetAnyObjectToStringForced

Outside of these, there are 11 more benchmarks that are over 40 characters long. The NC PR updates the validation with BenchmarkDoctor to issues the check report with following diagnostics:

	Benchmark Check Report
	`DictionarySubscriptDefaultMutationArrayOfObjects` name is 48 characters long. _{Benchmark name should not be longer than 40 characters.}
	`DictionarySubscriptDefaultMutationOfObjects` name is 43 characters long. _{Benchmark name should not be longer than 40 characters.}
	`FloatingPointPrinting_Double_description_small` name is 46 characters long. _{Benchmark name should not be longer than 40 characters.}
	`FloatingPointPrinting_Double_description_uniform` name is 48 characters long. _{Benchmark name should not be longer than 40 characters.}
	`FloatingPointPrinting_Double_interpolated` name is 41 characters long. _{Benchmark name should not be longer than 40 characters.}
	`FloatingPointPrinting_Float80_description_small` name is 47 characters long. _{Benchmark name should not be longer than 40 characters.}
	`FloatingPointPrinting_Float80_description_uniform` name is 49 characters long. _{Benchmark name should not be longer than 40 characters.}
	`FloatingPointPrinting_Float80_interpolated` name is 42 characters long. _{Benchmark name should not be longer than 40 characters.}
	`FloatingPointPrinting_Float_description_small` name is 45 characters long. _{Benchmark name should not be longer than 40 characters.}
	`FloatingPointPrinting_Float_description_uniform` name is 47 characters long. _{Benchmark name should not be longer than 40 characters.}
	`NormalizedIterator_nonBMPSlowestPrenormal` name is 41 characters long. _{Benchmark name should not be longer than 40 characters.}

Benchmark Family	benchmarks	longer than 40	introduced
`FloatingPointPrinting`	9	9	9 months ago
`NormalizedIterator`	8	1	2 months ago

It's quite obvious these are all also part of larger families, and they are longer than 40 characters simply because there was no forcing function. That is the point of proposed NC combined with the check of newly added benchmarks, which also includes these messages hinting at the solution:

	Benchmark Check Report
	`DictionarySubscriptDefaultMutationOfObjects` name is composed of 6 words. _{Split DictionarySubscriptDefaultMutationOfObjects name into dot-separated groups and variants. See http://bit.ly/BenchmarkNaming}
	`FloatingPointPrinting_Double_description_uniform` name doesn`t conform to benchmark naming convention. _{See http://bit.ly/BenchmarkNaming}

The second message, the naming error, is issued because of the presence of the underscore, which clearly plays the separator role suggested in the previous warning. The proposed NC is here clearly just unifying an emergent practice in SBS naming: The FloatingPointPrinting family had 2 degrees of freedom: type (Double, Float, Float80) tested method (description, interpolated string) with one extra variant for range (small vs. uniform).

Given the FloatingPoint prefix is semantically redundant, because the actual types are part of the benchmark names anyway, the renaming would be rather trivial, with longest name coming at 36 characters long:

Printing.Float.description.Small
Printing.Float80.description.Uniform
Printing.Double.IterpolatedString

Here's another example — correcting myself — as in the hindsight, my first crack at this issue while adding sequence slicing benchmarks, wasn't completely successful. I have only focused on the length limit and the readability clearly suffered.

	Benchmark Check Report
	`DropFirstAnySeqCRangeIterLazy` name is composed of 7 words. _{Split DropFirstAnySeqCRangeIterLazy name into dot-separated groups and variants. See http://bit.ly/BenchmarkNaming}

I've incorporated the lessons learned into the new proposal. Rename example straight from NC proposal:

Seq.dropFirst.Array
Seq.dropLast.Range.lazy
Seq.dropWhile.UnfoldSeq
Seq.prefix.AnySeq.RangeIter.lazy
Seq.prefixWhile.AnyCol.Array
Seq.suffix.AnySeq.UnfoldSeq.lazy

No automated solution can replace the insight of the author in designing a coherent naming system in these cases. Here are 4 benchmarks added 10 days ago, Codable.swift: JSONPerfDecode, JSONPerfEncode, PlistPerfDecode, PlistPerfEncode. With the NC in place, these should have been named like:

Codable.encode.JSON
Codable.decode.PropertyList

But this thread wasn't supposed to be about debating the naming convention and I find this diversion quite frustrating. So let me be a bit more blunt:

Swift performance is fragile (i.e. it SUCKS!!!!) What are we going to do about this?! My suggestion:
Systematically increase coverage (step from above), which requires NC to manage the increase in number of benchmarks.

Instead the NC PR is stuck on red herrings of clarity (aversion to abbreviations) and flexibility (aversion to length limit). So let me ask again, HOW are we going to un-suck Swift's fragile performance problem? Because what I'm hearing so far is:

Andrew_Trick · December 10, 2018, 7:55pm

Performance is fragile because of inadequate design in key areas of the compiler. It's not a great mystery how these problems can be fixed, but it is extremely difficult to do that once a production compiler reaches a certain level of complexity. If you hand someone a benchmark and say "fix this performance" they will take the shortest path to improving the score. This leads to more fragility and increases the maintenance burden on the people who would otherwise be improving the design. I personally spent the past two weeks maintaining a series of performance hacks that just continue to snowball.

So, while I'm very happy to have a large benchmark suite to catch regressions and to have helpful tooling surrounding it, these are tools that, depending on how they're used could make the problem worse.

palimondo · December 10, 2018, 8:12pm

Andrew, thanks for speaking about the core issue. What's your opinion of increasing the test coverage by systematically covering space adjacent to an existing test like I've tried to do in Flatten (and sequence slicing before)?

My hope would be that pointing-out the seemingly inexplicable regressions between variations increases the chances of coming up with systematic fix rather than piling up singleton hacks.

Also, what about the idea of adding additional tooling on top of benchmark measurements, to increase visibility of issues?

Andrew_Trick · December 10, 2018, 10:41pm

I appreciate that you compared different implementations with varying levels of abstraction and organized them into a nice little subset of benchmarks with their own namespace.

If there is a substantial performance difference between any two implementations for no apparent reason, then that justifies having both of them until the problem is fixed.

My primary objection to adding so many benchmarks is that we aren't documenting them. Each benchmark should have motivating comments. That's the only way for us to assign priority to regressions and to tell in the future if they become redundant or irrelevant.