Unladen soars Swift
Winter south of Sahara
Airspeed still mirage
In my personal experience, the performance of idiomatic Swift code is currently very fragile. It is not uncommon to experience one or two orders of magnitude performance drops when the optimizer fails to remove a layer of abstraction. Reasons for this are totally opaque and impossible to reason about on the level of Swift code. Use of Instruments for profiling such problems is essential, but correctly calibrating the expectations when looking at the trace or disassembled code with regards to what is normal and abnormal behavior requires a ton of experimentation. Overall, in my experience the Swift is still painfully immature in this regard.
Our approach to such problems was, I think, largely reactive: waiting for users to file performance bugs, adding benchmarks for these cases as we go. But in my opinion, the Swift Benchmark Suite (SBS) has been a largely underutilized resource in our project. It is fair to say that I have some strong opinions about it, but I’m sure you can all improve upon them. So here is what I think would help with the situation and the steps I’m trying to take in that direction.
Without proper measurements, we don’t know where we stand and where we to go. So the robust measurement methodology (which is nearing completion) is a fundamental 1️⃣. step. We must have trust in the reported results. When written properly to avoid accumulating errors from environmental interruption, the benchmarks in SBS are robust enough to be run in parallel and fast enough to run in just a few seconds for a smoke check. This means that adding more benchmarks is no longer a cost prohibitive activity demanding extreme frugality, as was the case when full benchmark run took several hours to complete.
This enables us to take a 2️⃣. step: systematically increase the performance test coverage by adding benchmarks for various combinations of standard library types (eg. value vs. reference) This approach has been in my experience an effective tool for discovering a number of de-optimization pitfalls.
My first substantial Swift PR introduced this for sequence slicing operations. I like to think that it eventually lead to the elimination of
AnySequence from of that API (SE-0234). I’m using the same approach now in PR #20552. In my experience, using GYB for these was very beneficial.
A legitimate issue with this approach is making sense of the increased number of data points, which is main reason why I’ve proposed the new Benchmark Naming Convention, to systematically order the benchmarks into families, groups and variants. This leads me to the necessity of a 3️⃣. step: improved tooling on top of benchmark reports.
Since the introduction of
run_smoke_bench, which significantly shortened the time to take benchmark measurements with @swift-ci, the public visibility into the performance of the Swift is greatly diminished, because it reports only changes. There are no publicly visible full benchmark reports. I understand there are Apple-internal tools (
lnt) which cover this, but that is of no use to me as an external Swift contributor.
I’d like to build more tools to make better sense of benchmark results, but that requires publicly available access to the benchmark measurements from CI. Bare minimum would be to restore the publishing of full benchmark reports on GitHub. Then it should be possible to build tools on top of its API.
Maybe you’ve seen the charts from my robust-microbench report, with fully responsive design — great on iPads and iPhones, not just on the desktop. Adopting the proposed benchmark naming convention (mainly the name length limit) would make more of them possible. Please have a look at this manual prototype of relative comparison table of benchmarks within one family, across the groups and variants built on top of the proposed naming convention. If this makes sense, please approve it, as it’s been stuck in review limbo for over a month now… (I don't dare to merge it without at least one )
I’m aware of efforts to bring Rust inspired ownership to Swift, which I expect will greatly improve the performance robustness. I believe the above outlined approach is largely complementary, as it’s about using the SBS to it’s full potential, so that all performance related decisions are backed by hard evidence. I don’t like to be flying blind anymore…
What do you think?