-no-resilience-just-go-fast compiler flag

dabrahams · March 24, 2020, 6:12pm

If you've ever tried to do any kind of high-performance computing with Swift, you've probably found yourself writing scads of @inlinable, @usableFromInline, @frozen, and other annotations to defeat the optimization-blocking effects of Swift's resilient compilation model. If you miss just one important annotation, it's liable to drive your performance over a cliff. This makes programming tedious and optimization needlessly error-prone. Therefore, I propose introducing a compiler flag that globally disables resilience. IMO having it would be useful even for programmers who want (selective) resilience, by revealing the performance limits they might hope to reach by selective annotation.

Thoughts?

saeta · March 24, 2020, 6:22pm

Relatedly: I wonder if SwiftPM could set the flag automatically when it is compiling dependencies. SwiftPM knows that the source code is available and is built as a unit (it's downloading & compiling it!), so it should not have any resilience boundaries anywhere. Unfortunately today, we have seen cases that separating into separate modules still results in significant performance cliffs (>9x performance drops!), even with the -cross-module-optimization flag enabled from master.

Edit: A previous version of this post discussed exclusivity checks; because this was slightly confusing, I've now removed this paragraph.

jrose · March 24, 2020, 6:46pm

This request doesn't make sense. >:-( Swift without -enable-library-evolution is already mostly no-resilience-just-go-fast; the one exception is inlinability, mainly because we don't have a compilation model that allows you to mark everything inlinable. IIRC @eeckstein was working on such a thing at one point. (EDIT: oh, that's probably what -cross-module-optimization is; I haven't been following that work.)

If you are using -enable-library-evolution and are doing hypothetical performance testing, turn it off in one of your build configurations.

dabrahams · March 24, 2020, 6:50pm

@jrose My experience is that, without using -enable-library-evolution, but also without extensive annotation, lots of code is slower than it could be. The essence of the request is to make the annotation unnecessary. What part of that doesn't make sense?

jrose · March 24, 2020, 6:54pm

@frozen has literally no codegen effect without -enable-library-evolution.

@inlinable still does have an effect (a potentially huge effect, indeed!) in allowing inlining-based optimizations across module boundaries. Last time I was involved in this (having left Apple) we didn't yet have a model to make that work uniformly across all types of function bodies in a program, but at the point when we do I think we'd want to seriously consider having it on all the time for optimized builds, or at least WMO+optimized builds. If we don't decide to have it on all the time, then sure, a flag might be worth it. I just don't think it's productive to have a discussion around a flag.

There are no other optimization opportunities that I know of that would be turned on with this hypothetical flag and off without it.

dabrahams · March 24, 2020, 7:03pm

This is very interesting. Presumably you would package implicit @usableFromInline along with the implicit @inlinable applied in these scenarios?

I wonder what's wrong with the world that I thought we needed something else. In other words, what can be done to prevent the misperception that annotations other than @inlinable can help performance?

Joe_Groff · March 24, 2020, 7:05pm

Cross-module optimization is being actively worked on. There is no reason ultimately to have a separate "go fast" mode, IMO.

anandabits · March 24, 2020, 7:05pm

I’m not sure if it’s relevant to your use case, but this is loosely related to the way I have come to think about “submodules”. I really want the ability to decompose the implementation of a module without exposing the implementation details to users of the module. We can make a normal module and statically link it, but when we do that the statically linked module is still available for import, which is undesirable.

When I think of a “submodule”, static linking without exposing the linked module is roughly what I am thinking of. Because the “submodule” would always be compiled together with the “host” module and not visible outside of the “host”, the optimizer should be free to optimize as if all code was in a single module.

I believe a relevant example is SwiftNIO’s “internal” modules mentioned in their public API guidelines.

dnadoba · March 24, 2020, 7:49pm

I have noticed the same issue. Code that is performance critical must be marked with @inlinable/@usableFromInline to get acceptable performance.

As @jrose said, I think we don’t need a new compiler flag. But if ‘-enable-library-evolution’ is not specified, the compiler should just mark everything with @inlinable/@usableFromInline by default.
Is there anything that does prevent the compiler from doing this?

Joe_Groff · March 24, 2020, 8:07pm

This is essentially what cross-module optimization would do.

Joe_Groff · March 24, 2020, 8:55pm

It's also worth noting that inlining all the things is not the be-all end-all solution to Swift performance, and comes with its own performance hazards because of code size bloat if we undergo it carelessly. In addition to cross-module optimization, we should also continue doing work to improve optimization for unspecialized code without relying on specialization and inlining.

dabrahams · March 24, 2020, 10:37pm

We have a -cross-module-optimization flag. Are you saying that's equivalent to marking everything @inlinable or @usableFromInline?

I was inclined to agree with @dnadoba until I read this:

If we're concerned about those performance hazards, are we sure we don't want an orthogonal flag that just turns "inline all the things" on or off?

Joe_Groff · March 24, 2020, 11:10pm

Currently, it's more or less equivalent to doing that for generic declarations specifically. If that doesn't lead to code size or compile time performance issues by itself, I think the plan is to eventually tweak the heuristics to automatically treat more functions as inlinable as well. Blindly making everything inlinable is possible, but not necessarily the best thing overall because of the many quadratic analyses in the compiler it may strain. If there are particular classes of function that you find would benefit from inlinability that CMO doesn't make inlinable in its current form, that would be valuable information for tuning its heuristics.

dabrahams · March 25, 2020, 3:52am

It may be so, but when trying out a programming approach, I often need to know whether it can be optimized, and I'm more than willing to wait for the compiler to churn through quadratic analyses to get the answer. If eventually I need to tune which things are actually inlinable, either for code size or compilation time reasons, I can come back and do that once I've proved the design's inherent performance. If I have to examine lots of assembly code and fiddle with annotations before I can even know whether the approach is going to have viable performance, the whole process is much harder.

It seems like the current status quo is

There are no flags that do exactly what I want today
There's some aspiration to make the right combinations of existing flags do something like what I want…
…But Joe at least is not sure he really wants a combination of existing flags to go all the way to “blind inlinability.”

This tells me that—despite the fact that the request “makes no sense”—I'll want the flag I'm asking for, at least in the short term, and possibly forever. I'm inclined to code it up and submit a PR.

Joe_Groff · March 25, 2020, 3:01pm

If the compiler's default behavior is not giving you close to the "inherent performance" of the code you're writing, then we should treat that first as an issue with the compiler's heuristics for what is or isn't treated as inlinable in cross-module mode. If you can share a small benchmark program that highlights your need, that might be more actionable.

Karl · March 25, 2020, 3:29pm

We do have the @_specialize attribute (docs), which is aimed at performance testing.

So using cross-module-optimisation to make everything inlinable and the attribute to control the heuristics should theoretically give you a way to evaluate the upper-bound.

dabrahams · March 25, 2020, 9:49pm

The problem is that I can often never tell whether I've achieved close to “inherent performance” and it's a lot of labor to find out. It's usually a far better use of my time for the compiler to spend quadratic time trying different optimizations.

I'm not looking for help optimizing a particular piece of code. My need is a workflow need, not something that can be revealed by looking at an example. I need an easy way to say "optimize the speed of all the things" so I can evaluate whether a coding approach is viable in the short term, without filing optimizer bugs and waiting to see if they get fixed.

I don't think actionable-ness on anyone else's part is really the issue here. I brought the question up to see whether a PR for such an experimental flag is likely to be accepted. For now, my team is building its own toolchain, so it would be easy to add the flag to our own work, but we expect our language changes to land upstream eventually, and at that point we had really hoped to switch to using a stock Swift compiler. If we find the flag useful but have to apply a patch and build a toolchain to get it, that will be sad.

Michael_Gottesman · March 25, 2020, 10:52pm

@dabrahams I think that what @Joe_Groff is saying is that when library evolution is disabled, passing -cross-module-optimization to the compiler will provide you with some of what you want, today (ignoring the heuristic).

The main thing here is that we don't want to just do it blindly, we are worried about other implications like code-size for instance. So the idea has been, ok use a heuristic so that we can tune this thing to not have too much of a code-size impact/vs the perf win and carefully expand the cases that we support over time. So any test cases that you have would be very interesting.

That being said, my memory may be incorrect: +CC @Erik_Eckstein who is in this area.

One last thing. I do think that it would be useful to have a mode that does /all/ the things and then measuring the code-size/perf difference to help inform the decision. But I don't know if that is possible (I don't know how the heuristic is implemented, but Erik will).

dabrahams · March 25, 2020, 11:43pm

@Michael_Gottesman I think you and I have the same view of what @Joe_Groff is saying. Understanding all of what he's saying, I think we should have this flag anyway… which appears to be what you're saying at the end of your post. So… complete agreement?

Michael_Gottesman · March 26, 2020, 12:10am

I just don't think it should be a public option in the driver though. This shouldn't be apart of the compiler's interface. I would be fine with a frontend option or a -Xllvm option. But it isn't my call. I defer to Erik.