Brave new world: best practices for cross-module optimization

taylorswift · August 21, 2023, 7:22pm

for as long as i can remember, exposing generics in public APIs was something to be avoided if possible, and it was necessary to devote an enormous amount of thought and planning towards architecting libraries to not rely on copious amounts of @inlinable.

which is why i was shocked to discover that, since 5.8, everything has been made inlinable by default. (when using SPM.)

i now have three questions:

is cross-module optimization even a net win? back when it was gated by a feature flag, the near-uniform recommendation was not to use it, because it significantly worsened performance.

what was the rationale for making this the new default, when its performance impact was understood to be unclear at best to negative at worst? has anything changed since it was gated by the feature flag?
has anyone studied the impact on compilation times? in the event that it is significant, are there ways to limit cross-module inlining to a select group of modules within a package? can cross-module inlining take place across package boundaries?
assuming neither of the above two issues is relevant anymore, are there any downsides to vending generics as part of public API, in modules intended to be built from source? is there ever a reason to manually wrap/specialize to “hide” generics in the post-5.7 world?

hassila · August 21, 2023, 8:04pm

Just to make one short comment: It’s unclear to me if it’s “everything” that is cross-module-optimized by default in 5.8. Looking at some of the other commits it rather seemed that it was a less aggressive mode that was enabled by default and the test when I saw the regression in performance previously was with the more aggressive mode (which AFAIU still can be enabled with the feature flag) - can’t say I know the exact difference.

I’d be super happy to hear more from someone in a position to elucidate though.

Erik_Eckstein · August 24, 2023, 5:43am

In general, CMO is a significant performance win. But (as with most optimizations) there can be corner cases where you see a degradation.
The critical problem with CMO is code size. Therefore the CMO which is enabled by default is much more conservative than the "aggressive" CMO, which must be explicitly enabled with -cross-module-optimization.

impact on compilation times

We didn't see any significant impact on compilation times. Especially with the default CMO which has only a relatively small impact on size/complexity in the optimization pipeline.

ways to limit cross-module inlining to a select group of modules within a package

The compiler option -disable-cmo disables the default CMO.

can cross-module inlining take place across package boundaries

yes

is there ever a reason to manually wrap/specialize to “hide” generics in the post-5.7 world?

It really depends. CMO makes it less likely that generic APIs will have a negative performance impact. But it still can happen (CMO is an optimization based on heuristics).

hassila · August 24, 2023, 12:47pm

Thanks @Erik_Eckstein for details!

For the record I opened Update blackHole and identity to use @_optimize(none) instead of @inline(never) by hassila · Pull Request #17 · apple/swift-collections-benchmark · GitHub for swift-collections-benchmarks - perhaps there are other places with blackHoles in the Swift universe too, but that's where I found it originally.

taylorswift · August 24, 2023, 8:09pm

thanks for the detailed reply!

i’m not sure i understand the tradeoffs here correctly, inlinability shouldn’t impact code size, only inlining should. based on my (very limited!) understanding of the optimizer, i would expect there to be a lot of optimization passes (e.g. ARC optimizations) that the compiler should be able to apply by analyzing inlinable code without actually inlining the code.

can you give a brief overview of what those heuristics are? is there a good workflow for inspecting if CMO has taken place?

Erik_Eckstein · August 28, 2023, 6:48pm

inlinability shouldn’t impact code size, only inlining should

not exactly. First, more functions available for inlining will also result in more inlining (the inliner is selecting functions based on a heuristic, too). Second, more function specialization is done. This can have a negative or positive effect on code size.

can you give a brief overview of what those heuristics are?

It's mainly based on the function size.

is there a good workflow for inspecting if CMO has taken place?

It's possible to look at the generated swiftmodule file with swiftc -sil-opt and look what functions have a SIL function body. But that's more a tool for compiler engineers.

taylorswift · August 28, 2023, 9:03pm

let’s say, as a thought experiment, i took a codebase with fifty modules, and then refactored it so that all of the code lived in one oversized module and every declaration had internal or lower access control. wouldn’t that also result in more inlining?

it seems to me that there would be two possibilities:

the optimizer currently performs too much inlining, and this is unrelated to CMO, because all CMO is doing is just making external code susceptible to the same overinlining problem that internal code already suffers from.
the optimizer currently strikes the right balance for intra-module inlining, but for some reason is more aggressive when inlining things that originate from outside the module than it otherwise would be.

which is it?

Erik_Eckstein · September 1, 2023, 7:21am

Inlining decisions are probably the most complicated thing in the optimizer.
The problem is that inlining can have a negative or positive effect on code size.
The reason to limit making functions inlinable with CMO is mainly to keep additional (code size) churn to a minimum compared to not using CMO at all.

wadetregaskis · September 1, 2023, 5:47pm

Larger binary size isn't necessarily bad - what matters is the working set size of instructions for any performance-sensitive code. I've seen real-world binaries that were approaching a gigabyte in TEXT size (C++ templates, yay ) yet were super fast because any given core tended to nest in relatively tiny working subsets of the code.

PGO (Profile-Guided Optimisation) is really helpful in this regard for helping the compiler know which parts of the code benefit from being small [enough to fit into L1 icache] (among other things, like how symbols should be arranged to minimise icache fragmentation and prefetch misses).

In my experience, most code (by machine instruction count) isn't sensitive to size and actually does benefit from aggressive inlining (for reasons less clear to me - perhaps many compounding consequences such as better elimination of redundant or unreachable code).

I mention this because WMO comes up relatively often but PGO rarely gets mentioned, and I suspect they really should go hand-in-hand (for non-trivial codebases). It looks like PGO is supported in Swift projects (in Xcode: Product > Perform Action > Generate Optimization Profile…) though I haven't tried it. It used to work quite well for Clang-based projects, at least.

jrose · September 1, 2023, 9:13pm

Code size still matters on mobile devices, both for the actual space on disk and for the bandwidth it takes to download. For desktop platforms it’s not as bad, but still somewhat a concern. I agree that for servers it basically doesn’t matter these days.

wadetregaskis · September 2, 2023, 3:35am

I wish iOS app developers had that attitude.

Jon_Shier · September 2, 2023, 4:32am

It’s not the developers that are the problem there.

li3zhen1 · October 30, 2023, 4:03pm

assuming neither of the above two issues is relevant anymore, are there any downsides to vending generics as part of public API, in modules intended to be built from source? is there ever a reason to manually wrap/specialize to “hide” generics in the post-5.7 world?

I've been experimenting a library with heavy generics recently. The class signature looks like this:

public final class SimulationKD<NodeID, V>
where NodeID: Hashable, V: SIMD, V.Scalar : SimulatableFloatingPoint {
}

and in one of my test cases the generic version with V==simd_double2 takes ~0.59s . Turning on cross-module-optimization takes ~0.17s

By manually inlining with V = simd_double2, it takes ~0.05s with cmo disabled, and ~0.04s with cmo enabled.

So guess at this time(Swift 5.9), generics are still something to avoid in public API.

taylorswift · October 30, 2023, 8:18pm

just curious, what made you forgo generics entirely instead of falling back on (the admittedly suboptimal) @inlinable?

li3zhen1 · October 30, 2023, 9:06pm

I'm new to Swift so I don't know very well about the compiling things. I started this library with non-generic and then refactored it to generic, with a huge performance downgrade.

I did some experiments and from my observation @inlinable doesn't work very well, but I'm not sure if I'm using it correctly. And by replacing V.Scalar: SimulatableFloatingPoint to V.Scalar==Double globally (still inside the where clause), I get like 20% speed back. Then I tried manually inline and it got the speed back.

taylorswift · October 30, 2023, 9:12pm

@inlinable is hard, i used to ship very large modules because i did not understand how @inlinable works. for such a fundamental building block of the language, resources for learning how to use it are dreadfully sparse.

one reason @inlinable might not be working for you is because you haven’t @inlinabled the entire call stack, if you only @inlinable the outer generic call, you will still have generic abstraction overhead in all the places you call generic functions inside the outer function.

li3zhen1 · October 30, 2023, 10:18pm

Guess I should reinvestigate @inlinable now😆. Thank you!

johannesweiss · November 2, 2023, 9:26pm

Right, to add some colour: You need to make every generic public function as well as any function that the public API calls @inlinable. This includes anything that's called transitively. You can (but don't have to) stop adding inlinables once you hit a function that isn't generic.

There might be places where you only want specialisation (but not actual inlining). In those cases use @inlinable @inline(never) func iWantYouToBeSpecialisedButNotInlined<Foo: Bar>(_ foo: Foo). And yes, that's @inlinable @inline(never) which essentially means "specialisable" .

asdf_bro · October 7, 2024, 7:58pm

An interesting find in swift/include/swift/Option/Options.td:

def EnableCMOEverything : Flag<["-"], "enable-cmo-everything">,
  Flags<[HelpHidden, FrontendOption]>,
  HelpText<"Perform cross-module optimization on everything (all APIs). "
           "This is the same level of serialization as Embedded Swift.">;

def CrossModuleOptimization : Flag<["-"], "cross-module-optimization">,
  Flags<[HelpHidden, FrontendOption]>,
  HelpText<"Perform cross-module optimization">;

Is there special risk, beyond code size increase, when using -enable-cmo-everything with regular Swift projects? If not, it seems like an incredible option for many projects – so significant that it should be included in a new -O2 optimization level for better visibility.