Pitch: Support LTO for Swift

elsh · September 19, 2023, 10:41pm

Introduction

A Swift application often comprises a substantial number of modules, with instances of 200 or more modules not being uncommon, each with varying sizes. New features are continually integrated into Swift applications, resulting in a rapid increase in their code sizes. Among the most prevalent challenges encountered by Swift users is the consistent increase in binary size.

Presently available optimization techniques are limited in scope due to the lack of visibility into the entire group of modules. This is where Link Time Optimization (LTO) can prove to be crucial. In fact, numerous third-party applications have gone to the extent of implementing their own custom LTO solutions just to achieve size optimization across all modules.

We propose to support the integration of LTO for Swift, coupled with size-specific optimization. This approach aims to alleviate the burden on users, sparing them the need to develop their own custom solutions.

Proposed Solution

There are two LTO (Link Time Optimization) choices offered within LLVM: full (monolithic) and thin (incremental) — see this talk on the key differences. In Swift 5.7+, the bundled clang/llvm distribution contains a stable full-LTO capability that can be effectively applied to Swift code. However, it's worth noting that the thin-LTO mode, known for its faster build times and improved memory efficiency, is not stable in the clang/llvm version (15.0) linked to Swift 5.9, thus requires reconsideration when transitioning to a later clang/llvm version. For this reason, we'll focus on full-LTO in this proposal, though we expect it to be straightforward to extend this to thin-LTO in the future.

Our proposal entails introducing the full-LTO option as a new build setting in SwiftPM (and a corresponding flag for other build systems). When enabled, bitcode will be generated on a per-file basis, subsequently linked together to enable global optimizations. This setting proves particularly useful for products that are statically linked.

Furthermore, we suggest making the level of LTO configurable, allowing users to specify their optimization preference, whether it is focused on size or performance. For instance, users could select LTO level = 1 (perf) | 2 (size) to prioritize either performance or size optimization. It’s important to note that without size optimization, enabling LTO could potentially lead to an increase in code size, as it enables more inlining across all modules. This outcome might be preferred by users aiming for enhanced performance. Conversely, for those who prioritize achieving the utmost size optimization, we could introduce an additional option (via a new build setting) for more rigorous Dead Code Elimination (DCE), such as -experimental- h ermetic- s eal-at- l ink (referred to as HSL below). This option is expected to result in signification reductions in code size when building an executable (combined with other statically linked libraries).

Code Size
Below are preliminary results of the code size comparisons with and without full LTO combined with other optimization flags (in Swift 5.9). These were conducted in release mode on select Swift packages, Mockolo and Swift OpenAPI. The percentages are calculated in comparison to the -O/-Osize counterparts of the No LTO figures.

*The following was measured on MacBook Pro 16-in, 2019, 2.4 GHz 8-Core Intel Core i9, 64 GB 2667 MHz DDR4.

Mockolo		__TEXT	TEXT %	__DATA	DATA %	others	dec	hex
No LTO	-O	5439488	--	622592	--	4306321408	4312383488	10109c000
No LTO	-Osize	4702208	--	638976	--	4306452480	4311793664	10100c000
LTO	-O	6914048	+27.11	638976	+2.63	4305207296	4312760320	1010f8000
LTO	-Osize	4653056	-1.05	638976	0	4305338368	4310630400	100ef0000
LTO	-O + HSL	3932160	-27.71	409600	-34.21	4299014144	4303355904	100800000
LTO	-Osize + HSL	3244032	-31.01	409600	-35.90	4299292672	4302946304	10079c000

Swift OpenAPI		__TEXT	TEXT %	__DATA	DATA %	others	dec	hex
No LTO	-O	13434880	--	1392640	--	4314562560	4329390080	1020d4000
No LTO	-Osize	11993088	--	1392640	--	4314923008	4328308736	101fcc000
LTO	-O	15073280	+12.20	1409024	+1.18	4312842240	4329324544	1020c4000
LTO	-Osize	11747328	-2.05	1409024	+1.18	4313186304	4326342656	101dec000
LTO	-O + HSL	11730944	-12.68	1146880	-17.65	4305272832	4318150656	10161c000
LTO	-Osize + HSL	10420224	-13.11	1146880	-17.65	4305977344	4317544448	101588000

Note that the default release mode for SPM packages in Swift 5.9 has cross module optimization enabled (-enable-default-cmo ) and builds with -O and No LTO. The above was measured with WMO as well, but the numbers were similar with and without the WMO option.

As seen in the table, LTO amplifies the impact of both performance and size optimizations. The observed code size reductions range up to approximately 30%, marking a significant improvement, especially when utilizing HSL. The combination of HSL and LTO is particularly advantageous for building executables and their associated statically linked libraries. Interestingly, the -Osize option, while delivering a good amount of reduction, achieves a similar level of optimization with or without LTO; this is attributed to the fact that a function merge pass, one of the key passes triggered by -Osize, is currently not enabled to run on a merged module. However, this potential enhancement remains a consideration for future development, as indicated below.

Build Time
The build time in release mode, as indicated by the following measurements, does reveal an increase with LTO; link time goes up to ~68%. This is an inherent consequence, as all bitcode per file must be consolidated into a single entity, after which optimization passes are applied sequentially. Between -O and -Osize, however, the build time is relatively faster with the latter option, and is further improved with HSL since unused code is not linked. The memory cost is also very high; for example, malloc peak for LTO of the clang binary was reported to be about 11GB with llvm 3.9 as referenced here. Because of the long build time and high memory cost, the LTO setting would be made opt-in; it can be activated exclusively for the release mode, ensuring that it does not impact local development (debug mode).

Mockolo		Compile	%	Link	%	Total	%
No LTO	-O	365.01	--	2.31	--	367.32	--
No LTO	-Osize	319.64	--	2.42	--	322.06	--
LTO	-O	324.21	-11.18%	129.84	55.21%	454.05	23.61%
LTO	-Osize	313.82	-1.82%	111.18	44.94%	425	31.96%
LTO	-O + HSL	342.16	-6.26%	102.73	43.47%	444.89	21.12%
LTO	-Osize + HSL	304.9	-4.61%	91.88	36.97%	396.78	23.20%

Swift OpenAPI		Compile	%	Link	%	Total	%
No LTO	-O	448.12	--	3.83	--	451.95	--
No LTO	-Osize	409.76	--	3.94	--	413.70	--
LTO	-O	442.37	-1.28%	263.48	67.79%	705.85	56.18%
LTO	-Osize	391.28	-4.51%	241.89	60.39%	633.17	53.05%
LTO	-O + HSL	423.92	-5.40%	261.81	67.36%	685.73	51.73%
LTO	-Osize + HSL	405.04	-1.15%	242.04	60.43%	647.08	56.41%

Performance
LTO allows inlining across all modules, enabling further optimizations. With -O (for speed), the runtime performance is expected to be better than without LTO; the llvm talk mentioned earlier reports about 10% boost in performance. Even with -Osize, LLVM inlining is performed across all LLVM modules combined together in the LTO mode, thus the performance is expected to be better than (or at least no less than) the option without the LTO. The HSL option which removes unused code should help further improve performance.

Even the third parties that have adopted LTO with rigorous custom size optimizations (aggressive outlining) have reported that they have not observed significant performance regressions, as referenced in this and this talk.

Debuggability
With LTO, all the bitcode files are merged into one for optimizations, but the source file and the line numbers are retained within the debug info. Full stack traces are also available to allow tracking the origin of bugs. Additionally, if required, further metadata can be incorporated into the debug info.

Future Directions

LTO creates opportunities for more advanced optimizations, as demonstrated by third-party users who have adopted more aggressive machine code outlining techniques. A function merge pass on merged bitcode files would prove to be very useful for size optimization, so we would seek to incorporating the option. Additionally, we are open to introducing further alternatives if they are deemed necessary.
A mergeable library mechanism was introduced in Swift 5.9, where a dylib is treated as a static lib in release mode. This could enable dylibs to be part of LTO.
As we transition to a newer clang/llvm version, we would consider introducing the availability of Thin LTO as an alternative.

allevato · September 19, 2023, 11:43pm

Runtime performance is an important factor to consider, but what do the compile and link times look like under these tests? Can you include those metrics as well? My limited experimentation with some of the LTO features that are implemented today on some large apps had LTO link times that were so long (at least 30 minutes, before I gave up) that it was unusable.

LTO in theory is really interesting and powerful and I'm interested to see this, but I want to make sure that we're not using just that as the canonical solution for other very important things like symbol visibility (since this thread was linked from that one). LTO might be able to do a better job there, but we should also be able to improve the current state of the world considerably without its added complexity.

sspringer · September 20, 2023, 12:31am

It is included in according table above, isn’t it? I would say 50 % more build time for 30 % less size in the example is not too bad (considering it would only be used for release builds). The effect might be much greater when statically linking prebuilt libraries e.g. via --static-swift-stdlib, wouldn‘t this be the main use case? I think doing something in this direction would really be important.

jrose · September 20, 2023, 3:28am

I’m also curious about peak memory usage whenever people talk about Full LTO. It’s potentially the difference between “release builds on my dev machine” and “release builds can only be done from the dedicated, well-provisioned build machine”. That wouldn’t be a reason not to add it to the compiler, but it’s part of the information you need to provide to users.

jrose · September 20, 2023, 3:30am

I also strongly object to referring to these as “levels” rather than, say, “modes”, since we do not want to promise that one takes less time or memory than the other. They’re just using the same basic technique aiming for different goals.

hassila · September 20, 2023, 5:12am

Happy to see this! One question about HLS, if there are any documentation on how it works it’d be great to link to. Just trying to understand the implications, but out concern is:

We are specifically interested in how aggressive DCE is for e.g. public symbols?

We load plugins dynamically. Would be happy to use HLS/DCE if we know our hosting application can keep its public interface and its internal dependencies or would such risk to be stripped out if we have no references to them in the hosting app? We link everything else static except for the plugins and a single dynamic library that’s in evolution mode.

allevato · September 20, 2023, 11:43am

Sorry, my initial post was unclear: I'd like to see the separate compile time and link time, not the end-to-end build time. I suspect that most of the increase to the build time was in the linkage due to the extra work, but I'd like to see that explicitly. 50% more E2E build time for a couple libraries is already a pretty large bump, and monolithic LTO doesn't scale linearly; the link time increase for a large app is going to be much more than that 50%. So it would be helpful to have a better upfront understanding of what the real-world impact would be in Swift.

Thin LTO is meant to be better in this regard, as well as addressing other full LTO problems. So if we're building new support for something that, then I'd rather see the focus placed there because full LTO just isn't usable at a large scale.

wadetregaskis · September 20, 2023, 3:36pm

I find this very intriguing. Thank you @elsh for exploring this.

FWIW, regarding build-time vs runtime trade-offs, I'm in favour of an -Oultra option that pulls out all the stops and lets build times fall where they may. The most pertinent benefit in this context being that LTO could be implemented under that option without controversy or delay, making it available to the world promptly while the merits of including it in other optimisation modes is debated.

It'd be very useful for some applications. e.g. some server-side applications where your release build happens asynchronously in some CI/CD system anyway, and peak runtime performance may trump a few extra hours of build server time). Even if it's not appropriate for most applications (e.g. most Mac & iOS apps don't really benefit from high levels of optimisation because they're dominated by user input and/or network latency at runtime anyway).

That said, for my dinky little Mac & iOS apps, I'd still be inclined to use an -Oultra option because why not - I don't have hard deadlines on builds; my development pipeline can have many stages and be superscalar.

Assumed in this, though, is that the runtime performance would be better. It'd be great to see that dimension quantified in addition to build time & binary size.

Adrian_Prantl · September 21, 2023, 4:43pm

Generally speaking, LLDB's existing support for LTO objects should also work with Swift, and there is no loss in debug info quality expected that would be due to LTO (other than dealing with even more optimized code).

Keith · September 21, 2023, 9:30pm

I'm super excited to see this! We're very interested in being able to ship our large iOS app with LTO at some point. Is the core goal of this pitch to improve the UX for LTO features that already exist, or is there also work planned on lower level parts of Swift's LTO support? One thing that we saw recently for example is that -experimental-hermetic-seal-at-link hasn't been well tested with ObjC interop in general (source)

As a performance data point, today our iOS application takes ~1.5 hours to link when testing with -lto=llvm-full on a M1 Max MBP with 64gbs of ram. The resulting binary crashes at runtime though, so I'm curious how many show-stopping bugs are out there.

Also worth noting that as a testbed bazel supports building with today's LTO support by emitting bitcode already.

John_McCall · September 21, 2023, 10:57pm

Is this purely proposing low-level (LLVM) LTO, or is there a capability for higher-level (SIL) optimization here?

elsh · September 22, 2023, 6:28pm

allevato:

Runtime performance is an important factor to consider, but what do the compile and link times look like under these tests? Can you include those metrics as well? My limited experimentation with some of the LTO features that are implemented today on some large apps had LTO link times that were so long (at least 30 minutes, before I gave up) that it was unusable.

LTO in theory is really interesting and powerful and I'm interested to see this, but I want to make sure that we're not using just that as the canonical solution for other very important things like symbol visibility (since this thread was linked from that one). LTO might be able to do a better job there, but we should also be able to improve the current state of the world considerably without its added complexity.

The post is updated with the breakdown of the build time; the link time on a sample package shows about 68% increase and is likely much higher on a larger project.

Re: symbol visibility, this post was referred because there was a question on LTO, not meant as the canonical solution for that.

According to the llvm lto talk, the malloc peak memory of the clang binary was reported to be about 11GB (this was with llvm 3.9 though).

Dylibs are not built as part of LTO as they have to be shipped separately. You can build a dylib with a new flag -merge-library as of swift 5.9 though, which will allow a linker to treat the dylib as a static lib in release mode; we might have to modify a few things to enable LTO on such dylibs though.

Full lto has already been adopted by multiple third parties for the sole purpose of binary size optimization. Due to a long build time and high memory cost, it’s expected to run on a dedicated machine exclusively for the release mode. The goal of this proposal is to simplify adoption by offering full LTO (as a start) as an integrated option (and then later thin LTO with a more recent version of clang/llvm). Typically, adopting LTO requires manual adjustments or customization, but this proposal aims to streamline the process.

The goal is to both make LTO more accessible and configurable and improve optimizations used by LTO such as HSL.

Purely LTO for now.

elsh · September 22, 2023, 6:37pm

That's not surprising; this talk reports about 200% increase of build time. Hopefully we can transition to thin-LTO with a more recent version of clang/llvm. Were the crashes seen with HSL or with other optimizations such as outlining? Also did the crash reports have enough info for debugging? Yeah we will have to see how many show-stopping bugs we'll need to fix.

elsh · September 22, 2023, 6:40pm

The referenced talk mentions perf gain to be about 10%; it's not on a Swift app though but still gives a useful insight.

hassila · September 22, 2023, 6:41pm

Let me try to clarify the question - I was perhaps a bit unclear;

We have a large hosting application with hundreds of modules which is what we want to build with LTO/DCE - our concern is as follows:

This application is implementing an API interface which is made available to our customers as an API package - our customers are building the plugins against this API completely isolated from us.

Our hosting application is then loading the customers plugins during runtime to provide a rich runtime environment for them.

As several parts of this hosting applications implementation of the API isn’t used by the hosting application itself, but only by the customer plugins - we are concerned that it might be viewed as dead code by the DCE pass

Thus, we’d like to ask what the heuristics for what is viewed as “dead” code is with that background?

Do we need to add dummy usage of all API to the hosting application to avoid DCE of those parts needed by dynamically loaded plugins, or are we overthinking it?

elsh · September 22, 2023, 6:45pm

No you don't need to add dummy usage of all APIs; they will not be removed in this scenario.

elsh · September 22, 2023, 6:47pm

The "mode" is already reserved to mean full or thin lto, thus a different term "level" but I agree it may not be the best; perhaps "config" is a better term?

allevato · September 22, 2023, 6:48pm

Great! That addresses most of my concerns that we might be putting too many eggs in one basket here.

I still fear that Full LTO won't work for most of our use cases, but I don't want to be a wet blanket here—this is really valuable work that sets the stage for future improvements, like Thin LTO that you've already mentioned. Thanks for doing this!

Keith · September 22, 2023, 6:51pm

I think I could probably isolate them with some work, but in the past it didn't seem like any work was going into LTO so I didn't work on narrowing down any. Some have been fixed organically though [SR-15964] ArgumentParser LTO crashes linkers · Issue #58225 · apple/swift · GitHub

sspringer · September 28, 2023, 6:30pm

Question: Is there a combination of settings which already “should” / “should work most times” on Linux and already helps significantly with the size of the executable when using static linking? (It is a little difficult to extract this from the above discussion.)

I tried

swift build -c release --static-swift-stdlib -Xswiftc -lto=llvm-full

and also … -lto=llvm-thin and got link errors in both cases (“no such file … xxx.o”).