A Swift application often comprises a substantial number of modules, with instances of 200 or more modules not being uncommon, each with varying sizes. New features are continually integrated into Swift applications, resulting in a rapid increase in their code sizes. Among the most prevalent challenges encountered by Swift users is the consistent increase in binary size.
Presently available optimization techniques are limited in scope due to the lack of visibility into the entire group of modules. This is where Link Time Optimization (LTO) can prove to be crucial. In fact, numerous third-party applications have gone to the extent of implementing their own custom LTO solutions just to achieve size optimization across all modules.
We propose to support the integration of LTO for Swift, coupled with size-specific optimization. This approach aims to alleviate the burden on users, sparing them the need to develop their own custom solutions.
There are two LTO (Link Time Optimization) choices offered within LLVM: full (monolithic) and thin (incremental) — see this talk on the key differences. In Swift 5.7+, the bundled clang/llvm distribution contains a stable full-LTO capability that can be effectively applied to Swift code. However, it's worth noting that the thin-LTO mode, known for its faster build times and improved memory efficiency, is not stable in the clang/llvm version (15.0) linked to Swift 5.9, thus requires reconsideration when transitioning to a later clang/llvm version. For this reason, we'll focus on full-LTO in this proposal, though we expect it to be straightforward to extend this to thin-LTO in the future.
Our proposal entails introducing the full-LTO option as a new build setting in SwiftPM (and a corresponding flag for other build systems). When enabled, bitcode will be generated on a per-file basis, subsequently linked together to enable global optimizations. This setting proves particularly useful for products that are statically linked.
Furthermore, we suggest making the level of LTO configurable, allowing users to specify their optimization preference, whether it is focused on size or performance. For instance, users could select
LTO level = 1 (perf) | 2 (size) to prioritize either performance or size optimization. It’s important to note that without size optimization, enabling LTO could potentially lead to an increase in code size, as it enables more inlining across all modules. This outcome might be preferred by users aiming for enhanced performance. Conversely, for those who prioritize achieving the utmost size optimization, we could introduce an additional option (via a new build setting) for more rigorous Dead Code Elimination (DCE), such as
ink (referred to as HSL below). This option is expected to result in signification reductions in code size when building an executable (combined with other statically linked libraries).
Below are preliminary results of the code size comparisons with and without full LTO combined with other optimization flags (in Swift 5.9). These were conducted in release mode on select Swift packages, Mockolo and Swift OpenAPI. The percentages are calculated in comparison to the -O/-Osize counterparts of the No LTO figures.
*The following was measured on MacBook Pro 16-in, 2019, 2.4 GHz 8-Core Intel Core i9, 64 GB 2667 MHz DDR4.
|Mockolo||__TEXT||TEXT %||__DATA||DATA %||__OBJC||others||dec||hex|
|LTO||-O + HSL||3932160||-27.71||409600||-34.21||0||4299014144||4303355904||100800000|
|LTO||-Osize + HSL||3244032||-31.01||409600||-35.90||0||4299292672||4302946304||10079c000|
|Swift OpenAPI||__TEXT||TEXT %||__DATA||DATA %||__OBJC||others||dec||hex|
|LTO||-O + HSL||11730944||-12.68||1146880||-17.65||0||4305272832||4318150656||10161c000|
|LTO||-Osize + HSL||10420224||-13.11||1146880||-17.65||0||4305977344||4317544448||101588000|
Note that the default release mode for SPM packages in Swift 5.9 has cross module optimization enabled (
-enable-default-cmo ) and builds with -O and No LTO. The above was measured with WMO as well, but the numbers were similar with and without the WMO option.
As seen in the table, LTO amplifies the impact of both performance and size optimizations. The observed code size reductions range up to approximately 30%, marking a significant improvement, especially when utilizing HSL. The combination of HSL and LTO is particularly advantageous for building executables and their associated statically linked libraries. Interestingly, the -Osize option, while delivering a good amount of reduction, achieves a similar level of optimization with or without LTO; this is attributed to the fact that a function merge pass, one of the key passes triggered by -Osize, is currently not enabled to run on a merged module. However, this potential enhancement remains a consideration for future development, as indicated below.
The build time in release mode, as indicated by the following measurements, does reveal an increase with LTO; link time goes up to ~68%. This is an inherent consequence, as all bitcode per file must be consolidated into a single entity, after which optimization passes are applied sequentially. Between -O and -Osize, however, the build time is relatively faster with the latter option, and is further improved with HSL since unused code is not linked. The memory cost is also very high; for example, malloc peak for LTO of the clang binary was reported to be about 11GB with llvm 3.9 as referenced here. Because of the long build time and high memory cost, the LTO setting would be made opt-in; it can be activated exclusively for the release mode, ensuring that it does not impact local development (debug mode).
|LTO||-O + HSL||342.16||-6.26%||102.73||43.47%||444.89||21.12%|
|LTO||-Osize + HSL||304.9||-4.61%||91.88||36.97%||396.78||23.20%|
|LTO||-O + HSL||423.92||-5.40%||261.81||67.36%||685.73||51.73%|
|LTO||-Osize + HSL||405.04||-1.15%||242.04||60.43%||647.08||56.41%|
LTO allows inlining across all modules, enabling further optimizations. With -O (for speed), the runtime performance is expected to be better than without LTO; the llvm talk mentioned earlier reports about 10% boost in performance. Even with -Osize, LLVM inlining is performed across all LLVM modules combined together in the LTO mode, thus the performance is expected to be better than (or at least no less than) the option without the LTO. The HSL option which removes unused code should help further improve performance.
Even the third parties that have adopted LTO with rigorous custom size optimizations (aggressive outlining) have reported that they have not observed significant performance regressions, as referenced in this and this talk.
With LTO, all the bitcode files are merged into one for optimizations, but the source file and the line numbers are retained within the debug info. Full stack traces are also available to allow tracking the origin of bugs. Additionally, if required, further metadata can be incorporated into the debug info.
- LTO creates opportunities for more advanced optimizations, as demonstrated by third-party users who have adopted more aggressive machine code outlining techniques. A function merge pass on merged bitcode files would prove to be very useful for size optimization, so we would seek to incorporating the option. Additionally, we are open to introducing further alternatives if they are deemed necessary.
- A mergeable library mechanism was introduced in Swift 5.9, where a dylib is treated as a static lib in release mode. This could enable dylibs to be part of LTO.
- As we transition to a newer clang/llvm version, we would consider introducing the availability of Thin LTO as an alternative.