[GSoC] LTO support progress report

I'm working on the GSoC project LTO support for Swift with @compnerd.

Here is my implementation plan for this project.

I'll post a weekly progress report every Monday as @augusto2112 does on GSOC Linux debugger support progress report.

Until now, I've worked on the following things.

  • Addressed code review comments by @compnerd for apple/swift#31146 and spent much time on retrying CI.
  • Reading LLD and Swift codebase to get a deeper understanding of each architecture.
  • Start implementing libswiftLTO pipeline (WIP)

I'm spending much time to investigate which information in ASTContext is required after SILGen.

Since ASTContext is used through the compiler process, it has so much information. But LTO plugin is used in linker process, so it can't derive ASTContext state from the compiler process and needs to create and setup ASTContext again.

For this reason, I'm investigating which information in ASTContext should be serialized in SIB.

17 Likes

Thanks for the update and sharing the detailed plan @kateinoigakukun. I can't wait to have proper LTO working for all platforms that Swift supports and producing smaller Swift binaries! (Especially as this is critical for the WebAssembly target).

After #31146 is merged, as far as I understand one could start using the -lto=llvm flag with the master nightlies. Or would it need lld built in some special way? Also, does it already yield some reductions in size of produced binaries, or do we need to wait for language-specific LTO to kick in to start seeing any noticeable reduction?

1 Like

After #31146 is merged, as far as I understand one could start using the -lto=llvm flag with the master nightlies. Or would it need lld built in some special way? Also, does it already yield some reductions in size of produced binaries, or do we need to wait for language-specific LTO to kick in to start seeing any noticeable reduction?

@Max_Desiatov wasm-ld already support bitcode LTO, so we can use it without special work after the PR merged. And I've not tried yet for wasm but I think llvm level lto can reduce size of produced binaries more than wasm-opt can.

2 Likes

You may also want to investigate the existing -function-sections flag which will allow you to drop unused functions as on MachO and PE/COFF targets.

2 Likes

Hey everyone!

Last week, I opened a PR to bootstrap LTO pipeline in the compiler side. It contains a very basis for transforming SIL into LLVM IR and does not contain any optimization at this time.

https://github.com/apple/swift/pull/32233

(And I'm sorry that I couldn't do a lot of things for GSoC because of too many assignments from my university.)

This week, I'll address the reviewed points of the PR and fix some remaining issues.

One of the issues is that my current implementation depends on the order of input modules.

In the usual compiler process, all dependent modules are loaded by need after the main module that uses them.

On the other hand, the current implementation load the input module immediately, so it depends on the order of input modules and fails if the dependent libraries are not loaded before user-module.
So I need to implement lazy module loading mechanism for serialized on memory modules.

4 Likes

Hello :wave:

Last week, I made some PRs to extend SIL pass manager to be available for multiple modules optimization.

But due to some regression, #32237 was reverted.

This week I'll focus these PRs get to be merged.

3 Likes

Hello everyone.

Last week, I split the LLVM LTO changes into several PRs to get it back to master.

But they are still in review. I hope we'll be able to merge them this week.

1 Like

Hello.

Since last week, I'm prototyping michael's architecture for further discussion. The architecture is similar to LLVM thin LTO.

#32462 was merged into master and #32429 is almost ready to merge.

I couldn't spend much time last week since I was not feeling well, but it got to be better now.

This week, I'll continue to prototype the architecture and work on merging the remaining LLVM LTO PRs.

5 Likes

Hello. Last week, I published a prototype implementation of the architecture I mentioned last week's report.

Now, I'm refactoring the prototype implementation and preparing to break it down into several PRs while waiting for feedback from mentors.

And this week, I got a lot of assignments from university, so my work time may be a little shorter than usual.

3 Likes

Hello, everyone.
Last week I got some feedback for the prototype implementation and #32429 was merged into master branch.
In addition, driver part of LLVM LTO PR is now under review.

This week, I'll work on it to be merged and take a binary size benchmark for the prototype implementation to clarify that it shows better value than existing optimizations.

7 Likes

Last week, I've spent much time trying the LTO for stdlib to benchmark but I found that it has some difficulties.

The main problem is that SIB to object file is not well supported. Compiling stdlib sib to object file fails even without LTO.

In addition, there were many false-positive eliminations and it causes assertion error, so I fixed them.

I can't take much time for this project than usual because I have university final exams for about two weeks from this week. However, two weeks later, summer vacation will start and it'll allow me to spend more time on this project.

2 Likes

Last weekend, I succeed to optimize some popular Swift libraries for benchmark by prototype optimizer on this branch.

Here is summary of the result at this time.

SwiftyJSON

Variant Size
non-LTO Swift LTO LLVM LTO Swift & LLVM LTO
Onone 306.4 KB 250.5 KB 234.0 KB 202.2 KB
O 310.6 KB 253.6 KB 299.2 KB 233.1 KB
Osize 278.3 KB 221.2 KB 251.8 KB 203.0 KB

SwiftSyntax

Variant Size
non-LTO Swift LTO LLVM LTO Swift & LLVM LTO
Onone 16.1 MB 10.4 MB 8.2 MB 5.6 MB
O 6.9 MB 5.9 MB 6.9 MB 5.0 MB
Osize 5.6 MB 5.1 MB 5.3 MB 3.9 MB

RxSwift

Variant Size
non-LTO Swift LTO LLVM LTO Swift & LLVM LTO
Onone 2.8 MB 2.0 MB 1.8 MB 1.4 MB
O 1.6 MB 1.4 MB 1.6 MB 1.3 MB
Osize 1.5 MB 1.3 MB 1.5 MB 1.2 MB

Now, the optimizer does conservatively for witness table elimination, so it doesn't show significant reduction. But after do that, it would be a better result.

And I'm sending patches around SIB serialization:

21 Likes

looks nice, well done.

1 Like

Last week, I sent some patches around Serialization format.

And I measured additional benchmarks of binary size, build time, and runtime performance for some libraries including stdlib.

This result shows that lto can reduce build time also.

Variant Size
non-LTO Swift LTO LLVM LTO Swift & LLVM LTO
Onone 10.0 MB 6.6 MB 6.8 MB 4.8 MB
O 7.5 MB 4.7 MB 7.4 MB 4.3 MB
Osize 7.0 MB 4.5 MB 6.8 MB 4.1 MB
Variant Build Time
non-LTO Swift LTO LLVM LTO Swift & LLVM LTO
Onone 185.73 s 178.56 s 181.19 s 216.63 s
O 615.91 s 560.77 s 316.85 s 569.25 s
Osize 478.26 s 420.65 s 172.32 s 359.00 s

See also: https://github.com/kateinoigakukun/swift-lto-benchmark

And I started porting my LTO works into apple/swift repo.

https://github.com/apple/swift/pull/33324

https://github.com/apple/swift/pull/33400

In my current plan, the big changes in my forked branch will be split down into:

  1. [sent] Add frontend options which are used to emit module summary
  2. [draft] Impl module summary serialization
  3. Impl a frontend action which merges multiple module summaries
  4. Impl a DCE opt pass which uses merged module summary
  5. Impl driver to handle SIB and module summary emission

Now I'm mainly blocked by those PRs and LLVM LTO PR reviews.

10 Likes

Last week, I worked on more aggressive dead table elimination based on type reference information.

The optimization eliminates vtables and witness tables if the conforming types are not referenced by any instruction.
This results more binary size reduction.

  • stdlib: -5%
  • SwiftyJSON: -13%
  • RxSwift: -8%

In addition, I implemented KeyPath accessors elimination, but this was not so much effective for binary size reduction.

To find heavy living functions, I implemented call graph visualizer and dominator tree based analyzer. (but I found that call graph is too big to see at once :sweat_smile: )
Screen Shot 2020-08-16 at 9.31.21

e.g. dominator tree based analysis

size    | %     | symbol
24079   | 9.86  | main
23665   | 9.69  |   $s18SwiftStdlibExample5editsyShySSGSSF
15849   | 6.49  |     $sSS6append10contentsOfyx_tSTRzSJ7ElementRtzlF
3373    | 1.38  |       $sSS6append10contentsOfyx_tSTRzSJ7ElementRtzlFSs_Tg5
163     | 0.07  |         $ss15withUnsafeBytes2of_q_xz_q_SWKXEtKr0_lFs6UInt64V_ADtSWxs5Error_plyq_Isgyrzo_q_sAE_pAD_ADtRszr0_lIetlyrzo_Tpq5s15__StringStorageC_Tg5011$ss12_Smallg47V8withUTF8yxxSRys5UInt8VGKXEKlFxSWKXEfU_s02__B7H5C_TG5SRys0N0VGxsAE_plyAGIsgyrzo_s01_jG0VTf1nc_n
103     | 0.04  |           $ss12_SmallStringV8withUTF8yxxSRys5UInt8VGKXEKlFxSWKXEfU_s02__B7StorageC_TG5
97      | 0.04  |             $ss12_SmallStringV8withUTF8yxxSRys5UInt8VGKXEKlFxSWKXEfU_s02__B7StorageC_Tg5
70      | 0.03  |         $sSR5start5countSRyxGSPyxGSg_SitcfCs5UInt8V_Tgq5
11      | 0.00  |         $sSnsSxRzSZ6StrideRpzrlE8distance4from2toSix_xtFSi_Tg5

Small patches were merged, but the main PRs are still waiting for reviews.

9 Likes

Last week, I spent much time on supporting LTO build variant for apple/swift's benchmark system. The benchmark system reported the below result comparing -Osize and -Osize with LTO.

Code size: -Osize v.s. -Osize with LTO

Regression OLD NEW DELTA RATIO
RandomShuffle.o 10692 11036 +3.2% 0.97x
SortArrayInClass.o 8910 9151 +2.7% 0.97x
StringMatch.o 7480 7659 +2.4% 0.98x
StringReplaceSubrange.o 7010 7173 +2.3% 0.98x
Diffing.o 10331 10566 +2.3% 0.98x
Array2D.o 13683 13967 +2.1% 0.98x
Substring.o 31123 31655 +1.7% 0.98x
DropLast.o 44229 44751 +1.2% 0.99x
StringWalk.o 51054 51627 +1.1% 0.99x
ย 
Improvement OLD NEW DELTA RATIO
PrimsNonStrongRef.o 194994 159624 -18.1% 1.22x
NIOChannelPipeline.o 4219 3647 -13.6% 1.16x
PolymorphicCalls.o 7677 6959 -9.4% 1.10x
BucketSort.o 31344 28799 -8.1% 1.09x
COWTree.o 18837 17508 -7.1% 1.08x
Queue.o 28705 27205 -5.2% 1.06x
Exclusivity.o 5483 5284 -3.6% 1.04x
Phonebook.o 37757 36615 -3.0% 1.03x
WordCount.o 82401 80738 -2.0% 1.02x
CSVParsing.o 80620 79231 -1.7% 1.02x
SortIntPyramids.o 39449 38975 -1.2% 1.01x
DictOfArraysToArrayOfDicts.o 38540 38097 -1.1% 1.01x
FloatingPointParsing.o 68406 67629 -1.1% 1.01x

Some object files have regressions on code size, but I couldn't find the reason why LTO path increases instruction size.

And I prototyped SwiftPM LTO support. It allows us to try the LTO easily by adding --lto=swift option.

In addition, I wrote up final evaluation report of this GSoC project.

13 Likes

Any chance you could share the links to your final report, or any instructions for developers to try this out on their own?

1 Like

As far as I understand, this work hasn't been fully merged. The main PR [Serialization] Add ModuleSummary serialization format by kateinoigakukun ยท Pull Request #33400 ยท apple/swift ยท GitHub is in review since August 2020 and there were no updates or feedback from reviewers since then. Maybe @kateinoigakukun could describe the situation in more details.

Even though the Swift-LTO part is not complete, you can enable LLVM level thin-lto/full-lto. That part of the work is complete.

That being said, I am not sure how production ready it is. I recently added support for building the swift stdlib with thin-lto/full-lto and found that bugs were found on the stdlib. But the stdlib is a bit of a special case so your mileage may vary.

For LLVM level LTO, you can use it just adding -lto=llvm-thin or -lto=llvm-full in driver options.

e.g.

$ swiftc -emit-library -lto=llvm-thin X.swift
$ swiftc main.swift -lX -lto=llvm-thin -o main
1 Like