[GSoC] LTO support progress report

kateinoigakukun · June 2, 2020, 11:34am

I'm working on the GSoC project LTO support for Swift with @compnerd.

Here is my implementation plan for this project.

I'll post a weekly progress report every Monday as @augusto2112 does on GSOC Linux debugger support progress report.

Until now, I've worked on the following things.

Addressed code review comments by @compnerd for apple/swift#31146 and spent much time on retrying CI.
Reading LLD and Swift codebase to get a deeper understanding of each architecture.
Start implementing libswiftLTO pipeline (WIP)

I'm spending much time to investigate which information in ASTContext is required after SILGen.

Since ASTContext is used through the compiler process, it has so much information. But LTO plugin is used in linker process, so it can't derive ASTContext state from the compiler process and needs to create and setup ASTContext again.

For this reason, I'm investigating which information in ASTContext should be serialized in SIB.

Max_Desiatov · June 2, 2020, 12:16pm

Thanks for the update and sharing the detailed plan @kateinoigakukun. I can't wait to have proper LTO working for all platforms that Swift supports and producing smaller Swift binaries! (Especially as this is critical for the WebAssembly target).

After #31146 is merged, as far as I understand one could start using the -lto=llvm flag with the master nightlies. Or would it need lld built in some special way? Also, does it already yield some reductions in size of produced binaries, or do we need to wait for language-specific LTO to kick in to start seeing any noticeable reduction?

kateinoigakukun · June 2, 2020, 12:25pm

After #31146 is merged, as far as I understand one could start using the -lto=llvm flag with the master nightlies. Or would it need lld built in some special way? Also, does it already yield some reductions in size of produced binaries, or do we need to wait for language-specific LTO to kick in to start seeing any noticeable reduction?

@Max_Desiatov wasm-ld already support bitcode LTO, so we can use it without special work after the PR merged. And I've not tried yet for wasm but I think llvm level lto can reduce size of produced binaries more than wasm-opt can.

compnerd · June 2, 2020, 4:47pm

You may also want to investigate the existing -function-sections flag which will allow you to drop unused functions as on MachO and PE/COFF targets.

kateinoigakukun · June 9, 2020, 11:16am

Hey everyone!

Last week, I opened a PR to bootstrap LTO pipeline in the compiler side. It contains a very basis for transforming SIL into LLVM IR and does not contain any optimization at this time.

https://github.com/apple/swift/pull/32233

(And I'm sorry that I couldn't do a lot of things for GSoC because of too many assignments from my university.)

This week, I'll address the reviewed points of the PR and fix some remaining issues.

One of the issues is that my current implementation depends on the order of input modules.

In the usual compiler process, all dependent modules are loaded by need after the main module that uses them.

On the other hand, the current implementation load the input module immediately, so it depends on the order of input modules and fails if the dependent libraries are not loaded before user-module.
So I need to implement lazy module loading mechanism for serialized on memory modules.

kateinoigakukun · June 16, 2020, 11:12pm

Hello

Last week, I made some PRs to extend SIL pass manager to be available for multiple modules optimization.

[NFC] Add abstract class which resolves external references by kateinoigakukun · Pull Request #32368 · apple/swift · GitHub
[LTO] Extend SILPassManager to be available with multiple modules by kateinoigakukun · Pull Request #32372 · apple/swift · GitHub
[LTO] Add new transform variant for cross module optimization by kateinoigakukun · Pull Request #32373 · apple/swift · GitHub

But due to some regression, #32237 was reverted.

This week I'll focus these PRs get to be merged.

kateinoigakukun · June 24, 2020, 12:50am

Hello everyone.

Last week, I split the LLVM LTO changes into several PRs to get it back to master.

Revert "Revert "[LTO] Build lld on Windows CI"" by kateinoigakukun · Pull Request #32462 · apple/swift · GitHub
https://github.com/apple/swift/pull/32429
[LTO] Support LLVM LTO for driver by kateinoigakukun · Pull Request #32430 · apple/swift · GitHub

But they are still in review. I hope we'll be able to merge them this week.

kateinoigakukun · June 30, 2020, 11:30pm

Hello.

Since last week, I'm prototyping michael's architecture for further discussion. The architecture is similar to LLVM thin LTO.

#32462 was merged into master and #32429 is almost ready to merge.

I couldn't spend much time last week since I was not feeling well, but it got to be better now.

This week, I'll continue to prototype the architecture and work on merging the remaining LLVM LTO PRs.

kateinoigakukun · July 7, 2020, 5:09pm

Hello. Last week, I published a prototype implementation of the architecture I mentioned last week's report.

Now, I'm refactoring the prototype implementation and preparing to break it down into several PRs while waiting for feedback from mentors.

And this week, I got a lot of assignments from university, so my work time may be a little shorter than usual.

kateinoigakukun · July 14, 2020, 11:41pm

Hello, everyone.
Last week I got some feedback for the prototype implementation and #32429 was merged into master branch.
In addition, driver part of LLVM LTO PR is now under review.

This week, I'll work on it to be merged and take a binary size benchmark for the prototype implementation to clarify that it shows better value than existing optimizations.

kateinoigakukun · July 21, 2020, 6:18pm

Last week, I've spent much time trying the LTO for stdlib to benchmark but I found that it has some difficulties.

The main problem is that SIB to object file is not well supported. Compiling stdlib sib to object file fails even without LTO.

In addition, there were many false-positive eliminations and it causes assertion error, so I fixed them.

I can't take much time for this project than usual because I have university final exams for about two weeks from this week. However, two weeks later, summer vacation will start and it'll allow me to spend more time on this project.

kateinoigakukun · August 3, 2020, 1:59pm

Last weekend, I succeed to optimize some popular Swift libraries for benchmark by prototype optimizer on this branch.

Here is summary of the result at this time.

SwiftyJSON

Variant	Size
	non-LTO	Swift LTO	LLVM LTO	Swift & LLVM LTO
Onone	306.4 KB	250.5 KB	234.0 KB	202.2 KB
O	310.6 KB	253.6 KB	299.2 KB	233.1 KB
Osize	278.3 KB	221.2 KB	251.8 KB	203.0 KB

SwiftSyntax

Variant	Size
	non-LTO	Swift LTO	LLVM LTO	Swift & LLVM LTO
Onone	16.1 MB	10.4 MB	8.2 MB	5.6 MB
O	6.9 MB	5.9 MB	6.9 MB	5.0 MB
Osize	5.6 MB	5.1 MB	5.3 MB	3.9 MB

RxSwift

Variant	Size
	non-LTO	Swift LTO	LLVM LTO	Swift & LLVM LTO
Onone	2.8 MB	2.0 MB	1.8 MB	1.4 MB
O	1.6 MB	1.4 MB	1.6 MB	1.3 MB
Osize	1.5 MB	1.3 MB	1.5 MB	1.2 MB

Now, the optimizer does conservatively for witness table elimination, so it doesn't show significant reduction. But after do that, it would be a better result.

And I'm sending patches around SIB serialization:

https://github.com/apple/swift/pull/33255

BigSur · August 3, 2020, 2:00pm

looks nice, well done.

kateinoigakukun · August 11, 2020, 10:43am

Last week, I sent some patches around Serialization format.

And I measured additional benchmarks of binary size, build time, and runtime performance for some libraries including stdlib.

This result shows that lto can reduce build time also.

Variant	Size
non-LTO	Swift LTO	LLVM LTO	Swift & LLVM LTO
Onone	10.0 MB	6.6 MB	6.8 MB	4.8 MB
O	7.5 MB	4.7 MB	7.4 MB	4.3 MB
Osize	7.0 MB	4.5 MB	6.8 MB	4.1 MB

Variant	Build Time
non-LTO	Swift LTO	LLVM LTO	Swift & LLVM LTO
Onone	185.73 s	178.56 s	181.19 s	216.63 s
O	615.91 s	560.77 s	316.85 s	569.25 s
Osize	478.26 s	420.65 s	172.32 s	359.00 s

See also: https://github.com/kateinoigakukun/swift-lto-benchmark

And I started porting my LTO works into apple/swift repo.

https://github.com/apple/swift/pull/33324

https://github.com/apple/swift/pull/33400

In my current plan, the big changes in my forked branch will be split down into:

[sent] Add frontend options which are used to emit module summary
[draft] Impl module summary serialization
Impl a frontend action which merges multiple module summaries
Impl a DCE opt pass which uses merged module summary
Impl driver to handle SIB and module summary emission

Now I'm mainly blocked by those PRs and LLVM LTO PR reviews.

kateinoigakukun · August 18, 2020, 3:34pm

Last week, I worked on more aggressive dead table elimination based on type reference information.

The optimization eliminates vtables and witness tables if the conforming types are not referenced by any instruction.
This results more binary size reduction.

stdlib: -5%
SwiftyJSON: -13%
RxSwift: -8%

In addition, I implemented KeyPath accessors elimination, but this was not so much effective for binary size reduction.

To find heavy living functions, I implemented call graph visualizer and dominator tree based analyzer. (but I found that call graph is too big to see at once )
Screen Shot 2020-08-16 at 9.31.21

e.g. dominator tree based analysis

size    | %     | symbol
24079   | 9.86  | main
23665   | 9.69  |   $s18SwiftStdlibExample5editsyShySSGSSF
15849   | 6.49  |     $sSS6append10contentsOfyx_tSTRzSJ7ElementRtzlF
3373    | 1.38  |       $sSS6append10contentsOfyx_tSTRzSJ7ElementRtzlFSs_Tg5
163     | 0.07  |         $ss15withUnsafeBytes2of_q_xz_q_SWKXEtKr0_lFs6UInt64V_ADtSWxs5Error_plyq_Isgyrzo_q_sAE_pAD_ADtRszr0_lIetlyrzo_Tpq5s15__StringStorageC_Tg5011$ss12_Smallg47V8withUTF8yxxSRys5UInt8VGKXEKlFxSWKXEfU_s02__B7H5C_TG5SRys0N0VGxsAE_plyAGIsgyrzo_s01_jG0VTf1nc_n
103     | 0.04  |           $ss12_SmallStringV8withUTF8yxxSRys5UInt8VGKXEKlFxSWKXEfU_s02__B7StorageC_TG5
97      | 0.04  |             $ss12_SmallStringV8withUTF8yxxSRys5UInt8VGKXEKlFxSWKXEfU_s02__B7StorageC_Tg5
70      | 0.03  |         $sSR5start5countSRyxGSPyxGSg_SitcfCs5UInt8V_Tgq5
11      | 0.00  |         $sSnsSxRzSZ6StrideRpzrlE8distance4from2toSix_xtFSi_Tg5

Small patches were merged, but the main PRs are still waiting for reviews.

kateinoigakukun · August 25, 2020, 12:36am

Last week, I spent much time on supporting LTO build variant for apple/swift's benchmark system. The benchmark system reported the below result comparing -Osize and -Osize with LTO.

Code size: -Osize v.s. -Osize with LTO

Regression	OLD	NEW	DELTA	RATIO
RandomShuffle.o	10692	11036	+3.2%	0.97x
SortArrayInClass.o	8910	9151	+2.7%	0.97x
StringMatch.o	7480	7659	+2.4%	0.98x
StringReplaceSubrange.o	7010	7173	+2.3%	0.98x
Diffing.o	10331	10566	+2.3%	0.98x
Array2D.o	13683	13967	+2.1%	0.98x
Substring.o	31123	31655	+1.7%	0.98x
DropLast.o	44229	44751	+1.2%	0.99x
StringWalk.o	51054	51627	+1.1%	0.99x

Improvement	OLD	NEW	DELTA	RATIO
PrimsNonStrongRef.o	194994	159624	-18.1%	1.22x
NIOChannelPipeline.o	4219	3647	-13.6%	1.16x
PolymorphicCalls.o	7677	6959	-9.4%	1.10x
BucketSort.o	31344	28799	-8.1%	1.09x
COWTree.o	18837	17508	-7.1%	1.08x
Queue.o	28705	27205	-5.2%	1.06x
Exclusivity.o	5483	5284	-3.6%	1.04x
Phonebook.o	37757	36615	-3.0%	1.03x
WordCount.o	82401	80738	-2.0%	1.02x
CSVParsing.o	80620	79231	-1.7%	1.02x
SortIntPyramids.o	39449	38975	-1.2%	1.01x
DictOfArraysToArrayOfDicts.o	38540	38097	-1.1%	1.01x
FloatingPointParsing.o	68406	67629	-1.1%	1.01x

Some object files have regressions on code size, but I couldn't find the reason why LTO path increases instruction size.

And I prototyped SwiftPM LTO support. It allows us to try the LTO easily by adding --lto=swift option.

In addition, I wrote up final evaluation report of this GSoC project.

stevenhepting · March 29, 2021, 7:46pm

Any chance you could share the links to your final report, or any instructions for developers to try this out on their own?

Max_Desiatov · March 29, 2021, 8:24pm

As far as I understand, this work hasn't been fully merged. The main PR [Serialization] Add ModuleSummary serialization format by kateinoigakukun · Pull Request #33400 · apple/swift · GitHub is in review since August 2020 and there were no updates or feedback from reviewers since then. Maybe @kateinoigakukun could describe the situation in more details.

Michael_Gottesman · March 29, 2021, 9:35pm

Even though the Swift-LTO part is not complete, you can enable LLVM level thin-lto/full-lto. That part of the work is complete.

That being said, I am not sure how production ready it is. I recently added support for building the swift stdlib with thin-lto/full-lto and found that bugs were found on the stdlib. But the stdlib is a bit of a special case so your mileage may vary.

kateinoigakukun · March 30, 2021, 9:02am

For LLVM level LTO, you can use it just adding -lto=llvm-thin or -lto=llvm-full in driver options.

e.g.

$ swiftc -emit-library -lto=llvm-thin X.swift
$ swiftc main.swift -lX -lto=llvm-thin -o main