Refactoring plan of SILVisitor for LTO

kateinoigakukun · June 17, 2020, 4:00pm

Hi, everyone! I want to ask you to review my re-architecture plan before landing the change.

Motivation

The LTO accepts multiple SIL modules as input, and execute SIL passes.
To prepare for the cross-module optimization, I want to link external module functions with the referent SILFunctions. Currently, SILLinker is doing a similar thing for inlining.

However, SILLinker loads external module functions from LoadedModules in SILModule. In other words, a SILModule is a unit of a module, and at the same time, it manages the dependent modules.

But for LTO, we need to think multiple SILModules, I want to separate these roles.
For example, if "management of dependent modules" is extracted from SILModule roles, it would be easier to link external SILFunctions each other.

Proposed Change

This PR #32368 proposes a new abstract interface "SILReferenceResolver" that abstracts how to resolve external references. After this change, SILVisitor doesn't depend on SILModule as a "dependency module manager" directly.

class CrossModuleReferenceResolver : public SILReferenceResolver {
  std::vector<SILModule *> Modules;
  bool loadFunction(SILFunction *F) override {
    for (auto M : Modules) {
      if (M->loadFunction(F)) return true;
    }
    return false;
  }
  ...
};

What do you think about this re-architecture plan?

kateinoigakukun · June 17, 2020, 4:00pm

CC: @Michael_Gottesman, @compnerd

Michael_Gottesman · June 23, 2020, 9:28pm

This is not how we will want to do this. This is going to be a full LTO like solution. Instead, we should go for an approach that uses summaries (i.e. thinlto). The reason why this is important is that full LTO has been shown not to be a performant approach since it requires the entire program to be in memory at the same time (even if for codegening it would not need to be).

To see a small discussion of this, see this offset into M. Amini & T. Johnson's talk “ThinLTO: Scalable and Incremental LTO”:

// Inline to avoid discourse expanding this.
https://www.youtube.com/watch?v=9OIEZAj243g&feature=youtu.be&t=300

Keep in mind that this presentation is from a few years ago, but the talk about memory consumption with some numbers from LLVM is still relevant.

Arnold · June 23, 2020, 9:46pm

So far we are using clang's LTO model as a blue print for the architecture of how "Swift cross module optimization with closed system" knowledge should work.

I would like to see some discussion why this is the right approach.

The existing proposal performs the cross module optimization step as part of the linker step. This is reasonable but not neccesary: You could imagine to perform the cross module optimization as an additional step in the compiler pipeline driven by the Swift driver. First, we compile individual modules in a cross module mode. This step might generate summaries along side the .swiftmodule. A second step invokes the frontend with the summaries and swifmodule which allows for the cross module optimization at the SIL level. At the conclusion of this step we generate llvm modules (+ optional llvm summaries) + object code. The link step can then perform what we know as llvm's thin lto. Possibly with Swift specific llvm lto optimizations.

This is one way to skin this cat. I would like to hear what others think?

Michael_Gottesman · June 23, 2020, 9:51pm

First of all, I know the sausage factory floor is messy, but I sort of feel sorry for the cat = p.

Let me lay out what you are saying in a bit more detail to make sure I am understanding correctly:

The clang LTO model is to use the linker. I would assume, Arnold, that you would argue that this was created so we can work with normal Makefiles/etc (similar reasoning behind auto linking being created). I assume you are questioning if doing this in the linker is the appropriate time to do so and if instead we can just do it earlier using the fact that the driver has significantly more control when compiling swift code?

Arnold · June 23, 2020, 9:58pm

Yes. Also, I think the fact that we are emitting extra .sib files as intermediate products seems derived from that architecture?

Michael_Gottesman · June 23, 2020, 10:02pm

@Arnold yes. The nice thing about us not going through that system is that we do not need to write linker plug-ins for all of the various linkers. So if we can do it, I think we should.

kateinoigakukun · June 24, 2020, 12:31am

It seems reasonable to follow the thin lto way. I talked that with @compnerd a few days ago.

And I agree that we don't need to embed the optimization within each linker. What do you think @compnerd?

Michael_Gottesman · June 24, 2020, 2:42am

@Arnold I was thinking about this a bit more and realized that there is an important design decision that has not been spoken about: do we want to rely on CMO serializing everything.

If we do not want to rely on CMO doing that, then we need some intermediate representation for the SIL in between the initial compilations and the merging of summaries. Otherwise, we do not have an appropriate thing that we can codegen /after/ we have finished the cross module optimization phase.

My thoughts are that most likely we will not want to serialize everything at CMO time in the near future due to code-size concerns. But if other people have other thoughts, I am happy to discuss.

That being said, given that CMO is not going to serialize all the SIL, I imagine this is how the architecture would be:

noting that CMO would be serialized early and we would produce a summary file at that time for the swift module and that the SIB file serialization would happen where we serialize today in the middle of the optimizer pipeline. The SIB file would then be used to codegen (and would use CMO code from other modules) using merged summaries.

One thing that is not completely clear to me is if SIB files themselves would want a summary of some sort to enable DCE. I am imagining information that we would discover later after optimization like we DCE-ed enough that this vtable is not used anywhere in the SIB file even though it /could/ have been. But I haven't thought in great depth. It would not be useful for inlining of course.

kateinoigakukun · June 24, 2020, 4:27am

One thing that is not completely clear to me is if SIB files themselves would want a summary of some sort to enable DCE. I am imagining information that we would discover later after optimization like we DCE-ed enough that this vtable is not used anywhere in the SIB file even though it /could/ have been. But I haven't thought in great depth. It would not be useful for inlining of course.

I think if DCE just removes llvm.used attribute from the declarations of table, it would be no problem even they are used after the CMO.

Erik_Eckstein · June 24, 2020, 10:50am

I basically agree with Arnold.
Also, we should answer the question, how much benefit we would get from building up such an infrastructure - compared to the cross-module optimization we have now.
"Bottom-up" optimizations, like cross-module inlining, specialization are already supported with our current cross-module-optimization approach.

With swift LTO, we could do "top-down" interprocedural optimizations, most importantly DCE.
Though, DCE can also be done by the linker. Currently it does not work well with swift's witness tables, but probably that could be supported in the linker in some way.
So the question is: what's the amount of improvement we can get from cross-module "top-down" optimizations (beside DCE)? Is it worth investing in building this framework with the benefits we are expecting?

Arnold · June 24, 2020, 1:24pm

Good point. Can we do DCE as an LLVM LTO optimization that is Swift specific by providing summary info which top-level entry points (witness/v-table-entries/metadata) are may-use of a llvm-module?

Arnold · June 24, 2020, 2:15pm

There is the possibility of more de-virtualization if our world is more closed. Not clear to me how beneficial that would be in practice where we use opaque (non-cmo'ed) libraries. If those libraries come with summaries what they import/extend (that would be ABI though) you might get close to a real closed world ...

An architecture aspect is the possibility for parallelization. How does the architecture support parallelization with as short sequential stages as possible?

An example thought:

The initial stages of generating individual modules: swiftinterface, llvm bc and individual summaries is naturally parallel.
Potential next stage of collating summaries is a linearization point.
Optimization based on collated summaries (be it a Swift CMO step or a Swift-LLVM-LTO) and code generation can be done (mostly) in parallel.
Ultimate link step to produce final object is sequential.

kateinoigakukun · July 5, 2020, 1:39pm

Hi, everyone. Since last week, I had been prototyping @Arnold and @Michael_Gottesman 's architecture.

Here is my rough implementation of the architecture:

github.com/kateinoigakukun/swift

[Prototype] Thin LTO

kateinoigakukun:master ← kateinoigakukun:katei/prototype-thin-lto

opened 11:09AM - 04 Jul 20 UTC

kateinoigakukun

+3815 -146

```bash # 1. Emit module summary for 'module1' into './module1.swiftmodule.summ…ary' $ swift-frontend -emit-sib module1.swift \ -emit-module-summary-path module1.swiftmodule.summary \ -parse-as-library $ swift-frontend -emit-module module1.swift -parse-as-library # 2. Emit module summary for 'main' into './main.swiftmodule.summary' $ swift-frontend -emit-sib main.swift \ -emit-module-summary-path main.swiftmodule.summary # 3. Merge module summaries into one summary file, link them and mark dead functions $ swift-frontend -cross-module-opt \ main.swiftmodule.summary module1.swiftmodule.summary \ -o merged-module.summary # 4. Do Dead Function Elimination for 'module1' module using the combined module summary $ sil-opt -emit-sil module1.sib \ -module-summary-path merged-module.summary \ --sil-cross-deadfuncelim # 5. Do again for 'main' module. $ sil-opt -emit-sil main.sib \ -module-summary-path merged-module.summary \ --sil-cross-deadfuncelim ```

Overview

Almost the same as LLVM's Thin LTO

Emit module summary
- Add a new file type .swiftmodule.sumary.
- It serializes a module's call graph, Witness Table, and VTable information.
- The structure is similar to LLVM Thin LTO summary.
- swift-frontend's option -emit-module-summary-path corresponds the emission.
- This can be done in parallel.
Merge summaries
- swift-frontend -cross-module-opt [module summary file...] links and merges multiple summaries.
- And prepare for optimization at this phase (e.g. Marking dead functions)
- This is sequential stage

Performing Optimizations for each modules
- Pass merged module summary to swift-frontend via -module-summary-path
- My prototype implements only simple Dead Function Elimination.
- This can be done in parallel

# 1. Emit module summary for 'module1' into './module1.swiftmodule.summary'
$ swift-frontend -emit-sib module1.swift \
                 -emit-module-summary-path module1.swiftmodule.summary \
                 -parse-as-library

$ swift-frontend -emit-module module1.swift -parse-as-library

# 2. Emit module summary for 'main' into './main.swiftmodule.summary'
$ swift-frontend -emit-sib main.swift \
                 -emit-module-summary-path main.swiftmodule.summary

# 3. Merge module summaries into one summary file, link them and mark dead functions
$ swift-frontend -cross-module-opt \
                 main.swiftmodule.summary module1.swiftmodule.summary \
                 -o merged-module.summary

# 4. Do Dead Function Elimination for 'module1' module using the combined module summary
$ sil-opt -emit-sil module1.sib \
          -module-summary-path merged-module.summary \
          --sil-cross-deadfuncelim

# 5. Do again for 'main' module.
$ sil-opt -emit-sil main.sib \
          -module-summary-path merged-module.summary \
          --sil-cross-deadfuncelim

Module Summary file format

The module summary file consists of call graph information and virtual method table information.

func myPrint(_ text: String) { ... }
public protocol Animal {
  func bark()
}

public struct Cat: Animal {
  public func bark() { myPrint("mew") }
}

public struct Dog: Animal {
  public func bark() { myPrint("bow")}
}

public func callBark<T: Animal>(_ animal: T) {
  animal.bark()
}

For example, this swift file would be summarized as:

Andrew_Trick · August 21, 2020, 10:50pm

I basically like the approach laid out here for bootstrapping. I'm not sure how will it will integrate with the Swift driver in the future or whether splitting compilation units into .sib files is well supported today. For example, sil-opt often fails today when supplied with .sil produced by swift-frontend.

I think calling this "LTO" is a misnomer, as this seems like a feature that's internal to the swift driver, not driven by the linker. Although I do think LLVM thin-LTO is a good archictecture to emulate.

Reading between the lines, here's how I understand this proposal...

Given modules A and B...

Compile A

$ swift-frontend A1.swift A2.swift -whole-module-optimization -emit-module -emit-module-summary -emit-sib -o A.swiftmodule

Output: A.swiftmodule, A.swiftmodule.summary, A.sib

The three output file types are likely all generated during SIL module serialization, but we have the option of deferring the .summary and .sib output in the future if it's useful to run more passes on those.

A.sib must contain the additional SIL function bodies that are not exported for cross-module optimization. It may also contain a copy of exported function bodies, for example, if they have been further optimized after the module was serialized. A.sib is the same file format as .swiftmodule but does not include any AST-level type-information (A.sib is useless on its own). A.sib can be arbitrarily broken down into Ax.sib, Ay.sib, Az.sib either for parallelism or incremental builds.

Compile B

$ swift-frontend B1.swift B2.swift -whole-module-optimization -emit-module -emit-module-summary -emit-sib -o B.swiftmodule

Output: B.swiftmodule, B.swiftmodule.summary, B.sib

$ swift-frontend -merge-module-summary \
                 A.swiftmodule.summary B.swiftmodule.summary \
                 -o merged_module.summary

The summary merge step seems unnecessary, but it may save compilation time because each module does not need to "re-merge" the summaries as it imports them.

I'm not sure why the proposal calls this "-cross-module-opt".

Test and debug the SIL optimizer

$ sil-opt A.sib -emit-sil \
          -module-summary-path merged-module.summary \
          --sil-cross-deadfuncelim

Finds A.swiftmodule and B.swiftmodule in the include path
I think we currently need to specify the .sib file's parent
.swiftmodule on the command line, but that seems silly. A.sib should
know that it comes from A.swiftmodule

CodeGen A

$ swift-frontend A.sib -c -o A.o -module-summary-path merged-module.summary

Finds A.swiftmodule and B.swiftmodule in the include path

There seemed to be some confusion regarding the artifacts produced by the compiler. Here's my take on that...

It's useful to separate information that has different dependence information, different lifetime, or needs to be individuated on a command line into separate files.

.swiftmodule: "what a module exports"

somewhat analogous to a combined header
produced by a single well-defined SIL serialization point in the
optimizer pipeline (prior to dropping any semantics).
self-contained, AST, SIL-level declarations, and exported function
body definitions.
may depend on information from function bodies that aren't included
(at least as-is today). Ideally we would have a way of recording
those dependencies on .swift files, and/or avoid introducing them, to
support incremental cross-module optimization.

.swiftmodule.summary: "inclusive module summary"

augments .swiftmodule with summary information that's inclusive over
the module's implementation
could (should?) be embedded within the .swiftmodule, but separating them
allows for a single merged summary file
additional source of dependencies on .swift files. It's possible
that updating a .swift file changes the summary but not the
.swiftmodule
could potentially be emitted later in the pipeline to provide more
refined summary

.sib/.sil: "SIL-level compilation unit"

somewhat analogous to .cpp/.bc./.ll
arbitrary subset of SIL function bodies for codegen within a
module that can be merged or split. Never seen by other modules.
these may be emitted at any time during SIL optimization for testing and debugging.
.sib should (ideally) be isomorphic and interchangeable with .sil files

kateinoigakukun · August 23, 2020, 4:12am

Thanks for re-organizing my proposal. That explains my thought perfectly!

In my opinion, the Swift compiler driver should focus on single module compilation and build systems like SwiftPM should drive these kinds of cross-module cooperation things.

Right, I know those issues. I'm sending patches to fix them now.

Yes, as you said, this emulates LLVM thin-LTO architecture but not performs at link-time
Do you have any idea of name for this optimization?

.sib/.sil and .swiftmodule.summary have same lifetime because they depends on internal things that are not exported as .swiftmodule. So I think that it should be better to embed summary info in .sib rather than .swiftmodule.

schuett · August 23, 2020, 3:24pm

My 2 cents.

Have a look at the thinlto video. It is slightly different from your proposal. The thinlink step is actually doing optimizations instead of merging the summaries.

Here is an example. It creates one summary for each module resp. translation unit:
http://lists.llvm.org/pipermail/llvm-dev/2019-January/128955.html

kateinoigakukun · August 23, 2020, 3:45pm

As far as I know, thinlink phase merge summaries and compute dead symbols. And lto backend actually do optimizations and codegen based on merged summary and computed dead flags.

My prototype implementation follows this approach.

schuett · August 23, 2020, 3:52pm

It also does an analysis for cross-TU inlining and provides these information in the per TU summaries.

E.g. is it beneficial to inline foo from Module A into Module B depending on the size of foo and the frequency it is used in Module B.

kateinoigakukun · August 23, 2020, 4:05pm

Interesting. But I think similar things are already done with current cross-module-optimization based .swiftmodule.