"available externally" vs build time

Folks working on the SIL optimizer, particularly those interested in faster builds:

If I understand the SIL optimizer correctly, it seems that when the current program references an external symbol declared as @_inlinable, that SILModule::linkFunction eagerly deserializes the @_inlinable body and splat it into the current module. That SIL function exists in the current module, gets optimized, inlined, etc along with existing functions, then gets dropped on the floor at IRGen time if it still exists.

If this is true, this seems like an incredibly wasteful approach, particularly given how many @_inlinable functions exist in the standard library, and particularly for programs that have lots of small files. Try this:

$ cat u.swift
func f() {
  print("hello")
}

$ swiftc u.swift -emit-sil -o - | wc -l
    7191

That is a *TON* of SIL, most having to do with array internals, string internals, and other stuff. This eats memory and costs a ton of compile time to deserialize and slog this around, which gets promptly dropped on the floor by IRGen. It also makes the -emit-sil output more difficult to work with...

Optimized builds are also bad:
$ swiftc u.swift -emit-sil -o - -O | wc -l
     861

If you look at it, only about 70 lines of that is the actual program being compiled, the rest is dropped on the floor by IRGen. This costs a ton of memory and compile time to deserialize and represent this, then even more is wasted running the optimizer on code which was presumably optimized when the stdlib was built.

I imagine that this approach was inspired by LLVM’s available_externally linkage, which does things the same way. This is a simple way to make sure that interprocedural optimizations can see the bodies of external functions to inline them, etc. However, LLVM doesn’t have the benefit of a module system like Swift’s, so it has no choice.

So here are the questions: :-)

1. It looks like the MandatoryInliner is the biggest culprit at -O0 here: it deserializes the referenced function (MandatoryInlining.cpp:384) and *then* checks to see if the callee is @_transparent. Would it make sense to change this to check for @_transparent first (which might require a SIL change?), and only deserialize if so?

2. The performance inliner will have the same issue after this, and deserializing the bodies of all inlinable referenced functions is unavoidable for it. However, we don’t have to copy the SIL into the current module and burn compile time by subjecting it to all of the standard optimizations again. Would it make sense to put deserialized function bodies into a separate SIL module, and teach the (few) IPA/IPO optimizations about this fact? This should be very straight-forward to do for all of the optimizations I’m aware of.

I haven’t done any measurements, but this seems like it could be a big speedup, particularly for programs containing a bunch of relatively small files and not using WMO.

-Chris

Folks working on the SIL optimizer, particularly those interested in faster builds:

If I understand the SIL optimizer correctly, it seems that when the current program references an external symbol declared as @_inlinable, that SILModule::linkFunction eagerly deserializes the @_inlinable body and splat it into the current module. That SIL function exists in the current module, gets optimized, inlined, etc along with existing functions, then gets dropped on the floor at IRGen time if it still exists.

If this is true, this seems like an incredibly wasteful approach, particularly given how many @_inlinable functions exist in the standard library, and particularly for programs that have lots of small files.

In the past, I had talked about making all linking in of functions be lazy. The implementation was still there. There was a large backlash from other performance people since it adds complexity every place that one attempts to analyze a function. I didn't push too hard on this (gotta save your political capital when possible), but I still think that such an issue is an API problem, not an optimizer design problem. With the proper API design around looking up functions and function related instructions, such problems would go away.

Try this:

$ cat u.swift
func f() {
  print("hello")
}

$ swiftc u.swift -emit-sil -o - | wc -l
    7191

That is a *TON* of SIL, most having to do with array internals, string internals, and other stuff. This eats memory and costs a ton of compile time to deserialize and slog this around, which gets promptly dropped on the floor by IRGen. It also makes the -emit-sil output more difficult to work with...

Optimized builds are also bad:
$ swiftc u.swift -emit-sil -o - -O | wc -l
     861

If you look at it, only about 70 lines of that is the actual program being compiled, the rest is dropped on the floor by IRGen. This costs a ton of memory and compile time to deserialize and represent this, then even more is wasted running the optimizer on code which was presumably optimized when the stdlib was built.

I imagine that this approach was inspired by LLVM’s available_externally linkage, which does things the same way. This is a simple way to make sure that interprocedural optimizations can see the bodies of external functions to inline them, etc. However, LLVM doesn’t have the benefit of a module system like Swift’s, so it has no choice.

So here are the questions: :-)

1. It looks like the MandatoryInliner is the biggest culprit at -O0 here: it deserializes the referenced function (MandatoryInlining.cpp:384) and *then* checks to see if the callee is @_transparent. Would it make sense to change this to check for @_transparent first (which might require a SIL change?), and only deserialize if so?

I think the reason why this happened is that a transparent function IIRC must have a body and the verifier asserts upon this. I imagine you could add an intermediate step. IMO this is worth the change not only for -Onone compile time, but also b/c SourceKit relies on mandatory inlining for the purposes of diagnostics, so it /could/ speed up the editor experience as well.

2. The performance inliner will have the same issue after this, and deserializing the bodies of all inlinable referenced functions is unavoidable for it. However, we don’t have to copy the SIL into the current module and burn compile time by subjecting it to all of the standard optimizations again. Would it make sense to put deserialized function bodies into a separate SIL module, and teach the (few) IPA/IPO optimizations about this fact? This should be very straight-forward to do for all of the optimizations I’m aware of.

I haven't thought about this completely, but this could potentially cause problems. In general the SIL optimizer assumes that there is one SILModule. I would be careful. Couldn't you just turn off optimizations on available_external functions? Also, one could argue that there are potential cases where optimizing the available_external functions by themselves could save compile time since you are optimizing in one place instead of in multiple places after inlining. That being said, off the top of my head I can't think of any situation where optimizing in an imported module would result in more optimization opportunities than in the original module beyond cases where there are circular references (e.g. a function in the imported module refers to a function in my module, so I can devirtualize/optimize further in my module and do that once before inlining). But IIRC circular references are no-beuano in Swift, so it is not clear to me if that is a /real/ case. Jordan would know more about this. +CC Jordan.

···

On Dec 28, 2017, at 7:32 PM, Chris Lattner via swift-dev <swift-dev@swift.org> wrote:

I haven’t done any measurements, but this seems like it could be a big speedup, particularly for programs containing a bunch of relatively small files and not using WMO.

-Chris

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Folks working on the SIL optimizer, particularly those interested in faster builds:

If I understand the SIL optimizer correctly, it seems that when the current program references an external symbol declared as @_inlinable, that SILModule::linkFunction eagerly deserializes the @_inlinable body and splat it into the current module. That SIL function exists in the current module, gets optimized, inlined, etc along with existing functions, then gets dropped on the floor at IRGen time if it still exists.

I’ve noticed this too, but haven’t had time to look at it yet.

If this is true, this seems like an incredibly wasteful approach, particularly given how many @_inlinable functions exist in the standard library, and particularly for programs that have lots of small files. Try this:

I agree!

1. It looks like the MandatoryInliner is the biggest culprit at -O0 here: it deserializes the referenced function (MandatoryInlining.cpp:384) and *then* checks to see if the callee is @_transparent. Would it make sense to change this to check for @_transparent first (which might require a SIL change?), and only deserialize if so?

This seems like a clear win.

2. The performance inliner will have the same issue after this, and deserializing the bodies of all inlinable referenced functions is unavoidable for it. However, we don’t have to copy the SIL into the current module and burn compile time by subjecting it to all of the standard optimizations again. Would it make sense to put deserialized function bodies into a separate SIL module, and teach the (few) IPA/IPO optimizations about this fact? This should be very straight-forward to do for all of the optimizations I’m aware of.

What if we deserialized function bodies lazily instead of deserializing the transitive closure of all serialized functions referenced from a function?

Slava

···

On Dec 28, 2017, at 4:32 PM, Chris Lattner via swift-dev <swift-dev@swift.org> wrote:

I haven’t done any measurements, but this seems like it could be a big speedup, particularly for programs containing a bunch of relatively small files and not using WMO.

-Chris

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Folks working on the SIL optimizer, particularly those interested in faster builds:

If I understand the SIL optimizer correctly, it seems that when the current program references an external symbol declared as @_inlinable, that SILModule::linkFunction eagerly deserializes the @_inlinable body and splat it into the current module. That SIL function exists in the current module, gets optimized, inlined, etc along with existing functions, then gets dropped on the floor at IRGen time if it still exists.

I’ve noticed this too, but haven’t had time to look at it yet.

If this is true, this seems like an incredibly wasteful approach, particularly given how many @_inlinable functions exist in the standard library, and particularly for programs that have lots of small files. Try this:

I agree!

1. It looks like the MandatoryInliner is the biggest culprit at -O0 here: it deserializes the referenced function (MandatoryInlining.cpp:384) and *then* checks to see if the callee is @_transparent. Would it make sense to change this to check for @_transparent first (which might require a SIL change?), and only deserialize if so?

This seems like a clear win.

+1

It should be a trivial change and I’m wondering why we haven’t done this yet.
I filed [SR-6697] Don't deserialize non-transparent functions in the mandatory inliner · Issue #49246 · apple/swift · GitHub

2. The performance inliner will have the same issue after this, and deserializing the bodies of all inlinable referenced functions is unavoidable for it. However, we don’t have to copy the SIL into the current module and burn compile time by subjecting it to all of the standard optimizations again. Would it make sense to put deserialized function bodies into a separate SIL module, and teach the (few) IPA/IPO optimizations about this fact? This should be very straight-forward to do for all of the optimizations I’m aware of.

What if we deserialized function bodies lazily instead of deserializing the transitive closure of all serialized functions referenced from a function?

Well, with our pass pipeline architecture I suspect it will not make a difference. We process functions bottom-up. For example, the performance inliner optimizes the callee first before trying to inline it (because it influences the inlining decision). So the performance inliner actually visits the whole call tree.

Would it make sense to put deserialized function bodies into a separate SIL module

We serialize early in the pipeline, i.e. serialized functions are not (fully) optimized. And at least the performance inliner needs functions to be optimized to make good inlining decisions. So it makes sense to also optimize deserialized functions.

That said, I’m sure there is still potential for improvements. For example, we could exclude deserialized generic functions from optimizations, because we only inline specialized functions.

···

On Jan 2, 2018, at 1:08 PM, Slava Pestov via swift-dev <swift-dev@swift.org> wrote:

On Dec 28, 2017, at 4:32 PM, Chris Lattner via swift-dev <swift-dev@swift.org> wrote:

Slava

I haven’t done any measurements, but this seems like it could be a big speedup, particularly for programs containing a bunch of relatively small files and not using WMO.

-Chris

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Folks working on the SIL optimizer, particularly those interested in faster builds:

If I understand the SIL optimizer correctly, it seems that when the current program references an external symbol declared as @_inlinable, that SILModule::linkFunction eagerly deserializes the @_inlinable body and splat it into the current module. That SIL function exists in the current module, gets optimized, inlined, etc along with existing functions, then gets dropped on the floor at IRGen time if it still exists.

I’ve noticed this too, but haven’t had time to look at it yet.

If this is true, this seems like an incredibly wasteful approach, particularly given how many @_inlinable functions exist in the standard library, and particularly for programs that have lots of small files. Try this:

I agree!

1. It looks like the MandatoryInliner is the biggest culprit at -O0 here: it deserializes the referenced function (MandatoryInlining.cpp:384) and *then* checks to see if the callee is @_transparent. Would it make sense to change this to check for @_transparent first (which might require a SIL change?), and only deserialize if so?

This seems like a clear win.

+1

It should be a trivial change and I’m wondering why we haven’t done this yet.
I filed [SR-6697] Don't deserialize non-transparent functions in the mandatory inliner · Issue #49246 · apple/swift · GitHub

2. The performance inliner will have the same issue after this, and deserializing the bodies of all inlinable referenced functions is unavoidable for it. However, we don’t have to copy the SIL into the current module and burn compile time by subjecting it to all of the standard optimizations again. Would it make sense to put deserialized function bodies into a separate SIL module, and teach the (few) IPA/IPO optimizations about this fact? This should be very straight-forward to do for all of the optimizations I’m aware of.

What if we deserialized function bodies lazily instead of deserializing the transitive closure of all serialized functions referenced from a function?

Well, with our pass pipeline architecture I suspect it will not make a difference. We process functions bottom-up. For example, the performance inliner optimizes the callee first before trying to inline it (because it influences the inlining decision). So the performance inliner actually visits the whole call tree.

However, imagine if f() calls g() which calls h() which calls i(). If all four of f, g, h, and i are serialized, then we will deserialize them all as soon as anything references f(). But the performance inliner might choose to inline f(), and not g(), therefore the deserialization of h() and i() is unnecessary.

Or am I misunderstanding the issue here?

···

On Jan 4, 2018, at 1:08 PM, Erik Eckstein <eeckstein@apple.com> wrote:

On Jan 2, 2018, at 1:08 PM, Slava Pestov via swift-dev <swift-dev@swift.org> wrote:

On Dec 28, 2017, at 4:32 PM, Chris Lattner via swift-dev <swift-dev@swift.org> wrote:

Would it make sense to put deserialized function bodies into a separate SIL module

We serialize early in the pipeline, i.e. serialized functions are not (fully) optimized. And at least the performance inliner needs functions to be optimized to make good inlining decisions. So it makes sense to also optimize deserialized functions.

That said, I’m sure there is still potential for improvements. For example, we could exclude deserialized generic functions from optimizations, because we only inline specialized functions.

Slava

I haven’t done any measurements, but this seems like it could be a big speedup, particularly for programs containing a bunch of relatively small files and not using WMO.

-Chris

_______________________________________________
swift-dev mailing list
swift-dev@swift.org <mailto:swift-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org <mailto:swift-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-dev

1. It looks like the MandatoryInliner is the biggest culprit at -O0 here: it deserializes the referenced function (MandatoryInlining.cpp:384) and *then* checks to see if the callee is @_transparent. Would it make sense to change this to check for @_transparent first (which might require a SIL change?), and only deserialize if so?

This seems like a clear win.

+1

It should be a trivial change and I’m wondering why we haven’t done this yet.
I filed [SR-6697] Don't deserialize non-transparent functions in the mandatory inliner · Issue #49246 · apple/swift · GitHub

Thanks!

2. The performance inliner will have the same issue after this, and deserializing the bodies of all inlinable referenced functions is unavoidable for it. However, we don’t have to copy the SIL into the current module and burn compile time by subjecting it to all of the standard optimizations again. Would it make sense to put deserialized function bodies into a separate SIL module, and teach the (few) IPA/IPO optimizations about this fact? This should be very straight-forward to do for all of the optimizations I’m aware of.

What if we deserialized function bodies lazily instead of deserializing the transitive closure of all serialized functions referenced from a function?

Well, with our pass pipeline architecture I suspect it will not make a difference. We process functions bottom-up. For example, the performance inliner optimizes the callee first before trying to inline it (because it influences the inlining decision). So the performance inliner actually visits the whole call tree.

Would it make sense to put deserialized function bodies into a separate SIL module

We serialize early in the pipeline, i.e. serialized functions are not (fully) optimized.

Really? The serialized functions in the standard library aren’t optimized? That itself seems like a significant issue: you’re pushing optimized compile time cost onto every user’s source file that uses an unoptimized stdlib symbol.

And at least the performance inliner needs functions to be optimized to make good inlining decisions. So it makes sense to also optimize deserialized functions.

That said, I’m sure there is still potential for improvements. For example, we could exclude deserialized generic functions from optimizations, because we only inline specialized functions.

If the serialized functions are in fact optimized, you have a lot of ways to avoid deserializing in practice. There just aren’t that many IPO/IPA passes in the compiler, so you can build in summaries that they need into the serialized sil code. If they aren’t optimized, then there are bigger problems.

-Chris

···

On Jan 4, 2018, at 1:08 PM, Erik Eckstein <eeckstein@apple.com> wrote:

Folks working on the SIL optimizer, particularly those interested in faster builds:

If I understand the SIL optimizer correctly, it seems that when the current program references an external symbol declared as @_inlinable, that SILModule::linkFunction eagerly deserializes the @_inlinable body and splat it into the current module. That SIL function exists in the current module, gets optimized, inlined, etc along with existing functions, then gets dropped on the floor at IRGen time if it still exists.

I’ve noticed this too, but haven’t had time to look at it yet.

If this is true, this seems like an incredibly wasteful approach, particularly given how many @_inlinable functions exist in the standard library, and particularly for programs that have lots of small files. Try this:

I agree!

1. It looks like the MandatoryInliner is the biggest culprit at -O0 here: it deserializes the referenced function (MandatoryInlining.cpp:384) and *then* checks to see if the callee is @_transparent. Would it make sense to change this to check for @_transparent first (which might require a SIL change?), and only deserialize if so?

This seems like a clear win.

+1

It should be a trivial change and I’m wondering why we haven’t done this yet.
I filed [SR-6697] Don't deserialize non-transparent functions in the mandatory inliner · Issue #49246 · apple/swift · GitHub

2. The performance inliner will have the same issue after this, and deserializing the bodies of all inlinable referenced functions is unavoidable for it. However, we don’t have to copy the SIL into the current module and burn compile time by subjecting it to all of the standard optimizations again. Would it make sense to put deserialized function bodies into a separate SIL module, and teach the (few) IPA/IPO optimizations about this fact? This should be very straight-forward to do for all of the optimizations I’m aware of.

What if we deserialized function bodies lazily instead of deserializing the transitive closure of all serialized functions referenced from a function?

Well, with our pass pipeline architecture I suspect it will not make a difference. We process functions bottom-up. For example, the performance inliner optimizes the callee first before trying to inline it (because it influences the inlining decision). So the performance inliner actually visits the whole call tree.

However, imagine if f() calls g() which calls h() which calls i(). If all four of f, g, h, and i are serialized, then we will deserialize them all as soon as anything references f(). But the performance inliner might choose to inline f(), and not g(), therefore the deserialization of h() and i() is unnecessary.

Or am I misunderstanding the issue here?

To make the inlining decision for g() into f() the optimizer looks at h() and i() as well.

But the question is if the additional compile time this is worth the improved accuracy.
We could definitely do something more intelligent and/or compile time favorable here.

···

On Jan 4, 2018, at 1:14 PM, Slava Pestov <spestov@apple.com> wrote:

On Jan 4, 2018, at 1:08 PM, Erik Eckstein <eeckstein@apple.com <mailto:eeckstein@apple.com>> wrote:

On Jan 2, 2018, at 1:08 PM, Slava Pestov via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

On Dec 28, 2017, at 4:32 PM, Chris Lattner via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

Would it make sense to put deserialized function bodies into a separate SIL module

We serialize early in the pipeline, i.e. serialized functions are not (fully) optimized. And at least the performance inliner needs functions to be optimized to make good inlining decisions. So it makes sense to also optimize deserialized functions.

That said, I’m sure there is still potential for improvements. For example, we could exclude deserialized generic functions from optimizations, because we only inline specialized functions.

Slava

I haven’t done any measurements, but this seems like it could be a big speedup, particularly for programs containing a bunch of relatively small files and not using WMO.

-Chris

_______________________________________________
swift-dev mailing list
swift-dev@swift.org <mailto:swift-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org <mailto:swift-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-dev

1. It looks like the MandatoryInliner is the biggest culprit at -O0 here: it deserializes the referenced function (MandatoryInlining.cpp:384) and *then* checks to see if the callee is @_transparent. Would it make sense to change this to check for @_transparent first (which might require a SIL change?), and only deserialize if so?

This seems like a clear win.

+1

It should be a trivial change and I’m wondering why we haven’t done this yet.
I filed [SR-6697] Don't deserialize non-transparent functions in the mandatory inliner · Issue #49246 · apple/swift · GitHub

Thanks!

2. The performance inliner will have the same issue after this, and deserializing the bodies of all inlinable referenced functions is unavoidable for it. However, we don’t have to copy the SIL into the current module and burn compile time by subjecting it to all of the standard optimizations again. Would it make sense to put deserialized function bodies into a separate SIL module, and teach the (few) IPA/IPO optimizations about this fact? This should be very straight-forward to do for all of the optimizations I’m aware of.

What if we deserialized function bodies lazily instead of deserializing the transitive closure of all serialized functions referenced from a function?

Well, with our pass pipeline architecture I suspect it will not make a difference. We process functions bottom-up. For example, the performance inliner optimizes the callee first before trying to inline it (because it influences the inlining decision). So the performance inliner actually visits the whole call tree.

Would it make sense to put deserialized function bodies into a separate SIL module

We serialize early in the pipeline, i.e. serialized functions are not (fully) optimized.

Really? The serialized functions in the standard library aren’t optimized? That itself seems like a significant issue: you’re pushing optimized compile time cost onto every user’s source file that uses an unoptimized stdlib symbol.

First of all, serialized functions are optimized, but not with the full optimization pipeline. We serialize in the middle of the pass pipeline.
The reason for this is that we solve two problems with that:
1) We cannot serialize fragile functions which inlined resilient functions, because this would expose resilient code to the client. On the other hand we want to enable this kind of inlining for code generated in the module itself. In other words: we need a different optimization pipeline for code generation and serialization anyway.
2) The optimization pipeline is split into “high level” and “low level” regarding @_semantics functions. In the high-level part @_semantic functions are not inlined. If we serialize after such functions are inlined then we would de-serialize “low-level” sil into “high level” sil.

I’m not worried about the compile time impact of early serialization. When we did that change we measured compile time and didn’t see a significant difference.

And at least the performance inliner needs functions to be optimized to make good inlining decisions. So it makes sense to also optimize deserialized functions.

That said, I’m sure there is still potential for improvements. For example, we could exclude deserialized generic functions from optimizations, because we only inline specialized functions.

If the serialized functions are in fact optimized, you have a lot of ways to avoid deserializing in practice. There just aren’t that many IPO/IPA passes in the compiler, so you can build in summaries that they need into the serialized sil code. If they aren’t optimized, then there are bigger problems.

Inlining decisions also depend on the caller context, like function argument values, e.g. if a function argument is constant and that argument controls a condition in the callee, this is considered.
It’s possible to model this in a summary information, but it’s not trivial.
But, as I said, there are definitely many possibilities for improvements.

···

On Jan 4, 2018, at 4:57 PM, Chris Lattner <clattner@nondot.org> wrote:

On Jan 4, 2018, at 1:08 PM, Erik Eckstein <eeckstein@apple.com <mailto:eeckstein@apple.com>> wrote:

-Chris

Consider Slava’s example of "f() calls g() which calls h() which calls i()” where “f” is in the user module and g/h/i are in the standard library.

In most cases, when building the standard library, h and i will be inlined into g and should be serialized as just g with no calls in it. If you have a summary for the size of g (and whatever other heuristics the inliner is using) you can consult that and avoid deserializing the function if it doesn’t make sense to inline g/h/i.

OTOH, if “h" was too big to inline into “g”, then you’d consult the summary for “g” and decide whether it makes sense to inline g into f. At that point, you have a call to h, and consult the summary for h to see if it makes sense to inline it, etc.

In other words, this all composes.

-Chris

···

On Jan 4, 2018, at 2:11 PM, Erik Eckstein via swift-dev <swift-dev@swift.org> wrote:

Well, with our pass pipeline architecture I suspect it will not make a difference. We process functions bottom-up. For example, the performance inliner optimizes the callee first before trying to inline it (because it influences the inlining decision). So the performance inliner actually visits the whole call tree.

However, imagine if f() calls g() which calls h() which calls i(). If all four of f, g, h, and i are serialized, then we will deserialize them all as soon as anything references f(). But the performance inliner might choose to inline f(), and not g(), therefore the deserialization of h() and i() is unnecessary.

Or am I misunderstanding the issue here?

To make the inlining decision for g() into f() the optimizer looks at h() and i() as well.

But the question is if the additional compile time this is worth the improved accuracy.
We could definitely do something more intelligent and/or compile time favorable here.