As some people expressed interest in working on code size optimizations, I'd like to start a thread to share ideas what we could do to reduce code size.
Here are some items of my list:
Improve function merging: In the swift compiler, we are running a LLVM pass to merge "similar" functions (swift/LLVMMergeFunctions.cpp at main · apple/swift · GitHub). Currently this is quite limited. If we extend the possibility to accept more "differences" (most importantly differing constant offsets of getelementptr), the pass could merge more functions. This is especially important to merge specializations of the same generic function, where e.g. just the type sizes mismatch.
Invest in a more powerful function outlining algorithm (also on LLVM level). This could enable even more code sharing between specialized generic functions.
Improve the inlining heuristics. This is an ongoing topic and, although it sounds simple, it's a surprisingly hard problem.
Code size optimizations are an all-time ongoing effort. And it's pretty important! Since swift-3 we added countless small and larger improvements to reduce code size.
My intention here is to make this effort a bit more transparent. So that people can share ideas and contributors can pick up ideas and add optimizations improvements.
In addition to the LLVM improvements Erik mentioned, a lot of the changes we've been making to Swift fall into these categories:
Replacing code-driven runtime mechanisms with compact data-driven ones. One example of this is using mangled names to encode types we want to access metadata for, and using a runtime interpreter to perform the access, instead of open-coding sequences of accessor calls.
Systematic improvements to SIL, such as OSSA and opaque values, that allow for more pervasive optimization. Unnecessary ARC traffic and extra copies in generic code are significant code size sinks.
Looking for opportunities where on-demand-generated definitions, such as generic specializations and thunks, can be migrated to centralized locations such as the libswiftCore.dylib, so that all Swift binaries can share one copy.
Improving lazy code generation in Swift's IRGen and SILGen stages, so that fewer unused private/internal/on-demand-generated definitions get emitted in the first place.
Work is ongoing in all of those areas. More recently, we've also been investigating whether we can be less conservative about marking definitions used, so that linker-level dead stripping can be more effective, which you can see in this PR:
One hazard in doing this is breaking apps that rely on type reflection, so we'd need some way to more explicitly track types whose metadata is needed at runtime. @omax's proposal here is one possibility:
Here is a somewhat random list of code size improvement ideas we're still working on, or planning to start work on. (This should not be taken as us "claiming" this work for ourselves exclusively, if any other motivated people or teams want to investigate any of these projects!)
We now use mangled names for accessing fully specialized type references, and we want to extend the mangling to be able to do so for protocol conformances and witness tables as well. This will also address one of the launch time regression issues from using mangled names, which is the need to re-look-up conformances while resolving mangled generic arguments.
We also want to improve the performance and code size of the runtime demangler; it's based on the demangler implementation from the swift-demangle tool as a bootstrapping step, but this has a lot of unnecessary functionality, and a design that's inefficient for runtime use, where it first builds a parse tree and the runtime then has to walk the tree. A demangler implementation that could be pared down to only what was necessary for resolving types and conformances, and which was callback-based so that the mangled name could be interpreted on the fly without an intermediate tree representation, ought to be able to reduce the interpreter overhead to the point we could use mangled names for all metadata accesses, not only fully specialized ones.
Value witness functions can also be a significant source of code size, and there are a number of approaches we can investigate to reduce their size:
Types with the same layout (meaning the same size, alignment, stride, extra inhabitants for enum cases, and retain/released fields in the same places) can share value witnesses. We only take advantage of this in a handful of special cases, but IRGen could have a more systematic notion of "type layout", which we can use as the key for emitting and sharing value witnesses, instead of individual types.
We need to do type layout optimization at some point (such as reordering struct fields to minimize padding), and there is also an opportunity to optimize type layout with an eye toward creating shared layouts more often. Layout rules such as favoring putting refcounted fields at the beginning of a type's layout could cause value witnesses to be shared more often.
Along the same lines of using mangled names to represent type accesses, we can come up with a runtime encoding for type layouts, and use that instead of open-coded value witnesses for the majority of value witness implementations.
We could add language functionality that makes it easier to make the inline size of value types smaller, such as indirect fields in structs. Smaller types generally need less code to copy and destroy.
The Clang importer can generate a lot of code on behalf of imported C types, including type metadata, synthesized conformances, thunks for ObjC methods, and so on. A lot of these are not terribly interesting and don't really vary their CPU-level implementation across types (for instance, every imported C enum's RawRepresentable conversion is just a bitcast), so providing shared implementations of common witness tables and thunks in the runtime could save a lot of code size in mixed Swift-ObjC projects.
Generic type metadata instantiation has to compute the offsets of fields in a generic type, and it does so with currently open-coded instantiation code. Some or all of this could be driven by the field reflection metadata that also already exists in the runtime.
I'll add more to this list as I have time. Hopefully that at least gives you an idea of some of what we've done, and what we plan to still do.
Thanks Joe! I think we always appreciate organized "design rationale release notes" like this.
Tangent: release notes for other areas would be appreciated too. For example, @Chris_Lattner3 says resilience existed as an idea since the beginning but was only ~finished in Swift 5. "How Swift Achieved Dynamic Linking Where Rust Couldn't" happens to scratch the design rationale itch for resilience.
Maybe the following doesn't suit well for this thread, but speaking of VWT and source-level optimizations does it make sense to extend stdlib with more public Trivial/POD types that are known statically(like existing semi-private StaticString, or private _FixedArray16),
which won't break the trivialness, and to encourage developers to use them when it's appropriate?
Some of my synthetic tests show that the win of POD over non-POD types can start from 20% for simple layouts and go up very fast for complicated nested layouts with many ARC fields.
Defining known-POD types can be useful for addressing hot paths in your own code, but it has some unfortunate effects on the ecosystem and API design if we encourage people to do it pervasively. There's still a lot of work to be done to improve code generation and optimization to remove ARC, and provide language functionality to make it easier to avoid situations that fundamentally need ARC, so that the standard Swift types can be used in most situations without overhead.
Would it help if we relaxed the behaviour which preserves unique trap locations?
This is just something that I wonder about quite often. Whenever I profile Swift code and look at the disassembly, essentially every single function has a looooong tail of ud2s, and I wonder if they couldn’t be merged.
I was very hopeful that we could outline the long tails and move them to a "cold" section of the binary, but unfortunately when that was attempted it had unintuitive and not all that positive consequences. I still think there's a win lurking there somewhere but it will require deeper analysis than I expected.
Yeah, any attempt to move the trap addresses further away seems like it'll make the code size issue even worse, because you'll run into ISA-level branch range issues. It would be cool if conditional call instructions came back into fashion, so you could have one trap whose origin address got pushed into the link register by the branch to the trap.
Thanks for posting this @Erik_Eckstein. I spent some time prototyping up some rough code to hoist GEPs out of functions that are largely similar (with the exception of ConstantInt offsets in their GEPs). This is really rough, but you can have a look at https://github.com/plotfi/swift/commit/769ab586. From what I can tell, this allows for protocol witness functions and property getters to get merged together. The one downside I've noticed on top of tree swift+llvm is that when the MachineOutliner does a really good job of outlining, this GEP hoisting technique (and pushing the remaining similar instructions into a separate function that is called, which gets merged) can actually make things larger because of the added calling convention code.
I was wondering, do you have some ideas of handling the GEP cases without putting the similar code into a separate function that needs to get called? I am still working on some tests and some before/after IR to post. Thanks again.
On x86_64 in 64bit mode, it looks like only jo rel8 and jo rel32 are supported, jo rel16 isnt valid so the branch range does become an issue quite quickly.
Could some of the ud2 instructions be put in the padding used to align the functions, then checks that happen early in a function (eg precondition) can hopefully do a jo rel8 backwards to reach them instead of a jo rel32 to the end?
Also could multiple functions share ud2 instructions? Whilst all of the jos inside a function need to be to unique addresses, could multiple functions jump to the same instruction since the call stack would differentiate which function was at fault? Once a jo uses a rel32 destination it seems a function could probably use a pre-existing ud2.
Those both sound like interesting ideas. Can DWARF line tables encode different locations for a PC with different return addresses? Either of the tricks you suggest would probably need deeper cooperation from LLVM to achieve, though. That's not necessarily a bad thing, since better LLVM-level support for the trap properties Swift wants would be a generally good thing for performance and code size as well. The way we currently generate a branch and trap, with a barrier before each trap to keep LLVM from coalescing them, interferes with a lot of local optimizations that could otherwise still apply. If LLVM had a conditional trap intrinsic that didn't introduce control flow, similar to SIL's cond_fail instruction, and better support for identifiable traps, we'd probably get better code generation for Swift code, in addition to allowing the LLVM backends' instruction selection to be more clever about stuffing identifiable traps in optimal places.
@porglezomp brings up the fact that branch predictors generally treat backward branches as being taken in their ground state, so there could be some cost to branch prediction accuracy if we try to stuff traps before rather than after the main function body.
There's also the ModuleSummary work by @kateinoigakukun, which is waiting for review. In our preliminary testing with the SwiftWasm toolchain it led to at least 20% binary size improvement on average. I guess the potential improvement could be more than that, as currently we don't ship module summaries for stdlib together with the toolchain/SDK archive.
Also, I don't think anyone has mentioned ICU. It may not be a problem on platforms with dynamic linking, but as WebAssembly currently only supports static linking, it's probably the biggest cause of code size bloat for us.
I think further ARC optimiser improvements would also be valuable for code size improvements. Excessive ARC traffic leads to code bloat in a few different ways. It leads to extra register shuffling before and after some operations to shuffle arguments in and out. It can lead to the kinds of differences that prevent similar function merging. It can prevent inlining in cases where that inlining would be helpful.
Many of the necessary improvements may include giving the ABI better tools to communicate how objects are used. I've run into issues where communicating the constraints of the code via things like @_effects can substantially shrink the size of function bodies.
The thing about ICU is that the unicode algorithms are all data-driven. We could try to chop up the data in slightly different ways, but it's unlikely that we'd see significant code-size improvements from that.
That's not to say it isn't worth doing (it is), but for other reasons.
Even if we did absorb and somehow shrink the unicode data, you'd still have to bring your own timezone data for simple things like formatting dates. This is just a weakness of WebAssembly, I'm afraid.