Hi Swift-Dev,
We care deeply about the size of binaries that are generated by the Swift compiler and make an effort to optimize and shrink the generated binaries. One of the problems that we have today is that swift symbols are mangled into extremely long strings. This is especially a problem for libraries, and almost half of the size of libswiftCore.dylib (the swift runtime library on x86_64 OSX) is string tables. On MacOSX you can use the command ’size -m file.dylib’ to read the size of the string table. C++ also suffers from the problem of long names, but since we control the Swift ABI we can do better than C++.
Here are two names that I found in our libraries:
__TIF14StdlibUnittest13checkSequenceu0_Rxs14CollectionType_s12SequenceTypeWx9Generator7Element_zW_9GeneratorS3__rFTxq_KT_SS9showFrameSb10stackTraceVS_14SourceLocStack4fileSS4lineSu16resiliencyChecksVS_32CollectionMisuseResiliencyChecks9sameValueFTWxS2_S3__WxS2_S3___Sb_T_A6_
__TTSg5VSS13CharacterViewS_s14CollectionTypes_GVs17IndexingGeneratorS__GS1_S__s13GeneratorTypes_VS_5IndexS3_s16ForwardIndexTypes_SiSis18_SignedIntegerTypes_SiSis33_BuiltinIntegerLiteralConvertibles_Vs20_DisabledRangeIndex__S_S_s9IndexablesS_s12SequenceTypes_GS1_S__GS1_S__S2_s_Vs9Character_S3_S3_S4_s_SiSiS5_s_SiSiS6_s_S7__S__S10__S10____TFFSS12replaceRangeuRxs14CollectionTypeWx9Generator7Element_zVs9CharacterrFTGVs5RangeVVSS13CharacterView5Index_4withx_T_U_FRS4_T_
One way to solve this problem is to implement a better string table format that will include some kind of compression or tree encoding into MachO, ELF and PE. The work on MachO, ELF and PE is beyond the scope of the Swift project and I’d like to focus on improving things on the Swift side.
I see two independent tasks that we can do to shorten the length of string symbols. First, we can improve the existing mangling. We can do things like use non-decimal length specifiers, etc. The second task, which is the subject of the rest of this email, is the implementation of a compression layer on top of the existing mangling scheme.
Compressing long symbols
The Swift compiler is free to encode Swift names in whatever way it likes, but we need to follow a few basic rules. First, the character encoding space needs to be [_a-zA-Z0-9]. We can’t use non-ascii characters to encode Swift names because this is the character space that ELF supports. Second, names need to be unique and independent. We can’t just compress the whole string table or create names that are a running counter because we need to be able to inspect each name independently. We need to be able to decode the name independently and extract the information that’s stored in the name. We also need to be able to decode the name to be able to present something useful in the debugger. This means that a simple one directional hashing of the name is not an option.
With these restrictions in mind, I believe that the problem that we need to solve is independent compression of names. I suggest that we add a post-processing string compression pass. The input to the compress method would be a string mangled name, like the two names I showed above, and the output would be a shorter string. We would also need a decompression method that would return the original string.
Obviously, gzipping each string independently won’t be effective. I believe that we need to implement our own compression algorithm here that will work with the restrictions that I mentioned above (being textual) and with the prior knowledge we have.
Example of a basic compression algorithm
In this section I describe a trivial compression algorithm that can reduce the size of the string tables by 30%. This algorithm is not optimal but it is a good starting point for our discussion. One example of "prior knowledge” that I mentioned above is that we know what are the common sub-strings that are used in Swift names. Some of the most common substrings in Swift mangled names are:
1. "S_S_S_S"
2. "ollectio”
3. “Type”
4. “Index”
5. “erator”
6. “7Element"
7. “able"
We can use this prior knowledge about names of Swift types to compress our mangling! Consider the following encoding:
A string like this:
__TTSg5VSS13CharacterView
Would be translated into something like:
__TTSg5<reference to word #67>13<reference to word #43>View
Commonly used strings can be encoded as references to global tables. We need to have some escape character (that needs to be ascii-printable), and use that character to encode the reference.
One interesting bit of information is that the character ‘Y’ is only used 4 times in the entire standard library! The letter J, and a few other letters are also not used very frequently. We can use these letters as escape characters.
The basic encoding that I experimented with used the following rules:
1. Use two escape characters that are not frequently used in names (Y and Z). These characters are escape character and can’t be used without escaping. ‘Y’ will be encoded as ‘YY’, and ‘Z’ would be encoded as ‘YZ’.
2. The most commonly used sub-strings (calculated as length of substring times number of occurrences) would be encoded with a single escape character and a single index character. The table of highly repetitive substrings can only contain 61 entries (a-zA-Z0-9_, minus two escape characters).
A reference to the very frequent substrings would be encoded as “Yx”, where x is the character that’s translated into a numeric index. This is two chars per substring.
3. The less commonly used substrings are encoded as three-character sequence. “Zxx”, where xx is the numeric index in the large table (that can hold 63*63 substrings).
It is obvious how to reverse this compression using the same string table used to compress the names.
With this encoding scheme the name “__TwxxV14StdlibUnittest24MinimalForwardCollection” becomes “Y5wxxJ0UJ1HJ3_Jyn4J5VJ6SdY9on” and the name “__TMPVs15ContiguousArray” becomes “Y5MPJ3r5J5EJ82”.
I benchmarked this basic technique by using 5 dylibs as the training set (SwiftCore, Simd, unittests, etc) and compressed the string tables of these binaries. The result was a ~31% reduction in size.
On the Swift dylib itself (training set and input to the compressor) the wins are about ~25%.
This encoding makes the symbol name a lot less intuitive for humans, but at the SIL-level we already print human readable strings above function names as comments. I believe that the debugger would just work as-is since the internal APIs won’t change.
One of the disadvantages of using a constant string table is that once we rename the APIs in the standard library the string table would be less effective in compressing it and code that uses it.
What’s next?
The small experiment I described above showed that compressing the names in the string table has a huge potential for reducing the size of swift binaries. I’d like for us (swift-developers) to talk about the implications of this change and start working on the two tasks of tightening our existing mangling format and on implementing a new compression layer on top.
Nadav