Reducing the size of Swift binaries by shortening symbols

Nadav_Rotem · December 18, 2015, 11:42pm

Hi Swift-Dev,

We care deeply about the size of binaries that are generated by the Swift compiler and make an effort to optimize and shrink the generated binaries. One of the problems that we have today is that swift symbols are mangled into extremely long strings. This is especially a problem for libraries, and almost half of the size of libswiftCore.dylib (the swift runtime library on x86_64 OSX) is string tables. On MacOSX you can use the command ’size -m file.dylib’ to read the size of the string table. C++ also suffers from the problem of long names, but since we control the Swift ABI we can do better than C++.

Here are two names that I found in our libraries:

__TIF14StdlibUnittest13checkSequenceu0_Rxs14CollectionType_s12SequenceTypeWx9Generator7Element_zW_9GeneratorS3__rFTxq_KT_SS9showFrameSb10stackTraceVS_14SourceLocStack4fileSS4lineSu16resiliencyChecksVS_32CollectionMisuseResiliencyChecks9sameValueFTWxS2_S3__WxS2_S3___Sb_T_A6_

__TTSg5VSS13CharacterViewS_s14CollectionTypes_GVs17IndexingGeneratorS__GS1_S__s13GeneratorTypes_VS_5IndexS3_s16ForwardIndexTypes_SiSis18_SignedIntegerTypes_SiSis33_BuiltinIntegerLiteralConvertibles_Vs20_DisabledRangeIndex__S_S_s9IndexablesS_s12SequenceTypes_GS1_S__GS1_S__S2_s_Vs9Character_S3_S3_S4_s_SiSiS5_s_SiSiS6_s_S7__S__S10__S10____TFFSS12replaceRangeuRxs14CollectionTypeWx9Generator7Element_zVs9CharacterrFTGVs5RangeVVSS13CharacterView5Index_4withx_T_U_FRS4_T_

One way to solve this problem is to implement a better string table format that will include some kind of compression or tree encoding into MachO, ELF and PE. The work on MachO, ELF and PE is beyond the scope of the Swift project and I’d like to focus on improving things on the Swift side.

I see two independent tasks that we can do to shorten the length of string symbols. First, we can improve the existing mangling. We can do things like use non-decimal length specifiers, etc. The second task, which is the subject of the rest of this email, is the implementation of a compression layer on top of the existing mangling scheme.

Compressing long symbols

The Swift compiler is free to encode Swift names in whatever way it likes, but we need to follow a few basic rules. First, the character encoding space needs to be [_a-zA-Z0-9]. We can’t use non-ascii characters to encode Swift names because this is the character space that ELF supports. Second, names need to be unique and independent. We can’t just compress the whole string table or create names that are a running counter because we need to be able to inspect each name independently. We need to be able to decode the name independently and extract the information that’s stored in the name. We also need to be able to decode the name to be able to present something useful in the debugger. This means that a simple one directional hashing of the name is not an option.

With these restrictions in mind, I believe that the problem that we need to solve is independent compression of names. I suggest that we add a post-processing string compression pass. The input to the compress method would be a string mangled name, like the two names I showed above, and the output would be a shorter string. We would also need a decompression method that would return the original string.

Obviously, gzipping each string independently won’t be effective. I believe that we need to implement our own compression algorithm here that will work with the restrictions that I mentioned above (being textual) and with the prior knowledge we have.

Example of a basic compression algorithm

In this section I describe a trivial compression algorithm that can reduce the size of the string tables by 30%. This algorithm is not optimal but it is a good starting point for our discussion. One example of "prior knowledge” that I mentioned above is that we know what are the common sub-strings that are used in Swift names. Some of the most common substrings in Swift mangled names are:

1. "S_S_S_S"
2. "ollectio”
3. “Type”
4. “Index”
5. “erator”
6. “7Element"
7. “able"

We can use this prior knowledge about names of Swift types to compress our mangling! Consider the following encoding:

A string like this:

__TTSg5VSS13CharacterView

Would be translated into something like:

__TTSg5<reference to word #67>13<reference to word #43>View

Commonly used strings can be encoded as references to global tables. We need to have some escape character (that needs to be ascii-printable), and use that character to encode the reference.
One interesting bit of information is that the character ‘Y’ is only used 4 times in the entire standard library! The letter J, and a few other letters are also not used very frequently. We can use these letters as escape characters.

The basic encoding that I experimented with used the following rules:

1. Use two escape characters that are not frequently used in names (Y and Z). These characters are escape character and can’t be used without escaping. ‘Y’ will be encoded as ‘YY’, and ‘Z’ would be encoded as ‘YZ’.

2. The most commonly used sub-strings (calculated as length of substring times number of occurrences) would be encoded with a single escape character and a single index character. The table of highly repetitive substrings can only contain 61 entries (a-zA-Z0-9_, minus two escape characters).

A reference to the very frequent substrings would be encoded as “Yx”, where x is the character that’s translated into a numeric index. This is two chars per substring.

3. The less commonly used substrings are encoded as three-character sequence. “Zxx”, where xx is the numeric index in the large table (that can hold 63*63 substrings).

It is obvious how to reverse this compression using the same string table used to compress the names.

With this encoding scheme the name “__TwxxV14StdlibUnittest24MinimalForwardCollection” becomes “Y5wxxJ0UJ1HJ3_Jyn4J5VJ6SdY9on” and the name “__TMPVs15ContiguousArray” becomes “Y5MPJ3r5J5EJ82”.

I benchmarked this basic technique by using 5 dylibs as the training set (SwiftCore, Simd, unittests, etc) and compressed the string tables of these binaries. The result was a ~31% reduction in size.

On the Swift dylib itself (training set and input to the compressor) the wins are about ~25%.

This encoding makes the symbol name a lot less intuitive for humans, but at the SIL-level we already print human readable strings above function names as comments. I believe that the debugger would just work as-is since the internal APIs won’t change.

One of the disadvantages of using a constant string table is that once we rename the APIs in the standard library the string table would be less effective in compressing it and code that uses it.

What’s next?

The small experiment I described above showed that compressing the names in the string table has a huge potential for reducing the size of swift binaries. I’d like for us (swift-developers) to talk about the implications of this change and start working on the two tasks of tightening our existing mangling format and on implementing a new compression layer on top.

Nadav

gribozavr · December 20, 2015, 10:17am

+ Stephen Canon, because he probably has good ideas in this domain.

*What’s next?*

The small experiment I described above showed that compressing the names
in the string table has a huge potential for reducing the size of swift
binaries. I’d like for us (swift-developers) to talk about the implications
of this change and start working on the two tasks of tightening our
existing mangling format and on implementing a new compression layer on
top.

Hi Nadav,

This is a great start that shows that there is a potential for improvement
in our mangled names!

To make this effort more visible, I would suggest creating a bug on
https://bugs.swift.org/ .

I think we survey existing solutions that industry has developed for
compressing short messages. What comes to mind:

- header compression in HTTP2:
https://http2.github.io/http2-spec/compression.html

- PPM algorithms are one of the best-performing compression algorithms for
text.

- Arithmetic coding is also a natural starting point for experimentation.

Since the input mangled name also comes in a restricted character set, we
could also remove useless bits first, and try an existing compression
algorithm on the resulting binary string.

We should also build a scheme that uses shortest one between the compressed
and non-compressed names.

For running experiments it would be useful to publish a sample corpus of
mangled names that we will be using for comparing the algorithms and
approaches.

I also have a concern about making mangled names completely unreadable.
Today, I can frequently at least get a gist of what the referenced entity
is without a demangler. What we could do is make the name consist of a
human-readable prefix that encodes just the base name and a compressed
suffix that encodes the rest of the information.

_T<length><class name><length><method name><compressed suffix>

We would be able to use references to the class and the method name from
the compressed part, so that character data isn't completely wasted.

This scheme that injects human-readable parts will also allow the debugger
to quickly match the names without the need to decompress them.

We should also investigate improving existing mangling scheme to produce
shorter results. For example, one idea that comes to mind is using base-60
instead of base-10 for single-digit numbers that that specify identifier
length, falling back to base-10 for longer numbers to avoid ambiguity.
This would save one character for every identifier longer than 9 characters
and shorter than 60, which is actually the common case.

Dmitri

···

On Fri, Dec 18, 2015 at 3:42 PM, Nadav Rotem via swift-dev < swift-dev@swift.org> wrote:

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com>*/

ddunbar · December 20, 2015, 9:01pm

Hi Nadav,

One thing you didn't call out here was any connection to visibility.

I would expect that in practice it is quite common for most of the symbols to not be visible outside the linkage unit. Inside the linkage unit, the compiler can already rename many of the symbols in any way it likes. Wouldn't one tact for solving this problem simply be to encourage more pervasive use of whole module optimization, and support compiler renaming of non-exported symbols?

- Daniel

···

On Dec 18, 2015, at 4:42 PM, Nadav Rotem via swift-dev <swift-dev@swift.org> wrote:

Hi Swift-Dev,

We care deeply about the size of binaries that are generated by the Swift compiler and make an effort to optimize and shrink the generated binaries. One of the problems that we have today is that swift symbols are mangled into extremely long strings. This is especially a problem for libraries, and almost half of the size of libswiftCore.dylib (the swift runtime library on x86_64 OSX) is string tables. On MacOSX you can use the command ’size -m file.dylib’ to read the size of the string table. C++ also suffers from the problem of long names, but since we control the Swift ABI we can do better than C++.

Here are two names that I found in our libraries:

__TIF14StdlibUnittest13checkSequenceu0_Rxs14CollectionType_s12SequenceTypeWx9Generator7Element_zW_9GeneratorS3__rFTxq_KT_SS9showFrameSb10stackTraceVS_14SourceLocStack4fileSS4lineSu16resiliencyChecksVS_32CollectionMisuseResiliencyChecks9sameValueFTWxS2_S3__WxS2_S3___Sb_T_A6_

__TTSg5VSS13CharacterViewS_s14CollectionTypes_GVs17IndexingGeneratorS__GS1_S__s13GeneratorTypes_VS_5IndexS3_s16ForwardIndexTypes_SiSis18_SignedIntegerTypes_SiSis33_BuiltinIntegerLiteralConvertibles_Vs20_DisabledRangeIndex__S_S_s9IndexablesS_s12SequenceTypes_GS1_S__GS1_S__S2_s_Vs9Character_S3_S3_S4_s_SiSiS5_s_SiSiS6_s_S7__S__S10__S10____TFFSS12replaceRangeuRxs14CollectionTypeWx9Generator7Element_zVs9CharacterrFTGVs5RangeVVSS13CharacterView5Index_4withx_T_U_FRS4_T_

One way to solve this problem is to implement a better string table format that will include some kind of compression or tree encoding into MachO, ELF and PE. The work on MachO, ELF and PE is beyond the scope of the Swift project and I’d like to focus on improving things on the Swift side.

I see two independent tasks that we can do to shorten the length of string symbols. First, we can improve the existing mangling. We can do things like use non-decimal length specifiers, etc. The second task, which is the subject of the rest of this email, is the implementation of a compression layer on top of the existing mangling scheme.

Compressing long symbols

The Swift compiler is free to encode Swift names in whatever way it likes, but we need to follow a few basic rules. First, the character encoding space needs to be [_a-zA-Z0-9]. We can’t use non-ascii characters to encode Swift names because this is the character space that ELF supports. Second, names need to be unique and independent. We can’t just compress the whole string table or create names that are a running counter because we need to be able to inspect each name independently. We need to be able to decode the name independently and extract the information that’s stored in the name. We also need to be able to decode the name to be able to present something useful in the debugger. This means that a simple one directional hashing of the name is not an option.

With these restrictions in mind, I believe that the problem that we need to solve is independent compression of names. I suggest that we add a post-processing string compression pass. The input to the compress method would be a string mangled name, like the two names I showed above, and the output would be a shorter string. We would also need a decompression method that would return the original string.

Obviously, gzipping each string independently won’t be effective. I believe that we need to implement our own compression algorithm here that will work with the restrictions that I mentioned above (being textual) and with the prior knowledge we have.

Example of a basic compression algorithm

In this section I describe a trivial compression algorithm that can reduce the size of the string tables by 30%. This algorithm is not optimal but it is a good starting point for our discussion. One example of "prior knowledge” that I mentioned above is that we know what are the common sub-strings that are used in Swift names. Some of the most common substrings in Swift mangled names are:

1. "S_S_S_S"
2. "ollectio”
3. “Type”
4. “Index”
5. “erator”
6. “7Element"
7. “able"

We can use this prior knowledge about names of Swift types to compress our mangling! Consider the following encoding:

A string like this:

__TTSg5VSS13CharacterView

Would be translated into something like:

__TTSg5<reference to word #67>13<reference to word #43>View

Commonly used strings can be encoded as references to global tables. We need to have some escape character (that needs to be ascii-printable), and use that character to encode the reference.
One interesting bit of information is that the character ‘Y’ is only used 4 times in the entire standard library! The letter J, and a few other letters are also not used very frequently. We can use these letters as escape characters.

The basic encoding that I experimented with used the following rules:

1. Use two escape characters that are not frequently used in names (Y and Z). These characters are escape character and can’t be used without escaping. ‘Y’ will be encoded as ‘YY’, and ‘Z’ would be encoded as ‘YZ’.

2. The most commonly used sub-strings (calculated as length of substring times number of occurrences) would be encoded with a single escape character and a single index character. The table of highly repetitive substrings can only contain 61 entries (a-zA-Z0-9_, minus two escape characters).

A reference to the very frequent substrings would be encoded as “Yx”, where x is the character that’s translated into a numeric index. This is two chars per substring.

3. The less commonly used substrings are encoded as three-character sequence. “Zxx”, where xx is the numeric index in the large table (that can hold 63*63 substrings).

It is obvious how to reverse this compression using the same string table used to compress the names.

With this encoding scheme the name “__TwxxV14StdlibUnittest24MinimalForwardCollection” becomes “Y5wxxJ0UJ1HJ3_Jyn4J5VJ6SdY9on” and the name “__TMPVs15ContiguousArray” becomes “Y5MPJ3r5J5EJ82”.

I benchmarked this basic technique by using 5 dylibs as the training set (SwiftCore, Simd, unittests, etc) and compressed the string tables of these binaries. The result was a ~31% reduction in size.

<PastedGraphic-2.png>

On the Swift dylib itself (training set and input to the compressor) the wins are about ~25%.

This encoding makes the symbol name a lot less intuitive for humans, but at the SIL-level we already print human readable strings above function names as comments. I believe that the debugger would just work as-is since the internal APIs won’t change.

One of the disadvantages of using a constant string table is that once we rename the APIs in the standard library the string table would be less effective in compressing it and code that uses it.

What’s next?

The small experiment I described above showed that compressing the names in the string table has a huge potential for reducing the size of swift binaries. I’d like for us (swift-developers) to talk about the implications of this change and start working on the two tasks of tightening our existing mangling format and on implementing a new compression layer on top.

Nadav

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Stephen_Canon · December 20, 2015, 3:35pm

Nadav, can you clarify what we’re really trying to accomplish here? "Smaller binaries” isn’t too important of a goal in and of itself.

Are we trying to:
– reduce storage used on disk
– reduce load time
– reduce loaded memory footprint
– make emitting swift binaries more efficient
– something else?

Yes, I know, “all of the above”, but understanding something about what’s most important would help evaluate the proposal.

It’s also worth keeping in mind that iOS and OS X have been aggressively adopting pervasive system-wide compression both on disk and in memory. This trend will continue, and it makes it quite a bit less important for individual components to explicitly adopt compression techniques themselves, except in cases where there’s a lot of special structure that those components can leverage to get better compression than a general-purpose lossless compressor can manage (images and sound are the two obvious examples of this, but also cases like huge arrays of floating-point data where the low-order bits don’t matter, etc). Linux hasn’t been as aggressive about doing this yet, but pervasive system-level compression is The Future.

– Steve

···

On Dec 20, 2015, at 5:17 AM, Dmitri Gribenko <gribozavr@gmail.com> wrote:

+ Stephen Canon, because he probably has good ideas in this domain.

On Fri, Dec 18, 2015 at 3:42 PM, Nadav Rotem via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

What’s next?

The small experiment I described above showed that compressing the names in the string table has a huge potential for reducing the size of swift binaries. I’d like for us (swift-developers) to talk about the implications of this change and start working on the two tasks of tightening our existing mangling format and on implementing a new compression layer on top.

Hi Nadav,

This is a great start that shows that there is a potential for improvement in our mangled names!

To make this effort more visible, I would suggest creating a bug on https://bugs.swift.org/ .

I think we survey existing solutions that industry has developed for compressing short messages. What comes to mind:

- header compression in HTTP2:
https://http2.github.io/http2-spec/compression.html

- PPM algorithms are one of the best-performing compression algorithms for text.

- Arithmetic coding is also a natural starting point for experimentation.

Since the input mangled name also comes in a restricted character set, we could also remove useless bits first, and try an existing compression algorithm on the resulting binary string.

We should also build a scheme that uses shortest one between the compressed and non-compressed names.

For running experiments it would be useful to publish a sample corpus of mangled names that we will be using for comparing the algorithms and approaches.

I also have a concern about making mangled names completely unreadable. Today, I can frequently at least get a gist of what the referenced entity is without a demangler. What we could do is make the name consist of a human-readable prefix that encodes just the base name and a compressed suffix that encodes the rest of the information.

_T<length><class name><length><method name><compressed suffix>

We would be able to use references to the class and the method name from the compressed part, so that character data isn't completely wasted.

This scheme that injects human-readable parts will also allow the debugger to quickly match the names without the need to decompress them.

We should also investigate improving existing mangling scheme to produce shorter results. For example, one idea that comes to mind is using base-60 instead of base-10 for single-digit numbers that that specify identifier length, falling back to base-10 for longer numbers to avoid ambiguity. This would save one character for every identifier longer than 9 characters and shorter than 60, which is actually the common case.

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com <mailto:gribozavr@gmail.com>>*/

Nadav_Rotem · December 21, 2015, 5:43am

+ Stephen Canon, because he probably has good ideas in this domain.

What’s next?

The small experiment I described above showed that compressing the names in the string table has a huge potential for reducing the size of swift binaries. I’d like for us (swift-developers) to talk about the implications of this change and start working on the two tasks of tightening our existing mangling format and on implementing a new compression layer on top.

Hi Dmitri,

Hi Nadav,

This is a great start that shows that there is a potential for improvement in our mangled names!

To make this effort more visible, I would suggest creating a bug on https://bugs.swift.org/ .

I think we survey existing solutions that industry has developed for compressing short messages. What comes to mind:

- header compression in HTTP2:
https://http2.github.io/http2-spec/compression.html

- PPM algorithms are one of the best-performing compression algorithms for text.

It is important to keep in mind that we are not inventing a new text-compression algorithms here. We are solving a specific problem with specific characteristics and constraints. Nothing is groundbreaking here. Lempel-Ziv based algorithms rely on repetitions within the text, and can use some prior statistical knowledge on the text (pre-computed tables). Our mangled names are relatively short and don’t have significant repetitions, so having a reference to some pre-computed table, and not having an internal reference is obvious.

On top of the basic encoding layer we need to apply some kind of variable length encoding (Cameron suggested Huffman). The restricted character set is not a problem here. To encode a string of bits using the charset [a-zA-Z0-9_] just do: mod63, div63, mod63, div63, until you consume the whole bit stream.

- Arithmetic coding is also a natural starting point for experimentation.

Since the input mangled name also comes in a restricted character set, we could also remove useless bits first, and try an existing compression algorithm on the resulting binary string.

The upper bits never come into play. You can conceptually imagine that bytes have ~6 bits (63 combinations). You only need to worry about these kind of things when you use variable length encoding (see my previous comment about Cameron’s suggestion to use Huffman encoding as a second pass).

We should also build a scheme that uses shortest one between the compressed and non-compressed names.

For running experiments it would be useful to publish a sample corpus of mangled names that we will be using for comparing the algorithms and approaches.

nm -a libswiftCore.dylib > strings.txt

I also have a concern about making mangled names completely unreadable. Today, I can frequently at least get a gist of what the referenced entity is without a demangler. What we could do is make the name consist of a human-readable prefix that encodes just the base name and a compressed suffix that encodes the rest of the information.

_T<length><class name><length><method name><compressed suffix>

When are you looking at symbols directly? SIL already shows the detangled names as comments, and for everything else there is a demangler.

We would be able to use references to the class and the method name from the compressed part, so that character data isn't completely wasted.

This scheme that injects human-readable parts will also allow the debugger to quickly match the names without the need to decompress them.

We should also investigate improving existing mangling scheme to produce shorter results. For example, one idea that comes to mind is using base-60 instead of base-10 for single-digit numbers that that specify identifier length, falling back to base-10 for longer numbers to avoid ambiguity. This would save one character for every identifier longer than 9 characters and shorter than 60, which is actually the common case.

Yes.

Thanks,
Nadav

···

On Dec 20, 2015, at 2:17 AM, Dmitri Gribenko <gribozavr@gmail.com> wrote:
On Fri, Dec 18, 2015 at 3:42 PM, Nadav Rotem via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com <mailto:gribozavr@gmail.com>>*/

Nadav_Rotem · December 21, 2015, 6:26am

Hi Nadav,

Hi Daniel,

One thing you didn't call out here was any connection to visibility.

This is a great point!

Apps usually don’t export too many symbols, but they do import symbols from external shared objects and need to keep the long names around. I created a basic Swift one-page iOS App using the Xcode wizard and the generated binary had a string-table of 70K**. The same app in Obj-C had a 12K linkedit section ("size -m" can print this info).

I would expect that in practice it is quite common for most of the symbols to not be visible outside the linkage unit. Inside the linkage unit, the compiler can already rename many of the symbols in any way it likes. Wouldn't one tact for solving this problem simply be to encourage more pervasive use of whole module optimization, and support compiler renaming of non-exported symbols?

I don’t think that Swift Access Control (public/internal/private) is affected by WMO, but I will check.

Thanks,
Nadav

** This is one of the names that the sample app has:

__TTSf4n_s_s_n___TTSg5GVSs12_ArrayBufferPSs9AnyObject__GS_PS0___Ss16_ArrayBufferTypeSs_GVSs15CollectionOfOnePS0___GS2_PS0___Ss14CollectionTypeSs_PS0___GVSs17IndexingGeneratorGS_PS0____GS4_GS_PS0____Ss13GeneratorTypeSs_PS0___PS0___GVSs12_SliceBufferPS0___GS6_PS0___Ss23_CollectionDefaultsTypeSsGS6_PS0___Ss32_CollectionGeneratorDefaultsTypeSs_GS4_GS6_PS0____GS4_GS6_PS0____S5_Ss_PS0___SiSiSs16ForwardIndexTypeSs_SiSiSs18_SignedIntegerTypeSs_SiSiSs33_BuiltinIntegerLiteralConvertibleSs_Si_PS0___GVSs14GeneratorOfOnePS0___GS12_PS0___S5_Ss_OSs3BitS13_S9_Ss_SiSiS10_Ss_SiSiS11_Ss_VSs20_DisabledRangeIndex__PS0___GVSs12_prext_SliceGS2_PS0____GS15_GS2_PS0____S7_SsGS15_GS2_PS0____S8_Ss_GS4_GS15_GS2_PS0_____GS4_GS15_GS2_PS0_____S5_Ss_PS0___S13_S13_S9_Ss_SiSiS10_Ss_SiSiS11_Ss_S14__PS0_____TFSs28_arrayNonSliceInPlaceReplaceu0_Rq_Ss16_ArrayBufferTypeq0_Ss14CollectionTypezqq_S_7Elementqqq0_Ss12SequenceType9GeneratorSs13GeneratorType7Elementzqq_Ss23_CollectionDefaultsType5IndexSizqqq_S3_5IndexSs16ForwardIndexType8DistanceSizqqq_S3_5IndexS4_19_DisabledRangeIndexSizqqqq_S3_5IndexS4_8DistanceSs25IntegerLiteralConvertible18IntegerLiteralTypeSi_FTRq_GVSs5RangeSi_Siq0__T_

···

On Dec 20, 2015, at 1:01 PM, Daniel Dunbar <daniel_dunbar@apple.com> wrote:

- Daniel

On Dec 18, 2015, at 4:42 PM, Nadav Rotem via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

Hi Swift-Dev,

We care deeply about the size of binaries that are generated by the Swift compiler and make an effort to optimize and shrink the generated binaries. One of the problems that we have today is that swift symbols are mangled into extremely long strings. This is especially a problem for libraries, and almost half of the size of libswiftCore.dylib (the swift runtime library on x86_64 OSX) is string tables. On MacOSX you can use the command ’size -m file.dylib’ to read the size of the string table. C++ also suffers from the problem of long names, but since we control the Swift ABI we can do better than C++.

Here are two names that I found in our libraries:

__TIF14StdlibUnittest13checkSequenceu0_Rxs14CollectionType_s12SequenceTypeWx9Generator7Element_zW_9GeneratorS3__rFTxq_KT_SS9showFrameSb10stackTraceVS_14SourceLocStack4fileSS4lineSu16resiliencyChecksVS_32CollectionMisuseResiliencyChecks9sameValueFTWxS2_S3__WxS2_S3___Sb_T_A6_

__TTSg5VSS13CharacterViewS_s14CollectionTypes_GVs17IndexingGeneratorS__GS1_S__s13GeneratorTypes_VS_5IndexS3_s16ForwardIndexTypes_SiSis18_SignedIntegerTypes_SiSis33_BuiltinIntegerLiteralConvertibles_Vs20_DisabledRangeIndex__S_S_s9IndexablesS_s12SequenceTypes_GS1_S__GS1_S__S2_s_Vs9Character_S3_S3_S4_s_SiSiS5_s_SiSiS6_s_S7__S__S10__S10____TFFSS12replaceRangeuRxs14CollectionTypeWx9Generator7Element_zVs9CharacterrFTGVs5RangeVVSS13CharacterView5Index_4withx_T_U_FRS4_T_

One way to solve this problem is to implement a better string table format that will include some kind of compression or tree encoding into MachO, ELF and PE. The work on MachO, ELF and PE is beyond the scope of the Swift project and I’d like to focus on improving things on the Swift side.

I see two independent tasks that we can do to shorten the length of string symbols. First, we can improve the existing mangling. We can do things like use non-decimal length specifiers, etc. The second task, which is the subject of the rest of this email, is the implementation of a compression layer on top of the existing mangling scheme.

Compressing long symbols

The Swift compiler is free to encode Swift names in whatever way it likes, but we need to follow a few basic rules. First, the character encoding space needs to be [_a-zA-Z0-9]. We can’t use non-ascii characters to encode Swift names because this is the character space that ELF supports. Second, names need to be unique and independent. We can’t just compress the whole string table or create names that are a running counter because we need to be able to inspect each name independently. We need to be able to decode the name independently and extract the information that’s stored in the name. We also need to be able to decode the name to be able to present something useful in the debugger. This means that a simple one directional hashing of the name is not an option.

With these restrictions in mind, I believe that the problem that we need to solve is independent compression of names. I suggest that we add a post-processing string compression pass. The input to the compress method would be a string mangled name, like the two names I showed above, and the output would be a shorter string. We would also need a decompression method that would return the original string.

Obviously, gzipping each string independently won’t be effective. I believe that we need to implement our own compression algorithm here that will work with the restrictions that I mentioned above (being textual) and with the prior knowledge we have.

Example of a basic compression algorithm

In this section I describe a trivial compression algorithm that can reduce the size of the string tables by 30%. This algorithm is not optimal but it is a good starting point for our discussion. One example of "prior knowledge” that I mentioned above is that we know what are the common sub-strings that are used in Swift names. Some of the most common substrings in Swift mangled names are:

1. "S_S_S_S"
2. "ollectio”
3. “Type”
4. “Index”
5. “erator”
6. “7Element"
7. “able"

We can use this prior knowledge about names of Swift types to compress our mangling! Consider the following encoding:

A string like this:

__TTSg5VSS13CharacterView

Would be translated into something like:

__TTSg5<reference to word #67>13<reference to word #43>View

Commonly used strings can be encoded as references to global tables. We need to have some escape character (that needs to be ascii-printable), and use that character to encode the reference.
One interesting bit of information is that the character ‘Y’ is only used 4 times in the entire standard library! The letter J, and a few other letters are also not used very frequently. We can use these letters as escape characters.

The basic encoding that I experimented with used the following rules:

1. Use two escape characters that are not frequently used in names (Y and Z). These characters are escape character and can’t be used without escaping. ‘Y’ will be encoded as ‘YY’, and ‘Z’ would be encoded as ‘YZ’.

2. The most commonly used sub-strings (calculated as length of substring times number of occurrences) would be encoded with a single escape character and a single index character. The table of highly repetitive substrings can only contain 61 entries (a-zA-Z0-9_, minus two escape characters).

A reference to the very frequent substrings would be encoded as “Yx”, where x is the character that’s translated into a numeric index. This is two chars per substring.

3. The less commonly used substrings are encoded as three-character sequence. “Zxx”, where xx is the numeric index in the large table (that can hold 63*63 substrings).

It is obvious how to reverse this compression using the same string table used to compress the names.

With this encoding scheme the name “__TwxxV14StdlibUnittest24MinimalForwardCollection” becomes “Y5wxxJ0UJ1HJ3_Jyn4J5VJ6SdY9on” and the name “__TMPVs15ContiguousArray” becomes “Y5MPJ3r5J5EJ82”.

I benchmarked this basic technique by using 5 dylibs as the training set (SwiftCore, Simd, unittests, etc) and compressed the string tables of these binaries. The result was a ~31% reduction in size.

<PastedGraphic-2.png>

On the Swift dylib itself (training set and input to the compressor) the wins are about ~25%.

This encoding makes the symbol name a lot less intuitive for humans, but at the SIL-level we already print human readable strings above function names as comments. I believe that the debugger would just work as-is since the internal APIs won’t change.

One of the disadvantages of using a constant string table is that once we rename the APIs in the standard library the string table would be less effective in compressing it and code that uses it.

What’s next?

The small experiment I described above showed that compressing the names in the string table has a huge potential for reducing the size of swift binaries. I’d like for us (swift-developers) to talk about the implications of this change and start working on the two tasks of tightening our existing mangling format and on implementing a new compression layer on top.

Nadav

_______________________________________________
swift-dev mailing list
swift-dev@swift.org <mailto:swift-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-dev

James_Campbell · December 20, 2015, 9:52pm

I would love to add that uploading iOS apps is much bigger with swift

···

Sent from my iPhone

On 20 Dec 2015, at 15:35, Stephen Canon via swift-dev <swift-dev@swift.org> wrote:

Nadav, can you clarify what we’re really trying to accomplish here? "Smaller binaries” isn’t too important of a goal in and of itself.

Are we trying to:
– reduce storage used on disk
– reduce load time
– reduce loaded memory footprint
– make emitting swift binaries more efficient
– something else?

Yes, I know, “all of the above”, but understanding something about what’s most important would help evaluate the proposal.

It’s also worth keeping in mind that iOS and OS X have been aggressively adopting pervasive system-wide compression both on disk and in memory. This trend will continue, and it makes it quite a bit less important for individual components to explicitly adopt compression techniques themselves, except in cases where there’s a lot of special structure that those components can leverage to get better compression than a general-purpose lossless compressor can manage (images and sound are the two obvious examples of this, but also cases like huge arrays of floating-point data where the low-order bits don’t matter, etc). Linux hasn’t been as aggressive about doing this yet, but pervasive system-level compression is The Future.

– Steve

On Dec 20, 2015, at 5:17 AM, Dmitri Gribenko <gribozavr@gmail.com> wrote:

+ Stephen Canon, because he probably has good ideas in this domain.

On Fri, Dec 18, 2015 at 3:42 PM, Nadav Rotem via swift-dev <swift-dev@swift.org> wrote:

What’s next?

The small experiment I described above showed that compressing the names in the string table has a huge potential for reducing the size of swift binaries. I’d like for us (swift-developers) to talk about the implications of this change and start working on the two tasks of tightening our existing mangling format and on implementing a new compression layer on top.

Hi Nadav,

This is a great start that shows that there is a potential for improvement in our mangled names!

To make this effort more visible, I would suggest creating a bug on https://bugs.swift.org/ .

I think we survey existing solutions that industry has developed for compressing short messages. What comes to mind:

- header compression in HTTP2:
https://http2.github.io/http2-spec/compression.html

- PPM algorithms are one of the best-performing compression algorithms for text.

- Arithmetic coding is also a natural starting point for experimentation.

Since the input mangled name also comes in a restricted character set, we could also remove useless bits first, and try an existing compression algorithm on the resulting binary string.

We should also build a scheme that uses shortest one between the compressed and non-compressed names.

For running experiments it would be useful to publish a sample corpus of mangled names that we will be using for comparing the algorithms and approaches.

I also have a concern about making mangled names completely unreadable. Today, I can frequently at least get a gist of what the referenced entity is without a demangler. What we could do is make the name consist of a human-readable prefix that encodes just the base name and a compressed suffix that encodes the rest of the information.

_T<length><class name><length><method name><compressed suffix>

We would be able to use references to the class and the method name from the compressed part, so that character data isn't completely wasted.

This scheme that injects human-readable parts will also allow the debugger to quickly match the names without the need to decompress them.

We should also investigate improving existing mangling scheme to produce shorter results. For example, one idea that comes to mind is using base-60 instead of base-10 for single-digit numbers that that specify identifier length, falling back to base-10 for longer numbers to avoid ambiguity. This would save one character for every identifier longer than 9 characters and shorter than 60, which is actually the common case.

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com>*/

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

gribozavr · December 21, 2015, 5:53am

+ Stephen Canon, because he probably has good ideas in this domain.

What’s next?

The small experiment I described above showed that compressing the names
in the string table has a huge potential for reducing the size of swift
binaries. I’d like for us (swift-developers) to talk about the implications
of this change and start working on the two tasks of tightening our existing
mangling format and on implementing a new compression layer on top.

Hi Dmitri,

Hi Nadav,

This is a great start that shows that there is a potential for improvement
in our mangled names!

To make this effort more visible, I would suggest creating a bug on
https://bugs.swift.org/ .

I think we survey existing solutions that industry has developed for
compressing short messages. What comes to mind:

- header compression in HTTP2:
https://http2.github.io/http2-spec/compression.html

- PPM algorithms are one of the best-performing compression algorithms for
text.

It is important to keep in mind that we are not inventing a new
text-compression algorithms here. We are solving a specific problem with
specific characteristics and constraints. Nothing is groundbreaking here.

This is exactly the case for using a specialized algorithm: a
well-defined problem with specific constraints, not a general case of
arbitrary input.

Lempel-Ziv based algorithms rely on repetitions within the text, and can use
some prior statistical knowledge on the text (pre-computed tables). Our
mangled names are relatively short and don’t have significant repetitions,
so having a reference to some pre-computed table, and not having an internal
reference is obvious.

Names are formed according to well-defined mangling rules with their
own redundancy and patterns, and identifiers usually come from
English. This is why using a high-order statistical model makes
sense.

Pre-computed tables on the other hand have their own issues: since we
have a runtime demangler that our metadata parsing relies upon, the
whole precomputed table of known substitutions has to be a part of the
runtime, and not just the demangler. This creates interesting
constraints for bare-metal environments for example. I'm not saying
that it is wrong to use them, I'm just saying that the trade-off is
not so clear-cut.

On top of the basic encoding layer we need to apply some kind of variable
length encoding (Cameron suggested Huffman). The restricted character set is
not a problem here. To encode a string of bits using the charset
[a-zA-Z0-9_] just do: mod63, div63, mod63, div63, until you consume the
whole bit stream.

- Arithmetic coding is also a natural starting point for experimentation.

Since the input mangled name also comes in a restricted character set, we
could also remove useless bits first, and try an existing compression
algorithm on the resulting binary string.

The upper bits never come into play. You can conceptually imagine that bytes
have ~6 bits (63 combinations). You only need to worry about these kind of
things when you use variable length encoding (see my previous comment about
Cameron’s suggestion to use Huffman encoding as a second pass).

We should also build a scheme that uses shortest one between the compressed
and non-compressed names.

For running experiments it would be useful to publish a sample corpus of
mangled names that we will be using for comparing the algorithms and
approaches.

nm -a libswiftCore.dylib > strings.txt

I also have a concern about making mangled names completely unreadable.
Today, I can frequently at least get a gist of what the referenced entity is
without a demangler. What we could do is make the name consist of a
human-readable prefix that encodes just the base name and a compressed
suffix that encodes the rest of the information.

_T<length><class name><length><method name><compressed suffix>

When are you looking at symbols directly? SIL already shows the detangled
names as comments, and for everything else there is a demangler.

System tools to inspect object files on Linux don't know about Swift
mangling. Yes, I can pass their output through a demangler, sometimes
(when they don't have a GUI for example).

Dmitri

···

On Sun, Dec 20, 2015 at 9:43 PM, Nadav Rotem <nrotem@apple.com> wrote:

On Dec 20, 2015, at 2:17 AM, Dmitri Gribenko <gribozavr@gmail.com> wrote:
On Fri, Dec 18, 2015 at 3:42 PM, Nadav Rotem via swift-dev > <swift-dev@swift.org> wrote:

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com>*/

Nadav_Rotem · December 21, 2015, 6:07am

Hi Steve,

Nadav, can you clarify what we’re really trying to accomplish here? "Smaller binaries” isn’t too important of a goal in and of itself.

Are we trying to:
– reduce storage used on disk
– reduce load time
– reduce loaded memory footprint
– make emitting swift binaries more efficient
– something else?

Yes, I know, “all of the above”, but understanding something about what’s most important would help evaluate the proposal.

It’s also worth keeping in mind that iOS and OS X have been aggressively adopting pervasive system-wide compression both on disk and in memory. This trend will continue, and it makes it quite a bit less important for individual components to explicitly adopt compression techniques themselves, except in cases where there’s a lot of special structure that those components can leverage to get better compression than a general-purpose lossless compressor can manage (images and sound are the two obvious examples of this, but also cases like huge arrays of floating-point data where the low-order bits don’t matter, etc). Linux hasn’t been as aggressive about doing this yet, but pervasive system-level compression is The Future.

Swift is a systems programming language. We’d like to be able to build the whole operating system in Swift. This mans that one day your phone will have hundreds of shared libraries (written in swift) loaded all at the same time. Thousands of shared libraries will be saved on disk, and updated every time you upgrade the OS or some apps. The string table (linkedit section) is loaded into memory (shared cow). In a world where every single process uses multiple swift libraries reducing the size of this section is very beneficial.

Disk and network compressions can help. I believe that we have domain specific information that will allow us do a better job in compressing this section.

Thanks,
-Nadav

···

On Dec 20, 2015, at 7:35 AM, Stephen Canon <scanon@apple.com> wrote:

– Steve

On Dec 20, 2015, at 5:17 AM, Dmitri Gribenko <gribozavr@gmail.com <mailto:gribozavr@gmail.com>> wrote:

+ Stephen Canon, because he probably has good ideas in this domain.

On Fri, Dec 18, 2015 at 3:42 PM, Nadav Rotem via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

What’s next?

The small experiment I described above showed that compressing the names in the string table has a huge potential for reducing the size of swift binaries. I’d like for us (swift-developers) to talk about the implications of this change and start working on the two tasks of tightening our existing mangling format and on implementing a new compression layer on top.

Hi Nadav,

This is a great start that shows that there is a potential for improvement in our mangled names!

To make this effort more visible, I would suggest creating a bug on https://bugs.swift.org/ .

I think we survey existing solutions that industry has developed for compressing short messages. What comes to mind:

- header compression in HTTP2:
https://http2.github.io/http2-spec/compression.html

- PPM algorithms are one of the best-performing compression algorithms for text.

- Arithmetic coding is also a natural starting point for experimentation.

Since the input mangled name also comes in a restricted character set, we could also remove useless bits first, and try an existing compression algorithm on the resulting binary string.

We should also build a scheme that uses shortest one between the compressed and non-compressed names.

For running experiments it would be useful to publish a sample corpus of mangled names that we will be using for comparing the algorithms and approaches.

I also have a concern about making mangled names completely unreadable. Today, I can frequently at least get a gist of what the referenced entity is without a demangler. What we could do is make the name consist of a human-readable prefix that encodes just the base name and a compressed suffix that encodes the rest of the information.

_T<length><class name><length><method name><compressed suffix>

We would be able to use references to the class and the method name from the compressed part, so that character data isn't completely wasted.

This scheme that injects human-readable parts will also allow the debugger to quickly match the names without the need to decompress them.

We should also investigate improving existing mangling scheme to produce shorter results. For example, one idea that comes to mind is using base-60 instead of base-10 for single-digit numbers that that specify identifier length, falling back to base-10 for longer numbers to avoid ambiguity. This would save one character for every identifier longer than 9 characters and shorter than 60, which is actually the common case.

Dmitri

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com <mailto:gribozavr@gmail.com>>*/

ddunbar · December 24, 2015, 4:46pm

Hi Nadav,

Hi Daniel,

One thing you didn't call out here was any connection to visibility.

This is a great point!

Apps usually don’t export too many symbols, but they do import symbols from external shared objects and need to keep the long names around. I created a basic Swift one-page iOS App using the Xcode wizard and the generated binary had a string-table of 70K**. The same app in Obj-C had a 12K linkedit section ("size -m" can print this info).

Where was that from, though? I wouldn't expect that many imported symbols in a simple app (and in general would expect it to grow maybe logarithmically with the size of an application, since it is bounded how many you can import vs. how many you can write yourself).

I would expect that in practice it is quite common for most of the symbols to not be visible outside the linkage unit. Inside the linkage unit, the compiler can already rename many of the symbols in any way it likes. Wouldn't one tact for solving this problem simply be to encourage more pervasive use of whole module optimization, and support compiler renaming of non-exported symbols?

I don’t think that Swift Access Control (public/internal/private) is affected by WMO, but I will check.

My point here was that with WMO you can rename all the internal symbols with ease.

- Daniel

···

On Dec 20, 2015, at 11:26 PM, Nadav Rotem <nrotem@apple.com> wrote:

On Dec 20, 2015, at 1:01 PM, Daniel Dunbar <daniel_dunbar@apple.com <mailto:daniel_dunbar@apple.com>> wrote:

Thanks,
Nadav

** This is one of the names that the sample app has:

__TTSf4n_s_s_n___TTSg5GVSs12_ArrayBufferPSs9AnyObject__GS_PS0___Ss16_ArrayBufferTypeSs_GVSs15CollectionOfOnePS0___GS2_PS0___Ss14CollectionTypeSs_PS0___GVSs17IndexingGeneratorGS_PS0____GS4_GS_PS0____Ss13GeneratorTypeSs_PS0___PS0___GVSs12_SliceBufferPS0___GS6_PS0___Ss23_CollectionDefaultsTypeSsGS6_PS0___Ss32_CollectionGeneratorDefaultsTypeSs_GS4_GS6_PS0____GS4_GS6_PS0____S5_Ss_PS0___SiSiSs16ForwardIndexTypeSs_SiSiSs18_SignedIntegerTypeSs_SiSiSs33_BuiltinIntegerLiteralConvertibleSs_Si_PS0___GVSs14GeneratorOfOnePS0___GS12_PS0___S5_Ss_OSs3BitS13_S9_Ss_SiSiS10_Ss_SiSiS11_Ss_VSs20_DisabledRangeIndex__PS0___GVSs12_prext_SliceGS2_PS0____GS15_GS2_PS0____S7_SsGS15_GS2_PS0____S8_Ss_GS4_GS15_GS2_PS0_____GS4_GS15_GS2_PS0_____S5_Ss_PS0___S13_S13_S9_Ss_SiSiS10_Ss_SiSiS11_Ss_S14__PS0_____TFSs28_arrayNonSliceInPlaceReplaceu0_Rq_Ss16_ArrayBufferTypeq0_Ss14CollectionTypezqq_S_7Elementqqq0_Ss12SequenceType9GeneratorSs13GeneratorType7Elementzqq_Ss23_CollectionDefaultsType5IndexSizqqq_S3_5IndexSs16ForwardIndexType8DistanceSizqqq_S3_5IndexS4_19_DisabledRangeIndexSizqqqq_S3_5IndexS4_8DistanceSs25IntegerLiteralConvertible18IntegerLiteralTypeSi_FTRq_GVSs5RangeSi_Siq0__T_

- Daniel

On Dec 18, 2015, at 4:42 PM, Nadav Rotem via swift-dev <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:

Hi Swift-Dev,

We care deeply about the size of binaries that are generated by the Swift compiler and make an effort to optimize and shrink the generated binaries. One of the problems that we have today is that swift symbols are mangled into extremely long strings. This is especially a problem for libraries, and almost half of the size of libswiftCore.dylib (the swift runtime library on x86_64 OSX) is string tables. On MacOSX you can use the command ’size -m file.dylib’ to read the size of the string table. C++ also suffers from the problem of long names, but since we control the Swift ABI we can do better than C++.

Here are two names that I found in our libraries:

__TIF14StdlibUnittest13checkSequenceu0_Rxs14CollectionType_s12SequenceTypeWx9Generator7Element_zW_9GeneratorS3__rFTxq_KT_SS9showFrameSb10stackTraceVS_14SourceLocStack4fileSS4lineSu16resiliencyChecksVS_32CollectionMisuseResiliencyChecks9sameValueFTWxS2_S3__WxS2_S3___Sb_T_A6_

__TTSg5VSS13CharacterViewS_s14CollectionTypes_GVs17IndexingGeneratorS__GS1_S__s13GeneratorTypes_VS_5IndexS3_s16ForwardIndexTypes_SiSis18_SignedIntegerTypes_SiSis33_BuiltinIntegerLiteralConvertibles_Vs20_DisabledRangeIndex__S_S_s9IndexablesS_s12SequenceTypes_GS1_S__GS1_S__S2_s_Vs9Character_S3_S3_S4_s_SiSiS5_s_SiSiS6_s_S7__S__S10__S10____TFFSS12replaceRangeuRxs14CollectionTypeWx9Generator7Element_zVs9CharacterrFTGVs5RangeVVSS13CharacterView5Index_4withx_T_U_FRS4_T_

One way to solve this problem is to implement a better string table format that will include some kind of compression or tree encoding into MachO, ELF and PE. The work on MachO, ELF and PE is beyond the scope of the Swift project and I’d like to focus on improving things on the Swift side.

I see two independent tasks that we can do to shorten the length of string symbols. First, we can improve the existing mangling. We can do things like use non-decimal length specifiers, etc. The second task, which is the subject of the rest of this email, is the implementation of a compression layer on top of the existing mangling scheme.

Compressing long symbols

The Swift compiler is free to encode Swift names in whatever way it likes, but we need to follow a few basic rules. First, the character encoding space needs to be [_a-zA-Z0-9]. We can’t use non-ascii characters to encode Swift names because this is the character space that ELF supports. Second, names need to be unique and independent. We can’t just compress the whole string table or create names that are a running counter because we need to be able to inspect each name independently. We need to be able to decode the name independently and extract the information that’s stored in the name. We also need to be able to decode the name to be able to present something useful in the debugger. This means that a simple one directional hashing of the name is not an option.

With these restrictions in mind, I believe that the problem that we need to solve is independent compression of names. I suggest that we add a post-processing string compression pass. The input to the compress method would be a string mangled name, like the two names I showed above, and the output would be a shorter string. We would also need a decompression method that would return the original string.

Obviously, gzipping each string independently won’t be effective. I believe that we need to implement our own compression algorithm here that will work with the restrictions that I mentioned above (being textual) and with the prior knowledge we have.

Example of a basic compression algorithm

In this section I describe a trivial compression algorithm that can reduce the size of the string tables by 30%. This algorithm is not optimal but it is a good starting point for our discussion. One example of "prior knowledge” that I mentioned above is that we know what are the common sub-strings that are used in Swift names. Some of the most common substrings in Swift mangled names are:

1. "S_S_S_S"
2. "ollectio”
3. “Type”
4. “Index”
5. “erator”
6. “7Element"
7. “able"

We can use this prior knowledge about names of Swift types to compress our mangling! Consider the following encoding:

A string like this:

__TTSg5VSS13CharacterView

Would be translated into something like:

__TTSg5<reference to word #67>13<reference to word #43>View

Commonly used strings can be encoded as references to global tables. We need to have some escape character (that needs to be ascii-printable), and use that character to encode the reference.
One interesting bit of information is that the character ‘Y’ is only used 4 times in the entire standard library! The letter J, and a few other letters are also not used very frequently. We can use these letters as escape characters.

The basic encoding that I experimented with used the following rules:

1. Use two escape characters that are not frequently used in names (Y and Z). These characters are escape character and can’t be used without escaping. ‘Y’ will be encoded as ‘YY’, and ‘Z’ would be encoded as ‘YZ’.

2. The most commonly used sub-strings (calculated as length of substring times number of occurrences) would be encoded with a single escape character and a single index character. The table of highly repetitive substrings can only contain 61 entries (a-zA-Z0-9_, minus two escape characters).

A reference to the very frequent substrings would be encoded as “Yx”, where x is the character that’s translated into a numeric index. This is two chars per substring.

3. The less commonly used substrings are encoded as three-character sequence. “Zxx”, where xx is the numeric index in the large table (that can hold 63*63 substrings).

It is obvious how to reverse this compression using the same string table used to compress the names.

With this encoding scheme the name “__TwxxV14StdlibUnittest24MinimalForwardCollection” becomes “Y5wxxJ0UJ1HJ3_Jyn4J5VJ6SdY9on” and the name “__TMPVs15ContiguousArray” becomes “Y5MPJ3r5J5EJ82”.

I benchmarked this basic technique by using 5 dylibs as the training set (SwiftCore, Simd, unittests, etc) and compressed the string tables of these binaries. The result was a ~31% reduction in size.

<PastedGraphic-2.png>

On the Swift dylib itself (training set and input to the compressor) the wins are about ~25%.

This encoding makes the symbol name a lot less intuitive for humans, but at the SIL-level we already print human readable strings above function names as comments. I believe that the debugger would just work as-is since the internal APIs won’t change.

One of the disadvantages of using a constant string table is that once we rename the APIs in the standard library the string table would be less effective in compressing it and code that uses it.

What’s next?

The small experiment I described above showed that compressing the names in the string table has a huge potential for reducing the size of swift binaries. I’d like for us (swift-developers) to talk about the implications of this change and start working on the two tasks of tightening our existing mangling format and on implementing a new compression layer on top.

Nadav

_______________________________________________
swift-dev mailing list
swift-dev@swift.org <mailto:swift-dev@swift.org>
https://lists.swift.org/mailman/listinfo/swift-dev

Chris_Lattner · December 21, 2015, 11:58pm

I see what you’re saying, but I don’t think that it would make sense to consider this. Given a tradeoff between size for all clients and users of the swift language vs compiler/library hackers having to run the demangler manually now and then, the benefit of reducing symbol sizes greatly outweigh any pain incurred.

-Chris

···

On Dec 20, 2015, at 9:53 PM, Dmitri Gribenko via swift-dev <swift-dev@swift.org> wrote:

nm -a libswiftCore.dylib > strings.txt

I also have a concern about making mangled names completely unreadable.
Today, I can frequently at least get a gist of what the referenced entity is
without a demangler. What we could do is make the name consist of a
human-readable prefix that encodes just the base name and a compressed
suffix that encodes the rest of the information.

_T<length><class name><length><method name><compressed suffix>

When are you looking at symbols directly? SIL already shows the detangled
names as comments, and for everything else there is a demangler.

System tools to inspect object files on Linux don't know about Swift
mangling. Yes, I can pass their output through a demangler, sometimes
(when they don't have a GUI for example).