Android app size and lib_FoundationICU.so

Android apps that incorporate native Swift need to embed all their native dependencies in the app in the form of shared object files for each supported architecture. This includes both app-specific Swift code, as well as all the core dependencies like the Foundation libraries. These are all weighty files, but the one that stands out in the list is the 40M lib_FoundationICU.so:

ls -laSrh ~/Library/org.swift.swiftpm/swift-sdks/swift-6.0.3-RELEASE-android-24-0.1.artifactbundle/swift-6.0.3-release-android-24-sdk/android-27c-sysroot/usr/lib/aarch64-linux-android/

  441K libFoundationXML.so
  1.8M libFoundationNetworking.so
  3.8M libFoundationInternationalization.so
  8.7M libFoundation.so
  8.8M libFoundationEssentials.so
  9.6M libswiftCore.so
   40M lib_FoundationICU.so

With Skip's native Swift support (see Building apps with Skip's native Swift Android toolchain integration (tech preview)), we have two very basic sample apps, one called "Hello Skip" which uses pure transpilation from Swift to Kotlin (and no native Swift), and another called "Hiya Skip" which uses native Swift (for 2 architectures: aarch64 and x86_64). As can be seen from those two release pages, HelloSkip-release.apk
is 12.9 MB, and HiyaSkip-release.apk is 200MB! The compressed distribution bundles, which is approximately the download size that a user would experience, are a bit better (HelloSkip-release.aab is 8.44 MB and HiyaSkip-release.aab is 70.4 MB), but 70MB is still too much for a "Hello World" app, especially as contrasted with the iOS HiyaSkip-release.ipa: 789KB.

I would like to start an open discussion about what might be done about this. Here are the options we have considered thus far:

  1. Exclude i18n altogether: if code just imports FoundationEssentials and eschews FoundationInternationalization, then the APK size would be reduced drastically. This is the most obvious solution, but is also the most obvious one to dismiss out of hand: apps need support for localization.
  2. Implement a lib_FoundationICUAndroidNDK.so replacement library that uses the existing libicu.so: aside from only being present in Android 12 (API 31) and higher (as per ICU4C on Android) and thereby only being available on ~45% of devices (as per Android Studio's API market share estimator), the Android ICU4C NDK Reference shows that the API subset they expose is insufficient for the needs of FoundationInternationalization: it has none of the udat_*, unum_*, or ucurr_* needed for date/time formatting and parsing, nor any ucal_* calendar functions.
  3. Implement a lib_FoundationICUAndroidJNI.so that bridges localization functions to Java: unlike the Android NDK's ICU4C support, the Java SDK's ICU4J support (e.g., via the android.icu.text, android.icu.number and android.icu.lang packages) are considerably more complete. However, this would require implementing support via JNI, and incurring a bridging trip for each any every localization call. I fear that the performance implications could be significant.
  4. Reduce the size of lib_FoundationICU.so: most of the size of the library comes from the in the icu_packaged_main_data.*.inc.h header tables in swift-foundation-icu/icuSources/common at main · swiftlang/swift-foundation-icu · GitHub. Could these be somehow compressed further? Or could a significant size reduction be realized by excluding some rare-but-large locales from that file? I haven't explored this option yet.
  5. Support for dynamic thinning of ICU data: if an app only contains localization strings for 2 languages, then it might make sense to reduce the locale data to only support those languages. This would be a significant re-architecting of swift-foundation-icu to use external and removable resources rather than embedded [uint8_t] data.

I would be interested in hearing other suggestions for how we might reduce the Android app size. 200MB is too large, especially when contrasted with Flutter (~10 MB min APK size) which bundles its own i18n support.

11 Likes

We're facing the same problems on Android, and so far we've gone with the route of always depending on FoundationEssentials so we can omit FoundationInternationalization. Though in practice, we do want access to the Calendar APIs in Foundation - and not reinvent the wheel.

I'd love some input on this for people more familiar with the subject.

1 Like

ICU changes its data and functionality quite a bit between releases. Mismatches between what happened to be around on the OS and what Foundation's clients expected from its formatted output was a cause of much confusion in previous swift-corelibs-foundation iterations. Plus, it caused constant test failures depending on what platform you were running on. In summary - ICU is not the same everywhere.

Also, some parts of Foundation depend on functionality that is specific to Apple's ICU and has not been upstreamed for one reason or another. These are usually prefixed with ualoc_, like here and here.

5 Likes

IMO the best long-term approach would be to combine both methods, as it aligns well with the long-term goals of SwiftFoundationICU.

Currently, the icu_packaged_main_data.*.inc.h files are built from a prebuilt version of Apple ICU. Ideally, in the future, we should build the data file as part of the SwiftPM build. This way, as you mentioned, we can also support optional thinning. However, I haven’t yet found a good way to “merge” the CMake-based build with the SPM build, which is why we’re currently using pre-built embedded files as an interim solution. I suspect the ultimate solution will involve rewriting the ICU data build tool as a plugin, allowing it to be integrated into the SPM build.

Though in practice, we do want access to the Calendar APIs in Foundation - and not reinvent the wheel.

Gregorian Calendar and ISO8601 formatting are available in FoundationEssentials. Granted, it is indeed more difficult to move other calendars into FoundationEssentials, but we could certainly look into that given the precedent of Gregorian calendar.

@Tony_Parker's response I think generally sums up my conclusions from the investigation that I did on this a while ago. The best that I remember that was possible was to filter out any unnecessary data with the tools that ICU provides for packaging the data. Of course that won't slim it down to what you can do without FoundationInternationalization, but, it does help reduce the size significantly, especially if you are building with just FoundationEssentials.

The Android ICU releases were not officially supported in their C/C++ form. Has that changed? It seems that you would need to do a full round trip to the Java API through JNI to use the data. However, you will still need to use the ICU library build and the data is intrinsically tied to that release. I believe in order to shuffle the data out in favour of the JNI bridged content, you might need to extend ICU to support a new type of data backend (computation).

2 Likes

Ahh, well that effectively eliminates options #2 and #3, since I don't think Android's built-in ICU is ever likely to provide functions like ualoc_getAppleParent.

Yes, I'm thinking that externalizing the data to a separate file will be the solution (like on macOS: /usr/share/icu/icudt74l.dat). This is tricky on Android due to the way they expect assets to be accessed (see: Overriding Bundle.module for loading resources from Android assets). But one immediate benefit is that the app would only need to contain one single instance of the data file, rather than repeating it N times for each supported architecture.

In addition, this opens the possibility of performing some "thinning" of the ICU data to only include those locales that are supported by the containing app. This is what Flutter does to keep its bundled ICU data size as minimal as possible. This thinning could be done as a post-processing step by the packaging tool for the app.

2 Likes

Is that actually the case? I know that the generator tool takes the architecture as a parameter at least, but I do not know how it uses it in the external data flow.

I'm somewhat vary of that - the data format is not stable and I do not know what gets encoded into the data. If the changes made impact the encoded data, it is possible that the tool would improperly "thin" the data. Such a tool would need to be version locked with the compiler.

Furthermore, I'm not sure I understand the idea completely. How do you "thin" the API surface for the pre-compiled module (libFoundationInternationalization.so)? Or is the proposal that the developer is responsible for not performing any dynamic calls?

1 Like

The ICU data file only seems to be endian-specific, and not architecture-specific, and I think all major Android devices are little-endian.

I think the tool would need to be version-locked to the specific ICU version used by FoundationICU, rather than the compiler itself. Perhaps swift-foundation-icu could build the executable itself from a separate target, like swift-foundation-icu-tool?

As long as any tools that work with it are aware of the current ICU version's data format, I believe it is safe to manipulate it. This is one of the documented use case for the build tool that @icharleshu had suggested using:

For example, the icupkg tool at /opt/homebrew/Cellar/icu4c@74/74.2/sbin/icupkg can be used to expand, filter, and repackage the icudt74l.dat file.

I'm not suggesting that any of the API surface be removed. Rather, once the ICU data is externalized, we could reduce the size of the data file by stripping out unneeded locales and languages.

I think that is the bit that I am confused about. There could be dynamic behaviour (e.g. setlocale(getenv("MY_LC_OVERRIDE)) that we cannot statically understand. If this is something that is user directed rather than source code directed, then why does this need to be a post-processing step? This could be something that is done up-front and the user simply selects, or we have the data files and they can run the tool to generate the packaged data as a pre-build step.

Sadly? Fortunately? :man_shrugging: The current state is that all Android devices are little endian these days as the MIPS port was retired :frowning:. It certainly makes this easier for us. But, I think that is something that we should take into consideration so as to not design ourselves into a corner.

1 Like

Dynamically overriding the current locale to something that isn't present in the data file would result in ICU's fallback behavior rules being applied (e.g., en_IN → en_GB → en_001 → en → root).

I'm proposing that the app's supported locales – either explicitly enumerated in some metadata by the developer or auto-detected by just looking for the Resources/lang_REGION.proj/ folders – would be used at the app packaging stage to perform the thinning of the locale data. If the app developer wants to support every locale in existence, they could just specify them all and accept the resulting increase in app size.

We would known which architectures the app supports at build time (based on the JNI folders that are included in the app), so we could always generate both icudt74l.dat and icudt74b.dat files in the event one of those archs is big-endian (like a resurrected MIPS or some theoretical future BE architecture).

I don’t think it’s appropriate to only have string processing support for the locales of your app’s UI. If I’m using a contacts app that only has an English UI, I still want to be able to sort the contacts in proper localized order.

Unfortunately, I don’t have an answer here, the usual solution on a modern OS is “the OS has a library for this” and we just talked about why we can’t use that library.

EDIT: It’s possible you don’t feed any user data into Foundation for string processing! Great! Except then you’re probably in the FoundationEssentials case anyway.

3 Likes

In the general case, they do need localization, but what percentage of apps will stick localized strings or time/date lookups in their natively compiled Swift code? Probably not many. You may want to default to a build setting that disables FoundationInternationalization altogether, so those who don't need localization/calendars in Swift can benefit from excluding it.

For those who need localization or other libicu-based APIs, this seems like your best bet to reduce app size due to native code. It would require figuring out exactly what APIs are in the most minimal official Android libicu, then only adding the missing "udat_*, unum_*, or ucurr_* needed for date/time formatting and parsing... any ucal_* calendar functions" plus ualoc_* back to your lib_FoundationICUAndroidNDK.so.

Only "~45% of devices" may have this system libicu.so, but the Play store has required that app updates target API 33 or later for the last six months. I don't know what type of backward compatibility they provide: have you looked to see if Play Services or some other system update also provides a libicu.so to older devices to keep such native code updates running?

I suspect that very few apps are doing localization or time/calendar lookups on the hot path, so I doubt performance will matter. But yeah, manually writing this bridging code, even based on JavaKit from swift-java, is probably not worth it.

This was discussed several years ago and eventually led to the work to bring Unicode tables into the stdlib and remove the libicu dependency there. Unfortunately, this has not been done for Foundation, but there were some ideas discussed in that earlier thread.

This seems like something you could immediately implement, ie have lib_FoundationICU.so load this data from a local file at first access rather than have it hard-coded in each arch-specific library. This would give you the biggest gain on Android for the least work and could likely be upstreamed back to Foundation, behind an optional build setting if nothing else.

Yep, once it is separated out, you could also add a tool for this.

On the contrary, I think almost every app will need to be able to format arbitrary numbers like 1,234,567.89 for the US and as 1.234.567,89 or 1 234 567,89 in Europe. And any app that shows a date needs to know that 1/2/2025 means something different to a US user (January 2) versus most other places (February 1). And as @jrose mentions, string sorting without locale awareness will be nonsensical for many languages. These all need FoundationInternationalization and the ICU data, or a subset thereof.

I think you are right. It should be fairly straightforward to patch icu_packaged_data.h to access the data from an external resources if ICU_EXTERNAL is set, which immediately removes the issue of the duplication of the data in each arch's .so, as well facilitates compression of the data in the archive and, optionally, pruning out unneeded locale data at the packaging stage.

1 Like

We found the ICU size to be a major driver for pushback on adopting Swift in Android builds in our organization. We finally were able to convince those pushing back that the ~70MB would be worth it for shared implementations of code, but it was a close thing. Almost stopped us before we started.

Likely, we will need (or eventually need) most of the localization offered by ICU to serve the countries where we sell products. @Finagolfin we will unfortunately need it in the natively compiled swift code as we have built our own text rendering system. @compnerd mentioned using a computed data backend. If that could call into the installed Android ICU and bridge gaps between Apple's implementation and Android's, the lost performance may be preferable to the gain in size. Might be a brittle thing, though...

1 Like

The biggest gain you can make right now is to factor out the ICU data into a single minimal data file that is loaded at runtime. That's because the way ICU has always been built for Swift is to hard-code that data into the library itself. You can see this in the way it was packaged before Swift 6:

> du -sk swift-5.10.1-RELEASE-fedora39/usr/lib/swift/linux/libicu*1
28008   swift-5.10.1-RELEASE-fedora39/usr/lib/swift/linux/libicudataswift.so.69.1
4148    swift-5.10.1-RELEASE-fedora39/usr/lib/swift/linux/libicui18nswift.so.69.1
2428    swift-5.10.1-RELEASE-fedora39/usr/lib/swift/linux/libicuucswift.so.69.1

> readelf -sW swift-5.10.1-RELEASE-fedora39/usr/lib/swift/linux/libicudataswift.so.69.1

Symbol table '.dynsym' contains 2 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000000001000 0x1b562d0 OBJECT  GLOBAL DEFAULT    4 icudt_swift69_dat

Symbol table '.symtab' contains 3 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000001b58f10     0 OBJECT  LOCAL  DEFAULT    6 _DYNAMIC
     2: 0000000000001000 0x1b562d0 OBJECT  GLOBAL DEFAULT    4 icudt_swift69_dat

Most of the size was from libicudata, which appears to just be a single large array of data, icudt_swift69_dat.

Swift 6 brought ICU in-house into a swiftlang package with its own manifest, so it is extremely easy to tinker with now. :smiley: It may have initially supported loading its data from a file, but that was switched over to the old model of hard-coding it all into the single large ICU library, which Marc highlighted last week.

I suggest that you and Marc look into reverting that and have Foundation load the data at runtime instead, so the four shared libraries for each Android architecture can share a single icudata file, instead of shipping multiple large shared libraries that duplicate the same data for each architecture.

As for the further code slimming you mention, they will pale in comparison to such data slimming.

I'm debating what the right way to do this should be. We could:

  1. Easy: Mandate that whatever initialization/bootstrapping phase that is setting up the Swift environment also handle extracting the data and placing it in a file somewhere, and then communicate the location by setting an environment variable (like ICU_DATA_FILE) that FoundationICU will check. While simple, and has the advantage that the data can be stored compressed in the apk's assets, it has the downside of either needing to plop a ~20-~30 MB file onto disk every time the app starts, or else add the complexity of maintaining some checksum and comparing it with a pre-cached file on disk every time initialization takes place.
  2. Hard: Keep the ICU data in the APK and access it using the Asset Manager's support for providing an mmap-able file descriptor to the asset contents (which I discuss at Overriding Bundle.module for loading resources from Android assets). The advantage is that we wouldn't need to write a file to disk on every app startup (which reportedly caused issues for Flutter apps in the past when they used this technique), but the disadvantage is that the file couldn't be stored compressed in the APK. Plus, in order to get a handle to the NDK AAssetManager, you need a jobject pointer to the Java AssetManager, as well as access to the Android Context object. So it is a whole lot more setup, and brings JNI into the picture.