By generating the packaged data with ICU’s data build tool and applying a filter to remove unused features and limit locales to en_001, the embedded data size can be reduced significantly, for example, from 29.3 MB to 8.6 MB in our builds, without impacting functionality on non-Apple platforms (note that the current Foundation on non-Apple doesn't support locales other than en_001). You can try a slimmed-down version of the ICU data today by importing the GoodNotes/swift-icudata-slim package.
Given that this unused data only increases binary size, I’d like to propose replacing the current full dataset with the filtered version as the default in swift-foundation-icu.
Question for the Foundation team: Is there an intention to support locales other than en_001 on non-Apple platforms? The major data size reduction comes from removing unused locales, so the answer will determine how far we can take this optimization.
I already have a suitable filter file ready and I’m happy to help integrate it. I can't see how icu_packaged_data.cpp and the supplemental .inc files are currently generated, so if we decide to proceed, I’d appreciate some help from the Foundation team on this part of the process.
Yes, we should keep the full locale set. However, there is a small amount of data which can be removed (the ~6MiB sounds right). This was something that I had brought up in the past, but I don’t think that I ever got a clear answer as to whether such a change would be accepted.
Question for the Foundation team: Is there an intention to support locales other than en_001on non-Apple platforms? The major data size reduction comes from removing unused locales, so the answer will determine how far we can take this optimization.
If you only want en_001, you can import only FoundationEssentials , which does not pull in ICU at all. Although that also means that you won’t be able to use any FoundationInternationalization API backed by ICU.
Are there any ICU features that you want to use with en_001?
Yes, we should keep the full locale set. However, there is a small amount of data which can be removed (the ~6MiB sounds right). This was something that I had brought up in the past, but I don’t think that I ever got a clear answer as to whether such a change would be accepted.
@Tony_Parker and I had discussed this. We’re definitely interested in data reduction. We are already working towards this, for example, by moving API from FoundationInternationalization to FoundationEssentials such as Gregorian calendar and ISO8601 date formatting. There might be other areas where we can continue with this direction.
Some other high level ideas probably worth exploring are
Trimming data that aren’t needed by Foundation from ICU (like you both mentioned here): I think it’s just a matter of identifying exactly which ones, and do it in a maintainable fashion that allows us to easily update the packaged data when upstream ICU updates theirs.
A dynamic ICU build to support limited locales: There’s already logic in place to handle locale fallback if the target locale is missing/not supported. It seems like a reasonable enhancement. I suppose the challenge is in packaging the data in a flexible way?
Add a configuration to allow static ICU builds to only build with a chunk of data needed: Unlike the former case, the data will be segmented in the other dimension. For example, if the developer is only interested in number formatting, but does not need to work with collation at all, can we perhaps exclude the collation data at build time?
If the localization data was emitted as Swift code (Ala InlineArray etc) it might actually cull via whole module optimizations. That would entail practically writing a ICU4Swift instead of calling to the C functions. Of course much more research should be done to confirm that optimization would be ideal - however IIUC it has served other languages well like rust w/ icu4x.