ICU usage in Swift

Michael_Gottesman · June 12, 2019, 5:11am

compnerd Saleem Abdulrasool
June 12
I think that for the standard library alone at least is something where we can absolutely can and should do this. The overhead difference is potentially massive.

Sure, but also the maintenance of it may be difficult and consume time. I think most of my concerns would be satisfied /if/ it is behind a flag and there is no guarantee around it working (i.e. it isn't blocking PR testing and swift.org wouldn't test it).

I guess I need to ask the obligatory... why do you want to do this?

Cross-compiling ICU is nearly impossible without the custom CMake rules. It also makes building ICU a lot more complicated. Particularly for Windows, I don't see any other way to build ICU honestly (it requires a ton of additional setup and build-script and python are not really scalable approaches to setting up a full windows image to do a build). I think that upstream is interested in the CMake support, but doing that completely is a larger undertaking than what I can currently do. I welcome someone else completing this work to the point where upstream will switch over to it.

That would mean that we would be maintaining a fork of ICU. No? Looking at update checkout today, I see that we are not maintaining a fork. My local remote is: GitHub - unicode-org/icu: The home of the ICU project source code..

compnerd · June 12, 2019, 6:10am

No, please don't fork ICU. We don't need that additional maintenance burden. The way that I did it is that I have a standalone CMakeLists.txt that can be dropped in. No forks, no patches, just a single file that we copy in (we could even place it external to the tree and prefix the paths).

Sure, adding a flag is acceptable to me.

As to the desire to this: ICU is not a system library on Windows and an ancient version ships on android. Android will need to ship a copy of the data file for ICU to be usable. The Swift runtime is not a system provided package, nor is Swift considered ABI stable on any platform but Darwin. This means that the only supported means of distribution is a local copy of the runtime. This means that every package needs to bundle a 20+ MiB data file (of which we need arguably ~1-2 MiB of the data for the standard library). For most of the non-western world, the access to packages which are this large can be prohibitive. Shrinking the data file would enable more packages to be able to use Swift and provide a smaller download. Even iOS limits OTA updates to ~100 MB IIRC (and android limits APKs to 100 MB). That means that you are spending ~10-20% of your package budget on a data file that is not even needed for the most part.

Jon_Shier · June 12, 2019, 6:44am

OTA iOS app updates are currently limited to 200 MB, and I believe that limit is removed for iOS 13 (but I don’t know for sure). Not that it matters for this discussion.

millenomi · June 12, 2019, 9:19am

I’m going to circle back with some thoughts after offline discussion.

It has been an explicit non-goal of Foundation for some time now to try to work without a full ICU installation available. Development generally assumes that all interfaces in the library be present and working in normal operating conditions; @Tony_Parker may chime in more explicitly about policy, but that’s the assumption that I’ve also been making in writing new s-c-f code. This doesn’t mean that I’m opposed on principle, but making ICU trimming a mandatory part of the release can create the cases @Michael_Gottesman was pointing at every time we update ICU (did a new dependency form inside the library that requires new files for normal operation?) or merge Core Foundation (did it introduce new API usage?). We currently cannot detect mismatches automatically here.

I do not have the confidence that you have that our regular testing can easily find the more esoteric failure cases — ICU-relying tests are notoriously hard to write, and while I have introduced a variant of them this winter that I believe will be more resilient, we still don’t know if partial string testing is going to be sufficient to catch bugs; and what the degree of unknown unknowns we introduce this way can be, and thus whether our point-by-point test formulation is going to ever catch any of them.

Now, I’ve read your Android assessment and looked at current best practices and I agree that the current situation is using significant amounts of the size budget for these developers. My current, personal position as the person who needs to set the guarantees for s-c-f’s correctness is that I would be fine with a community-driven assessment of slicing and the appropriate tooling for achieving it and testing it; but this need is so specific to Android that I would call on the part of the community that would like this port to be maintained to also maintain their slicing setup over time. I’d be happy to notify appropriate people when changes occur on my side that require a reassessment.

compnerd · June 12, 2019, 6:58pm

As to constructing the reduced set: ICU actually provides support to trace the ICU data being used. We could do builds with --enable-tracing on ICU and construct the filter from the traced categories.

millenomi · June 12, 2019, 7:00pm

That would actually be awesome. Is it based on runtime behavior?

Geordie_J · June 12, 2019, 7:08pm

That’s a clever way around it! I wonder how accurate it is for non English language locales? Would we have to switch to various locales at runtime? Maybe there’s scope for some automated “UI tests”

Edit: FWIW, ICU data takes up about 40% of our app APK on Android. So this would be a welcome change!

compnerd · June 12, 2019, 7:17pm

Yes, it is tracing the runtime behaviours and shows what it is loading. But, limiting it to categories I think should be pretty easy and still can be beneficial (e.g. we always load the timezone data from the system, not ICU, so we can strip the timezone data - that is ~3 MiB).

millenomi · June 12, 2019, 10:31pm

Yeah. I'd rather have careful, well-considered and decently stress-tested category cuts rather than trying to figure out how to run our setup with every possible language combination just to see what data happens to be loaded.

compnerd · June 17, 2019, 1:51am

Oh, an additional data point that I would like to point out: even chrome filters the data for both (chromium as well as ChromeOS from what I can tell). So, this isn't something that is a fringe thing.