ICU usage in Swift

compnerd · February 14, 2019, 6:35am

Hello fellow developers,

One dependency that has been annoying for Swift has been the ICU dependency. Ignoring the question of whether the dependency makes sense for Swift or not, I was thinking it may be possible to alleviate some of the pain of ICU a slight bit.

For non-Darwin, non-Windows targets, we are building ICU from source. This is one piece that currently requires autotools, which makes building this on Windows pretty painful. I have a little toy build setup using CMake which should also allow cross-compiling the ICU library. I can put the toy build up if people are interested.

Playing around with this made me start wondering, what are the exact pieces from ICU that are needed for the standard library and for Foundation/CoreFoundation? Perhaps we can build a reduced functionality version of ICU with just the components that we need for Swift. Assuming that the API surface is not very large, we should be able to do a static build of ICU but keep the data shared. This keeps the largest size of the ICU library shared across Foundation/CoreFoundation and the runtime (and possibly even the system!). The reduced build statically linked would enable DCE of the rest of the ICU library and we could statically link just the code in the standard library and Foundation and build just the components that are used in Swift.

Is there something obvious that I am overlooking here and there is a more subtle reason that this approach wasn't considered or was deemed as unreasonable?

CC: @Michael_Ilseman @Michael_Gottesman @millenomi @pvieito

Jean-Daniel · February 14, 2019, 4:51pm

Is the ICU data guarantee to be forward and backward compatible across ICU versions ? If not, static linking would not be possible as it would require that the embedded version exactly match the OS version, which can't be possible if you deploy on more than on OS major version.

compnerd · February 14, 2019, 4:53pm

Well, for Linux distributions, they may be able to control the ICU data itself. But, we probably still want it shared to have a single copy between Foundation and the standard library since the bulk of the size of ICU would be the data itself.

spevans · February 14, 2019, 5:29pm

I think the stdlib uses ICU for Unicode normalisation and grapheme breaking and there is the idea of implementing the code part of that in Swift directly:
Swift Native Grapheme Breaking: [SR-9423] Swift Native Grapheme Breaking · Issue #51887 · apple/swift · GitHub
Stop using ICU for normalisation: [SR-9432] Stop using ICU for normalization · Issue #51896 · apple/swift · GitHub

Foundation makes more heavy use of it due to the localisation, internationalisation and calendar so may be harder to build a subset.

However you mentioned CMake and there seems to be some mention of it on ICU's Jira [ICU-7747] - Unicode Consortium so maybe its simpler to get the ICU build to move to CMake?

compnerd · February 14, 2019, 5:34pm

Oh, nifty, I didn't know that the project was already considering that. Yes, I wrote up a pretty quick CMakeLists and put it on GitHub. It is sufficient for building ICU for Windows and Android at least.

I had come across those two SRs previously, and think that it would be great if that happens, which would mean that ICU could be entirely compacted into Foundation reducing that conflict with the system version and simplifying the build as well.

compnerd · February 14, 2019, 7:02pm

CC: @stamba

The PPC64 target was also running into some issues with ICU that this approach would help alleviate (though we should fix the underlying issue as well).

Michael_Ilseman · February 14, 2019, 8:29pm

I believe it is a long-term goal of the standard library to wean itself off of ICU, but there are some challenges involved. ICU is a continual source of performance pain for us, limits the applicability of Swift to systems-level programming, complicates Swift distribution, etc.

However, ICU currently serves 3 valuable purposes:

1. ICU Bundles Data

ICU includes a copy of the UCD and similar data. Accessing this is necessary to implement any of the other functionality ICU provides us as well as answer some APIs such as Unicode.Scalar.Properties, capitalization, etc. Accessing this data through ICU is typically too expensive for us to do in a hot-loop implementing one of the below algorithms, but is fine for public API.

This data is large and changes with every version of Unicode. Bundling this data in the standard library would require:

We prune out data we don't use
Find or invent a compact binary representation and lazy expansion mechanism
Vigilantly keep them up to date, tying versions of the standard library to specific Unicode versions.

An alternative, which would allow us to make progress on the next two points before/without tackling this one, is to ask for ICU APIs for direct access to the binary data and the means to interpret that data.

2. ICU Implements Normalization Algorithms

We use ICU to lazily convert a moving window of a string's contents to NFC for comparison (we honor canonical equivalence). We also want to add API allowing users to view the scalar or code unit contents of a String in a given normalization form, and we would utilize ICU for that.

Trying to work around performance issues here is the cause of considerable complexity in the standard library.

These algorithms are not complicated and unlikely to change over time, as they are mostly driven by the data. But to implement something efficient, we would likely need/want more direct access to the data.

Implementing this in the standard library would take some work up-front, and some perf tuning, but should yield a substantial benefit for comparisons. It also would allow us to more easily pre-validate contents as already being NFC, wherein canonical equivalence is the same as binary equivalence, we can just memcmp!

3. ICU Implements Grapheme Breaking

Unlike normalization, where the algorithm is fixed and the data changes version-to-version of Unicode, grapheme breaking's algorithm and data both change version-to-version of Unicode.

Implementing this in the standard library would require revision and validation for every new version of Unicode, beyond the initial implementation and perf tuning. Like #1, it would tie stdlib versions to Unicode versions.

Michael_Ilseman · April 22, 2019, 6:57pm

I filed SR-10535

SDGGiesbrecht · April 23, 2019, 5:02am

I asked about the status of ICU (or whatever standin) in the compiler itself over in another thread and still have no answer. Maybe someone here knows?

While I realize the compiler and the standard library are distinct, I am not sure how much their dependency requirements affect each other. Anyone working on this should probably be aware that access to some form of Unicode normalization will be required for the other.

compnerd · April 23, 2019, 4:04pm

The compiler itself does not use ICU, but the standard library does. The current state of affairs has the standard library and the compiler build conflated which makes things rather confusing. The compiler's dependencies remain LLVM and clang. The compiler doesn't really have any string normalization (that I am aware of) in it though - why do you need string normalization there?

SDGGiesbrecht · April 23, 2019, 6:21pm

Because right now identifiers (including operators) do not follow unicode equivalence. The core team has said here, here, here and definitively here that it is a bug that should be fixed.

Right now surprises like the following are possible:

let café = "café" // NFD
let café = "Fwahahaha!" // NFC
print(café) // Compiles and runs, but what does it do?

infix operator ≠ // NFD

// Compiler error: operator not defined.
func ≠(lhs: Int, rhs: Int) -> Bool { // NFC
    return lhs != rhs
}

For more details see the thread linked earlier.

SDGGiesbrecht · April 23, 2019, 7:04pm

P.S. Thank you @compnerd for your clear answer.

compnerd · April 23, 2019, 9:39pm

Ah, I see. That is rather unfortunate. The thing is, even in the standard library, there is desire to move away from the ICU, so if there is a way to do the unicode normalisation efficiently in a standalone manner, that might work well. But, adding a dependency on ICU in the compiler is really not very palatable to me - ICU is a large library, and requires the data library which is really large, and that means that you increase the load time for the compiler, which would be a huge hit to compilation times and overall memory usage.

SDGGiesbrecht · April 23, 2019, 10:59pm

Yes, I do not like the idea of adding ICU at this point either, which is part of why I didn’t jump on submitting a bug fix right away.

What would be nice is if when factoring normalization out of ICU, it is done in a way that both the compiler and the standard library can then share most of the source. Since they both essentially need the same replacement for ICU, it would be nice to provide the foundation for solutions to both issues at once.

compnerd · June 12, 2019, 12:13am

@Michael_Ilseman - sorry to necromance this thread again.

So, ICU 64 seems to have added even more data. At this point, the ICU data alone is >20MiB. However, it seems that there is now a tool that we should be able to use to actually limit the data that we package into the ICU data files. The question now becomes, what data do we really need for the combination of the standard library and Foundation (though I suspect that @millenomi would be better suited for that). I have a custom CMakeLists.txt setup now to build ICU, and I am considering adding support for building the data bundle as well. It seems like we should be able to reduce the packaged data to what we truly need. (icu/buildtool.md at main · unicode-org/icu · GitHub provides a good overview of the contributions of the various pieces of data)

Between @spevans and I, we are considering moving the Linux target to use the new CMakeLists as a means of simplifying the build as well as speeding it up. This can also be a good time to shrink the actual runtime size as well.

Michael_Ilseman · June 12, 2019, 12:27am

The standard library basically needs a subset of the UCD. But, I would guess that bundling all of the UCD in an efficient binary representation shouldn't be that big. What all are you measuring in the 20MiB? Are you including the CLDR? All locale, etc., concerns are considered outside the domain of the standard library. Foundation likely makes heavy use of the CLDR and needs to pull in much more data.

The standard library needs the data that drives the following:

The portions of the exposed by Unicode.Scalar.Properties, understanding that more may be exposed in the future.
A couple properties used for normalization fast-paths, understanding that we may want to expose all normal forms in the future (not just NFC).
Anything transitively required by the ICU APIs we use, such as grapheme cluster break properties.

compnerd · June 12, 2019, 12:48am

Yes, I am describing the full unicode data that ICU bundles by default. I agree that the data that we actually use should be much smaller and that is why I was asking what exactly is that set that we need to include. We should be able to build the subset that we actually need as the ICU APIs should not be exposed through the Swift interfaces, we don't need to worry about users requesting some data that we exclude in the custom bundle.

Also, please pardon my limited knowledge in the domain, I could really use some help mapping the desired data to the ICU categories.

Unicode Character names (unames): ~270 KiB
Normalization (normalization): ~160 KiB
Break Iteration (brkitr_rules, brkitr_dictionaries, brkitr_trees): ~525 KiB, ~3 MiB, ~15 KiB

If that is correct, that comes out to ~4 MiB, which is still significantly smaller than the ~20 MiB.

Michael_Gottesman · June 12, 2019, 1:35am

Some thoughts:

My main worry about this is that it may be non-trivial to map in the general case to these ICU category rules (as you eluded to). This to me means that unless we have perfect test coverage (which we won't) then it may not be obvious to an updater what to add (since if we had perfect test coverage, we would be guarantee to fail at runtime). This issue around generating the right data and training people who may not understand ICU to use that seems like it would be hard to make work and will lead to bugs. I am very hesitant to say we should remove data unless we have an automated way to do this that is guaranteed to avoid these problems.
Have you upstreamed the cmake code for building ICU? I am not sure if we should take the custom thing. It would be better to use stock ICU that the ICU team has tested.

Michael_Gottesman · June 12, 2019, 1:37am

Just to give you an idea. To quote the documentation of ICU Data Build Tool:

File Slicing (coarse-grained features)

ICU provides a lot of features, of which you probably need only a small subset for your application. Feature slicing is a powerful way to prune out data for any features you are not using.

CAUTION: When slicing by features, you must manually include all dependencies. For example, if you are formatting dates, you must include not only the date formatting data but also the number formatting data, since dates contain numbers. Expect to spend a fair bit of time debugging your feature filter to get it to work the way you expect it to.

compnerd · June 12, 2019, 1:44am

I think that for the standard library alone at least is something where we can absolutely can and should do this. The overhead difference is potentially massive.

Cross-compiling ICU is nearly impossible without the custom CMake rules. It also makes building ICU a lot more complicated. Particularly for Windows, I don't see any other way to build ICU honestly (it requires a ton of additional setup and build-script and python are not really scalable approaches to setting up a full windows image to do a build). I think that upstream is interested in the CMake support, but doing that completely is a larger undertaking than what I can currently do. I welcome someone else completing this work to the point where upstream will switch over to it.