ICU usage in Swift

CC: @stamba

The PPC64 target was also running into some issues with ICU that this approach would help alleviate (though we should fix the underlying issue as well).

I believe it is a long-term goal of the standard library to wean itself off of ICU, but there are some challenges involved. ICU is a continual source of performance pain for us, limits the applicability of Swift to systems-level programming, complicates Swift distribution, etc.

However, ICU currently serves 3 valuable purposes:

1. ICU Bundles Data

ICU includes a copy of the UCD and similar data. Accessing this is necessary to implement any of the other functionality ICU provides us as well as answer some APIs such as Unicode.Scalar.Properties, capitalization, etc. Accessing this data through ICU is typically too expensive for us to do in a hot-loop implementing one of the below algorithms, but is fine for public API.

This data is large and changes with every version of Unicode. Bundling this data in the standard library would require:

  1. We prune out data we don't use
  2. Find or invent a compact binary representation and lazy expansion mechanism
  3. Vigilantly keep them up to date, tying versions of the standard library to specific Unicode versions.

An alternative, which would allow us to make progress on the next two points before/without tackling this one, is to ask for ICU APIs for direct access to the binary data and the means to interpret that data.

2. ICU Implements Normalization Algorithms

We use ICU to lazily convert a moving window of a string's contents to NFC for comparison (we honor canonical equivalence). We also want to add API allowing users to view the scalar or code unit contents of a String in a given normalization form, and we would utilize ICU for that.

Trying to work around performance issues here is the cause of considerable complexity in the standard library.

These algorithms are not complicated and unlikely to change over time, as they are mostly driven by the data. But to implement something efficient, we would likely need/want more direct access to the data.

Implementing this in the standard library would take some work up-front, and some perf tuning, but should yield a substantial benefit for comparisons. It also would allow us to more easily pre-validate contents as already being NFC, wherein canonical equivalence is the same as binary equivalence, we can just memcmp!

3. ICU Implements Grapheme Breaking

Unlike normalization, where the algorithm is fixed and the data changes version-to-version of Unicode, grapheme breaking's algorithm and data both change version-to-version of Unicode.

Implementing this in the standard library would require revision and validation for every new version of Unicode, beyond the initial implementation and perf tuning. Like #1, it would tie stdlib versions to Unicode versions.

21 Likes

I filed SR-10535

3 Likes

I asked about the status of ICU (or whatever standin) in the compiler itself over in another thread and still have no answer. Maybe someone here knows?

While I realize the compiler and the standard library are distinct, I am not sure how much their dependency requirements affect each other. Anyone working on this should probably be aware that access to some form of Unicode normalization will be required for the other.

The compiler itself does not use ICU, but the standard library does. The current state of affairs has the standard library and the compiler build conflated which makes things rather confusing. The compiler's dependencies remain LLVM and clang. The compiler doesn't really have any string normalization (that I am aware of) in it though - why do you need string normalization there?

Because right now identifiers (including operators) do not follow unicode equivalence. The core team has said here, here, here and definitively here that it is a bug that should be fixed.

Right now surprises like the following are possible:

let café = "café" // NFD
let café = "Fwahahaha!" // NFC
print(café) // Compiles and runs, but what does it do?
infix operator ≠ // NFD

// Compiler error: operator not defined.
func ≠(lhs: Int, rhs: Int) -> Bool { // NFC
    return lhs != rhs
}

For more details see the thread linked earlier.

2 Likes

P.S. Thank you @compnerd for your clear answer.

1 Like

Ah, I see. That is rather unfortunate. The thing is, even in the standard library, there is desire to move away from the ICU, so if there is a way to do the unicode normalisation efficiently in a standalone manner, that might work well. But, adding a dependency on ICU in the compiler is really not very palatable to me - ICU is a large library, and requires the data library which is really large, and that means that you increase the load time for the compiler, which would be a huge hit to compilation times and overall memory usage.

1 Like

Yes, I do not like the idea of adding ICU at this point either, which is part of why I didn’t jump on submitting a bug fix right away.

What would be nice is if when factoring normalization out of ICU, it is done in a way that both the compiler and the standard library can then share most of the source. Since they both essentially need the same replacement for ICU, it would be nice to provide the foundation for solutions to both issues at once.

@Michael_Ilseman - sorry to necromance this thread again.

So, ICU 64 seems to have added even more data. At this point, the ICU data alone is >20MiB. However, it seems that there is now a tool that we should be able to use to actually limit the data that we package into the ICU data files. The question now becomes, what data do we really need for the combination of the standard library and Foundation (though I suspect that @millenomi would be better suited for that). I have a custom CMakeLists.txt setup now to build ICU, and I am considering adding support for building the data bundle as well. It seems like we should be able to reduce the packaged data to what we truly need. (icu/buildtool.md at main · unicode-org/icu · GitHub provides a good overview of the contributions of the various pieces of data)

Between @spevans and I, we are considering moving the Linux target to use the new CMakeLists as a means of simplifying the build as well as speeding it up. This can also be a good time to shrink the actual runtime size as well.

1 Like

The standard library basically needs a subset of the UCD. But, I would guess that bundling all of the UCD in an efficient binary representation shouldn't be that big. What all are you measuring in the 20MiB? Are you including the CLDR? All locale, etc., concerns are considered outside the domain of the standard library. Foundation likely makes heavy use of the CLDR and needs to pull in much more data.

The standard library needs the data that drives the following:

  1. The portions of the exposed by Unicode.Scalar.Properties, understanding that more may be exposed in the future.
  2. A couple properties used for normalization fast-paths, understanding that we may want to expose all normal forms in the future (not just NFC).
  3. Anything transitively required by the ICU APIs we use, such as grapheme cluster break properties.

Yes, I am describing the full unicode data that ICU bundles by default. I agree that the data that we actually use should be much smaller and that is why I was asking what exactly is that set that we need to include. We should be able to build the subset that we actually need as the ICU APIs should not be exposed through the Swift interfaces, we don't need to worry about users requesting some data that we exclude in the custom bundle.

Also, please pardon my limited knowledge in the domain, I could really use some help mapping the desired data to the ICU categories.

  1. Unicode Character names (unames): ~270 KiB
  2. Normalization (normalization): ~160 KiB
  3. Break Iteration (brkitr_rules, brkitr_dictionaries, brkitr_trees): ~525 KiB, ~3 MiB, ~15 KiB

If that is correct, that comes out to ~4 MiB, which is still significantly smaller than the ~20 MiB.

Some thoughts:

  1. My main worry about this is that it may be non-trivial to map in the general case to these ICU category rules (as you eluded to). This to me means that unless we have perfect test coverage (which we won't) then it may not be obvious to an updater what to add (since if we had perfect test coverage, we would be guarantee to fail at runtime). This issue around generating the right data and training people who may not understand ICU to use that seems like it would be hard to make work and will lead to bugs. I am very hesitant to say we should remove data unless we have an automated way to do this that is guaranteed to avoid these problems.

  2. Have you upstreamed the cmake code for building ICU? I am not sure if we should take the custom thing. It would be better to use stock ICU that the ICU team has tested.

Just to give you an idea. To quote the documentation of ICU Data Build Tool:

File Slicing (coarse-grained features)

ICU provides a lot of features, of which you probably need only a small subset for your application. Feature slicing is a powerful way to prune out data for any features you are not using.

CAUTION: When slicing by features, you must manually include all dependencies. For example, if you are formatting dates, you must include not only the date formatting data but also the number formatting data, since dates contain numbers. Expect to spend a fair bit of time debugging your feature filter to get it to work the way you expect it to.

I think that for the standard library alone at least is something where we can absolutely can and should do this. The overhead difference is potentially massive.

Cross-compiling ICU is nearly impossible without the custom CMake rules. It also makes building ICU a lot more complicated. Particularly for Windows, I don't see any other way to build ICU honestly (it requires a ton of additional setup and build-script and python are not really scalable approaches to setting up a full windows image to do a build). I think that upstream is interested in the CMake support, but doing that completely is a larger undertaking than what I can currently do. I welcome someone else completing this work to the point where upstream will switch over to it.

compnerd Saleem Abdulrasool
June 12
I think that for the standard library alone at least is something where we can absolutely can and should do this. The overhead difference is potentially massive.

Sure, but also the maintenance of it may be difficult and consume time. I think most of my concerns would be satisfied /if/ it is behind a flag and there is no guarantee around it working (i.e. it isn't blocking PR testing and swift.org wouldn't test it).

I guess I need to ask the obligatory... why do you want to do this?

Cross-compiling ICU is nearly impossible without the custom CMake rules. It also makes building ICU a lot more complicated. Particularly for Windows, I don't see any other way to build ICU honestly (it requires a ton of additional setup and build-script and python are not really scalable approaches to setting up a full windows image to do a build). I think that upstream is interested in the CMake support, but doing that completely is a larger undertaking than what I can currently do. I welcome someone else completing this work to the point where upstream will switch over to it.

That would mean that we would be maintaining a fork of ICU. No? Looking at update checkout today, I see that we are not maintaining a fork. My local remote is: GitHub - unicode-org/icu: The home of the ICU project source code..

No, please don't fork ICU. We don't need that additional maintenance burden. The way that I did it is that I have a standalone CMakeLists.txt that can be dropped in. No forks, no patches, just a single file that we copy in (we could even place it external to the tree and prefix the paths).

Sure, adding a flag is acceptable to me.

As to the desire to this: ICU is not a system library on Windows and an ancient version ships on android. Android will need to ship a copy of the data file for ICU to be usable. The Swift runtime is not a system provided package, nor is Swift considered ABI stable on any platform but Darwin. This means that the only supported means of distribution is a local copy of the runtime. This means that every package needs to bundle a 20+ MiB data file (of which we need arguably ~1-2 MiB of the data for the standard library). For most of the non-western world, the access to packages which are this large can be prohibitive. Shrinking the data file would enable more packages to be able to use Swift and provide a smaller download. Even iOS limits OTA updates to ~100 MB IIRC (and android limits APKs to 100 MB). That means that you are spending ~10-20% of your package budget on a data file that is not even needed for the most part.

OTA iOS app updates are currently limited to 200 MB, and I believe that limit is removed for iOS 13 (but I don’t know for sure). Not that it matters for this discussion.

I’m going to circle back with some thoughts after offline discussion.

It has been an explicit non-goal of Foundation for some time now to try to work without a full ICU installation available. Development generally assumes that all interfaces in the library be present and working in normal operating conditions; @Tony_Parker may chime in more explicitly about policy, but that’s the assumption that I’ve also been making in writing new s-c-f code. This doesn’t mean that I’m opposed on principle, but making ICU trimming a mandatory part of the release can create the cases @Michael_Gottesman was pointing at every time we update ICU (did a new dependency form inside the library that requires new files for normal operation?) or merge Core Foundation (did it introduce new API usage?). We currently cannot detect mismatches automatically here.

I do not have the confidence that you have that our regular testing can easily find the more esoteric failure cases — ICU-relying tests are notoriously hard to write, and while I have introduced a variant of them this winter that I believe will be more resilient, we still don’t know if partial string testing is going to be sufficient to catch bugs; and what the degree of unknown unknowns we introduce this way can be, and thus whether our point-by-point test formulation is going to ever catch any of them.

Now, I’ve read your Android assessment and looked at current best practices and I agree that the current situation is using significant amounts of the size budget for these developers. My current, personal position as the person who needs to set the guarantees for s-c-f’s correctness is that I would be fine with a community-driven assessment of slicing and the appropriate tooling for achieving it and testing it; but this need is so specific to Android that I would call on the part of the community that would like this port to be maintained to also maintain their slicing setup over time. I’d be happy to notify appropriate people when changes occur on my side that require a reassessment.

2 Likes

As to constructing the reduced set: ICU actually provides support to trace the ICU data being used. We could do builds with --enable-tracing on ICU and construct the filter from the traced categories.

1 Like