ICU usage in Swift

windows
android
linux

(Saleem Abdulrasool) #1

Hello fellow developers,

One dependency that has been annoying for Swift has been the ICU dependency. Ignoring the question of whether the dependency makes sense for Swift or not, I was thinking it may be possible to alleviate some of the pain of ICU a slight bit.

For non-Darwin, non-Windows targets, we are building ICU from source. This is one piece that currently requires autotools, which makes building this on Windows pretty painful. I have a little toy build setup using CMake which should also allow cross-compiling the ICU library. I can put the toy build up if people are interested.

Playing around with this made me start wondering, what are the exact pieces from ICU that are needed for the standard library and for Foundation/CoreFoundation? Perhaps we can build a reduced functionality version of ICU with just the components that we need for Swift. Assuming that the API surface is not very large, we should be able to do a static build of ICU but keep the data shared. This keeps the largest size of the ICU library shared across Foundation/CoreFoundation and the runtime (and possibly even the system!). The reduced build statically linked would enable DCE of the rest of the ICU library and we could statically link just the code in the standard library and Foundation and build just the components that are used in Swift.

Is there something obvious that I am overlooking here and there is a more subtle reason that this approach wasn't considered or was deemed as unreasonable?

CC: @Michael_Ilseman @Michael_Gottesman @millenomi @pvieito


(Jean-Daniel) #2

Is the ICU data guarantee to be forward and backward compatible across ICU versions ? If not, static linking would not be possible as it would require that the embedded version exactly match the OS version, which can't be possible if you deploy on more than on OS major version.


(Saleem Abdulrasool) #3

Well, for Linux distributions, they may be able to control the ICU data itself. But, we probably still want it shared to have a single copy between Foundation and the standard library since the bulk of the size of ICU would be the data itself.


(Simon Evans) #4

I think the stdlib uses ICU for Unicode normalisation and grapheme breaking and there is the idea of implementing the code part of that in Swift directly:
Swift Native Grapheme Breaking: https://bugs.swift.org/browse/SR-9423
Stop using ICU for normalisation: https://bugs.swift.org/browse/SR-9432

Foundation makes more heavy use of it due to the localisation, internationalisation and calendar so may be harder to build a subset.

However you mentioned CMake and there seems to be some mention of it on ICU's Jira https://unicode-org.atlassian.net/browse/ICU-7747 so maybe its simpler to get the ICU build to move to CMake?


(Saleem Abdulrasool) #5

Oh, nifty, I didn't know that the project was already considering that. Yes, I wrote up a pretty quick CMakeLists and put it on GitHub. It is sufficient for building ICU for Windows and Android at least.

I had come across those two SRs previously, and think that it would be great if that happens, which would mean that ICU could be entirely compacted into Foundation reducing that conflict with the system version and simplifying the build as well.


(Saleem Abdulrasool) #6

CC: @stamba

The PPC64 target was also running into some issues with ICU that this approach would help alleviate (though we should fix the underlying issue as well).


(Michael Ilseman) #7

I believe it is a long-term goal of the standard library to wean itself off of ICU, but there are some challenges involved. ICU is a continual source of performance pain for us, limits the applicability of Swift to systems-level programming, complicates Swift distribution, etc.

However, ICU currently serves 3 valuable purposes:

1. ICU Bundles Data

ICU includes a copy of the UCD and similar data. Accessing this is necessary to implement any of the other functionality ICU provides us as well as answer some APIs such as Unicode.Scalar.Properties, capitalization, etc. Accessing this data through ICU is typically too expensive for us to do in a hot-loop implementing one of the below algorithms, but is fine for public API.

This data is large and changes with every version of Unicode. Bundling this data in the standard library would require:

  1. We prune out data we don't use
  2. Find or invent a compact binary representation and lazy expansion mechanism
  3. Vigilantly keep them up to date, tying versions of the standard library to specific Unicode versions.

An alternative, which would allow us to make progress on the next two points before/without tackling this one, is to ask for ICU APIs for direct access to the binary data and the means to interpret that data.

2. ICU Implements Normalization Algorithms

We use ICU to lazily convert a moving window of a string's contents to NFC for comparison (we honor canonical equivalence). We also want to add API allowing users to view the scalar or code unit contents of a String in a given normalization form, and we would utilize ICU for that.

Trying to work around performance issues here is the cause of considerable complexity in the standard library.

These algorithms are not complicated and unlikely to change over time, as they are mostly driven by the data. But to implement something efficient, we would likely need/want more direct access to the data.

Implementing this in the standard library would take some work up-front, and some perf tuning, but should yield a substantial benefit for comparisons. It also would allow us to more easily pre-validate contents as already being NFC, wherein canonical equivalence is the same as binary equivalence, we can just memcmp!

3. ICU Implements Grapheme Breaking

Unlike normalization, where the algorithm is fixed and the data changes version-to-version of Unicode, grapheme breaking's algorithm and data both change version-to-version of Unicode.

Implementing this in the standard library would require revision and validation for every new version of Unicode, beyond the initial implementation and perf tuning. Like #1, it would tie stdlib versions to Unicode versions.