ICU usage in Swift

Michael_Ilseman · February 14, 2019, 8:29pm

I believe it is a long-term goal of the standard library to wean itself off of ICU, but there are some challenges involved. ICU is a continual source of performance pain for us, limits the applicability of Swift to systems-level programming, complicates Swift distribution, etc.

However, ICU currently serves 3 valuable purposes:

1. ICU Bundles Data

ICU includes a copy of the UCD and similar data. Accessing this is necessary to implement any of the other functionality ICU provides us as well as answer some APIs such as Unicode.Scalar.Properties, capitalization, etc. Accessing this data through ICU is typically too expensive for us to do in a hot-loop implementing one of the below algorithms, but is fine for public API.

This data is large and changes with every version of Unicode. Bundling this data in the standard library would require:

We prune out data we don't use
Find or invent a compact binary representation and lazy expansion mechanism
Vigilantly keep them up to date, tying versions of the standard library to specific Unicode versions.

An alternative, which would allow us to make progress on the next two points before/without tackling this one, is to ask for ICU APIs for direct access to the binary data and the means to interpret that data.

2. ICU Implements Normalization Algorithms

We use ICU to lazily convert a moving window of a string's contents to NFC for comparison (we honor canonical equivalence). We also want to add API allowing users to view the scalar or code unit contents of a String in a given normalization form, and we would utilize ICU for that.

Trying to work around performance issues here is the cause of considerable complexity in the standard library.

These algorithms are not complicated and unlikely to change over time, as they are mostly driven by the data. But to implement something efficient, we would likely need/want more direct access to the data.

Implementing this in the standard library would take some work up-front, and some perf tuning, but should yield a substantial benefit for comparisons. It also would allow us to more easily pre-validate contents as already being NFC, wherein canonical equivalence is the same as binary equivalence, we can just memcmp!

3. ICU Implements Grapheme Breaking

Unlike normalization, where the algorithm is fixed and the data changes version-to-version of Unicode, grapheme breaking's algorithm and data both change version-to-version of Unicode.

Implementing this in the standard library would require revision and validation for every new version of Unicode, beyond the initial implementation and perf tuning. Like #1, it would tie stdlib versions to Unicode versions.