[Pitch] Unicode Normalization

Michael_Ilseman · July 18, 2024, 2:46pm

An alternative formulation could be to put the preferred normal form on String and all the other APIs as methods on the views. E.g.

extension String.[UnicodeScalar|UTF8|UTF16]View {
  public var nfd: some Sequence<Element>
  public var nfc: some Sequence<Element>
  public var nfkd: some Sequence<Element>
  public var nfkc: some Sequence<Element>
}
extension String {
  public var normalized: String
  public mutating func normalize()
}

The rest of the API in the pitch is not parameterized over the form enums, so those could be dropped.

xwu · July 18, 2024, 2:54pm

I see the rationale behind trying to design in this way, but to me it's starting to rhyme with the evolution path of String where for a while only its views conformed to Collection for maximum correctness.

In this case, the APIs are specialized enough that I doubt its design will be revisited any time soon even if we don't make the most ergonomic design choice, so I worry that we should not lightly discard the lessons of history here.

What, concretely, do we gain with this sort of fastidiousness?

Michael_Ilseman · July 18, 2024, 2:59pm

I think this is, in general, a good intuitive guide. There are some more details in practice.

String is also where the initializers are hosted and String(decoding:as:) will produce a string whose UnicodeScalarView yields the same scalars that came in, and not a different but canonically equivalent sequence.

Additionally, String's comparable and hashable yield a machine-ordering, which is useful for invariants and storage in data structures, rather than a localized human ordering. This implies that the performance of comparisons and hashing is an important property.

The impure API you note does not change the observable results, but it does change the execution time (which, in practice, can be observable). Similarly, normalizing a string into the stdlib's preferred form can significantly change the observable execution time of an operation that lies at the heart of the performance of data structures such as Dictionary. Additionally, normalizing the string will change the contents of the code unit and scalar views.

ellie20 · July 18, 2024, 10:26pm

Yeah. NFKC and NFKD are also canonically equivalent to each other, so they would probably be collapsed into a single "compatibilityNormalized()" method, maybe with an optional parameter to distinguish the two.

That's fair. There's a tradeoff here. On the one hand, we can prioritize ergonomics and consistency with the current ecosystem of APIs and user code. On the other hand, we can choose to commit to a specific, long-term vision which I think has benefits, but may or may not pan out in the end.

I think the primary benefit would be that it rewards users for picking the "right" type for the domain they're working with. This could reduce friction when users need to reach for the Equatable, Hashable, and Collection implementations that are best suited for their current problem domain. It could also reduce friction in the other direction, when users are already using, say, Set<UnicodeScalarView>, and would appreciate immediate access to the APIs most relevant for their lower-level problem domain. It could also prevent bugs, such as a user accidentally using Characters when they meant to use Unicode.Scalars, or accidentally using the incorrect Hashable implementation.

johannesweiss · July 22, 2024, 10:28am

I have frequently had two needs:

Make sure something's NFC normalised
An isNFC check

This proposal takes care of that. Therefore: Very much support it!

Caveats: I didn't read through the whole discussion and I don't think I fully grasp everything. I believe I have a good understanding of NFC & NFD and this proposal delivers what I need and more.

Michael_Ilseman · July 22, 2024, 3:38pm

Does server-side have any need for stability of normalization? That is, knowing which Unicode versions the string is guaranteed to be in NFC. If a code point is unassigned in one version of Unicode and in another it is assigned with a non-zero canonical combining class, it will normalize differently between those two versions. As soon as it has been assigned, however, it will always normalize the same in future versions of Unicode.

This could come up if the isNFC invariant is held by the same content across different versions of Swift, or a database with a different version of Unicode, or a non-Swift library or process.

johannesweiss · July 23, 2024, 9:18am

Good question, honestly, I don't know. I have not had a use case where that's important.

The only thing I could see is storing only normalised byte sequences in a data store and then later on making assumptions about it being normalised. Not sure how frequently that would come up.

austintatious · July 28, 2024, 5:07am

I was thinking about user inputs that may need to match or be hashed (like passwords) that may accept unicode. The general practice is to not allow unicode in passwords for this reason but I am certain this will eventually be considered a culturally insensitive practice . I was also thinking about certain file systems accept unicode characters and may have a particular normalization convention.

Karl · August 4, 2024, 6:25pm

I hope this explains a bit more about stable normalisations.

I found it the most difficult part of the proposal to write, because it can quickly sound very complex as you talk about different systems getting different results.

"Is `x` Normalized?"

So, to explain stable normalisations, it's helpful to start by considering what it means when we say a string "is normalised". It's very simple; literally all it means is that normalising the string returns the same string.

isNormalized:
  normalize(x) == x

For me, it was a bit of a revelation to grasp that in general, the result of isNormalized is not gospel and is only locally meaningful. Asking the same question, at another point in space or in time, may yield a different result:

Two machines communicating over a network may disagree about whether x is normalised.
The same machine may think x is normalised one day, then after an OS update, suddenly think the same x is not normalised.

"Are `x` and `y` Equivalent?"

Normalisation is how we define equivalence. Two strings, x and y, are equivalent if normalising each of them produces the same result:

areEquivalent(x, y):
  normalize(x) == normalize(y)

And so following from the previous section, when we deal in pairs (or collections) of strings, it follows that:

Two machines communicating over a network may disagree about whether x and y are equivalent or distinct.
The same machine may think x and y are distinct one day, then after an OS update, suddenly think that the same x and y are equivalent.

This has some interesting implications. For instance:

If you encode a Set<String> in a JSON file, when you (or another machine) decodes it later, the resulting Set's count may be less than what it was when it was encoded.
And if you associate values with those strings, such as in a Dictionary<String, SomeValue>, some values may be discarded because we would think they have duplicate keys.
If you serialise a sorted list of strings, they may not be considered sorted when you (or another machine) loads them.

A demo always helps:

let strings = [
    "e\u{1E08F}\u{031F}",
    "e\u{031F}\u{1E08F}",
]

print(strings)
print(Set(strings).count)

Each of these strings contains an "e" and the same two combining marks. One of them, U+1E08F, is COMBINING CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I which was added in Unicode 15.0, 09/2022 (here's a list of all combining marks in Unicode, sorted by age - they still add them fairly often).

Running the above code snippet on Swift 5.2 (via Godbolt), we find the Set has 2 strings. If we run it on nightly (Godbolt), it only contains 1 string.

There are situations where this can be a problem. And the problem, at least for these examples, is that we can't guarantee that a normalised string will always be considered normalised.

Aside: I actually think this issue is somewhat underrated by other languages and libraries, and it should be a feature of Swift's Unicode support that we offer developers the tools to deal with it.

How Stable Normalisation Helps

Unicode's stabilisation policies and the normalisation process help to limit the problems of unstable equivalence.

Firstly (without getting too in to the details), the normalize(x) function, upon which everything is built, treats unassigned characters as if they begin a new "segment" of the string. Let's consider the strings from the above example, which each contain two combining characters. As part of normalisation, these must be sorted.

let strings = [
  "e\u{1E08F}\u{031F}",
  "e\u{031F}\u{1E08F}",
]

The second string is in the correct canonical order - \u{031F} before \u{1E08F}, and if the Swift runtime supports at least Unicode 15.0, we will know to rearrange them like that. That means:

// On nightly:

isNormalized(strings[0]) // false
isNormalized(strings[1]) // true
areEquivalent(strings[0], strings[1]) // true

And that is why Swift nightly only has 1 string in its Set.

The Swift 5.2 system, on the other hand, doesn't know that it's safe to rearrange those characters (one of them is completely unknown to it!) so it is conservative and leaves the string as it is. That means:

// On 5.2:

isNormalized(strings[0]) // true  <-----
isNormalized(strings[1]) // true
areEquivalent(strings[0], strings[1]) // false  <-----

This is quite an important result - it considers both strings normalised, and therefore not equivalent! (this is what I mean when I said isNormalized isn't gospel)

But one thing that is very nice is that because normalisation doesn't touch things it doesn't understand, the true normalisation (strings[1]) is universally agreed to be normalised; it is stable, and it is only the status of un-normalised text that is in dispute. This is what Unicode refers to as a stabilised string, and which the proposal exposes as stableNormalization:

Once a string has been normalized by the NPSS for a particular normalization form, it will never change if renormalized for that same normalization form by an implementation that supports any version of Unicode, past or future.

For example, if an implementation normalizes a string to NFC, following the constraints of NPSS (aborting with an error if it encounters any unassigned code point for the version of Unicode it supports), the resulting normalized string would be stable: it would remain completely unchanged if renormalized to NFC by any conformant Unicode normalization implementation supporting a prior or a future version of the standard.

Since normalisation defines equivalence, it also follows that two distinct stable normalisations will never be considered equivalent. From a developer's perspective, if I store N stable normalisations in to my Set<String> or Dictionary<String, X>, I know for a fact that any client that decodes that data will see a collection of N distinct keys.

Historical Note

What we have discussed is the "weakest" normalisation stability guarantee - it's the baseline, the most basic guarantee we can offer: that your existing normalised text (if it is a "true" or "stable" normalisation) will always be considered normalised.

Technically, this doesn't include a guarantee about which specific string you get out when you first normalise some non-normalised text. That's because there were a handful of esoteric mapping changes between Unicode 3.1 and 4.1 which we can basically forget about now. From UAX15:

This guarantee has been in place for Unicode 3.1 and after. It has been necessary to correct the decompositions of a small number of characters since Unicode 3.1, as listed in the Normalization Corrections data file, but such corrections are in accordance with the above principles: all text normalized on old systems will test as normalized in future systems. All text normalized in future systems will test as normalized on past systems. Prior to Unicode 4.1, what may change for those few characters, is that unnormalized text may normalize differently on past and future systems.

There are 3 published corrigenda between Unicode 3.1 and 4.1 corresponding to those characters (numbers 2, 3, and 4). The last was in 2003 (which is somehow over 20 years ago now) and two of them state clearly:

This corrigendum does not change the status of normalized text. Text that is in any of the normalization forms (NFD, NFC, NFKD, NFKC) as defined in Unicode 3.2.0 is still in that same normalization form after the application of this corrigendum.

The exception is HEBREW LETTER YOD WITH HIRIQ, an apparently very rare character that was mistakenly omitted from one table in Unicode 3.0 back in 2001 as a result of a clerical error, and then patched in 3.1 in a way that meant previously-normalised text was no longer normalised.

The stronger version of the stability guarantee is that now (since 4.1), they promise they won't even make corrections on that scale ever again. This is the guarantee @Michael_Ilseman mentioned:

If a string contains only characters from a given version of Unicode, and it is put into a normalized form in accordance with that version of Unicode, then the results will be identical to the results of putting that string [note: this means the original, un-normalised input string] into a normalized form in accordance with any subsequent version of Unicode.

And it explains why there have been no normalisation mapping corrigenda for the last 20 years.

For us: we have our own data tables, we're never going to be running on Unicode 3.1..<4.1 data. This is never going to be an issue we need to care about. We are producing data using the corrected tables, and any systems which use these incredibly rare characters and specifically depends on data normalised to the old tables are going to have some mechanism in place to deal with it.

"Lookup Stability"?

There is a separate concern, which is the kinds of strings that could be used to perform lookups (for example, in a dictionary).

Even if you're careful to only store stabilised normalisations in the dictionary (strings[1] in the example from earlier), that's good - you'll never encounter a duplicate key, and the value will always be retrievable by giving the exact scalars of the key, but only clients with Unicode 15.0 would be able to look that entry up using un-normalised counterparts, such as strings[0].

Older systems will still think strings[0] and strings[1] are not equivalent, because they still lack the data to turn strings[0] in to strings[1] and so just leave them be.

It may be reasonable to limit allowed characters used in important strings (such as usernames and passwords) if you want to ensure users can enter them reliably on old devices, but that's more of a user-experience issue.

The technical quality you want for something like a password is stable normalisation.

So last thing to discuss: age. Is it really just a number?

Does age matter?

For producing a stable normalisation, no. The only thing that matters is that you have the data to produce the result, and the point is that all other systems will forever agree that it is normalised and therefore distinct.

If it's important to you that all clients get exactly the same lookup behaviour, and you need to support systems with older versions of Unicode, you might additionally want to add some kind of version limit. I don't think it's usually so important, though, and in any case is unrelated to whether the normalisation is stable.

Michael_Ilseman · August 7, 2024, 2:32pm

Thanks for the detailed write up @Karl, it's very helpful and parts of it will make great additions to the doc comments of the final API.

To summarize my understanding, if a string is in stable-NFC:

Conversion to NFC will always produce the same string in any version of Unicode
Running NPSS under a prior version of Unicode could throw an error. For example, it could be reported as not stable-NFC on an older system.
A system with a prior version of Unicode might not correctly detect canonical equivalents to that string

The question now is how we should surface this as API.

As pitched, we're looking at an additional "stable" API for every normalization API. Should this also extend to Unicode.NormalizedScalars and Unicode.NFXNormalizer? Similarly, isNormalized()?

What if we just added a isStablyNormalized() query? Or, a areAllCodePointsAssigned API with a juicy doc comment explaining NPSS and why this could be useful?