I hope this explains a bit more about stable normalisations.
I found it the most difficult part of the proposal to write, because it can quickly sound very complex as you talk about different systems getting different results.
"Is x
Normalized?"
So, to explain stable normalisations, it's helpful to start by considering what it means when we say a string "is normalised". It's very simple; literally all it means is that normalising the string returns the same string.
isNormalized:
normalize(x) == x
For me, it was a bit of a revelation to grasp that in general, the result of isNormalized
is not gospel and is only locally meaningful. Asking the same question, at another point in space or in time, may yield a different result:
-
Two machines communicating over a network may disagree about whether x
is normalised.
-
The same machine may think x
is normalised one day, then after an OS update, suddenly think the same x
is not normalised.
"Are x
and y
Equivalent?"
Normalisation is how we define equivalence. Two strings, x
and y
, are equivalent if normalising each of them produces the same result:
areEquivalent(x, y):
normalize(x) == normalize(y)
And so following from the previous section, when we deal in pairs (or collections) of strings, it follows that:
-
Two machines communicating over a network may disagree about whether x
and y
are equivalent or distinct.
-
The same machine may think x
and y
are distinct one day, then after an OS update, suddenly think that the same x
and y
are equivalent.
This has some interesting implications. For instance:
-
If you encode a Set<String>
in a JSON file, when you (or another machine) decodes it later, the resulting Set's count
may be less than what it was when it was encoded.
-
And if you associate values with those strings, such as in a Dictionary<String, SomeValue>
, some values may be discarded because we would think they have duplicate keys.
-
If you serialise a sorted list of strings, they may not be considered sorted when you (or another machine) loads them.
A demo always helps:
let strings = [
"e\u{1E08F}\u{031F}",
"e\u{031F}\u{1E08F}",
]
print(strings)
print(Set(strings).count)
Each of these strings contains an "e" and the same two combining marks. One of them, U+1E08F, is COMBINING CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
which was added in Unicode 15.0, 09/2022 (here's a list of all combining marks in Unicode, sorted by age - they still add them fairly often).
Running the above code snippet on Swift 5.2 (via Godbolt), we find the Set
has 2 strings. If we run it on nightly (Godbolt), it only contains 1 string.
There are situations where this can be a problem. And the problem, at least for these examples, is that we can't guarantee that a normalised string will always be considered normalised.
Aside: I actually think this issue is somewhat underrated by other languages and libraries, and it should be a feature of Swift's Unicode support that we offer developers the tools to deal with it.
How Stable Normalisation Helps
Unicode's stabilisation policies and the normalisation process help to limit the problems of unstable equivalence.
Firstly (without getting too in to the details), the normalize(x)
function, upon which everything is built, treats unassigned characters as if they begin a new "segment" of the string. Let's consider the strings from the above example, which each contain two combining characters. As part of normalisation, these must be sorted.
let strings = [
"e\u{1E08F}\u{031F}",
"e\u{031F}\u{1E08F}",
]
The second string is in the correct canonical order - \u{031F}
before \u{1E08F}
, and if the Swift runtime supports at least Unicode 15.0, we will know to rearrange them like that. That means:
// On nightly:
isNormalized(strings[0]) // false
isNormalized(strings[1]) // true
areEquivalent(strings[0], strings[1]) // true
And that is why Swift nightly only has 1 string in its Set
.
The Swift 5.2 system, on the other hand, doesn't know that it's safe to rearrange those characters (one of them is completely unknown to it!) so it is conservative and leaves the string as it is. That means:
// On 5.2:
isNormalized(strings[0]) // true <-----
isNormalized(strings[1]) // true
areEquivalent(strings[0], strings[1]) // false <-----
This is quite an important result - it considers both strings normalised, and therefore not equivalent! (this is what I mean when I said isNormalized
isn't gospel)
But one thing that is very nice is that because normalisation doesn't touch things it doesn't understand, the true normalisation (strings[1]
) is universally agreed to be normalised; it is stable, and it is only the status of un-normalised text that is in dispute. This is what Unicode refers to as a stabilised string, and which the proposal exposes as stableNormalization
:
Once a string has been normalized by the NPSS for a particular normalization form, it will never change if renormalized for that same normalization form by an implementation that supports any version of Unicode, past or future.
For example, if an implementation normalizes a string to NFC, following the constraints of NPSS (aborting with an error if it encounters any unassigned code point for the version of Unicode it supports), the resulting normalized string would be stable: it would remain completely unchanged if renormalized to NFC by any conformant Unicode normalization implementation supporting a prior or a future version of the standard.
Since normalisation defines equivalence, it also follows that two distinct stable normalisations will never be considered equivalent. From a developer's perspective, if I store N stable normalisations in to my Set<String>
or Dictionary<String, X>
, I know for a fact that any client that decodes that data will see a collection of N distinct keys.
Historical Note
What we have discussed is the "weakest" normalisation stability guarantee - it's the baseline, the most basic guarantee we can offer: that your existing normalised text (if it is a "true" or "stable" normalisation) will always be considered normalised.
Technically, this doesn't include a guarantee about which specific string you get out when you first normalise some non-normalised text. That's because there were a handful of esoteric mapping changes between Unicode 3.1 and 4.1 which we can basically forget about now. From UAX15:
This guarantee has been in place for Unicode 3.1 and after. It has been necessary to correct the decompositions of a small number of characters since Unicode 3.1, as listed in the Normalization Corrections data file, but such corrections are in accordance with the above principles: all text normalized on old systems will test as normalized in future systems. All text normalized in future systems will test as normalized on past systems. Prior to Unicode 4.1, what may change for those few characters, is that unnormalized text may normalize differently on past and future systems.
There are 3 published corrigenda between Unicode 3.1 and 4.1 corresponding to those characters (numbers 2, 3, and 4). The last was in 2003 (which is somehow over 20 years ago now) and two of them state clearly:
This corrigendum does not change the status of normalized text. Text that is in any of the normalization forms (NFD, NFC, NFKD, NFKC) as defined in Unicode 3.2.0 is still in that same normalization form after the application of this corrigendum.
The exception is HEBREW LETTER YOD WITH HIRIQ
, an apparently very rare character that was mistakenly omitted from one table in Unicode 3.0 back in 2001 as a result of a clerical error, and then patched in 3.1 in a way that meant previously-normalised text was no longer normalised.
The stronger version of the stability guarantee is that now (since 4.1), they promise they won't even make corrections on that scale ever again. This is the guarantee @Michael_Ilseman mentioned:
If a string contains only characters from a given version of Unicode, and it is put into a normalized form in accordance with that version of Unicode, then the results will be identical to the results of putting that string [note: this means the original, un-normalised input string] into a normalized form in accordance with any subsequent version of Unicode.
And it explains why there have been no normalisation mapping corrigenda for the last 20 years.
For us: we have our own data tables, we're never going to be running on Unicode 3.1..<4.1 data. This is never going to be an issue we need to care about. We are producing data using the corrected tables, and any systems which use these incredibly rare characters and specifically depends on data normalised to the old tables are going to have some mechanism in place to deal with it.
"Lookup Stability"?
There is a separate concern, which is the kinds of strings that could be used to perform lookups (for example, in a dictionary).
Even if you're careful to only store stabilised normalisations in the dictionary (strings[1]
in the example from earlier), that's good - you'll never encounter a duplicate key, and the value will always be retrievable by giving the exact scalars of the key, but only clients with Unicode 15.0 would be able to look that entry up using un-normalised counterparts, such as strings[0]
.
Older systems will still think strings[0]
and strings[1]
are not equivalent, because they still lack the data to turn strings[0]
in to strings[1]
and so just leave them be.
It may be reasonable to limit allowed characters used in important strings (such as usernames and passwords) if you want to ensure users can enter them reliably on old devices, but that's more of a user-experience issue.
The technical quality you want for something like a password is stable normalisation.
So last thing to discuss: age. Is it really just a number?
Does age matter?
For producing a stable normalisation, no. The only thing that matters is that you have the data to produce the result, and the point is that all other systems will forever agree that it is normalised and therefore distinct.
If it's important to you that all clients get exactly the same lookup behaviour, and you need to support systems with older versions of Unicode, you might additionally want to add some kind of version limit. I don't think it's usually so important, though, and in any case is unrelated to whether the normalisation is stable.