Strings for Linguistics: pitch for default NFKD + reverse mapping

Most software applications that process texts just need to process Strings and Characters, thanks to the UNICODE equivalence:

L01 let c1 = "\u{00F1}" // equivalent to let c1 = "ñ"
L02 let c2 = "\u{006E}\u{0303}" // "n" + "tilde"
L03 c1 == c2 // => true

The following code will produce a String, but no-one knows if its .unicodeScalars array represents its content as precomposed or decomposed characters, or even a mix of the two forms:

L04 try data = Data (contentsOf: sampleUtf8TextUrl!)
L05 let text = String (bytes: bytes, encoding: String.Encoding.utf8)!

Most programmers don't care about how exactly the text.unicodeScalars array is represented, thanks to the UNICODE equivalence in L03: no need to deal with unicodeScalars, just have your application scan the text.characters, and the UNICODE equivalence will take care of differences in the internal representation.

(performing the UNICODE string equivalence such as in L03 has to be costly though).

This UNICODE equivalence is enough for most software applications, but it does not address the linguistic needs.

(1) In a large number of examples, a linguistic parser needs to be able to match a letter that occur in a text without its diacritic(s):

-- the word form "CAFE" occurring in French texts must match the lexical entry "café".
-- In Hebrew texts, one could find letter ש (shin) as is, but this letter could actually match "שּׂ" (shin + dagesh + shin dot) in the dictionary or grammar.

Similarly, ligatures such as "fi" (U+FB01) need to be separated into "f" + "i" before matching grammars (ligatures in Arabic and Devanagari scripts are numerous).

These operations can easily and efficiently be performed if the .unicodeScalars array of the string is in the NFKD form, but it is costly to match a character (e.g. "e") with a potentially large set of corresponding equivalent precomposed characters ("é", "è", "ê", "ë", "ẽ", etc.), even more costly when the letter occurring in the text is a ligature that contains 2 or more actual letters.

Therefore, any linguistic parser needs to systematically compute the NFKD form of each string it wants to parse, e.g.:

L06 let textToParse = text.decomposedStringWithCompatibilityMapping

My understanding is that this operation involves converting the original .unicodeScalars array to (char *) to access the ICU library, copying and transforming the (char *) and then converting the resulting (char *) back to the resulting unicodeScalars => very costly. Much more costly than if the code in L05 were to produce the String in NFKD form directly by scanning the initial flow of UTF8 bytes.

I therefore suggest the format of String.unicodeScalars should always be normalized in NFKD form, i.e. (no need for L06).

If all Swift standard methods were always producing a normalized form for String.unicodeScalars, the UNICODE equality (c1 == c2) would be very efficient.

I understand that W3C would rather have us use the NFC form. But the Unicode equivalence which acts on Strings and Characters already fulfill all the needs for a NFC form. In other words, there is no need for NFC if we have Swift Strings and Characters.

(2) when a linguistic parser processes the string's .unicodeScalars array, it will find matches. These matches will be expressed as an index in the .unicodeScalars array, which has nothing to do with the initial data the client sent to the parser (usually a UTF8 array of bytes).

Converting the UTF8 array of bytes (in L05) has broken the link between the client's data and the parser's: the parser cannot tell its client where the matches are.

I suggest this link should be kept, maybe via a system of indices such as:

text.utf8 => should contain the client's initial bytes of array (including the DOM)
text.utf8.indices => should contain the beginning position of each character (i.e. grapheme) in the UTF8 array, e.g. text.utf8.indices[3] points to the beginning of the UTF8 sequence for the 3rd grapheme cluster in the text.

text.unicodeScalars => should contain the NFKD representation of the string
text.unicodeScalars.indices => array that contains the beginning position of each character (i.e. grapheme) in .unicodeScalars, e.g. text.unicodeScalars.indices[3] points to the beginning of the sequence of unicodeScalars for the 3rd grapheme of the text.

This way, the two arrays of indices would be synchronized, and the parser could tell its client where the match occurred.

I don't think you've made a strong case why this should be done by default. In fact, you've made an argument for the opposite by stating that most programmers don't care about the normalised form of the underlying unicode scalars, so this would just be a performance penalty for them and also incur the other costs of normalisation (not injective/bijective, NFKD in particular is very lossy, etc). Though I realise as I write this that it's not clear to me if you're proposing to normalise the storage of the string itself or just providing a view that does this normalisation on the fly.

This seems like it would probably be better handled by an API to take your text in some form (an existing String, maybe Data or whatever byte buffer form) and give you a string where the unicodeScalars are normalised in a given form. Then you can also easily support all four of the Unicode normal forms and you don't break backwards compatibility by changing how unicodeScalars works for all Strings.

Most programmers can avoid dealing with .unicodeScalars, thanks to the UNICODE equivalence functionality of the "==" operator.

However, programmers who process the .unicodeScalars array without normalizing it first, are making potentially buggy software. Using non-normalized forms for .unicodeScalars is harmful for most non-English languages. Therefore, the .unicodeScalars array should always be normalized before being processed.

You are right: normalizing the .unicodeScalars array can be processed via an API that provides the four methods: decomposedStringWithCompatibilityMapping (NFKD), precomposedStringWithCompatibilityMapping (NFKC), decomposedStringWithCanonicalMapping (NFD) and precomposedStringWithCanonicalMapping (NFC). I assume these methods are based on the ICU package, which is costly.

It seems to me that scanning the original UTF8 to directly produce a normalized .unicodeScalar array would be much more efficient than producing first a non-normalized (useless) array, and then converting it to a normalized one. Note that this would cost anything to programs that process Strings and Characters and do not process the .unicodeScalars array, because this array is evaluated lazily.

I am confused about your statement that getting a normalized .unicodeScalars array would break compatibility for existing programs who do not assume the array is normalized: if these programs assume that the letter "ñ" can be represented either by “\u{00F1}” or by “\u{006E}\u{0303}”, then why would they break if the standard Swift library always give them “\u{006E}\u{0303}” ?

Now if (!) we agree that we need a normalized form for .unicodeScalars, which normalizing form should we use? From a linguistic point of view, I argued that NFKD is the best form, because it splits graphemes into real linguistic units.

You did mention that this form is lossy, but the typical examples of loss I am aware of are not really relevant to Swift programmers, because they initially access each Character of the String (i.e. they know the character "fi" is not equal to "f"), before looking inside the .unicodeScalars array.

1 Like

It sounds like your motivation is locale-aware string processing, but you're pitching normalization of Unicode scalars and these are not the same thing. There is no guarantee that NFKD "splits graphemes" in locale-appropriate ways for all locales.

We definitely don't want users to assume that unicodeScalar is appropriate for locale-aware string processing without using the appropriate locale-aware facilities. Providing lossy (NFKD) or even non-lossy (NFD) decompositions by default only makes such potentially incorrect use more attractive and not less. Moreover, it would be inappropriate for those who wish to access the original non-normalized string, and it incurs a performance cost.

5 Likes

I agree with you that we are not talking about locale-aware strings. On the opposite, I would like to have a robust normalized unicodeScalar-based processing, that would be independent from any locale settings, so that I could build one linguistic parser that handles texts in any language. For instance, the ligature “fi” is always equivalent to "f" + "i", whatever the language. The letter "ñ" is always equivalent to its precomposed and its decomposed representations, whatever the language.

On the one hand, I only see bad things from processing an uncontrolled unicodeScalars array: either there will be bugs because of programmers who forget that there are more that one way to represent a given grapheme in an unicodeScalars array, or programmers will need to systematically call a costly API to normalise the unicodeScalars array before doing any computation.

On the other hand, having a normalized unicodeScalar array does not stop us from accessing the original String.utf8 view. Thanks to the Unicode equivalence, one can both parse the String using Characters (where we don't even care about the difference in representation, thanks to the Unicode equivalence), and parse the String using Character.utf8 which will allow us to distinguish between the decomposed and the precomposed representations of a grapheme.

It is not because we have a normalized unicodeScalars array that we will loose access to the String.utf8 view, which represents the original data and is not lossy.