Most software applications that process texts just need to process Strings and Characters, thanks to the UNICODE equivalence:
L01 let c1 = "\u{00F1}" // equivalent to let c1 = "ñ"
L02 let c2 = "\u{006E}\u{0303}" // "n" + "tilde"
L03 c1 == c2 // => true
The following code will produce a String, but no-one knows if its .unicodeScalars array represents its content as precomposed or decomposed characters, or even a mix of the two forms:
L04 try data = Data (contentsOf: sampleUtf8TextUrl!)
L05 let text = String (bytes: bytes, encoding: String.Encoding.utf8)!
Most programmers don't care about how exactly the text.unicodeScalars array is represented, thanks to the UNICODE equivalence in L03: no need to deal with unicodeScalars, just have your application scan the text.characters, and the UNICODE equivalence will take care of differences in the internal representation.
(performing the UNICODE string equivalence such as in L03 has to be costly though).
This UNICODE equivalence is enough for most software applications, but it does not address the linguistic needs.
(1) In a large number of examples, a linguistic parser needs to be able to match a letter that occur in a text without its diacritic(s):
-- the word form "CAFE" occurring in French texts must match the lexical entry "café".
-- In Hebrew texts, one could find letter ש (shin) as is, but this letter could actually match "שּׂ" (shin + dagesh + shin dot) in the dictionary or grammar.
Similarly, ligatures such as "fi" (U+FB01) need to be separated into "f" + "i" before matching grammars (ligatures in Arabic and Devanagari scripts are numerous).
These operations can easily and efficiently be performed if the .unicodeScalars array of the string is in the NFKD form, but it is costly to match a character (e.g. "e") with a potentially large set of corresponding equivalent precomposed characters ("é", "è", "ê", "ë", "ẽ", etc.), even more costly when the letter occurring in the text is a ligature that contains 2 or more actual letters.
Therefore, any linguistic parser needs to systematically compute the NFKD form of each string it wants to parse, e.g.:
L06 let textToParse = text.decomposedStringWithCompatibilityMapping
My understanding is that this operation involves converting the original .unicodeScalars array to (char *) to access the ICU library, copying and transforming the (char *) and then converting the resulting (char *) back to the resulting unicodeScalars => very costly. Much more costly than if the code in L05 were to produce the String in NFKD form directly by scanning the initial flow of UTF8 bytes.
I therefore suggest the format of String.unicodeScalars should always be normalized in NFKD form, i.e. (no need for L06).
If all Swift standard methods were always producing a normalized form for String.unicodeScalars, the UNICODE equality (c1 == c2) would be very efficient.
I understand that W3C would rather have us use the NFC form. But the Unicode equivalence which acts on Strings and Characters already fulfill all the needs for a NFC form. In other words, there is no need for NFC if we have Swift Strings and Characters.
(2) when a linguistic parser processes the string's .unicodeScalars array, it will find matches. These matches will be expressed as an index in the .unicodeScalars array, which has nothing to do with the initial data the client sent to the parser (usually a UTF8 array of bytes).
Converting the UTF8 array of bytes (in L05) has broken the link between the client's data and the parser's: the parser cannot tell its client where the matches are.
I suggest this link should be kept, maybe via a system of indices such as:
text.utf8 => should contain the client's initial bytes of array (including the DOM)
text.utf8.indices => should contain the beginning position of each character (i.e. grapheme) in the UTF8 array, e.g. text.utf8.indices[3] points to the beginning of the UTF8 sequence for the 3rd grapheme cluster in the text.
text.unicodeScalars => should contain the NFKD representation of the string
text.unicodeScalars.indices => array that contains the beginning position of each character (i.e. grapheme) in .unicodeScalars, e.g. text.unicodeScalars.indices[3] points to the beginning of the sequence of unicodeScalars for the 3rd grapheme of the text.
This way, the two arrays of indices would be synchronized, and the parser could tell its client where the match occurred.