I understand and I agree that normalized views should be lazy. But when we do need to parse a large String (text), we will need to actually convert it to a normalized view, i.e. the lazy aspect is gone. At this point, it would be nice to get the normalized form directly from the original data (which will be UTF8 in most cases), rather than using an (uncontrolled) array of UnicodeScalars, converting it back and forth via ICU's (char *), copying with their malloc's and free's.
-
Swift standard library needs to add support for at least one normalized view, and provide a method to get its opposite form. i.e. if NFKD is the default (transparent) form, then Swift should add a .toNFKC() method, or reciprocally.
-
Yes I think the unicode scalar should always follow a known standard, so that anyone can rely on it.There are probably lots of bugs around because of this false belief that Unicode has a standard way to code letters.
I disagree with the counter-argument: there is no "as physically encoded" in UnicodeScalars: the UnicodeScalar view does not reflect the original as-encoded string internal representation, which is usually UTF8-based.
- I understand that there are many other applications than NLP applications. I am arguing for Linguistics...
From a linguistic point of view, NFKD would be the best, because it would allow a linguistic parser to match texts in the most efficient way possible. For instance, In French, uppercase letters often miss their accent, e.g. "CAFE" in a text should match "café" in the dictionary. Using NFKD would make the match "E" == "é" easy, just a test between
text[tposition].unicodeScalars[0] == dictionary[dposition].unicodeScalars[0]
Using NFKC or worse NFC would mean matching the "E" in texts would translate into a loop over all the possible combinations of a "e" and any diacritic:
for precomposedLetter in ["e", "é", "è", "ê", "ë", "ẽ"... ] {
if text[tposition].unicodeScalars[0] == precomposedLetter { // MATCH }
}
Same thing with ligatures: NFC is bad for Linguistics because absolutely all the word forms that occur in texts and contain "fi" should match the corresponding dictionary entry with "f" followed by "i". In other words, to have a linguistic parser running properly, one would need to first separate all ligatures into their linguistic units. As you mentioned earlier, this is critical for language scripts such as Arabic: NFC encodes "ﶂ" as U+FD82 (i.e. one unicodeScalar for Lam + Hah + Alef Mksura in final form): how is a linguistic program supposed to find the corresponding word in a dictionary, or link it to its suffixed variants (same ligature but not final)? Same problem with Devanagari.
I strongly disagree that there is any loss of information by transforming "fi" into "f" "i": no word spelling requires a "fi"; reciprocally, every single word that contains "f" followed by "i" can be written as "fi".
I am confused about the 2⁵ example. Although NFKD would separate this single grapheme into two unicodeScalars (same way as é => e'), the corresponding grapheme would still be unique and different from "2". Parsers for arithmetic expressions never loop inside each grapheme: they loop over graphemes.
To recap:
- is crucial for NLP applications. Without a normalized standard view, Swift support for Strings is not really better than previous languages; arguably worse because we lost the random access (!).
This is the role of a programming language to provide programmers with reliable data. Not having a normalized form in Swift means asking all programmers to use ICU (the truth is they will not, and their software will be buggy), i.e. a necessary level that is a pain, and should not even exist in the first place.
- If Swift's UnicodeScalars' array is transparently normalized, that would be a huge and unique advantage for programmers. Providing programmers with a non-normalized array of unicodeScalars makes it useless, anyways.
On the other hand, imagine telling a Java or C++ programmer that all their software is buggy, unless they add ICU to every String processing code, whereas Swift processes Strings transparently, in a reliable way.
- Transparent NFKD default representation of UnicodeScalars would be GREAT for NLP applications. I understand W3C prefers NFC, and for some other applications (maybe text publishing and word processors?) NFC could be better... But these programs probably don't even need to process UnicodeScalars in the first place: they deal with glyphs, and will even avoid the variable-length UTF8 representation like the plague : they would be better off working on glyphs/Characters anyways.