String for linguistic processing

(This thread is spun off from Strings in Swift 4 - #63 by Michael_Ilseman)

@silberz wrote:

@silberz, as far as whether you can rely on composition for your needs, you would know your domain best. Note that Unicode has several exceptions to the rules (because of course it does). Are you familiar with composition exclusions?

For example, this might affect your support for Tibetan, which had a rocky history in early Unicode where it was added, removed, then re-added. But, I'm not familiar with the details here.

ICU provides support for NFC, NFKC, NFKD, NFD, and FCC. So the question for Swift is what is the most "Swifty" way to expose these? Since these are expert-oriented (highly domain-specific), it seems similar to the Unicode Scalar Properties pitch.

Yes, this is a fallout of String not having native UTF-8 storage, where we cannot provide a mapping between a String.Index and a byte offset due to internally transcoding to UTF-16. This is a use case that's increasing in frequency and popping up in different domains (e.g. SwiftSyntax is dealing with this now).

Until then you'll have to maintain a mapping. @Xi_Ge or @harlanhaskins might be able to describe how they're doing so. While annoying, hopefully it's just a Dictionary<String.Index, Int> whose keys are from whatever view you're doing your processing in (likely a scalar view).

Nope, this is Swift procedure: https://github.com/apple/swift-evolution/blob/master/process.md. For example, @allevato started the Unicode Scalar Properties pitch.

Indeed, linguistic parsers could benefit from processing the array of unicodeScalars, rather than the String or the array of Characters.

The best would be to have transparent support for NFKD, and conversion to NFKC if needed. From NFKD, one can deal with composition exclusions without too much cost.

Unless I am missing something, using ICU will not be efficient, as it means getting an array of unicodeScalars from a String, converting it to a (char *), and then converting back the resulting (char *) to an array of unicodeScalars, then back to a String: lots of copies and malloc's and free's.

NFKD support should be totally transparent, as in:

let text = String (bytes: bytes, encoding: String.Encoding.utf8)!
let unicodeScalarArray = text.unicodeScalars // => always already in NFKD format
// if needed:
let NFKCUnicodeScalarArray = unicodeScalarArray.toNfkc() => converts in equivalent NFKC

This is precisely why all of String’s views are lazy: avoid extra allocations and copying. Any kind of normalized view would also be lazy. The vast (vast!) majority of normalization segments are relatively small, say under 32 bytes. Thus we can use a buffer on the stack rather than the heap when calling into ICU as a fast-path. Even if an unlikely fundamental shift in common Unicode strings were to happen, we could find various workarounds to ICU inefficiency, e.g. heap buffers managed by thread-local storage.

I really want to see your use case supported. Please understand this goal while I nit-pick this into something that I think can be accepted by the broader community of String users.

Let me split this into three smaller decisions.

  1. Should Swift’s standard library add support for various normalized views of unicode scalars?
  2. If #1, should we change the default presentation of unicode scalars to a normal form?
  3. If #2, should NFKD be that default?

Decision #1 is sufficient for your needs, while #3 lets you write myStr.unicodeScalars rather than myStr.unicodeScalars.nfkd. Splitting this into 3 points allows for #1 to happen even if #3 is a deal-breaker for many users of unicode scalars.


Here is a potential argument against NFKD for a default. If we were to change the default to a normal form, what form should it be? It seems most likely that we would settle on NFC, for many of the same reasons that W3C requires/strongly recommends NFC.

Additionally, NFKD presents 2⁵ as 25, and comparison under NFKD states that 2⁵ is equal to 25. Outside of specialized scenarios, this is at odds with modern Unicode practices. For example, String’s == follows canonical equivalence and thus can’t be based on a NFK* form. Since normalization is mostly useful for treating equivalent strings equivalently, it’s counter-intuitive that the unicode scalar view would present a normalized, yet non-canonical view.

Finally, compatibility mappings lose information. is 1 grapheme but an NFKD normalized string would hold 2, due to information loss. (Similarly 2⁵ would appear as 25). It’s counter-intuitive that a String’s default unicode scalar view would lose potentially-critical information regarding the unicode scalars that comprise the String.


Decision #2 is more debatable, i.e. philosophical. The current default is to present the unicode scalars that comprise a string as they are physically encoded.

An argument for presenting normalized unicode scalars by default is harm reduction. Normal forms can prevent some programmer errors which accidentally treat equivalent strings differently.

An argument against normalized unicode scalars by default is that normalization is only useful for some uses of unicode scalars (namely operating modulo canonical/compatibility equivalence). Even then, the choice between C and D forms depends on intended usage. However, many uses of the unicode scalars want to operate on the scalars as-written, i.e. as they are physically encoded by the String. Since the preferred presentation depends on usage, make the normalized presentations explicit.


Decision #1 comes down to deciding if it fits the scope of the standard library and how it is represented. If you are interested in #1, I can advise and/or collaborate with you on a pitch to add normalized views onto the unicode scalar view.

I understand and I agree that normalized views should be lazy. But when we do need to parse a large String (text), we will need to actually convert it to a normalized view, i.e. the lazy aspect is gone. At this point, it would be nice to get the normalized form directly from the original data (which will be UTF8 in most cases), rather than using an (uncontrolled) array of UnicodeScalars, converting it back and forth via ICU's (char *), copying with their malloc's and free's.

  1. Swift standard library needs to add support for at least one normalized view, and provide a method to get its opposite form. i.e. if NFKD is the default (transparent) form, then Swift should add a .toNFKC() method, or reciprocally.

  2. Yes I think the unicode scalar should always follow a known standard, so that anyone can rely on it.There are probably lots of bugs around because of this false belief that Unicode has a standard way to code letters.

I disagree with the counter-argument: there is no "as physically encoded" in UnicodeScalars: the UnicodeScalar view does not reflect the original as-encoded string internal representation, which is usually UTF8-based.

  1. I understand that there are many other applications than NLP applications. I am arguing for Linguistics...

From a linguistic point of view, NFKD would be the best, because it would allow a linguistic parser to match texts in the most efficient way possible. For instance, In French, uppercase letters often miss their accent, e.g. "CAFE" in a text should match "café" in the dictionary. Using NFKD would make the match "E" == "é" easy, just a test between

text[tposition].unicodeScalars[0] == dictionary[dposition].unicodeScalars[0]

Using NFKC or worse NFC would mean matching the "E" in texts would translate into a loop over all the possible combinations of a "e" and any diacritic:

for precomposedLetter in ["e", "é", "è", "ê", "ë", "ẽ"... ] {
if text[tposition].unicodeScalars[0] == precomposedLetter { // MATCH }
}

Same thing with ligatures: NFC is bad for Linguistics because absolutely all the word forms that occur in texts and contain "fi" should match the corresponding dictionary entry with "f" followed by "i". In other words, to have a linguistic parser running properly, one would need to first separate all ligatures into their linguistic units. As you mentioned earlier, this is critical for language scripts such as Arabic: NFC encodes "ﶂ" as U+FD82 (i.e. one unicodeScalar for Lam + Hah + Alef Mksura in final form): how is a linguistic program supposed to find the corresponding word in a dictionary, or link it to its suffixed variants (same ligature but not final)? Same problem with Devanagari.

I strongly disagree that there is any loss of information by transforming "fi" into "f" "i": no word spelling requires a "fi"; reciprocally, every single word that contains "f" followed by "i" can be written as "fi".

I am confused about the 2⁵ example. Although NFKD would separate this single grapheme into two unicodeScalars (same way as é => e'), the corresponding grapheme would still be unique and different from "2". Parsers for arithmetic expressions never loop inside each grapheme: they loop over graphemes.

To recap:

  1. is crucial for NLP applications. Without a normalized standard view, Swift support for Strings is not really better than previous languages; arguably worse because we lost the random access (!).

This is the role of a programming language to provide programmers with reliable data. Not having a normalized form in Swift means asking all programmers to use ICU (the truth is they will not, and their software will be buggy), i.e. a necessary level that is a pain, and should not even exist in the first place.

  1. If Swift's UnicodeScalars' array is transparently normalized, that would be a huge and unique advantage for programmers. Providing programmers with a non-normalized array of unicodeScalars makes it useless, anyways.

On the other hand, imagine telling a Java or C++ programmer that all their software is buggy, unless they add ICU to every String processing code, whereas Swift processes Strings transparently, in a reliable way.

  1. Transparent NFKD default representation of UnicodeScalars would be GREAT for NLP applications. I understand W3C prefers NFC, and for some other applications (maybe text publishing and word processors?) NFC could be better... But these programs probably don't even need to process UnicodeScalars in the first place: they deal with glyphs, and will even avoid the variable-length UTF8 representation like the plague : they would be better off working on glyphs/Characters anyways.

Sounds like you know what you want to pitch.