That is incorrect. There are many natural languages the rely heavily on combining characters for which there is no precomposed form. Off the top of my head, I remember Devangari, Farsi (I think), Tamil, and even Vietnamese (which is Latin-based) were examples.
My understanding was that:
-- Vietnamese NFKC normalization's purpose is exactly to recompose any Vietnamese decomposed sequence of scalars into the equivalent composite character (i.e. one 4-byte scalar), e.g. "a" + circumflex + tilde = U+1EAB
-- Devanagari's scripts combine two or three actual letters into one glyph, but from a linguistic point of view, the resulting glyph still represents the sequence of the initial letters, similar to ligatures in latin languages, e.g. "ffi" = "f" + "f" + "i" (difficult = difficult)
-- Farsi (and most Arabic-based scripts) do have ligatures that must be processed as one logical unit, but these units function like words (rather than letters), similar to latin-based abbreviations such as "& = Latin "et".
-- Same thing with Emoji: I believe linguistic parsers should consider emoji as words rather than letters, e.g. "" = "Noun heart" or "Verb to love", "
" = "Adjective happy" or "Adverb Happily", etc. They look like abbreviations, like "$" = "dollar"
I need to do more research...
It sounds like what you want is to operate directly on the unicode scalars instead of graphemes. String has a UnicodeScalarView
, which lazily decodes the contents for this purpose. If you want to make it eager, you can say Array(myStr.unicodeScalars)
.
If you need control concerning normalization, that could be added. If so, feel free to start a pitch for this functionality.
edit: code formatting
Thanks Michael, itaiferber,and Jean-Daniel,
From your feedback, I understand that it would be possible to get a "composition" operation that follows the NFKC standard to get an array of unicode scalars for a given string. This would allow one to build a robust syntactic parser that could parse texts in any natural language. Ligatures would not be a linguistic problem because NFKC properly separates them into sequences of letters; from a linguistic point of view, emojis should be processed as words rather than letters.
From a previous discussion with Michael, I understand that a standard "decomposition" operation such as NFKD could be available as well. Such a feature would allow one to build a robust morphological parser, that could add or remove accents or stresses to/from a letter in a linguistically natural way. That is crucial for all languages that have a heavy morphology.
Another missing feature that is desperately needed: a data <=> string reversibility access that would allow a parser to tell its client at what exact position in the initial data (byte array) the match actually occurred. Without this basic feature, I don't see how to build serious NLP applications.
I don't know how to "start a pitch" to get these three features; are "pitches" Apple official procedure? I understand that the String team has other priorities and has limited Human resources (maybe I can help?). I do hope that Swift will get these features, hopefully before June when I will have to start to work full time on the next version of the NooJ linguistic platform.
Since this is an old thread (pre-forum) and likely to involve more discussion around specific use cases, I spun off String for linguistic processing.