Strings in Swift 4

Michael_Ilseman · March 15, 2018, 8:15pm

That is incorrect. There are many natural languages the rely heavily on combining characters for which there is no precomposed form. Off the top of my head, I remember Devangari, Farsi (I think), Tamil, and even Vietnamese (which is Latin-based) were examples.

silberz · March 15, 2018, 9:06pm

My understanding was that:

-- Vietnamese NFKC normalization's purpose is exactly to recompose any Vietnamese decomposed sequence of scalars into the equivalent composite character (i.e. one 4-byte scalar), e.g. "a" + circumflex + tilde = U+1EAB

-- Devanagari's scripts combine two or three actual letters into one glyph, but from a linguistic point of view, the resulting glyph still represents the sequence of the initial letters, similar to ligatures in latin languages, e.g. "ﬃ" = "f" + "f" + "i" (diﬃcult = difficult)

-- Farsi (and most Arabic-based scripts) do have ligatures that must be processed as one logical unit, but these units function like words (rather than letters), similar to latin-based abbreviations such as "& = Latin "et".

-- Same thing with Emoji: I believe linguistic parsers should consider emoji as words rather than letters, e.g. "" = "Noun heart" or "Verb to love", "" = "Adjective happy" or "Adverb Happily", etc. They look like abbreviations, like "$" = "dollar"

I need to do more research...

Michael_Ilseman · March 15, 2018, 9:38pm

It sounds like what you want is to operate directly on the unicode scalars instead of graphemes. String has a UnicodeScalarView, which lazily decodes the contents for this purpose. If you want to make it eager, you can say Array(myStr.unicodeScalars).

If you need control concerning normalization, that could be added. If so, feel free to start a pitch for this functionality.

edit: code formatting

silberz · March 19, 2018, 9:40am

Thanks Michael, itaiferber,and Jean-Daniel,

From your feedback, I understand that it would be possible to get a "composition" operation that follows the NFKC standard to get an array of unicode scalars for a given string. This would allow one to build a robust syntactic parser that could parse texts in any natural language. Ligatures would not be a linguistic problem because NFKC properly separates them into sequences of letters; from a linguistic point of view, emojis should be processed as words rather than letters.

From a previous discussion with Michael, I understand that a standard "decomposition" operation such as NFKD could be available as well. Such a feature would allow one to build a robust morphological parser, that could add or remove accents or stresses to/from a letter in a linguistically natural way. That is crucial for all languages that have a heavy morphology.

Another missing feature that is desperately needed: a data <=> string reversibility access that would allow a parser to tell its client at what exact position in the initial data (byte array) the match actually occurred. Without this basic feature, I don't see how to build serious NLP applications.

I don't know how to "start a pitch" to get these three features; are "pitches" Apple official procedure? I understand that the String team has other priorities and has limited Human resources (maybe I can help?). I do hope that Swift will get these features, hopefully before June when I will have to start to work full time on the next version of the NooJ linguistic platform.

Michael_Ilseman · March 19, 2018, 2:42pm

Since this is an old thread (pre-forum) and likely to involve more discussion around specific use cases, I spun off String for linguistic processing.