String for linguistic processing

Michael_Ilseman · March 19, 2018, 3:21pm

@silberz, as far as whether you can rely on composition for your needs, you would know your domain best. Note that Unicode has several exceptions to the rules (because of course it does). Are you familiar with composition exclusions?

For example, this might affect your support for Tibetan, which had a rocky history in early Unicode where it was added, removed, then re-added. But, I'm not familiar with the details here.

ICU provides support for NFC, NFKC, NFKD, NFD, and FCC. So the question for Swift is what is the most "Swifty" way to expose these? Since these are expert-oriented (highly domain-specific), it seems similar to the Unicode Scalar Properties pitch.

Yes, this is a fallout of String not having native UTF-8 storage, where we cannot provide a mapping between a String.Index and a byte offset due to internally transcoding to UTF-16. This is a use case that's increasing in frequency and popping up in different domains (e.g. SwiftSyntax is dealing with this now).

Until then you'll have to maintain a mapping. @Xi_Ge or @harlanhaskins might be able to describe how they're doing so. While annoying, hopefully it's just a Dictionary<String.Index, Int> whose keys are from whatever view you're doing your processing in (likely a scalar view).

Nope, this is Swift procedure: https://github.com/apple/swift-evolution/blob/master/process.md. For example, @allevato started the Unicode Scalar Properties pitch.