[Pitch] Character Classes for String Processing

Michael_Ilseman · October 21, 2021, 12:42am

Alejandro's example shows that Unicode accepts such sequences as degenerate grapheme clusters.

From UAX#29:

Ignore degenerates . No special provisions are made to get marginally better behavior for degenerate cases that never occur in practice, such as an A followed by an Indic combining mark.

Swift, being an actual implementation of Unicode with its own additional semantics to ensure, has to constantly wrestle with the existence of degenerate grapheme clusters. They will come up in these discussions, because Unicode explicitly chose to permit their existence in favor of simplicity/speed. We follow Unicode even as it violates Collection algebra.

However, they are degenerate, so we are not burdened with ascribing meaning to the meaningless. We need to have defined behavior (in the strict UB-sense) in a world where they exist, but our regular API design intuition does not necessarily map directly on to them. We should ascribe meaning for the meaningful graphemes and degenerate cases can fall out naturally.

Some prior musings on the topic (I'm hoping Discourse links them properly):

Corner-cases in `Character` classification of whitespace

Degenerate graphemes, such as one that contains only a combining scalar, violate common Collection intuition:
“abcde”.count // 5
“\u{0301}”.count // 1
let str = “abcde” + “\u{0301}” // “abcdé”
str.count // 5
String needs to accommodate the existence of degenerate graphemes, and they can always be formed by operating on the Unicode scalar or code unit views. But, we should try to avoid forming them in common use top-level String APIs.

These points become salient the moment we try to use character class definitions with API (@timv). I think here is probably a better place to discuss and clarify them than in the API pitch. This pitch doesn't call them out specifically (@nnnnnnnn we should in next draft), but its definitions extend to them.