[Pitch] Character Classes for String Processing

I assure you he has not.

Alejandro's example shows that Unicode accepts such sequences as degenerate grapheme clusters.

From UAX#29:

  1. Ignore degenerates . No special provisions are made to get marginally better behavior for degenerate cases that never occur in practice, such as an A followed by an Indic combining mark.

Swift, being an actual implementation of Unicode with its own additional semantics to ensure, has to constantly wrestle with the existence of degenerate grapheme clusters. They will come up in these discussions, because Unicode explicitly chose to permit their existence in favor of simplicity/speed. We follow Unicode even as it violates Collection algebra.

However, they are degenerate, so we are not burdened with ascribing meaning to the meaningless. We need to have defined behavior (in the strict UB-sense) in a world where they exist, but our regular API design intuition does not necessarily map directly on to them. We should ascribe meaning for the meaningful graphemes and degenerate cases can fall out naturally.

Some prior musings on the topic (I'm hoping Discourse links them properly):

These points become salient the moment we try to use character class definitions with API (@timv). I think here is probably a better place to discuss and clarify them than in the API pitch. This pitch doesn't call them out specifically (@nnnnnnnn we should in next draft), but its definitions extend to them.

3 Likes