Declarative String Processing Overview

Yes, when in grapheme-semantic mode, I believe that the most important design point is to maintain equivalent behavior as if you were walking the characters yourself by hand.

Thus, we shouldn't split grapheme clusters in resulting indices. We should generalize what we can to grapheme clusters and reject the rest. More on that in the upcoming discussion on character classes.

There is no universal concept of "correct" here, only a universal concept of "consistent". Ordering is used for programming invariants but is linguistically meaningless.

For programming invariants, ordering under NFD can also be interesting, but irrelevant to your concerns. In grapheme-semantic mode, they would all combine and be the same grapheme cluster either way. Grapheme segmentation doesn't split normalization segments, and vice versa.

We do want a lot more normalization API on String. Now that @Alejandro merged native normalization support, this is a possibility.

If you want to meaningfully compare those characters, you also need:

  1. Collation data (too big to ship with stdlib)
  2. Locale or other tailoring (too high level a concern for stdlib)
  3. Know the application domain (even higher level than locale)

as the same character orders differently in different languages and even orders differently in different applications of the same language (e.g. German phonebook order vs German dictionary order). Also, what even is a "character" for ordering purposes is different than for counting purposes, e.g. some locales have digraphs that "stick" together for ordering.

No, we don't want to deviate from standard regex syntax for the literal. This is a great thing to have in Pattern though.

A related concern, which you might be actually asking for, is how to tailor custom character classes through the use of key paths, closures, etc. That is definitely something we want to support and we're baking support into the implementation early. I don't now how best this will surface in API (is it contextual?, etc.) yet, so that's a great thing to be discussing.

3 Likes