SE-0363: Unicode for String Processing

xwu · July 3, 2022, 4:35pm

To support both matching on String's default character-by-character view and more broadly-compatible Unicode scalar-based matching, you can select a matching level for an entire regex or a portion of a regex constructed with the RegexBuilder API. [...]

These specific levels of matching [or to be specific, extended grapheme cluster-based matching rather than Unicode scalar-based matching], and the options to switch between them, are unique to Swift

I'm concerned that extended grapheme cluster-based matching is both unique to Swift and also the (silent) default. Compatibility with other regex engines has been a key consideration in this series of proposals, and it is specifically desired with respect to the classical regex syntax. As the proposal acknowledges, extended grapheme cluster-based matching by default decreases that level of broad compatibility.

Moreover, since there is no "inside-the-regex" syntax to enable the more compatible Unicode scalar mode and only the rather more verbose .matchingSemantics(.unicodeScalar), I worry about discoverability even if a user knows about this compatibility difference. That said, I do understand the rationale for not including totally new regex syntax, at least at this time, and could probably be on board with that approach were it not for the first consideration above about defaults.

In addition, threading these two "matching semantics" modes through the entire proposal also means that we have to review two distinct sets of character classes, one of them unique to Swift. They look very well thought out, and it is certainly possible that the proposed semantics for each character class is optimal as-is. But do we have a deep enough set of Unicode expertise in this community to approximate a "many eyes" approach to evaluating these details? I am skeptical of this. How many people have an informed opinion on halfwidth forms, for instance? It is unclear what we would do down the line (with source compatibility constraints, etc.) if, upon reflection, some definition adopted today turns out to have included or excluded characters in error or conflicts with a future Unicode decision—the Unicode Consortium makes certain backwards compatibility guarantees with respect to its own decisions but isn't under any obligation not to contradict what we decide.

Therefore, I wonder if there is another approach possible here—something along the lines below—which would satisfy the overarching (legitimate) need for Unicode correctness by default and consistency with the user-facing representation of String as a sequence of extended grapheme clusters as elements, while obviating the need for a distinct extended grapheme cluster-based regex matching scheme:

Support a "Unicode normalization–insensitive" option for regex, enabled by default for regex literals
Where the "Unicode normalization–insensitive" option is enabled, normalize the regex and match against the normalized string representation (i.e., do automatically what the proposal says one must do manually when using regex engines in other languages)
Drop grapheme cluster-level semantics from the proposal; in place of matchingSemantics, add an ignoresNormalizationForm() option (strawman name) that defaults to being enabled, and possibly an in-regex syntax for toggling the same option

I have minor concerns re some spellings (e.g., dotMatches... versus anchorsMatch...—can't we use the singular for both flags, particularly since it's possible to use only one anchor?) and character class definitions (e.g., the mismatch where isHexDigit is more narrowly defined than the hexDigit class), which can probably be discussed later.