[Pitch] Character Classes for String Processing

Michael_Ilseman · October 21, 2021, 4:05pm

(Acknowledging that how we switch between semantic modes is TBD and outside of scope here, beyond clarifying semantics). If we had static information that the intent was for grapheme-semantic usage, we would accept but warn on \x{301}: it would only match a degenerate grapheme cluster at the front of a string or substring. This specific case is further classified by Unicode as a defective (but not ill-formed!) combining character sequence.

Unicode's specific guidance here is in the context of rendering behavior, but we can interpret the spirit of the guidance for ourselves. Warning is a great idea, and if this were used in a grapheme-semantic operation it would have exactly the problem we warned about: it would only match a leading degenerate character. I think that sensible and composable fall-back behavior would still accept this and match in this restricted way (emergent from underlying semantics).

For example, you might have a layer of scalar-semantic parsing for CSV and a higher-layer grapheme-semantic processing of individual entries. The scalar semantic parsing might (even intentionally) split what would otherwise be a grapheme cluster as it's part of the structure of its format. We'd want to prioritize sensible and understandable behavior for composition of semantic levels above an attempt to provide more-intuitive treatment of specific degenerates.

nnnnnnnn · October 22, 2021, 7:38pm

That reads better, thank you!

This does look like an ambiguity — do you mind bringing it up in the thread for the Regular Expression Literal pitch?

(cc @George) I think this will come down in part to how many properties we're adding, so I'd like to defer this question for now.

There was some good discussion of this question in the "alternatives" sections of the Unicode Scalar Properties and Character Properties proposals, to which I'd add that placing only some properties into a nested type can lead to more confusion, since users then have to look in/know about/remember two places instead of just one.

The definition of "word character" in the pitch is based on common usage of \w and [:word:] in regular expressions (as described in the UTS#18 recommendation). You're definitely correct that this is a rough approximation of "characters that make up a word", even as applied to contractions or other words that contain punctuation, but there's a enough precedent based on the rough approximation that it's in our best interests to stick with it.

For a more nuanced (though still not complete) algorithm for detecting word boundaries, you might take a look at the word boundary rules described in UAX#29.

Nevin · October 22, 2021, 8:41pm

If the use-case is “detect words”, then we should make it easy to do the right thing. In particular, we should not make it easier to do the wrong thing than the right thing.

If the use-case is not “detect words”, then we should not use the name “word character”.

xwu · October 22, 2021, 9:12pm

The use case, I’d venture to say, is to approximately detect words in the same way that specific others approximately detect words. This is a significant enough use case that UTS#18 makes recommendations for this approximation, distinct from the UAX#29 word boundary algorithm, and the term of art, as far as I can tell, for this approximation is one or more consecutive “word characters.”

JanWillemBrands · October 24, 2021, 6:18am

UniCode defines formal character name aliases. It mentions:

# Note that no formal name alias for the ISO 6429 "BELL" is
# provided for U+0007, because of the existing name collision
# with U+1F514 BELL.

beccadax · October 25, 2021, 7:26pm

I would recommend that we support only the subset of these that are supported in a Swift string literal. The others should probably be parsed, but we should treat them as syntax errors and offer fix-its to Swift syntax.

Why?

The missing escapes were removed because we judged that they were almost never actually needed, and the \u{...} syntax was chosen to unify the multiple existing hex syntaxes into something extensible that people could actually remember. Those reasons should apply equally to regexes.
It will be confusing if some backslash escapes which evaluate to ordinary characters are supported in regexes but not in normal strings, and especially if the "specify a character using a hex number" syntax is different.

If we take this suggestion, that means we should support:

\u{H+} for Unicode hex numbers
\\ for backslash
\/ for slash
\" for double quote
\' for single quote
\0 for null
\t for tab
\n for LF
\r for CR

And have fix-its for:

\xHH to \u{HH} (trimming leading zeroes)
\uHHHH to \u{HHHH} (trimming leading zeroes)
\UHHHHHHHH to \u{HHHHHHHH} (trimming leading zeroes)
\f to \u{a}
\e to \u{1b}
\a to \u{7}
\b to \u{8}
\cX to \u{equivalent HH}

Alternatively, if we think that any of the escapes I'm suggesting should be fix-its ought to be kept because they would be valuable in regexes, they should also be added to the string literal syntax. Basically, I'm suggesting we ought to have parity in the things that are reasonable to express in a string literal.