[Pitch] Character Classes for String Processing

nnnnnnnn · October 22, 2021, 7:38pm

That reads better, thank you!

This does look like an ambiguity — do you mind bringing it up in the thread for the Regular Expression Literal pitch?

(cc @George) I think this will come down in part to how many properties we're adding, so I'd like to defer this question for now.

There was some good discussion of this question in the "alternatives" sections of the Unicode Scalar Properties and Character Properties proposals, to which I'd add that placing only some properties into a nested type can lead to more confusion, since users then have to look in/know about/remember two places instead of just one.

The definition of "word character" in the pitch is based on common usage of \w and [:word:] in regular expressions (as described in the UTS#18 recommendation). You're definitely correct that this is a rough approximation of "characters that make up a word", even as applied to contractions or other words that contain punctuation, but there's a enough precedent based on the rough approximation that it's in our best interests to stick with it.

For a more nuanced (though still not complete) algorithm for detecting word boundaries, you might take a look at the word boundary rules described in UAX#29.