Finding Character in CharacterSet

allevato · March 29, 2018, 1:19am

I think CharacterSet is probably the wrong way to think about this problem, for the reasons you noted—it's actually a set of Unicode.Scalars, so you can't properly handle multi-scalar grapheme clusters by using it.

It would also be a bit difficult to express something like "a set of Characters that are lowercase", especially if you wanted that to be an iterable collection, since that would include many combinations of base letters followed by arbitrarily long sequences of combining modifiers. In fact, I believe such a set would be infinite, because you can stack modifiers ad infinitum (and thus, zalgo was born).

In this particular case, I think what you really want is predicates like isLowercase on Character, which would handle something like "ọ̀" correctly by implementing the test as defined by the Unicode Standard that handles the complete grapheme cluster instead of testing individual scalars. Then, you could just write str.prefix(while: { $0.isLowercase }) and be done with it.

@Michael_Ilseman and I have been discussing scalar and character properties in this thread, which may interest you.