String hygiene

Michael_Ilseman · May 20, 2018, 9:02pm

I wouldn't say that it is incorrect for Swift. I would say that it seems inconsistent with a view of String as a collection of graphemes. These semantics may be perfect for, e.g., String.UnicodeScalarView.trimmed().

Degenerate graphemes are a reality that we have to deal with, and we (unfortunately) cannot have a nice algebraic String because of them. But, it does seem odd for String to have a common-use API that produces them.

I don't think I communicated my point well. My example was meant to demonstrate some corner cases that we need to have a strategy for, even if a random package's implementation doesn't care about the details. I see (at least) 3 potential results, each of which has some upsides and downsides. The question is what the result should be.

I’ll use \u{020} to represent a space explicitly.

“\u{020}\u{301}abc\n\u{301}”.trimmed() could yield:

“abc”, under the view that String APIs operate on graphemes and a grapheme leading with whitespace is considered whitespace.
“\u{020}\u{301}abc\n\u{301}”, under the view that String APIs operate on graphemes and a grapheme with a combining character is not whitespace.
“\u{301}abc\n\u{301}”, which results in a degenerate leading grapheme.

I feel like #3 makes sense for a trimmed() on the UnicodeScalarView or the the code unit views. For String, we need to figure out which of these 3 (or perhaps an unknown 4th option) is the least harmful behavior.