Over in the Character Properties proposal and the String.trim() pitch, as well as an eventual String.lines() pitch, the topic of what Character
s are and are not whitespace keeps coming up.
Character
represents a grapheme, and the rules of grapheme breaking allow for the existence of odd graphemes that might not otherwise emerge organically.
What is whitespace? Is “\u{020}\u{301}” (U+0020 SPACE, U+0301 COMBINING ACUTE ACCENT), which is often rendered as ́
, whitespace?
I think it’s important to view how this concept of whitespace could be applied. For that, we’ll use the following example and ask what the result of trim()
and lines()
should be.
let str = "\u{020}\u{301}abc\n\u{301}de\u{020}\u{301}"
// str : String = " ́abc\ńde ́"
Array(str.unicodeScalars)
// [" ", "\u{0301}", "a", "b", "c", "\n", "\u{0301}", "d", "e", " ", "\u{0301}"]
Array(str.trimmed().lines())
// ???
I see (at least) 3 possible results:
[“abc”, ”de”]
["\u{020}\u{301}abc\n\u{301}de\u{020}\u{301}"]
["\u{301}abc”, “\u{301}de\u{020}\u{301}"]
What should the result be, and more importantly, why?
Each of these answers involves various tradeoffs and neither is perfect.
Answer #1, which says these odd graphemes are whitespace, could cause a user to lose the information of the combining accent mark in their processing. It could also break intuition of the whitespace concept as it relates to visibility or display width (though perhaps that shouldn’t be conflated).
Answer #2, which says these odd graphemes are not whitespace, could cause a user to be surprised given the string clearly has a whitespace leading scalar and newline scalar inside of it.
Answer #3, which skips graphemes to operate on scalars, produces degenerate graphemes. These cause very counter-intuitive behavior on subsequent String API calls, but are also a necessary corner case permitted by grapheme breaking rules. These semantics would also deviate from String’s primary presentation as a collection of graphemes.
(I’ll follow up with my personal opinion and reasoning later in this thread)