Corner-cases in `Character` classification of whitespace

I don’t think we should prioritize inorganic corner cases when designing the overall model, though we should think through behavior in all situations. For these corner cases, we should prioritize consistency with subsequent operations and user intention.

My (weakly held) opinion is that #1 is the least-bad result, even though #2 feels more pedantically correct (the best kind of correct). I think #1 is the behavior we should provide, even if we don’t formally specify it at this time in the docs.

String.lines() will likely produce a to-be-designed LazySplitCollection, with an option to control whether the separator is preserved or not. String.trimmed() likewise is information-losing on purpose, but the trimmed characters are still accessible if it returns a Substring.

If a String contains a “\n\u{301}” somewhere inside it (newline with combining accent), I think it is much more surprising to not do line-splitting around that grapheme. Producing a separator of “\n\u{301}” is the more consistent behavior and less surprising than not splitting on it.

In this sense, “is-whitespace” might be more like “has-programmatic-whitespace”. Given these graphemes are atypical corner cases (AFAICT), I think it’s less confusing to just think of it as “is-whitespace”.

Regarding solution #3 and “degenerate” graphemes.

Degenerate graphemes, such as one that contains only a combining scalar, violate common Collection intuition:

“abcde”.count // 5
“\u{0301}”.count // 1
let str = “abcde” + “\u{0301}” // “abcdé”
str.count // 5

String needs to accommodate the existence of degenerate graphemes, and they can always be formed by operating on the Unicode scalar or code unit views. But, we should try to avoid forming them in common use top-level String APIs.

Regarding whitespace and visibility or rendering

AKA “If it looks like whitespace and quacks like whitespace…”.

We need to be careful to not conflate programatic usage of whitespace with visibility and rendering. There’s examples within Unicode of whitespace scalars which have a visible representation: “ ” (U+1680 OGHAM SPACE MARK). String also can’t really answer all such questions fully or accurately and it’s strongly recommended to consult your platform — e.g. ask CoreText.

Other considerations

@torquato mentioned that Leiden Convention for representing texts originally derived from ancient papyrus manuscripts may utilize a mark underneath empty space to reflect a missing or unknown character. There are different conventions on how to represent this electronically, generally recommending to use tags. However, if one chooses to encode this instead as a whitespace scalar followed by a combining under-dot, then this usage would fall through the cracks and such characters could be dropped from a trimmed String. My recommendation is to not let this this scenario guide the final decision.

Thoughts? Are there other interesting scenarios where user intention might deviate?

2 Likes