Corner-cases in `Character` classification of whitespace


(Michael Ilseman) #1

Over in the Character Properties proposal and the String.trim() pitch, as well as an eventual String.lines() pitch, the topic of what Characters are and are not whitespace keeps coming up.

Character represents a grapheme, and the rules of grapheme breaking allow for the existence of odd graphemes that might not otherwise emerge organically.

What is whitespace? Is “\u{020}\u{301}” (U+0020 SPACE, U+0301 COMBINING ACUTE ACCENT), which is often rendered as ́, whitespace?

I think it’s important to view how this concept of whitespace could be applied. For that, we’ll use the following example and ask what the result of trim() and lines() should be.

let str = "\u{020}\u{301}abc\n\u{301}de\u{020}\u{301}"
// str : String = " ́abc\ńde ́"
Array(str.unicodeScalars)
// [" ", "\u{0301}", "a", "b", "c", "\n", "\u{0301}", "d", "e", " ", "\u{0301}"]

Array(str.trimmed().lines())
// ???

I see (at least) 3 possible results:

  1. [“abc”, ”de”]
  2. ["\u{020}\u{301}abc\n\u{301}de\u{020}\u{301}"]
  3. ["\u{301}abc”, “\u{301}de\u{020}\u{301}"]

What should the result be, and more importantly, why?

Each of these answers involves various tradeoffs and neither is perfect.

Answer #1, which says these odd graphemes are whitespace, could cause a user to lose the information of the combining accent mark in their processing. It could also break intuition of the whitespace concept as it relates to visibility or display width (though perhaps that shouldn’t be conflated).

Answer #2, which says these odd graphemes are not whitespace, could cause a user to be surprised given the string clearly has a whitespace leading scalar and newline scalar inside of it.

Answer #3, which skips graphemes to operate on scalars, produces degenerate graphemes. These cause very counter-intuitive behavior on subsequent String API calls, but are also a necessary corner case permitted by grapheme breaking rules. These semantics would also deviate from String’s primary presentation as a collection of graphemes.

(I’ll follow up with my personal opinion and reasoning later in this thread)


String hygiene
String hygiene
Text streaming in standard library
(Erik Little) #2

Is there some prior art that we could follow from other languages? Or maybe from the unicode standards? I would really hate to start having to invent functionality/meanings in order to work on things like trim. If there's no well defined definitions of what whitespace characters are I would say go with the strictest possible interpretations for the default, and be more permissive only when passed that option.


(Tony Allevato) #3

Are there any character sets supported by the Unicode standard where a whitespace character + a combining scalar diacritic has some sort of linguistically/semantically correct meaning? I can't think of any, but I'm also by no means an expert on every language that exists on the planet.

If the answer to the question above is "no", then my preference would be that we define whitespace to mean "a grapheme cluster for which all of its scalars has the White_Space property". That seems most straightforward from an implementation point of view, and it also means that we wouldn't define something as a whitespace that doesn't visually look like a whitespace.


(Michael Ilseman) #4

There isn't that I know of. I'm interested in the why. Why is being stricter the more appealing option? I see it feels like we're being more precise when viewing an isolated grapheme, but why is it the best behavior for these APIs? I.e. why do you think result #2 is better than #1 or #3?


(Erica Sadun) #5

This answer is only in regard to the lingering acute accent: in no way is that whitespace. There is no intent to use it to separate words or lines. Only intentional characters that are meant to do that task are whitespace/newlines


(Dante Broggi) #6

I do not know whether I would prefer the result to be #1 or #2, but I certainly would not want the result to be #3, unless the trimming was performed on a .unicodeScalars view.


(Michael Ilseman) #7

Sorry, I'm having a hard time parsing this sentence. Are you arguing for solution #3, or a result that also drops the letters in the middle?

edit: I see your edit now! Makes more sense :slight_smile:. What do you think about the example of usage on String, instead of isolated graphemes?

Intent is... a little tricky here.

From Unicode’s perspective, “whitespace” as separator applies to programmatic rather than linguistic intent (even though tons of scalars with no prior programmatic usage are still considered whitespace). Trying to interpret whitespace as having linguistic meaning does not work across languages. (This is why we should eventually expose a String.words() or similar which does more intelligent word breaking for written languages.)

For example, a String that holds the contents of a CSV file has the programmatic intent for “abc\n\u{301}def” to represent 2 entries.

#1 is “bad” because it associates the accent with the newline separator. #2 is “bad” because the intention was almost certainly for there to be two records and not one. #3 is “bad” because it produces entries with leading degenerate graphemes. Which is least bad?

Ordering the options from a principle of intent, #3 most closely adheres to intent, followed by #1. #2 does not honor the intended usage of the newline scalar as an entry separator.

I’m not arguing against #2 as the best answer, just against it being the clear-cut winner for modeling intent.


#8

Yes. In papyrological editions the combining underdot ('COMBINING DOT BELOW' (U+0323)) is used to denote that the reading of a character on the papyrus is not certain. The traces of ink most likely resemble that letter. A space with such a dot below means that there are traces of ink and that there is a character but that it is far too uncertain to tell which.

In this situation whitespace + combining dot below has the exact opposite meaning of whitespace. It is not the absence of a character but the assertion that there is one.

Therefore I'm strongly opposing to treating such cases as whitespace as a whole.


(Michael Ilseman) #9

Out of curiosity, which space do they use for this purpose and do you have a link?


#10

Does this help?

https://www.fileformat.info/info/unicode/category/Zs/list.htm

Note that this is not an exhaustive list (tab is not included, for example).


(Michael Ilseman) #11

No, I am intimately familiar with that list ;-) I'm asking about encoding conventions for representing ancient writing originally done on papyrus instead of computers.


#12

When I did papyrology at university I just hit the spacebar. :smiley:
Sorry, that was so long ago, I don't have any typographical specifications at hand except that this is part of the Leiden Conventions.


(Erica Sadun) #13

A good question but I think it's the wrong one to drive this discussion.

Character set localization, perhaps including as do Calendars do, some edge cases for weird archeological stuff.


(Michael Ilseman) #14

To be clear, this was prefaced with "Out of curiosity", meaning that I'm interested in understanding obscure use cases (and reasoning through their prevalence) even if they don't or shouldn't influence the final decision.

Could you elaborate more on why these graphemes would appear in any of these and why the intention is clearly for them to not be whitespace?


(Erica Sadun) #15

Either there is a standard for whitespace (Z*, and a few others) or there are localized use-cases that include such obscurities as archaeological annotations. Should these non-standard use cases exist and be considered, then there has to be a way to "localize" (so to speak) the meaning of white-space to a given use.

I think any solution is going to have to allow for malformed data and treat it in an expected way. To quote someone wise, "String’s primary presentation [is] as a collection of graphemes"


(Michael Ilseman) #16

Ah, that makes sense, thank you for clarifying. As far as standards, I'd recommend going with the white_space derived property rather than general-category based (unless there's compelling reasons otherwise). This is what the Unicode.Scalar.Properties proposal uses.


(TJ Usiyan) #17

Is there room for allowing multiple strategies via an enum?


(Michael Ilseman) #18

I don’t think we should prioritize inorganic corner cases when designing the overall model, though we should think through behavior in all situations. For these corner cases, we should prioritize consistency with subsequent operations and user intention.

My (weakly held) opinion is that #1 is the least-bad result, even though #2 feels more pedantically correct (the best kind of correct). I think #1 is the behavior we should provide, even if we don’t formally specify it at this time in the docs.

String.lines() will likely produce a to-be-designed LazySplitCollection, with an option to control whether the separator is preserved or not. String.trimmed() likewise is information-losing on purpose, but the trimmed characters are still accessible if it returns a Substring.

If a String contains a “\n\u{301}” somewhere inside it (newline with combining accent), I think it is much more surprising to not do line-splitting around that grapheme. Producing a separator of “\n\u{301}” is the more consistent behavior and less surprising than not splitting on it.

In this sense, “is-whitespace” might be more like “has-programmatic-whitespace”. Given these graphemes are atypical corner cases (AFAICT), I think it’s less confusing to just think of it as “is-whitespace”.

Regarding solution #3 and “degenerate” graphemes.

Degenerate graphemes, such as one that contains only a combining scalar, violate common Collection intuition:

“abcde”.count // 5
“\u{0301}”.count // 1
let str = “abcde” + “\u{0301}” // “abcdé”
str.count // 5

String needs to accommodate the existence of degenerate graphemes, and they can always be formed by operating on the Unicode scalar or code unit views. But, we should try to avoid forming them in common use top-level String APIs.

Regarding whitespace and visibility or rendering

AKA “If it looks like whitespace and quacks like whitespace…”.

We need to be careful to not conflate programatic usage of whitespace with visibility and rendering. There’s examples within Unicode of whitespace scalars which have a visible representation: “ ” (U+1680 OGHAM SPACE MARK). String also can’t really answer all such questions fully or accurately and it’s strongly recommended to consult your platform — e.g. ask CoreText.

Other considerations

@torquato mentioned that Leiden Convention for representing texts originally derived from ancient papyrus manuscripts may utilize a mark underneath empty space to reflect a missing or unknown character. There are different conventions on how to represent this electronically, generally recommending to use tags. However, if one chooses to encode this instead as a whitespace scalar followed by a combining under-dot, then this usage would fall through the cracks and such characters could be dropped from a trimmed String. My recommendation is to not let this this scenario guide the final decision.

Thoughts? Are there other interesting scenarios where user intention might deviate?


#19

Is it right to think of this as any (a Character is whitespace if any of its constituent Unicode scalars are whitespace) versus all (a Character is whitespace if all of its constituent Unicode scalars are whitespace) or are there more complicated situations?

The option chosen is not that important to me (though I don't like #3 at all), and I don't have much experience in this area, but I have a couple of thoughts.

  • Your examples are mostly light parsing activities. Perhaps parsing would usually more correctly be done on one of the scalar views, but most people will reach for String first. Is this the main use case for deciding if a Character is whitespace? Are there any others?
  • If you have a visibility/rendering mindset then you're right that things get quite complicated. This became clear to me from your example with a newline instead of a space. In the places I tested, "test\n\u{301}this" is rendered over two lines, with the dangling combining acute accent at the start of the second line. So if I was approaching this from the rendering perspective, I might conclude something complex like “for trimmed() use the all rule, but for lines() use the any rule” or have a different rule for horizontal spacing vs newlines, or something.

In conclusion, I think you've convinced me about #1, even though #2 seemed more natural when just thinking about the trimmed() case.


(Nick Keets) #20

Of course there is. Here is for Go:

0x0009
0x000A
0x000B
0x000C
0x000D
0x0020
0x0085
0x1680
0x2000
0x2001
0x2002
0x2003
0x2004
0x2005
0x2006
0x2007
0x2008
0x2009
0x200A
0x2028
0x2029
0x202f
0x205f
0x3000

source: https://golang.org/src/unicode/tables.go#L6608

And here is for Python:

0x0009
0x000A
0x000B
0x000C
0x000D
0x001C
0x001D
0x001E
0x001F
0x0020
0x0085
0x00A0
0x1680
0x2000
0x2001
0x2002
0x2003
0x2004
0x2005
0x2006
0x2007
0x2008
0x2009
0x200A
0x2028
0x2029
0x202F
0x205F
0x3000

source: https://github.com/python/cpython/blob/279a96206f3118a482d10826a1e32b272db4505d/Objects/unicodetype_db.h#L5774

We don't need to invent whitespace from first principles...