Corner-cases in `Character` classification of whitespace

benrimmington · May 23, 2018, 12:31pm

I'm not sure how relevant this is, but NSLinguisticTagger enumerates the following:

Substrings	Token Types	Lexical Classes
`"\u{020}\u{301}"`	Word	OtherWord
`"abc"`	Word	OtherWord
`"\n"`	Whitespace	ParagraphBreak
`"\u{301}"`	Word	OtherWord
`"de"`	Word	OtherWord
`"\u{020}\u{301}"`	Word	OtherWord

Michael_Ilseman · May 23, 2018, 2:31pm

That is not what this thread is about. You provided arbitrary lists of scalars from other languages. I recommend using Unicode's arbitrary list of scalars. Nothing is being invented here as far as scalars are concerned.

This thread is talking about graphemes, specifically behavior surrounding corner cases involving combing scalars following whitespace scalars.

edit: I updated the title to help avoid this kind of misunderstanding.

allevato · May 23, 2018, 3:52pm

I think I agree with most of your analysis in the original thread:

Answer #1 loses information about the scalars that were combined with the spaces. Once the user has trimmed that, they can't get it back. The same could be said for traditional whitespace itself, because the user would lose information like "is this an ASCII space, or a quad space, or...", but in this case, a combining scalar of its own is not a space as defined by Unicode, so the result is that a trim function would remove non-whitespace scalars from the string. That feels wrong.

Answer #3 breaks graphemes, which seems totally incorrect to me. If the trim function is being implemented in terms of Character, then each Character should be treated as an indivisible unit.

So, answer #2 seems like it best balances what users would expect. Attaching a combining scalar to a whitespace character feels like "intent", that they want it to be treated as something potentially other than whitespace. But that still may just be intuition on my part.

Now, just to throw a possible wrench in: my earlier post said "a Character is whitespace if all of its Unicode scalars have White_Space == true". We should note that this doesn't hold for things like alphabetic characters, where we only want to check the base scalar—"á" is alphabetic even though "´" isn't. So we don't have a consistent rule here, and I could see someone arguing that combining scalars should be ignored for whitespace in a similar fashion. But I would still prefer the more "intuitive" solution of #2, which doesn't treat something that visually doesn't look like whitespace as whitespace.

itaiferber · May 23, 2018, 5:25pm

Instead of prescribing what is and isn't whitespace by deciding on what trimming whitespace is meant to do, can we not sidestep this entire issue by offering the developer a choice?

enum WhitespaceOptions {
    // Considers a `Character` to be whitespace if all of its underlying
    // Unicode scalars are whitespace.
    case hasAllWhitespace

    // Considers a `Character` to be whitespace if any of its underlying
    // Unicode scalars are whitespace.
    case hasAnyWhitespace
}

extension String {
    func trim(_ options: TrimOptions, whitespaceOptions: WhitespaceOptions = .hasAllWhitespace) { ... }
}

As I'd previously brought up in the Character and String properties pitch, whitespace has different meaning in different applications ("does not appear to draw anything" vs. "separates values" is just one distinction) and instead of trying to decide what whitespace means across the board, I think allowing the developer to meaningfully decide for themselves and the semantics they need is better for everyone.

Michael_Ilseman · May 23, 2018, 5:49pm

trim() and lines() are definitely both information-losing by purpose. If trimmed() returned a Substring or if lines() took an option to preserve separators, then that information could be recovered if needed.

allevato:

So, answer #2 seems like it best balances what users would expect. Attaching a combining scalar to a whitespace character feels like "intent", that they want it to be treated as something potentially other than whitespace. But that still may just be intuition on my part.

Now, just to throw a possible wrench in: my earlier post said "a Character is whitespace if all of its Unicode scalars have White_Space == true ". We should note that this doesn't hold for things like alphabetic characters, where we only want to check the base scalar— "á" is alphabetic even though "´" isn't. So we don't have a consistent rule here, and I could see someone arguing that combining scalars should be ignored for whitespace in a similar fashion. But I would still prefer the more "intuitive" solution of #2, which doesn't treat something that visually doesn't look like whitespace as whitespace.

Unicode is careful to separate the designation of White_Space (which I’ll just call “whitespace”) as a programatic concept from linguistic usage.

This is important, as usage varies dramatically across writing systems and even across styles within the same writing system. To further hammer this point, U+200B (Zero Width Space) is often recommended to explicitly separate words in a linguistic context, but does not have the derived property White_Space.

Visibility is also something that shouldn’t be conflated with whitespace. They are related, in that whitespace is never recommended to be rendered as invisible. However, there’s no requirement of “emptiness” of the rendering. Both of the following two Strings have 3 Characters, an “a” and a “b” separated by a whitespace Character: Whitespace Visibility Example

(edit: there seems to be an issue with Discourse, hence the gist link rather than in-line code)

TL;DR; “whitespace” is a crappy name for this concept, but it’s what we have.

This is a good suggestion and I like exposing more control. I don't think it's clearly a choice between any/all; it could be a choice between leading-scalar/all, as @allevato mentioned.

We still have to pick a default, though, which I think should be leading-scalar (or any) for parsing consistency.

Right, for full control and adhering to a spec concerning of a stream of Unicode scalars or code points, the lower level views should be used. But, people often write ad-hoc parsers. As much as I prefer rigid specifications, following the robustness principle is usually the least-harmful choice for these users.

(As we continue to improve String performance, the overhead of grapheme-by-grapheme processing will hopefully decline to an acceptable level for almost all users)

As for most examples being parsery, this is natural as we're talking about reading the contents of a String under a programmatic interpretation. This distinction is not relevant to creating Strings, where the user is the one making the decision. (Also, I think the stdlib should provide "pad" or "center" methods in addition to more interpolation goodies and formatting control, but that's a different topic).

nick.keets · May 24, 2018, 7:09am

Fair enough, but in this case I'd say that nobody except Unicode scholars cares about this and it could be left as an implementation detail. I'm not sure how this is Evolution Discussion material.

But to not be just negative, for what it's worth, both Go and Python do #3 for your initial example:

Go

	str := "\u0020\u0301abc\n\u0301de\u0020\u0301"
	lines := strings.Split(strings.TrimSpace(str), "\n")
	for _, s := range lines {
		for _, r := range s {
			fmt.Printf("%x ", r)
		}
		fmt.Println()
	}

301 61 62 63 
301 64 65 20 301

Python

str = "\u0020\u0301abc\n\u0301de\u0020\u0301"
lines = str.strip().split('\n')
for line in lines:
    print([hex(ord(x)) for x in line])

['0x301', '0x61', '0x62', '0x63']
['0x301', '0x64', '0x65', '0x20', '0x301']

xwu · May 29, 2018, 5:06pm

Michael_Ilseman:

Regarding solution #3 and “degenerate” graphemes.

Degenerate graphemes, such as one that contains only a combining scalar, violate common Collection intuition:
“abcde”.count // 5
“\u{0301}”.count // 1
let str = “abcde” + “\u{0301}” // “abcdé”
str.count // 5
String needs to accommodate the existence of degenerate graphemes, and they can always be formed by operating on the Unicode scalar or code unit views. But, we should try to avoid forming them in common use top-level String APIs.

Given @nick.keets's information on the behavior of Go and Python, and your note above that "ordering the options from a principle of intent, #3 most closely adheres to intent"--I believe actually #3 is the ideal answer.

I am not concerned about corner-case input resulting in corner-case output (i.e., degenerate graphemes). I would agree that common/non-corner-case/sane inputs to common use top-level APIs shouldn't result in degenerate graphemes, but I don't think we should be falling over ourselves to avoid such output for admitted corner cases. After all, as has been discussed here, whitespace is a programmatic concept and not a linguistic one, and if there's one classic principle of programmatic manipulation of data, it's GIGO.

That said, I agree with @allevato that #2 seems like an acceptable result for the reasons he outlined. That it works appropriately with the Leiden Convention for ancient papyrus is the cherry on top.

I am really not enamored of #1. Yes, "trim" is information-losing, but the information that the user expects to be lost is whitespace. It would be reasonable, therefore, if a user expects to be able to enumerate (a priori) all possible information that could be lost in such an operation without inspecting the input string. If some of the scalars of a grapheme are not whitespace, dropping the entire grapheme is losing information beyond what the user may expect, and that seems...bad.