Corner-cases in `Character` classification of whitespace

Yes. In papyrological editions the combining underdot ('COMBINING DOT BELOW' (U+0323)) is used to denote that the reading of a character on the papyrus is not certain. The traces of ink most likely resemble that letter. A space with such a dot below means that there are traces of ink and that there is a character but that it is far too uncertain to tell which.

In this situation whitespace + combining dot below has the exact opposite meaning of whitespace. It is not the absence of a character but the assertion that there is one.

Therefore I'm strongly opposing to treating such cases as whitespace as a whole.

7 Likes

Out of curiosity, which space do they use for this purpose and do you have a link?

Does this help?

https://www.fileformat.info/info/unicode/category/Zs/list.htm

Note that this is not an exhaustive list (tab is not included, for example).

1 Like

No, I am intimately familiar with that list ;-) I'm asking about encoding conventions for representing ancient writing originally done on papyrus instead of computers.

When I did papyrology at university I just hit the spacebar. :smiley:
Sorry, that was so long ago, I don't have any typographical specifications at hand except that this is part of the Leiden Conventions.

2 Likes

A good question but I think it's the wrong one to drive this discussion.

Character set localization, perhaps including as do Calendars do, some edge cases for weird archeological stuff.

To be clear, this was prefaced with "Out of curiosity", meaning that I'm interested in understanding obscure use cases (and reasoning through their prevalence) even if they don't or shouldn't influence the final decision.

Could you elaborate more on why these graphemes would appear in any of these and why the intention is clearly for them to not be whitespace?

Either there is a standard for whitespace (Z*, and a few others) or there are localized use-cases that include such obscurities as archaeological annotations. Should these non-standard use cases exist and be considered, then there has to be a way to "localize" (so to speak) the meaning of white-space to a given use.

I think any solution is going to have to allow for malformed data and treat it in an expected way. To quote someone wise, "String’s primary presentation [is] as a collection of graphemes"

1 Like

Ah, that makes sense, thank you for clarifying. As far as standards, I'd recommend going with the white_space derived property rather than general-category based (unless there's compelling reasons otherwise). This is what the Unicode.Scalar.Properties proposal uses.

Is there room for allowing multiple strategies via an enum?

2 Likes

I don’t think we should prioritize inorganic corner cases when designing the overall model, though we should think through behavior in all situations. For these corner cases, we should prioritize consistency with subsequent operations and user intention.

My (weakly held) opinion is that #1 is the least-bad result, even though #2 feels more pedantically correct (the best kind of correct). I think #1 is the behavior we should provide, even if we don’t formally specify it at this time in the docs.

String.lines() will likely produce a to-be-designed LazySplitCollection, with an option to control whether the separator is preserved or not. String.trimmed() likewise is information-losing on purpose, but the trimmed characters are still accessible if it returns a Substring.

If a String contains a “\n\u{301}” somewhere inside it (newline with combining accent), I think it is much more surprising to not do line-splitting around that grapheme. Producing a separator of “\n\u{301}” is the more consistent behavior and less surprising than not splitting on it.

In this sense, “is-whitespace” might be more like “has-programmatic-whitespace”. Given these graphemes are atypical corner cases (AFAICT), I think it’s less confusing to just think of it as “is-whitespace”.

Regarding solution #3 and “degenerate” graphemes.

Degenerate graphemes, such as one that contains only a combining scalar, violate common Collection intuition:

“abcde”.count // 5
“\u{0301}”.count // 1
let str = “abcde” + “\u{0301}” // “abcdé”
str.count // 5

String needs to accommodate the existence of degenerate graphemes, and they can always be formed by operating on the Unicode scalar or code unit views. But, we should try to avoid forming them in common use top-level String APIs.

Regarding whitespace and visibility or rendering

AKA “If it looks like whitespace and quacks like whitespace…”.

We need to be careful to not conflate programatic usage of whitespace with visibility and rendering. There’s examples within Unicode of whitespace scalars which have a visible representation: “ ” (U+1680 OGHAM SPACE MARK). String also can’t really answer all such questions fully or accurately and it’s strongly recommended to consult your platform — e.g. ask CoreText.

Other considerations

@torquato mentioned that Leiden Convention for representing texts originally derived from ancient papyrus manuscripts may utilize a mark underneath empty space to reflect a missing or unknown character. There are different conventions on how to represent this electronically, generally recommending to use tags. However, if one chooses to encode this instead as a whitespace scalar followed by a combining under-dot, then this usage would fall through the cracks and such characters could be dropped from a trimmed String. My recommendation is to not let this this scenario guide the final decision.

Thoughts? Are there other interesting scenarios where user intention might deviate?

2 Likes

Is it right to think of this as any (a Character is whitespace if any of its constituent Unicode scalars are whitespace) versus all (a Character is whitespace if all of its constituent Unicode scalars are whitespace) or are there more complicated situations?

The option chosen is not that important to me (though I don't like #3 at all), and I don't have much experience in this area, but I have a couple of thoughts.

  • Your examples are mostly light parsing activities. Perhaps parsing would usually more correctly be done on one of the scalar views, but most people will reach for String first. Is this the main use case for deciding if a Character is whitespace? Are there any others?
  • If you have a visibility/rendering mindset then you're right that things get quite complicated. This became clear to me from your example with a newline instead of a space. In the places I tested, "test\n\u{301}this" is rendered over two lines, with the dangling combining acute accent at the start of the second line. So if I was approaching this from the rendering perspective, I might conclude something complex like “for trimmed() use the all rule, but for lines() use the any rule” or have a different rule for horizontal spacing vs newlines, or something.

In conclusion, I think you've convinced me about #1, even though #2 seemed more natural when just thinking about the trimmed() case.

2 Likes

Of course there is. Here is for Go:

0x0009
0x000A
0x000B
0x000C
0x000D
0x0020
0x0085
0x1680
0x2000
0x2001
0x2002
0x2003
0x2004
0x2005
0x2006
0x2007
0x2008
0x2009
0x200A
0x2028
0x2029
0x202f
0x205f
0x3000

source: https://golang.org/src/unicode/tables.go#L6608

And here is for Python:

0x0009
0x000A
0x000B
0x000C
0x000D
0x001C
0x001D
0x001E
0x001F
0x0020
0x0085
0x00A0
0x1680
0x2000
0x2001
0x2002
0x2003
0x2004
0x2005
0x2006
0x2007
0x2008
0x2009
0x200A
0x2028
0x2029
0x202F
0x205F
0x3000

source: https://github.com/python/cpython/blob/279a96206f3118a482d10826a1e32b272db4505d/Objects/unicodetype_db.h#L5774

We don't need to invent whitespace from first principles...

I'm not sure how relevant this is, but NSLinguisticTagger enumerates the following:

Substrings Token Types Lexical Classes
"\u{020}\u{301}" Word OtherWord
"abc" Word OtherWord
"\n" Whitespace ParagraphBreak
"\u{301}" Word OtherWord
"de" Word OtherWord
"\u{020}\u{301}" Word OtherWord
1 Like

That is not what this thread is about. You provided arbitrary lists of scalars from other languages. I recommend using Unicode's arbitrary list of scalars. Nothing is being invented here as far as scalars are concerned.

This thread is talking about graphemes, specifically behavior surrounding corner cases involving combing scalars following whitespace scalars.

edit: I updated the title to help avoid this kind of misunderstanding.

3 Likes

I think I agree with most of your analysis in the original thread:

Answer #1 loses information about the scalars that were combined with the spaces. Once the user has trimmed that, they can't get it back. The same could be said for traditional whitespace itself, because the user would lose information like "is this an ASCII space, or a quad space, or...", but in this case, a combining scalar of its own is not a space as defined by Unicode, so the result is that a trim function would remove non-whitespace scalars from the string. That feels wrong.

Answer #3 breaks graphemes, which seems totally incorrect to me. If the trim function is being implemented in terms of Character, then each Character should be treated as an indivisible unit.

So, answer #2 seems like it best balances what users would expect. Attaching a combining scalar to a whitespace character feels like "intent", that they want it to be treated as something potentially other than whitespace. But that still may just be intuition on my part.

Now, just to throw a possible wrench in: my earlier post said "a Character is whitespace if all of its Unicode scalars have White_Space == true". We should note that this doesn't hold for things like alphabetic characters, where we only want to check the base scalar—"á" is alphabetic even though "´" isn't. So we don't have a consistent rule here, and I could see someone arguing that combining scalars should be ignored for whitespace in a similar fashion. But I would still prefer the more "intuitive" solution of #2, which doesn't treat something that visually doesn't look like whitespace as whitespace.

4 Likes

Instead of prescribing what is and isn't whitespace by deciding on what trimming whitespace is meant to do, can we not sidestep this entire issue by offering the developer a choice?

enum WhitespaceOptions {
    // Considers a `Character` to be whitespace if all of its underlying
    // Unicode scalars are whitespace.
    case hasAllWhitespace

    // Considers a `Character` to be whitespace if any of its underlying
    // Unicode scalars are whitespace.
    case hasAnyWhitespace
}

extension String {
    func trim(_ options: TrimOptions, whitespaceOptions: WhitespaceOptions = .hasAllWhitespace) { ... }
}

As I'd previously brought up in the Character and String properties pitch, whitespace has different meaning in different applications ("does not appear to draw anything" vs. "separates values" is just one distinction) and instead of trying to decide what whitespace means across the board, I think allowing the developer to meaningfully decide for themselves and the semantics they need is better for everyone.

3 Likes

trim() and lines() are definitely both information-losing by purpose. If trimmed() returned a Substring or if lines() took an option to preserve separators, then that information could be recovered if needed.

Unicode is careful to separate the designation of White_Space (which I’ll just call “whitespace”) as a programatic concept from linguistic usage.

This is important, as usage varies dramatically across writing systems and even across styles within the same writing system. To further hammer this point, U+200B (Zero Width Space) is often recommended to explicitly separate words in a linguistic context, but does not have the derived property White_Space.

Visibility is also something that shouldn’t be conflated with whitespace. They are related, in that whitespace is never recommended to be rendered as invisible. However, there’s no requirement of “emptiness” of the rendering. Both of the following two Strings have 3 Characters, an “a” and a “b” separated by a whitespace Character: Whitespace Visibility Example

(edit: there seems to be an issue with Discourse, hence the gist link rather than in-line code)

TL;DR; “whitespace” is a crappy name for this concept, but it’s what we have.

This is a good suggestion and I like exposing more control. I don't think it's clearly a choice between any/all; it could be a choice between leading-scalar/all, as @allevato mentioned.

We still have to pick a default, though, which I think should be leading-scalar (or any) for parsing consistency.

Right, for full control and adhering to a spec concerning of a stream of Unicode scalars or code points, the lower level views should be used. But, people often write ad-hoc parsers. As much as I prefer rigid specifications, following the robustness principle is usually the least-harmful choice for these users.

(As we continue to improve String performance, the overhead of grapheme-by-grapheme processing will hopefully decline to an acceptable level for almost all users)

As for most examples being parsery, this is natural as we're talking about reading the contents of a String under a programmatic interpretation. This distinction is not relevant to creating Strings, where the user is the one making the decision. (Also, I think the stdlib should provide "pad" or "center" methods in addition to more interpolation goodies and formatting control, but that's a different topic).

1 Like

Fair enough, but in this case I'd say that nobody except Unicode scholars cares about this and it could be left as an implementation detail. I'm not sure how this is Evolution Discussion material.

But to not be just negative, for what it's worth, both Go and Python do #3 for your initial example:

Go

	str := "\u0020\u0301abc\n\u0301de\u0020\u0301"
	lines := strings.Split(strings.TrimSpace(str), "\n")
	for _, s := range lines {
		for _, r := range s {
			fmt.Printf("%x ", r)
		}
		fmt.Println()
	}
301 61 62 63 
301 64 65 20 301 

Python

str = "\u0020\u0301abc\n\u0301de\u0020\u0301"
lines = str.strip().split('\n')
for line in lines:
    print([hex(ord(x)) for x in line])
['0x301', '0x61', '0x62', '0x63']
['0x301', '0x64', '0x65', '0x20', '0x301']
1 Like

Given @nick.keets's information on the behavior of Go and Python, and your note above that "ordering the options from a principle of intent, #3 most closely adheres to intent"--I believe actually #3 is the ideal answer.

I am not concerned about corner-case input resulting in corner-case output (i.e., degenerate graphemes). I would agree that common/non-corner-case/sane inputs to common use top-level APIs shouldn't result in degenerate graphemes, but I don't think we should be falling over ourselves to avoid such output for admitted corner cases. After all, as has been discussed here, whitespace is a programmatic concept and not a linguistic one, and if there's one classic principle of programmatic manipulation of data, it's GIGO.

That said, I agree with @allevato that #2 seems like an acceptable result for the reasons he outlined. That it works appropriately with the Leiden Convention for ancient papyrus is the cherry on top.

I am really not enamored of #1. Yes, "trim" is information-losing, but the information that the user expects to be lost is whitespace. It would be reasonable, therefore, if a user expects to be able to enumerate (a priori) all possible information that could be lost in such an operation without inspecting the input string. If some of the scalars of a grapheme are not whitespace, dropping the entire grapheme is losing information beyond what the user may expect, and that seems...bad.

2 Likes
Terms of Service

Privacy Policy

Cookie Policy