Ambiguity between Character properties and Foundation's CharacterSet

john-mueller · November 19, 2019, 3:48pm

I recently encountered counter-intuitive behavior regarding whitespace checking. I changed CharacterSet.whitespaces.contains(scalar) to scalar.isWhitespace, and immediately started failing unit tests.

An hour later, the reason was clear. Apple's documentation states the following about CharacterSet.whitespaces:

Returns a character set containing the characters in Unicode General Category Zs and CHARACTER TABULATION (U+0009).

On the other hand, the implementation of the .isWhitespace property on Unicode.Scalar (and, by extension, Character) states:

This property is true for scalars that are spaces, separator characters, and other control characters that should be treated as whitespace for the purposes of parsing text elements.

This property corresponds to the "White_Space" property in the Unicode Standard.

That basically boils down to .isWhitespace returning true for various line separation characters in addition to spacing, although it returns false for 'ZERO WIDTH SPACE' (U+200B), whereas CharacterSet.whitespaces returns true.

Note, both properties are working as defined. However, the difference in behavior is non-intuitive, and now I find myself double-guessing whether I know exactly what corner case behavior I might be missing when I use Character properties vs. CharacterSet. This seems like the opposite of what we would want for a clearly named property in the standard library. In other words, should I have to go digging into the Unicode spec manual to understand why switching from CharacterSet.whitespaces.contains(scalar) to scalar.isWhitespace breaks my code?

So what can be done? Either definition of whitespace is justifiable, but is there any chance of standardizing on one definition, given that either would be a breaking change (at least as far as I can tell)? I suppose this is also complicated by Foundation being an Apple thing, as opposed to a Swift thing. Barring an actual change, how could we make the fact that a diffence exists between these two definitions more obvious to new users of Swift?

xwu · November 19, 2019, 4:18pm

We need to document the difference for both APIs; I think it’s fair to consider the lack of such caveats in the documentation to be a bug. Also, it would probably be important to mention in the documentation the degree to which the behavior of UnicodeScalar APIs could change with future versions of Unicode.

It’s not just that Foundation and the standard library are distinct and that one is a closed-source library owned by Apple. It’s also that CharacterSet and Unicode.Scalar were designed at very different times in the evolution of Unicode. It’s obviously problematic to change the behavior of longstanding APIs, but on the other hand, new APIs shouldn’t be constrained forever to reflect Unicode at the time Foundation adopted certain definitions.

QuinceyMorris · November 19, 2019, 5:30pm

Piling on to what @xwu said:

CharacterSet is really NSCharacterSet underneath, and is related to NSString rather than String. For that reason, it is "really" about what NSString counts as characters, namely UTF-16 code units. [Note: code units, not even code points aka scalars.] Swift does try to paper over the differences, but it can't always succeed.

My advice is to avoid using CharacterSet in Swift at all, except in cases where you don't care that the outcome is somewhat inconsistent. Stick with the spiffy new[-ish] APIs introduced in SE-0211.

SDGGiesbrecht · November 19, 2019, 8:59pm

The documentation definitely deserves improvement.

CharacterSet and NSCharacterSet do semantically contain Unicode scalars, not UTF‐16 code units like NSString.

The differences derive instead from the fact that CharacterSet’s standard sets are the general category values—which are identical to scalar.properties.generalCategory—, whereas isWhitespace refers to White_Space, a separate and unrelated binary property that intersects with several general categories.

It is not a matter of old vs new. The two mean completely different things (albeit with an unfortunate resemblance in the names which begs for confusion).

The Standard Library equivalent for...

CharacterSet.whitespaces.contains(scalar)

...is...

scalar.properties.generalCategory == .spaceSeparator || scalar == "\u{9}"

The extra || scalar == "\u{9} is because CharacterSet treats all C0 controls according to their ASCII definitions instead of their Unicode ones (or rather lack thereof). U+0009 isn’t necessarily a character tabulation:

23.1 Control Codes

There are 65 code points set aside in the Unicode Standard for compatibility with the C0 and C1 control codes defined in the ISO/IEC 2022 framework. The ranges of these code points are U+0000..U+001F, U+007F, and U+0080..U+009F, which correspond to the 8- bit controls 00₁₆ to 1F₁₆ (C0 controls), 7F₁₆ (delete), and 80₁₆ to 9F₁₆ (C1 controls), respectively. For example, the 8-bit legacy control code character tabulation (or tab) is the byte value 09₁₆; the Unicode Standard encodes the corresponding control code at U+0009.

The Unicode Standard provides for the intact interchange of these code points, neither adding to nor subtracting from their semantics. The semantics of the control codes are generally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the control function semantics specified in ISO/IEC 6429:1992.

In general, the use of control codes constitutes a higher-level protocol and is beyond the scope of the Unicode Standard. For example, the use of ISO/IEC 6429 control sequences for controlling bidirectional formatting would be a legitimate higher-level protocol layered on top of the plain text of the Unicode Standard. Higher-level protocols are not specified by the Unicode Standard; their existence cannot be assumed without a separate agreement between the parties interchanging such data.

―Section 23: Special Areas and Format Characters, The Unicode Standard 12.1.0

mattrips · November 21, 2019, 5:04am

Great observations. Thank you.

This touches on a larger, unfortunate legacy: our character sets include things that aren’t characters or parts of characters. Line breaks, paragraph breaks, tabs, control characters, spaces, etc. Those things mostly contain information about formatting.

We need to evolve toward systems that separate information about characters/glyphs from information about formatting. In the context of Swift, we cannot change Unicode, but we can and should develop a better way of representing formatted text via logical data structures enveloped around Unicode characters.