I recently encountered counter-intuitive behavior regarding whitespace checking. I changed CharacterSet.whitespaces.contains(scalar)
to scalar.isWhitespace
, and immediately started failing unit tests.
An hour later, the reason was clear. Apple's documentation states the following about CharacterSet.whitespaces:
Returns a character set containing the characters in Unicode General Category Zs and CHARACTER TABULATION (U+0009).
On the other hand, the implementation of the .isWhitespace property on Unicode.Scalar (and, by extension, Character) states:
This property is
true
for scalars that are spaces, separator characters, and other control characters that should be treated as whitespace for the purposes of parsing text elements.This property corresponds to the "White_Space" property in the Unicode Standard.
That basically boils down to .isWhitespace returning true for various line separation characters in addition to spacing, although it returns false for 'ZERO WIDTH SPACE' (U+200B), whereas CharacterSet.whitespaces returns true.
Note, both properties are working as defined. However, the difference in behavior is non-intuitive, and now I find myself double-guessing whether I know exactly what corner case behavior I might be missing when I use Character properties vs. CharacterSet. This seems like the opposite of what we would want for a clearly named property in the standard library. In other words, should I have to go digging into the Unicode spec manual to understand why switching from CharacterSet.whitespaces.contains(scalar)
to scalar.isWhitespace
breaks my code?
So what can be done? Either definition of whitespace is justifiable, but is there any chance of standardizing on one definition, given that either would be a breaking change (at least as far as I can tell)? I suppose this is also complicated by Foundation being an Apple thing, as opposed to a Swift thing. Barring an actual change, how could we make the fact that a diffence exists between these two definitions more obvious to new users of Swift?