The documentation definitely deserves improvement.
CharacterSet
and NSCharacterSet
do semantically contain Unicode scalars, not UTF‐16 code units like NSString
.
The differences derive instead from the fact that CharacterSet
’s standard sets are the general category values—which are identical to scalar.properties.generalCategory
—, whereas isWhitespace
refers to White_Space
, a separate and unrelated binary property that intersects with several general categories.
It is not a matter of old vs new. The two mean completely different things (albeit with an unfortunate resemblance in the names which begs for confusion).
The Standard Library equivalent for...
CharacterSet.whitespaces.contains(scalar)
...is...
scalar.properties.generalCategory == .spaceSeparator || scalar == "\u{9}"
The extra || scalar == "\u{9}
is because CharacterSet
treats all C0 controls according to their ASCII definitions instead of their Unicode ones (or rather lack thereof). U+0009 isn’t necessarily a character tabulation:
23.1 Control Codes
There are 65 code points set aside in the Unicode Standard for compatibility with the C0 and C1 control codes defined in the ISO/IEC 2022 framework. The ranges of these code points are U+0000..U+001F, U+007F, and U+0080..U+009F, which correspond to the 8- bit controls 0016 to 1F16 (C0 controls), 7F16 (delete), and 8016 to 9F16 (C1 controls), respectively. For example, the 8-bit legacy control code character tabulation (or tab) is the byte value 0916; the Unicode Standard encodes the corresponding control code at U+0009.
The Unicode Standard provides for the intact interchange of these code points, neither adding to nor subtracting from their semantics. The semantics of the control codes are generally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the control function semantics specified in ISO/IEC 6429:1992.
In general, the use of control codes constitutes a higher-level protocol and is beyond the scope of the Unicode Standard. For example, the use of ISO/IEC 6429 control sequences for controlling bidirectional formatting would be a legitimate higher-level protocol layered on top of the plain text of the Unicode Standard. Higher-level protocols are not specified by the Unicode Standard; their existence cannot be assumed without a separate agreement between the parties interchanging such data.
―Section 23: Special Areas and Format Characters, The Unicode Standard 12.1.0