Swift's support for Unicode-aware operations on String
s is quite nice compared to many other modern programming languages, but one area where we're currently lacking is in the ability to query properties of Unicode.Scalar
and Character
values (e.g., to classify them).
As one possible example of how these could be implemented, I'll point to the UnicodeScalar+BooleanProperties
extension in my icu-swift project. I make no claim that this is the ideal API to be adopted by the standard library; rather, I just want it to serve as a starting off point for discussion.
Based on previous discussions here, I think many of us agree that we need to support some subset of these capabilities (at least) in the standard library. The following is a bit of a stream-of-consciousness of what I think the broad goals/questions are. If I've left anything out, please feel free to add it!
Which properties to support?
The Unicode standard defines a large number of properties that range from "useful in everyday computing" to "typically used only in specialized text processing algorithms." There are also may different types of propertiesâBoolean-valued, string-valued, numeric-valued, enumerated-type-valued, and so forth.
A design proposal to improve the state of Unicode properties in the standard library will need to explicitly state which properties will be supported (and by elimination, which will be omitted, if any). I won't try to be exhaustive here in my first message, but to cherry-pick some examples, properties like Uppercase
, Lowercase
, and White_Space
are obvious candidates for inclusion. Something like Logical_Order_Exception
that comes up less frequently on its own may not meet the bar that we decide to set.
How to expose the properties?
How should we design the APIs that expose these properties? This may be motivated by how many properties we decide to support above. If the number we support is small, then individual properties for each probably makes sense (isLowercase
, isUppercase
, isWhitespace
, etc.). But if we end up supporting something closer to the full set of properties, would that bloat the API? Would we want to provide something enum-based like hasProperty(.lowercase)
instead? That would reduce bloat but it would also be quite less discoverable and less familiar compared to the is*
APIs provided by Java, C, and others.
We also have to consider the non-Boolean properties; one example, Numeric_Value
, may be more appropriate to expose as a failable initializer on Int
. (Or even on Double
; consider that the Numeric_Value
of U+00BD VULGAR FRACTION ONE HALF
is 0.5
.)
What applies to Unicode.Scalar
and what applies to Character
?
By definition, the Unicode standard defines these properties on code points, which Swift represents using Unicode.Scalar
(except that Unicode.Scalar
has a "hole" where the surrogate code points live, but I don't believe that will be hugely relevant to our discussion here). Because of that, in my opinion, any properties that we expose should at a minimum be supported on Unicode.Scalar
.
So the question then becomes, what should we also support on Character
? Character
s are the "default" view on String
and will be what most users iterate over and operate on unless their use case has performance requirements that make it preferable to use UnicodeScalarView
instead of paying the cost of calculating grapheme cluster breaks.
When we talk about Character
s (i.e., grapheme clusters of one or more scalars), I think we can classify the properties in one of three ways:
-
Properties that are well-defined and are derivable solely based on the properties of the
Character
's constituentUnicode.Scalar
s. For example, considerWhite_Space
. It's reasonable that a user would want to ask if aCharacter
is whitespace or not. If we have a "weird" character, like a space followed by a combining accent mark, I think it's sensible to say "no, that's not whitespace". In other words, theCharacter
has property X iff all of its scalars have property X. -
Properties that have sensible definitions for
Character
s but which are not derived solely based on the constituentUnicode.Scalar
s. For example, we expect users will want to ask "is a character lowercase", but you can't say aCharacter
is lowercase if all of its scalars haveLowercase == true
because if you have "a" + a combining accent mark, the combining accent hasLowercase == false
. In specific cases like this (upper/lower/titlecase), Unicode defines specific algorithms for strings that we can apply toCharacter
: aCharacter
c is uppercase iftoUpper(c) == c
, and likewise for lower- and titlecase. -
Properties that make no sense to support on
Character
. One example that comes to mind isVariation_Selector
. AFAIK, variation selectors always combine with a preceding scalar unless they are in a cluster by themselves, so you don't really gain anything by asking "is thisCharacter
a variant selector?" You should just drop down to the scalars to ask.
Thoughts?