Swift’s support for Unicode-aware operations on
Strings is quite nice compared to many other modern programming languages, but one area where we’re currently lacking is in the ability to query properties of
Character values (e.g., to classify them).
As one possible example of how these could be implemented, I’ll point to the
UnicodeScalar+BooleanProperties extension in my icu-swift project. I make no claim that this is the ideal API to be adopted by the standard library; rather, I just want it to serve as a starting off point for discussion.
Based on previous discussions here, I think many of us agree that we need to support some subset of these capabilities (at least) in the standard library. The following is a bit of a stream-of-consciousness of what I think the broad goals/questions are. If I’ve left anything out, please feel free to add it!
Which properties to support?
The Unicode standard defines a large number of properties that range from “useful in everyday computing” to “typically used only in specialized text processing algorithms.” There are also may different types of properties—Boolean-valued, string-valued, numeric-valued, enumerated-type-valued, and so forth.
A design proposal to improve the state of Unicode properties in the standard library will need to explicitly state which properties will be supported (and by elimination, which will be omitted, if any). I won’t try to be exhaustive here in my first message, but to cherry-pick some examples, properties like
White_Space are obvious candidates for inclusion. Something like
Logical_Order_Exception that comes up less frequently on its own may not meet the bar that we decide to set.
How to expose the properties?
How should we design the APIs that expose these properties? This may be motivated by how many properties we decide to support above. If the number we support is small, then individual properties for each probably makes sense (
isWhitespace, etc.). But if we end up supporting something closer to the full set of properties, would that bloat the API? Would we want to provide something enum-based like
hasProperty(.lowercase) instead? That would reduce bloat but it would also be quite less discoverable and less familiar compared to the
is* APIs provided by Java, C, and others.
We also have to consider the non-Boolean properties; one example,
Numeric_Value, may be more appropriate to expose as a failable initializer on
Int. (Or even on
Double; consider that the
U+00BD VULGAR FRACTION ONE HALF is
What applies to
Unicode.Scalar and what applies to
By definition, the Unicode standard defines these properties on code points, which Swift represents using
Unicode.Scalar (except that
Unicode.Scalar has a “hole” where the surrogate code points live, but I don’t believe that will be hugely relevant to our discussion here). Because of that, in my opinion, any properties that we expose should at a minimum be supported on
So the question then becomes, what should we also support on
Characters are the “default” view on
String and will be what most users iterate over and operate on unless their use case has performance requirements that make it preferable to use
UnicodeScalarView instead of paying the cost of calculating grapheme cluster breaks.
When we talk about
Characters (i.e., grapheme clusters of one or more scalars), I think we can classify the properties in one of three ways:
Properties that are well-defined and are derivable solely based on the properties of the
Unicode.Scalars. For example, consider
White_Space. It’s reasonable that a user would want to ask if a
Characteris whitespace or not. If we have a “weird” character, like a space followed by a combining accent mark, I think it’s sensible to say “no, that’s not whitespace”. In other words, the
Characterhas property X iff all of its scalars have property X.
Properties that have sensible definitions for
Characters but which are not derived solely based on the constituent
Unicode.Scalars. For example, we expect users will want to ask “is a character lowercase”, but you can’t say a
Characteris lowercase if all of its scalars have
Lowercase == truebecause if you have “a” + a combining accent mark, the combining accent has
Lowercase == false. In specific cases like this (upper/lower/titlecase), Unicode defines specific algorithms for strings that we can apply to
Characterc is uppercase if
toUpper(c) == c, and likewise for lower- and titlecase.
Properties that make no sense to support on
Character. One example that comes to mind is
Variation_Selector. AFAIK, variation selectors always combine with a preceding scalar unless they are in a cluster by themselves, so you don’t really gain anything by asking “is this
Charactera variant selector?” You should just drop down to the scalars to ask.