Adding Unicode properties to UnicodeScalar/Character

Swift's support for Unicode-aware operations on Strings is quite nice compared to many other modern programming languages, but one area where we're currently lacking is in the ability to query properties of Unicode.Scalar and Character values (e.g., to classify them).

As one possible example of how these could be implemented, I'll point to the UnicodeScalar+BooleanProperties extension in my icu-swift project. I make no claim that this is the ideal API to be adopted by the standard library; rather, I just want it to serve as a starting off point for discussion.

Based on previous discussions here, I think many of us agree that we need to support some subset of these capabilities (at least) in the standard library. The following is a bit of a stream-of-consciousness of what I think the broad goals/questions are. If I've left anything out, please feel free to add it!

Which properties to support?

The Unicode standard defines a large number of properties that range from "useful in everyday computing" to "typically used only in specialized text processing algorithms." There are also may different types of properties—Boolean-valued, string-valued, numeric-valued, enumerated-type-valued, and so forth.

A design proposal to improve the state of Unicode properties in the standard library will need to explicitly state which properties will be supported (and by elimination, which will be omitted, if any). I won't try to be exhaustive here in my first message, but to cherry-pick some examples, properties like Uppercase, Lowercase, and White_Space are obvious candidates for inclusion. Something like Logical_Order_Exception that comes up less frequently on its own may not meet the bar that we decide to set.

How to expose the properties?

How should we design the APIs that expose these properties? This may be motivated by how many properties we decide to support above. If the number we support is small, then individual properties for each probably makes sense (isLowercase, isUppercase, isWhitespace, etc.). But if we end up supporting something closer to the full set of properties, would that bloat the API? Would we want to provide something enum-based like hasProperty(.lowercase) instead? That would reduce bloat but it would also be quite less discoverable and less familiar compared to the is* APIs provided by Java, C, and others.

We also have to consider the non-Boolean properties; one example, Numeric_Value, may be more appropriate to expose as a failable initializer on Int. (Or even on Double; consider that the Numeric_Value of U+00BD VULGAR FRACTION ONE HALF is 0.5.)

What applies to Unicode.Scalar and what applies to Character?

By definition, the Unicode standard defines these properties on code points, which Swift represents using Unicode.Scalar (except that Unicode.Scalar has a "hole" where the surrogate code points live, but I don't believe that will be hugely relevant to our discussion here). Because of that, in my opinion, any properties that we expose should at a minimum be supported on Unicode.Scalar.

So the question then becomes, what should we also support on Character? Characters are the "default" view on String and will be what most users iterate over and operate on unless their use case has performance requirements that make it preferable to use UnicodeScalarView instead of paying the cost of calculating grapheme cluster breaks.

When we talk about Characters (i.e., grapheme clusters of one or more scalars), I think we can classify the properties in one of three ways:

  1. Properties that are well-defined and are derivable solely based on the properties of the Character's constituent Unicode.Scalars. For example, consider White_Space. It's reasonable that a user would want to ask if a Character is whitespace or not. If we have a "weird" character, like a space followed by a combining accent mark, I think it's sensible to say "no, that's not whitespace". In other words, the Character has property X iff all of its scalars have property X.

  2. Properties that have sensible definitions for Characters but which are not derived solely based on the constituent Unicode.Scalars. For example, we expect users will want to ask "is a character lowercase", but you can't say a Character is lowercase if all of its scalars have Lowercase == true because if you have "a" + a combining accent mark, the combining accent has Lowercase == false. In specific cases like this (upper/lower/titlecase), Unicode defines specific algorithms for strings that we can apply to Character: a Character c is uppercase if toUpper(c) == c, and likewise for lower- and titlecase.

  3. Properties that make no sense to support on Character. One example that comes to mind is Variation_Selector. AFAIK, variation selectors always combine with a preceding scalar unless they are in a cluster by themselves, so you don't really gain anything by asking "is this Character a variant selector?" You should just drop down to the scalars to ask.


Thoughts?

13 Likes

This seems like an excellent direction for extending UnicodeScalar and Character.

Regarding the way in which the properties are exposed, it seems like we might want a rich hasProperty()-style API for all the defined properties. However, that doesn't preclude also having a simpler set of isFoo predicates for the more common properties. CharacterSet's static properties are a good jumping off place for what might be considered common.

Correct me if I'm wrong, but it seems like all of this would be purely additive, right?

3 Likes

Thank you for kicking this off! https://github.com/allevato/icu-swift seems like a great testing ground for this.

This is a great way to approach the task. We may want to expose all the raw information for sophisticated use cases, and selectively bless some queries for common use.

My vote is for isLowercase, etc., to be present directly for common use, semantically equivalent to something built on top of a more general facility. Much like a subset of those provided by icu-swift.

I feel like this will end up being a case-by-case (all puns intended) tiny research project.

Case is pretty tricky. Unicode defines at least 3 different levels of thinking about case, because of course it does.

Old fashioned notions of case (such as in Java or CharacterSet) are based on general category, but that proved to not be very future-proof and skewed towards bicameral alphabets. The second level of case is from derived properties, which is likely what we'll want for UnicodeScalar at least. Relevant (trimmed) quote from the State of String thread:

Finally, the third level are the String functions such as isLowercase and isUppercase. I don't know if this level is overkill on Character, but off-the-cuff it seems viable. There might be some decent fast-paths in the implementation we can use for common scenarios.

AFAICT, isLowercase would return true for caseless graphemes, such as "7". I don't know what behavior we want to expose, e.g. perhaps a grapheme has to satisfy both Unicode's isCased and isLowercase for our Swift computed property isLowercase.

Or we expose both and let the user sort it out. Really depends on the use.

This is additive, but I think it addresses a very sore spot in String/Character/UnicodeScalar that can be developed in parallel with ABI efforts.

Agree completely.

And thanks for mentioning the point about CharacterSet's obsolete logic. IMO we should fix that as well by introducing a new and correct UnicodeScalarSet type, but that's best saved for a separate proposal. That being said, if these properties go in, it'll be important to fix because people will wonder why Unicode.Scalar.isLowercase isn't consistent with CharacterSet.lowercaseLetters.

I think this is exactly what we want for Character. AFAICT, the Unicode standard doesn't distinguish between single grapheme clusters and strings of them for the purposes of case detection above the scalar level. (The exception is titlecase, since that naturally involves multiple scalars and word boundaries, but even in some specific cases, there are scalars that encode multiple "characters" that are inherently titlecased—e.g., U+01F2: LATIN CAPITAL LETTER D WITH SMALL LETTER Z (Dz).

So, the same case detection we'd want to use for strings should be correct for characters as well.

That's true—the "Default Case Detection" subsection of 3.13 in the standard shows that uncased code points like "7" are simultaneously uppercase, lowercase, and titlecase. This means we have to make a decision for these:

  • ("7" as Unicode.Scalar).isLowercased == true because it's technically correct according to the Unicode standard, and users have to remember to check isCased as well.
  • ("7" as Unicode.Scalar).isLowercased == false because we would internally also factor in isCased and do the thing that's most obvious for the user.

I also imagine that nobody wants a world where ("7" as Unicode.Scalar).isLowercased != ("7" as Character).isLowercased.

Edit: It's also worth noting that if we go with the second option above, there's no way to get back from that result to the raw value; it's not reversible. So if somebody does want the "raw" value of the case property, we'd need to provide that as a separate API.

I would imagine this would be level 2, that is, based on the Lowercase derived property and not the Unicode string function isLowercase(X). "7" does not have that derived property. (See section 4.2 of the standard).

Nano-proposal:

Requirement 1: Canonically equivalent Characters always give the same answer for these queries.

Requirement 2: A Character comprised of a single scalar gives the same answer as querying the scalar directly.

Corollary: Any Character canonically equivalent with a single-scalar Character must give the same answer as that scalar.

Research project: What, then, is a consistent model for multi-scalar graphemes not canonically equivalent to a single-scalar one?

All case folding is irreversible; it may even change the grapheme count.

1 Like

Agree with earlier comments that isLowercase, etc., should be present directly for common use.

As for more esoteric properties, what of something like a property named unicodeProperties that returns an OptionSet? I think if feasible it'd afford more flexibility than hasProperty, and we're talking about advanced usage anyway.

2 Likes

Ah yes, you're right—I was getting wrapped up in the various definitions. In that case, it looks like the derived "Lowercase" and "Uppercase" property would directly give us what we want for scalars and single-scalar characters.

AFAIK, this is where the Default Case Detection rules in 3.13 come in. So to try to restate everything:

  1. For Unicode.Scalars and Characters consisting of a single Unicode.Scalar or consisting of multiple scalars that are canonically equivalent to a single scalar, isLowercase equals the value of the single scalar's derived Lowercase property.

  2. For Characters consisting of multiple scalars that are not canonically equivalent to a single scalar, then isLowercased is true if and only if C == toLowercase(C) && isCased(C).

Case #1 is really a subset of case #2, but it presents an optimization opportunity for single scalars where we don't have to compute a temporary mapping and test equality. Overall, this behavior is consistent with what's described by 3.13 and produces the correct results for something like "a + several combining accents" (where isCased keeps it true) and for emoji sequences where isCased would be false, therefore saying the whole cluster is false.

How does that sound?

1 Like

This hadn't occurred to me, but I really like it. It moves the "bloat" out of the main Unicode.Scalar interface and into its own type for advanced users.

One concern: ICU 60 already defines 64 boolean properties, which would exhaust a UInt64 that we would use as the raw value of the OptionSet. What do we do if Unicode adds another property? We can expand our bit space with DoubleWidth<>, but since the underlying integer type is part of the public API of an OptionSet, can we safely scale it in the future in a non-breaking way?

Are any of the 64 properties easily derived from trivial combinations of the others? If so, then it may be feasible. If not, then we might need to look into other designs.

Efficiency and practicality aside, I feel that they would properly be modelled by a Set of enum cases and not an OptionSet. In reality you probably don't want to have to actually construct such a Set by querying all properties at construction, so why not just a custom type that conforms to SetAlgebra, with the Element being an enum of properties?

Each character has multiple properties.
Edit: I see what you mean--a set of enum cases. That's intriguing, but would we end up having too many types? Is there that much structure among the properties such that we have a deep hierarchy of many mutually exclusive options?

The more I think about it, the less a "set" (option set or regular set) fits with the API that ICU gives us.

There's no way to query "give me all the boolean properties of this scalar" as a single bitmask. The only function we have AFAICT is u_hasBinaryProperty, which only lets us query them one at a time. That means that if we want to support a true set type, we have to query all 64 properties any time someone wants just one of them, which seems like a poor implementation strategy.

2 Likes

You wouldn't need to do that if you use a custom type that conforms to SetAlgebra. e.g. Very roughly

enum CharacterProperty {
    case upperCase, lowerCase, deprecated, diacritic // etc, possibly with raw values that match ICUs UProperty
}

struct CharacterProperties: SetAlgebra {
    typealias Element = CharacterProperty
    
    func contains(_ member: CharacterProperty) -> Bool {
        u_hasBinaryProperty(...) // map member to UProperty, call ICU
    }
    // etc
}

Some of the SetAlgebra functions would require querying all properties, though.

However, this is only if being a set is particularly desirable, which it probably isn't unless you expect people to be e.g. intersecting the properties of all characters in a string to determine what properties they have in common. A function that takes an enum, such as hasProperty mentioned above, makes the most sense to me, with shorthands for common properties.

Most of them would, if we implement contains the most efficient way by calling u_hasBinaryProperty directly and not caching anything. So yeah, that would end up being even worse than OptionSet.

I'd say defer any approach that cannot be efficiently implemented on top of ICU as future work. We can consider what an ideal future would look like, but let's also separate out something concrete for inclusion in Swift 5.

Establishing this was where my individual research left off. Do you have a relevant part of the spec, justification, and/or an argument for why this must hold?

I've dug through the spec a bit and can't find specific writing for these assertions but I believe they hold:

First, let's take Case 2:

Then, Case 1 was written as such:

So let's try to restate it in terms of Case 2. Let's say we have a single scalar S. Then we need to show that the derived Lowercased property of S is always equivalent to S == toLowercase(S) && isCased(S). The cases are:

  1. S is an un-cased code point. Then its Lowercased derived property should be false. Likewise, isCased(S) will be false, so the two are equivalent.
  2. S is a cased code point. Then,
    a. S has Lowercased property == true. Then S == toLowercase(S) is true because S doesn't get changed, and isCased(S) is true, so they're equivalent.
    b. S has Lowercased property == false. Then S == toLowercase(S) is false, so the whole expression is false.

Unfortunately I can't find more concrete properties in the spec that clarify this, but I believe this will always hold?

1 Like

If I had a nickel for every time I said that! This is Unicode, and thus we throw all intuition out the door and also consult the tables:

:sweat:

For lowercase/uppercase, I think we still need more justification based on the defined semantics of these methods. From the spec:

Then the next step is to determine what this mapping is, and if the mapping is invariant for all unicode scalars with the relevant property. Basically, we need to prove:

∀(s ∈ UnicodeScalars) isLowercase(s) === Lowercase_Mapping(s) == s

We can validate this for given a version of Unicode through exhaustive search (it's only a million scalars, so this can be done in about a second). We can do this today!

To reason about whether this will remain valid in future versions of Unicode, we need to reason through how Uppercase_Mapping(C) is defined and whether it's compatibility affordances is similar/equivalent to the derived properties. That is, we also need to prove:

∀(V ∈ FutureUnicodeVersions, sv ∈ V.UnicodeScalars) isLowercase(sv) === Lowercase_Mapping(sv) == sv

with a fairly high degree of confidence.

From the spec:

(Commentary from me and not proven): Note that casing is normative but not fixed, however this is ok so long as the case mappings are always in sync. String provides universal semantics by default and leaves localization to the platform, so we're not interested in SpecialCasing properties. Since we're talking about graphemes, and not whole Strings, position-dependent case mapping (such as the Greek sigma) are not relevant. So in this case (all puns intended), we only care about SimpleCasing, assuming it aligns with the other derived properties.

So I think the next step is to run the experiment on today's Unicode version and see.

1 Like

Are we conflating strings and scalars? Because AFAICT this statement is tautological (or trivially proven) based on the definitions in the spec:

  • isLowercase(X) is a function that the Unicode spec defines on strings (3.13, D139) that is true when toLowercase(X) == X
  • toLowercase(X) is also defined on strings (3.13, R2) as mapping each character C in X to Lowercase_Mapping(C)
  • Lowercase_Mapping is a string-valued property of scalars; the corresponding scalar-valued property is Simple_Lowercase_Mapping

So I don't think we can evaluate the equation to prove without first treating the scalars first as strings in some contexts. If we do so, we end up with:

∀(s ∈ UnicodeScalars) isLowercase(String(s)) === Lowercase_Mapping(s) == String(s)

Then expanding:

∀(s ∈ UnicodeScalars) toLowercase(String(s)) == String(s) === Lowercase_Mapping(s) == String(s)

And since String(s) consists of a single scalar, toLowercase(String(s)) is equal by definition to Lowercase_Mapping(s).

But that doesn't tell us anything about whether we can use the derived Lowercase property of a scalar as an optimization equivalent to performing these string comparisons, does it?

I started running some experiments to test that assertion, and I found some problems already, just in the first 256:

Mapping mismatch for scalar ª (U+aa): isLowercase(s) && isCased(s) == false, but Lowercase property == true
Mapping mismatch for scalar º (U+ba): isLowercase(s) && isCased(s) == false, but Lowercase property == true

This means it looks like we can't use the Lowercase property on its own for scalars if we want to be consistent with the definition how it's defined for strings :pensive:

So, I'll posit that our API should probably be something like:

  • Unicode.Scalar.hasProperty(.lowercase) – a low-level operation that returns the Unicode property value directly
  • Unicode.Scalar.isLowercase – promotes the scalar to a String and uses the transformation/detection functions defined by the Unicode spec
  • Character.isLowercase – promotes the character to a String and uses the transformation/detection functions defined by the Unicode spec

...and likewise for uppercase (and titlecase?).

1 Like