Adding Unicode properties to UnicodeScalar/Character

Thank you for kicking this off! https://github.com/allevato/icu-swift seems like a great testing ground for this.

This is a great way to approach the task. We may want to expose all the raw information for sophisticated use cases, and selectively bless some queries for common use.

My vote is for isLowercase, etc., to be present directly for common use, semantically equivalent to something built on top of a more general facility. Much like a subset of those provided by icu-swift.

I feel like this will end up being a case-by-case (all puns intended) tiny research project.

Case is pretty tricky. Unicode defines at least 3 different levels of thinking about case, because of course it does.

Old fashioned notions of case (such as in Java or CharacterSet) are based on general category, but that proved to not be very future-proof and skewed towards bicameral alphabets. The second level of case is from derived properties, which is likely what we'll want for UnicodeScalar at least. Relevant (trimmed) quote from the State of String thread:

Finally, the third level are the String functions such as isLowercase and isUppercase. I don't know if this level is overkill on Character, but off-the-cuff it seems viable. There might be some decent fast-paths in the implementation we can use for common scenarios.

AFAICT, isLowercase would return true for caseless graphemes, such as "7". I don't know what behavior we want to expose, e.g. perhaps a grapheme has to satisfy both Unicode's isCased and isLowercase for our Swift computed property isLowercase.

Or we expose both and let the user sort it out. Really depends on the use.

This is additive, but I think it addresses a very sore spot in String/Character/UnicodeScalar that can be developed in parallel with ABI efforts.

Agree completely.

And thanks for mentioning the point about CharacterSet's obsolete logic. IMO we should fix that as well by introducing a new and correct UnicodeScalarSet type, but that's best saved for a separate proposal. That being said, if these properties go in, it'll be important to fix because people will wonder why Unicode.Scalar.isLowercase isn't consistent with CharacterSet.lowercaseLetters.

I think this is exactly what we want for Character. AFAICT, the Unicode standard doesn't distinguish between single grapheme clusters and strings of them for the purposes of case detection above the scalar level. (The exception is titlecase, since that naturally involves multiple scalars and word boundaries, but even in some specific cases, there are scalars that encode multiple "characters" that are inherently titlecased—e.g., U+01F2: LATIN CAPITAL LETTER D WITH SMALL LETTER Z (Dz).

So, the same case detection we'd want to use for strings should be correct for characters as well.

That's true—the "Default Case Detection" subsection of 3.13 in the standard shows that uncased code points like "7" are simultaneously uppercase, lowercase, and titlecase. This means we have to make a decision for these:

  • ("7" as Unicode.Scalar).isLowercased == true because it's technically correct according to the Unicode standard, and users have to remember to check isCased as well.
  • ("7" as Unicode.Scalar).isLowercased == false because we would internally also factor in isCased and do the thing that's most obvious for the user.

I also imagine that nobody wants a world where ("7" as Unicode.Scalar).isLowercased != ("7" as Character).isLowercased.

Edit: It's also worth noting that if we go with the second option above, there's no way to get back from that result to the raw value; it's not reversible. So if somebody does want the "raw" value of the case property, we'd need to provide that as a separate API.

I would imagine this would be level 2, that is, based on the Lowercase derived property and not the Unicode string function isLowercase(X). "7" does not have that derived property. (See section 4.2 of the standard).

Nano-proposal:

Requirement 1: Canonically equivalent Characters always give the same answer for these queries.

Requirement 2: A Character comprised of a single scalar gives the same answer as querying the scalar directly.

Corollary: Any Character canonically equivalent with a single-scalar Character must give the same answer as that scalar.

Research project: What, then, is a consistent model for multi-scalar graphemes not canonically equivalent to a single-scalar one?

All case folding is irreversible; it may even change the grapheme count.

1 Like

Agree with earlier comments that isLowercase, etc., should be present directly for common use.

As for more esoteric properties, what of something like a property named unicodeProperties that returns an OptionSet? I think if feasible it'd afford more flexibility than hasProperty, and we're talking about advanced usage anyway.

2 Likes

Ah yes, you're right—I was getting wrapped up in the various definitions. In that case, it looks like the derived "Lowercase" and "Uppercase" property would directly give us what we want for scalars and single-scalar characters.

AFAIK, this is where the Default Case Detection rules in 3.13 come in. So to try to restate everything:

  1. For Unicode.Scalars and Characters consisting of a single Unicode.Scalar or consisting of multiple scalars that are canonically equivalent to a single scalar, isLowercase equals the value of the single scalar's derived Lowercase property.

  2. For Characters consisting of multiple scalars that are not canonically equivalent to a single scalar, then isLowercased is true if and only if C == toLowercase(C) && isCased(C).

Case #1 is really a subset of case #2, but it presents an optimization opportunity for single scalars where we don't have to compute a temporary mapping and test equality. Overall, this behavior is consistent with what's described by 3.13 and produces the correct results for something like "a + several combining accents" (where isCased keeps it true) and for emoji sequences where isCased would be false, therefore saying the whole cluster is false.

How does that sound?

1 Like

This hadn't occurred to me, but I really like it. It moves the "bloat" out of the main Unicode.Scalar interface and into its own type for advanced users.

One concern: ICU 60 already defines 64 boolean properties, which would exhaust a UInt64 that we would use as the raw value of the OptionSet. What do we do if Unicode adds another property? We can expand our bit space with DoubleWidth<>, but since the underlying integer type is part of the public API of an OptionSet, can we safely scale it in the future in a non-breaking way?

Are any of the 64 properties easily derived from trivial combinations of the others? If so, then it may be feasible. If not, then we might need to look into other designs.

Efficiency and practicality aside, I feel that they would properly be modelled by a Set of enum cases and not an OptionSet. In reality you probably don't want to have to actually construct such a Set by querying all properties at construction, so why not just a custom type that conforms to SetAlgebra, with the Element being an enum of properties?

Each character has multiple properties.
Edit: I see what you mean--a set of enum cases. That's intriguing, but would we end up having too many types? Is there that much structure among the properties such that we have a deep hierarchy of many mutually exclusive options?

The more I think about it, the less a "set" (option set or regular set) fits with the API that ICU gives us.

There's no way to query "give me all the boolean properties of this scalar" as a single bitmask. The only function we have AFAICT is u_hasBinaryProperty, which only lets us query them one at a time. That means that if we want to support a true set type, we have to query all 64 properties any time someone wants just one of them, which seems like a poor implementation strategy.

2 Likes

You wouldn't need to do that if you use a custom type that conforms to SetAlgebra. e.g. Very roughly

enum CharacterProperty {
    case upperCase, lowerCase, deprecated, diacritic // etc, possibly with raw values that match ICUs UProperty
}

struct CharacterProperties: SetAlgebra {
    typealias Element = CharacterProperty
    
    func contains(_ member: CharacterProperty) -> Bool {
        u_hasBinaryProperty(...) // map member to UProperty, call ICU
    }
    // etc
}

Some of the SetAlgebra functions would require querying all properties, though.

However, this is only if being a set is particularly desirable, which it probably isn't unless you expect people to be e.g. intersecting the properties of all characters in a string to determine what properties they have in common. A function that takes an enum, such as hasProperty mentioned above, makes the most sense to me, with shorthands for common properties.

Most of them would, if we implement contains the most efficient way by calling u_hasBinaryProperty directly and not caching anything. So yeah, that would end up being even worse than OptionSet.

I'd say defer any approach that cannot be efficiently implemented on top of ICU as future work. We can consider what an ideal future would look like, but let's also separate out something concrete for inclusion in Swift 5.

Establishing this was where my individual research left off. Do you have a relevant part of the spec, justification, and/or an argument for why this must hold?

I've dug through the spec a bit and can't find specific writing for these assertions but I believe they hold:

First, let's take Case 2:

Then, Case 1 was written as such:

So let's try to restate it in terms of Case 2. Let's say we have a single scalar S. Then we need to show that the derived Lowercased property of S is always equivalent to S == toLowercase(S) && isCased(S). The cases are:

  1. S is an un-cased code point. Then its Lowercased derived property should be false. Likewise, isCased(S) will be false, so the two are equivalent.
  2. S is a cased code point. Then,
    a. S has Lowercased property == true. Then S == toLowercase(S) is true because S doesn't get changed, and isCased(S) is true, so they're equivalent.
    b. S has Lowercased property == false. Then S == toLowercase(S) is false, so the whole expression is false.

Unfortunately I can't find more concrete properties in the spec that clarify this, but I believe this will always hold?

1 Like

If I had a nickel for every time I said that! This is Unicode, and thus we throw all intuition out the door and also consult the tables:

:sweat:

For lowercase/uppercase, I think we still need more justification based on the defined semantics of these methods. From the spec:

Then the next step is to determine what this mapping is, and if the mapping is invariant for all unicode scalars with the relevant property. Basically, we need to prove:

∀(s ∈ UnicodeScalars) isLowercase(s) === Lowercase_Mapping(s) == s

We can validate this for given a version of Unicode through exhaustive search (it's only a million scalars, so this can be done in about a second). We can do this today!

To reason about whether this will remain valid in future versions of Unicode, we need to reason through how Uppercase_Mapping(C) is defined and whether it's compatibility affordances is similar/equivalent to the derived properties. That is, we also need to prove:

∀(V ∈ FutureUnicodeVersions, sv ∈ V.UnicodeScalars) isLowercase(sv) === Lowercase_Mapping(sv) == sv

with a fairly high degree of confidence.

From the spec:

(Commentary from me and not proven): Note that casing is normative but not fixed, however this is ok so long as the case mappings are always in sync. String provides universal semantics by default and leaves localization to the platform, so we're not interested in SpecialCasing properties. Since we're talking about graphemes, and not whole Strings, position-dependent case mapping (such as the Greek sigma) are not relevant. So in this case (all puns intended), we only care about SimpleCasing, assuming it aligns with the other derived properties.

So I think the next step is to run the experiment on today's Unicode version and see.

1 Like

Are we conflating strings and scalars? Because AFAICT this statement is tautological (or trivially proven) based on the definitions in the spec:

  • isLowercase(X) is a function that the Unicode spec defines on strings (3.13, D139) that is true when toLowercase(X) == X
  • toLowercase(X) is also defined on strings (3.13, R2) as mapping each character C in X to Lowercase_Mapping(C)
  • Lowercase_Mapping is a string-valued property of scalars; the corresponding scalar-valued property is Simple_Lowercase_Mapping

So I don't think we can evaluate the equation to prove without first treating the scalars first as strings in some contexts. If we do so, we end up with:

∀(s ∈ UnicodeScalars) isLowercase(String(s)) === Lowercase_Mapping(s) == String(s)

Then expanding:

∀(s ∈ UnicodeScalars) toLowercase(String(s)) == String(s) === Lowercase_Mapping(s) == String(s)

And since String(s) consists of a single scalar, toLowercase(String(s)) is equal by definition to Lowercase_Mapping(s).

But that doesn't tell us anything about whether we can use the derived Lowercase property of a scalar as an optimization equivalent to performing these string comparisons, does it?

I started running some experiments to test that assertion, and I found some problems already, just in the first 256:

Mapping mismatch for scalar ª (U+aa): isLowercase(s) && isCased(s) == false, but Lowercase property == true
Mapping mismatch for scalar º (U+ba): isLowercase(s) && isCased(s) == false, but Lowercase property == true

This means it looks like we can't use the Lowercase property on its own for scalars if we want to be consistent with the definition how it's defined for strings :pensive:

So, I'll posit that our API should probably be something like:

  • Unicode.Scalar.hasProperty(.lowercase) – a low-level operation that returns the Unicode property value directly
  • Unicode.Scalar.isLowercase – promotes the scalar to a String and uses the transformation/detection functions defined by the Unicode spec
  • Character.isLowercase – promotes the character to a String and uses the transformation/detection functions defined by the Unicode spec

...and likewise for uppercase (and titlecase?).

1 Like

Another thought: We discussed earlier that CharacterSet is inadequate because its definition of lowercaseCharacters and uppercaseCharacters is based on general categories instead of derived properties.

But as shown above, there are still scalars (like feminine/masculine ordinals ª/º) where the property value is inconsistent with the result of the case detection function.

If, in the future, we want a Unicode.ScalarSet type that works as one would expect, I think users would expect the following to be true:

∀ (s ∈ Unicode.ScalarSet.lowercaseScalars) s.isLowercase == true
∀ (s ∉ Unicode.ScalarSet.lowercaseScalars) s.isLowercase == false

...which means we cannot implement that set in terms of the Lowercase Unicode property alone. Likely, we would need two APIs, to match the proposed pair of APIs in the previous post:

  1. Unicode.ScalarSet.lowercaseScalars is defined as the set of scalars for which s.isLowercase == true
  2. Unicode.ScalarSet(havingProperty: .lowercase) is defined as the set of scalars for which s.hasProperty(.lowercase) == true

The second one can be built directly on top of ICU uset_* APIs. The harder question is how we implement the first in a way that's both efficient and safe with respect to future changes to the Unicode data.

1 Like

No, by isLowercase I meant whether the scalar has the lowercase derived property. I quoted R1/R2 from the spec earlier, which defines toUppercase(X) to be the result of applying Uppercase_Mapping to every "character" in X. This is a context-less mapping, so we wouldn't have to worry about all sequences of scalars, just all scalars themselves.

(The standard's use of the word "character" is always vague, but usually means scalar and/or code point and I don't see any context to think otherwise here).

Yup, this is exactly what I was worried about. Case is hard, even the spec says so.

So at this point, I think it makes sense to regroup and come up with an alternate attack plan. It seems like for casing, devoid of a provided language or more context, it's less clear what a universal semantics on graphemes should be.

As you mentioned, I think we definitely want Unicode.Scalar to have APIs for querying properties. In addition to exposing more functionality, this gives sophisticated users a means-of-last-resort, in similar vein to how Character has a unicodeScalars property.

Beyond that, I'd say to defer Character casing for later. It's still worth investigating some other properties. I think isWhitespace, isNewline, and maybe isLetter/isNumber is more useful anyways.

Of the 3 notions of casing (general category based, derived property based, many-headed stringly based), I really don't think the first is interesting. We could expose general category information on Unicode.Scalar for anyone who needs control for compatibility purposes. Otherwise, go with the derived property.

As far as having a scalar set type in the future, we'd still probably want the derived property semantics. I'm not sure how useful such a set type would be. A function is usually more useful and convenient than a set, unless you really need to enumerate elements.

1 Like