SE-0211: Add Unicode Properties to Unicode.Scalar

Michael_Ilseman · April 26, 2018, 11:34pm

Agreed, but if the Unicode name is directly at odds with what we expect user intuition to be, I'd prefer it to be as explicitly qualified as possible. An example could be the property isEmoji, which doesn't really behave how a non-Unicode expert would expect: ("7" as Unicode.Scalar).properties.isEmoji == true.

You'll be happy to know Unicode has several quirky notions of "lowercase" This is an argument for surfacing the raw UCD data as directly and explicitly as we can without interpretation (which is the aim of this pitch). I worry that Unicode.Scalar.isLowercase would be ambiguous as to which notion of lowercase (and potentially at-odds with other String API), while something explicitly calling out "properties" or in a UCD-accessing namespace gives better context.

I think my main argument is that just "Unicode" and "Scalar" is an insufficient context for directly surfacing properties with Unicode's terminology. I feel that "Property" (which the Unicode standard always uses in this context) should be in the spelling somehow.

For other Unicode.Scalar API, we don't have that constraint and can tackle them as needed.

Not quite. Unicode defines Character Properties*, one of which is called Name. Unicode.Scalar.Properties.name directly corresponds to that property. I suppose it could be spelled Unicode.Scalar.nameProperty or Unicode.Scalar.getStringProperty(.name), where "property" echos direct correspondence to Unicode Character Properties.

This direct correspondence is also why I'm wary of breaking the link or changing the name or meaning to anything other than directly echoing the standard.

* In Unicode parlance, "character" is vague but usually ends up meaning scalar or code point.

Michael_Ilseman · April 26, 2018, 11:37pm

ICU is merely the vehicle through which we access the UCD. Names and design should more closely adhere to the Unicode standard than ICU if there's discrepancy.

Paul_Cantrell · April 27, 2018, 12:07am

Thanks, Michael, I'm convinced! The proposal as it stands does seem like the best choice. Thank you for walking me (and everyone) through it.

benrimmington · April 30, 2018, 5:30am

Would the GeneralCategory be useful as an OptionSet? It could have static properties for:

Cased_Letter (LC) = [Ll, Lt, Lu]
Letter       (L)  = [Ll, Lm, Lo, Lt, Lu]
Mark         (M)  = [Mc, Me, Mn]
Number       (N)  = [Nd, Nl, No]
Punctuation  (P)  = [Pc, Pd, Pe, Pf, Pi, Po, Ps]
Symbol       (S)  = [Sc, Sk, Sm, So]
Separator    (Z)  = [Zl, Zp, Zs]
Other        (C)  = [Cc, Cf, Cn, Co, Cs]

If the proposed isDefined property is equivalent to generalCategory != .unassigned, then I'm not sure that it's worth adding. Or it could be called isAssigned instead.

The Unicode.Version tuple might need three components, if it will also be used with the u_getUnicodeVersion API.

allevato · April 30, 2018, 2:15pm

There are no APIs currently being proposed that would take advantage of an OptionSet representation of GeneralCategory, so I'm not sure there's an advantage there; it would also introduce a disconnect between our property definition and the property in the Standard. IMO it would be misleading to have an API return an option set but for it to always have a single element.

The static properties you mention could still be exposed as Set<GeneralCategory>, or we could add Boolean computed instance properties to the GeneralCategory enum: isLetter, isNumber, etc. Do you have a strong motivating use case now? If not, either of those could be considered for a future addition.

Good point; the ICU documentation appears to indicate that this is the case, but I'll run some tests to see if they differ at any code points just to be sure. The difference may only be in the surrogate code points, which can't be created as a Unicode.Scalar anyway; if that's the case, I'll drop it.

Michael_Ilseman · April 30, 2018, 4:29pm

I defined these ad-hoc when implementing Character properties, it might be nice to expose them more broadly.

allevato · April 30, 2018, 4:44pm

It looks like the versions that I defined in my icu-swift library would work for you then—I can pull those into my implementation easily.

benrimmington · April 30, 2018, 8:27pm

@allevato, I don't have a strong motivating use case.

ICU has a UCHAR_GENERAL_CATEGORY_MASK, but possibly only for the u_getPropertyValueName and u_getPropertyValueEnum APIs.

ICU has patterns such as [\p{L}] or [\p{Letter}] or [\p{General_Category=Letter}], in UnicodeSet and Regular Expressions. But if regex literals will be supported by the Swift compiler, they'd probably use the Unicode standard names, rather than the GeneralCategory APIs.

Cased_Letter is in PropertyValueAliases.txt, but isn't mentioned in chapter 4.5 of the core specification.

John_McCall · May 2, 2018, 2:14am

Unicode doesn't claim to be standardizing how programming languages should expose these properties. It's also a specification which can be permitted a few formalisms that shouldn't be interpreted too literally. In that light:

I don't find arguments about using the exact names very convincing. Obviously we're already changing capitalization, and I don't think minor decorations like an "is" prefix are going to introduce any confusion about which property is which. More significant changes than that seem like they'd be out of line.
It seems wrong for numericValue to produce NaN instead of nil on a scalar that's not numeric. In fact, if this is supposed to be a more direct representation of Unicode, maybe it should return an optional pair of Numeric_Value and Numeric_Type properties (or maybe a payloaded enum?), since presumably (1) they're non-nil for exactly the same scalars and (2) we should discourage the assumption that the numeric value can be interpreted as if the character were always an Arabic numeral.

allevato · May 2, 2018, 2:26pm

Thanks for the feedback!

So just to be clear about what you're replying to, you're OK with the property names as written in the proposal? (Camel-case with is* prefixes except where the property is already an indicative verb phrase.)

The main reason I went with NaN was because it's what the Standard specifies as the default value in UAX #44 4.2.9. I can see the point about someone just "blindly" grabbing a non-optional Double and trying to use it though, whereas Double? requires them to more carefully consider the value.

FWIW, my personal icu-swift library does exactly what you suggest and defines numericValue as an enum with a payload, going as far as to make the payload an Int instead of a Double when the Numeric_Type is one where that's possible.

For this proposal, I intentionally stayed closer to the "bare bones" of the Standard, but if there's a lot of support for the enum representation, I can do that instead.

John_McCall · May 2, 2018, 5:19pm

So just to be clear about what you’re replying to, you’re OK with the property names as written in the proposal? (Camel-case with is* prefixes except where the property is already an indicative verb phrase.)

Yes, that seems fine. Sorry, it looks like I was relying on old information; I really should've verified it.

The main reason I went with NaN was because it’s what the Standard specifies as the default value in UAX #44 4.2.9.

I think it's most appropriate to see that as a formalism. After all, there are characters whose value cannot be correctly reported in binary floating-point (such as U+2155 "1/5") or even decimal floating-point (such as U+2153 "1/3").

FWIW, my personal icu-swift library does exactly what you suggest and defines numericValue as an enum with a payload 3, going as far as to make the payload an Int instead of a Double when the Numeric_Type is one where that’s possible.

That sounds great to me. Have you considered using a pair of Ints for the fractional cases?

Michael_Ilseman · May 2, 2018, 6:36pm

+1. This is the basic design philosophy of this proposal. It uses “is”, camel-case, etc., but otherwise mirrors the spec.

Given the binary compatibility needs of the standard library, we should be careful trying to layer any interpretation beyond the UCD, as the exact meaning can differ across OSes and future versions of Unicode. That being said, careful categorization into integral and rational can directly surface the UCD.

If we want to associate numeric type with value, we can provide something like the following (excessively long names are just a placeholder):

// non-frozen
enum NumericTypeAndValue {
  // non-frozen
  enum NumericValue {
    case integral(Int)
    case rational(numerator: Int, denominator: Int)
  }

  case decimal(Int)
  case digit(Int)
  case numeric(NumericValue)
}
extension Unicode.Scalar.Properties {
  var numericTypeAndValue: NumericTypeAndValue?
}

But, I think keeping them separate or paired might be better, as the enum with associated values suggests a categorization that's somewhat meaningless beyond compatibility considerations. We shouldn't try to suggest any interpretation of the difference between integral-numeric and digit:

These are currently defined to always be rational in the UCD, but we should probably keep all such enums non-frozen, as they could all be extended in the future.

Example approach for separate enum and either separate property or paired:

// non-frozen
enum NumericType {
  case decimal, digit, numeric
}
// non-frozen
enum NumericValue {
  case integral(Int)
  case rational(numerator: Int, denominator: Int)
}

extension Unicode.Scalar.Properties {
  // Separate
  var numericType: NumericType?
  var numericValue: NumericValue?

  // Alternatively:
  var numericType: (NumericType, NumericValue)?
}

The pair demonstrates the tie between the two properties, which is an upside. One downside of the pair is that tuples are effectively "frozen", but I don't know how far down the Unicode-FUD rabbit hole that concern is.

I feel like var numericValue: NumericValue? is the more useful API anyways, as NumericType is much more niche.

allevato · May 2, 2018, 7:51pm

I dug around in the ICU documentation and unfortunately I don't think an API exists that lets us access the raw string in UnicodeData.txt (which is written as a fraction). They only have the Double-returning API, so unfortunately I don't know if this can be done without taking that value and reconstituting it back into a numerator/denominator.

John_McCall · May 2, 2018, 7:54pm

Ah, okay, that's unfortunate.

John_McCall · May 2, 2018, 8:04pm

My biggest concern here is just that we use optionality instead of NaN. Otherwise, a paired result was just a suggestion, and I'm happy to just let you and Tony do what you think is best. I'm not sure I understand the use-case for this API anyway, since there doesn't seem to be a uniform algorithm for parsing numeric values from strings that would take advantage of it.

griotspeak · May 19, 2018, 7:28am

(I don't look in the Proposal Reviews forum enough, it seems.)

+1

I've followed the discussion in the first thread and quickly read this discussion. (the pair of enums would be lovely.)

rlovelett · May 29, 2018, 5:16pm

So May is basically over. Is there any decision on this?

Ben_Cohen · June 19, 2018, 9:02pm

Hi everyone,

Sorry for the delay in processing the final conclusion. The core team has decided to accept the proposal, with one amendment that numericValue should be optional as discussed above.

Thanks to everyone who participated!

allevato · June 19, 2018, 10:56pm

Thanks for the update! I'll update the prototype PR and kick it off for another round of review.