Adding Unicode properties to UnicodeScalar/Character

allevato · February 4, 2018, 5:46pm

Another thought: We discussed earlier that CharacterSet is inadequate because its definition of lowercaseCharacters and uppercaseCharacters is based on general categories instead of derived properties.

But as shown above, there are still scalars (like feminine/masculine ordinals ª/º) where the property value is inconsistent with the result of the case detection function.

If, in the future, we want a Unicode.ScalarSet type that works as one would expect, I think users would expect the following to be true:

∀ (s ∈ Unicode.ScalarSet.lowercaseScalars) s.isLowercase == true
∀ (s ∉ Unicode.ScalarSet.lowercaseScalars) s.isLowercase == false

...which means we cannot implement that set in terms of the Lowercase Unicode property alone. Likely, we would need two APIs, to match the proposed pair of APIs in the previous post:

Unicode.ScalarSet.lowercaseScalars is defined as the set of scalars for which s.isLowercase == true
Unicode.ScalarSet(havingProperty: .lowercase) is defined as the set of scalars for which s.hasProperty(.lowercase) == true

The second one can be built directly on top of ICU uset_* APIs. The harder question is how we implement the first in a way that's both efficient and safe with respect to future changes to the Unicode data.

Michael_Ilseman · February 5, 2018, 1:32am

No, by isLowercase I meant whether the scalar has the lowercase derived property. I quoted R1/R2 from the spec earlier, which defines toUppercase(X) to be the result of applying Uppercase_Mapping to every "character" in X. This is a context-less mapping, so we wouldn't have to worry about all sequences of scalars, just all scalars themselves.

(The standard's use of the word "character" is always vague, but usually means scalar and/or code point and I don't see any context to think otherwise here).

Yup, this is exactly what I was worried about. Case is hard, even the spec says so.

So at this point, I think it makes sense to regroup and come up with an alternate attack plan. It seems like for casing, devoid of a provided language or more context, it's less clear what a universal semantics on graphemes should be.

As you mentioned, I think we definitely want Unicode.Scalar to have APIs for querying properties. In addition to exposing more functionality, this gives sophisticated users a means-of-last-resort, in similar vein to how Character has a unicodeScalars property.

Beyond that, I'd say to defer Character casing for later. It's still worth investigating some other properties. I think isWhitespace, isNewline, and maybe isLetter/isNumber is more useful anyways.

Of the 3 notions of casing (general category based, derived property based, many-headed stringly based), I really don't think the first is interesting. We could expose general category information on Unicode.Scalar for anyone who needs control for compatibility purposes. Otherwise, go with the derived property.

As far as having a scalar set type in the future, we'd still probably want the derived property semantics. I'm not sure how useful such a set type would be. A function is usually more useful and convenient than a set, unless you really need to enumerate elements.

allevato · February 6, 2018, 4:58am

My gut feeling is that Character case detection should work like String case detection since the Unicode spec doesn't appear to make any distinction between strings and grapheme clusters with regard to casing, meaning that the latter would be treated identically to single cluster strings.

(Aside: Even if we solve the case detection problem for Characters, we can never solve the case transformation problem in a way that's closed over Characters; that is, we can't have func Character.uppercased() -> Character that satisfies the relation S.uppercased() === join(C.uppercased() for each C in S). The obvious counterexample is ("ß" as Character).uppercased(). The uppercase mapping for "ß" is "SS", which can't be expressed as a Character.)

Alright. As much as it worries me to punt on Character because it may open the door to inconsistencies in the future, I'm fine with making some progress forward.

So I'll kick things off with a proposed API for the Boolean property accessor. We'll use an enum to list the queryable properties, and I'll make some of the cases with a comment if I think it deserves a "shortcut" property directly on Unicode.Scalar:

extension Unicode.Scalar {
  public func hasProperty(_ property: Unicode.BooleanProperty) -> Bool
}

extension Unicode {
  public enum BooleanProperty {
    case alphabetic    // also Unicode.Scalar.isAlphabetic
    case asciiHexDigit
    case bidiControl
    case bidiMirrored
    case dash
    case defaultIgnorableCodePoint
    case deprecated
    case diacritic
    case extender
    case fullCompositionExclusion
    case graphemeBase
    case graphemeExtend
    case graphemeLink
    case hexDigit
    case hyphen
    case idContinue
    case idStart
    case ideographic
    case ideographicDescriptionSequenceBinaryOperator
    case ideographicDescriptionSequenceTrinaryOperator
    case joinControl
    case logicalOrderException
    case lowercase    // also Unicode.Scalar.isLowercase
    case math
    case noncharacterCodePoint
    case quotationMark
    case radical
    case softDotted
    case terminalPunctuation
    case unifiedIdeograph
    case uppercase    // also Unicode.Scalar.isUppercase
    case whitespace    // also Unicode.Scalar.isWhitespace
    case xidContinue    // also Unicode.Scalar.isIdentifierContinuation
    case xidStart    // also Unicode.Scalar.isIdentifierStart
    case caseSensitive
    case sentenceTerminal
    case variationSelector
    case nfdInert
    case nfkdInert
    case nfcInert
    case nfkcInert
    case segmentStarter
    case patternSyntax
    case patternWhitespace
    case posixAlnum    // also Unicode.Scalar.isPOSIXAlnum
    case posixBlank    // also Unicode.Scalar.isPOSIXBlank
    case posixGraph    // also Unicode.Scalar.isPOSIXGraph
    case posixPrint    // also Unicode.Scalar.isPOSIXPrint
    case posixXDigit    // also Unicode.Scalar.isPOSIXXDigit
    case cased
    case caseIgnorable
    case changesWhenLowercased
    case changesWhenUppercased
    case changesWhenTitlecased
    case changesWhenCasefolded
    case changesWhenCasemapped
    case changesWhenNFKCCasefolded
    case emoji
    case emojiPresentation
    case emojiModifier
    case emojiModifierBase
    case emojiComponent
    case regionalIndicator
    case prependedConcatenationMark
  }
}

I've tweaked the names of the properties in a way that I think makes them fit into Swift better. The ones that are also surfaced as Unicode.Scalar.is* properties are chosen as the set that I think is likely to be commonly used, but I'm open to both bikeshedding and reëvaluating that list.

Notably missing is isDigit, which isn't expressed as a Boolean property. We can provide it by returning general category == U_DECIMAL_DIGIT_NUMBER.

Michael_Ilseman · February 6, 2018, 6:17pm

ª (U+aa) is considered to be both cased and lowercase in the UCD but does not have a case mapping to transform it to into another representation. This is an example where the level-1 notion of scalar casing (general category based) would say false while level 2 (derived property based) would say true. The reason it's not considered level-3 cased (string function based) is that without a case mapping transformation, it is invariant to case conversion, and thus the string function isCased() always returns false. The string based functions are all based on case mappings:

Convenience link for the properties: Unicode Utilities: Character Properties

edit: And this convenience link for what all the properties are: https://unicode.org/cldr/utility/properties.jsp

Right, this is why I'm leaning towards deferring a casing solution for Character for now until we've explored some of the other properties we want. That might sway us more so one direction or the other. We can keep investigating, I just want to also make forward progress on the others.

Sorry, I meant deferring casing for Character for now, but not necessarily punting out of this release. I think exploring the other properties on Character will help build our reasoning about whether Character is "more like Unicode.Scalar" or "more like String".

I think this is a good start. What are the kinds of non-boolean properties and could they fit together? Could an API have an enum with associated values to handle all properties?

Could we provide the functionality without the enum? One alternative, if we had something like:

extension Unicode.Scalar {
  // Some kind of lazy collection that has queries on it
  struct Properties {}

  var properties: Properties { get }
}

This would also "namespace" exhaustive query APIs together, where they can be present for code completion and discovery without getting in the way.

A separate/subsequent design task would then be convenience queries directly on Unicode.Scalar and Character.

allevato · February 9, 2018, 4:04am

Yeah, that's another good option for namespacing. I'd be happy with either one—I was leaning initially toward the enum because it maps well to other ICU concepts, like if we wanted a future Unicode.ScalarSet(havingProperty: .foo), but it's probably unlikely that we'd need to support all 64 properties in that API if we even have it at all. And like you mention, a nested Properties struct also lets us put other non-Boolean properties there more easily.

In that world, the Boolean properties are fairly straightforward:

extension Unicode {
  public struct Properties {
    public var isAlphabetic: Bool { get }
    public var isASCIIHexDigit: Bool { get }
    public var isBidiControl: Bool { get }
    public var isBidiMirrored: Bool { get }
    public var isDash: Bool { get }
    // ...and so on down the list
  }
}

Outside of the Boolean properties, I don't see very many that I think we would need to surface, at least not in a first version of an API—they're fairly advanced/specific, so unless we wanted to expose all of ICU in the standard library (which I assume is a non-goal), the Booleans and the ones below probably cover the functionality most people would want:

It looks like there are (at least?) two notions of the numeric value of a scalar. One is u_getNumericValue, which returns a floating-point value. This one is pretty flexible, even supporting fractions like 'VULGAR FRACTION ONE FIFTH' (U+2155) which has a numeric value of 0.2. We could expose this as Double.init?(_ scalar: Unicode.Scalar).
Likewise, there's u_digit, which we may want to expose in a form such as Int.init?(_ scalar: Unicode.Scalar, radix: Int = 10).
We can support the inverse of the one above, u_forDigit: Unicode.Scalar.init?(digit: Int, radix: Int = 10).
Of the rest, I could imagine exposing the general category (u_charType) and allocation block (ublock_getCode) could be useful for some kinds of processing, but we'd have to define some pretty big enums to cover those and I'm not convinced we need them yet.

WDYT?

Michael_Ilseman · February 11, 2018, 1:27am

Barring any further motivation, I think we should go with ICU APIs that are more geared towards what's surfaced in the UCD instead of ones providing Java-compatible semantics. In that case, numericValue is relevant but not Java's digit(). But this is dependent on the use case.

I think exposing the general category is also useful, at the very least because many properties are phrased in terms of them.

There's also properties that aid discovery, debugging, playing around, etc. General category (alongside full name and abbreviated), a scalar's name (aka ICU-Swift's name()), it's script (also with short/long name), age, etc. A good guide could be the kinds of things someone would want to have in order to write a tool akin to UniViewSVG 15

There's a few properties that might be useful in the specific context of Swift. One would be hasBoundaryBefore, which is tied to a particular choice of normalization form, and the stdlib would use the one that it uses for comparison. isInert, et al are a little less useful but might be interesting. The standard library could even start using these properties today (well, after PR-13877 is merged).

I'm slightly in favor of various segmentation properties as well, some of which are enums. These are pretty low-level, even for Unicode enthusiasts, but at the very least the stdlib internally will want to make use of them.

dabrahams · February 11, 2018, 3:41pm

Very happy to see y'all are looking into this so deeply; thanks!

allevato · February 11, 2018, 8:40pm

It's funny that you mention that, because my original motivation for implementing icu-swift was to do exactly that as a personal project to explore writing a small Kitura app. But like most of my personal projects nowadays, I got distracted after writing the low-level bits.

(Also, importing ICU's C APIs into Swift on Linux using the traditional Swift PM module map approach is fairly difficult, because the system packages are compiled with version-suffixed names and the #defines that strip the suffixes are ignored by the importer. So that enhances the case for putting more into the standard library, IMO.)

Anyway, it sounds like you're interested in having quite a bit deeper support for ICU than I anticipated. That's fine by me! I've been leaning a bit conservatively because I wasn't sure how much people would be comfortable adding to the standard library, but if you think properties like the ones you mentioned above would be generally useful, I'm happy to include them in the proposal.

I'll look at writing all this up into a draft proposal over the next couple days and post it to this thread for some more discussion.

Michael_Ilseman · February 11, 2018, 10:28pm

Yeah, I had to do all of these things (hackily) by myself in the past. I'm so happy to leverage your suffering and just use Swift-ICU for experimentation now! When we're done here, we can commiserate over about U_DISABLE_RENAMING.

(We could probably do a better job in how we bundle ICU in Linux toolchains with Swift, but that's a very different topic.)

Right, I think it's good to have them all nested inside Properties. Scalar properties are never the be-all-end-all answer for human-presentable text (especially in the context of a specific locale), so this helps keep it organized into an enthusiasts/experts section. We'll always have high-level support for human-presentable text and we'll figure out what makes sense to expose directly on String/Character/UnicodeScalar. But at the very least, the properties are there for when you really need it.

I also like exposing some of the standard library's internal tools and techniques as API, so long as they surface well and have an obvious place to go. These properties would be a great fit.

Awesome! I would go with Unicode.Scalar.Properties as a resilient (non-fixed-layout, however it's spelled) struct without any stored properties at first, but that we can add some to for e.g. caching. It's also probably time for a strawman enumeration of desired properties, possibly excluding some legacy or highly ICU/vendor-specific ones.

When we have something pretty solid, we can spin off a new thread for it and either close this thread or repurpose it for discussing what to surface to Unicode.Scalar/Character/String.

(Also let me know if you'd like my help as a co-author, though I'd probably not be contributing much until after I nail down more ABI details).

Michael_Ilseman · March 6, 2018, 10:22pm

If it helps to get started, here's a rough sketch of one way to expose this:

extension Unicode.Scalar {
  // Query properties provided by the UCD...
  // <insert comment about being for expert/low-level use>
  struct Properties {

    // Boolean properties

    // U+000B, U+000C, U+0085, U+2028, U+2029
    var isNewline: Bool { get }

    // Has derived property Uppercase.
    // <extended documention about specific semantics>
    var isUppercase: Bool { get }
    // ... isLowercase, isCased, ...

    // Has derived property White_Space
    // <extended documentation about word breaking>
    var isWhitespace: Bool { get }

    // Has derived property Hex_Digit
    var isHexDigit: Bool { get }
    // ... isASCIIHexDigit, 
    
    // Has derived property Alphabetic
    var isAlphabetic: Bool { get }
    // ... isMath, isLetter, isControl, isPunctuation, isQuotationMark, isDiacritic, 
    // ... isIdeographic, isRadical, isDash, isRegionalIndicator, isNumeric, ...

    // <misc properties, more so for discovery/enthusiasts>
    var generalCategory: ??? { get }
    var age: Unicode.Version { get } // Whatever Unicode.Version is...
    var script: ??? { get }
    var block: ??? { get }

    // <insert precise semantics based on UCD>
    var numericValue: Double { get }
    var numericType: ??? { get }

    // <some kind of random-access collection of 
    //  Unicode.Scalar. insert comment/disclaimer about what
    //  casing means. Maybe just return a String as we get
    //  small-string optimizations soon>
    var uppercaseMapping: ??? { get }
    var lowercaseMapping: ??? { get }
    var titlecaseMapping: ??? { get }

    // <some way of exposing conditions concerning casing>
    var caseCondition: ??? { get }

    // Normalization-based queries...
    var canonicalCombiningClass: Int { get }
    var isFullCompositionExclusion: Bool { get }

    // <The following use stdlib's preferred normal form
    //  for comparisons, which may change between releases>
    var hasBoundaryBefore: Bool { get }
    // ...

    // <some kind of stored properties, such as the scalar
    //  itself and perhaps a cached option set of 
    //  some of the ICU queries>
  }

  var properties: Properties { get }
}

edit: Text segmentation does define a Newline property, so use that definition.

allevato · March 7, 2018, 2:36pm

Sorry for the delay on getting this together—my free time has been more limited than I thought.

It looks like we're on similar pages, so here's what I have so far; I wanted to spend some more time on it but I'll go ahead and post what I have for now:

gist.github.com

https://gist.github.com/allevato/6440d2c6a27f92381d47bf2cd16d36c1

swift-unicode-improvements.md

# Add Unicode Properties to `Unicode.Scalar`

* Proposal: [SE-NNNN](NNNN-filename.md)
* Authors: [Tony Allevato](https://github.com/allevato)
* Review Manager: TBD
* Status: **Awaiting implementation**

*During the review process, add the following fields as needed:*

* Implementation: [apple/swift#NNNNN](https://github.com/apple/swift/pull/NNNNN)

This file has been truncated. show original

That's mostly just a dump of many of the common ICU properties and a direct mapping of them to Swift APIs. A lot of it is based on my icu-swift work, but I've gone through and tried improving the names and smoothing out some other edges since I implemented that the first time.

xwu · March 7, 2018, 4:57pm

Like it. Thanks for all this work.

Brief comment: for properties, especially the “is*” ones, it’d be nice to hew closer to the Unicode names—easier to find for those who are experienced in these details, and not really any less clear for those that aren’t, since no name will fully explain. These are definitely all terms of art.

For example, “extendsPrecedingScalar” is nice but “isExtender” is predictable. If a user doesn’t know what “extender” refers to, neither name tells them what it means to “extend” a Unicode scalar. But being able to glance and know that some Swift property clearly maps to a particular Unicode property and not some modified version of it or another property I don’t know about is a plus.

allevato · March 7, 2018, 5:05pm

Yeah, I went back and forth a lot on naming. The names in my UnicodeScalar+BooleanProperties.swift are essentially direct translations from the underlying Unicode names, whereas in this version I tried to make them a bit more "poetically Swift".

I'd be happy with either naming scheme, TBH.

Michael_Ilseman · March 7, 2018, 6:28pm

No worries at all! Let me know when/how/if I can help with moving things along.

--

This is looking great! I agree with @xwu regarding naming.

If you're using ICU's properties as a guide, translate the "UCD Name" column. It seems like you're already doing this, but definitely make sure to exclude anything with a "c" in the un-labeled column (I'm undecided about excluding entries without "(U)").

Comments defining behavior probably shouldn't be phrased in terms of ICU (implementation detail), but rather Unicode and the UCD.

example:

~~public var extendsPrecedingScalar: Bool { get } // UCHAR_EXTENDER~~

 // Has derived property ["Extender"](https://www.unicode.org/reports/tr44/#Extender)
public var isExtender: Bool { get } // (Implementation: UCHAR_EXTENDER)

(I really appreciate you providing the ICU mappings to help implementation!)

I'm not sure of the documentation conventions here, and whether we should parrot the spec's description. E.g. "Extender -- Characters whose principal function is to extend the value or shape of a preceding alphabetic character. Typical of these are length and iteration marks.". Alternatively, we just hyperlink.

CC @nnnnnnnn for advice. We also really want to clarify we're talking about expert-use UCD semantics and that this is not necessarily generalizable to presenting results for human consumption. For example, whitespace detection is very useful for source-code processing tools, but is a hazard if you're relying on it to present text to a user in a language you haven't anticipated.

--

For case mappings, I thought there was a more modern approach that recognizes that a scalar may expand to multiple other scalars. E.g. CaseFolding.txt has several multi-scalar mappings, not to mention SpecialCasing.txt. I think we'd also like to expose case condition.

Case mappings would then return a String or perhaps a String.UnicodeScalarView (utilizing small-string optimizations to avoid allocation). I'm unfamiliar with bidi and whether that also has a similar issue.

--

edit: Age should probably be (major, minor), ala UAX #44: Unicode Character Database

nnnnnnnn · March 7, 2018, 6:43pm

The floating-point docs are probably the most relevant example—always try to explain in plain English with examples, and for methods that implement specific requirements we have an extra sentence, e.g.:

This method implements the remainder operation defined by the IEEE 754 specification.

A link is great; the more specific the better.

allevato · March 17, 2018, 6:06pm

I've started making some progress on an implementation of this. So far, it's pretty straightforward—I've taken the non-deprecated binary properties defined by the Unicode Standard (not any that are strictly ICU additions, yet) and implemented and documented them. Thanks for the documentation tips! This is just a start, and I'll end up fleshing them out a bit more. There's still a lot of work to do there. (For example, as you mentioned, we should clarify more about how properties like casing and whitespace work.)

The work so far is pushed in this branch: https://github.com/allevato/swift/compare/master...unicode-properties

I plan to keep chipping away at this in the near term as my time allows, adding more of the common properties we discussed above.

One question I'd like some input on: when I start adding enum-typed properties, some of those enums will be quite large. The easiest thing to do would be to make them RawRepresentable with the underlying ICU-defined integer values as the raw values, but that's a leaky abstraction and I assume we don't want that (what I wouldn't give for internal conformance right now!).

So, right now I'm planning to just write those enums by hand, along with internal inits with large switches to convert them from their raw values. I could also use GYB to simplify this a bit, but the cost there is we're adding another place where we're using GYB, which isn't great. Any strong opinions either way?

ckeithray · March 17, 2018, 6:26pm

I'm glad you're doing this.

I wonder if it would be appropriate to have static methods or properties returning a set or array of characters corresponding to "is..." properties? For example, "isHexDigit" and "static var hexDigits: Set".

For most properties this is pretty unchanging, but there are new emoji every year and returning every variation of people and family emojis can create a long list.

Also, I was contemplating creating methods to create the people and family emojis that take color modifiers, and other multi-scalar emojis like national flags. flag("ca") == ""

And multi-scalar characters builders and deconstructors: "e".addAcuteAccent() == "é", "é".hasAcuteAccent == true, etc.

What do you think?

allevato · March 17, 2018, 6:44pm

This could be done by providing a UnicodeScalarSet that is properly implemented on top of the ICU character set API. The existing CharacterSet Foundation type has some deficiencies in this regard, so I would support doing this—but I want to keep it as a separate proposal otherwise this one will get unwieldy.

These sound like interesting ideas for a third-party library, but they seem a bit specialized to include in the standard library itself. My goal is to give folks the building blocks to write more advanced stuff like that on their own. (And these specific examples can be written today, since you only need to inspect specific Unicode scalars. The properties described in the thread above can't be accessed today from Swift without bridging and linking directly to ICU, which is currently a painful experience.)

Michael_Ilseman · March 18, 2018, 4:16pm

So far this is looking great! I couldn't comment directly on your repo, so here's a couple nits:

For isDefaultIgnorableCodePoint, comment states Alphabetic property .

LogicalOrderException's comment has a un-commented line break.

Slight change to documentation regarding casing (similarly for uppercase):

-  /// A Boolean property indicating whether the scalar is lowercase. `
+  /// A Boolean property indicating whether the scalar's letterform is considered lowercase. `

The lowercase derived property is a guide to what users commonly think, but not a firm answer:

--

I'll have to get back to you regarding the enums.

Michael_Ilseman · March 19, 2018, 8:16pm

@allevato

I think we definitely want general category, however we roll that out. This is a very useful query for compatibility with old-fashioned Unicode processing.

For script and block, which are pretty large String enums, it's a tradeoff against stdlib binary size. Depending on the impact, I'd saw we could keep or defer them for later.

What other enums were you considering?