Adding Unicode properties to UnicodeScalar/Character

Those two were probably the most concerning ones in terms of size, in addition to General_Category (which I've added to my branch).

I was thinking that some like GraphemeClusterBreakType might be useful, but since it looks like stdlib uses a grapheme-cluster-based break iterator to determine grapheme boundaries, I'm not sure whether it would be used internally or not.

If disk space were no object, my personal goal would be to expose Swift-friendly APIs for all of ICU :slight_smile:. Of course, if all of this is going into libswiftCore, then we need to be fairly selective about what we think users will actually want.

Throwing this out there: do you think anyone would object to having a new standard library specifically for advanced extensions to Unicode.Scalar, Character, and String, along with some new APIs? It could be distributed with the toolchain but as a separate library from libswiftCore, so only those who need it will link it in. The advantage here vs. doing it as a third-party library is that we still resolve much of the pain of linking to ICU for users, on both platforms: Apple OSes, where you have to link your own copy statically or get an App Store rejection; and Linux, where renaming makes using ClangImporter problematic with the system ICU libs.

If that's a viable path, we should consider it now before adding a lot of APIs to libswiftCore, so that we don't end up with a strange delineation of those APIs.

1 Like

Since the grapheme breaking rules themselves change from version-to-version, we can't have an implementation based on these properties.

This has been discussed in many different flavors (e.g. Large Proposal: Non-Standard Libraries - #42 by Michael_Ilseman), but in general I would expect such a thing to happen first as a package. As far as the Linux ICU pain, we could just ship a modern ICU on Linux with Swift rather than trying to rely on the OS's. Linux has the reverse situation where Swift code is typically newer than the OS's libraries.

If by all of ICU you also mean the CLDR, then rest assured that there will be plenty of unexposed well-delineated surface area ;-) I think UnicodeScalar properties within the stdlib is sensible.

Doesn't this make it fit more as an official but separate library? Something like SwiftICU or something like that... This is a very good to have, but not that common to be used. Might this be a good example for the various "separate libraries" discussion?

Yes, I agreeā€”the rest of my post above after statement that discusses doing just that. :slightly_smiling_face:

But in the interest of not feature creeping this pitch/proposal thread, I'll stick with the basic Unicode properties on scalars for now.

2 Likes

I now have the prototype implementation and proposal write-up posted as pull requests. Anyone who's interested, please take a look!

5 Likes

I donā€™t know much about Unicode but Iā€™ll give my two cents: I really like this proposal. Itā€™s well designed and described. I only have two concerns.

  1. I would prefer the alternative of a hasProperty function as I find it as discoverable but I find that is reads better:
// This reads as asking if the scalar properties is alphabetic 
scalar.properties.isAlphabetic
// I find the following clearer:
scalar.hasProperty(.alphabetic)

Whatever ends up being chosen, I agree that we should not bloat the Scalar API directly.

  1. Do we want the Properties struct (or enum if we go with my suggestion) on Scalar? When we look at Character, wonā€™t we reuse it? In that case, shouldnā€™t it live directly in the Unicode ā€œnamespaceā€?

Thanks for the feedback!

If we went that direction, how would you recommend handling the non-Boolean properties? Just surface them on Unicode.Scalar directly?

One advantage to the Properties struct is that there's a single location where all of the properties defined by the Unicode Standard live, whether they're Boolean, enums, strings, numbers, etc.

The Properties struct wouldn't be reusable verbatim for Character, since many of the properties defined on Unicode.Scalar only apply to single scalars, not to grapheme clusters. For example, isGraphemeBase and isGraphemeExtend only make sense for a single scalar.

As you say in the proposal, the case mappings would be quite useful to have on Scalar directly. For the other properties, perhaps we could prefix them with Unicode to group them together:

unicodeAge
unicodeGeneralCategory
...

That makes sense, thatā€™s why I donā€™t strongly disagree with the proposalā€™s direction. I just feel like some of those properties feel so useful that I would have liked to have them more accessible on Scalar directly instead of always going through properties. But perhaps Iā€™m overthinking this.

Iā€™m not sure this is the proper way to go here, but it would be possible to have a scalar.property(.uppercase) style method with a return type depending on the property.

See this gist for an example of this: generic_property.swift Ā· GitHub

One quirk is that the property members need to be computed properties instead of stored, since those are not supported in generic types, and those donā€™t seem to be picked up by auto complete right now (at least in Swift Playgrounds for iPad).

It's... a bit more complicated here. Unicode defines properties on scalars but almost nothing on graphemes. Any queries on Character would be Swift inventing semantics not part of the Unicode standard. I'm working on a pitch right now to expose some of these queries, but the semantics need careful consideration and need to be more "Swifty" than "Unicody". That is, their naming and usage in common Swift code should adhere to what something really means rather than whether a bit was set in the UCD.

These scalar properties have very specific semantics, and are helpful for many expert uses, but surfacing them in a "non-expert" namespace gives a false impression of linguistic correctness. For example, the "isEmoji" property on scalars is set for ASCII numbers, which might be part of an emoji sequence somewhere. Trying to "fix" the name is hard, as their semantics are often nothing more than "is the bit set in the UCD".

I think @xwu is right that their names as terms-of-art should reflect Unicode connotations rather than common Swift usage connotations (hence the "expert namespace" of Properties and highly technical documentation):

5 Likes

As someone who is unlikely to ever need to dive to this level of scalar characteristics, I like how these accessors are all neatly contained into the properties and do not pollute the namespace of Unicode.Scalar itself.

2 Likes

Random question, but doesn't swift core already depend on all of ICU? Or would the proposal add an additional dependency?

It does, so this proposal doesnā€™t add new dependencies but instead makes more features from ICU available to Swift users through stdlib without the pain of linking separately to ICU.

4 Likes

Love this idea. Just my two cents, I'd make the Unicode.GeneralCategory enum RawRepresentable (enum Unicode.GeneralCategory: String) so that we can access the two-letter codes if needed

This could especially be useful for readability and clarity when creating regular expressions referencing those character categories, using "\\p{\(Unicode.GeneralCategory.lowercaseLetter)}" for example.

2 Likes

Thanks! I agree that this would be an excellent addition to the API; I've called it out in the proposal review thread.