SE-0211: Add Unicode Properties to Unicode.Scalar

The review of SE-0211: Add Unicode Properties to Unicode.Scalar begins now and runs through May 1, 2018.

Reviews are an important part of the Swift evolution process. All reviews should be made in this thread on the Swift forums or, if you would like to keep your feedback private, directly to me via email or forum DM as the review manager.

What goes into a review of a proposal?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift.

When reviewing a proposal, here are some questions to consider:

  • What is your evaluation of the proposal?

  • Is the problem being addressed significant enough to warrant a change to Swift?

  • Does this proposal fit well with the feel and direction of Swift?

  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Thanks,
Ben Cohen
Review Manager

11 Likes

(Disclaimer: I participated heavily in the pitch phase; I am not impartial)

+1, fills an important for-expert-use gap in String

Yes, there is otherwise no good way to query many of these properties.

Yes, especially as presented in a carved-out namespace

I've used ICU extensively, and this surfaces much of the useful functionality directly in Swift.

In-depth study.

I don't have much to say here, so I won't do a full question-by-question response. I think this is straightforward, well designed and useful, and I look forward to the similar proposal for Character.

As I mentioned in the discussion, I'm not sure about the numericValue property. IIUC, such things are typically expressed as failable initialisers in Swift. The same might also apply to the upper/lower case mapping Strings. This is a question of API style, and I'm not sure what the right approach is.

  • I'm not sure about the name "age" in var age: Unicode.Version?. Age typically refers to a duration between two points in time, not a reference to a specific event. I think a name like introduced or introducedInVersion would be clearer and look better at the call-site.
  • Should Unicode.Scalar.Properties.Version really be nested inside Properties? I think hoisting it a few levels to Unicode.Version would be clearer.
  • Also, if we made Unicode.Version a struct, we could make it Comparable and Equatable. I'm not exactly sure why anybody would care about which version the scalar was introduced with, but if you did, it's probably because you want to compare that version with some known version.
  • If we expose a unicode version on the scalars, it would probably make sense to expose the version of the Unicode standard which the currently-installed ICU conforms to.
  • For the generalCategory property - what is the difference between a nil value, .unassigned, and an unknown case? (I'm assuming Unicode.GeneralCategory is non-frozen).

EDIT: Apparently, the General_Category enum is frozen since Unicode 2.1.3 (Character Encoding Stability)

2 Likes

Thanks for the feedback!

As mentioned in the "Alternatives Considered" section of the proposal, we considered other "Swiftier" names for many of these properties. Given that these properties are meant for advanced usage by folks who Know What They're Doing™, it's more important to stick with the name that corresponds directly to the name as it's defined in the Unicode Standard; deviating their names or scattering them across multiple types only serves to make them less discoverable for the intended audience.

For example, consider that numericValue must return a Double because of the nature of how that property is defined; I would imagine that most users who want to implement an algorithm based on their knowledge of those properties would not go searching for Double(unicodeScalar:), whose name gives no indication of what property in the Standard that it maps to. It also separates it from its companion property numericType.

I would imagine that the upcoming proposal for Character properties would support the types of failable initializers that you mention, because those are more user-facing, but the scalar properties are more advanced and the benefits of making them discoverable for those advanced users outweighs the other concerns, IMO.

That's a typo from when I reformatted a few parts of the proposal recently; my intention was to have it defined in Unicode. Thanks for pointing it out!

Since the standard library implements ad hoc relational operators for Equatable/Comparable tuples of arity ≤ 6, these ages can already be compared; they just can't be used generically where the protocol itself is a constraint. I'm not opposed to making it a struct and conforming if someone makes a compelling case, but I'm not sure how useful it would be.

That's a good idea—I'd be fine with adding a static property to Unicode that maps to u_getUnicodeVersion.

Looks like I missed this after implementing it; in the actual implementation, the property is non-optional: [SE-0211] Add Unicode properties to Unicode.Scalar by allevato · Pull Request #15593 · apple/swift · GitHub

1 Like

@AliSoftware in the companion pitch thread for this proposal had an excellent suggestion:

Adding this would be a trivial change and I think it would be an excellent enhancement to the API.

1 Like

Uninformed Unicode question: Can numericValue have a sensible value when numericType is nil?

When numericType == nil (corresponding to Numeric_Type == None), the Standard defines the default value of Numeric_Value to be NaN. The Swift implementation thus also returns Double.nan in this situation (which is different than ICU's behavior, but more accurate to the Standard).

Okay, so it's always Double.nan when NumericType is nil? Therefore, does something like public var numericValue: (Double, NumericType)? make more sense for the Swift API? As well as eliminating nonsense cases, it would give more justification as to why this isn't simply an initializer.

Apart from that, nothing else stuck out as exceptional and I support the proposal's approval. The rest of the API makes justified uses of term of arts with minor tweaks. I also like how everything is encapsulated into Unicode.Scalar.Properties as a way to namespace the more 'esoteric' parts of Unicode into a structure that is clearly meant for advanced use cases.

This would deviate from the intentional decision in the proposal to map the properties one-to-one from the Unicode Standard for discoverability. IMO we would need a very strong reason to decrease the discoverability of individual properties if we're providing an API meant to support advanced use cases for folks who are very familiar with the Standard and the properties it defines.

Yeah, TBH, as soon as I hit post I kind of regretted it. The suggestion contradicts what I wrote in the second half of the message about artfully respecting Unicode standard names. I relent :)

1 Like

I'm a big fan of all of the new properties added in this proposal, but I'm skeptical of the need to bury them all the way in Unicode.Scalar.Properties.

Unicode.Scalar doesn't have that many properties or methods to begin with. I only see a handful of properties (isASCII, utf16, value, description, debugDescription, hashValue, and customMirror). It's by no means already bloated, so I think Unicode.Scalar is a more natural place to put these new properties.

Additionally, the call-site would read more clearly (scalar.isAlphabetic instead of scalar.properties.isAlphabetic).

5 Likes

I'd love to get more feedback on this from folks. If people feel strongly that the sheer number of properties we're introducing aren't harmful to the Unicode.Scalar API as a whole, I'd be happy to hoist them out of the nested object.

Unicode.Scalar is a somewhat more technical type to begin with, so maybe the "bloat" is warranted...

2 Likes

Without having given it any in-depth thought, the clarity of scalar.isAlphabetic reads nicely, and the intervening .properties doesn’t. My only hesitation would be name collisions.

It seems like there are two categories of names that would move to Scalar. One is a pool of properties such as name, age, and generalCategory that isn’t likely to grow much, and whose names feel at home on Scalar.

The other pool is names such as isPatternWhitespace and changesWhenTitlecased; that pool is likely to grow. These all begin with an is or changesWhen prefix; the latter seems inherently same, but is could cause a collision. Is it likely that we’d ever encounter some “foo” for which we want an isFoo property on Scalar and Unicode adds some “foo” property with a different meaning? That seems far-fetched, having given it all of 30 seconds of consideration.

1 Like

I think it's a useful addition that won't do any harm to people who don't need it, so I can't see any reason not to accept this proposal.
But instead of the Unicode.Scalar.Properties, I'd prefer the enum/Unicode.Scalar.hasProperty alternative:
It also keeps Unicode.Scalar small, and it's easy for users to define their own extensions to query for the properties they actually need.

Also, serializing enum values is simpler than storing keypaths (I don't have a use case for this, but it wouldn't hurt to have that option).

What is your evaluation of the proposal?

+1

Is the problem being addressed significant enough to warrant a change to Swift?

Yes

Does this proposal fit well with the feel and direction of Swift?

Yes.

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

Swift's unicode support is better than any other language I've used, and this is a great addition. Useful, clear and at an appropriate abstraction level.

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Participated in Character discussion (apparently there were 2 separate discussions), read the proposal, checked out some unicode stuff.


On the issue of whether or not the properties should be nested in a .properties structure: I don't mind either way, but we should consider that Character will likely have many equivalent properties, and it would be useful to keep the structure consistent. I think it should be easy to flip your code between the scalar/character levels without adding/removing ".properties" everywhere. For example:

if myString.unicodeScalars.contains(where: { $0.isWhitespace }) { ... }

// Oh, no! We discover that we should be using characters (or vice-versa)...

if myString.characters.contains(where: { $0.isWhitespace }) { ... }

Also, I could not find any realistic usecases for u_charAge in the wild, so I don't mind Unicode.Version just being a tuple. I don't think it's a very "interesting" type (as in, nobody will care about writing extensions or adding protocol conformances to it).

The main issue here is that these predicates mean something very specific to the Unicode standard, and not necessarily what Swift would name them or what semantics we would prefer to define for them. Without this pitch, Unicode enthusiasts/experts have no way of directly accessing this data, but the interpretation of the data is on the enthusiast/expert. isAlphabetic is perhaps the most benign, but namespacing these off all together seems more consistent.

That being said, I think there's a reasonable argument that we could promote or define some semantics on Unicode.Scalar. But, for every one of them, we'd be defining semantics for Swift that don't necessarily map directly or unambiguously in both name and spirit to Unicode's.

For example, we could define a Unicode.Scalar.name property. Which name does that return? Unicode has between 1 and 6 names for any given scalar, so we'd be promoting one as the name for Swift. In practice, we'd likely be choosing between the formal name for compatibility purposes and the corrected name. On the one hand, we would want the most correct name we could get, but on the other hand most of the world revolving around Unicode sticks to the formal name.

Age and general category could make sense on Unicode.Scalar, though I could see an argument against "blessing" programming with GC without providing more comprehensive alternatives.

My preference would be to defer these considerations as future work for now. Unicode.Scalar.Properties.x has a direct correspondence to a similarly-named entry in the UCD.

6 Likes

After lurking in the pitch thread from beginning, I think this proposal adds useful functionality to Swift. I’m for accepting it as is or with properties directly on Scalar.

It’s hard for me to judge the ergonomics of the nested properties design vs. direct Unicode.Scalar extension (suggested up-thread) in the abstract. I understand this proposal is currently a conservative attempt to surface the underlying ICU functionality for expert Unicode users. I’m unaware of any ICU challenger, so I think directly surfacing ICU properties is relatively future-proof bet. Is it reasonable to expect that we’ll come up with anything different in the future? Though, I also don’t know how much API churn there was in ICU historically…

@allevato Could you drop some pointers to code that uses these kind of properties in practice (for example something that uses your Swift ICU wrappers)?

Hmm, that is a compelling thought.

The counterargument, which I'm unsure of but I'll go ahead and make, is that if there is some property that has the same name but different semantics as Unicode’s, having two meanings for that name is a dangerous pitfall. I’d rather not second-guess Unicode’s semantics; if Unicode has a quirky notion of, say, “lowercase” or “quotation mark,” it seems better to go with the standard and all its problems instead of trying to preemptively correct those problems for others. In a similar vein, I would expect URL not to try to help me by redefining well-standardized URL components. I should be able to read one industry standard, not the standard plus Swift’s adjustments. This uniformity of reasoning is why we have standards in the first place.

There's the argument; as I said, I'm unsure about it either way. It boils down to these question: what if anything does Unicode.Scalar represent other than the Unicode standard? It already has Unicode in the name; have we not entered a standardized universe already? If not, what other Swift-specific semantics does it have? Finally, are any Swift-only semantics additive, or must they be conflicting?

This suggests perhaps a single-valued name property is not the right model, regardless of whether it is namespaced under .properties.

4 Likes

As I understand it, Character is the Swift primitive element of a String. Scalars have no real meaning outside of Unicode, which is why they're nested like that in Swift. So I agree with you - all members of Unicode.Scalar should implicitly mirror the standard, and we shouldn't need a separate .properties thing to make it explicit.

...and, if we do eventually add properties with non-standard semantics to the Swift-level abstraction for characters (Swift.Character), it might be worth also including a Unicode.ExtendedGraphemeCluster or Unicode.Character type whose members do strictly follow the standard, in the same way Unicode.Scalar should.