SE-0221 – Character Properties

Ben_Cohen · July 23, 2018, 5:22pm

The review of SE-0221 — Character Properties begins now and runs through July 29, 2018.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to the review manager (via email or direct message in the Swift forums).

What goes into a review of a proposal?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift.

When reviewing a proposal, here are some questions to consider:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Thanks,
Ben Cohen
Review Manager

masters3d · July 23, 2018, 6:10pm

What is your evaluation of the proposal?

I am for the proposal but I really dislike all the properties being added. I would much rather the notion of versioned sets of Unicode classification. As in, I want to know if a character is an emoji in 2012 version of Unicode. If I always want swift to pick the latest version of Unicode then pick current.

Unicode.current.emojiSet.contains(myChar)

Unicode.version(someOlderVersion).emojiSet.contains(myNewChar)

I think it would be surprising if Unicode changes in the future make it hard to change all the Character properties. I much rather see this functionality outside of the Character type.

With the current design, I will that one version of swift would resolve some properties as true while a new version could update them as false. Could that happen?

Is the problem being addressed significant enough to warrant a change to Swift?

Yes

Does this proposal fit well with the feel and direction of Swift?

Yes

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

No

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Read the proposal.

Michael_Ilseman · July 23, 2018, 6:24pm

I feel like there’s two separable aspects you brought up here:

One is a request is to continue adding APIs for Unicode-savvy users, ala Unicode Scalar Properties and the recently pitched String case folding and normalization APIs. Specifically, the ability to request version-specific information.

It sounds like you want something akin to versioned Unicode Scalar Properties, but I feel it would be out of place for the standard library proper as currently designed. It would require shipping all versions of Unicode data files, reconciling availability (so all properties might end up being optional), etc. I think it would make a very interesting SPM package today, and in the future could have a place in a form of “extended” libraries or package catalogue for Unicode experts/enthusiasts.

The other is a concern about the stability of answers to this query. Many String APIs suffer from this, including String.count and String.lowercased(), which varies version-to-version of Unicode. Do you see something particularly troubling for Character Properties that doesn’t already apply to String?

masters3d · July 23, 2018, 6:46pm

yes. .isWhitespace comes mind. All of the sudden all character properties are a breaking change if a character is classified as non whitespace anymore. Maybe in that case they should all be methods and not properties.

xwu · July 23, 2018, 6:51pm

Unicode provides stability guarantees around whitespace, if I'm not mistaken. Swift could (and probably should) explicitly provide the same guarantees: i.e., no whitespace character will ever be classified as not a whitespace character in a future version of Swift. The same goes for newlines.

jawbroken · July 24, 2018, 1:31am

Small documentation error, I think you need to swap lowercase and uppercase at the end:

  /// Lowercase Characters vary under case-conversion to lowercase, but not when
  /// converted to uppercase.

If I recall correctly, your opinion here:

We recommend that the precise semantics of isWhitespace and isNewline be unspecified regarding graphemes consisting of leading whitespace/newlines followed by combining scalars.

has changed since the earlier threat discussing this issue. I think leaving this unspecified is fine, since it's unlikely to be an issue in practice, and it fits with some of the other String edge cases (e.g. Collection conformance).

There is a “lessor”/“lesser” typo somewhere.

Generally I think this is a nice pragmatic proposal, and I think it does a great job of presenting the complexity of the space and the reasoning behind the choices made. I don't think it's worthwhile handling the Unicode versioning issues at this level, because the problems there are more general (e.g. String character counts have already changed with Swift versions).

KeithTsui · July 24, 2018, 11:46am

What is your evaluation of the proposal?
I think adding those properties to Character looks too verbose for Swift Standard Library.
Moreover, those properties look related to put a character into different character set. Therefore, why not enrich CharacterSet to achieve the same functionalities.

Is the problem being addressed significant enough to warrant a change to Swift?
No.

Does this proposal fit well with the feel and direction of Swift?
Yes, but not necessarily be implemented in Standard Library.

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
I think it would be handy to use, but I prefer to use CharacterSet to classify Character instead with Character itself.

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?
Quick reading.

pvieito · July 24, 2018, 8:00pm

What is your evaluation of the proposal?

+1 to the added properties to Character.
Strong opposition to renaming FixedWidthInteger.init?<S: StringProtocol>(S, radix: Int = 10) to FixedWidthInteger.init?<S: StringProtocol>(ascii: S, radix: Int = 10).

Is the problem being addressed significant enough to warrant a change to Swift?

The added Character properties seem really useful.
The renaming change is superfluous.

Does this proposal fit well with the feel and direction of Swift?

Yes for the Character properties.
No for the renaming change. I think changing something as common, simple and extended as Int(31) to Int(ascii: 31) is not a Swifty change.

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

No.
Casting a string to an integer is something very common, and all popular language use syntax very similar to the current Int(31) initializer.

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Reading the review.

Karl · July 25, 2018, 4:37pm

The thing about this is that Swift allows full-unicode type and variable names. Maybe your code is all in Chinese, and maybe you'd expect something like Int("三") to work as well as Int("3"), especially since other String operations are unicode-aware by default.

EDIT: That said, the Swift compiler itself doesn't recognise non-ASCII integer literals. let _ = 三 doesn't compile. So maybe there is no such expectation.

xwu · July 26, 2018, 2:47am

No such change is proposed. Int will continue to conform to LosslessStringConvertible, which requires init?(String).

griotspeak · July 26, 2018, 5:47am

+1

I've followed the conversation and provided some feedback. I've given the proposal a quick read.

Michael_Ilseman · July 26, 2018, 8:44pm

@pvieito

This source break is severable and if it's not worth it, we can drop it from the proposal.

However, as @xwu alluded to, this is not as drastic in practice as it seems. As proposed:

let x = Int("7") // Optional(7)
let y = Float("2") // Optional(2.0)
let z = Int(ascii: "z", radix: 36) // Optional(35)

Michael_Ilseman · July 26, 2018, 9:09pm

As alluded to in “Alternatives Considered”

We are also considering relaxing isEmoji to return true for default-textual Characters without an explicit text presentation selector. For example, U+2708 (AIRPLANE) by default is rendered as ✈ and not ✈️. As proposed, Character(“\u{U+2708}”).isEmoji would return false, but we are considering having it return true because it could be rendered as emoji.

We’re changing our minds such that isEmoji returns true for emoji-presentable Characters with a default textual presentation. isEmoji will still return false for emoji-presentable Characters containing an explicit U+FE0E (text presentation selector).

Would anyone find it useful to specifically distinguish emoji-presentable Characters with a default textual presentation? Would one additional property such as (strawman name) isEmojiWithDefaultTextualPresentation: Bool be useful? Suggestions for a less awful name?

xwu · July 26, 2018, 9:26pm

I highly doubt that this will behave as the vast majority of users would expect. By that definition:

("1" as Character).isEmoji // true

I think the original definition hews much closer to user expectations.

Michael_Ilseman · July 26, 2018, 9:31pm

No, Character("1").isEmoji should return false. Character cannot not just check the emoji property on the leading scalar, as that includes emoji components without an emoji presentation. We will have to analyze portions of the grapheme in these situations. Similarly, an isolated regional indicator would also return false.

pvieito · July 26, 2018, 9:34pm

@xwu @Michael_Ilseman Thanks for the clarification! In any case I would oppose to this change.

This change would also raise the question of why LosslessStringConvertible does not include the ascii keyword while the FixedWidthInteger does as they both only accept ASCII based input strings.

Also, Float("1,44") does not work but I don't think changing it to Float(dotSeparator: "1.44") is nor desirable nor required.

xwu · July 26, 2018, 9:35pm

I'm afraid I don't understand how "1" can be false while "" can be true for this property. Unlike isolated regional indicators, both of these are standalone extended grapheme clusters with a non-default emoji representation (I am not referring to the keycap 1 but the emoji variant not supported by Apple's font). Can you explain?

jrose · July 26, 2018, 10:01pm

To @xwu's point: Unicode Utilities: UnicodeSet

EDIT: For comparison, Unicode Utilities: UnicodeSet contains neither "1" nor "".

Michael_Ilseman · July 26, 2018, 11:00pm

Sure! I haven’t fully vetted this approach and the below code would definitely fall under the realm of “implementation details”.

public var isEmoji: Bool {
  guard _firstScalarProps.isEmoji else { return false }
  if _firstScalarProps._isEmojiComponent {
    return _checkWellFormedEmoji(self)
  }
  return !unicodeScalars.contains(Unicode.Scalar._textPresentationSelector)
}

_checkWellFormedEmoji would do the minimum necessary checking to determine if the grapheme is a well-formed emoji candidate. It would not necessarily perform full-blown validation, but rather be “permissive”. None of this would be inlinable. It would check for:

Flags: leading pair of regional indicators
Keycaps: leading with [0-9#*] U+FE0F U+20E3

(Note that flag tag sequences begin with U+1F3F4 (), which is not a component, so they’re already handled).

An alternative approach could leverage the Extended_Pictographic property, but I think the above approach is superior.

edit: I didn't see your example of the "1" with emoji selector at first. We could have _checkWellFormedEmoji permit this by detecting the explicit emoji selector.

Michael_Ilseman · July 26, 2018, 11:11pm

That initializer does not take a radix, and thus isn't latin-skewed in its requirements. It happens to fail for non-ASCII, but we could extend it in the future (or as part of this proposal if you're making that argument) to support more numbers.

A radix-taking one, however, pretty much restricts it to ASCII or the Latin full-width compatibility forms. I.e. the proposed isHexDigit property.