SE-0221 – Character Properties

xwu · July 26, 2018, 11:41pm

I guess my question is, how would one define the semantics of _checkWellFormedEmoji such that "1" without the emoji selector is not "well-formed emoji" but "✈️" without the emoji selector is "well-formed emoji"--short of arbitrarily hardcoding [0-9#*] as "not emoji"--which is not scalable as there are non-ASCII commonly used symbols (some mathematical arrows, for example) that fall under the same scenario as "1" and as new Unicode versions continue to add emoji variants to existing, commonly used symbols.

I think the only reasonable answer is that extended grapheme clusters where the base character has a non-default emoji variant and where there is no emoji selector are not emoji for the purposes of Swift, as you originally proposed.

Michael_Ilseman · July 26, 2018, 11:56pm

Perhaps I should of named _checkWellFormedEmoji as _checkWellFormedEmojiWithLeadingComponent to imply this is only used along the lead-scalar-is-emoji-component path. ✈ is not an emoji component, while 1 is.

The alternative approach that I alluded to is to use the Extended_Pictographic property instead of Emoji_Component for this purpose. I haven't yet investigated the overlap between Emoji_Component and the complement of Extended_Pictographic, but the spec alludes to the latter being useful for future-proofing.

The logic would be the same as before, ✈ is an extended pictograph while 1 and regional indicators are not.

beccadax · July 27, 2018, 12:26am

I'm not a fan of the source-breaking init(ascii:radix:) change. This API ought to match the init(_:) used for its LosslessStringConvertible conformance, and we can't and shouldn't change the label on that one.

The rest of it looks pretty good. I'm a little iffy on having both e.g. isWholeNumber and an optional wholeNumberValue, but if the former can be true for characters where the latter returns nil because their numeric value is ambiguous, that's probably the best you can do. I'm also not certain we need uppercased() and lowercased() when they return Strings anyway and we don't guarantee that str.lowercased() == String(str.flatMap { $0.lowercased() }), but that's a minor quibble.

jrose · July 27, 2018, 1:05am

Okay, UTS #51 seems to imply that Emoji_Component is a reasonable thing to check here if we want to make AIRPLANE be considered an emoji without a presentation selector.

allevato · July 27, 2018, 1:44am

One concern: Emoji_Component and Extended_Pictograph are recent additions to the standard, and the versions of ICU that support them are also very recent (60 and 62, respectively). I haven't installed the latest macOS or iOS betas yet, but on High Sierra 10.13.6, we don't have these yet in the system ICU:

Welcome to Apple Swift version 4.2 (swiftlang-1000.0.25 clang-1000.0.28.1). Type :help for assistance.
  1> import Darwin
  2. let icu = dlopen("libicucore.dylib", 0)
  3. let u_getVersion = unsafeBitCast(dlsym(icu, "u_getVersion"), to: (@convention(c) (UnsafeMutablePointer<Int8>) -> ()).self)
  4. var version: [Int8] = [0, 0, 0, 0]
  5. u_getVersion(&version)
  6. print(version)
[59, 1, 0, 0]

Can anyone confirm whether they're available in the ICU distributed with macOS 10.14/iOS 12? If not, it could be a while before Character properties based on these would be relied on by actual clients.

jawbroken · July 27, 2018, 1:58am

I think I'm having trouble following this conversation about isEmoji because your “we’re changing our minds” post didn't really contain the reason why you want to make the change, @Michael_Ilseman. Why should isEmoji be true for ✈? I wouldn't intuitively expect it to be true. The original proposed semantics make more sense to me (default emoji and default textual with an explicit emoji presentation selector), but I'm not sure what the goal is here.

griotspeak · July 27, 2018, 3:22am

I am in favor of the source breaking change regarding ascii. I'll admit, though, that I've always thought it unfortunate that we weren't explicit around what characters we would accept. The benefit, in my opinion, is that of explicitly considering the world and unicode. A conversion without any labels really should accept any number from any language, in my opinion, and users should be surprised when it doesn't.

xwu · July 27, 2018, 3:23am

Ah, I see where I have a gap in understanding. The last time I read about Unicode properties for emoji (which was less than a year ago), Emoji_Component did not exist. I did not realize you were using "emoji component" to refer to a formal, new property added to Unicode data.

Yes, I see how this might work. However, as @allevato writes, these properties are so new that we cannot actually implement them on High Sierra. (Looks like Mojave does ship with ICU 62.) On the other hand, the original definition proposed can be implemented on earlier versions of macOS and iOS.

Michael_Ilseman · July 27, 2018, 8:33pm

Yes, depending on the implementation strategy used, this property might have availability constraints. However, we can pursue a variety of techniques with various trade-offs. For example, there are not many emoji components, so we could even consider adding a fall-back path that embeds Unicode 11's listings on older systems.

Michael_Ilseman · July 27, 2018, 8:44pm

Right, this is a fairly confusing area and I should provide some background. Refer to UTS #51: Unicode Emoji for the source-of-truth.

We are talking about Characters which are candidates for being displayed as emoji, i.e. rendered with colorful images rather than normal text. ("candidate" is my word for something that could be emoji-presentable, not the spec's).

We’re also dealing with two different ends here: the text author (input) and the text presenter (output).

Author’s Choice

The author of such a Character can use variation selectors to dictate rendering. The text presentation selector (U+FE0E) dictates that it must not be rendered as emoji while the emoji presentation selector (U+FE0F) dictates that it must be rendered as emoji.

This identifies two classifications:

Emoji candidate explicitly dictated to be rendered as emoji
Emoji candidate explicitly dictated to not be rendered as emoji

See Emoji Presentation Sequences, v11.0 for a listing of these.

Presenter’s Choice

In the absence of such an explicit selector, it is the presenter’s decision (i.e. the application or renderer). A word processing application is likely to prefer textual presentation whenever possible, while a chat app is more likely to prefer emoji rendering whenever possible.

For example, ✈ is rendered textually if I’m viewing Discourse on my laptop’s web browser and as an image when viewed through the Discourse app on my phone.

The Unicode standard never dictates presentation style for these emoji, but it wants to promote interoperability and consistency across applications and platforms. For that, it provides “default” guidance using scalar properties, but again, it is fully up to the presenter. This guidance is useful for applications without a strong preference one way or another, see Emoji vs Text Display.

The standard identifies 3 groupings for graphemes:

Emoji-default
Text-default
Text-only

Confusingly, this part of the standard describes these groups in terms of the Emoji and Emoji_Presentation properties on the leading scalar, which is insufficient to identify whether a grapheme could be presented as an emoji.

Candidate Emoji Graphemes

A grapheme without an explicit selector whose leading scalar has Emoji=YES and Emoji_Presentation=NO (a so-called text-default leading scalar) requires further interpretation to determine if it’s a candidate for emoji presentation. This was the situation that @xwu and I were discussing earlier in this thread.

For example, 1 has Emoji=YES; Emoji_Presentation=NO, yet is not an emoji candidate unless it is the leading scalar of a well-formed emoji sequence, such as U+0031 U+FE0F U+20E3 (). In contrast, ✈ has the same two properties but is a candidate by itself.

Unicode provides the Emoji_Component and Extended_Pictographic properties to assist in interpreting sequences such as keycaps, flags, tags, and future-proofing segmentation.

✈ has Emoji_Component=NO while 1 has Emoji_Component=YES. We can thus use Emoji_Component on the leading scalar to specifically check for cases like 1 or regional indicators whose candidacy is conditional on the rest of the grapheme.

See UTS #51: Unicode Emoji for more details on these sets. There are many kinds of emoji sequences, and future versions are likely to tweak these or add more.

Swift’s Character

Now, the question is how to expose this on Character, acknowledging the fact that emoji-ness is inherently a little fuzzy, application-specific, and can change over time with newer versions of Unicode. Unicode 11.0 with Emoji 11.0 definitely marks a point of emoji-maturity for Unicode, but things continue to change.

One approach could be an EmojiStatus enum and query, which has a case for all of the 5 above groups: explicitEmojiPresentation, explicitTextPresentation, defaultEmojiPresentation, defaultTextPresentation, and notEmoji. However, this grouping may change in the future (e.g. similar to how numeric classification did), and we’d be baking it in as something with strict source compatibility constraints. Grouping based on the kind of emoji sequence would be even more fragile.

Groupings are less ergonomic for most use cases. Users are more likely to want to distinguish across 2 queries:

Could this be presented as an emoji?
If yes, what is Unicode’s guidance regarding default presentation?

Bicycle Storage Facility

Our preferred paint color is to use Bool properties on Character. We don’t think there’s much additional clarity gained from more verbose names, such as mayHaveEmojiPresentation over isEmoji.

Character.isEmoji: whether this Character could be presented as an emoji
- False for leading scalar with Emoji=NO
- False for a grapheme containing an explicit text selector
- Otherwise, true for any leading scalar with Emoji_Component=NO
- Otherwise, whether the grapheme is an emoji sequence with a leading component scalar:
  - Flags: leads with a pair of regional indicators
  - Tags: has a series of tag scalars ending with a tag terminator (U+E007F)
  - Keycaps: [0-9#*] U+FE0F U+20E3

The particulars of that algorithm would not be baked into the ABI, but the API signature would be.

From this thread, it seems like your intuition (and likely that of many others) is that the second conditional query, i.e. Unicode’s guidance regarding default presentation, would be a very useful addition. How do you think this should appear? One possibility:

Character.isEmojiPresentation: whether this Character is recommended to be presented as an emoji, explicitly or by default
- False for anything that Character.isEmoji doesn’t accept
- Otherwise, true for anything with an explicit emoji presentation selector
- Otherwise, whether leading scalar has Emoji_Presentation=YES

beccadax · July 27, 2018, 11:20pm

I get where you're coming from, but "is" sounds so definite that it becomes misleading. I would at least call it something like canBeEmoji.

xwu · July 28, 2018, 3:01am

Now that you've clarified all the new bits with Unicode 11, I think this is fine, and probably as close as we can get to user expectations.

"Can" is too flimsy of a question, IMO, and I don't think one that users are actually interested in asking. New versions of Unicode, for example, continue to emojify existing characters, such that any character "can be" emoji in that sense.

In essence, the problem is as follows:

Most users will want an answer to the question, "Is this character going to be shown to my user as an emoji?"
For a certain number of characters, that question is unanswerable from within the Swift standard library, because it is the application or renderer's choice, although each succeeding version of Unicode allows us to get closer to a consistent answer for more and more characters.

jawbroken · July 28, 2018, 2:06pm

Thanks for the thorough explanation. I guess it still seems to me like Swift is in this situation:

since it has no idea what the presenter will be. Anyone who wants to be sure would have to somehow encode the specific rules for the presenter that they are targeting. It seems like you have a different opinion, i.e. that a developer will be more interested in knowing “Could this be presented as an emoji?” and therefore this should get the isEmoji spelling, but I'm not sure why that would be the common case. In fact, I struggle to think of a good use for it at all. Perhaps I'm naïvely thinking that a lot of common presenters would follow the Unicode default guidance here, so that would be the more generally useful interpretation.

benrimmington · July 30, 2018, 6:55am

I'm not sure how useful the proposed APIs will be, but I agree with the motivation "to increase the usefulness of Character".

It might be confusing to have differently named/nested/typed properties.

`Character`	`Unicode.Scalar`
`.isHexadecimalDigit`	`.properties.isHexDigit`
`.isLetter`	`.properties.isAlphabetic`
`.isMathSymbol`	`.properties.isMath`
`.wholeNumberValue`	`.properties.numericValue`

(An isASCII property might also be useful on String or StringProtocol).

Michael_Ilseman · July 30, 2018, 5:39pm

In this case, the presenter is the application itself, i.e. the developer making the query. Now, it may be that the developer has unwittingly delegated this decision to a rendering environment as you suggest. But it also may be that the developer is delegating with intent, or even handling this themselves. The standard library cannot know, and whether something is a candidate for presentation as an emoji is useful information.

You bring up an interesting point. In your view, it seems like Unicode’s recommendation about default presentation could be every bit (or even more) relevant than whether that choice is up to the developer.

How should we expose these two pieces of information?

Strawman (with alternative names):

Character.isEmoji / Character.isEmojiPresentable / Character.canBeEmoji
- Whether this character can be rendered as an emoji, depending on the rendering environment
Character.isEmojiDefault / Character.isDefaultEmojiPresentable / Character.isEmojiByDefault
- Whether this character can be rendered as an emoji and whether Unicode recommends that it is done so by default.

Michael_Ilseman · July 30, 2018, 6:01pm

The names on Unicode.Scalar.Properties directly reflect the UCD entry's name, without any attempt at choosing a better name other than basic grammatical transforms (e.g. the leading "is"). Swift is just directly echoing the data tables.

We should pick the best name for properties directly on Character, under Swift's interpretation of the result. These names may align, but if Character's name differs somewhat from the UCD's name that could be a benefit, as the behavior does not necessarily align directly.

Side note: I think isHexDigit is a totally reasonable alternative name to isHexadecimalDigit, as the "hex" short-hand has a pretty well established term-of-art status.

Future work: As for String.isASCII, we definitely will want to expose these kinds of queries in conjunction with work on performance flags, where the standard library will want to track known-ASCII-ness and other properties to enable processing fast paths. In that world, there's a isKnownASCII vs isASCII split, where that latter may require a scan of the entire String to compute. Or, this may be spelled String.isASCII(performScan: Bool = true), etc. Similarly for normalization status, trivial graphemes, perhaps encoding validity (if not enforced at creation), etc.

Ben_Cohen · July 30, 2018, 6:05pm

Review Manager: review extended a bit to conclude discussion

Hi everyone – thanks for your feedback on the review so far. The review period has now ended, but I'm going to hold the review open for a short while longer specifically to continue the discussion on the naming/semantics of isEmoji.

jawbroken · July 31, 2018, 12:22am

Okay, I understand. So “can be rendered as an emoji” might be useful for people who are manually customising the rendering, or implementing functionality that e.g. tries to match the logic of some rendering environment. Using isEmoji for “can be rendered as an emoji” seems liable to cause a lot of confusion though. I would almost suggest that isEmoji should mean isEmojiByDefault, because my intuition is that will be the more commonly used form and I think it will match people's expectations in most cases. If that's not acceptable then the pair of canBeEmoji/isEmojiByDefault seem to hit the right tradeoff between accuracy and brevity for me. This would also have the benefit of autocompleting the “right way” when someone is looking for isEmoji, with the ByDefault part warning them that it is not as simple as they might think.

benrimmington · July 31, 2018, 10:35am

When searching for ✈︎ (U+2708 U+FE0E: airplane with text presentation selector) the "instant answer" is from Emojipedia.

Should this kind of use case be supported by another isEmoji... property?

Michael_Ilseman · July 31, 2018, 7:47pm

I don't think this is the case at all. Many modern presenters prefer emoji presentation whenever possible. As I pointed out earlier, ✈ is rendered textually in Safari on Mac, but as an emoji in Safari on iOS.