SE-0221 – Character Properties

Michael_Ilseman · July 27, 2018, 8:44pm

Right, this is a fairly confusing area and I should provide some background. Refer to UTS #51: Unicode Emoji for the source-of-truth.

We are talking about Characters which are candidates for being displayed as emoji, i.e. rendered with colorful images rather than normal text. ("candidate" is my word for something that could be emoji-presentable, not the spec's).

We’re also dealing with two different ends here: the text author (input) and the text presenter (output).

Author’s Choice

The author of such a Character can use variation selectors to dictate rendering. The text presentation selector (U+FE0E) dictates that it must not be rendered as emoji while the emoji presentation selector (U+FE0F) dictates that it must be rendered as emoji.

This identifies two classifications:

Emoji candidate explicitly dictated to be rendered as emoji
Emoji candidate explicitly dictated to not be rendered as emoji

See Emoji Presentation Sequences, v11.0 for a listing of these.

Presenter’s Choice

In the absence of such an explicit selector, it is the presenter’s decision (i.e. the application or renderer). A word processing application is likely to prefer textual presentation whenever possible, while a chat app is more likely to prefer emoji rendering whenever possible.

For example, ✈ is rendered textually if I’m viewing Discourse on my laptop’s web browser and as an image when viewed through the Discourse app on my phone.

The Unicode standard never dictates presentation style for these emoji, but it wants to promote interoperability and consistency across applications and platforms. For that, it provides “default” guidance using scalar properties, but again, it is fully up to the presenter. This guidance is useful for applications without a strong preference one way or another, see Emoji vs Text Display.

The standard identifies 3 groupings for graphemes:

Emoji-default
Text-default
Text-only

Confusingly, this part of the standard describes these groups in terms of the Emoji and Emoji_Presentation properties on the leading scalar, which is insufficient to identify whether a grapheme could be presented as an emoji.

Candidate Emoji Graphemes

A grapheme without an explicit selector whose leading scalar has Emoji=YES and Emoji_Presentation=NO (a so-called text-default leading scalar) requires further interpretation to determine if it’s a candidate for emoji presentation. This was the situation that @xwu and I were discussing earlier in this thread.

For example, 1 has Emoji=YES; Emoji_Presentation=NO, yet is not an emoji candidate unless it is the leading scalar of a well-formed emoji sequence, such as U+0031 U+FE0F U+20E3 (). In contrast, ✈ has the same two properties but is a candidate by itself.

Unicode provides the Emoji_Component and Extended_Pictographic properties to assist in interpreting sequences such as keycaps, flags, tags, and future-proofing segmentation.

✈ has Emoji_Component=NO while 1 has Emoji_Component=YES. We can thus use Emoji_Component on the leading scalar to specifically check for cases like 1 or regional indicators whose candidacy is conditional on the rest of the grapheme.

See UTS #51: Unicode Emoji for more details on these sets. There are many kinds of emoji sequences, and future versions are likely to tweak these or add more.

Swift’s Character

Now, the question is how to expose this on Character, acknowledging the fact that emoji-ness is inherently a little fuzzy, application-specific, and can change over time with newer versions of Unicode. Unicode 11.0 with Emoji 11.0 definitely marks a point of emoji-maturity for Unicode, but things continue to change.

One approach could be an EmojiStatus enum and query, which has a case for all of the 5 above groups: explicitEmojiPresentation, explicitTextPresentation, defaultEmojiPresentation, defaultTextPresentation, and notEmoji. However, this grouping may change in the future (e.g. similar to how numeric classification did), and we’d be baking it in as something with strict source compatibility constraints. Grouping based on the kind of emoji sequence would be even more fragile.

Groupings are less ergonomic for most use cases. Users are more likely to want to distinguish across 2 queries:

Could this be presented as an emoji?
If yes, what is Unicode’s guidance regarding default presentation?

Bicycle Storage Facility

Our preferred paint color is to use Bool properties on Character. We don’t think there’s much additional clarity gained from more verbose names, such as mayHaveEmojiPresentation over isEmoji.

Character.isEmoji: whether this Character could be presented as an emoji
- False for leading scalar with Emoji=NO
- False for a grapheme containing an explicit text selector
- Otherwise, true for any leading scalar with Emoji_Component=NO
- Otherwise, whether the grapheme is an emoji sequence with a leading component scalar:
  - Flags: leads with a pair of regional indicators
  - Tags: has a series of tag scalars ending with a tag terminator (U+E007F)
  - Keycaps: [0-9#*] U+FE0F U+20E3

The particulars of that algorithm would not be baked into the ABI, but the API signature would be.

From this thread, it seems like your intuition (and likely that of many others) is that the second conditional query, i.e. Unicode’s guidance regarding default presentation, would be a very useful addition. How do you think this should appear? One possibility:

Character.isEmojiPresentation: whether this Character is recommended to be presented as an emoji, explicitly or by default
- False for anything that Character.isEmoji doesn’t accept
- Otherwise, true for anything with an explicit emoji presentation selector
- Otherwise, whether leading scalar has Emoji_Presentation=YES