SE-0221 – Character Properties

The names on Unicode.Scalar.Properties directly reflect the UCD entry's name, without any attempt at choosing a better name other than basic grammatical transforms (e.g. the leading "is"). Swift is just directly echoing the data tables.

We should pick the best name for properties directly on Character, under Swift's interpretation of the result. These names may align, but if Character's name differs somewhat from the UCD's name that could be a benefit, as the behavior does not necessarily align directly.

Side note: I think isHexDigit is a totally reasonable alternative name to isHexadecimalDigit, as the "hex" short-hand has a pretty well established term-of-art status.

Future work: As for String.isASCII, we definitely will want to expose these kinds of queries in conjunction with work on performance flags, where the standard library will want to track known-ASCII-ness and other properties to enable processing fast paths. In that world, there's a isKnownASCII vs isASCII split, where that latter may require a scan of the entire String to compute. Or, this may be spelled String.isASCII(performScan: Bool = true), etc. Similarly for normalization status, trivial graphemes, perhaps encoding validity (if not enforced at creation), etc.

4 Likes

Review Manager: review extended a bit to conclude discussion

Hi everyone – thanks for your feedback on the review so far. The review period has now ended, but I'm going to hold the review open for a short while longer specifically to continue the discussion on the naming/semantics of isEmoji.

Okay, I understand. So “can be rendered as an emoji” might be useful for people who are manually customising the rendering, or implementing functionality that e.g. tries to match the logic of some rendering environment. Using isEmoji for “can be rendered as an emoji” seems liable to cause a lot of confusion though. I would almost suggest that isEmoji should mean isEmojiByDefault, because my intuition is that will be the more commonly used form and I think it will match people's expectations in most cases. If that's not acceptable then the pair of canBeEmoji/isEmojiByDefault seem to hit the right tradeoff between accuracy and brevity for me. This would also have the benefit of autocompleting the “right way” when someone is looking for isEmoji, with the ByDefault part warning them that it is not as simple as they might think.

1 Like

When searching for ✈︎ (U+2708 U+FE0E: airplane with text presentation selector) the "instant answer" is from Emojipedia.

Should this kind of use case be supported by another isEmoji... property?

I don't think this is the case at all. Many modern presenters prefer emoji presentation whenever possible. As I pointed out earlier, is rendered textually in Safari on Mac, but as an emoji in Safari on iOS.

We could add a presentation enum:

enum EmojiPresentation {
  case defaultEmoji
  case defaultTextual
  case explicitEmoji
  case explicitTextual
}
extension Character {
  var emojiPresentation: EmojiPresentation? { ... }
}

But, it’s probably too early to add this. Is EmojiPresentation frozen or not? If it’s frozen, then we have elevated something much more tightly coupled with current Unicode details regarding emojiness into the source and ABI compatibility story. If it’s non-frozen, then users cannot meaningfully switch over all cases, and checking against one case carries the threat of missing a future additional case.

My recommendation is to add var isEmoji: Bool for now, likely under availability constraints. Unicode 11 seems (and I’ll eat these words one day) mature enough for us to provide a common-sense answer to this.

3 Likes

Sure, but it's hard for me to tell if most presenters will match your proposed isEmoji. e.g. Does Safari on iOS match it in all cases? I can say that at least one very common and important presenter, Xcode (and Swift Playgrounds), seems to use the Unicode defaults in my non-exhaustive tests.

It's not possible to do something that will match all presenters, so someone is going to be unhappy. My opinion is that it's better to claim something is an emoji and have it render as text than to claim something is not an emoji and have it render as one. I'd liken something failing to render as emoji to the situation where your font doesn't have a glyph for a particular Character at all.

3 Likes

Sure, so I see one goal here to be building tools that let people explicitly match the presenter behaviour. My theory is that it is probably common for presenters to take the Unicode default behaviour here, so it would be useful to provide that information in some way. It sounds like the current idea is to only provide the “could be presented as emoji” version, which might be another common behaviour that is (possibly?) used by Safari on iOS (but not macOS). So perhaps both could be provided, as @Michael_Ilseman previously mentioned.

Okay, but I'm not sure why that is the better failure mode here, so this isn't currently my opinion. The Unicode default version seems to match Xcode/Swift Playgrounds/macOS Terminal, so it would be useful for teaching material. And I think isEmoji is definitely too strong a name for either canBeEmoji and isEmojiByDefault, whichever versions actually make it into the standard library.

I don't think this analogy really holds. These characters are not “supposed” to render as emoji, failing to, and then falling back to plain text. There is a Unicode default for how they are presented (which would be my starting point for “supposed to” in this case) that is being overridden/customised in some cases.

Having re-read UTS#51, I'm coming around to the sense that an enum is called for.

As the standard says, this is designed to allow implementations to choose between presentation behaviors, and it would allow users who want to target particular environments a better sense of what those behaviors would be. To my mind, this enum would only be coupled as tightly to UTS#51 as String is generally to Unicode, which seems entirely appropriate.

Given the terminology in UTS#51, I'd suggest a spelling as follows:

extension Character {
  enum PresentationStyle {
    case emoji
    case emojiByDefault
    case textByDefault
    case text
  }

  var presentationStyle: PresentationStyle /* non-optional */ { ... }
}

To be future-proof, the enum should be non-frozen. I don't think an exhaustive switch is necessary: if or when Unicode supports arbitrary embedded graphics, it's not as though any currently written software, unmodified, will know what to do with it, so requiring an @unknown default (which, for a renderer, say, would simply render one of those placeholder question mark glyphs) seems fine.

4 Likes

The enum is certainly appealing, pending stability concerns.

UTS#51 is all about emoji, so I think the EmojiPresentationStyle would be a better name than PresentationStyle.

I think this is different. Most of String's churn from changes in Unicode versions surface as runtime behavior differences and not source breaks or additional enum cases. For example, while String implements grapheme-breaking behind the scenes, it does not expose grapheme-breaking rules in source, so when every recent version of Unicode has changed the rules, there is no source churn.

Adding this enum is promoting what elsewhere is runtime behavior differences into a source compatibility requirement. The closest analogy here would be enums inside of Unicode.Scalar.Properties (general category, numeric type), which are far more mature and exist as stable properties in the UCD. Presentation styles themselves, however, are not codified as UCD entries, though properties that determine what style to use are.

That being said, the enum is appealing as we can convey what we know about Unicode's recommendations.

I agree that if we go the enum route, it should be non-frozen. Beyond just exhaustively switching, code which guards against one specific case (e.g. 'text') could be invalidated if a new Unicode version kicks off a re-shuffling of presentation style. I'm not saying this concern is a hard argument against the enum, just for consideration.

3 Likes

Since this topic is still open I'd like to make a comment about the use of the term "ASCII" in function labels and other places. Swift has been officially ported to the mainframe z/OS operating system (see Mainframe Developer Center is moving Oct 16th!). z/OS, while supporting Unicode and various ASCII code pages, is historically an EBCDIC based operating system. Seeing "ascii" used in the name of cases such as init?(ascii:radix), isASCII and asciiValue concern me that there might be confusion for z/OS developers; or even that the result might not be useful in an EBCDIC environment.

In the case of init?(ascii:radix) it appears that the label in question is intended to indicate that the string is limited to digits 0-9 and characters A-Z (or a-z?). So its really limited to a subset of ASCII, anyway. I don't have an alternative I'm really happy with, but... what about alphanumeric, with a very literal interpretation of that word (alphabetic (A-Z) and numeric (0-9)) characters only. (Though I dare mention that one could easily support up to Base-94 encoding by using digits, upper and lower case being distinct, and 32 other (sorry) ASCII characters:

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}

(note that the first character is the space character.)

I'm not necessarily asking for Base-94 to be supported as part of this effort, but its something to keep in mind. I guess the ascii label would be better than alphanumeric in that case... It's latin1Character more appropriate?

As for isASCII and asciiValue I'm not sure offhand if those names are appropriate or not for their function in an EBCDIC environment. I guess I would just ask that this issue be kept in mind when naming things and developing associated features.

The point of init?(ascii: radix:) is to get an ASCII character code. It won't be especially helpful if you're actually using EBCDIC, but changing the name isn't going to make it helpful for EBCDIC either. The ASCII character codes are, by design, a strict subset of Unicode, which is why they get special affordances.

2 Likes

Review Manager Update: Extension for isEmoji discussion.

The proposal has been accepted by the core team, but this thread is being held open until August 10th for specific discussion of the options for implementing the isEmoji property. Thanks for all your input so far.

As @Nobody1707 mentioned, ASCII is a subset of Unicode, hence why we have support for those 128 code points.

The standard library has built-in support for Unicode and subsets of Unicode. Natively supporting encodings that are not compatible with Unicode is not a goal of String as part of the standard library. For support of non-Unicode-compatible strings, first transcode into a compatible encoding and then create a String from there. An EBCDIC string that was transcoded into Unicode and then surfaced in Swift as a String would exhibit the behavior of the Unicode-compatible representation. This is true for all String operations and queries, including these ASCII-based Character properties.

I'm not suggesting direct EBCDIC support within String. (Foundation does have support for EBCDIC codepage 37.) I am suggesting that perhaps "ascii:" is not the correct label for this "radix" function. The name does not make it at all clear to me what it means. Just one person's opinion...

1 Like

If the core team is ok with this, I think the best action is to sever the initializer changes and emoji presentation part of this proposal.

The initializer changes are not essential to these properties. They can be deferred as future work for a proposal focusing more on number-parsing.

Emoji presentation is a complicated concept to surface as properties, due to it being context-sensitive and incorporating explicit and default presentation guidance. Surfacing this complexity directly ties source compatibility with an area of Unicode that’s more prone to changes version-to-version. I think it deserves its own proposal and we should defer it as future work.

2 Likes

That makes sense, though. It's an initializer for a number and accepts the subset of ascii that describes a number. I wouldn't be surprised if I handed it " and didn't receive a value. (maybe I might expect the code point or something? That seems like a stretch.)

The difference between ASCII and EBCDIC seems to be the the code points themselves, which I've already said seems like a stretch to consider for the initializer.

alphanumeric has at least as much ambiguity in the direction of "one" seeming like an acceptable

All of this is to say that I think it is a mistake to look for something more precise than ascii when it is clear that we mean 'subset of ascii such that…'

1 Like

Review Manager Update: Proposal Accepted with modifications

The provisional acceptance has been updated to reflect deferral of the .isEmoji property to a later proposal.

In addition, the proposed source-breaking change to the string-to-integer conversion initializer has been dropped.

1 Like