Pitch: Unicode Named Character Escape Sequence

mattt · November 29, 2018, 4:45pm

Introduction

This proposal adds a new \N{name} escape sequence to Swift string literals, where name is the name of a Unicode character.

Discussion

The Unicode named character escape sequence was previously discussed here:

Background

Each Unicode character is assigned a unique code point, a number between U+0000 — U+10FFFF, and a name, consisting of uppercase letters (A–Z), digits (0–9), hyphens, and spaces. For example, the Unicode character for the letter “A” used in English has the code point U+0041 and the name LATIN CAPITAL LETTER A. The term scalar value defines the subset of Unicode code points that aren’t surrogate pairs.

In Swift, a string literal may include a character directly (“A”) or using the \u{n} escape sequence, where n is a 1–8 digit hexadecimal number corresponding to the scalar value (”\u{0041}”). A string literal may also include character by interpolation (let letterA = ”\u{0041}”; “\(letterA)”).

Motivation

In Swift, it can be cumbersome to work with Unicode characters that are non-printing, confusable, or have difficulty rendering in the editor. This difficulty can inhibit developer productivity and cause programming errors.

Non-Printing Characters

Non-printing characters can’t be seen or directly interacted with in most editors (including Xcode), which makes them difficult to work with in code.

For example, the emoji is a sequence comprising (WOMAN U+1F469) + (ZERO WIDTH JOINER U+200D) + (GIRL U+1F467). The middle character, zero-width joiner (ZWJ), is a non-printing control character that changes the way glyphs are shaped for adjacent characters rather than having a distinct rendering itself.

There are currently a few different strategies for working with non-printing characters:

One approach is to use the \u{n} escape sequence in a string literal, passing scalar value for each character.

// Unicode Scalar Value Escape
"\u{1F469}\u{200D}\u{1F467}" // "👩‍👧"

This achieves the desired results, but the use of opaque numerical constants makes the code difficult to understand.

Another approach is to assign character values to constants, using a combination of variable names and comments to clarify intent, and interpolate those values:

// Commented Declaration + Interpolation
let woman: Character = "\u{1F469}" // WOMAN
let zwj: Character = "\u{200D}" // ZERO WIDTH JOINER
let girl: Character = "\u{1F469}" // GIRL
"\(woman)\(zwj)\(girl)" // "👩‍👧"

This approach is more understandable than the previous one but requires more work on the part of the developer. More concerning, however, is that this approach can lead to difficult-to-track-down bugs if the comments and behavior conflict, whether due to a mistake initially or an erroneous change later on.

An ideal solution would combine the compiler checking of the former approach with the semantic clarity of the latter approach. This could be achieved by adding support for a new escape sequence, \N{name}, which allows Unicode characters to be included into string literals by name:

// Proposed \N Escape Sequence
"\N{WOMAN}\N{ZERO WIDTH JOINER}\N{GIRL}" // "👩‍👧"

Unicode 11.0 specifies 777 emoji ZWJ sequences, including , that platforms are encouraged to support. Vendors may also choose to support other emoji ZWJ sequences, such as Microsoft’s “Hipster Cat”, which comprises (CAT FACE U+1F431) + ZWJ + (EYEGLASSES U+1F453), and is only currently supported on Windows platforms.

In addition to Emoji, ZWJ is used in Arabic script and Indic scripts, including Devanagari and Kannada. Incidentally, Arabic script provides another example of the difficulty a developer faces when working with text in code: handling directionality.

Directional Formatting Characters

Arabic script is written right-to-left (RTL) whereas Latin script is written left-to-right (LTR). When working with text containing, for example, both Arabic and Latin script, the use of non-printing, directional formatting characters like RIGHT-TO-LEFT MARK U+200F (RLM) may be necessary to achieve the desired results.

As with the previous ZWJ example, the proposed “\N{name}” escape sequence offers a solution that’s both understandable to the developer and checked by the compiler:

// Unicode Scalar Value Escape
"The phrase is مرحبا بالعالم!\u{200F} in Arabic."

// Commented Declaration + Interpolation
let rlm: Character = "\u{200F}" // RIGHT-TO-LEFT MARK
"The phrase is مرحبا بالعالم!\(rlm) in Arabic."

// Proposed \N Escape Sequence
"The phrase is مرحبا بالعالم!\N{RIGHT-TO-LEFT MARK} in Arabic."

Confusable Characters

Even if a character is printing, their glyph may be ambiguous.

Unicode Technical Report #36 describes how characters in single-, mixed-, and whole-script contexts may be confused for another character. This phenomenon is demonstrated well by the Confusables Unicode Utility.

Correct handling of confusables is most important in security applications, such as for preventing hostname spoofing in URLs. However, confusable characters can be problematic in code as well. For example, consider the following selection from the 24 characters comprising Unicode’s Punctuation, Dash [Pd] category:

*   U+002D  HYPHEN-MINUS           -
*   U+2010  HYPHEN                 ‐
*   U+2011  NON-BREAKING HYPHEN    ‑
*   U+2012  FIGURE DASH            ‒
*   U+2013  EN DASH                –
*   U+2014  EM DASH                —
*   U+2015  HORIZONTAL BAR         ―
*   U+2E3A  TWO-EM DASH            ⸺
*   U+2E3B  THREE-EM DASH          ⸻

Most programming fonts are unable to distinguish these characters. If a developer decides to include a character directly into a string literal, the original meaning may be lost in subsequent changes. A developer may not recognize a code convention for using en dash (–) to delimit range bounds, and instead type a hyphen-minus (-) somewhere else in the project.

Consider the following four options for including an en dash in code, including the proposed \N{name} escape sequence:

// Direct
"–"

// Unicode Scalar Value Escape
"\u{2013}"

// Commented Declaration + Interpolation
let enDash: Character = "\u{2013}" // EN DASH
"\(enDash)"

// Proposed \N Escape Sequence
“\N{EN DASH}"

Another example of confusable characters are cross-script homographs. Unicode defines separate code points for LATIN CAPITAL LETTER A (U+0041), CYRILLIC CAPITAL LETTER A (U+0410), and GREEK CAPITAL LETTER ALPHA (U+0391). However, these characters are indiscernible in most fonts.

The proposed \N{name} escape sequence can be helpful for distinguishing between homographs like these:

"A == \u{0041} == \N{LATIN CAPITAL LETTER A}"
"А == \u{0410} == \N{CYRILLIC CAPITAL LETTER A}"
"Α == \u{0391} == \N{GREEK CAPITAL LETTER ALPHA}"

Design

The \N{} escape sequence is supported in a few programming languages. We propose to model the design according to these existing implementations.

Python

Python defines a \N{name} escape sequence. Support for name aliases was added in Python 3.

Perl

Perl defines a \N{} escape sequence that accepts code points with a U+, such as \N{U+0041}. Including the statement use charnames qw( :full ); allows Perl code to pass name arguments to \N{} as well.

Objective-C / Swift / Foundation

The \N syntax can be found when calling the (NS)String method applyingTransform(_:reverse:) with the .toUnicodeName transform:

import Foundation

"🍩".applyingTransform(.toUnicodeName, reverse: false) // \N{DOUGHNUT}
"\\N{DOUGHNUT}".applyingTransform(.toUnicodeName, reverse: true) // 🍩

Implementation

The data required to implement this feature is provided by the ICU library. However, the Swift compiler doesn't currently link libICU, and doing that may not be straightforward.

An alternative approach would be to embed this data from the Unicode Character Database (UCD) directly, using the XML representation described in Unicode Standard Annex #42. As part of the build process, this XML file could be downloaded, parsed, and used to generate a static array declaration in code that's used by the compiler to do character name lookups.

To get a sense of what this entails, here's a link to the directory of UCD XML files for Unicode 11.0: Index of /Public/11.0.0/ucdxml.

In terms of code impact, Unicode 11.0 has 137,374 characters. Estimating that each character name is between 16 and 32 bytes, we could expect this to require around 2 – 4 megabytes.

It should also be possible to reference Unicode characters by normative formal name aliases, of which there are currently a few hundred.

Source Compatibility

This is a purely additive change. The syntax proposed is not currently valid Swift.

Effect on ABI Stability

None

Effect on API Resilience

None

Documentation Impact

The Swift Programming Language would need to update the “Special Characters in String Literals” section in its Strings and Characters chapter to document the new escape sequence.

In addition, documentation for the applyingTransform(_:reverse:) method would need to be updated to note support for the \N escape sequence in Swift.

Alternatives to Consider

As part of the pitch process, we are especially interested in soliciting feedback and suggestions for the following:

Using `\U{name}` instead of `\N{name}`

In terms of spelling, the strength of \N{name} comes entirely from the precedent set by the aforementioned languages that currently implement this functionality. The letter "N" is a weak mnemonic for "Unicode character name". It's also similar --- but unrelated to --- the more common \n escape sequence, which might create confusion for developers.

An alternative spelling to consider for this proposal is \U{name}. The letter "U" reinforces its relation to "Unicode" and provides case symmetry with the existing, related \u{n} escape sequence.

Unfortunately, searching for \N usage in code is difficult (GitHub, for example, strips the backslash character in code search). Therefore, we don't have any data about the prevalence of \N in the wild to help make a determination of how strong the existing convention is.

Supporting Named Sequences

Unicode also provides a database of named sequences, as described by Unicode Standard Annex #34. Essentially, these are common extended grapheme clusters that are treated like characters (unique name, part of the Unicode namespace), but comprise multiple code points instead of just one. For example, the named sequence LATIN SMALL LETTER I WITH MACRON AND GRAVE (ī̀) is defined by U+012B followed by U+0300. A list of named sequences in Unicode 11.0 is provided by the data file NamedSequences.txt.

Adding support for named sequences would add complexity, as the compiler would have to treat these differently than normal characters (named sequences can't be used for Unicode scalars literals). It's unclear whether named sequences should be supported in the initial implementation, deferred until later, or implemented separately with a new escape sequence.

Supporting Emoji Sequences

Related to the previous point, there are also several hundred named Emoji sequences; see emoji-sequences.txt and emoji-zwj-sequences.txt.

We're less inclined to include these in an initial implementation, but are interested in gauging demand for this functionality later on.

jrose · November 29, 2018, 5:27pm

I'll admit my experience may not be representative, but I have never run across code in Python (or Perl) that uses this feature. I think it definitely can improve clarity in some cases, but the cost of linking a giant table of names into the compiler that has to be kept up to date seems not worth the tradeoff for me (cf. trying to decide what continuation codepoints are valid at compile time).

xwu · November 29, 2018, 5:44pm

Agree with concern about practical costs in the setting of likely low use. If this were to be adopted, though, I see no reason why it needs any new syntax: \u should be able to accommodate just fine.

mattt · November 29, 2018, 5:49pm

How feasible would it be to have the compiler link to ICU and delegate name resolution functionality to that library instead of embedding and maintaining a separate table of constants?

mattt · November 29, 2018, 5:52pm

This came up in discussion on Twitter: https://twitter.com/mattt/status/1068185975359123456

There are a few instances of collision between hexadecimal code point numbers and character names, including:

BED (U+1F6CF)
௭ TAMIL DIGIT SEVEN (U+0BED)

bee (U+1F41D)
௮ TAMIL DIGIT EIGHT (U+0BEE)

U+000C, which has the formal alias FF

U+0011 – U+0014, which are DC1, DC2, DC3, and DC4

jrose · November 29, 2018, 5:57pm

Hm. Apple platforms don't provide this directly, so we'd have to go through CF or Foundation—implementable, but a little annoying, and some names might be unavailable depending on the OS you're running, which I really don't like (my original complaint).

On the Linux side, we just decided to switch to embedding an ICU into the corelibs Foundation we're building, but if that's statically linked then I'm not sure we can just use it from the compiler. (And we definitely can't use it if we're cross-compiling.) But if the point of using ICU is to avoid having a manual copy of the table, we'd only want to do it if we don't need to include an extra copy in whatever we're shipping.

Joe_Groff · November 29, 2018, 5:58pm

With the new string interpolation design, and eventual support for constant evaluation, it might be interesting to look into making this a library feature. You could extend String so that \(characterNamed: "ZERO WIDTH JOINER") worked as an interpolation.

allevato · November 29, 2018, 6:15pm

This is one of those features that I think would be absolutely fun to have, but as has been pointed out, it seems hard to do "right" without a major decision like linking ICU, or at least the name table alone (which is probably worse), into the compiler.

If we want to expose a runtime API for this, I think the best thing to do would be to look back at SE-0211 where we added Unicode.Scalar.Properties.name and add its inverse—something like Unicode.Scalar.init?(named name: String).

But the nice thing about @mattt's proposal is that for code point names that are known at compile-time, invalid names can be detected as compiler errors, just as something like let x: Character = "ab" is an error today. Without that feature, I think the trade-off between performance and documentation is too high—it's hard to imagine preferring \(characterNamed: "ZERO WIDTH JOINER") which would involve an expensive runtime lookup versus just writing the equivalent \u{...} and including a comment.

Joe_Groff · November 29, 2018, 6:19pm

Compile-time evaluation will hopefully support the ability to evaluate assertions at compile time, allowing these sorts of conditions to be checked without hardcoded compiler support.

allevato · November 29, 2018, 6:23pm

If it can be done using compile-time evaluation support, then that does make it more appealing (but then I'd say that the Unicode.Scalar initializer should still be supported, and it could also use the same CTE support).

I missed the original reference to CTE in your original post—but that would still require linking ICU into the compiler, correct? So are you just advocating for a different syntax than the proposed \N{...}, not for turning the entire feature into a runtime one?

jrose · November 29, 2018, 6:25pm

I'm a little less optimistic about this particular one, since it involves a string table lookup and that string table (at least on Apple platforms) may not be constant.

Joe_Groff · November 29, 2018, 6:36pm

Yeah, it'd be great to use compile-time evaluation to eventually get unicode scalar and grapheme cluster validation out of the compiler eventually as well. For the named character feature, yeah, we'd need to provide a name table somewhere the compiler can see it. I don't think it's necessary for it to exactly match the target platform's table at all times; if it were a table in the compiler, you'd have all the same issues as a pre-generated table for compile-time evaluation lookup.

SDGGiesbrecht · November 29, 2018, 8:08pm

Background: I speak five human languages which use three different scripts in two different writing directions. I spoke Unicode long before I spoke Swift. One of my very first programming projects was to redesign my keyboard so I didn’t have to switch back and forth between several.

This pitch sounds like bad idea to me.

The code point names have limited usefulness.
- Because of the extreme specificity required for uniqueness, most of them are excessively long and cumbersome.
  - Examples:
    - “A”: LATIN CAPITAL LETTER A
    - “[”: LEFT SQUARE BRACKET
  - From UTN #27:
    
    One of the reasons why the Unicode standard publishes many informative aliases in the Unicode names list is because there often are much better, more communicative names for particular characters, even in English than the normative names in the data file. [...] Informal aliases are useful in describing a character, but cannot be used as identifiers, because they are not guaranteed to be unique or stable.
- Many are spelled in odd ways or are unintuitive. Unicode Technical Note #27 is dedicated to these problems. The name is usually insufficient to understand the nature of a character, it is often necessary to consult the charts. For example (emphasis added):
  
  0387 · GREEK ANO TELEIA
  • functions in Greek like a semicolon
  • 00B7 · is the preferred character
  ≡ 00B7 · middle dot
- Because of the specificity and inconsistency, it is often much easier to memorize the four digit hex codes than it is to remember the wording and spelling of the awkward names—and it is certainly much faster to type the shorter hex codes. Even if I were reviewing some source code, seeing the name “HYPHEN” forces me to stop and look up which character it is that is meant used, but seeing “2010” tells me much faster that it’s the unambiguous one, not the ASCII hybrid stroke thing.
From my experience, control (“non‐printing”) characters are actually a particularly poor candidate for such usage:
- Directional formatting is something I deal with every day. Unlike the non‐printing characters in ASCII—which incidentally have no names—, directional formatters and most other Unicode controls (including the non‐joiner mentioned in the pitch) are immediately obvious when you look at the text. While they don’t print per se, their reason for existence is to fix the appearance of the surrounding characters. You know you need to use the control because the text looks wrong without it. Names, on the other hand, have no such effect on the surrounding text, so they make the source even harder to read, because the shape and order of the source characters no longer matches that of the compiled result. Notice how using the RTL mark itself looks much cleaner, and has its correct order instead of the garbled order of the one with the name.
  - "The phrase is مرحبا بالعالم!\N{RIGHT-TO-LEFT MARK} in Arabic."
  - "The phrase is مرحبا بالعالم!‏ in Arabic."
  It is also worth noting that these characters are readily available on the keyboards of those whose languages use them. It is 1 keypress for the character vs 18 + a keyboard switch to type the name.
  
  Asside: What you really want for that sentence is directional embedding controls (U+2066–2069), not the right‐to‐left mark: "The phrase is \(rli)مرحبا بالعالم!\(pdi) in Arabic."
- Occasionally you do want to isolate a control character from its effects. Most commonly for me, that is to prevent a bidirectional control in a literal from making the whole source line display in a backward direction. Still however, the Unicode names are so cumbersome that I would end up defining things like let zwj = "\u{200D}" just to avoid writing "👩\N{ZERO WIDTH JOINER}👧" and be able to write "👩\(zwj)👧" instead.
Confusable characters are rarely confusable in context. Unicode calls them out because they are vulnerable to intentional spoofing, but they are not confusable in normal use. It is immediately obvious to any translator or bilingual developer which of these is a Latin “O” and which is a Greek “Ο”: ["Hello, world!", "Γεια σου, κόσμε!"]. Honestly, if I’m proofreading someone’s translation work, I don’t want to try to read "Γεια σ\N{GREEK CAPITAL LETTER OMICRON}υ, κόσμε!", let alone "\N{GREEK CAPITAL LETTER GAMMA}\N{GREEK SMALL LETTER EPSILON}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER ALPHA}..." (I ran out of patience before I even finished the entire example string).
This pitch would probably find its greatest pool of users among those with a desire to conjure Emoji by typing words, like on GitHub or similar places, where :clap: results in “”. While that might be useful, I think that belongs in the domain of the input method. Emoji are fickle things which compose in odd ways and have even more erratic names than the rest of Unicode. They are very shallow and ambiguous semantically, having their entire value instead in a particular image. This makes them conceptually unique from other text:
- If I’m going to use Emoji, I first want to see what they look like, not what they happen to be called. Regardless of how I entered one, I would want the character itself in the source code, not an opaque name. An analogous feature in Swift is the provision of #color literals, so that the intent can be visible directly in the source code. But using character names instead of images is the opposite philosophy.
- Most people discover Emoji by paging through lists of them displayed in visual form, not by trying various description strings to see if they produce an image. Many of those character pickers display localized descriptions (including the one in macOS). So it is usually easier to enter the character itself straight into the source than it is to look up their metadata in order to describe them by name instead.
- Asside: Unicode itself considers them a short‐term workaround awaiting a real solution, and recommends using real images instead.

If the question I started with were, What other fascinating details of Unicode can we present to users?, then this pitch might be a natural answer. But when I instead ask Is this actually useful? the answer I come to is No.

SDGGiesbrecht · November 29, 2018, 9:03pm

...in case you were cleverly trying to prove a point, yes, I did notice that your hex value for “girl” is wrong and doesn’t match the Emoji in the comment. It was not so hard to track down. ;)

mattt · November 29, 2018, 11:09pm

Thank you for taking the time to write out your thoughts about this proposal. I appreciate your perspective as a native speaker of several languages and someone who frequently works in multiple scripts.

Please allow me to try to respond to some of your points:

My intent with this proposal is to provide a clearer and more convenient way for developers to work with Unicode in a precise manner. I'm not trying to defend the technical problems or deficiencies of the standard, or attempt to work around them.

Unicode character names, for better or worse, can be helpful in clarifying the intent or behavior of text processing code—especially to developers who are less versed in a certain script or Unicode in general. Even if it's easier for you to memorize certain hex codes, it raises the barrier to entry for understanding and contributing to that code for others.

This may be true in text, but not necessarily for code. For example, if you were programmatically constructing a string with mixed directionality using interpolation, you may want to isolate text with FSI (\u+2068) and PDI (u+2069).

In the context of text, no. But this proposal is intended for writing code attempting to handle those confusable cases.

I don't expect \N to be used in this context. You can continue to input text in the method most convenient to you. \N is there for exceptional cases, when meaning isn't immediately clear (or is actively misleading) in a particular context.

Again, my intent is not for this to be an input method. As I alluded to in the proposal, emoji is becoming increasingly combinatoric in nature, and there are fewer guarantees about how sequences are rendered --- especially in a code editor. If you're writing software that deals precisely with Unicode characters, working with character names may be the most convenient option.

Again, I worry that the sentiment of your feedback may stem from a misunderstanding of my intent (in which case, is helpful for refining this proposal).

Throughout your response, you made several comments about input methods when writing text. To this end, referring to characters by name is naturally going to be more cumbersome than just typing what you want.

But this proposal isn't for writing text, it's for writing code that works with text. It's for folks that really do need to work with the fascinating (/ infuriating) details of Unicode and need to do so in a precise manner.

Does that make sense? Do you see text input as being an orthogonal concern to the problem I propose solving with \N? Or do you still have concerns about the usefulness of this feature for developers working with text in code?

SDGGiesbrecht · November 30, 2018, 9:52am

This is an entirely additive proposal, so I realize we can simply choose not to use it if it were implemented.

My work falls into these two categories roughly half and half:

The “writing text” part. I translate. Often it is mixed in with source code of some kind or another, but even then it mostly involves fully readable string literals with the occasional interpolation. As you say, this is not what this proposal is (intended to be) about. It is not useful in this problem domain. But to a developer who speaks one language, it is easy to read a phrase like the following, and wrongly understand that this feature will be a necessity for his translators in order to localize his application correctly:

mattt:

When working with text containing, for example, both Arabic and Latin script, the use of non-printing, directional formatting characters like RIGHT-TO-LEFT MARK U+200F (RLM) may be necessary to achieve the desired results.

The very example provided belongs to this “working with text” domain ("The phrase is مرحبا بالعالم!‏ in Arabic."). I’m just making it clear that it is not as useful in this domain as the proposal can make it sound to someone unfamiliar the topic.
The “fascinating details of Unicode” part. I’ve worked on input methods, which requires working with unpaired controls and isolated combining characters. I’ve worked on letter frequency analysis scripts which needed to carefully handle decomposition and undo canonical reordering in a way more logical to the grammar. I do a lot of work in this domain too.

In this problem domain, in order to do anything right, you need a vastly better understanding of how Unicode works. By the time you reach that level of understanding, you are very familiar with the hex codes and don’t use the names for anything. The hex code carries a lot more useful information: Which block are we dealing in, latin‐1 (008x) or punctuation (20xx)? Is this ASCII (00xx), BMP (xxxx), or extended (xxxxx) (ergo how many bytes in UTF‐x)? Is it spacing (02xx) or combining (03xx)? Is it a modifier (02xx) or punctuation (20xx)? None of this information is captured in a name very well. APOSTROPHE vs RIGHT SINGLE QUOTATION MARK—Which one is actually recommended as an apostrophe? Consult the charts and you’ll find it’s the second one. For these sorts of reasons, I find the names unhelpful in a heavily technical Unicode setting. In every case either the character itself or the hex code is more useful, more communicative, easier to discover the first place, and faster to input.

SDGGiesbrecht · December 2, 2018, 4:48pm

I felt bad about sounding so antagonistic, so I put some more thought into this:

What if the proposal were more focused?

ASCII gave us several escapes (\n, \t, etc.), but only for characters which were especially worthy of it. There was never any \a for & or \d for $.

So why not aim for a comparable scope in Unicode?

Each non‐printing control has an acronym used in the charts and other places where a printable representation is desired. It would be much more manageable to give such controls escapes according to their acronyms:

\c{cgj}: combining grapheme joiner
\c{zwnj}: zero width non‐joiner
\c{rti}: right‐to‐left isolate

...and so on. The entire list is not that long. Personally I would weed out those that have been superseded, much like Swift dropped C’s \a (bell), \b (backspace), etc. With deprecated and unrecommended characters removed, the list falls to around 30 controls total.

That would provide four advantages, while still fulfilling most of the pitch’s motivation:

It leaves the fickle names out of it.
It is manageable in size. No need for ICU.
The acronyms actually compete with hex codes for brevity.
The acronyms are even more familiar and easier to spell than the names they stand for. (I notice they were repeatedly called ZWJ and RLM in the text of proposal.)

Joe_Groff · December 2, 2018, 5:23pm

Another preexisting source of standardized short entity names we could borrow are those from HTML &character; escapes.