Introduction
This proposal adds a new \N{name}
escape sequence to Swift string literals, where name is the name of a Unicode character.
Discussion
The Unicode named character escape sequence was previously discussed here:
- https://twitter.com/mattt/status/1059805674689847296.
- https://twitter.com/mattt/status/1059805674689847296 · GitHub
Background
Each Unicode character is assigned a unique code point, a number between U+0000 — U+10FFFF, and a name, consisting of uppercase letters (A–Z), digits (0–9), hyphens, and spaces. For example, the Unicode character for the letter “A” used in English has the code point U+0041 and the name LATIN CAPITAL LETTER A. The term scalar value defines the subset of Unicode code points that aren’t surrogate pairs.
In Swift, a string literal may include a character directly (“A”
) or using the \u{n}
escape sequence, where n is a 1–8 digit hexadecimal number corresponding to the scalar value (”\u{0041}”
). A string literal may also include character by interpolation (let letterA = ”\u{0041}”; “\(letterA)”
).
Motivation
In Swift, it can be cumbersome to work with Unicode characters that are non-printing, confusable, or have difficulty rendering in the editor. This difficulty can inhibit developer productivity and cause programming errors.
Non-Printing Characters
Non-printing characters can’t be seen or directly interacted with in most editors (including Xcode), which makes them difficult to work with in code.
For example, the emoji is a sequence comprising
(WOMAN U+1F469) + (ZERO WIDTH JOINER U+200D) +
(GIRL U+1F467). The middle character, zero-width joiner (ZWJ), is a non-printing control character that changes the way glyphs are shaped for adjacent characters rather than having a distinct rendering itself.
There are currently a few different strategies for working with non-printing characters:
One approach is to use the \u{n}
escape sequence in a string literal, passing scalar value for each character.
// Unicode Scalar Value Escape
"\u{1F469}\u{200D}\u{1F467}" // "👩👧"
This achieves the desired results, but the use of opaque numerical constants makes the code difficult to understand.
Another approach is to assign character values to constants, using a combination of variable names and comments to clarify intent, and interpolate those values:
// Commented Declaration + Interpolation
let woman: Character = "\u{1F469}" // WOMAN
let zwj: Character = "\u{200D}" // ZERO WIDTH JOINER
let girl: Character = "\u{1F469}" // GIRL
"\(woman)\(zwj)\(girl)" // "👩👧"
This approach is more understandable than the previous one but requires more work on the part of the developer. More concerning, however, is that this approach can lead to difficult-to-track-down bugs if the comments and behavior conflict, whether due to a mistake initially or an erroneous change later on.
An ideal solution would combine the compiler checking of the former approach with the semantic clarity of the latter approach. This could be achieved by adding support for a new escape sequence, \N{name}
, which allows Unicode characters to be included into string literals by name:
// Proposed \N Escape Sequence
"\N{WOMAN}\N{ZERO WIDTH JOINER}\N{GIRL}" // "👩👧"
Unicode 11.0 specifies 777 emoji ZWJ sequences, including , that platforms are encouraged to support. Vendors may also choose to support other emoji ZWJ sequences, such as Microsoft’s “Hipster Cat”, which comprises
(CAT FACE U+1F431) + ZWJ +
(EYEGLASSES U+1F453), and is only currently supported on Windows platforms.
In addition to Emoji, ZWJ is used in Arabic script and Indic scripts, including Devanagari and Kannada. Incidentally, Arabic script provides another example of the difficulty a developer faces when working with text in code: handling directionality.
Directional Formatting Characters
Arabic script is written right-to-left (RTL) whereas Latin script is written left-to-right (LTR). When working with text containing, for example, both Arabic and Latin script, the use of non-printing, directional formatting characters like RIGHT-TO-LEFT MARK U+200F (RLM) may be necessary to achieve the desired results.
As with the previous ZWJ example, the proposed “\N{name}” escape sequence offers a solution that’s both understandable to the developer and checked by the compiler:
// Unicode Scalar Value Escape
"The phrase is مرحبا بالعالم!\u{200F} in Arabic."
// Commented Declaration + Interpolation
let rlm: Character = "\u{200F}" // RIGHT-TO-LEFT MARK
"The phrase is مرحبا بالعالم!\(rlm) in Arabic."
// Proposed \N Escape Sequence
"The phrase is مرحبا بالعالم!\N{RIGHT-TO-LEFT MARK} in Arabic."
Confusable Characters
Even if a character is printing, their glyph may be ambiguous.
Unicode Technical Report #36 describes how characters in single-, mixed-, and whole-script contexts may be confused for another character. This phenomenon is demonstrated well by the Confusables Unicode Utility.
Correct handling of confusables is most important in security applications, such as for preventing hostname spoofing in URLs. However, confusable characters can be problematic in code as well. For example, consider the following selection from the 24 characters comprising Unicode’s Punctuation, Dash [Pd] category:
* U+002D HYPHEN-MINUS -
* U+2010 HYPHEN ‐
* U+2011 NON-BREAKING HYPHEN ‑
* U+2012 FIGURE DASH ‒
* U+2013 EN DASH –
* U+2014 EM DASH —
* U+2015 HORIZONTAL BAR ―
* U+2E3A TWO-EM DASH ⸺
* U+2E3B THREE-EM DASH ⸻
Most programming fonts are unable to distinguish these characters. If a developer decides to include a character directly into a string literal, the original meaning may be lost in subsequent changes. A developer may not recognize a code convention for using en dash (–) to delimit range bounds, and instead type a hyphen-minus (-) somewhere else in the project.
Consider the following four options for including an en dash in code, including the proposed \N{name} escape sequence:
// Direct
"–"
// Unicode Scalar Value Escape
"\u{2013}"
// Commented Declaration + Interpolation
let enDash: Character = "\u{2013}" // EN DASH
"\(enDash)"
// Proposed \N Escape Sequence
“\N{EN DASH}"
Another example of confusable characters are cross-script homographs. Unicode defines separate code points for LATIN CAPITAL LETTER A (U+0041), CYRILLIC CAPITAL LETTER A (U+0410), and GREEK CAPITAL LETTER ALPHA (U+0391). However, these characters are indiscernible in most fonts.
The proposed \N{name} escape sequence can be helpful for distinguishing between homographs like these:
"A == \u{0041} == \N{LATIN CAPITAL LETTER A}"
"А == \u{0410} == \N{CYRILLIC CAPITAL LETTER A}"
"Α == \u{0391} == \N{GREEK CAPITAL LETTER ALPHA}"
Design
The \N{}
escape sequence is supported in a few programming languages. We propose to model the design according to these existing implementations.
Python
Python defines a \N{name}
escape sequence. Support for name aliases was added in Python 3.
Perl
Perl defines a \N{}
escape sequence that accepts code points with a U+, such as \N{U+0041}
. Including the statement use charnames qw( :full );
allows Perl code to pass name arguments to \N{}
as well.
Objective-C / Swift / Foundation
The \N
syntax can be found when calling the (NS)String
method applyingTransform(_:reverse:)
with the .toUnicodeName
transform:
import Foundation
"🍩".applyingTransform(.toUnicodeName, reverse: false) // \N{DOUGHNUT}
"\\N{DOUGHNUT}".applyingTransform(.toUnicodeName, reverse: true) // 🍩
Implementation
The data required to implement this feature is provided by the ICU library. However, the Swift compiler doesn't currently link libICU, and doing that may not be straightforward.
An alternative approach would be to embed this data from the Unicode Character Database (UCD) directly, using the XML representation described in Unicode Standard Annex #42. As part of the build process, this XML file could be downloaded, parsed, and used to generate a static array declaration in code that's used by the compiler to do character name lookups.
To get a sense of what this entails, here's a link to the directory of UCD XML files for Unicode 11.0: Index of /Public/11.0.0/ucdxml.
In terms of code impact, Unicode 11.0 has 137,374 characters. Estimating that each character name is between 16 and 32 bytes, we could expect this to require around 2 – 4 megabytes.
It should also be possible to reference Unicode characters by normative formal name aliases, of which there are currently a few hundred.
Source Compatibility
This is a purely additive change. The syntax proposed is not currently valid Swift.
Effect on ABI Stability
None
Effect on API Resilience
None
Documentation Impact
The Swift Programming Language would need to update the “Special Characters in String Literals” section in its Strings and Characters chapter to document the new escape sequence.
In addition, documentation for the applyingTransform(_:reverse:)
method would need to be updated to note support for the \N
escape sequence in Swift.
Alternatives to Consider
As part of the pitch process, we are especially interested in soliciting feedback and suggestions for the following:
Using \U{name}
instead of \N{name}
In terms of spelling, the strength of \N{name}
comes entirely from the precedent set by the aforementioned languages that currently implement this functionality. The letter "N
" is a weak mnemonic for "Unicode character name". It's also similar --- but unrelated to --- the more common \n
escape sequence, which might create confusion for developers.
An alternative spelling to consider for this proposal is \U{name}
. The letter "U
" reinforces its relation to "Unicode" and provides case symmetry with the existing, related \u{n}
escape sequence.
Unfortunately, searching for \N
usage in code is difficult (GitHub, for example, strips the backslash character in code search). Therefore, we don't have any data about the prevalence of \N
in the wild to help make a determination of how strong the existing convention is.
Supporting Named Sequences
Unicode also provides a database of named sequences, as described by Unicode Standard Annex #34. Essentially, these are common extended grapheme clusters that are treated like characters (unique name, part of the Unicode namespace), but comprise multiple code points instead of just one. For example, the named sequence LATIN SMALL LETTER I WITH MACRON AND GRAVE (ī̀) is defined by U+012B followed by U+0300. A list of named sequences in Unicode 11.0 is provided by the data file NamedSequences.txt.
Adding support for named sequences would add complexity, as the compiler would have to treat these differently than normal characters (named sequences can't be used for Unicode scalars literals). It's unclear whether named sequences should be supported in the initial implementation, deferred until later, or implemented separately with a new escape sequence.
Supporting Emoji Sequences
Related to the previous point, there are also several hundred named Emoji sequences; see emoji-sequences.txt and emoji-zwj-sequences.txt.
We're less inclined to include these in an initial implementation, but are interested in gauging demand for this functionality later on.