In light of the core team's decision on SE-0243, I'd like to kick off a pitch for single-quoted literals based on the feedback given.
It's the product of multiple people's work but, while we figure out who's signing on to it, here it is so that we can relaunch the conversation. I'll take the blame for all typos and other errors:
Unicode scalar literals
Introduction
Swift's String
type accommodates Unicode by default and models a Collection
of extended grapheme clusters, which in Swift are in turn modeled by Character
. This is appropriate for a type that handles human-readable text. However, the ergonomics of low-level string processing is a significant pain point for some Swift users, especially when it comes to dealing with individual code points.
To address this shortcoming, we propose a Unicode scalar literal as a single Unicode scalar surrounded by single quotation marks (e.g., 'x'
).
Motivation
Character
is on the wrong level of abstraction when it comes to processing ASCII bytes. "\r\n"
is a single extended grapheme cluster, or Character
, that represents a sequence of two ASCII characters. Therefore, Character.asciiValue
is fundamentally broken for the purposes of byte processing as it can cause silent data loss. As another example, Character
considers the ASCII semicolon ;
to be substitutable with GREEK QUESTION MARK (U+037E). These are clearly inappropriate features for the byte processing use case.
(This is not to say the Character
abstraction isn't useful at all: on the contrary, it's clearly the right choice for String
's element type for reasons already discussed elsewhere.)
Unicode.Scalar
and its associated string view are much closer to the level of actual encodings, and they are more appropriate abstractions for low-level text processing. This is certainly true for ASCII but also applies to any other context where equivalency under Unicode normalization would be inappropriate or unnecessary.
Unicode.Scalar
is a type that is crying out for its own literal syntax. It has grown an awesome set of APIs in Swift 5 for common and advanced text processing use cases, and it's a shame that its rich properties are locked away behind convoluted syntax. It would be ideal to be able to type '\u{301}'.name
into a playground to learn about a particular code point.
A design where '\r'
evaluates to the Unicode scalar U+000D would resolve the issues discussed in this proposal.
Proposed solution
We would introduce a Unicode scalar literal as a single Unicode scalar surrounded by single quotation marks (e.g., 'x'
).
The compiler will verify at compile time that the content of a Unicode scalar literal consists of one and only one Unicode scalar (without normalization). Note that this rule also precludes an empty Unicode scalar literal (i.e., ''
).
Go and Rust have adopted a similar design, where single quotation marks are used to surround a literal Unicode code point or Unicode scalar value, respectively.
A Unicode scalar value is any Unicode code point except high- and low-surrogate code points. In Go, a Unicode code point is known as a rune, a term now also adopted in .NET.
These modern languages do not tie this literal syntax with the "atom" of their string type, and in fact they divorce iteration over a string from the "atom" of their string type as well. In Go, a string is an arbitrary sequence of UTF-8 bytes, its "length" is the length in bytes, and indexing gives a byte offset. As a special exception, iteration occurs over a string's runes. In Rust, a string slice (str
) is an arbitrary sequence of UTF-8 bytes, its "length" is the length in bytes, and indexing gives a byte offset. It is not possible to iterate over a string slice; one must explicitly ask for its UTF-8 byte view or Unicode scalar view.
Detailed design
Types that conform to ExpressibleByUnicodeScalarLiteral
but not ExpressibleByExtendedGraphemeClusterLiteral
will show a deprecation warning when they are expressed using string literal syntax (i.e., with double quotation marks).
The default type of a Unicode scalar literal (i.e., UnicodeScalarLiteralType
) will be Unicode.Scalar
(a.k.a. UnicodeScalar
).
Of course, types that conform to ExpressibleByExtendedGraphemeClusterLiteral
(including types that conform to ExpressibleByStringLiteral
) necessarily conform to ExpressibleByUnicodeScalarLiteral
. Therefore, they may also be expressed using the newly proposed Unicode scalar literal syntax: let x = '1' as Character
. However, regardless of the type to which the literal value is coerced, the content of the literal will be verified at compile time to contain one and only one Unicode scalar.
Since the content of a Unicode scalar literal must be one and only one Unicode scalar, it isn't strictly necessary to escape a single quotation mark. We will leave it as a possible future direction to consider whether
let x = '''
is supported as a statement equivalent tolet x = '\''
.
Source compatibility
Since the Unicode scalar literal syntax is purely additive, we foresee no source compatibility breaks.
The proposal would cause deprecation warnings to appear when Unicode scalars are expressed using string literals. A fix-it can be provided to migrate such uses.
Effect on ABI stability
None.
Effect on API resilience
None.
Alternatives considered
The principal alternative is to use the proposed dedicated literal syntax for a character literal (i.e., extended grapheme cluster literal).
However, there are no strong use cases for adding dedicated literal syntax for the Character
type. "👨👩👧👦" as Character
seems therefore sufficiently ergonomic, and indeed, of the two dozen or so most "popular" programming languages, none use a dedicated syntax for an extended grapheme cluster literal. Since member lookup for a literal value is deliberately performed only on the default literal type, using the proposed syntax for a character literal would once again lock up useful APIs for Unicode scalars behind a convoluted syntax.
Moreover, the version of Unicode supported, and therefore grapheme breaking, is a runtime concept. It is the version of the standard library linked at run time that determines whether a string's contents are one extended grapheme cluster (i.e., Character
) or not. A dedicated character literal syntax can provide users no guarantees about grapheme breaking as it relates to the contents of the literal, because such knowledge cannot be "baked in" statically into the code. In other words, with only best-effort diagnostics available at compile time, a valid "character literal" might not be a valid Character
.
Another alternative design could address specifically the ASCII use case by dedicating the proposed literal syntax for ASCII contents (whether a character or a string). What would be gained would be compile-time checking that any such content is ASCII. As a trade-off, we would lose compile-time checking that any such content contains one and only one Unicode scalar, and we would lose ergonomic access to Unicode scalar APIs.