Unicode scalar literals

xwu · March 27, 2019, 5:29pm

In light of the core team's decision on SE-0243, I'd like to kick off a pitch for single-quoted literals based on the feedback given.

It's the product of multiple people's work but, while we figure out who's signing on to it, here it is so that we can relaunch the conversation. I'll take the blame for all typos and other errors:

Unicode scalar literals

Introduction

Swift's String type accommodates Unicode by default and models a Collection of extended grapheme clusters, which in Swift are in turn modeled by Character. This is appropriate for a type that handles human-readable text. However, the ergonomics of low-level string processing is a significant pain point for some Swift users, especially when it comes to dealing with individual code points.

To address this shortcoming, we propose a Unicode scalar literal as a single Unicode scalar surrounded by single quotation marks (e.g., 'x').

Motivation

Character is on the wrong level of abstraction when it comes to processing ASCII bytes. "\r\n" is a single extended grapheme cluster, or Character, that represents a sequence of two ASCII characters. Therefore, Character.asciiValue is fundamentally broken for the purposes of byte processing as it can cause silent data loss. As another example, Character considers the ASCII semicolon ; to be substitutable with GREEK QUESTION MARK (U+037E). These are clearly inappropriate features for the byte processing use case.

(This is not to say the Character abstraction isn't useful at all: on the contrary, it's clearly the right choice for String's element type for reasons already discussed elsewhere.)

Unicode.Scalar and its associated string view are much closer to the level of actual encodings, and they are more appropriate abstractions for low-level text processing. This is certainly true for ASCII but also applies to any other context where equivalency under Unicode normalization would be inappropriate or unnecessary.

Unicode.Scalar is a type that is crying out for its own literal syntax. It has grown an awesome set of APIs in Swift 5 for common and advanced text processing use cases, and it's a shame that its rich properties are locked away behind convoluted syntax. It would be ideal to be able to type '\u{301}'.name into a playground to learn about a particular code point.

A design where '\r' evaluates to the Unicode scalar U+000D would resolve the issues discussed in this proposal.

Proposed solution

We would introduce a Unicode scalar literal as a single Unicode scalar surrounded by single quotation marks (e.g., 'x').

The compiler will verify at compile time that the content of a Unicode scalar literal consists of one and only one Unicode scalar (without normalization). Note that this rule also precludes an empty Unicode scalar literal (i.e., '').

Go and Rust have adopted a similar design, where single quotation marks are used to surround a literal Unicode code point or Unicode scalar value, respectively.

A Unicode scalar value is any Unicode code point except high- and low-surrogate code points. In Go, a Unicode code point is known as a rune, a term now also adopted in .NET.

These modern languages do not tie this literal syntax with the "atom" of their string type, and in fact they divorce iteration over a string from the "atom" of their string type as well. In Go, a string is an arbitrary sequence of UTF-8 bytes, its "length" is the length in bytes, and indexing gives a byte offset. As a special exception, iteration occurs over a string's runes. In Rust, a string slice (str) is an arbitrary sequence of UTF-8 bytes, its "length" is the length in bytes, and indexing gives a byte offset. It is not possible to iterate over a string slice; one must explicitly ask for its UTF-8 byte view or Unicode scalar view.

Detailed design

Types that conform to ExpressibleByUnicodeScalarLiteral but not ExpressibleByExtendedGraphemeClusterLiteral will show a deprecation warning when they are expressed using string literal syntax (i.e., with double quotation marks).

The default type of a Unicode scalar literal (i.e., UnicodeScalarLiteralType) will be Unicode.Scalar (a.k.a. UnicodeScalar).

Of course, types that conform to ExpressibleByExtendedGraphemeClusterLiteral (including types that conform to ExpressibleByStringLiteral) necessarily conform to ExpressibleByUnicodeScalarLiteral. Therefore, they may also be expressed using the newly proposed Unicode scalar literal syntax: let x = '1' as Character. However, regardless of the type to which the literal value is coerced, the content of the literal will be verified at compile time to contain one and only one Unicode scalar.

Since the content of a Unicode scalar literal must be one and only one Unicode scalar, it isn't strictly necessary to escape a single quotation mark. We will leave it as a possible future direction to consider whether let x = ''' is supported as a statement equivalent to let x = '\''.

Source compatibility

Since the Unicode scalar literal syntax is purely additive, we foresee no source compatibility breaks.

The proposal would cause deprecation warnings to appear when Unicode scalars are expressed using string literals. A fix-it can be provided to migrate such uses.

Effect on ABI stability

None.

Effect on API resilience

None.

Alternatives considered

The principal alternative is to use the proposed dedicated literal syntax for a character literal (i.e., extended grapheme cluster literal).

However, there are no strong use cases for adding dedicated literal syntax for the Character type. "👨‍👩‍👧‍👦" as Character seems therefore sufficiently ergonomic, and indeed, of the two dozen or so most "popular" programming languages, none use a dedicated syntax for an extended grapheme cluster literal. Since member lookup for a literal value is deliberately performed only on the default literal type, using the proposed syntax for a character literal would once again lock up useful APIs for Unicode scalars behind a convoluted syntax.

Moreover, the version of Unicode supported, and therefore grapheme breaking, is a runtime concept. It is the version of the standard library linked at run time that determines whether a string's contents are one extended grapheme cluster (i.e., Character) or not. A dedicated character literal syntax can provide users no guarantees about grapheme breaking as it relates to the contents of the literal, because such knowledge cannot be "baked in" statically into the code. In other words, with only best-effort diagnostics available at compile time, a valid "character literal" might not be a valid Character.

Another alternative design could address specifically the ASCII use case by dedicating the proposed literal syntax for ASCII contents (whether a character or a string). What would be gained would be compile-time checking that any such content is ASCII. As a trade-off, we would lose compile-time checking that any such content contains one and only one Unicode scalar, and we would lose ergonomic access to Unicode scalar APIs.

Michael_Ilseman · March 27, 2019, 5:44pm

I disagree. This argument applies to almost any semantics defined on top of grapheme clusters. Why is normalizing "\r\n" to "\n" for the purposes of returning a single UInt8 a loss of information that demands deprecation, but normalizing U+037E to ';' for the purposes of comparison not?

jrose · March 27, 2019, 5:50pm

I'm a negative on single-quoted literals in general at this point. The case I most want to use them is for searching through bytes, which is the controversial Int8/UInt8 case; putting it on Unicode.Scalar or Character doesn't sufficiently improve my actual use case because I can't use those for parsing bytes. (And when parsing strings, you already have enough type context for the regular "x" to work.)

I'd rather give up on "single-quoted literals" and instead keep the single quote for some kind of sigil (like in Lisp). For example, we could have used 'foo as key path syntax, since we're referring to a property without accessing it. I'm not saying that's better or worse than the \.foo that we went with, but it would have been an option if we weren't holding it in place for single-quoted literals.

hisekaldma · March 27, 2019, 5:54pm

+1 for exactly the reasons outlined. This solves a real problem in an elegant way.

Is deprecation of Character.asciiValue a part of the proposal? Otherwise, this part should probably be removed from the text.

lorentey · March 27, 2019, 6:01pm

Needless to say, I would be in favor of dedicated syntax for Unicode scalar literals.

Swift currently has major ergonomics issues around Unicode scalars: the stdlib hides them behind a five-syllable, scary-looking type name, and the language provides absolutely no concessions to make it easier to enter them into source code. The only literal syntax we have for strings is double quotes, which (well-deservedly) default to String, and is tied closely with high-level Unicode concepts such as normalization and grapheme cluster boundaries.

This covers high-level text processing needs much better than other languages. However, as it currently stands, Swift provides precious little support for lower-level string processing needs, where normalization is irrelevant/harmful. Processing ASCII data is just one of these, although it;s probably the most important one.

xwu · March 27, 2019, 6:05pm

Unless I’m mistaken "\n" != "\r\n" in Swift—this is an ad hoc non-equivalent substitution for a single function. Moreover, it is reasonable to expect that if a user bothers to check that their string “is ASCII,” then mapping over that string for the “ASCII value” should be a lossless operation.

The broader point is that this is an example of an operation that does not make sense for the semantics defined on top of grapheme clusters. Some operations do and others don’t, but in part because using Unicode scalars is less ergonomic, we have been somewhat lax in shoehorning these operations onto Character even when they are not a perfect fit.

Anyway, deprecation of this API isn’t part of the pitch and is left over from an earlier draft, so we can excise the comment about that. [Update: excised.]

lorentey · March 27, 2019, 6:15pm

Here is a perfectly cromulent little function for extracting the ASCII parts of a String value:

func asciiBytes(of input: String) -> [UInt8] {
  return input.compactMap { $0.asciiValue }
}

The string "foo\r\nbar" contains 8 ASCII characters. Unfortunately, the function above returns an array of count 7. The CR character is silently lost. This is unexpected and highly dangerous.

Character is at a wholly inappropriate level when it comes to low-level string processing.

Normalizing U+037E to ‘;’ would also be inappropriate in some contexts. asciiValue’s behavior here is merely a value type violation that highlights the absurdity of having such a property on Character.

let a: Character = "\u{37e}"
let b: Character = ";"

a == b // ⟹ true
a.asciiValue == b.asciiValue // ⟹ false

lorentey · March 27, 2019, 6:30pm

How would a trapping UnicodeScalar.ascii property not be a sufficient improvement for the “searching through bytes” usecase?

jrose · March 27, 2019, 6:33pm

At that point I can just write ascii("a") and get the same effect using contextual types (and a helper function in my project, not a proposed addition to the stdlib). I want something more compact.

xwu · March 27, 2019, 6:39pm

Yes, to be clear, this proposal is fundamentally addressed at improving facilities for low-level Unicode-aware text processing. We see .NET moving in the direction of adopting Go's Rune type for much the same reasons.

If the most you would want to do is to search through ASCII bytes, then indeed the proposed solution (in isolation, as per core team feedback) does not offer that at the maximum compactness; the same can be said for pretty much all of String, though, for that use case.

lorentey · March 27, 2019, 6:40pm

That’s fair. Neglecting to mention the name of the encoding would be a step too far for me.

Tino · March 27, 2019, 6:44pm

Imho this is the most convincing example that there is some issue with the status quo:
case Int8(UInt8(ascii: "a"))
looks really cumbersome compared to case 'a'.

But maybe there are other ways to solve this?
I recently had the odd idea of adding static computed properties to UInt8 - one for each character.
That would allow you to write
case .a:
instead, which is even more concise than single quotes.

It is probably a weird concept, but the good thing with ASCII is that it's rather limited compared to Unicode.
Because of the restrictions for variable names, you couldn't represent all characters in this way - but actually, I think using textual descriptions for special characters (UInt8.space instead of ' ') could even make the code easier to understand.

xwu · March 27, 2019, 6:48pm

This is certainly an interesting direction.

The core team's feedback from SE-0243 is to keep this issue separated from the topic of single-quoted literals and to tackle it second, so this pitch makes no mention of it or any alternatives here.

taylorswift · March 27, 2019, 7:28pm

I’ve tried something like this for handling keypresses in an application, and what I’ve found is it’s surprisingly hard to remember what the non-alphanumeric symbol name is.

'!' -> 'exclam', 'exclamation', 'bang', 'exclamationMark'?
'@' -> 'at', 'atSign', 'atSymbol'?
'#' -> 'hashtag', 'hashTag', 'number', 'numberSign', 'pound', 'poundSymbol', 'poundSign'?
'$' -> 'dollar', 'dollarSymbol', 'dollarSign'?
'%' -> 'percent', 'percentage', 'percentSign', 'percentageSign', 'percentSymbol'? 
'^' -> ????????? (discoverability = 0)
'&' -> 'and', 'andSymbol', 'ampersand'?
'*' -> 'asterisk', 'star', 'starSymbol'?
'(' -> 'lparen', 'lParentheses', 'leftParentheses', 'openParentheses'?
')' -> 'rparen', 'rParentheses', 'rightParentheses', 'closeParentheses?
'-' -> 'dash', 'hyphen', 'minus', 'hyphenMinus'?
'_' -> 'underscore', 'underbar'?
'+' -> 'plus', 'plusSign', 'plusSymbol'?
'=' -> 'equal', equals', 'equalSign', 'equalsSign'?
'{' -> 'leftCurlyBrace', 'leftCurlyBracket', 'leftFrenchBrace', 'leftFrenchBracket', 'openCurlyBrace', ...?
'}' -> ...?
'[' -> 'leftBracket', 'leftSquareBracket', 'openBracket', 'openSquareBracket'?
'|' -> 'verticalBar', 'pipe'?
'\' -> 'backslash', 'backSlash'
':' -> 'colon'
';' -> 'semi', 'semicolon', 'semiColon'?
'"' -> 'quot', 'quote', 'doubleQuote'?

jrose · March 27, 2019, 7:58pm

'^' is "caret", but the point is still valid.

(For fun, take a look at the mnemonics @Joe_Groff came up with for mapping ASCII operator characters to the English alphabet. That's a slightly different problem because the names have to have a unique starting letter, but it's still fun.)

taylorswift · March 27, 2019, 8:08pm

This argument isn’t relevant. You are (again) mixing up the concepts of literal conversion and literal coercion. Double-quoted literals are capable of providing the exact same static guarantees as single-quoted literals (as they do right now!), the difference is only in the amount of type context needed to accomplish it.
The sole benefit of single quoted literals for Unicode.Scalar only, is that this guarantee can now be provided:

let c:Character = 'a'

which guaranteed that c is a single-codepoint Character. But we already decided this kind of expression is a bug of the design, not a feature, so i hardly see the utility.

This kind of discussion only proves that the entire literals system needs rework to present the concept of static coercion more correctly to users.

xwu · March 27, 2019, 8:11pm

I am referring neither to coercion nor conversion. I am arguing that a literal syntax dedicated to mean "this is a character" is misleading because it cannot guarantee that what's contained is in fact a character. This has nothing to do with the implementation of literals and is exclusively about the surface syntax for end users.

Have we? I don't see why. This exactly parallels floating-point types being expressible by integers.

taylorswift · March 27, 2019, 8:15pm

It is no more misleading than double quoted literals for Character. It sounds like an argument against any literal syntax at all for Character.

from the rejection rationale:

One concern raised during the review was that because ExpressibleByStringLiteral refines ExpressibleByExtendedGraphemeClusterLiteral , then type context will allow expressions like 'x' + 'y' == "xy" . The core team agrees that this is unfortunate and that if these protocols were redesigned, this refinement would not exist. However, this is not considered enough of an issue to justify introducing new protocols to avoid the problem. Where practical, the implementation of single-quote literals should generate compile-time warnings for these kind of misuses – though this should not be done by adding additional deprecated operators to the standard library.

I believe this also covers the implicit promotion relationship between ExpressibleByExtendedGraphemeClusterLiteral and ExpressibleByUnicodeScalarLiteral.

jrose · March 27, 2019, 8:16pm

Note that the specific concern about guaranteeing that "x" is a single character is because that can change based on what version of Unicode you're using, which Swift treats as a run-time decision. On old versions of macOS, for example, skin-tone-modified emoji are not considered valid Characters, because the version of Unicode that the system provided did not know about skin tone modifiers.

xwu · March 27, 2019, 8:16pm

We don't have a dedicated syntax for Character: these are spelled identically to string literals and for good reason (i.e., because they cannot be distinguished at compile time). And yes, it is an argument against any dedicated literal syntax for Character.