Single-quoted code unit literals

taylorswift · December 13, 2022, 12:34am

preface: this is a follow-up to
Single Quoted Character Literals (Why yes, again) .

@johnno1962 has communicated to me that he intends to take a short break from swift evolution, so as the other author of the “character literals” pitches, i will fill-in for him in the meantime.

in light of some of the feedback we received on the most recent pitch thread, i have gone back and reworked the proposal in an attempt to relieve the logjam.

older revisions:

revision 1 (out of date)

current revision:

github.com

apple/swift-evolution/blob/fda1450371d361895d2d33fb595bb06714d54949/proposals/NNNN-single-quoted-code-unit-literals.md

# Single-quoted code unit literals

 * Proposal: SE-NNNN
 * Authors: [taylorswift](https://github.com/kelvin13), [John Holdsworth](https://github.com/johnno1962)
 * Review manager:
 * Status:
 * Implementation:
 * Threads:
 [1](https://forums.swift.org/t/prepitch-character-integer-literals/10442)
 [2](https://forums.swift.org/t/se-0243-codepoint-and-character-literals/21188)
 [3](https://forums.swift.org/t/single-quoted-character-literals-why-yes-again/61898/)

## Introduction

In swift, double-quoted literals (`"a"`) can have several different meanings, depending on type context and visible `typealias` definitions. Using the same notation for `String`-like collection types, and their `Character`-like element types has led to difficulty communicating the meaning of operators such as `+=`, and developer confusion surrounding the availability of standard library APIs such as `func + (lhs:Character, rhs:Character)`, which does not exist in the standard library today, but is widely imagined to exist due to its syntactical similarity to `func + (lhs:String, rhs:String)`.

It is also a longstanding pain point that developers writing code that performs low-level unicode string processing or buffer-decoding tasks have no type-safe way of expressing UTF-8 or UTF-16 code units using a human-readable literal syntax.

As a solution to both problems, we propose adding single-quoted literals (`'a'`) to the language, aiming to:

This file has been truncated. show original

bbrk24 · December 13, 2022, 5:15am

I've been following the other currently-active single-quoted literal thread (but not the one from a few years ago), and I have to say I like this proposal a lot better. I do have one question already: why does ExpressibleByASCIILiteral only support UInt8 and not Int8? Under the current proposal, whether or not code of the following form is legal depends on the platform:

var c: CChar = 'a'

I've run into a situation before where I'm working with an Int8 rather than a UInt8 because of how char is bridged, and being able to use a literal like '0' would have been very helpful for legibility.

taylorswift · December 13, 2022, 7:22pm

i think this is an excellent point and definitely an oversight on my part. i just pushed an update to the proposal draft, adding the following section about CChar:

Although Int8 is not as common a buffer element type as UInt8, it is the type aliased by CChar on some platforms. It would not make for a good user experience if code using code unit literals with CChar stopped working when ported to a different platform, so we propose enabling expressing Int8 with an ASCII literal. To avoid introducing excessive associatedtype requirements, the promoted literal type of Int8 is still UInt8.

Windows uses UInt16 as its native PlatformChar, so there is no need at this time to enable BMP literals for Int16.

for now, i have left Int16 out of the proposal, but i am interested in knowing if there are any platforms that require BMP interop with Int16.

jrose · December 13, 2022, 8:56pm

I don’t know for sure, but I hope not. ASCII is fine for Int8 because it’s a 7-bit encoding, but putting BMP values above 0x7FFF into an Int16 and having them show up as negative feels…unfortunate, even if it probably wouldn’t cause problems in practice.

(Still thinking about the proposal myself.)

taylorswift · December 13, 2022, 9:09pm

C++11 requires char16_t to be unsigned:

char16_t - type for UTF-16 character representation, required to be large enough to represent any UTF-16 code unit (16 bits). It has the same size, signedness, and alignment as std::uint_least16_t, but is a distinct type.

so i think the answer is no, we do not need Int16.

aside: i’ve completely left out UInt32 (char32_t/wchar_t) because Unicode.Scalar has the same stride and no one should be using [UInt32] for UTF-32 when [Unicode.Scalar] is an option. but we do not have a way to rebind arrays to another type without copying the array.

on the other hand, are UTF-32 APIs even common enough for us to be thinking about UInt32?

michelf · December 13, 2022, 9:57pm

My first impression upon reading this is:

That's a lot of new protocols for a relatively small feature!
The proposal's name is about code units, but unexpectedly there's also new literals for code points (UnicodeScalar) and grapheme clusters (Character). I had double check I followed the correct link.
UnicodeScalar and Character already have their literals, do they need new ones?

Overall I welcome the addition of those literals for UInt8 and UInt16, but I feel the rest is unnecessary, although it might keep various people happy. I wish it could be simple and straightforward.

taylorswift · December 13, 2022, 11:54pm

correct me if i am wrong, but it sounds like you support the addition of

ExpressibleByASCIILiteral and
ExpressibleByBMPLiteral,

but not

ExpressibleByCodepointLiteral and
ExpressibleByCharacterLiteral.

and i do not think this is an unreasonable position to hold. the latter two protocols would be derived protocols, and we could always add them at a later date without breaking ABI.

on the other hand as a proposal author, my role has been to try and draft a spec that can pass review, and there are some who think it is important that single-quoted literals be able to express Character values.¹

so i would only encourage you to continue making your viewpoint known because i am just one person and i cannot decide for the whole community that single-quoted literals are going to work for (U)Int8 and UInt16, but not Unicode.Scalar or Character.

on a slightly different note, with regards to “that’s a lot of new protocols!”, i think that the criteria for whether something should go in the standard library should not be “how many declarations will it add?”, but rather “could this go in a package instead?”

and i think for something like AtomicValue that is something that can go in a package and so it lives in swift-atomics. but the literal expression domain protocols cannot go in a package, they can only live in the standard library.

[1] actually, you could still express Character values with single quoted literals, because Character could still conform to ExpressibleByBMPLiteral. you just would not be able to write:

let c:Character = '🇺🇸'

but we could add this later alongside ExpressibleByCharacterLiteral without breaking ABI because we can always add a conformance to a more-derived protocol.

anon9791410 · December 14, 2022, 12:08am

Is this going to solve the following problem? I'm sure you all have better reasons for wanting the feature, but I'd just love to be able to eliminate "("x" as Character)" for extensions on literals.

func function(_: Character) { }
function("a") // compiles

extension Character {
  var property: Void { () }
  func function() { }
}
"a".property // Value of type 'String' has no member 'property'
"a".function() // Value of type 'String' has no member 'function'

taylorswift · December 14, 2022, 12:12am

yes.

In the absence of type information, the compiler will infer the Character type for a single-quoted literal, regardless of whether it qualifies for one of the more restricted expression domains.

let auto = 'x' // as Character

this is a property of single-quoted literals, and not of a specific literal expression domain protocol, like ExpressibleByCharacterLiteral.

it would infer Character even if only ExpressibleByASCIILiteral and ExpressibleByBMPLiteral were part of the proposal.

this would not prevent us from being able to express Character with a general ExpressibleByCharacterLiteral expression in the future.

michelf · December 14, 2022, 1:35am

Exactly that: trying to satisfy everyone creates a situation where the only solution is not worth its weight. But I'll be happy if I'm proven wrong and we finally get code point literals.

John_McCall · December 14, 2022, 2:44am

I'm not thrilled about making integer types conform to these protocols, and I know that was a major controversy the last time this was reviewed, too. I think this proposal would be on stronger ground if the conforming types were dedicated character / code-point / code-unit types.

taylorswift · December 14, 2022, 4:40am

it seems this is still an area of disagreement then.

a common theme of the feedback in this thread seems to be the proposal is too large and tries to introduce too many features at once.

what if the proposal were reduced to simply introducing the lexical expression domain protocols ExpressibleByASCIILiteral and ExpressibleByBMPLiteral, and conforming Character and Unicode.Scalar to them?

ExpressibleByASCIILiteral would still support a promoted literal type of UInt8, but an actual standard library conformance of UInt8 and Int8 to ExpressibleByASCIILiteral would be left to a subsequent proposal.

the BMP range of Character would initially be available as part of the first proposal, and a subsequent proposal could extend the single-quoted syntax to general grapheme clusters (ExpressibleByCharacterLiteral) without breaking ABI.

would this be acceptable to you?

xwu · December 14, 2022, 4:42am

What does this mean?

taylorswift · December 14, 2022, 4:45am

the term promoted literal type is explained in the proposal.

in this case, it means you would be able to conform a type to ExpressibleByASCIILiteral by providing an init(asciiLiteral:UInt8) initializer.

xwu · December 14, 2022, 4:49am

I would suggest that the most reviewable unit of work hews closely to the core team's feedback in 2017: That is, motivating a new kind of literal (using a single quotation mark as delimiter) and discussing whether the default conforming literal type ought to be Character or Unicode.Scalar.

I would not think that introducing new terminology about literals to make Character and Unicode.Scalar expressible by an integer will be within the spirit of that guidance any more than making UInt8 expressible by a character literal.

taylorswift · December 14, 2022, 5:06am

as i explained in my response to @Jessy above, the default inferred type of a single-quoted literal expression will be Character.

i do not recall this being controversial the first time around, and i am curious to hear what your rationale is for having the default inferred type be Unicode.Scalar.

promoted literal types such as ExtendedGraphemeClusterLiteralType exist throughout the standard library today, and the proposal only uses the term to provide background about the design of the existing double-quoted expression protocols.

rather than use the phrase promoted literal type, we could just say

type witnesses for associatedtype requirements such as UnicodeScalarLiteralType, ExtendedGraphemeClusterLiteralType, and StringLiteralType

but this is quite a mouthful for a concept we refer to repeatedly throughout the text.

nothing in the proposal calls for making either type expressible by an integer literal.

part 3 of the proposal calls for making the specific types UInt8, Int8, and UInt16 expressible by single-quoted code unit literals. is this the part of the proposal you are referring to?

it sounds like you are opposed to part 3.ii. of the proposal. would a subset of the proposal, like the one i outlined in my response to @John_McCall be acceptable to you?

taylorswift:

a common theme of the feedback in this thread seems to be the proposal is too large and tries to introduce too many features at once.

what if the proposal were reduced to simply introducing the lexical expression domain protocols ExpressibleByASCIILiteral and ExpressibleByBMPLiteral, and conforming Character and Unicode.Scalar to them?

ExpressibleByASCIILiteral would still support a promoted literal type of UInt8, but an actual standard library conformance of UInt8 and Int8 to ExpressibleByASCIILiteral would be left to a subsequent proposal.

the BMP range of Character would initially be available as part of the first proposal, and a subsequent proposal could extend the single-quoted syntax to general grapheme clusters (ExpressibleByCharacterLiteral) without breaking ABI.

xwu · December 14, 2022, 5:37am

Thanks, I'd misinterpreted based on the description of init(asciiLiteral: UInt8) provided above. It'd be helpful to have the proposal stick as closely to the user-facing terminology of Swift as possible in describing the proposed behavior, using code as needed to clarify.

My feelings on that part have been made plenty clear in over 50 replies in the original review thread, I think. Each part of a successful proposal would contribute to a whole that addresses a strong motivating problem, and your challenge here (in my view) is to present the case for a new literal type, and to fit it into Swift as it now is, without really any of the ASCII bits.

taylorswift · December 14, 2022, 4:13pm

i am sorry none of the compromises or subsets of the proposal appear acceptable to you.

although i personally feel that single-quoted literals would hold value, even in a split-up proposal that would neither enable expressing integers nor arbitrary extended grapheme clusters with single quoted literals, the final say is up to the language workgroup, and i think it is becoming clear that there is little appetite for this feature, so i have gone ahead and closed the evolution PR.

John_McCall · December 14, 2022, 6:24pm

To be clear, neither I nor Xiaodi has been speaking for the Language Workgroup.

That does seem much more in keeping with the Core Team's original review feedback of (1) being in favor of adding character literals while also (2) being reticent to have conformances on any of the integer types. All together, though, it feels like an odd position, because it adds two protocols that lack library types that specifically model them, and (at least as stated) it range-limits the new single-quoted literals so that they can't express an arbitrary code point.

I feel like a more natural evolution path here is to start by just enabling the syntax, have it require one of the two existing protocols (extended grapheme cluster / unicode scalar), and make it use Character as its default literal type. Then the limited-range protocols for scalars that fit within UTF-8 / UTF-16 code units would be a viable follow-on proposal, although this would be a much stronger proposal if there were UTF8CodeUnit / UTF16CodeUnit library types that modeled them. Speaking officially (on this point alone), the Language Workgroup is not likely to embrace adding character-literal conformances on any of the integer types.

jrose · December 14, 2022, 6:42pm

I’m speaking just for myself, but I don’t have any non-ASCII, non-UInt8/CChar use cases. A separate UTF8CodeUnit or ASCIIChar type could have a custom match operator and cheap conversions, but that doesn’t strictly need any language changes at all (it can be done with ExpressibleByUnicodeScalarLiteral and existing literal syntax, albeit without static enforcement of ASCII-only).

(The proposal calls this the “Low Level String Processing” use case; I want to clarify that mine is usually about byte buffers, possibly but not necessarily valid strings. This means I care less about UTF-16, because UTF-16 embedded in a binary format usually doesn’t require processing the UTF-16, just extracting it as a blob. But I see that there are occasionally use cases for operating directly on UTF-16 buffers for stringy purposes.)