Prepitch: Character integer literals

fswarbrick · October 19, 2018, 8:55pm

ByteChar? ByteValue?

taylorswift · October 19, 2018, 8:56pm

i have taken to calling them “byte strings” but it doesn’t really have a good name as the upper half of that range wasnt well-standardized before unicode came along

AlexanderM · October 19, 2018, 8:57pm

Again, you could just do this:

let buffer = ["+", "-", "+", "-", "+", "-", "+", "-", "+", "-", ].map(UInt8.init(ascii:))
Darwin.write(fd, buffer, buffer.count)

or

let plusAndMinusAscii = ["+", "-"].map(UInt8.init(ascii:))
let buffer = (0..<500).flatMap { _ in plusAndMinusAscii }
Darwin.write(fd, buffer, buffer.count)

or even

extension StringProtocol {
	var ascii: [UInt8] {
		return unicodeScalars.map { unicodeScalar in 
			guard unicodeScalar.isASCII else {
				fatalError("Tried to get ascii code of non-ascii unicode scalars.")
			}
			return UInt8(unicodeScalar.value)
		}
	}
}

let buffer = "+-+-+-+-+-+-".ascii
print(buffer)

taylorswift · October 19, 2018, 9:03pm

because the codepoint version is transparent whereas this one constructs Character objects and tries to transform them back into UInt8s at run time? It’s like initializing integers by casting Float literals. It’ll work, but why would that ever be considered ideal? Casting Float literals at the very least still has the possibility that the whole chain could be optimized out by the compiler, whereas such an optimization cannot be done for Character since grapheme checking needs to be done at run time as since 3.0 Swift links against the system ICU library.

And don’t forget, an invalid grapheme (but valid Character) in your example is a run time error, whereas an invalid codepoint literal is a compile time error.

AlexanderM · October 19, 2018, 9:16pm

The solution to that is to make compiler optimizations to allow these to be converted at compile time. I don't see why it needs new language syntax. And why stop at ASCII? What about ISO 8859? What about other character encodings?

johnno1962 · October 20, 2018, 12:47am

The proposal does not stop at ASCII. Any codepoint is valid provided it fits into the target type.

let ascii: Int8 = 'a' // 97
let latin1: UInt8 = 'ÿ' // 255
let utf16: UInt16 = 'Ƥ' //  420
let emoji: UInt32 = '🙂' // 128578

michelf · October 20, 2018, 1:03am

At least this last one can sort of be written as:

let emoji: UnicodeScalar = "🙂"

... assuming you don't mind using UnicodeScalar in lieu of UInt32. They're both the same thing under the hood.

AlexanderM · October 20, 2018, 4:16pm

So a multitude of encoding would be supported by this, and which encoding is used is an implicit function of the character and the destination datatype's size?

I'm sure there would be ambitious cases when the same character exists in 2 encodings, with the same size. How would such a case be disambiguated?

Again, all this complexity, just to avoid something like a UInt8.init(ascii:) call? No thanks.

taylorswift · October 20, 2018, 4:44pm

What? There are no such things as duplicate codepoints in unicode, only equivalent encodings and equivalent grapheme compositions which are one of the precise existing problems this proposal is designed to help solve. Perhaps you are confusing unicode codepoints with unicode code units or unicode graphemes?

michelf · October 20, 2018, 5:26pm

I don't think anyone suggested adding multiple encodings. @johnno1962's example includes a Latin 1 character simply because Unicode code points that fit into one byte are equivalent to Latin 1. It obviously won't work for any other encoding (unless you count ASCII) because Unicode only has this particular relationship with Latin 1.

Chris_Lattner3 · October 20, 2018, 9:50pm

Thank you so much for driving this forward John, and I appologize for abandoning you with this before. I would also really love to see this make progress and am thrilled you're pushing on it (I just don't have time to dedicate to it). Thank you thank you thank you!

-Chris

Chris_Lattner3 · October 20, 2018, 9:52pm

johnno1962:

Detailed Design

These new character literals will have default type Character and be statically checked to contain only a single extended grapheme cluster. They will be processed largely as if they were a short String .

When the Character is representable by a single UNICODE codepoint however, (a 20 bit number) they will be able to express a Unicode.Scalar and any of the integer types provided the codepoint value fits into that type.

As an example:
let a = 'a' // This will have type Character
let s: Unicode.Scalar = 'a' // is also possible
let i: Int8 = 'a' // takes the ASCII value

+1 to this design.

-Chris

AlexanderM · October 21, 2018, 1:39am

I was responding to this comment. Prepitch: Character integer literals - #147 by johnno1962

My question was: why do we highlight ASCII and Latin 1? Why should those 2 encodings get such special treatment from the language? If there is a need to initialize integers from characters, I would like to see a generic mechanism that supports arbitrary encodings, and without wasting the ' sigil.

michelf · October 21, 2018, 2:04am

Both ASCII and Latin-1 are a subset of Unicode code points. So if you support Unicode you get those for free, it's just an overflow check when assigning the code point to a variable. It's the same "special" treatment you get with integer literals: the compiler tells you about the overflow in let x: Int8 = 1000.

As for other encodings, I haven't seen anyone asking for literals or a use-case that'd benefit from that. Is there a reason to support them? They've all been superseded by Unicode.

Chris_Lattner3 · October 21, 2018, 3:31am

Because ASCII is pervasive, e.g. as the basis for almost all of the ietf protocols. This matters a lot for server development, as well as handling markup formats like XML, as well as many other things that skew towards the lowest common denominator of encodings, which is unequivocally ASCII.

-Chris

fswarbrick · October 21, 2018, 3:49am

I've seen examples such as this:

let buffer: [UInt8] = ['+', '-', '+', '-', '+', '-', '+', '-', '+', '-', ]

Would allowing the following be "too much" or out of scope for this pitch?

let buffer: [UInt8] = '+-+-+-+-+-'

taylorswift · October 21, 2018, 4:21am

Hi all, I’ve completed a new draft of the proposal, which can be found here.

Proposal: SE-XXXX
Authors: Kelvin Ma (@taylorswift), Chris Lattner (@Chris_Lattner3)
Review manager:
Status: Awaiting review
Implementation:
Threads: 1

Introduction

Swift’s String type is designed for Unicode correctness and abstracts away the underlying binary representation of the string to model it as a Collection of grapheme clusters. This is an appropriate string model for human-readable text, as to a human reader, the atomic unit of a string^† is the extended grapheme cluster. When treated this way, many logical string operations “just work” the way users expect, and adopting this string model eliminates entire classes of string bugs that plague earlier C-like languages.

However, it is increasingly clear that this model is entirely inappropriate for strings in the machine-readable context, where operating on a string as a Collection of grapheme clusters can actually introduce new types of bugs, as well as introduce performance traps and harm code safety and readability. Many solutions have been proposed in the past, mostly revolving around the creation of a new, machine-readable string type (“ASCIIString”, “FixedWidthString”, etc.) We instead propose a simple and small addition to the language, which we believe would greatly streamline the writing of such code, as well as offer greater generalizability and benefit other parts of the language: adding a new literal type CodepointLiteral^‡ which takes single-quotes ('), and is transparently convertible to Swift’s integer types.

† In exceptional cases, the human “atomic unit” can actually extend beyond the definition of an extended grapheme cluster to include multiple clusters. An example of this is the German double grapheme "SS" which lowercases to a single grapheme "ß". This is generally considered the domain of locale-sensitive text processing, and is above the abstraction level of String.

‡ As written, this name is semantically incorrect for how these literals are actually defined to behave in this proposal. This proposal will use CodepointLiteral as a strawman spelling, as such until we settle on a more correct name for these literals.

Background and terminology

Swift (and Unicode) strings and characters sit atop two levels of abstraction over a binary buffer. These levels of abstraction are the unicode codepoint and the unicode grapheme.

Unicode codepoints are the atomic unit of Unicode. They are integers from 0x00_0000 ... 0x10_FFFF which are assigned to characters such as 'é' or control characters such as '\n'. The integer value of a codepoint is called its unicode scalar^† and corresponds to the Swift type Unicode.Scalar.

Codepoints are a useful concept, but are extremely inefficient to store directly, as the vast majority of common characters are assigned to small integers which do not require the entire four-byte width of a unicode scalar to store. Unicode encodings such as UTF-8 and UTF-16 are used to compress sequences of codepoints through use of variable-length coding. For example, UTF-8 assigns the common codepoint 'a' to the single-byte codeword [0x61], and the less common codepoint '√' to the three-byte codeword [0xE2, 0x88, 0x9A].

Unicode strings are strings which use UTF-8 or UTF-16 as their in-memory representation. Swift Strings are unicode strings. The UTF buffer of a unicode string should be considered a raw binary format, and so the constituent bytes in isolation have no meaning.

Extended grapheme clusters (usually just graphemes) are ranges of codepoint sequences which humans percieve as logically a single “character”. This corresponds to the Swift Character type. An example of a grapheme is the '👩‍✈️' emoji, which contains three codepoints: '👩', '\u{200D}' (zero-width joiner), and '✈️'. Grapheme breaking is context-dependent — '👩', '\u{200D}', and '✈️' are all valid graphemes in isolation, yet concatenating them in sequence “fuses” them into a single grapheme.

var string:String = "👩"
print(string, string.count)
// 👩 1

string.append("\u{200D}")
string.append("✈️")
print(string, string.count)
// 👩‍✈️ 1

Because characters can (somewhat confusingly) be built up from other characters, in this proposal we will only use the word character as a loose term for the general concept of a “textual unit”.

Not only can graphemes be composed of multiple codepoints, but they can have multiple possible decompositions as well. The grapheme 'á' has two decomposition forms: the single codepoint composed form ['á'] ([0xE1]), and the double codepoint decomposed form ['a', '́'] ([0x61, 0x301]). Both forms are considered to be the same grapheme, and so compare equal under unicode canonical equivalence. Graphemes can have multiple distinct single-codepoint decompositions — 'Å' (0xC5) and 'Å' (0x212B), for example, are both considered to be the same grapheme.

ASCII characters are 7-bit scalars in the range 0x00 ... 0x7F which represent most common characters such as 'a' or '+'. ASCII strings are arrays of ASCII characters, which are commonly used to model textual data in classical computing. ASCII strings are an efficient string representation as each character can be represented with a single 8-bit integer. In this proposal, we will refer to strings represented as arrays of 8-bit characters as bytestrings. Some developers call these strings c-strings due to their popularity in the C language, although the term can also refer to a UTF-8 unicode string that ends with the null character. This proposal will not use the term c-string.

Bytestrings can encode many more characters than there are ASCII characters, and so the mapping from 8-bit integer values to unicode codepoints can be extended to the entire 0x00 ... 0xFF range. Because ASCII strings are almost always stored as bytestrings, and the general popularity of the ASCII acronym, many people say “ASCII string” when they really mean bytestring. This distinction becomes important when attempting to retrofit String APIs to operate on bytestrings, as UTF-8 is only backwards-comparible with the ASCII subset of a bytestring’s coding range.

For example, the ASCII character 'a' has the exact same representation in both a bytestring, and a UTF-8 unicode string: 0x61, the value of its unicode codepoint. Similarly, in a bytestring, the character 'é' is directly represented by 0xE9, the value of its unicode codepoint. However, in a UTF-8 string, this character is encoded as the two-byte sequence [0xC3, 0xA9].

Machine strings are a generalization of the bytestring concept to wider integer widths. Machine strings are characterized by their fixed-width character encoding, so with unlike Strings (“human strings”), integer subscripts make sense on them. Machine strings can be thought of as an alternative coding approach to unicode strings, where instead of using multiple code units to encode higher codepoints, we simply encode however many codepoints we can directly, and disallow the rest.

Because a machine string can always encode more codepoints directly than its equivalent-width unicode string, corresponding machine and unicode string representations are never binary-equivalent.

† Valid unicode codepoints are actually a superset of valid unicode scalars, as certain codepoint values (0xD800 ... 0xDFFF) are reserved and so do not represent characters. These codepoints are used as sentinel shorts in the UTF-16 encoding, or, are simply unused at this time. This distinction is unimportant to this proposal.

Motivation

Today, most people who process bytestrings are forced to use either the lower-level [UInt8] array API, or the higher-level String API. Both workarounds have serious flaws we believe cause active harm to users and the language.

A popular approach among some users is to (ab)use the String API, and attempt to spell familiar C-idioms using its syntax. This has the major bonus of readability, but leaves users vulnerable to many pitfalls.

A common mistake is to convert bytestrings to Strings and compare them to other Strings. Given two bytestrings a:[UInt8], b:[UInt8], many users assume that

String(decoding: a, as: Unicode.ASCII.self) == 
String(decoding: b, as: Unicode.ASCII.self)

if and only if

a == b

but this doesn’t actually hold for all bytestrings. A real-world example of where this can cause harm is when detecting the magic header for the JPEG image format, ['ÿ', 'Ø', 'ÿ', 'Û'] ([0xFF, 0xD8, 0xFF, 0xDB]). For obvious choices of Unicode codec, it is possible for an entirely different bytestring to match it.

// none of these codepoints are actually ASCII, so `Unicode.ASCII` 
// is clearly the wrong codec to use. 

String(decoding: [0xFF, 0xD8, 0xFF, 0xDB], as: Unicode.ASCII.self) == 
String(decoding: [0xEE, 0xC7, 0xEE, 0xCA], as: Unicode.ASCII.self)
// true

The other option, Unicode.UTF8, has the same problem.

// both of these bytestrings are considered to be UTF-8 gibberish, 
// and all gibberish strings compare equal.

String(decoding: [0xFF, 0xD8, 0xFF, 0xDB], as: Unicode.UTF8.self) == 
String(decoding: [0xEE, 0xC7, 0xEE, 0xCA], as: Unicode.UTF8.self))
// true

Indeed, the correct way to do these String comparisons is to widen our input bytestring to 16 bits, and import it as a UTF-16 unicode string!

String(decoding: [0xFF as UInt8, 0xD8, 0xFF, 0xDB].map{ UInt16($0) }, 
             as: Unicode.UTF16.self) 
             == "ÿØÿÛ"
// true (expected)

Aside from being inefficient for long strings, understanding why this is a valid identity^† requires a deep understanding of Unicode, and users are highly unlikely to discover this idiom on their own. We should not require users to be Unicode experts in order to write correct bytestring code.^‡

† Can’t figure it out? It’s because for each grapheme, the Unicode standard defines no more than one decomposition sequence which consists solely of codepoints below 0x100. (What, you haven’t read the Unicode Standard, Version 11.0.0?) Note that many graphemes still have multiple canonically-equivalent decomposition sequences containing at least one codepoint below 0x100. Because of this, the act of widening an 8-bit machine string to a 16-bit machine string can introduce no canonically equivalent decomposition sequences (so long as they are zero-extended), preserving the one-to-one relationship.

‡ A correct String comparison identity for 16-bit machine strings is left as an exercise for the reader.

False comparison positives aren’t the only correctness traps that can result from abuse of String APIs. Many textual formats, such as XML and JSON, are defined in terms of codepoints, and parsing by Character will lead to bugs. For example, a combining character after the opening quote of an XML attribute (as in attr="\u{308}value") is well-formed XML and must be parsed as a value starting with a combining character.

"\"\u{308}".count // 1

Credit to Michel Fortin for the example!

In cases like these, String is simply the wrong tool for the job, and if all you have is a String, then everything looks like a Character.

According to Swift’s stated design goals,

we believe that the most obvious way to write Swift code must also be safe, fast, and expressive.

Here, the “obvious” approach of parsing by grapheme cluster is neither safe, nor fast.

An alternative approach some users prefer is to drop all String pretenses and work in [UInt8] or some other integer array type. This approach has far fewer footguns, since the code is being written in the same domain that the standard it is implementing is defined. The clearest, easiest, and least error-prone way to test if two [UInt8] buffers are equal, is, of course, to test if they are equal.

Machine strings often require you to extract characters at fixed offsets inside them. (Quick! Get the month from a "YYYY-MM-DD" datestring!) Random access integer subscripting is a completely natural operation on an Array, yet completely unnatural on a String. Users trying to subscript the kth character in a String are liable to fall into performance traps which could easily add a factor of n to their runtime.

Another selling point of [UInt8], is that we effectively get all the bytestring library methods we had in String, for free, as Array, of course, also conforms to BidirectionalCollection.

The main drawback to integer arrays is that they lack a clear and readable literal type. In C, 'a' is a uint8_t literal, equivalent to 97. Swift has no such equivalent, requiring awkward spellings like UInt8(ascii: "a"), or UInt8(truncatingIfNeeded: ("a" as Unicode.Scalar).value) for the codepoints above 0x80. Alternatives, like spelling out the values in hex or decimal directly, are even worse. This harms readability of code, and is one of the sore points of bytestring processing in Swift.

static char const hexcodes[16] = {
    '0', '1', '2', '3', '4' ,'5', '6', '7', '8', '9', 
    'a', 'b', 'c', 'd', 'e', 'f'
};

// This is the best we can get right now, while showing the textual 
// letter form.
let hexcodes = [
    UInt8(ascii: "0"), UInt8(ascii: "1"), UInt8(ascii: "2"), UInt8(ascii: "3"),
    UInt8(ascii: "4"), UInt8(ascii: "5"), UInt8(ascii: "6"), UInt8(ascii: "7"),
    UInt8(ascii: "8"), UInt8(ascii: "9"), UInt8(ascii: "a"), UInt8(ascii: "b"),
    UInt8(ascii: "c"), UInt8(ascii: "d"), UInt8(ascii: "e"), UInt8(ascii: "f")
]

UInt8 has a convenience initializer for converting from ASCII, but if you're working with other types like Int8 (common when dealing with C APIs that take char, it is much more awkward. Consider scanning through a char* buffer as an UnsafeBufferPointer<Int8>:

for scalar in int8buffer {
    switch scalar {
    case Int8(UInt8(ascii: "a")) ... Int8(UInt8(ascii: "f")):
        // lowercase hex letter
    case Int8(UInt8(ascii: "A")) ... Int8(UInt8(ascii: "F")):
        // uppercase hex letter
    case Int8(UInt8(ascii: "0")) ... Int8(UInt8(ascii: "9")):
        // hex digit
    default:
        // something else
    }
}

Aside from being ugly and verbose, transforming Character or Unicode.Scalar literals also sacrifices compile-time guarantees. The statement let codepoint:UInt16 = 128578 is a compile time error, whereas let codepoint = UInt16(("🙂" as Unicode.Scalar).value) is a run time error.

Codepoints are inherently textual, so it should be possible to express them with a textual literal without requiring layers upon layers of transformations. Just as applying the String APIs runs counter to Swift’s stated design goals of safety and efficiency, forcing users to express basic data values in such a convoluted and unreadable way runs counter to our design goal of expressiveness.

Michel Fortin put it best: “You need to express characters as code points or sometime lower-level integers in the parser. If it's a complicated mess to express this, then the parser becomes a complicated mess.”

Furthermore, improving Swift’s bytestring ergonomics is an important part of our long term goal of expanding into embedded platforms. Here’s one embedded developer’s take on the proposal.

Proposed solution

Let's do the obvious thing here, and add a textual literal type for Swift’s integer types. The value of the literal will be the value of its codepoint.

Swift’s textual literals currently exist in a hierarchy where ExpressibleByStringLiteral inherits from ExpressibleByExtendedGraphemeClusterLiteral, which in turn inherits from ExpressibleByUnicodeScalarLiteral. Types that conform to this family of protocols indicate to the compiler the strictest level of literal overflow checking it should do. For example, a type that conforms to ExpressibleByExtendedGraphemeClusterLiteral forces the compiler to verify that the input text literal is a single grapheme^†, while a type that conforms to ExpressibleByUnicodeScalarLiteral force the compiler to also verify that the grapheme is a single valid 21-bit unicode scalar.^‡

As the guarantee for a valid unicode scalar is stricter than the guarantee needed to prove that the literal won’t overflow an Int32, we can use Unicode.Scalar literals to express Int32s, UInt32s, and higher through the ExpressibleByUnicodeScalarLiteral protocol. The natural way to extend this feature to UInt16 and below is to introduce two new protocols ExpressibleByUnicode16Literal and ExpressibleByUnicode8Literal, which do even stricter overflow checking than ExpressibleByUnicodeScalarLiteral. ExpressibleByUnicodeScalarLiteral would naturally inherit from ExpressibleByUnicode16Literal^∗, which would in turn inherit from ExpressibleByUnicode8Literal.

This allows us to statically diagnose overflowing codepoint literals, just as the compiler and standard library already work together to detect overflowing integer literals:

let a: Int16 = 128 // ok
let b: Int8 = 128  // error: integer literal '128' overflows when stored into 'Int8' 

let c: Int16 = 'Ƥ' // ok
let d: Int8  = 'Ƥ' // error: character literal 'Ƥ' overflows when stored into 'Int8'

With these changes, the hex code example can be written much more naturally:

let hexcodes: [UInt8] = [
    '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 
    'a', 'b', 'c', 'd', 'e', 'f'
]

for scalar in int8buffer {
    switch scalar {
    case 'a' ... 'f':
        // lowercase hex letter
    case 'A' ... 'F':
        // uppercase hex letter
    case '0' ... '9':
        // hex digit
    default:
        // something else
    }
}

The JPEG header example can be written as:

guard bytestring == ['ÿ', 'Ø', 'ÿ', 'Û']

instead of:

guard String(decoding: bytestring.map{ UInt16($0) }, as: Unicode.UTF16.self) == "ÿØÿÛ"

or:

guard bytestring == [0xFF, 0xD8, 0xFF, 0xDB]

Choice of single quotes

The proposed solution is syntax-agnostic and can actually be implemented entirely using double quotes. However, conforming some classes of textual literals to integer types can lead to some interesting spellings such as "1" + "1" == 98 instead of "11". We forsee problems arising from this to be quite rare, as type inference will almost always catch such mistakes, and very few users are likely to express a String with two literals instead of the much shorter "11".

Nevertheless, mixing arithmetic operators with double-quoted literals seems like a recipe for confusion, and there is enough popular demand for single-quoted literals that there is a compelling case for using a different quote syntax for these literals.

We propose to adopt the 'x' syntax for all textual literal types, up to and including ExtendedGraphemeClusterLiteral, but not including StringLiteral. These literals will be used to express integer types, Character, Unicode.Scalar, and types like UTF16.CodeUnit in the standard library.

The default inferred literal type for let x = 'a' will be Character. This follows the principle of least surprise, as most users expect '1' + '1' to evaluate to "11" more than 98.

Use of single quotes for character/scalar literals is heavily precedented in other languages, including C, Objective-C, C++, Java, and Rust, although different languages have slightly differing ideas about what a “character” is. We choose to use the single quote syntax specifically because it reinforces the notion that strings and character values are different: the former is a sequence, the later is a scalar (and "integer-like"). Character types also don't support string literal interpolation, which is another reason to move away from double quotes.

One significant corner case is worth mentioning: some methods may be overloaded on both Character and String. This design allows natural user-side syntax for differentiating between the two.

† Since 3.0, the compiler cannot statically verify the validity of a Character literal, as the runtime must call into the system ICU library to determine this. It is still possible for the compiler to rule out character sequences which could not possibly be graphemes.

‡ Not all codepoints are valid 21-bit unicode scalars, but the codepoints that are invalid are invalid because they represent no character, and so could never be written in a literal anyway. At any rate, such edge cases are best expressed as hexadecimal integer literals.

∗ To make it possible for ExpressibleByUnicodeScalarLiteral to inherit from ExpressibleByUnicode16Literal, we need to restrict our range of valid Unicode16Literals to exclude the range 0xD800 ... 0xDFFF, since those are not valid unicode scalars. Of course, as before, we don’t lose much from making this restriction.

Single quotes in Swift, a historical perspective

In Swift 1.0, we wanted to reserve single quotes for some yet-to-be determined syntactical purpose. However, today, pretty much all of the things that we once thought we might want to use single quotes for have already found homes in other parts of the Swift syntactical space. For example, syntax for multi-line string literals uses triple quotes ("""), and string interpolation syntax uses standard double quote syntax. With the passage of SE-0200, raw-mode string literals settled into the #""# syntax. In current discussions around regex literals, most people seem to prefer slashes (/).

At this point, it is clear that the early syntactic conservatism was unwarranted. We do not forsee another use for this syntax, and given the strong precedent in other languages for characters, it is natural to use it.

Existing double quote initializers for characters

We propose deprecating the double quote literal form for Character and Unicode.Scalar types and slowly migrating them out of Swift.

let c2 = 'f'               // preferred
let c1: Character = "f"   // deprecated

Detailed Design

We will extend the ExpressibleByUnicodeScalarLiteral → ExpressibleByExtendedGraphemeClusterLiteral → ExpressibleByStringLiteral protocol chain to include ExpressibleByUnicode8Literal → ExpressibleByUnicode16Literal.

UInt8 and Int8 have the strictest compile time validation requirements, and so conform only to ExpressibleByUnicode8Literal. UInt16 and Int16 will accept valid 16-bit codepoint literals, and so conform to the more specific ExpressibleByUnicode16Literal protocol. Wider integer types conform to the yet more specific ExpressibleByUnicodeScalarLiteral, which accepts an even wider range of codepoint literals. (All of them, to be exact.)

ExpressibleByUnicode8Literal                // adopted by: UInt8,  Int8
    ↓
ExpressibleByUnicode16Literal               // adopted by: UInt16, Int16 
    ↓
ExpressibleByUnicodeScalarLiteral           // adopted by: UInt32, Int32 
                                            //             UInt64, Int64
    ↓                                       //             UInt,   Int
                                            //             Unicode.Scalar
ExpressibleByExtendedGraphemeClusterLiteral // adopted by: Character
    ↓
ExpressibleByStringLiteral                  // adopted by: String

The default inferred type for all single-quoted literals will be Character, addressing an unrelated, but longstanding pain point in Swift, where Characters had no dedicated literal syntax.

// if we create a new single quoted literal type, we should make it the 
// sole literal type for `Character` and below, and set `Character` to be 
// its default inferred type.
typealias ExtendedGraphemeClusterType = Character
typealias UnicodeScalarType           = Character 
typealias Unicode16Type               = Character  
typealias Unicode8Type                = Character

Despite the naming, we see no reason to cripple codepoint literals by preventing them from being able to express multi-codepoint grapheme clusters. Thus, the following is a valid codepoint literal:

let flag: Character = '🇨🇦'

As such, codepoint literals will be lexed in much a similar manner as existing double-quoted literals, except they will be restricted to containing at most a single grapheme cluster. We welcome suggestions for better names for this family of literal types, although no new standard library symbols actually need include the term “codepoint”.

Source compatibility

This proposal could be done in a way that is strictly additive, but we feel it is best to deprecate the existing double quote initializers for characters, and the UInt8.init(ascii:) initializer.

Here is a specific sketch of a deprecation policy:

Continue accepting these in Swift 4 mode with no change.
Introduce the new syntax support into Swift 5.
Swift 5 mode would start producing deprecation warnings (with a fixit to change double quotes to single quotes.)
The Swift 4 to 5 migrator would change the syntax (by virtue of applying the deprecation fixits.)
Swift 6 would not accept the old syntax.

Effect on ABI stability

No effect as this is an additive change. Heroic work could be done to try to prevent the UInt8.init(ascii:) initializer and other to-be-deprecated conformances from being part of the ABI. This seems unnecessary though.

Effect on API resilience

None.

Alternatives considered

None.

Nevin · October 21, 2018, 11:22am

For integers, we just have one ExpressibleByIntegerLiteral protocol. We do not have sub-protocols for each bitwidth. Shouldn’t the same approach work here for Unicode scalars?

johnno1962 · October 21, 2018, 12:37pm

Thanks for this @taylorswift. If only this described what the prototype implemented () we’d be headed for review. There are problems with your detailed design IMO. It’s difficult to reuse ExpressibleByUnicodeScalarLiteral and ExpressibleByExtendedGraphemeClusterLiteral even if it seems logical as they are already spoken for so it would be a source breaking change and the proposal would no longer be additive. In the end there is also no way to avoid defining new protocols anyway as you need to distinguish between single and double quoted literals or you get problems like Ints being expressible by double quoted string literals or single-grapheme double quoted literals having a different default type from multi grapheme strings.

The simplest thing to do is define a new ExpressibleByCodepointLiteral protocol. It also helps as @Nevin points out to avoid the per integer size protocols you’ve suggested - the check being done by the compiler.

cukr · October 21, 2018, 1:47pm

What are 'x' literals meant to represent? Is it a single unicode code point i.e. a single number, or is it more complicated than that?
This thread makes it clear that you should be careful when mixing low-level and high-level strings.
This proposal is muddy and confusing because Character will be expressible by 'x' literals.

If it's going to potentially represent more than one code point, then what is the advantage of it over "x"?