Prepitch: Character integer literals


#263

Now this I could get behind. Especially the part where an ASCII string can be used to express an array of integers. I’m not entirely convinced that ASCII is worth promoting with first-class exclusive syntax in Swift—if this is only for legacy compatibility then I lean against.

But if it is worth supporting pure ASCII, then your idea sounds like the way to do it.


(John Holdsworth) #264

Just when I thought the thread was settling…. I’m sorry but I really don’t see how we should be elevating a legacy concept such as ASCII to such a prominent role in Swift. I like Swift’s highly principled abstraction that a Character is a single atomic visual entity regards of how it is represented and for me this is what we should be trying to encapsulate with single quote syntax. Some of these Characters can be represented by a single unicode scalar, some of those fit int an integer storage implied by the expression context and some of those are ASCII. These are secondary internal distinctions which can be used to gate a new shorthand which allows us to express an integer with a character value but the requirements of the niche shorthand shouldn’t feedback to the definition of what is a character literal. I actually argued for your approach myself further up the thread calling them "code point literals" when they were not limited to ASCII as it improved diagnostics but eventually saw the light and changed my mind. Character with a capital C is the abstraction we want to be capturing with the new single quoted literal. Honest!


(Michel Fortin) #265

Sorry about that. I'm just going a bit beyond what was discussed.

The thinking is that if limiting character literals to ASCII makes them less error-prone, the same thing is also true of strings when you intend to use them in term of scalars. If you don't want combining and equivalent characters to be in your way, use an ASCII string.

This could also be a good way to dispel doubts about lookalike characters lurking in sensitive strings. You can't hide a cyrillic "а" in the string "paypal" if the string is limited to ASCII.

But I'll admit the idea of "yet another string syntax" looks a bit unappealing.


(Xiaodi Wu) #266

Although it's possible to think of these as "truncated Unicode scalars," the stated motivation for the proposal makes it clear that it's not at all how it's going to be used, or why it's useful. There are no C APIs that semantically work with an array of the least significant 8/16 bits of Unicode scalars--they work with an array of ASCII or Latin-1 characters, or UCS-2-encoded characters.


(Michael Ilseman) #267

This seems like a pragmatic approach that can be extended in the future. I would like to get this pitch ready for SE.

This has been a very long thread with many changes. The most recent draft is out of date with this direction and also has a lot of irrelevant prose. Should we start a fresh slate?


(John Holdsworth) #268

Sorry @michelf, you threw such a curved ball in there I totally missed what you were suggesting which is a whole ‘nother proposal. Perhaps that might be a valid alternative use of single quote but Characters is at least precedented and another idea for the security feature you propose could be a new escape character e.g. \a which would force ASCII only as in the string ”\ahttp://paypal.com”.

Seems like we have pretty much settled on “Option 4 for now” then. There is one other detail about the migration to single quoted literals that it might be worth discussing here or in the review: If Character literals are to become the preferred spelling for expressing Unicode.Scalar and Character (and sometimes Int*) in Swift should the existing conformances of String be eventually deprecated (with fixit) which would be mildly source compatibility breaking and over what timescale?


(Michel Fortin) #269

I also proposed it in the hope of offering something even cleaner than option 4. See the nice table it produces below. The rest just falls as a consequence of the design, but I don't think this was communicated very well. I'll be fine with option 4 too.

Option 5: ASCII with single-quotes & Unicode with double-quotes

Single quoted (ASCII)

UInt8 UInt16 UInt32 Unicode.Scalar Character String Notes
'x' 120 120 120 U+0078 x x ASCII scalar
'©' error error error error error error Latin‐1 scalar
'é' error error error error error error Latin‐1 scalar which expands under NFD
'花' error error error error error error BMP scalar
';' error error error error error error BMP scalar which changes under NFx
'שּׂ' error error error error error error BMP scalar which expands under NFx
'𓀎' error error error error error error Supplemental plane scalar
'ē̱' error error error error error error Character with no single‐scalar representation
'ab' error error error error error ab Multiple characters

Double quoted (same as now, no duplication or deprecation)

UInt8 UInt16 UInt32 Unicode.Scalar Character String Notes
"x" error error error U+0078 x x ASCII scalar
"©" error error error U+00A9* © © Latin‐1 scalar
"é" error error error U+00E9/error* é é Latin‐1 scalar which expands under NFD
"花" error error error U+82B1* BMP scalar
";" error error error U+037E/U+003B*† ; ; BMP scalar which changes under NFx
"שּׂ" error error error U+FB2D/error* שּׂ שּׂ BMP scalar which expands under NFx
"𓀎" error error error U+1300E* 𓀎 𓀎 Supplemental plane scalar
"ē̱" error error error error ē̱ ē̱ Character with no single‐scalar representation
"ab" error error error error error ab Multiple characters

(^) #270

i don’t see anything wrong with option 5. in fact if you go back to the upper part of the thread it’s a lot closer to the original pitch than the current proposal. in fact someone suggested a v similar idea pretty early in the process except they were trying to cast multichar ascii literals to integer slugs instead of String.

the problem is i don’t think it’s as politically attractive as the current proposal, which is a long-negotiated compromise between a lot of different groups of people with different goals for single quoted literals. (again, read the thread, the whole thread.)

there’s a bunch of people who didn’t want to use up the single-quote syntax on “something as niche as ascii strings”. these people were only placated by appealing to the C (and basically every other language) precedent where single quotes are for “single objects” and double quotes are for “vector objects”. so i don’t think let s:String = 'ab' is going to fly.

there’s a bunch of people (especially in the core team) who wanted to extend the single quote syntax to cover Unicode.Scalar and Character, so that these (important!) types finally get a dedicated literal syntax instead of having to write as Character everywhere. they did not want to see single quotes limited to just the U+0 to U+128 single codepoint range. so i don’t think having let c:Character = '👩🏼‍💻' error out is going to fly.

this thread has 3631637642 posts because influential people wanted these features in the proposal. stripping them out realistically is just gonna put us back where we started and all these things are just gonna get rehashed anew.


(Jeremy David Giesbrecht) #271

Is your first paragraph a separate thought from the rest?

¶1 says you like Option 4.

¶2–5 say you have reservations about “it”. But none of those reservations apply to Option 4; they all apply to Michel’s alternative idea.


(Michael Ilseman) #272

What you're describing is not Option 4. Here is Option 4. Unicode.Scalar and Character can be created with (appropriate) single-quoted literals.


(^) #273

oh sorry that’s a typo I was referring to michel fortin’s table (option 5,, so many options…). the actual option 4 looks sensible to me.


(^) #274

*option 5, but again, i like option 5— it’s simple and completely avoids the deprecation/functional duplication issues with option 4, where we would have to figure out how to phase out double quotes without disturbing ABI. i’m just saying i don’t think it has a realistic chance of passing.


(^) #275

Since option 4 seems to be the most popular, I’ve rewritten the proposal document based on it, which can be viewed here

SE-240

Integer-convertible character literals

Introduction

Swift’s String type is designed for Unicode correctness and abstracts away the underlying binary representation of the string to model it as a Collection of grapheme clusters. This is an appropriate string model for human-readable text, as to a human reader, the atomic unit of a string is (usually) the extended grapheme cluster. When treated this way, many logical string operations “just work” the way users expect.

However, it is also common in programming to need to express values which are intrinsically numeric, but have textual meaning, when taken as an ASCII value. We propose adding a new literal syntax takes single-quotes ('), and is transparently convertible to Swift’s integer types. This syntax, but not the behavior, will extend to all “scalar” text literals, up to and including Character, and will become the preferred literal syntax these types.

Motivation

For both correctness and efficiency, [UInt8] (or another integer array type) is usually the most appropriate representation for an ASCII string. (See Stop converting Data to String for a discussion on why String is an inappropriate representation.)

A major pain point of integer arrays is that they lack a clear and readable literal type. In C, 'a' is a uint8_t literal, equivalent to 97. Swift has no such equivalent, requiring awkward spellings like UInt8(ascii: "a"). Alternatives, like spelling out the values in hex or decimal directly, are even worse. This harms readability of code, and is one of the sore points of bytestring processing in Swift.

static char const hexcodes[16] = {
    '0', '1', '2', '3', '4' ,'5', '6', '7', '8', '9', 
    'a', 'b', 'c', 'd', 'e', 'f'
};
let hexcodes = [
    UInt8(ascii: "0"), UInt8(ascii: "1"), UInt8(ascii: "2"), UInt8(ascii: "3"),
    UInt8(ascii: "4"), UInt8(ascii: "5"), UInt8(ascii: "6"), UInt8(ascii: "7"),
    UInt8(ascii: "8"), UInt8(ascii: "9"), UInt8(ascii: "a"), UInt8(ascii: "b"),
    UInt8(ascii: "c"), UInt8(ascii: "d"), UInt8(ascii: "e"), UInt8(ascii: "f")
]    

Sheer verbosity can be reduced by applying “clever” higher-level constructs such as

let hexcodes = [
    "0", "1", "2", "3",
    "4", "5", "6", "7",
    "8", "9", "a", "b",
    "c", "d", "e", "f"
].map{ UInt8(ascii: $0) }

or even

let hexcodes = Array(UInt8(ascii: "0") ... UInt8(ascii: "9")) + 
               Array(UInt8(ascii: "a") ... UInt8(ascii: "f"))

though this comes at the expense of an even higher noise-to-signal ratio, as we are forced to reference concepts such as function mapping, or concatenation, range construction, Array materialization, and run-time type conversion, when all we wanted to express was a fixed set of hardcoded values.

In addition, the init(ascii:) initializer only exists on UInt8. If you're working with other types like Int8 (common when dealing with C APIs that take char), it is much more awkward. Consider scanning through a char* buffer as an UnsafeBufferPointer<Int8>:

for scalar in int8buffer {
    switch scalar {
    case Int8(UInt8(ascii: "a")) ... Int8(UInt8(ascii: "f")):
        // lowercase hex letter
    case Int8(UInt8(ascii: "A")) ... Int8(UInt8(ascii: "F")):
        // uppercase hex letter
    case Int8(UInt8(ascii: "0")) ... Int8(UInt8(ascii: "9")):
        // hex digit
    default:
        // something else
    }
}

Aside from being ugly and verbose, transforming Unicode.Scalar literals also sacrifices compile-time guarantees. The statement let char: UInt8 = 1989 is a compile time error, whereas let char: UInt8 = .init(ascii: "߅") is a run time error.

ASCII scalars are inherently textual, so it should be possible to express them with a textual literal without requiring layers upon layers of transformations. Just as applying the String APIs runs counter to Swift’s stated design goals of safety and efficiency, forcing users to express basic data values in such a convoluted and unreadable way runs counter to our design goal of expressiveness.

Integer character literals would provide benefits to String users. One of the future directions for String is to provide performance-sensitive or low-level users with direct access to code units. Having numeric character literals for use with this API is hugely motivating. Furthermore, improving Swift’s bytestring ergonomics is an important part of our long term goal of expanding into embedded platforms.

Proposed solution

Let's do the obvious thing here, and conform Swift’s integer literal types to ExpressibleByUnicodeScalarLiteral. These conversions will only be valid for the ASCII range U+0 ..< U+128; unicode scalar literals outside of that range will be invalid and treated similar to the way we currently diagnose overflowing integer literals. This is a conservative limitation which we believe is warranted, as allowing transparent unicode conversion to integer types carries major encoding pitfalls we want to protect users from.

ExpressibleBy UnicodeScalarLiteral ExtendedGraphemeClusterLiteral StringLiteral
UInt8:, … , Int: yes* no no
Unicode.Scalar: yes no no
Character: yes (inherited) yes no
String: no* no* yes
StaticString: no* no* yes

Cells marked with an asterisk * indicate behavior that is different from the current language behavior.

As we are introducing a separate literal syntax 'a' for “scalar” text objects, and making it the preferred syntax for Unicode.Scalar and Character, it will no longer be possible to initialize Strings or StaticStrings from unicode scalar literals or character literals. To users, this will have no discernable impact, as double quoted literals will simply be inferred as string literals.

This proposal will have no impact on custom ExpressibleBy conformances, however, integer types UInt8 through Int will now be available as source types provided by the ExpressibleByUnicodeScalarLiteral.init(unicodeScalarLiteral:) initializer. For these specializations, the initializer will be responsible for enforcing the compile-time ASCII range check on the unicode scalar literal.

init() unicodeScalarLiteral extendedGraphemeClusterLiteral stringLiteral
:UInt8, … , :Int yes* no no
:Unicode.Scalar yes no no
:Character yes (upcast) yes no
:String yes (upcast) yes (upcast) yes (upcast)
:StaticString yes (upcast) yes (upcast) yes

The ASCII range restriction will only apply to single-quote literals coerced to an integer type. Any valid Unicode.Scalar can be written as a single-quoted unicode scalar literal, and any valid Character can be written as a single-quoted character literal.

'a' 'é' 'β' '𓀎' '👩‍✈️' "ab"
:String "ab"
:Character 'a' 'é' 'β' '𓀎' '👩‍✈️'
:Unicode.Scalar U+0061 U+00E9 U+03B2 U+1300E
:UInt32 97
:UInt16 97
:UInt8 97
:Int8 97

With these changes, the hex code example can be written much more naturally:

let hexcodes: [UInt8] = [
    '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 
    'a', 'b', 'c', 'd', 'e', 'f'
]

for scalar in int8buffer {
    switch scalar {
    case 'a' ... 'f':
        // lowercase hex letter
    case 'A' ... 'F':
        // uppercase hex letter
    case '0' ... '9':
        // hex digit
    default:
        // something else
    }
}

Choice of single quotes

We propose to adopt the 'x' syntax for all textual literal types up to and including ExtendedGraphemeClusterLiteral, but not including StringLiteral. These literals will be used to express integer types, Character, Unicode.Scalar, and types like UTF16.CodeUnit in the standard library.

The default inferred literal type for let x = 'a' will be Character, following the principle of least surprise. This also allows for a natural user-side syntax for differentiating methods overloaded on both Character and String.

Single-quoted literals will be inferred to be integer types in cases where a Character or Unicode.Scalar overload does not exist, but an integer overload does. This can lead to strange spellings such as '1' + 1' == 98. However, we forsee problems arising from this to be quite rare, as the type system will almost always catch such mistakes, and very few users are likely to express a String with two literals instead of the much more obvious "11".

Use of single quotes for character/scalar literals is heavily precedented in other languages, including C, Objective-C, C++, Java, and Rust, although different languages have slightly differing ideas about what a “character” is. We choose to use the single quote syntax specifically because it reinforces the notion that strings and character values are different: the former is a sequence, the later is a scalar (and "integer-like"). Character types also don't support string literal interpolation, which is another reason to move away from double quotes.

Single quotes in Swift, a historical perspective

In Swift 1.0, we wanted to reserve single quotes for some yet-to-be determined syntactical purpose. However, today, pretty much all of the things that we once thought we might want to use single quotes for have already found homes in other parts of the Swift syntactical space. For example, syntax for multi-line string literals uses triple quotes ("""), and string interpolation syntax uses standard double quote syntax. With the passage of SE-0200, raw-mode string literals settled into the #""# syntax. In current discussions around regex literals, most people seem to prefer slashes (/).

At this point, it is clear that the early syntactic conservatism was unwarranted. We do not forsee another use for this syntax, and given the strong precedent in other languages for characters, it is natural to use it.

Existing double quote initializers for characters

We propose deprecating the double quote literal form for Character and Unicode.Scalar types and slowly migrating them out of Swift.

let c2 = 'f'               // preferred
let c1: Character = "f"   // deprecated

Detailed Design

The only standard library change will be to add {UInt8, Int8, ..., Int} to the list of allowed Self.UnicodeScalarLiteralType types. (This entails conforming the integer types to _ExpressibleByBuiltinUnicodeScalarLiteral.) The ASCII range checking will be performed at compile-time in the typechecker, in essentially the same way that overflow checking for ExpressibleByIntegerLiteral.IntegerLiteralType types works today.

protocol ExpressibleByUnicodeScalarLiteral {
    associatedtype UnicodeScalarLiteralType: 
        {StaticString, ..., Unicode.Scalar} + {UInt8, Int8, ..., Int}
    
    init(unicodeScalarLiteral: UnicodeScalarLiteralType)
}

The default inferred type for all single-quoted literals will be Character, addressing a longstanding pain point in Swift, where Characters had no dedicated literal syntax.

typealias UnicodeScalarLiteralType           = Character
typealias ExtendedGraphemeClusterLiteralType = Character 

This will have no source-level impact, as all double-quoted literals get their default inferred type from the StringLiteralType typealias, which currently overshadows ExtendedGraphemeClusterLiteralType and UnicodeScalarLiteralType. The UnicodeScalarLiteralType typealias will remain meaningless, but ExtendedGraphemeClusterLiteralType typealias will now be used to infer a default type for single-quoted literals.

Source compatibility

This proposal could be done in a way that is strictly additive, but we feel it is best to deprecate the existing double quote initializers for characters, and the UInt8.init(ascii:) initializer.

Here is a specific sketch of a deprecation policy:

  • Continue accepting these in Swift 5 mode with no change.

  • Introduce the new syntax support into Swift 5.1.

  • Swift 5.1 mode would start producing deprecation warnings (with a fixit to change double quotes to single quotes.)

  • The Swift 5 to 5.1 migrator would change the syntax (by virtue of applying the deprecation fixits.)

  • Swift 6 would not accept the old syntax.

During the transition period, "a" will remain a valid unicode scalar literal, so it will be possible to initialize integer types with double-quoted ASCII literals.

let ascii:Int8 = "a" // produces a deprecation warning 

However, as this will only be possible in new code, and will produce a deprecation warning from the outset, this should not be a problem.

Effect on ABI stability

All changes except deprecating the UInt8.init(ascii:) initializer are either additive, or limited to the type checker, parser, or lexer. Removing String and StaticString’s ExpressibleByUnicodeScalarLiteral and ExpressibleByExtendedGraphemeClusterLiteral conformances would otherwise be ABI-breaking, but this can be implemented entirely in the type checker, since source literals are a compile-time construct.

Removing UInt8.init(ascii:) would break ABI, but this is not necessary to implement the proposal, it’s merely housekeeping.

Effect on API resilience

None.

Alternatives considered

Integer initializers

Some have proposed extending the UInt8(ascii:) initializer to other integer types (Int8, UInt16, … , Int). However, this forgoes compile-time validity checking, and entails a substantial increase in API surface area for questionable gain.

Lifting the ASCII range restriction

Some have proposed allowing any unicode scalar literal whose codepoint index does not overflow the target integer type to be convertible to that integer type. Consensus was that this is an easy source of unicode encoding bugs, and provides little utility to the user. If people change their minds in the future, this restriction can always be lifted in a source and ABI compatible way.

Single-quoted ASCII strings

Some have proposed allowing integer array types to be expressible by multi-character ASCII strings such as 'abcd'. We consider this to be out of scope of this proposal, as well as unsupported by precedent in C and related languages.


(John Holdsworth) #276

One thing I took from option 5 is initialising an array of Ints from the characters in a string. Perhaps this could be added to the standard library as an annex to the proposal:

extension Array where Element: FixedWidthInteger {
  public init(_ characters: String) {
    self = characters.unicodeScalars.map { Element($0.value) }
  }
}

let hexcodes2 = [Int8]("0123456789abcdef")

You can also use ExpressibleByStringValue


(Michel Fortin) #277

I think it'd be better to enforce ASCII-ness and spell it like this:

let hexcodes: [UInt8](ascii: "0123456789abcdef")
let nonascii: [UInt8](ascii: "éé") // error: non-ascii character

I'm not too sure it has much of a link to this proposal however. Unlike option 5, there is no common syntax linking the character literal and this initializer, so it's pretty much standalone to me (which gives it a better chance of success too).

Edit: forgot that UInt8.init(ascii:) is meant to be deprecated by this proposal, making this suggestion a bit out of place.


(Xiaodi Wu) #278

I like this direction much better. I would just add that, given the new direction, it seems sensible and would improve clarity to add an additional ‘ExpressibleBy’ protocol as follows:

UInt8 /* etc. */ :
  ExpressibleByASCIILiteral

Unicode.Scalar :
  ExpressibleByASCIILiteral,
  ExpressibleByUnicodeScalarLiteral

Character :
  ExpressibleByASCIILiteral,
  ExpressibleByUnicodeScalarLiteral,
  ExpressibleByExtendedGraphemeClusterLiteral

This would clarify, for example, why no Int8 value is expressible by 'é' while a Unicode.Scalar value is.


(^) #279

There is no ExpressibleByASCIILiteral for the same reason there is no ExpressibleByUInt32Literal or ExpressibleByUInt16Literal. You get the ASCII restriction by specifying the type of the argument in Self.init(unicodeScalarLiteral:T) to be one of the integer types among the many choices of Self.UnicodeScalarLiteralType. I agree this is a very confusing system, but it’s how Swift’s literal protocols currently work, and changing that would be a much bigger change than this proposal aims to be. If someone is making custom conformances to the literal protocols, we can assume they are pretty well versed in the intricacies of the type system, so I don’t think this would be a problem.


(Xiaodi Wu) #280

I know how Swift’s literal protocols currently work; I’m explicitly suggesting that they would be made less confusing, and the ASCII-restricted behavior of a numeric type expressible by a character literal more obvious, by changing that for character literals. I think this proposal should aim to make that bigger change.

If Swift’s character literal protocols were not already designed this way, it is implausible that one would ask for two different protocols for three different behaviors. For backwards compatilibility reasons, we can’t coalesce them into one protocol even though integer literals work that way in Swift. The only sensible design left is to have distinct protocols, not pretending that UInt8 is actually expressible by a Unicode scalar but not expressible by an extended grapheme cluster in the same way that Unicode.Scalar is, which in this design direction it really isn’t.


(Jeremy David Giesbrecht) #281

I philosophically agree with you, but ABI stability has been a big concern so it may not be practical anymore.

The wording of compiler error message is an alternate way to make the same thing clear. “ is not ASCII” should already be enough to quell any questions about why it didn’t work when "e" did. If they move on to the question of “Why shouldn’t it work?”, then ExpressibleByASCIILiteral wouldn’t really answer that for them either.

The idea of an ExpressibleByASCIILiteral could still be mentioned in the proposal. I do like it if it is ABI‐viable. It would be a small enough and likely uncontested difference that the core team can always “accept with revisions” to add or subtract it from the implementation based on their better understanding of its ABI impact. Ask for it without demanding and invite them to decide for themselves.


(^) #282

the impact would be that ExpressibleByUnicodeScalarLiteral (and indirectly ExpressibleByExtendedGraphemeClusterLiteral) would now inherit from from ExpressibleByASCIILiteral instead of being a base protocol. These requirements could be satisfied by default implementation, but i don’t know the ABI impact of that.