Prepitch: Character integer literals

i don’t see anything wrong with option 5. in fact if you go back to the upper part of the thread it’s a lot closer to the original pitch than the current proposal. in fact someone suggested a v similar idea pretty early in the process except they were trying to cast multichar ascii literals to integer slugs instead of String.

the problem is i don’t think it’s as politically attractive as the current proposal, which is a long-negotiated compromise between a lot of different groups of people with different goals for single quoted literals. (again, read the thread, the whole thread.)

there’s a bunch of people who didn’t want to use up the single-quote syntax on “something as niche as ascii strings”. these people were only placated by appealing to the C (and basically every other language) precedent where single quotes are for “single objects” and double quotes are for “vector objects”. so i don’t think let s:String = 'ab' is going to fly.

there’s a bunch of people (especially in the core team) who wanted to extend the single quote syntax to cover Unicode.Scalar and Character, so that these (important!) types finally get a dedicated literal syntax instead of having to write as Character everywhere. they did not want to see single quotes limited to just the U+0 to U+128 single codepoint range. so i don’t think having let c:Character = '👩🏼‍💻' error out is going to fly.

this thread has 3631637642 posts because influential people wanted these features in the proposal. stripping them out realistically is just gonna put us back where we started and all these things are just gonna get rehashed anew.

Is your first paragraph a separate thought from the rest?

¶1 says you like Option 4.

¶2–5 say you have reservations about “it”. But none of those reservations apply to Option 4; they all apply to Michel’s alternative idea.

1 Like

What you're describing is not Option 4. Here is Option 4. Unicode.Scalar and Character can be created with (appropriate) single-quoted literals.

1 Like

oh sorry that’s a typo I was referring to michel fortin’s table (option 5,, so many options…). the actual option 4 looks sensible to me.

1 Like

*option 5, but again, i like option 5— it’s simple and completely avoids the deprecation/functional duplication issues with option 4, where we would have to figure out how to phase out double quotes without disturbing ABI. i’m just saying i don’t think it has a realistic chance of passing.

1 Like

Since option 4 seems to be the most popular, I’ve rewritten the proposal document based on it, which can be viewed here

SE-240

Integer-convertible character literals

Introduction

Swift’s String type is designed for Unicode correctness and abstracts away the underlying binary representation of the string to model it as a Collection of grapheme clusters. This is an appropriate string model for human-readable text, as to a human reader, the atomic unit of a string is (usually) the extended grapheme cluster. When treated this way, many logical string operations “just work” the way users expect.

However, it is also common in programming to need to express values which are intrinsically numeric, but have textual meaning, when taken as an ASCII value. We propose adding a new literal syntax takes single-quotes ('), and is transparently convertible to Swift’s integer types. This syntax, but not the behavior, will extend to all “scalar” text literals, up to and including Character, and will become the preferred literal syntax these types.

Motivation

For both correctness and efficiency, [UInt8] (or another integer array type) is usually the most appropriate representation for an ASCII string. (See Stop converting Data to String for a discussion on why String is an inappropriate representation.)

A major pain point of integer arrays is that they lack a clear and readable literal type. In C, 'a' is a uint8_t literal, equivalent to 97. Swift has no such equivalent, requiring awkward spellings like UInt8(ascii: "a"). Alternatives, like spelling out the values in hex or decimal directly, are even worse. This harms readability of code, and is one of the sore points of bytestring processing in Swift.

static char const hexcodes[16] = {
    '0', '1', '2', '3', '4' ,'5', '6', '7', '8', '9', 
    'a', 'b', 'c', 'd', 'e', 'f'
};
let hexcodes = [
    UInt8(ascii: "0"), UInt8(ascii: "1"), UInt8(ascii: "2"), UInt8(ascii: "3"),
    UInt8(ascii: "4"), UInt8(ascii: "5"), UInt8(ascii: "6"), UInt8(ascii: "7"),
    UInt8(ascii: "8"), UInt8(ascii: "9"), UInt8(ascii: "a"), UInt8(ascii: "b"),
    UInt8(ascii: "c"), UInt8(ascii: "d"), UInt8(ascii: "e"), UInt8(ascii: "f")
]    

Sheer verbosity can be reduced by applying “clever” higher-level constructs such as

let hexcodes = [
    "0", "1", "2", "3",
    "4", "5", "6", "7",
    "8", "9", "a", "b",
    "c", "d", "e", "f"
].map{ UInt8(ascii: $0) }

or even

let hexcodes = Array(UInt8(ascii: "0") ... UInt8(ascii: "9")) + 
               Array(UInt8(ascii: "a") ... UInt8(ascii: "f"))

though this comes at the expense of an even higher noise-to-signal ratio, as we are forced to reference concepts such as function mapping, or concatenation, range construction, Array materialization, and run-time type conversion, when all we wanted to express was a fixed set of hardcoded values.

In addition, the init(ascii:) initializer only exists on UInt8. If you're working with other types like Int8 (common when dealing with C APIs that take char), it is much more awkward. Consider scanning through a char* buffer as an UnsafeBufferPointer<Int8>:

for scalar in int8buffer {
    switch scalar {
    case Int8(UInt8(ascii: "a")) ... Int8(UInt8(ascii: "f")):
        // lowercase hex letter
    case Int8(UInt8(ascii: "A")) ... Int8(UInt8(ascii: "F")):
        // uppercase hex letter
    case Int8(UInt8(ascii: "0")) ... Int8(UInt8(ascii: "9")):
        // hex digit
    default:
        // something else
    }
}

Aside from being ugly and verbose, transforming Unicode.Scalar literals also sacrifices compile-time guarantees. The statement let char: UInt8 = 1989 is a compile time error, whereas let char: UInt8 = .init(ascii: "߅") is a run time error.

ASCII scalars are inherently textual, so it should be possible to express them with a textual literal without requiring layers upon layers of transformations. Just as applying the String APIs runs counter to Swift’s stated design goals of safety and efficiency, forcing users to express basic data values in such a convoluted and unreadable way runs counter to our design goal of expressiveness.

Integer character literals would provide benefits to String users. One of the future directions for String is to provide performance-sensitive or low-level users with direct access to code units. Having numeric character literals for use with this API is hugely motivating. Furthermore, improving Swift’s bytestring ergonomics is an important part of our long term goal of expanding into embedded platforms.

Proposed solution

Let's do the obvious thing here, and conform Swift’s integer literal types to ExpressibleByUnicodeScalarLiteral. These conversions will only be valid for the ASCII range U+0 ..< U+128; unicode scalar literals outside of that range will be invalid and treated similar to the way we currently diagnose overflowing integer literals. This is a conservative limitation which we believe is warranted, as allowing transparent unicode conversion to integer types carries major encoding pitfalls we want to protect users from.

ExpressibleBy UnicodeScalarLiteral ExtendedGraphemeClusterLiteral StringLiteral
UInt8:, … , Int: yes* no no
Unicode.Scalar: yes no no
Character: yes (inherited) yes no
String: no* no* yes
StaticString: no* no* yes

Cells marked with an asterisk * indicate behavior that is different from the current language behavior.

As we are introducing a separate literal syntax 'a' for “scalar” text objects, and making it the preferred syntax for Unicode.Scalar and Character, it will no longer be possible to initialize Strings or StaticStrings from unicode scalar literals or character literals. To users, this will have no discernable impact, as double quoted literals will simply be inferred as string literals.

This proposal will have no impact on custom ExpressibleBy conformances, however, integer types UInt8 through Int will now be available as source types provided by the ExpressibleByUnicodeScalarLiteral.init(unicodeScalarLiteral:) initializer. For these specializations, the initializer will be responsible for enforcing the compile-time ASCII range check on the unicode scalar literal.

init() unicodeScalarLiteral extendedGraphemeClusterLiteral stringLiteral
:UInt8, … , :Int yes* no no
:Unicode.Scalar yes no no
:Character yes (upcast) yes no
:String yes (upcast) yes (upcast) yes (upcast)
:StaticString yes (upcast) yes (upcast) yes

The ASCII range restriction will only apply to single-quote literals coerced to an integer type. Any valid Unicode.Scalar can be written as a single-quoted unicode scalar literal, and any valid Character can be written as a single-quoted character literal.

'a' 'é' 'β' '𓀎' '👩‍✈️' "ab"
:String "ab"
:Character 'a' 'é' 'β' '𓀎' '👩‍✈️'
:Unicode.Scalar U+0061 U+00E9 U+03B2 U+1300E
:UInt32 97
:UInt16 97
:UInt8 97
:Int8 97

With these changes, the hex code example can be written much more naturally:

let hexcodes: [UInt8] = [
    '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 
    'a', 'b', 'c', 'd', 'e', 'f'
]

for scalar in int8buffer {
    switch scalar {
    case 'a' ... 'f':
        // lowercase hex letter
    case 'A' ... 'F':
        // uppercase hex letter
    case '0' ... '9':
        // hex digit
    default:
        // something else
    }
}

Choice of single quotes

We propose to adopt the 'x' syntax for all textual literal types up to and including ExtendedGraphemeClusterLiteral, but not including StringLiteral. These literals will be used to express integer types, Character, Unicode.Scalar, and types like UTF16.CodeUnit in the standard library.

The default inferred literal type for let x = 'a' will be Character, following the principle of least surprise. This also allows for a natural user-side syntax for differentiating methods overloaded on both Character and String.

Single-quoted literals will be inferred to be integer types in cases where a Character or Unicode.Scalar overload does not exist, but an integer overload does. This can lead to strange spellings such as '1' + 1' == 98. However, we forsee problems arising from this to be quite rare, as the type system will almost always catch such mistakes, and very few users are likely to express a String with two literals instead of the much more obvious "11".

Use of single quotes for character/scalar literals is heavily precedented in other languages, including C, Objective-C, C++, Java, and Rust, although different languages have slightly differing ideas about what a “character” is. We choose to use the single quote syntax specifically because it reinforces the notion that strings and character values are different: the former is a sequence, the later is a scalar (and "integer-like"). Character types also don't support string literal interpolation, which is another reason to move away from double quotes.

Single quotes in Swift, a historical perspective

In Swift 1.0, we wanted to reserve single quotes for some yet-to-be determined syntactical purpose. However, today, pretty much all of the things that we once thought we might want to use single quotes for have already found homes in other parts of the Swift syntactical space. For example, syntax for multi-line string literals uses triple quotes ("""), and string interpolation syntax uses standard double quote syntax. With the passage of SE-0200, raw-mode string literals settled into the #""# syntax. In current discussions around regex literals, most people seem to prefer slashes (/).

At this point, it is clear that the early syntactic conservatism was unwarranted. We do not forsee another use for this syntax, and given the strong precedent in other languages for characters, it is natural to use it.

Existing double quote initializers for characters

We propose deprecating the double quote literal form for Character and Unicode.Scalar types and slowly migrating them out of Swift.

let c2 = 'f'               // preferred
let c1: Character = "f"   // deprecated

Detailed Design

The only standard library change will be to add {UInt8, Int8, ..., Int} to the list of allowed Self.UnicodeScalarLiteralType types. (This entails conforming the integer types to _ExpressibleByBuiltinUnicodeScalarLiteral.) The ASCII range checking will be performed at compile-time in the typechecker, in essentially the same way that overflow checking for ExpressibleByIntegerLiteral.IntegerLiteralType types works today.

protocol ExpressibleByUnicodeScalarLiteral {
    associatedtype UnicodeScalarLiteralType: 
        {StaticString, ..., Unicode.Scalar} + {UInt8, Int8, ..., Int}
    
    init(unicodeScalarLiteral: UnicodeScalarLiteralType)
}

The default inferred type for all single-quoted literals will be Character, addressing a longstanding pain point in Swift, where Characters had no dedicated literal syntax.

typealias UnicodeScalarLiteralType           = Character
typealias ExtendedGraphemeClusterLiteralType = Character 

This will have no source-level impact, as all double-quoted literals get their default inferred type from the StringLiteralType typealias, which currently overshadows ExtendedGraphemeClusterLiteralType and UnicodeScalarLiteralType. The UnicodeScalarLiteralType typealias will remain meaningless, but ExtendedGraphemeClusterLiteralType typealias will now be used to infer a default type for single-quoted literals.

Source compatibility

This proposal could be done in a way that is strictly additive, but we feel it is best to deprecate the existing double quote initializers for characters, and the UInt8.init(ascii:) initializer.

Here is a specific sketch of a deprecation policy:

  • Continue accepting these in Swift 5 mode with no change.

  • Introduce the new syntax support into Swift 5.1.

  • Swift 5.1 mode would start producing deprecation warnings (with a fixit to change double quotes to single quotes.)

  • The Swift 5 to 5.1 migrator would change the syntax (by virtue of applying the deprecation fixits.)

  • Swift 6 would not accept the old syntax.

During the transition period, "a" will remain a valid unicode scalar literal, so it will be possible to initialize integer types with double-quoted ASCII literals.

let ascii:Int8 = "a" // produces a deprecation warning 

However, as this will only be possible in new code, and will produce a deprecation warning from the outset, this should not be a problem.

Effect on ABI stability

All changes except deprecating the UInt8.init(ascii:) initializer are either additive, or limited to the type checker, parser, or lexer. Removing String and StaticString’s ExpressibleByUnicodeScalarLiteral and ExpressibleByExtendedGraphemeClusterLiteral conformances would otherwise be ABI-breaking, but this can be implemented entirely in the type checker, since source literals are a compile-time construct.

Removing UInt8.init(ascii:) would break ABI, but this is not necessary to implement the proposal, it’s merely housekeeping.

Effect on API resilience

None.

Alternatives considered

Integer initializers

Some have proposed extending the UInt8(ascii:) initializer to other integer types (Int8, UInt16, … , Int). However, this forgoes compile-time validity checking, and entails a substantial increase in API surface area for questionable gain.

Lifting the ASCII range restriction

Some have proposed allowing any unicode scalar literal whose codepoint index does not overflow the target integer type to be convertible to that integer type. Consensus was that this is an easy source of unicode encoding bugs, and provides little utility to the user. If people change their minds in the future, this restriction can always be lifted in a source and ABI compatible way.

Single-quoted ASCII strings

Some have proposed allowing integer array types to be expressible by multi-character ASCII strings such as 'abcd'. We consider this to be out of scope of this proposal, as well as unsupported by precedent in C and related languages.

4 Likes

One thing I took from option 5 is initialising an array of Ints from the characters in a string. Perhaps this could be added to the standard library as an annex to the proposal:

extension Array where Element: FixedWidthInteger {
  public init(_ characters: String) {
    self = characters.unicodeScalars.map { Element($0.value) }
  }
}

let hexcodes2 = [Int8]("0123456789abcdef")

You can also use ExpressibleByStringValue

1 Like

I think it'd be better to enforce ASCII-ness and spell it like this:

let hexcodes: [UInt8](ascii: "0123456789abcdef")
let nonascii: [UInt8](ascii: "éé") // error: non-ascii character

I'm not too sure it has much of a link to this proposal however. Unlike option 5, there is no common syntax linking the character literal and this initializer, so it's pretty much standalone to me (which gives it a better chance of success too).

Edit: forgot that UInt8.init(ascii:) is meant to be deprecated by this proposal, making this suggestion a bit out of place.

1 Like

I like this direction much better. I would just add that, given the new direction, it seems sensible and would improve clarity to add an additional ‘ExpressibleBy’ protocol as follows:

UInt8 /* etc. */ :
  ExpressibleByASCIILiteral

Unicode.Scalar :
  ExpressibleByASCIILiteral,
  ExpressibleByUnicodeScalarLiteral

Character :
  ExpressibleByASCIILiteral,
  ExpressibleByUnicodeScalarLiteral,
  ExpressibleByExtendedGraphemeClusterLiteral

This would clarify, for example, why no Int8 value is expressible by 'é' while a Unicode.Scalar value is.

6 Likes

There is no ExpressibleByASCIILiteral for the same reason there is no ExpressibleByUInt32Literal or ExpressibleByUInt16Literal. You get the ASCII restriction by specifying the type of the argument in Self.init(unicodeScalarLiteral:T) to be one of the integer types among the many choices of Self.UnicodeScalarLiteralType. I agree this is a very confusing system, but it’s how Swift’s literal protocols currently work, and changing that would be a much bigger change than this proposal aims to be. If someone is making custom conformances to the literal protocols, we can assume they are pretty well versed in the intricacies of the type system, so I don’t think this would be a problem.

2 Likes

I know how Swift’s literal protocols currently work; I’m explicitly suggesting that they would be made less confusing, and the ASCII-restricted behavior of a numeric type expressible by a character literal more obvious, by changing that for character literals. I think this proposal should aim to make that bigger change.

If Swift’s character literal protocols were not already designed this way, it is implausible that one would ask for two different protocols for three different behaviors. For backwards compatilibility reasons, we can’t coalesce them into one protocol even though integer literals work that way in Swift. The only sensible design left is to have distinct protocols, not pretending that UInt8 is actually expressible by a Unicode scalar but not expressible by an extended grapheme cluster in the same way that Unicode.Scalar is, which in this design direction it really isn’t.

2 Likes

I philosophically agree with you, but ABI stability has been a big concern so it may not be practical anymore.

The wording of compiler error message is an alternate way to make the same thing clear. “ is not ASCII” should already be enough to quell any questions about why it didn’t work when "e" did. If they move on to the question of “Why shouldn’t it work?”, then ExpressibleByASCIILiteral wouldn’t really answer that for them either.

The idea of an ExpressibleByASCIILiteral could still be mentioned in the proposal. I do like it if it is ABI‐viable. It would be a small enough and likely uncontested difference that the core team can always “accept with revisions” to add or subtract it from the implementation based on their better understanding of its ABI impact. Ask for it without demanding and invite them to decide for themselves.

the impact would be that ExpressibleByUnicodeScalarLiteral (and indirectly ExpressibleByExtendedGraphemeClusterLiteral) would now inherit from from ExpressibleByASCIILiteral instead of being a base protocol. These requirements could be satisfied by default implementation, but i don’t know the ABI impact of that.

These are the constraints that guided the new implementation. Aside from being more complicated if we introduced ExpressibleByASCIILiteral at this stage it would also break the deal we made with ourselves to leave the door open for Option 2.

can we please just get this to review? this thread has been going in circles for a year now

1 Like

Not really. That would be no more difficult than adding conformances for Int types to ExpressibleByUnicodeScalarLiteral. In fact, that set‐up would allow you to trivially add such a conformance in your own module if you understood what you were doing and wanted to live dangerously. The way we have it now, the conformance is already there and in the way, so you cannot do it yourself. (...Well, though uglier, you could add a conformance to the character version to work around it, so maybe never mind.)

But more people are thinking this than are outgoing enough to say it:

1 Like

it’s not that, it’s the fact that we’re changing ExpressibleByUnicodeScalarLiteral from a base protocol to an inherited one. this means the requirements for ExpressibleByUnicodeScalarLiteral, and every literal protocol that inherits it, are different now. to users, there’s no change, since the new requirements all get default implementations, but again, idk how this affects ABI

I meant the present change would not make John’s desired future addition any more difficult at that future time.

I understand the difference it makes at the present.

On the contrary, I think this thread is just starting to go in some interesting directions. Given that it won’t make it into Swift 5.0, I don’t see any reason to cut off such fruitful discussion.

I have no idea what merit continuing this thread may have or if anyone reads all those messages, but as it is tagged as a pitch and even labeled as "prepitch", I think it's sensible to move forward by creating a new topic.

Also, imho it would be a good idea to start this with a introduction to the fundamental problems of encodings (I guess there are some good resources that could be linked), and their impact on Swift:
It's quite hard to find information about source file encoding for swiftc, and swiftc --help doesn't even mention what encoding it expects.

1 Like