SE-0243: Codepoint and Character Literals

Ben_Cohen · March 4, 2019, 3:45pm

Hi Swift Community,

The review of SE-0243: Codepoint and Character Literals begins now and runs through March 12, 2019.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to me as the review manager via email or direct message on the forums. If you send me email, please put "SE-0243" somewhere in the subject line.

What goes into a review of a proposal?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift.

When reviewing a proposal, here are some questions to consider:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Thank you for contributing to Swift!

Ben Cohen
Review Manager

allevato · March 4, 2019, 4:05pm

Finally! Let's get this merged.

Yes. This feature will be a boon to anyone doing C interop, textual parsing, or working with other low-level data formats. The current UInt8(ascii:) initializer is a pain to type and decreases the readability of code. Furthermore, this lets us finally represent single Characters conveniently without explicitly typing or as-casting them.

100%. I'm bummed that we can't get the conformances in automatically, but I understand that's a limitation of ABI stability right now and this is still a step in the right direction.

This fits nicely with how character literals are handled in many languages similar to Swift, so users will find it natural, and the compile-time validation fits perfectly.

Read and participated in various threads about the issue on the forum.

Nevin · March 4, 2019, 4:17pm

Do I understand the proposal correctly here?:

A new syntax is introduced for single-quoted character literals.

No new literal-expressible protocol is added. Instead, the existing unicode-scalar and extended-grapheme-cluster literal protocols are used.

The default type of a single-quoted character literal is Character.

Since String conforms to ExpressibleByExtendedGraphemeClusterLiteral, a single-quoted character literal can be assigned to a String:

let s: String = 'ü'

It is possible to extend the standard-library integer types (UInt8, Int, etc.) with conformance to ExpressibleByUnicodeScalarLiteral. Doing so will allow the assignment of a single-quoted ASCII character literal to an integer:

let n: UInt8 = '*'

This is checked at compile time to ensure the literal is in the ASCII range (0–127). The double-quoted syntax does not work for integers.

The existing double-quoted syntax for unicode-scalar and extended-grapheme-cluster literals is deprecated. In a future version of Swift, we can expect that syntax to be removed. When that happens, it will no longer work.

Is that correct? Did I miss anything?

johnno1962 · March 4, 2019, 4:26pm

Thanks @Nevin, That's a very good summary of the cruxes of it.

Nevin · March 4, 2019, 4:51pm

Review:

• I think String should not be expressible by a single-quoted literal, only a double-quoted literal.

• It is not clear to me why integers are restricted to being represented by ASCII characters only. Every Unicode scalar has a unique integer value, which is already available in Swift at runtime:

let x: Unicode.Scalar = "∫"
let n: UInt32 = x.value
print(n)    // 8747

It seems straightforward to make this available at compile-time through the literal syntax being proposed.

• • •

In light of the ABI concerns regarding retroactive conformance of integer types to the ExpressibleByUnicodeScalarLiteral protocol, perhaps another possibility should be considered:

We currently have a scenario where double-quoted literals can be interpreted through any of three different ExpressibleBy___Literal protocols: String, UnicodeScalar, and ExtendedGraphemeCluster. Evidently, there is no problem with the same syntax being used for multiple kinds of literals.

We also have four different syntaxes for integer literals: decimal, hexadecimal (0x), binary (0b), and octal (0o). Thus there is also apparently no problem with multiple syntaxes for the same kind of literal.

Therefore, perhaps we should make single-quoted characters also be valid *integer* literals. Instead of trying to shoehorn integer types into accepting UnicodeScalar literals, we could simply make the compiler recognize single-quoted characters as another way to spell integer literals.

Then there is no ABI concern, and in fact no new conformance to add at all.

jrose · March 4, 2019, 4:58pm

Small note: please don't use the term "retroactive conformances" for "conformances added in a compatible ABI version". For better or worse, we're already using it for "conformances added in a separate module from either the type or the protocol", and the issues around the two are not the same. Maybe "version-dependent conformances" or "backwards-deployable conformances" (I'm not sure which meaning's being used here).

I am weakly in favor of the proposal except for the deprecation part, because I don't want to assume we'll have a -swift-version 5.1. I also don't see why init(ascii:) makes sense to deprecate, since it can be useful for run-time values too. Finally,

Some have proposed allowing integer array types to be expressible by multi-character ASCII strings such as 'abcd'. We consider this to be out of scope of this proposal, as well as unsupported by precedent in C and related languages.

I agree that this is out of scope, but there is some precedent for it: multicharacter scalar literals as an implementation-defined part of C and C++. I've definitely had use for "sequence of bytes that have a nice ASCII representation" in parsing code and I'd support such an addition in the future.

johnno1962 · March 4, 2019, 5:16pm

This isn’t ideal but difficult to avoid in a ABI stable world:

Greater flexibility would have been my preference but a long discussion about the subtleties of unicode normalisation persuaded the thread it was best to steer clear of being able to represent integers outside the ASCII range (The Option 2 vs. Option 4 debate). We can always relax this constraint later if a killer use-case turns up.

This was the first implementation proposed (Codepoint literals) but it became obvious the proper abstraction in the Swift string world for single quotes should be as Character literals that can use the more flexible existing protocols for nearly the same result.

Michael_Ilseman · March 4, 2019, 5:34pm

What is your evaluation of the proposal?

+1 for the most part. As @jrose mentioned, why are we deprecating the init?

Is the problem being addressed significant enough to warrant a change to Swift?

Yes

Does this proposal fit well with the feel and direction of Swift?

Yes

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

This is very similar

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

In-depth study

Nevin · March 4, 2019, 6:21pm

I was trying to say ¿Porque no los dos?

Use single-quoted literals for ExpressibleByUnicodeScalarLiteral with all the same conforming types as today.

Use single-quoted literals for ExpressibleByExtendedGraphemeClusterLiteral (sidenote: ugh, why was this not spelled ExpressibleByCharacterLiteral?) with all the same conforming types as today.

And also, use single-quoted literals for ExpressibleByIntegerLiteral with all the same conforming types as today.

…just because the conformance is there, doesn’t mean we have to accept the syntax.

I have a vague recollection that some types conforming to Collection do not (or at least did not) allow direct subscripting, and this is/was intentional. It might have been some sort of Range, and the reasoning was that people don’t expect the subscript to be an identity operation. It still worked in generic contexts, it just wasn’t allowed to be used concretely.

I don’t know if that’s still the case, but I am fairly confident it was at some point.

So if we wanted to, I expect we could do something similar and make it an error to directly assign a single-quoted literal to a String.

Do we not have a (Character, Character) -> String overload of the + operator? Because that’s what I’d expect this to use.

Nevin · March 4, 2019, 6:30pm

This makes absolutely zero sense to me. A Unicode scalar has exactly one canonical integer value.

No normalization should be involved at all. If the source-file places a single Unicode scalar between two apostrophes, then that is a Unicode scalar literal.

It is a literal representation of a single Unicode scalar. And a single Unicode scalar has exactly one integer value, which can be accessed at runtime through its value property.

There is no ambiguity. If a Unicode scalar literal is used to express an integer value, then the only reasonable interpretation is to use the integer value of the Unicode scalar in the literal.

Alejandro · March 4, 2019, 6:34pm

Off topic and I apologize, but for those curious you can also say “porque no ambos?”

taylorswift · March 4, 2019, 8:29pm

I used to agree with you, but it turns out taking ExpressibleByStringLiteral out of the text literal protocol hierarchy brings a lot of ABI issues with it, and in principle, let s:String = 'a' isn’t too different from let f:Float = 1. Complain all you want about the binary stability, but if you ask me it’s not worth fighting the ABI over this.

Early iterations of the proposal allowed exactly what you’re asking, but developers who use a lot of international text said there would be a lot of problems relating to unicode encoding once you get past U+0x7F. (literals that change value when you save the .swift file???) We think their concerns are valid, so we have the ASCII restriction.

We don’t really have to. I don’t care too much about the init(ascii:) initializer and that part can be omitted from the proposal without affecting anything else.

We don’t. it’s okay, i could have sworn we had one too lol. One draft of the proposal said specifically to add this exact operator to Character x Character but once you have 'a' + 'a' the next logical step is 'a' * 5, which of course brings us to 'a' * 'a' (???)

we’ve been over this

Nevin · March 4, 2019, 9:02pm

That does not actually address the issue.

Saying “we’ve been over this” and linking to a post that simply dismisses the idea out of hand, is not conducive to a productive Swift Evolution discussion.

lorentey · March 4, 2019, 9:02pm

I like most parts, except for the proposed integer conformances. Integer literal initialization seems like a relatively niche feature that comes with an unreasonably large conceptual/pedagogical cost.

I can fully appreciate the convenience of being able to write code like this while I'm writing a parser for some binary file format that includes ASCII elements:

let magic: [UInt8] = ['G', 'I', 'F', '8', '9', 'a']

However, in the overwhelmingly vast majority of my time, I'm not writing such a parser, and the notion that it's okay to consider some characters unambiguously identical to their ASCII codes rubs me the wrong way.

By their nature, characters aren't integer values. There is nothing inherently "71ish" about the letter G; the idea that the digit 8 is the same thing as the number 56 is absurd.

We can encode G and 8 to integers (or, rather, a series of bits) by selecting an encoding. There are many encodings to choose from, and not all use the same representations as ASCII; indeed, Swift has some support for running on IBM Z systems that prefer encodings from the EBCDIC family.

The Swift stdlib has so far been careful to always make the selection of an encoding explicit in Swift source code. This has been best practice in API design for multiple decades now, so I always assumed this was a deliberate choice -- any implicit encoding would certainly lead to hard-to-spot mistakes. This proposal is quite flagrantly breaking this practice. Limiting the feature to ASCII is a good compromise, but it does not fully eliminate the problem.

In view of this, I would prefer if the passage about eventually adding ExpressibleByUnicodeScalarLiteral conformance to Int8/.../Int was dropped from the proposal text. Instead, users should keep explicitly opting into the integer initialization feature, even if versioned conformances become available. The required one-liner conformance declaration honestly doesn't seem like an unreasonable burden for authors of parser packages.

The UInt8(ascii:) initializer should not be deprecated; rather, it should be added to all integer types. (The objections raised in the Alternatives Considered section seem quite weak to me.)

Yes, despite the objection above.

Single-quoted Unicode.Scalar and Character literals are obviously desirable.

As detailed above, I consider integer initialization to be a bad idea in general. Limiting it to ASCII is a workable compromise; indefinitely leaving it opt-in would be even better.

I don't know any language that properly supports Unicode at the level Swift aims for. APIs that implicitly assume a string encoding are generally considered subpar in the libraries I worked with -- unfortunately, character initialization is often a legacy language feature that's has to be kept unchanged.

I was involved in late-stage pitch discussions, and thought long and hard about the issues.

taylorswift · March 4, 2019, 9:07pm

I think there is enough reason to give ASCII preferred treatment over all other encodings, just from the fact that so many major binary formats explicitly specify it. For example, the png standard:

3.2. Chunk layout

Each chunk consists of four parts:

Length

…

Chunk Type

A 4-byte chunk type code. For convenience in description and in examining PNG files, type codes are restricted to consist of uppercase and lowercase ASCII letters (A-Z and a-z, or 65-90 and 97-122 decimal). However, encoders and decoders must treat the codes as fixed binary values, not character strings. For example, it would not be correct to represent the type code IDAT by the EBCDIC equivalents of those letters. Additional naming conventions for chunk types are discussed in the next section.

Non-ASCII 7-bit encodings are so rare compared to Non-UTF8 >7-bit encodings that i really don’t think it’s valid to use the same argument for both. We shouldn’t make the common case (ASCII) difficult to use just to accommodate a very uncommon case (EBCDIC).

Nevin · March 4, 2019, 9:07pm

Please don’t try to slippery-slope this.

We have a + operator that concatenates arrays, but nobody expects there to be a * operator that repeats an array n times. Some people might want that, but we don’t have it and the slope is not slippery.

Furthermore, with the proposal in its current form, 'a' * 'a' would be valid Swift (after a zero-effort user-supplied conformance, which is only required for ABI reasons), expressing the multiplication of 97 by itself as an integer. So if you think “'a' * 'a'” is objectionable, then it follows that the current proposal is as well.

Nevin · March 4, 2019, 9:14pm

How is that any different from what we can write today:

let x = "θ" as Unicode.Scalar
let n = x.value

The only difference is that today the conversion from character literal to integer must happen at runtime, whereas it could be done at compile-time with exactly the same semantics.

I have seen zero examples of a problem with doing this at compile-time, that is not also a problem with doing it at runtime as we can today.

taylorswift · March 4, 2019, 9:16pm

your whole premise is that 'a' shouldn’t be a String, but really there isn’t a strong argument why it shouldn’t be since and if we go with the ABI-friendly route, we get 'a' + 'a' = "aa" for free. 'a' * 'a' is unfortunate but we already decided for users, it’s fine since the type checker would almost certainly catch that mistake. For standard library authors, 'a' * 'a' is a different question since if we are vending + (Character, Character) -> String, we would also have to consider * (Character, Int) -> String along with it (python has it after all)

Nevin · March 4, 2019, 9:25pm

The reason is clarity at the point of use. If we introduce single-quoted literals as this proposal suggests, then the difference between them and double-quoted literals should be meaningful:

'a' can be a Character or Unicode.Scalar, but not a String nor StaticString

"a" can be a String or StaticString, but not a Character nor Unicode.Scalar

That way the meaning of character and string literals in source code is much clearer.

Seriously, please stop with the slippery-sloping. We are under no obligation to consider any such thing, and even if we did consider it we have no obligation to adopt it.

taylorswift · March 4, 2019, 9:38pm

off topic but @johnno1962 why does let s:String = 'a' work again? String’s ExpressibleByExtendedGraphemeClusterLiteral conformance takes a String argument

String.init(extendedGraphemeClusterLiteral: "a" as Character)
error: repl.swift:3:49: error: cannot convert value of type 'Character' to expected argument type 'String'