SE-0243: Codepoint and Character Literals

This isn’t ideal but difficult to avoid in a ABI stable world:

Greater flexibility would have been my preference but a long discussion about the subtleties of unicode normalisation persuaded the thread it was best to steer clear of being able to represent integers outside the ASCII range (The Option 2 vs. Option 4 debate). We can always relax this constraint later if a killer use-case turns up.

This was the first implementation proposed (Codepoint literals) but it became obvious the proper abstraction in the Swift string world for single quotes should be as Character literals that can use the more flexible existing protocols for nearly the same result.

1 Like
  • What is your evaluation of the proposal?

+1 for the most part. As @jrose mentioned, why are we deprecating the init?

  • Is the problem being addressed significant enough to warrant a change to Swift?

Yes

  • Does this proposal fit well with the feel and direction of Swift?

Yes

  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

This is very similar

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

In-depth study

1 Like

I was trying to say ¿Porque no los dos?

Use single-quoted literals for ExpressibleByUnicodeScalarLiteral with all the same conforming types as today.

Use single-quoted literals for ExpressibleByExtendedGraphemeClusterLiteral (sidenote: ugh, why was this not spelled ExpressibleByCharacterLiteral?) with all the same conforming types as today.

And also, use single-quoted literals for ExpressibleByIntegerLiteral with all the same conforming types as today.

…just because the conformance is there, doesn’t mean we have to accept the syntax.

I have a vague recollection that some types conforming to Collection do not (or at least did not) allow direct subscripting, and this is/was intentional. It might have been some sort of Range, and the reasoning was that people don’t expect the subscript to be an identity operation. It still worked in generic contexts, it just wasn’t allowed to be used concretely.

I don’t know if that’s still the case, but I am fairly confident it was at some point.

So if we wanted to, I expect we could do something similar and make it an error to directly assign a single-quoted literal to a String.

Do we not have a (Character, Character) -> String overload of the + operator? Because that’s what I’d expect this to use.

This makes absolutely zero sense to me. A Unicode scalar has exactly one canonical integer value.

No normalization should be involved at all. If the source-file places a single Unicode scalar between two apostrophes, then that is a Unicode scalar literal.

It is a literal representation of a single Unicode scalar. And a single Unicode scalar has exactly one integer value, which can be accessed at runtime through its value property.

There is no ambiguity. If a Unicode scalar literal is used to express an integer value, then the only reasonable interpretation is to use the integer value of the Unicode scalar in the literal.

1 Like

Off topic and I apologize, but for those curious you can also say “porque no ambos?”

3 Likes

I used to agree with you, but it turns out taking ExpressibleByStringLiteral out of the text literal protocol hierarchy brings a lot of ABI issues with it, and in principle, let s:String = 'a' isn’t too different from let f:Float = 1. Complain all you want about the binary stability, but if you ask me it’s not worth fighting the ABI over this.

Early iterations of the proposal allowed exactly what you’re asking, but developers who use a lot of international text said there would be a lot of problems relating to unicode encoding once you get past U+0x7F. (literals that change value when you save the .swift file???) We think their concerns are valid, so we have the ASCII restriction.

We don’t really have to. I don’t care too much about the init(ascii:) initializer and that part can be omitted from the proposal without affecting anything else.

We don’t. it’s okay, i could have sworn we had one too lol. One draft of the proposal said specifically to add this exact operator to Character x Character but once you have 'a' + 'a' the next logical step is 'a' * 5, which of course brings us to 'a' * 'a' (???)

we’ve been over this

5 Likes

That does not actually address the issue.

Saying “we’ve been over this” and linking to a post that simply dismisses the idea out of hand, is not conducive to a productive Swift Evolution discussion.

I like most parts, except for the proposed integer conformances. Integer literal initialization seems like a relatively niche feature that comes with an unreasonably large conceptual/pedagogical cost.

I can fully appreciate the convenience of being able to write code like this while I'm writing a parser for some binary file format that includes ASCII elements:

let magic: [UInt8] = ['G', 'I', 'F', '8', '9', 'a']

However, in the overwhelmingly vast majority of my time, I'm not writing such a parser, and the notion that it's okay to consider some characters unambiguously identical to their ASCII codes rubs me the wrong way.

By their nature, characters aren't integer values. There is nothing inherently "71ish" about the letter G; the idea that the digit 8 is the same thing as the number 56 is absurd.

We can encode G and 8 to integers (or, rather, a series of bits) by selecting an encoding. There are many encodings to choose from, and not all use the same representations as ASCII; indeed, Swift has some support for running on IBM Z systems that prefer encodings from the EBCDIC family.

The Swift stdlib has so far been careful to always make the selection of an encoding explicit in Swift source code. This has been best practice in API design for multiple decades now, so I always assumed this was a deliberate choice -- any implicit encoding would certainly lead to hard-to-spot mistakes. This proposal is quite flagrantly breaking this practice. Limiting the feature to ASCII is a good compromise, but it does not fully eliminate the problem.

In view of this, I would prefer if the passage about eventually adding ExpressibleByUnicodeScalarLiteral conformance to Int8/.../Int was dropped from the proposal text. Instead, users should keep explicitly opting into the integer initialization feature, even if versioned conformances become available. The required one-liner conformance declaration honestly doesn't seem like an unreasonable burden for authors of parser packages.

The UInt8(ascii:) initializer should not be deprecated; rather, it should be added to all integer types. (The objections raised in the Alternatives Considered section seem quite weak to me.)

Yes, despite the objection above.

Single-quoted Unicode.Scalar and Character literals are obviously desirable.

As detailed above, I consider integer initialization to be a bad idea in general. Limiting it to ASCII is a workable compromise; indefinitely leaving it opt-in would be even better.

I don't know any language that properly supports Unicode at the level Swift aims for. APIs that implicitly assume a string encoding are generally considered subpar in the libraries I worked with -- unfortunately, character initialization is often a legacy language feature that's has to be kept unchanged.

I was involved in late-stage pitch discussions, and thought long and hard about the issues.

10 Likes

I think there is enough reason to give ASCII preferred treatment over all other encodings, just from the fact that so many major binary formats explicitly specify it. For example, the png standard:

3.2. Chunk layout

Each chunk consists of four parts:

Length

Chunk Type

A 4-byte chunk type code. For convenience in description and in examining PNG files, type codes are restricted to consist of uppercase and lowercase ASCII letters (A-Z and a-z, or 65-90 and 97-122 decimal). However, encoders and decoders must treat the codes as fixed binary values, not character strings. For example, it would not be correct to represent the type code IDAT by the EBCDIC equivalents of those letters. Additional naming conventions for chunk types are discussed in the next section.

Non-ASCII 7-bit encodings are so rare compared to Non-UTF8 >7-bit encodings that i really don’t think it’s valid to use the same argument for both. We shouldn’t make the common case (ASCII) difficult to use just to accommodate a very uncommon case (EBCDIC).

2 Likes

Please don’t try to slippery-slope this.

We have a + operator that concatenates arrays, but nobody expects there to be a * operator that repeats an array n times. Some people might want that, but we don’t have it and the slope is not slippery.

Furthermore, with the proposal in its current form, 'a' * 'a' would be valid Swift (after a zero-effort user-supplied conformance, which is only required for ABI reasons), expressing the multiplication of 97 by itself as an integer. So if you think “'a' * 'a'” is objectionable, then it follows that the current proposal is as well.

How is that any different from what we can write today:

let x = "θ" as Unicode.Scalar
let n = x.value

The only difference is that today the conversion from character literal to integer must happen at runtime, whereas it could be done at compile-time with exactly the same semantics.

I have seen zero examples of a problem with doing this at compile-time, that is not also a problem with doing it at runtime as we can today.

your whole premise is that 'a' shouldn’t be a String, but really there isn’t a strong argument why it shouldn’t be since and if we go with the ABI-friendly route, we get 'a' + 'a' = "aa" for free. 'a' * 'a' is unfortunate but we already decided for users, it’s fine since the type checker would almost certainly catch that mistake. For standard library authors, 'a' * 'a' is a different question since if we are vending + (Character, Character) -> String, we would also have to consider * (Character, Int) -> String along with it (python has it after all)

The reason is clarity at the point of use. If we introduce single-quoted literals as this proposal suggests, then the difference between them and double-quoted literals should be meaningful:

'a' can be a Character or Unicode.Scalar, but not a String nor StaticString

"a" can be a String or StaticString, but not a Character nor Unicode.Scalar

That way the meaning of character and string literals in source code is much clearer.

Seriously, please stop with the slippery-sloping. We are under no obligation to consider any such thing, and even if we did consider it we have no obligation to adopt it.

off topic but @johnno1962 why does let s:String = 'a' work again? String’s ExpressibleByExtendedGraphemeClusterLiteral conformance takes a String argument

String.init(extendedGraphemeClusterLiteral: "a" as Character)
error: repl.swift:3:49: error: cannot convert value of type 'Character' to expected argument type 'String'
1 Like

Way off topic, It works because ’a’ is a character literal expression which can be expressed in a single unicode scalar so it searches for ExpressibleByUnicodeScalarLiteral which String must conform to by virtue of the inheritance hierarchy from ExpressibleByStringLiteral as opposed to Character which is a type. I don’t know exactly why you're seeing that specific error.

ExpressibleByUnicodeScalarLiteral does allow you to run into the same issues today. But there are three significant differences.

  1. To encounter most of the issues today, you have to compose separate concepts:

    This is dangerous code:

    let x = "é" // Creating a text literal.
    let n = x.unicodeScalars.first!.value // Getting the numeric encoding.
    

    Thanks to Unicode equivalence, after copy‐and‐paste, etc., this source might compile so that n is 0x65 or 0xE8.

    But each line on its own would have been perfectly reasonable and safe in different contexts:

    let x = "é" // Creating a text literal (like line 1 above).
    if readLine().contains(x.first) { // Using it safely as text.
        print("Your string has no “é” in it.")
    }
    
    print("Your string is composed of these characters:")
    for x in readLine().unicodeScalars {
        let n = x.value // Getting the numeric encoding (like line 2 above).
        print("U+\(n)") // Safely expecting it to be absolutely anything.
    }
    

    But this proposal adds the combination as a single operation:

    let n: UInt8 = 'a'
    

    Thus it has the extra responsibility to consider where and when the entire combined operation as a whole is and isn’t reliable.

  2. The subset of issues you can already encounter today without composition fail immediately with a compiler error.

    This is dangerous code:

    let x = "é" as Unicode.Scalar
    

    That may or many not still fit into a Unicode.Scalar after copy‐and‐paste, but you will know immediately.

    The proposal, had it not limited itself, would have introduced instances of single operations that would be derailed silently:

    let x: UInt32 = `Å` // Started out as 0x212B, might become 0xC5.
    

    Hidden, nebulous logic changes would be far worse than the sudden compiler failures we can encounter today.

  3. Today, the most straightforward, safest way to get a vulnerable scalar from a literal currently requires all the same compiler and language functionality as the dangerous code:

    let x = "\u{E8}" as Unicode.Scalar
    

    Making the dangerous variant illegal would have made this safe way impossible as well.

    On the other hand, regarding the additions in the proposal, the most straightforward safe way to get a vulnerable integer is completely unrelated in both syntax and functionality, and is thus unaffected by the safety checks:

    let x = 0xE8
    

(For any new readers who want more information about this, the relevant discussion and explanations begin here in the pitch thread and run for about 20 posts until a consensus was reached around what was called “Option 4”. Please read those first if you want to ask a question or post an opinion about the restriction to ASCII.)

7 Likes

Oh. That's unfortunate. I actually did not realize the proposal would allow that. (This isn't immediately obvious from the proposal text, and it's only tangentially implied by a test in the implementation.)

Searching back, I now see this did come up during the pitch. I'm sorry I missed it:

I wholeheartedly agree with @xwu here.

I accept that there is some utility in using ASCII character literals for the purposes of integer initialization and pattern matching. But I absolutely do object to the prospect of character literals directly appearing in arithmetic expressions like 'a' + 12, much less abominations like 'a' * 'a'.

let m1 = ('a' as Int) + 12     // this is (barely) acceptable to me
let m2 = 'a' + 12              // ...but this seems extremely unwise
let wat = 'a' * 'b' / 'z'      // ...and this is just absurd

If we cannot prevent the last two lines to compile, then in my opinion the proposed integer conformances should be scrapped altogether.

An argument can perhaps be made to keep the Int8/UInt8 conformances to cater for the parsing usecase, however absurd these expressions might be. But this argument doesn't apply to any other FixedWidthInteger type, so I believe it'd be best if the generic initializer was removed from the proposal.

4 Likes

I agree with this feedback, and it is the main thrust of my review comment about this revised proposal:

In general, I thought that the proposal was acceptable in its previous incarnation, but removing conformances due to ABI limitations and then making them opt-in is a very strange way of vending a feature. It's not discoverable, and it's certainly not convenient.

Instead of trying to push ahead with the minimum possible modification to the design, we need to revisit this design more comprehensively because the pros and cons have changed dramatically.

(To make it explicit: The previous pro of the highest possible convenience in writing let x: Int = 'a' is very much tempered when one must write a conformance. We also introduce a new con that we are encouraging people to conform a type they don't own to a protocol they don't own, which is absolutely contrary to what we tell people otherwise.)

By contrast, init(ascii:) seems very straightforward, and if we care indeed about convenience for initializing arrays of integers in this way, we can vend also a similar initializer on extension Array where Element: FixedWidthInteger.

If compile-time checking is required, there is no reason why the compiler cannot be made to have special knowledge of these init(ascii:) methods to make this work until such time as the constexpr-like features mature and supersede that magic.

7 Likes

-1. Main reason: Unsigned Integers are numbers, not codepoints. Codepoints are represented by unsigned integers, but that doesn't mean that UInt8 etc. need to be able to be represented by these kinds of literals. There should be a wrapper type such as

struct SomeKindOfCodepoint {
    var value: UInt8
}

instead of polluting integer interfaces. I would hate it to make code like this valid:

let a: UInt8 = 10

let b = a + 'x' //wat?

Do codepoints have multiplication? Do codepoints have division? No? Then they should be their own type that only wraps the underlying integer representation. Afterwards, this wrapper type can conform to a new type of ExpressibleBy...Literal as this proposal wants.

But the way it looks now, this is moving way too close into C-like unsigned char territory, where there is no difference between text and numbers, which is a poor choice for a high-level language with a strong type system.

8 Likes

I'm -1 for any version that adds non-numeric literals to integers. It undermines Swift's approach to Unicode correctness. I can already imagine the all-too-easily copy-pasta'ed Stack Overflow posts that will recommend pattern matching over String.utf8.

The basic "Haven't you had this problem?" in the proposal isn't compelling. Nobody's forcing anybody to elide tons of UInt8(ascii:) calls into a single ugly array literal. Plenty of C parsers big chunk of #defines for code points they care about, and I've seen similar approaches in Swift without any trouble.

Core Team members have often said self-hosting Swift is a non-goal because it biases the language towards what's good for a compiler. Likewise, I don't think the language needs unfairly biased in favor of ASCII as part of its string ergonomic story. That's every other programming language. We already have those languages.

I'm perfectly in favor of single-quote literals for Swift's for UnicodeScalar, Character, and String. It notably improves on the status quo and aligns with other languages.

2 Likes