SE-0243: Codepoint and Character Literals


(John Holdsworth) #122

Presumably the users intention was to match ‘é’ characters in the stream but they are in for disappointment.


(Xiaodi Wu) #123

Why would one presume that? The property is named asciiValue and 'é' is not ASCII.

That a statement can in this case be statically determined to evaluate to false is another issue altogether; it would be reasonable for the compiler to give warnings about dead/unreachable code here. That doesn’t require any specific logic for ASCII.


(Teva Merlin) #124

It would be nice to get a warning at compile time when using asciiValue on a non-ASCII literal.


(John Holdsworth) #125

Sorry, my example wasn’t very helpful. Adapting the example from the original post:

if cString.advanced(by: 2).pointee == 'é'.asciiValue {

This is a logic error that can never be true but the dev won't receive any indication at compile or run.


(^) #126

this has everything to do with compile time checking. UInt8.init(ascii:) makes this complicated, because semantically, UInt8(ascii: "a") is a call to a function called UInt8(ascii v:Unicode.Scalar) with {v → "a" as Unicode.Scalar}. Let’s say we add some compiler magic to make UInt8(ascii: "a") compiler-checked. We did that for SE-0213 after all.

let i:UInt8 = .init(ascii: "a")

But what about?

let a:Unicode.Scalar = "a"
let i:UInt8 = .init(ascii: a)

how about this?

struct Tag 
{
    init(_ a:Unicode.Scalar, _ p:Unicode.Scalar, 
         _ r:Unicode.Scalar, _ s:Unicode.Scalar) 
    {
        self.a = .init(ascii: a)
        self.p = .init(ascii: p)
        self.r = .init(ascii: r)
        self.s = .init(ascii: s)
    }

    static 
    let IHDR:Tag = .init("I", "H", "D", "R")
}

It would be weird and inconsistent if 1 worked, but 2 and 3 didn’t.

A literal syntax like a'x' or just 'x' coerced (important, not converted) to an integer type just doesn’t have any of these problems. For a'x', the compiler can just do the checking based on what the lexer tells it. For 'x', the compiler can just do the checking based on what the type checker tells it. But for UInt8.init(ascii: "x"), the compiler has to do lots and lots of analysis. On top of that, it just wouldn’t be clear to people reading (and writing!) code whether compile time validation is occurring or not.

That property returns an optional. It has to, because as compile time, it only knows that self must be a Character. It doesn’t know it has exactly one codepoint, or that that codepoint is an ASCII codepoint.

You cannot accuse this proposal of encouraging bad coding practices, and then suggest this as an alternative:

let feature = ('l'.asciiValue!, 'i'.asciiValue!, 'g'.asciiValue!, 'a'.asciiValue!)

On top of that, if self isn’t an actual ASCII value, this produces about the least helpful error message you could get.

Fatal error: Unexpectedly found nil while unwrapping an Optional value
Current stack trace:
0    libswiftCore.so                    0x00007f7e4d59adb0 _swift_stdlib_reportFatalError + 69
1    libswiftCore.so                    0x00007f7e4d4a92d6 <unavailable> + 3314390
2    libswiftCore.so                    0x00007f7e4d4a9655 <unavailable> + 3315285
3    libswiftCore.so                    0x00007f7e4d2caef0 _fatalErrorMessage(_:_:file:line:flags:) + 19
4    ascii                              0x000055eeffda7dfc <unavailable> + 3580
5    libc.so.6                          0x00007f7e4c471ab0 __libc_start_main + 231
6    ascii                              0x000055eeffda7ada <unavailable> + 2778

On linux, it doesn’t even tell you where the force unwrap happened. (The bug report SR-755 was basically triaged “won’t fix”.)

It’s also unfortunate that this not only compiles, but runs (and fails) silently.

if 0xE9 == ("é" as Character).asciiValue 
{ 
    print("i’m a user who thinks ASCII is the same as Latin-1, and there are lots of me") 
} 

reasonable?

// textbook example of non-foldable code
func isEAcute(_ scalar:UInt8) -> Bool 
{
    return scalar == ("é" as Character).asciiValue
}

That’s because the forum is highlighting the second one but not the first.

(a'I', a'H', a'D', a'R')
(UInt8(ascii: "I"), UInt8(ascii: "H"), UInt8(ascii: "D"), UInt8(ascii: "R")) 

Also, I don’t think it’s hard to type at all.


(^) #127

I also want to make it clear to everyone that the problem of 'a' * 'a', 'a' / 'a', and other arithmetic “abominations” has a simple solution.

@available(*, unavailable, message: "* is unsupported on Character values")
func * (lhs:Character, rhs:Character) -> Character
{
    fatalError("unreachable")
}

print("a" * "b")
$ swiftc -Onone unavailable.swift 
unavailable.swift:7:11: error: '*' is unavailable: * is unsupported on Character values
print("a" * "b")
          ^
unavailable.swift:2:6: note: '*' has been explicitly marked unavailable here
func * (lhs:Character, rhs:Character) -> Character

With “sit-in” declarations for -, *, /, %, etc, on (Character, Character), I consider this a non-issue moving forward.


(^) #128

The problem of

let x = UInt8('8')   // Optional(8)
let y = '8' as UInt8 // 56

also has a simple solution: just make 'a' not expand to a string literal ever.

I don’t know why John thought this would break ABI and tried to use the “look for digits in single quotes” lexer hack. Just because String has an inherited conformance to ExpressibleByExtendedGraphemeClusterLiteral in the ABI doesn’t mean the compiler has to ever invoke it at all. Keeping 'a' from ever being interpreted as a String literal would be simple(r) to implement and would have zero source or ABI impact.

Yes, it means you can’t do this:

struct S:ExpressibleByStringLiteral 
{
    init(stringLiteral:String) {}
}

let s:S = 'a'

But i don’t think a user would expect this to ever work in the first place (how many even know ExpressibleByStringLiteral inherits from ExpressibleByExtendedGraphemeClusterLiteral?), and if you really wanted to use the 'a' syntax, you should just conform your struct to the correct protocol, ExpressibleByExtendedGraphemeClusterLiteral to begin with.


At this point, I think the two main objections to this proposal,

  • let a:UInt8 = 'a' / 'a'

  • let b:UInt8 = .init('8')

have been resolved.


(John Holdsworth) #129

Eh? What “look for digits in single quotes” lexer hack?


(^) #130

i thought that’s what you meant by this


(John Holdsworth) #131

It’s not a lexer hack. It’s because a single character string looks for ExpressibleByUnicodeScalarLiteral. The actual results from the implementation are:

(swift) let x1 = UInt8("8")
// x1 : UInt8? = Optional(8)
(swift) let y1 = "8" as UInt8
<REPL Input>:1:10: error: cannot convert value of type 'String' to type 'UInt8' in coercion
let y1 = "8" as UInt8
         ^~~
(swift) let x2 = UInt8('8')
// x2 : UInt8? = Optional(8)
(swift) let y2 = '8' as UInt8
<REPL Input>:1:10: error: cannot convert value of type 'Character' to type 'UInt8' in coercion
let y2 = '8' as UInt8
         ^~~
(swift) extension UInt8: ExpressibleByUnicodeScalarLiteral {}
(swift) let x3 = UInt8("8")
<REPL Input>:1:16: error: integers can only be expressed by single quoted character literals
let x3 = UInt8("8")
               ^
(swift) let y3 = "8" as UInt8
<REPL Input>:1:10: error: integers can only be expressed by single quoted character literals
let y3 = "8" as UInt8
         ^
(swift) let x4 = UInt8('8')
// x4 : UInt8 = 56
(swift) let y4 = '8' as UInt8
// y4 : UInt8 = 56

This isn’t ideal but how much code is out there that depends on initialising ints from a single character string literal. Unless you conform a type to ExpressibleByUnicodeScalarLiteral the behaviour remains as it was before.


(^) #132

okay so it’s already been fixed for a while! great!

I do think we should revisit banning let s:String = 'a'. I can see arguments going both ways, but if UInt8('8') isn’t going to be a thing (which it shouldn’t), we should tell a consistent story here.


#133

I still haven't heard a strong motivation for single quotes:


(John Holdsworth) #134

The motivation for me is twofold. Firstly to complete the analogy with languages like C/Java where a sequence of characters (a string) has a different literal form from something you want to be considered a single character for ergonomic reasons. In some ways it’s “neat” that Swift has merged the two into one but I don’t think this makes the language more approachable. The second motivation is a more pragmatic one to create a distinction in the interests of not producing a source breaking change to the behaviour of string literals with the new integer conversions. We didn’t want “8” as Int to suddenly be a trap people could fall into, giving 56 when they were expecting 8.


(Xiaodi Wu) #135

The compiler should be able to prove that code inside the if block is unreachable. That would be a great QoL improvement for the compiler, and it doesn’t require anything special to be added to the API surface.


#136

The problem with single-quote I have, is where we draw the line between single- and double-quote, because there’re so many types that I’d say is a good cutoff:
ascii character, unicode scalar, unicode grapheme cluster (Character).

The proposed design seems to use Character, but then allow for ascii checking if the underlying type requires it? Though I think that if single quote is to also represent raw value of whatever inside, grapheme cluster doesn’t seem like a good fit.


(Xiaodi Wu) #137

It is not true that the compiler “has” to do lots and lots of analysis for one spelling but not another. The semantics are the same.

The case has been made very clearly as to why a single-quoted character should not be coercible to an integer. Adding a combinatorial explosion of operator overloads doesn’t solve the problem, merely hiding it from select cases in concrete code. Consider how many operators you would need to add just to disable '42' / 3 and 42 / '3', recalling that the user can change the default literal type.

You show very clearly how one API, written by you, is clunky to use in Swift, but reject the alternative designs which are already presented which are just as safe. It is a non-goal to redesign Swift’s surface syntax to make everything validated at compile time instead of runtime, and to make it “look” that way to boot. If I were to write your library, the initializer would be spelled Tag("IDHR") at the point of use.


#138

Thanks! It would be helpful to have these listed in the proposal! I don't particularly see a need to create parity with C/Java, but even though I think "8" as Int is unlikely to be a big stumbling point, I can see it being a confusing one to those that encounter it. At the very least '8' shows a bit more explicit intent. I wonder, though, if it's possible to introduce support with double quotes, evaluate any stumbling points in the wild, and then (and only then) introduce single quotes if the distinction seems necessary.


#139

Rust requires char (and its literal) to be a single code point. I wonder if that's worked well for them?


(John Holdsworth) #140

There’s a case to be made for drawing the line at ASCII character and having ‘a’ be processed as just an integer literal. This was referred to as “Codepoint literals” early on. It would certainly make the implementation far simpler and be something we could deliver without ABI complications. Character feels like a better fit as the concept of “elements of a String” in Swift’s no compromise String model but is certainly causing us problems.


#141

I don’t use Rust a lot, but that’s what I’m thinking as well.

It’s precisely this. The proposal allows 'a' to be both an integer literal, and also a grapheme cluster, which to me doesn't seem to go in the same direction.