SE-0243: Codepoint and Character Literals

Vogel · March 5, 2019, 8:00pm

But there might be existing code that's already doing that.

Also, what about this:

UInt8("10")

Why would that be allowed, but

UInt8("9")

not? Very, very confusing.

Edit: @michelf was quicker

Jon_Shier · March 5, 2019, 8:01pm

It doesn't look like UInt8 conforms to ExpressibleByStringLiteral, which is what I understood 213 to mean.

xwu · March 5, 2019, 8:04pm

We're talking conformance to ExpressibleByUnicodeScalarLiteral, which going forward will still be expressible using double quotation marks.

To be explicit, consider:

let x = UInt8("8")
let y = "8" as UInt8

If UInt8 conforms to ExpressibleByUnicodeScalarLiteral, x should be equal to y according to SE-0213.

johnno1962 · March 5, 2019, 8:09pm

In the long run the plan is it won’t and character literals will separate from string literals apart from the unavoidable let a: String = ‘a’`. This problem will go away compiling Swift 6 and both Int8(“8”) and Int(“88”) will give an error. (String literals will no longer look for ExpressibleByUnicodeScalarLiteral inside the type checker - this change would be internal to the compiler)

xwu · March 5, 2019, 8:12pm

Neither the removal of the double quotes nor the removal of these actually useful initializers is part of this proposal, so while it may be somebody's long-term plan it is not the plan of record for Swift.

Vogel · March 5, 2019, 8:12pm

Why? What's wrong with them?

johnno1962 · March 5, 2019, 8:16pm

Looks like I need to make a correction Int8(“8”) and Int(“88”) will both work as before this proposal after Swift 6 and they will no longer short-circuit to the ExpressibleByUnicodeScalarLiteral initialiser for single digit literals. Apologies for the mistake.

taylorswift · March 5, 2019, 8:19pm

Vogel:

Unnecessary use of low-level APIs instead of proper high-level APIs that already solve the problem perfectly fine.

Here's my suggestion:

extension DatabaseDateComponents {
    init?(cString: UnsafePointer<CChar>) {
        let string = String(cString: cString)

        guard string.count >= 5 else {
            return nil
        }

        if string[4] == "-" {
            self.init(datetime: string)
        } else if string[2] == ":" {
            self.init(time: string)
        } else {
            return nil
        }
    }
}

no, no, no! Please, no one ever do this. Converting bytestrings to String can change the indexing of the bytestring characters, since some of them could be utf8 continuation bytes or form grapheme clusters. (these would not be valid ASCII scalars, but it’s not like CChar knows that.) At the very minimum, use init(decoding:as:) with the Unicode.ASCII codec. It’s disheartening that so many people here who love to trumpet the importance of text encodings pay no heed to them when it actually matters.

However, as you’ve already discovered, String is exactly the wrong tool for this job, since it does not vend random access subscripts.

xwu · March 5, 2019, 8:20pm

However, according to SE-0213, if an end user adds the conformance, then the compiler should be short circuiting them and might already do so. This would be a very confusing behavior.

Bottom line is that, with the guarantees of backward compatibility and what's already in the standard library, I don't think that integer types can be made to conform to ExpressibleByUnicodeScalarLiteral without producing some really confusing inconsistent emergent behaviors in the standard library.

Since the whole point here is to help users convert ASCII characters to corresponding integer values, adding some convenience at the cost of adding these footguns does not seem wise.

johnno1962 · March 5, 2019, 8:25pm

I agree it’s unfortunate which is why the implementation marks it as an error. If I’m honest I don’t know why the compiler takes this route. There is an IntX.init(_ string: String) initialiser in stdlib but it chooses instead to process “88” as an integer literal given half the chance.

xwu · March 5, 2019, 8:27pm

It does so because we agreed that it should do so in SE-0213: this is precisely the consequence of proposing to conform an integer to ExpressibleByUnicodeScalarLiteral under those rules, and (besides '1' / '1' and other examples above) another demonstration of why this conformance is unwise.

What this proposal is attempting to state by the conformance is that, semantically, "8" is an accepted literal expression for 56 in all circumstances. The trouble is that, excepting the specific circumstance where you want the ASCII encoding of the character 8, it is not. What the standard library has already included are methods where "8" in fact converts to 8, not 56. By making those conversion initializers unadorned by a label, it has already staked out a claim that in fact that is the usually desired conversion.

Vogel · March 5, 2019, 8:29pm

Please read the documentation before making wild assumptions like this one. @gwendal.roue stated quite clearly that these are SQLite dates, which cannot contain any non-ASCII characters: Date And Time Functions

Even if they could, chances are that String indexing would probably be the correct choice:
201🥚-01-01 could be correctly parsed as January first, twothousandeggteen, while [CChar] indexing could not recognize it properly.

As I've mentioned earlier, this is very dangerous thinking. If that logic applies, I might as well just convert all of my Strings to Arrays now whenever I need some kind of character lookup. This is really just a missing convenience function that should exist as an extension of Collection.

SDGGiesbrecht · March 5, 2019, 8:43pm

Discussions about this are plentiful and they tend to run forever. If it interests you, please take it to one of these threads instead. It is off topic here.

taylorswift · March 5, 2019, 8:44pm

If i might suggest a slight amendment that would address some of the concerns raised here:

A good analogy to the issue of text–integer encodings is the issue of numeric literal–integer encodings. Meaning, if you write down the digits 1000, we have no idea if it is decimal, octal, binary, or hexadecimal. In this case, the literal is 1000, and the codec is the numeric base. In Swift, we use 0x, 0o, 0b to distinguish between all the different integer encodings.

let n1:Int =   1000 // decimal literal, n1 ← 1000 {base ten}
let n2:Int = 0b1000 // binary literal,  n2 ← 8    {base ten}
let n3:Int = 0o1000 // octal literal    n3 ← 512  {base ten}
let n4:Int = 0x1000 // hex literal      n4 ← 4096 {base ten}

following this precedent, we should reserve 'a' exclusively for Character, and assign prefixes u'a', a'a' for Unicode.Scalar and ASCII literals. A natural extension to EBCDIC would then be e'a' for example.

let c0:String         =  "a" // string literal,    c0 ← "a"
let c1:Character      =  'a' // character literal, c1 ← 'a'
let c2:Unicode.Scalar = u'a' // codepoint literal, c2 ← U'a'
let c3:UInt8          = a'a' // ASCII literal,     c3 ← 97
let c4:UInt8          = e'a' // EBCDIC literal,    c4 ← 129

I’m aware prefixed string literals were rejected a while back when we were talking about raw string syntax, but I think there are good reasons to adopt it for single-quoted “single-element” text literals:

Swift already prints Unicode.Scalar values with the prefix U', and it would be nice if the literal syntax aligned with the debug description syntax.
We would have an unambiguous way to write both Character and Unicode.Scalar values now, as opposed to just Character literals (proposal as written), or neither (status quo).
It’s easily extensible to accommodate alternative character–integer mappings like EBCDIC
It’s concise and clear to read, especially with appropriate syntax highlighting, so we’d get

let fontFeature = (a'k', a'e', a'r', a'n')`

instead of

let fontFeature = (UInt8(ascii: "k"), UInt8(ascii: "e"), UInt8(ascii: "r"), UInt8(ascii: "n"))`

.

Michael_Ilseman · March 5, 2019, 8:45pm

There are some totally valid concerns here, but I want to address some misunderstandings about Swift and Unicode.

Swift firmly establishes that String and related types are Unicode and not some other standard.

Unicode is a “Universal Character Encoding”. It is a mapping between “characters” (cough not in the grapheme cluster sense) and numbers starting from 0. This assignment to a specific number is the crux of the standard. The elements of this universal encoding we call Unicode is “code points” (cough or Unicode scalar values for only the valid ones).

A “Unicode Encoding” is an encoding of the universal encoding (or a subset of it), which may have alternate numbers (e.g. EBCDIC) or more complex details (e.g. uses a smaller-width representation). The elements of such an encoding are called “code units”.

Nothing in this proposal is attempting to bake in anything about particular “code units” from some particular Unicode encoding, rather it is addressing Unicode itself’s element type, “code point”.

(This is all pretty confusing.)

I’m not presenting an argument that Swift syntax should operate under some implicit conversion between a syntactic construct such as a literal and a number. That’s the purpose of this review thread and I understand that people can disagree for valid reasons. I’m just trying to dispel some of the FUD around mixing up Unicode and some particular Unicode encoding.

Code points are integer values. The code point ‘G’ is inherently 71 in Unicode. The idea that digit 8 is the same thing as number 56 is the point of character encodings.

Swift supports Unicode, but does not mandate a particular Unicode encoding. EBCDIC is not a subset of Unicode as decoding it to Unicode involves remapping some values.

(Again, this is not an argument that Swift’s particular syntactic construct for code points should produce Swift’s number types, just that these kinds of encoding-related concerns are not relevant)

This proposal does not change that.

Swift forces Unicode on us, and then Unicode forces ASCII on us. Unicode by explicit design is a superset of ASCII. From the standard:

While taking the ASCII character set as its starting point, the Unicode Standard goes far beyond ASCII’s limited ability to encode only the upper- and lowercase letters A through Z. It provides the capacity to encode all characters used for the written languages of the world—more than 1 million characters can be encoded.

UTF-8 is a red herring here. We’re talking about Unicode itself, i.e. code points not code units. If 0x61 were to map to ‘x’, that wouldn’t be Unicode, and if it isn’t Unicode, it isn’t Swift.

Any encoding that’s not literally compatible with ASCII (i.e. without decoding) is not literally compatible with Unicode. Such encodings might be “Unicode encodings” (encoding of an encoding), meaning that they need to go through a process of decoding in order to be literally equivalent to ASCII/Unicode.

Could you elaborate? For many tasks, pattern matching over String.utf8 is exactly what you should be doing.

I do agree that arithmetic feels odd and out place for these literals. I feel like most of the utility comes from equality comparisons and pattern matching.

@taylorswift @johnno1962, did you explore the impact of overloads for ~= and ==? I don’t know if this would cause more type checking issues in practice. (cc @xedin)

Alternatively, are there any other options for excluding these from operators? I don’t recall exactly how availability works with overload resolution (@xedin?), but would it be possible to have some kind of unavailable/obsoleted/prefer-me-but-don’t-compile-me overloads for arithmetic operators that take ExpressibleByUnicodeScalarLiteral?

johnno1962 · March 5, 2019, 8:46pm

I hear you. I’ve already conceded on the possibility that integer conversions may be sufficiently unacceptable to some that they may not pass in a previous post then there are practical considerations such as this glitch. It’s good we’re thrashing this out. I’m still waiting for more people to chime in on whether single quotes for character literals as a use for single quotes in general would be a worthwhile ergonomic improvement to Swift in itself.

Vogel · March 5, 2019, 8:48pm

I absolutely is off topic here, that's why @taylorswift shouldn't have brought it up as an argument.

Vogel · March 5, 2019, 8:51pm

Or we just change the way Unicode.Scalars are printed.

SDGGiesbrecht · March 5, 2019, 8:57pm

My comment was directed at no one in particular, but rather to potential future posters in general. I saw no problem with what had been posted so far, which was a reasonable exploration of the area where the subjects intersect. I just knew it had potential to grow out of hand very quickly, and I did not want the review derailed.

Vogel · March 5, 2019, 9:11pm

I accept this argument and retract my previous argument about alternative encodings.

I would support an ExpressibleByUnicodeScalarLiteral improvement with new _ExpressibleByBuiltinUnicodeScalarLiteral conformances:

extension UTF8 {
    //get rid of the old typealias to UInt8. Leave UInt8 alone!:
    struct CodeUnit: _ExpressibleByBuiltinUnicodeScalarLiteral, ExpressibleByUnicodeScalarLiteral {
        //8-bit only, compiler-enforced. Custom types can also use UTF8.CodeUnit as its UnicodeScalarLiteralType:
        typealias UnicodeScalarLiteralType = CodeUnit

        var value: UInt8
    }
}

This would use the well-known double quotes. It would add compiler-enforced 8- and 16-bit code unit types.

It would not pollute Integer APIs at all.

The only problem of course is that changing the Element of String.UTF8View etc. would be a breaking change (It is using UTF8.CodeUnit). Maybe there needs to be a String.betterUTF8View (or whatever other name) and the old utf8View etc. would just be deprecated.

Everyone that wants to mess around with code units can then use types like [UTF8.CodeUnit] instead of [UInt8]