SE-0243: Codepoint and Character Literals

There are some totally valid concerns here, but I want to address some misunderstandings about Swift and Unicode.

Swift firmly establishes that String and related types are Unicode and not some other standard.

Unicode is a “Universal Character Encoding”. It is a mapping between “characters” (cough not in the grapheme cluster sense) and numbers starting from 0. This assignment to a specific number is the crux of the standard. The elements of this universal encoding we call Unicode is “code points” (cough or Unicode scalar values for only the valid ones).

A “Unicode Encoding” is an encoding of the universal encoding (or a subset of it), which may have alternate numbers (e.g. EBCDIC) or more complex details (e.g. uses a smaller-width representation). The elements of such an encoding are called “code units”.

Nothing in this proposal is attempting to bake in anything about particular “code units” from some particular Unicode encoding, rather it is addressing Unicode itself’s element type, “code point”.

(This is all pretty confusing.)

I’m not presenting an argument that Swift syntax should operate under some implicit conversion between a syntactic construct such as a literal and a number. That’s the purpose of this review thread and I understand that people can disagree for valid reasons. I’m just trying to dispel some of the FUD around mixing up Unicode and some particular Unicode encoding.

Code points are integer values. The code point ‘G’ is inherently 71 in Unicode. The idea that digit 8 is the same thing as number 56 is the point of character encodings.

Swift supports Unicode, but does not mandate a particular Unicode encoding. EBCDIC is not a subset of Unicode as decoding it to Unicode involves remapping some values.

(Again, this is not an argument that Swift’s particular syntactic construct for code points should produce Swift’s number types, just that these kinds of encoding-related concerns are not relevant)

This proposal does not change that.

Swift forces Unicode on us, and then Unicode forces ASCII on us. Unicode by explicit design is a superset of ASCII. From the standard:

While taking the ASCII character set as its starting point, the Unicode Standard goes far beyond ASCII’s limited ability to encode only the upper- and lowercase letters A through Z. It provides the capacity to encode all characters used for the written languages of the world—more than 1 million characters can be encoded.

UTF-8 is a red herring here. We’re talking about Unicode itself, i.e. code points not code units. If 0x61 were to map to ‘x’, that wouldn’t be Unicode, and if it isn’t Unicode, it isn’t Swift.

Any encoding that’s not literally compatible with ASCII (i.e. without decoding) is not literally compatible with Unicode. Such encodings might be “Unicode encodings” (encoding of an encoding), meaning that they need to go through a process of decoding in order to be literally equivalent to ASCII/Unicode.

Could you elaborate? For many tasks, pattern matching over String.utf8 is exactly what you should be doing.


I do agree that arithmetic feels odd and out place for these literals. I feel like most of the utility comes from equality comparisons and pattern matching.

@taylorswift @johnno1962, did you explore the impact of overloads for ~= and ==? I don’t know if this would cause more type checking issues in practice. (cc @xedin)

Alternatively, are there any other options for excluding these from operators? I don’t recall exactly how availability works with overload resolution (@xedin?), but would it be possible to have some kind of unavailable/obsoleted/prefer-me-but-don’t-compile-me overloads for arithmetic operators that take ExpressibleByUnicodeScalarLiteral?

10 Likes

I hear you. I’ve already conceded on the possibility that integer conversions may be sufficiently unacceptable to some that they may not pass in a previous post then there are practical considerations such as this glitch. It’s good we’re thrashing this out. I’m still waiting for more people to chime in on whether single quotes for character literals as a use for single quotes in general would be a worthwhile ergonomic improvement to Swift in itself.

I absolutely is off topic here, that's why @taylorswift shouldn't have brought it up as an argument.

Or we just change the way Unicode.Scalars are printed.

My comment was directed at no one in particular, but rather to potential future posters in general. I saw no problem with what had been posted so far, which was a reasonable exploration of the area where the subjects intersect. I just knew it had potential to grow out of hand very quickly, and I did not want the review derailed.

1 Like

I accept this argument and retract my previous argument about alternative encodings.

I would support an ExpressibleByUnicodeScalarLiteral improvement with new _ExpressibleByBuiltinUnicodeScalarLiteral conformances:

extension UTF8 {
    //get rid of the old typealias to UInt8. Leave UInt8 alone!:
    struct CodeUnit: _ExpressibleByBuiltinUnicodeScalarLiteral, ExpressibleByUnicodeScalarLiteral {
        //8-bit only, compiler-enforced. Custom types can also use UTF8.CodeUnit as its UnicodeScalarLiteralType:
        typealias UnicodeScalarLiteralType = CodeUnit

        var value: UInt8
    }
}

This would use the well-known double quotes. It would add compiler-enforced 8- and 16-bit code unit types.

It would not pollute Integer APIs at all.

The only problem of course is that changing the Element of String.UTF8View etc. would be a breaking change (It is using UTF8.CodeUnit). Maybe there needs to be a String.betterUTF8View (or whatever other name) and the old utf8View etc. would just be deprecated.

Everyone that wants to mess around with code units can then use types like [UTF8.CodeUnit] instead of [UInt8]

2 Likes

I feel like the constant values are still important, see this ugly wall of code in the PNG library

That doesn't seem all that bad to me, it's just a bunch of magic values expressed as static constants?

well it would be a lot nicer if

public static
let IHDR:Tag = .init(73, 72, 68, 82)

was

public static
let IHDR:Tag = .init('I', 'H', 'D', 'R')

or just

public static 
let IHDR:Vector4<UInt8> = ('I', 'H', 'D', 'R')

(coming soon to a swift evolution near you!)

It does look like there's a ExpressibleByStringLiteral wanting to come out in that particular example.

1 Like

Not really. No compile time validation of string length or codepoint range, and it requires runtime setup with ICU.

1 Like

For this and similar use-cases, I think you should either conform the type itself to ExpressibleByStringLiteral, or else use Jonas’s idea of an ASCII struct:

struct ASCII: ExpressibleByUnicodeScalarLiteral {
  var rawValue: UInt8
  …
}

Neither validates the Unicode.Scalar values to be ASCII. if that seems like a edge concern, think about how easy it is to accidentally type a '–' instead of a '-' or a '“' instead of a '"'. In fact in school lecture slides i've seen more accidental uses of than correct uses of '.

There is nothing stopping the library authors from writing an initializer that takes four Unicode scalars. This is an API entirely under the end user's control.

1 Like

Whether written in binary, octal, hex, or decimal, all of those are integer literals. Swift does not use any prefixes to distinguish between literals. This is actually a bit of a challenge for decimal vs hex float literals, but nonetheless, there is no use of prefixes in Swift anywhere to indicate different literals or different default literal types.

yes, but it should take four ASCII scalars, not four Unicode.Scalars. They are only the same for a specific subset of Unicode.Scalars. What happened to the importance of text encodings?

That too is a precondition under the control of the API authors.

PNG allows user-defined chunk types,, this is why Tag is a struct and not an enum providing the ASCII slugs as computed properties on self. (This also prevents users from accidentally shadowing a public chunk type, since the stored representation means it will just get interpreted as the public chunk.) Having the initializer take four UInt8s is a form of idiotproofing so that code like

let myType:Tag = .init("k", "l", "o", "ß")

or, god save us,

let myType:Tag = "kloß"

won’t compile to begin with. I know this because i am the API author

UInt8 isn’t perfect, it’s still possible to sneak a 0xFF in there, but the beautiful thing about this proposal is once we have ASCII-validated character literals, no one in their right mind would type a decimal number instead of a ASCII-validated character literal.

Just because we haven’t done it before in Swift doesn’t mean we can’t do it in the future. It’s heavily precedented in other languages. I don’t think “no literal prefixes” is anywhere in the language design goals. We are running short on delimiters, after all.

You make the ASCII type enforce the ASCII-ness.

If you want compile-time enforcement, then perhaps the language should expose a mechanism for defining arbitrary-sized binary integer types, such as UInt7.

,,, and this type’s init would take a UInt8 (or a UInt7 in a perfect world). We’re really just kicking the can deeper into the nest.