SE-0243: Codepoint and Character Literals

taylorswift · March 5, 2019, 9:11pm

I feel like the constant values are still important, see this ugly wall of code in the PNG library

Vogel · March 5, 2019, 9:13pm

That doesn't seem all that bad to me, it's just a bunch of magic values expressed as static constants?

taylorswift · March 5, 2019, 9:14pm

well it would be a lot nicer if

public static
let IHDR:Tag = .init(73, 72, 68, 82)

was

public static
let IHDR:Tag = .init('I', 'H', 'D', 'R')

or just

public static 
let IHDR:Vector4<UInt8> = ('I', 'H', 'D', 'R')

(coming soon to a swift evolution near you!)

Lantua · March 5, 2019, 9:15pm

It does look like there's a ExpressibleByStringLiteral wanting to come out in that particular example.

taylorswift · March 5, 2019, 9:17pm

Not really. No compile time validation of string length or codepoint range, and it requires runtime setup with ICU.

Nevin · March 5, 2019, 9:19pm

For this and similar use-cases, I think you should either conform the type itself to ExpressibleByStringLiteral, or else use Jonas’s idea of an ASCII struct:

struct ASCII: ExpressibleByUnicodeScalarLiteral {
  var rawValue: UInt8
  …
}

taylorswift · March 5, 2019, 9:35pm

Neither validates the Unicode.Scalar values to be ASCII. if that seems like a edge concern, think about how easy it is to accidentally type a '–' instead of a '-' or a '“' instead of a '"'. In fact in school lecture slides i've seen more accidental uses of ‘ than correct uses of '.

xwu · March 5, 2019, 9:44pm

There is nothing stopping the library authors from writing an initializer that takes four Unicode scalars. This is an API entirely under the end user's control.

xwu · March 5, 2019, 9:49pm

taylorswift:

A good analogy to the issue of text–integer encodings is the issue of numeric literal–integer encodings. Meaning, if you write down the digits 1000 , we have no idea if it is decimal, octal, binary, or hexadecimal. In this case, the literal is 1000 , and the codec is the numeric base. In Swift, we use 0x , 0o , 0b to distinguish between all the different integer encodings.
let n1:Int =   1000 // decimal literal, n1 ← 1000 {base ten}
let n2:Int = 0b1000 // binary literal,  n2 ← 8    {base ten}
let n3:Int = 0o1000 // octal literal    n3 ← 512  {base ten}
let n4:Int = 0x1000 // hex literal      n4 ← 4096 {base ten}
following this precedent, we should reserve 'a' exclusively for Character , and assign prefixes u'a' , a'a' for Unicode.Scalar and ASCII literals. A natural extension to EBCDIC would then be e'a' for example.
let c0:String         =  "a" // string literal,    c0 ← "a"
let c1:Character      =  'a' // character literal, c1 ← 'a'
let c2:Unicode.Scalar = u'a' // codepoint literal, c2 ← U'a'
let c3:UInt8          = a'a' // ASCII literal,     c3 ← 97
let c4:UInt8          = e'a' // EBCDIC literal,    c4 ← 129
I’m aware prefixed string literals were rejected a while back when we were talking about raw string syntax, but I think there are good reasons to adopt it for single-quoted “single-element” text literals:

Whether written in binary, octal, hex, or decimal, all of those are integer literals. Swift does not use any prefixes to distinguish between literals. This is actually a bit of a challenge for decimal vs hex float literals, but nonetheless, there is no use of prefixes in Swift anywhere to indicate different literals or different default literal types.

taylorswift · March 5, 2019, 9:49pm

yes, but it should take four ASCII scalars, not four Unicode.Scalars. They are only the same for a specific subset of Unicode.Scalars. What happened to the importance of text encodings?

xwu · March 5, 2019, 9:50pm

That too is a precondition under the control of the API authors.

taylorswift · March 5, 2019, 10:00pm

PNG allows user-defined chunk types,, this is why Tag is a struct and not an enum providing the ASCII slugs as computed properties on self. (This also prevents users from accidentally shadowing a public chunk type, since the stored representation means it will just get interpreted as the public chunk.) Having the initializer take four UInt8s is a form of idiotproofing so that code like

let myType:Tag = .init("k", "l", "o", "ß")

or, god save us,

let myType:Tag = "kloß"

won’t compile to begin with. I know this because i am the API author

UInt8 isn’t perfect, it’s still possible to sneak a 0xFF in there, but the beautiful thing about this proposal is once we have ASCII-validated character literals, no one in their right mind would type a decimal number instead of a ASCII-validated character literal.

Just because we haven’t done it before in Swift doesn’t mean we can’t do it in the future. It’s heavily precedented in other languages. I don’t think “no literal prefixes” is anywhere in the language design goals. We are running short on delimiters, after all.

Nevin · March 5, 2019, 10:07pm

You make the ASCII type enforce the ASCII-ness.

If you want compile-time enforcement, then perhaps the language should expose a mechanism for defining arbitrary-sized binary integer types, such as UInt7.

taylorswift · March 5, 2019, 10:09pm

,,, and this type’s init would take a UInt8 (or a UInt7 in a perfect world). We’re really just kicking the can deeper into the nest.

xwu · March 5, 2019, 10:12pm

taylorswift:

Having the initializer take four UInt8 s is a form of idiotproofing so that code like
let myType:Tag = .init("k", "l", "o", "ß")
or, god save us,
let myType:Tag = "kloß"
won’t compile to begin with. I know this because i am the API author

Sounds like you want compile-type checking of preconditions. As you point out, UInt8 doesn't actually idiot-proof your code at compile time, and this proposal isn't required for making it possible either.

xwu · March 5, 2019, 10:14pm

Actually, I do recall that Chris Lattner has specifically said he or the core team made that choice early on. Whether the core team still holds that view today is unclear.

Nevin · March 5, 2019, 10:17pm

…erm, what is your point? The ASCII type would also conform to ExpressibleByUnicodeScalarLiteral, and the literal-initializer would ensure that only ASCII characters are permited:

extension ASCII: ExpressibleByUnicodeScalarLiteral {
  init(unicodeScalarLiteral value: Unicode.Scalar) {
    self.rawValue = UInt8(ascii: value)
  }
}

Then you can make your Tag initializer take four ASCII values, and call it like this:

let IHDR = Tag("I", "H", "D", "R")

taylorswift · March 5, 2019, 10:52pm

Nevin:

…erm, what is your point? The ASCII type would also conform to ExpressibleByUnicodeScalarLiteral , and the literal-initializer would ensure that only ASCII characters are permited:
extension ASCII: ExpressibleByUnicodeScalarLiteral {
  init(unicodeScalarLiteral value: Unicode.Scalar) {
    self.rawValue = UInt8(ascii: value)
  }
}

it’s not quite that simple, a quick sketch using @constexpression as a strawman attribute:

extension ASCII:ExpressibleByUnicodeScalarLiteral 
{
    init(unicodeScalarLiteral value:@constexpression Unicode.Scalar) 
    {
        #assert(value.value & 0xffff_ff80 == 0, 
            "Literal value '\(value)' is not an ASCII literal")
        self.rawValue = .init(truncatingIfNeeded: value.value)
    }
}

of course, even in theory this wouldn’t actually compile because ExpressibleByUnicodeScalarLiteral requires an init(unicodeScalarLiteral:Unicode.Scalar), not an init(unicodeScalarLiteral:@constexpression Unicode.Scalar). So this would probably get tied in with the improved literal initializer attributes discussed at the tuple thread. (weird how all these things seem to connect with each other lol.)

extension ASCII 
{
    @unicodeScalarLiteral // implied @constexpression for all arguments
    init(unicodeScalarLiteral value:Unicode.Scalar) 
    {
        #assert(value.value & 0xffff_ff80 == 0, 
            "Literal value '\(value)' is not an ASCII literal")
        self.rawValue = .init(truncatingIfNeeded: value.value)
    }
}

It really sounds though like what you’re actually chiselling out here is a standard library ASCII type, which would be ExpressibleByUnicodeScalarLiteral and get magic compiler checks in the same way that Int8 gets checked in the current proposal implementation. Definitely possible, but it would be a radically different direction from the current proposal and we’d have to start from scratch.

struct ASCII:ExpressibleByUnicodeScalarLiteral
{
    var _value:UInt8
    public 
    var value:UInt8 
    {
        return self._value
    }
    
    // dangerous, but that’s a design problem with the 
    // `ExpressibleBy` protocols
    public 
    init(unicodeScalarLiteral value:Unicode.Scalar) 
    {
        self._value = .init(truncatingIfNeeded: value.value)
    }
    
    public 
    init<T>(truncatingIfNeeded value:T) where T:BinaryInteger 
    {
        self._value = .init(truncatingIfNeeded: value & 0x8f)
    }
    
    public 
    init?<T>(_ value:T) where T:BinaryInteger 
    {
        guard let value:UInt8 = .init(exactly: value), 
            value & 0x80 == 0
        else 
        {
            return nil 
        }
        
        self._value = value 
    }
    
    static 
    func &+ (lhs:ASCII, rhs:ASCII) -> ASCII 
    {
        return .init(truncatingIfNeeded: lhs.value &+ rhs.value)
    }
    static 
    func &- (lhs:ASCII, rhs:ASCII) -> ASCII 
    {
        return .init(truncatingIfNeeded: lhs.value &- rhs.value)
    }
}

Nevin · March 6, 2019, 12:47am

No, I’m saying you can literally get the syntax you want right now without any changes to the standard library at all. You just have to write the ASCII type in your project, and start using it.

The only thing missing is compile-time overflow checking.

dwaite · March 6, 2019, 12:57am

But it isn't arithmetic on characters, it is arithmetic on the integers you get from an ascii table lookup (which for the current internal representation as UTF-8, doesn't necessitate a table).

UInt8(ascii: "a") is an explicit way to state this, but is considered wordy. I would argue that 'a' is not wordy enough.