SE-0243: Codepoint and Character Literals

xwu · March 7, 2019, 4:20pm

They are not at all limited to those circumstances. The majority of uses of literals are in contexts where the type is not explicitly coerced. In generic contexts, they are inferred wherever possible to be the default literal type. Indeed, the crux of what you are suggesting is that an integer should be expressible by '\n' by default.

I assert no such thing. Finding it "handy" to do something "from time to time" that causes the myriad other issues enumerated in this chain is the fundamental reason why I do not support such a feature.

Whether the body of developers would want a notation for a single extended grapheme cluster is larger, smaller, or positively minuscule is largely beside the point--what they would wish for does not have the negative effects set out here. If indeed we believe that it is not a common usage, then the status quo where we do not support single-quoted literals at all is a perfectly acceptable outcome to me. But--and I think you and I agree on this--I've outlined amply why I do not consider the expressibility of integers by single-quoted letters to be an acceptable outcome.

johnno1962 · March 7, 2019, 4:25pm

All we can do is agree to disagree on this. I’m prepared to go with the flow but I will be the one that is disappointed if we don’t find a use for single quoted literals preferably expressing integers somehow. We have a concrete use case and some spare syntax. It seems logical to try to marry them up.

Douglas_Gregor · March 7, 2019, 5:25pm

The introduction of unavailable/obsoleted/prefer-me-but-don’t-compile-me overloads is not a good approach from a type-checking perspective. The type checker will try to find a way to type-check the expression (avoiding the unavailable overload), and the presence of that overload will slow things down.

Doug

sjavora · March 7, 2019, 8:50pm

After reading the whole thread and re-reading the proposal, I'm -1 on this. The fact that ASCII characters have numerical values shouldn't IMO translate into being substitutable for numbers (really, it was the 'x'.isMultipleOf('a') example that pushed me over the edge - that's just nonsense).

In over 6 years of iOS app development (ObjC and Swift), I've worked on exactly one project that would have benefitted from this (just a data point, not saying that others don't have different needs).

I may also be influenced by this thread but I can't shake the feeling that this proposal is motivated to a large extent by a single PNG library...

RMJay · March 7, 2019, 10:08pm

Swift should provide an expressive way to write low level PNG libraries and twiddly ascii algorithms if the domain requires it. That would be an improvement to Swift.

How about 'A' for Character literals, but with 'A'.ascii available for those Characters belonging to ascii (checked at compile time). The motivation being that this is quite a compact and clear syntax.

taylorswift · March 7, 2019, 11:18pm

It’s helpful to outline some big-picture roadmaps for Swift literals to understand how this concept would work and how it fits in to the rest of the language. I think everyone agrees the current ExpressibleBy system is excessively complex and magical, and it’s becoming clear the language is starting to outgrow this system.

To be clear, this is not part of the proposal, rather a vision for how 'x' as UInt8 will evolve from “special compiler magic” to “general language feature”.

Basically, in place of protocol conformances, we would have a set of @ attributes that would mark initializers as being literal initializers.

enum Int.Base 
{
    case decimal, octal, binary, hexadecimal
}
enum Double.Base 
{
    case decimal, hexadecimal
}
extension Double 
{
    // instead of receiving fully-parsed `Builtin` values, these 
    // initializers just receive minimally-parsed lexer tokens
    // the `Int(integer)` part means it depends on `Int`’s `@integerLiteral` 
    // initializer.
    @integerLiteral(expressible, Int(integer))
    init(sign:Sign, base:Int.Base, digits:[Int])

    @floatLiteral(expressible, Int(integer), Double(integer))
    init(sign:Sign, base:Base, fraction:[Int], digits:[Int], exponent:[Int])
}

// invokes Double.init(sign: .plus, base: .decimal, digits:[9, 8, 9, 1])
let x:Double = 1989

// invokes Double.init(sign: .plus, base: .hexadecimal, 
//                 fraction: [10], digits:[1, 15], exponent: [2, 1])
let y:Double = 0xF1.Ap12

Unlike the ExpressibleBy initializers, @{}Literal-annotated initializers would not be callable at runtime (a gaping hole in the current system, which makes no sense, and has been the source of endless headaches),, it follows that they would not be part of ABI. This is important because it makes the initializer arguments constant expressions, allowing literal syntax errors to be thrown from static #asserts inside the initializers, rather than from C++ code in the compiler. This would move a lot of C++ implementation into the standard library, and allow us to get rid of a considerable amount of Builtin cruft. It would open up a lot of new possibilities for things like arbitrary-precision integer types, which right now have to go through Builtin.IntLiteral.

For static string, string, character, and unicode scalar literals, we would be able to drop the overlapping ExpressibleByStringLiteral:ExpressibleByExtendedGraphemeClusterLiteral:ExpressibleByUnicodeScalarLiteral mess we have right now, and just have unified @textLiteral and @textElementLiteral attributes.

extension String 
{
    @textLiteral(expressible, Unicode.Scalar(text))
    init(hashtags:Int, unicodeScalars:[Unicode.Scalar])
}

Note that @textLiteral and textElementLiteral initializers take an array of Unicode.Scalars, because grapheme cluster boundaries aren’t known to the compiler at compile-time.

The compiler would evaluate a text literal (or any other literal) in source by looking for all visible @textLiteral initializers that are marked expressible that match the type context, if there is any. If there isn’t any, it will look for an initializer in the TextLiteralType typealias, analogous to what we do now.

The as coercion operator would do something similar, but it would not need a typealias (since the rhs of the operator gives the concrete type), and the initializer would not have to be marked expressible. This gives us coercible-but-not-expressible by ___ types.

extension UInt8 
{
    @textElementLiteral(Int(integer), UInt32(integer)) 
    init(unicodeScalars:[Unicode.Scalar])
    {
        // can use `Int` and `UInt32` literals because we declared them 
        // as dependencies
        #assert(unicodeScalars.count == 1, 
            unicodeScalars[0].value & 0xff_ff_ff_80 == 0)

        self.init(truncatingIfNeeded: unicodeScalars[0].value)
    }
}

Of course, this isn’t really that different from defining a function or method that takes explicitly @compilerEvaluable arguments, and would be spelled similar to the existing UInt8(ascii:) initializer. (Though we obviously couldn’t reuse the function signature.) But I think as is clearer and more readable, and makes more sense given its existing semantics in the language. This would be especially true if functions with @compilerEvaluable-restricted arguments share the same call site syntax as normal Swift functions, since in situations like

let a:Character = 'a'
let ord:UInt8 = foo(a)

you can’t tell whether foo is folding its argument without knowing its signature.

tevamerlin · March 8, 2019, 6:46pm

It is verbose, indeed, but I find that the fact that it starts with the literal makes it more readable than something like UInt8(ascii:'a').
However, I still prefer 'a'.asciiValue, which is about as verbose but has the advantage of being a lot more explicit.

taylorswift · March 8, 2019, 6:51pm

I think 'a'.asciiValue is pretty readable, but the fact that it returns an optional is a big negative for me. I don’t think making it a trapping property on Character would fly, since non-ASCII Characters aren’t really an edge case.

RMJay · March 8, 2019, 8:25pm

Is it possible to not have it return an optional? - by not being a property of a Character, but actually being a property that exists on the character literal, but only if that literal is ascii.

taylorswift · March 8, 2019, 8:51pm

this sounds pretty magical. why not be explicit and just extend as syntax instead of abusing . notation?

lorentey · March 8, 2019, 8:53pm

I believe ASCII is important enough to deserve some dedicated API surface area. A trapping variant of asciiValue would be a much easier sell to me than 'A' as UInt8.

'a'.ascii // ⟹ 97
'á'.ascii // 💥 (possibly even a compile-time warning)

This would certainly simplify dealing with individual ASCII characters, to the point that they become convenient enough to eliminate the need for direct ASCII literals.

To represent ASCII byte sequences, we already have String.UTF8View:

"fred".utf8 // [102, 114, 101, 100]

It's a nice collection type and it should be possible to use it for convenient matching of ASCII byte strings. It has a huge advantage over array literals in that it supports the small string optimization.
Missing APIs can be added as needed if matching byte sequences isn't easy or fast enough in the current stdlib.

If people need a dedicated ASCII string type (for e.g. type safely marking certain APIs as requiring ASCII strings), we could even add a separate String.ASCIIView type:

"fred".utf8 // [102, 114, 101, 100]
"fred".ascii // [102, 114, 101, 100]

"frédi".utf8 // [102, 114, 195, 169, 100, 105]
"frédi".ascii // 💥 (possibly even a compile-time warning)

taylorswift · March 8, 2019, 9:06pm

lorentey:

I believe ASCII is important enough to deserve some dedicated API surface area. A trapping variant of asciiValue would be a much easier sell to me than 'A' as UInt8 .
'a'.ascii // ⟹ 97
'á'.ascii // 💥 (possibly even a compile-time warning or error)

i don’t know if i like the idea of special-casing certain properties to have this behavior. We would probably want to do the same thing for other properties that have a similar issue like .first, and it would be confusing if some function calls were transparent and validated while others aren’t.

// what does this do?
let u:Unicode.Scalar = .init(value: 0xff_ff_ff_ff)
// this makes me expect a compiler guarantee 
let u:Unicode.Scalar = 0xff_ff_ff_ff as Unicode.Scalar // error

And if we do start implicitly validating, when do we stop? as has a clearly defined role: coerce a constant literal value on the left side to the type on the right side. What you’re suggesting is a lot more open-ended.

// if this gets validated
let s:UInt8 = 'a'.ascii
// does that mean this is too?
let t:Int   = [1, 2, 3, 4][i & 0xb11]

xwu · March 8, 2019, 9:19pm

taylorswift:

it would be confusing if some function calls were transparent and validated while others aren’t.
// what does this do?
let u:Unicode.Scalar = .init(value: 0xff_ff_ff_ff)
// this makes me expect a compiler guarantee 
let u:Unicode.Scalar = 0xff_ff_ff_ff as Unicode.Scalar // error
And if we do start implicitly validating, when do we stop?

It is already the case that some functions are validated at compile time. See SE-0213. In the long run, everything that can be validated at compile time within reasonable time, should be.

It should be.

lorentey · March 8, 2019, 9:19pm

Neither do I, which is why I added the 'possibly'. Warnings like that are helpful in catching basic mistakes, but they wouldn't catch all cases. The runtime trap would.

However, making sure it's convenient to work with ASCII data is important enough to go the extra mile.

This is a fun sidetrack that we could argue over for days, but it seems minimally relevant. Do you have views about the usability of a trapping Character.ascii property or about using the .utf8 view to represent ASCII byte strings?

taylorswift · March 8, 2019, 10:03pm

If we only cared about usability, Character.ascii would be perfectly fine. But we also care about whether the actual API makes sense. Do we really want trapping .ascii to be available on all Character values? Would this encourage people to write stuff like this?

func foo(_ character:Character) 
{
    self.file.write(character.ascii)
}

"júst á nörmál Swîfτ ßtríng".forEach(foo(_:))

The standard library uses trapping very sparingly, for certain operations where the failing case is too much of an edge case for it to be worth burdening the API with an Optional. I don’t think the Character '😻' meets that threshold. We emphasize Character’s grapheme-cluster nature so much it would be weird if we suddenly started considering emojis and non-ASCII codepoints “the unexpected case”.

Character.asciiValue returns an optional, and that’s exactly how it should be. This tells me that a computed property on Character is not the right solution.

The .utf8 view is definitely not the right way to represent ASCII bytestrings. This just seems like opening the door to "arenâ€™t".

lorentey · March 9, 2019, 3:03am

Sure; I don't see why not! Any such misuse will result in clear traps, not mis-encoded text. The potential for harm is far less than in allowing character literals to be inferred as integer types.

But if people are generally happy to deal with unwrapping asciiValue's optional return value, then of course that's even better.

A significant portion of this proposal is about ergonomics enhancements to ASCII processing. It goes through some questionable contortions to allow the let a: UInt8 = 'A' syntax, ostensibly in the name of ASCII ergonomics. This increasingly looks like a mistake.

The simple, non-controversial addition of single-quoted literals for Character is already a huge boost to ASCII productivity, because it allows you to type 'A'.asciiValue to get 65. You can't do that today: the string literal gets inferred to String, which does not provide that API.

// Swift 5
"A".asciiValue // error: value of type 'String' has no member 'asciiValue'

// With the non-controversial parts of SE-0243:
'A'.asciiValue! // ⟹ 65✨

It should be noted that Character.asciiValue does have one interesting quirk: it normalizes the CR+LF sequence (which is a single character!) to a single LF:

'\r\n'.asciiValue! // ⟹ 10 (?!)

This is arguably not right; migrating to another property could provide the opportunity to fix this. There is a fundamental underlying issue in that the Character → ASCII encoding mapping is not one-to-one.

It would perhaps be better to work on the level of Unicode scalars instead, and have single-quoted literals default to Unicode.Scalar instead of Character:

let newline1 = '\n' // Unicode.Scalar
let newline2 = '\r\n' // error: '\r\n' is not a single Unicode scalar; did you mean to initialize a Character value?
let newline3: Character = '\r\n'

print(newline1.asciiValue) // Optional(10)

(Conveniently, the protocol is already called ExpressibleByUnicodeScalarLiteral. I'm sorry if this came up during the pitch; I did not have time to check.)

Yes, it's implicitly relying on UTF-8 being compatible with ASCII, and this may cause issues when it's used for encoding; but UTF8View seems perfectly serviceable to me when I'm looking for random pieces of ASCII text in a byte sequence.

What I was trying to get at is that it's not clear how the proposal's integer conformances would extend to cover a sequence of ASCII bytes. The examples I've seen aren't particularly convincing, to put it mildly.

let a: [UInt8] = ['1', '9', '9', '2', '-', '0', '8', '-', '0', '3']  // Yuck
let b = "1992-08-03".utf8   // Same thing, except it works in Swift 5 and requires no allocation

If you prefer, it would be certainly possible to add an ASCIIView that restricts itself to ASCII characters.

xwu · March 9, 2019, 3:57am

lorentey:

It should be noted that Character.asciiValue does have one interesting quirk: it normalizes the CR+LF sequence (which is a single character!) to a single LF:
'\r\n'.asciiValue! // ⟹ 10 (?!)
This is arguably not right; migrating to another property could provide the opportunity to fix this. There is a fundamental underlying issue in that the Character → ASCII encoding mapping is not one-to-one.

It would perhaps be better to work on the level of Unicode scalars instead, and have single-quoted literals default to Unicode.Scalar instead of Character :

This is, in my view, a huge insight. Not only is compiler validation that something is a single Unicode scalar much more trivial, but Unicode scalars already have many APIs for conversion to their integer values.

let x = 'é'

// The following APIs already exist:
x.value   // 233
x.isAscii // false

This existing design also addresses the objection above that someone using asciiValue with a non-ASCII character might be thrown because they expect a value instead of nil. Meanwhile, users would still have a compile-time guarantee that the stuff within the single quotes isn't a decomposed form:

// Current syntax
let decomposed = "\u{0065}\u{0301}" // é
let y = Unicode.Scalar(decomposed)  // nil

// Possible future syntax
let z = '\u{0065}\u{0301}' // compile-time error

IMO it'd even be fine, if we call that notation exclusively a "Unicode scalar literal," to ditch its use for Character entirely; indeed, it might be best given how each version of Unicode subtly refines what's considered a single extended grapheme cluster.

Down the line, if after the addition of these facilities it is found that additional specific support for ASCII is still necessary, we may consider an ASCII type:

// Typed freehand, for illustrative purposes only, not working code
struct ASCII {
  internal var _value: Builtin.Int7
  public var value: UInt8 {
    return UInt8(Builtin.extOrBitCast_Int7_Int8(_value))
  }
  public init(_ value: UInt8) {
    _precondition(value < 128)
    _value = Builtin.truncOrBitCast_Int8_Int7(value._value)
  }
}

extension ASCII: ExpressibleByUnicodeScalarLiteral { /* ... */ }
extension Array: ExpressibleByStringLiteral where Element == ASCII {
  /* ... */
  public var value: [UInt8] { /* ... */ }
}

// Usage
let x = 'a' as ASCII
let pngTag = "abcd" as [ASCII]
pngTag.value // [97, 98, 99, 100]

Regarding the actual quirk at hand for asciiValue, it is weird, but the strictly correct version where ("\r\n" as Character).asciiValue evaluates to nil is also...not good.

dwaite · March 10, 2019, 3:37am

I'm also not sure it would be appropriate for Character to have an asciiValue method which returns an optional, and an ascii method which has the same behavior except it aborts. In a code-completion world with automatic conversion to optionals, this is asking for trouble

An ascii property also only solves some of the issue, as you may still need to explicitly convert the integer type between UInt8 <-> Int8.

dwaite · March 10, 2019, 3:51am

lorentey:

It should be noted that Character.asciiValue does have one interesting quirk: it normalizes the CR+LF sequence (which is a single character!) to a single LF:
'\r\n'.asciiValue! // ⟹ 10 (?!)

Partially off-topic, but you've just convinced me that I will never master Unicode. I had no idea that there were grapheme clusters composed entirely of multiple ASCII-range code points.

johnno1962 · March 10, 2019, 1:25pm

Now that the battle for integer convertibility seems pretty much lost, I wonder if adding a few well chosen operators to the standard library couldn’t scratch the itch the proposal was trying to address...

For array initialisation I’d suggest:

extension Array where Element: FixedWidthInteger {
    @available(swift 5.1)
    public init(scalars: String) {
        self = scalars.unicodeScalars.map { Element($0.value) }
    }
}

let hex = [Int8](scalars: "0123456789abcdef")

For comparison you could simply add:

@_transparent
@available(swift 5.1)
public func ==<T: FixedWidthInteger> (lhs: T, rhs: Unicode.Scalar) -> Bool {
    return lhs == rhs.value
}

if cString.advanced(by: 2).pointee == “b" {

For use in switches the following is enough:

@_transparent
@available(swift 5.1)
public func ~=<T: FixedWidthInteger> (pattern: Unicode.Scalar, value: T) -> Bool {
    return pattern.value == value
}
@available(swift 5.1)
public func ~=<T: FixedWidthInteger> (pattern: ClosedRange<Unicode.Scalar>, value: T) -> Bool {
    return pattern.contains(Unicode.Scalar(UInt32(value))!)
}

let digit = UInt8(ascii: "1")
switch(digit) {
case "8":
    print("Hello1")
case "1" ... "2":
    print("Hello2")
default:
    break
}

Operators that might be useful can be added without opening the flood gates to nonsense expressions.

@_transparent
@available(swift 5.1)
public func -<T: FixedWidthInteger> (lhs: T, rhs: Unicode.Scalar) -> T {
    return lhs - T(rhs.value)
}

print(digit-“0")