SE-0243: Codepoint and Character Literals

xwu · March 10, 2019, 4:31pm

I would not be opposed to a few well vetted operators, but I would push back on the examples you give:

If we have '1' express a Unicode scalar as @lorentey suggests, many of the examples you give become dramatically simplified. I would be hesitant to further simplify x == '1'.value to x == '1', for a similar reason to what we ran into trouble with in terms of adhering to SE-0213: namely, that whether a bare literal with a number expresses that number or its ASCII value is made less clear. In principle, heterogeneous comparison operators that eliminate conversions is fine, but only in the circumstance where there is no possible confusion as to how things are converted, and I do not think we meet that bar here.

As for '1'...'8', what you are showing here is a good role for regex literals, but this operator is not it. Specifically, an expression like 'A'...'z' would match '['. This invites user error. A regex literal is sorely needed, but this imitation of it is in my view misguided.

I am not sure why one would need 42 - '0'. I have often needed to offset to or from the ASCII value of '0' or 'a', say, to get the ASCII value of a nearby character, but that is for lack of better facilities that we are attempting to design here, not an end in itself.

lorentey · March 10, 2019, 4:39pm

We should rather fix the cases where you need to deal with signed bytes. Swift APIs have pretty much standardized on using UInt8. The platform-dependent signedness of CChar is terribly unhelpful.

lorentey · March 10, 2019, 5:24pm

As @xwu noted, these particular examples share the same problem as ’a’ as UInt8: they use unspecified encodings, so they invite mistaken assumptions, and, ultimately, bugs.

As I have tried to explain during the pitch, Unicode.Scalar.value is almost never helpful when dealing with encoded data. Truncating it to fit a narrow integer type results in a random hodgepodge of misfit encodings, most of which would make truly terrible defaults:

let a = [Int8](scalars: “cafe”) // → ASCII 
let b = [UInt8](scalars: “café”) // → ⛔️ Latin-1
let c = [Int16](scalars: “café”) // → 🐞 Half of UCS-2
let d = [UInt16](scalars: “café”) // → ⛔️ UCS-2
let e = [Int32](scalars: “café”) // → Signed variant of UTF-32
let f = [UInt32](scalars: “café”) // → UTF-32
let g = [Int64](scalars: “café”) // → 🐞 This isn’t a thing
let g = [UInt64](scalars: “café”) // → 🐞 UTF-64 does not exist

We already have String’s encoded views to do this sort of thing correctly.

let a = [UInt8](“café”.utf8) // → ✅ UTF-8
let b = [UInt16](“café”.utf16) // → ✅ UTF-16

Note how this makes the choice of encoding obvious, without sacrificing ergonomics.

johnno1962 · March 10, 2019, 5:35pm

But we aren’t headed in this direction are we?

(swift) '1'.value
<REPL Input>:1:5: error: value of type 'Character' has no member 'value'
'1’.value

Not at all keen on ‘a’.asciiValue! as the alternative as it has a foot-gun if you leave the ! off.

I’m just slinging things against the wall and seeing if anything sticks.. Apparently not.

lorentey · March 10, 2019, 5:52pm

I do believe there is a clear and pressing need for Unicode scalar literals. Introducing them would be a huge leap for the ergonomics of dealing with encoded string data.

let a = ‘A’     // inferred as Unicode.Scalar, *not* Character!
switch byte {
  case ‘0’.ascii ... ‘9’.ascii: // trapping property on Unicode.Scalar
    print(“digit”)
  default:
    print(“not a digit”)
}
let c: Character = ‘A’     // Character works too but you need to spell out the type

johnno1962 · March 10, 2019, 6:13pm

We need to make a decision whether the default type for character literals would be Unicode.Scalar or Character. Unicode.Scalar seems a strange choice but it’s up to the core team in the finish. @lorentey, would you accept some of the shorthand operators (except for the array one which I’m not at all attached to) above if they all trapped on non-ascii values.

extension Unicode.Scalar {
  @_transparent
  @available(swift 5.0)
  public static func -<T: FixedWidthInteger> (lhs: T, rhs: Unicode.Scalar) -> T {
    _precondtion(rhs.isASCII, "Only ASCII value accepted in this context")
    return lhs - T(rhs.value)
  }
  @_transparent
  @available(swift 5.0)
  public static func ==<T: FixedWidthInteger> (lhs: T, rhs: Unicode.Scalar) -> Bool {
    _precondtion(rhs.isASCII, "Only ASCII value accepted in this context")
    return lhs == T(rhs.value)
  }
  @_transparent
  @available(swift 5.0)
  public static func ~=<T: FixedWidthInteger> (pattern: Unicode.Scalar, value: T) -> Bool {
    _precondtion(pattern.isASCII, "Only ASCII value accepted in this context")
    return value == pattern
  }
}
@available(swift 5.0)
public func ~=<T: FixedWidthInteger> (pattern: ClosedRange<Unicode.Scalar>, value: T) -> Bool {
    precondition(pattern.lowerBound.isASCII && pattern.upperBound.isASCII,
                  "Only ASCII value accepted in this context")
    return pattern.contains(Unicode.Scalar(UInt32(value))!)
}

ASCII is so pervasive I don’t think support for it is unreasonable.

lorentey · March 10, 2019, 6:32pm

I don’t like the idea of any implicit encoding, but if we are forced to select one, ASCII seems the least harmful choice.

But why is this so important? Is it really too much to ask to type byte == ’:’.ascii instead of byte == ’:’? The former seems vastly preferable to me in every way.

johnno1962 · March 10, 2019, 6:36pm

I’d probably agree but requires we choose Unicode.Scalar as the default type of character literals. Having an implied encoding shouldn’t be a problem if we restrict things to ASCII with the _preconditions. Convenience without the pernicious bugs.

lorentey · March 10, 2019, 6:54pm

I agree that the byte processing usecase is important enough to deserve some syntactic sugar. We should ensure that ’\r’.ascii gives you the correct UInt8 value without type annotations, optional unwrapping or any other obstacles.

Dedicating single-quoted literals to Unicode scalars to make this work seems like a reasonable, low-effort choice. It also happens to be a great fit with the existing ExpressibleByUnicodeScalarLiteral protocol.

It does leave Character without a dedicated literal syntax. Is that bad? I haven’t seen many usecases for it in this thread.

johnno1962 · March 10, 2019, 6:58pm

I agree Unicode.Scalar is probably a more useful type than Character. We should have different default types for ‘a’ and ‘’ or would the later revert to the mechanics of a String literal? Possible but sounds conceptually dodgy ground.

lorentey · March 10, 2019, 7:08pm

If we go this way, ‘👪’ should not work without an explicit Character annotation, and arguably not even with one.

let a = ‘👪’ // error; did you mean `“👪” as Character`?
let b: Character = ‘👪’ // questionable, but okay
let c: Character = “👪” // fine

Character is essentially a String of length one, containing an arbitrary number of Unicode scalars. String literal syntax is not inappropriate for it.

taylorswift · March 10, 2019, 7:16pm

This seems like the worst of both worlds,, you lose out on the compile-time validation and get saddled with unneeded domain widening, but don’t gain any clarity, as @xwu points out.

String literals are constructed at runtime due to ICU dependencies, so validating this is going to be pretty complicated. This seems like a perfect use case for the @expressible(none) compile-time literal attribute I proposed a few posts back, which takes a [Unicode.Scalar] array instead of a String. I’m sure you’re also aware of the pitfalls inherent in converting unicode-aware Strings into ASCII bytestrings.

Let’s not swat a fly with a nuclear warhead. I would hate to see people compiling regexes just to test if an ASCII byte is a digit or a letter.

I need to remind everyone that the first drafts of the proposal specified exactly this behavior, but there was a lot of pushback from people in favor of 'a' for Character literals. (read basically the 30 posts before the one i linked.) Backtracking on this is likely to bring a lot of the pro-Character literal people out of the woodwork to defend their syntax.

I have no objection to ':'.ascii, but there are a lot of practical challenges that would make it hard to make this API actually usable.

We can’t vend this on Character, because it would get too confusing to have an optional asciiValue and a trapping ascii value on the same type, and the latter seems to go against the spirit of what Character is trying to model.
We’re left with vending this on Unicode.Scalar, but we have to sacrifice Character literals to make this not require contortions like (':' as Unicode.Scalar).ascii. I would also say that many of the arguments against ascii on Character, also apply to ascii on Unicode.Scalar. Unicode.Scalar can model 1,111,998 codepoints, it would be weird and against the spirit of the type to consider 1,111,870 of them “edge cases”, which is the assumption we make when we make something trapping instead of optional.
':'.ascii just doesn’t tell a great compile-time validation story. Of course, we could just special-case it and make this particular expression known to the compiler but that doesn’t sound particularly generalizable to me. I disagree with xwu’s assertion that compile-time validation should be heuristic and implicit. It’s far more useful to know when and where to trust the compiler to handle things so that I know to add in manual runtime validation (or static #asserts at the call-site) in the situations where it’s not.

johnno1962 · March 10, 2019, 7:17pm

What you are talking about is Unicode.Scalar literals which I don’t have a problem with and provides scope for more comprehensible error diagnostics when there is more that one Unicode.Scalar in the literal. It’s a bit untidy conceptually as the elements of a String are Character. Getting this to work would be difficult:

let b: Character = ‘👪’ // questionable, but okay

Single quoted literals would be restricted to single Unicode.Scalar “strings”. We’d need to add a trapping .ascii property to Unicode.Scalar.

lorentey · March 10, 2019, 7:29pm

That would be very welcome! I’ve honestly seen very few arguments for reserving the single-quote shorthand to Character; I’d love to see some usecases for it that can compete with the crystal-clear urgency of ’\r’.ascii.

We have just two flavours of stringy literals, but we have three stringy types. We need to make a choice.

taylorswift · March 10, 2019, 7:35pm

I’d read through the early discussion on the first pitch thread first. I think you have a good (but by no means new) argument, but i’d hate to re-litigate a fight that stalled the pitch for 7 months last year.

that is, if we’re discounting u'a', and the syntactical options that come with that. (ducks and hides)

nonsensery · March 10, 2019, 7:37pm

Not suggesting this is a good idea, but since there’s no need for an “empty character” literal, the language could use two single quotes as another delimiter.

let a = 'a' // Unicode.Scalar
let b = ''b'' // Character
let c = “c” // String

lorentey · March 10, 2019, 7:50pm

ASCII processing is not an edge case; it deserves some concessions. We already have UnicodeScalar.isASCII; adding a trapping .ascii property doesn’t seem problematic to me at all.

xwu · March 10, 2019, 7:56pm

Here is how I envision the proposed solution as discussed here. I think it should be on the whole fairly uncontroversial based on the degree of consensus we've already achieved:

Proposed solution

We would introduce a Unicode scalar literal as a single Unicode scalar surrounded by single quotation marks ('x').

The compiler will verify at compile time that the content of a Unicode scalar literal consists of one and only one Unicode scalar (without normalization). Note that this rule also precludes an empty Unicode scalar literal (i.e., '').

Types that conform to ExpressibleByUnicodeScalarLiteral but not ExpressibleByExtendedGraphemeClusterLiteral will show a deprecation warning when they are expressed using string literal syntax (with double quotation marks).

The default type of a Unicode scalar literal (i.e., UnicodeScalarLiteralType) will be Unicode.Scalar (aka UnicodeScalar).

Of course, types that conform to ExpressibleByExtendedGraphemeClusterLiteral (which include types that conform to ExpressibleByStringLiteral) necessarily conform to ExpressibleByUnicodeScalarLiteral. Therefore, they may also be expressed using the newly proposed Unicode scalar literal syntax: let x = '1' as Character. However, regardless of the type to which the literal value is coerced, the content of the literal will be verified at compile time to contain one and only one Unicode scalar.

To improve and streamline the syntax for obtaining the ASCII value of a Unicode scalar, the following API will be added:

extension Unicode.Scalar {
  @inlinable
  public var ascii: UInt8 {
    _precondition(value < 128)
    return UInt8(value)
  }
}

The initializer UInt8(ascii:) and the property Character.asciiValue will be deprecated in favor of this new API.

In the former case, the initializer becomes entirely redundant but is clearly a clumsier spelling once it is possible to spell a Unicode scalar using a literal without explicit coercion. (That is, UInt8(ascii: "1") is more ergonomic than ("1" as Unicode.Scalar).ascii, but '1'.ascii is more ergonomic than UInt8(ascii: '1').)

In the latter case, the property contains a pitfall as \r\n is a single character and is ASCII but does not have a single ASCII value; therefore, it is first normalized to \n, which is likely to be surprising and unexpected.

xwu · March 10, 2019, 7:59pm

Either the API would handle validation at compile time or runtime. I don't understand why you would add in additional runtime validation if it's already validated at runtime but not at compile time.

johnno1962 · March 10, 2019, 8:09pm

While I agree with the bulk of your assessment it seems regrettable that single quoted literals would not be Character literals. I’m not sure I see the necessity to restrict them to Unicode.Scalar when the alternative is simply to add a trapping .ascii property to Character and everybody is happy.

extension Character {
    var ascii: UInt8 {
        return asciiValue!
    }
}

This would be VERY inefficient if you look at the implementation.