SE-0243: Codepoint and Character Literals

Is it possible to not have it return an optional? - by not being a property of a Character, but actually being a property that exists on the character literal, but only if that literal is ascii.

this sounds pretty magical. why not be explicit and just extend as syntax instead of abusing . notation?

1 Like

I believe ASCII is important enough to deserve some dedicated API surface area. A trapping variant of asciiValue would be a much easier sell to me than 'A' as UInt8.

'a'.ascii // ⟹ 97
'á'.ascii // 💥 (possibly even a compile-time warning)

This would certainly simplify dealing with individual ASCII characters, to the point that they become convenient enough to eliminate the need for direct ASCII literals.

To represent ASCII byte sequences, we already have String.UTF8View:

"fred".utf8 // [102, 114, 101, 100]

It's a nice collection type and it should be possible to use it for convenient matching of ASCII byte strings. It has a huge advantage over array literals in that it supports the small string optimization.
Missing APIs can be added as needed if matching byte sequences isn't easy or fast enough in the current stdlib.

If people need a dedicated ASCII string type (for e.g. type safely marking certain APIs as requiring ASCII strings), we could even add a separate String.ASCIIView type:

"fred".utf8 // [102, 114, 101, 100]
"fred".ascii // [102, 114, 101, 100]

"frédi".utf8 // [102, 114, 195, 169, 100, 105]
"frédi".ascii // 💥 (possibly even a compile-time warning)
1 Like

i don’t know if i like the idea of special-casing certain properties to have this behavior. We would probably want to do the same thing for other properties that have a similar issue like .first, and it would be confusing if some function calls were transparent and validated while others aren’t.

// what does this do?
let u:Unicode.Scalar = .init(value: 0xff_ff_ff_ff)
// this makes me expect a compiler guarantee 
let u:Unicode.Scalar = 0xff_ff_ff_ff as Unicode.Scalar // error

And if we do start implicitly validating, when do we stop? as has a clearly defined role: coerce a constant literal value on the left side to the type on the right side. What you’re suggesting is a lot more open-ended.

// if this gets validated
let s:UInt8 = 'a'.ascii
// does that mean this is too?
let t:Int   = [1, 2, 3, 4][i & 0xb11]

It is already the case that some functions are validated at compile time. See SE-0213. In the long run, everything that can be validated at compile time within reasonable time, should be.

It should be.

1 Like

Neither do I, which is why I added the 'possibly'. Warnings like that are helpful in catching basic mistakes, but they wouldn't catch all cases. The runtime trap would.

However, making sure it's convenient to work with ASCII data is important enough to go the extra mile.

This is a fun sidetrack that we could argue over for days, but it seems minimally relevant. Do you have views about the usability of a trapping Character.ascii property or about using the .utf8 view to represent ASCII byte strings?

If we only cared about usability, Character.ascii would be perfectly fine. But we also care about whether the actual API makes sense. Do we really want trapping .ascii to be available on all Character values? Would this encourage people to write stuff like this?

func foo(_ character:Character) 
{
    self.file.write(character.ascii)
}

"júst á nörmál Swîfτ ßtríng".forEach(foo(_:))

The standard library uses trapping very sparingly, for certain operations where the failing case is too much of an edge case for it to be worth burdening the API with an Optional. I don’t think the Character '😻' meets that threshold. We emphasize Character’s grapheme-cluster nature so much it would be weird if we suddenly started considering emojis and non-ASCII codepoints “the unexpected case”.

Character.asciiValue returns an optional, and that’s exactly how it should be. This tells me that a computed property on Character is not the right solution.

The .utf8 view is definitely not the right way to represent ASCII bytestrings. This just seems like opening the door to "aren’t".

Sure; I don't see why not! Any such misuse will result in clear traps, not mis-encoded text. The potential for harm is far less than in allowing character literals to be inferred as integer types.

But if people are generally happy to deal with unwrapping asciiValue's optional return value, then of course that's even better.

A significant portion of this proposal is about ergonomics enhancements to ASCII processing. It goes through some questionable contortions to allow the let a: UInt8 = 'A' syntax, ostensibly in the name of ASCII ergonomics. This increasingly looks like a mistake.

The simple, non-controversial addition of single-quoted literals for Character is already a huge boost to ASCII productivity, because it allows you to type 'A'.asciiValue to get 65. You can't do that today: the string literal gets inferred to String, which does not provide that API.

// Swift 5
"A".asciiValue // error: value of type 'String' has no member 'asciiValue'

// With the non-controversial parts of SE-0243:
'A'.asciiValue! // ⟹ 65✨ 

It should be noted that Character.asciiValue does have one interesting quirk: it normalizes the CR+LF sequence (which is a single character!) to a single LF:

'\r\n'.asciiValue! // ⟹ 10 (?!)

This is arguably not right; migrating to another property could provide the opportunity to fix this. There is a fundamental underlying issue in that the Character → ASCII encoding mapping is not one-to-one.

It would perhaps be better to work on the level of Unicode scalars instead, and have single-quoted literals default to Unicode.Scalar instead of Character:

let newline1 = '\n' // Unicode.Scalar
let newline2 = '\r\n' // error: '\r\n' is not a single Unicode scalar; did you mean to initialize a Character value?
let newline3: Character = '\r\n'

print(newline1.asciiValue) // Optional(10)

(Conveniently, the protocol is already called ExpressibleByUnicodeScalarLiteral. I'm sorry if this came up during the pitch; I did not have time to check.)

Yes, it's implicitly relying on UTF-8 being compatible with ASCII, and this may cause issues when it's used for encoding; but UTF8View seems perfectly serviceable to me when I'm looking for random pieces of ASCII text in a byte sequence.

What I was trying to get at is that it's not clear how the proposal's integer conformances would extend to cover a sequence of ASCII bytes. The examples I've seen aren't particularly convincing, to put it mildly.

let a: [UInt8] = ['1', '9', '9', '2', '-', '0', '8', '-', '0', '3']  // Yuck
let b = "1992-08-03".utf8   // Same thing, except it works in Swift 5 and requires no allocation

If you prefer, it would be certainly possible to add an ASCIIView that restricts itself to ASCII characters.

5 Likes

This is, in my view, a huge insight. Not only is compiler validation that something is a single Unicode scalar much more trivial, but Unicode scalars already have many APIs for conversion to their integer values.

let x = 'é'

// The following APIs already exist:
x.value   // 233
x.isAscii // false

This existing design also addresses the objection above that someone using asciiValue with a non-ASCII character might be thrown because they expect a value instead of nil. Meanwhile, users would still have a compile-time guarantee that the stuff within the single quotes isn't a decomposed form:

// Current syntax
let decomposed = "\u{0065}\u{0301}" // é
let y = Unicode.Scalar(decomposed)  // nil

// Possible future syntax
let z = '\u{0065}\u{0301}' // compile-time error

IMO it'd even be fine, if we call that notation exclusively a "Unicode scalar literal," to ditch its use for Character entirely; indeed, it might be best given how each version of Unicode subtly refines what's considered a single extended grapheme cluster.


Down the line, if after the addition of these facilities it is found that additional specific support for ASCII is still necessary, we may consider an ASCII type:

// Typed freehand, for illustrative purposes only, not working code
struct ASCII {
  internal var _value: Builtin.Int7
  public var value: UInt8 {
    return UInt8(Builtin.extOrBitCast_Int7_Int8(_value))
  }
  public init(_ value: UInt8) {
    _precondition(value < 128)
    _value = Builtin.truncOrBitCast_Int8_Int7(value._value)
  }
}

extension ASCII: ExpressibleByUnicodeScalarLiteral { /* ... */ }
extension Array: ExpressibleByStringLiteral where Element == ASCII {
  /* ... */
  public var value: [UInt8] { /* ... */ }
}

// Usage
let x = 'a' as ASCII
let pngTag = "abcd" as [ASCII]
pngTag.value // [97, 98, 99, 100]

Regarding the actual quirk at hand for asciiValue, it is weird, but the strictly correct version where ("\r\n" as Character).asciiValue evaluates to nil is also...not good.

1 Like

I'm also not sure it would be appropriate for Character to have an asciiValue method which returns an optional, and an ascii method which has the same behavior except it aborts. In a code-completion world with automatic conversion to optionals, this is asking for trouble

An ascii property also only solves some of the issue, as you may still need to explicitly convert the integer type between UInt8 <-> Int8.

1 Like

Partially off-topic, but you've just convinced me that I will never master Unicode. I had no idea that there were grapheme clusters composed entirely of multiple ASCII-range code points.

1 Like

Now that the battle for integer convertibility seems pretty much lost, I wonder if adding a few well chosen operators to the standard library couldn’t scratch the itch the proposal was trying to address...

For array initialisation I’d suggest:

extension Array where Element: FixedWidthInteger {
    @available(swift 5.1)
    public init(scalars: String) {
        self = scalars.unicodeScalars.map { Element($0.value) }
    }
}

let hex = [Int8](scalars: "0123456789abcdef")

For comparison you could simply add:

@_transparent
@available(swift 5.1)
public func ==<T: FixedWidthInteger> (lhs: T, rhs: Unicode.Scalar) -> Bool {
    return lhs == rhs.value
}

if cString.advanced(by: 2).pointee == “b" {

For use in switches the following is enough:

@_transparent
@available(swift 5.1)
public func ~=<T: FixedWidthInteger> (pattern: Unicode.Scalar, value: T) -> Bool {
    return pattern.value == value
}
@available(swift 5.1)
public func ~=<T: FixedWidthInteger> (pattern: ClosedRange<Unicode.Scalar>, value: T) -> Bool {
    return pattern.contains(Unicode.Scalar(UInt32(value))!)
}

let digit = UInt8(ascii: "1")
switch(digit) {
case "8":
    print("Hello1")
case "1" ... "2":
    print("Hello2")
default:
    break
}

Operators that might be useful can be added without opening the flood gates to nonsense expressions.

@_transparent
@available(swift 5.1)
public func -<T: FixedWidthInteger> (lhs: T, rhs: Unicode.Scalar) -> T {
    return lhs - T(rhs.value)
}

print(digit-“0")
2 Likes

I would not be opposed to a few well vetted operators, but I would push back on the examples you give:

If we have '1' express a Unicode scalar as @lorentey suggests, many of the examples you give become dramatically simplified. I would be hesitant to further simplify x == '1'.value to x == '1', for a similar reason to what we ran into trouble with in terms of adhering to SE-0213: namely, that whether a bare literal with a number expresses that number or its ASCII value is made less clear. In principle, heterogeneous comparison operators that eliminate conversions is fine, but only in the circumstance where there is no possible confusion as to how things are converted, and I do not think we meet that bar here.

As for '1'...'8', what you are showing here is a good role for regex literals, but this operator is not it. Specifically, an expression like 'A'...'z' would match '['. This invites user error. A regex literal is sorely needed, but this imitation of it is in my view misguided.

I am not sure why one would need 42 - '0'. I have often needed to offset to or from the ASCII value of '0' or 'a', say, to get the ASCII value of a nearby character, but that is for lack of better facilities that we are attempting to design here, not an end in itself.

2 Likes

We should rather fix the cases where you need to deal with signed bytes. Swift APIs have pretty much standardized on using UInt8. The platform-dependent signedness of CChar is terribly unhelpful.

2 Likes

As @xwu noted, these particular examples share the same problem as ’a’ as UInt8: they use unspecified encodings, so they invite mistaken assumptions, and, ultimately, bugs.

As I have tried to explain during the pitch, Unicode.Scalar.value is almost never helpful when dealing with encoded data. Truncating it to fit a narrow integer type results in a random hodgepodge of misfit encodings, most of which would make truly terrible defaults:

let a = [Int8](scalars: “cafe”) // → ASCII 
let b = [UInt8](scalars: “café”) // → ⛔️ Latin-1
let c = [Int16](scalars: “café”) // → 🐞 Half of UCS-2
let d = [UInt16](scalars: “café”) // → ⛔️ UCS-2
let e = [Int32](scalars: “café”) // → Signed variant of UTF-32
let f = [UInt32](scalars: “café”) // → UTF-32
let g = [Int64](scalars: “café”) // → 🐞 This isn’t a thing
let g = [UInt64](scalars: “café”) // → 🐞 UTF-64 does not exist

We already have String’s encoded views to do this sort of thing correctly.

let a = [UInt8](“café”.utf8) // → ✅ UTF-8
let b = [UInt16](“café”.utf16) // → ✅ UTF-16

Note how this makes the choice of encoding obvious, without sacrificing ergonomics.

1 Like

But we aren’t headed in this direction are we?

(swift) '1'.value
<REPL Input>:1:5: error: value of type 'Character' has no member 'value'
'1’.value

Not at all keen on ‘a’.asciiValue! as the alternative as it has a foot-gun if you leave the ! off.

I’m just slinging things against the wall and seeing if anything sticks.. Apparently not.

1 Like

I do believe there is a clear and pressing need for Unicode scalar literals. Introducing them would be a huge leap for the ergonomics of dealing with encoded string data.

let a = ‘A’     // inferred as Unicode.Scalar, *not* Character!
switch byte {
  case ‘0’.ascii ... ‘9’.ascii: // trapping property on Unicode.Scalar
    print(“digit”)
  default:
    print(“not a digit”)
}
let c: Character = ‘A’     // Character works too but you need to spell out the type

We need to make a decision whether the default type for character literals would be Unicode.Scalar or Character. Unicode.Scalar seems a strange choice but it’s up to the core team in the finish. @lorentey, would you accept some of the shorthand operators (except for the array one which I’m not at all attached to) above if they all trapped on non-ascii values.

extension Unicode.Scalar {
  @_transparent
  @available(swift 5.0)
  public static func -<T: FixedWidthInteger> (lhs: T, rhs: Unicode.Scalar) -> T {
    _precondtion(rhs.isASCII, "Only ASCII value accepted in this context")
    return lhs - T(rhs.value)
  }
  @_transparent
  @available(swift 5.0)
  public static func ==<T: FixedWidthInteger> (lhs: T, rhs: Unicode.Scalar) -> Bool {
    _precondtion(rhs.isASCII, "Only ASCII value accepted in this context")
    return lhs == T(rhs.value)
  }
  @_transparent
  @available(swift 5.0)
  public static func ~=<T: FixedWidthInteger> (pattern: Unicode.Scalar, value: T) -> Bool {
    _precondtion(pattern.isASCII, "Only ASCII value accepted in this context")
    return value == pattern
  }
}
@available(swift 5.0)
public func ~=<T: FixedWidthInteger> (pattern: ClosedRange<Unicode.Scalar>, value: T) -> Bool {
    precondition(pattern.lowerBound.isASCII && pattern.upperBound.isASCII,
                  "Only ASCII value accepted in this context")
    return pattern.contains(Unicode.Scalar(UInt32(value))!)
}

ASCII is so pervasive I don’t think support for it is unreasonable.

I don’t like the idea of any implicit encoding, but if we are forced to select one, ASCII seems the least harmful choice.

But why is this so important? Is it really too much to ask to type byte == ’:’.ascii instead of byte == ’:’? The former seems vastly preferable to me in every way.

1 Like

I’d probably agree but requires we choose Unicode.Scalar as the default type of character literals. Having an implied encoding shouldn’t be a problem if we restrict things to ASCII with the _preconditions. Convenience without the pernicious bugs.