Single Quoted Character Literals (Why yes, again)

ksluder · December 8, 2022, 10:36pm

I wouldn’t necessarily even call this an issue of form or procedure. It’s a matter of listening to the decision makers and then working with them and others on an improved proposal. That’s just called “working in a group”.

As @beccadax elaborated, no decisions—not even technical ones—are made based solely on evaluation of blind submissions. Decisionmaking is a social process in any organization, and it’s usually counterproductive to dismiss the feedback of those whom the organization has vested with decisionmaking authority. The more productive approaches are to either incorporate that feedback, or to gather sufficient support from trusted voices in the organization to lobby against that feedback.

johnno1962 · December 8, 2022, 10:38pm

Here here. This pitch has been in development for 4 years. One wonders where the finish line is.

taylorswift · December 8, 2022, 11:11pm

beccadax:

The way this section is written is incredibly confusing—I actually read your protocol hierarchy backwards at first (because inheritance is usually drawn the other way, with more-derived classes below less-derived classes!) and spent half an hour writing increasingly confused critiques of the backwards design. Even now that I've figured that out, though, I still don't quite see how this design is supposed to work, particularly in terms of initializing types that only conform to the marker protocols.

To get everyone on the same page, I strongly recommend you write the actual declarations for the new protocols along with first cuts at their doc comments, and describe any modifications to the semantics of existing protocols. (For instance, which protocols imply support for which syntaxes?) I would also write examples of the code the compiler should generate when a single-quoted literal is used for a type that only conforms to the new protocols. (Just Swift expressions—the equivalent of saying ""a" as Unicode.Scalar lowers to Unicode.Scalar(unicodeScalarLiteral: 97)"). And I would specify which types will gain conformances to which protocols.

unless i am misunderstanding the proposal in its newest iteration, @_marker protocols cannot declare requirements, so user-defined types cannot implement ExpressibleBySingleQuotedLiteral alone; the conformances for Unicode.Scalar, Character, UInt8, etc would have to be baked into the compiler, or rely on ExpressibleByUnicodeScalarLiteral.

from what i recall during the first review, one of the more widespread criticisms of the original proposal was this:

One concern raised during the review was that because ExpressibleByStringLiteral refines ExpressibleByExtendedGraphemeClusterLiteral, then type context will allow expressions like 'x' + 'y' == "xy".

which does not coexist happily with 'x' + 'y' == 241.

with that in mind, could we simply create a new, unrelated hierarchy for ExpressibleByCharacterLiteral? (which is a serendipitously unclaimed name in the standard library.)

@_marker
protocol _ExpressibleByBuiltinCharacterLiteral
{
}

extension Unicode.Scalar:_ExpressibleByBuiltinCharacterLiteral {}
extension Character:_ExpressibleByBuiltinCharacterLiteral {}

protocol ExpressibleByASCIILiteral
{
    init(asciiLiteral:UInt8)
}

protocol ExpressibleByCharacterLiteral:ExpressibleByASCIILiteral
{
    associatedtype CharacterLiteralType:_ExpressibleByBuiltinCharacterLiteral
    init(characterLiteral:CharacterLiteralType)
}
extension ExpressibleByCharacterLiteral
    where CharacterLiteralType == Unicode.Scalar
{
    init(asciiLiteral:UInt8)
    {
        self.init(characterLiteral: .init(asciiLiteral))
    }
}
extension ExpressibleByCharacterLiteral
    where CharacterLiteralType == Character
{
    init(asciiLiteral:UInt8)
    {
        self.init(characterLiteral: .init(.init(asciiLiteral)))
    }
}

extension UInt8:ExpressibleByASCIILiteral
{
    init(asciiLiteral:UInt8) { self = asciiLiteral }
}
extension Unicode.Scalar:ExpressibleByCharacterLiteral
{
    init(characterLiteral:Self) { self = asciiLiteral }
}
extension Character:ExpressibleByCharacterLiteral
{
    init(characterLiteral:Self) { self = asciiLiteral }
}

the key thing to note here is that String does not conform to ExpressibleByCharacterLiteral. so we would not have the situation where 'x' + 'y' == "xy" can occur.

ExpressibleByExtendedGraphemeClusterLiteral and ExpressibleByUnicodeScalarLiteral could then continue to exist unchanged with the double-quoted syntax, and the language could deprecate them at whatever pace people are comfortable with, which may very well be “never”.

behavioral changes i can forsee:

Basic type identities

('€')                   → ('€' as Character)

// compilation error
('€' as String)         → Never 

("1" + "1")             → ("ab" as String)

// compilation error, because `+ (lhs:String, rhs:Character)` does not exist
("1" + '€')             → Never 

// compilation error, because `+ (lhs:Character, rhs:Character)` does not exist
('1' + '1' as String)   → Never

// compilation error, because `UInt8` is not implicitly convertible to `Int`
('1' + '1' as Int)      → Never

Initializers of integers

Int.init("0123")        → (123 as Int?)
// compilation error, because `Int.init(_:Character)` does not exist
// compilation error, because `Int.init(_:Unicode.Scalar)` does not exist
// compilation error, because `Int.init(_:UInt8)` exists but `'€'` is not ASCII
Int.init('€')           → Never

Int.init('3')           → Int.init(51 as UInt8) → (51 as Int)
(['a', 'b'] as [Int8])  → ([97, 98] as [Int8])

More arithmetic

('a' + 1)           → (98 as UInt8)
('b' - 'a' + 10)    → (11 as UInt8)
// runtime error, from integer overflow
('a' * 'b')         → Never
("123".firstIndex(of: '2')) → (String.Index.init(_rawBits: 65799) as String.Index?)

johnno1962 · December 8, 2022, 11:25pm

TBH I never had a problem with that. To me it seems logical. Strings are made up of Characters concatenated, A Character is itself a (short) String. I don't recommend implementing a new protocol hieracy just to avoid this. There is no reason this change should be affect ABI.

I would accept is more problematic if you're not expecting it but difficult to avoid if one wants to offer other more useful forms of code point arithmetic. Having both exiting at the same time depending on type context is confusing if you seek out problems but the simple case of integer conversions is at least simple.

taylorswift · December 8, 2022, 11:28pm

this does not exist in the standard library today, "x" + "y" desugars to ("x" as String) + ("y" as String).

the following is not valid swift:

let xy:String = ("x" as Character) + ("y" as Character)
// cannot convert value of type 'Character' to expected argument type 'String'

johnno1962 · December 8, 2022, 11:32pm

You can't convert a Character to String but a Character literal can express a String. In the case of 'x' + 'y', the literals are both expressing strings as that is the operator that is available.

taylorswift · December 8, 2022, 11:37pm

which is why i suggested not injecting things into the root of the ExpressibleByUnicodeScalarLiteral hierarchy, and instead having a parallel hierarchy that String would not conform to.

i think it makes sense that

static func + (lhs:Character, rhs:Character) -> String

does not exist, because it is analogous to

static func + (lhs:Int, rhs:Int) -> [Int]

moreover, i anticipate that people would not like

Int.init('3') → Int.init(51 as UInt8) → (51 as Int)

if '3' were capable of expressing a String, because we would expect Int.init("3") to return 3 as Int and not 51 as Int.

johnno1962 · December 9, 2022, 12:02am

This is the difference in emphasis between the new pitch and the old. Single Quoted literals now need to be considered more integer-like than character-like though they can take that role which is also why breaking the proposal in two will present problems.

Separating the hierarchy and introducing new types or protocols i.e. breaking ABI will require people to update their users operating systems before they can use the feature which I'm keen to avoid.

ensan-hcl · December 9, 2022, 12:20am

I believe ASCII arithmetics should be implemented not as UInt8, but as ASCII type. So that we can write 'x' + 'y' == ASCII(241), which reads clearer.

taylorswift:

@inlinable public mutating
func next() -> UInt8?
{
    while let digit:UInt8 = self.iterator.next()
    {
        switch digit 
        {
        case '0' ... '9':   return digit      - '0'
        case 'a' ... 'f':   return digit + 10 - 'a'
        case 'A' ... 'F':   return digit + 10 - 'A'
        default:            continue
        }
    }
    return nil
}

You can write like this.

@inlinable public mutating
func next() -> UInt8?
{
    while let digit = self.iterator.next()
    {
        let asciiCharacter = ASCII(digit)
        switch asciiCharacter 
        {
        case '0' ... '9':   return asciiCharacter      - '0'
        case 'a' ... 'f':   return asciiCharacter + 10 - 'a'
        case 'A' ... 'F':   return asciiCharacter + 10 - 'A'
        default:            continue
        }
    }
    return nil
}

taylorswift · December 9, 2022, 12:24am

i think that making single-quoted literals more integer-like necessitates making them less string-like, because trying to have them do both is generating difficulties that @xwu and @beccadax have highlighted.

and i think that is okay, because we do not currently consider 1 and [1] to be interchangeable, and the fact that "1" as Character looks like "1" as String is a historical oddity that arose from Swift using the " delimiter for both types.

which i think is a huge argument in favor of using ' for Character, Unicode.Scalar, and something that would unblock expressing UInt8 with them.

if we want single-quoted literals to be string-like, then Int.init('3') will become a problem because:

T.init(_:T) disregards default literal inference, and infers T (SE-2013)
we want Int to be expressible by '3'
we want String to be expressible by '3'
we already have Int.init(_:String)

these four things cannot all be true at the same time, and 1 and 4 are a fact of the language.

in my view, 3 is not needed, and is not really consistent with the first of the two major goals of the proposal, which is to have a separate syntax for characters that does not collide with the one we use for strings, and it actively undermines the second major goal of the proposal (integer coercions).

so i really think we could reach a broader “consensus” if we just accept that we will write

let x:String = .init('a')

the same way we write

let x:[Int] = [1]

taylorswift · December 9, 2022, 12:31am

this was floated during the first review, but having an ASCII type would create more problems than it would solve, because it would not be compatible with:

String.UTF8View
UnsafeBufferPointer<UInt8>
UnsafeRawBufferPointer

all of which have an element type of UInt8, and which would exclude 99% of the cases where ASCII literals would be used.

moreover ASCII implies that the codepoint is less than 0x80, which makes it impossible to inter-operate with UTF-8 strings.

allevato · December 9, 2022, 2:35am

What use case is served by adding two character literals and getting an integer back?

The way I see it, characters are a lot like dates—it makes sense to talk about advancing or backtracking a character by some distance:

let f = 'a' + 5
let t = 'z' - 6

and to compute the distance between two characters:

print('f' - 'a')  // 5

Ranges and comparisons naturally fall out of these relationships, as well.

But like adding two dates, 'x' + 'y' for two arbitrary characters seems nonsensical, only occurring as a side effect of an integer conversion. But what does that integer result actually mean, and why would we ever want to support it? Even if we supported single character literals and the operations I mentioned above, I don't think we want Swift users to be writing 'x' + 'y' in their own code.

johnno1962 · December 9, 2022, 9:09am

I think I need to grab the reins and try to wrestle this pitch back on track after some of the digressions yesterday evening. The first thing I want to say is the #1 guiding principle for the implementation from which the proposal is derived is that it is a compile time feature and does not require any new support from the language runtime. As such, new protocols, conformances or types and re-inventing or coming up with a parallel treatment of single Quoted literals is out of the question. This also has the advantage that it restricts the space inside which the proposal can iterate otherwise the degrees of freedom are so great there would be little hope of gathering any consensus. It also re-uses some critical well exercised code.

Taking this approach leaves the implementation with two identities that people seem to fixate on: 'x' + 'y' == "xy" and 'x' + 'y' == 241. The core team touched on the former in their decision noting:

I would go further and say this isn't a bug but a feature in that it seems reasonable to me that it should be possible to compose a String as the concatenation of character literals.

Looking at the ExpressibleBy hierarchy enhanced by this pitch and turning it the right way up and underscoring the marker protocols, the following is envisaged:

@_marker _ExpressibleByASCIILiteral
   ↳ @_marker _ExpressibleBySingleQuotedLiteral
     ↳ ExpressibleByUnicodeScalarLiteral
       ↳ ExpressibleByExtendedGraphemeClusterLiteral
         ↳ ExpressibleByStringLiteral

Given how type checking interacts with the protocols and literals the implication of a literal that satisfies one of these protocols is that it also satisfies all those below it. i.e. an ASCIILiteral can be a SingleQuotedLiteral which can be a UnicodeScalarLiteral and so on down to it being valid as a String. Unless we are prepared to revisit how these protocols are applied in the compiler there is nothing you can do about the identity above.

Defining these new marker protocols (which don't really exist outside the compiler and therefore don't affect ABI) allows us to target the integer conformances in a way that was not possible with that previously reviewed implementation and the following issue which was one motivation for separating out the integer conformances is less relevant (though they can still be separated out.)

With regard to the second identity if you introduce the integer conformances 'x' + 'y' == 241 and related absurdities it is only possible to defend this by mentioning C and Java have co-existed with this tenant for decades and it hasn't been reported to be an issue to my knowledge. It's testament to the power of the Swift String model's abstraction that people can no longer see past it.

More than anything else I am interested in keeping this pitch be as limited in scope as possible i.e. practically implied by the existing means literals are type checked. If we open the door to major rethinks of how things work we are lost and will only open the door to increased complexity of the implementation, subtle regressions, loss of consensus and failure to get anything actually delivered.

xwu · December 9, 2022, 1:23pm

I think the feedback you're getting is that an adequate exploration of the design space (which is what the pitch phase is about) will need to wrestle with these options. I don't see how it can be justified to start with ruling these options out as the "#1 guiding principle." Clearly, there is an appetite to discuss these topics.

This is the core team's feedback on moving forward that point (emphasis mine):

xwu · December 9, 2022, 1:43pm

I'm not sure I clearly recall this part of the discussion terribly well.

I think an ASCII type that is layout-compatible with UInt8 (which it would be) could interoperate quite smoothly with String and (particularly raw) buffer types. Note also that, in the intervening years, we've developed new expressivity in Swift such as wrappers. Imagine being able to mark a parameter as taking an @ASCII-wrapped UInt8, such that the caller just passes in an UInt8 but you get to manipulate it through ASCII-oriented APIs.

Limiting the type to actual ASCII code points would be advantageous for users who actually mean ASCII, in the usual way that making invalid states impossible is advantageous in general.

Additionally, if we wanted to be ambitious here, a Swift standard library ASCII type can use a 7-bit built-in for storage. We could explore, then, making Optional<ASCII> also layout-compatible with UInt8 in such a way that every UTF-8 buffer is reinterpretable as a buffer of optional ASCII code points. Nice—maybe too clever by half though.

I should say that I do recall exploring this angle myself some years ago and coming away a little disappoint at the ergonomics of an ASCII type. However, I just went back and sketched out a rough implementation, then played with it to see what it would be like to use by rewriting one of your example switches (slightly modified for simplicity)—and to see what the codegen would be like. It was pretty nice actually:

// Implementation
// Warning: not thoroughly tested; do not copy-and-paste for production
@frozen
public struct ASCII: ExpressibleByUnicodeScalarLiteral {
  @usableFromInline
  internal let _value: UInt8

  @inlinable
  public init(_ scalar: Unicode.Scalar) {
    precondition(scalar.isASCII)
    _value = UInt8(truncatingIfNeeded: scalar.value)
  }

  @inlinable
  public init?<T: StringProtocol>(_ string: T) {
    var it = string.utf8.makeIterator()
    guard let first = it.next(), first < 128, it.next() == nil else {
      return nil
    }
    _value = first
  }

  @_transparent
  public init(_ value: UInt8) {
    precondition(value < 128)
    _value = value
  }

  @_transparent
  public init(_ value: Int8) {
    precondition(value >= 0)
    _value = UInt8(bitPattern: value)
  }

  @inlinable
  public init<T: BinaryInteger>(_ value: T) {
    precondition(value >= 0 && value < 128)
    _value = UInt8(truncatingIfNeeded: value)
  }

  @inlinable
  public init(unicodeScalarLiteral scalar: UnicodeScalar) {
    self.init(scalar)
  }
}

extension ASCII: Equatable {
  @_transparent
  public static func == (lhs: ASCII, rhs: ASCII) -> Bool {
    lhs._value == rhs._value
  }
}

extension ASCII: Hashable { }

extension ASCII: Comparable {
  @_transparent
  public static func < (lhs: ASCII, rhs: ASCII) -> Bool {
    lhs._value < rhs._value
  }
}

extension ASCII: Strideable {
  @inlinable
  public func distance(to other: ASCII) -> Int8 {
    Int8(bitPattern: other._value) &- Int8(bitPattern: _value)
  }

  @inlinable
  public func advanced(by n: Int8) -> ASCII {
    ASCII(Int8(bitPattern: _value) + n)
  }
}

extension ASCII: Sendable { }

extension ASCII: CustomStringConvertible {
  public var description: String { String(_value) }
}

Use:

func f(_ digit: UInt8) -> Int8 {
  let ascii = ASCII(digit)
  switch ascii {
  case "A"..."Z":
    return ascii.distance(to: "A")
  case "a"..."z":
    return ascii.distance(to: "a") + 26
  case "0"..."9":
    return ascii.distance(to: "0") + 52
  case "+":
    return 62
  case "/":
    return 63
  default:
    fatalError()
  }
}

func g(_ digit: UInt8) -> Int8 {
  switch digit {
  case 0x41 ... 0x5a: // A-Z
      return Int8(digit) - 0x41
  case 0x61 ... 0x7a: // a-z
      return Int8(digit) - 0x61 + 26
  case 0x30 ... 0x39: // 0-9
      return Int8(digit) - 0x30 + 52
  case 0x2b: // +
      return 62
  case 0x2f: // /
      return 63
  default:
      fatalError()
  }
}

To me, f is definitely more readable than g; by inspection of the optimized output via godbolt.org, it produces close-to-equivalent results. Left as a subsequent exercise would be working out how f would look if ASCII could be a wrapper on the input parameter.

johnno1962 · December 9, 2022, 2:05pm

Yes this is indeed the main alternative if I was more ambitious. You'd need to define new operators for everything but at least you'd have control over which were deemed acceptable, but another type + operators slows down type checking slightly so you don't get it for free. But for me the main fail here is that in order to use this new feature however it was surfaced (would it be the default type of Single Quoted literals?) your users would have all needed to update their operating system to the runtime containing the new type. I am strenuously trying to avoid such a constraint. This is why I imposed my design rule #1 you mention above.

A more "adequate" exploration to what end? People are asking me to fix things that I don't view as a problem (A single quoted literal should be able to act as a String) and the price for fixing it is very high indeed (see my note about design rule #1 above.)

We're going round in circles on integer conversions. Some people will never find them acceptable as they allow arithmetic nonsense that has been a feature of other languages for years. Perhaps a middle ground where Single Quoted literals are reviewed and implemented but to get the integer conversions people have to opt in by say, loading a package will be the outcome though this was viewed as being bad practice and highly unacceptable in the last review (even if it was for different reasons - ABI issues with conformance availability.)

johnno1962 · December 9, 2022, 2:18pm

Thanks very much for the review @beccadax, it's clear you have a keen grasp of the dilemmas this proposal faces. The more useful feature (integer conversion) is by far the most contentious and as a result of the previous review is stuck behind a less pressing improvement to the language (single quoted literals in their own right). I wouldn't press to deprecate double quoted syntax for Character and Unicode.Scalar values however as I view the two as being essentially equivalent and if I have my way single quoted literals will have other semantics (the conversions) that others may wish to avoid.

ksluder · December 9, 2022, 4:14pm

You’re digging in your heels here. It’s not surprising that you’re not getting anywhere with this strategy, because your arguments simply aren’t strong enough, and you don’t have any political capital to leverage.

You haven’t pointed to any real-world projects that would benefit from your pitch. You haven’t showed how they would be harmed by the alternatives being raised in this thread. You haven’t addressed any adjacent problem spaces where different tradeoffs might be more beneficial. You’ve merely presented a “take it or leave it” proposal. The default answer to any such proposal is “leave it”.

xwu · December 9, 2022, 4:16pm

Back-deployment would be a slog but not impossible. That said, as a general rule, new features on ABI-stable platforms often require a new runtime, and eschewing the best design for a feature to avoid this is letting the tail wag the dog, or putting the cart before the horse, etc. In general this would not be how we'd approach this unless there's some overriding reason why the feature only makes sense if back-deployable.

Note that nothing about what I've written on the ASCII type requires a new literal, further going to the point that a new literal needs to be motivated independently.

johnno1962 · December 9, 2022, 5:43pm

But why not set out to avoid these problems if it is possible? What best design?