Single Quoted Character Literals (Why yes, again)

johnno1962 · December 12, 2022, 8:42pm

Please don't be melodramatic, nobody is asking you to make that choice. Introducing this accommodation for a particular type of coding in Swift does not deprecate any other aspect of the powerfully abstract Swift String model. The suggested feature is opt-in for those that need it.

JohnBlackburne · December 12, 2022, 9:22pm

It doesn't work like that though. If a feature is added to the language then people will use it. Then whether I want to or not I will encounter it – in sample code, in Swift library code, in code written by other people that I'm asked to work on, have to reason about.

ensan-hcl · December 12, 2022, 11:55pm

I don't understand why we need an expression like 'a' + 1 in the first place. Wouldn't an expression like 'a'.asciiOffset(1) be sufficient? And if we do so we can avoid 'x' + 'y' from the beginning.
Such operations appear natural because our expected alphabetical order happens to match the order of the ASCII code. Expressions like '*' + '1' are never natural. I think using method is clearer to make it explicit that it is using ASCII order.

taylorswift · December 13, 2022, 12:37am

i’ve gone back and reworked the proposal based on the feedback from this thread. since it is a significant departure from the proposal in its current form, i decided to post it as a new thread. here is the link for anyone interested:

johnno1962 · January 22, 2024, 3:39pm

Unfortunately this pitch wandered into a legislative labyrinth that I don't have the wit to find the way out of, nor the wisdom to know when to give up. So.. I've started casting about for alternatives and found an extension and a new Array initialiser that I believe solve the bulk of what I was looking to achieve.

First, looking at the "awkward code" @beccadax mentioned:

If you define just this one simple extension: (Edited)

extension FixedWidthInteger {
    /// Basic equality operators
    @_transparent
    public static func == (i: Self, s: Unicode.Scalar) -> Bool {
        return i == s.value
    }
    @_transparent
    public static func != (i: Self, s: Unicode.Scalar) -> Bool {
        return i != s.value
    }
    /// Used in switch statements
    @_transparent
    public static func ~= (s: Unicode.Scalar, i: Self) -> Bool {
        return i == s.value
    }
    /// Maybe useful now and then
    @_transparent
    public static func - (i: Self, s: Unicode.Scalar) -> Self {
        return i - Self(s.value)
    }
}

The code could be transformed from:

    switch self.previous {
    case UInt8(ascii: " "), UInt8(ascii: "\r"), UInt8(ascii: "\n"), UInt8(ascii: "\t"), // whitespace
      UInt8(ascii: "("), UInt8(ascii: "["), UInt8(ascii: "{"),            // opening delimiters
      UInt8(ascii: ","), UInt8(ascii: ";"), UInt8(ascii: ":"),              // expression separators
      0:                          // whitespace / last char in file
      return false

to

    switch self.previous {
    case " ", "\r", "\n", "\t", "(", "[", "{", ",", ";", ":",              // expression separators
      0:                          // whitespace / last char in file
      return false

This would be a candidate for inclusion in the standard library IMHO as it is an additive change that shouldn't involve collateral damage to the language it is so finely targeted. The compiler tests run through and there are only 12 failures where the diagnostic changed in some tests for invalid code. I have further tested 1000 or so Swift packages from the Swift Package Index and didn't see problems.

Coming full circle on how these pitches started out:

It may be better to simply introduce a new initialiser on Arrays of FixedWidthIntegers; something along the lines of:

extension Array where Element: FixedWidthInteger {
    /// Initialise an Integer array of "characters"
    @inline(__always)
    public init(unicode: String, default: UInt32 = 0) {
        self.init(unicode.map {
            let scalars = $0.unicodeScalars
            if scalars.count == 1,
                let v = scalars.first?.value,
                v <= Element.max {
                return Element(v)
            }
            return Element(`default`)
        })
    }
}

So the code above would become:

let hexcodes = [UInt8](unicode: "0123456789abcdef")

Between these two suggestions I believe the majority of use cases I felt were poorly served by current Swift find a solution. I've put together a small Swift Package so you can try these ideas out but would hope we could find a path for these to find their way into the stdlib:

github.com/johnno1962/Character

Sources/Character/Character.swift

main

// The Swift Programming Language
// https://docs.swift.org/swift-book

extension Array where Element: FixedWidthInteger {
    /// Initialise an Integer array of "characters"
    @inline(__always)
    public init(unicode: String, default: UInt32 = 0) {
        self.init(unicode.map {
            let scalars = $0.unicodeScalars
            if scalars.count == 1,
                let v = scalars.first?.value,
                v <= Element.max {
                return Element(v)
            }
            return Element(`default`)
        })
    }
}

extension FixedWidthInteger {

This file has been truncated. show original

Dmitriy_Ignatyev · January 22, 2024, 8:07pm

Thanks for pushing this forward. While I'm writing code for at least 6 last years, I'm still confused seeing double quotes on scalars and grapheme cluster literals.

Dmitriy_Ignatyev · January 22, 2024, 8:12pm

I suppose an additional pitch can be made later, e.g. make double quotes a warning in Swift 6 mode

johnno1962 · January 22, 2024, 11:47pm

Just to be clear I'm altering this pitch to tack away from changing literal syntax for UnicodeScalars (and Characters) to single quotes (even if that might be nice, and creating new integer conversions using the ExpressableBy protocols on those literals) to using targeted operators to plug what to seem to be a few gaps in the Swift language for low level coding. As an example to show this can tidy code up I've ported swift-syntax's Lexer mentioned above to use these new operators:

I don't know what the performance implications of using a protocol extension are going to be exactly (if someone wants to fill me in on that) but the compiler and tests are still running in the same amount of time (though I don't imagine Lexing is on the critical path)

johnno1962 · January 23, 2024, 3:51pm

... It seems there aren't any performance implications to using a protocol extension I can see. I've been benchmarking a Release build of the swift-syntax project and cannot find any repeatable difference in lexer performance between the original code and the branch above of my fork with the code "tidied up" using the protocol extension I'm proposing for the standard library. TBH this surprises me. I might have thought using an ExpressableBy protocol would have been faster but the compiler sees right through the protocol extension and does the necessary.

Edit: P.S. For a Debug build the ExpressibleBy version and the original code is about 30% faster though lexing is only half of the time spent parsing.

johnno1962 · February 20, 2024, 11:48am

An update on this pitch, I've been able to field test these operators on the swift-syntax project which is a perfect use case being a non-trivial parser (for Swift), well instrumented down to instruction counts so one can verify performance doesn't regress using a more abstract approach than UInt8(ascii:). We were able to land a PR UnicodeScalar operators. by johnno1962 · Pull Request #2439 · apple/swift-syntax · GitHub on the Swift Lexer a few weeks ago that converted switch statement such as this:

    switch self.previous {
     case UInt8(ascii: " "), UInt8(ascii: "\r"), UInt8(ascii: "\n"), UInt8(ascii: "\t"),  // whitespace
       UInt8(ascii: "("), UInt8(ascii: "["), UInt8(ascii: "{"),  // opening delimiters
       UInt8(ascii: ","), UInt8(ascii: ";"), UInt8(ascii: ":"),  // expression separators

to this (which I think we can all agree is an improvement).

    switch self.previous {
     case " ", "\r", "\n", "\t",  // whitespace
       "(", "[", "{",  // opening delimiters
       ",", ";", ":",  // expression separators

I've prepared a PR on the standard library Simple operators for character value comparisons. by johnno1962 · Pull Request #71749 · apple/swift · GitHub and a proposal Operators for UInt8 comparisons to unicode scalars by johnno1962 · Pull Request #2329 · apple/swift-evolution · GitHub we can discuss or review.

github.com

apple/swift-evolution/blob/ad705b2e6c4c7022fb51b3a37646fc0b8e1ea00f/proposals/0243-character-operators.md

# Character Literal Operators

* Proposal: [SE-0243](0243-character-operators.md)
* Authors: [Dianna ma (“Taylor Swift”)](https://github.com/tayloraswift), [John Holdsworth](https://github.com/johnno1962)
* Review manager: [Ben Cohen](https://github.com/airspeedswift)
* Status: **Second review** 
* Implementation: [apple/swift#71749](https://github.com/apple/swift/pull/71749)
* Previous Revision: [1](https://github.com/apple/swift-evolution/blob/9713526f3423270c27082c620c75b2e5bc92050e/proposals/0243-codepoint-and-character-literals.md)
* Threads: [1](https://forums.swift.org/t/prepitch-character-integer-literals/10442) [2](https://forums.swift.org/t/se-0243-codepoint-and-character-literals/21188) [3](https://forums.swift.org/t/single-quoted-character-literals-why-yes-again/61898)

## Introduction

This proposal improves Swift's character-literal ergonomics. This support is fundamental not only to parsing tasks within the Swift language but also to tasks that require developers to extract and manipulate data. Areas that would benefit include handling domain-specific languages (DSLs) and parsing commonly-used data formats such as JSON. Any workflow based on lexical analysis or tokenization requirements will gain from this proposal.

The Swift community previously considered single-quote syntax for character literals. While working on Swift's Lexer code, another solution came to light.  Adding well-chosen operators to the Standard Library tidied up the Lexer implementation with minimal impact on the language. These operators didn't burn the single-quote for future reserved use, they served all the most pressing use-cases effectively and demonstrated a small but measurable performance improvement.

This improvement was validated through our work on [PR 2439](https://github.com/apple/swift-syntax/pull/2439#issuecomment-1922292277). The patch showcased how to streamline character-binary integer interchange for low level code. This proposal offers the same readable solution that seamlessly integrates with the established character and style of Swift. Additionally, it provides a slight performance boost, making it a valuable enhancement for performant code.

To see how the proposal simplifies code, consider how the PR above resulted in the following changes from:

This file has been truncated. show original

mickeyl · February 22, 2024, 12:01pm

I like it and it fixes an annoyance.

idrougge · February 26, 2024, 11:22am

One difference between this ASCII-centric proposal and the way it works in C-inspired languages is that in the latter case you can define not only UInt8 values but also wider integers, e.g. int mark = 'MARK'.

Many binary formats employ such 32-bit markers (often accompanied by a length field).

scanon · February 26, 2024, 2:43pm

... well, sort of. C's grammar lets you write it, but leaves the meaning entirely up to the implementation:

§ 6.4.4.5 p11
The value of an integer character constant containing more than one character (e.g. ’ab’) ... is implementation-defined.

so in practice you can't ever take advantage of this in standards-conforming portable code. It's really a compiler/platform feature rather than a language feature.

tera · February 27, 2024, 3:00pm

If to bring this to swift we could enforce a particular endian (big, naturally) for multi-char constants. I do appreciate when I see them in hex viewer. That being said the need for integer constant like 'MARK' is less and less pressing compared to 30 years ago.

johnno1962 · March 8, 2024, 5:13pm

Alright, time to try to stir things up here. @taylorswift, @xwu, @michelf, @ksluder, @beccadax, @benrimmington, you've had plenty to say about previous incarnations of this proposal (sorry to spam you directly). Do you have anything to add about the new paired down proposal? It provides a solution implemented in Swift rather than requiring changes to the language itself which proved so difficult to negotiate. Perhaps it serves a bit of a niche requirement but one many new users coming to the language might expect to "simply work" and indeed with this solution it does and is performant. Its already found a home in Apple's swift-syntax project and aligns well with one of Swift's original goals that it could be a systems programming language.

The main arguments against might be the obfuscation of some diagnostics for nonsense code you can see in the tests of the PR to stdlib or maybe the utility of deploying the change to take up storage on perhaps a billion phones for such a nice requirement. With respect to the second concern, I checked the size of libSwiftCore.dylib where the standard library lives and it did not change building toolchains with and without the change as new the code did not overflow into a new page.

At this stage I feel we're ready for a second review @Ben_Cohen. Is there anything more can I do to bring this about? As @OscarWilde would have said if he'd frequented these forums: "There is only one thing in the world worse than receiving criticism, It is not receiving criticism at all."

ksluder · March 8, 2024, 6:12pm

I appreciate the attempt at pragmatism, but this approach effectively makes UInt8 an alternative to Unicode.Scalar, and in the process encounters a problem that Unicode.Scalar exists to define away.

The advantage of Unicode.Scalar is that the meanings of values 128–255 are well defined, whereas in UInt8 form they can be interpreted relative to any codepage. The behavior of your examples is well defined for only half the values you might encounter. Does behavior on Windows depend on the legacy codepage? Is it affected by the LANG environment variable on Linux and Mac?

Is there a way to reformulate your version to depend more tightly on Unicode.Scalar? Then it becomes more justifiable to assume that comparing a UInt8s against a string literal assumes the UInt8 is a Unicode codepoint.

johnno1962 · March 8, 2024, 6:16pm

Isn't the implementation I posted limited to UInt8 values less than 128? In which case the values do coincide with UnicodeScalar values. They are just more accessible.

ksluder · March 8, 2024, 6:25pm

At the usage site, what prevents someone from doing file.peek() == "È"? Or perhaps worse, file.peek() == textField.text[0]?

taylorswift · March 8, 2024, 6:35pm

i don’t think this is what we would want, because i imagine a lot of use cases for this would center around processing runs of UTF-8 text embedded inside [UInt8] buffers, and UTF-8 code units do not align with Unicode.Scalar codepoints.

michelf · March 8, 2024, 6:47pm

Personally, I'd go for an even simpler solution. We could add an ascii property on UInt8 converting it to a UnicodeScalar? as a counterpart to UInt8(ascii:). Then you can write this:

switch self.previous.ascii {
case " ", "\r", "\n", "\t", "(", "[", "{", ",", ";", ":", "\0":
  return false

and this:

if self.peek(at: 1).ascii == "!" {

No operator shennigans needed.

I guess the main difference in behavior with the proposed implementation is that it won't trap at runtime if your character literal is outside of ASCII range, instead it'll simply return false.

The pattern is also reusable for other encodings. For instance: self.peek(at: 1).isoLatin1 == "¶" (where I assume isoLatin1 would be defined in a user extension because I don't expect the standard library to provide this one).