SE-0243: Codepoint and Character Literals

xwu · March 10, 2019, 8:10pm

johnno1962:

While I agree with the bulk of your assessment it seems regrettable that single quoted literals would not be Character literals. I’m not sure I see the necessity to restrict them to Unicode.Scalar when the alternative is simply to add a trapping .ascii property to Character and everybody is happy.
extension Character {
    var ascii: UInt8 {
        return asciiValue!
    }
}
This would be VERY inefficient if you look at the implementation.

What is the return value of your proposed function for the single character \r\n?

I do not aim to--and I believe it is neither possible nor desirable to--make everyone happy. I'd like to arrive through discussion on the best possible API. Previously, I was quite adamant that 'x' should be a character literal, but after this extensive discussion I no longer believe that to be the case.

For a design to have clarity, one has to commit. What is this thing we introduce? With that clarity necessarily comes making clear what this thing is not. There are plenty of reasons why, as @lorentey points out, it is more urgent to have a Unicode scalar literal. Therefore, it is not a character literal. The whole underpinning of the proposal is about why it's been a bad idea to have one literal syntax serve three types of literals (Unicode scalar, extended grapheme cluster, string). We have a chance to correct that here; it would be key not to replicate that error.

johnno1962 · March 10, 2019, 8:13pm

(swift) extension Character {
              var ascii: UInt8 {
                  return asciiValue!
            }
        }
(swift) '\r\n'.ascii                                                                                                                        
// r0 : UInt8 = 10
(swift)

Is this the right answer? I’m no longer sure...

xwu · March 10, 2019, 8:17pm

My point is, there is no right answer. That's why any asciiValue API on Character is arguably fundamentally broken.

michelf · March 10, 2019, 8:24pm

The irony about this is that it already works:

("A"..."z").contains("[") // returns true

xwu · March 10, 2019, 8:25pm

No need to add it, then, as I guess that feature already exists! Still, I hope we'll agree it's not ideal.

johnno1962 · March 10, 2019, 8:28pm

One argument could run: We’re not able to decide on where to draw the line between single quoted literals and double quoted literals viz: Character, Unicode.Scalar, ASCII and this would be a source breaking change generating a collective groan from the community. The targeted operators I suggest seem to meet the ergonomic requirements I was looking for and can be implemented now in Swift 5.0 and be made to be free of foot-guns if we restrict the Unicode.Scalars to ASCII in their implementation. For me the question is whether these should make it into the standard library and be another battery that is included with Swift or be a CocoaPod/SPM module.

xwu · March 10, 2019, 8:30pm

We should be able to decide, and I think we're converging on Unicode.Scalar. It will not break source.

Feature work for Swift 5.0 is completely over.

johnno1962 · March 10, 2019, 8:32pm

I’m not converging on Unicode.Scalar but I could live it it. How will it not break source? ...I guess the two syntaxes could live side by side.

D’uh I know that! I mean the operators I suggest work with Swift 5.0 as it is but could be added to the stdlib for 5.1 with or without single quote syntax.

nonsensery · March 10, 2019, 9:17pm

There’s a nice symmetry in using double quotes for human-readable text (String and Character), and single quotes for machine-readable text (Unicode.Scalar).

It’s unfortunate that String’s UTF-16 and UTF-8 code units are represented by UInt16 and UInt8, instead of dedicated types, because that probably excludes them from being expressed by single-quote literals.

taylorswift · March 10, 2019, 10:19pm

not sure how useful utf-8 literals would be. How do you write a continuation byte?

michelf · March 10, 2019, 10:23pm

It sure is unideal. But we might as well say that this is not ideal:

"2" < "10" // returns false

and then ask people to be precise in what they write to remove any ambiguity:

"2".isLexicographicallyOrdered(with: "10")

But forcing this on everyone is going to have a lot of drawbacks. The convenience of < for strings comes with edge cases where the result can be surprising, but that's the price you pay for being able to express things concisely.

I accept and embrace the fact that unicode scalars and ASCII characters are numbers, and I believe anyone working with them should too, and thus to me it makes sense to create ranges of them and check for inclusion. That it lets you write bizarre things makes it not ideal, especially to the untrained eye, but a more verbose alternative is not ideal either because it makes things less readable when the verbosity gets repeated.

I think it makes sense to more clearly separate literals for things that are numeric in nature (unicode scalars and ASCII characters) from things that aren't (grapheme clusters). But I also believe restricting single quote literals to unicode scalars is going to encourage people to use scalars when they really should use Character, which is probably worse than being able do do weird unicode scalar ranges expressed with double quotes.

This last problem would be mostly gone if those new literals were restricted to ASCII range. Whatever you do with an ASCII character represented as a Character or UnicodeScalar, it'll probably do the same. The only exception would be matching "\r\n" as a Character. But then restricting it to the ASCII range means you can't deprecate double-quotes for UnicodeScalar literals. We're turning full circle.

taylorswift · March 10, 2019, 10:29pm

I agree! The issue I have with trapping ascii is it basically implies that everything else that Unicode.Scalar can represent is the edge case. We’re basically turning Unicode.Scalar into an ASCII type with 1,111,870 invalid states. At this point, it would be better just to introduce a proper 7-bit ASCII type.

xwu:

To improve and streamline the syntax for obtaining the ASCII value of a Unicode scalar, the following API will be added:
extension Unicode.Scalar {
  @inlinable
  public var ascii: UInt8 {
    _precondition(value < 128)
    return UInt8(value)
  }
}

@inlinable is not enough here, to make compile time validation work, you have to annotate it with @constexpression or whatever it’s called once the feature is finalized, to prevent it from being callable on a dynamic value. Any API ought to be either checked with runtime preconditions, or compile-time assertions, but not both. Making the behavior change depending on conditions known only to godbolt seems like a recipe for confusion.

xwu · March 10, 2019, 10:32pm

As I mentioned above, I am unconcerned and unconvinced that this requires compile-time validation, though if it can be done heuristically then certainly it is a bonus. It's an issue separable from the main topic here regarding literals so I would rather not delve into that further.

johnno1962 · March 10, 2019, 10:38pm

Huh? I get:

Welcome to Apple Swift version 5.0 (swiftlang-1001.0.60.3 clang-1001.0.37.8).
Type :help for assistance.
  1> "2" < "10"
$R0: Bool = false

michelf · March 10, 2019, 10:52pm

Oops, my mistake. This should say "returns false", and just to clarify I wrote the post assuming it'd say "returns false". I'll edit this.

Nevin · March 10, 2019, 11:00pm

I think it’s pretty clear now that this proposal should be returned for revision.

taylorswift · March 10, 2019, 11:46pm

can you tell me what is wrong with this code?

enum BitVectorType 
{
    static 
    let uint8:(UInt8, UInt8, UInt8, UInt8) = 
    (
        (" " as Character).asciiValue!, 
        ("8" as Character).asciiValue!, 
        ("'" as Character).asciiValue!, 
        ("d" as Character).asciiValue!
    )
    
    static 
    let uint8x:(UInt8, UInt8, UInt8, UInt8) = 
    (
        (" " as Character).asciiValue!, 
        ("8" as Character).asciiValue!, 
        ("'" as Character).asciiValue!, 
        ("h" as Character).asciiValue!
    )
    
    static 
    let uint8b:(UInt8, UInt8, UInt8, UInt8) = 
    (
        (" " as Character).asciiValue!, 
        ("8" as Character).asciiValue!, 
        ("'" as Character).asciiValue!, 
        ("b" as Character).asciiValue!
    )
    
    static 
    let uint16:(UInt8, UInt8, UInt8, UInt8) = 
    (
        ("1" as Character).asciiValue!, 
        ("6" as Character).asciiValue!, 
        ("'" as Character).asciiValue!, 
        ("d" as Character).asciiValue!
    )
    
    static 
    let uint16x:(UInt8, UInt8, UInt8, UInt8) = 
    (
        ("1" as Character).asciiValue!, 
        ("6" as Character).asciiValue!, 
        ("'" as Character).asciiValue!, 
        ("h" as Character).asciiValue!
    )
    
    static 
    let uint16b:(UInt8, UInt8, UInt8, UInt8) = 
    (
        ("1" as Character).asciiValue!, 
        ("6" as Character).asciiValue!, 
        ("'" as Character).asciiValue!, 
        ("b" as Character).asciiValue!
    )
    
    static 
    let uint32:(UInt8, UInt8, UInt8, UInt8) = 
    (
        ("3" as Character).asciiValue!, 
        ("2" as Character).asciiValue!, 
        ("’" as Character).asciiValue!, 
        ("d" as Character).asciiValue!
    )
    
    static 
    let uint32x:(UInt8, UInt8, UInt8, UInt8) = 
    (
        ("3" as Character).asciiValue!, 
        ("2" as Character).asciiValue!, 
        ("'" as Character).asciiValue!, 
        ("h" as Character).asciiValue!
    )
    
    static 
    let uint32b:(UInt8, UInt8, UInt8, UInt8) = 
    (
        ("3" as Character).asciiValue!, 
        ("2" as Character).asciiValue!, 
        ("'" as Character).asciiValue!, 
        ("b" as Character).asciiValue!
    )
}

did you see it? how long did it take you?

This will compile without warnings. Because static properties are lazily computed, this will run without errors. An enum full of constants is also the last thing you’d think to unit-test, and how would you give this testing coverage anyway? It would be monumentally stupid if people had to add tests consisting of nothing but instantiating constants to catch these sort of bugs.

xwu · March 10, 2019, 11:57pm

Less than 10 seconds, without the aid of any external tools, on my first read through the code, purely by visual inspection, without backtracking. It's here:

        ("’" as Character).asciiValue!,

I don't see in what way this is specific to these constants versus any other constants. This seems like a great argument that there should be better tooling for all constants--and I would agree about that.

That said, constants very much should be tested: that's not monumentally stupid. You can't rely on any compiler to tell you that the constant value you supplied is correct: it's not enough in your example to know that you used ASCII characters--you need to know you've used the right ones. Since you're saying you can't trust careful reading, how do you intend to suss out copypasta and other silly errors without testing?

michelf · March 10, 2019, 11:59pm

The big problem is we can't agree on goals and priorities. Hopefully the core team will offer some guidance.

taylorswift · March 11, 2019, 12:59am

xwu:

Less than 10 seconds, without the aid of any external tools, on my first read through the code, purely by visual inspection, without backtracking. It's here:
        ("’" as Character).asciiValue!, 

congratulations on your eyeballs. you get the point. It’s really really easy to accidentally type a ’ instead of a ' on a lot of platforms. Some of them will turn one into the other automatically. And in a lot of programming fonts, these characters look similar. It’s also not the only character susceptible to “unicode trespassing”,, there’s - and –, ` and ˋ, etc. So yes, i think this is an issue.

Of course you need careful reading. The great thing about ASCII strings is the characters are all visually different enough that careful reading is all you need to do to verify your constants are correct. Unicode means you have to examine them byte by encoded byte.