SE-0243: Codepoint and Character Literals

Vogel · March 6, 2019, 12:57am

It's worth starting from scratch, because it's the much better solution.

lorentey · March 6, 2019, 1:16am

I think this is the crux of the issue.

The proposal is very explicit about introducing the proposed integer conformances to simplify the narrow but extremely important usecase of processing ASCII byte strings. Not the first 128 Unicode code points, but specifically ASCII byte strings.

This is extremely well-chosen motivation; restating it in terms of Unicode code points would not carry anywhere near the same weight. If Unicode chose to extend some flavor of EBCDIC or instead, most programmers would still want to use 'A' as a shorthand to the byte 65 when they're processing PNG files or parsing SQL results. (The fact that Unicode extends ASCII is certainly helpful, but it's mostly irrelevant here. The proposed integer conformances won't help much with processing encoded Unicode text.)

In terms of both its practical effects and its explicitly stated goals, the proposal bakes ASCII directly into the language syntax, as an implicit encoding. Clearly, ASCII is important enough to deserve some syntactic sugar. I only have an issue with the choice being implicit.

// You really can't get more implicit than this:
let a = 'A' + 10
print(a) // 75

The act of encoding a character should leave a mark in the source. The mark should indicate the specific encoding used, and it should be as close to the line that performs the act as possible.

This is a general API hygiene issue to allow internationalization; it's only tangentially related to Unicode support, and it has absolutely nothing to do with the intricate details of Unicode code points.

The character 'G' has the code point U+0047 in Unicode. The code point U+0047 has the integer value 71, as well as a bunch of other interesting properties.

However, neither the character 'G' nor the code point U+0047 are the integer 71.

Swift does have a nice type system, and the stdlib has already defined Character, Unicode.Scalar and Int as separate types to represent these distinct concepts.

dwaite · March 6, 2019, 1:43am

Could you clarify this statement?

Are you saying that having an integer literal notation expressed via a code point (as a new option alongside decimal, hexadecimal, octal and binary notations) would be a deviation from the current literal system, presumably with negative impacts? For example

 let x:UInt8 = 🚲"^" // integer literal of 94

If not, would a new notation for expressing an array literal (of integer literals) based on the code points of a string be a deviation with negative impacts? Something like:

 let y:[UInt8] = ⛺️"foo" // unfortunately no good shed emoji

Michael_Ilseman · March 6, 2019, 1:53am

That understates the value. The value of this part of the proposal is in processing text which may contain any Unicode code point, but where the only ones with semantic meaning you are checking for are in the ASCII compatible subset. This is used in consuming pretty much every textual format used in modern computing (XML/CSV/JSON), source code, etc. Those might have non-ASCII contents, which are treated as opaque data, but are not interpreted by matching against specific code points.

Dante-Broggi · March 6, 2019, 2:09am

I think I agree, none of the various CodeUnits should be integers.

michelf · March 6, 2019, 2:17am

What is your evaluation of the proposal?

I'm quite ambivalent.

Implementing the initializer for ASCII integer literals in the standard library before we can add the conformance seems a bit confusing. Sure, we can introduce the conformance ourselves where needed, but this is basically making one part of the core language optional. It's weird, even if temporary.
I'm not particularly convinced introducing a single-quote literal for the sake of representing one Character without type inference is that useful. At least the motivation part doesn't explain it well enough for me. It could be slightly better as a final result than using type inference, but is it really worth the trouble of deprecating the old syntax?
I do find it useful to be able to represent ASCII characters with single quotes for integer types and that it is statically checked to be ASCII. This is something I'd really like to see in the language. But the asymmetry between UInt8('8') vs. '8' as UInt8 is troubling I think.

Also, I'll echo others in that I don't see why UInt8(ascii: 64) need to be deprecated.

Is the problem being addressed significant enough to warrant a change to Swift?

There is basically two problems here:

making literals characters/unicode scalars/integers easier to distinguish at the point of use versus strings.
allow integers to be easily initialized from ASCII characters

I personally don't care for 1, but I think 2 is very worth it when parsing various kinds of data formats.

Does this proposal fit well with the feel and direction of Swift?

I'm not too sure.

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

The current proposal is inspired from the C syntax using single quotes for single characters, which are really just numbers. C doesn't really understand Unicode, but C character literals make it easy to write low level algorithms that works fine with ASCII. And a lot of strings are guarantied to be ASCII in this world: numbers, dates, identifiers, etc. So it's great to be able to work in ASCII when needed, and this is much easier to do in C.

I wish Swift was better for expressing things of the sort, and this proposal does improve things a bit. At the same time I think it makes compromises by regrouping Unicode characters under the same umbrella and this creates tensions between the two usages. I proposed earlier in the pitch phase that we could simply use single quotes for ASCII scalars and double quotes for anything Unicode and I feel this would work better; it also avoids deprecating an existing syntax.

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Read the review thread and participated in the pitch.

taylorswift · March 6, 2019, 2:29am

citation needed! i searched his twitter for a while and couldn’t find anything, and i am an ex-fangirl.

This is the most important part!

I’m curious how you guys feel about a'a' and u'a' syntax. The prefix would unambiguously indicate the encoding used, and it’s much more concise. When you use ASCII literals, you usually want to use multiple at a time, and all those UInt8(ascii:) calls add up.

It would seem to me that the opposition to the literal prefixes is almost entirely aesthetic, and i feel if we had syntax highlighting support for it a lot of those objections would go away. And in the great tradeoff triangle of clarity–expressiveness–aesthetics, i think it is better to compromise on aesthetics.

Perfect is the enemy of good

dwaite · March 6, 2019, 2:32am

Could you point toward a reference? I'm curious whether this refers to internal literal syntax or to supporting custom literals (such as SQL"SELECT * from X" ).

dwaite · March 6, 2019, 3:01am

I'd be happy with a single-codepoint syntax that was interpreted as an integer literal under the assumption that 'x' wasn't chosen as the syntax as being too confusing. I could also be talked into an eventual multi-codepoint form, that seems a much more challenging reach.

The disadvantages to that approach (with my ratings, against my scale) would be:

let x:UInt8 = 🚲'à' (as a bikesheddable syntax) would work, initializing x to 240. Int8 would be restricted to ASCII characters.
Users could be confused that the value is 240, and not some other value based on some other encoding. I think a case could be made that literals should always be interpreted as NFC-Unicode.
When interacting with native code, let x:[UInt8] = 🚲"resumé" (or other sequence of code-point literals) would not have any understanding of character encodings, and could cause misuse. This is countered somewhat in that char * in C is converted to use an Int8, which would not be initializable to non-ASCII values.

xwu · March 6, 2019, 3:44am

dwaite:

Could you clarify this statement?

Are you saying that having an integer literal notation expressed via a code point (as a new option alongside decimal, hexadecimal, octal and binary notations) would be a deviation from the current literal system, presumably with negative impacts? For example
 let x:UInt8 = 🚲"^" // integer literal of 94
If not, would a new notation for expressing an array literal (of integer literals) based on the code points of a string be a deviation with negative impacts? Something like:
 let y:[UInt8] = ⛺️"foo" // unfortunately no good shed emoji

I'm not sure what you mean. These are not integer literals; they are string literals.

Swift does not support f, ul, or other suffixes for numeric literals as found in C; this was a deliberate choice, but I would have to dig for longer to find Chris's post specifically on that topic. Nor did we accept r as a prefix for raw strings; no doubt that discussion is still fresh in your mind.

I will do you one better, however: here is Chris's opinion of this exact idea.

We are under no deadline to produce the good.

zwaldowski · March 6, 2019, 4:29am

It’s important to examine which tasks. If so many of them are related to four-char codes — as this thread has brought up so often — the language shouldn’t muddy the nicely concrete meaning it assigns to literals in favor of having wanted to decode a PNG header once.

The other cases mentioned like CSV are not at all broken under the status quo; you can write fluent, terse, and correct code that makes no encoding assumptions.

It would be an unforced and substantial policy shift for Swift to say “eh, yeah, I guess sometimes you really do mean 65...90 here because of historical context… I think… I’ll just convert it”, without paying any attention to the fewer-LoC, subtler, insidious incorrectness it inherits from the C code it’s imitating.

bobergj · March 6, 2019, 4:51am

The value of this part of the proposal is in processing text which may contain any Unicode code point, but where the only ones with semantic meaning you are checking for are in the ASCII compatible subset

Sure, I think what but what I, and a bunch of other people, are against is primarily:

Having ascii literals implicitly convert to numeric types:
let digit:UInt8 = '0' + n % 10
That 'a' should have a special behaviour for ASCII. It is not obvious why:

   let scalar: Unicode.Scalar = "ÿ"                    
   let scalarValueAttempt1: UInt32 = scalar.value  // OK
   let scalarValueAttempt2: UInt32 = 'ÿ'           // Compile time error under proposal, huh?
   let scalarValueAttempt3: UInt32 = 'a'           // OK under proposal

The primary value in the proposal is that it adds compile time checks for a Unicode scalar literal being part of the Unicode ASCII compatible subset.
Another value is that it adds some affordances for matching byte values to ascii, and ranges of ascii scalar values.

Building on my comments above about a ASCII scalar type here are some ideas for an alternative proposal: Unicode.Scalar.ASCII.md · GitHub
Brief summary: Add type Unicode.Scalar.ASCII, make the compiler bounds check its literal initialisation unicode scalar value, add some affordances for matching with byte values.

taylorswift · March 6, 2019, 9:32am

well, Chris didn’t reject it as much as pose a set of questions that needed to be answered for the idea to be viable. That proposal from 2015 was much more expansive (reworking custom operators, extending to numeric literals?) and couldn’t answer those questions. I believe we have answers today in this specific context.

I think over time this became a bigger and bigger complaint. just the vibe i’ve been getting around here, though.

the ugliness of the C suffixes comes from the fact that the suffixes are basically abbreviated reproductions of the type names (unsigned long long long long long foo = 123ullllllllllllllllll;) which are visually redundant, but mechanically needed because of C’s loose numeric promotion semantics.

Character literal prefixes on the other hand convey useful information about the value.

// destination type is UInt64, but we know `x` is an ASCII encoded 
// value between 0 and 128
let c:UInt64 = a'x' 
// destination type is UInt32, but we know `x` is an EBCDIC encoded 
// value between 0 and 256
let d:UInt32 = e'x'

Also, let c:Character = 'x'c instead of let c:Character = "x" isn’t really a boost in clarity, but let c:UInt8 = a'x' instead of let c:UInt8 = .init(ascii: 'x') is, at least when you have lots and lots of ASCII values buried in lots and lots of UInt8(ascii:) initializers in a line.

We don’t. We have no need.

In the four years since, we can actually conceive of a workable way to making this possible! (though i wouldn’t desugar 'a'c to c('a'), rather, I’d let it be a statically assertable parameter to a @textLiteral initializer.) I’d hop over to the tuple literals thread for some interesting thoughts on a fully flexible compile-time literals system.

No. We have no need.

a'x' and u'x'. Others like e'x' could be added through the evolution process, and eventually, users could define their own prefixed text literals with the aformentioned compile-time literals.

The really interesting thing to me is back then, before we had really sat down and thought about how we deal with literals in the language, this idea would have required a lot of weird compiler magic to make it make sense, and the motivation was quite lacking. Both of those premises have changed since then, which is really a testament to how far Swift has come.

xwu · March 6, 2019, 2:40pm

taylorswift:

Character literal prefixes on the other hand convey useful information about the value.

// destination type is UInt64, but we know `x` is an ASCII encoded 
// value between 0 and 128
let c:UInt64 = a'x' 
// destination type is UInt32, but we know `x` is an EBCDIC encoded 
// value between 0 and 256
let d:UInt32 = e'x'

How is this different from r"x" for raw strings, which was rejected by the community?

I do not understand why, in a language that spells a pointer as UnsafePointer, it is necessary for one specific use case to introduce a one-letter syntactic sugar for init(ascii:), which is both succinct and readable. This continues to have nothing to do with compile-time checking, which can happen with no additional syntax, which is what you say is most important.

jayton · March 6, 2019, 3:13pm

What is your evaluation of the proposal?

-1

The pitch thread was hampered by an unclear goal; it felt as if the promoters of the pitch wanted to get “some type of character literal”, regardless of what particular use cases it fulfilled.

I’m not against character literals, but the result doesn’t feel like a good fit. In particular, the clarification in this thread that 'a' * 'a' will parse pushed me over to a definite -1. This feels extremely unswifty, and far more problematic than, e.g., implicitly casing integers to Bool (which I wouldn’t support either).

As xwu said, the conformance-to-opt-in mechanism encourages people to violate good Swift style.

Regarding the a'x' option, I think it feels too much like allowing alphanumeric characters in operators, but only sometimes. I’m also not at all convinced that it’s more legible than what we have now:

(a'I', a'H', a'D', a'R')  // aIaHaDaR? Also surprisingly hard to type
(UInt8(ascii: "I"), UInt8(ascii: "H"), UInt8(ascii: "D"), UInt8(ascii: "R"))  // I ... H ... D ... R

The argument about compile-time checking in the PNG case is unconvincing. Compile time checking of invariants is desirable in general, but there is nothing to suggest that this particular case urgently needs a specialized compiler feature. Tag("I", "H", "D", "R") with a precondition check is possible now and not particularly onerous.

Is the problem being addressed significant enough to warrant a change to Swift?

Yes. While I don’t believe it’s an urgent problem, having some kind of character literal and a path towards easier working with ASCII is desirable, or at least unobjectionable. My objection is to pushing this proposal through regardless of consequences.

I would probably be happy with a new proposal that introduces an explicit ASCII struct or reworks CodeUnit as per Vogel. (Also, as Jordan suggested, keeping the path open to allow 'xyz' as an array literal in a later pitch would be good.)

Does this proposal fit well with the feel and direction of Swift?

No. The 'a' * 'a' thing and the conformance-for-opt-in thing are both major red flags.

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

C comparisons abound, but before C I used Pascal. There, a Char literal is spelled 'x' and a string literal 'xy'. I don’t recall being confused about this or coming across any criticism of it, although without type inference there’s no “how do I spell "x" as Character without specifying the type” question.

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

I’ve read most of this thread and large chunks of the previous one.

tevamerlin · March 6, 2019, 4:18pm

gwendal.roue:

The interesting parts are of course:
if cString.advanced(by: 4).pointee == 45 /* '-' */ { ... }
if cString.advanced(by: 2).pointee == 58 /* ':' */ { ... }
In the current state of Swift, they could also have been written this way:
if cString.advanced(by: 4).pointee == Int8(UInt8(ascii: "-")) { ... }
if cString.advanced(by: 2).pointee == Int8(UInt8(ascii: ":")) { ... }
And with this proposal (as stated in the Motivation section), we would read instead:
if cString.advanced(by: 4).pointee == '-' { ... }
if cString.advanced(by: 2).pointee == ':' { ... }

For this kind of code, my preference would be to be able to write:

if cString.advanced(by: 4).pointee == '-'.ascii { ... }
if cString.advanced(by: 2).pointee == ':'.ascii { ... }

That is, no implicit use of character literals as integers, but something shorter (and, imo, much easier to read) than UInt8(ascii:) to express that we want the ASCII code for this character.
We could of course imagine a similar property for EBCDIC.

xwu · March 6, 2019, 4:22pm

This already exists as asciiValue!. We do not automatically find members on non-default literal types (for reasons), but with the introduction of single quotes that default to Character, what you suggest will automatically work.

(Standardizing on asciiValue instead of init(ascii:) may actually be a good reason to deprecate the latter.)

tevamerlin · March 6, 2019, 4:40pm

Oops! Sorry, I had missed that one.

Great!
(Even though I would have preferred the shorter ascii over asciiValue.)

johnno1962 · March 6, 2019, 5:03pm

This looked promising and does in fact work testing it in the implementation:

(swift) 'a'.asciiValue
// r0 : UInt8? = Optional(97)

It does have a nasty fail however in that if you compare against a non-ascii character it returns nil and therefore can never match without either a compilation or run-time error

(swift) if 97 == 'é'.asciiValue {                                                                                                            }

xwu · March 6, 2019, 5:04pm

johnno1962:

It does have a nasty fail however in that if you compare against a non-ascii character it returns nil and therefore can never match without either a compilation or run-time error
(swift) if 97 == 'é'.asciiValue {  

Why is this a “nasty fail”?