Prepitch: Character integer literals

Dante-Broggi · October 8, 2018, 6:46pm

A few things on this:

I think it should be the created type's decision on whether to allow, or not, multiple 'char' literals.
Also, this should be a completely new literal type, with its own ExpressibleBy*Literal protocol, or protocols.
Possible names I can think of are ExpressibleByCodepointLiteral for the single 'char' case, and ExpressibleByTextLiteral for the multiple 'char' case.

As for real world advantages of 'char' literals, I think some standards use integer discriminators where the ASCII interpretation of the value is correlated with its semantic meaning, and if putting that sort of data in code, it would be better to use the more readable ASCII in source, instead of a raw number.

Michael_Ilseman · October 8, 2018, 9:49pm

Do you think there would be value in a separate ExpressibleBy protocol for this? As in, you could have ExpressibleBySingleQuotedLiteral (perhaps with a better name), then FixedWidthInteger (or even perhaps Numeric), Character, and Unicode.Scalar could conform.

One reason I ask is that an effect of the reference implementation (where Character conforms to ExpressibleByIntegerLiteral), users can type the following:

(8 as Character) == "8" // false

taylorswift · October 9, 2018, 2:51am

Michael_Ilseman:

One reason I ask is that an effect of the reference implementation (where Character conforms to ExpressibleByIntegerLiteral ), users can type the following:
(8 as Character) == "8" // false

i don’t think this is a good enough reason. maybe javascript programmers see it differently but I wouldn’t assume 8 as Character to be anything but "\u{0x8}"

jawbroken · October 9, 2018, 3:57am

Now that raw strings have a different syntax, thanks to your work, I'm happy with the idea of using single quotes for character literals. Re-reading the months-old discussion here, I find the view in these posts compelling:

Chris_Lattner3:

We didn’t use single quotes initially because we thought that they could be used for multi-line string literals, but that has already been resolved. I don’t see the downside to taking the (highly precedented! :-) approach of using them for character literals. Once you do that, then the behavior is immediately obvious:
  let a = 'c'  // Character
  let b : Int8 = 'x'  // ok
  let c : Int16 = '˚' // ok if it fits.
  let d : Int = '\u{123}'  // ok

  let e : Int8 = "x" // obviously not ok, "x" is a string.

allevato:

Yep. I would expect the following behavior:
let x1: Character = 'a'  // ok
let x2: Int = 'a'  // ok, x2 == 97
let x3: Character = '🇨🇦'  // ok
let x4: Int = '🇨🇦'  // compile error, not a single scalar
This is the same behavior we already have today if you try to initialize a Unicode.Scalar from a double-quoted string literal with more than one scalar, or a Character from a double-quoted string literal with multiple grapheme clusters.

This approach seems clean and straightforward. I can't say I understand the benefit (besides lower implementation complexity) of making single quoted literals be a type of integer literal. This only limits the applications for this literal form, and therefore makes it harder for a proposal to clear the bar for usefulness. A new ExpressibleBy*Literal also lets you do the natural thing of making '🙂' default to Character instead of an Int which cuts down on a lot of weird behaviour (e.g. I presume I can write var x = '🙂' * '🙂' or '8' + '9' in your prototype, etc).

johnno1962 · October 9, 2018, 11:52pm

Thanks for the clarifications. I think I’ve finally got the bigger picture, updated the proof of concept implementation and made a toolchain is available here.

There is now a separate ExpressibleByCodepointLiteral protocol and the default type of these literals is now Character. The examples above all work except the following expectation:


let x3: Character = '🇨🇦' // not ok

Despite having type Character, single quoted “codepoint” literals are derived from Integer literals in this implementation and can only represent single codepoint graphemes. The advantage of taking this tack is better error reporting and the checking the codepoint fits into the destination type.

I hope this experiment will be of use moving this pitch along.

jawbroken · October 10, 2018, 2:31am

Great, thanks. I still think it might be hard for a user to understand why they can't write a Character using what they will roughly think of as a character literal there, though.

johnno1962 · October 10, 2018, 9:48am

That’s UNICODE ¯\(ツ)/¯. The suggested model is that these are "codepoint literals". You can always use ".

let x3: Character = "🇨🇦" // ok

jawbroken · October 10, 2018, 10:34am

Sure, I understand that is possible, but why is it desirable? Why shouldn't the single-quoted version work?

Tino · October 10, 2018, 11:52am

If been thinking about single-delimiter literals for quite some time, and imho we could skip the second ' here.
I'm not sure how important the aspect of brevity is, but 33% would be a significant reduction ;-)

let hexcodes = ['0, '1, '2, '3, '4, '5, '6, '7, '8, '9, 'a, 'b, 'c, 'd, 'e, 'f]

doesn't look that bad to me (and whitespace is always invisible when you specify it directly).

johnno1962 · October 10, 2018, 1:39pm

It’s an artefact of the implementation or should I say it’s all I could get to work. Somebody who actually knows what they are doing might fare better but I don’t think this is a burdensome limitation in practice. In fact, I like the model that these literals are more Int-like than Character-like. There is no escaping the need to have at least some knowledge of UNICODE’s vagaries.

You can see one problem in your message - external colorising editors would get confused.

benrimmington · October 10, 2018, 4:47pm

Why not use the ExpressibleByUnicodeScalarLiteral protocol and Unicode.Scalar struct?

johnno1962 · October 10, 2018, 5:13pm

That’s easy to do and might be more correct the way things turned out:

/// The default type for single quoted "character" literals.
public typealias CodepointLiteralType = Unicode.Scalar

taylorswift · October 10, 2018, 6:14pm

Tino:

If been thinking about single-delimiter literals for quite some time, and imho we could skip the second ' here.
I'm not sure how important the aspect of brevity is, but 33% would be a significant reduction ;-)
let hexcodes = ['0, '1, '2, '3, '4, '5, '6, '7, '8, '9, 'a, 'b, 'c, 'd, 'e, 'f]
doesn't look that bad to me (and whitespace is always invisible when you specify it directly).

how to write the space character ' '?

Dante-Broggi · October 10, 2018, 6:41pm

One thing I think these literals should be able to do is:

let literal: UInt32 = 'aeio'
assert(literal == 0x6165696f)

Which would require multi-scalar literals.

Edit: would be UInt32.

taylorswift · October 10, 2018, 6:47pm

I don’t think this is obvious behavior at all. what would this be?

let literal:UInt32 = 'aθi'

Tino · October 10, 2018, 6:53pm

That's a small minority of characters, and imho the numerical value is better in this case - or something like Character.space (which could be shortened to .space in many situations).
But afaics, ' wouldn't be problematic either, given that a lonely single quote would always be an error.

Would this actually be allowed?
If that's the case, it would be a really obvious argument for keeping the closing delimiter... but I thought that the length of the literal always has to be one, and in this case, there's no need for a second way to signal its end.

AlexanderM · October 10, 2018, 7:27pm

I really can't behind wasting the ' reserved character on so niche/uncommon, that could easily be done with a map call:

let hexcodes = [
	"0", "1", "2", "3", "4" ,"5", "6", "7",
	"8", "9", "a", "b", "c", "d", "e", "f"
].map(UInt8.init(ascii:))

print(hexcodes)
// [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 97, 98, 99, 100, 101, 102]

michelf · October 10, 2018, 7:35pm

I see you've chosen big-endian here. This will not please everyone.

Tino · October 10, 2018, 7:35pm

... if the compile-time code execution story finally makes some progress, this wouldn't even be expensive ;-)
But why should we not use '? If there is no other idea for it, there's little merit not utilizing it.

taylorswift · October 10, 2018, 8:25pm

AlexanderM:

I really can't behind wasting the ' reserved character on so niche/uncommon, that could easily be done with a map call:
let hexcodes = [
	"0", "1", "2", "3", "4" ,"5", "6", "7",
	"8", "9", "a", "b", "c", "d", "e", "f"
].map(UInt8.init(ascii:))

print(hexcodes)
// [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 97, 98, 99, 100, 101, 102]

this constructs Character values from literals which cannot be done at compile time since the grapheme cluster stuff depends on the ICU runtime

you might also want to have them as Int8s instead of UInt8s since that makes testing ASCII vs latin extended easy (ASCII is always positive)