Prepitch: Character integer literals

A few things on this:

  • I think it should be the created type's decision on whether to allow, or not, multiple 'char' literals.

  • Also, this should be a completely new literal type, with its own ExpressibleBy*Literal protocol, or protocols.

  • Possible names I can think of are ExpressibleByCodepointLiteral for the single 'char' case, and ExpressibleByTextLiteral for the multiple 'char' case.

As for real world advantages of 'char' literals, I think some standards use integer discriminators where the ASCII interpretation of the value is correlated with its semantic meaning, and if putting that sort of data in code, it would be better to use the more readable ASCII in source, instead of a raw number.

2 Likes

Do you think there would be value in a separate ExpressibleBy protocol for this? As in, you could have ExpressibleBySingleQuotedLiteral (perhaps with a better name), then FixedWidthInteger (or even perhaps Numeric), Character, and Unicode.Scalar could conform.

One reason I ask is that an effect of the reference implementation (where Character conforms to ExpressibleByIntegerLiteral), users can type the following:

(8 as Character) == "8" // false
2 Likes

i don’t think this is a good enough reason. maybe javascript programmers see it differently but I wouldn’t assume 8 as Character to be anything but "\u{0x8}"

3 Likes

Now that raw strings have a different syntax, thanks to your work, I'm happy with the idea of using single quotes for character literals. Re-reading the months-old discussion here, I find the view in these posts compelling:

This approach seems clean and straightforward. I can't say I understand the benefit (besides lower implementation complexity) of making single quoted literals be a type of integer literal. This only limits the applications for this literal form, and therefore makes it harder for a proposal to clear the bar for usefulness. A new ExpressibleBy*Literal also lets you do the natural thing of making '🙂' default to Character instead of an Int which cuts down on a lot of weird behaviour (e.g. I presume I can write var x = '🙂' * '🙂' or '8' + '9' in your prototype, etc).

7 Likes

Thanks for the clarifications. I think I’ve finally got the bigger picture, updated the proof of concept implementation and made a toolchain is available here.

There is now a separate ExpressibleByCodepointLiteral protocol and the default type of these literals is now Character. The examples above all work except the following expectation:


let x3: Character = '🇨🇦' // not ok

Despite having type Character, single quoted “codepoint” literals are derived from Integer literals in this implementation and can only represent single codepoint graphemes. The advantage of taking this tack is better error reporting and the checking the codepoint fits into the destination type.

I hope this experiment will be of use moving this pitch along.

4 Likes

Great, thanks. I still think it might be hard for a user to understand why they can't write a Character using what they will roughly think of as a character literal there, though.

3 Likes

That’s UNICODE ¯\(ツ)/¯. The suggested model is that these are "codepoint literals". You can always use ".

let x3: Character = "🇨🇦" // ok

Sure, I understand that is possible, but why is it desirable? Why shouldn't the single-quoted version work?

1 Like

If been thinking about single-delimiter literals for quite some time, and imho we could skip the second ' here.
I'm not sure how important the aspect of brevity is, but 33% would be a significant reduction ;-)

let hexcodes = ['0, '1, '2, '3, '4, '5, '6, '7, '8, '9, 'a, 'b, 'c, 'd, 'e, 'f]

doesn't look that bad to me (and whitespace is always invisible when you specify it directly).

It’s an artefact of the implementation or should I say it’s all I could get to work. Somebody who actually knows what they are doing might fare better but I don’t think this is a burdensome limitation in practice. In fact, I like the model that these literals are more Int-like than Character-like. There is no escaping the need to have at least some knowledge of UNICODE’s vagaries.

You can see one problem in your message - external colorising editors would get confused.

Why not use the ExpressibleByUnicodeScalarLiteral protocol and Unicode.Scalar struct?

3 Likes

That’s easy to do and might be more correct the way things turned out:

/// The default type for single quoted "character" literals.
public typealias CodepointLiteralType = Unicode.Scalar

how to write the space character ' '?

3 Likes

One thing I think these literals should be able to do is:

let literal: UInt32 = 'aeio'
assert(literal == 0x6165696f)

Which would require multi-scalar literals.

Edit: would be UInt32.

1 Like

I don’t think this is obvious behavior at all. what would this be?

let literal:UInt32 = 'aθi'

That's a small minority of characters, and imho the numerical value is better in this case - or something like Character.space (which could be shortened to .space in many situations).
But afaics, ' wouldn't be problematic either, given that a lonely single quote would always be an error.

Would this actually be allowed?
If that's the case, it would be a really obvious argument for keeping the closing delimiter... but I thought that the length of the literal always has to be one, and in this case, there's no need for a second way to signal its end.

I really can't behind wasting the ' reserved character on so niche/uncommon, that could easily be done with a map call:

let hexcodes = [
	"0", "1", "2", "3", "4" ,"5", "6", "7",
	"8", "9", "a", "b", "c", "d", "e", "f"
].map(UInt8.init(ascii:))

print(hexcodes)
// [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 97, 98, 99, 100, 101, 102]
2 Likes

I see you've chosen big-endian here. This will not please everyone.

... if the compile-time code execution story finally makes some progress, this wouldn't even be expensive ;-)
But why should we not use '? If there is no other idea for it, there's little merit not utilizing it.

1 Like

this constructs Character values from literals which cannot be done at compile time since the grapheme cluster stuff depends on the ICU runtime

you might also want to have them as Int8s instead of UInt8s since that makes testing ASCII vs latin extended easy (ASCII is always positive)