Prepitch: Character integer literals

But I don’t think I do agree. If they were single quotes yes but I don’t recommend making potentially source breaking changes to things that String’s conform to via ExpressibleByUnicodeScalar. Cleaner to give single-quoted, single-codepoint literals their own new protocol ExpressibleByCodepointLiteral and conform Int & co. to that.

In the case where the literal is not a single codepoint I’ve taken out the error and had the parser re-lex the token as a String in the prototype. The means single quoted characters can be either integer (modified to look for conformance to ExpressibleByCodepointLiteral) or string literals. This probably sounds a little on the pragmatic side but isn’t that different from container literals which can be Arrays or Dictionaries depending on their first element. In practical terms this means the following is now possible:

let x3: Character = ‘🇨🇦' // ok (yay!)

But you get some gnarly errors when you mix the two (default type is now Unicode.Scalar):

k.swift:6:20: error: binary operator '-' cannot be applied to operands of type 'Unicode.Scalar' and 'String'
let d2: Int8 = '*' - '👨🏼‍🚀'
k.swift:6:20: note: overloads for '-' exist with these partially matching parameter lists: (Float, Float), (Double, Double), (Float80, Float80), (UInt8, UInt8), (Int8, Int8), (UInt16, UInt16), (Int16, Int16), (UInt32, UInt32), (Int32, Int32), (UInt64, UInt64), (Int64, Int64), (UInt, UInt), (Int, Int), (Self, Self), (Self, Self.Stride)

I think it was better to catch the error early on myself. The following is still illegal:

k.swift:2:21: error: cannot convert value of type 'String' to specified type 'Character'
let a1: Character = '🇨🇦🇨🇦’
1 Like

It seems like it should be possible to get better error messages here somehow, while preserving the nicer user model. I don't really know what I'm talking about here, but instead of kind of hacking it to parse as a unicode scalar with a string fallback can you not represent it as an “extended grapheme cluster”/Character and then check the overflow, etc conditions later on the inferred type?

Yes, the dual nature of single quoted literals a bit of a hack but I did it to try to placate you and @xwu who take the position that we do not want to surface anything sub-character related to strings in Swift. Your’s is a highly principled position but one I disagree with due to practical considerations. For me, these are Unicode.Scalar literals not Character literals and Int-like by nature given the likely use-case and should be prevented from representing multi-codepoint graphemes in the interests of usability by giving a better error early on.

let invalid = ‘👨🏼‍🚀’
k.swift:2:15: error: character not expressible as a single codepoint

Trying to do something else is what is responsible for the poor/confusing error message. There is always the existing “🇺🇸” syntax for character literals which it is too late to change anyway. I’ve tried saying the default type is Character (even though it can’t represent all characters) and this latest pragmatic approach to try to defuse the issue so we can move forward. Ultimately, this is an implementation detail that can be directed by conclusions drawn in the review process. It’s not worth spending any more time tying down specific details of the implementation before then.

It’s not clear to me at this point what problem you’re aiming to solve with this proposal design, if not the case of representing single characters as understood by ordinary users. This is very much relevant to the proposal idea overall; it is not an implementation detail.

Why, incidentally, do you say that the type Character can’t represent all characters?

I guess it comes down to what are “single characters” as understood by “ordinary users”. In the world Swift is looking to promote these are “what appears to be one screen entity” or “single extended grapheme” and there are existing perfectly adequate apis and character literal syntax for working entirely in this world should you want to break down to the `“Character” level.

In the legacy world you may be called on to work with a UnsafePointer<Int8> and want to deal in the definition of what was a “character" up to until 2014 - an int. This is the admittedly limited area where a shorthand for Unit(ascii: “\n”) might come in handy and one that copes with latin1 or 20bit unicode characters before they invented zero width joiners and things got a bit silly.

This is the use case targeted by this proposal. In this domain, the most general model is Uncode.Scalar which is specialised by the compiler down to the specific integer type required along with suitable compile time checks. I mentioned that in my world single quoted Unicode.Scalar literals may not represent all characters but can have a default type of Character in passing.

Anyway, I’ve approached this in the abstract and described how you and I differ and why this the case and it will have to be thrashed out at decision time in the review The model is not an implementation detail but the implementation is.

Surely you don't mean that we should create a whole new literal syntax entirely as a shorthand for UInt(ascii:)?

The use of a character literal, as I see it, is in being able to represent to the reader the assertion that I expect a certain Unicode sequence to be a single extended grapheme cluster. This is not merely a model of text being promoted by Swift (although that is salient here because we're talking about proposals for Swift evolution, which should be compatible with the direction of Swift) but also one promoted by Unicode, and therefore intended for much wider adoption. The reason it would be such an important feature is precisely that what's a single Unicode extended grapheme cluster can be non-obvious yet of huge importance when working with Unicode (for example, in text editing, the caret should advance one extended grapheme cluster at a time); having a syntax to assert that something is one such cluster and not a string of them is highly relevant for a Unicode-correct language.

That Swift allows essentially arbitrarily many types to be expressible by the same kind of literal makes possible the use case of restricting the content of a character literal to those that can be represented by a single code point--which is certainly a nice benefit.

However, I certainly don't find the use case of shorthands for UInt(ascii:) compelling. And, yes, I have actually written code that requires such APIs. (In fact, I've thought long enough about this in the past that I renamed a C string API in SE-0134.)

Moreover, some of the examples above show likelihood of active harm in the proposed design. Specifically I would argue that it would be a nonstarter for the proposal if the following behavior didn't hold:

'1' + '1' // "11"
'1' - '1' // Error: binary operator '-' cannot be applied to two 'Character' operands
4 Likes

why can’t we just leave the typealias UnicodeScalarType = Character or String and we would get the expected behavior? If we use double-quoted literals for everything, the typealiases have to stay String.

I can’t speak for the real author of this proposal, @taylorswift but this is the main point of his pitch for me. For the motivation, its very well set out in Chris’ draft and Kevin’s original.

I personally would like to be able to type:

let hexcodes: [UInt8] = [
  '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
  'a', 'b', 'c', 'd', 'e', 'f'
]

and get a comprehensible error for

let hexcodes: [UInt8] = [
  '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
  'a', 'b', 'c', 'd', 'e', '🙂'
]
k.swift:11:28: error: codepoint literal '128578' overflows when stored into ‘UInt8'

and:

let hexcodes: [UInt8] = [
'0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
'a', 'b', 'c', 'd', 'e', '👨🏼‍🚀'
]
k.swift:11:28: error: cannot convert value of type 'String' to expected element type 'UInt8'

and:

let hexcodes: [UInt8] = [
'0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
'a', 'b', 'c', 'd', 'e', '👨🏼‍🚀👨🏼‍🚀'
]
k.swift:11:28: error: character literal is not single extended grapheme cluster

I’ve rowed back and stopped trying to assert theses are only Unicode.Scalar literals but Character and added the assertion you ask for as you can see above.

So, for these Character literals, all I am looking for is: if they happen to be single code point they can initialise an Int of the appropriate size. The specifics of how It’s been implemented in the prototype are not relevant but If the literal happens to be a single code point, the compiler searches for conformances to the new protocol ExpressibleByCodepointLiteral rather than ExpressibleByStringLiteral. As Character, String and Unicode.Scalar conform to ExpressibleByCodepointLiteral, you can use these literals (single code point or not) in a string context entirely as before.

Your edge cases are interesting but do not for me “show likelihood of active harm” as you say with characteristic understatement.

print(‘1' + ‘1') // would print 98, but if you give it more context of the type:
print("1" + ‘1' + ‘1') // is happy to print 111
print(‘z' - ‘a') // prints 25 and I would say was a feature not a bug.

There is something worthy of JavaScript about the loose-ness of the behaviour I’ll concede but this comes with the Ints “can express a code point” conformance. I’d be struggling to think of an essential use-case for single quoted literals without this feature though.

See, print('1' + '1') printing "98" is, in my view, a serious design defect. It is not made better but worse if print("1" + '1') prints "11". It is part and parcel of the design if Int can be expressible by a character literal--you are right about that. My point being that, because such a behavior is unacceptable in my opinion, I cannot accept the design as-is. That is, my takeaway is that bare Int should not be expressible by a character literal.

This is compounded by the fact that I do not find avoiding the use of UInt8(ascii:) to be a compelling use case. In fact, Swift requires explicit converting initializers between integer types of different bit widths; it is entirely consistent that a character can only be converted to its ASCII value by an explicit initializer and seems entirely inconsistent if such an initializer weren't required. Therefore, I'd argue that enabling the behavior proposed here should be an anti-goal.

The compelling use case for single-quoted character literals is, in my view, as I expressed above: that it permits the user to assert that a particular Unicode sequence is a single extended grapheme cluster.

2 Likes

What is happening ?

i think you’re mixing up a lot of different ideas from a lot of different people as i think john has a slightly different vision from me for how this proposal should work.

Both of these should return "11" in my opinion and that isn’t at all inconsistent with my proposed design. The default inferred type for 'a' should always be a unicode type, not an integer type,, the integer coercion (please remember, we are not talking about type conversion here, we are talking about type coercion) should be achieved through ExpressibleBy_____ conformances. (More specifically, the built-in hooks we add to the standard library for these conformances.) I don’t think it’s too much of a burden to write

let codepoint:Int = 'a'

instead of

let codepoint = 'a' // Int

Please remember the difference between coercion and conversion. We do not require people to write

let n:Int16 = Int16(22)

for instance, in fact this is the canonical example we use to teach people how not to use the integer initializers. There are semantic differences.

let n:UInt8 = UInt8(256) // traps at run time 
let n:UInt8 = 256 // compile time error

Are we still talking about getting rid of the "a" as Character requirement? Or are you talking about the ICU assertion that happens at runtime?

2 Likes

The example you are putting to the test of “acceptability” is a bit of a straw man argument. Is it worthwhile to care about what adding two character literals with no external type constraint gives? Is that a real situation a coder is likely to encounter? A more representative example is that the following behaves as the programmer might intend:

func f1(_ i: Int) {
 print(i)
}
func f2(_ s: String) {
 print(s)
}

f1('1' + '1') // prints 98 (ascii value for ‘1’ * 2)
f2('1' + '1') // prints 11

A new toolchain is available if you want to try it out. I feel something like the approach I’ve outlined is definitely feasible and the implementation not that far off the mark. In this final version:

  • The default type for character literals is Character.

  • If the literal is a code point, it follows the new protocol ExpressibleByCodepointLiteral path.

  • All Int types as well as Character Unicode.Scalar and String have conformances to this protocol.

  • Otherwise, the literal is checked that it is a single grapheme and is processed as a string literal which may follow the ExpressibleByExtendedGraphemeClusterLiteral or ExpressibleByStringLiteral path.

  • It was not possible to re-use ExpressibleByUnicodeScalarLiteral as that would affect double quoted strings.

Good Luck!

Good, I agree.

(However, note that this will not be enough to avoid 'f' - '1' producing an integral result, which I think is also important; it would be highly unusual if 'f' + '1' and 'f' - '1' produced results of different types and had totally different semantics. To avoid such an eventuality, it would require defining the - operation on the default type to disable it, and even that will not be enough after the next refinement to operator overload resolution (not yet pitched but already implemented on master behind a flag) is accepted, where the compiler will prefer overloads from a designated protocol (in the case of - that would be Numeric) over any other overloads.)

We got rid of this distinction in SE-0213, hardwiring knowledge of this syntax in the compiler so that the two are indistinguishable--i.e., UInt8(256) is now a type coercion operation.

In my view, we would leverage character literals to (a) permit let x = 'a' instead of let x: Character = "a"; and (b) permit compile-time diagnostics when the contents of the literal are definitely not a single Unicode extended grapheme cluster.

2 Likes

I think 'f' - '1' is fine tbh, - isn’t a string operator so it’s pretty obvious that this has to be an arithmetic operator. I’m in general not a big fan of these operator “symmetry” arguments (i.e. !! to match ??) especially since + and - aren’t even a symmetric pair in stringland.

SE-0213 works because this worked in the first place:

let n:UInt8 = 256 // compile error

indeed, UInt8(256) is defined in terms of this. You’re kind of putting the cart before the horse here. If i remember right we actually made SE-0213 because tutorial authors wouldn’t quit teaching beginners UInt8(256) so we had to burn this into the compiler to keep them from shooting themselves in the foot. So,, completely different circumstance,, and completely different motivation. The “correct” Swift idiom is, and it has always been:

let n:UInt8 = 256

or for those who like to put the type on the right

let n = 256 as UInt8

(b) already works in Swift

let c:Character = "ab" 
//error: cannot convert value of type 'String' to specified type 'Character'
1 Like

I don't find arguments on 'symmetry' convincing for addition of new syntax, but I think it's quite sensible to say that, where x + y and x - y are both valid, it's unusual--confusing at least and maybe even harmful--for them to have totally different semantics and produce results of totally different types. This is an entirely different argument from saying, for example, that every type that supports + should support -, which is the argument you're referring to as unconvincing (and I agree).

Well, the intended idiom is actually let n = 256 as UInt8, but it truly makes no difference. It was not really justifiable that UInt8(256) had different behavior than 256 as UInt8 regardless of whether no users or all users confused the two.

But this is quite besides the point: I see this implicit interconversion between characters and integers as of-a-kind with other C idioms such as implicit integer promotion and even, perhaps, equivalence between pointers and arrays). Swift rejects outright supporting such C idioms, favoring more explicit syntax (in the case of arrays and pointers, for example, withUnsafe(Mutable)BufferPointer and adding conveniences specifically at the boundary between Swift and C APIs where you can use &array with a C API that takes a mutable pointer). I wouldn't be opposed to burning into the compiler additional similar conveniences to treat characters as integers when working with C APIs, but a character simply should not be an integer any more than an array is a pointer.

here’s the thing, for something to be harmful, it has to contribute to a user making some kind of mistake. i’m sorry but i just can’t forsee where someone could make a mistake with -. no one writes string1 - string2 and expects it to mean anything. I have seen + used on strings, I have seen * used on strings, I have even seen w^2 used on strings. But I have to admit, I have never seen a - used on a string. The fact that they produce results of totally different types makes the chance of a mistake even smaller.

Then substitute * for - in the discussion; the same issue applies.

At base the problem is that unless you are totally inured to the strangeness of this C-ism, performing math on literal characters is an extremely odd thing to do, because characters are not numbers (although they are represented as such) any more than arrays are pointers. Propagating this throughout Swift is not a mere implementation detail of having character literals, it's a fundamental philosophical shift in the direction of the language.

My problem with the design proposed has nothing to do with the addition of syntax to distinguish single characters from sequences of them. It has everything to do with the other piece of the proposal, which is to make numbers expressible by characters (rather than allowing explicit conversion between them, which it is already possible to do).

4 Likes

I don’t know if C pointers are a valid comparison, because unless you are writing a kernel, you almost never need to (or should) hardcode pointer addresses in source code. Interestingly, the exception is the null pointer, which we do have a expressibleBy conformance for: nil ! On the other hand, switching on hardcoded ASCII characters is extremely common.

Beyond that though, C pointers are actually a really interesting analogy. All of us who work on/in the language agree that C pointers are problematic, and Swift offers superior tools that don’t have the same safety hazards and performance traps that C pointers have, such as Array, for _ in, map, etc. As a result, 95% of the time in Swift, you never need to touch a pointer, and probably 75% of the time you don’t even need to use indices. This is a good thing.

That being said, sometimes pointers and memory are unavoidable. And Swift does its best to provide good support for unsafe pointers to make pointer code safe and readable. For example, we check for null pointers like this

guard let buffer:UnsafePointer<Int> = foo() 
else 
{
    ... 
}

not like this

let buffer:UnsafePointer<Int> = foo()
guard buffer != .init(bitPattern: 0)
else
{
    ...
}

We provide UnsafeBufferPointer and friends with Collection conformance. We even have integer subscripts on UnsafePointer, a feature I lobbied to remove from the language, because I thought it was too c-ish, and was unsuccessful because people wanted to write p[0] instead of p.pointee!

Now, consider unicode codepoints. All of us who work on/in the language agree they are problematic, and Swift offers superior tools that model human-readable text better, such as Character and String. As a result, 85% of the time in Swift, you never need to touch an ASCII scalar. This is a good thing.

That being said, sometimes unicode codepoints are unavoidable. And Swift really should do its best to provide good support to make ASCII code safe and readable. Right now we have to do things like spell them in decimal (or slightly better, hex). Or use workarounds like wrap a higher-level construct like a Unicode.Scalar with a function that you know returns the integer value you want. Neither of these are very readable, especially if you have a lot of them in one place together, which you frequently do.

if signature == UInt32(truncatingIfNeeded: UInt8(ascii: "J")) << 24 | 
                UInt32(truncatingIfNeeded: UInt8(ascii: "F")) << 16 | 
                UInt32(truncatingIfNeeded: UInt8(ascii: "I")) <<  8 | 
                UInt32(truncatingIfNeeded: UInt8(ascii: "F"))

If we had codepoint literals we could write this as

if signature == 'J' << 24 | 'F' << 16 | 'I' << 8 | 'F'

I take issue with this because, characters (with a lowercase ‘c’) are numbers. Most programmers have internalized that A-Z and 0-9 are contiguous numbers, that most common symbols can be encoded in a fixed-width 8-bit integer, and engineers have taken to using this mapping between letters and numbers as useful mnemonics to encode integers. So we get ('J', 'F', 'I', 'F') == (74, 70, 73, 70) instead of four arbitrary numbers. The problem is that people tried to use this system to encode characters of human text, which doesn’t work so well. So Swift created the world of Character and String to handle human text correctly. But the machine-readable world still exists, and there, this system does work well. And using Character and all the higher-level unicode constructs in this context is as clunky and inappropriate as using UInt8 and all the lower-level unicode constructs for human text.

I guess I'm kind of in the middle here. I think that single-quoted literals should be able to represent any Character, but I also don't think it's a huge deal if these literals can be inferred to be the various integer types. This is going to potentially lead to some edge cases (e.g. 'f' - 1 being some integer), but I don't know that these will be frequently encountered and I think at least some of these edge cases could be handled with targeted warnings or similar (e.g. perhaps warning on use of operators that Character doesn't support when the default fallback to Character would otherwise have been used). There are already somewhat similar edge cases possible with the other literal forms and type inference.

4 Likes

Assuming signature is an array of CChar, why would anyone want to write either of these instead of String(signature) == "JFIF"?

No, I'm not making a comparison of the relationship between pointers and numbers to the relationship between characters and numbers; I am making a comparison of the relationship between pointers and arrays to the relationship between characters and numbers. Specifically, how C treats each of the two relationships as one of interchangeability and Swift does not.

No, characters don't have the semantics of numbers. They might be stored as numbers, but that doesn't make them numbers any more than arrays are pointers.