SE-0243: Codepoint and Character Literals

johnno1962 · March 6, 2019, 11:23pm

A part of me agrees and in fact I spent quite a bit of time last night re-evaluating exactly this as an alternative proposal after @michelf’s review. It has many advantages in terms of simplicity. The practical uses of the broader Character literals are quite rare and you can use a string literal. As a strategy we could introduce them now as ‘ASCII’ literals with a view to extending their reach if a way is found to get it to work.

taylorswift · March 7, 2019, 3:24am

Looking at the operators and methods we need to blacklist from integer character literals, I’ve come to the realization that there is just too much gray area when you consider every possible API that takes an integer argument.

// yes 
let buffer:[UInt8] = ['0', '8', '-', '0', '3', '-', '1', '9', '9', '2']

// yes 
if (string[0], string[1]) == ('0', 'x')

// yes 
if username.contains('\'') || username.contains('"')

// probably? 
if 'a' ... 'z' ~= byte

// maybe? 
let hexdigit:UInt8 = n < 10 ? '0' + n : 'a' - 10 + n

// no
n >>= 'a'

// no
if x.isMultiple(of: 'a')

// no
array.removeLast('x')

I think our real goal is for integer types to be coercible from 'x', but not expressible by 'x'. The UInt8(ascii:) initializer fails the first criteria, the UInt8:ExpressibleByUnicodeScalarLiteral conformance fails the second criteria.

I think we should consider making 'x' as UInt8 be the only syntax for invoking ASCII–integer equivalence,, that is, overload as to work on character literals.

<IntegerASCIIValue> ::= <CharacterLiteral> as <TypeIdentifier>

Unlike UInt8(ascii:), the left hand side would be restricted to a constant character literal, so the compiler could easily validate this.

This would be a compiler-only change, but in the long run as we look to reworking the literals system, we could introduce @coercibleFromTextLiteral in the same spirit as @expressibleFromTextLiteral, effectively allowing users to overload as with @constantExpression implementations.

The only drawback is it’s a lot more verbose, and the as operator has a bit too low precedence.

let buffer:[UInt8] = 
[
    '0' as UInt8, '8' as UInt8, '-' as UInt8, 
    '0' as UInt8, '3' as UInt8, '-' as UInt8, 
    '1' as UInt8, '9' as UInt8, '9' as UInt8, '2' as UInt8
]

if ('a' as UInt8) ... ('z' as UInt8) ~= byte

Thoughts?

xwu · March 7, 2019, 3:37am

taylorswift:

The only drawback is it’s a lot more verbose, and the as operator has a bit too low precedence.
let buffer:[UInt8] = 
[
    '0' as UInt8, '8' as UInt8, '-' as UInt8, 
    '0' as UInt8, '3' as UInt8, '-' as UInt8, 
    '1' as UInt8, '9' as UInt8, '9' as UInt8, '2' as UInt8
]
if ('a' as UInt8) ... ('z' as UInt8) ~= byte
Thoughts?

I still don’t see why you insist on the invention of new syntax or new compiler features here. What is the advantage of 'a' as UInt8 over UInt8(ascii: 'a')?

This is particularly in the case where I could write: let buffer = ['0', '8', '-', /* ... */].map { $0.asciiValue! }.

taylorswift · March 7, 2019, 3:57am

you would support new compiler magic to special-case UInt8(ascii:) for compile time checking (a much harder problem than you seem to imply), but not new compiler features that could eventually be integrated into a larger, cohesive user-extensible literals system? 2 + 2 = 'a'

A better literals framework is on the horizon for us (working a write-up for that), and coercible-but-not-expressible initializers would definitely be part of that.

for me that last example desugars to

let buffer = 
{
    var buffer:[UInt8] = []
    for c:Character in ['0', '8', '-', '0', '3', '-', '1', '9', '9', '2'] 
    {
        buffer.append(c.asciiValue!) // force-unwrap? :(
    }
    return buffer
}

Which is very different from loading 10 bytes from .rodata or immediate operands. sure, it might compile to the same thing (though without the validation checks), but if you have to ask godbolt what the code is doing, that means the code isn’t clear, and that something is lacking with the language.

ultimately i think you and i just fundamentally disagree on the importance of compile time validation and the whole “shortest amount of time to a correct implementation” idea. i don’t think it’s productive to continue this discussion further.

jayton · March 7, 2019, 9:28am

Realistically, C-family languages have this distinction because the simplistic parser and lack of type system in B necessitated it, not because Ken Thompson thought deeply about it and decided that double-quoted character literals would be too confusing.

Avi · March 7, 2019, 9:43am

While that is probably true, we do have the situation that we use different literal forms for different classes of types.

numerics use unquoted strings of characters (mostly, but not exclusively, digits)
strings are bracketed by double-quotes
collections are bracketed by (square) brackets

I think that there's value in considering single characters to be a separate class than string, as the type and capabilities are very different. There is apparently a considerable debate to be had as to what kind of "character" a single element should be represented as (ASCII, unicode scalar, grapheme cluster, etc.)

johnno1962 · March 7, 2019, 12:02pm

If we are to find a use for single quoted syntax now that the passage of raw strings has freed it up my intuition for where we are headed is towards ‘ASCII’ literals. The last minute additional requirement that ABI stability means no new conformances are possible until a way can be found to gate them with @available has made Character literals with a capital C very difficult to deliver. The idea that users should add a conformance to an internal protocol to enable integer conversions was always going to be a difficult sell and despite my best efforts a module defining this conformance can have global effects such as the unfortunate UInt8(“8”) issue discussed above.

In this alternative model ‘a’ is a spelling for the integer literal 97 with default type Int. Using this simpler and more honest type hopefully people will be less surprised that expressions such as ‘a’ * ’b’ are possible even if they are meaningless and the domain of Unicode with all it’s subtitles can be left to double quoted strings undisturbed, making this alternative proposal purely additive.

This will come as a disappointment to people trying to maintain Swift's abstract model for Strings (@Michael_Ilseman, responsible adult for things string may wish to comment here) but it comes with far fewer subtle and potentially confusing complications and is something it would be trivial to deliver now. If we were to take this route this would not close off eventually implementing more flexible and conceptually pure later on. The type checker is so smart about the only complication further down the line is when people are relying on type inference when declaring a variable and using the default type. Emergent problems would be picked up by the compiler — the scope for insidious bugs is relatively limited.

For those who would say lets just wait and talk about this some more what we are waiting on principally is @availability on conformances and given that the implementation for SE-0068 (A proposal from April 2016 for universal Self) is only just being worked on now I don’t anticipate it will be turning up soon. Given this and we’re not locking ourselves out off future more conceptually pure possibilities I’d be tempted by the simpler model. It is certainly more accessible to new or part-time programmers of Swift.

Avi · March 7, 2019, 12:26pm

I am mostly an observer here, but I agree with the main point. Something that is useful to people now is better than something which is conceptually-pure or more general at some undefinable future time. The gotcha, which you mention, is that we don't want to close off those "better" options by the design or implementation of this feature. For that, I defer to the people who actually understand what's going on

xwu · March 7, 2019, 2:05pm

Yes. If I were convinced as to the necessity of compile-time checking (I am not), I would be in favor of the compiler gaining abilities to check preconditions at compile time where possible. (This can be a general feature where the compiler emits a warning whenever it can prove that an expression is always optimized to a precondition failure.) I would not see any reason for fundamentally overhauling literals.

What are the motivations you have for “a better literals framework”? In my view, literals are already very complicated for users, and anything other than a simplification would be unwise.

Coercing a literal to a non-default type is already so confusing that SE-0213 was necessary to simplify the language and make things more predictable for users. The idea that we should add more functions to the 'as' operator seems completely beyond the pale when I would wager that the number of people who can enumerate all of the current functions of 'as', 'is', 'as?', and 'as!'—even in this forum—is in the single digits.

Not just you and I, I’d say. The overarching design of Swift leans heavily on trusting the compiler, and how things are optimized is invisible in the end user model. CoW “just works” but doesn’t come with its own keyword, whether something compiles down to a constant is completely opaque, and even fundamental language features such as for...in desugar to methods found in complex protocol hierarchies that are then optimized away by the compiler. What you assert is important to you is fundamentally not consistent with Swift’s overarching design.

Feel free not to continue the discussion further, but continuing it is in your interest if your aim is to overturn a fundamental design decision of the language.

xwu · March 7, 2019, 2:12pm

johnno1962:

In this alternative model ‘a’ is a spelling for the integer literal 97 with default type Int. Using this simpler and more honest type hopefully people will be less surprised that expressions such as ‘a’ * ’b’ are possible even if they are meaningless and the domain of Unicode with all it’s subtitles can be left to double quoted strings undisturbed, making this alternative proposal purely additive.

This will come as a disappointment to people trying to maintain Swift's abstract model for Strings (@Michael_Ilseman, responsible adult for things string may wish to comment here) but it comes with far fewer subtle and potentially confusing complications and is something it would be trivial to deliver now. If we were to take this route this would not close off eventually implementing more flexible and conceptually pure later on.

This would be a gravely disappointing alternative for me. It would foreclose the use of single quotes as a character literal, as in, a user-visible assertion that the contents inside are a single extended grapheme cluster.

I fail to be convinced here why the singular use case of getting an ASCII value justifies the use of a syntax that is in other languages reserved for characters, however defined in their languages, and for which in Swift we have first-class Unicode support.

johnno1962 · March 7, 2019, 2:23pm

It need not foreclose this forever, just impose the limitation single quoted characters must be ASCII for now. This constraint could still be relaxed at a later date if an implementation could be found and not be source breaking. Is a user-visible assertion that the contents inside are a single extended grapheme cluster really that useful to postpone indefinitely? The String/Character type distinction looks after this anyway at the moment.

xwu · March 7, 2019, 2:27pm

To postpone what indefinitely? This proposal under review already reflects the use of single-quoted literals for Unicode characters and that portion of the proposal has been the least controversial. I would hope that it would be accepted.

To be clear, I (and many community members here) do not agree that integers should be expressible by letters, and the reasons for that have been plenty developed above. As @taylorswift points out above, 'x'.isMultipleOf('a') is on its face nonsensical. That design, in my view (and theirs), should not just be postponed but definitively rejected.

johnno1962 · March 7, 2019, 2:35pm

You have made your opinion on this abundantly clear from the get go. It’s my view that they would extend Swift’s reach to embedded systems/parsers etc. and add another arrow to Swift’s quiver. It is this which is more the focus of this proposal for me as more of an Engineer than computer scientist. It’s regrettable we can not deliver a clean transition to Character literals now but I’m recommending this as a transitional phase. We’re not closing off any future options.

xwu · March 7, 2019, 2:44pm

A design where single-quoted literals default to Int very much precludes them ever becoming character literals. As you know, in the absence of a type coercion, member lookup for literals is limited to the default type—which in what you now recommend would be Int.

Such a design amplifies the number of scenarios in which examples such as '1' / '1' are possible, closes off any possible mitigation of them, and eliminates the use of these literals for actual characters forever. In brief, it would eliminate the portion of the proposal under review here that I wholeheartedly support and take the portion that I do not support to a whole new level. This is why I write that I would be gravely disappointed.

Lantua · March 7, 2019, 2:47pm

I do agree (and even thought about it myself) that there’s a disparity if we have a Unicode StringLiteral but ASCIIValueLiteral.

Though I’m convinced that the use of UnicodeScalarLiteral or UnicodeGraphemeClusterLiteral would be very limited, especially since Character and Unicode.Scalar seems to handle StringLiteral just fine.

Still, I’m not sold that something this "textual" should have integer value.

Lantua · March 7, 2019, 2:55pm

taylorswift:

I think we should consider making 'x' as UInt8 be the only syntax for invoking ASCII–integer equivalence,, that is, overload as to work on character literals.
<IntegerASCIIValue> ::= <CharacterLiteral> as <TypeIdentifier>
Unlike UInt8(ascii:) , the left hand side would be restricted to a constant character literal, so the compiler could easily validate this.

This sounds like a Literal type without default inferred type. On the surface I do like this idea, but I want to explore the consequence of this in depth if we’re going this route.

ladislas · March 7, 2019, 3:25pm

Mostly an observer here, but I've been following closely the discussion from the beginning and I agree with you here.

I totally agree with the "coercible from 'x' , but not expressible by 'x'".

johnno1962 · March 7, 2019, 3:52pm

Dearest @xwu, we could go back and forth on this all day. I don’t see the default type being such a big deal or precluding anything. The situations where there is no type context are limited to let statements where the user is too lazy to specify a type and functions that take Any such as print. Otherwise it’s more important what is expressible by what as determined by the compiler protocols. All I am saying is for now only integers would be expressible by character literals while they where implemented as integer literals but this could be opened up later.

Anyway, this is not the fundamental point at issue. You have a principled position where you are not prepared to countenance any numeric representation of the components of text in the Swift language and this precludes the integer conversions. Your assertion that there are a larger body of developers thinking “If only there was a way I could be sure that this literal I typed in was a single extended grapheme cluster" (when they would receive an error anyway and can see it on their screen) than those who would find it handy to be able to type ’\n’ from time to time I just don’t buy into.

johnno1962 · March 7, 2019, 4:06pm

I’ve not really been following this discussion but if the suggestion is we would need to type ’a’ as UInt8 I’d probably prefer UInt8(ascii: ‘a’) TBH. I’d stick with the expressible by protocols myself even if the results are sometimes absurd. It’s always been possible to pass the a file length to a function expecting the number of blocks in a file.

ladislas · March 7, 2019, 4:09pm

Same here.

'a' as UInt8 is too verbose IMHO.