SE-0243: Codepoint and Character Literals

bjhomer · March 19, 2019, 6:33pm

// Error: Cannot convert value of type 'String' to specified type 'Character'
let x: Character = "volcano"

This seems to be validating at compile time…

taylorswift · March 19, 2019, 6:35pm

The compiler makes an attempt to exclude literal sequences it knows could never possibly be a grapheme cluster, but it cannot prove a literal sequence is a grapheme cluster. this is an example of a compiler heuristic, not a validation guarantee.

johnno1962 · March 19, 2019, 6:46pm

I wish considerations about how this .ascii property is implemented did not appear to be driving the concept of what character literals are particularly where there are alternatives where this property would not be critical to the design (operators.) We’re supposed to be designing the language top down -- not have low level considerations driving a much more subtle and demanding model if you are not level II Unicode-certified. How are we supposed to explain to a new Swift developer ’a’ is valid but ’🇨🇦' is not? We’ll have to have the Core Team make a call on this or we’ll be going round in circles forever.

taylorswift · March 19, 2019, 6:55pm

Operators will not solve

// storing a bytestring value 
static 
var liga:(Int8, Int8, Int8, Int8) 
{
    return (108, 105, 103, 97) // ('l', 'i', 'g', 'a')
}

,

// storing an ASCII scalar to mixed utf8-ASCII text
var xml:[UInt8] = ...
xml.append(47) // '/'
xml.append(62) // '>'

, or

// reading an ASCII scalar from mixed utf8-ASCII text 
let xml:[Int8] = ... 
if let i:Int = xml.firstIndex(of: 60) // '<'
{
    ...
}

.

Only the third example even has an operator-based equivalent, and it is far less efficient, as it replaces a value-based search (firstIndex(of: 60)) with a predicate-based one. (firstIndex{ $0 == "<" }).

If we assume unicode grapheme clusters are an advanced unicode concept, it makes perfect sense to restrict users to the simple C/python world of Unicode.Scalar at first, and introduce them to the craziness that is unicode grapheme clusters and canonical equivalence second. '🇨🇦' is not, and has never been a “character” in the classical sense. It is a sequence of two codepoints 🇦🇧 which get collapsed depending on the version of unicode supported by the user’s OS, and conditionally rendered as a single glyph by some text shaping systems provided some fonts which supply such a ligature.

johnno1962 · March 19, 2019, 7:09pm

Operators won’t solve all problems but won’t allow you to type ’1’/‘1’ either which is a feature.

‘' Looks like a character to me. Swift has taken the position that it is not dealing in characters in this traditional sense any more which I wholeheartedly support.

taylorswift · March 19, 2019, 7:22pm

‘Looks’ is a terrible word to use here, because this kind of talk is exactly what confuses beginners about Swift’s Character model. For example, according to Swift, '🇹🇼' is a single grapheme cluster, as are all pairs of regional indicators, including nonsense pairs like 🇽🇴 🇽🇴 (), but whether or not you’ll see (and interact with it in firefox) it as a flag emoji, or a pair of regional indicators depends on where (not even when) you bought your iphone.

(the reason, for the curious, is china mandates that Apple not render the taiwan flag grapheme cluster as a single emoji for obvious reasons)

My point is, to talk about Characters, it is not enough to just invoke “a Character is what a human perceives as a single textual element”. To chinese people, (🇹, 🇼) is not a textual atom. To the west, it is. To talk about Characters correctly, we have to go back to Unicode definitions. And if you ask me, these are a very advanced topic. Swift is right to include first-class support for unicode correctness. But it’s still advanced Swift, so it’s not unreasonable that APIs designed around them should presume a pretty advanced Swift user.

xwu · March 19, 2019, 7:25pm

We are to explain that single quotation marks are for Unicode scalar literals, just as they are for peer languages.

Does a user have to know what a Unicode scalar is before they can use a Unicode scalar literal? Yes.

When it comes to Unicode strings, many things look like other things that are not the same. It is a goal of any Unicode-aware language to help users correctly distinguish between these things where the distinction matters. Since what is a single code point and what isn’t does very much matter for a spectrum of operations, your example is excellent for demonstrating why having a literal syntax (so that the compiler will give an error) is helpful.

No, not a problem: in fact, precisely the goal. People cannot see what an extended grapheme cluster is any more than they can see what a Unicode scalar is. A literal syntax which allows you to assert that something is one and only one Unicode scalar is the whole point. If one could make analogous compile-time guarantees for extended grapheme clusters, then a literal syntax for that could conceivably be useful, but it is not possible to do so.

johnno1962 · March 19, 2019, 7:34pm

Isn’t that the problem? Unicode.Scalar is a very subtle distinction somewhere between a codepoint and a grapheme cluster. People can see what a Character is on their screens.

I accept that, and Unicode.Scalar is a more useful type, but it seems unfortunate conceptually.

johnno1962 · March 19, 2019, 10:16pm

Looking back, are we much further along, 9 days later?

jawbroken · March 20, 2019, 5:06am

It's the point of some people in the review thread, but not the point of the proposal and doesn't seem to obviously address the use cases therein. I personally don't find it that valuable, and ways to assert this at compile time could be designed independent of any literal syntax or default literal type (and indeed have already been implemented for using double-quoted literals with Unicode.Scalar). And similarly, there is existing compile-time validation when using double-quoted literals for Character, so best-effort validation could clearly be done for single-quoted literals, with run-time validation for the remaining cases.

benrimmington · March 20, 2019, 10:52am

@lorentey @taylorswift
When creating magic numbers from string literals, StaticString might be better than String.UTF8View.

In the following example, liga.magic is an integer constant when compiled with optimizations.

extension StaticString {
    public var magic: UInt32 {
        precondition(isASCII)
        precondition(hasPointerRepresentation)
        precondition(utf8CodeUnitCount == 4)
        return UInt32(utf8Start[0]) << 24
             | UInt32(utf8Start[1]) << 16
             | UInt32(utf8Start[2]) <<  8
             | UInt32(utf8Start[3])
    }
}

let liga: StaticString = "liga"
print("\(liga) is \(liga.magic)")

You could also try using a custom ExpressibleByStringLiteral where StringLiteralType == StaticString type.

xwu · March 20, 2019, 11:47am

Let me be clear: I mean that it is the point that is being served by proposing that single quotation marks serve Unicode scalar literals. It is valuable for use cases not necessarily mentioned in the proposal but amply described already in this thread. These use cases apply to Unicode scalars and not extended grapheme clusters.

Again: we are looking to improve on the status quo, where the ergonomics of using Unicode scalar literals are poor despite their extensive APIs. The rationale for devoting a new syntax to Unicode scalars instead of extended grapheme literals is that it improves ergonomics in a way that is relevant for Unicode scalars that is not possible for extended grapheme clusters.

xwu · March 20, 2019, 11:55am

Note that, as I quoted above, grapheme breaking is a runtime concept (intentionally) and not specified by Swift version. It will depend on the version of the stdlib that is linked at run time whether ICU is involved or not, and therefore will not be statically verifiable.

johnno1962 · March 20, 2019, 5:09pm

I’ve seen this argument a few times that we can’t have Character literals because the compiler does not know how to segment them reliably. Is this really true? The compiler links with the ICU that happens to be on the devs mac and segments them the best it can here. Technically there might be differences due to the ICU version but I don’t see that as a overriding argument.

Edit: I’m not sure the compiler uses ICU. It seems to roll it’s own segmenting but lets assume it works.
Another Edit: The compiler does not link with the ICU Perhaps it should?

taylorswift · March 20, 2019, 5:10pm

that seems unfortunate. this is only semi-related, but when we overhaul the literals system to support @textLiteral and @textElementLiteral, drawing the " vs ' line at Unicode.Scalar would make the distinction between the two initializers more meaningful, as one would take a static [Unicode.Scalar] array, whereas the other could take a static Unicode.Scalar, instead of both taking arrays.

johnno1962 · March 20, 2019, 10:32pm

It’s a small change for the compiler to share runtime code and use the ICU for grapheme segmentation:

Should I raise a PR on apple/swift? I can see why the compiler should remain independent of the OS.

xwu · March 21, 2019, 1:00am

Again, grapheme breaking is a runtime concept. In other words, what counts as an extended grapheme cluster is intended to be the same for all Swift programs running on the same system regardless of where and when they’re compiled.

The compiler cannot have the final word on this, so linking ICU may change the best-effort heuristic employed statically but won’t make it anything other than a heuristic. As the code comment deleted in your PR makes clear, it’s unclear how much effort is worthwhile to put into such a heuristic, but I’d imagine that linking ICU is certainly too far.

johnno1962 · March 27, 2019, 11:06am

Hi @Ben_Cohen, have you had any luck drafting the review conclusions to post? This pitch won’t be able to proceed without some direction from the Core Team.

Ben_Cohen · March 27, 2019, 3:35pm

Hi John – yup, planning to send out the review conclusion shortly.

Ben_Cohen · March 27, 2019, 4:19pm

Revew Conclusion

The review for SE-0243 has ended and the proposal has been rejected.

To move forward on this topic, and to help the community converge on more of a consensus around these features, the core team recommends breaking this proposal out into two separate proposals that could be re-pitched and (depending on the pitch outcome) re-run.

Introducing a single-quoted literal

The core team supports the direction of adding single quotes for character and scalar literals to the language, and this has broad support from the community. This part of the proposal should become a new stand-alone proposal.

This part of the proposal is also not affected by the ABI stability issues relating to backward deployment of conformances.

There was some discussion during the review of whether the "default" for these should be the Character or Unicode.Scalar type, and the pros and cons for choosing one over the other should be discussed in the pitch phase.

`String` and character literals

One concern raised during the review was that because ExpressibleByStringLiteral refines ExpressibleByExtendedGraphemeClusterLiteral, then type context will allow expressions like 'x' + 'y' == "xy". The core team agrees that this is unfortunate and that if these protocols were redesigned, this refinement would not exist. However, this is not considered enough of an issue to justify introducing new protocols to avoid the problem. Where practical, the implementation of single-quote literals should generate compile-time warnings for these kind of misuses – though this should not be done by adding additional deprecated operators to the standard library.

In the longer term, the core team thinks that the overall mechanism for literals in Swift should be re-evaluated. This re-evaluation could also cover other features, such as a mechanism for arbitrary-precision numeric literals. This is a large undertaking, however, and should not hold up a proposal for single-quoted literals.

Expressing integer types with character literals

This was the cause of the majority of disagreement during the review. Once single-quoted literals have been added to the language, this part of the proposal (or an alternative, such as the addition of a trapping or nil-returning ascii property) can be re-pitched separately.

Thanks for all your contributions during this review!

Ben Cohen
Review Manager

SE-0243: Codepoint and Character Literals

Revew Conclusion

Introducing a single-quoted literal

String and character literals

Expressing integer types with character literals

`String` and character literals